prefilter

package

v0.11.4 Latest Latest Go to latest Published: Jan 16, 2026 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/coregx/coregex

Links

Open Source Insights

Documentation ¶

Rendered for

Overview ¶

Package prefilter provides fast candidate filtering for regex search using extracted literal sequences.

A prefilter is used to quickly reject positions in the haystack that cannot possibly match the full regex pattern. This provides dramatic speedup (10-100x) for patterns with literals, as we can use SIMD-accelerated search primitives instead of running the full automaton.

The package automatically selects the optimal prefilter strategy based on extracted literals:

Single byte → MemchrPrefilter (SIMD byte search)
Single substring → MemmemPrefilter (SIMD substring search)
2-32 literals (len ≥ 3) → TeddyPrefilter (SIMD multi-literal)
Many literals → AhoCorasickPrefilter (automaton, future)

Example usage:

// Extract literals from regex pattern
re, _ := syntax.Parse("(hello|world)", syntax.Perl)
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)

// Build prefilter
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()

// Use prefilter to find candidates
haystack := []byte("foo hello bar world baz")
pos := pf.Find(haystack, 0)
// pos == 4 (position of "hello")

Package prefilter implements the Teddy multi-pattern SIMD matching algorithm.

Teddy is a SIMD-accelerated algorithm for searching multiple patterns simultaneously. It uses vector shuffle instructions (PSHUFB) to perform parallel table lookups, identifying candidate match positions that are then verified against actual patterns.

The algorithm is particularly effective for 2-32 patterns with length >= 3 bytes, providing 20-50x speedup over naive multi-pattern search.

Algorithm Overview:

Build Phase (once per pattern set): - Assign patterns to buckets (0-7 for Slim Teddy) - Build nibble lookup masks for fingerprint bytes - Each mask byte contains bucket membership bits
Search Phase (per 16-byte chunk with SSSE3): - Load haystack chunk - Extract low/high nibbles - Use PSHUFB to lookup bucket membership - AND results to find candidates - Extract candidate positions
Verify Phase (per candidate): - Check patterns in candidate's bucket - Compare full pattern bytes - Return first match

Reference:

BurntSushi/aho-corasick Teddy README
docs/dev/research/TEDDY_IMPLEMENTATION_GUIDE.md

Package prefilter provides the Fat Teddy multi-pattern SIMD matching algorithm.

Fat Teddy is an AVX2-based extension of Teddy that uses 16 buckets (vs 8 in Slim Teddy). It processes 256-bit vectors and can efficiently handle 33-64 patterns.

Architecture:

Slim Teddy: 8 buckets, SSSE3 (128-bit), 2-32 patterns
Fat Teddy: 16 buckets, AVX2 (256-bit), 33-64 patterns
Fallback: Aho-Corasick for >64 patterns or no AVX2

Mask Layout (256-bit / 32 bytes per mask):

Low 128-bit lane (bytes 0-15): buckets 0-7
High 128-bit lane (bytes 16-31): buckets 8-15

Reference:

Rust aho-corasick: src/packed/teddy/generic.rs
docs/dev/kanban/done/V010-FAT-001-research.md

Example (TeddyBasic) ¶

Example_teddyBasic demonstrates basic Teddy usage for multi-pattern search

package main

import (
	"fmt"

	"github.com/coregx/coregex/prefilter"
)

func main() {
	patterns := [][]byte{
		[]byte("foo"),
		[]byte("bar"),
		[]byte("baz"),
	}

	teddy := prefilter.NewTeddy(patterns, nil)
	if teddy == nil {
		fmt.Println("Teddy not available")
		return
	}

	haystack := []byte("hello bar world")
	pos := teddy.Find(haystack, 0)

	fmt.Printf("Found at position: %d\n", pos)
}

Output:

Found at position: 6

Example (TeddyConfig) ¶

Example_teddyConfig demonstrates custom Teddy configuration

package main

import (
	"fmt"

	"github.com/coregx/coregex/prefilter"
)

func main() {
	patterns := [][]byte{
		[]byte("apple"),
		[]byte("banana"),
	}

	config := &prefilter.TeddyConfig{
		MinPatterns:    2,
		MaxPatterns:    8,
		MinPatternLen:  3,
		FingerprintLen: 1, // Use 1-byte fingerprint
	}

	teddy := prefilter.NewTeddy(patterns, config)
	if teddy == nil {
		fmt.Println("Teddy not available")
		return
	}

	haystack := []byte("I like banana and apple")
	pos := teddy.Find(haystack, 0)

	fmt.Printf("Found at position: %d\n", pos)
}

Output:

Found at position: 7

Example (TeddyHeapBytes) ¶

Example_teddyHeapBytes demonstrates memory usage reporting

package main

import (
	"fmt"

	"github.com/coregx/coregex/prefilter"
)

func main() {
	patterns := [][]byte{
		[]byte("foo"),
		[]byte("bar"),
		[]byte("baz"),
		[]byte("qux"),
	}

	teddy := prefilter.NewTeddy(patterns, nil)
	if teddy == nil {
		fmt.Println("Teddy not available")
		return
	}

	fmt.Printf("Teddy uses approximately %d bytes of heap memory\n", teddy.HeapBytes())
	// Output will vary, but typically < 1KB
	// Example output: Teddy uses approximately 376 bytes of heap memory
}

Example (TeddyMultipleMatches) ¶

Example_teddyMultipleMatches demonstrates finding multiple matches

package main

import (
	"fmt"

	"github.com/coregx/coregex/prefilter"
)

func main() {
	patterns := [][]byte{
		[]byte("ERROR"),
		[]byte("WARNING"),
	}

	teddy := prefilter.NewTeddy(patterns, nil)
	if teddy == nil {
		fmt.Println("Teddy not available")
		return
	}

	haystack := []byte("INFO: all good\nERROR: something broke\nWARNING: check this")

	// Find all matches
	matches := []int{}
	start := 0
	for {
		pos := teddy.Find(haystack, start)
		if pos == -1 {
			break
		}
		matches = append(matches, pos)
		start = pos + 1 // Continue searching after this match
	}

	fmt.Printf("Found %d matches at positions: %v\n", len(matches), matches)
}

Output:

Found 2 matches at positions: [15 38]

Example (TeddyNoMatch) ¶

Example_teddyNoMatch demonstrates no-match behavior

package main

import (
	"fmt"

	"github.com/coregx/coregex/prefilter"
)

func main() {
	patterns := [][]byte{
		[]byte("foo"),
		[]byte("bar"),
	}

	teddy := prefilter.NewTeddy(patterns, nil)
	if teddy == nil {
		fmt.Println("Teddy not available")
		return
	}

	haystack := []byte("hello world")
	pos := teddy.Find(haystack, 0)

	if pos == -1 {
		fmt.Println("No match found")
	}
}

Output:

No match found

Example (TeddyNotSuitable) ¶

Example_teddyNotSuitable demonstrates when Teddy is not suitable

package main

import (
	"fmt"

	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Too few patterns (need >= 2)
	teddy1 := prefilter.NewTeddy([][]byte{[]byte("foo")}, nil)
	fmt.Printf("1 pattern: %v\n", teddy1 != nil)

	// Pattern too short (need >= 3 bytes)
	teddy2 := prefilter.NewTeddy([][]byte{[]byte("ab"), []byte("cd")}, nil)
	fmt.Printf("Short patterns: %v\n", teddy2 != nil)

	// Too many patterns (max 32)
	manyPatterns := make([][]byte, 35)
	for i := range manyPatterns {
		manyPatterns[i] = []byte(fmt.Sprintf("pat%02d", i))
	}
	teddy3 := prefilter.NewTeddy(manyPatterns, nil)
	fmt.Printf("Too many patterns: %v\n", teddy3 != nil)

}

Output:

1 pattern: false
Short patterns: false
Too many patterns: false

Example (TeddyVsNaive) ¶

Example_teddyVsNaive demonstrates performance comparison

package main

import (
	"fmt"

	"github.com/coregx/coregex/prefilter"
)

func main() {
	patterns := [][]byte{
		[]byte("pattern1"),
		[]byte("pattern2"),
		[]byte("pattern3"),
	}

	teddy := prefilter.NewTeddy(patterns, nil)
	if teddy == nil {
		fmt.Println("Teddy not available")
		return
	}

	haystack := make([]byte, 10000)
	for i := range haystack {
		haystack[i] = 'x'
	}
	copy(haystack[5000:], "pattern2")

	pos := teddy.Find(haystack, 0)
	fmt.Printf("Found pattern at position: %d\n", pos)
	fmt.Println("Teddy provides 20-50x speedup over naive multi-pattern search")
}

Output:

Found pattern at position: 5000
Teddy provides 20-50x speedup over naive multi-pattern search

Example (TeddyWithRegex) ¶

Example_teddyWithRegex demonstrates Teddy with regex literal extraction

package main

import (
	"fmt"
	"regexp/syntax"

	"github.com/coregx/coregex/literal"
	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Parse regex with alternation (use distinct patterns to ensure 3 prefixes)
	re, _ := syntax.Parse("(abc|def|ghi)", syntax.Perl)

	// Extract prefixes
	extractor := literal.New(literal.DefaultConfig())
	prefixes := extractor.ExtractPrefixes(re)

	fmt.Printf("Extracted %d prefixes\n", prefixes.Len())

	// Build prefilter (will select Teddy for 3 patterns)
	builder := prefilter.NewBuilder(prefixes, nil)
	pf := builder.Build()

	if pf == nil {
		fmt.Println("No prefilter available")
		return
	}

	// Use prefilter
	haystack := []byte("test def test")
	pos := pf.Find(haystack, 0)

	fmt.Printf("Found at position: %d\n", pos)
}

Output:

Extracted 3 prefixes
Found at position: 5

Index ¶

Constants
type Builder
- func NewBuilder(prefixes, suffixes *literal.Seq) *Builder
- func (b *Builder) Build() Prefilter
type DigitPrefilter
- func NewDigitPrefilter() *DigitPrefilter
- func (p *DigitPrefilter) Find(haystack []byte, start int) int
- func (p *DigitPrefilter) HeapBytes() int
- func (p *DigitPrefilter) IsComplete() bool
- func (p *DigitPrefilter) LiteralLen() int
type FatTeddy
- func NewFatTeddy(patterns [][]byte, config *FatTeddyConfig) *FatTeddy
- func (t *FatTeddy) Find(haystack []byte, start int) int
- func (t *FatTeddy) FindMatch(haystack []byte, start int) (int, int)
- func (t *FatTeddy) HeapBytes() int
- func (t *FatTeddy) IsComplete() bool
- func (t *FatTeddy) LiteralLen() int
- func (t *FatTeddy) MinimumLen() int
- func (t *FatTeddy) PatternCount() int
- func (t *FatTeddy) Patterns() [][]byte
type FatTeddyConfig
- func DefaultFatTeddyConfig() *FatTeddyConfig
type MatchFinder
type Prefilter
- func WrapWithTracking(inner Prefilter) Prefilter
type Teddy
- func NewTeddy(patterns [][]byte, config *TeddyConfig) *Teddy
- func (t *Teddy) Find(haystack []byte, start int) int
- func (t *Teddy) FindMatch(haystack []byte, start int) (int, int)
- func (t *Teddy) HeapBytes() int
- func (t *Teddy) IsComplete() bool
- func (t *Teddy) LiteralLen() int
type TeddyConfig
- func DefaultTeddyConfig() *TeddyConfig
type TrackedPrefilter
- func (tp *TrackedPrefilter) Find(haystack []byte, start int) int
- func (tp *TrackedPrefilter) HeapBytes() int
- func (tp *TrackedPrefilter) IsComplete() bool
- func (tp *TrackedPrefilter) LiteralLen() int
type Tracker
- func NewTracker(inner Prefilter) *Tracker
- func NewTrackerWithConfig(inner Prefilter, config TrackerConfig) *Tracker
- func (t *Tracker) ConfirmMatch()
- func (t *Tracker) Find(haystack []byte, start int) int
- func (t *Tracker) HeapBytes() int
- func (t *Tracker) Inner() Prefilter
- func (t *Tracker) IsActive() bool
- func (t *Tracker) IsComplete() bool
- func (t *Tracker) LiteralLen() int
- func (t *Tracker) Reset()
- func (t *Tracker) Stats() (candidates, confirms uint64, efficiency float64, active bool)
type TrackerConfig
- func DefaultTrackerConfig() TrackerConfig

Constants ¶

View Source

const (
	// MaxSlimTeddyPatterns is the maximum number of patterns Slim Teddy (SSSE3) can handle.
	// Slim Teddy uses 8 buckets with modulo distribution.
	// Reference: Rust aho-corasick uses 32 patterns as threshold for Fat Teddy.
	MaxSlimTeddyPatterns = 32

	// MaxFatTeddyPatterns is the maximum number of patterns Fat Teddy (AVX2) can handle.
	// Fat Teddy uses 16 buckets and requires AVX2 (256-bit vectors).
	// Reference: Rust aho-corasick supports up to 64 patterns with Fat Teddy.
	MaxFatTeddyPatterns = 64

	// MaxTeddyPatterns is the maximum patterns for auto-selection.
	// If AVX2 available: 64 (Fat Teddy), otherwise: 32 (Slim Teddy)
	MaxTeddyPatterns = MaxFatTeddyPatterns

	// MinTeddyPatterns is the minimum number of patterns required for Teddy
	MinTeddyPatterns = 2

	// MinTeddyPatternLen is the minimum pattern length for effective Teddy search.
	// Patterns shorter than 3 bytes have high false positive rates.
	MinTeddyPatternLen = 3

	// MaxFingerprintLen is the maximum fingerprint length (1-4 bytes).
	// We use 2-byte fingerprint by default for better false positive rejection.
	MaxFingerprintLen = 4

	// NumBucketsSlim is the number of buckets in Slim Teddy (8 buckets, 8 bits per mask byte)
	NumBucketsSlim = 8

	// NumBucketsFat is the number of buckets in Fat Teddy (16 buckets, requires AVX2)
	// Low 128-bit lane = buckets 0-7, High 128-bit lane = buckets 8-15
	NumBucketsFat = 16
)

Constants for Teddy configuration

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Builder ¶

type Builder struct {
	// contains filtered or unexported fields
}

Builder constructs the optimal prefilter from extracted literals.

The builder analyzes the literal sequences (prefixes and suffixes) and selects the most efficient prefilter strategy. The selection is based on:

Number of literals
Length of literals
Completeness flag

Selection strategy (in order of preference):

Single byte literal → MemchrPrefilter (fastest)
Single substring literal → MemmemPrefilter (very fast)
2-32 literals, len≥3 → TeddyPrefilter (SIMD multi-pattern)
Many literals → AhoCorasickPrefilter (automaton, future)
No suitable literals → nil (no prefilter)

Example:

// Build from extracted prefixes
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf != nil {
    pos := pf.Find(haystack, 0)
}

Example ¶

ExampleBuilder demonstrates building a prefilter from a regex pattern.

package main

import (
	"fmt"
	"regexp/syntax"

	"github.com/coregx/coregex/literal"
	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Parse the regex pattern
	re, _ := syntax.Parse("hello", syntax.Perl)

	// Extract prefix literals
	extractor := literal.New(literal.DefaultConfig())
	prefixes := extractor.ExtractPrefixes(re)

	// Build the prefilter
	builder := prefilter.NewBuilder(prefixes, nil)
	pf := builder.Build()

	if pf != nil {
		// Use the prefilter to find candidates
		haystack := []byte("foo hello world")
		pos := pf.Find(haystack, 0)
		fmt.Printf("Found candidate at position %d\n", pos)
	}

}

Output:

Found candidate at position 4

Example (Alternation) ¶

ExampleBuilder_alternation demonstrates prefilter with alternations.

package main

import (
	"fmt"
	"regexp/syntax"

	"github.com/coregx/coregex/literal"
	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Pattern with alternation - Go parser may factorize common prefixes
	re, _ := syntax.Parse("(foo|foobar|food)", syntax.Perl)

	extractor := literal.New(literal.DefaultConfig())
	prefixes := extractor.ExtractPrefixes(re)

	// The Go parser factorizes to foo[dbr] or similar
	// After minimization, should extract "foo"
	builder := prefilter.NewBuilder(prefixes, nil)
	pf := builder.Build()

	if pf != nil {
		haystack := []byte("test foobar end")
		pos := pf.Find(haystack, 0)
		fmt.Printf("Found candidate at position %d\n", pos)
		fmt.Printf("Complete match: %v\n", pf.IsComplete())
	}

}

Output:

Found candidate at position 5
Complete match: false

Example (NoPrefilter) ¶

ExampleBuilder_noPrefilter demonstrates patterns with no available prefilter.

package main

import (
	"fmt"
	"regexp/syntax"

	"github.com/coregx/coregex/literal"
	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Pattern with no extractable literals (wildcard)
	re, _ := syntax.Parse(".*", syntax.Perl)

	extractor := literal.New(literal.DefaultConfig())
	prefixes := extractor.ExtractPrefixes(re)

	builder := prefilter.NewBuilder(prefixes, nil)
	pf := builder.Build()

	if pf == nil {
		fmt.Println("No prefilter available, must use full regex engine")
	}

}

Output:

No prefilter available, must use full regex engine

Example (SingleByte) ¶

ExampleBuilder_singleByte demonstrates prefilter selection for single byte patterns.

package main

import (
	"fmt"
	"regexp/syntax"

	"github.com/coregx/coregex/literal"
	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Pattern with single byte literal
	re, _ := syntax.Parse("[a].*", syntax.Perl)

	extractor := literal.New(literal.DefaultConfig())
	prefixes := extractor.ExtractPrefixes(re)

	builder := prefilter.NewBuilder(prefixes, nil)
	pf := builder.Build()

	// Should select MemchrPrefilter for single byte
	haystack := []byte("xxxayyy")
	pos := pf.Find(haystack, 0)
	fmt.Printf("Found 'a' at position %d\n", pos)
	fmt.Printf("Heap usage: %d bytes\n", pf.HeapBytes())

}

Output:

Found 'a' at position 3
Heap usage: 0 bytes

Example (Substring) ¶

ExampleBuilder_substring demonstrates prefilter selection for substring patterns.

package main

import (
	"fmt"
	"regexp/syntax"

	"github.com/coregx/coregex/literal"
	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Pattern with substring literal
	re, _ := syntax.Parse("pattern.*", syntax.Perl)

	extractor := literal.New(literal.DefaultConfig())
	prefixes := extractor.ExtractPrefixes(re)

	builder := prefilter.NewBuilder(prefixes, nil)
	pf := builder.Build()

	// Should select MemmemPrefilter for substring
	haystack := []byte("test pattern matching")
	pos := pf.Find(haystack, 0)
	fmt.Printf("Found 'pattern' at position %d\n", pos)
	fmt.Printf("Heap usage: %d bytes\n", pf.HeapBytes())

}

Output:

Found 'pattern' at position 5
Heap usage: 7 bytes

Example (WithSuffixes) ¶

ExampleBuilder_withSuffixes demonstrates using suffixes when prefixes are empty.

package main

import (
	"fmt"
	"regexp/syntax"

	"github.com/coregx/coregex/literal"
	"github.com/coregx/coregex/prefilter"
)

func main() {
	// Pattern with suffix but no prefix
	re, _ := syntax.Parse(".*world", syntax.Perl)

	extractor := literal.New(literal.DefaultConfig())
	prefixes := extractor.ExtractPrefixes(re) // Will be empty
	suffixes := extractor.ExtractSuffixes(re) // Will have "world"

	// Builder will use suffixes when prefixes are empty
	builder := prefilter.NewBuilder(prefixes, suffixes)
	pf := builder.Build()

	if pf != nil {
		haystack := []byte("hello world")
		pos := pf.Find(haystack, 0)
		fmt.Printf("Found suffix at position %d\n", pos)
	}

}

Output:

Found suffix at position 6

func NewBuilder ¶

func NewBuilder(prefixes, suffixes *literal.Seq) *Builder

NewBuilder creates a new prefilter builder from extracted literal sequences.

Parameters:

prefixes - literals that must appear at the start of matches (from ExtractPrefixes)
suffixes - literals that must appear at the end of matches (from ExtractSuffixes)

The builder prefers prefixes over suffixes because forward search is more natural and cache-friendly. Suffixes are only used if prefixes are empty.

Either or both can be nil, indicating no literals of that type were extracted.

Example:

extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
suffixes := extractor.ExtractSuffixes(re)
builder := prefilter.NewBuilder(prefixes, suffixes)

func (*Builder) Build ¶

func (b *Builder) Build() Prefilter

Build constructs the best prefilter for the given literals.

Returns nil if no effective prefilter can be built (e.g., no literals, or literals are too complex for available strategies).

The selection logic:

Prefer prefixes over suffixes (forward search)
Single byte → Memchr (10-15x speedup)
Single substring → Memmem (5-10x speedup)
Multiple literals (2-8, len≥3) → Teddy (20-50x speedup)
Many literals → (future) AhoCorasick
Otherwise → nil

Example:

builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf == nil {
    // No prefilter available, use full regex engine
    return fullRegexSearch(haystack, pattern)
}
// Use prefilter for fast candidate finding
pos := pf.Find(haystack, 0)

type DigitPrefilter ¶ added in v0.9.0

type DigitPrefilter struct{}

DigitPrefilter implements the Prefilter interface for patterns that must start with ASCII digits [0-9].

It uses SIMD-accelerated digit scanning to quickly find candidate positions, allowing the regex engine to skip large regions of non-digit text. This is particularly effective for:

IP address patterns: `(?:25[0-5]|2[0-4][0-9]|...)`
Numeric validators: `[1-9][0-9]*`
Phone number patterns: `\d{3}-\d{3}-\d{4}`

Performance characteristics:

Small inputs (< 32 bytes): 2-3x faster than byte-by-byte
Medium inputs (4KB): 8-10x faster
Large inputs (64KB+): 15-20x faster

This prefilter is NOT complete - finding a digit is only a candidate position. The full regex must be verified at that position.

Example usage (internal):

pf := NewDigitPrefilter()
pos := pf.Find(haystack, 0)
for pos != -1 {
    if regexMatchesAt(haystack, pos) {
        return pos
    }
    pos = pf.Find(haystack, pos+1)
}

func NewDigitPrefilter ¶ added in v0.9.0

func NewDigitPrefilter() *DigitPrefilter

NewDigitPrefilter creates a prefilter for patterns that must start with digits.

This prefilter uses simd.MemchrDigitAt internally for fast digit scanning. It is designed for patterns where all alternation branches must begin with a digit [0-9], enabling efficient skip-ahead through non-digit regions.

Example patterns that benefit:

IP address: `25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]`
Numeric: `[0-9]+`
Date: `[0-9]{4}-[0-9]{2}-[0-9]{2}`

func (*DigitPrefilter) Find ¶ added in v0.9.0

func (p *DigitPrefilter) Find(haystack []byte, start int) int

Find returns the index of the first digit at or after 'start'. Returns -1 if no digit is found in the remaining haystack.

This method uses SIMD acceleration on AMD64 with AVX2 support. For inputs >= 32 bytes, processes 32 bytes per iteration.

Parameters:

haystack: the byte slice to search
start: the starting position (inclusive)

Returns:

index >= start if a digit candidate is found
-1 if no digit exists at or after start

func (*DigitPrefilter) HeapBytes ¶ added in v0.9.0

func (p *DigitPrefilter) HeapBytes() int

HeapBytes returns 0 because DigitPrefilter uses no heap allocation.

The prefilter is stateless and relies on simd.MemchrDigitAt which operates directly on the input slice without additional allocations.

func (*DigitPrefilter) IsComplete ¶ added in v0.9.0

func (p *DigitPrefilter) IsComplete() bool

IsComplete returns false because finding a digit is only a candidate position. The full regex must be verified at that position to confirm a match.

Unlike literal prefilters (Memchr, Memmem) which can sometimes guarantee a match, digit prefiltering only narrows the search space - the actual pattern may still fail to match at the digit position.

func (*DigitPrefilter) LiteralLen ¶ added in v0.9.0

func (p *DigitPrefilter) LiteralLen() int

LiteralLen returns 0 because DigitPrefilter doesn't match fixed-length literals.

The prefilter finds digit characters, but the actual match length depends on the full regex pattern being verified.

type FatTeddy ¶ added in v0.10.0

type FatTeddy struct {
	// contains filtered or unexported fields
}

FatTeddy is an AVX2-accelerated multi-pattern searcher using 16 buckets.

It uses the same algorithm as Slim Teddy but with 256-bit vectors, enabling efficient search for 33-64 patterns.

Thread-safety: FatTeddy is safe for concurrent use (all state is immutable).

func NewFatTeddy ¶ added in v0.10.0

func NewFatTeddy(patterns [][]byte, config *FatTeddyConfig) *FatTeddy

NewFatTeddy creates a new Fat Teddy searcher for the given patterns.

Returns nil if patterns are not suitable for Fat Teddy:

Fewer than MinPatterns (default: 2)
More than MaxPatterns (default: 64)
Any pattern shorter than MinPatternLen (default: 3)

Note: For 2-32 patterns, prefer Slim Teddy (SSSE3) as it has lower overhead. Fat Teddy is optimal for 33-64 patterns on AVX2-capable CPUs.

Example:

patterns := make([][]byte, 50)
for i := range patterns {
    patterns[i] = []byte(fmt.Sprintf("pattern%02d", i))
}
fatTeddy := prefilter.NewFatTeddy(patterns, nil)
if fatTeddy != nil {
    pos := fatTeddy.Find(haystack, 0)
}

func (*FatTeddy) Find ¶ added in v0.10.0

func (t *FatTeddy) Find(haystack []byte, start int) int

Find returns the index of the first match starting at or after 'start'.

This implements the Prefilter interface. It uses AVX2 SIMD search to find candidates, then verifies full pattern matches.

Returns -1 if no match is found.

func (*FatTeddy) FindMatch ¶ added in v0.10.0

func (t *FatTeddy) FindMatch(haystack []byte, start int) (int, int)

FindMatch returns the start and end positions of the first match.

func (*FatTeddy) HeapBytes ¶ added in v0.10.0

func (t *FatTeddy) HeapBytes() int

HeapBytes implements Prefilter.HeapBytes.

func (*FatTeddy) IsComplete ¶ added in v0.10.0

func (t *FatTeddy) IsComplete() bool

IsComplete implements Prefilter.IsComplete.

func (*FatTeddy) LiteralLen ¶ added in v0.10.0

func (t *FatTeddy) LiteralLen() int

LiteralLen implements Prefilter.LiteralLen.

func (*FatTeddy) MinimumLen ¶ added in v0.10.0

func (t *FatTeddy) MinimumLen() int

MinimumLen returns the minimum haystack length for efficient SIMD search.

For haystacks smaller than this, the SIMD setup overhead exceeds the benefit. Callers should use a fallback strategy (like Aho-Corasick) for small inputs.

This follows Rust regex's minimum_len() approach:

AVX2 processes 32 bytes per iteration
With 2-byte fingerprint, minimum is 32 + 1 = 33 bytes
We use 64 as conservative threshold based on benchmarks showing Aho-Corasick is ~2x faster than Fat Teddy on ~37 byte inputs

Reference: rust-aho-corasick/src/packed/teddy/builder.rs:585

func (*FatTeddy) PatternCount ¶ added in v0.10.0

func (t *FatTeddy) PatternCount() int

PatternCount returns the number of patterns in this Fat Teddy searcher. Used by meta-engine to build appropriate fallback strategy.

func (*FatTeddy) Patterns ¶ added in v0.10.0

func (t *FatTeddy) Patterns() [][]byte

Patterns returns the patterns stored in this Fat Teddy searcher. Used by meta-engine to build Aho-Corasick fallback for small haystacks.

type FatTeddyConfig ¶ added in v0.10.0

type FatTeddyConfig struct {
	// MinPatterns is the minimum patterns required (default: 2)
	MinPatterns int

	// MaxPatterns is the maximum patterns allowed (default: 64)
	MaxPatterns int

	// MinPatternLen is the minimum pattern length required (default: 3)
	MinPatternLen int

	// FingerprintLen is the number of fingerprint bytes to use (1-4, default: 2)
	FingerprintLen int
}

FatTeddyConfig configures Fat Teddy construction.

func DefaultFatTeddyConfig ¶ added in v0.10.0

func DefaultFatTeddyConfig() *FatTeddyConfig

DefaultFatTeddyConfig returns the default Fat Teddy configuration.

type MatchFinder ¶ added in v0.8.18

type MatchFinder interface {
	// FindMatch returns the start and end positions of the first match.
	// Returns (start, end) if found, (-1, -1) if not found.
	// The matched bytes are haystack[start:end].
	FindMatch(haystack []byte, start int) (start2, end int)
}

MatchFinder is an optional interface for prefilters that can return the matched range directly, avoiding the need for NFA verification.

This is particularly useful for multi-pattern prefilters like Teddy where the matched pattern length varies.

type Prefilter ¶

type Prefilter interface {
	// Find returns the index of the first candidate match starting at or after
	// 'start', or -1 if no candidate is found.
	//
	// A candidate match means a position where one of the prefilter literals
	// was found. This does NOT guarantee a full regex match - the caller must
	// verify the match using the full regex engine (unless IsComplete() is true).
	//
	// Parameters:
	//   haystack - the byte buffer to search
	//   start - the starting position (must be >= 0 and <= len(haystack))
	//
	// Returns:
	//   index >= start if a candidate is found
	//   -1 if no candidate exists at or after start
	//
	// Example:
	//
	//	pf := /* some prefilter */
	//	pos := pf.Find(haystack, 0)
	//	for pos != -1 {
	//	    // Verify match at pos using full regex engine
	//	    if fullRegexMatches(haystack, pos) {
	//	        return pos
	//	    }
	//	    // Continue searching
	//	    pos = pf.Find(haystack, pos+1)
	//	}
	Find(haystack []byte, start int) int

	// IsComplete returns true if a prefilter match guarantees a full regex match.
	//
	// When true, the regex engine can skip verification and directly return the
	// prefilter result. This is the case when:
	//   - The regex is an exact literal (e.g., /hello/)
	//   - The literal sequence is complete and non-overlapping
	//
	// Most prefilters return false, meaning verification is required.
	//
	// Example:
	//
	//	if pf.IsComplete() {
	//	    // Direct return, no verification needed
	//	    return pf.Find(haystack, start)
	//	} else {
	//	    // Must verify with full regex engine
	//	    pos := pf.Find(haystack, start)
	//	    if pos != -1 && fullRegexMatches(haystack, pos) {
	//	        return pos
	//	    }
	//	}
	IsComplete() bool

	// LiteralLen returns the length of the matched literal when IsComplete() is true.
	//
	// This allows the regex engine to calculate exact match bounds without
	// running the full automata: end = start + LiteralLen().
	//
	// Returns:
	//   > 0 if IsComplete() is true (the length of the complete literal)
	//   0 if IsComplete() is false or if the prefilter matches variable lengths
	//
	// Example:
	//
	//	if pf.IsComplete() {
	//	    pos := pf.Find(haystack, start)
	//	    if pos != -1 {
	//	        matchEnd := pos + pf.LiteralLen()
	//	        return haystack[pos:matchEnd]
	//	    }
	//	}
	LiteralLen() int

	// HeapBytes returns the number of bytes of heap memory used by this prefilter.
	//
	// This is useful for profiling and memory budgeting in the regex engine.
	// Simple prefilters (Memchr, Memmem) typically return 0 as they don't
	// allocate heap memory. Complex prefilters (Teddy, AhoCorasick) may use
	// significant memory for lookup tables.
	//
	// Returns:
	//   0 if no heap allocation
	//   positive value indicating heap bytes used
	HeapBytes() int
}

Prefilter is used to quickly find candidate match positions before running the full regex engine.

The prefilter scans the haystack for literals extracted from the regex pattern. When a literal is found, that position is returned as a candidate. The regex engine then verifies if a full match exists at that position.

Key methods:

Find: returns the next candidate position
IsComplete: indicates if prefilter match is sufficient (no verification needed)
HeapBytes: returns memory usage for profiling

func WrapWithTracking ¶ added in v0.4.0

func WrapWithTracking(inner Prefilter) Prefilter

WrapWithTracking creates a Prefilter that includes effectiveness tracking.

The returned prefilter can be used anywhere a regular Prefilter is expected, but also supports tracking methods via type assertion:

pf := prefilter.WrapWithTracking(innerPf)
// Use as regular prefilter
pos := pf.Find(haystack, 0)
// Access tracking via type assertion
if tracked, ok := pf.(*prefilter.TrackedPrefilter); ok {
    tracked.ConfirmMatch()
    _, _, eff, _ := tracked.Stats()
}

type Teddy ¶

type Teddy struct {
	// contains filtered or unexported fields
}

Teddy is a SIMD-accelerated multi-pattern searcher.

It uses bucket-based filtering to quickly identify candidate positions, then verifies full pattern matches. This provides dramatic speedup for alternation patterns like (foo|bar|baz).

Thread-safety: Teddy is safe for concurrent use (all state is immutable).

func NewTeddy ¶

func NewTeddy(patterns [][]byte, config *TeddyConfig) *Teddy

NewTeddy creates a new Teddy searcher for the given patterns.

Returns nil if patterns are not suitable for Teddy:

Fewer than MinPatterns (default: 2)
More than MaxPatterns (default: 8)
Any pattern shorter than MinPatternLen (default: 3)

Parameters:

patterns - the byte patterns to search for
config - configuration (nil uses DefaultTeddyConfig)

Example:

patterns := [][]byte{
    []byte("foo"),
    []byte("bar"),
    []byte("baz"),
}
teddy := prefilter.NewTeddy(patterns, nil)
if teddy != nil {
    pos := teddy.Find(haystack, 0)
}

func (*Teddy) Find ¶

func (t *Teddy) Find(haystack []byte, start int) int

Find returns the index of the first candidate match starting at or after 'start'.

This implements the Prefilter interface. It uses SIMD search to find candidates, then verifies full pattern matches.

Returns -1 if no match is found.

Algorithm:

If haystack too short: use scalar fallback
SIMD search: find candidate positions using nibble masks
For each candidate: verify full pattern match
Return first match position

Thread-safety: Safe for concurrent use (all state is immutable).

func (*Teddy) FindMatch ¶ added in v0.8.18

func (t *Teddy) FindMatch(haystack []byte, start int) (int, int)

FindMatch returns the start and end positions of the first match. This is more efficient than Find() when the pattern length varies, as it avoids the need for a separate NFA search to find the end.

Returns (start, end) if found, (-1, -1) if not found. The matched bytes are haystack[start:end].

func (*Teddy) HeapBytes ¶

func (t *Teddy) HeapBytes() int

HeapBytes implements Prefilter.HeapBytes.

Returns approximate heap memory used by Teddy:

Mask tables: 264 bytes (fixed)
Pattern storage: sum of pattern lengths
Bucket arrays: ~8 * 8 bytes

Total: typically < 1KB for 2-8 patterns.

func (*Teddy) IsComplete ¶

func (t *Teddy) IsComplete() bool

IsComplete implements Prefilter.IsComplete.

Returns true if finding a Teddy match guarantees a full regex match. This is only true if all patterns are exact literals with no overlap.

For most use cases, this returns false (verification required).

func (*Teddy) LiteralLen ¶ added in v0.8.14

func (t *Teddy) LiteralLen() int

LiteralLen implements Prefilter.LiteralLen.

When all patterns have the same length and complete=true, returns that uniform length. Otherwise returns 0.

type TeddyConfig ¶

type TeddyConfig struct {
	// MinPatterns is the minimum patterns required (default: 2)
	MinPatterns int

	// MaxPatterns is the maximum patterns allowed (default: 8)
	MaxPatterns int

	// MinPatternLen is the minimum pattern length required (default: 3)
	MinPatternLen int

	// FingerprintLen is the number of fingerprint bytes to use (1-4, default: 1)
	// Higher values reduce false positives but increase computation cost.
	FingerprintLen int
}

TeddyConfig configures Teddy construction.

func DefaultTeddyConfig ¶

func DefaultTeddyConfig() *TeddyConfig

DefaultTeddyConfig returns the default Slim Teddy configuration. For Fat Teddy (33-64 patterns), use NewFatTeddy with DefaultFatTeddyConfig.

type TrackedPrefilter ¶ added in v0.4.0

type TrackedPrefilter struct {
	*Tracker
}

TrackedPrefilter wraps a Prefilter to implement the Prefilter interface while providing tracking capabilities.

This allows using a tracked prefilter anywhere a regular Prefilter is expected.

func (*TrackedPrefilter) Find ¶ added in v0.4.0

func (tp *TrackedPrefilter) Find(haystack []byte, start int) int

Find implements Prefilter.Find with tracking.

func (*TrackedPrefilter) HeapBytes ¶ added in v0.4.0

func (tp *TrackedPrefilter) HeapBytes() int

HeapBytes implements Prefilter.HeapBytes.

func (*TrackedPrefilter) IsComplete ¶ added in v0.4.0

func (tp *TrackedPrefilter) IsComplete() bool

IsComplete implements Prefilter.IsComplete.

func (*TrackedPrefilter) LiteralLen ¶ added in v0.8.14

func (tp *TrackedPrefilter) LiteralLen() int

LiteralLen implements Prefilter.LiteralLen.

type Tracker ¶ added in v0.4.0

type Tracker struct {
	// contains filtered or unexported fields
}

Tracker wraps a Prefilter with effectiveness tracking.

The tracker monitors the ratio of confirmed matches to candidates found. When effectiveness drops below a threshold (too many false positives), the prefilter is automatically disabled to prevent catastrophic slowdown.

This implements the rust-regex pattern where prefilters can be "retired" when they become ineffective for certain inputs.

Algorithm:

Track candidates (prefilter finds) and confirms (actual matches)
Every N candidates, check effectiveness ratio
If ratio < threshold, disable prefilter
Once disabled, never re-enable (for this search)

Example usage:

pf := prefilter.NewBuilder(prefixes, nil).Build()
tracker := prefilter.NewTracker(pf)
for {
    if !tracker.IsActive() {
        // Prefilter disabled, use full scan
        break
    }
    pos := tracker.Find(haystack, start)
    if pos == -1 {
        break
    }
    if fullRegexMatches(haystack, pos) {
        tracker.ConfirmMatch()
        return pos
    }
    start = pos + 1
}

func NewTracker ¶ added in v0.4.0

func NewTracker(inner Prefilter) *Tracker

NewTracker creates a new tracker for the given prefilter with default config.

Returns nil if the inner prefilter is nil.

func NewTrackerWithConfig ¶ added in v0.4.0

func NewTrackerWithConfig(inner Prefilter, config TrackerConfig) *Tracker

NewTrackerWithConfig creates a new tracker with custom configuration.

Returns nil if the inner prefilter is nil.

func (*Tracker) ConfirmMatch ¶ added in v0.4.0

func (t *Tracker) ConfirmMatch()

ConfirmMatch should be called when a candidate actually matches.

This increments the confirm counter, improving the effectiveness ratio. Always call this after verifying a candidate is a real match.

func (*Tracker) Find ¶ added in v0.4.0

func (t *Tracker) Find(haystack []byte, start int) int

Find returns the next candidate position, or -1 if none found or disabled.

This method:

Checks if prefilter is still active
Calls inner prefilter's Find
Increments candidate counter if found
Checks effectiveness at intervals

Unlike the inner prefilter, this may return -1 even when candidates exist if the prefilter has been disabled due to low effectiveness.

func (*Tracker) HeapBytes ¶ added in v0.4.0

func (t *Tracker) HeapBytes() int

HeapBytes returns the memory used by the inner prefilter.

func (*Tracker) Inner ¶ added in v0.4.0

func (t *Tracker) Inner() Prefilter

Inner returns the underlying prefilter.

Useful for accessing type-specific methods or bypassing tracking.

func (*Tracker) IsActive ¶ added in v0.4.0

func (t *Tracker) IsActive() bool

IsActive returns true if the prefilter is still being used.

When false, the caller should fall back to full regex search instead of using the prefilter.

func (*Tracker) IsComplete ¶ added in v0.4.0

func (t *Tracker) IsComplete() bool

IsComplete delegates to the inner prefilter's IsComplete.

func (*Tracker) LiteralLen ¶ added in v0.8.14

func (t *Tracker) LiteralLen() int

LiteralLen delegates to the inner prefilter's LiteralLen.

func (*Tracker) Reset ¶ added in v0.4.0

func (t *Tracker) Reset()

Reset clears statistics and re-enables the prefilter.

Useful for reusing the tracker across multiple searches.

func (*Tracker) Stats ¶ added in v0.4.0

func (t *Tracker) Stats() (candidates, confirms uint64, efficiency float64, active bool)

Stats returns the current tracking statistics.

Returns (candidates, confirms, efficiency, active).

type TrackerConfig ¶ added in v0.4.0

type TrackerConfig struct {
	// CheckInterval is how often to check effectiveness (in candidates).
	// Default: 64
	CheckInterval uint64

	// MinEfficiency is the minimum acceptable ratio of confirms/candidates.
	// If efficiency drops below this, prefilter is disabled.
	// Default: 0.1 (10%)
	MinEfficiency float64

	// WarmupPeriod is the minimum number of candidates before checking effectiveness.
	// This prevents premature disabling on small samples.
	// Default: 128
	WarmupPeriod uint64
}

TrackerConfig holds configuration for the effectiveness tracker.

func DefaultTrackerConfig ¶ added in v0.4.0

func DefaultTrackerConfig() TrackerConfig

DefaultTrackerConfig returns the default tracker configuration.

Default values are tuned based on rust-regex heuristics:

CheckInterval: 64 (check frequently but not every candidate)
MinEfficiency: 0.1 (10% - disable if >90% false positives)
WarmupPeriod: 128 (need enough samples for statistical significance)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL