Documentation
¶
Overview ¶
Package prefilter provides fast candidate filtering for regex search using extracted literal sequences.
A prefilter is used to quickly reject positions in the haystack that cannot possibly match the full regex pattern. This provides dramatic speedup (10-100x) for patterns with literals, as we can use SIMD-accelerated search primitives instead of running the full automaton.
The package automatically selects the optimal prefilter strategy based on extracted literals:
- Single byte → MemchrPrefilter (SIMD byte search)
- Single substring → MemmemPrefilter (SIMD substring search)
- 2-32 literals (len ≥ 3) → TeddyPrefilter (SIMD multi-literal)
- Many literals → AhoCorasickPrefilter (automaton, future)
Example usage:
// Extract literals from regex pattern
re, _ := syntax.Parse("(hello|world)", syntax.Perl)
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
// Build prefilter
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
// Use prefilter to find candidates
haystack := []byte("foo hello bar world baz")
pos := pf.Find(haystack, 0)
// pos == 4 (position of "hello")
Package prefilter implements the Teddy multi-pattern SIMD matching algorithm.
Teddy is a SIMD-accelerated algorithm for searching multiple patterns simultaneously. It uses vector shuffle instructions (PSHUFB) to perform parallel table lookups, identifying candidate match positions that are then verified against actual patterns.
The algorithm is particularly effective for 2-32 patterns with length >= 3 bytes, providing 20-50x speedup over naive multi-pattern search.
Algorithm Overview:
Build Phase (once per pattern set): - Assign patterns to buckets (0-7 for Slim Teddy) - Build nibble lookup masks for fingerprint bytes - Each mask byte contains bucket membership bits
Search Phase (per 16-byte chunk with SSSE3): - Load haystack chunk - Extract low/high nibbles - Use PSHUFB to lookup bucket membership - AND results to find candidates - Extract candidate positions
Verify Phase (per candidate): - Check patterns in candidate's bucket - Compare full pattern bytes - Return first match
Reference:
- BurntSushi/aho-corasick Teddy README
- docs/dev/research/TEDDY_IMPLEMENTATION_GUIDE.md
Package prefilter provides the Fat Teddy multi-pattern SIMD matching algorithm.
Fat Teddy is an AVX2-based extension of Teddy that uses 16 buckets (vs 8 in Slim Teddy). It processes 256-bit vectors and can efficiently handle 33-64 patterns.
Architecture:
- Slim Teddy: 8 buckets, SSSE3 (128-bit), 2-32 patterns
- Fat Teddy: 16 buckets, AVX2 (256-bit), 33-64 patterns
- Fallback: Aho-Corasick for >64 patterns or no AVX2
Mask Layout (256-bit / 32 bytes per mask):
- Low 128-bit lane (bytes 0-15): buckets 0-7
- High 128-bit lane (bytes 16-31): buckets 8-15
Reference:
- Rust aho-corasick: src/packed/teddy/generic.rs
- docs/dev/kanban/done/V010-FAT-001-research.md
Example (TeddyBasic) ¶
Example_teddyBasic demonstrates basic Teddy usage for multi-pattern search
package main
import (
"fmt"
"github.com/coregx/coregex/prefilter"
)
func main() {
patterns := [][]byte{
[]byte("foo"),
[]byte("bar"),
[]byte("baz"),
}
teddy := prefilter.NewTeddy(patterns, nil)
if teddy == nil {
fmt.Println("Teddy not available")
return
}
haystack := []byte("hello bar world")
pos := teddy.Find(haystack, 0)
fmt.Printf("Found at position: %d\n", pos)
}
Output: Found at position: 6
Example (TeddyConfig) ¶
Example_teddyConfig demonstrates custom Teddy configuration
package main
import (
"fmt"
"github.com/coregx/coregex/prefilter"
)
func main() {
patterns := [][]byte{
[]byte("apple"),
[]byte("banana"),
}
config := &prefilter.TeddyConfig{
MinPatterns: 2,
MaxPatterns: 8,
MinPatternLen: 3,
FingerprintLen: 1, // Use 1-byte fingerprint
}
teddy := prefilter.NewTeddy(patterns, config)
if teddy == nil {
fmt.Println("Teddy not available")
return
}
haystack := []byte("I like banana and apple")
pos := teddy.Find(haystack, 0)
fmt.Printf("Found at position: %d\n", pos)
}
Output: Found at position: 7
Example (TeddyHeapBytes) ¶
Example_teddyHeapBytes demonstrates memory usage reporting
package main
import (
"fmt"
"github.com/coregx/coregex/prefilter"
)
func main() {
patterns := [][]byte{
[]byte("foo"),
[]byte("bar"),
[]byte("baz"),
[]byte("qux"),
}
teddy := prefilter.NewTeddy(patterns, nil)
if teddy == nil {
fmt.Println("Teddy not available")
return
}
fmt.Printf("Teddy uses approximately %d bytes of heap memory\n", teddy.HeapBytes())
// Output will vary, but typically < 1KB
// Example output: Teddy uses approximately 376 bytes of heap memory
}
Example (TeddyMultipleMatches) ¶
Example_teddyMultipleMatches demonstrates finding multiple matches
package main
import (
"fmt"
"github.com/coregx/coregex/prefilter"
)
func main() {
patterns := [][]byte{
[]byte("ERROR"),
[]byte("WARNING"),
}
teddy := prefilter.NewTeddy(patterns, nil)
if teddy == nil {
fmt.Println("Teddy not available")
return
}
haystack := []byte("INFO: all good\nERROR: something broke\nWARNING: check this")
// Find all matches
matches := []int{}
start := 0
for {
pos := teddy.Find(haystack, start)
if pos == -1 {
break
}
matches = append(matches, pos)
start = pos + 1 // Continue searching after this match
}
fmt.Printf("Found %d matches at positions: %v\n", len(matches), matches)
}
Output: Found 2 matches at positions: [15 38]
Example (TeddyNoMatch) ¶
Example_teddyNoMatch demonstrates no-match behavior
package main
import (
"fmt"
"github.com/coregx/coregex/prefilter"
)
func main() {
patterns := [][]byte{
[]byte("foo"),
[]byte("bar"),
}
teddy := prefilter.NewTeddy(patterns, nil)
if teddy == nil {
fmt.Println("Teddy not available")
return
}
haystack := []byte("hello world")
pos := teddy.Find(haystack, 0)
if pos == -1 {
fmt.Println("No match found")
}
}
Output: No match found
Example (TeddyNotSuitable) ¶
Example_teddyNotSuitable demonstrates when Teddy is not suitable
package main
import (
"fmt"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Too few patterns (need >= 2)
teddy1 := prefilter.NewTeddy([][]byte{[]byte("foo")}, nil)
fmt.Printf("1 pattern: %v\n", teddy1 != nil)
// Pattern too short (need >= 3 bytes)
teddy2 := prefilter.NewTeddy([][]byte{[]byte("ab"), []byte("cd")}, nil)
fmt.Printf("Short patterns: %v\n", teddy2 != nil)
// Too many patterns (max 32)
manyPatterns := make([][]byte, 35)
for i := range manyPatterns {
manyPatterns[i] = []byte(fmt.Sprintf("pat%02d", i))
}
teddy3 := prefilter.NewTeddy(manyPatterns, nil)
fmt.Printf("Too many patterns: %v\n", teddy3 != nil)
}
Output: 1 pattern: false Short patterns: false Too many patterns: false
Example (TeddyVsNaive) ¶
Example_teddyVsNaive demonstrates performance comparison
package main
import (
"fmt"
"github.com/coregx/coregex/prefilter"
)
func main() {
patterns := [][]byte{
[]byte("pattern1"),
[]byte("pattern2"),
[]byte("pattern3"),
}
teddy := prefilter.NewTeddy(patterns, nil)
if teddy == nil {
fmt.Println("Teddy not available")
return
}
haystack := make([]byte, 10000)
for i := range haystack {
haystack[i] = 'x'
}
copy(haystack[5000:], "pattern2")
pos := teddy.Find(haystack, 0)
fmt.Printf("Found pattern at position: %d\n", pos)
fmt.Println("Teddy provides 20-50x speedup over naive multi-pattern search")
}
Output: Found pattern at position: 5000 Teddy provides 20-50x speedup over naive multi-pattern search
Example (TeddyWithRegex) ¶
Example_teddyWithRegex demonstrates Teddy with regex literal extraction
package main
import (
"fmt"
"regexp/syntax"
"github.com/coregx/coregex/literal"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Parse regex with alternation (use distinct patterns to ensure 3 prefixes)
re, _ := syntax.Parse("(abc|def|ghi)", syntax.Perl)
// Extract prefixes
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
fmt.Printf("Extracted %d prefixes\n", prefixes.Len())
// Build prefilter (will select Teddy for 3 patterns)
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf == nil {
fmt.Println("No prefilter available")
return
}
// Use prefilter
haystack := []byte("test def test")
pos := pf.Find(haystack, 0)
fmt.Printf("Found at position: %d\n", pos)
}
Output: Extracted 3 prefixes Found at position: 5
Index ¶
- Constants
- type Builder
- type DigitPrefilter
- type FatTeddy
- func (t *FatTeddy) Find(haystack []byte, start int) int
- func (t *FatTeddy) FindMatch(haystack []byte, start int) (int, int)
- func (t *FatTeddy) HeapBytes() int
- func (t *FatTeddy) IsComplete() bool
- func (t *FatTeddy) LiteralLen() int
- func (t *FatTeddy) MinimumLen() int
- func (t *FatTeddy) PatternCount() int
- func (t *FatTeddy) Patterns() [][]byte
- type FatTeddyConfig
- type MatchFinder
- type Prefilter
- type Teddy
- type TeddyConfig
- type TrackedPrefilter
- type Tracker
- func (t *Tracker) ConfirmMatch()
- func (t *Tracker) Find(haystack []byte, start int) int
- func (t *Tracker) HeapBytes() int
- func (t *Tracker) Inner() Prefilter
- func (t *Tracker) IsActive() bool
- func (t *Tracker) IsComplete() bool
- func (t *Tracker) LiteralLen() int
- func (t *Tracker) Reset()
- func (t *Tracker) Stats() (candidates, confirms uint64, efficiency float64, active bool)
- type TrackerConfig
Examples ¶
- Package (TeddyBasic)
- Package (TeddyConfig)
- Package (TeddyHeapBytes)
- Package (TeddyMultipleMatches)
- Package (TeddyNoMatch)
- Package (TeddyNotSuitable)
- Package (TeddyVsNaive)
- Package (TeddyWithRegex)
- Builder
- Builder (Alternation)
- Builder (NoPrefilter)
- Builder (SingleByte)
- Builder (Substring)
- Builder (WithSuffixes)
Constants ¶
const ( // MaxSlimTeddyPatterns is the maximum number of patterns Slim Teddy (SSSE3) can handle. // Slim Teddy uses 8 buckets with modulo distribution. // Reference: Rust aho-corasick uses 32 patterns as threshold for Fat Teddy. MaxSlimTeddyPatterns = 32 // MaxFatTeddyPatterns is the maximum number of patterns Fat Teddy (AVX2) can handle. // Fat Teddy uses 16 buckets and requires AVX2 (256-bit vectors). // Reference: Rust aho-corasick supports up to 64 patterns with Fat Teddy. MaxFatTeddyPatterns = 64 // MaxTeddyPatterns is the maximum patterns for auto-selection. // If AVX2 available: 64 (Fat Teddy), otherwise: 32 (Slim Teddy) MaxTeddyPatterns = MaxFatTeddyPatterns // MinTeddyPatterns is the minimum number of patterns required for Teddy MinTeddyPatterns = 2 // MinTeddyPatternLen is the minimum pattern length for effective Teddy search. // Patterns shorter than 3 bytes have high false positive rates. MinTeddyPatternLen = 3 // MaxFingerprintLen is the maximum fingerprint length (1-4 bytes). // We use 2-byte fingerprint by default for better false positive rejection. MaxFingerprintLen = 4 // NumBucketsSlim is the number of buckets in Slim Teddy (8 buckets, 8 bits per mask byte) NumBucketsSlim = 8 // NumBucketsFat is the number of buckets in Fat Teddy (16 buckets, requires AVX2) // Low 128-bit lane = buckets 0-7, High 128-bit lane = buckets 8-15 NumBucketsFat = 16 )
Constants for Teddy configuration
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Builder ¶
type Builder struct {
// contains filtered or unexported fields
}
Builder constructs the optimal prefilter from extracted literals.
The builder analyzes the literal sequences (prefixes and suffixes) and selects the most efficient prefilter strategy. The selection is based on:
- Number of literals
- Length of literals
- Completeness flag
Selection strategy (in order of preference):
- Single byte literal → MemchrPrefilter (fastest)
- Single substring literal → MemmemPrefilter (very fast)
- 2-32 literals, len≥3 → TeddyPrefilter (SIMD multi-pattern)
- Many literals → AhoCorasickPrefilter (automaton, future)
- No suitable literals → nil (no prefilter)
Example:
// Build from extracted prefixes
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf != nil {
pos := pf.Find(haystack, 0)
}
Example ¶
ExampleBuilder demonstrates building a prefilter from a regex pattern.
package main
import (
"fmt"
"regexp/syntax"
"github.com/coregx/coregex/literal"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Parse the regex pattern
re, _ := syntax.Parse("hello", syntax.Perl)
// Extract prefix literals
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
// Build the prefilter
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf != nil {
// Use the prefilter to find candidates
haystack := []byte("foo hello world")
pos := pf.Find(haystack, 0)
fmt.Printf("Found candidate at position %d\n", pos)
}
}
Output: Found candidate at position 4
Example (Alternation) ¶
ExampleBuilder_alternation demonstrates prefilter with alternations.
package main
import (
"fmt"
"regexp/syntax"
"github.com/coregx/coregex/literal"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Pattern with alternation - Go parser may factorize common prefixes
re, _ := syntax.Parse("(foo|foobar|food)", syntax.Perl)
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
// The Go parser factorizes to foo[dbr] or similar
// After minimization, should extract "foo"
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf != nil {
haystack := []byte("test foobar end")
pos := pf.Find(haystack, 0)
fmt.Printf("Found candidate at position %d\n", pos)
fmt.Printf("Complete match: %v\n", pf.IsComplete())
}
}
Output: Found candidate at position 5 Complete match: false
Example (NoPrefilter) ¶
ExampleBuilder_noPrefilter demonstrates patterns with no available prefilter.
package main
import (
"fmt"
"regexp/syntax"
"github.com/coregx/coregex/literal"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Pattern with no extractable literals (wildcard)
re, _ := syntax.Parse(".*", syntax.Perl)
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf == nil {
fmt.Println("No prefilter available, must use full regex engine")
}
}
Output: No prefilter available, must use full regex engine
Example (SingleByte) ¶
ExampleBuilder_singleByte demonstrates prefilter selection for single byte patterns.
package main
import (
"fmt"
"regexp/syntax"
"github.com/coregx/coregex/literal"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Pattern with single byte literal
re, _ := syntax.Parse("[a].*", syntax.Perl)
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
// Should select MemchrPrefilter for single byte
haystack := []byte("xxxayyy")
pos := pf.Find(haystack, 0)
fmt.Printf("Found 'a' at position %d\n", pos)
fmt.Printf("Heap usage: %d bytes\n", pf.HeapBytes())
}
Output: Found 'a' at position 3 Heap usage: 0 bytes
Example (Substring) ¶
ExampleBuilder_substring demonstrates prefilter selection for substring patterns.
package main
import (
"fmt"
"regexp/syntax"
"github.com/coregx/coregex/literal"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Pattern with substring literal
re, _ := syntax.Parse("pattern.*", syntax.Perl)
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re)
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
// Should select MemmemPrefilter for substring
haystack := []byte("test pattern matching")
pos := pf.Find(haystack, 0)
fmt.Printf("Found 'pattern' at position %d\n", pos)
fmt.Printf("Heap usage: %d bytes\n", pf.HeapBytes())
}
Output: Found 'pattern' at position 5 Heap usage: 7 bytes
Example (WithSuffixes) ¶
ExampleBuilder_withSuffixes demonstrates using suffixes when prefixes are empty.
package main
import (
"fmt"
"regexp/syntax"
"github.com/coregx/coregex/literal"
"github.com/coregx/coregex/prefilter"
)
func main() {
// Pattern with suffix but no prefix
re, _ := syntax.Parse(".*world", syntax.Perl)
extractor := literal.New(literal.DefaultConfig())
prefixes := extractor.ExtractPrefixes(re) // Will be empty
suffixes := extractor.ExtractSuffixes(re) // Will have "world"
// Builder will use suffixes when prefixes are empty
builder := prefilter.NewBuilder(prefixes, suffixes)
pf := builder.Build()
if pf != nil {
haystack := []byte("hello world")
pos := pf.Find(haystack, 0)
fmt.Printf("Found suffix at position %d\n", pos)
}
}
Output: Found suffix at position 6
func NewBuilder ¶
NewBuilder creates a new prefilter builder from extracted literal sequences.
Parameters:
prefixes - literals that must appear at the start of matches (from ExtractPrefixes) suffixes - literals that must appear at the end of matches (from ExtractSuffixes)
The builder prefers prefixes over suffixes because forward search is more natural and cache-friendly. Suffixes are only used if prefixes are empty.
Either or both can be nil, indicating no literals of that type were extracted.
Example:
extractor := literal.New(literal.DefaultConfig()) prefixes := extractor.ExtractPrefixes(re) suffixes := extractor.ExtractSuffixes(re) builder := prefilter.NewBuilder(prefixes, suffixes)
func (*Builder) Build ¶
Build constructs the best prefilter for the given literals.
Returns nil if no effective prefilter can be built (e.g., no literals, or literals are too complex for available strategies).
The selection logic:
- Prefer prefixes over suffixes (forward search)
- Single byte → Memchr (10-15x speedup)
- Single substring → Memmem (5-10x speedup)
- Multiple literals (2-8, len≥3) → Teddy (20-50x speedup)
- Many literals → (future) AhoCorasick
- Otherwise → nil
Example:
builder := prefilter.NewBuilder(prefixes, nil)
pf := builder.Build()
if pf == nil {
// No prefilter available, use full regex engine
return fullRegexSearch(haystack, pattern)
}
// Use prefilter for fast candidate finding
pos := pf.Find(haystack, 0)
type DigitPrefilter ¶ added in v0.9.0
type DigitPrefilter struct{}
DigitPrefilter implements the Prefilter interface for patterns that must start with ASCII digits [0-9].
It uses SIMD-accelerated digit scanning to quickly find candidate positions, allowing the regex engine to skip large regions of non-digit text. This is particularly effective for:
- IP address patterns: `(?:25[0-5]|2[0-4][0-9]|...)`
- Numeric validators: `[1-9][0-9]*`
- Phone number patterns: `\d{3}-\d{3}-\d{4}`
Performance characteristics:
- Small inputs (< 32 bytes): 2-3x faster than byte-by-byte
- Medium inputs (4KB): 8-10x faster
- Large inputs (64KB+): 15-20x faster
This prefilter is NOT complete - finding a digit is only a candidate position. The full regex must be verified at that position.
Example usage (internal):
pf := NewDigitPrefilter()
pos := pf.Find(haystack, 0)
for pos != -1 {
if regexMatchesAt(haystack, pos) {
return pos
}
pos = pf.Find(haystack, pos+1)
}
func NewDigitPrefilter ¶ added in v0.9.0
func NewDigitPrefilter() *DigitPrefilter
NewDigitPrefilter creates a prefilter for patterns that must start with digits.
This prefilter uses simd.MemchrDigitAt internally for fast digit scanning. It is designed for patterns where all alternation branches must begin with a digit [0-9], enabling efficient skip-ahead through non-digit regions.
Example patterns that benefit:
- IP address: `25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]`
- Numeric: `[0-9]+`
- Date: `[0-9]{4}-[0-9]{2}-[0-9]{2}`
func (*DigitPrefilter) Find ¶ added in v0.9.0
func (p *DigitPrefilter) Find(haystack []byte, start int) int
Find returns the index of the first digit at or after 'start'. Returns -1 if no digit is found in the remaining haystack.
This method uses SIMD acceleration on AMD64 with AVX2 support. For inputs >= 32 bytes, processes 32 bytes per iteration.
Parameters:
- haystack: the byte slice to search
- start: the starting position (inclusive)
Returns:
- index >= start if a digit candidate is found
- -1 if no digit exists at or after start
func (*DigitPrefilter) HeapBytes ¶ added in v0.9.0
func (p *DigitPrefilter) HeapBytes() int
HeapBytes returns 0 because DigitPrefilter uses no heap allocation.
The prefilter is stateless and relies on simd.MemchrDigitAt which operates directly on the input slice without additional allocations.
func (*DigitPrefilter) IsComplete ¶ added in v0.9.0
func (p *DigitPrefilter) IsComplete() bool
IsComplete returns false because finding a digit is only a candidate position. The full regex must be verified at that position to confirm a match.
Unlike literal prefilters (Memchr, Memmem) which can sometimes guarantee a match, digit prefiltering only narrows the search space - the actual pattern may still fail to match at the digit position.
func (*DigitPrefilter) LiteralLen ¶ added in v0.9.0
func (p *DigitPrefilter) LiteralLen() int
LiteralLen returns 0 because DigitPrefilter doesn't match fixed-length literals.
The prefilter finds digit characters, but the actual match length depends on the full regex pattern being verified.
type FatTeddy ¶ added in v0.10.0
type FatTeddy struct {
// contains filtered or unexported fields
}
FatTeddy is an AVX2-accelerated multi-pattern searcher using 16 buckets.
It uses the same algorithm as Slim Teddy but with 256-bit vectors, enabling efficient search for 33-64 patterns.
Thread-safety: FatTeddy is safe for concurrent use (all state is immutable).
func NewFatTeddy ¶ added in v0.10.0
func NewFatTeddy(patterns [][]byte, config *FatTeddyConfig) *FatTeddy
NewFatTeddy creates a new Fat Teddy searcher for the given patterns.
Returns nil if patterns are not suitable for Fat Teddy:
- Fewer than MinPatterns (default: 2)
- More than MaxPatterns (default: 64)
- Any pattern shorter than MinPatternLen (default: 3)
Note: For 2-32 patterns, prefer Slim Teddy (SSSE3) as it has lower overhead. Fat Teddy is optimal for 33-64 patterns on AVX2-capable CPUs.
Example:
patterns := make([][]byte, 50)
for i := range patterns {
patterns[i] = []byte(fmt.Sprintf("pattern%02d", i))
}
fatTeddy := prefilter.NewFatTeddy(patterns, nil)
if fatTeddy != nil {
pos := fatTeddy.Find(haystack, 0)
}
func (*FatTeddy) Find ¶ added in v0.10.0
Find returns the index of the first match starting at or after 'start'.
This implements the Prefilter interface. It uses AVX2 SIMD search to find candidates, then verifies full pattern matches.
Returns -1 if no match is found.
func (*FatTeddy) FindMatch ¶ added in v0.10.0
FindMatch returns the start and end positions of the first match.
func (*FatTeddy) IsComplete ¶ added in v0.10.0
IsComplete implements Prefilter.IsComplete.
func (*FatTeddy) LiteralLen ¶ added in v0.10.0
LiteralLen implements Prefilter.LiteralLen.
func (*FatTeddy) MinimumLen ¶ added in v0.10.0
MinimumLen returns the minimum haystack length for efficient SIMD search.
For haystacks smaller than this, the SIMD setup overhead exceeds the benefit. Callers should use a fallback strategy (like Aho-Corasick) for small inputs.
This follows Rust regex's minimum_len() approach:
- AVX2 processes 32 bytes per iteration
- With 2-byte fingerprint, minimum is 32 + 1 = 33 bytes
- We use 64 as conservative threshold based on benchmarks showing Aho-Corasick is ~2x faster than Fat Teddy on ~37 byte inputs
Reference: rust-aho-corasick/src/packed/teddy/builder.rs:585
func (*FatTeddy) PatternCount ¶ added in v0.10.0
PatternCount returns the number of patterns in this Fat Teddy searcher. Used by meta-engine to build appropriate fallback strategy.
type FatTeddyConfig ¶ added in v0.10.0
type FatTeddyConfig struct {
// MinPatterns is the minimum patterns required (default: 2)
MinPatterns int
// MaxPatterns is the maximum patterns allowed (default: 64)
MaxPatterns int
// MinPatternLen is the minimum pattern length required (default: 3)
MinPatternLen int
// FingerprintLen is the number of fingerprint bytes to use (1-4, default: 2)
FingerprintLen int
}
FatTeddyConfig configures Fat Teddy construction.
func DefaultFatTeddyConfig ¶ added in v0.10.0
func DefaultFatTeddyConfig() *FatTeddyConfig
DefaultFatTeddyConfig returns the default Fat Teddy configuration.
type MatchFinder ¶ added in v0.8.18
type MatchFinder interface {
// FindMatch returns the start and end positions of the first match.
// Returns (start, end) if found, (-1, -1) if not found.
// The matched bytes are haystack[start:end].
FindMatch(haystack []byte, start int) (start2, end int)
}
MatchFinder is an optional interface for prefilters that can return the matched range directly, avoiding the need for NFA verification.
This is particularly useful for multi-pattern prefilters like Teddy where the matched pattern length varies.
type Prefilter ¶
type Prefilter interface {
// Find returns the index of the first candidate match starting at or after
// 'start', or -1 if no candidate is found.
//
// A candidate match means a position where one of the prefilter literals
// was found. This does NOT guarantee a full regex match - the caller must
// verify the match using the full regex engine (unless IsComplete() is true).
//
// Parameters:
// haystack - the byte buffer to search
// start - the starting position (must be >= 0 and <= len(haystack))
//
// Returns:
// index >= start if a candidate is found
// -1 if no candidate exists at or after start
//
// Example:
//
// pf := /* some prefilter */
// pos := pf.Find(haystack, 0)
// for pos != -1 {
// // Verify match at pos using full regex engine
// if fullRegexMatches(haystack, pos) {
// return pos
// }
// // Continue searching
// pos = pf.Find(haystack, pos+1)
// }
Find(haystack []byte, start int) int
// IsComplete returns true if a prefilter match guarantees a full regex match.
//
// When true, the regex engine can skip verification and directly return the
// prefilter result. This is the case when:
// - The regex is an exact literal (e.g., /hello/)
// - The literal sequence is complete and non-overlapping
//
// Most prefilters return false, meaning verification is required.
//
// Example:
//
// if pf.IsComplete() {
// // Direct return, no verification needed
// return pf.Find(haystack, start)
// } else {
// // Must verify with full regex engine
// pos := pf.Find(haystack, start)
// if pos != -1 && fullRegexMatches(haystack, pos) {
// return pos
// }
// }
IsComplete() bool
// LiteralLen returns the length of the matched literal when IsComplete() is true.
//
// This allows the regex engine to calculate exact match bounds without
// running the full automata: end = start + LiteralLen().
//
// Returns:
// > 0 if IsComplete() is true (the length of the complete literal)
// 0 if IsComplete() is false or if the prefilter matches variable lengths
//
// Example:
//
// if pf.IsComplete() {
// pos := pf.Find(haystack, start)
// if pos != -1 {
// matchEnd := pos + pf.LiteralLen()
// return haystack[pos:matchEnd]
// }
// }
LiteralLen() int
// HeapBytes returns the number of bytes of heap memory used by this prefilter.
//
// This is useful for profiling and memory budgeting in the regex engine.
// Simple prefilters (Memchr, Memmem) typically return 0 as they don't
// allocate heap memory. Complex prefilters (Teddy, AhoCorasick) may use
// significant memory for lookup tables.
//
// Returns:
// 0 if no heap allocation
// positive value indicating heap bytes used
HeapBytes() int
}
Prefilter is used to quickly find candidate match positions before running the full regex engine.
The prefilter scans the haystack for literals extracted from the regex pattern. When a literal is found, that position is returned as a candidate. The regex engine then verifies if a full match exists at that position.
Key methods:
- Find: returns the next candidate position
- IsComplete: indicates if prefilter match is sufficient (no verification needed)
- HeapBytes: returns memory usage for profiling
func WrapWithTracking ¶ added in v0.4.0
WrapWithTracking creates a Prefilter that includes effectiveness tracking.
The returned prefilter can be used anywhere a regular Prefilter is expected, but also supports tracking methods via type assertion:
pf := prefilter.WrapWithTracking(innerPf)
// Use as regular prefilter
pos := pf.Find(haystack, 0)
// Access tracking via type assertion
if tracked, ok := pf.(*prefilter.TrackedPrefilter); ok {
tracked.ConfirmMatch()
_, _, eff, _ := tracked.Stats()
}
type Teddy ¶
type Teddy struct {
// contains filtered or unexported fields
}
Teddy is a SIMD-accelerated multi-pattern searcher.
It uses bucket-based filtering to quickly identify candidate positions, then verifies full pattern matches. This provides dramatic speedup for alternation patterns like (foo|bar|baz).
Thread-safety: Teddy is safe for concurrent use (all state is immutable).
func NewTeddy ¶
func NewTeddy(patterns [][]byte, config *TeddyConfig) *Teddy
NewTeddy creates a new Teddy searcher for the given patterns.
Returns nil if patterns are not suitable for Teddy:
- Fewer than MinPatterns (default: 2)
- More than MaxPatterns (default: 8)
- Any pattern shorter than MinPatternLen (default: 3)
Parameters:
patterns - the byte patterns to search for config - configuration (nil uses DefaultTeddyConfig)
Example:
patterns := [][]byte{
[]byte("foo"),
[]byte("bar"),
[]byte("baz"),
}
teddy := prefilter.NewTeddy(patterns, nil)
if teddy != nil {
pos := teddy.Find(haystack, 0)
}
func (*Teddy) Find ¶
Find returns the index of the first candidate match starting at or after 'start'.
This implements the Prefilter interface. It uses SIMD search to find candidates, then verifies full pattern matches.
Returns -1 if no match is found.
Algorithm:
- If haystack too short: use scalar fallback
- SIMD search: find candidate positions using nibble masks
- For each candidate: verify full pattern match
- Return first match position
Thread-safety: Safe for concurrent use (all state is immutable).
func (*Teddy) FindMatch ¶ added in v0.8.18
FindMatch returns the start and end positions of the first match. This is more efficient than Find() when the pattern length varies, as it avoids the need for a separate NFA search to find the end.
Returns (start, end) if found, (-1, -1) if not found. The matched bytes are haystack[start:end].
func (*Teddy) HeapBytes ¶
HeapBytes implements Prefilter.HeapBytes.
Returns approximate heap memory used by Teddy:
- Mask tables: 264 bytes (fixed)
- Pattern storage: sum of pattern lengths
- Bucket arrays: ~8 * 8 bytes
Total: typically < 1KB for 2-8 patterns.
func (*Teddy) IsComplete ¶
IsComplete implements Prefilter.IsComplete.
Returns true if finding a Teddy match guarantees a full regex match. This is only true if all patterns are exact literals with no overlap.
For most use cases, this returns false (verification required).
func (*Teddy) LiteralLen ¶ added in v0.8.14
LiteralLen implements Prefilter.LiteralLen.
When all patterns have the same length and complete=true, returns that uniform length. Otherwise returns 0.
type TeddyConfig ¶
type TeddyConfig struct {
// MinPatterns is the minimum patterns required (default: 2)
MinPatterns int
// MaxPatterns is the maximum patterns allowed (default: 8)
MaxPatterns int
// MinPatternLen is the minimum pattern length required (default: 3)
MinPatternLen int
// FingerprintLen is the number of fingerprint bytes to use (1-4, default: 1)
// Higher values reduce false positives but increase computation cost.
FingerprintLen int
}
TeddyConfig configures Teddy construction.
func DefaultTeddyConfig ¶
func DefaultTeddyConfig() *TeddyConfig
DefaultTeddyConfig returns the default Slim Teddy configuration. For Fat Teddy (33-64 patterns), use NewFatTeddy with DefaultFatTeddyConfig.
type TrackedPrefilter ¶ added in v0.4.0
type TrackedPrefilter struct {
*Tracker
}
TrackedPrefilter wraps a Prefilter to implement the Prefilter interface while providing tracking capabilities.
This allows using a tracked prefilter anywhere a regular Prefilter is expected.
func (*TrackedPrefilter) Find ¶ added in v0.4.0
func (tp *TrackedPrefilter) Find(haystack []byte, start int) int
Find implements Prefilter.Find with tracking.
func (*TrackedPrefilter) HeapBytes ¶ added in v0.4.0
func (tp *TrackedPrefilter) HeapBytes() int
HeapBytes implements Prefilter.HeapBytes.
func (*TrackedPrefilter) IsComplete ¶ added in v0.4.0
func (tp *TrackedPrefilter) IsComplete() bool
IsComplete implements Prefilter.IsComplete.
func (*TrackedPrefilter) LiteralLen ¶ added in v0.8.14
func (tp *TrackedPrefilter) LiteralLen() int
LiteralLen implements Prefilter.LiteralLen.
type Tracker ¶ added in v0.4.0
type Tracker struct {
// contains filtered or unexported fields
}
Tracker wraps a Prefilter with effectiveness tracking.
The tracker monitors the ratio of confirmed matches to candidates found. When effectiveness drops below a threshold (too many false positives), the prefilter is automatically disabled to prevent catastrophic slowdown.
This implements the rust-regex pattern where prefilters can be "retired" when they become ineffective for certain inputs.
Algorithm:
- Track candidates (prefilter finds) and confirms (actual matches)
- Every N candidates, check effectiveness ratio
- If ratio < threshold, disable prefilter
- Once disabled, never re-enable (for this search)
Example usage:
pf := prefilter.NewBuilder(prefixes, nil).Build()
tracker := prefilter.NewTracker(pf)
for {
if !tracker.IsActive() {
// Prefilter disabled, use full scan
break
}
pos := tracker.Find(haystack, start)
if pos == -1 {
break
}
if fullRegexMatches(haystack, pos) {
tracker.ConfirmMatch()
return pos
}
start = pos + 1
}
func NewTracker ¶ added in v0.4.0
NewTracker creates a new tracker for the given prefilter with default config.
Returns nil if the inner prefilter is nil.
func NewTrackerWithConfig ¶ added in v0.4.0
func NewTrackerWithConfig(inner Prefilter, config TrackerConfig) *Tracker
NewTrackerWithConfig creates a new tracker with custom configuration.
Returns nil if the inner prefilter is nil.
func (*Tracker) ConfirmMatch ¶ added in v0.4.0
func (t *Tracker) ConfirmMatch()
ConfirmMatch should be called when a candidate actually matches.
This increments the confirm counter, improving the effectiveness ratio. Always call this after verifying a candidate is a real match.
func (*Tracker) Find ¶ added in v0.4.0
Find returns the next candidate position, or -1 if none found or disabled.
This method:
- Checks if prefilter is still active
- Calls inner prefilter's Find
- Increments candidate counter if found
- Checks effectiveness at intervals
Unlike the inner prefilter, this may return -1 even when candidates exist if the prefilter has been disabled due to low effectiveness.
func (*Tracker) HeapBytes ¶ added in v0.4.0
HeapBytes returns the memory used by the inner prefilter.
func (*Tracker) Inner ¶ added in v0.4.0
Inner returns the underlying prefilter.
Useful for accessing type-specific methods or bypassing tracking.
func (*Tracker) IsActive ¶ added in v0.4.0
IsActive returns true if the prefilter is still being used.
When false, the caller should fall back to full regex search instead of using the prefilter.
func (*Tracker) IsComplete ¶ added in v0.4.0
IsComplete delegates to the inner prefilter's IsComplete.
func (*Tracker) LiteralLen ¶ added in v0.8.14
LiteralLen delegates to the inner prefilter's LiteralLen.
type TrackerConfig ¶ added in v0.4.0
type TrackerConfig struct {
// CheckInterval is how often to check effectiveness (in candidates).
// Default: 64
CheckInterval uint64
// MinEfficiency is the minimum acceptable ratio of confirms/candidates.
// If efficiency drops below this, prefilter is disabled.
// Default: 0.1 (10%)
MinEfficiency float64
// WarmupPeriod is the minimum number of candidates before checking effectiveness.
// This prevents premature disabling on small samples.
// Default: 128
WarmupPeriod uint64
}
TrackerConfig holds configuration for the effectiveness tracker.
func DefaultTrackerConfig ¶ added in v0.4.0
func DefaultTrackerConfig() TrackerConfig
DefaultTrackerConfig returns the default tracker configuration.
Default values are tuned based on rust-regex heuristics:
- CheckInterval: 64 (check frequently but not every candidate)
- MinEfficiency: 0.1 (10% - disable if >90% false positives)
- WarmupPeriod: 128 (need enough samples for statistical significance)