aho_corasick

package module

v0.6.0 Latest Latest Go to latest Published: Apr 30, 2024 License: MIT Imports: 11 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/wasilibs/go-aho-corasick

Links

Open Source Insights

README ¶

go-aho-corasick

go-aho-corasick is a library for aho-corasick which wraps the Rust BurntSushi/aho-corasick library. It provides the same API as petar-dambovaliev/aho-corasick and is a drop-in replacement with full API and behavior compatibility. By default, aho-corasick is packaged as a WebAssembly module and accessed with the pure Go runtime, wazero. This means that it is compatible with any Go application, regardless of availability of cgo.

For TinyGo applications being built for WASM, this library will perform significantly better. For Go applications, the performance difference varies by case. petar-dambovaliev/aho-corasick performs quite well, but this library seems to perform better with 4+ patterns, and larger ones. The API is a drop-in replacement, so it is best to try it and benchmark to see the effect. Notably, for unknown input where a decision on using DFA or not cannot be made beforehand, the automatic detection in this library will often perform much better.

Behavior differences

This library will automatically pick an implementations among NFA, contiguous NFA, and DFA unless the DFA flag is explicitly passed when constructing. This generally results in better performance than always using one or the other.

Usage

go-aho-corasick is a standard Go library package and can be added to a go.mod file. It will work fine in Go or TinyGo projects.

go get github.com/wasilibs/go-aho-corasick

Because the library is a drop-in replacement for petar-dambovaliev/aho-corasick, an import rewrite can make migrating code to use it simple.

import "github.com/peter-dambovaliev/aho-corasick"

can be changed to

import "github.com/wasilibs/go-aho-corasick"

cgo

This library also supports opting into using cgo to wrap BurntSushi/aho-corasick instead of using WebAssembly. This requires having a built version of the library available - because Rust libraries are not easily installed for use with pkg-config, environment variables like CGO_LDFLAGS and LD_LIBRARY_PATH may be needed to find the library at build and runtime. The build tag aho_corasick_cgo can be used to enable cgo support.

Performance

Benchmarks are run against every commit in the bench workflow. GitHub action runners are highly virtualized and do not have stable performance across runs, but the relative numbers within a run should still be somewhat, though not precisely, informative.

Microbenchmarks

Microbenchmarks are the same as included in the reference Rust and Go libraries. Full results can be viewed in the workflow, a sample of results for one run looks like this

BurntSushi/random/onebyte/match/default-2                           51.3µs ± 3%     147.5µs ± 0%          74.7µs ± 1%
BurntSushi/random/onebyte/match/dfa-2                               64.4µs ± 0%     148.8µs ± 0%          76.9µs ± 4%
BurntSushi/random/onebyte/nomatch/default-2                         5.67µs ± 2%      7.33µs ±12%          1.33µs ±19%
BurntSushi/random/onebyte/nomatch/dfa-2                             5.79µs ± 1%      6.98µs ± 5%          1.50µs ± 1%
BurntSushi/random/twobytes/match/default-2                          53.9µs ± 0%     156.3µs ± 0%          78.2µs ± 0%
BurntSushi/random/twobytes/match/dfa-2                              66.6µs ± 0%     156.9µs ± 1%          79.9µs ± 3%
BurntSushi/random/twobytes/nomatch/default-2                        10.0µs ± 0%      10.9µs ± 0%           1.6µs ± 1%
BurntSushi/random/twobytes/nomatch/dfa-2                            10.0µs ± 1%      11.0µs ± 1%           1.6µs ± 1%
BurntSushi/random/threebytes/match/default-2                        55.7µs ± 1%     162.1µs ± 0%          81.0µs ± 2%
BurntSushi/random/threebytes/match/dfa-2                            67.3µs ± 0%     162.5µs ± 0%          81.4µs ± 1%
BurntSushi/random/threebytes/nomatch/default-2                      10.0µs ± 0%      13.0µs ± 0%           1.7µs ± 0%
BurntSushi/random/threebytes/nomatch/dfa-2                          10.0µs ± 0%      12.9µs ± 1%           1.7µs ± 0%
BurntSushi/random/fourbytes/match/default-2                          188µs ± 0%       174µs ± 0%            86µs ± 1%
BurntSushi/random/fourbytes/match/dfa-2                              108µs ± 0%       180µs ± 0%            88µs ± 2%
BurntSushi/random/fourbytes/nomatch/default-2                        158µs ± 0%        56µs ± 1%             2µs ± 1%
BurntSushi/random/fourbytes/nomatch/dfa-2                           64.5µs ± 0%      58.3µs ± 0%           1.9µs ± 1%
BurntSushi/random/fivebytes/match/default-2                          188µs ± 0%       177µs ± 0%            85µs ± 1%
BurntSushi/random/fivebytes/match/dfa-2                              119µs ± 0%       181µs ± 0%            85µs ± 1%
BurntSushi/random/fivebytes/nomatch/default-2                        158µs ± 0%        56µs ± 1%             2µs ± 1%
BurntSushi/random/fivebytes/nomatch/dfa-2                           64.5µs ± 0%      58.4µs ± 0%           1.9µs ± 1%
BurntSushi/random/ten-one-prefix/default-2                          33.1µs ± 1%      36.2µs ± 0%           6.0µs ± 1%
BurntSushi/random/ten-one-prefix/dfa-2                              26.9µs ± 0%      40.6µs ± 0%           5.7µs ± 1%
BurntSushi/random/ten-diff-prefix/default-2                          228µs ± 0%       253µs ± 0%            35µs ± 1%
BurntSushi/random/ten-diff-prefix/dfa-2                             64.5µs ± 0%     281.5µs ± 0%          38.9µs ± 1%
BurntSushi/random10x/leftmost-first/5000words/default-2             3.19ms ± 0%      2.84ms ± 1%          0.92ms ± 0%
BurntSushi/random10x/leftmost-first/5000words/dfa-2                  646µs ± 0%       652µs ± 0%           332µs ± 0%
BurntSushi/random10x/leftmost-first/100words/default-2              2.61ms ± 0%      0.59ms ± 0%          0.25ms ± 0%
BurntSushi/random10x/leftmost-first/100words/dfa-2                   647µs ± 0%       573µs ± 0%           251µs ± 0%
BurntSushi/sherlock/name/alt1/default-2                              443µs ± 1%       469µs ± 0%            64µs ± 0%
BurntSushi/sherlock/name/alt1/dfa-2                                  425µs ± 2%       481µs ± 0%            64µs ± 1%
BurntSushi/sherlock/name/alt2/default-2                              813µs ± 1%      1002µs ± 0%           209µs ± 0%
BurntSushi/sherlock/name/alt2/dfa-2                                  793µs ± 0%      1022µs ± 0%           212µs ± 1%
BurntSushi/sherlock/name/alt3/default-2                             9.61ms ± 0%      3.56ms ± 1%          0.26ms ± 0%
BurntSushi/sherlock/name/alt3/dfa-2                                 3.95ms ± 0%      3.65ms ± 0%          0.26ms ± 1%
BurntSushi/sherlock/name/alt4/default-2                              772µs ± 0%       999µs ± 0%           208µs ± 1%
BurntSushi/sherlock/name/alt4/dfa-2                                  763µs ± 0%      1020µs ± 0%           213µs ± 1%
BurntSushi/sherlock/name/alt5/default-2                              891µs ± 1%      1253µs ± 0%           255µs ± 0%
BurntSushi/sherlock/name/alt5/dfa-2                                  881µs ± 1%      1284µs ± 0%           258µs ± 1%
BurntSushi/sherlock/name/alt6/default-2                             9.54ms ± 0%      0.30ms ± 0%          0.01ms ± 0%
BurntSushi/sherlock/name/alt6/dfa-2                                 3.86ms ± 0%      0.30ms ± 0%          0.01ms ± 0%
BurntSushi/sherlock/name/alt7/default-2                              545µs ± 0%       559µs ± 0%            15µs ± 0%
BurntSushi/sherlock/name/alt7/dfa-2                                  542µs ± 0%       561µs ± 0%            15µs ± 0%
BurntSushi/sherlock/name/nocase1/default-2                          12.7ms ± 0%       4.1ms ± 0%           0.9ms ± 1%
BurntSushi/sherlock/name/nocase1/dfa-2                              4.08ms ± 0%      4.04ms ± 0%          0.86ms ± 1%
BurntSushi/sherlock/name/nocase2/default-2                          10.7ms ± 0%       3.8ms ± 0%           0.4ms ± 1%
BurntSushi/sherlock/name/nocase2/dfa-2                              4.03ms ± 0%      3.85ms ± 0%          0.42ms ± 1%
BurntSushi/sherlock/name/nocase3/default-2                          10.9ms ± 0%       3.9ms ± 0%           0.8ms ± 0%
BurntSushi/sherlock/name/nocase3/dfa-2                              4.05ms ± 0%      3.90ms ± 0%          0.79ms ± 0%
BurntSushi/sherlock/5000words/default-2                             21.6ms ± 0%      18.6ms ± 0%           8.4ms ± 0%
BurntSushi/sherlock/5000words/dfa-2                                 4.35ms ± 0%      4.93ms ± 0%          2.65ms ± 0%

Random gives us a look at various different patterns against a random corpus. We see that the four-byte case, with four patterns, seems to be a threshold where this library performs better with auto-detection of the matcher.

Sherlock is a more real-world example, with the literary text for Sherlock Holmes. We again see that cases with 4+ patterns seem to perform significantly better.

wafbench

wafbench tests the performance of replacing the pm operator of the OWASP CoreRuleSet and Coraza implementation with this library. This benchmark is considered a real world performance test, though the bulk of processing is in regex, not pm. The pm matchers themselves have a large number of patterns and are considered complex.

WAF/FTW-2                         28.2s ± 0%          28.3s ± 0%              28.3s ± 0%
WAF/POST/1-2                     3.08ms ± 2%         3.09ms ± 2%             3.10ms ± 2%
WAF/POST/1000-2                  20.0ms ± 1%         19.9ms ± 1%             20.2ms ± 1%
WAF/POST/10000-2                  190ms ± 0%          191ms ± 1%              191ms ± 1%
WAF/POST/100000-2                 1.88s ± 0%          1.87s ± 1%              1.88s ± 1%

In all cases, the libraries perform the same, within noise. This likely reflects that the test is mostly spending time in regex, not pm. It's presented mostly as a case study that wrapping with WebAssembly can provide the same performance without needing to rewrite an entire library.

Documentation ¶

Index ¶

Constants
type AhoCorasick
type AhoCorasickBuilder
- func NewAhoCorasickBuilder(o Opts) *AhoCorasickBuilder
- func (a *AhoCorasickBuilder) Build(patterns []string) AhoCorasick
type Finder
type Iter
type Match
type Opts
type Replacer
- func NewReplacer(finder Finder) Replacer
- func (r Replacer) ReplaceAll(haystack string, replaceWith []string) string
- func (r Replacer) ReplaceAllFunc(haystack string, f func(match Match) (string, bool)) string

Constants ¶

View Source

const (
	// Use standard match semantics, which support overlapping matches. When
	// used with non-overlapping matches, matches are reported as they are seen.
	StandardMatch matchKind = iota
	// Use leftmost-first match semantics, which reports leftmost matches.
	// When there are multiple possible leftmost matches, the match
	// corresponding to the pattern that appeared earlier when constructing
	// the automaton is reported.
	// This does **not** support overlapping matches or stream searching
	LeftMostFirstMatch
	// Use leftmost-longest match semantics, which reports leftmost matches.
	// When there are multiple possible leftmost matches, the longest match is chosen.
	LeftMostLongestMatch
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type AhoCorasick ¶

type AhoCorasick struct {
	// contains filtered or unexported fields
}

func (AhoCorasick) FindAll ¶

func (ac AhoCorasick) FindAll(haystack string) []Match

FindAll returns the matches found in the haystack

func (AhoCorasick) FindN ¶ added in v0.6.0

func (ac AhoCorasick) FindN(haystack string, n int) []Match

FindN returns the matches found in the haystack, up to n matches.

func (AhoCorasick) Iter ¶

func (ac AhoCorasick) Iter(haystack string) Iter

Iter gives an iterator over the built patterns

func (AhoCorasick) IterOverlapping ¶

func (ac AhoCorasick) IterOverlapping(haystack string) Iter

IterOverlapping gives an iterator over the built patterns with overlapping matches

func (AhoCorasick) PatternCount ¶

func (ac AhoCorasick) PatternCount() int

type AhoCorasickBuilder ¶

type AhoCorasickBuilder struct {
	// contains filtered or unexported fields
}

func NewAhoCorasickBuilder ¶

func NewAhoCorasickBuilder(o Opts) *AhoCorasickBuilder

NewAhoCorasickBuilder creates a new AhoCorasickBuilder based on Opts

func (*AhoCorasickBuilder) Build ¶

func (a *AhoCorasickBuilder) Build(patterns []string) AhoCorasick

Build builds a (non)deterministic finite automata from the user provided patterns

type Finder ¶

type Finder interface {
	FindAll(haystack string) []Match
	PatternCount() int
}

type Iter ¶

type Iter interface {
	Next() *Match
}

Iter is an iterator over matches found on the current haystack it gives the user more granular control. You can chose how many and what kind of matches you need.

type Match ¶

type Match struct {
	// contains filtered or unexported fields
}

A representation of a match reported by an Aho-Corasick automaton.

A match has two essential pieces of information: the identifier of the pattern that matched, along with the start and end offsets of the match in the haystack.

func (*Match) End ¶

func (m *Match) End() int

End gives the index of the last character of this match inside the haystack

func (*Match) Pattern ¶

func (m *Match) Pattern() int

Pattern returns the index of the pattern in the slice of the patterns provided by the user that was matched

func (*Match) Start ¶

func (m *Match) Start() int

Start gives the index of the first character of this match inside the haystack

type Opts ¶

type Opts struct {
	AsciiCaseInsensitive bool
	MatchOnlyWholeWords  bool
	MatchKind            matchKind
	DFA                  bool
}

Opts defines a set of options applied before the patterns are built

type Replacer ¶

type Replacer struct {
	// contains filtered or unexported fields
}

func NewReplacer ¶

func NewReplacer(finder Finder) Replacer

func (Replacer) ReplaceAll ¶

func (r Replacer) ReplaceAll(haystack string, replaceWith []string) string

ReplaceAll replaces the matches found in the haystack according to the user provided slice `replaceWith` It panics, if `replaceWith` has length different from the patterns that it was built with

func (Replacer) ReplaceAllFunc ¶

func (r Replacer) ReplaceAllFunc(haystack string, f func(match Match) (string, bool)) string

ReplaceAllFunc replaces the matches found in the haystack according to the user provided function it gives fine grained control over what is replaced. A user can chose to stop the replacing process early by returning false in the lambda In that case, everything from that point will be kept as the original haystack

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
magefiles module

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL