aho_corasick

package module
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 30, 2024 License: MIT Imports: 11 Imported by: 1

README

go-aho-corasick

go-aho-corasick is a library for aho-corasick which wraps the Rust BurntSushi/aho-corasick library. It provides the same API as petar-dambovaliev/aho-corasick and is a drop-in replacement with full API and behavior compatibility. By default, aho-corasick is packaged as a WebAssembly module and accessed with the pure Go runtime, wazero. This means that it is compatible with any Go application, regardless of availability of cgo.

For TinyGo applications being built for WASM, this library will perform significantly better. For Go applications, the performance difference varies by case. petar-dambovaliev/aho-corasick performs quite well, but this library seems to perform better with 4+ patterns, and larger ones. The API is a drop-in replacement, so it is best to try it and benchmark to see the effect. Notably, for unknown input where a decision on using DFA or not cannot be made beforehand, the automatic detection in this library will often perform much better.

Behavior differences

This library will automatically pick an implementations among NFA, contiguous NFA, and DFA unless the DFA flag is explicitly passed when constructing. This generally results in better performance than always using one or the other.

Usage

go-aho-corasick is a standard Go library package and can be added to a go.mod file. It will work fine in Go or TinyGo projects.

go get github.com/wasilibs/go-aho-corasick

Because the library is a drop-in replacement for petar-dambovaliev/aho-corasick, an import rewrite can make migrating code to use it simple.

import "github.com/peter-dambovaliev/aho-corasick"

can be changed to

import "github.com/wasilibs/go-aho-corasick"
cgo

This library also supports opting into using cgo to wrap BurntSushi/aho-corasick instead of using WebAssembly. This requires having a built version of the library available - because Rust libraries are not easily installed for use with pkg-config, environment variables like CGO_LDFLAGS and LD_LIBRARY_PATH may be needed to find the library at build and runtime. The build tag aho_corasick_cgo can be used to enable cgo support.

Performance

Benchmarks are run against every commit in the bench workflow. GitHub action runners are highly virtualized and do not have stable performance across runs, but the relative numbers within a run should still be somewhat, though not precisely, informative.

Microbenchmarks

Microbenchmarks are the same as included in the reference Rust and Go libraries. Full results can be viewed in the workflow, a sample of results for one run looks like this

BurntSushi/random/onebyte/match/default-2                           51.3µs ± 3%     147.5µs ± 0%          74.7µs ± 1%
BurntSushi/random/onebyte/match/dfa-2                               64.4µs ± 0%     148.8µs ± 0%          76.9µs ± 4%
BurntSushi/random/onebyte/nomatch/default-2                         5.67µs ± 2%      7.33µs ±12%          1.33µs ±19%
BurntSushi/random/onebyte/nomatch/dfa-2                             5.79µs ± 1%      6.98µs ± 5%          1.50µs ± 1%
BurntSushi/random/twobytes/match/default-2                          53.9µs ± 0%     156.3µs ± 0%          78.2µs ± 0%
BurntSushi/random/twobytes/match/dfa-2                              66.6µs ± 0%     156.9µs ± 1%          79.9µs ± 3%
BurntSushi/random/twobytes/nomatch/default-2                        10.0µs ± 0%      10.9µs ± 0%           1.6µs ± 1%
BurntSushi/random/twobytes/nomatch/dfa-2                            10.0µs ± 1%      11.0µs ± 1%           1.6µs ± 1%
BurntSushi/random/threebytes/match/default-2                        55.7µs ± 1%     162.1µs ± 0%          81.0µs ± 2%
BurntSushi/random/threebytes/match/dfa-2                            67.3µs ± 0%     162.5µs ± 0%          81.4µs ± 1%
BurntSushi/random/threebytes/nomatch/default-2                      10.0µs ± 0%      13.0µs ± 0%           1.7µs ± 0%
BurntSushi/random/threebytes/nomatch/dfa-2                          10.0µs ± 0%      12.9µs ± 1%           1.7µs ± 0%
BurntSushi/random/fourbytes/match/default-2                          188µs ± 0%       174µs ± 0%            86µs ± 1%
BurntSushi/random/fourbytes/match/dfa-2                              108µs ± 0%       180µs ± 0%            88µs ± 2%
BurntSushi/random/fourbytes/nomatch/default-2                        158µs ± 0%        56µs ± 1%             2µs ± 1%
BurntSushi/random/fourbytes/nomatch/dfa-2                           64.5µs ± 0%      58.3µs ± 0%           1.9µs ± 1%
BurntSushi/random/fivebytes/match/default-2                          188µs ± 0%       177µs ± 0%            85µs ± 1%
BurntSushi/random/fivebytes/match/dfa-2                              119µs ± 0%       181µs ± 0%            85µs ± 1%
BurntSushi/random/fivebytes/nomatch/default-2                        158µs ± 0%        56µs ± 1%             2µs ± 1%
BurntSushi/random/fivebytes/nomatch/dfa-2                           64.5µs ± 0%      58.4µs ± 0%           1.9µs ± 1%
BurntSushi/random/ten-one-prefix/default-2                          33.1µs ± 1%      36.2µs ± 0%           6.0µs ± 1%
BurntSushi/random/ten-one-prefix/dfa-2                              26.9µs ± 0%      40.6µs ± 0%           5.7µs ± 1%
BurntSushi/random/ten-diff-prefix/default-2                          228µs ± 0%       253µs ± 0%            35µs ± 1%
BurntSushi/random/ten-diff-prefix/dfa-2                             64.5µs ± 0%     281.5µs ± 0%          38.9µs ± 1%
BurntSushi/random10x/leftmost-first/5000words/default-2             3.19ms ± 0%      2.84ms ± 1%          0.92ms ± 0%
BurntSushi/random10x/leftmost-first/5000words/dfa-2                  646µs ± 0%       652µs ± 0%           332µs ± 0%
BurntSushi/random10x/leftmost-first/100words/default-2              2.61ms ± 0%      0.59ms ± 0%          0.25ms ± 0%
BurntSushi/random10x/leftmost-first/100words/dfa-2                   647µs ± 0%       573µs ± 0%           251µs ± 0%
BurntSushi/sherlock/name/alt1/default-2                              443µs ± 1%       469µs ± 0%            64µs ± 0%
BurntSushi/sherlock/name/alt1/dfa-2                                  425µs ± 2%       481µs ± 0%            64µs ± 1%
BurntSushi/sherlock/name/alt2/default-2                              813µs ± 1%      1002µs ± 0%           209µs ± 0%
BurntSushi/sherlock/name/alt2/dfa-2                                  793µs ± 0%      1022µs ± 0%           212µs ± 1%
BurntSushi/sherlock/name/alt3/default-2                             9.61ms ± 0%      3.56ms ± 1%          0.26ms ± 0%
BurntSushi/sherlock/name/alt3/dfa-2                                 3.95ms ± 0%      3.65ms ± 0%          0.26ms ± 1%
BurntSushi/sherlock/name/alt4/default-2                              772µs ± 0%       999µs ± 0%           208µs ± 1%
BurntSushi/sherlock/name/alt4/dfa-2                                  763µs ± 0%      1020µs ± 0%           213µs ± 1%
BurntSushi/sherlock/name/alt5/default-2                              891µs ± 1%      1253µs ± 0%           255µs ± 0%
BurntSushi/sherlock/name/alt5/dfa-2                                  881µs ± 1%      1284µs ± 0%           258µs ± 1%
BurntSushi/sherlock/name/alt6/default-2                             9.54ms ± 0%      0.30ms ± 0%          0.01ms ± 0%
BurntSushi/sherlock/name/alt6/dfa-2                                 3.86ms ± 0%      0.30ms ± 0%          0.01ms ± 0%
BurntSushi/sherlock/name/alt7/default-2                              545µs ± 0%       559µs ± 0%            15µs ± 0%
BurntSushi/sherlock/name/alt7/dfa-2                                  542µs ± 0%       561µs ± 0%            15µs ± 0%
BurntSushi/sherlock/name/nocase1/default-2                          12.7ms ± 0%       4.1ms ± 0%           0.9ms ± 1%
BurntSushi/sherlock/name/nocase1/dfa-2                              4.08ms ± 0%      4.04ms ± 0%          0.86ms ± 1%
BurntSushi/sherlock/name/nocase2/default-2                          10.7ms ± 0%       3.8ms ± 0%           0.4ms ± 1%
BurntSushi/sherlock/name/nocase2/dfa-2                              4.03ms ± 0%      3.85ms ± 0%          0.42ms ± 1%
BurntSushi/sherlock/name/nocase3/default-2                          10.9ms ± 0%       3.9ms ± 0%           0.8ms ± 0%
BurntSushi/sherlock/name/nocase3/dfa-2                              4.05ms ± 0%      3.90ms ± 0%          0.79ms ± 0%
BurntSushi/sherlock/5000words/default-2                             21.6ms ± 0%      18.6ms ± 0%           8.4ms ± 0%
BurntSushi/sherlock/5000words/dfa-2                                 4.35ms ± 0%      4.93ms ± 0%          2.65ms ± 0%

Random gives us a look at various different patterns against a random corpus. We see that the four-byte case, with four patterns, seems to be a threshold where this library performs better with auto-detection of the matcher.

Sherlock is a more real-world example, with the literary text for Sherlock Holmes. We again see that cases with 4+ patterns seem to perform significantly better.

wafbench

wafbench tests the performance of replacing the pm operator of the OWASP CoreRuleSet and Coraza implementation with this library. This benchmark is considered a real world performance test, though the bulk of processing is in regex, not pm. The pm matchers themselves have a large number of patterns and are considered complex.

WAF/FTW-2                         28.2s ± 0%          28.3s ± 0%              28.3s ± 0%
WAF/POST/1-2                     3.08ms ± 2%         3.09ms ± 2%             3.10ms ± 2%
WAF/POST/1000-2                  20.0ms ± 1%         19.9ms ± 1%             20.2ms ± 1%
WAF/POST/10000-2                  190ms ± 0%          191ms ± 1%              191ms ± 1%
WAF/POST/100000-2                 1.88s ± 0%          1.87s ± 1%              1.88s ± 1%

In all cases, the libraries perform the same, within noise. This likely reflects that the test is mostly spending time in regex, not pm. It's presented mostly as a case study that wrapping with WebAssembly can provide the same performance without needing to rewrite an entire library.

Documentation

Index

Constants

View Source
const (
	// Use standard match semantics, which support overlapping matches. When
	// used with non-overlapping matches, matches are reported as they are seen.
	StandardMatch matchKind = iota
	// Use leftmost-first match semantics, which reports leftmost matches.
	// When there are multiple possible leftmost matches, the match
	// corresponding to the pattern that appeared earlier when constructing
	// the automaton is reported.
	// This does **not** support overlapping matches or stream searching
	LeftMostFirstMatch
	// Use leftmost-longest match semantics, which reports leftmost matches.
	// When there are multiple possible leftmost matches, the longest match is chosen.
	LeftMostLongestMatch
)

Variables

This section is empty.

Functions

This section is empty.

Types

type AhoCorasick

type AhoCorasick struct {
	// contains filtered or unexported fields
}

func (AhoCorasick) FindAll

func (ac AhoCorasick) FindAll(haystack string) []Match

FindAll returns the matches found in the haystack

func (AhoCorasick) FindN added in v0.6.0

func (ac AhoCorasick) FindN(haystack string, n int) []Match

FindN returns the matches found in the haystack, up to n matches.

func (AhoCorasick) Iter

func (ac AhoCorasick) Iter(haystack string) Iter

Iter gives an iterator over the built patterns

func (AhoCorasick) IterOverlapping

func (ac AhoCorasick) IterOverlapping(haystack string) Iter

IterOverlapping gives an iterator over the built patterns with overlapping matches

func (AhoCorasick) PatternCount

func (ac AhoCorasick) PatternCount() int

type AhoCorasickBuilder

type AhoCorasickBuilder struct {
	// contains filtered or unexported fields
}

func NewAhoCorasickBuilder

func NewAhoCorasickBuilder(o Opts) *AhoCorasickBuilder

NewAhoCorasickBuilder creates a new AhoCorasickBuilder based on Opts

func (*AhoCorasickBuilder) Build

func (a *AhoCorasickBuilder) Build(patterns []string) AhoCorasick

Build builds a (non)deterministic finite automata from the user provided patterns

type Finder

type Finder interface {
	FindAll(haystack string) []Match
	PatternCount() int
}

type Iter

type Iter interface {
	Next() *Match
}

Iter is an iterator over matches found on the current haystack it gives the user more granular control. You can chose how many and what kind of matches you need.

type Match

type Match struct {
	// contains filtered or unexported fields
}

A representation of a match reported by an Aho-Corasick automaton.

A match has two essential pieces of information: the identifier of the pattern that matched, along with the start and end offsets of the match in the haystack.

func (*Match) End

func (m *Match) End() int

End gives the index of the last character of this match inside the haystack

func (*Match) Pattern

func (m *Match) Pattern() int

Pattern returns the index of the pattern in the slice of the patterns provided by the user that was matched

func (*Match) Start

func (m *Match) Start() int

Start gives the index of the first character of this match inside the haystack

type Opts

type Opts struct {
	AsciiCaseInsensitive bool
	MatchOnlyWholeWords  bool
	MatchKind            matchKind
	DFA                  bool
}

Opts defines a set of options applied before the patterns are built

type Replacer

type Replacer struct {
	// contains filtered or unexported fields
}

func NewReplacer

func NewReplacer(finder Finder) Replacer

func (Replacer) ReplaceAll

func (r Replacer) ReplaceAll(haystack string, replaceWith []string) string

ReplaceAll replaces the matches found in the haystack according to the user provided slice `replaceWith` It panics, if `replaceWith` has length different from the patterns that it was built with

func (Replacer) ReplaceAllFunc

func (r Replacer) ReplaceAllFunc(haystack string, f func(match Match) (string, bool)) string

ReplaceAllFunc replaces the matches found in the haystack according to the user provided function it gives fine grained control over what is replaced. A user can chose to stop the replacing process early by returning false in the lambda In that case, everything from that point will be kept as the original haystack

Directories

Path Synopsis
magefiles module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL