aho_corasick

package module
v0.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2024 License: MIT Imports: 7 Imported by: 0

README

aho-corasick

Efficient string matching in Golang via the aho-corasick algorithm.

x20 faster than https://github.com/cloudflare/ahocorasick and x3 faster than https://github.com/anknown/ahocorasick

Memory consuption is a eigth of https://github.com/cloudflare/ahocorasick and half of https://github.com/anknown/ahocorasick

This library is forked from https://github.com/petar-dambovaliev/aho-corasick, which was heavily inspired by https://github.com/BurntSushi/aho-corasick.

Usage

go get -u github.com/pgavlin/aho-corasick
import (
    ahocorasick "github.com/pgavlin/aho-corasick"
)
builder := ahocorasick.NewAhoCorasickBuilder(Opts{
    AsciiCaseInsensitive: true,
    MatchOnlyWholeWords:  true,
    MatchKind:            LeftMostLongestMatch,
    DFA:                  true,
})

ac := builder.Build([]string{"bear", "masha"})
haystack := "The Bear and Masha"
matches := ac.FindAll(haystack)

for _, match := range matches {
    println(haystack[match.Start():match.End()])
}

Matching can be done via NFA or DFA. NFA has runtime complexity O(N + M) in relation to the haystack and number of matches. DFA has runtime complexity O(N), but it uses more memory.

Replacing of matches in the haystack.

replaceWith needs to be the same length as the patterns

r := ahocorasick.NewReplacer(ac)
replaced := r.ReplaceAll(haystack, replaceWith)

ReplaceAllFunc is useful, for example, if you want to use the original text cassing but you are matching case insensitively. You can replace partially by return false and from that point, the original string will be preserved.

replaced := r.ReplaceAllFunc(haystack, func(match Match) (string, bool) {
    return `<a>` + haystack[match.Start():match.End()] + `<\a>`, true
})

Search for matches one at a time via the iterator

iter := ac.Iter(haystack)

for next := iter.Next(); next != nil; next = iter.Next() {
    ...
}

It's plenty fast but if you want to use it in parallel, that is also possible.

Memory consumption won't increase because the read-only automaton is not actually copied, only the counters are.

The magic line is ac := ac

var w sync.WaitGroup

w.Add(50)
for i := 0; i < 50; i++ {
    go func() {
        ac := ac
        matches := ac.FindAll(haystack)
        println(len(matches))
        w.Done()
    }()
}
w.Wait()

Documentation

Index

Constants

View Source
const (
	// Use standard match semantics, which support overlapping matches. When
	// used with non-overlapping matches, matches are reported as they are seen.
	StandardMatch matchKind = iota
	// Use leftmost-first match semantics, which reports leftmost matches.
	// When there are multiple possible leftmost matches, the match
	// corresponding to the pattern that appeared earlier when constructing
	// the automaton is reported.
	// This does **not** support overlapping matches or stream searching
	LeftMostFirstMatch
	// Use leftmost-longest match semantics, which reports leftmost matches.
	// When there are multiple possible leftmost matches, the longest match is chosen.
	LeftMostLongestMatch
)

Variables

This section is empty.

Functions

This section is empty.

Types

type AhoCorasick

type AhoCorasick struct {
	// contains filtered or unexported fields
}

AhoCorasick is the main data structure that does most of the work

func (AhoCorasick) FindAll

func (ac AhoCorasick) FindAll(haystack string) []Match

FindAll returns the matches found in the haystack

func (AhoCorasick) Iter

func (ac AhoCorasick) Iter(haystack string) Iter

Iter gives an iterator over the built patterns

func (AhoCorasick) IterByte

func (ac AhoCorasick) IterByte(haystack []byte) Iter

IterByte gives an iterator over the built patterns

func (AhoCorasick) IterOverlapping

func (ac AhoCorasick) IterOverlapping(haystack string) Iter

Iter gives an iterator over the built patterns with overlapping matches

func (AhoCorasick) IterOverlappingByte

func (ac AhoCorasick) IterOverlappingByte(haystack []byte) Iter

IterOverlappingByte gives an iterator over the built patterns with overlapping matches

func (AhoCorasick) PatternCount

func (ac AhoCorasick) PatternCount() int

type AhoCorasickBuilder

type AhoCorasickBuilder struct {
	// contains filtered or unexported fields
}

AhoCorasickBuilder defines a set of options applied before the patterns are built

func NewAhoCorasickBuilder

func NewAhoCorasickBuilder(o Opts) AhoCorasickBuilder

NewAhoCorasickBuilder creates a new AhoCorasickBuilder based on Opts

func (*AhoCorasickBuilder) Build

func (a *AhoCorasickBuilder) Build(patterns []string) AhoCorasick

Build builds a (non)deterministic finite automata from the user provided patterns

func (*AhoCorasickBuilder) BuildByte

func (a *AhoCorasickBuilder) BuildByte(patterns [][]byte) AhoCorasick

BuildByte builds a (non)deterministic finite automata from the user provided patterns

type Finder

type Finder interface {
	FindAll(haystack string) []Match
	PatternCount() int
}

type Iter

type Iter interface {
	Next() *Match
}

Iter is an iterator over matches found on the current haystack it gives the user more granular control. You can chose how many and what kind of matches you need.

type Match

type Match struct {
	// contains filtered or unexported fields
}

A representation of a match reported by an Aho-Corasick automaton.

A match has two essential pieces of information: the identifier of the pattern that matched, along with the start and end offsets of the match in the haystack.

func (*Match) End

func (m *Match) End() int

End gives the index of the last character of this match inside the haystack

func (*Match) Pattern

func (m *Match) Pattern() int

Pattern returns the index of the pattern in the slice of the patterns provided by the user that was matched

func (*Match) Start

func (m *Match) Start() int

Start gives the index of the first character of this match inside the haystack

type Opts

type Opts struct {
	AsciiCaseInsensitive bool
	MatchOnlyWholeWords  bool
	MatchKind            matchKind
	DFA                  bool
}

Opts defines a set of options applied before the patterns are built MatchOnlyWholeWords does filtering after matching with MatchKind this could lead to situations where, in this case, nothing is matched

    trieBuilder := NewAhoCorasickBuilder(Opts{
	     MatchOnlyWholeWords: true,
	     MatchKind:           LeftMostLongestMatch,
	     DFA:                 false,
    })

		trie := trieBuilder.Build([]string{"testing", "testing 123"})
		result := trie.FindAll("testing 12345")
	 len(result) == 0

this is due to the fact LeftMostLongestMatch is the matching strategy "testing 123" is found but then is filtered out by MatchOnlyWholeWords use MatchOnlyWholeWords with caution

type Replacer

type Replacer struct {
	// contains filtered or unexported fields
}

func NewReplacer

func NewReplacer(finder Finder) Replacer

func (Replacer) ReplaceAll

func (r Replacer) ReplaceAll(haystack string, replaceWith []string) string

ReplaceAll replaces the matches found in the haystack according to the user provided slice `replaceWith` It panics, if `replaceWith` has length different from the patterns that it was built with

func (Replacer) ReplaceAllFunc

func (r Replacer) ReplaceAllFunc(haystack string, f func(match Match) (string, bool)) string

ReplaceAllFunc replaces the matches found in the haystack according to the user provided function it gives fine grained control over what is replaced. A user can chose to stop the replacing process early by returning false in the lambda In that case, everything from that point will be kept as the original haystack

func (Replacer) ReplaceAllWith

func (r Replacer) ReplaceAllWith(haystack, replacement string) string

ReplaceAllWith replaces the matches found in the haystack according with replacement.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL