cedar

package module
v0.0.0-...-59fa815 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 30, 2021 License: GPL-2.0 Imports: 11 Imported by: 0

README

ahocorasick

Go Report Card GoCover GoDoc

Package ahocorasick implementes fast, compact and low memory used aho-corasick algorithm based on double-array trie. and also supports visualizing inner data structures by graphviz

cedar-go is a Golang port of cedar which is written in C++ by Naoki Yoshinaga. cedar-go currently implements the reduced verion of cedar. This package is not thread safe if there is one goroutine doing insertions or deletions.

Visualize

Install graphviz

# for mac os
brew install graphviz
# for ubuntu
sudo apt-get install graphviz

Dump structure in golang

m := cedar.NewMatcher()
m.Insert...
m.Compile...
// dump trie graph in double array trie
m.Cedar().DumpGraph("trie.gv")
// dump aho-corasick dfa graph in double array trie
m.DumpGraph("dfa.gv")

Generate png

dot -Tpng -o out.png trie.gv
# example: words {"she", "he", "her", "hers"}
  • trie GitHub

  • aho-corasick GitHub

Install

go get github.com/aradilov/ahocorasick

Usage

  • aho-corasick
package main

import (
	"fmt"

	"github.com/aradilov/ahocorasick"
)

func main() {
	fmt.Printf("\ntesting in simple case...\n")
	m := cedar.NewMatcher()
	words := []string{
		"she", "he", "her", "hers",
	}
	for i, word := range words {
		m.Insert([]byte(word), i)
	}
	// visualize trie 
	m.Cedar().DumpGraph("trie.gv")
	m.Compile()
	// visualize aho-corasick
	m.DumpGraph("dfa.gv")
	seq := []byte("hershertongher")
	fmt.Printf("searching %s\n", string(seq))
        resp := m.Match(seq)
        for resp.HasNext() {
            items := resp.NextMatchItem(seq)
            for _, itr := range items {
                key := m.Key(seq, itr)
                fmt.Printf("key:%s value:%d\n", key, itr.Value.(int))
            }
        }
	// release buffer to sync.Pool
	resp.Release()
}

​ output

testing in simple case...
searching hershertongher
key:he value:1
key:her value:2
key:hers value:3
key:he value:1
key:she value:0
key:her value:2
key:he value:1
key:her value:2
  • trie
package main

import (
	"fmt"

	"github.com/aradilov/ahocorasick"
)

func main() {
	fmt.Printf("\ntesting int float32 value...\n")
	cd := NewCedar()
	words := []string{
		"she", "hers", "her", "he",
	}
	for i, word := range words {
		v := float32(i) + 1.880001
		cd.Insert([]byte(word), v)
	}
	ids := cd.PrefixMatch([]byte("hers"), 0)
	for _, id := range ids {
		k, _ := cd.Key(id)
		v, _ := cd.Get(k)
		fmt.Printf("key:%s val:%f\n", string(k), v.(float32))
	} 
}

​ output

testing int float32 value...
key:he val:4.880001
key:her val:3.880001
key:hers val:2.880001

Chinese words segment demo

Build demo test

go test .

demo output

Searching 一丁不识一丁点C++的T桖中华人民共和国人民解放军轰炸南京长江大桥
key:一 value:7
key:一丁 value:17
key:丁 value:3317
key:一丁不识 value:18
key:不识 value:9890
key:识 value:290279
key:一 value:7
key:不识一丁 value:9891
key:一丁 value:17
key:丁 value:3317
key:一丁点 value:19
key:丁点 value:3519
key:点 value:214913
key:C++ value:5
key:的 value:233716
key:中 value:13425
key:中华 value:13663
key:华 value:63497
key:华人 value:63545
key:人 value:25372
key:中华人民 value:13667
key:人民 value:25881
key:民 value:195603
key:中华人民共和国 value:13668
key:人民共和国 value:25891
key:共和国 value:44163
key:国 value:88227
key:国人 value:88295
key:人 value:25372
key:人民 value:25881
key:民 value:195603
key:解 value:287247
key:解放 value:287374
key:放 value:160645
key:人民解放军 value:25927
key:解放军 value:287381
...

Benchmark

ahocorasick golang implementation: cloudflare anknown iohub

image

How to run benchmark

git clone https://github.com/aradilov/ahocorasick
cd benchmark
go get -u -v
go build .
./benchmark

Documentation

Index

Constants

View Source
const (
	DefaultTokenBufferSize = 4096
	DefaultMatchBufferSize = 4096
)
View Source
const (
	CJKZhMin = '\u4E00'
	CJKZhMax = '\u9FFF'
)

defines max & min value of chinese CJK code

Variables

View Source
var (
	ErrInvalidDataType = errors.New("cedar: invalid datatype")
	ErrInvalidValue    = errors.New("cedar: invalid value")
	ErrInvalidKey      = errors.New("cedar: invalid key")
	ErrNoPath          = errors.New("cedar: no path")
	ErrNoValue         = errors.New("cedar: no value")
	ErrTooLarge        = errors.New("acmatcher: Tool Large for grow")
)

defines Error type

Functions

This section is empty.

Types

type Cedar

type Cedar struct {
	// contains filtered or unexported fields
}

Cedar encapsulates a fast and compressed double array trie for words query

func NewCedar

func NewCedar() *Cedar

NewCedar new a Cedar instance

func (*Cedar) Delete

func (da *Cedar) Delete(key []byte) (err error)

Delete removes a key-value pair from the cedar. It will return ErrNoPath, if the key has not been added.

func (*Cedar) DumpGraph

func (da *Cedar) DumpGraph(fname string)

DumpGraph dumps inner data structures for graphviz

func (*Cedar) Get

func (da *Cedar) Get(key []byte) (value interface{}, err error)

Get returns the value associated with the given `key`. It is equivalent to

id, err1 = Jump(key)
value, err2 = Value(id)

Thus, it may return ErrNoPath or ErrNoValue,

func (*Cedar) GetByNid

func (da *Cedar) GetByNid(nid int) (value interface{}, err error)

func (*Cedar) GetLabel

func (da *Cedar) GetLabel(nid int) byte

func (*Cedar) Insert

func (da *Cedar) Insert(key []byte, value interface{}) error

Insert adds a key-value pair into the cedar. It will return ErrInvalidValue, if value < 0 or >= valueLimit.

func (*Cedar) Jump

func (da *Cedar) Jump(path []byte, from int) (to int, err error)

Jump travels from a node `from` to another node `to` by following the path `path`. For example, if the following keys were inserted:

id	key
19	abc
23	ab
37	abcd

then

Jump([]byte("ab"), 0) = 23, nil		// reach "ab" from root
Jump([]byte("c"), 23) = 19, nil			// reach "abc" from "ab"
Jump([]byte("cd"), 23) = 37, nil		// reach "abcd" from "ab"

func (*Cedar) Key

func (da *Cedar) Key(id int) (key []byte, err error)

Key returns the key of the node with the given `id`. It will return ErrNoPath, if the node does not exist.

func (*Cedar) Load

func (da *Cedar) Load(in io.Reader, dataType string) error

Load loads the cedar from an io.Writer, where dataType is either "json" or "gob".

func (*Cedar) LoadFromFile

func (da *Cedar) LoadFromFile(fileName string, dataType string) error

LoadFromFile loads the cedar from a file, where dataType is either "json" or "gob".

func (*Cedar) MatchWildcard

func (da *Cedar) MatchWildcard(seq []byte, nid int, cb func(nid int, key []byte, value interface{}))

Match multiple subsequence in seq and return tokens Wildcard in the end (or start) of the seq means zero or any amount of symbols Wildcard in the middle of the seq means one symbol or more

func (*Cedar) PrefixMatch

func (da *Cedar) PrefixMatch(key []byte, num int) (ids []int)

PrefixMatch returns a list of at most `num` nodes which match the prefix of the key. If `num` is 0, it returns all matches. For example, if the following keys were inserted:

id	key
19	abc
23	ab
37	abcd

then

PrefixMatch([]byte("abc"), 1) = [ 23 ]				// match ["ab"]
PrefixMatch([]byte("abcd"), 0) = [ 23, 19, 37]		// match ["ab", "abc", "abcd"]

func (*Cedar) PrefixPredict

func (da *Cedar) PrefixPredict(key []byte, num int) (ids []int)

PrefixPredict returns a list of at most `num` nodes which has the key as their prefix. These nodes are ordered by their keys. If `num` is 0, it returns all matches. For example, if the following keys were inserted:

id	key
19	abc
23	ab
37	abcd

then

PrefixPredict([]byte("ab"), 2) = [ 23, 19 ]			// predict ["ab", "abc"]
PrefixPredict([]byte("ab"), 0) = [ 23, 19, 37 ]		// predict ["ab", "abc", "abcd"]

func (*Cedar) Save

func (da *Cedar) Save(out io.Writer, dataType string) error

Save saves the cedar to an io.Writer, where dataType is either "json" or "gob".

func (*Cedar) SaveToFile

func (da *Cedar) SaveToFile(fileName string, dataType string) error

SaveToFile saves the cedar to a file, where dataType is either "json" or "gob".

func (*Cedar) Status

func (da *Cedar) Status() (keys, nodes, size, capacity int)

Status reports the following statistics of the cedar:

keys:		number of keys that are in the cedar,
nodes:		number of trie nodes (slots in the base array) has been taken,
size:			the size of the base array used by the cedar,
capacity:		the capicity of the base array used by the cedar.

func (*Cedar) Update

func (da *Cedar) Update(key []byte, value int) error

Update increases the value associated with the `key`. The `key` will be inserted if it is not in the cedar. It will return ErrInvalidValue, if the updated value < 0 or >= valueLimit.

type MatchToken

type MatchToken struct {
	KLen  int // len of key
	Value interface{}
	At    int // match position of source text
	Freq  uint
}

MatchToken matched words in Aho Corasick Matcher

type Matcher

type Matcher struct {
	// contains filtered or unexported fields
}

Matcher Aho Corasick Matcher

func NewMatcher

func NewMatcher() *Matcher

NewMatcher new an aho corasick matcher

func (*Matcher) Cedar

func (m *Matcher) Cedar() *Cedar

Cedar return a cedar trie instance

func (*Matcher) Compile

func (m *Matcher) Compile()

Compile trie to aho-corasick

func (*Matcher) DumpGraph

func (m *Matcher) DumpGraph(fname string)

DumpGraph dumps aho-corasick dfa structures to graphviz file

func (*Matcher) Insert

func (m *Matcher) Insert(bs []byte, val interface{})

Insert a byte sequence to double array trie inner matcher

func (*Matcher) Key

func (m *Matcher) Key(seq []byte, t MatchToken) []byte

Key extract matched key in seq

func (*Matcher) Match

func (m *Matcher) Match(seq []byte) *Response

Match multiple subsequence in seq and return tokens

type Response

type Response struct {
	// contains filtered or unexported fields
}

func NewResponse

func NewResponse(ac *Matcher) *Response

func (*Response) HasNext

func (r *Response) HasNext() bool

func (*Response) NextMatchItem

func (r *Response) NextMatchItem(content []byte) []MatchToken

func (*Response) Release

func (r *Response) Release()

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL