tokenizer

package
v1.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 21, 2026 License: AGPL-3.0 Imports: 4 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FilterStopword

func FilterStopword(sw Stopwords, in iter.Seq[*Token]) (out iter.Seq[*Token])

func FinishStemmer

func FinishStemmer(raw []byte, keep int, fold func([]byte) []byte) ([]byte, bool)

FinishStemmer returns a borrowed sub slice when the kept prefix needs no rewriting, otherwise the folded allocation. fold lowers and, for Spanish, strips accents.

func HasSuffixFold

func HasSuffixFold(b []byte, suf string) bool

func NeedsFold

func NeedsFold(b []byte) bool

NeedsFold reports whether b contains an ASCII uppercase letter or any multibyte rune, the only cases that force an allocation.

func TokenizeWithStemmer

func TokenizeWithStemmer(in []byte, stem Stemmer) iter.Seq[*Token]

Types

type Stemmer

type Stemmer func(raw []byte) (term []byte, owned bool)

Stemmer normalizes one raw token. It returns the term and whether the term is an owned allocation (true) or a sub slice of raw (false).

type Stopwords

type Stopwords map[uint64]struct{}

func BuildStopWords

func BuildStopWords(words ...string) Stopwords

type Token

type Token struct {
	Value  []byte
	IsStem bool
}

type Tokenizer

type Tokenizer func(in []byte) (seq iter.Seq[*Token])

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL