tokenizer

package

v1.2.1 Latest Latest Go to latest Published: Jun 21, 2026 License: AGPL-3.0 Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/RogueTeam/textiplex

Links

Open Source Insights

Documentation ¶

Index ¶

func FilterStopword(sw Stopwords, in iter.Seq[*Token]) (out iter.Seq[*Token])
func FinishStemmer(raw []byte, keep int, fold func([]byte) []byte) ([]byte, bool)
func HasSuffixFold(b []byte, suf string) bool
func NeedsFold(b []byte) bool
func TokenizeWithStemmer(in []byte, stem Stemmer) iter.Seq[*Token]
type Stemmer
type Stopwords
- func BuildStopWords(words ...string) Stopwords
type Token
type Tokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func FilterStopword ¶

func FilterStopword(sw Stopwords, in iter.Seq[*Token]) (out iter.Seq[*Token])

func FinishStemmer ¶

func FinishStemmer(raw []byte, keep int, fold func([]byte) []byte) ([]byte, bool)

FinishStemmer returns a borrowed sub slice when the kept prefix needs no rewriting, otherwise the folded allocation. fold lowers and, for Spanish, strips accents.

func HasSuffixFold ¶

func HasSuffixFold(b []byte, suf string) bool

func NeedsFold ¶

func NeedsFold(b []byte) bool

NeedsFold reports whether b contains an ASCII uppercase letter or any multibyte rune, the only cases that force an allocation.

func TokenizeWithStemmer ¶

func TokenizeWithStemmer(in []byte, stem Stemmer) iter.Seq[*Token]

Types ¶

type Stemmer ¶

type Stemmer func(raw []byte) (term []byte, owned bool)

Stemmer normalizes one raw token. It returns the term and whether the term is an owned allocation (true) or a sub slice of raw (false).

type Stopwords ¶

type Stopwords map[uint64]struct{}

func BuildStopWords ¶

func BuildStopWords(words ...string) Stopwords

type Token ¶

type Token struct {
	Value  []byte
	IsStem bool
}

type Tokenizer ¶

type Tokenizer func(in []byte) (seq iter.Seq[*Token])

Source Files ¶

View all Source files

tokenizer.go

Directories ¶

Path	Synopsis
en
es
keyword

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL