tok

package
v25.0.0-split-vector3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 4, 2025 License: Apache-2.0 Imports: 55 Imported by: 0

Documentation

Index

Constants

View Source
const (
	IdentNone      = 0x0
	IdentTerm      = 0x1
	IdentExact     = 0x2
	IdentExactLang = 0x3
	IdentYear      = 0x4
	IdentMonth     = 0x41
	IdentDay       = 0x42
	IdentHour      = 0x43
	IdentGeo       = 0x5
	IdentInt       = 0x6
	IdentFloat     = 0x7
	IdentFullText  = 0x8
	IdentBool      = 0x9
	IdentTrigram   = 0xA
	IdentHash      = 0xB
	IdentSha       = 0xC
	IdentBigFloat  = 0xD
	IdentVFloat    = 0xE
	IdentCustom    = 0x80
	IdentDelimiter = 0x1f // ASCII 31 - Unit separator
)

Tokenizer identifiers are unique and can't be reused. The range 0x00 - 0x7f is system reserved. The range 0x80 - 0xff is for custom tokenizers. TODO: use these everywhere where we must ensure a system tokenizer.

Variables

This section is empty.

Functions

func BuildTokens

func BuildTokens(val interface{}, t Tokenizer) ([]string, error)

BuildTokens tokenizes a value, creating strings that can be used to create index keys.

func EncodeGeoTokens

func EncodeGeoTokens(tokens []string)

EncodeGeoTokens encodes the given list of tokens as geo tokens.

func EncodeRegexTokens

func EncodeRegexTokens(tokens []string)

EncodeRegexTokens encodes the given list of strings as regex tokens.

func GetFullTextTokens

func GetFullTextTokens(funcArgs []string, lang string) ([]string, error)

GetFullTextTokens returns the full-text tokens for the given value.

func GetIndexFactoryOptsFromSpec

func GetIndexFactoryOptsFromSpec(spec *pb.VectorIndexSpec) (opts.Options, error)

func GetTermTokens

func GetTermTokens(funcArgs []string) ([]string, error)

GetTermTokens returns the term tokens for the given value.

func GetTokens

func GetTokens(id byte, funcArgs ...string) ([]string, error)

GetTokens returns the tokens for the given tokenizer ID and value. funcArgs should only have one element which is the value that needs to be tokenized.

func LangBase

func LangBase(lang string) string

LangBase returns the BCP47 base of a language. If the confidence of the matching is better than none, we return that base. Otherwise, we return "en" (English) which is a good default.

func LoadCustomTokenizer

func LoadCustomTokenizer(soFile string)

LoadCustomTokenizer reads and loads a custom tokenizer from the given file.

Types

type BigFloatTokenizer

type BigFloatTokenizer struct{}

BigFloatTokenizer generates tokens from big float data.

func (BigFloatTokenizer) Identifier

func (t BigFloatTokenizer) Identifier() byte

func (BigFloatTokenizer) IsLossy

func (t BigFloatTokenizer) IsLossy() bool

func (BigFloatTokenizer) IsSortable

func (t BigFloatTokenizer) IsSortable() bool

func (BigFloatTokenizer) Name

func (t BigFloatTokenizer) Name() string

func (BigFloatTokenizer) Tokens

func (t BigFloatTokenizer) Tokens(v interface{}) ([]string, error)

func (BigFloatTokenizer) Type

func (t BigFloatTokenizer) Type() string

type BoolTokenizer

type BoolTokenizer struct{}

BoolTokenizer returns tokens from boolean data.

func (BoolTokenizer) Identifier

func (t BoolTokenizer) Identifier() byte

func (BoolTokenizer) IsLossy

func (t BoolTokenizer) IsLossy() bool

func (BoolTokenizer) IsSortable

func (t BoolTokenizer) IsSortable() bool

func (BoolTokenizer) Name

func (t BoolTokenizer) Name() string

func (BoolTokenizer) Tokens

func (t BoolTokenizer) Tokens(v interface{}) ([]string, error)

func (BoolTokenizer) Type

func (t BoolTokenizer) Type() string

type CustomTokenizer

type CustomTokenizer struct{ PluginTokenizer }

CustomTokenizer generates tokens from custom logic. It doesn't make sense for plugins to implement the IsSortable and IsLossy methods, so they're hard-coded.

func (CustomTokenizer) IsLossy

func (t CustomTokenizer) IsLossy() bool

func (CustomTokenizer) IsSortable

func (t CustomTokenizer) IsSortable() bool

type DayTokenizer

type DayTokenizer struct{}

DayTokenizer generates day tokens from datetime data.

func (DayTokenizer) Identifier

func (t DayTokenizer) Identifier() byte

func (DayTokenizer) IsLossy

func (t DayTokenizer) IsLossy() bool

func (DayTokenizer) IsSortable

func (t DayTokenizer) IsSortable() bool

func (DayTokenizer) Name

func (t DayTokenizer) Name() string

func (DayTokenizer) Tokens

func (t DayTokenizer) Tokens(v interface{}) ([]string, error)

func (DayTokenizer) Type

func (t DayTokenizer) Type() string

type ExactTokenizer

type ExactTokenizer struct {
	// contains filtered or unexported fields
}

ExactTokenizer returns the exact string as a token. If collator is provided for any language then it also adds the language in the prefix .

func (ExactTokenizer) Identifier

func (t ExactTokenizer) Identifier() byte

func (ExactTokenizer) IsLossy

func (t ExactTokenizer) IsLossy() bool

func (ExactTokenizer) IsSortable

func (t ExactTokenizer) IsSortable() bool

func (ExactTokenizer) Name

func (t ExactTokenizer) Name() string

func (ExactTokenizer) Prefix

func (t ExactTokenizer) Prefix() []byte

func (ExactTokenizer) Tokens

func (t ExactTokenizer) Tokens(v interface{}) ([]string, error)

func (ExactTokenizer) Type

func (t ExactTokenizer) Type() string

type FactoryCreateSpec

type FactoryCreateSpec struct {
	// contains filtered or unexported fields
}

FactoryCreateSpec includes an IndexFactory and the options required to instantiate a VectorIndex of the given type. In short, everything that is needed in order to create a VectorIndex!

func GetFactoryCreateSpecFromSpec

func GetFactoryCreateSpecFromSpec(spec *pb.VectorIndexSpec) (*FactoryCreateSpec, error)

func (*FactoryCreateSpec) CreateIndex

func (fcs *FactoryCreateSpec) CreateIndex(name string) (index.VectorIndex[float32], error)

func (*FactoryCreateSpec) Name

func (fcs *FactoryCreateSpec) Name() string

type FloatTokenizer

type FloatTokenizer struct{}

FloatTokenizer generates tokens from floating-point data.

func (FloatTokenizer) Identifier

func (t FloatTokenizer) Identifier() byte

func (FloatTokenizer) IsLossy

func (t FloatTokenizer) IsLossy() bool

func (FloatTokenizer) IsSortable

func (t FloatTokenizer) IsSortable() bool

func (FloatTokenizer) Name

func (t FloatTokenizer) Name() string

func (FloatTokenizer) Tokens

func (t FloatTokenizer) Tokens(v interface{}) ([]string, error)

func (FloatTokenizer) Type

func (t FloatTokenizer) Type() string

type FullTextTokenizer

type FullTextTokenizer struct {
	// contains filtered or unexported fields
}

FullTextTokenizer generates full-text tokens from string data.

func (FullTextTokenizer) Identifier

func (t FullTextTokenizer) Identifier() byte

func (FullTextTokenizer) IsLossy

func (t FullTextTokenizer) IsLossy() bool

func (FullTextTokenizer) IsSortable

func (t FullTextTokenizer) IsSortable() bool

func (FullTextTokenizer) Name

func (t FullTextTokenizer) Name() string

func (FullTextTokenizer) Tokens

func (t FullTextTokenizer) Tokens(v interface{}) ([]string, error)

func (FullTextTokenizer) Type

func (t FullTextTokenizer) Type() string

type GeoTokenizer

type GeoTokenizer struct{}

GeoTokenizer generates tokens from geo data.

func (GeoTokenizer) Identifier

func (t GeoTokenizer) Identifier() byte

func (GeoTokenizer) IsLossy

func (t GeoTokenizer) IsLossy() bool

func (GeoTokenizer) IsSortable

func (t GeoTokenizer) IsSortable() bool

func (GeoTokenizer) Name

func (t GeoTokenizer) Name() string

func (GeoTokenizer) Tokens

func (t GeoTokenizer) Tokens(v interface{}) ([]string, error)

func (GeoTokenizer) Type

func (t GeoTokenizer) Type() string

type HashTokenizer

type HashTokenizer struct{}

HashTokenizer returns hash tokens from string data.

func (HashTokenizer) Identifier

func (t HashTokenizer) Identifier() byte

func (HashTokenizer) IsLossy

func (t HashTokenizer) IsLossy() bool

IsLossy false for the HashTokenizer. This allows us to avoid having to retrieve values for the returned results, and compare them against the value in the query, which is slow. There is very low probability of collisions with a 256-bit hash. We use that fact to speed up equality query operations using the hash index.

func (HashTokenizer) IsSortable

func (t HashTokenizer) IsSortable() bool

func (HashTokenizer) Name

func (t HashTokenizer) Name() string

func (HashTokenizer) Tokens

func (t HashTokenizer) Tokens(v interface{}) ([]string, error)

func (HashTokenizer) Type

func (t HashTokenizer) Type() string

type HourTokenizer

type HourTokenizer struct{}

HourTokenizer generates hour tokens from datetime data.

func (HourTokenizer) Identifier

func (t HourTokenizer) Identifier() byte

func (HourTokenizer) IsLossy

func (t HourTokenizer) IsLossy() bool

func (HourTokenizer) IsSortable

func (t HourTokenizer) IsSortable() bool

func (HourTokenizer) Name

func (t HourTokenizer) Name() string

func (HourTokenizer) Tokens

func (t HourTokenizer) Tokens(v interface{}) ([]string, error)

func (HourTokenizer) Type

func (t HourTokenizer) Type() string

type IndexFactory

type IndexFactory interface {
	Tokenizer
	// TODO: Distinguish between float64 and float32, allowing either.
	//       Default should be float32.
	index.IndexFactory[float32]
}

IndexFactory combines the notion of a Tokenizer with index.IndexFactory. We register IndexFactory instances just like we register Tokenizers.

func GetIndexFactoriesFromSpecs

func GetIndexFactoriesFromSpecs(specs []*pb.VectorIndexSpec) []IndexFactory

func GetIndexFactory

func GetIndexFactory(name string) (IndexFactory, bool)

GetIndexFactory returns IndexFactory given name.

func GetIndexFactoryFromSpec

func GetIndexFactoryFromSpec(spec *pb.VectorIndexSpec) (IndexFactory, bool)

type IntTokenizer

type IntTokenizer struct{}

IntTokenizer generates tokens from integer data.

func (IntTokenizer) Identifier

func (t IntTokenizer) Identifier() byte

func (IntTokenizer) IsLossy

func (t IntTokenizer) IsLossy() bool

func (IntTokenizer) IsSortable

func (t IntTokenizer) IsSortable() bool

func (IntTokenizer) Name

func (t IntTokenizer) Name() string

func (IntTokenizer) Tokens

func (t IntTokenizer) Tokens(v interface{}) ([]string, error)

func (IntTokenizer) Type

func (t IntTokenizer) Type() string

type MonthTokenizer

type MonthTokenizer struct{}

MonthTokenizer generates month tokens from datetime data.

func (MonthTokenizer) Identifier

func (t MonthTokenizer) Identifier() byte

func (MonthTokenizer) IsLossy

func (t MonthTokenizer) IsLossy() bool

func (MonthTokenizer) IsSortable

func (t MonthTokenizer) IsSortable() bool

func (MonthTokenizer) Name

func (t MonthTokenizer) Name() string

func (MonthTokenizer) Tokens

func (t MonthTokenizer) Tokens(v interface{}) ([]string, error)

func (MonthTokenizer) Type

func (t MonthTokenizer) Type() string

type PluginTokenizer

type PluginTokenizer interface {
	Name() string
	Type() string
	Tokens(interface{}) ([]string, error)
	Identifier() byte
}

PluginTokenizer is implemented by external plugins loaded dynamically via *.so files. It follows the implementation semantics of the Tokenizer interface.

Think carefully before modifying this interface, as it would break users' plugins.

type Sha256Tokenizer

type Sha256Tokenizer struct {
	// contains filtered or unexported fields
}

Sha256Tokenizer generates tokens for the sha256 hash part from string data.

func (Sha256Tokenizer) Identifier

func (t Sha256Tokenizer) Identifier() byte

func (Sha256Tokenizer) IsLossy

func (t Sha256Tokenizer) IsLossy() bool

func (Sha256Tokenizer) IsSortable

func (t Sha256Tokenizer) IsSortable() bool

func (Sha256Tokenizer) Name

func (t Sha256Tokenizer) Name() string

func (Sha256Tokenizer) Tokens

func (t Sha256Tokenizer) Tokens(v interface{}) ([]string, error)

func (Sha256Tokenizer) Type

func (t Sha256Tokenizer) Type() string

type TermTokenizer

type TermTokenizer struct {
	// contains filtered or unexported fields
}

TermTokenizer generates term tokens from string data.

func (TermTokenizer) Identifier

func (t TermTokenizer) Identifier() byte

func (TermTokenizer) IsLossy

func (t TermTokenizer) IsLossy() bool

func (TermTokenizer) IsSortable

func (t TermTokenizer) IsSortable() bool

func (TermTokenizer) Name

func (t TermTokenizer) Name() string

func (TermTokenizer) Tokens

func (t TermTokenizer) Tokens(v interface{}) ([]string, error)

func (TermTokenizer) Type

func (t TermTokenizer) Type() string

type Tokenizer

type Tokenizer interface {

	// Name is name of tokenizer. This should be unique.
	Name() string

	// Type returns the string representation of the typeID that we care about.
	Type() string

	// Tokens return tokens for a given value. The tokens shouldn't be encoded
	// with the byte identifier.
	Tokens(interface{}) ([]string, error)

	// Identifier returns the prefix byte for this token type. This should be
	// unique. The range 0x80 to 0xff (inclusive) is reserved for user-provided
	// custom tokenizers.
	Identifier() byte

	// IsSortable returns true if the tokenizer can be used for sorting/ordering.
	IsSortable() bool

	// IsLossy returns true if we don't store the values directly as index keys
	// during tokenization. If a predicate is tokenized using an IsLossy() tokenizer,
	// then we need to fetch the actual value and compare.
	IsLossy() bool
}

Tokenizer defines what a tokenizer must provide.

func GetTokenizer

func GetTokenizer(name string) (Tokenizer, bool)

GetTokenizer returns tokenizer given unique name.

func GetTokenizerByID

func GetTokenizerByID(id byte) (Tokenizer, bool)

GetTokenizerByID tries to find a tokenizer by id in the registered list. Returns the tokenizer and true if found, otherwise nil and false.

func GetTokenizerForLang

func GetTokenizerForLang(t Tokenizer, lang string) Tokenizer

GetTokenizerForLang returns the correct full-text tokenizer for the given language.

func GetTokenizers

func GetTokenizers(names []string) ([]Tokenizer, error)

GetTokenizers returns a list of tokenizer given a list of unique names.

type TrigramTokenizer

type TrigramTokenizer struct{}

TrigramTokenizer returns trigram tokens from string data.

func (TrigramTokenizer) Identifier

func (t TrigramTokenizer) Identifier() byte

func (TrigramTokenizer) IsLossy

func (t TrigramTokenizer) IsLossy() bool

func (TrigramTokenizer) IsSortable

func (t TrigramTokenizer) IsSortable() bool

func (TrigramTokenizer) Name

func (t TrigramTokenizer) Name() string

func (TrigramTokenizer) Tokens

func (t TrigramTokenizer) Tokens(v interface{}) ([]string, error)

func (TrigramTokenizer) Type

func (t TrigramTokenizer) Type() string

type YearTokenizer

type YearTokenizer struct{}

YearTokenizer generates year tokens from datetime data.

func (YearTokenizer) Identifier

func (t YearTokenizer) Identifier() byte

func (YearTokenizer) IsLossy

func (t YearTokenizer) IsLossy() bool

func (YearTokenizer) IsSortable

func (t YearTokenizer) IsSortable() bool

func (YearTokenizer) Name

func (t YearTokenizer) Name() string

func (YearTokenizer) Tokens

func (t YearTokenizer) Tokens(v interface{}) ([]string, error)

func (YearTokenizer) Type

func (t YearTokenizer) Type() string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL