Documentation
¶
Index ¶
- Constants
- func BuildTokens(val interface{}, t Tokenizer) ([]string, error)
- func EncodeGeoTokens(tokens []string)
- func EncodeRegexTokens(tokens []string)
- func GetFullTextTokens(funcArgs []string, lang string) ([]string, error)
- func GetIndexFactoryOptsFromSpec(spec *pb.VectorIndexSpec) (opts.Options, error)
- func GetTermTokens(funcArgs []string) ([]string, error)
- func GetTokens(id byte, funcArgs ...string) ([]string, error)
- func LangBase(lang string) string
- func LoadCustomTokenizer(soFile string)
- type BigFloatTokenizer
- type BoolTokenizer
- type CustomTokenizer
- type DayTokenizer
- type ExactTokenizer
- func (t ExactTokenizer) Identifier() byte
- func (t ExactTokenizer) IsLossy() bool
- func (t ExactTokenizer) IsSortable() bool
- func (t ExactTokenizer) Name() string
- func (t ExactTokenizer) Prefix() []byte
- func (t ExactTokenizer) Tokens(v interface{}) ([]string, error)
- func (t ExactTokenizer) Type() string
- type FactoryCreateSpec
- type FloatTokenizer
- type FullTextTokenizer
- type GeoTokenizer
- type HashTokenizer
- type HourTokenizer
- type IndexFactory
- type IntTokenizer
- type MonthTokenizer
- type PluginTokenizer
- type Sha256Tokenizer
- type TermTokenizer
- type Tokenizer
- type TrigramTokenizer
- type YearTokenizer
Constants ¶
const ( IdentNone = 0x0 IdentTerm = 0x1 IdentExact = 0x2 IdentExactLang = 0x3 IdentYear = 0x4 IdentMonth = 0x41 IdentDay = 0x42 IdentHour = 0x43 IdentGeo = 0x5 IdentInt = 0x6 IdentFloat = 0x7 IdentFullText = 0x8 IdentBool = 0x9 IdentTrigram = 0xA IdentHash = 0xB IdentSha = 0xC IdentBigFloat = 0xD IdentVFloat = 0xE IdentCustom = 0x80 IdentDelimiter = 0x1f // ASCII 31 - Unit separator )
Tokenizer identifiers are unique and can't be reused. The range 0x00 - 0x7f is system reserved. The range 0x80 - 0xff is for custom tokenizers. TODO: use these everywhere where we must ensure a system tokenizer.
Variables ¶
This section is empty.
Functions ¶
func BuildTokens ¶
BuildTokens tokenizes a value, creating strings that can be used to create index keys.
func EncodeGeoTokens ¶
func EncodeGeoTokens(tokens []string)
EncodeGeoTokens encodes the given list of tokens as geo tokens.
func EncodeRegexTokens ¶
func EncodeRegexTokens(tokens []string)
EncodeRegexTokens encodes the given list of strings as regex tokens.
func GetFullTextTokens ¶
GetFullTextTokens returns the full-text tokens for the given value.
func GetIndexFactoryOptsFromSpec ¶
func GetIndexFactoryOptsFromSpec(spec *pb.VectorIndexSpec) (opts.Options, error)
func GetTermTokens ¶
GetTermTokens returns the term tokens for the given value.
func GetTokens ¶
GetTokens returns the tokens for the given tokenizer ID and value. funcArgs should only have one element which is the value that needs to be tokenized.
func LangBase ¶
LangBase returns the BCP47 base of a language. If the confidence of the matching is better than none, we return that base. Otherwise, we return "en" (English) which is a good default.
func LoadCustomTokenizer ¶
func LoadCustomTokenizer(soFile string)
LoadCustomTokenizer reads and loads a custom tokenizer from the given file.
Types ¶
type BigFloatTokenizer ¶
type BigFloatTokenizer struct{}
BigFloatTokenizer generates tokens from big float data.
func (BigFloatTokenizer) Identifier ¶
func (t BigFloatTokenizer) Identifier() byte
func (BigFloatTokenizer) IsLossy ¶
func (t BigFloatTokenizer) IsLossy() bool
func (BigFloatTokenizer) IsSortable ¶
func (t BigFloatTokenizer) IsSortable() bool
func (BigFloatTokenizer) Name ¶
func (t BigFloatTokenizer) Name() string
func (BigFloatTokenizer) Tokens ¶
func (t BigFloatTokenizer) Tokens(v interface{}) ([]string, error)
func (BigFloatTokenizer) Type ¶
func (t BigFloatTokenizer) Type() string
type BoolTokenizer ¶
type BoolTokenizer struct{}
BoolTokenizer returns tokens from boolean data.
func (BoolTokenizer) Identifier ¶
func (t BoolTokenizer) Identifier() byte
func (BoolTokenizer) IsLossy ¶
func (t BoolTokenizer) IsLossy() bool
func (BoolTokenizer) IsSortable ¶
func (t BoolTokenizer) IsSortable() bool
func (BoolTokenizer) Name ¶
func (t BoolTokenizer) Name() string
func (BoolTokenizer) Tokens ¶
func (t BoolTokenizer) Tokens(v interface{}) ([]string, error)
func (BoolTokenizer) Type ¶
func (t BoolTokenizer) Type() string
type CustomTokenizer ¶
type CustomTokenizer struct{ PluginTokenizer }
CustomTokenizer generates tokens from custom logic. It doesn't make sense for plugins to implement the IsSortable and IsLossy methods, so they're hard-coded.
func (CustomTokenizer) IsLossy ¶
func (t CustomTokenizer) IsLossy() bool
func (CustomTokenizer) IsSortable ¶
func (t CustomTokenizer) IsSortable() bool
type DayTokenizer ¶
type DayTokenizer struct{}
DayTokenizer generates day tokens from datetime data.
func (DayTokenizer) Identifier ¶
func (t DayTokenizer) Identifier() byte
func (DayTokenizer) IsLossy ¶
func (t DayTokenizer) IsLossy() bool
func (DayTokenizer) IsSortable ¶
func (t DayTokenizer) IsSortable() bool
func (DayTokenizer) Name ¶
func (t DayTokenizer) Name() string
func (DayTokenizer) Tokens ¶
func (t DayTokenizer) Tokens(v interface{}) ([]string, error)
func (DayTokenizer) Type ¶
func (t DayTokenizer) Type() string
type ExactTokenizer ¶
type ExactTokenizer struct {
// contains filtered or unexported fields
}
ExactTokenizer returns the exact string as a token. If collator is provided for any language then it also adds the language in the prefix .
func (ExactTokenizer) Identifier ¶
func (t ExactTokenizer) Identifier() byte
func (ExactTokenizer) IsLossy ¶
func (t ExactTokenizer) IsLossy() bool
func (ExactTokenizer) IsSortable ¶
func (t ExactTokenizer) IsSortable() bool
func (ExactTokenizer) Name ¶
func (t ExactTokenizer) Name() string
func (ExactTokenizer) Prefix ¶
func (t ExactTokenizer) Prefix() []byte
func (ExactTokenizer) Tokens ¶
func (t ExactTokenizer) Tokens(v interface{}) ([]string, error)
func (ExactTokenizer) Type ¶
func (t ExactTokenizer) Type() string
type FactoryCreateSpec ¶
type FactoryCreateSpec struct {
// contains filtered or unexported fields
}
FactoryCreateSpec includes an IndexFactory and the options required to instantiate a VectorIndex of the given type. In short, everything that is needed in order to create a VectorIndex!
func GetFactoryCreateSpecFromSpec ¶
func GetFactoryCreateSpecFromSpec(spec *pb.VectorIndexSpec) (*FactoryCreateSpec, error)
func (*FactoryCreateSpec) CreateIndex ¶
func (fcs *FactoryCreateSpec) CreateIndex(name string) (index.VectorIndex[float32], error)
func (*FactoryCreateSpec) Name ¶
func (fcs *FactoryCreateSpec) Name() string
type FloatTokenizer ¶
type FloatTokenizer struct{}
FloatTokenizer generates tokens from floating-point data.
func (FloatTokenizer) Identifier ¶
func (t FloatTokenizer) Identifier() byte
func (FloatTokenizer) IsLossy ¶
func (t FloatTokenizer) IsLossy() bool
func (FloatTokenizer) IsSortable ¶
func (t FloatTokenizer) IsSortable() bool
func (FloatTokenizer) Name ¶
func (t FloatTokenizer) Name() string
func (FloatTokenizer) Tokens ¶
func (t FloatTokenizer) Tokens(v interface{}) ([]string, error)
func (FloatTokenizer) Type ¶
func (t FloatTokenizer) Type() string
type FullTextTokenizer ¶
type FullTextTokenizer struct {
// contains filtered or unexported fields
}
FullTextTokenizer generates full-text tokens from string data.
func (FullTextTokenizer) Identifier ¶
func (t FullTextTokenizer) Identifier() byte
func (FullTextTokenizer) IsLossy ¶
func (t FullTextTokenizer) IsLossy() bool
func (FullTextTokenizer) IsSortable ¶
func (t FullTextTokenizer) IsSortable() bool
func (FullTextTokenizer) Name ¶
func (t FullTextTokenizer) Name() string
func (FullTextTokenizer) Tokens ¶
func (t FullTextTokenizer) Tokens(v interface{}) ([]string, error)
func (FullTextTokenizer) Type ¶
func (t FullTextTokenizer) Type() string
type GeoTokenizer ¶
type GeoTokenizer struct{}
GeoTokenizer generates tokens from geo data.
func (GeoTokenizer) Identifier ¶
func (t GeoTokenizer) Identifier() byte
func (GeoTokenizer) IsLossy ¶
func (t GeoTokenizer) IsLossy() bool
func (GeoTokenizer) IsSortable ¶
func (t GeoTokenizer) IsSortable() bool
func (GeoTokenizer) Name ¶
func (t GeoTokenizer) Name() string
func (GeoTokenizer) Tokens ¶
func (t GeoTokenizer) Tokens(v interface{}) ([]string, error)
func (GeoTokenizer) Type ¶
func (t GeoTokenizer) Type() string
type HashTokenizer ¶
type HashTokenizer struct{}
HashTokenizer returns hash tokens from string data.
func (HashTokenizer) Identifier ¶
func (t HashTokenizer) Identifier() byte
func (HashTokenizer) IsLossy ¶
func (t HashTokenizer) IsLossy() bool
IsLossy false for the HashTokenizer. This allows us to avoid having to retrieve values for the returned results, and compare them against the value in the query, which is slow. There is very low probability of collisions with a 256-bit hash. We use that fact to speed up equality query operations using the hash index.
func (HashTokenizer) IsSortable ¶
func (t HashTokenizer) IsSortable() bool
func (HashTokenizer) Name ¶
func (t HashTokenizer) Name() string
func (HashTokenizer) Tokens ¶
func (t HashTokenizer) Tokens(v interface{}) ([]string, error)
func (HashTokenizer) Type ¶
func (t HashTokenizer) Type() string
type HourTokenizer ¶
type HourTokenizer struct{}
HourTokenizer generates hour tokens from datetime data.
func (HourTokenizer) Identifier ¶
func (t HourTokenizer) Identifier() byte
func (HourTokenizer) IsLossy ¶
func (t HourTokenizer) IsLossy() bool
func (HourTokenizer) IsSortable ¶
func (t HourTokenizer) IsSortable() bool
func (HourTokenizer) Name ¶
func (t HourTokenizer) Name() string
func (HourTokenizer) Tokens ¶
func (t HourTokenizer) Tokens(v interface{}) ([]string, error)
func (HourTokenizer) Type ¶
func (t HourTokenizer) Type() string
type IndexFactory ¶
type IndexFactory interface { Tokenizer // TODO: Distinguish between float64 and float32, allowing either. // Default should be float32. index.IndexFactory[float32] }
IndexFactory combines the notion of a Tokenizer with index.IndexFactory. We register IndexFactory instances just like we register Tokenizers.
func GetIndexFactoriesFromSpecs ¶
func GetIndexFactoriesFromSpecs(specs []*pb.VectorIndexSpec) []IndexFactory
func GetIndexFactory ¶
func GetIndexFactory(name string) (IndexFactory, bool)
GetIndexFactory returns IndexFactory given name.
func GetIndexFactoryFromSpec ¶
func GetIndexFactoryFromSpec(spec *pb.VectorIndexSpec) (IndexFactory, bool)
type IntTokenizer ¶
type IntTokenizer struct{}
IntTokenizer generates tokens from integer data.
func (IntTokenizer) Identifier ¶
func (t IntTokenizer) Identifier() byte
func (IntTokenizer) IsLossy ¶
func (t IntTokenizer) IsLossy() bool
func (IntTokenizer) IsSortable ¶
func (t IntTokenizer) IsSortable() bool
func (IntTokenizer) Name ¶
func (t IntTokenizer) Name() string
func (IntTokenizer) Tokens ¶
func (t IntTokenizer) Tokens(v interface{}) ([]string, error)
func (IntTokenizer) Type ¶
func (t IntTokenizer) Type() string
type MonthTokenizer ¶
type MonthTokenizer struct{}
MonthTokenizer generates month tokens from datetime data.
func (MonthTokenizer) Identifier ¶
func (t MonthTokenizer) Identifier() byte
func (MonthTokenizer) IsLossy ¶
func (t MonthTokenizer) IsLossy() bool
func (MonthTokenizer) IsSortable ¶
func (t MonthTokenizer) IsSortable() bool
func (MonthTokenizer) Name ¶
func (t MonthTokenizer) Name() string
func (MonthTokenizer) Tokens ¶
func (t MonthTokenizer) Tokens(v interface{}) ([]string, error)
func (MonthTokenizer) Type ¶
func (t MonthTokenizer) Type() string
type PluginTokenizer ¶
type PluginTokenizer interface { Name() string Type() string Tokens(interface{}) ([]string, error) Identifier() byte }
PluginTokenizer is implemented by external plugins loaded dynamically via *.so files. It follows the implementation semantics of the Tokenizer interface.
Think carefully before modifying this interface, as it would break users' plugins.
type Sha256Tokenizer ¶
type Sha256Tokenizer struct {
// contains filtered or unexported fields
}
Sha256Tokenizer generates tokens for the sha256 hash part from string data.
func (Sha256Tokenizer) Identifier ¶
func (t Sha256Tokenizer) Identifier() byte
func (Sha256Tokenizer) IsLossy ¶
func (t Sha256Tokenizer) IsLossy() bool
func (Sha256Tokenizer) IsSortable ¶
func (t Sha256Tokenizer) IsSortable() bool
func (Sha256Tokenizer) Name ¶
func (t Sha256Tokenizer) Name() string
func (Sha256Tokenizer) Tokens ¶
func (t Sha256Tokenizer) Tokens(v interface{}) ([]string, error)
func (Sha256Tokenizer) Type ¶
func (t Sha256Tokenizer) Type() string
type TermTokenizer ¶
type TermTokenizer struct {
// contains filtered or unexported fields
}
TermTokenizer generates term tokens from string data.
func (TermTokenizer) Identifier ¶
func (t TermTokenizer) Identifier() byte
func (TermTokenizer) IsLossy ¶
func (t TermTokenizer) IsLossy() bool
func (TermTokenizer) IsSortable ¶
func (t TermTokenizer) IsSortable() bool
func (TermTokenizer) Name ¶
func (t TermTokenizer) Name() string
func (TermTokenizer) Tokens ¶
func (t TermTokenizer) Tokens(v interface{}) ([]string, error)
func (TermTokenizer) Type ¶
func (t TermTokenizer) Type() string
type Tokenizer ¶
type Tokenizer interface { // Name is name of tokenizer. This should be unique. Name() string // Type returns the string representation of the typeID that we care about. Type() string // Tokens return tokens for a given value. The tokens shouldn't be encoded // with the byte identifier. Tokens(interface{}) ([]string, error) // Identifier returns the prefix byte for this token type. This should be // unique. The range 0x80 to 0xff (inclusive) is reserved for user-provided // custom tokenizers. Identifier() byte // IsSortable returns true if the tokenizer can be used for sorting/ordering. IsSortable() bool // IsLossy returns true if we don't store the values directly as index keys // during tokenization. If a predicate is tokenized using an IsLossy() tokenizer, // then we need to fetch the actual value and compare. IsLossy() bool }
Tokenizer defines what a tokenizer must provide.
func GetTokenizer ¶
GetTokenizer returns tokenizer given unique name.
func GetTokenizerByID ¶
GetTokenizerByID tries to find a tokenizer by id in the registered list. Returns the tokenizer and true if found, otherwise nil and false.
func GetTokenizerForLang ¶
GetTokenizerForLang returns the correct full-text tokenizer for the given language.
func GetTokenizers ¶
GetTokenizers returns a list of tokenizer given a list of unique names.
type TrigramTokenizer ¶
type TrigramTokenizer struct{}
TrigramTokenizer returns trigram tokens from string data.
func (TrigramTokenizer) Identifier ¶
func (t TrigramTokenizer) Identifier() byte
func (TrigramTokenizer) IsLossy ¶
func (t TrigramTokenizer) IsLossy() bool
func (TrigramTokenizer) IsSortable ¶
func (t TrigramTokenizer) IsSortable() bool
func (TrigramTokenizer) Name ¶
func (t TrigramTokenizer) Name() string
func (TrigramTokenizer) Tokens ¶
func (t TrigramTokenizer) Tokens(v interface{}) ([]string, error)
func (TrigramTokenizer) Type ¶
func (t TrigramTokenizer) Type() string
type YearTokenizer ¶
type YearTokenizer struct{}
YearTokenizer generates year tokens from datetime data.
func (YearTokenizer) Identifier ¶
func (t YearTokenizer) Identifier() byte
func (YearTokenizer) IsLossy ¶
func (t YearTokenizer) IsLossy() bool
func (YearTokenizer) IsSortable ¶
func (t YearTokenizer) IsSortable() bool
func (YearTokenizer) Name ¶
func (t YearTokenizer) Name() string
func (YearTokenizer) Tokens ¶
func (t YearTokenizer) Tokens(v interface{}) ([]string, error)
func (YearTokenizer) Type ¶
func (t YearTokenizer) Type() string