Documentation
¶
Overview ¶
Package api defines the Tokenizer API. It's just a hack to break the cyclic dependency, and allow the users to import `tokenizers` and get the default implementations.
Index ¶
- func SpecialTokenStrings() []string
- type Config
- type SpecialToken
- func (i SpecialToken) IsASpecialToken() bool
- func (i SpecialToken) MarshalJSON() ([]byte, error)
- func (i SpecialToken) MarshalText() ([]byte, error)
- func (i SpecialToken) MarshalYAML() (interface{}, error)
- func (i SpecialToken) String() string
- func (i *SpecialToken) UnmarshalJSON(data []byte) error
- func (i *SpecialToken) UnmarshalText(text []byte) error
- func (i *SpecialToken) UnmarshalYAML(unmarshal func(interface{}) error) error
- func (SpecialToken) Values() []string
- type Tokenizer
- type TokensDecoder
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SpecialTokenStrings ¶ added in v0.1.1
func SpecialTokenStrings() []string
SpecialTokenStrings returns a slice of all String values of the enum
Types ¶
type Config ¶
type Config struct { ConfigFile string TokenizerClass string `json:"tokenizer_class"` ChatTemplate string `json:"chat_template"` UseDefaultSystemPrompt bool `json:"use_default_system_prompt"` ModelMaxLength float64 `json:"model_max_length"` MaxLength float64 `json:"max_length"` SpModelKwargs map[string]any `json:"sp_model_kwargs"` ClsToken string `json:"cls_token"` UnkToken string `json:"unk_token"` SepToken string `json:"sep_token"` MaskToken string `json:"mask_token"` BosToken string `json:"bos_token"` EosToken string `json:"eos_token"` PadToken string `json:"pad_token"` AddBosToken bool `json:"add_bos_token"` AddEosToken bool `json:"add_eos_token"` AddedTokensDecoder map[int]TokensDecoder `json:"added_tokens_decoder"` AdditionalSpecialTokens []string `json:"additional_special_tokens"` DoLowerCase bool `json:"do_lower_case"` CleanUpTokenizationSpaces bool `json:"clean_up_tokenization_spaces"` SpacesBetweenSpecialTokens bool `json:"spaces_between_special_tokens"` TokenizeChineseChars bool `json:"tokenize_chinese_chars"` StripAccents any `json:"strip_accents"` NameOrPath string `json:"name_or_path"` DoBasicTokenize bool `json:"do_basic_tokenize"` NeverSplit any `json:"never_split"` Stride int `json:"stride"` TruncationSide string `json:"truncation_side"` TruncationStrategy string `json:"truncation_strategy"` }
Config struct to hold HuggingFace's tokenizer_config.json contents. There is no formal schema for this file, but these are some common fields that may be of use. Specific tokenizer classes are free to implement additional features as they see fit.
The extra field ConfigFile holds the path to the file with the full config.
func ParseConfigContent ¶
ParseConfigContent parses the given json content (of a tokenizer_config.json file) into a Config structure.
func ParseConfigFile ¶
ParseConfigFile parses the given file (holding a tokenizer_config.json file) into a Config structure.
type SpecialToken ¶
type SpecialToken int
SpecialToken is an enum of commonly used special tokens.
const ( TokBeginningOfSentence SpecialToken = iota TokEndOfSentence TokUnknown TokPad TokMask TokClassification TokSpecialTokensCount )
func SpecialTokenString ¶ added in v0.1.1
func SpecialTokenString(s string) (SpecialToken, error)
SpecialTokenString retrieves an enum value from the enum constants string name. Throws an error if the param is not part of the enum.
func SpecialTokenValues ¶ added in v0.1.1
func SpecialTokenValues() []SpecialToken
SpecialTokenValues returns all values of the enum
func (SpecialToken) IsASpecialToken ¶ added in v0.1.1
func (i SpecialToken) IsASpecialToken() bool
IsASpecialToken returns "true" if the value is listed in the enum definition. "false" otherwise
func (SpecialToken) MarshalJSON ¶ added in v0.1.1
func (i SpecialToken) MarshalJSON() ([]byte, error)
MarshalJSON implements the json.Marshaler interface for SpecialToken
func (SpecialToken) MarshalText ¶ added in v0.1.1
func (i SpecialToken) MarshalText() ([]byte, error)
MarshalText implements the encoding.TextMarshaler interface for SpecialToken
func (SpecialToken) MarshalYAML ¶ added in v0.1.1
func (i SpecialToken) MarshalYAML() (interface{}, error)
MarshalYAML implements a YAML Marshaler for SpecialToken
func (SpecialToken) String ¶ added in v0.1.1
func (i SpecialToken) String() string
func (*SpecialToken) UnmarshalJSON ¶ added in v0.1.1
func (i *SpecialToken) UnmarshalJSON(data []byte) error
UnmarshalJSON implements the json.Unmarshaler interface for SpecialToken
func (*SpecialToken) UnmarshalText ¶ added in v0.1.1
func (i *SpecialToken) UnmarshalText(text []byte) error
UnmarshalText implements the encoding.TextUnmarshaler interface for SpecialToken
func (*SpecialToken) UnmarshalYAML ¶ added in v0.1.1
func (i *SpecialToken) UnmarshalYAML(unmarshal func(interface{}) error) error
UnmarshalYAML implements a YAML Unmarshaler for SpecialToken
func (SpecialToken) Values ¶ added in v0.1.1
func (SpecialToken) Values() []string
type Tokenizer ¶
type Tokenizer interface { Encode(text string) []int Decode([]int) string // SpecialTokenID returns ID for given special token if registered, or an error if not. SpecialTokenID(token SpecialToken) (int, error) }
Tokenizer interface allows one convert test to "tokens" (integer ids) and back.
It also allows mapping of special tokens: tokens with a common semantic (like padding) but that may map to different ids (int) for different tokenizers.