api

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 27, 2025 License: Apache-2.0 Imports: 5 Imported by: 2

Documentation

Overview

Package api defines the Tokenizer API. It's just a hack to break the cyclic dependency, and allow the users to import `tokenizers` and get the default implementations.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SpecialTokenStrings added in v0.1.1

func SpecialTokenStrings() []string

SpecialTokenStrings returns a slice of all String values of the enum

Types

type Config

type Config struct {
	ConfigFile     string
	TokenizerClass string `json:"tokenizer_class"`

	ChatTemplate           string `json:"chat_template"`
	UseDefaultSystemPrompt bool   `json:"use_default_system_prompt"`

	ModelMaxLength float64        `json:"model_max_length"`
	MaxLength      float64        `json:"max_length"`
	SpModelKwargs  map[string]any `json:"sp_model_kwargs"`

	ClsToken  string `json:"cls_token"`
	UnkToken  string `json:"unk_token"`
	SepToken  string `json:"sep_token"`
	MaskToken string `json:"mask_token"`
	BosToken  string `json:"bos_token"`
	EosToken  string `json:"eos_token"`
	PadToken  string `json:"pad_token"`

	AddBosToken             bool                  `json:"add_bos_token"`
	AddEosToken             bool                  `json:"add_eos_token"`
	AddedTokensDecoder      map[int]TokensDecoder `json:"added_tokens_decoder"`
	AdditionalSpecialTokens []string              `json:"additional_special_tokens"`

	DoLowerCase                bool `json:"do_lower_case"`
	CleanUpTokenizationSpaces  bool `json:"clean_up_tokenization_spaces"`
	SpacesBetweenSpecialTokens bool `json:"spaces_between_special_tokens"`

	TokenizeChineseChars bool   `json:"tokenize_chinese_chars"`
	StripAccents         any    `json:"strip_accents"`
	NameOrPath           string `json:"name_or_path"`
	DoBasicTokenize      bool   `json:"do_basic_tokenize"`
	NeverSplit           any    `json:"never_split"`

	Stride             int    `json:"stride"`
	TruncationSide     string `json:"truncation_side"`
	TruncationStrategy string `json:"truncation_strategy"`
}

Config struct to hold HuggingFace's tokenizer_config.json contents. There is no formal schema for this file, but these are some common fields that may be of use. Specific tokenizer classes are free to implement additional features as they see fit.

The extra field ConfigFile holds the path to the file with the full config.

func ParseConfigContent

func ParseConfigContent(jsonContent []byte) (*Config, error)

ParseConfigContent parses the given json content (of a tokenizer_config.json file) into a Config structure.

func ParseConfigFile

func ParseConfigFile(filePath string) (*Config, error)

ParseConfigFile parses the given file (holding a tokenizer_config.json file) into a Config structure.

type SpecialToken

type SpecialToken int

SpecialToken is an enum of commonly used special tokens.

const (
	TokBeginningOfSentence SpecialToken = iota
	TokEndOfSentence
	TokUnknown
	TokPad
	TokMask
	TokClassification
	TokSpecialTokensCount
)

func SpecialTokenString added in v0.1.1

func SpecialTokenString(s string) (SpecialToken, error)

SpecialTokenString retrieves an enum value from the enum constants string name. Throws an error if the param is not part of the enum.

func SpecialTokenValues added in v0.1.1

func SpecialTokenValues() []SpecialToken

SpecialTokenValues returns all values of the enum

func (SpecialToken) IsASpecialToken added in v0.1.1

func (i SpecialToken) IsASpecialToken() bool

IsASpecialToken returns "true" if the value is listed in the enum definition. "false" otherwise

func (SpecialToken) MarshalJSON added in v0.1.1

func (i SpecialToken) MarshalJSON() ([]byte, error)

MarshalJSON implements the json.Marshaler interface for SpecialToken

func (SpecialToken) MarshalText added in v0.1.1

func (i SpecialToken) MarshalText() ([]byte, error)

MarshalText implements the encoding.TextMarshaler interface for SpecialToken

func (SpecialToken) MarshalYAML added in v0.1.1

func (i SpecialToken) MarshalYAML() (interface{}, error)

MarshalYAML implements a YAML Marshaler for SpecialToken

func (SpecialToken) String added in v0.1.1

func (i SpecialToken) String() string

func (*SpecialToken) UnmarshalJSON added in v0.1.1

func (i *SpecialToken) UnmarshalJSON(data []byte) error

UnmarshalJSON implements the json.Unmarshaler interface for SpecialToken

func (*SpecialToken) UnmarshalText added in v0.1.1

func (i *SpecialToken) UnmarshalText(text []byte) error

UnmarshalText implements the encoding.TextUnmarshaler interface for SpecialToken

func (*SpecialToken) UnmarshalYAML added in v0.1.1

func (i *SpecialToken) UnmarshalYAML(unmarshal func(interface{}) error) error

UnmarshalYAML implements a YAML Unmarshaler for SpecialToken

func (SpecialToken) Values added in v0.1.1

func (SpecialToken) Values() []string

type Tokenizer

type Tokenizer interface {
	Encode(text string) []int
	Decode([]int) string

	// SpecialTokenID returns ID for given special token if registered, or an error if not.
	SpecialTokenID(token SpecialToken) (int, error)
}

Tokenizer interface allows one convert test to "tokens" (integer ids) and back.

It also allows mapping of special tokens: tokens with a common semantic (like padding) but that may map to different ids (int) for different tokenizers.

type TokensDecoder

type TokensDecoder struct {
	Content    string `json:"content"`
	Lstrip     bool   `json:"lstrip"`
	Normalized bool   `json:"normalized"`
	Rstrip     bool   `json:"rstrip"`
	SingleWord bool   `json:"single_word"`
	Special    bool   `json:"special"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL