ztoken

package module

v1.0.0 Latest Latest Go to latest Published: Mar 30, 2026 License: Apache-2.0 Imports: 8 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/ztoken

Links

Open Source Insights

README ¶

ztoken

BPE tokenizer library for Go with HuggingFace compatibility.

Part of the Zerfoo ML ecosystem.

Features

Byte-Pair Encoding (BPE) tokenizer with full merge-based encoding/decoding
HuggingFace tokenizer.json loading — compatible with GPT-2, Llama, Gemma, Mistral, and other models
GGUF tokenizer extraction — extract tokenizer data directly from GGUF model files via ztoken/gguf
SentencePiece compatibility — handles U+2581 space markers used by Llama-family models
Special token handling — BOS, EOS, PAD, UNK with exact-match encoding for control tokens
Byte-level BPE — GPT-2 style byte-to-Unicode encoding for full UTF-8 coverage
Text normalization — configurable normalizer pipeline (NFC, NFD, NFKC, lowercase, etc.)
Zero external dependencies — stdlib only, plus golang.org/x/text for Unicode normalization

Installation

go get github.com/zerfoo/ztoken

Quick Start

Load from HuggingFace tokenizer.json

package main

import (
    "fmt"

    "github.com/zerfoo/ztoken"
)

func main() {
    // Load a HuggingFace tokenizer.json file
    tok, err := ztoken.LoadFromJSON("tokenizer.json")
    if err != nil {
        panic(err)
    }

    // Encode text to token IDs
    ids, _ := tok.Encode("Hello, world!")
    fmt.Println(ids)

    // Decode token IDs back to text
    text, _ := tok.Decode(ids)
    fmt.Println(text) // Hello, world!

    // Inspect vocabulary
    fmt.Println(tok.VocabSize())

    // Access special tokens
    special := tok.SpecialTokens()
    fmt.Printf("BOS=%d EOS=%d PAD=%d UNK=%d\n",
        special.BOS, special.EOS, special.PAD, special.UNK)
}

Extract Tokenizer from GGUF Model Files

The ztoken/gguf sub-package extracts tokenizer data directly from GGUF model files, so you don't need a separate tokenizer.json:

package main

import (
    "fmt"

    "github.com/zerfoo/ztoken/gguf"
)

func main() {
    // metadata is any type implementing gguf.Metadata interface:
    //   GetString(key string) (string, bool)
    //   GetStringArray(key string) ([]string, bool)
    //   GetUint32(key string) (uint32, bool)
    //   GetInt32Array(key string) ([]int32, bool)
    tok, err := gguf.ExtractTokenizer(metadata)
    if err != nil {
        panic(err)
    }

    ids, _ := tok.Encode("Hello from GGUF!")
    fmt.Println(ids)
}

Build a Tokenizer Programmatically

package main

import (
    "fmt"

    "github.com/zerfoo/ztoken"
)

func main() {
    vocab := map[string]int{
        "hello": 0, "world": 1, " ": 2,
        "<unk>": 3, "<s>": 4, "</s>": 5, "<pad>": 6,
    }
    merges := []ztoken.MergePair{
        {Left: "hel", Right: "lo"},
        {Left: "wor", Right: "ld"},
    }
    special := ztoken.SpecialTokens{BOS: 4, EOS: 5, PAD: 6, UNK: 3}

    tok := ztoken.NewBPETokenizer(vocab, merges, special, false)
    ids, _ := tok.Encode("hello")
    fmt.Println(ids) // [0]
}

SentencePiece Compatibility

Models using SentencePiece tokenization (Llama, Gemma) encode spaces as the U+2581 character. ztoken handles this automatically when loading from GGUF files with tokenizer.ggml.model = "llama", or you can enable it manually:

tok := ztoken.NewBPETokenizer(vocab, merges, special, false)
tok.SetSentencePiece(true)

Use Cases

ML inference preprocessing — tokenize prompts before feeding them to transformer models via zerfoo
Text processing pipelines — encode/decode text with production-grade BPE
Model tooling — extract and inspect tokenizers from GGUF and HuggingFace model files
Embedding in Go services — zero-CGo tokenization that compiles with go build

Package Structure

Package	Description
`ztoken`	Core tokenizer interface, BPE implementation, HuggingFace JSON loader
`ztoken/gguf`	GGUF metadata-based tokenizer extraction

Dependencies

ztoken has zero external dependencies beyond the Go standard library and golang.org/x/text for Unicode normalization.

ztoken is used by:

zerfoo — ML inference, training, and serving framework

License

Apache 2.0

Documentation ¶

Overview ¶

Package tokenizer provides text tokenization for ML model inference.

The Tokenizer interface abstracts over different tokenization algorithms (whitespace, BPE, SentencePiece). Implementations include WhitespaceTokenizer for testing and BPETokenizer for production use with HuggingFace models.

Index ¶

type BERTEncoding
type BPETokenizer
- func LoadFromJSON(path string) (*BPETokenizer, error)
- func NewBPETokenizer(vocab map[string]int, merges []MergePair, special SpecialTokens, ...) *BPETokenizer
type MergePair
type NormalizerFunc
type SpecialTokens
type Tokenizer
- func Load(path string) (Tokenizer, error)
type WhitespaceTokenizer
- func NewWhitespaceTokenizer() *WhitespaceTokenizer
type WordPieceTokenizer
- func NewWordPieceTokenizer(vocab map[string]int, special SpecialTokens) *WordPieceTokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type BERTEncoding ¶ added in v0.3.0

type BERTEncoding struct {
	InputIDs      []int // Token IDs: [CLS] + tokens + [SEP] (+ tokens + [SEP] for pairs)
	AttentionMask []int // 1 for real tokens, 0 for padding
	TokenTypeIDs  []int // 0 for first sentence, 1 for second sentence
}

BERTEncoding holds the input tensors expected by BERT-family models.

type BPETokenizer ¶

type BPETokenizer struct {
	// contains filtered or unexported fields
}

BPETokenizer implements the Tokenizer interface using byte-pair encoding. It loads vocabulary and merge rules from HuggingFace tokenizer.json format. When scores are set and merges are empty, it falls back to SentencePiece encoding using greedy leftmost-longest match with score-based tie-breaking, matching llama.cpp behavior.

Stable.

func LoadFromJSON ¶

func LoadFromJSON(path string) (*BPETokenizer, error)

LoadFromJSON reads a HuggingFace tokenizer.json file and returns a BPETokenizer. For loading any tokenizer type, use Load instead.

func NewBPETokenizer ¶

func NewBPETokenizer(vocab map[string]int, merges []MergePair, special SpecialTokens, byteLevelBPE bool) *BPETokenizer

NewBPETokenizer creates a BPETokenizer from vocabulary, merge rules, and special tokens.

func (*BPETokenizer) Decode ¶

func (t *BPETokenizer) Decode(ids []int) (string, error)

Decode converts token IDs back to text.

func (*BPETokenizer) Encode ¶

func (t *BPETokenizer) Encode(text string) ([]int, error)

Encode tokenizes text into a sequence of token IDs using BPE.

func (*BPETokenizer) EncodeWithSpecialTokens ¶

func (t *BPETokenizer) EncodeWithSpecialTokens(text string, addBOS bool, addEOS bool) ([]int, error)

EncodeWithSpecialTokens wraps Encode and optionally prepends BOS / appends EOS.

func (*BPETokenizer) GetID ¶

func (t *BPETokenizer) GetID(token string) (int, bool)

GetID returns the ID for a given token string.

func (*BPETokenizer) GetToken ¶

func (t *BPETokenizer) GetToken(id int) (string, bool)

GetToken returns the string for a given token ID.

func (*BPETokenizer) SetAddLeadingSpace ¶ added in v0.3.3

func (t *BPETokenizer) SetAddLeadingSpace(enabled bool)

SetAddLeadingSpace controls whether SentencePiece mode prepends ▁ to the first word. By default this is set to true when SetSentencePiece is called, matching llama.cpp / SentencePiece behavior. GGUF models may override this via the tokenizer.ggml.add_space_prefix metadata key.

func (*BPETokenizer) SetScores ¶ added in v0.3.0

func (t *BPETokenizer) SetScores(scores []float32)

SetScores sets token scores for SentencePiece unigram encoding. When scores are set and merges are empty, the tokenizer uses score-based greedy encoding instead of BPE merge-based encoding. Scores are indexed by token ID (negative log probabilities).

func (*BPETokenizer) SetSentencePiece ¶

func (t *BPETokenizer) SetSentencePiece(enabled bool)

SetSentencePiece enables SentencePiece-style pre-tokenization where spaces are replaced with ▁ (U+2581) and the text is split at ▁ boundaries.

func (*BPETokenizer) SetSpecialTokenStrings ¶

func (t *BPETokenizer) SetSpecialTokenStrings(tokens map[string]int)

SetSpecialTokenStrings registers token strings that should be matched as single tokens during encoding instead of being split by BPE.

func (*BPETokenizer) SpecialTokens ¶

func (t *BPETokenizer) SpecialTokens() SpecialTokens

SpecialTokens returns the special token configuration.

func (*BPETokenizer) VocabSize ¶

func (t *BPETokenizer) VocabSize() int

VocabSize returns the number of tokens in the vocabulary.

type MergePair ¶

type MergePair struct {
	Left  string
	Right string
}

MergePair represents an adjacent token pair used in BPE merging.

Stable.

type NormalizerFunc ¶

type NormalizerFunc func(string) string

NormalizerFunc transforms text before tokenization.

Stable.

type SpecialTokens ¶

type SpecialTokens struct {
	BOS int // Beginning of sequence
	EOS int // End of sequence
	PAD int // Padding
	UNK int // Unknown token
}

SpecialTokens holds IDs for commonly used special tokens.

Stable.

type Tokenizer ¶

type Tokenizer interface {
	// Encode converts text into a sequence of token IDs.
	Encode(text string) ([]int, error)

	// Decode converts a sequence of token IDs back into text.
	Decode(ids []int) (string, error)

	// VocabSize returns the total number of tokens in the vocabulary.
	VocabSize() int

	// GetToken returns the string token for a given ID and whether it exists.
	GetToken(id int) (string, bool)

	// GetID returns the token ID for a given string and whether it exists.
	GetID(token string) (int, bool)

	// SpecialTokens returns the special token IDs for this tokenizer.
	SpecialTokens() SpecialTokens
}

Tokenizer is the interface for all tokenizer implementations.

Stable.

func Load ¶ added in v0.3.0

func Load(path string) (Tokenizer, error)

Load reads a HuggingFace tokenizer.json file and returns the appropriate Tokenizer implementation based on the model type (BPE or WordPiece).

type WhitespaceTokenizer ¶

type WhitespaceTokenizer struct {
	// contains filtered or unexported fields
}

WhitespaceTokenizer provides simple whitespace-based tokenization. It splits text on whitespace boundaries and maps words to integer IDs. Useful for testing and non-production scenarios.

Stable.

func NewWhitespaceTokenizer ¶

func NewWhitespaceTokenizer() *WhitespaceTokenizer

NewWhitespaceTokenizer creates a WhitespaceTokenizer pre-loaded with standard special tokens: <unk> (0), <s> (1), </s> (2), <pad> (3).

func (*WhitespaceTokenizer) AddToken ¶

func (t *WhitespaceTokenizer) AddToken(token string) int

AddToken adds a token to the vocabulary if it does not already exist. Returns the token's ID.

func (*WhitespaceTokenizer) Decode ¶

func (t *WhitespaceTokenizer) Decode(ids []int) (string, error)

Decode converts token IDs back to a space-separated string.

func (*WhitespaceTokenizer) Encode ¶

func (t *WhitespaceTokenizer) Encode(text string) ([]int, error)

Encode splits text on whitespace and returns token IDs. Unknown words map to the UNK token ID.

func (*WhitespaceTokenizer) GetID ¶

func (t *WhitespaceTokenizer) GetID(token string) (int, bool)

GetID returns the token ID for a given string.

func (*WhitespaceTokenizer) GetToken ¶

func (t *WhitespaceTokenizer) GetToken(id int) (string, bool)

GetToken returns the string token for a given ID.

func (*WhitespaceTokenizer) SpecialTokens ¶

func (t *WhitespaceTokenizer) SpecialTokens() SpecialTokens

SpecialTokens returns the special token IDs.

func (*WhitespaceTokenizer) VocabSize ¶

func (t *WhitespaceTokenizer) VocabSize() int

VocabSize returns the number of tokens in the vocabulary.

type WordPieceTokenizer ¶ added in v0.3.0

type WordPieceTokenizer struct {
	// contains filtered or unexported fields
}

WordPieceTokenizer implements the Tokenizer interface using the WordPiece algorithm, as used by BERT-family models. It greedily matches the longest subword prefix from the vocabulary, using "##" to denote continuation tokens.

Stable.

func NewWordPieceTokenizer ¶ added in v0.3.0

func NewWordPieceTokenizer(vocab map[string]int, special SpecialTokens) *WordPieceTokenizer

NewWordPieceTokenizer creates a WordPieceTokenizer from a vocabulary and special tokens.

func (*WordPieceTokenizer) Decode ¶ added in v0.3.0

func (t *WordPieceTokenizer) Decode(ids []int) (string, error)

Decode converts token IDs back to text. Continuation tokens (##prefix) are joined without spaces to reconstruct words.

func (*WordPieceTokenizer) Encode ¶ added in v0.3.0

func (t *WordPieceTokenizer) Encode(text string) ([]int, error)

Encode tokenizes text into a sequence of token IDs using WordPiece.

func (*WordPieceTokenizer) EncodeForBERT ¶ added in v0.3.0

func (t *WordPieceTokenizer) EncodeForBERT(textA string, textB string, maxLen int) (*BERTEncoding, error)

EncodeForBERT tokenizes one or two sentences into the BERT input format. For a single sentence: [CLS] tokens [SEP] For a sentence pair: [CLS] tokens_a [SEP] tokens_b [SEP] The result is padded to maxLen if maxLen > 0.

func (*WordPieceTokenizer) GetID ¶ added in v0.3.0

func (t *WordPieceTokenizer) GetID(token string) (int, bool)

GetID returns the token ID for a given string.

func (*WordPieceTokenizer) GetToken ¶ added in v0.3.0

func (t *WordPieceTokenizer) GetToken(id int) (string, bool)

GetToken returns the string token for a given ID.

func (*WordPieceTokenizer) SpecialTokens ¶ added in v0.3.0

func (t *WordPieceTokenizer) SpecialTokens() SpecialTokens

SpecialTokens returns the special token IDs.

func (*WordPieceTokenizer) VocabSize ¶ added in v0.3.0

func (t *WordPieceTokenizer) VocabSize() int

VocabSize returns the number of tokens in the vocabulary.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
gguf Package gguf extracts a BPE tokenizer from GGUF file metadata.	Package gguf extracts a BPE tokenizer from GGUF file metadata.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL