tokenizer

package module

v0.0.0-...-0062cc5 Latest Latest Go to latest Published: Jan 21, 2026 License: Apache-2.0 Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/FrogoAI/tokenizer

Links

Open Source Insights

README ¶

Based on the code provided for the tokenizer repository, here is a comprehensive README.md.

I have updated the installation and usage sections to reflect the features found in prepare.go (specifically the pipeline construction logic) and prepare_test.go (usage with embedded files).

tokenizer

A configuration-driven, singleton wrapper for the sugarme/tokenizer library (BERT/RoBERTa compatible).

tokenizer simplifies the usage of HuggingFace-style tokenizers in Go. Instead of manually constructing the tokenization pipeline (Normalizers, PreTokenizers, Decoders) in code, this library hydrates the entire state from a standard tokenizer.json configuration file.

It allows you to load tokenizers directly from an embed.FS (or any fs.FS) and provides a thread-safe singleton pattern to ensure resource-heavy models are loaded only once.

Features

JSON Configuration: Builds the entire tokenizer pipeline (Model, Normalizer, PreTokenizer, PostProcessor, Decoder, Padding, Truncation) from a single JSON file.
Embed-Ready: Designed to work seamlessly with Go's embed package via the fs.FS interface.
Singleton Pattern: Includes a thread-safe GetTokenizer method that lazily loads the model on first use and caches it for the application's lifetime.
Full Pipeline Support: Automatically configures:
Added Tokens: Special tokens handling.
Truncation & Padding: max length and padding strategies.
Normalization: Unicode normalization, lowercasing, etc.

Installation

go get github.com/frogoai/tokenizer

Usage

1. Using the Singleton (Recommended)

This method ensures the tokenizer is loaded only once, even if called concurrently. It is ideal for use in API handlers.

package main

import (
	"embed"
	"fmt"
	
	"github.com/frogoai/tokenizer"
)

//go:embed resources/tokenizer.json
var assets embed.FS

func main() {
	// 1. Get the singleton instance.
	// The first call parses the JSON and builds the model.
	// Subsequent calls return the cached instance.
	tk, err := tokenizer.GetTokenizer(assets, "resources/tokenizer.json")
	if err != nil {
		panic(err)
	}

	// 2. Encode text
	// Helper method EncodeSingle simplifies the common case
	en, err := tk.EncodeSingle("Hello, World!")
	if err != nil {
		panic(err)
	}

	// 3. Access results
	fmt.Printf("Tokens: %v\n", en.Tokens)
	fmt.Printf("IDs:    %v\n", en.Ids)
}

2. Manual Loading

If you need to load multiple different tokenizers (e.g., one for BERT and one for GPT), use FromFile directly.

tk, err := tokenizer.FromFile(os.DirFS("./configs"), "bert-base.json")

Configuration Format

The tokenizer.json should follow the structure exported by the HuggingFace tokenizers library. The wrapper looks for these top-level keys:

model: The underlying model parameters (e.g., WordPiece, BPE).
normalizer: Text cleaning rules.
pre_tokenizer: Splitting rules (e.g., whitespace).
post_processor: Special token insertion (e.g., [CLS], [SEP]).
decoder: ID to string conversion rules.
added_tokens: Special vocabulary.
truncation: Max length settings.
padding: Padding strategy.

Testing

The repository includes tests for validating encoding against expected token counts, useful for ensuring parity with Python implementations.

go test ./...

License

MIT

Documentation ¶

Index ¶

Constants
func ABTest(data, salt []byte, groups ...uint64) uint64
func Between(data string, keys ...string) string
func ByteSliceToString(b []byte) string
func CommonString(str string) string
func FromFile(fs fs.FS, file string) (*tokenizer.Tokenizer, error)
func GetTokenizer(fs fs.FS, file string) (*tokenizer.Tokenizer, error)
func NFDLowerString(str string) string
func Normalize(str string) (string, error)
func SanitizeEmail(email string) string
func SplitBetweenTokens(data string, keys ...string) []string

Constants ¶

View Source

const (
	EmailTagStart = "+"
	EmailAt       = "@"
)

Variables ¶

This section is empty.

Functions ¶

func ABTest ¶

func ABTest(data, salt []byte, groups ...uint64) uint64

func Between ¶

func Between(data string, keys ...string) string

Between function to get content between two keys

func ByteSliceToString ¶

func ByteSliceToString(b []byte) string

ByteSliceToString cast given bytes to string, without allocation memory

func CommonString ¶

func CommonString(str string) string

func FromFile ¶

func FromFile(fs fs.FS, file string) (*tokenizer.Tokenizer, error)

func GetTokenizer ¶

func GetTokenizer(fs fs.FS, file string) (*tokenizer.Tokenizer, error)

func NFDLowerString ¶

func NFDLowerString(str string) string

func Normalize ¶

func Normalize(str string) (string, error)

func SanitizeEmail ¶

func SanitizeEmail(email string) string

func SplitBetweenTokens ¶

func SplitBetweenTokens(data string, keys ...string) []string

SplitBetweenTokens take string and one or two tokens, and cut everything between two tokens, or between two copies of first token

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
embedded

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL