tokenizer

package module
v0.0.0-...-0062cc5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 21, 2026 License: Apache-2.0 Imports: 12 Imported by: 0

README

Based on the code provided for the tokenizer repository, here is a comprehensive README.md.

I have updated the installation and usage sections to reflect the features found in prepare.go (specifically the pipeline construction logic) and prepare_test.go (usage with embedded files).


tokenizer

A configuration-driven, singleton wrapper for the sugarme/tokenizer library (BERT/RoBERTa compatible).

tokenizer simplifies the usage of HuggingFace-style tokenizers in Go. Instead of manually constructing the tokenization pipeline (Normalizers, PreTokenizers, Decoders) in code, this library hydrates the entire state from a standard tokenizer.json configuration file.

It allows you to load tokenizers directly from an embed.FS (or any fs.FS) and provides a thread-safe singleton pattern to ensure resource-heavy models are loaded only once.

Features

  • JSON Configuration: Builds the entire tokenizer pipeline (Model, Normalizer, PreTokenizer, PostProcessor, Decoder, Padding, Truncation) from a single JSON file.
  • Embed-Ready: Designed to work seamlessly with Go's embed package via the fs.FS interface.
  • Singleton Pattern: Includes a thread-safe GetTokenizer method that lazily loads the model on first use and caches it for the application's lifetime.
  • Full Pipeline Support: Automatically configures:
  • Added Tokens: Special tokens handling.
  • Truncation & Padding: max length and padding strategies.
  • Normalization: Unicode normalization, lowercasing, etc.

Installation

go get github.com/frogoai/tokenizer

Usage

This method ensures the tokenizer is loaded only once, even if called concurrently. It is ideal for use in API handlers.

package main

import (
	"embed"
	"fmt"
	
	"github.com/frogoai/tokenizer"
)

//go:embed resources/tokenizer.json
var assets embed.FS

func main() {
	// 1. Get the singleton instance.
	// The first call parses the JSON and builds the model.
	// Subsequent calls return the cached instance.
	tk, err := tokenizer.GetTokenizer(assets, "resources/tokenizer.json")
	if err != nil {
		panic(err)
	}

	// 2. Encode text
	// Helper method EncodeSingle simplifies the common case
	en, err := tk.EncodeSingle("Hello, World!")
	if err != nil {
		panic(err)
	}

	// 3. Access results
	fmt.Printf("Tokens: %v\n", en.Tokens)
	fmt.Printf("IDs:    %v\n", en.Ids)
}

2. Manual Loading

If you need to load multiple different tokenizers (e.g., one for BERT and one for GPT), use FromFile directly.

tk, err := tokenizer.FromFile(os.DirFS("./configs"), "bert-base.json")

Configuration Format

The tokenizer.json should follow the structure exported by the HuggingFace tokenizers library. The wrapper looks for these top-level keys:

  • model: The underlying model parameters (e.g., WordPiece, BPE).
  • normalizer: Text cleaning rules.
  • pre_tokenizer: Splitting rules (e.g., whitespace).
  • post_processor: Special token insertion (e.g., [CLS], [SEP]).
  • decoder: ID to string conversion rules.
  • added_tokens: Special vocabulary.
  • truncation: Max length settings.
  • padding: Padding strategy.

Testing

The repository includes tests for validating encoding against expected token counts, useful for ensuring parity with Python implementations.

go test ./...

License

MIT

Documentation

Index

Constants

View Source
const (
	EmailTagStart = "+"
	EmailAt       = "@"
)

Variables

This section is empty.

Functions

func ABTest

func ABTest(data, salt []byte, groups ...uint64) uint64

func Between

func Between(data string, keys ...string) string

Between function to get content between two keys

func ByteSliceToString

func ByteSliceToString(b []byte) string

ByteSliceToString cast given bytes to string, without allocation memory

func CommonString

func CommonString(str string) string

func FromFile

func FromFile(fs fs.FS, file string) (*tokenizer.Tokenizer, error)

func GetTokenizer

func GetTokenizer(fs fs.FS, file string) (*tokenizer.Tokenizer, error)

func NFDLowerString

func NFDLowerString(str string) string

func Normalize

func Normalize(str string) (string, error)

func SanitizeEmail

func SanitizeEmail(email string) string

func SplitBetweenTokens

func SplitBetweenTokens(data string, keys ...string) []string

SplitBetweenTokens take string and one or two tokens, and cut everything between two tokens, or between two copies of first token

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL