tokenizers

package module
v0.9.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 19, 2024 License: MIT Imports: 3 Imported by: 1

README

Tokenizers

Go bindings for the HuggingFace Tokenizers library.

Installation

make build to build libtokenizers.a that you need to run your application that uses bindings. In addition, you need to inform the linker where to find that static library: go run -ldflags="-extldflags '-L./path/to/libtokenizers.a'" . or just add it to the CGO_LDFLAGS environment variable: CGO_LDFLAGS="-L./path/to/libtokenizers.a" to avoid specifying it every time.

Using pre-built binaries

If you don't want to install Rust toolchain, build it in docker: docker build --platform=linux/amd64 -f release/Dockerfile . or use prebuilt binaries from the releases page. Prebuilt libraries are available for:

Getting started

TLDR: working example.

Load a tokenizer from a JSON config:

import "github.com/cohere-ai/tokenizers"

tk, err := tokenizers.FromFile("./data/bert-base-uncased.json")
if err != nil {
    return err
}
// release native resources
defer tk.Close()

Encode text and decode tokens:

fmt.Println("Vocab size:", tk.VocabSize())
// Vocab size: 30522
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", false))
// [2829 4419 14523 2058 1996 13971 3899] [brown fox jumps over the lazy dog]
fmt.Println(tk.Encode("brown fox jumps over the lazy dog", true))
// [101 2829 4419 14523 2058 1996 13971 3899 102] [[CLS] brown fox jumps over the lazy dog [SEP]]
fmt.Println(tk.Decode([]uint32{2829, 4419, 14523, 2058, 1996, 13971, 3899}, true))
// brown fox jumps over the lazy dog

Benchmarks

go test . -run=^\$ -bench=. -benchmem -count=10 > test/benchmark/$(git rev-parse HEAD).txt

Decoding overhead (due to CGO and extra allocations) is between 2% to 9% depending on the benchmark.

go test . -bench=. -benchmem -benchtime=10s

goos: darwin
goarch: arm64
pkg: github.com/cohere-ai/tokenizers
BenchmarkEncodeNTimes-10     	  959494	     12622 ns/op	     232 B/op	      12 allocs/op
BenchmarkEncodeNChars-10      1000000000	     2.046 ns/op	       0 B/op	       0 allocs/op
BenchmarkDecodeNTimes-10     	 2758072	      4345 ns/op	      96 B/op	       3 allocs/op
BenchmarkDecodeNTokens-10    	18689725	     648.5 ns/op	       7 B/op	       0 allocs/op
PASS
ok   github.com/cohere-ai/tokenizers 126.681s

Run equivalent Rust tests with cargo bench.

decode_n_times          time:   [3.9812 µs 3.9874 µs 3.9939 µs]
                        change: [-0.4103% -0.1338% +0.1275%] (p = 0.33 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

decode_n_tokens         time:   [651.72 ns 661.73 ns 675.78 ns]
                        change: [+0.3504% +2.0016% +3.5507%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Contributing

Please refer to CONTRIBUTING.md for information on how to contribute a PR to this project.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type EncodeOption

type EncodeOption func(eo *encodeOpts)

func WithReturnAllAttributes

func WithReturnAllAttributes() EncodeOption

func WithReturnAttentionMask

func WithReturnAttentionMask() EncodeOption

func WithReturnOffsets

func WithReturnOffsets() EncodeOption

func WithReturnSpecialTokensMask

func WithReturnSpecialTokensMask() EncodeOption

func WithReturnTokens

func WithReturnTokens() EncodeOption

func WithReturnTypeIDs

func WithReturnTypeIDs() EncodeOption

type Encoding

type Encoding struct {
	IDs               []uint32
	TypeIDs           []uint32
	SpecialTokensMask []uint32
	AttentionMask     []uint32
	Tokens            []string
	Offsets           []Offset
}

type Offset

type Offset [2]uint

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

func FromBytes

func FromBytes(data []byte, opts ...TokenizerOption) (*Tokenizer, error)

func FromBytesWithTruncation

func FromBytesWithTruncation(data []byte, maxLen uint32, dir TruncationDirection) (*Tokenizer, error)

func FromFile

func FromFile(path string) (*Tokenizer, error)

func (*Tokenizer) Close

func (t *Tokenizer) Close() error

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(tokenIDs []uint32, skipSpecialTokens bool) string

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(str string, addSpecialTokens bool) ([]uint32, []string)

func (*Tokenizer) EncodeWithOptions

func (t *Tokenizer) EncodeWithOptions(str string, addSpecialTokens bool, opts ...EncodeOption) Encoding

func (*Tokenizer) VocabSize

func (t *Tokenizer) VocabSize() uint32

type TokenizerOption

type TokenizerOption func(to *tokenizerOpts)

func WithEncodeSpecialTokens

func WithEncodeSpecialTokens() TokenizerOption

type TruncationDirection

type TruncationDirection int
const (
	TruncationDirectionLeft TruncationDirection = iota
	TruncationDirectionRight
)

Directories

Path Synopsis
example module
release module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL