tokenizer

package module
v1.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 7, 2023 License: Apache-2.0 Imports: 15 Imported by: 3

README

Tokenizers License

Cohere's tokenizers library provides an interface to encode and decode text given a computed vocabulary, and includes pre-computed tokenizers that are used to train Cohere's models.

We plan on eventually also open sourcing tools to create new tokenizers.

Example using Go

Choose a tokenizer inside of the vocab folder including both a encoder.json file and a vocab.bpe file and create an encoder as seen below. The tokenizer used in this example is named the coheretext-50k tokenizer.

import (
  ...
  "github.com/cohere-ai/tokenizer"
)

encoder := tokenizer.NewFromPrebuilt("coheretext-50k")

To encode a string of text, use the Encode method. Encode returns a slice of int64s.

encoded := encoder.Encode("this is a string to be encoded")
fmt.Printf("%v", encoded)
// [6372 329 258 3852 288 345 37754]

To decode a slice of int64s, use the Decode method. Decode returns a string.

fmt.Printf(encoder.Decode(encoded))
// this is a string to be encoded

Speed

Using a 2.5GHz CPU, encoding 1000 tokens takes approximately 6.5 milliseconds, and decoding 1000 tokens takes approximately 0.2 milliseconds.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CountReader added in v1.0.0

func CountReader(reader io.Reader) (map[string]int64, error)

func CountString added in v1.0.0

func CountString(s string) map[string]int64

func MergeCounts added in v1.0.0

func MergeCounts(a map[string]int64, b map[string]int64)

func WordSplit added in v1.0.2

func WordSplit(s string) []string

Types

type Encoder

type Encoder struct {
	Encoder   map[string]int64
	Decoder   map[int64]string
	BPERanks  map[[2]string]int64
	Cache     map[string]string
	VocabSize int64
}

func New

func New(encoder map[string]int64, bpeMerges [][2]string) (*Encoder, error)

func NewFromPrebuilt

func NewFromPrebuilt(name string) (*Encoder, error)

func NewFromReaders

func NewFromReaders(encoderReader, vocabReader io.Reader) (*Encoder, error)

func (*Encoder) Decode

func (e *Encoder) Decode(tokens []int64) string

func (*Encoder) Encode

func (e *Encoder) Encode(text string) ([]int64, []string)

func (*Encoder) EncodeWords added in v1.0.4

func (e *Encoder) EncodeWords(words []string) ([]int64, []string)

type Merge added in v1.0.0

type Merge struct {
	Merge [2]string
	Count int64
}

func BPE added in v1.0.0

func BPE(freq map[string]int64, numSymbols, minFrequency int64) (map[string]int64, []*Merge, error)

type WordCount added in v1.0.0

type WordCount struct {
	Pieces []string `json:"pieces"`
	Count  int64    `json:"count"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL