tokenizer

package module

v1.1.2 Latest Latest Go to latest Published: Feb 7, 2023 License: Apache-2.0 Imports: 15 Imported by: 3

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/cohere-ai/tokenizer

Links

Open Source Insights

README ¶

Tokenizers

Cohere's tokenizers library provides an interface to encode and decode text given a computed vocabulary, and includes pre-computed tokenizers that are used to train Cohere's models.

We plan on eventually also open sourcing tools to create new tokenizers.

Example using Go

Choose a tokenizer inside of the vocab folder including both a encoder.json file and a vocab.bpe file and create an encoder as seen below. The tokenizer used in this example is named the coheretext-50k tokenizer.

import (
  ...
  "github.com/cohere-ai/tokenizer"
)

encoder := tokenizer.NewFromPrebuilt("coheretext-50k")

To encode a string of text, use the Encode method. Encode returns a slice of int64s.

encoded := encoder.Encode("this is a string to be encoded")
fmt.Printf("%v", encoded)
// [6372 329 258 3852 288 345 37754]

To decode a slice of int64s, use the Decode method. Decode returns a string.

fmt.Printf(encoder.Decode(encoded))
// this is a string to be encoded

Speed

Using a 2.5GHz CPU, encoding 1000 tokens takes approximately 6.5 milliseconds, and decoding 1000 tokens takes approximately 0.2 milliseconds.

Documentation ¶

Index ¶

func CountReader(reader io.Reader) (map[string]int64, error)
func CountString(s string) map[string]int64
func MergeCounts(a map[string]int64, b map[string]int64)
func WordSplit(s string) []string
type Encoder
type Merge
- func BPE(freq map[string]int64, numSymbols, minFrequency int64) (map[string]int64, []*Merge, error)
type WordCount

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CountReader ¶ added in v1.0.0

func CountReader(reader io.Reader) (map[string]int64, error)

func CountString ¶ added in v1.0.0

func CountString(s string) map[string]int64

func MergeCounts ¶ added in v1.0.0

func MergeCounts(a map[string]int64, b map[string]int64)

func WordSplit ¶ added in v1.0.2

func WordSplit(s string) []string

Types ¶

type Encoder ¶

type Encoder struct {
	Encoder   map[string]int64
	Decoder   map[int64]string
	BPERanks  map[[2]string]int64
	Cache     map[string]string
	VocabSize int64
}

func New ¶

func New(encoder map[string]int64, bpeMerges [][2]string) (*Encoder, error)

func NewFromPrebuilt ¶

func NewFromPrebuilt(name string) (*Encoder, error)

func NewFromReaders ¶

func NewFromReaders(encoderReader, vocabReader io.Reader) (*Encoder, error)

func (*Encoder) Decode ¶

func (e *Encoder) Decode(tokens []int64) string

func (*Encoder) Encode ¶

func (e *Encoder) Encode(text string) ([]int64, []string)

func (*Encoder) EncodeWords ¶ added in v1.0.4

func (e *Encoder) EncodeWords(words []string) ([]int64, []string)

type Merge ¶ added in v1.0.0

type Merge struct {
	Merge [2]string
	Count int64
}

func BPE ¶ added in v1.0.0

func BPE(freq map[string]int64, numSymbols, minFrequency int64) (map[string]int64, []*Merge, error)

type WordCount ¶ added in v1.0.0

type WordCount struct {
	Pieces []string `json:"pieces"`
	Count  int64    `json:"count"`
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL