sentencepiece

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 27, 2025 License: Apache-2.0 Imports: 4 Imported by: 1

Documentation

Overview

Package sentencepiece implements a tokenizers.Tokenizer based on SentencePiece tokenizer.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func New

func New(config *api.Config, repo *hub.Repo) (api.Tokenizer, error)

New creates a SentencePiece tokenizer based on the "tokenizer.model" file, which must be a SentencePiece Model proto (see protos.Model).

It implements a tokenizer.TokenizerConstructor function signature.

Types

type Tokenizer

type Tokenizer struct {
	*esentencepiece.Processor
	Info *esentencepiece.ModelInfo
}

Tokenizer implements tokenizers.Tokenizer interface based on SentencePiece tokenizer by Google.

func (*Tokenizer) Decode

func (p *Tokenizer) Decode(ids []int) string

Decode returns the text from a sequence of ids. It implements sampler.Vocabulary.

func (*Tokenizer) Encode

func (p *Tokenizer) Encode(text string) []int

Encode returns the text encoded into a sequence of ids. It implements sampler.Vocabulary.

func (*Tokenizer) SpecialTokenID

func (p *Tokenizer) SpecialTokenID(token api.SpecialToken) (int, error)

SpecialTokenID returns the token for the given symbol, or an error if not known.

Directories

Path Synopsis
private
protos
Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.
Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL