wordpiecetokenizer

package

v0.2.1 Latest Latest Go to latest Published: Nov 8, 2023 License: BSD-2-Clause Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/nlpodyssey/cybertron

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func GroupSubWords(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair
func IsDefaultSpecial(word string) bool
type WordPieceTokenizer
- func New(vocabulary *vocabulary.Vocabulary) *WordPieceTokenizer
- func (t *WordPieceTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair
- func (t *WordPieceTokenizer) WordPieceTokenize(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair

Constants ¶

View Source

const (
	// DefaultClassToken is the default class token value for the WordPiece tokenizer.
	DefaultClassToken = "[CLS]"
	// DefaultSequenceSeparator is the default sequence separator value for the WordPiece tokenizer.
	DefaultSequenceSeparator = "[SEP]"
	// DefaultUnknownToken is the default unknown token value for the WordPiece tokenizer.
	DefaultUnknownToken = "[UNK]"
	// DefaultMaskToken is the default mask token value for the WordPiece tokenizer.
	DefaultMaskToken = "[MASK]"
	// DefaultSplitPrefix is the default split prefix value for the WordPiece tokenizer.
	DefaultSplitPrefix = "##"
	// DefaultMaxWordChars is the default maximum word length for the WordPiece tokenizer.
	DefaultMaxWordChars = 100
)

Variables ¶

This section is empty.

Functions ¶

func GroupSubWords ¶

func GroupSubWords(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair

GroupSubWords returns a list of tokens range each of which represents the start and the end index of the tokens that form a complete word.

func IsDefaultSpecial ¶

func IsDefaultSpecial(word string) bool

IsDefaultSpecial return whether the word matches a special token, or not.

Types ¶

type WordPieceTokenizer ¶

type WordPieceTokenizer struct {
	// contains filtered or unexported fields
}

WordPieceTokenizer is a tokenizer that breaks tokens into sub-word units based on a supplied vocabulary. See https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details. WordPieceTokenizers uses BaseTokenizer to preprocess the input text.

func New ¶

func New(vocabulary *vocabulary.Vocabulary) *WordPieceTokenizer

New returns a new WordPieceTokenizer.

func (*WordPieceTokenizer) Tokenize ¶

func (t *WordPieceTokenizer) Tokenize(text string) []tokenizers.StringOffsetsPair

Tokenize converts the input text to a slice of words or sub-words token units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.

func (*WordPieceTokenizer) WordPieceTokenize ¶

func (t *WordPieceTokenizer) WordPieceTokenize(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair

WordPieceTokenize transforms the input token in a new slice of words or sub-words units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL