wordpiecetokenizer

package
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 8, 2023 License: BSD-2-Clause Imports: 4 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// DefaultClassToken is the default class token value for the WordPiece tokenizer.
	DefaultClassToken = "[CLS]"
	// DefaultSequenceSeparator is the default sequence separator value for the WordPiece tokenizer.
	DefaultSequenceSeparator = "[SEP]"
	// DefaultUnknownToken is the default unknown token value for the WordPiece tokenizer.
	DefaultUnknownToken = "[UNK]"
	// DefaultMaskToken is the default mask token value for the WordPiece tokenizer.
	DefaultMaskToken = "[MASK]"
	// DefaultSplitPrefix is the default split prefix value for the WordPiece tokenizer.
	DefaultSplitPrefix = "##"
	// DefaultMaxWordChars is the default maximum word length for the WordPiece tokenizer.
	DefaultMaxWordChars = 100
)

Variables

This section is empty.

Functions

func GroupSubWords

func GroupSubWords(tokens []tokenizers.StringOffsetsPair) []tokenizers.StringOffsetsPair

GroupSubWords returns a list of tokens range each of which represents the start and the end index of the tokens that form a complete word.

func IsDefaultSpecial

func IsDefaultSpecial(word string) bool

IsDefaultSpecial return whether the word matches a special token, or not.

Types

type WordPieceTokenizer

type WordPieceTokenizer struct {
	// contains filtered or unexported fields
}

WordPieceTokenizer is a tokenizer that breaks tokens into sub-word units based on a supplied vocabulary. See https://arxiv.org/pdf/1609.08144.pdf Section 4.1 for details. WordPieceTokenizers uses BaseTokenizer to preprocess the input text.

func New

func New(vocabulary *vocabulary.Vocabulary) *WordPieceTokenizer

New returns a new WordPieceTokenizer.

func (*WordPieceTokenizer) Tokenize

Tokenize converts the input text to a slice of words or sub-words token units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.

func (*WordPieceTokenizer) WordPieceTokenize

WordPieceTokenize transforms the input token in a new slice of words or sub-words units based on the supplied vocabulary. The resulting tokens preserve the alignment with the portion of the original text they belong to.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL