pretokenizedstring

package

v0.2.0 Latest Latest Go to latest Published: Dec 12, 2020 License: BSD-2-Clause Imports: 5 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/nlpodyssey/gotokenizers

Links

Open Source Insights

Documentation ¶

Index ¶

type NormalizedByteSplit
type OriginalByteSplit
type PreTokenizedString
- func FromNormalizedString(ns *normalizedstring.NormalizedString) *PreTokenizedString
- func FromString(s string) *PreTokenizedString
type Split
- func SplitsFromNormalizedStrings(nss []*normalizedstring.NormalizedString) []Split
type SplitFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type NormalizedByteSplit ¶

type NormalizedByteSplit struct {
	// A slice of the normalized string
	String string
	// The associated bytes offsets, in the normalized referential
	Offsets strutils.ByteOffsets
	// The potential tokens
	Tokens *[]models.Token
}

type OriginalByteSplit ¶

type OriginalByteSplit struct {
	// A slice of the normalized string
	String string
	// The associated bytes offsets, in the original referential
	Offsets strutils.ByteOffsets
	// The potential tokens
	Tokens *[]models.Token
}

type PreTokenizedString ¶

type PreTokenizedString struct {
	// contains filtered or unexported fields
}

PreTokenizedString is in charge of splitting an underlying string, making sure everything is fine while doing so, and providing ways to normalize and tokenize these splits.

Once everything has been normalized and tokenized, the PreTokenizedString is able to build an Encoding with all the relevant offsets and word ids, relative to the original string.

func FromNormalizedString ¶

func FromNormalizedString(ns *normalizedstring.NormalizedString) *PreTokenizedString

func FromString ¶

func FromString(s string) *PreTokenizedString

func (*PreTokenizedString) GetNormalizedByteSplits ¶

func (p *PreTokenizedString) GetNormalizedByteSplits() []NormalizedByteSplit

GetNormalizedByteSplits returns a list of NormalizedByteSplit.

func (*PreTokenizedString) GetOriginalByteSplits ¶

func (p *PreTokenizedString) GetOriginalByteSplits() []OriginalByteSplit

GetOriginalByteSplits returns a list of OriginalByteSplit.

func (*PreTokenizedString) IntoEncoding ¶ added in v0.2.0

func (p *PreTokenizedString) IntoEncoding(wordIndex int, typeID int) (*encodings.Encoding, error)

IntoEncoding transforms the current PreTokenizedString into an encodings.Encoding.

If a wordIndex is provided (i.e. >= 0), any word in the generated Encoding will be set to this value. This is generally used with pre-tokenized input, that does not need the PreTokenizedString to generate word ids.

This method will fail if some splits do not have associated Token.

Offset indices are based on bytes (not runes).

func (*PreTokenizedString) Normalize ¶

func (p *PreTokenizedString) Normalize(
	normalize func(ns *normalizedstring.NormalizedString) error,
) error

Normalize normalizes all the splits that do not have attached Split.Tokens, using the provided normalization function.

func (*PreTokenizedString) Split ¶

func (p *PreTokenizedString) Split(splitFunc SplitFunc) error

Split splits the PreTokenizedString by providing a SplitFunc in charge of splitting each substring (normalizedstring.NormalizedString) into multiple parts.

func (*PreTokenizedString) Splits ¶

func (p *PreTokenizedString) Splits() []Split

func (*PreTokenizedString) Tokenize ¶

func (p *PreTokenizedString) Tokenize(
	tokenize func(ns *normalizedstring.NormalizedString) ([]models.Token, error),
) error

Tokenize tokenizes all the splits that do not have attached Split.Tokens, using the provided tokenization function.

type Split ¶

type Split struct {
	// The underlying normalizedstring.NormalizedString.
	// Each SubString is represented by a normalizedstring.NormalizedString
	// and in the end we might be carrying a lot of SubString representing
	// various parts of the original input string.
	NormalizedString *normalizedstring.NormalizedString
	// Optional Tokens associated to this Split.
	Tokens *[]models.Token
}

Split is a wrapper for a subpart of a NormalizedString.

This Split contains the underlying NormalizedString as well as its offsets in the original string. These offsets are in the "original" referential. It also contains any Token associated to the current split.

func SplitsFromNormalizedStrings ¶

func SplitsFromNormalizedStrings(nss []*normalizedstring.NormalizedString) []Split

SplitsFromNormalizedStrings transforms a slice of NormalizedStrings into a corresponding slice of Splits, with nil tokens.

type SplitFunc ¶

type SplitFunc func(
	index int,
	ns *normalizedstring.NormalizedString,
) ([]Split, error)

SplitFunc (used by PreTokenizedString.Split) takes a normalizedstring.NormalizedString and is in charge of returning an iterator over the produced normalizedstring.NormalizedString.

SplitFunc is free of modifying these NormalizedString as relevant, as long as it respects the constraint stated below.

There is only one constraint that MUST be respected: the produced normalizedstring.NormalizedString, if combined back together, must have the same "original" string as the original one given to SplitFunc. This concretely means that, for the offset tracking to work as expected, SplitFunc must produce "splits" of the original string.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL