pretokenizedstring

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 12, 2020 License: BSD-2-Clause Imports: 5 Imported by: 2

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type NormalizedByteSplit

type NormalizedByteSplit struct {
	// A slice of the normalized string
	String string
	// The associated bytes offsets, in the normalized referential
	Offsets strutils.ByteOffsets
	// The potential tokens
	Tokens *[]models.Token
}

type OriginalByteSplit

type OriginalByteSplit struct {
	// A slice of the normalized string
	String string
	// The associated bytes offsets, in the original referential
	Offsets strutils.ByteOffsets
	// The potential tokens
	Tokens *[]models.Token
}

type PreTokenizedString

type PreTokenizedString struct {
	// contains filtered or unexported fields
}

PreTokenizedString is in charge of splitting an underlying string, making sure everything is fine while doing so, and providing ways to normalize and tokenize these splits.

Once everything has been normalized and tokenized, the PreTokenizedString is able to build an Encoding with all the relevant offsets and word ids, relative to the original string.

func FromString

func FromString(s string) *PreTokenizedString

func (*PreTokenizedString) GetNormalizedByteSplits

func (p *PreTokenizedString) GetNormalizedByteSplits() []NormalizedByteSplit

GetNormalizedByteSplits returns a list of NormalizedByteSplit.

func (*PreTokenizedString) GetOriginalByteSplits

func (p *PreTokenizedString) GetOriginalByteSplits() []OriginalByteSplit

GetOriginalByteSplits returns a list of OriginalByteSplit.

func (*PreTokenizedString) IntoEncoding added in v0.2.0

func (p *PreTokenizedString) IntoEncoding(wordIndex int, typeID int) (*encodings.Encoding, error)

IntoEncoding transforms the current PreTokenizedString into an encodings.Encoding.

If a wordIndex is provided (i.e. >= 0), any word in the generated Encoding will be set to this value. This is generally used with pre-tokenized input, that does not need the PreTokenizedString to generate word ids.

This method will fail if some splits do not have associated Token.

Offset indices are based on bytes (not runes).

func (*PreTokenizedString) Normalize

func (p *PreTokenizedString) Normalize(
	normalize func(ns *normalizedstring.NormalizedString) error,
) error

Normalize normalizes all the splits that do not have attached Split.Tokens, using the provided normalization function.

func (*PreTokenizedString) Split

func (p *PreTokenizedString) Split(splitFunc SplitFunc) error

Split splits the PreTokenizedString by providing a SplitFunc in charge of splitting each substring (normalizedstring.NormalizedString) into multiple parts.

func (*PreTokenizedString) Splits

func (p *PreTokenizedString) Splits() []Split

func (*PreTokenizedString) Tokenize

func (p *PreTokenizedString) Tokenize(
	tokenize func(ns *normalizedstring.NormalizedString) ([]models.Token, error),
) error

Tokenize tokenizes all the splits that do not have attached Split.Tokens, using the provided tokenization function.

type Split

type Split struct {
	// The underlying normalizedstring.NormalizedString.
	// Each SubString is represented by a normalizedstring.NormalizedString
	// and in the end we might be carrying a lot of SubString representing
	// various parts of the original input string.
	NormalizedString *normalizedstring.NormalizedString
	// Optional Tokens associated to this Split.
	Tokens *[]models.Token
}

Split is a wrapper for a subpart of a NormalizedString.

This Split contains the underlying NormalizedString as well as its offsets in the original string. These offsets are in the "original" referential. It also contains any Token associated to the current split.

func SplitsFromNormalizedStrings

func SplitsFromNormalizedStrings(nss []*normalizedstring.NormalizedString) []Split

SplitsFromNormalizedStrings transforms a slice of NormalizedStrings into a corresponding slice of Splits, with nil tokens.

type SplitFunc

type SplitFunc func(
	index int,
	ns *normalizedstring.NormalizedString,
) ([]Split, error)

SplitFunc (used by PreTokenizedString.Split) takes a normalizedstring.NormalizedString and is in charge of returning an iterator over the produced normalizedstring.NormalizedString.

SplitFunc is free of modifying these NormalizedString as relevant, as long as it respects the constraint stated below.

There is only one constraint that MUST be respected: the produced normalizedstring.NormalizedString, if combined back together, must have the same "original" string as the original one given to SplitFunc. This concretely means that, for the offset tracking to work as expected, SplitFunc must produce "splits" of the original string.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL