bpe

package module

v0.0.0-...-67576d3 Latest Latest Go to latest Published: Feb 1, 2021 License: MIT Imports: 5 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/go-nlp/bpe

Links

Open Source Insights

README ¶

bpe_prep

bpe

Documentation ¶

Overview ¶

bpe provides a byte-pair encoding for text

Index ¶

func PreBPE() map[rune]rune
func SimpleTokenizer(a string) []string
type Encoder
- func Learn(c *corpus.Corpus, symbols, minFreq int, markEOW bool) (retVal Encoder, err error)
type FuncOpt
- func MarkEOW(t bool) FuncOpt
- func WithReuse(buf []Pair) FuncOpt
type Pair
type Statistics
- func PairStats(c *corpus.Corpus, opts ...FuncOpt) Statistics
type Tokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func PreBPE ¶

func PreBPE() map[rune]rune

PreBPE is a function that provides mapping for runes. This function is used for handling large text corpuses, and it is derived from OpenAI's GPT-2. The original code may be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py

The original comments from GPT-2 clarifies:

The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.

It is unsure what utiltiy this provides now, given the design direction of the BPE package has gone in a slightly different direction - this package deals with runes, instead of messing around with strings and bytes. We sacrifice memory for readability and understandability.

func SimpleTokenizer ¶

func SimpleTokenizer(a string) []string

SimpleTokenizer is a simple tokenizer of text

Types ¶

type Encoder ¶

type Encoder struct {
	Corpus       *corpus.Corpus
	Pairs        []Pair
	Replacements map[Pair]rune
	MaxRune      rune
}

Encoder represents a state that may be used to encode a word

func Learn ¶

func Learn(c *corpus.Corpus, symbols, minFreq int, markEOW bool) (retVal Encoder, err error)

Learn learns an Encoder from the given data in the corpus in the input.

type FuncOpt ¶

type FuncOpt func(*funcMod)

FuncOpt is an option to modify the behaviours of a function

func MarkEOW ¶

func MarkEOW(t bool) FuncOpt

MarkEOW is a modifier to inform the Learn function whether the end of the word should be marked.

func WithReuse ¶

func WithReuse(buf []Pair) FuncOpt

WithReuse uses the given (usually pre-allocated) buffer of Pairs

type Pair ¶

type Pair struct {
	// contains filtered or unexported fields
}

Pair is a pair of runes - it is an immutable tuple. Use P() to create a new Pair

func P ¶

func P(fst, snd rune) Pair

P constructs a new Pair

func Pairs ¶

func Pairs(word string, opts ...FuncOpt) []Pair

Pairs returns the Pairs of runes found in a word (as string)

func PairsRunes ¶

func PairsRunes(word []rune, opts ...FuncOpt) []Pair

PairsRunes returns the Pairs of runes found in a word (as []rune)

func PairsRunesWithReuse ¶

func PairsRunesWithReuse(word []rune, buf []Pair) []Pair

PairsRunesWithReuse is the PairsRunes function, but with a buffer passed in specifically.

func PairsWithReuse ¶

func PairsWithReuse(word string, buf []Pair) []Pair

PairsWithReuse is the Pairs function, but with a buffer passed in specifically.

func (Pair) Eq ¶

func (p Pair) Eq(q Pair) bool

Eq is the comparison function for two pairs, p and q

func (Pair) Format ¶

func (p Pair) Format(s fmt.State, c rune)

Format implements fmt.Formatter

func (Pair) Fst ¶

func (p Pair) Fst() rune

Fst returns the first projection

func (Pair) MarshalJSON ¶

func (p Pair) MarshalJSON() ([]byte, error)

MarshalJSON returns the JSON-encoded version of Pair

func (Pair) Snd ¶

func (p Pair) Snd() rune

Snd returns the second projection

func (*Pair) UnmarshalJSON ¶

func (p *Pair) UnmarshalJSON(bs []byte) error

UnmarshalJSON unmarshals a JSON encoded Pair into the data structure itself

type Statistics ¶

type Statistics struct {
	Stats   map[Pair]int
	Indices map[Pair]map[int]int
	Corpus  *corpus.Corpus
	MaxRune rune
}

Statistics is the statistics of a corpus, used to figure out which pairs to replace.

func PairStats ¶

func PairStats(c *corpus.Corpus, opts ...FuncOpt) Statistics

PairStats returns the occurence frequencies of pairs of runes. It also construct an index of pairs to the word ID along its frequency

type Tokenizer ¶

type Tokenizer func(a string) []string

Tokenizer is a function that tokenizes a string. This library provides a simple tokenizer.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL