bpe

package module
v0.0.0-...-67576d3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 1, 2021 License: MIT Imports: 5 Imported by: 1

README

bpe_prep

bpe

Documentation

Overview

bpe provides a byte-pair encoding for text

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func PreBPE

func PreBPE() map[rune]rune

PreBPE is a function that provides mapping for runes. This function is used for handling large text corpuses, and it is derived from OpenAI's GPT-2. The original code may be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py

The original comments from GPT-2 clarifies:

The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.

It is unsure what utiltiy this provides now, given the design direction of the BPE package has gone in a slightly different direction - this package deals with runes, instead of messing around with strings and bytes. We sacrifice memory for readability and understandability.

func SimpleTokenizer

func SimpleTokenizer(a string) []string

SimpleTokenizer is a simple tokenizer of text

Types

type Encoder

type Encoder struct {
	Corpus       *corpus.Corpus
	Pairs        []Pair
	Replacements map[Pair]rune
	MaxRune      rune
}

Encoder represents a state that may be used to encode a word

func Learn

func Learn(c *corpus.Corpus, symbols, minFreq int, markEOW bool) (retVal Encoder, err error)

Learn learns an Encoder from the given data in the corpus in the input.

type FuncOpt

type FuncOpt func(*funcMod)

FuncOpt is an option to modify the behaviours of a function

func MarkEOW

func MarkEOW(t bool) FuncOpt

MarkEOW is a modifier to inform the Learn function whether the end of the word should be marked.

func WithReuse

func WithReuse(buf []Pair) FuncOpt

WithReuse uses the given (usually pre-allocated) buffer of Pairs

type Pair

type Pair struct {
	// contains filtered or unexported fields
}

Pair is a pair of runes - it is an immutable tuple. Use P() to create a new Pair

func P

func P(fst, snd rune) Pair

P constructs a new Pair

func Pairs

func Pairs(word string, opts ...FuncOpt) []Pair

Pairs returns the Pairs of runes found in a word (as string)

func PairsRunes

func PairsRunes(word []rune, opts ...FuncOpt) []Pair

PairsRunes returns the Pairs of runes found in a word (as []rune)

func PairsRunesWithReuse

func PairsRunesWithReuse(word []rune, buf []Pair) []Pair

PairsRunesWithReuse is the PairsRunes function, but with a buffer passed in specifically.

func PairsWithReuse

func PairsWithReuse(word string, buf []Pair) []Pair

PairsWithReuse is the Pairs function, but with a buffer passed in specifically.

func (Pair) Eq

func (p Pair) Eq(q Pair) bool

Eq is the comparison function for two pairs, p and q

func (Pair) Format

func (p Pair) Format(s fmt.State, c rune)

Format implements fmt.Formatter

func (Pair) Fst

func (p Pair) Fst() rune

Fst returns the first projection

func (Pair) MarshalJSON

func (p Pair) MarshalJSON() ([]byte, error)

MarshalJSON returns the JSON-encoded version of Pair

func (Pair) Snd

func (p Pair) Snd() rune

Snd returns the second projection

func (*Pair) UnmarshalJSON

func (p *Pair) UnmarshalJSON(bs []byte) error

UnmarshalJSON unmarshals a JSON encoded Pair into the data structure itself

type Statistics

type Statistics struct {
	Stats   map[Pair]int
	Indices map[Pair]map[int]int
	Corpus  *corpus.Corpus
	MaxRune rune
}

Statistics is the statistics of a corpus, used to figure out which pairs to replace.

func PairStats

func PairStats(c *corpus.Corpus, opts ...FuncOpt) Statistics

PairStats returns the occurence frequencies of pairs of runes. It also construct an index of pairs to the word ID along its frequency

type Tokenizer

type Tokenizer func(a string) []string

Tokenizer is a function that tokenizes a string. This library provides a simple tokenizer.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL