Documentation
¶
Overview ¶
bpe provides a byte-pair encoding for text
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func PreBPE ¶
PreBPE is a function that provides mapping for runes. This function is used for handling large text corpuses, and it is derived from OpenAI's GPT-2. The original code may be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py
The original comments from GPT-2 clarifies:
The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.
It is unsure what utiltiy this provides now, given the design direction of the BPE package has gone in a slightly different direction - this package deals with runes, instead of messing around with strings and bytes. We sacrifice memory for readability and understandability.
func SimpleTokenizer ¶
SimpleTokenizer is a simple tokenizer of text
Types ¶
type FuncOpt ¶
type FuncOpt func(*funcMod)
FuncOpt is an option to modify the behaviours of a function
type Pair ¶
type Pair struct {
// contains filtered or unexported fields
}
Pair is a pair of runes - it is an immutable tuple. Use P() to create a new Pair
func PairsRunes ¶
PairsRunes returns the Pairs of runes found in a word (as []rune)
func PairsRunesWithReuse ¶
PairsRunesWithReuse is the PairsRunes function, but with a buffer passed in specifically.
func PairsWithReuse ¶
PairsWithReuse is the Pairs function, but with a buffer passed in specifically.
func (Pair) MarshalJSON ¶
MarshalJSON returns the JSON-encoded version of Pair
func (*Pair) UnmarshalJSON ¶
UnmarshalJSON unmarshals a JSON encoded Pair into the data structure itself