Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type DefaultWord ¶
type DefaultWord struct {
// contains filtered or unexported fields
}
DefaultWord is the default tokenizer, designed to be used with bodies of text in english and other latin-based languages.
func NewDefaultWordTokenizer ¶
func NewDefaultWordTokenizer(stripLinebreaks bool) *DefaultWord
NewDefaultWordTokenizer returns a new default word tokenizer.
func (*DefaultWord) Format ¶
func (tk *DefaultWord) Format(tokens []string) string
Format joins a slice of tokens by the tokenizer rules. If we want to output the ngrams again, they're not always going to go back together in the same manner they were consumed, especially if the tokenizers are wildly different. A formatter allows us to ensure the resulting text is appropriate.
func (*DefaultWord) Scanner ¶
Scanner is the core tokenizer which splits a slice of bytes into tokens. It can be used with bufio.scanner to tokenizer a stream of data.
func (*DefaultWord) Tokenize ¶
func (tk *DefaultWord) Tokenize(str string) []string
Tokenize splits a string into tokens. Each instance of standard punctuation is also considered to be token in order to preserve expected grammar.
type Tokenizer ¶
type Tokenizer interface {
// Tokenize tokenizes a string.
Tokenize(string) []string
// Scanner tokenizes a byte array and can be used by bufio.Scanner.
Scanner(data []byte, atEOF bool) (advance int, token []byte, err error)
// Format joins a slice of tokens by the tokenizer rules.
Format([]string) string
}
Tokenizer is a string tokenizer which splits a string into discrete tokens.