tokenizers

package
v0.0.0-...-67e2655 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 29, 2019 License: MIT Imports: 4 Imported by: 0

README

GoDoc

Tokenizers

Tokenizers can be passed to the ngrams.NewIndex function to change the data tokenization mechanism. More details can be found in the ngrams README.

Default Word Tokenizer (default)

// New word tokenizer which includes line breaks as distinct tokens.
tk := NewDefaultWordTokenizer(false)

// New word tokenizer without tokenized line breaks.
tk := NewDefaultWordTokenizer(true)

New tokenizers can be created by satisfying the tokenizers.Tokenizer interface.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DefaultWord

type DefaultWord struct {
	// contains filtered or unexported fields
}

DefaultWord is the default tokenizer, designed to be used with bodies of text in english and other latin-based languages.

func NewDefaultWordTokenizer

func NewDefaultWordTokenizer(stripLinebreaks bool) *DefaultWord

NewDefaultWordTokenizer returns a new default word tokenizer.

func (*DefaultWord) Format

func (tk *DefaultWord) Format(tokens []string) string

Format joins a slice of tokens by the tokenizer rules. If we want to output the ngrams again, they're not always going to go back together in the same manner they were consumed, especially if the tokenizers are wildly different. A formatter allows us to ensure the resulting text is appropriate.

func (*DefaultWord) Scanner

func (tk *DefaultWord) Scanner(data []byte, atEOF bool) (advance int, token []byte, err error)

Scanner is the core tokenizer which splits a slice of bytes into tokens. It can be used with bufio.scanner to tokenizer a stream of data.

func (*DefaultWord) Tokenize

func (tk *DefaultWord) Tokenize(str string) []string

Tokenize splits a string into tokens. Each instance of standard punctuation is also considered to be token in order to preserve expected grammar.

type Tokenizer

type Tokenizer interface {

	// Tokenize tokenizes a string.
	Tokenize(string) []string

	// Scanner tokenizes a byte array and can be used by bufio.Scanner.
	Scanner(data []byte, atEOF bool) (advance int, token []byte, err error)

	// Format joins a slice of tokens by the tokenizer rules.
	Format([]string) string
}

Tokenizer is a string tokenizer which splits a string into discrete tokens.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL