Package nlp implements POS tagging, word tokenization, and sentence segmentation.



var SentenceTokenizer = segment.NewPunktSentenceTokenizer()

SentenceTokenizer splits text into sentences.

var WordTokenizer = NewIterTokenizer()

WordTokenizer splits text into words.


func TextToTokens

func TextToTokens(text string, nlp *Info) []tag.Token

TextToTokens converts a string to a slice of tokens.


type Block

type Block struct {
	Context string // parent content - e.g., sentence -> paragraph
	Line    int    // line of the block
	Scope   string // section selector
	Parent  string // parent (fully-qualfied) selector
	Text    string // text content

A Block represents a section of text.

func NewBlock

func NewBlock(ctx, txt, sel string) Block

NewBlock makes a new Block with prepared text and a Selector.

func NewBlockWithParent added in v2.24.1

func NewBlockWithParent(ctx, txt, sel, parent string) Block

NewBlockWithParent makes a new Block with prepared text, a Selector, and a parent.

func NewLinedBlock

func NewLinedBlock(ctx, txt, sel string, line int, _ *Info) Block

NewLinedBlock creates a Block with an already-known location.

type Info added in v2.28.3

type Info struct {
	Lang         string // Language of the file.
	Endpoint     string // API endpoint (optional); TODO: should this be per-file?
	Scope        string // The file's ext scope.
	Tagging      bool   // Does the file need POS tagging?
	Segmentation bool   // Does the file need sentence segmentation?
	Splitting    bool   // Does the file need paragraph splitting?

Info handles NLP-related tasks.

Assigning this on a per-file basis allows us to handle multi-language projects -- one file might be `en` while another is `ja`, for example.

func (*Info) Compute added in v2.28.3

func (n *Info) Compute(block *Block) ([]Block, error)

An NLP provider is a library to implements part-of-speech tagging, sentence segmentation, and word tokenization.

The default implementation is the pure-Go prose library, but the goal is to allow (fairly) seamless integration with non-Go libraries too (such as spaCy).

type IterTokenizer added in v2.28.3

type IterTokenizer struct {
	// contains filtered or unexported fields

IterTokenizer splits a sentence into words.

func NewIterTokenizer added in v2.18.0

func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer

NewIterTokenizer creates a new iterTokenizer.

func (*IterTokenizer) Tokenize added in v2.28.3

func (t *IterTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

type SegmentResult

type SegmentResult struct {
	Sents []string

type TagResult

type TagResult struct {
	Tokens []tag.Token

type TaggedWord

type TaggedWord struct {
	Token tag.Token
	Line  int
	Span  []int

TaggedWord is a word with an NLP context.

type TokenTester added in v2.18.0

type TokenTester func(string) bool

type Tokenizer added in v2.18.0

type Tokenizer interface {
	Tokenize(string) []string

type TokenizerOptFunc added in v2.18.0

type TokenizerOptFunc func(*IterTokenizer)

func UsingContractions added in v2.18.0

func UsingContractions(x []string) TokenizerOptFunc

UsingContractions sets the provided contractions.

func UsingEmoticons added in v2.18.0

func UsingEmoticons(x map[string]int) TokenizerOptFunc

UsingEmoticons sets the provided map of emoticons.

func UsingIsUnsplittable added in v2.18.0

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittable gives a function that tests whether a token is splittable or not.

func UsingPrefixes added in v2.18.0

func UsingPrefixes(x []string) TokenizerOptFunc

UsingPrefixes sets the provided prefixes.

func UsingSanitizer added in v2.18.0

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

UsingSanitizer sets the provided sanitizer.

func UsingSpecialRE added in v2.18.0

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

UsingSpecialRE sets the provided special regex for unsplittable tokens.

func UsingSplitCases added in v2.18.0

func UsingSplitCases(x []string) TokenizerOptFunc

UsingSplitCases sets the provided splitCases.

func UsingSuffixes added in v2.18.0

func UsingSuffixes(x []string) TokenizerOptFunc

UsingSuffixes sets the provided suffixes.

