The highest tagged major version is v3.

nlp

package

v2.30.0 Latest Latest Go to latest Published: Dec 6, 2023 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/errata-ai/vale

Links

Open Source Insights

Documentation ¶

Overview ¶

Package nlp implements POS tagging, word tokenization, and sentence segmentation.

Index ¶

Variables
func TextToTokens(text string, nlp *Info) []tag.Token
type Block
type Info
- func (n *Info) Compute(block *Block) ([]Block, error)
type IterTokenizer
- func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer
- func (t *IterTokenizer) Tokenize(text string) []string
type SegmentResult
type TagResult
type TaggedWord
type TokenTester
type Tokenizer
type TokenizerOptFunc

Constants ¶

This section is empty.

Variables ¶

View Source

var SentenceTokenizer = segment.NewPunktSentenceTokenizer()

SentenceTokenizer splits text into sentences.

View Source

var WordTokenizer = NewIterTokenizer()

WordTokenizer splits text into words.

Functions ¶

func TextToTokens ¶

func TextToTokens(text string, nlp *Info) []tag.Token

TextToTokens converts a string to a slice of tokens.

Types ¶

type Block ¶

type Block struct {
	Context string // parent content - e.g., sentence -> paragraph
	Line    int    // line of the block
	Scope   string // section selector
	Parent  string // parent (fully-qualfied) selector
	Text    string // text content
}

A Block represents a section of text.

func NewBlock ¶

func NewBlock(ctx, txt, sel string) Block

NewBlock makes a new Block with prepared text and a Selector.

func NewBlockWithParent ¶ added in v2.24.1

func NewBlockWithParent(ctx, txt, sel, parent string) Block

NewBlockWithParent makes a new Block with prepared text, a Selector, and a parent.

func NewLinedBlock ¶

func NewLinedBlock(ctx, txt, sel string, line int, _ *Info) Block

NewLinedBlock creates a Block with an already-known location.

type Info ¶ added in v2.28.3

type Info struct {
	Lang         string // Language of the file.
	Endpoint     string // API endpoint (optional); TODO: should this be per-file?
	Scope        string // The file's ext scope.
	Tagging      bool   // Does the file need POS tagging?
	Segmentation bool   // Does the file need sentence segmentation?
	Splitting    bool   // Does the file need paragraph splitting?
}

Info handles NLP-related tasks.

Assigning this on a per-file basis allows us to handle multi-language projects -- one file might be `en` while another is `ja`, for example.

func (*Info) Compute ¶ added in v2.28.3

func (n *Info) Compute(block *Block) ([]Block, error)

An NLP provider is a library to implements part-of-speech tagging, sentence segmentation, and word tokenization.

The default implementation is the pure-Go prose library, but the goal is to allow (fairly) seamless integration with non-Go libraries too (such as spaCy).

type IterTokenizer ¶ added in v2.28.3

type IterTokenizer struct {
	// contains filtered or unexported fields
}

IterTokenizer splits a sentence into words.

func NewIterTokenizer ¶ added in v2.18.0

func NewIterTokenizer(opts ...TokenizerOptFunc) *IterTokenizer

NewIterTokenizer creates a new iterTokenizer.

func (*IterTokenizer) Tokenize ¶ added in v2.28.3

func (t *IterTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

type SegmentResult ¶

type SegmentResult struct {
	Sents []string
}

type TagResult ¶

type TagResult struct {
	Tokens []tag.Token
}

type TaggedWord ¶

type TaggedWord struct {
	Token tag.Token
	Line  int
	Span  []int
}

TaggedWord is a word with an NLP context.

type TokenTester ¶ added in v2.18.0

type TokenTester func(string) bool

type Tokenizer ¶ added in v2.18.0

type Tokenizer interface {
	Tokenize(string) []string
}

type TokenizerOptFunc ¶ added in v2.18.0

type TokenizerOptFunc func(*IterTokenizer)

func UsingContractions ¶ added in v2.18.0

func UsingContractions(x []string) TokenizerOptFunc

UsingContractions sets the provided contractions.

func UsingEmoticons ¶ added in v2.18.0

func UsingEmoticons(x map[string]int) TokenizerOptFunc

UsingEmoticons sets the provided map of emoticons.

func UsingIsUnsplittable ¶ added in v2.18.0

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittable gives a function that tests whether a token is splittable or not.

func UsingPrefixes ¶ added in v2.18.0

func UsingPrefixes(x []string) TokenizerOptFunc

UsingPrefixes sets the provided prefixes.

func UsingSanitizer ¶ added in v2.18.0

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

UsingSanitizer sets the provided sanitizer.

func UsingSpecialRE ¶ added in v2.18.0

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

UsingSpecialRE sets the provided special regex for unsplittable tokens.

func UsingSplitCases ¶ added in v2.18.0

func UsingSplitCases(x []string) TokenizerOptFunc

UsingSplitCases sets the provided splitCases.

func UsingSuffixes ¶ added in v2.18.0

func UsingSuffixes(x []string) TokenizerOptFunc

UsingSuffixes sets the provided suffixes.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL