prose

package module

v3.0.0-...-faace3a Latest Latest Go to latest Published: Mar 9, 2023 License: MIT Imports: 18 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/zuvaai/prose

Links

Open Source Insights

README ¶

prose

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/jdkato/prose/v2

Usage

Overview

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

Type	Example
Email addresses	`Jane.Doe@example.com`
Hashtags	`#trending`
Mentions	`@jdkato`
URLs	`https://github.com/jdkato/prose`
Emoticons	`:-)`, `>:(`, `o_0`, etc.

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name	Language	License	GRS (English)	GRS (Other)	Speed†
Pragmatic Segmenter	Ruby	MIT	98.08% (51/52)	100.00%	3.84 s
prose	Go	MIT	75.00% (39/52)	N/A	0.96 s
TactfulTokenizer	Ruby	GNU GPLv3	65.38% (34/52)	48.57%	46.32 s
OpenNLP	Java	APLv2	59.62% (31/52)	45.71%	1.27 s
Standford CoreNLP	Java	GNU GPLv3	59.62% (31/52)	31.43%	0.92 s
Splitta	Python	APLv2	55.77% (29/52)	37.14%	N/A
Punkt	Python	APLv2	46.15% (24/52)	48.57%	1.79 s
SRX English	Ruby	GNU GPLv3	30.77% (16/52)	28.57%	6.19 s
Scapel	Ruby	GNU GPLv3	28.85% (15/52)	20.00%	0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library	Accuracy	5-Run Average (sec)
NLTK	0.893	7.224
`prose`	0.961	2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG	DESCRIPTION
`(`	left round bracket
`)`	right round bracket
`,`	comma
`:`	colon
`.`	period
`''`	closing quotation mark
``	opening quotation mark
`#`	number sign
`$`	currency
`CC`	conjunction, coordinating
`CD`	cardinal number
`DT`	determiner
`EX`	existential there
`FW`	foreign word
`IN`	conjunction, subordinating or preposition
`JJ`	adjective
`JJR`	adjective, comparative
`JJS`	adjective, superlative
`LS`	list item marker
`MD`	verb, modal auxiliary
`NN`	noun, singular or mass
`NNP`	noun, proper singular
`NNPS`	noun, proper plural
`NNS`	noun, plural
`PDT`	predeterminer
`POS`	possessive ending
`PRP`	pronoun, personal
`PRP$`	pronoun, possessive
`RB`	adverb
`RBR`	adverb, comparative
`RBS`	adverb, superlative
`RP`	adverb, particle
`SYM`	symbol
`TO`	infinitival to
`UH`	interjection
`VB`	verb, base form
`VBD`	verb, past tense
`VBG`	verb, gerund or present participle
`VBN`	verb, past participle
`VBP`	verb, non-3rd person singular present
`VBZ`	verb, 3rd person singular present
`WDT`	wh-determiner
`WP`	wh-pronoun, personal
`WP$`	wh-pronoun, possessive
`WRB`	wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

Documentation ¶

Overview ¶

Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Index ¶

Constants
func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
func ReadAndDecodeBytes(filename string) (*gob.Decoder, error)
func ReadBytes(filename string) ([]byte, error)
type DataSource
- func UsingEntities(data []EntityContext) DataSource
- func UsingEntitiesAndTokenizer(data []EntityContext, tokenizer Tokenizer) DataSource
type DocOpt
- func UsingModel(model *Model) DocOpt
- func UsingTokenizer(include Tokenizer) DocOpt
- func WithExtraction(include bool) DocOpt
- func WithSegmentation(include bool) DocOpt
- func WithTagging(include bool) DocOpt
- func WithTokenization(include bool) DocOpt
type DocOpts
type Document
- func NewDocument(text string, opts ...DocOpt) (*Document, error)
- func (doc *Document) Entities() []Entity
- func (doc *Document) Sentences() []Sentence
- func (doc *Document) Tokens() []Token
type Entity
type EntityContext
type LabeledEntity
type Model
- func ModelFromData(name string, sources ...DataSource) (*Model, error)
- func ModelFromDisk(path string) (*Model, error)
- func ModelFromFS(name string, filesys fs.FS) (*Model, error)
- func (m *Model) Write(path string) error
type PerceptronTagger
- func NewPerceptronTagger() (*PerceptronTagger, error)
- func (pt *PerceptronTagger) Tag(tokens []*Token) []*Token
type Sentence
type Token
type TokenTester
type Tokenizer
type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]struct{}) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
type TupleSlice
- func ReadTagged(text, sep string) TupleSlice
- func (t TupleSlice) Len() int
- func (t TupleSlice) Swap(i, j int)

Examples ¶

ReadTagged

Constants ¶

View Source

const NoneFeat = "None"

Variables ¶

This section is empty.

Functions ¶

func NewIterTokenizer ¶

func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer

Constructor for default iterTokenizer

func ReadAndDecodeBytes ¶

func ReadAndDecodeBytes(filename string) (*gob.Decoder, error)

ReadAndDecodeBytes reads an embedded file into a gob decoder

func ReadBytes ¶

func ReadBytes(filename string) ([]byte, error)

ReadBytes reads an embedded file into a byte slice.

Types ¶

type DataSource ¶

type DataSource func(model *Model)

DataSource provides training data to a Model.

func UsingEntities ¶

func UsingEntities(data []EntityContext) DataSource

UsingEntities creates a NER from labeled data.

func UsingEntitiesAndTokenizer ¶

func UsingEntitiesAndTokenizer(data []EntityContext, tokenizer Tokenizer) DataSource

UsingEntities creates a NER from labeled data and custom tokenizer.

type DocOpt ¶

type DocOpt func(doc *Document, opts *DocOpts)

A DocOpt represents a setting that changes the document creation process.

For example, it might disable named-entity extraction:

doc := prose.NewDocument("...", prose.WithExtraction(false))

func UsingModel ¶

func UsingModel(model *Model) DocOpt

UsingModel can enable (the default) or disable named-entity extraction.

func UsingTokenizer ¶

func UsingTokenizer(include Tokenizer) DocOpt

UsingTokenizer specifies the Tokenizer to use.

func WithExtraction ¶

func WithExtraction(include bool) DocOpt

WithExtraction can enable (the default) or disable named-entity extraction.

func WithSegmentation ¶

func WithSegmentation(include bool) DocOpt

WithSegmentation can enable (the default) or disable sentence segmentation.

func WithTagging ¶

func WithTagging(include bool) DocOpt

WithTagging can enable (the default) or disable POS tagging.

func WithTokenization ¶

func WithTokenization(include bool) DocOpt

WithTokenization can enable (the default) or disable tokenization. Deprecated: use UsingTokenizer instead.

type DocOpts ¶

type DocOpts struct {
	Extract   bool      // If true, include named-entity extraction
	Segment   bool      // If true, include segmentation
	Tag       bool      // If true, include POS tagging
	Tokenizer Tokenizer // If true, include tokenization
}

DocOpts controls the Document creation process:

type Document ¶

type Document struct {
	Model *Model
	Text  string
	// contains filtered or unexported fields
}

A Document represents a parsed body of text.

func NewDocument ¶

func NewDocument(text string, opts ...DocOpt) (*Document, error)

NewDocument creates a Document according to the user-specified options.

For example,

doc := prose.NewDocument("...")

func (*Document) Entities ¶

func (doc *Document) Entities() []Entity

Entities returns `doc`'s entities.

func (*Document) Sentences ¶

func (doc *Document) Sentences() []Sentence

Sentences returns `doc`'s sentences.

func (*Document) Tokens ¶

func (doc *Document) Tokens() []Token

Tokens returns `doc`'s tokens.

type Entity ¶

type Entity struct {
	Text  string // The entity's actual content.
	Label string // The entity's label.
}

An Entity represents an individual named-entity.

type EntityContext ¶

type EntityContext struct {
	// Is this is a correct entity?
	//
	// Some annotation software, e.g. Prodigy, include entities "rejected" by
	// its user. This allows us to handle those cases.
	Accept bool

	Spans []LabeledEntity // The entity locations relative to `Text`.
	Text  string          // The sentence containing the entities.
}

EntityContext represents text containing named-entities.

type LabeledEntity ¶

type LabeledEntity struct {
	Start int
	End   int
	Label string
}

LabeledEntity represents an externally-labeled named-entity.

type Model ¶

type Model struct {
	Name string
	// contains filtered or unexported fields
}

A Model holds the structures and data used internally by prose.

func ModelFromData ¶

func ModelFromData(name string, sources ...DataSource) (*Model, error)

ModelFromData creates a new Model from user-provided training data.

func ModelFromDisk ¶

func ModelFromDisk(path string) (*Model, error)

ModelFromDisk loads a Model from the user-provided location.

func ModelFromFS ¶

func ModelFromFS(name string, filesys fs.FS) (*Model, error)

ModelFromFS loads a model from the

func (*Model) Write ¶

func (m *Model) Write(path string) error

Write saves a Model to the user-provided location.

type PerceptronTagger ¶

type PerceptronTagger struct {
	// contains filtered or unexported fields
}

perceptronTagger is a port of Textblob's "fast and accurate" POS tagger. See https://github.com/sloria/textblob-aptagger for details.

func NewPerceptronTagger ¶

func NewPerceptronTagger() (*PerceptronTagger, error)

newPerceptronTagger creates a new PerceptronTagger and loads the built-in AveragedPerceptron model.

func (*PerceptronTagger) Tag ¶

func (pt *PerceptronTagger) Tag(tokens []*Token) []*Token

Tag takes a slice of words and returns a slice of tagged tokens.

type Sentence ¶

type Sentence struct {
	Text string // The sentence's text.
}

A Sentence represents a segmented portion of text.

type Token ¶

type Token struct {
	Tag   string // The token's part-of-speech tag.
	Text  string // The token's actual content.
	Label string // The token's IOB label.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TokenTester ¶

type TokenTester func(string) bool

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(string) []*Token
}

type TokenizerOptFunc ¶

type TokenizerOptFunc func(*iterTokenizer)

func UsingContractions ¶

func UsingContractions(x []string) TokenizerOptFunc

Use the provided contractions.

func UsingEmoticons ¶

func UsingEmoticons(x map[string]struct{}) TokenizerOptFunc

Use the provided map of emoticons.

func UsingIsUnsplittable ¶

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.

func UsingPrefixes ¶

func UsingPrefixes(x []string) TokenizerOptFunc

Use the provided prefixes.

func UsingSanitizer ¶

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

Use the provided sanitizer.

func UsingSpecialRE ¶

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

Use the provided special regex for unsplittable tokens.

func UsingSplitCases ¶

func UsingSplitCases(x []string) TokenizerOptFunc

Use the provided splitCases.

func UsingSuffixes ¶

func UsingSuffixes(x []string) TokenizerOptFunc

Use the provided suffixes.

type TupleSlice ¶

type TupleSlice [][][]string

TupleSlice is a slice of tuples in the form (words, tags).

func ReadTagged ¶

func ReadTagged(text, sep string) TupleSlice

ReadTagged converts pre-tagged input into a TupleSlice suitable for training.

Example ¶

tagged := "Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS"
fmt.Println(ReadTagged(tagged, "|"))

Output:

[[[Pierre Vinken , 61 years] [NNP NNP , CD NNS]]]

func (TupleSlice) Len ¶

func (t TupleSlice) Len() int

Len returns the length of a Tuple.

func (TupleSlice) Swap ¶

func (t TupleSlice) Swap(i, j int)

Swap switches the ith and jth elements in a Tuple.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

prose

Installation

Usage

Contents

Overview

Tokenizing

Segmenting

Tagging

NER

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func NewIterTokenizer ¶

func ReadAndDecodeBytes ¶

func ReadBytes ¶

Types ¶

type DataSource ¶

func UsingEntities ¶

func UsingEntitiesAndTokenizer ¶

type DocOpt ¶

func UsingModel ¶

func UsingTokenizer ¶

func WithExtraction ¶

func WithSegmentation ¶

func WithTagging ¶

func WithTokenization ¶

type DocOpts ¶

type Document ¶

func NewDocument ¶

func (*Document) Entities ¶

func (*Document) Sentences ¶

func (*Document) Tokens ¶

type Entity ¶

type EntityContext ¶

type LabeledEntity ¶

type Model ¶

func ModelFromData ¶

func ModelFromDisk ¶

func ModelFromFS ¶

func (*Model) Write ¶

type PerceptronTagger ¶

func NewPerceptronTagger ¶

func (*PerceptronTagger) Tag ¶

type Sentence ¶

type Token ¶

type TokenTester ¶

type Tokenizer ¶

type TokenizerOptFunc ¶

func UsingContractions ¶

func UsingEmoticons ¶

func UsingIsUnsplittable ¶

func UsingPrefixes ¶

func UsingSanitizer ¶

func UsingSpecialRE ¶

func UsingSplitCases ¶

func UsingSuffixes ¶

type TupleSlice ¶

func ReadTagged ¶

func (TupleSlice) Len ¶

func (TupleSlice) Swap ¶

Source Files ¶