prose

package module

v2.0.0 Latest Latest Go to latest Published: Jun 16, 2020 License: MIT Imports: 21 Imported by: 47

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jdkato/prose

Links

Open Source Insights

README ¶

prose

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get gopkg.in/jdkato/prose.v2

Usage

Overview

package main

import (
    "fmt"
    "log"

    "gopkg.in/jdkato/prose.v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmenatation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of hanlding modern text, including the non-word character spans shown below.

Type	Example
Email addresses	`Jane.Doe@example.com`
Hashtags	`#trending`
Mentions	`@jdkato`
URLs	`https://github.com/jdkato/prose`
Emoticons	`:-)`, `>:(`, `o_0`, etc.

package main

import (
    "fmt"
    "log"

    "gopkg.in/jdkato/prose.v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name	Language	License	GRS (English)	GRS (Other)	Speed†
Pragmatic Segmenter	Ruby	MIT	98.08% (51/52)	100.00%	3.84 s
prose	Go	MIT	75.00% (39/52)	N/A	0.96 s
TactfulTokenizer	Ruby	GNU GPLv3	65.38% (34/52)	48.57%	46.32 s
OpenNLP	Java	APLv2	59.62% (31/52)	45.71%	1.27 s
Standford CoreNLP	Java	GNU GPLv3	59.62% (31/52)	31.43%	0.92 s
Splitta	Python	APLv2	55.77% (29/52)	37.14%	N/A
Punkt	Python	APLv2	46.15% (24/52)	48.57%	1.79 s
SRX English	Ruby	GNU GPLv3	30.77% (16/52)	28.57%	6.19 s
Scapel	Ruby	GNU GPLv3	28.85% (15/52)	20.00%	0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library	Accuracy	5-Run Average (sec)
NLTK	0.893	7.224
`prose`	0.961	2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG	DESCRIPTION
`(`	left round bracket
`)`	right round bracket
`,`	comma
`:`	colon
`.`	period
`''`	closing quotation mark
``	opening quotation mark
`#`	number sign
`$`	currency
`CC`	conjunction, coordinating
`CD`	cardinal number
`DT`	determiner
`EX`	existential there
`FW`	foreign word
`IN`	conjunction, subordinating or preposition
`JJ`	adjective
`JJR`	adjective, comparative
`JJS`	adjective, superlative
`LS`	list item marker
`MD`	verb, modal auxiliary
`NN`	noun, singular or mass
`NNP`	noun, proper singular
`NNPS`	noun, proper plural
`NNS`	noun, plural
`PDT`	predeterminer
`POS`	possessive ending
`PRP`	pronoun, personal
`PRP$`	pronoun, possessive
`RB`	adverb
`RBR`	adverb, comparative
`RBS`	adverb, superlative
`RP`	adverb, particle
`SYM`	symbol
`TO`	infinitival to
`UH`	interjection
`VB`	verb, base form
`VBD`	verb, past tense
`VBG`	verb, gerund or present participle
`VBN`	verb, past participle
`VBP`	verb, non-3rd person singular present
`VBZ`	verb, 3rd person singular present
`WDT`	wh-determiner
`WP`	wh-pronoun, personal
`WP$`	wh-pronoun, possessive
`WRB`	wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "gopkg.in/jdkato/prose.v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketbal in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

Documentation ¶

Overview ¶

Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Index ¶

func Asset(name string) ([]byte, error)
func AssetDir(name string) ([]string, error)
func AssetInfo(name string) (os.FileInfo, error)
func AssetNames() []string
func MustAsset(name string) []byte
func RestoreAsset(dir, name string) error
func RestoreAssets(dir, name string) error
type DataSource
- func UsingEntities(data []EntityContext) DataSource
type DocOpt
- func UsingModel(model *Model) DocOpt
- func WithExtraction(include bool) DocOpt
- func WithSegmentation(include bool) DocOpt
- func WithTagging(include bool) DocOpt
- func WithTokenization(include bool) DocOpt
type DocOpts
type Document
- func NewDocument(text string, opts ...DocOpt) (*Document, error)
- func (doc *Document) Entities() []Entity
- func (doc *Document) Sentences() []Sentence
- func (doc *Document) Tokens() []Token
type Entity
type EntityContext
type LabeledEntity
type Model
- func ModelFromData(name string, sources ...DataSource) *Model
- func ModelFromDisk(path string) *Model
- func (m *Model) Write(path string) error
type Sentence
type Token
type TupleSlice
- func ReadTagged(text, sep string) TupleSlice
- func (t TupleSlice) Len() int
- func (t TupleSlice) Swap(i, j int)

Examples ¶

ReadTagged

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Asset ¶

func Asset(name string) ([]byte, error)

Asset loads and returns the asset for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetDir ¶

func AssetDir(name string) ([]string, error)

AssetDir returns the file names below a certain directory embedded in the file by go-bindata. For example if you run go-bindata on data/... and data contains the following hierarchy:

data/
  foo.txt
  img/
    a.png
    b.png

then AssetDir("data") would return []string{"foo.txt", "img"} AssetDir("data/img") would return []string{"a.png", "b.png"} AssetDir("foo.txt") and AssetDir("notexist") would return an error AssetDir("") will return []string{"data"}.

func AssetInfo ¶

func AssetInfo(name string) (os.FileInfo, error)

AssetInfo loads and returns the asset info for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetNames ¶

func AssetNames() []string

AssetNames returns the names of the assets.

func MustAsset ¶

func MustAsset(name string) []byte

MustAsset is like Asset but panics when Asset would return an error. It simplifies safe initialization of global variables.

func RestoreAsset ¶

func RestoreAsset(dir, name string) error

RestoreAsset restores an asset under the given directory

func RestoreAssets ¶

func RestoreAssets(dir, name string) error

RestoreAssets restores an asset under the given directory recursively

Types ¶

type DataSource ¶

type DataSource func(model *Model)

DataSource provides training data to a Model.

func UsingEntities ¶

func UsingEntities(data []EntityContext) DataSource

UsingEntities creates a NER from labeled data.

type DocOpt ¶

type DocOpt func(doc *Document, opts *DocOpts)

A DocOpt represents a setting that changes the document creation process.

For example, it might disable named-entity extraction:

doc := prose.NewDocument("...", prose.WithExtraction(false))

func UsingModel ¶

func UsingModel(model *Model) DocOpt

UsingModel can enable (the default) or disable named-entity extraction.

func WithExtraction ¶

func WithExtraction(include bool) DocOpt

WithExtraction can enable (the default) or disable named-entity extraction.

func WithSegmentation ¶

func WithSegmentation(include bool) DocOpt

WithSegmentation can enable (the default) or disable sentence segmentation.

func WithTagging ¶

func WithTagging(include bool) DocOpt

WithTagging can enable (the default) or disable POS tagging.

func WithTokenization ¶

func WithTokenization(include bool) DocOpt

WithTokenization can enable (the default) or disable tokenization.

type DocOpts ¶

type DocOpts struct {
	Extract  bool // If true, include named-entity extraction
	Segment  bool // If true, include segmentation
	Tag      bool // If true, include POS tagging
	Tokenize bool // If true, include tokenization
}

DocOpts controls the Document creation process:

type Document ¶

type Document struct {
	Model *Model
	Text  string
	// contains filtered or unexported fields
}

A Document represents a parsed body of text.

func NewDocument ¶

func NewDocument(text string, opts ...DocOpt) (*Document, error)

NewDocument creates a Document according to the user-specified options.

For example,

doc := prose.NewDocument("...")

func (*Document) Entities ¶

func (doc *Document) Entities() []Entity

Entities returns `doc`'s entities.

func (*Document) Sentences ¶

func (doc *Document) Sentences() []Sentence

Sentences returns `doc`'s sentences.

func (*Document) Tokens ¶

func (doc *Document) Tokens() []Token

Tokens returns `doc`'s tokens.

type Entity ¶

type Entity struct {
	Text  string // The entity's actual content.
	Label string // The entity's label.
}

An Entity represents an individual named-entity.

type EntityContext ¶

type EntityContext struct {
	// Is this is a correct entity?
	//
	// Some annotation software, e.g. Prodigy, include entities "rejected" by
	// its user. This allows us to handle those cases.
	Accept bool

	Spans []LabeledEntity // The entity locations relative to `Text`.
	Text  string          // The sentence containing the entities.
}

EntityContext represents text containing named-entities.

type LabeledEntity ¶

type LabeledEntity struct {
	Start int
	End   int
	Label string
}

LabeledEntity represents an externally-labeled named-entity.

type Model ¶

type Model struct {
	Name string
	// contains filtered or unexported fields
}

A Model holds the structures and data used internally by prose.

func ModelFromData ¶

func ModelFromData(name string, sources ...DataSource) *Model

ModelFromData creates a new Model from user-provided training data.

func ModelFromDisk ¶

func ModelFromDisk(path string) *Model

ModelFromDisk loads a Model from the user-provided location.

func (*Model) Write ¶

func (m *Model) Write(path string) error

Write saves a Model to the user-provided location.

type Sentence ¶

type Sentence struct {
	Text string // The sentence's text.
}

A Sentence represents a segmented portion of text.

type Token ¶

type Token struct {
	Tag   string // The token's part-of-speech tag.
	Text  string // The token's actual content.
	Label string // The token's IOB label.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TupleSlice ¶

type TupleSlice [][][]string

TupleSlice is a slice of tuples in the form (words, tags).

func ReadTagged ¶

func ReadTagged(text, sep string) TupleSlice

ReadTagged converts pre-tagged input into a TupleSlice suitable for training.

Example ¶

tagged := "Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS"
fmt.Println(ReadTagged(tagged, "|"))

Output:

[[[Pierre Vinken , 61 years] [NNP NNP , CD NNS]]]

func (TupleSlice) Len ¶

func (t TupleSlice) Len() int

Len returns the length of a Tuple.

func (TupleSlice) Swap ¶

func (t TupleSlice) Swap(i, j int)

Swap switches the ith and jth elements in a Tuple.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

prose

Installation

Usage

Contents

Overview

Tokenizing

Segmenting

Tagging

NER

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func Asset ¶

func AssetDir ¶

func AssetInfo ¶

func AssetNames ¶

func MustAsset ¶

func RestoreAsset ¶

func RestoreAssets ¶

Types ¶

type DataSource ¶

func UsingEntities ¶

type DocOpt ¶

func UsingModel ¶

func WithExtraction ¶

func WithSegmentation ¶

func WithTagging ¶

func WithTokenization ¶

type DocOpts ¶

type Document ¶

func NewDocument ¶

func (*Document) Entities ¶

func (*Document) Sentences ¶

func (*Document) Tokens ¶

type Entity ¶

type EntityContext ¶

type LabeledEntity ¶

type Model ¶

func ModelFromData ¶

func ModelFromDisk ¶

func (*Model) Write ¶

type Sentence ¶

type Token ¶

type TupleSlice ¶

func ReadTagged ¶

func (TupleSlice) Len ¶

func (TupleSlice) Swap ¶

Source Files ¶