prose

prose
is a natural language processing library (English only) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.
You can find a more detailed summary on the library's performance here: Introducing prose
v2.0.0: Bringing NLP to Go.
NOTE: If you're looking for v1.0.0's README, you can still find it here.
Installation
$ go get gopkg.in/jdkato/prose.v2
Usage
Contents
Overview
package main
import (
"fmt"
"log"
"gopkg.in/jdkato/prose.v2"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag, tok.Label)
// Go NNP B-GPE
// is VBZ O
// an DT O
// ...
}
// Iterate over the doc's named-entities:
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Go GPE
// Google GPE
}
// Iterate over the doc's sentences:
for _, sent := range doc.Sentences() {
fmt.Println(sent.Text)
// Go is an open-source programming language created at Google.
}
}
The document-creation process adheres to the following sequence of steps:
tokenization -> POS tagging -> NE extraction
\
segmentation
Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:
doc, err := prose.NewDocument(
"Go is an open-source programming language created at Google.",
prose.WithExtraction(false))
Tokenizing
prose
includes a tokenizer capable of handling modern text, including the non-word character spans shown below.
Type |
Example |
Email addresses |
Jane.Doe@example.com |
Hashtags |
#trending |
Mentions |
@jdkato |
URLs |
https://github.com/jdkato/prose |
Emoticons |
:-) , >:( , o_0 , etc. |
package main
import (
"fmt"
"log"
"gopkg.in/jdkato/prose.v2"
)
func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
if err != nil {
log.Fatal(err)
}
// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag)
// @jdkato NN
// , ,
// go VB
// to TO
// http://example.com NN
// thanks NNS
// :) SYM
// . .
}
}
Segmenting
prose
includes one of the most accurate sentence segmenters available according to the Golden Rules created by the developers of the pragmatic_segmenter
.
Name |
Language |
License |
GRS (English) |
GRS (Other) |
Speed† |
Pragmatic Segmenter |
Ruby |
MIT |
98.08% (51/52) |
100.00% |
3.84 s |
prose |
Go |
MIT |
73.07% (38/52) |
N/A |
0.96 s |
TactfulTokenizer |
Ruby |
GNU GPLv3 |
65.38% (34/52) |
48.57% |
46.32 s |
OpenNLP |
Java |
APLv2 |
59.62% (31/52) |
45.71% |
1.27 s |
Standford CoreNLP |
Java |
GNU GPLv3 |
59.62% (31/52) |
31.43% |
0.92 s |
Splitta |
Python |
APLv2 |
55.77% (29/52) |
37.14% |
N/A |
Punkt |
Python |
APLv2 |
46.15% (24/52) |
48.57% |
1.79 s |
SRX English |
Ruby |
GNU GPLv3 |
30.77% (16/52) |
28.57% |
6.19 s |
Scapel |
Ruby |
GNU GPLv3 |
28.85% (15/52) |
20.00% |
0.13 s |
† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose
was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.
package main
import (
"fmt"
"strings"
"gopkg.in/jdkato/prose.v2"
)
func main() {
// Create a new document with the default configuration:
doc, _ := prose.NewDocument(strings.Join([]string{
"I can see Mt. Fuji from here.",
"St. Michael's Church is on 5th st. near the light."}, " "))
// Iterate over the doc's sentences:
sents := doc.Sentences()
fmt.Println(len(sents)) // 2
for _, sent := range sents {
fmt.Println(sent.Text)
// I can see Mt. Fuji from here.
// St. Michael's Church is on 5th st. near the light.
}
}
Tagging
prose
includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:
Library |
Accuracy |
5-Run Average (sec) |
NLTK |
0.893 |
7.224 |
prose |
0.961 |
2.538 |
(See scripts/test_model.py
for more information.)
The full list of supported POS tags is given below.
TAG |
DESCRIPTION |
( |
left round bracket |
) |
right round bracket |
, |
comma |
: |
colon |
. |
period |
'' |
closing quotation mark |
`` |
opening quotation mark |
# |
number sign |
$ |
currency |
CC |
conjunction, coordinating |
CD |
cardinal number |
DT |
determiner |
EX |
existential there |
FW |
foreign word |
IN |
conjunction, subordinating or preposition |
JJ |
adjective |
JJR |
adjective, comparative |
JJS |
adjective, superlative |
LS |
list item marker |
MD |
verb, modal auxiliary |
NN |
noun, singular or mass |
NNP |
noun, proper singular |
NNPS |
noun, proper plural |
NNS |
noun, plural |
PDT |
predeterminer |
POS |
possessive ending |
PRP |
pronoun, personal |
PRP$ |
pronoun, possessive |
RB |
adverb |
RBR |
adverb, comparative |
RBS |
adverb, superlative |
RP |
adverb, particle |
SYM |
symbol |
TO |
infinitival to |
UH |
interjection |
VB |
verb, base form |
VBD |
verb, past tense |
VBG |
verb, gerund or present participle |
VBN |
verb, past participle |
VBP |
verb, non-3rd person singular present |
VBZ |
verb, 3rd person singular present |
WDT |
wh-determiner |
WP |
wh-pronoun, personal |
WP$ |
wh-pronoun, possessive |
WRB |
wh-adverb |
NER
prose
v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON
) and geographical/political Entities (GPE
) by default.
package main
import (
"gopkg.in/jdkato/prose.v2"
)
func main() {
doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Lebron James PERSON
// Los Angeles GPE
}
}
However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose
: Radically efficient machine teaching in Go for a tutorial.