lingo

package module
v0.0.0-...-491e816 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 18, 2020 License: MIT Imports: 17 Imported by: 10

README

lingo

Build Status

package lingo provides the data structures and algorithms required for natural language processing.

Specifically, it provides a POS Tagger (lingo/pos), a Dependency Parser (lingo/dep), and a basic tokenizer (lingo/lexer) for English. It also provides data structures for holding corpuses (lingo/corpus), and treebanks (lingo/treebank).

The aim of this package is to provide a production quality pipeline for natural language processing.

Install

The package is go-gettable: go get -u github.com/chewxy/lingo

This package and its subpackages depend on very few external packages. Here they are:

Package Used For Vitality Notes Licence
gorgonia Machine learning Vital. It won't be hard to rewrite them, but why? Same author Gorgonia Licence (Apache 2.0-like)
gographviz Visualization of annotations, and other graph-related visualizations Vital for visualizations, which are a nice-to-have feature API last changed 12th April 2017 gographviz licence (Apache 2.0)
errors Errors The package won't die without it, but it's a very nice to have Stable API for the past year errors licence (MIT/BSD like)
set Set operations Can be easily replaced Stable API for the past year set licence (MIT/BSD-like)

Usage

See the individual packages for usage. There is also a bunch of executables in the cmd directory. They're meant to be examples as to how a natural language processing pipeline can be set up.

A natural language pipeline with this package is heavily channels driven. Here's is an example for dependency parsing:

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

How It Works

For specific tasks (POS tagging, parsing, named entity recognition etc), refer to the README of each subpackage. This package on its own mainly provides the data structures that the subpackages will use.

Perhaps the most important data structure is the *Annotation structure. It basically holds a word and the associated metadata for the word.

For dependency parses, the graph takes three forms: *Dependency, *DependencyTree and *Annotation. All three forms are convertable from one to another. TODO: explain rationale behind each data type.

Quirks

Very Oddly Specific POS Tags and Dependency Rel Types

A particular quirk you may have noticed is that the POSTag and DependencyType are hard coded in as constants. This package does in fact provide two variations of each: one from Stanford/Penn Treebank and one from UniversalDependencies.

The main reason for hardcoding these are mainly for performance reasons - knowing ahead how much to allocate reduces a lot of additional work the program has to do. It also reduces the chances of mutating a global variable.

Of course this comes as a tradeoff - programs are limited to these two options. Thankfully there are only a limited number of POS Tag and Dependency Relation types. Two of the most popular ones (Stanford/PTB and Universal Dependencies) have been implemented.

The following build tags are supported:

  • stanfordtags
  • universaltags
  • stanfordrel
  • universalrel

To use a specific tagset or relset, build your program thusly: go build -tags='stanfordtags'.

The default tag and dependency rel types are the universal dependencies version.

Lexer

You should also note that the tokenizer, lingo/lexer is not your usual run-of-the-mill NLP tokenizer. It's a tokenizer that tokenizes by space, with some specific rules for English. It was inspired by Rob Pike's talk on lexers. I thought it'd be cool to write something like that for NLP.

The test cases in package lingo/lexer showcases how it handles unicode, and other pathalogical english.

Contributing

see CONTRIBUTING.md for more info

Licence

This package is licenced under the MIT licence.

Documentation

Overview

package lingo provides the data structures and algorithms required for natural language processing.

Index

Constants

View Source
const BUILD_RELSET = "universalrel"
View Source
const BUILD_TAGSET = "universaltags"

Variables

View Source
var Adjectives = []POSTag{ADJ}
View Source
var Adverbs = []POSTag{ADV}
View Source
var DeterminerRels = []DependencyType{Det, Det_PreDet}
View Source
var Determiners = []POSTag{DET}
View Source
var Interrogatives = []POSTag{PRON, DET, ADV}
View Source
var Modifiers = []DependencyType{AMod}
View Source
var Nouns = []POSTag{NOUN, PROPN}
View Source
var NumberWords = map[string]int{
	"zero":        0,
	"one":         1,
	"two":         2,
	"three":       3,
	"four":        4,
	"five":        5,
	"six":         6,
	"seven":       7,
	"eight":       8,
	"nine":        9,
	"ten":         10,
	"eleven":      11,
	"twelve":      12,
	"thirteen":    13,
	"fourteen":    14,
	"fifteen":     15,
	"sixteen":     16,
	"nineteen":    19,
	"seventeen":   17,
	"eighteen":    18,
	"twenty":      20,
	"thirty":      30,
	"forty":       40,
	"fifty":       50,
	"sixty":       60,
	"seventy":     70,
	"eighty":      80,
	"ninety":      90,
	"hundred":     100,
	"thousand":    1000,
	"million":     1000000,
	"billion":     1000000000,
	"trillion":    1000000000000,
	"quadrillion": 1000000000000000,
}

NumberWords was generated with this python code

numberWords = {}

simple = '''zero one two three four five six seven eight nine ten eleven twelve
        thirteen fourteen fifteen sixteen seventeen eighteen nineteen
        twenty'''.split()
for i, word in zip(xrange(0, 20+1), simple):
    numberWords[word] = i

tense = '''thirty forty fifty sixty seventy eighty ninety hundred'''.split()
for i, word in zip(xrange(30, 100+1, 10), tense):
	numberWords[word] = i

larges = '''thousand million billion trillion quadrillion quintillion sextillion septillion'''.split()
for i, word in zip(xrange(3, 24+1, 3), larges):
	numberWords[word] = 10**i
View Source
var Numbers = []POSTag{NUM}
View Source
var ProperNouns = []POSTag{PROPN}
View Source
var QuantifingMods = []DependencyType{NumMod}
View Source
var Symbols = []POSTag{SYM, PUNCT}
View Source
var Verbs = []POSTag{VERB}

Functions

func AllocTree

func AllocTree() depConsOpt

AllocTree allocates the lefts and rights. Typical construction of the *Dependency doesn't allocate the trees as they're not necessary for a number of tasks.

func EqStringSlice

func EqStringSlice(a, b []string) bool

func FromAnnotatedSentence

func FromAnnotatedSentence(s AnnotatedSentence) depConsOpt

FromAnnotatedSentence creates a dependency from an AnnotatedSentence.

func InDepTypes

func InDepTypes(x DependencyType, set []DependencyType) bool

func InPOSTags

func InPOSTags(x POSTag, set []POSTag) bool

POSTag related functions

func InStringSlice

func InStringSlice(s string, l []string) bool

func IsAdjective

func IsAdjective(x POSTag) bool

func IsAdverb

func IsAdverb(x POSTag) bool

func IsCompound

func IsCompound(x DependencyType) bool

func IsDeterminer

func IsDeterminer(x POSTag) bool

func IsDeterminerRel

func IsDeterminerRel(x DependencyType) bool

func IsIN

func IsIN(x POSTag) bool

IsIN returns true if the POSTag is a subordinating conjunction. The reason why this exists is because in the stanford tag, IN is the POSTag while in the universal dependencies, it's the SCONJ POSTag

func IsInterrogative

func IsInterrogative(x POSTag) bool

func IsModifier

func IsModifier(x DependencyType) bool

func IsMultiword

func IsMultiword(x DependencyType) bool

func IsNoun

func IsNoun(x POSTag) bool

func IsNumber

func IsNumber(x POSTag) bool

func IsProperNoun

func IsProperNoun(x POSTag) bool

func IsQuantifier

func IsQuantifier(x DependencyType) bool

func IsSymbol

func IsSymbol(x POSTag) bool

func IsVerb

func IsVerb(x POSTag) bool

func ReadCluster

func ReadCluster(r io.Reader) map[string]Cluster

ReadCluster reads PercyLiang's cluster file format and returns a map of strings to Cluster

func StringIs

func StringIs(s string, f is) bool

func UnescapeSpecials

func UnescapeSpecials(word string) string

Types

type AnnotatedSentence

type AnnotatedSentence []*Annotation

AnnotatedSentence is a sentence, but each word has been annotated.

func NewAnnotatedSentence

func NewAnnotatedSentence() AnnotatedSentence

func (AnnotatedSentence) Children

func (as AnnotatedSentence) Children(h int) (retVal []int)

func (AnnotatedSentence) Clone

func (AnnotatedSentence) Dependency

func (as AnnotatedSentence) Dependency() *Dependency

func (AnnotatedSentence) Edges

func (as AnnotatedSentence) Edges() (retVal []DependencyEdge)

func (AnnotatedSentence) Fix

func (as AnnotatedSentence) Fix()

func (AnnotatedSentence) Heads

func (as AnnotatedSentence) Heads() []int

Heads returns the head IDs of the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) IDs

func (as AnnotatedSentence) IDs() []int

IDs returns the list of IDs in the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) IsValid

func (as AnnotatedSentence) IsValid() bool

func (AnnotatedSentence) Labels

func (as AnnotatedSentence) Labels() []DependencyType

Labels returns the DependencyTypes of the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Leaves

func (as AnnotatedSentence) Leaves() (retVal []int)

Leaves returns the *Annotations which are leaves. If the dependency hasn't been set yet, every single *Annotation is a leaf.

func (AnnotatedSentence) LemmaString

func (as AnnotatedSentence) LemmaString() string

func (AnnotatedSentence) Lemmas

func (as AnnotatedSentence) Lemmas() []string

Lemmas returns the lemmas as as slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Len

func (as AnnotatedSentence) Len() int

sort interface

func (AnnotatedSentence) Less

func (as AnnotatedSentence) Less(i, j int) bool

func (AnnotatedSentence) LoweredString

func (as AnnotatedSentence) LoweredString() string

func (AnnotatedSentence) LoweredStringSlice

func (as AnnotatedSentence) LoweredStringSlice() []string

LoweredStringSlice returns the lowercased version of the words in the sentence as a slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) MarshalJSON

func (as AnnotatedSentence) MarshalJSON() ([]byte, error)

func (AnnotatedSentence) Phrase

func (as AnnotatedSentence) Phrase(start, end int) (AnnotatedSentence, error)

Phrase returns the slice of the sentence. While you can do the same by simply doing as[start:end], this method returns errors instead of panicking

func (AnnotatedSentence) SetID

func (as AnnotatedSentence) SetID()

func (AnnotatedSentence) StemString

func (as AnnotatedSentence) StemString() string

func (AnnotatedSentence) Stems

func (as AnnotatedSentence) Stems() []string

Stems returns the stems as a slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) String

func (as AnnotatedSentence) String() string

func (AnnotatedSentence) StringSlice

func (as AnnotatedSentence) StringSlice() []string

StringSlice returns the original words as a slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Swap

func (as AnnotatedSentence) Swap(i, j int)

func (AnnotatedSentence) Tags

func (as AnnotatedSentence) Tags() []POSTag

Tags returns the POSTags of the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Tree

func (as AnnotatedSentence) Tree() *DependencyTree

func (*AnnotatedSentence) UnmarshalJSON

func (as *AnnotatedSentence) UnmarshalJSON(b []byte) error

func (AnnotatedSentence) ValueString

func (as AnnotatedSentence) ValueString() string

type Annotation

type Annotation struct {
	Lexeme
	POSTag

	// fields to do with an annotation being in a collection
	DependencyType
	ID   int
	Head *Annotation

	// info about the annotation itself
	Lemma   string
	Lowered string
	Stem    string

	// auxiliary data for processing
	Cluster
	Shape
	WordFlag
	// contains filtered or unexported fields
}

Annotation is the word and it's metadata. This includes the position, its dependency head (if available), its lemma, POSTag, etc

A collection of Annoations - AnnotatedSentence is also a representation of a dependency parse

Every field is exported for easy gobbing. be very careful with setting stuff

func AnnotationFromLexTag

func AnnotationFromLexTag(l Lexeme, t POSTag, f AnnotationFixer) *Annotation

AnnotationFromLexTag is only ever used in tests. Fixer is optional

func NewAnnotation

func NewAnnotation() *Annotation

func NullAnnotation

func NullAnnotation() *Annotation

func RootAnnotation

func RootAnnotation() *Annotation

func StartAnnotation

func StartAnnotation() *Annotation

func StringToAnnotation

func StringToAnnotation(s string, f AnnotationFixer) *Annotation

func (*Annotation) Clone

func (a *Annotation) Clone() *Annotation

func (*Annotation) GoString

func (a *Annotation) GoString() string

func (*Annotation) HeadID

func (a *Annotation) HeadID() int

func (*Annotation) IsNumber

func (a *Annotation) IsNumber() bool

func (*Annotation) MarshalJSON

func (a *Annotation) MarshalJSON() ([]byte, error)

func (*Annotation) Process

func (a *Annotation) Process(f AnnotationFixer) error

func (*Annotation) SetHead

func (a *Annotation) SetHead(headAnn *Annotation)

func (*Annotation) String

func (a *Annotation) String() string

func (*Annotation) UnmarshalJSON

func (a *Annotation) UnmarshalJSON(b []byte) error

type AnnotationFixer

type AnnotationFixer interface {
	Lemmatizer
	Stemmer
	Clusters() (map[string]Cluster, error)
}

type AnnotationSet

type AnnotationSet []*Annotation

func (AnnotationSet) Add

func (AnnotationSet) Contains

func (as AnnotationSet) Contains(a *Annotation) bool

func (AnnotationSet) Index

func (as AnnotationSet) Index(a *Annotation) int

func (AnnotationSet) Len

func (as AnnotationSet) Len() int

func (AnnotationSet) Less

func (as AnnotationSet) Less(i, j int) bool

func (AnnotationSet) Set

func (as AnnotationSet) Set() AnnotationSet

func (AnnotationSet) Swap

func (as AnnotationSet) Swap(i, j int)

type Cluster

type Cluster int

Cluster represents a brown cluster

type Corpus

type Corpus interface {
	// ID returns the ID of a word and whether or not it was found in the corpus
	Id(word string) (id int, ok bool)

	// Word returns the word given the ID, and whether or not it was found in the corpus
	Word(id int) (word string, ok bool)

	// Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID
	Add(word string) int

	// Size returns the size of the corpus.
	Size() int

	// WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.
	WordFreq(word string) int

	// IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.
	IDFreq(id int) int

	// TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.
	TotalFreq() int

	// MaxWordLength returns the length of the longest known word in the corpus
	MaxWordLength() int

	// WordProb returns the probability of a word appearing in the corpus
	WordProb(word string) (float64, bool)

	// IO stuff
	gob.GobEncoder
	gob.GobDecoder
}

Corpus is the interface for the corpus.

type Dependency

type Dependency struct {
	AnnotatedSentence
	// contains filtered or unexported fields
}

Dependency represents the dependency parse of a sentence. While AnnotatedSentence does already do a job of representing the dependency parse of a sentence, *Dependency actually contains meta information about the dependency parse (specifically, lefts, rights) that makes parsing a dependency a lot faster

The fields are mostly left unexported for a good reason - a dependency parse SHOULD be static after it's been built

func NewDependency

func NewDependency(opts ...depConsOpt) *Dependency

NewDependency creates a new *Dependency. It takes optional construction options:

FromAnnotatedSentence
AllocTree

func (*Dependency) AddArc

func (d *Dependency) AddArc(head, child int, label DependencyType)

func (*Dependency) AddChild

func (d *Dependency) AddChild(head, child int)

func (*Dependency) AddRel

func (d *Dependency) AddRel(child int, rel DependencyType)

func (*Dependency) Annotation

func (d *Dependency) Annotation(i int) *Annotation

func (*Dependency) HasSingleRoot

func (d *Dependency) HasSingleRoot() bool

func (*Dependency) Head

func (d *Dependency) Head(i int) int

func (*Dependency) IsLegal

func (d *Dependency) IsLegal() bool

func (*Dependency) IsProjective

func (d *Dependency) IsProjective() bool

func (*Dependency) Label

func (d *Dependency) Label(i int) DependencyType

func (*Dependency) Lefts

func (d *Dependency) Lefts() [][]int

func (*Dependency) N

func (d *Dependency) N() int

func (*Dependency) Rights

func (d *Dependency) Rights() [][]int

func (*Dependency) Root

func (d *Dependency) Root() int

func (*Dependency) Sentence

func (d *Dependency) Sentence() AnnotatedSentence

func (*Dependency) SetLefts

func (d *Dependency) SetLefts(l [][]int)

please only use these for testing

func (*Dependency) SetRights

func (d *Dependency) SetRights(r [][]int)

func (*Dependency) SprintRel

func (d *Dependency) SprintRel() string

func (*Dependency) WordCount

func (d *Dependency) WordCount() int

type DependencyEdge

type DependencyEdge struct {
	Gov *Annotation
	Dep *Annotation
	Rel DependencyType
}

type DependencyTree

type DependencyTree struct {
	Parent *DependencyTree

	ID   int            // the word number in a sentence
	Type DependencyType // refers to the dependency type to the parent
	Word *Annotation

	Children []*DependencyTree
}

A DependencyTree is an alternate form of representing a dependency parse. This form makes it easier to traverse the tree

func NewDependencyTree

func NewDependencyTree(parent *DependencyTree, ID int, ann *Annotation) *DependencyTree

func (*DependencyTree) AddChild

func (d *DependencyTree) AddChild(child *DependencyTree)

func (*DependencyTree) AddRel

func (d *DependencyTree) AddRel(rel DependencyType)

func (*DependencyTree) Dot

func (d *DependencyTree) Dot() string

func (*DependencyTree) Walk

func (d *DependencyTree) Walk(fn func(interface{}))

type DependencyType

type DependencyType byte

DependencyType represents the relation between two words

const (
	NoDepType DependencyType = iota
	Dep
	Root

	// nominal dependencies
	NSubj
	NSubjPass
	DObj
	IObj

	// predicate dependencies
	CSubj
	CSubjPass
	CComp

	XComp

	// nominal dependencies
	NumMod
	Appos
	NMod

	// predicate dependencies
	ACl
	ACl_RelCl // RCMod in stanford deps
	Det
	Det_PreDet

	// modifier word
	AMod
	Neg

	// Case Marking, preposition, possessive
	Case

	// Nominal dependencies
	NMod_NPMod
	NMod_TMod
	NMod_Poss

	// Predicate Dependencies
	AdvCl

	// Modifier Word
	AdvMod

	// Compounding and Unanalyzed
	Compound
	Compound_Part
	Name // Unused in English
	MWE
	Foreign  // Unused in English
	GoesWith // Unused in English

	// Loose Joining Relations
	List
	Dislocated // Unused in English
	Parataxis
	Remnant    // Unused in English
	Reparandum // Unused in English

	// Nominal Dependent
	Vocative // Unused in English
	Discourse
	Expl

	// Auxilliary
	Aux
	AuxPass
	Cop

	// Other
	Mark
	Punct

	Conj
	Coordination // CC
	CC_PreConj

	MAXDEPTYPE
)

http://universaldependencies.github.io/docs/en/dep/all.html

func (DependencyType) MarshalText

func (dt DependencyType) MarshalText() ([]byte, error)

func (DependencyType) String

func (i DependencyType) String() string

func (*DependencyType) UnmarshalText

func (dt *DependencyType) UnmarshalText(text []byte) error

type DependencyTypeSet

type DependencyTypeSet [MAXDEPTYPE]bool

DependencyTypeSet is a set of all the DependencyTypes

func (DependencyTypeSet) String

func (dts DependencyTypeSet) String() string

type Lemmatizer

type Lemmatizer interface {
	Lemmatize(string, POSTag) ([]string, error)
}

Lemmatizer is anything that can lemmatize

type Lexeme

type Lexeme struct {
	Value      string
	LexemeType LexemeType

	Line int
	Col  int
	Pos  int
}

func MakeLexeme

func MakeLexeme(s string, t LexemeType) Lexeme

func NullLexeme

func NullLexeme() Lexeme

func RootLexeme

func RootLexeme() Lexeme

func StartLexeme

func StartLexeme() Lexeme

func (Lexeme) Fix

func (l Lexeme) Fix() Lexeme

func (Lexeme) Flags

func (l Lexeme) Flags() WordFlag

func (Lexeme) GoString

func (l Lexeme) GoString() string

func (Lexeme) Shape

func (l Lexeme) Shape() Shape

func (Lexeme) String

func (l Lexeme) String() string

type LexemeSentence

type LexemeSentence []Lexeme

Lexeme Sentence

func NewLexemeSentence

func NewLexemeSentence() LexemeSentence

func (LexemeSentence) String

func (ls LexemeSentence) String() string

type LexemeType

type LexemeType byte
const (
	EOF LexemeType = iota
	Word
	Disambig
	URI
	Number
	Date
	Time
	Punctuation
	Symbol
	Space
	SystemUse
)

func (LexemeType) String

func (i LexemeType) String() string

type POSTag

type POSTag byte

POSTag represents a Part of Speech Tag.

const (
	X POSTag = iota // aka NULLTAG
	UNKNOWN_TAG
	ROOT_TAG
	ADJ
	ADP
	ADV
	AUX
	CONJ
	DET
	INTJ
	NOUN
	NUM
	PART
	PRON
	PROPN
	PUNCT
	SCONJ
	SYM
	VERB

	MAXTAG // MAXTAG is provided here as index support
)

func POSTagShortcut

func POSTagShortcut(l Lexeme) (POSTag, bool)

POSTagShortcut is a shortcut function to help the POSTagger shortcircuit some decisions about what the tag is

func (POSTag) MarshalText

func (p POSTag) MarshalText() ([]byte, error)

func (POSTag) String

func (i POSTag) String() string

func (*POSTag) UnmarshalText

func (p *POSTag) UnmarshalText(text []byte) error

type Sentencer

type Sentencer interface {
	Sentence() AnnotatedSentence
}

Sentencer is anything that returns an AnnotatedSentence

type Shape

type Shape string

Shape represents the shape of a word. It's currently implemented as a string

type Stemmer

type Stemmer interface {
	Stem(string) (string, error)
}

Stemmer is anything that can stem

type TagSet

type TagSet [MAXTAG]bool

TagSet is a set of all the POSTags

func (TagSet) String

func (ts TagSet) String() string

type WordEmbeddings

type WordEmbeddings interface {
	Corpus

	// WordVector returns a vector of embeddings given the word
	WordVector(word string) (vec tensor.Tensor, err error)

	// Vector returns a vector of embeddings given the word ID
	Vector(id int) (vec tensor.Tensor, err error)

	// Embedding returns the matrix
	Embedding() tensor.Tensor
}

WordEmbeddings is any type that is both a corpus and can return word vectors

type WordFlag

type WordFlag uint32

WordFlags represent the types a word may be. A word may have multiple flags

const (
	NoFlag WordFlag = iota
	IsLetter
	IsAscii
	IsDigit
	IsLower
	IsPunct
	IsSpace
	IsTitle
	IsUpper
	LikeURL
	LikeNum
	LikeEmail
	IsStopWord
	IsOOV // for ner

	MAXFLAG
)

func (WordFlag) String

func (f WordFlag) String() string

Directories

Path Synopsis
cmd
dep
pos

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL