lingo

package module

v0.0.0-...-491e816 Latest Latest Go to latest Published: Sep 18, 2020 License: MIT Imports: 17 Imported by: 10

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/chewxy/lingo

Links

Open Source Insights

README ¶

lingo

package lingo provides the data structures and algorithms required for natural language processing.

Specifically, it provides a POS Tagger (lingo/pos), a Dependency Parser (lingo/dep), and a basic tokenizer (lingo/lexer) for English. It also provides data structures for holding corpuses (lingo/corpus), and treebanks (lingo/treebank).

The aim of this package is to provide a production quality pipeline for natural language processing.

Install

The package is go-gettable: go get -u github.com/chewxy/lingo

This package and its subpackages depend on very few external packages. Here they are:

Package	Used For	Vitality	Notes	Licence
gorgonia	Machine learning	Vital. It won't be hard to rewrite them, but why?	Same author	Gorgonia Licence (Apache 2.0-like)
gographviz	Visualization of annotations, and other graph-related visualizations	Vital for visualizations, which are a nice-to-have feature	API last changed 12th April 2017	gographviz licence (Apache 2.0)
errors	Errors	The package won't die without it, but it's a very nice to have	Stable API for the past year	errors licence (MIT/BSD like)
set	Set operations	Can be easily replaced	Stable API for the past year	set licence (MIT/BSD-like)

Usage

See the individual packages for usage. There is also a bunch of executables in the cmd directory. They're meant to be examples as to how a natural language processing pipeline can be set up.

A natural language pipeline with this package is heavily channels driven. Here's is an example for dependency parsing:

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

How It Works

For specific tasks (POS tagging, parsing, named entity recognition etc), refer to the README of each subpackage. This package on its own mainly provides the data structures that the subpackages will use.

Perhaps the most important data structure is the *Annotation structure. It basically holds a word and the associated metadata for the word.

For dependency parses, the graph takes three forms: *Dependency, *DependencyTree and *Annotation. All three forms are convertable from one to another. TODO: explain rationale behind each data type.

Quirks

Very Oddly Specific POS Tags and Dependency Rel Types

A particular quirk you may have noticed is that the POSTag and DependencyType are hard coded in as constants. This package does in fact provide two variations of each: one from Stanford/Penn Treebank and one from UniversalDependencies.

The main reason for hardcoding these are mainly for performance reasons - knowing ahead how much to allocate reduces a lot of additional work the program has to do. It also reduces the chances of mutating a global variable.

Of course this comes as a tradeoff - programs are limited to these two options. Thankfully there are only a limited number of POS Tag and Dependency Relation types. Two of the most popular ones (Stanford/PTB and Universal Dependencies) have been implemented.

The following build tags are supported:

stanfordtags
universaltags
stanfordrel
universalrel

To use a specific tagset or relset, build your program thusly: go build -tags='stanfordtags'.

The default tag and dependency rel types are the universal dependencies version.

Lexer

You should also note that the tokenizer, lingo/lexer is not your usual run-of-the-mill NLP tokenizer. It's a tokenizer that tokenizes by space, with some specific rules for English. It was inspired by Rob Pike's talk on lexers. I thought it'd be cool to write something like that for NLP.

The test cases in package lingo/lexer showcases how it handles unicode, and other pathalogical english.

Contributing

see CONTRIBUTING.md for more info

Licence

This package is licenced under the MIT licence.

Documentation ¶

Overview ¶

package lingo provides the data structures and algorithms required for natural language processing.

Index ¶

Constants
Variables
func AllocTree() depConsOpt
func EqStringSlice(a, b []string) bool
func FromAnnotatedSentence(s AnnotatedSentence) depConsOpt
func InDepTypes(x DependencyType, set []DependencyType) bool
func InPOSTags(x POSTag, set []POSTag) bool
func InStringSlice(s string, l []string) bool
func IsAdjective(x POSTag) bool
func IsAdverb(x POSTag) bool
func IsCompound(x DependencyType) bool
func IsDeterminer(x POSTag) bool
func IsDeterminerRel(x DependencyType) bool
func IsIN(x POSTag) bool
func IsInterrogative(x POSTag) bool
func IsModifier(x DependencyType) bool
func IsMultiword(x DependencyType) bool
func IsNoun(x POSTag) bool
func IsNumber(x POSTag) bool
func IsProperNoun(x POSTag) bool
func IsQuantifier(x DependencyType) bool
func IsSymbol(x POSTag) bool
func IsVerb(x POSTag) bool
func ReadCluster(r io.Reader) map[string]Cluster
func StringIs(s string, f is) bool
func UnescapeSpecials(word string) string
type AnnotatedSentence
- func NewAnnotatedSentence() AnnotatedSentence
- func (as AnnotatedSentence) Children(h int) (retVal []int)
- func (as AnnotatedSentence) Clone() AnnotatedSentence
- func (as AnnotatedSentence) Dependency() *Dependency
- func (as AnnotatedSentence) Edges() (retVal []DependencyEdge)
- func (as AnnotatedSentence) Fix()
- func (as AnnotatedSentence) Heads() []int
- func (as AnnotatedSentence) IDs() []int
- func (as AnnotatedSentence) IsValid() bool
- func (as AnnotatedSentence) Labels() []DependencyType
- func (as AnnotatedSentence) Leaves() (retVal []int)
- func (as AnnotatedSentence) LemmaString() string
- func (as AnnotatedSentence) Lemmas() []string
- func (as AnnotatedSentence) Len() int
- func (as AnnotatedSentence) Less(i, j int) bool
- func (as AnnotatedSentence) LoweredString() string
- func (as AnnotatedSentence) LoweredStringSlice() []string
- func (as AnnotatedSentence) MarshalJSON() ([]byte, error)
- func (as AnnotatedSentence) Phrase(start, end int) (AnnotatedSentence, error)
- func (as AnnotatedSentence) SetID()
- func (as AnnotatedSentence) StemString() string
- func (as AnnotatedSentence) Stems() []string
- func (as AnnotatedSentence) String() string
- func (as AnnotatedSentence) StringSlice() []string
- func (as AnnotatedSentence) Swap(i, j int)
- func (as AnnotatedSentence) Tags() []POSTag
- func (as AnnotatedSentence) Tree() *DependencyTree
- func (as *AnnotatedSentence) UnmarshalJSON(b []byte) error
- func (as AnnotatedSentence) ValueString() string
type Annotation
- func AnnotationFromLexTag(l Lexeme, t POSTag, f AnnotationFixer) *Annotation
- func NewAnnotation() *Annotation
- func NullAnnotation() *Annotation
- func RootAnnotation() *Annotation
- func StartAnnotation() *Annotation
- func StringToAnnotation(s string, f AnnotationFixer) *Annotation
- func (a *Annotation) Clone() *Annotation
- func (a *Annotation) GoString() string
- func (a *Annotation) HeadID() int
- func (a *Annotation) IsNumber() bool
- func (a *Annotation) MarshalJSON() ([]byte, error)
- func (a *Annotation) Process(f AnnotationFixer) error
- func (a *Annotation) SetHead(headAnn *Annotation)
- func (a *Annotation) String() string
- func (a *Annotation) UnmarshalJSON(b []byte) error
type AnnotationFixer
type AnnotationSet
- func (as AnnotationSet) Add(a *Annotation) AnnotationSet
- func (as AnnotationSet) Contains(a *Annotation) bool
- func (as AnnotationSet) Index(a *Annotation) int
- func (as AnnotationSet) Len() int
- func (as AnnotationSet) Less(i, j int) bool
- func (as AnnotationSet) Set() AnnotationSet
- func (as AnnotationSet) Swap(i, j int)
type Cluster
type Corpus
type Dependency
- func NewDependency(opts ...depConsOpt) *Dependency
- func (d *Dependency) AddArc(head, child int, label DependencyType)
- func (d *Dependency) AddChild(head, child int)
- func (d *Dependency) AddRel(child int, rel DependencyType)
- func (d *Dependency) Annotation(i int) *Annotation
- func (d *Dependency) HasSingleRoot() bool
- func (d *Dependency) Head(i int) int
- func (d *Dependency) IsLegal() bool
- func (d *Dependency) IsProjective() bool
- func (d *Dependency) Label(i int) DependencyType
- func (d *Dependency) Lefts() [][]int
- func (d *Dependency) N() int
- func (d *Dependency) Rights() [][]int
- func (d *Dependency) Root() int
- func (d *Dependency) Sentence() AnnotatedSentence
- func (d *Dependency) SetLefts(l [][]int)
- func (d *Dependency) SetRights(r [][]int)
- func (d *Dependency) SprintRel() string
- func (d *Dependency) WordCount() int
type DependencyEdge
type DependencyTree
- func NewDependencyTree(parent *DependencyTree, ID int, ann *Annotation) *DependencyTree
- func (d *DependencyTree) AddChild(child *DependencyTree)
- func (d *DependencyTree) AddRel(rel DependencyType)
- func (d *DependencyTree) Dot() string
- func (d *DependencyTree) Walk(fn func(interface{}))
type DependencyType
- func (dt DependencyType) MarshalText() ([]byte, error)
- func (i DependencyType) String() string
- func (dt *DependencyType) UnmarshalText(text []byte) error
type DependencyTypeSet
- func (dts DependencyTypeSet) String() string
type Lemmatizer
type Lexeme
- func MakeLexeme(s string, t LexemeType) Lexeme
- func NullLexeme() Lexeme
- func RootLexeme() Lexeme
- func StartLexeme() Lexeme
- func (l Lexeme) Fix() Lexeme
- func (l Lexeme) Flags() WordFlag
- func (l Lexeme) GoString() string
- func (l Lexeme) Shape() Shape
- func (l Lexeme) String() string
type LexemeSentence
- func NewLexemeSentence() LexemeSentence
- func (ls LexemeSentence) String() string
type LexemeType
- func (i LexemeType) String() string
type POSTag
- func POSTagShortcut(l Lexeme) (POSTag, bool)
- func (p POSTag) MarshalText() ([]byte, error)
- func (i POSTag) String() string
- func (p *POSTag) UnmarshalText(text []byte) error
type Sentencer
type Shape
type Stemmer
type TagSet
- func (ts TagSet) String() string
type WordEmbeddings
type WordFlag
- func (f WordFlag) String() string

Constants ¶

View Source

const BUILD_RELSET = "universalrel"

View Source

const BUILD_TAGSET = "universaltags"

Variables ¶

View Source

var Adjectives = []POSTag{ADJ}

View Source

var Adverbs = []POSTag{ADV}

View Source

var Compounds = []DependencyType{Compound, Compound_Part}

View Source

var DeterminerRels = []DependencyType{Det, Det_PreDet}

View Source

var Determiners = []POSTag{DET}

View Source

var Interrogatives = []POSTag{PRON, DET, ADV}

View Source

var Modifiers = []DependencyType{AMod}

View Source

var MultiWord = []DependencyType{MWE, Compound, Compound_Part, Parataxis}

View Source

var Nouns = []POSTag{NOUN, PROPN}

View Source

var NumberWords = map[string]int{
	"zero":        0,
	"one":         1,
	"two":         2,
	"three":       3,
	"four":        4,
	"five":        5,
	"six":         6,
	"seven":       7,
	"eight":       8,
	"nine":        9,
	"ten":         10,
	"eleven":      11,
	"twelve":      12,
	"thirteen":    13,
	"fourteen":    14,
	"fifteen":     15,
	"sixteen":     16,
	"nineteen":    19,
	"seventeen":   17,
	"eighteen":    18,
	"twenty":      20,
	"thirty":      30,
	"forty":       40,
	"fifty":       50,
	"sixty":       60,
	"seventy":     70,
	"eighty":      80,
	"ninety":      90,
	"hundred":     100,
	"thousand":    1000,
	"million":     1000000,
	"billion":     1000000000,
	"trillion":    1000000000000,
	"quadrillion": 1000000000000000,
}

NumberWords was generated with this python code

numberWords = {}

simple = '''zero one two three four five six seven eight nine ten eleven twelve
        thirteen fourteen fifteen sixteen seventeen eighteen nineteen
        twenty'''.split()
for i, word in zip(xrange(0, 20+1), simple):
    numberWords[word] = i

tense = '''thirty forty fifty sixty seventy eighty ninety hundred'''.split()
for i, word in zip(xrange(30, 100+1, 10), tense):
	numberWords[word] = i

larges = '''thousand million billion trillion quadrillion quintillion sextillion septillion'''.split()
for i, word in zip(xrange(3, 24+1, 3), larges):
	numberWords[word] = 10**i

View Source

var Numbers = []POSTag{NUM}

View Source

var ProperNouns = []POSTag{PROPN}

View Source

var QuantifingMods = []DependencyType{NumMod}

View Source

var Symbols = []POSTag{SYM, PUNCT}

View Source

var Verbs = []POSTag{VERB}

Functions ¶

func AllocTree ¶

func AllocTree() depConsOpt

AllocTree allocates the lefts and rights. Typical construction of the *Dependency doesn't allocate the trees as they're not necessary for a number of tasks.

func EqStringSlice ¶

func EqStringSlice(a, b []string) bool

func FromAnnotatedSentence ¶

func FromAnnotatedSentence(s AnnotatedSentence) depConsOpt

FromAnnotatedSentence creates a dependency from an AnnotatedSentence.

func InDepTypes ¶

func InDepTypes(x DependencyType, set []DependencyType) bool

func InPOSTags ¶

func InPOSTags(x POSTag, set []POSTag) bool

POSTag related functions

func InStringSlice ¶

func InStringSlice(s string, l []string) bool

func IsAdjective ¶

func IsAdjective(x POSTag) bool

func IsAdverb ¶

func IsAdverb(x POSTag) bool

func IsCompound ¶

func IsCompound(x DependencyType) bool

func IsDeterminer ¶

func IsDeterminer(x POSTag) bool

func IsDeterminerRel ¶

func IsDeterminerRel(x DependencyType) bool

func IsIN ¶

func IsIN(x POSTag) bool

IsIN returns true if the POSTag is a subordinating conjunction. The reason why this exists is because in the stanford tag, IN is the POSTag while in the universal dependencies, it's the SCONJ POSTag

func IsInterrogative ¶

func IsInterrogative(x POSTag) bool

func IsModifier ¶

func IsModifier(x DependencyType) bool

func IsMultiword ¶

func IsMultiword(x DependencyType) bool

func IsNoun ¶

func IsNoun(x POSTag) bool

func IsNumber ¶

func IsNumber(x POSTag) bool

func IsProperNoun ¶

func IsProperNoun(x POSTag) bool

func IsQuantifier ¶

func IsQuantifier(x DependencyType) bool

func IsSymbol ¶

func IsSymbol(x POSTag) bool

func IsVerb ¶

func IsVerb(x POSTag) bool

func ReadCluster ¶

func ReadCluster(r io.Reader) map[string]Cluster

ReadCluster reads PercyLiang's cluster file format and returns a map of strings to Cluster

func StringIs ¶

func StringIs(s string, f is) bool

func UnescapeSpecials ¶

func UnescapeSpecials(word string) string

Types ¶

type AnnotatedSentence ¶

type AnnotatedSentence []*Annotation

AnnotatedSentence is a sentence, but each word has been annotated.

func NewAnnotatedSentence ¶

func NewAnnotatedSentence() AnnotatedSentence

func (AnnotatedSentence) Children ¶

func (as AnnotatedSentence) Children(h int) (retVal []int)

func (AnnotatedSentence) Clone ¶

func (as AnnotatedSentence) Clone() AnnotatedSentence

func (AnnotatedSentence) Dependency ¶

func (as AnnotatedSentence) Dependency() *Dependency

func (AnnotatedSentence) Edges ¶

func (as AnnotatedSentence) Edges() (retVal []DependencyEdge)

func (AnnotatedSentence) Fix ¶

func (as AnnotatedSentence) Fix()

func (AnnotatedSentence) Heads ¶

func (as AnnotatedSentence) Heads() []int

Heads returns the head IDs of the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) IDs ¶

func (as AnnotatedSentence) IDs() []int

IDs returns the list of IDs in the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) IsValid ¶

func (as AnnotatedSentence) IsValid() bool

func (AnnotatedSentence) Labels ¶

func (as AnnotatedSentence) Labels() []DependencyType

Labels returns the DependencyTypes of the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Leaves ¶

func (as AnnotatedSentence) Leaves() (retVal []int)

Leaves returns the *Annotations which are leaves. If the dependency hasn't been set yet, every single *Annotation is a leaf.

func (AnnotatedSentence) LemmaString ¶

func (as AnnotatedSentence) LemmaString() string

func (AnnotatedSentence) Lemmas ¶

func (as AnnotatedSentence) Lemmas() []string

Lemmas returns the lemmas as as slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Len ¶

func (as AnnotatedSentence) Len() int

sort interface

func (AnnotatedSentence) Less ¶

func (as AnnotatedSentence) Less(i, j int) bool

func (AnnotatedSentence) LoweredString ¶

func (as AnnotatedSentence) LoweredString() string

func (AnnotatedSentence) LoweredStringSlice ¶

func (as AnnotatedSentence) LoweredStringSlice() []string

LoweredStringSlice returns the lowercased version of the words in the sentence as a slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) MarshalJSON ¶

func (as AnnotatedSentence) MarshalJSON() ([]byte, error)

func (AnnotatedSentence) Phrase ¶

func (as AnnotatedSentence) Phrase(start, end int) (AnnotatedSentence, error)

Phrase returns the slice of the sentence. While you can do the same by simply doing as[start:end], this method returns errors instead of panicking

func (AnnotatedSentence) SetID ¶

func (as AnnotatedSentence) SetID()

func (AnnotatedSentence) StemString ¶

func (as AnnotatedSentence) StemString() string

func (AnnotatedSentence) Stems ¶

func (as AnnotatedSentence) Stems() []string

Stems returns the stems as a slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) String ¶

func (as AnnotatedSentence) String() string

func (AnnotatedSentence) StringSlice ¶

func (as AnnotatedSentence) StringSlice() []string

StringSlice returns the original words as a slice of string. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Swap ¶

func (as AnnotatedSentence) Swap(i, j int)

func (AnnotatedSentence) Tags ¶

func (as AnnotatedSentence) Tags() []POSTag

Tags returns the POSTags of the sentence. The return value has exactly the same length as the sentence.

func (AnnotatedSentence) Tree ¶

func (as AnnotatedSentence) Tree() *DependencyTree

func (*AnnotatedSentence) UnmarshalJSON ¶

func (as *AnnotatedSentence) UnmarshalJSON(b []byte) error

func (AnnotatedSentence) ValueString ¶

func (as AnnotatedSentence) ValueString() string

type Annotation ¶

type Annotation struct {
	Lexeme
	POSTag

	// fields to do with an annotation being in a collection
	DependencyType
	ID   int
	Head *Annotation

	// info about the annotation itself
	Lemma   string
	Lowered string
	Stem    string

	// auxiliary data for processing
	Cluster
	Shape
	WordFlag
	// contains filtered or unexported fields
}

Annotation is the word and it's metadata. This includes the position, its dependency head (if available), its lemma, POSTag, etc

A collection of Annoations - AnnotatedSentence is also a representation of a dependency parse ¶

Every field is exported for easy gobbing. be very careful with setting stuff

func AnnotationFromLexTag ¶

func AnnotationFromLexTag(l Lexeme, t POSTag, f AnnotationFixer) *Annotation

AnnotationFromLexTag is only ever used in tests. Fixer is optional

func NewAnnotation ¶

func NewAnnotation() *Annotation

func NullAnnotation ¶

func NullAnnotation() *Annotation

func RootAnnotation ¶

func RootAnnotation() *Annotation

func StartAnnotation ¶

func StartAnnotation() *Annotation

func StringToAnnotation ¶

func StringToAnnotation(s string, f AnnotationFixer) *Annotation

func (*Annotation) Clone ¶

func (a *Annotation) Clone() *Annotation

func (*Annotation) GoString ¶

func (a *Annotation) GoString() string

func (*Annotation) HeadID ¶

func (a *Annotation) HeadID() int

func (*Annotation) IsNumber ¶

func (a *Annotation) IsNumber() bool

func (*Annotation) MarshalJSON ¶

func (a *Annotation) MarshalJSON() ([]byte, error)

func (*Annotation) Process ¶

func (a *Annotation) Process(f AnnotationFixer) error

func (*Annotation) SetHead ¶

func (a *Annotation) SetHead(headAnn *Annotation)

func (*Annotation) String ¶

func (a *Annotation) String() string

func (*Annotation) UnmarshalJSON ¶

func (a *Annotation) UnmarshalJSON(b []byte) error

type AnnotationFixer ¶

type AnnotationFixer interface {
	Lemmatizer
	Stemmer
	Clusters() (map[string]Cluster, error)
}

type AnnotationSet ¶

type AnnotationSet []*Annotation

func (AnnotationSet) Add ¶

func (as AnnotationSet) Add(a *Annotation) AnnotationSet

func (AnnotationSet) Contains ¶

func (as AnnotationSet) Contains(a *Annotation) bool

func (AnnotationSet) Index ¶

func (as AnnotationSet) Index(a *Annotation) int

func (AnnotationSet) Len ¶

func (as AnnotationSet) Len() int

func (AnnotationSet) Less ¶

func (as AnnotationSet) Less(i, j int) bool

func (AnnotationSet) Set ¶

func (as AnnotationSet) Set() AnnotationSet

func (AnnotationSet) Swap ¶

func (as AnnotationSet) Swap(i, j int)

type Cluster ¶

type Cluster int

Cluster represents a brown cluster

type Corpus ¶

type Corpus interface {
	// ID returns the ID of a word and whether or not it was found in the corpus
	Id(word string) (id int, ok bool)

	// Word returns the word given the ID, and whether or not it was found in the corpus
	Word(id int) (word string, ok bool)

	// Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID
	Add(word string) int

	// Size returns the size of the corpus.
	Size() int

	// WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.
	WordFreq(word string) int

	// IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.
	IDFreq(id int) int

	// TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.
	TotalFreq() int

	// MaxWordLength returns the length of the longest known word in the corpus
	MaxWordLength() int

	// WordProb returns the probability of a word appearing in the corpus
	WordProb(word string) (float64, bool)

	// IO stuff
	gob.GobEncoder
	gob.GobDecoder
}

Corpus is the interface for the corpus.

type Dependency ¶

type Dependency struct {
	AnnotatedSentence
	// contains filtered or unexported fields
}

Dependency represents the dependency parse of a sentence. While AnnotatedSentence does already do a job of representing the dependency parse of a sentence, *Dependency actually contains meta information about the dependency parse (specifically, lefts, rights) that makes parsing a dependency a lot faster

The fields are mostly left unexported for a good reason - a dependency parse SHOULD be static after it's been built

func NewDependency ¶

func NewDependency(opts ...depConsOpt) *Dependency

NewDependency creates a new *Dependency. It takes optional construction options:

FromAnnotatedSentence
AllocTree

func (*Dependency) AddArc ¶

func (d *Dependency) AddArc(head, child int, label DependencyType)

func (*Dependency) AddChild ¶

func (d *Dependency) AddChild(head, child int)

func (*Dependency) AddRel ¶

func (d *Dependency) AddRel(child int, rel DependencyType)

func (*Dependency) Annotation ¶

func (d *Dependency) Annotation(i int) *Annotation

func (*Dependency) HasSingleRoot ¶

func (d *Dependency) HasSingleRoot() bool

func (*Dependency) Head ¶

func (d *Dependency) Head(i int) int

func (*Dependency) IsLegal ¶

func (d *Dependency) IsLegal() bool

func (*Dependency) IsProjective ¶

func (d *Dependency) IsProjective() bool

func (*Dependency) Label ¶

func (d *Dependency) Label(i int) DependencyType

func (*Dependency) Lefts ¶

func (d *Dependency) Lefts() [][]int

func (*Dependency) N ¶

func (d *Dependency) N() int

func (*Dependency) Rights ¶

func (d *Dependency) Rights() [][]int

func (*Dependency) Root ¶

func (d *Dependency) Root() int

func (*Dependency) Sentence ¶

func (d *Dependency) Sentence() AnnotatedSentence

func (*Dependency) SetLefts ¶

func (d *Dependency) SetLefts(l [][]int)

please only use these for testing

func (*Dependency) SetRights ¶

func (d *Dependency) SetRights(r [][]int)

func (*Dependency) SprintRel ¶

func (d *Dependency) SprintRel() string

func (*Dependency) WordCount ¶

func (d *Dependency) WordCount() int

type DependencyEdge ¶

type DependencyEdge struct {
	Gov *Annotation
	Dep *Annotation
	Rel DependencyType
}

type DependencyTree ¶

type DependencyTree struct {
	Parent *DependencyTree

	ID   int            // the word number in a sentence
	Type DependencyType // refers to the dependency type to the parent
	Word *Annotation

	Children []*DependencyTree
}

A DependencyTree is an alternate form of representing a dependency parse. This form makes it easier to traverse the tree

func NewDependencyTree ¶

func NewDependencyTree(parent *DependencyTree, ID int, ann *Annotation) *DependencyTree

func (*DependencyTree) AddChild ¶

func (d *DependencyTree) AddChild(child *DependencyTree)

func (*DependencyTree) AddRel ¶

func (d *DependencyTree) AddRel(rel DependencyType)

func (*DependencyTree) Dot ¶

func (d *DependencyTree) Dot() string

func (*DependencyTree) Walk ¶

func (d *DependencyTree) Walk(fn func(interface{}))

type DependencyType ¶

type DependencyType byte

DependencyType represents the relation between two words

const (
	NoDepType DependencyType = iota
	Dep
	Root

	// nominal dependencies
	NSubj
	NSubjPass
	DObj
	IObj

	// predicate dependencies
	CSubj
	CSubjPass
	CComp

	XComp

	// nominal dependencies
	NumMod
	Appos
	NMod

	// predicate dependencies
	ACl
	ACl_RelCl // RCMod in stanford deps
	Det
	Det_PreDet

	// modifier word
	AMod
	Neg

	// Case Marking, preposition, possessive
	Case

	// Nominal dependencies
	NMod_NPMod
	NMod_TMod
	NMod_Poss

	// Predicate Dependencies
	AdvCl

	// Modifier Word
	AdvMod

	// Compounding and Unanalyzed
	Compound
	Compound_Part
	Name // Unused in English
	MWE
	Foreign  // Unused in English
	GoesWith // Unused in English

	// Loose Joining Relations
	List
	Dislocated // Unused in English
	Parataxis
	Remnant    // Unused in English
	Reparandum // Unused in English

	// Nominal Dependent
	Vocative // Unused in English
	Discourse
	Expl

	// Auxilliary
	Aux
	AuxPass
	Cop

	// Other
	Mark
	Punct

	Conj
	Coordination // CC
	CC_PreConj

	MAXDEPTYPE
)

http://universaldependencies.github.io/docs/en/dep/all.html

func (DependencyType) MarshalText ¶

func (dt DependencyType) MarshalText() ([]byte, error)

func (DependencyType) String ¶

func (i DependencyType) String() string

func (*DependencyType) UnmarshalText ¶

func (dt *DependencyType) UnmarshalText(text []byte) error

type DependencyTypeSet ¶

type DependencyTypeSet [MAXDEPTYPE]bool

DependencyTypeSet is a set of all the DependencyTypes

func (DependencyTypeSet) String ¶

func (dts DependencyTypeSet) String() string

type Lemmatizer ¶

type Lemmatizer interface {
	Lemmatize(string, POSTag) ([]string, error)
}

Lemmatizer is anything that can lemmatize

type Lexeme ¶

type Lexeme struct {
	Value      string
	LexemeType LexemeType

	Line int
	Col  int
	Pos  int
}

func MakeLexeme ¶

func MakeLexeme(s string, t LexemeType) Lexeme

func NullLexeme ¶

func NullLexeme() Lexeme

func RootLexeme ¶

func RootLexeme() Lexeme

func StartLexeme ¶

func StartLexeme() Lexeme

func (Lexeme) Fix ¶

func (l Lexeme) Fix() Lexeme

func (Lexeme) Flags ¶

func (l Lexeme) Flags() WordFlag

func (Lexeme) GoString ¶

func (l Lexeme) GoString() string

func (Lexeme) Shape ¶

func (l Lexeme) Shape() Shape

func (Lexeme) String ¶

func (l Lexeme) String() string

type LexemeSentence ¶

type LexemeSentence []Lexeme

Lexeme Sentence

func NewLexemeSentence ¶

func NewLexemeSentence() LexemeSentence

func (LexemeSentence) String ¶

func (ls LexemeSentence) String() string

type LexemeType ¶

type LexemeType byte

const (
	EOF LexemeType = iota
	Word
	Disambig
	URI
	Number
	Date
	Time
	Punctuation
	Symbol
	Space
	SystemUse
)

func (LexemeType) String ¶

func (i LexemeType) String() string

type POSTag ¶

type POSTag byte

POSTag represents a Part of Speech Tag.

const (
	X POSTag = iota // aka NULLTAG
	UNKNOWN_TAG
	ROOT_TAG
	ADJ
	ADP
	ADV
	AUX
	CONJ
	DET
	INTJ
	NOUN
	NUM
	PART
	PRON
	PROPN
	PUNCT
	SCONJ
	SYM
	VERB

	MAXTAG // MAXTAG is provided here as index support
)

func POSTagShortcut ¶

func POSTagShortcut(l Lexeme) (POSTag, bool)

POSTagShortcut is a shortcut function to help the POSTagger shortcircuit some decisions about what the tag is

func (POSTag) MarshalText ¶

func (p POSTag) MarshalText() ([]byte, error)

func (POSTag) String ¶

func (i POSTag) String() string

func (*POSTag) UnmarshalText ¶

func (p *POSTag) UnmarshalText(text []byte) error

type Sentencer ¶

type Sentencer interface {
	Sentence() AnnotatedSentence
}

Sentencer is anything that returns an AnnotatedSentence

type Shape ¶

type Shape string

Shape represents the shape of a word. It's currently implemented as a string

type Stemmer ¶

type Stemmer interface {
	Stem(string) (string, error)
}

Stemmer is anything that can stem

type TagSet ¶

type TagSet [MAXTAG]bool

TagSet is a set of all the POSTags

func (TagSet) String ¶

func (ts TagSet) String() string

type WordEmbeddings ¶

type WordEmbeddings interface {
	Corpus

	// WordVector returns a vector of embeddings given the word
	WordVector(word string) (vec tensor.Tensor, err error)

	// Vector returns a vector of embeddings given the word ID
	Vector(id int) (vec tensor.Tensor, err error)

	// Embedding returns the matrix
	Embedding() tensor.Tensor
}

WordEmbeddings is any type that is both a corpus and can return word vectors

type WordFlag ¶

type WordFlag uint32

WordFlags represent the types a word may be. A word may have multiple flags

const (
	NoFlag WordFlag = iota
	IsLetter
	IsAscii
	IsDigit
	IsLower
	IsPunct
	IsSpace
	IsTitle
	IsUpper
	LikeURL
	LikeNum
	LikeEmail
	IsStopWord
	IsOOV // for ner

	MAXFLAG
)

func (WordFlag) String ¶

func (f WordFlag) String() string

Directories ¶

Path	Synopsis
cmd
demo command
dep command
lexer command
pos command
corpus
dep
lexer
pos
treebank

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL