dep

package

v0.0.0-...-491e816 Latest Latest Go to latest Published: Sep 18, 2020 License: MIT Imports: 16 Imported by: 4

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/chewxy/lingo

Links

Open Source Insights

README ¶

Dependency Parser

Package dependencyparser is a package that provides data structures and algorithms for a dependency parser as described by Chen and Manning 2014 [PDF]. It achieves similar accuracy scores as the the cited paper.

Installing

go get -u github.com/chewxy/lingo/dep

How It Works

Transition Based Parsing

The core of the parser is a transition based parser, as popularized by Nivre 2003 [PDF]. It's essentially a shift-reduce parser with more states. Dan Jurafsky has a very complete overview of transition-based parsing [PDF], which should be consulted should more questions arise.

Transitions

At the core of a transition based parser are two data structures: a stack and a queue. The queue, or buffer holds a list of words waiting to be parsed. Parsing is then simply a matter of manipulating the state of the stack and queue. Specifically there are three possible actions in an arc-standard parser:

Shift: Shift simply shifts one word from the buffer on to the top of the stack
Left: Left means the top of the stack is the head of the word underneath it. After the transition is applied (the link between the nodes attached), the word underneath the stack is removed.
Right: Right means that the top of the stack is the child of the word underneath it. After the transition is applied, the top of the stack is popped.

A word on the terms "head", and "child". Consider the sentence "I am human":

We say "human" is the head of the words "I" and "am". Therefore, "I" and "am" are considered to be children of "human".

Example

Let's look at a simple example to concrefy the ideas: "The cat sat on the mat". Here are the states

Step	Stack	Buffer	Transition
0	[ROOT]	["The", "cat", "sat", "on", "the", "mat"]	Shift
1	[ROOT, "The"]	["cat", "sat", "on", "the", "mat"]	Shift
2	[ROOT, "The", "cat"]	["sat", "on", "the", "mat"]	Left
3	[ROOT, "cat"]	["sat", "on", "the", "mat"]	Shift
4	[ROOT, "cat", "sat"]	["on", "the", "mat"]	Left
5	[ROOT, "sat"]	["on", "the", "mat"]	Shift
6	[ROOT, "sat", "on"]	["the", "mat"]	Shift
7	[ROOT, "sat", "on", "the"]	["mat"]	Shift
8	[ROOT, "sat", "on", "the", "mat"]	[]	Left
9	[ROOT, "sat", "on", "mat"]	[]	Left
10	[ROOT, "sat", "mat"]	[]	Right
11	[ROOT, "sat"]	[]	Left

The above transitions produces this parse tree:

The real question then is of course - how does the system know which is the correct transition to emit, given the state?

The answer is machine learning.

Machine Learning

What exactly are we learning? Or more carefully put, what are the inputs and outputs of the machine learning algorithm? The table in the example above provides a template for the inputs and output. The output is easy - the transition is what we want to learn.

As for the input, it's a little bit more complex. The input consists of the stack and the buffer. It'd be impractical and slow to include everything in the stack and buffer (dynamic neural networks are somewhat slower than static ones). So Chen and Manning came up with an ingenious idea -

Use the top 3 words of the stack
Use the top 3 words of the buffer
Use the first and second leftmost/rightmost children of the first two words of the stack

Instead of directly using the words, POS Tag and dependency relations as features, the rather ingenious idea was that it would use vectors drawn from an embedding matrix to represent these features instead. So instead of building sparse features, concatenating the vectors form a fixed sized input vector. This makes training the network much more expedient. You'll find this in features.go

Given each state above, it'd be fairly trivial to extract an input vector based on the 18 "features" listed and feed forwards to a neural network. The result is a fast parser.

Neural Network

The machine learning algorithm behind this parser is a simple 3-layered network. An input layer is constructed from the embedding matrices, and is forwarded to the first layer, which is activated by a cube activation function. This then passes forwards to a dropout layer before the last layer, which is a softmax layer.

[image of NN]

Hairy Bits

The hairy bits of this is the oracle. Specifically, the question: given a training sentence, how do we generate correct examples such as the table above?

TODO: finish writing this section

How To Use

This package provides three main data structures for use:

Parser
Model
Trainer

Trainer takes a []treebank.SentenceTag and produces a Model. Parser requires a Model to run, and is basically a exported wrapper over configuration that handles a pipeline.

Basic NLP Pipeline

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words. 
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

Training A Model

To train a model you'd use the Trainer. The trainer accepts a []treebank.SentenceTag. As long as you can parse your training file into those (package treebank accepts CONLLU formatted files as well as the PennTreebank formatted files), you'd be fine.

An example trainer is in the cmd directory of lingo

FAQ

Why not an LSTM or RNN to encode the state of the stack and buffer?

The answer is simplicity and speed. I have attempted variants of the parser with different neural networks - they don't work as fast as this. I am aware of Parsey-McParseface and the slightly improved accuracy compared to this model, but the speed has been not as great as I expect. This package emphasises parsing speed over accuracy - for most well written English sentences, this package performs well.

Why are there no models?

I'm afraid you're gonna have to train your own models. Training takes days on the Universal Dependency dataset and I haven't had the time to train on those. All my models are specific to the use of the company, and hence cannot be released.

What caveats are there?

Chen and Manning described using pre-computed activations for the top 10000 or so words. I did not implement that, but it would be trivial to revisit and implement it. Feel free to send a pull request.

How can this be sped up?

Use multiple, smaller trainers, each training on a separate batch. You can hence train them concurrently (pass the costs in a channel and collect at the end). At the end, sum the gradients before applying adagrad. The trade off is that a LOT more memory will be used. It's also the reason why it wasn't included as the default. It's quite trivial to write though. Send a pull request if you have managed to reduce memory usage.

Contributing

see package lingo's CONTRIBUTING.md for more information. There is currently a list of issues in Github issues. Those are good places to start.

Licence

This package is MIT licenced.

Documentation ¶

Index ¶

Constants
Variables
func SprintScores(scores []float64, ts []transition) string
type Model
- func Load(filename string) (*Model, error)
- func LoadReader(rd io.ReadCloser) (*Model, error)
- func (m *Model) Corpus() *corpus.Corpus
- func (m *Model) POSTagEmbeddings() *tensor.Dense
- func (m *Model) Save(filename string) error
- func (m *Model) SaveWriter(f io.WriteCloser) error
- func (m *Model) String() string
- func (m *Model) WordEmbeddings() *tensor.Dense
type Move
- func (i Move) String() string
type NNConfig
- func (c *NNConfig) GobDecode(p []byte) error
- func (c NNConfig) GobEncode() ([]byte, error)
- func (c NNConfig) String() string
type NonProjectiveError
- func (err NonProjectiveError) Error() string
type Parser
- func New(m *Model) *Parser
- func (d *Parser) Run()
- func (d *Parser) SprintFeatures(feature []int) string
- func (d *Parser) String() string
type Performance
- func Evaluate(predictedTrees, goldTrees []*lingo.Dependency) Performance
- func (p Performance) String() string
type TarpitError
- func (err TarpitError) Error() string
- func (c TarpitError) GoString() string
- func (c TarpitError) String() string
type Trainer
- func NewTrainer(opts ...TrainerConsOpt) *Trainer
- func (t *Trainer) Clusters() (map[string]lingo.Cluster, error)
- func (t *Trainer) Cost() <-chan float64
- func (t *Trainer) Init() (err error)
- func (t *Trainer) Lemmatize(a string, pt lingo.POSTag) ([]string, error)
- func (t *Trainer) Perf() <-chan Performance
- func (t *Trainer) Stem(a string) (string, error)
- func (t *Trainer) Train(epochs int) error
- func (t *Trainer) TrainWithoutCrossValidation(epochs int) error
type TrainerConsOpt
- func WithCluster(c map[string]lingo.Cluster) TrainerConsOpt
- func WithConfig(conf NNConfig) TrainerConsOpt
- func WithCorpus(c *corpus.Corpus) TrainerConsOpt
- func WithCrossValidationSet(st []treebank.SentenceTag) TrainerConsOpt
- func WithGeneratedCorpus(sts ...treebank.SentenceTag) TrainerConsOpt
- func WithLemmatizer(l lingo.Lemmatizer) TrainerConsOpt
- func WithStemmer(s lingo.Stemmer) TrainerConsOpt
- func WithTrainingModel(m *Model) TrainerConsOpt
- func WithTrainingSet(st []treebank.SentenceTag) TrainerConsOpt

Constants ¶

View Source

const (
	POS_OFFSET   int = 18
	DEP_OFFSET       = 36
	STACK_OFFSET     = 6
	STACK_NUMBER     = 6
)

View Source

const BUILD_DEBUG = "PARSER: RELEASE BUILD"

View Source

const BUILD_DIAG = "Non-Diagnostic Build"

View Source

const DEBUG = false

View Source

const (
	DOES_NOT_EXIST head = iota - 1
)

View Source

const (
	MAXFEATURE feature
)

Variables ¶

View Source

var ALLMOVES = [...]Move{Left, Right, Shift}

ALLMOVES is the set of all possible moves

View Source

var KnownWords *corpus.Corpus // package provided global

View Source

var MAXTRANSITION int

View Source

var READMEMSTATS = false

View Source

var TABCOUNT uint32 = 0

Functions ¶

func SprintScores ¶

func SprintScores(scores []float64, ts []transition) string

Types ¶

type Model ¶

type Model struct {
	// contains filtered or unexported fields
}

Model holds the neural network that a DependencyParser uses. To train, use a Trainer

func Load ¶

func Load(filename string) (*Model, error)

func LoadReader ¶

func LoadReader(rd io.ReadCloser) (*Model, error)

func (*Model) Corpus ¶

func (m *Model) Corpus() *corpus.Corpus

func (*Model) POSTagEmbeddings ¶

func (m *Model) POSTagEmbeddings() *tensor.Dense

func (*Model) Save ¶

func (m *Model) Save(filename string) error

func (*Model) SaveWriter ¶

func (m *Model) SaveWriter(f io.WriteCloser) error

func (*Model) String ¶

func (m *Model) String() string

func (*Model) WordEmbeddings ¶

func (m *Model) WordEmbeddings() *tensor.Dense

type Move ¶

type Move byte

Move is an action that the dependency parser can take - whether to Shift, Attach-Left, or AttachRight

const (
	Shift Move = iota
	Left
	Right

	MAXMOVE
)

func (Move) String ¶

func (i Move) String() string

type NNConfig ¶

type NNConfig struct {
	BatchSize                  int     // 10000
	Dropout                    float64 // 0.5
	AdaEps                     float64 // 1e-6
	AdaAlpha                   float64 //0.02
	Reg                        float64 // 1e-8
	HiddenSize                 int     // 200
	EmbeddingSize              int     // 50
	NumPrecomputed             int     //100000
	EvalPerIteration           int     // 100
	ClearGradientsPerIteration int     // 0

	Dtype tensor.Dtype
}

NNConfig configures the neural network

var DefaultNNConfig NNConfig

DefaultNNConfig is the default config that is passed in, for initialization purposses.

func (*NNConfig) GobDecode ¶

func (c *NNConfig) GobDecode(p []byte) error

func (NNConfig) GobEncode ¶

func (c NNConfig) GobEncode() ([]byte, error)

func (NNConfig) String ¶

func (c NNConfig) String() string

type NonProjectiveError ¶

type NonProjectiveError struct{ *lingo.Dependency }

NonProjective error is the error that is emitted when the dependency tree is not projective (that is to say the children cross lines)

func (NonProjectiveError) Error ¶

func (err NonProjectiveError) Error() string

type Parser ¶

type Parser struct {
	Input  chan lingo.AnnotatedSentence
	Output chan *lingo.Dependency
	Error  chan error

	*Model
}

Parser is the object that performs the dependency parsing It contains a neural network, which is the core of it.

The same object can be used to train the NN

func New ¶

func New(m *Model) *Parser

New creates a new Parser

func (*Parser) Run ¶

func (d *Parser) Run()

Run is used when using the NN to parse a sentence. For training, see Train()

func (*Parser) SprintFeatures ¶

func (d *Parser) SprintFeatures(feature []int) string

func (*Parser) String ¶

func (d *Parser) String() string

type Performance ¶

type Performance struct {
	Iter int     // which training iteration is this?
	UAS  float64 // Unlabelled Attachment Score
	LAS  float64 // Labeled Attachment Score
	UEM  float64 // Unlabelled Exact Match
	Root float64 // Correct Roots Ratio
}

Performance is a tuple that holds performance information from a training session

func Evaluate ¶

func Evaluate(predictedTrees, goldTrees []*lingo.Dependency) Performance

Evaluate compares predicted trees with the gold standard trees and returns a Performance. It panics if the number of predicted trees and the number of gold trees aren't the same

func (Performance) String ¶

func (p Performance) String() string

type TarpitError ¶

type TarpitError struct {
	// contains filtered or unexported fields
}

TarpitError is an error when the arc-standard is stuck. It implements GoStringer, which when called will output the state as a string. It also implements lingo.Sentencer, so the offending sentence can easily be retrieved

func (TarpitError) Error ¶

func (err TarpitError) Error() string

func (TarpitError) GoString ¶

func (c TarpitError) GoString() string

func (TarpitError) String ¶

func (c TarpitError) String() string

type Trainer ¶

type Trainer struct {
	*Model

	// Training configuration
	EvalPerIter int    // for cross validation - evaluate results every n epochs
	PassDirect  bool   // Pass on the costs directly to the cost channel? If false, an average will be used
	SaveBest    string // SaveBest is the filename that will be saved. If it's empty then the best-while-training will not be saved
	// contains filtered or unexported fields
}

Trainer trains a model

func NewTrainer ¶

func NewTrainer(opts ...TrainerConsOpt) *Trainer

NewTrainer creates a new Trainer.

func (*Trainer) Clusters ¶

func (t *Trainer) Clusters() (map[string]lingo.Cluster, error)

Clusters implements lingo.Fixer

func (*Trainer) Cost ¶

func (t *Trainer) Cost() <-chan float64

Cost returns a channel of costs for monitoring the training. If the PassDirect field in the trainer is set to true then the costs are directly returned. Otherwise the costs are averaged over the epoch.

func (*Trainer) Init ¶

func (t *Trainer) Init() (err error)

Init initializes the DependencyParser with a corpus and a neural network config

func (*Trainer) Lemmatize ¶

func (t *Trainer) Lemmatize(a string, pt lingo.POSTag) ([]string, error)

Lemmatize implemnets lingo.Lemmatizer

func (*Trainer) Perf ¶

func (t *Trainer) Perf() <-chan Performance

Perf returns a channel of Performance for monitoring the training.

func (*Trainer) Stem ¶

func (t *Trainer) Stem(a string) (string, error)

Stem implements lingo.Stemmer

func (*Trainer) Train ¶

func (t *Trainer) Train(epochs int) error

Train trains a model.

If a cross validation set is provided, it will automatically train with the cross validation set

func (*Trainer) TrainWithoutCrossValidation ¶

func (t *Trainer) TrainWithoutCrossValidation(epochs int) error

TrainWithoutCrossValidation trains a model without cross validation.

type TrainerConsOpt ¶

type TrainerConsOpt func(t *Trainer)

TrainerConsOpt is a construction option for trainer

func WithCluster ¶

func WithCluster(c map[string]lingo.Cluster) TrainerConsOpt

WithCluster sets the brown cluster options for the DependencyParser

func WithConfig ¶

func WithConfig(conf NNConfig) TrainerConsOpt

WithConfig sets up a *Trainer with a NNConfig

func WithCorpus ¶

func WithCorpus(c *corpus.Corpus) TrainerConsOpt

WithCorpus creates a Trainer with a corpus

func WithCrossValidationSet ¶

func WithCrossValidationSet(st []treebank.SentenceTag) TrainerConsOpt

WithCrossValidationSet creates a trainer with a cross validation set

func WithGeneratedCorpus ¶

func WithGeneratedCorpus(sts ...treebank.SentenceTag) TrainerConsOpt

WithGeneratedCorpus creates a Trainer's corpus from a list of SentenceTags. The corpus will be generated from the SentenceTags

func WithLemmatizer ¶

func WithLemmatizer(l lingo.Lemmatizer) TrainerConsOpt

WithLemmatizer sets the lemmatizer option on the Trainer

func WithStemmer ¶

func WithStemmer(s lingo.Stemmer) TrainerConsOpt

WithStemmer sets up the stemmer option on the DependencyParser

func WithTrainingModel ¶

func WithTrainingModel(m *Model) TrainerConsOpt

WithTrainingModel loads a trainer with a model

func WithTrainingSet ¶

func WithTrainingSet(st []treebank.SentenceTag) TrainerConsOpt

WithTrainingSet creates a trainer with a training set

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL