dep

package
v0.0.0-...-491e816 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 18, 2020 License: MIT Imports: 16 Imported by: 4

README

Dependency Parser

Package dependencyparser is a package that provides data structures and algorithms for a dependency parser as described by Chen and Manning 2014 [PDF]. It achieves similar accuracy scores as the the cited paper.

Installing

go get -u github.com/chewxy/lingo/dep

How It Works

Transition Based Parsing

The core of the parser is a transition based parser, as popularized by Nivre 2003 [PDF]. It's essentially a shift-reduce parser with more states. Dan Jurafsky has a very complete overview of transition-based parsing [PDF], which should be consulted should more questions arise.

Transitions

At the core of a transition based parser are two data structures: a stack and a queue. The queue, or buffer holds a list of words waiting to be parsed. Parsing is then simply a matter of manipulating the state of the stack and queue. Specifically there are three possible actions in an arc-standard parser:

  • Shift: Shift simply shifts one word from the buffer on to the top of the stack
  • Left: Left means the top of the stack is the head of the word underneath it. After the transition is applied (the link between the nodes attached), the word underneath the stack is removed.
  • Right: Right means that the top of the stack is the child of the word underneath it. After the transition is applied, the top of the stack is popped.

A word on the terms "head", and "child". Consider the sentence "I am human":

We say "human" is the head of the words "I" and "am". Therefore, "I" and "am" are considered to be children of "human".

Example

Let's look at a simple example to concrefy the ideas: "The cat sat on the mat". Here are the states

Step Stack Buffer Transition
0 [ROOT] ["The", "cat", "sat", "on", "the", "mat"] Shift
1 [ROOT, "The"] ["cat", "sat", "on", "the", "mat"] Shift
2 [ROOT, "The", "cat"] ["sat", "on", "the", "mat"] Left
3 [ROOT, "cat"] ["sat", "on", "the", "mat"] Shift
4 [ROOT, "cat", "sat"] ["on", "the", "mat"] Left
5 [ROOT, "sat"] ["on", "the", "mat"] Shift
6 [ROOT, "sat", "on"] ["the", "mat"] Shift
7 [ROOT, "sat", "on", "the"] ["mat"] Shift
8 [ROOT, "sat", "on", "the", "mat"] [] Left
9 [ROOT, "sat", "on", "mat"] [] Left
10 [ROOT, "sat", "mat"] [] Right
11 [ROOT, "sat"] [] Left

The above transitions produces this parse tree:

The real question then is of course - how does the system know which is the correct transition to emit, given the state?

The answer is machine learning.

Machine Learning

What exactly are we learning? Or more carefully put, what are the inputs and outputs of the machine learning algorithm? The table in the example above provides a template for the inputs and output. The output is easy - the transition is what we want to learn.

As for the input, it's a little bit more complex. The input consists of the stack and the buffer. It'd be impractical and slow to include everything in the stack and buffer (dynamic neural networks are somewhat slower than static ones). So Chen and Manning came up with an ingenious idea -

  • Use the top 3 words of the stack
  • Use the top 3 words of the buffer
  • Use the first and second leftmost/rightmost children of the first two words of the stack

Instead of directly using the words, POS Tag and dependency relations as features, the rather ingenious idea was that it would use vectors drawn from an embedding matrix to represent these features instead. So instead of building sparse features, concatenating the vectors form a fixed sized input vector. This makes training the network much more expedient. You'll find this in features.go

Given each state above, it'd be fairly trivial to extract an input vector based on the 18 "features" listed and feed forwards to a neural network. The result is a fast parser.

Neural Network

The machine learning algorithm behind this parser is a simple 3-layered network. An input layer is constructed from the embedding matrices, and is forwarded to the first layer, which is activated by a cube activation function. This then passes forwards to a dropout layer before the last layer, which is a softmax layer.

[image of NN]

Hairy Bits

The hairy bits of this is the oracle. Specifically, the question: given a training sentence, how do we generate correct examples such as the table above?

TODO: finish writing this section

How To Use

This package provides three main data structures for use:

  • Parser
  • Model
  • Trainer

Trainer takes a []treebank.SentenceTag and produces a Model. Parser requires a Model to run, and is basically a exported wrapper over configuration that handles a pipeline.

Basic NLP Pipeline

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words. 
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

Training A Model

To train a model you'd use the Trainer. The trainer accepts a []treebank.SentenceTag. As long as you can parse your training file into those (package treebank accepts CONLLU formatted files as well as the PennTreebank formatted files), you'd be fine.

An example trainer is in the cmd directory of lingo

FAQ

Why not an LSTM or RNN to encode the state of the stack and buffer?

The answer is simplicity and speed. I have attempted variants of the parser with different neural networks - they don't work as fast as this. I am aware of Parsey-McParseface and the slightly improved accuracy compared to this model, but the speed has been not as great as I expect. This package emphasises parsing speed over accuracy - for most well written English sentences, this package performs well.

Why are there no models?

I'm afraid you're gonna have to train your own models. Training takes days on the Universal Dependency dataset and I haven't had the time to train on those. All my models are specific to the use of the company, and hence cannot be released.

What caveats are there?

Chen and Manning described using pre-computed activations for the top 10000 or so words. I did not implement that, but it would be trivial to revisit and implement it. Feel free to send a pull request.

How can this be sped up?

Use multiple, smaller trainers, each training on a separate batch. You can hence train them concurrently (pass the costs in a channel and collect at the end). At the end, sum the gradients before applying adagrad. The trade off is that a LOT more memory will be used. It's also the reason why it wasn't included as the default. It's quite trivial to write though. Send a pull request if you have managed to reduce memory usage.

Contributing

see package lingo's CONTRIBUTING.md for more information. There is currently a list of issues in Github issues. Those are good places to start.

Licence

This package is MIT licenced.

Documentation

Index

Constants

View Source
const (
	POS_OFFSET   int = 18
	DEP_OFFSET       = 36
	STACK_OFFSET     = 6
	STACK_NUMBER     = 6
)
View Source
const BUILD_DEBUG = "PARSER: RELEASE BUILD"
View Source
const BUILD_DIAG = "Non-Diagnostic Build"
View Source
const DEBUG = false
View Source
const (
	DOES_NOT_EXIST head = iota - 1
)
View Source
const (
	MAXFEATURE feature
)

Variables

View Source
var ALLMOVES = [...]Move{Left, Right, Shift}

ALLMOVES is the set of all possible moves

View Source
var KnownWords *corpus.Corpus // package provided global
View Source
var MAXTRANSITION int
View Source
var READMEMSTATS = false
View Source
var TABCOUNT uint32 = 0

Functions

func SprintScores

func SprintScores(scores []float64, ts []transition) string

Types

type Model

type Model struct {
	// contains filtered or unexported fields
}

Model holds the neural network that a DependencyParser uses. To train, use a Trainer

func Load

func Load(filename string) (*Model, error)

func LoadReader

func LoadReader(rd io.ReadCloser) (*Model, error)

func (*Model) Corpus

func (m *Model) Corpus() *corpus.Corpus

func (*Model) POSTagEmbeddings

func (m *Model) POSTagEmbeddings() *tensor.Dense

func (*Model) Save

func (m *Model) Save(filename string) error

func (*Model) SaveWriter

func (m *Model) SaveWriter(f io.WriteCloser) error

func (*Model) String

func (m *Model) String() string

func (*Model) WordEmbeddings

func (m *Model) WordEmbeddings() *tensor.Dense

type Move

type Move byte

Move is an action that the dependency parser can take - whether to Shift, Attach-Left, or AttachRight

const (
	Shift Move = iota
	Left
	Right

	MAXMOVE
)

func (Move) String

func (i Move) String() string

type NNConfig

type NNConfig struct {
	BatchSize                  int     // 10000
	Dropout                    float64 // 0.5
	AdaEps                     float64 // 1e-6
	AdaAlpha                   float64 //0.02
	Reg                        float64 // 1e-8
	HiddenSize                 int     // 200
	EmbeddingSize              int     // 50
	NumPrecomputed             int     //100000
	EvalPerIteration           int     // 100
	ClearGradientsPerIteration int     // 0

	Dtype tensor.Dtype
}

NNConfig configures the neural network

var DefaultNNConfig NNConfig

DefaultNNConfig is the default config that is passed in, for initialization purposses.

func (*NNConfig) GobDecode

func (c *NNConfig) GobDecode(p []byte) error

func (NNConfig) GobEncode

func (c NNConfig) GobEncode() ([]byte, error)

func (NNConfig) String

func (c NNConfig) String() string

type NonProjectiveError

type NonProjectiveError struct{ *lingo.Dependency }

NonProjective error is the error that is emitted when the dependency tree is not projective (that is to say the children cross lines)

func (NonProjectiveError) Error

func (err NonProjectiveError) Error() string

type Parser

type Parser struct {
	Input  chan lingo.AnnotatedSentence
	Output chan *lingo.Dependency
	Error  chan error

	*Model
}

Parser is the object that performs the dependency parsing It contains a neural network, which is the core of it.

The same object can be used to train the NN

func New

func New(m *Model) *Parser

New creates a new Parser

func (*Parser) Run

func (d *Parser) Run()

Run is used when using the NN to parse a sentence. For training, see Train()

func (*Parser) SprintFeatures

func (d *Parser) SprintFeatures(feature []int) string

func (*Parser) String

func (d *Parser) String() string

type Performance

type Performance struct {
	Iter int     // which training iteration is this?
	UAS  float64 // Unlabelled Attachment Score
	LAS  float64 // Labeled Attachment Score
	UEM  float64 // Unlabelled Exact Match
	Root float64 // Correct Roots Ratio
}

Performance is a tuple that holds performance information from a training session

func Evaluate

func Evaluate(predictedTrees, goldTrees []*lingo.Dependency) Performance

Evaluate compares predicted trees with the gold standard trees and returns a Performance. It panics if the number of predicted trees and the number of gold trees aren't the same

func (Performance) String

func (p Performance) String() string

type TarpitError

type TarpitError struct {
	// contains filtered or unexported fields
}

TarpitError is an error when the arc-standard is stuck. It implements GoStringer, which when called will output the state as a string. It also implements lingo.Sentencer, so the offending sentence can easily be retrieved

func (TarpitError) Error

func (err TarpitError) Error() string

func (TarpitError) GoString

func (c TarpitError) GoString() string

func (TarpitError) String

func (c TarpitError) String() string

type Trainer

type Trainer struct {
	*Model

	// Training configuration
	EvalPerIter int    // for cross validation - evaluate results every n epochs
	PassDirect  bool   // Pass on the costs directly to the cost channel? If false, an average will be used
	SaveBest    string // SaveBest is the filename that will be saved. If it's empty then the best-while-training will not be saved
	// contains filtered or unexported fields
}

Trainer trains a model

func NewTrainer

func NewTrainer(opts ...TrainerConsOpt) *Trainer

NewTrainer creates a new Trainer.

func (*Trainer) Clusters

func (t *Trainer) Clusters() (map[string]lingo.Cluster, error)

Clusters implements lingo.Fixer

func (*Trainer) Cost

func (t *Trainer) Cost() <-chan float64

Cost returns a channel of costs for monitoring the training. If the PassDirect field in the trainer is set to true then the costs are directly returned. Otherwise the costs are averaged over the epoch.

func (*Trainer) Init

func (t *Trainer) Init() (err error)

Init initializes the DependencyParser with a corpus and a neural network config

func (*Trainer) Lemmatize

func (t *Trainer) Lemmatize(a string, pt lingo.POSTag) ([]string, error)

Lemmatize implemnets lingo.Lemmatizer

func (*Trainer) Perf

func (t *Trainer) Perf() <-chan Performance

Perf returns a channel of Performance for monitoring the training.

func (*Trainer) Stem

func (t *Trainer) Stem(a string) (string, error)

Stem implements lingo.Stemmer

func (*Trainer) Train

func (t *Trainer) Train(epochs int) error

Train trains a model.

If a cross validation set is provided, it will automatically train with the cross validation set

func (*Trainer) TrainWithoutCrossValidation

func (t *Trainer) TrainWithoutCrossValidation(epochs int) error

TrainWithoutCrossValidation trains a model without cross validation.

type TrainerConsOpt

type TrainerConsOpt func(t *Trainer)

TrainerConsOpt is a construction option for trainer

func WithCluster

func WithCluster(c map[string]lingo.Cluster) TrainerConsOpt

WithCluster sets the brown cluster options for the DependencyParser

func WithConfig

func WithConfig(conf NNConfig) TrainerConsOpt

WithConfig sets up a *Trainer with a NNConfig

func WithCorpus

func WithCorpus(c *corpus.Corpus) TrainerConsOpt

WithCorpus creates a Trainer with a corpus

func WithCrossValidationSet

func WithCrossValidationSet(st []treebank.SentenceTag) TrainerConsOpt

WithCrossValidationSet creates a trainer with a cross validation set

func WithGeneratedCorpus

func WithGeneratedCorpus(sts ...treebank.SentenceTag) TrainerConsOpt

WithGeneratedCorpus creates a Trainer's corpus from a list of SentenceTags. The corpus will be generated from the SentenceTags

func WithLemmatizer

func WithLemmatizer(l lingo.Lemmatizer) TrainerConsOpt

WithLemmatizer sets the lemmatizer option on the Trainer

func WithStemmer

func WithStemmer(s lingo.Stemmer) TrainerConsOpt

WithStemmer sets up the stemmer option on the DependencyParser

func WithTrainingModel

func WithTrainingModel(m *Model) TrainerConsOpt

WithTrainingModel loads a trainer with a model

func WithTrainingSet

func WithTrainingSet(st []treebank.SentenceTag) TrainerConsOpt

WithTrainingSet creates a trainer with a training set

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL