words

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 29, 2018 License: BSD-3-Clause Imports: 6 Imported by: 0

Documentation

Overview

Package words provides methods to estimate (word) emission probabilities.

The parameters in Hidden Markov Models (HMM) come in two forms: transition and emission probabilities. In a trigram HMM tagger, the transition probabilities are P(t3|t1,t2) and the emission probabilities P(w|t), where 'w' is a word and 't' a tag.

This package concerns itself with estimating emission probabilities. Generally, the emission probabilities are estimated as follows: (1) for words seen in the training data, probability is the (smoothed) maximum likelihood estimation; (2) for words that are not seen in the training data the probabilies are usually estimated based on inflectional properties.

The `Lexicon` type implements (1), while the SuffixHandler type is a possible implementation of (2) based on Brants, 2000. Both types implement the WordHandler interface.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Lexicon

type Lexicon struct {
	// contains filtered or unexported fields
}

Lexicon is an emission probability estimator for 'known words' (words seen in the training data).

func NewLexicon

func NewLexicon(wtf map[string]map[model.Tag]int, uf map[model.Unigram]int) Lexicon

NewLexicon constructs a new Lexicon from word/tag frequencies and unigram frequencies.

func NewLexiconWithFallback

func NewLexiconWithFallback(wtf map[string]map[model.Tag]int, uf map[model.Unigram]int, fallback WordHandler) Lexicon

NewLexiconWithFallback construct a new Lexicon from word/tag frequencies, unigram frequencies, and a fallback. The fallback is used to estimate the emission probabilities when the word is not in the lexicon. For instance, this permits use of Lexicon with SuffixHandler to estimate the emission probability for any word.

func (Lexicon) TagProbs

func (l Lexicon) TagProbs(word string) map[model.Tag]float64

TagProbs returns P(w|t) for a particular word 'w'. Probabilities are only returned for tags with which the word occurred in the training data, except if the word did not occur in the training data and a fallback is used.

type LookupSuffixHandler

type LookupSuffixHandler struct {
	// contains filtered or unexported fields
}

LookupSuffixHandler estimates the emission probabilities P(w|t) using word suffixes. In contrast to SuffixHandler, it uses map-based lookups. The initial construction of a LookupSuffixHandler takes a small amount of extra time. However, it is much faster during taggin.

func NewLookupSuffixHandler

func NewLookupSuffixHandler(sh SuffixHandler) LookupSuffixHandler

NewLookupSuffixHandler constructs a LookupSuffixHandler from a SuffixHandler. After construction, the SuffixHandler is discarded after construction.

func (LookupSuffixHandler) TagProbs

func (h LookupSuffixHandler) TagProbs(word string) map[model.Tag]float64

TagProbs estimates P(w|t) for a particular word 'w'.

type SubstLexicon

type SubstLexicon struct {
	// contains filtered or unexported fields
}

func NewSubstLexicon

func NewSubstLexicon(lexicon Lexicon, substitutions []Substitution) SubstLexicon

NewSubstLexicon construct a new Lexicon with substitution rules from a lexicon. If the lexicon does not return results for a word, the substitutions are applied and another lookup is attempted.

func NewSubstLexiconWithFallback

func NewSubstLexiconWithFallback(lexicon Lexicon, fallback WordHandler, substitutions []Substitution) SubstLexicon

NewSubstLexiconWithFallback construct a new Lexicon with substitution rules from a lexicon and a fallback. If the lexicon does not return results for a word, the substitutions are applied and another lookup is attempted. If this fails as well, the fallback is used.

func (SubstLexicon) TagProbs

func (l SubstLexicon) TagProbs(word string) map[model.Tag]float64

TagProbs returns P(w|t) for a particular word 'w'. Probabilities are only returned for tags with which the word occurred in the training data, except if the word did not occur in the training data and a fallback is used.

type Substitution

type Substitution struct {
	Pattern     *regexp.Regexp
	Replacement string
}

type SuffixHandler

type SuffixHandler struct {
	// contains filtered or unexported fields
}

SuffixHandler is an emission probability estimator that uses word suffices. It is normally used for words that were not seen in the training model.

Internally, this estimator uses four different distributions based properties of the token: (1) Tokens that start with an uppercase letter; (2) tokens that contain a dash (currently only '-'); (3) tokens that are recognized as cardinals; and (4) remaining tokens (typically lowercase words).

func NewSuffixHandler

func NewSuffixHandler(config SuffixHandlerConfig, m model.Model) SuffixHandler

NewSuffixHandler constructs a new SuffixHandler from the given configuration and model.

func (SuffixHandler) TagProbs

func (h SuffixHandler) TagProbs(word string) map[model.Tag]float64

TagProbs estimates P(w|t) for a particular word 'w'.

type SuffixHandlerConfig

type SuffixHandlerConfig struct {
	MaxSuffixLen    int
	UpperMaxFreq    int
	LowerMaxFreq    int
	DashMaxFreq     int
	CardinalMaxFreq int
	MaxTags         int
}

SuffixHandlerConfig stores the configuration for a SuffixHandler. It allows specification of the length of the suffix to be considered, maximum frequencies of tokens in order to be used as training data, and the maximum number of tags that a SuffixHandler should return p(w|t) for.

Tweaking this parameters can have a profound effect on the quality if the estimator. For instance, the typical length of inflectional suffixes is highly language-dependent. Good values for the maximum frequencies for the various types of tokens depends on the size of the training corpus - the distribution of unknown words is typically closer to that of low-frequency words than high-frequency words.

func DefaultSuffixHandlerConfig

func DefaultSuffixHandlerConfig() SuffixHandlerConfig

DefaultSuffixHandlerConfig returns a SuffixHandlerConfig that works reasonably well on German and English with approximately 50,000 to 100,000 sentences.

type WordHandler

type WordHandler interface {
	TagProbs(word string) map[model.Tag]float64
}

A WordHandler returns or estimates the emission probabilities P(w|t) for a given words.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL