Documentation ¶
Overview ¶
Package token deals with breaking a text into tokens. It cleans names broken by new lines, concatenating pieces together. Tokens are connected to properties. Properties are used for heuristic and Bayes' approaches for finding names.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NewTokenSN ¶
NewTokenSN is a factory and a wrapper. It takes gner.TokenNER object and wraps into TokenSN interface.
func SetIndices ¶
func SetIndices(ts []TokenSN, d *dict.Dictionary)
SetIndices takes a slice of tokens that correspond to a name candidate. It analyses the tokens and sets Token.Indices according to feasibility of the input tokens to form a scientific name. It checks if there is a possible species, ranks, and infraspecies.
func UpperIndex ¶
UpperIndex takes an index of a token and length of the tokens slice and returns an upper index of what could be a slice of a name. We expect that that most of the names will fit into 5 words. Other cases would require more thorough algorithims that we can run later as plugins.
Types ¶
type Decision ¶
type Decision int
Decision definds possible kinds of name candidates.
const ( NotName Decision = iota Uninomial Binomial PossibleBinomial Trinomial BayesUninomial BayesBinomial BayesTrinomial )
Possible Decisions
func (Decision) Cardinality ¶
Cardinality returns number of elements in canonical form of a scientific name. If name is uninomial 1 is returned, for binomial 2, for trinomial 3.
type Features ¶
type Features struct { // IsCapitalized is true if the first rune that is letter, is capitalized. IsCapitalized bool // HasDash is true if token tontains dash HasDash bool // HasStartParens is true if token start with '(' HasStartParens bool // HasEndParens is true if token ends with ')' HasEndParens bool // Abbr feature: token ends with a period. Abbr bool // PotentialBinomialGenus feature: the token might be a genus of name. PotentialBinomialGenus bool // StartsWithLetter feature: the token has necessary qualities to be a start // of a binomial species. It assumes to be low-case and be two letters or // more. StartsWithLetter bool // EndsWithLetter feature: the token has necessary quality to be a species // part of trinomial. EndsWithLetter bool // RankLike is true if token is a known infraspecific rank RankLike bool // UninomialDict defines which Genera or Uninomials dictionary (if any) // contained the token. UninomialDict dict.DictionaryType // SpeciesDict defines which Species dictionary (if any) contained the token. SpeciesDict dict.DictionaryType // GenSpGreyDict shows how many specific/infraspecific epithets of a putative // name matched bi-/tri- nomials in a full name dictionary for grey genera. // For example "Bubo bubo" name would set it to 1, and "Bubo bubo bubo" would // set it to 2. GenSpGreyDict int }
Features keep properties of a token as a possible candidate for a name part.
func (*Features) SetSpeciesDict ¶
func (p *Features) SetSpeciesDict(cleaned string, d *dict.Dictionary)
func (*Features) SetUninomialDict ¶
func (p *Features) SetUninomialDict(cleaned string, d *dict.Dictionary)
type NLP ¶
type NLP struct { // Odds are posterior odds. Odds float64 // OddsDetails are elements from which Odds are calculated. OddsDetails // LabelFreq is used to calculate prior odds of names appearing in a // document. LabelFreq bayes.LabelFreq }
NLP collects data received from Bayes' algorithm
type OddsDetails ¶
type OddsDetails map[string]map[bayes.FeatureName]map[bayes.FeatureValue]float64
OddsDetails are elements from which Odds are calculated
func NewOddsDetails ¶
func NewOddsDetails(l bayes.Likelihoods) OddsDetails