token

package
v0.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2020 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Package token deals with breaking a text into tokens. It cleans names broken by new lines, concatenating pieces together. Tokens are connected to features. Features are used for heuristic and Bayes' approaches for finding names.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SetIndices

func SetIndices(ts []Token, d *dict.Dictionary)

SetIndices takes a slice of tokens that correspond to a name candidate. It analyses the tokens and sets Token.Indices according to feasibility of the input tokens to form a scientific name. It checks if there is a possible species, ranks, and infraspecies.

func UpperIndex added in v0.8.4

func UpperIndex(i int, l int) int

UpperIndex takes an index of a token and length of the tokens slice and returns an upper index of what could be a slice of a name. We expect that that most of the names will fit into 5 words. Other cases would require more thorough algorithims that we can run later as plugins.

Types

type Decision

type Decision int

Decision definds possible kinds of name candidates.

const (
	NotName Decision = iota
	Uninomial
	Binomial
	PossibleBinomial
	Trinomial
	BayesUninomial
	BayesBinomial
	BayesTrinomial
)

Possible Decisions

func (Decision) Cardinality

func (d Decision) Cardinality() int

Cardinality returns number of elements in canonical form of a scientific name. If name is uninomial 1 is returned, for binomial 2, for trinomial 3.

func (Decision) In

func (d Decision) In(ds ...Decision) bool

In returns true if a Decision is included in given constants.

func (Decision) String

func (d Decision) String() string

String representation of a Decision

type Features

type Features struct {
	// Candidate to be a start of a uninomial or binomial.
	NameStartCandidate bool
	// The name looks like a possible genus name.
	PotentialBinomialGenus bool
	// The token has necessary qualities to be a start of a binomial.
	StartsWithLetter bool
	// The token has necessary quality to be a species part of trinomial.
	EndsWithLetter bool
	// Capitalized feature of the first alphabetic character.
	Capitalized bool
	// CapitalizedSpecies -- the first species lphabetic character is capitalized.
	CapitalizedSpecies bool
	// HasDash -- information if '-' character is part of the word
	HasDash bool
	// ParensEnd feature: token starts with parentheses.
	ParensStart bool
	// ParensEnd feature: token ends with parentheses.
	ParensEnd bool
	// ParensEndSpecies feature: species token ends with parentheses.
	ParensEndSpecies bool
	// Abbr feature: token ends with a period.
	Abbr bool
	// RankLike is true if token is a known infraspecific rank
	RankLike bool
	// UninomialDict defines which Genera or Uninomials dictionary (if any)
	// contained the token.
	UninomialDict dict.DictionaryType
	// SpeciesDict defines which Species dictionary (if any) contained the token.
	SpeciesDict dict.DictionaryType
}

Features keep properties of a token as a possible candidate for a name part.

type Indices

type Indices struct {
	Species      int
	Rank         int
	Infraspecies int
}

Indices of the elmements for a name candidate.

type NLP

type NLP struct {
	// Odds are posterior odds.
	Odds float64
	// OddsDetails are elements from which Odds are calculated.
	OddsDetails
	// LabelFreq is used to calculate prior odds of names appearing in a
	// document
	LabelFreq bayes.LabelFreq
}

NLP collects data received from Bayes' algorithm

type OddsDetails

type OddsDetails map[string]map[bayes.FeatureName]map[bayes.FeatureValue]float64

OddsDetails are elements from which Odds are calculated

func NewOddsDetails

func NewOddsDetails(l bayes.Likelihoods) OddsDetails

type Token

type Token struct {
	// Raw is a verbatim presentation of a token as it appears in a text.
	Raw []rune
	// Cleaned is a presentation of a token after normalization.
	Cleaned string
	// Start is the index of the first rune of a token. The first rune
	// does not have to be alpha-numeric.
	Start int
	// End is the index of the last rune of a token. The last rune does not
	// have to be alpha-numeric.
	End int
	// Decision tags the first token of a possible name with a classification
	// decision.
	Decision
	// Indices of semantic elements of a possible name.
	Indices
	// NLP data
	NLP
	// Features is a collection of features associated with the token
	Features
}

Token represents a word separated by spaces in a text. Words split by new lines are concatenated.

func NewToken

func NewToken(text []rune, start int, end int) Token

NewToken constructs a new Token object.

func Tokenize

func Tokenize(text []rune) []Token

Tokenize creates a slice containing every word in the document tokenized.

func (*Token) Clean

func (t *Token) Clean()

Clean converts a verbatim (Raw) string of a token into normalized cleaned up version.

func (*Token) InParentheses

func (t *Token) InParentheses() bool

InParentheses is true if token is surrounded by parentheses.

func (*Token) SetRank

func (t *Token) SetRank(d *dict.Dictionary)

func (*Token) SetSpeciesDict

func (t *Token) SetSpeciesDict(d *dict.Dictionary)

func (*Token) SetUninomialDict

func (t *Token) SetUninomialDict(d *dict.Dictionary)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL