token

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 24, 2022 License: MIT Imports: 9 Imported by: 0

Documentation

Overview

Package token deals with breaking a text into tokens. It cleans names broken by new lines, concatenating pieces together. Tokens are connected to properties. Properties are used for heuristic and Bayes' approaches for finding names.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewTokenSN

func NewTokenSN(token gner.TokenNER) gner.TokenNER

NewTokenSN is a factory and a wrapper. It takes gner.TokenNER object and wraps into TokenSN interface.

func SetIndices

func SetIndices(ts []TokenSN, d *dict.Dictionary)

SetIndices takes a slice of tokens that correspond to a name candidate. It analyses the tokens and sets Token.Indices according to feasibility of the input tokens to form a scientific name. It checks if there is a possible species, ranks, and infraspecies.

func UpperIndex

func UpperIndex(i int, l int) int

UpperIndex takes an index of a token and length of the tokens slice and returns an upper index of what could be a slice of a name. We expect that that most of the names will fit into 5 words. Other cases would require more thorough algorithims that we can run later as plugins.

Types

type Decision

type Decision int

Decision definds possible kinds of name candidates.

const (
	NotName Decision = iota
	Uninomial
	PossibleUninomial
	Binomial
	PossibleBinomial
	Trinomial
	BayesUninomial
	BayesBinomial
	BayesTrinomial
)

Possible Decisions

func (Decision) Cardinality

func (d Decision) Cardinality() int

Cardinality returns number of elements in canonical form of a scientific name. If name is uninomial 1 is returned, for binomial 2, for trinomial 3.

func (Decision) In

func (d Decision) In(ds ...Decision) bool

In returns true if a Decision is included in given constants.

func (Decision) String

func (d Decision) String() string

String representation of a Decision

type Features

type Features struct {
	// IsCapitalized is true if the first rune that is letter, is capitalized.
	IsCapitalized bool

	// HasDash is true if token contains dash
	HasDash bool

	// HasStartParens is true if token start with '('
	HasStartParens bool

	// HasEndParens is true if token ends with ')'
	HasEndParens bool

	// Abbr feature: token ends with a period.
	Abbr bool

	// PotentialBinomialGenus feature: the token might be a genus of name.
	PotentialBinomialGenus bool

	// StartsWithLetter feature: the token has necessary qualities to be a start
	// of a binomial species. It assumes to be low-case and be two letters or
	// more.
	StartsWithLetter bool

	// EndsWithLetter feature: the token has necessary quality to be a species
	// part of trinomial.
	EndsWithLetter bool

	// RankLike is true if token is a known infraspecific rank
	RankLike bool

	// UninomialDict defines which Genera or Uninomials dictionary (if any)
	// contained the token.
	UninomialDict dict.DictionaryType

	// SpeciesDict defines which Species dictionary (if any) contained the token.
	SpeciesDict dict.DictionaryType

	// GenSpInAmbigDict shows how many specific/infraspecific epithets of a putative
	// name matched bi-/tri- nomials in a full name dictionary for grey genera.
	// For example "Bubo bubo" name would set it to 1, and "Bubo bubo bubo" would
	// set it to 2.
	GenSpInAmbigDict int
}

Features keep properties of a token as a possible candidate for a name part.

func (*Features) SetRank

func (p *Features) SetRank(raw string, d *dict.Dictionary)

func (*Features) SetSpeciesDict

func (p *Features) SetSpeciesDict(cleaned string, d *dict.Dictionary)

func (*Features) SetUninomialDict

func (p *Features) SetUninomialDict(cleaned string, d *dict.Dictionary)

type Indices

type Indices struct {
	Species      int
	Rank         int
	Infraspecies int
}

Indices of the elmements for a name candidate.

type NLP

type NLP struct {
	// Odds are posterior odds.
	Odds float64

	// ClassCases is used to calculate prior odds of names appearing in a
	// document.
	ClassCases map[feature.Class]int

	// OddsDetails are used for calculating final odds for detected names and
	// for displaying results in the output
	OddsDetails
}

NLP collects data received from Bayes' algorithm

type OddsDetails

type OddsDetails map[string]float64

func NewOddsDetails

func NewOddsDetails(odds posterior.Odds) OddsDetails

func (OddsDetails) MarshalJSON added in v0.17.0

func (od OddsDetails) MarshalJSON() ([]byte, error)

type TokenSN

type TokenSN interface {
	gner.TokenNER
	Features() *Features
	NLP() *NLP
	Indices() *Indices
	Decision() Decision
	SetDecision(d Decision)
}

func Tokenize

func Tokenize(text []rune) []TokenSN

Tokenize creates a slice containing every word in the document tokenized.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL