gnfinder

package module
Version: v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 24, 2018 License: MIT Imports: 12 Imported by: 3

README

Global Names Finder Build Status Doc Status

Finds scientific names using dictionary and nlp approaches.

Features

  • Multiplatform packages (Linux, Windows, Mac OS X).
  • Self-contained, no external dependencies, only binary gnfinder or gnfinder.exe (~15Mb) is needed.
  • Takes UTF8-encoded text and returns back JSON-formatted output that contains detected scientific names.
  • Automatically detects the language of the text, and adjusts Bayes algorithm. for the language. English and German languages are currently supported.
  • Uses complementary heuristic and natural language processing algorithms.
  • Does not use Bayes algorithm if language cannot be detected. There is an option that can override this rule.
  • Optionally verifies found names against multiple biodiversity databases using gnindex service.
  • The library can be used concurrently to significantly improve speed. On a server with 40threads it is able to detect names on 50 million pages in approximately 3 hours using both heuristic and Bayes algorithms. Check bhlindex project for an example.

Install as a command line app

Download the binary executable for your operating system from the latest release.

Linux or OS X

Move gnfinder executabe somewhere in your PATH (for example /usr/local/bin)

sudo mv path_to/gnfinder /usr/local/bin
Windows

One possible way would be to create a default folder for executables and place gnfinder there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

mkdir C:\bin
copy path_to\gnfinder.exe C:\bin

Add C:\bin directory to your PATH environment variable.

Go
go get github.com/gnames/gnfinder
cd $GOPATH/src/github.com/gnames/gnfinder
make install

Usage as a command line app

To see flags and usage:

gnfinder --help

Examples:

Getting data from a pipe forcing English language and verification

echo "Pomatomus saltator and Parus major" | gnfinder -c -l eng

Verifying data against NCBI and Encyclopedia of Life

echo "Pomatomus saltator and Parus major" | gnfinder -c -l eng -s "4,12"

Getting data from a file and redirecting result to another file

gnfinder file1.txt > file2.json

Usage as a library

go get github.com/gnames/gnfinder
go get github.com/json-iterator/go
go get github.com/rakyll/statik
# To update dictionaries if they are changed
cd $GOPATH/srs/github.com/gnames/gnfinder
go generate
import (
  "github.com/gnames/gnfinder"
  "github.com/gnames/gnfinder/dict"
)

dict = &dict.LoadDictionary()
bytesText := []byte(utfText)

jsonNames := FindNamesJSON(bytesText, dict, opts)
Development

To install latest gnfinder

git get github.com/gnames/gnfinder
cd $GOPATH/src/github.com/gnames/gnfinder
make
gnfinder -h
Testing

Install [ginkgo], a [BDD] testing framefork for Go.

go get github.com/onsi/ginkgo/ginkgo
go get github.com/onsi/gomega

To run tests go to root directory of the project and run

ginkgo

#or

go test

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FindNamesJSON

func FindNamesJSON(data []byte, dict *dict.Dictionary,
	opts ...util.Opt) []byte

FindNamesJSON takes a text and returns scientific names found in the text, as well as tokens

func UniqueNameStrings added in v0.6.0

func UniqueNameStrings(names []Name) []string

UniqueNameStrings takes a list of names, and returns a list of unique name-strings

Types

type Meta

type Meta struct {
	// Date represents time when output was generated.
	Date time.Time `json:"date"`
	// Language of the document
	Language string `json:"language"`
	// TotalTokens is a number of 'normalized' words in the text
	TotalTokens int `json:"total_words"`
	// TotalNameCandidates is a number of words that might be a start of
	// a scientific name
	TotalNameCandidates int `json:"total_candidates"`
	// TotalNames is a number of scientific names found
	TotalNames int `json:"total_names"`
	// CurrentName (optional) is the index of the names array that designates a
	// "position of a cursor". It is used by programs like gntagger that allow
	// to work on the list of found names interactively.
	CurrentName int `json:"current_index,omitempty"`
}

Meta contains meta-information of name-finding result.

type Name

type Name struct {
	Type         string                `json:"type"`
	Verbatim     string                `json:"verbatim"`
	Name         string                `json:"name"`
	Odds         float64               `json:"odds,omitempty"`
	OddsDetails  token.OddsDetails     `json:"odds_details,omitempty"`
	OffsetStart  int                   `json:"start"`
	OffsetEnd    int                   `json:"end"`
	Annotation   string                `json:"annotation"`
	Verification resolver.Verification `json:"verification,omitempty"`
}

Name represents one found name.

func TokensToName

func TokensToName(ts []token.Token, text []rune) Name

type OddsDatum

type OddsDatum struct {
	Name bool
	Odds float64
}

OddsDatum is a simplified version of a name, that stores boolean decision (Name/NotName), and corresponding odds of the name.

type Output

type Output struct {
	Meta  `json:"metadata"`
	Names []Name `json:"names"`
}

Output type is the result of name-finding.

func CollectOutput

func CollectOutput(ts []token.Token, text []rune, m *util.Model) Output

CollectOutput takes tagged tokens and assembles gnfinder output out of them.

func FindNames

func FindNames(text []rune, d *dict.Dictionary, m *util.Model) Output

FindNames traverses a text and finds scientific names in it.

func NewOutput

func NewOutput(names []Name, ts []token.Token, m *util.Model) Output

NewOutput is a constructor for Output type.

func (*Output) FromJSON

func (o *Output) FromJSON(data []byte)

FromJSON converts JSON representation of Outout to Output object.

func (*Output) ToJSON

func (o *Output) ToJSON() []byte

ToJSON converts Output to JSON representation.

Directories

Path Synopsis
package dict contains dictionaries for finding scientific names
package dict contains dictionaries for finding scientific names
cmd
Package resolver verifies found name-strings against gnindex site located at https://index.globalnames.org.
Package resolver verifies found name-strings against gnindex site located at https://index.globalnames.org.
scripts
Package token deals with breaking a text into tokens.
Package token deals with breaking a text into tokens.
Package util contains useful shared functions
Package util contains useful shared functions

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL