gnfinder

package module
v0.11.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2020 License: MIT Imports: 9 Imported by: 3

README

Global Names Finder

Build Status Doc Status Go Report Card

Finds scientific names using dictionary and nlp approaches.

Features

  • Multiplatform packages (Linux, Windows, Mac OS X).
  • Self-contained, no external dependencies, only binary gnfinder or gnfinder.exe (~15Mb) is needed. However the internet connection is required for name-verification.
  • Takes UTF8-encoded text and returns back JSON-formatted output that contains detected scientific names.
  • Optionally, automatically detects the language of the text, and adjusts Bayes algorithm for the language. English and German languages are currently supported.
  • Uses complementary heuristic and natural language processing algorithms.
  • Optionally verifies found names against multiple biodiversity databases using gnindex service.
  • Detection of nomenclatural annotations like sp. nov., comb. nov., ssp. nov. and their variants.
  • Ability to see words that surround detected name-strings.
  • The library can be used concurrently to significantly improve speed. On a server with 40threads it is able to detect names on 50 million pages in approximately 3 hours using both heuristic and Bayes algorithms. Check bhlindex project for an example.

Install as a command line app

Download the binary executable for your operating system from the latest release.

Linux or OS X

Move gnfinder executabe somewhere in your PATH (for example /usr/local/bin)

sudo mv path_to/gnfinder /usr/local/bin
Windows

One possible way would be to create a default folder for executables and place gnfinder there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

mkdir C:\bin
copy path_to\gnfinder.exe C:\bin

Add C:\bin directory to your PATH environment variable.

Go
go get github.com/gnames/gnfinder
cd $GOPATH/src/github.com/gnames/gnfinder
make install

Usage

Usage as a command line app

To see flags and usage:

gnfinder --help
# or just
gnfinder

To see the version of its binary:

gnfinder -v

Examples:

Getting data from a pipe forcing English language and verification

echo "Pomatomus saltator and Parus major" | gnfinder find -c -l eng

Displaying matches from NCBI and Encyclopedia of Life, if exist. For the list of data source ids go gnresolver.

echo "Pomatomus saltator and Parus major" | gnfinder find -c -l eng -s "4,12"

Returning 5 words before and after found name-candidate.

gnfinder find -t 5 file_with_names.txt

Getting data from a file and redirecting result to another file

gnfinder find file1.txt > file2.json

Detection of nomenclatural annotations

echo "Parus major sp. n." | gnfinder find
Usage as gRPC service

Start gnfinder as a gRPC server:

# using default 8778 port
gnfinder grpc

# using some other port
gnfinder grpc -p 8901

Use a gRPC client for gnfinder. To learn how to make one, check a Ruby implementation of a client.

Usage as a library
cd $GOPATH/srs/github.com/gnames/gnfinder
make deps
import (
  "github.com/gnames/gnfinder"
  "github.com/gnames/gnfinder/dict"
)

bytesText := []byte(utfText)

jsonNames := FindNamesJSON(bytesText)
Usage as a docker container
docker pull gnames/gnfinder

# run gnfinder server, and map it to port 8888 on the host machine
docker run -d -p 8888:8778 --name gnfinder gnames/gnfinder

Development

To install the latest gnfinder

Download protoc binary compiled for your OS from protobuf releases.

Install protobuf on Mac
brew install protobuf

If you see any error messages, run brew doctor, follow any recommended fixes, and try again. If it still fails, try instead:

brew upgrade protobuf

Alternately, run the following commands:

PROTOC_ZIP=protoc-3.11.4-osx-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.11.4/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

Or manually download and install protoc from protobuf releases.

Install protobuf on Linux

Run the following commands:

PROTOC_ZIP=protoc-3.11.4-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.11.4/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

Or manually download and install protoc from protobuf releases.

Install gnfinder
go get github.com/gnames/gnfinder
cd $GOPATH/src/github.com/gnames/gnfinder
make deps
make
gnfinder -h

Testing

Install [ginkgo], a [BDD] testing framefork for Go.

make deps

To run tests go to root directory of the project and run

ginkgo

#or

go test

#or

make test

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	Version = "v0.11.1"
	Build   string
)

Functions

This section is empty.

Types

type GNfinder added in v0.8.4

type GNfinder struct {
	// Language for name-finding in the text.
	Language lang.Language
	// LanguageDetected is the code of a language that was detected in text.
	// It is an empty string, if detection of language is not set.
	LanguageDetected string
	// DetectLanguage flag is true if we want to detect language automatically.
	DetectLanguage bool
	// Bayes is true when we run Bayes algorithm, and false when we dont.
	Bayes bool
	// BayesOddsThreshold sets the limit of posterior odds. Everything bigger
	// that this limit will go to the names output.
	BayesOddsThreshold float64
	// BayesOddsDetails show odds calculation details in the CLI output.
	BayesOddsDetails bool
	// TextOdds captures "concentration" of names as it is found for the whole
	// text by heuristic name-finding. It should be close enough for real
	// number of names in text. We use it when we do not have local conentration
	// of names in a region of text.
	TextOdds bayes.LabelFreq
	// TokensAround gives number of tokens kepts before and after each
	// name-candidate.
	TokensAround int

	// Verifier for scientific names.
	Verifier *verifier.Verifier
	// Dict contains black, grey, and white list dictionaries.
	Dict *dict.Dictionary
	// BayesTrained contains training for all supported bayes dictionaries.
	BayesWeights map[lang.Language]*bayes.NaiveBayes
}

GNfinder is responsible for name-finding operations.

func NewGNfinder added in v0.8.4

func NewGNfinder(opts ...Option) *GNfinder

NewGNfinder creates GNfinder object with default data, or with data coming from opts.

func (*GNfinder) FindNames added in v0.8.4

func (gnf *GNfinder) FindNames(data []byte, opts ...Option) *output.Output

FindNames traverses a text and finds scientific names in it.

func (*GNfinder) FindNamesJSON added in v0.8.4

func (gnf *GNfinder) FindNamesJSON(data []byte, opts ...Option) []byte

FindNamesJSON takes a text as bytes and returns JSON representation of scientific names found in the text

func (*GNfinder) Update added in v0.9.0

func (gnf *GNfinder) Update(opts ...Option) []Option

Update updates GNfinder object to new options, and returns optiongs that can be used to revert GNfinder back to previous state.

type Option added in v0.8.4

type Option func(*GNfinder)

Option type for changing GNfinder settings.

func OptBayes added in v0.8.4

func OptBayes(b bool) Option

OptBayes is an option that forces running bayes name-finding even when the language is not supported by training sets.

func OptBayesOddsDetails added in v0.11.0

func OptBayesOddsDetails(o bool) Option

OptBayesOddsDetails option to show details of odds calculations.

func OptBayesThreshold added in v0.8.4

func OptBayesThreshold(odds float64) Option

OptBayesThreshold is an option for name finding, that sets new threshold for results from the Bayes name-finding. All the name candidates that have a higher threshold will appear in the resulting names output.

func OptBayesWeights added in v0.8.10

func OptBayesWeights(bw map[lang.Language]*bayes.NaiveBayes) Option

OptBayesWeights allows to set already created Bayes Training data and store it in gnfinder's BayesWeights field. It saves time if multiple workers have to be created by a client app.

func OptDetectLanguage added in v0.9.0

func OptDetectLanguage(bool) Option

OptDetectLanguage when true sets automatic detection of text's language.

func OptDict added in v0.8.4

func OptDict(d *dict.Dictionary) Option

OptDict allows to set already created dictionary for GNfinder. It saves time, because then dictionary does not have to be loaded at the construction time.

func OptLanguage added in v0.8.4

func OptLanguage(l lang.Language) Option

OptLanguage sets a language of a text.

func OptTokensAround added in v0.10.0

func OptTokensAround(tokensNum int) Option

OptTokensAround sets number of tokens rememberred on the left and right side of a name-candidate.

func OptVerify added in v0.8.4

func OptVerify(opts ...verifier.Option) Option

OptVerify is sets Verifier that will be used for validation of name-strings against https://index.globalnames.org service.

Directories

Path Synopsis
package dict contains dictionaries for finding scientific names
package dict contains dictionaries for finding scientific names
cmd
scripts
Package token deals with breaking a text into tokens.
Package token deals with breaking a text into tokens.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL