gnfinder

package module
v0.8.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 23, 2019 License: MIT Imports: 8 Imported by: 3

README

Global Names Finder

Build Status Doc Status Go Report Card

Finds scientific names using dictionary and nlp approaches.

Features

  • Multiplatform packages (Linux, Windows, Mac OS X).
  • Self-contained, no external dependencies, only binary gnfinder or gnfinder.exe (~15Mb) is needed.
  • Takes UTF8-encoded text and returns back JSON-formatted output that contains detected scientific names.
  • Automatically detects the language of the text, and adjusts Bayes algorithm. for the language. English and German languages are currently supported.
  • Uses complementary heuristic and natural language processing algorithms.
  • Does not use Bayes algorithm if language cannot be detected. There is an option that can override this rule.
  • Optionally verifies found names against multiple biodiversity databases using gnindex service.
  • The library can be used concurrently to significantly improve speed. On a server with 40threads it is able to detect names on 50 million pages in approximately 3 hours using both heuristic and Bayes algorithms. Check bhlindex project for an example.

Install as a command line app

Download the binary executable for your operating system from the latest release.

Linux or OS X

Move gnfinder executabe somewhere in your PATH (for example /usr/local/bin)

sudo mv path_to/gnfinder /usr/local/bin
Windows

One possible way would be to create a default folder for executables and place gnfinder there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

mkdir C:\bin
copy path_to\gnfinder.exe C:\bin

Add C:\bin directory to your PATH environment variable.

Go
go get github.com/gnames/gnfinder
cd $GOPATH/src/github.com/gnames/gnfinder
make install

Usage

Usage as a command line app

To see flags and usage:

gnfinder --help
# or just
gnfinder

To see the version of its binary:

gnfinder -v

Examples:

Getting data from a pipe forcing English language and verification

echo "Pomatomus saltator and Parus major" | gnfinder find -c -l eng

Verifying data against NCBI and Encyclopedia of Life

echo "Pomatomus saltator and Parus major" | gnfinder find -c -l eng -s "4,12"

Getting data from a file and redirecting result to another file

gnfinder find file1.txt > file2.json
Usage as gRPC service

Start gnfinder as a gRPC server:

# using default 8778 port
gnfinder grpc

#using some other port
gnfinder grpc -p 8901

Use a gRPC client for gnfinder. To learn how to make one, check a Ruby implementation of a client.

Usage as a library
cd $GOPATH/srs/github.com/gnames/gnfinder
make deps
import (
  "github.com/gnames/gnfinder"
  "github.com/gnames/gnfinder/dict"
)

bytesText := []byte(utfText)

jsonNames := FindNamesJSON(bytesText)
Usage as a docker container
docker pull gnames/gnfinder

# run gnfinder server, and map it to port 8888 on the host machine
docker run -d -p 8888:8778 --name gnfinder gnames/gnfinder

Development

To install latest gnfinder

git get github.com/gnames/gnfinder
cd $GOPATH/src/github.com/gnames/gnfinder
make deps
make
gnfinder -h

Testing

Install [ginkgo], a [BDD] testing framefork for Go.

make deps

To run tests go to root directory of the project and run

ginkgo

#or

go test

#or

make test

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type GNfinder added in v0.8.4

type GNfinder struct {
	// Language of the text
	Language lang.Language
	// Bayes flag forces to run Bayes name-finding on unknown languages
	Bayes bool
	// BayesOddsThreshold sets the limit of posterior odds. Everything bigger
	// that this limit will go to the names output.
	BayesOddsThreshold float64
	// TextOdds captures "concentration" of names as it is found for the whole
	// text by heuristic name-finding. It should be close enough for real
	// number of names in text. We use it when we do not have local conentration
	// of names in a region of text.
	TextOdds bayes.LabelFreq

	// Verifier for scientific names
	Verifier *verifier.Verifier
	// Dict contains black, grey, and white list dictionaries
	Dict *dict.Dictionary
}

GNfinder is responsible for name-finding operations

func NewGNfinder added in v0.8.4

func NewGNfinder(opts ...Option) *GNfinder

NewGNfinder creates GNfinder object with default data, or with data coming from opts.

func (*GNfinder) FindNames added in v0.8.4

func (gnf *GNfinder) FindNames(data []byte) *output.Output

FindNames traverses a text and finds scientific names in it.

func (*GNfinder) FindNamesJSON added in v0.8.4

func (gnf *GNfinder) FindNamesJSON(data []byte) []byte

FindNamesJSON takes a text as bytes and returns JSON representation of scientific names found in the text

type Option added in v0.8.4

type Option func(*GNfinder)

Option type for changing GNfinder settings.

func OptBayes added in v0.8.4

func OptBayes(b bool) Option

OptBayes is an option that forces running bayes name-finding even when the language is not supported by training sets.

func OptBayesThreshold added in v0.8.4

func OptBayesThreshold(odds float64) Option

OptBayesThreshold is an option for name finding, that sets new threshold for results from the Bayes name-finding. All the name candidates that have a higher threshold will appear in the resulting names output.

func OptDict added in v0.8.4

func OptDict(d *dict.Dictionary) Option

OptDict allows to set already created dictionary for GNfinder. It saves time, because then dictionary does not have to be loaded at the construction time.

func OptLanguage added in v0.8.4

func OptLanguage(l lang.Language) Option

OptLanguage sets a language of a text.

func OptVerify added in v0.8.4

func OptVerify(opts ...verifier.Option) Option

OptVerify is sets Verifier that will be used for validation of name-strings against https://index.globalnames.org service.

Directories

Path Synopsis
package dict contains dictionaries for finding scientific names
package dict contains dictionaries for finding scientific names
cmd
scripts
Package statik contains static assets.
Package statik contains static assets.
Package token deals with breaking a text into tokens.
Package token deals with breaking a text into tokens.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL