gnfinder

package module
v0.19.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 10, 2022 License: MIT Imports: 13 Imported by: 3

README

Global Names Finder (GNfinder)

DOI Build Status Doc Status Go Report Card

Very fast finder of scientific names. It uses dictionary and NLP approaches. On modern multiprocessor laptop it is able to process 15 million pages per hour. Works with many file formats and includes names verification against many biological databases. For full functionality it requires an Internet connection.

Citing

Zenodo DOI can be used to cite GNfinder.

Features

  • Multiplatform app (supports Linux, Windows, Mac OS X).
  • Self-contained, no external dependencies, only binary gnfinder or gnfinder.exe (~15Mb) is needed. However the internet connection is required for name-verification.
  • Includes REST API and web-based User Interface.
  • Takes UTF8-encoded text and returns back CSV, TSV or JSON-formatted output that contains detected scientific names.
  • Extracts text from PDF files, MS Word, MS Excel, HTML, XML, RTF, JPG, TIFF, GIF etc. files for names-detection.
  • Downloads web-page from a given URL for names-detection.
  • Optionally, automatically detects the language of the text, and adjusts Bayes algorithm for the language. English and German languages are currently supported.
  • Uses complementary heuristic and natural language processing algorithms.
  • Optionally verifies found names against multiple biodiversity databases using gnindex service.
  • Detection of nomenclatural annotations like sp. nov., comb. nov., ssp. nov. and their variants.
  • Ability to see words that surround detected name-strings.
  • The library can be used concurrently to significantly improve speed. On a server with 40threads it is able to detect names on 50 million pages in approximately 3 hours using both heuristic and Bayes algorithms. Check bhlindex project for an example.

Install as a command line app

Install with Homebrew on Mac OS X, Linux, and Linux on Windows ([WSL2][WSL install])

Homebrew is a popular package manager for Open Source software originally developed for Mac OS X. Now it is also available on Linux, and can easily be used on MS Windows 10 or 11, if Windows Subsystem for Linux (WSL) is [installed][WSL install].

Note that Homebrew requires some other programs to be installed, like Curl, Git, a compiler (GCC compiler on Linux, Xcode on Mac). If it is too much, go to the Linux and Mac without Homebrew section.

  1. Install Homebrew according to their instructions.

  2. Install GNfinder with:

    brew tap gnames/gn
    brew install gnfinder
    
Install by hand

GNfinder consists of just one executable file, so it is pretty easy to install it by hand. To do that download the binary executable for your operating system from the latest release.

Linux and Mac without Homebrew

Move gnfinder executable somewhere in your PATH (for example /usr/local/bin)

sudo mv path_to/gnfinder /usr/local/bin
Windows without Homebrew and WSL

It is possible to use GNfinder natively on Windows, without Homebrew or Linux installed.

One possible way would be to create a default folder for executables and place gnfinder there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

mkdir C:\bin
copy path_to\gnfinder.exe C:\bin

Add C:\bin directory to your PATH environment variable.

Go

Install Go v1.17 or higher.

git clone git@github.com:/gnames/gnfinder
cd gnfinder
make tools
make install

Configuration

When you run gnfinder command for the first time, it will create a gnfinder.yml configuration file.

This file should be located in the following places:

MS Windows: C:\Users\AppData\Roaming\gnfinder.yml

Mac OS: $HOME/.config/gnfinder.yml

Linux: $HOME/.config/gnfinder.yml

This file allows to set options that will modify behaviour of GNfinder according to your needs. It will spare you to enter the same flags for the command line application again and again.

Command line flags will override the settings in the configuration file.

It is also possible to setup environment variables. They will override the settings in both the configuration file and from the flags.

Settings Environment variables
BayesOddsThreshold GNF_BAYES_ODDS_THRESHOLD
DataSources GNF_DATA_SOURCES
Format GNF_FORMAT
InputTextOnly GNF_INPUT_TEXT_ONLY
IncludeInputText GNF_INCLUDE_INPUT_TEXT
Language GNF_LANGUAGE
TikaURL GNF_TIKA_URL
TokensAround GNF_TOKENS_AROUND
VerifierURL GNF_VERIFIER_URL
WithAllMatches GNF_WITH_ALL_MATCHES
WithAmbiguousNames GNF_WITH_AMBIGUOUS_NAMES
WithBayesOddsDetails GNF_WITH_BAYES_ODDS_DETAILS
WithOddsAdjustment GNF_WITH_ODDS_ADJUSTMENT
WithPlainInput GNF_WITH_PLAIN_INPUT
WithPositionInBytes GNF_WITH_POSITION_IN_BYTES
WithUniqueNames GNF_WITH_UNIQUE_NAMES
WithVerification GNF_WITH_VERIFICATION
WithoutBayes GNF_WITHOUT_BAYES

Usage

Usage as a command line app

To see flags and usage:

gnfinder --help
# or just
gnfinder

To see the version of its binary:

gnfinder -V

Examples:

Starting as a web-application and an API server on port 8080

gnfinder -p 8080

Getting names from a UTF8-encoded file without remote Tika service.

# -U flag prevents use of remote Apache Tika service for file conversion to
# UTF8-encoded plain text
# -U flag is optional, but it removes unnecessary remote call to Tika.

gnfinder file_with_names.txt -U

Getting names from a UTF8-encoded file in tab-separated values (TSV) format

gnfinder file_with_names.txt -U -f tsv

Getting names from a file that is not a plain UTF8-encoded text

gnfinder file.pdf

Getting names from a URL

gnfinder https://en.wikipedia.org/wiki/Raccoon

Getting unique names from a file in JSON format. Disables -w flag.

gnfinder file_with_names.txt -u -f pretty

Getting names from a file in JSON format, and using jq to process JSON

gnfinder file_with_names.txt -f compact | jq

Getting data from a pipe forcing English language and verification

echo "Pomatomus saltator and Parus major" | gnfinder -v -l eng
echo "Pomatomus saltator and Parus major" | gnfinder --verify --lang eng

Limit matches to NCBI and Encyclopedia of Life. For the list of data source ids go to gnverifier's data sources page.

echo "Pomatomus saltator and Parus major" | gnfinder -v -l eng -s "4,12"
echo "Pomatomus saltator and Parus major" | gnfinder --verify --lang eng --sources "4,12"

Preserve uninomial names that are also common words.

echo "Cancer is a genus" | gnfinder -A
echo "America is also a genus" | gnfinder --ambiguous-uninomials

Show all matches, not only the best result.

echo "Pomatomus saltator and Parus major" | gnfinder -M
echo "Pomatomus saltator and Parus major" | gnfinder --all-matches

Show all matches, but only for selected data-sources.

echo "Pomatomus saltator and Parus major" | gnfinder -M -s 1,12

Adjusting Prior Odds using information about found names. They are calculated as "found names number / (capitalized words number - found names number)". Such adjustment will decrease Odds for texts with very few names, and increase odds for texts with a lot of found names.

gnfinder -a -d -f pretty file_with_names.txt

Returning 5 words before and after found name-candidate. This flag does is ignored if unique names are returned.

gnfinder -w 5 file_with_names.txt
gnfinder --words-around 5 file_with_names.txt

Getting data from a file and redirecting result to another file

gnfinder file1.txt > file2.json

Detection of nomenclatural annotations

echo "Parus major sp. n." | gnfinder

Returning found names positions in the number of bytes from the beginning of the text instead of the number of UTF-8 characters

echo "Это Parus major" | gnfinder -b

There is also a tutorial about processing many PDF files in parallel.

Usage as a library
import (
  "github.com/gnames/gnfinder"
  "github.com/gnames/gnfinder/ent/nlp"
  "github.com/gnames/gnfinder/io/dict"
)

func Example() {
  txt := `Blue Adussel (Mytilus edulis) grows to about two
inches the first year,Pardosa moesta Banks, 1892`
  cfg := gnfinder.NewConfig()
  dictionary := dict.LoadDictionary()
  weights := nlp.BayesWeights()
  gnf := gnfinder.New(cfg, dictionary, weights)
  res := gnf.Find(txt)
  name := res.Names[0]
  fmt.Printf(
    "Name: %s, start: %d, end: %d",
    name.Name,
    name.OffsetStart,
    name.OffsetEnd,
  )
  // Output:
  // Name: Mytilus edulis, start: 13, end: 29
}
Usage as a docker container
docker pull gnames/gnfinder

# run GNfinder server, and map it to port 8888 on the host machine
docker run -d -p 8888:8778 --name gnfinder gnames/gnfinder
Usage of API

Best source for API usage is its documenation.

If you want to start your own API endpoint (for example on localhost, port 8080) use:

gnfinder -p 8080
curl localhost:8080/api/v0/ping

To upload a file and detect names from its content:

curl -v -F verification=true -F file=@/path/to/test.txt https://gnfinder.globalnames.org/api/v0/find

Projects based on GNfinder

gnfinder-plus allows to work with MS Docs and PDF files without remote services (requires local install of poppler package).

bhlindex creates an index of scientific names for Biodiversity Heritage Library (BHL).

bhlnames adds synonymy and currently accepted names to searches in BHL, connects publications to pages in BHL.

Development

To install the latest GNfinder

git clone git@github.com:/gnames/gnfinder
cd gnfinder
make tools
make install
Modify OpenAPI documentation
docker run -d -p 80:8080 swaggerapi/swagger-editor

Testing

From the root of the project:

make tools
# run make install for CLI testing
make install

To run tests go to the root directory of the project and run

go test ./...

#or

make test

Documentation

Overview

Example
package main

import (
	"fmt"

	"github.com/gnames/gnfinder"
	"github.com/gnames/gnfinder/config"
	"github.com/gnames/gnfinder/ent/nlp"
	"github.com/gnames/gnfinder/io/dict"
)

func main() {
	txt := `Blue Adussel (Mytilus edulis) grows to about two
inches the first year,Pardosa moesta Banks, 1892`
	cfg := config.New()
	dictionary := dict.LoadDictionary()
	weights := nlp.BayesWeights()
	gnf := gnfinder.New(cfg, dictionary, weights)
	res := gnf.Find("", txt)
	name := res.Names[0]
	fmt.Printf(
		"Name: %s, start: %d, end: %d",
		name.Name,
		name.OffsetStart,
		name.OffsetEnd,
	)
}
Output:

Name: Mytilus edulis, start: 13, end: 29

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	Version = "v0.19.5+"
	Build   string
)

Functions

This section is empty.

Types

type GNfinder added in v0.8.4

type GNfinder interface {
	// Find detects names in a `text`. The `file` argument provides the file-name
	// that contains the `text` (if given).
	Find(file, text string) output.Output

	// GetConfig provides all public Config fields.
	GetConfig() config.Config

	// ChangeConfig allows to modify config fields at the run-time.
	ChangeConfig(opts ...config.Option) GNfinder

	// GetVersion returns the version of GNfinder.
	GetVersion() gnvers.Version
}

GNfinder provides the main user-case functionality. It allows to find names in text, get/set configuration options, find out version of the project.

func New added in v0.12.0

func New(
	cfg config.Config,
	dictionaries *dict.Dictionary,
	weights map[lang.Language]bayes.Bayes,
) GNfinder

Directories

Path Synopsis
ent
api
nlp
token
Package token deals with breaking a text into tokens.
Package token deals with breaking a text into tokens.
cmd
io
dict
package dict contains dictionaries for finding scientific names
package dict contains dictionaries for finding scientific names
web
tools

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL