gnmatcher

package module

v0.3.5 Latest Latest Go to latest Published: Nov 19, 2020 License: MIT Imports: 6 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/gnames/gnmatcher

Links

Open Source Insights

README ¶

gnmatcher

The app matches a list of scientific name-strings to canonical forms of scientific names from various biodiversity datasets.

Introduction

This project is a component of a scientific names verification (reconciliation/resolution) service gnames. The purpose of verification is to compare a list of apparent scientific name-strings to a comprehensive set of scientific names collected from many external biodiversity sources. The gnmatcher project receives a list of name-strings and returns back 0 or more canonical forms of known names for each name-string.

The project aims to do such verification as fast and accurate as possible. Quite often, humans or character-recognition software (OCR) introduce misspellings in the name-strings. For this reason, gnmatcher uses fuzzy-matching algorithms when no exact match exists. Also, for cases where full name-string does not have a match, gnmatcher tries to match it against parts of names. For example, if name-string did not get a match on a subspecies level, the algorithm will try to match it on species and genus levels.

Reconciliation is the normalization of lexical variations of the same name, and comparison of them to normalized names from biodiversity data sources.

Resolution is a determination of how a nomenclaturally registered name can be interpreted from the point of taxonomy. For example, a name can be an accepted name for species, a synonym, or a discarded one.

The gnmatcher app functions as an HTTP service. An app can access it using HTTP client libraries. The API's methods and structures are described in the model dir.

Input and Output

A user calls HTTP resource /match sending an array of name-strings to the service and gets back canonical forms, the match type, as well as other metadata described as an Output message in the [protobuf] file.

The optimal size of the input is 5-10 thousand name-strings per array. Note that 10,000 is the maximal size, and larger arrays will be truncated.

Performance

For performance measurement we took 100,000 name-strings where only 30% of them were 'real' names. On a modern CPU with 12 hyper threads and GNM_JOBS_NUM environment variable set to 8, the service was able to process about 8,000 name-strings per second. For 'clean' data where most of the names are "real", you should see an even higher performance.

Prerequisites

You will need PostgreSQL with a restored dump of gnames database.
Docker service

Usage

Usage with docker

Install docker gnmatcher image: docker pull gnames/gnmatcher.
Copy .env.example file on user's disk and change values of environment variables accordingly.
Start the service:
```
docker run -p 8080:8080 -d --env-file your-env-file \
gnames/gnmatcher -- rest -p 8080`
```
This command will set the service on port 8080 and will make it available through port 8080 on a local machine.

Usage from command line

Download the [latest verion] of gnmatcher binary, untar and put somewhere in PATH.
Run gnmatcher -V to generate configuration at ~/.config/gnmatcher.yaml
Edit ~/.config/gnmatcher.yaml accordingly.
Run gnmatcher rest -p 1234

The service will run on the given port.

Client

A user can find an example of a client for the service in this test file

Development

To run tests a developer needs to install BDD binary ginkgo

There is a docker-compose file that sets up HTTP service to run tests. To run it to the following:

Copy .env.example file to the .env file in the project's root directory, change the settings accordingly.
Build the gnmatcher binary and docker image using make dc command.
Run docker-compose command docker compose
Run tests via go test ./... or ginkgo ./...

Documentation ¶

Overview ¶

package gnmatcher provides the main use-case of the project, which is matching of possible name-strings to scientific names registered in a variety of biodiversity databases.

The goal of the project is to return back matched canonical forms of scientific names by tens of thousands a second, making it possible to work with hundreds of millions/billions of name-string matching events.

The package is intended to be used by long-running services, because it takes a few seconds to initialized its lookup data structures.

Index ¶

Constants
Variables
func Example()
func NewGNMatcher(m matcher.Matcher) gnmatcher
type GNMatcher

Constants ¶

View Source

const MaxNamesNumber = 10_000

MaxMaxNamesNumber is the upper limit of the number of name-strings the MatchNames function can process. If the number is higher, the list of name-strings will be truncated.

Variables ¶

View Source

var (
	// Version of the gnmatcher
	Version = "v0.2.0"
	// Build timestamp
	Build = "n/a"
)

Functions ¶

func Example ¶ added in v0.3.5

func Example()

func NewGNMatcher ¶

func NewGNMatcher(m matcher.Matcher) gnmatcher

NewGNMatcher is a constructor for GNMatcher interface

Types ¶

type GNMatcher ¶

type GNMatcher interface {
	// MatchNames take a slice of scientific name-strings and return back
	// matches to canonical forms of known scientific names. The following
	// matches are attempted:
	// - Exact string match for viruses
	// - Exact match of the name-string's canonical form
	// - Fuzzy match of the canonical form
	// - Partial match of the canonical form where the middle parts of the name
	//   or last elements of the name are removed.
	// - Partial fuzzy match of the canonical form.
	//
	// The resulting output does provide canonical forms, but not the sources
	// where they are registered.
	//
	MatchNames(names []string) []*mlib.Match
}

GNMatcher is a public API to the project functionality.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
bloom package bloom creates and serves bloom filters for canonical names, and names of viruses.	package bloom creates and serves bloom filters for canonical names, and names of viruses.
config
dbase Package dbase is an interface to PostgreSQL database that contains Global Names index data	Package dbase is an interface to PostgreSQL database that contains Global Names index data
fuzzy
gnmatcher
cmd
matcher
rest
scripts The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.	The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.
stemskv stems_db package operates on a key-value store that contains stems and canonical forms that correspond to these stems.	stems_db package operates on a key-value store that contains stems and canonical forms that correspond to these stems.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL