The app matches a list of scientific name-strings to canonical forms of scientific names from various biodiversity datasets.
This project is a component of a scientific names verification
(reconciliation/resolution) service gnames. The purpose of verification is to
compare a list of apparent scientific name-strings to a comprehensive set of
scientific names collected from many external biodiversity sources. The
gnmatcher project receives a list of name-strings and returns back 0 or more
canonical forms of known names for each name-string.
The project aims to do such verification as fast and accurately as possible.
Quite often, humans or character-recognition software (OCR) introduce
misspellings in the name-strings. For this reason,
fuzzy-matching algorithms when no exact match exists. Also, for cases where
full name-string does not have a match,
gnmatcher tries to match it against
parts of names. For example, if name-string did not get a match on a subspecies
level, the algorithm will try to match it on species and genus levels.
Reconciliation is the normalization of lexical variations of the same name, and comparison of them to normalized names from biodiversity data sources.
Resolution is a determination of how a nomenclaturally registered name can be interpreted from the point of taxonomy. For example, a name can be an accepted name for species, a synonym, or a discarded one.
gnmatcher app functions as a gRPC service. An app can access it using
gRPC client libraries. The API of the gRPC service is described in its
Input and Output
A user calls gRPC method
MatchAry sending an array of name-strings to the
service and gets back canonical forms, the match type, as well as other
metadata described as an
Output message in the protobuf file.
The optimal size of the input is 5-10 thousand name-strings per array. Note that 10,000 is the maximal size, and larger arrays will be truncated.
For performance measurement we took 100,000 name-strings where only
30% of them were 'real' names. On a modern CPU with 12 hyper threads and
GNM_JOBS_NUM environment variable set to 8, the service was able to process
about 8,000 name-strings per second. For 'clean' data where most of the names
are "real", you should see an even higher performance.
Usage ### Prerequisites
You will need PostgreSQL with a restored dump of
docker pull gnames/gnmatcher.
Copy .env.example file on user's disk and change values of environment variables accordingly.
Start the service:
docker run -p 8080:8080 -d --env-file your-env-file \ gnames/gnmatcher -- grpc 8080`
This command will set the service on port 8080 and will make it available through port 8080 on a local machine.
A user can find an example of a client for the service in this test file
There is a docker-compose file that sets up gRPC service to run tests. To run it to the following:
.env.examplefile to the
.envfile in the project's root directory, change the settings accordingly.
gnmatcherbinary and docker image using
Run docker-compose command
Run tests via
go test ./...or
const MaxNamesNumber = 10_000
MaxMaxNamesNumber is the upper limit of the number of name-strings the MatchNames function can process. If the number is higher, the list of name-strings will be truncated.
var ( // Version of the gnmatcher Version = "v0.2.0" // Build timestamp Build = "n/a" )
This section is empty.
GNMatcher contains high level methods for scientific name matching.
func NewGNMatcher ¶
NewGNMatcher is a constructor for GNMatcher instance
package bloom creates and serves bloom filters for canonical names, and names of viruses.
|package bloom creates and serves bloom filters for canonical names, and names of viruses.
Package dbase is an interface to PostgreSQL database that contains Global Names index data
|Package dbase is an interface to PostgreSQL database that contains Global Names index data
Package fuzzy includes a Levenshtein automaton as well as a traditional implementation to calculate Levenshtein Distance.
|Package fuzzy includes a Levenshtein automaton as well as a traditional implementation to calculate Levenshtein Distance.
The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.
|The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.
stems_db package operates on a key-value store that contains stems and canonical forms that correspond to these stems.
|stems_db package operates on a key-value store that contains stems and canonical forms that correspond to these stems.