gnmatcher

package module
v0.5.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2021 License: MIT Imports: 5 Imported by: 2

README

gnmatcher

API GoDoc

import "github.com/gnames/gnmatcher"

The app matches a slice of strings to canonical forms of scientific names from various biodiversity datasets.

Introduction

The gnmatcher project receives a slice of strings and returns back 0 or more canonical forms of known names for each string. If it is not required to know which biodiversity repositories include the scientific names, the project can be used as a stand-alone RESTful service. If such information is important, the project is used as a component of a scientific names verification (reconciliation/resolution) service gnames.

The project aims to do such verification as fast and accurate as possible. Quite often, humans or character-recognition software (OCR) introduce misspellings in the strings. For this reason, gnmatcher uses fuzzy-matching algorithms when no exact match exists. Also, for cases where full string does not have a match, gnmatcher tries to match it against parts of the string. For example, if a string did not get a match on a subspecies level, the algorithm will try to match it on species and genus levels.

Reconciliation is the normalization of lexical variations of the same name, and comparison of them to normalized names from biodiversity data sources.

Resolution is a determination of how a nomenclaturally registered name can be interpreted from the point of taxonomy. For example, a name can be an accepted name for species, a synonym, or a discarded one.

The gnmatcher app functions as an HTTP service. An app can access it using HTTP client libraries. The API's methods and structures are described in by the OpenAPI specification.

Input and Output

A user calls HTTP resource /match sending an slice of strings to the service and gets back canonical forms, the match type, as well as other metadata described as an Output message in the [protobuf] file.

The optimal size of the input is 5-10 thousand strings per slice. Note that 10,000 is the maximal size, and larger arrays will be truncated.

Performance

For performance measurement we took 100,000 strings where only 30% of them were 'real' names. On a modern CPU with 12 hyper threads and GNM_JOBS_NUM environment variable set to 8, the service was able to process about 8,000 strings per second. For 'clean' data where most of the names are "real", you should see an even higher performance.

Prerequisites

  • You will need PostgreSQL with a restored dump of gnames database.

  • For PostgreSQL collation to work correctly set LC_COLLATE=C in /etc/default/locale

  • Docker service

Usage

Usage with docker
  • Install docker gnmatcher image: docker pull gnames/gnmatcher.

  • Copy .env.example file on user's disk and change values of environment variables accordingly.

  • Start the service:

    docker run -p 8080:8080 -d --env-file your-env-file \
    gnames/gnmatcher -- rest -p 8080`
    

    This command will set the service on port 8080 and will make it available through port 8080 on a local machine.

Usage from command line
  • Download the [latest verion] of gnmatcher binary, untar and put somewhere in PATH.

  • Run gnmatcher -V to generate configuration at ~/.config/gnmatcher.yaml

  • Edit ~/.config/gnmatcher.yaml accordingly.

  • Run gnmatcher rest -p 1234

The service will run on the given port.

Usage as a library
package main

import (
  "fmt"
  "github.com/gnames/gnmatcher"
  "github.com/gnames/gnmatcher/config"
  "github.com/gnames/gnmatcher/io/bloom"
  "github.com/gnames/gnmatcher/io/trie"
)

func main() {
  // Note that it takes several minutes to initialize lookup data structures.
  // Requirement for initialization: Postgresql database with loaded
  // http://opendata.globalnames.org/dumps/gnames-latest.sql.gz
  //
  // If data are imported already, it still takes several seconds to
  // load lookup data into memory.
  cfg := config.NewConfig()
  em := bloom.NewExactMatcher(cfg)
  fm := trie.NewFuzzyMatcher(cfg)
  gnm := gnmatcher.NewGNmatcher(em, fm)
  res := gnm.MatchNames([]string{"Pomatomus saltator", "Pardosa moesta"})
  for _, match := range res {
    fmt.Println(match.Name)
    fmt.Println(match.MatchType)
    for _, item := range match.MatchItems {
      fmt.Println(item.MatchStr)
      fmt.Println(item.EditDistance)
    }
  }
}

Client

A user can find an example of a client for the service in this test file.

The API is formally described in the OpenAPI specification

Development

There is a docker-compose file that sets up HTTP service to run tests. To run it to the following:

  1. Copy .env.example file to the .env file in the project's root directory, change the settings accordingly.

  2. Build the gnmatcher binary and docker image using make dc command.

  3. Run docker-compose command docker compose up

  4. Run tests via go test ./... -v

Documentation

Overview

package gnmatcher provides the main use-case of the project, which is matching of possible name-strings to scientific names registered in a variety of biodiversity databases.

The goal of the project is to return matched canonical forms of scientific names by tens of thousands a second, making it possible to work with hundreds of millions/billions of name-string matching events.

The package is intended to be used by long-running services because it takes a few seconds/minutes to initialize its lookup data structures.

Example
package main

import (
	"fmt"

	"github.com/gnames/gnmatcher"
	"github.com/gnames/gnmatcher/config"
	"github.com/gnames/gnmatcher/io/bloom"
	"github.com/gnames/gnmatcher/io/trie"
)

func main() {
	// Note that it takes several minutes to initialize lookup data structures.
	// Requirement for initialization: Postgresql database with loaded
	// http://opendata.globalnames.org/dumps/gnames-latest.sql.gz
	//
	// If data are imported already, it still takes several seconds to
	// load lookup data into memory.
	cfg := config.NewConfig()
	em := bloom.NewExactMatcher(cfg)
	fm := trie.NewFuzzyMatcher(cfg)
	gnm := gnmatcher.NewGNmatcher(em, fm, 1)
	res := gnm.MatchNames([]string{"Pomatomus saltator", "Pardosa moesta"})
	for _, match := range res {
		fmt.Println(match.Name)
		fmt.Println(match.MatchType)
		for _, item := range match.MatchItems {
			fmt.Println(item.MatchStr)
			fmt.Println(item.EditDistance)
		}
	}
}
Output:

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// Version of the gnmatcher. When make runs, it automatically
	// sets the variable using git tags and hashes.
	Version = "v0.3.6"
	// Build timestamp. When make runs, it automatically sets the variable.
	Build = "n/a"
)

Functions

This section is empty.

Types

type GNmatcher added in v0.5.4

type GNmatcher interface {
	// MatchNames takes a slice of scientific name-strings and returns back
	// matches to canonical forms of known scientific names. The following
	// matches are attempted:
	// - Exact string match for viruses
	// - Exact match of the name-string's canonical form
	// - Fuzzy match of the canonical form
	// - Partial match of the canonical form where the middle parts of the name
	//   or last elements of the name are removed.
	// - Partial fuzzy match of the canonical form.
	//
	// The resulting output does provide canonical forms, but not the sources
	// where they are registered.
	MatchNames(names []string) []mlib.Match

	// GetVersion returns version number and build timestamp
	GetVersion() gnvers.Version
}

GNmatcher is a public API to the project functionality.

func NewGNmatcher added in v0.5.4

func NewGNmatcher(em exact.ExactMatcher, fm fuzzy.FuzzyMatcher, j int) GNmatcher

NewGNmatcher is a constructor for GNmatcher interface. It takes two interfaces ExactMatcher and FuzzyMatcher.

Directories

Path Synopsis
package config contains information needed to run gnmatcher project.
package config contains information needed to run gnmatcher project.
ent
exact
package exact contains interface for exact-matching strings to known scientific names.
package exact contains interface for exact-matching strings to known scientific names.
fuzzy
package fuzzy contains interfaces and code to facilitate fuzzy-matching of name-strings to scientific names collected in gnames database.
package fuzzy contains interfaces and code to facilitate fuzzy-matching of name-strings to scientific names collected in gnames database.
matcher
package matcher is the central processing unit for matching name-strings to known scientific names.
package matcher is the central processing unit for matching name-strings to known scientific names.
package main provides an CLI interface to http service to run gnmatcher functionality.
package main provides an CLI interface to http service to run gnmatcher functionality.
cmd
io
bloom
package bloom creates and serves bloom filters for canonical names, and names of viruses.
package bloom creates and serves bloom filters for canonical names, and names of viruses.
dbase
package dbase provides convenience methods for accessing PostgreSQL database.
package dbase provides convenience methods for accessing PostgreSQL database.
rest
package rest provides http REST interface to gnmatcher functionality.
package rest provides http REST interface to gnmatcher functionality.
trie
package trie implements FuzzyMatcher interface that is responsible for fuzzy-matching strings to canonical forms of scientific names.
package trie implements FuzzyMatcher interface that is responsible for fuzzy-matching strings to canonical forms of scientific names.
The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.
The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL