Documentation
¶
Overview ¶
package gnmatcher provides the main use-case of the project, which is matching of possible name-strings to scientific names registered in a variety of biodiversity databases.
The goal of the project is to return back matched canonical forms of scientific names by tens of thousands a second, making it possible to work with hundreds of millions/billions of name-string matching events.
The package is intended to be used by long-running services, because it takes a few seconds to initialized its lookup data structures.
Example ¶
package main import ( "fmt" "github.com/gnames/gnmatcher" "github.com/gnames/gnmatcher/config" "github.com/gnames/gnmatcher/io/bloom" "github.com/gnames/gnmatcher/io/trie" ) func main() { // Note that it takes several minutes to initialize lookup data structures. // Requirement for initialization: Postgresql database with loaded // http://opendata.globalnames.org/dumps/gnames-latest.sql.gz // // If data are imported already, it still takes several seconds to // load lookup data into memory. cfg := config.NewConfig() em := bloom.NewExactMatcher(cfg) fm := trie.NewFuzzyMatcher(cfg) gnm := gnmatcher.NewGNMatcher(em, fm) res := gnm.MatchNames([]string{"Pomatomus saltator", "Pardosa moesta"}) for _, match := range res { fmt.Println(match.Name) fmt.Println(match.MatchType) for _, item := range match.MatchItems { fmt.Println(item.MatchStr) fmt.Println(item.EditDistance) } } }
Output:
Index ¶
Examples ¶
Constants ¶
const MaxNamesNumber = 10_000
MaxMaxNamesNumber is the upper limit of the number of name-strings the MatchNames function can process. If the number is higher, the list of name-strings will be truncated.
Variables ¶
var ( // Version of the gnmatcher Version = "v0.3.6" // Build timestamp Build = "n/a" )
Functions ¶
func NewGNMatcher ¶
func NewGNMatcher(em exact.ExactMatcher, fm fuzzy.FuzzyMatcher) gnmatcher
NewGNMatcher is a constructor for GNMatcher interface
Types ¶
type GNMatcher ¶
type GNMatcher interface { // MatchNames take a slice of scientific name-strings and return back // matches to canonical forms of known scientific names. The following // matches are attempted: // - Exact string match for viruses // - Exact match of the name-string's canonical form // - Fuzzy match of the canonical form // - Partial match of the canonical form where the middle parts of the name // or last elements of the name are removed. // - Partial fuzzy match of the canonical form. // // The resulting output does provide canonical forms, but not the sources // where they are registered. // MatchNames(names []string) []*mlib.Match gn.Versioner }
GNMatcher is a public API to the project functionality.
Directories
¶
Path | Synopsis |
---|---|
entity
|
|
io
|
|
bloom
package bloom creates and serves bloom filters for canonical names, and names of viruses.
|
package bloom creates and serves bloom filters for canonical names, and names of viruses. |
The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.
|
The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names. |