bhlindex

command module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 27, 2023 License: MIT Imports: 4 Imported by: 0

README

Biodiversity Heritage Library Scientific Names Index (BHLindex)

Doc Status

Creates an index of scientific names occurred in the collection of literature in the Biodiversity Heritage Library

Performance

This application allows to traverse all digitized corpus of Biodiversity Heritage Library in a matter of hours. On a modern high-end laptop we observed the following results:

  • name-finding in 275,000 volumes, 60 million pages: 2.5 hours.
  • name-verification of 23 million unique name-strings: 3 hours.
  • preparing a CSV file with 250 million names occurrences/verification records : 40 minutes.

Installation on Linux

BHL corpus of OCR-ed data can be found as a >50GB compressed file.

Database Preparation

Login to PostgreSQL server and create a database that has the same name as the PgDatabase parameter in the configuration file (default name is bhlindex).

This database will be used to keep found names. The final size of the database upon completion should be in a vicinity of 50 GB.

In the following example we create the database by a postgres superuser and also create a bhl user to operate on the database.

sudo su - postgres
[postgres ~]$ psql
postgres=# create user bhl with password 'my-very-secret-password';
CREATE ROLE
postgres=# create database bhlindex;
CREATE DATABASE
postgres=# grant all privileges on database bhlindex to bhl;
GRANT
postgres=# \c bhlindex
You are now connected to database "bhlindex" as user "postgres".
bhlindex=# alter schema public owner to bhl;
ALTER SCHEMA

The last step is only needed if the bhl user is not set as a superuser. Every database has its own public schema, make sure to change to correct database using \c my-db-name as shown in the example above.

Configuration

When you run the app for the first time it will create a configuration file and will provide information where the file is located (usually it is $HOME/.config/bhlnames.yaml)

Edit the file to provide credentials for PostgreSQL database.

Change the Jobs setting according to the amount of memory and the number of CPU. For 32Gb of memory Jobs: 7 works ok. This parameter sets the number of concurrent jobs running for name-finding.

Set BHLdir parameter to point to the root directory where BHL texts are located (several hundred gigabytes of texts).

Other parameters are optional.

Environment Variables

It is possible to use Environment Variables instead of configuration file. Environment Variables override the configuration file settings. The following variable can be used:

Config Env. Variable
BHLdir BHLI_BHL_DIR
OutputFormat BHLI_OUTPUT_FORMAT
PgHost BHLI_PG_HOST
PgPort BHLI_PG_PORT
PgUser BHLI_PG_USER
PgPass BHLI_PG_PASS
PgDatabase BHLI_PG_DATABASE
Jobs BHLI_JOBS
VerifierURL BHLI_VERIFIER_URL
WithoutConfirm BHLI_WITHOUT_CONFIRM

Usage

Commands

Get BHLindex version

bhlindex -V

Find names in BHL

bhlindex find
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec find -y

Verify detected names using [GNverifier] service

bhlindex verify
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec verify -y

Dump data into tab-separated files

Three files will be created: names, occurrences. They will have extension according to selected output format (CSV is the default). If it is required to filter verified results by data-sources, their list and corresponding IDs can be found at [gnverifier sources page]

Dump files take more than 30GB of space. If --short flag is used, the size is reduced to 13GB.

# Dump files to a designated directory with reduced number of fields,
# and with normalization of verbatim names.
bhlindex-dump -d ~/bhldump -S -N

# Dump files to a designated directory.
bhlindex dump -d ~/bhlindex-dump
# or
bhlindex dump --dir ~/bhlindex-dump

# Dump while creating reduced number of fields making output smaller.
bhlindex dump -S
bhlindex dump --short

# Clean up verbatim names from multiple spaces and characters around the name.
bhlindex dump -N
bhlindex dump --normalize-verbatim

# Dump records verified to particular data-sources of `gnverifier`.
# In this case verified names are filtered by `The Catalogue of Life` (ID=1)
# and `The Encyclopedia of Life` (ID=12).
bhlindex dump -d ~/bhlindex-dump -s 1,12
or
bhlindex dump --dir ~/bhlindex-dump --sources 1,12

# Dump using JSON or TSV formats.
bhlindex dump -f tsv -d ~/bhlindex-dump
bhlindex dump -f json -d ~/bhlindex-dump
#or
bhlindex dump --format tsv --dir ~/bhlindex-dump

To run all commands together

bhlindex find -y && \
  bhlindex verify -y && \
  bhlindex dump -d output-dir
Filtering dumped data

There is a Ruby script filter.rb included into the repository, which traverses the dump files names.csv and occurrences.csv and filters out names that are have more chance to be false positives. Copy the script to a directory with the dump files and run it with:

ruby ./filter.rb
Testing

Testing requires PostgreSQL database bhlindex_test. Testing will delete all data from the test database.

go test ./...

Documentation

Overview

Copyright © 2022 Dmitry Mozzherin <dmozzherin@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL