htindex

package module
v0.0.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 6, 2020 License: MIT Imports: 21 Imported by: 0

README

htindex

The purpose of htindex is to create an index of scientific names in HathiTrust Digital Library. This library contains large amount of scientific literature (40% public, 60% private). This program will allow to add biodiversity information to their metadata. It will make possible to search their corpus by scientific names.

Installation

For the app to work you need a directory of zipped titles/volumes organized by HathiTrust convention and a file that contains paths of these zipped files.

The program gets information about these files either from a configuration file, or from command line flags.

For Linux or Mac download the latest release, untar, and copy it to /usr/local/bin, or any other directory that is in the PATH.

In your home directory create .htindex.yaml. Use an example .htindex.yaml file for reference. The example file explains configuration parameters. You can skip creation of the .htindex.yaml file, if you are planning to provide all the needed settings via command line flags.

Usage

The htindex reads a file that contains paths to zipped volumes/books/titles from HathiTrust, finds these files, extracts text from all the pages, finds scientific names in them and saves results to a given output directory.

If ~/.htindex.yaml file already contains all the settings it is sufficient to run

htindex
# To see help message:
htindex -h
# To see version of the app:
htindex -v

If some settings for the app need to be modified during command line execution, use the following flags:

-h, --help : Shows help

-j, --jobs : Takes an positive integer. Sets the number of workers (jobs). It looks like optimal number is number_of_threads * 3.

-i, --input : Takes a string. Sets a path to the input data file

-o, --output : Takes a string. Sets a path to the output directory. This directory will contain error log and results data.

-p, --progress : Takes a positive integer. Sets the number of titles in a batch. After each batch, there will be a message in the output, that states how many titles are processed and the rate (titles per minute).

-r, --root : Takes a string. Sets a root path to add to the input file data. This creates complete absolute path to zip files with volumes.

-w, --words-around : Sets a number of words retained before and after every occurance of a name-candidate.

-v, --version : Shows htindex version and build timestamp

License

Released under MIT license

Authors

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type HTindex

type HTindex struct {
	// RootPrefix is concatenated with paths given in input file to get
	// complete path to HathiTrust files.
	RootPrefix string
	// InputPath gives path to file with input data.
	InputPath string
	// OutputPath gives path to a directory to keep output data.
	OutputPath string
	// JobsNum sets number of jobs/workers to run.
	JobsNum int
	// Dict contains shared dictionary for name finding.
	Dict *dict.Dictionary
	// WordsAround sets number of words retained before and after a
	// name-candidate.
	WordsAround int
	// ProgressNum determines how many titles should be processed for
	// a progress report.
	ProgressNum int
}

HTindex detects occurences of scientific names in Hathi Trust data.

func NewHTindex

func NewHTindex(opts ...Option) (*HTindex, error)

NewHTindex creates HTindex instance with several defaults. If a some options are provided, they will override default settings.

func (*HTindex) Run

func (hti *HTindex) Run() error

Run is the main method for creation of the scientific names index.

type Option

type Option func(h *HTindex)

Option sets the time for all options received during creation of new instance of HTindex object.

func OptInput

func OptInput(s string) Option

OptIntput is an absolute path to input data file. Each line of such file displays path to zipped file of a title.

func OptJobs

func OptJobs(i int) Option

OptJobs sets number of jobs/workers to run duing execution.

func OptOutput

func OptOutput(s string) Option

OptOutput is an absolute path to a directory where results will be written. If such directory does not exist already, it will be created during initialization of HTindex instance.

func OptProgressNum added in v0.0.2

func OptProgressNum(i int) Option

OptProgressNum sets how often to printout a line about the progress. When it is set to 1 report line appears after processing every title, and if it is 10 progress is shows after every 10th title.

func OptRoot

func OptRoot(s string) Option

OptRoot sets the prefix of the path to zipped titles. It wil be concatenated with a path provided in the input file to receive complete absolute path.

func OptWordsAround added in v0.0.7

func OptWordsAround(w int) Option

OptWordsAround sets number of words retained before and after a name-candidate.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL