gntagger

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 28, 2019 License: MIT Imports: 17 Imported by: 0

README

gntagger Doc Status

gntagger not only finds scientific names in a document. It also allows the user to go through each found name, see it in a context of a text, and then accept or reject the found name.

We made this program so we can improve on the quality of name-finding algorithm, but it is useful for anybody who needs to extract scientific names from a book or a scientific paper. The program works for MS Windows, Mac and Linux and it runs from a command line interface -- CMD in case of windows, or a terminal under Mac and Linux.

gntagger allows you to curate 4000 names spread over 600 pages in about 2 hours. It is significantly faster than curation made in a text editor or pdf viewer.

Ascii Cast

Installation

The program is just an executable file that runs from a command line. Download the latest zip or tar file for your operating system, extract the file and place it somehere in your PATH, so it is visible by your system.

Conversion of PDF to text

gntagger works with plain texts, so if you need to find names in a PDF file, first you need to convert it to text.

Linux

Usually you can just use less command.

less paper1.pdf | gntagger

Another option is pdftotext from xpdf package.

Mac

Use xpdf package:

brew install Caskroom/cask/xquartz
brew install xpdf
pdftotext -layout doc.pdf doc.txt
Windows

Download Xpdf tools, unzip them, and use pdftotext.exe

pdftotext.exe -layout doc.pdf doc.txt

Usage

To find out version

gntagger -version
gntagger -V

To get names from a file (processed text and list of names will be saved in the same directory as the text file)

gntagger file_with_names.txt

# on windows
gntagger.exe  file_with_names.txt

To get names from stanard input

# linux

less file.pdf | gntagger
less file.pdf | gntagger -bayes

# mac

pdftotext -layout file.pdf | gntagger
pdftotext -layout file.pdf | gntagger -bayes

Note that -layout flag for pdftotext tries to preserve the original structure of the text, as it was in the original PDF. It significantly increases chances for finding names that are split between the end and the start of two lines.

User Interface

The user interface of the program consists of 2 panels. The left panel contains detected scientific names, with a "current name" located in the middle of the screen and highlighted. The left panel contains the full text, where the "current name" is highlighted and aligned with the "current name" in the left panel.

The program is designed to move though the names quickly. Navigate to the next/previous name in the left panel using Right/Left arrow keys. All names have an empty annotation at the beginning. Pressing Right Arrow key automatically "accepts" found name if the annotation is empty. Other keys allow to annotate the "current name" differently:

  • Space: rejects a name with "NotName" annotation

  • 'y': re-accepts mistakenly rejected name with "Accepted" annotation

  • 'u': marks a name as "Uninomial"

  • 'g': marks a name as "Genus"

  • 's': marks a name as "Species"

  • Ctrl-C: saves curation and exits application

  • Ctrl-S: saves curations made so far

Current names are saved to clipboard automatically, so it is easy to paste them into a browser, speadsheet, database, or text editor.

The program autosaves results of curation. If the program crashes, or exited the user can continue curation at the last point instead of starting from scratch.

Development

Running tests

Install ginkgo, a BDD testing framefork for Go.

go get github.com/onsi/ginkgo/ginkgo
go get github.com/onsi/gomega

To run tests go to root directory of the project and run

ginkgo

#or

go test
Build executable
go build -ldflags "-X main.buildstamp=`date -u '+%Y-%m-%d_%I:%M:%S%p'` \
                   -X main.githash=`git rev-parse HEAD | cut -c1-7` \
                   -X main.gittag=`git describe --tags`" \
         -o gntagger -a cmd/gntagger/main.go

Documentation

Overview

Package gntagger is a command line application that helps to find/curate scientific names interactively. For example if there is a monograph about a genus with hundreds of scientific names, gntagger will find names automatically and then will let the user to verify found names interactively.

Asciicast: https://asciinema.org/a/wNfIt2TfZiyrAwJZKhuq5DkHV

The user interface of the program consists of 2 panels. The left panel contains detected scientific names, with a "current name" located in the middle of the screen and highlighted. The left panel contains the full text, where the "current name" is highlighted and aligned with the "current name" in the left panel.

The program is designed to move though the names quickly. Navigate to the next/previous name in the left panel using Right/Left arrow keys. All names have an empty annotation at the beginning. Pressing Right Arrow key automatically "accepts" found name if the annotation is empty. Other keys allow to annotate the "current name" differently:

* Space: rejects a name with "NotName" annotation

* 'y': re-accepts mistakenly rejected name with "Accepted" annotation

* 'u': marks a name as "Uninomial"

* 'g': marks a name as "Genus"

* 's': marks a name as "Species"

* Ctrl-C: saves curation and exits application

* Ctrl-S: saves curations made so far

The program autosaves results of curation. If the program crashes, or exited the user can continue curation at the last point instead of starting from scratch.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsDoubtful added in v0.3.0

func IsDoubtful(n *output.Name, gnt *GnTagger) bool

func NameStrings added in v0.3.0

func NameStrings(n *output.Name, current bool, i int,
	total int) ([]string, error)

NameStrings composes text to show in terminal gui

func ShowWarningIfPreviousData added in v0.3.0

func ShowWarningIfPreviousData(text *Text)

ShowWarningIfPreviousData takes *Text pointer. It warns if previously created data exist and backups the old data.

Types

type FileType added in v0.2.0

type FileType int

FileType describes types created by gntagger during name finding and curation.

const (
	// InputFile contains text used for name-finding.
	InputFile FileType = iota
	// NamesFile contains JSON output with names and metadata.
	NamesFile
	// MetaFile creates meta-information used for various purposes.
	MetaFile
)

type GnTagger added in v0.3.0

type GnTagger struct {
	// Bayes flag forces bayes name-finding even when the language of the text
	// is not supported.
	Bayes bool
	// OddsHigh marks a limit after which names are considered 'good'.
	OddsHigh float64
	// OddsLow marks a low limit for 'doubtful' names. OddsHigh is the upper
	// limit for such names.
	OddsLow float64
	// Express sets if we skip names that were already marked as 'good'
	// or 'bad'
	Express bool
}

GnTagger keeps configuration parameters of the program

func NewGnTagger added in v0.3.0

func NewGnTagger() *GnTagger

NewGnTagger creates a new GnTagger object

type Names

type Names struct {
	// Path to json file with names
	Path string
	// Data is a gnfinder output
	Data output.Output
}

Names is an object that keeps output of a name finder and the path where to save this data on disk

func NamesFromJSON

func NamesFromJSON(path string) *Names

NamesFromJSON creates gntagger's name structure from a finder output

func NewNames added in v0.2.0

func NewNames(text *Text, gnt *GnTagger) *Names

NewNames uses a name finder or existing information to return Names structure generated from a text

func PrepareFilesAndText added in v0.3.0

func PrepareFilesAndText(t *Text, w int, gnt *GnTagger) *Names

PrepareFilesAndText creates files if needed and returns list of names for curation.

func (*Names) GetCurrentName added in v0.2.0

func (n *Names) GetCurrentName() *output.Name

GetCurrentName returns currently selected name

func (*Names) Save

func (n *Names) Save() error

Save writes current state of names to file

func (*Names) UpdateAnnotations added in v0.3.0

func (n *Names) UpdateAnnotations(newAnnot annotation.Annotation, edge int,
	gnt *GnTagger) error

UpdateAnnotations takes an annotation updates the current name with it. If needed, it propagates annotations further down the 'unseen' list.

type Text

type Text struct {
	// Raw text, as it was given by a user
	Raw []byte
	// Cleaned text after removing non-printable characters and wrapping
	// according to the width of a user's terminal
	Processed []rune
	// Cleaned text in bytes
	ProcessedBytes []byte
	// Path to the text file
	Path string
	// Files is a map that contains names of the files created by gntagger
	Files map[FileType]string
	// TextMeta describes provides metainformation about text:
	// Checksum, GNtaggerVersion, Timestamp
	TextMeta
	// contains filtered or unexported fields
}

Text contains text of the input and its metadata

func NewText added in v0.2.0

func NewText(data []byte, path string, GNtaggerVersion string) *Text

NewText creates new Text object

func (*Text) AddError added in v0.2.1

func (t *Text) AddError(err error)

AddError adds a new arror to the Text's error collection.

func (*Text) Errors added in v0.2.1

func (t *Text) Errors() []error

Errors returns list of errors that happened during execution of the gntagger.

func (*Text) FilePath added in v0.2.0

func (t *Text) FilePath(f FileType) string

FilePath returns a file path for a given FileType.

func (*Text) Process added in v0.2.0

func (t *Text) Process(width int)

Process removes all non-printable characters from Text and wraps its lines making sure that all scientific names are visible.

type TextMeta added in v0.2.0

type TextMeta struct {
	// Checksum is a hash calculated from the content.
	Checksum string `json:"text_checksum"`
	// GNtaggerVersion is the version of gntagger that generated output.
	GNtaggerVersion string `json:"gntagger_version"`
	// Timestamp of the last save.
	Timestamp string `json:"save_timestamp"`
}

TextMeta used for creating the content of the MetaFile

func (*TextMeta) ToJSON added in v0.2.0

func (t *TextMeta) ToJSON() []byte

ToJSON converts meta-information into JSON format

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL