utils

module

v0.1.4 Latest Latest Go to latest Published: Aug 22, 2023 License: GPL-3.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

git.rescribe.xyz/utils

Links

Open Source Insights

README ¶

# rescribe.xyz/utils

This repository contains miscellaneous commands and small packages
useful for the OCR of books.

This is a collection of Go packages, and can be installed in the
standard go way, by running `go get rescribe.xyz/utils/...` and documentation
can be read with the `go doc` command or online at
<https://pkg.go.dev/rescribe.xyz/utils>.

If you just want to install and use the commands, you can get the
package with `git clone https://git.rescribe.xyz/utils`, and then
install them with `go install ./...` from within the `utils`
directory.

## Contributions

Any and all comments, bug reports, patches or pull requests would
be very welcomely received. Please email them to <nick@rescribe.xyz>.

## License

This package is licensed under the GPLv3. See the LICENSE file for
more details.

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cmd
analysestats analysestats analyses a set of 'best', 'conf', and 'hocr' files in a directory, outputting results to a .csv file for further investigation.	analysestats analyses a set of 'best', 'conf', and 'hocr' files in a directory, outputting results to a .csv file for further investigation.
avg-lines avg-lines prints a report of the average confidence for each line, sorted from worst to best	avg-lines prints a report of the average confidence for each line, sorted from worst to best
boxtotxt boxtotxt converts a Tesseract .box file to plain text	boxtotxt converts a Tesseract .box file to plain text
bucket-lines bucket-lines copies image-text line pairs into different directories according to the average character probability for the line	bucket-lines copies image-text line pairs into different directories according to the average character probability for the line
dehyphenate dehyphenate does basic dehyphenation on a hocr file	dehyphenate does basic dehyphenation on a hocr file
dlgbook dlgbook is a wrapper around getgbook which gets metadata and uses it to save to a specially formatted directory	dlgbook is a wrapper around getgbook which gets metadata and uses it to save to a specially formatted directory
eeboxmltohocr eeboxmltohocr converts the XML from an EEBO download to hOCR, which can be easily incorporated into a searchable PDF	eeboxmltohocr converts the XML from an EEBO download to hOCR, which can be easily incorporated into a searchable PDF
extracthocrlines extracthocrlines copies the text and corresponding image section for each line of a HOCR file into separate files, which is useful for OCR training	extracthocrlines copies the text and corresponding image section for each line of a HOCR file into separate files, which is useful for OCR training
fonttobytes fonttobytes outputs a font file as a series of bytes in go format, allowing a font to be easily embedded into a go binary	fonttobytes outputs a font file as a series of bytes in go format, allowing a font to be easily embedded into a go binary
hocrtotxt hocrtotxt prints the text from a hocr file	hocrtotxt prints the text from a hocr file
iiifdownloader iiifdownloader attempts to download every page of a IIIF book in the best available quality, given a manifest url	iiifdownloader attempts to download every page of a IIIF book in the best available quality, given a manifest url
pare-gt pare-gt moves some ground truth, ensuring that the same proportions of each ground truth source are represented in the moved section	pare-gt moves some ground truth, ensuring that the same proportions of each ground truth source are represented in the moved section
pgconf pgconf prints the total confidence for a page of hOCR	pgconf prints the total confidence for a page of hOCR
pkg
hocr hocr contains structures and functions for parsing and analysing hocr files	hocr contains structures and functions for parsing and analysing hocr files
line line contains various functions to manipulate ocr lines	line contains various functions to manipulate ocr lines
prob prob processes .prob files generated by ocropus	prob processes .prob files generated by ocropus