README

# rescribe.xyz/utils

This repository contains miscellaneous commands and small packages
useful for the OCR of books.

This is a collection of Go packages, and can be installed in the
standard go way, by running `go get rescribe.xyz/utils/...`

## Contributions

Any and all comments, bug reports, patches or pull requests would
be very welcomely received. Please email them to <nick@rescribe.xyz>.

## License

This package is licensed under the GPLv3. See the LICENSE file for
more details.
Expand ▾ Collapse ▴

Directories

Path Synopsis
cmd/avg-lines avg-lines prints a report of the average confidence for each line, sorted from worst to best
cmd/boxtotxt boxtotxt converts a Tesseract .box file to plain text
cmd/bucket-lines bucket-lines copies image-text line pairs into different directories according to the average character probability for the line
cmd/dehyphenate dehyphenate does basic dehyphenation on a hocr file
cmd/eeboxmltohocr eeboxmltohocr converts the XML from an EEBO download to hOCR, which can be easily incorporated into a searchable PDF
cmd/fonttobytes fonttobytes outputs a font file as a series of bytes in go format, allowing a font to be easily embedded into a go binary
cmd/hocrtotxt hocrtotxt prints the text from a hocr file
cmd/pare-gt pare-gt moves some ground truth, ensuring that the same proportions of each ground truth source are represented in the moved section
cmd/pgconf pgconf prints the total confidence for a page of hOCR
pkg/hocr hocr contains structures and functions for parsing and analysing hocr files
pkg/line line contains various functions to manipulate ocr lines
pkg/prob prob processes .prob files generated by ocropus