README
# rescribe.xyz/utils This repository contains miscellaneous commands and small packages useful for the OCR of books. This is a collection of Go packages, and can be installed in the standard go way, by running `go get rescribe.xyz/utils/...` ## Contributions Any and all comments, bug reports, patches or pull requests would be very welcomely received. Please email them to <nick@rescribe.xyz>. ## License This package is licensed under the GPLv3. See the LICENSE file for more details.
Directories
Path | Synopsis |
---|---|
cmd/avg-lines | avg-lines prints a report of the average confidence for each line, sorted from worst to best |
cmd/boxtotxt | boxtotxt converts a Tesseract .box file to plain text |
cmd/bucket-lines | bucket-lines copies image-text line pairs into different directories according to the average character probability for the line |
cmd/dehyphenate | dehyphenate does basic dehyphenation on a hocr file |
cmd/eeboxmltohocr | eeboxmltohocr converts the XML from an EEBO download to hOCR, which can be easily incorporated into a searchable PDF |
cmd/fonttobytes | fonttobytes outputs a font file as a series of bytes in go format, allowing a font to be easily embedded into a go binary |
cmd/hocrtotxt | hocrtotxt prints the text from a hocr file |
cmd/pare-gt | pare-gt moves some ground truth, ensuring that the same proportions of each ground truth source are represented in the moved section |
cmd/pgconf | pgconf prints the total confidence for a page of hOCR |
pkg/hocr | hocr contains structures and functions for parsing and analysing hocr files |
pkg/line | line contains various functions to manipulate ocr lines |
pkg/prob | prob processes .prob files generated by ocropus |