pdfocr

command
v0.0.0-...-aaae7e4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2025 License: MPL-2.0 Imports: 9 Imported by: 0

Documentation

Overview

pdfocr is a command-line tool for creating searchable PDFs with OCR text layers.

This tool can either enhance existing PDFs with OCR text layers or create new PDFs from images with embedded OCR text. It uses HOCR data to position text accurately within the document at the exact position of each recognized word.

Usage:

pdfocr -hocr document.hocr [options]
pdfocr -pdf document.pdf -check-ocr

Required flags:

-hocr string      Path to hOCR file (required except for -check-ocr)
-output string    Output PDF path (required except for -check-ocr)

Input options (one required):

-pdf string       Path to existing PDF to enhance with OCR
-image-dir string Directory containing page images to build a new PDF

Processing options:

-start-page int   Start applying OCR from this page (default 1)
-debug            Enable debug mode (shows OCR bounding boxes)
-force            Force reapply OCR even if layer exists
-strict           Error out when OCR detection fails or OCR already exists (unless Force is used)
-overwrite        Overwrite output file if it exists
-debug-pdf        Dump PDF structure for debugging
-check-ocr        Check if the PDF already has OCR and exit

Exit codes:

0 - Success (no warnings or errors)
1 - Error (operation failed)
2 - Success with warnings (including OCR already detected)
3 - Error: OCR already detected in strict mode

Examples:

Add OCR layer to existing PDF:

pdfocr -hocr document.hocr -pdf document.pdf -output document_searchable.pdf

Create PDF from image directory with OCR:

pdfocr -hocr document.hocr -image-dir ./page_images -output document_searchable.pdf

Check if a PDF already has OCR:

pdfocr -pdf document.pdf -check-ocr

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL