Documentation
¶
Overview ¶
pdfocr is a command-line tool for creating searchable PDFs with OCR text layers.
This tool can either enhance existing PDFs with OCR text layers or create new PDFs from images with embedded OCR text. It uses HOCR data to position text accurately within the document at the exact position of each recognized word.
Usage:
pdfocr -hocr document.hocr [options] pdfocr -pdf document.pdf -check-ocr
Required flags:
-hocr string Path to hOCR file (required except for -check-ocr) -output string Output PDF path (required except for -check-ocr)
Input options (one required):
-pdf string Path to existing PDF to enhance with OCR -image-dir string Directory containing page images to build a new PDF
Processing options:
-start-page int Start applying OCR from this page (default 1) -debug Enable debug mode (shows OCR bounding boxes) -force Force reapply OCR even if layer exists -strict Error out when OCR detection fails or OCR already exists (unless Force is used) -overwrite Overwrite output file if it exists -debug-pdf Dump PDF structure for debugging -check-ocr Check if the PDF already has OCR and exit
Exit codes:
0 - Success (no warnings or errors) 1 - Error (operation failed) 2 - Success with warnings (including OCR already detected) 3 - Error: OCR already detected in strict mode
Examples:
Add OCR layer to existing PDF:
pdfocr -hocr document.hocr -pdf document.pdf -output document_searchable.pdf
Create PDF from image directory with OCR:
pdfocr -hocr document.hocr -image-dir ./page_images -output document_searchable.pdf
Check if a PDF already has OCR:
pdfocr -pdf document.pdf -check-ocr
Click to show internal directories.
Click to hide internal directories.