pdfer

package module
v0.8.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 11, 2026 License: MIT Imports: 1 Imported by: 0

README

pdfer

A pure Go library for PDF processing with comprehensive XFA (XML Forms Architecture) support.

Go Reference Go Report Card

Features

  • Pure Go - No CGO, no external dependencies
  • Unified API - Clean parse.Open() entry point for all PDF operations
  • Unified Forms - Auto-detect and work with AcroForm or XFA forms
  • PDF Decryption - RC4 (40/128-bit) and AES (128/256-bit)
  • XFA Processing - Extract, parse, modify, and rebuild XFA forms
  • PDF Generation - Create PDFs from scratch with text, graphics, and images
  • Image Embedding - JPEG and PNG images with alpha channel support
  • Content Streams - Full text and graphics operators for page content
  • Object Streams - Full support for compressed object storage
  • Cross-Reference Streams - Parse modern PDF xref streams with predictor filters
  • Stream Filters - FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode
  • Incremental Updates - Parse PDFs with multiple revisions, follow /Prev chains
  • Byte-Perfect Parsing - Preserve exact bytes for reconstruction of original PDF

Installation

go get github.com/benedoc-inc/pdfer

Quick Start

Open and Parse a PDF (Unified API)
import "github.com/benedoc-inc/pdfer/core/parse"

// Open a PDF
pdf, err := parse.Open(pdfBytes)
if err != nil {
    log.Fatal(err)
}

// Get basic info
log.Printf("Version: %s", pdf.Version())
log.Printf("Objects: %d", pdf.ObjectCount())
log.Printf("Revisions: %d", pdf.RevisionCount())

// Get an object
obj, err := pdf.GetObject(1)
if err != nil {
    log.Fatal(err)
}
log.Printf("Object 1: %s", string(obj))

// List all objects
for _, num := range pdf.Objects() {
    log.Printf("Object %d exists", num)
}
Open an Encrypted PDF
pdf, err := parse.OpenWithOptions(pdfBytes, parse.ParseOptions{
    Password: []byte("secret"),
    Verbose:  true,
})
Byte-Perfect PDF Parsing
// Parse preserving exact bytes for reconstruction
pdf, err := parser.OpenWithOptions(pdfBytes, parser.ParseOptions{
    BytePerfect: true,
})

// Reconstruct identical PDF
reconstructed := pdf.Bytes()
// bytes.Equal(reconstructed, pdfBytes) == true

// Access raw object data with byte offsets
rawObj, _ := pdf.GetRawObject(1)
log.Printf("Object at offset %d, raw bytes: %d", rawObj.Offset, len(rawObj.RawBytes))
Extract and Fill Forms (Unified Interface)
import "github.com/benedoc-inc/pdfer/forms"

// Auto-detect and extract any form type (AcroForm or XFA)
form, err := forms.Extract(pdfBytes, password, false)
if err != nil {
    log.Fatal(err)
}

// Work with unified interface
schema := form.Schema()
log.Printf("Form type: %s, Fields: %d", form.Type(), len(schema.Questions))

// Fill the form
formData := types.FormData{
    "FirstName": "John",
    "LastName":  "Doe",
}
filled, err := form.Fill(pdfBytes, formData, password, false)
Extract XFA from an Encrypted PDF
package main

import (
    "log"
    "os"

    "github.com/benedoc-inc/pdfer"
    "github.com/benedoc-inc/pdfer/core/encrypt"
    "github.com/benedoc-inc/pdfer/forms/xfa"
)

func main() {
    // Read PDF
    pdfBytes, _ := os.ReadFile("form.pdf")

    // Decrypt (empty password for many government forms)
    _, encryptInfo, err := encrypt.DecryptPDF(pdfBytes, []byte(""), false)
    if err != nil {
        log.Fatal(err)
    }

    // Extract XFA streams
    streams, err := xfa.ExtractAllXFAStreams(pdfBytes, encryptInfo, false)
    if err != nil {
        log.Fatal(err)
    }

    // Access template (form structure)
    log.Printf("Template: %d bytes", len(streams.Template.Data))
    
    // Access datasets (form data)
    log.Printf("Datasets: %d bytes", len(streams.Datasets.Data))
    
    // Use convenient type aliases from root package
    var form *pdfer.FormSchema
    form, _ = xfa.ParseXFAForm(string(streams.Template.Data), false)
    log.Printf("Found %d questions", len(form.Questions))
}
Parse XFA Form Structure
// Parse the template to get form fields
form, err := xfa.ParseXFAForm(string(streams.Template.Data), false)
if err != nil {
    log.Fatal(err)
}

log.Printf("Found %d questions", len(form.Questions))
for _, q := range form.Questions {
    log.Printf("  %s: %s (%s)", q.ID, q.Label, q.Type)
}
Update Form Field Values
// Create form data (using type alias from root package)
formData := pdfer.FormData{
    "FirstName": "John",
    "LastName":  "Doe",
    "Date":      "2024-01-15",
}

// Update XFA in PDF
updatedPDF, err := xfa.UpdateXFAInPDF(pdfBytes, formData, encryptInfo, false)
if err != nil {
    log.Fatal(err)
}

os.WriteFile("filled.pdf", updatedPDF, 0644)
Create a PDF from Scratch
import "github.com/benedoc-inc/pdfer/core/write"

// Create a simple PDF with text and graphics
builder := write.NewSimplePDFBuilder()

// Add a page
page := builder.AddPage(writer.PageSizeLetter)

// Add a font and get its resource name
fontName := page.AddStandardFont("Helvetica")

// Draw content
page.Content().
    // Add text
    BeginText().
    SetFont(fontName, 24).
    SetTextPosition(72, 720).
    ShowText("Hello, PDF World!").
    EndText().
    // Draw a red rectangle
    SetFillColorRGB(1, 0, 0).
    Rectangle(72, 650, 200, 50).
    Fill()

builder.FinalizePage(page)

// Generate PDF bytes
pdfBytes, err := builder.Bytes()
Embed Images in a PDF
import "github.com/benedoc-inc/pdfer/core/write"

builder := write.NewSimplePDFBuilder()
page := builder.AddPage(write.PageSizeLetter)

// Add a JPEG image
jpegData, _ := os.ReadFile("photo.jpg")
imgInfo, err := builder.Writer().AddJPEGImage(jpegData, "Im1")
if err != nil {
    log.Fatal(err)
}

// Register image with page and draw it
imgName := page.AddImage(imgInfo)
page.Content().DrawImageAt(imgName, 72, 500, 200, 150)

builder.FinalizePage(page)
pdfBytes, _ := builder.Bytes()
Extract All Content from a PDF
import "github.com/benedoc-inc/pdfer/content/extract"

// Extract all content (text, graphics, images, fonts, annotations)
doc, err := extract.ExtractContent(pdfBytes, nil, false)
if err != nil {
    log.Fatal(err)
}

// Access extracted content
log.Printf("Pages: %d", len(doc.Pages))
for _, page := range doc.Pages {
    log.Printf("  Page %d: %d text elements, %d graphics, %d images",
        page.PageNumber, len(page.Text), len(page.Graphics), len(page.Images))
    
    // Text with positioning and font info
    for _, text := range page.Text {
        log.Printf("    Text: '%s' at (%.2f, %.2f) font: %s size: %.2f",
            text.Text, text.X, text.Y, text.FontName, text.FontSize)
    }
    
    // Resources (fonts, images)
    if page.Resources != nil {
        log.Printf("    Fonts: %d, Images: %d", 
            len(page.Resources.Fonts), len(page.Resources.Images))
    }
    
    // Annotations
    for _, annot := range page.Annotations {
        log.Printf("    Annotation: %s", annot.Type)
    }
}

// Extract as JSON
jsonStr, err := extract.ExtractContentToJSON(pdfBytes, nil, false)
if err != nil {
    log.Fatal(err)
}
log.Printf("JSON: %s", jsonStr)

Extraction Flow:

ExtractContent()
  ├─→ ExtractMetadata() → Document info (title, author, dates)
  ├─→ ExtractPages() → For each page:
  │     ├─→ parseContentStream() → Text, graphics, image refs
  │     ├─→ extractResources() → Fonts, XObjects, images
  │     └─→ extractAnnotations() → Links, comments, highlights
  ├─→ ExtractBookmarks() → Document outline
  └─→ Aggregate → Unique fonts/images from all pages
Parse PDFs with Incremental Updates
import "github.com/benedoc-inc/pdfer/core/parse"

// Parse a PDF that has been edited multiple times
pdfBytes, _ := os.ReadFile("edited.pdf")

// Check how many revisions the PDF has
revisions := parse.CountRevisions(pdfBytes)
log.Printf("PDF has %d revisions", revisions)

// Parse all revisions and merge object tables
result, err := parser.ParseWithIncrementalUpdates(pdfBytes, false)
if err != nil {
    log.Fatal(err)
}

log.Printf("Found %d objects", len(result.Objects))

// Extract a specific revision (e.g., the original before edits)
originalPDF, _ := parse.ExtractRevision(pdfBytes, 1)
Byte-Perfect PDF Parsing
import "github.com/benedoc-inc/pdfer/core/parse"

// Parse PDF with full byte preservation
pdfBytes, _ := os.ReadFile("document.pdf")
doc, err := parser.ParsePDFDocument(pdfBytes)
if err != nil {
    log.Fatal(err)
}

// Access individual revisions and objects
log.Printf("PDF has %d revisions, %d objects", doc.RevisionCount(), doc.ObjectCount())

// Get raw bytes of any object
obj := doc.GetObject(1)
log.Printf("Object 1: %d bytes", len(obj.RawBytes))

// Stream objects have parsed components
if obj.IsStream {
    log.Printf("Dictionary: %s", string(obj.DictRaw))
    log.Printf("Stream data: %d bytes", len(obj.StreamRaw))
}

// Reconstruct the PDF (byte-identical to original)
reconstructed := doc.Bytes()
// reconstructed == pdfBytes
Build XFA PDF from XML Streams
builder := write.NewXFABuilder(false)

streams := []writer.XFAStreamData{
    {Name: "template", Data: templateXML, Compress: true},
    {Name: "datasets", Data: datasetsXML, Compress: true},
    {Name: "config", Data: configXML, Compress: true},
}

pdfBytes, err := builder.BuildFromXFA(streams)

Package Structure

github.com/benedoc-inc/pdfer/
├── pdfer.go         # Root package with type aliases
├── core/            # Foundation layer
│   ├── parse/       # PDF parsing (reading structure)
│   ├── write/       # PDF writing (creating/modifying)
│   └── encrypt/     # Encryption/decryption
├── forms/           # Form processing (unified domain)
│   ├── forms.go     # Unified form interface
│   ├── acroform/    # AcroForm implementation
│   └── xfa/         # XFA implementation
├── content/         # Content operations
│   └── extract/     # Content extraction
├── resources/       # Embeddable resources
│   └── font/        # Font embedding
├── types/           # Shared data structures
├── cmd/pdfer/       # CLI tool
└── examples/        # Usage examples

Type Aliases

For convenience, common types are re-exported from the root package:

import "github.com/benedoc-inc/pdfer"

var enc *pdfer.Encryption      // = types.PDFEncryption
var form *pdfer.FormSchema     // = types.FormSchema
var q pdfer.Question           // = types.Question
var data pdfer.FormData        // = types.FormData

Supported PDF Features

Encryption
Feature Status
RC4 40-bit (V1)
RC4 128-bit (V2)
AES-128 (V4)
AES-256 (V5)
User password
Owner password
PDF Structure
Feature Status
Cross-reference tables
Cross-reference streams
Object streams (ObjStm)
FlateDecode filter
ASCIIHexDecode filter
ASCII85Decode filter
RunLengthDecode filter
PNG predictor filters
Image embedding (JPEG/PNG)
Page content streams
Incremental updates
Linearized PDFs
Content Extraction
Feature Status
Text extraction
Graphics extraction
Image extraction
Font extraction
Annotation extraction
Bookmark extraction
Metadata extraction
JSON serialization
XFA Forms
Feature Status
Template extraction
Datasets extraction
Config extraction
LocaleSet extraction
Form field parsing
Validation rules
Calculation rules
Field value update
PDF rebuild
Dynamic XFA ⚠️ Limited

Implementation Status

See GAPS.md for detailed implementation status and contribution opportunities.

High Priority Gaps
  • Incremental updates - Parse PDFs with multiple revisions ✅
  • Font embedding - TrueType/OpenType font subsetting ✅
  • Image embedding - JPEG, PNG image objects ✅
  • Page content streams - Text and graphics operators ✅
  • AES-256 full support - Complete V5/R6 encryption ✅
Not Planned
  • Dynamic XFA rendering (requires full layout engine)
  • Script execution (FormCalc/JavaScript)
  • Digital signatures (complex PKI requirements)

Testing

go test ./...

Run with verbose output:

go test -v ./...

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Acknowledgments

This library's PDF parsing approach is inspired by pypdf, implementing the "parse-then-decrypt" strategy for handling encrypted PDFs with object streams.

Documentation

Overview

Package pdfer provides pure Go PDF processing with comprehensive XFA support.

This is a zero-dependency PDF library that can:

  • Decrypt PDFs (RC4 and AES encryption)
  • Parse PDF structure (xref, objects, streams)
  • Extract and modify XFA forms
  • Create PDFs from scratch

Quick Start

Extract XFA from an encrypted PDF:

import "github.com/benedoc-inc/pdfer"
import "github.com/benedoc-inc/pdfer/core/encrypt"
import "github.com/benedoc-inc/pdfer/forms/xfa"

// Decrypt
_, encInfo, _ := encryption.DecryptPDF(pdfBytes, password, false)

// Extract XFA
streams, _ := xfa.ExtractAllXFAStreams(pdfBytes, encInfo, false)

Packages

  • encryption: PDF decryption (RC4, AES-128, AES-256)
  • parser: Low-level PDF parsing
  • types: Common data structures
  • writer: PDF creation and modification
  • xfa: XFA form processing

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Version

func Version() string

Version returns the library version.

Types

type Encryption

type Encryption = types.PDFEncryption

Encryption holds PDF encryption parameters and derived keys.

type FormData

type FormData = types.FormData

FormData is a map of field names to values for form filling.

type FormSchema

type FormSchema = types.FormSchema

FormSchema represents a parsed XFA form structure.

type Question

type Question = types.Question

Question represents a single form field.

type Rule

type Rule = types.Rule

Rule represents a validation or calculation rule.

type XFAConfig

type XFAConfig = types.XFAConfig

XFAConfig represents parsed XFA configuration.

type XFADatasets

type XFADatasets = types.XFADatasets

XFADatasets represents parsed XFA datasets.

type XFALocaleSet

type XFALocaleSet = types.XFALocaleSet

XFALocaleSet represents parsed XFA localization data.

Directories

Path Synopsis
cmd
pdfer command
content
extract
Package extraction provides comprehensive PDF content extraction This package extracts all content types from PDFs into structured data models that can be serialized to JSON
Package extraction provides comprehensive PDF content extraction This package extracts all content types from PDFs into structured data models that can be serialized to JSON
core
parse
Package parser provides PDF parsing functionality.
Package parser provides PDF parsing functionality.
write
Package writer provides PDF writing capabilities including page content streams
Package writer provides PDF writing capabilities including page content streams
examples
acroform_create command
Example: Create a PDF with AcroForm fields
Example: Create a PDF with AcroForm fields
acroform_extract command
Example: Extract AcroForm fields from a PDF
Example: Extract AcroForm fields from a PDF
acroform_fill command
Example: Fill AcroForm fields in a PDF (including object stream support)
Example: Fill AcroForm fields in a PDF (including object stream support)
create_pdf command
Example: Create a simple PDF from scratch
Example: Create a simple PDF from scratch
extract_xfa command
Example: Extract XFA from an encrypted PDF
Example: Extract XFA from an encrypted PDF
font_embedding command
Example: Create a PDF with an embedded TrueType font
Example: Create a PDF with an embedded TrueType font
Package forms provides a unified interface for working with PDF forms It supports both AcroForm and XFA form types with automatic detection
Package forms provides a unified interface for working with PDF forms It supports both AcroForm and XFA form types with automatic detection
acroform
Package acroform provides action support for form fields
Package acroform provides action support for form fields
xfa
resources
font
Package font provides TrueType/OpenType font embedding for PDFs
Package font provides TrueType/OpenType font embedding for PDFs
scripts command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL