pdfer

package module

v0.8.0 Latest Latest Go to latest Published: Jan 11, 2026 License: MIT Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/benedoc-inc/pdfer

Links

Open Source Insights

README ¶

pdfer

A pure Go library for PDF processing with comprehensive XFA (XML Forms Architecture) support.

Features

Pure Go - No CGO, no external dependencies
Unified API - Clean parse.Open() entry point for all PDF operations
Unified Forms - Auto-detect and work with AcroForm or XFA forms
PDF Decryption - RC4 (40/128-bit) and AES (128/256-bit)
XFA Processing - Extract, parse, modify, and rebuild XFA forms
PDF Generation - Create PDFs from scratch with text, graphics, and images
Image Embedding - JPEG and PNG images with alpha channel support
Content Streams - Full text and graphics operators for page content
Object Streams - Full support for compressed object storage
Cross-Reference Streams - Parse modern PDF xref streams with predictor filters
Stream Filters - FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode
Incremental Updates - Parse PDFs with multiple revisions, follow /Prev chains
Byte-Perfect Parsing - Preserve exact bytes for reconstruction of original PDF

Installation

go get github.com/benedoc-inc/pdfer

Quick Start

Open and Parse a PDF (Unified API)

import "github.com/benedoc-inc/pdfer/core/parse"

// Open a PDF
pdf, err := parse.Open(pdfBytes)
if err != nil {
    log.Fatal(err)
}

// Get basic info
log.Printf("Version: %s", pdf.Version())
log.Printf("Objects: %d", pdf.ObjectCount())
log.Printf("Revisions: %d", pdf.RevisionCount())

// Get an object
obj, err := pdf.GetObject(1)
if err != nil {
    log.Fatal(err)
}
log.Printf("Object 1: %s", string(obj))

// List all objects
for _, num := range pdf.Objects() {
    log.Printf("Object %d exists", num)
}

Open an Encrypted PDF

pdf, err := parse.OpenWithOptions(pdfBytes, parse.ParseOptions{
    Password: []byte("secret"),
    Verbose:  true,
})

Byte-Perfect PDF Parsing

// Parse preserving exact bytes for reconstruction
pdf, err := parser.OpenWithOptions(pdfBytes, parser.ParseOptions{
    BytePerfect: true,
})

// Reconstruct identical PDF
reconstructed := pdf.Bytes()
// bytes.Equal(reconstructed, pdfBytes) == true

// Access raw object data with byte offsets
rawObj, _ := pdf.GetRawObject(1)
log.Printf("Object at offset %d, raw bytes: %d", rawObj.Offset, len(rawObj.RawBytes))

Extract and Fill Forms (Unified Interface)

import "github.com/benedoc-inc/pdfer/forms"

// Auto-detect and extract any form type (AcroForm or XFA)
form, err := forms.Extract(pdfBytes, password, false)
if err != nil {
    log.Fatal(err)
}

// Work with unified interface
schema := form.Schema()
log.Printf("Form type: %s, Fields: %d", form.Type(), len(schema.Questions))

// Fill the form
formData := types.FormData{
    "FirstName": "John",
    "LastName":  "Doe",
}
filled, err := form.Fill(pdfBytes, formData, password, false)

Extract XFA from an Encrypted PDF

package main

import (
    "log"
    "os"

    "github.com/benedoc-inc/pdfer"
    "github.com/benedoc-inc/pdfer/core/encrypt"
    "github.com/benedoc-inc/pdfer/forms/xfa"
)

func main() {
    // Read PDF
    pdfBytes, _ := os.ReadFile("form.pdf")

    // Decrypt (empty password for many government forms)
    _, encryptInfo, err := encrypt.DecryptPDF(pdfBytes, []byte(""), false)
    if err != nil {
        log.Fatal(err)
    }

    // Extract XFA streams
    streams, err := xfa.ExtractAllXFAStreams(pdfBytes, encryptInfo, false)
    if err != nil {
        log.Fatal(err)
    }

    // Access template (form structure)
    log.Printf("Template: %d bytes", len(streams.Template.Data))
    
    // Access datasets (form data)
    log.Printf("Datasets: %d bytes", len(streams.Datasets.Data))
    
    // Use convenient type aliases from root package
    var form *pdfer.FormSchema
    form, _ = xfa.ParseXFAForm(string(streams.Template.Data), false)
    log.Printf("Found %d questions", len(form.Questions))
}

Parse XFA Form Structure

// Parse the template to get form fields
form, err := xfa.ParseXFAForm(string(streams.Template.Data), false)
if err != nil {
    log.Fatal(err)
}

log.Printf("Found %d questions", len(form.Questions))
for _, q := range form.Questions {
    log.Printf("  %s: %s (%s)", q.ID, q.Label, q.Type)
}

Update Form Field Values

// Create form data (using type alias from root package)
formData := pdfer.FormData{
    "FirstName": "John",
    "LastName":  "Doe",
    "Date":      "2024-01-15",
}

// Update XFA in PDF
updatedPDF, err := xfa.UpdateXFAInPDF(pdfBytes, formData, encryptInfo, false)
if err != nil {
    log.Fatal(err)
}

os.WriteFile("filled.pdf", updatedPDF, 0644)

Create a PDF from Scratch

import "github.com/benedoc-inc/pdfer/core/write"

// Create a simple PDF with text and graphics
builder := write.NewSimplePDFBuilder()

// Add a page
page := builder.AddPage(writer.PageSizeLetter)

// Add a font and get its resource name
fontName := page.AddStandardFont("Helvetica")

// Draw content
page.Content().
    // Add text
    BeginText().
    SetFont(fontName, 24).
    SetTextPosition(72, 720).
    ShowText("Hello, PDF World!").
    EndText().
    // Draw a red rectangle
    SetFillColorRGB(1, 0, 0).
    Rectangle(72, 650, 200, 50).
    Fill()

builder.FinalizePage(page)

// Generate PDF bytes
pdfBytes, err := builder.Bytes()

Embed Images in a PDF

import "github.com/benedoc-inc/pdfer/core/write"

builder := write.NewSimplePDFBuilder()
page := builder.AddPage(write.PageSizeLetter)

// Add a JPEG image
jpegData, _ := os.ReadFile("photo.jpg")
imgInfo, err := builder.Writer().AddJPEGImage(jpegData, "Im1")
if err != nil {
    log.Fatal(err)
}

// Register image with page and draw it
imgName := page.AddImage(imgInfo)
page.Content().DrawImageAt(imgName, 72, 500, 200, 150)

builder.FinalizePage(page)
pdfBytes, _ := builder.Bytes()

Extract All Content from a PDF

import "github.com/benedoc-inc/pdfer/content/extract"

// Extract all content (text, graphics, images, fonts, annotations)
doc, err := extract.ExtractContent(pdfBytes, nil, false)
if err != nil {
    log.Fatal(err)
}

// Access extracted content
log.Printf("Pages: %d", len(doc.Pages))
for _, page := range doc.Pages {
    log.Printf("  Page %d: %d text elements, %d graphics, %d images",
        page.PageNumber, len(page.Text), len(page.Graphics), len(page.Images))
    
    // Text with positioning and font info
    for _, text := range page.Text {
        log.Printf("    Text: '%s' at (%.2f, %.2f) font: %s size: %.2f",
            text.Text, text.X, text.Y, text.FontName, text.FontSize)
    }
    
    // Resources (fonts, images)
    if page.Resources != nil {
        log.Printf("    Fonts: %d, Images: %d", 
            len(page.Resources.Fonts), len(page.Resources.Images))
    }
    
    // Annotations
    for _, annot := range page.Annotations {
        log.Printf("    Annotation: %s", annot.Type)
    }
}

// Extract as JSON
jsonStr, err := extract.ExtractContentToJSON(pdfBytes, nil, false)
if err != nil {
    log.Fatal(err)
}
log.Printf("JSON: %s", jsonStr)

Extraction Flow:

ExtractContent()
  ├─→ ExtractMetadata() → Document info (title, author, dates)
  ├─→ ExtractPages() → For each page:
  │     ├─→ parseContentStream() → Text, graphics, image refs
  │     ├─→ extractResources() → Fonts, XObjects, images
  │     └─→ extractAnnotations() → Links, comments, highlights
  ├─→ ExtractBookmarks() → Document outline
  └─→ Aggregate → Unique fonts/images from all pages

Parse PDFs with Incremental Updates

import "github.com/benedoc-inc/pdfer/core/parse"

// Parse a PDF that has been edited multiple times
pdfBytes, _ := os.ReadFile("edited.pdf")

// Check how many revisions the PDF has
revisions := parse.CountRevisions(pdfBytes)
log.Printf("PDF has %d revisions", revisions)

// Parse all revisions and merge object tables
result, err := parser.ParseWithIncrementalUpdates(pdfBytes, false)
if err != nil {
    log.Fatal(err)
}

log.Printf("Found %d objects", len(result.Objects))

// Extract a specific revision (e.g., the original before edits)
originalPDF, _ := parse.ExtractRevision(pdfBytes, 1)

Byte-Perfect PDF Parsing

import "github.com/benedoc-inc/pdfer/core/parse"

// Parse PDF with full byte preservation
pdfBytes, _ := os.ReadFile("document.pdf")
doc, err := parser.ParsePDFDocument(pdfBytes)
if err != nil {
    log.Fatal(err)
}

// Access individual revisions and objects
log.Printf("PDF has %d revisions, %d objects", doc.RevisionCount(), doc.ObjectCount())

// Get raw bytes of any object
obj := doc.GetObject(1)
log.Printf("Object 1: %d bytes", len(obj.RawBytes))

// Stream objects have parsed components
if obj.IsStream {
    log.Printf("Dictionary: %s", string(obj.DictRaw))
    log.Printf("Stream data: %d bytes", len(obj.StreamRaw))
}

// Reconstruct the PDF (byte-identical to original)
reconstructed := doc.Bytes()
// reconstructed == pdfBytes

Build XFA PDF from XML Streams

builder := write.NewXFABuilder(false)

streams := []writer.XFAStreamData{
    {Name: "template", Data: templateXML, Compress: true},
    {Name: "datasets", Data: datasetsXML, Compress: true},
    {Name: "config", Data: configXML, Compress: true},
}

pdfBytes, err := builder.BuildFromXFA(streams)

Package Structure

github.com/benedoc-inc/pdfer/
├── pdfer.go         # Root package with type aliases
├── core/            # Foundation layer
│   ├── parse/       # PDF parsing (reading structure)
│   ├── write/       # PDF writing (creating/modifying)
│   └── encrypt/     # Encryption/decryption
├── forms/           # Form processing (unified domain)
│   ├── forms.go     # Unified form interface
│   ├── acroform/    # AcroForm implementation
│   └── xfa/         # XFA implementation
├── content/         # Content operations
│   └── extract/     # Content extraction
├── resources/       # Embeddable resources
│   └── font/        # Font embedding
├── types/           # Shared data structures
├── cmd/pdfer/       # CLI tool
└── examples/        # Usage examples

Type Aliases

For convenience, common types are re-exported from the root package:

import "github.com/benedoc-inc/pdfer"

var enc *pdfer.Encryption      // = types.PDFEncryption
var form *pdfer.FormSchema     // = types.FormSchema
var q pdfer.Question           // = types.Question
var data pdfer.FormData        // = types.FormData

Supported PDF Features

Encryption

Feature	Status
RC4 40-bit (V1)	✅
RC4 128-bit (V2)	✅
AES-128 (V4)	✅
AES-256 (V5)	✅
User password	✅
Owner password	✅

PDF Structure

Feature	Status
Cross-reference tables	✅
Cross-reference streams	✅
Object streams (ObjStm)	✅
FlateDecode filter	✅
ASCIIHexDecode filter	✅
ASCII85Decode filter	✅
RunLengthDecode filter	✅
PNG predictor filters	✅
Image embedding (JPEG/PNG)	✅
Page content streams	✅
Incremental updates	✅
Linearized PDFs	❌

Content Extraction

Feature	Status
Text extraction	✅
Graphics extraction	✅
Image extraction	✅
Font extraction	✅
Annotation extraction	✅
Bookmark extraction	✅
Metadata extraction	✅
JSON serialization	✅

XFA Forms

Feature	Status
Template extraction	✅
Datasets extraction	✅
Config extraction	✅
LocaleSet extraction	✅
Form field parsing	✅
Validation rules	✅
Calculation rules	✅
Field value update	✅
PDF rebuild	✅
Dynamic XFA	⚠️ Limited

Implementation Status

See GAPS.md for detailed implementation status and contribution opportunities.

High Priority Gaps

Incremental updates - Parse PDFs with multiple revisions ✅
Font embedding - TrueType/OpenType font subsetting ✅
Image embedding - JPEG, PNG image objects ✅
Page content streams - Text and graphics operators ✅
AES-256 full support - Complete V5/R6 encryption ✅

Not Planned

Dynamic XFA rendering (requires full layout engine)
Script execution (FormCalc/JavaScript)
Digital signatures (complex PKI requirements)

Testing

go test ./...

Run with verbose output:

go test -v ./...

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Acknowledgments

This library's PDF parsing approach is inspired by pypdf, implementing the "parse-then-decrypt" strategy for handling encrypted PDFs with object streams.

Documentation ¶

Overview ¶

Package pdfer provides pure Go PDF processing with comprehensive XFA support.

This is a zero-dependency PDF library that can:

Decrypt PDFs (RC4 and AES encryption)
Parse PDF structure (xref, objects, streams)
Extract and modify XFA forms
Create PDFs from scratch

Quick Start ¶

Extract XFA from an encrypted PDF:

import "github.com/benedoc-inc/pdfer"
import "github.com/benedoc-inc/pdfer/core/encrypt"
import "github.com/benedoc-inc/pdfer/forms/xfa"

// Decrypt
_, encInfo, _ := encryption.DecryptPDF(pdfBytes, password, false)

// Extract XFA
streams, _ := xfa.ExtractAllXFAStreams(pdfBytes, encInfo, false)

Packages ¶

encryption: PDF decryption (RC4, AES-128, AES-256)
parser: Low-level PDF parsing
types: Common data structures
writer: PDF creation and modification
xfa: XFA form processing

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Version ¶

func Version() string

Version returns the library version.

Types ¶

type Encryption ¶

type Encryption = types.PDFEncryption

Encryption holds PDF encryption parameters and derived keys.

type FormData ¶

type FormData = types.FormData

FormData is a map of field names to values for form filling.

type FormSchema ¶

type FormSchema = types.FormSchema

FormSchema represents a parsed XFA form structure.

type Question ¶

type Question = types.Question

Question represents a single form field.

type Rule ¶

type Rule = types.Rule

Rule represents a validation or calculation rule.

type XFAConfig ¶

type XFAConfig = types.XFAConfig

XFAConfig represents parsed XFA configuration.

type XFADatasets ¶

type XFADatasets = types.XFADatasets

XFADatasets represents parsed XFA datasets.

type XFALocaleSet ¶

type XFALocaleSet = types.XFALocaleSet

XFALocaleSet represents parsed XFA localization data.

Source Files ¶

View all Source files

pdfer.go

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cmd
pdfer command
content
extract Package extraction provides comprehensive PDF content extraction This package extracts all content types from PDFs into structured data models that can be serialized to JSON	Package extraction provides comprehensive PDF content extraction This package extracts all content types from PDFs into structured data models that can be serialized to JSON
core
encrypt
parse Package parser provides PDF parsing functionality.	Package parser provides PDF parsing functionality.
write Package writer provides PDF writing capabilities including page content streams	Package writer provides PDF writing capabilities including page content streams
examples
acroform_create command Example: Create a PDF with AcroForm fields	Example: Create a PDF with AcroForm fields
acroform_extract command Example: Extract AcroForm fields from a PDF	Example: Extract AcroForm fields from a PDF
acroform_fill command Example: Fill AcroForm fields in a PDF (including object stream support)	Example: Fill AcroForm fields in a PDF (including object stream support)
create_pdf command Example: Create a simple PDF from scratch	Example: Create a simple PDF from scratch
extract_xfa command Example: Extract XFA from an encrypted PDF	Example: Extract XFA from an encrypted PDF
font_embedding command Example: Create a PDF with an embedded TrueType font	Example: Create a PDF with an embedded TrueType font
forms Package forms provides a unified interface for working with PDF forms It supports both AcroForm and XFA form types with automatic detection	Package forms provides a unified interface for working with PDF forms It supports both AcroForm and XFA form types with automatic detection
acroform Package acroform provides action support for form fields	Package acroform provides action support for form fields
xfa
resources
font Package font provides TrueType/OpenType font embedding for PDFs	Package font provides TrueType/OpenType font embedding for PDFs
tests
scripts command
types