pdftable

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 27, 2026 License: MIT Imports: 13 Imported by: 0

README

pdftable

A Go-native port of Python's pdfplumber.

pdftable reads PDF documents, walks the content streams, and surfaces the positioned primitives — characters, lines, rectangles, curves — that higher-level layout algorithms (text extraction, word grouping, table detection) operate on. It is built on top of pdfcpu for low-level object parsing, xref handling, and FlateDecode decompression; everything above that (operator dispatch, text state, glyph positioning, ToUnicode CMaps, font encodings) is implemented here.

The library targets the gap in the Go PDF ecosystem: existing libraries either render PDFs to images, manipulate metadata, or extract bag-of- words text. None of them give you what pdfplumber gives Python users — a structured per-page object model you can run table-detection heuristics on. This is that.

Status

v0.3.0 — full pdfplumber parity for table-finding strategies. All four canonical strategies are implemented: lines, lines_strict, text, and explicit. Mix and match per-axis (e.g. vertical="text" + horizontal="lines") works as expected. Also ships the pdftable CLI for extracting text and tables without writing Go.

Go Reference CI

Install

go get github.com/hallelx2/pdftable@v0.3.0

Requires Go 1.25+ (uses the standard-library iter package for the Pages() range-over-func iterator, and pdfcpu v0.12+).

Quickstart

package main

import (
    "fmt"
    "log"

    "github.com/hallelx2/pdftable"
)

func main() {
    doc, err := pdftable.OpenFile("report.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    for n, page := range doc.Pages() {
        // Primitives (v0.0.1).
        chars, _ := page.Chars()
        rects, _ := page.Rects()
        lines, _ := page.Lines()
        fmt.Printf("page %d: %d chars, %d rects, %d lines\n",
            n, len(chars), len(rects), len(lines))

        // Words and text extraction (v0.1.0).
        words, _ := page.Words(pdftable.DefaultWordOpts())
        text, _ := page.ExtractText(pdftable.DefaultTextOpts())
        fmt.Printf("  %d words; first line: %q\n",
            len(words), firstLine(text))
    }
}

func firstLine(s string) string {
    for i, r := range s {
        if r == '\n' {
            return s[:i]
        }
    }
    return s
}

API surface

// Constructors.
func Open(r io.Reader) (Document, error)
func OpenBytes(b []byte) (Document, error)
func OpenFile(path string) (Document, error)

// Document.
type Document interface {
    NumPages() int
    Page(n int) (Page, error)              // 1-indexed
    Pages() iter.Seq2[int, Page]           // Go 1.23+ range-over-func
    Close() error
}

// Page.
type Page interface {
    Number() int
    Width() float64
    Height() float64
    Chars() ([]Char, error)
    Lines() ([]Line, error)
    Rects() ([]Rect, error)
    Curves() ([]Curve, error)
    Objects() (Objects, error)

    // New in v0.1.0: word + text extraction.
    Words(opts WordOpts) ([]Word, error)
    ExtractText(opts TextOpts) (string, error)
    ExtractTextSimple(xTolerance, yTolerance float64) (string, error)

    // Table finding: lines + lines_strict (v0.2.0); text + explicit (v0.3.0).
    FindTables(settings TableSettings) ([]TableFinder, error)
    ExtractTables(settings TableSettings) ([]*Table, error)
}

// Primitives.
type Char struct {
    Text                  string
    X0, Y0, X1, Y1        float64
    FontName              string
    FontSize              float64
    Upright               bool
    Advance               float64
}

type Line struct { X0, Y0, X1, Y1 float64; Stroke bool; Width float64 }

type Rect struct { X0, Y0, X1, Y1 float64; Stroke, Fill bool; Width float64 }

type Curve struct { Points [][2]float64; Stroke, Fill bool; Width float64 }

type Objects struct { Chars []Char; Lines []Line; Rects []Rect; Curves []Curve }

// Word (new in v0.1.0).
type Word struct {
    Text                string
    X0, Y0, X1, Y1      float64
    Upright             bool
    Direction           string // "ltr" | "rtl" | "ttb" | "btt"
    FontName            string
    FontSize            float64
    Chars               []Char // populated when WordOpts.KeepChars=true
}

// WordOpts: configure Page.Words. Use DefaultWordOpts() for pdfplumber-matching defaults.
type WordOpts struct {
    XTolerance         float64 // default 3
    YTolerance         float64 // default 3
    KeepBlankChars     bool
    UseTextFlow        bool
    HorizontalLTR      bool   // default true
    VerticalTTB        bool   // default true
    ExtraAttrs         []string
    SplitAtPunctuation bool
    Expand             bool   // ligature expansion; default true
    KeepChars          bool
}

// TextOpts: configure Page.ExtractText. Use DefaultTextOpts() for defaults.
type TextOpts struct {
    XTolerance, YTolerance       float64
    Layout                       bool
    LayoutWidthChars             int
    LayoutHeightChars            int
    XDensity, YDensity           float64 // PDF points per character / per line
    UseTextFlow                  bool
    HorizontalLTR                bool
    VerticalTTB                  bool
    ExtraAttrs                   []string
    Expand                       bool
}

// Sentinel errors.
var (
    ErrInvalidPDF     = errors.New("pdftable: invalid PDF")
    ErrPageOutOfRange = errors.New("pdftable: page out of range")
    ErrUnsupported    = errors.New("pdftable: unsupported feature")
    ErrEncrypted      = errors.New("pdftable: encrypted PDF (decryption not yet supported)")
)

Text extraction

doc, _ := pdftable.OpenFile("report.pdf")
defer doc.Close()
page, _ := doc.Page(1)

// Words: each Word is a contiguous text run.
words, _ := page.Words(pdftable.DefaultWordOpts())
for _, w := range words {
    fmt.Printf("%-20s @ (%.1f, %.1f) %s %.1fpt\n",
        w.Text, w.X0, w.Y0, w.FontName, w.FontSize)
}

// ExtractText: all text on the page as one string. Dense (no layout)
// joins words with spaces and lines with "\n".
text, _ := page.ExtractText(pdftable.DefaultTextOpts())
fmt.Println(text)

// Layout-preserving extraction emulates `pdftotext -layout` / pdfplumber's
// extract_text(layout=True) — column-aligned output suitable for forms.
opts := pdftable.DefaultTextOpts()
opts.Layout = true
laid, _ := page.ExtractText(opts)
fmt.Println(laid)

Tables

Page.ExtractTables is the table-detection entry point. It runs the edges → intersections → cells → tables pipeline (a direct port of pdfplumber's TableFinder) and returns one *Table per detected table, with cell text already extracted.

doc, _ := pdftable.OpenFile("invoice.pdf")
defer doc.Close()
page, _ := doc.Page(1)

settings := pdftable.DefaultTableSettings()
// settings.VerticalStrategy = pdftable.StrategyLinesStrict  // ignore rect outlines

tables, _ := page.ExtractTables(settings)
for ti, t := range tables {
    fmt.Printf("table %d: %d rows × %d cols at %+v\n",
        ti, len(t.Rows), len(t.Rows[0]), t.BBox)
    for _, row := range t.Rows {
        fmt.Println(row)
    }
}

TableSettings defaults match pdfplumber's (snap_tolerance=3, join_tolerance=3, edge_min_length=3, intersection_tolerance=3, text_tolerance=3, min_words_vertical=3, min_words_horizontal=1). Override any field on the value returned from DefaultTableSettings() to tighten or loosen the heuristics.

The four implemented strategies (one per axis, chosen independently):

  • StrategyLines — edges come from drawn Line segments, Rect outlines (all four sides), and axis-aligned Curve segments. Default. Best for typical PDFs whose tables have rule lines.
  • StrategyLinesStrict — only drawn Line segments are used. Use this when your PDF draws cell BACKGROUNDS as filled rectangles that you do NOT want treated as row boundaries.
  • StrategyText — edges inferred from word alignment. Vertical edges come from clusters of words sharing X0 / X1 / centre; horizontal edges from clusters sharing top-Y. Tunable via MinWordsVertical (default 3) and MinWordsHorizontal (default 1).
  • StrategyExplicit — caller-supplied edges via ExplicitVerticalLines / ExplicitHorizontalLines. Required when table boundaries are known from layout analysis or manual annotation.
Side-by-side: pdfplumber → pdftable (lines strategy)
# Python (pdfplumber)
import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    for table in page.find_tables({"vertical_strategy": "lines",
                                    "horizontal_strategy": "lines"}):
        for row in table.extract():
            print(row)
// Go (pdftable)
import "github.com/hallelx2/pdftable"

doc, _ := pdftable.OpenFile("invoice.pdf")
defer doc.Close()
page, _ := doc.Page(1)

settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyLines
settings.HorizontalStrategy = pdftable.StrategyLines

tables, _ := page.ExtractTables(settings)
for _, t := range tables {
    for _, row := range t.Rows {
        fmt.Println(row)
    }
}
Side-by-side: pdfplumber → pdftable (text strategy)
# Python (pdfplumber) — borderless tables
import pdfplumber

with pdfplumber.open("10k-filing.pdf") as pdf:
    page = pdf.pages[3]
    for table in page.find_tables({"vertical_strategy": "text",
                                    "horizontal_strategy": "text",
                                    "min_words_vertical": 3}):
        for row in table.extract():
            print(row)
// Go (pdftable)
doc, _ := pdftable.OpenFile("10k-filing.pdf")
defer doc.Close()
page, _ := doc.Page(4)

settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyText
settings.HorizontalStrategy = pdftable.StrategyText
settings.MinWordsVertical = 3

tables, _ := page.ExtractTables(settings)
for _, t := range tables {
    for _, row := range t.Rows {
        fmt.Println(row)
    }
}
Side-by-side: pdfplumber → pdftable (explicit strategy)
# Python (pdfplumber) — caller-supplied edges
import pdfplumber

with pdfplumber.open("statement.pdf") as pdf:
    page = pdf.pages[0]
    table = page.find_tables({
        "vertical_strategy": "explicit",
        "horizontal_strategy": "explicit",
        "explicit_vertical_lines":   [100, 200, 300, 400],
        "explicit_horizontal_lines": [600, 650, 700, 720],
    })[0]
    for row in table.extract():
        print(row)
// Go (pdftable)
doc, _ := pdftable.OpenFile("statement.pdf")
defer doc.Close()
page, _ := doc.Page(1)

settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyExplicit
settings.HorizontalStrategy = pdftable.StrategyExplicit
settings.ExplicitVerticalLines   = []float64{100, 200, 300, 400}
settings.ExplicitHorizontalLines = []float64{600, 650, 700, 720}

tables, _ := page.ExtractTables(settings)
for _, row := range tables[0].Rows {
    fmt.Println(row)
}
Mixed strategies

Each axis picks its strategy independently. Combinations like vertical=text + horizontal=lines (common for tables with drawn row separators but borderless columns) work out of the box:

settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy   = pdftable.StrategyText
settings.HorizontalStrategy = pdftable.StrategyLines
tables, _ := page.ExtractTables(settings)

The two outputs match cell-for-cell on the parity fixtures (see testdata/golden/*.tables-text.expected.json and *.tables.expected.json for the regression goldens). Field naming differs in the obvious places: pdftable returns a slice of *Table instead of Table objects you have to call .extract() on; rows are []string instead of list[Optional[str]] (missing cells produce "" rather than nil); and table bboxes use (X0, Y0, X1, Y1) PDF user space rather than pdfplumber's image-space (x0, top, x1, bottom).

CLI

pdftable ships a command-line interface that mirrors pdfplumber's CLI surface for the operations the library implements:

go install github.com/hallelx2/pdftable/cmd/pdftable@v0.3.0

Usage:

# Extract every table on every page as JSON.
pdftable extract invoice.pdf --tables --format json

# Borderless tables: use the text strategy.
pdftable extract 10k.pdf --tables \
    --vertical-strategy text --horizontal-strategy text \
    --min-words-vertical 4

# Extract text only (no table detection).
pdftable extract report.pdf --text --format text

# Subset of pages, pretty-printed JSON.
pdftable extract report.pdf --tables --pages 1,3-5 --indent 2

# Caller-supplied edges.
pdftable extract statement.pdf --tables \
    --vertical-strategy explicit --horizontal-strategy explicit \
    --explicit-vertical-lines 100,200,300,400 \
    --explicit-horizontal-lines 600,650,700,720

Flags:

Flag Default Description
--pages all Pages: 1,3-5 syntax.
--tables off Output detected tables.
--text off Output extracted text.
--format json json | text.
--vertical-strategy lines lines | lines_strict | text | explicit.
--horizontal-strategy lines same set.
--snap-tolerance 3 snap_tolerance (PDF pts).
--join-tolerance 3 join_tolerance (PDF pts).
--edge-min-length 3 drop merged edges shorter than this.
--intersection-tolerance 3 slack on edge crossings.
--text-tolerance 3 per-cell text-extraction tolerance.
--min-words-vertical 3 text strategy column threshold.
--min-words-horizontal 1 text strategy row threshold.
--explicit-vertical-lines (none) comma list of X coords.
--explicit-horizontal-lines (none) comma list of Y coords.
--indent 0 JSON indent (0 = compact).

Side-by-side comparison with pdfplumber

# Python (pdfplumber)
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for word in page.extract_words(x_tolerance=3, y_tolerance=3):
        print(word["text"], word["x0"], word["top"])
    print(page.extract_text())
// Go (pdftable)
import "github.com/hallelx2/pdftable"

doc, _ := pdftable.OpenFile("report.pdf")
defer doc.Close()
page, _ := doc.Page(1)

words, _ := page.Words(pdftable.DefaultWordOpts())
for _, w := range words {
    // pdftable's Y is PDF user-space (origin bottom-left). The
    // pdfplumber-equivalent "top" is page.Height() - w.Y1.
    fmt.Println(w.Text, w.X0, page.Height()-w.Y1)
}
fmt.Println(must(page.ExtractText(pdftable.DefaultTextOpts())))

Three differences worth noting:

  1. Page indexing is 1-based, matching the PDF spec and pdfplumber's pdf.pages[0] is actually the first page (Python is 0-indexed, pdfplumber compensates). Our Page(1) is the same first page.
  2. Coordinates are in PDF user space with origin at bottom-left. pdfplumber by default reports top (origin top-left, Y growing down) on its chars and words; we report Y0 / Y1 in PDF native coordinates. The conversion is top = page.Height() - Y1.
  3. Options are explicit Go structs, not **kwargs. Build a WordOpts / TextOpts, override the fields you care about, pass it through. DefaultWordOpts() / DefaultTextOpts() return pdfplumber-matching defaults.

Parity with pdfplumber

The word-grouping and text-extraction algorithms are direct ports of pdfplumber's WordExtractor and extract_text (see pdfplumber/utils/text.py). Tests in golden_test.go compare the Go output against pdfplumber's reference output on shared fixture PDFs.

Behaviours that match exactly:

  • Word grouping: same line-cluster-then-merge-by-gap algorithm, same defaults (XTolerance=3, YTolerance=3), same handling of blank-char filtering, ligature expansion (fi→fi, etc.), and split-at-punctuation.
  • Ordering: words returned in pdfplumber's order (top-to-bottom, then left-to-right within each line) when UseTextFlow is false.
  • Direction handling: ltr / rtl / ttb / btt mapping from upright + HorizontalLTR + VerticalTTB.

Behaviours that intentionally differ:

  • Position precision drifts when font metrics aren't bundled. pdfplumber uses pdfminer.six's AFM tables for the standard 14 fonts; we use a default-width fallback for now. Word text and order match exactly; word bboxes drift by up to ~10 PDF points on glyphs whose width isn't in the PDF's /Widths array. Golden tests assert text parity exactly and position parity within a 15-point envelope; the envelope tightens to <1pt once the AFM bundle lands (planned for v0.2.x).
  • Layout=true output is structurally similar but not byte-equal. Pdfplumber's layout algorithm has version-to-version drift; we produce a column-aligned grid with the same density defaults but don't promise byte-equal output across pdfplumber releases.

Behaviours not yet ported:

  • extract_text_lines (regex-based line extraction).
  • search on TextMap (regex over assembled page text with char-level match back-references).
  • Per-character extra_attrs hooks beyond fontname and size.

Architecture

pdftable/
├── pdftable.go        // Open / OpenBytes / OpenFile entry points
├── pdf.go             // Document interface + implementation
├── page.go            // Page interface + implementation
├── char.go            // Public Char / Line / Rect / Curve / Objects
├── text.go            // Word + ExtractText + ExtractTextSimple (v0.1.0)
├── table.go           // TableStrategy / TableSettings / Table types (v0.2.0)
├── finder.go          // Cells-from-edges algorithm (v0.2.0)
├── finder_text.go     // Text + explicit edge derivation (v0.3.0)
├── clustering.go      // 1-D clusterObjects, groupObjectsByAttr, dedupeChars
├── geometry.go        // BBox helpers: Union, Intersect, Contains, Snap
├── errors.go          // Sentinel errors
├── cmd/
│   └── pdftable/      // Command-line interface (v0.3.0)
│       └── main.go
└── internal/
    ├── layout/
    │   └── lines.go   // Edge type + snap/join/filter pipeline (v0.2.0)
    └── pdf/
        ├── reader.go      // pdfcpu bridge
        ├── content.go     // Content-stream interpreter
        ├── ops.go         // Operator dispatch table
        ├── state.go       // Graphics + text state, matrix math
        ├── font.go        // Font + encoding tables + glyph-name resolution
        └── cmap.go        // ToUnicode CMap parser

The public pdftable package is small and stable. The internal/pdf package owns the interpreter — its types are not exposed because they will evolve as more PDF features are added (Type 3 fonts, vertical writing, more exotic CMaps).

Why pdfcpu and not write a PDF parser from scratch?

PDF object parsing — xref tables, indirect-object resolution, stream decompression (FlateDecode, LZWDecode, ASCII85Decode), encryption — is a large amount of mostly-uninteresting code. pdfcpu is mature, well- tested, and gives us a parsed *model.Context to work with. We layer the content-stream interpreter (which pdfcpu doesn't have) on top.

If pdfcpu's dependency footprint becomes a problem (it pulls in image codecs we don't strictly need), the blast radius of swapping it out is limited to internal/pdf/reader.go. The rest of the package is stdlib-only.

Roadmap

  • v0.0.x — content-stream primitives.
  • v0.1.x — text extraction: Page.ExtractText, Page.Words, Page.ExtractTextSimple.
  • v0.2.x — table finding via ruling lines: Page.FindTables / Page.ExtractTables covering the lines and lines_strict strategies.
  • v0.3.x — remaining table strategies and CLI (this release): text (word-alignment edges), explicit (caller-supplied edges), and a pdftable CLI mirroring pdfplumber's surface.
  • v0.4.x — bundle the standard-14 AFM metrics so word bboxes (and therefore cell text) match pdfplumber to within 1 PDF point on standard fonts.
  • v0.5.x — performance pass: parser benchmarking against pdfminer.six and pdfplumber on a representative document corpus.

License

MIT. See LICENSE.

Acknowledgements

This library is a direct port of the algorithms in pdfminer.six and pdfplumber. Their authors did the hard work of figuring out how to robustly recover structure from the PDF wire format; this is that work translated into Go.

Documentation

Overview

Package pdftable is a Go-native port of Python's pdfplumber. It reads a PDF document, walks the content streams, and surfaces the positioned primitives (characters, lines, rectangles, curves) that higher-level layout algorithms — text extraction, word grouping, table detection — operate on.

The library is structured in layers that mirror the pdfplumber + pdfminer.six split:

  • This package (pdftable) is the public API. It is small, stable, and contains no PDF parsing logic of its own — it just exposes Document, Page, Char, Line, Rect, Curve and constructs them from the internal package.
  • github.com/hallelx2/pdftable/internal/pdf is the content-stream interpreter. It is intentionally not public: the data shapes there will evolve as we add more PDF features, and we don't want callers depending on them.

Typical usage:

doc, err := pdftable.OpenFile("report.pdf")
if err != nil { return err }
defer doc.Close()

for n, p := range doc.Pages() {
    chars, _ := p.Chars()
    fmt.Printf("page %d: %d chars\n", n, len(chars))
}

Phase scope: v0.1.0 ships content-stream primitives plus text extraction (Page.Words, Page.ExtractText, Page.ExtractTextSimple). Table-finding (ExtractTables, FindTables) is the next phase — see the README for the roadmap. The Page interface is additive across releases; v0.0.1 callers using only Chars/Lines/Rects/Curves continue to compile against v0.1.0 without changes.

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrInvalidPDF is returned by Open / OpenBytes / OpenFile when the
	// input bytes can't be parsed as a PDF. The underlying pdfcpu error
	// is wrapped so callers can still inspect the details with errors.As.
	ErrInvalidPDF = errors.New("pdftable: invalid PDF")

	// ErrPageOutOfRange is returned by Document.Page when n is < 1 or
	// > NumPages(). The PDF page index is 1-based, matching pdfplumber.
	ErrPageOutOfRange = errors.New("pdftable: page out of range")

	// ErrUnsupported is returned when we hit a PDF feature this library
	// does not yet implement (e.g. an exotic CMap, an unsupported XObject
	// subtype, vertical writing). The error string names the feature.
	ErrUnsupported = errors.New("pdftable: unsupported feature")

	// ErrEncrypted is returned when the PDF is encrypted and we can't
	// decrypt it with the empty password. Full encryption support is
	// out of scope for the initial release — callers that need it can
	// pre-decrypt with pdfcpu's api.Decrypt and feed the cleaned bytes
	// to OpenBytes.
	ErrEncrypted = errors.New("pdftable: encrypted PDF (decryption not yet supported)")
)

Sentinel errors returned by the public API. Callers can match these with errors.Is(); functions that surface a parser-level problem from pdfcpu or the content-stream interpreter wrap the underlying error so the cause is preserved.

We keep this set small on purpose. The PDF spec has hundreds of failure modes — most of them collapse into "the bytes do not look like a PDF", "you asked for a page that doesn't exist", or "we don't implement this feature yet". Anything more specific belongs in the wrapped error string, not as a new sentinel.

Functions

This section is empty.

Types

type BBox added in v0.1.0

type BBox struct {
	X0, Y0, X1, Y1 float64
}

BBox is the canonical four-tuple bounding-box helper that the layout algorithms (clustering, word grouping, text extraction) operate on. Field naming follows the Char/Line/Rect convention used throughout the package: x0,y0 is the lower-left corner and x1,y1 is the upper- right corner in PDF user space (origin at bottom-left, Y growing up).

We expose BBox as a value type — small, stack-allocatable, trivially copyable. Algorithms that need to pass a bbox around without poking at the larger Char/Rect/Line wrappers can construct one with NewBBox or pull one out with the BBoxOf helpers below.

The Go API intentionally chooses (X0,Y0,X1,Y1) over pdfplumber's dict-of-strings ({"x0","top","x1","bottom"}). The two flavours differ because pdfplumber operates in image space (Y growing down, "top" = small Y, "bottom" = large Y) and we operate in PDF user space (Y growing up). Comments call out the mapping wherever it matters.

func BBoxOfChar added in v0.1.0

func BBoxOfChar(c Char) BBox

BBoxOfChar returns the bounding box of a Char.

func BBoxOfChars added in v0.1.0

func BBoxOfChars(cs []Char) BBox

BBoxOfChars returns the smallest bbox enclosing every char in cs. Returns the zero BBox for an empty slice.

func MergeBBoxes added in v0.1.0

func MergeBBoxes(bboxes []BBox) BBox

MergeBBoxes returns the smallest bbox enclosing every input. Empty input returns a zero BBox. This mirrors pdfplumber's merge_bboxes and objects_to_bbox helpers — the typical caller has a slice of Chars and wants the combined bounding box for the resulting Word.

func NewBBox added in v0.1.0

func NewBBox(x0, y0, x1, y1 float64) BBox

NewBBox builds a BBox and normalises it so X0<=X1 and Y0<=Y1. Algorithms downstream rely on the normal form, so we never let an inverted bbox leak past this constructor.

func (BBox) Area added in v0.1.0

func (b BBox) Area() float64

Area returns Width * Height.

func (BBox) Contains added in v0.1.0

func (b BBox) Contains(other BBox) bool

Contains reports whether b fully encloses other. Edges are considered inside (>= on the low side, <= on the high side), so a bbox contains itself.

func (BBox) ContainsPoint added in v0.1.0

func (b BBox) ContainsPoint(x, y float64) bool

ContainsPoint reports whether (x,y) lies inside b (inclusive on edges).

func (BBox) Height added in v0.1.0

func (b BBox) Height() float64

Height returns the bbox's vertical extent.

func (BBox) Intersect added in v0.1.0

func (b BBox) Intersect(other BBox) (BBox, bool)

Intersect returns the overlapping rectangle of b and other, and a boolean reporting whether the intersection has non-empty area (i.e. the two bboxes actually overlap). This mirrors pdfplumber's get_bbox_overlap, which returns None when the boxes don't touch and the overlapping bbox otherwise.

We treat touching-but-not-overlapping (shared edge, zero area) as non-overlap, matching pdfplumber's `o_height + o_width > 0` check — a single-line ruler that grazes a word's bbox should not be reported as "intersecting" the word.

func (BBox) IsZero added in v0.1.0

func (b BBox) IsZero() bool

IsZero reports whether the bbox is the zero value (all four fields equal to zero). Useful for "did I forget to populate this" checks.

func (BBox) Snap added in v0.1.0

func (b BBox) Snap(step float64) BBox

Snap rounds each of b's four coordinates to the nearest multiple of step. Used by layout-analysis code to coalesce near-equal positions (e.g. ruling lines drawn at 99.9, 100.0, 100.1) before clustering. A step of 0 returns the original bbox unchanged.

func (BBox) Union added in v0.1.0

func (b BBox) Union(other BBox) BBox

Union returns the smallest bbox enclosing both b and other.

We DON'T treat a zero-value BBox as "empty" here — a caller that passes BBox{} to Union genuinely means "enclose the origin point". Use MergeBBoxes when you have a slice and want it to be a no-op on the empty slice.

func (BBox) Width added in v0.1.0

func (b BBox) Width() float64

Width returns the bbox's horizontal extent.

type Char

type Char struct {
	// Text is the Unicode payload of this glyph (one or more runes).
	// Empty string means the font's encoding and ToUnicode CMap both
	// failed to map the glyph; the bbox still describes where the
	// glyph was drawn so downstream layout heuristics can use the
	// positioning even when we can't read the text.
	Text string

	// Bounding box in PDF user space, with x0 <= x1 and y0 <= y1.
	// Y0 is the descender baseline of the glyph; Y1 is the top of the
	// ascender. The bbox is the typographic cell, not the ink — it
	// matches what pdfplumber reports.
	X0, Y0, X1, Y1 float64

	// FontName is the /BaseFont (or /Name) value from the font dict,
	// e.g. "Helvetica" or "ABCDEF+TimesNewRoman-Bold". It's surfaced
	// verbatim — callers that want to detect bold weight should match
	// on substrings ("bold", "-bd").
	FontName string

	// FontSize is the size the glyph was rendered at, in PDF user space
	// units (i.e. after applying the text matrix and CTM). For a glyph
	// drawn with `12 Tf` and identity CTM this is 12.0; if the CTM
	// scales by 2x it's 24.0.
	FontSize float64

	// Upright is true when the glyph is drawn in normal reading
	// orientation (a*d > 0 and b*c <= 0 in the combined text matrix).
	// Rotated and mirrored glyphs report false. pdfplumber uses the
	// same predicate to skip vertical-stamp text during table finding.
	Upright bool

	// Advance is the horizontal advance width the text matrix moved by
	// after drawing this glyph (in user-space units). It already
	// includes character spacing, word spacing for the space glyph,
	// and the font's per-glyph width — i.e. the actual ink-to-ink
	// distance to the next glyph.
	Advance float64
}

Char is a single positioned glyph on a page. Adjacent Chars do NOT share state — each carries its own font, size, and absolute bbox so downstream code (word grouping, table finding, text extraction) can reason about each glyph in isolation.

Text is the Unicode string this glyph represents. For most fonts a glyph maps to a single code point ("A"), but Adobe ligature glyph names ("fi", "ffi") and ToUnicode maps with multi-character bfchar entries can produce multi-rune strings — we never split them.

func (Char) Height

func (c Char) Height() float64

Height returns y1 - y0.

func (Char) Width

func (c Char) Width() float64

Width returns x1 - x0.

type Curve

type Curve struct {
	// Points are the vertices of the path in order. For curve segments,
	// only the on-path endpoints are emitted (control points are
	// dropped) — this matches pdfplumber's "pts" field, which is also
	// just the endpoints. Callers that need precise curve geometry
	// should request that extension; we don't expect tables to use
	// curves so dropping control points is fine for the initial scope.
	Points [][2]float64

	Stroke bool
	Fill   bool
	Width  float64
}

Curve is a path that contains at least one curve segment (`c`, `v`, or `y`) or any path that doesn't reduce to a single Rect or a series of straight Lines. Points are the path vertices in order, including intermediate control points; callers that need actual Bezier shapes can reconstruct them from the operator stream — for now we keep the simpler flattened representation that's enough for "did the page have decorative curves on it" detection.

type Document

type Document interface {
	// NumPages returns the page count. Always >= 1 for a valid PDF.
	NumPages() int

	// Page returns the n'th page (1-indexed). Returns
	// ErrPageOutOfRange if n is < 1 or > NumPages().
	Page(n int) (Page, error)

	// Pages returns a range iterator over (n, Page) pairs for use
	// with Go 1.23+ range-over-func syntax:
	//
	//   for n, p := range doc.Pages() {
	//       // ...
	//   }
	//
	// The iterator yields pages in 1-based order. It does NOT clone
	// the underlying Document — Close() on the doc still tears down
	// the iterator's pages.
	Pages() iter.Seq2[int, Page]

	// Close releases any resources held by the underlying parser.
	// For pdfcpu-backed documents this is currently a no-op (pdfcpu
	// holds everything in memory), but callers should still defer
	// Close() so that future backends (mmap, file handles) work.
	Close() error
}

Document represents one open PDF file. The interface (not a struct) keeps the API symmetric with Page and lets us swap implementations later — e.g. a Document that streams pages lazily from a remote blob — without breaking callers.

All accessors are safe to call concurrently. The underlying pdfcpu *model.Context is treated as read-only after the document is opened.

func Open

func Open(r io.Reader) (Document, error)

Open reads an entire io.Reader into memory and parses it as a PDF. pdfcpu's API requires an io.ReadSeeker; we buffer the input so the caller doesn't have to.

Use OpenBytes if you already have the file content as a []byte (avoids an extra copy), or OpenFile if you have a path.

func OpenBytes

func OpenBytes(b []byte) (Document, error)

OpenBytes parses an in-memory PDF.

The bytes are NOT copied — pdfcpu reads from them in-place and may retain references after this function returns. Callers must not mutate the slice for as long as the returned Document is in use.

func OpenFile

func OpenFile(path string) (Document, error)

OpenFile opens a PDF at the given filesystem path.

type Intersection added in v0.2.0

type Intersection struct {
	X, Y float64
	V    []layout.Edge // vertical edges passing through (X, Y)
	H    []layout.Edge // horizontal edges passing through (X, Y)
}

Intersection records one crossing point: an (x, y) tuple plus the vertical and horizontal edges that meet there. We need the edge sets (not just the count) because the cell-finder asks "does the same edge connect points p1 and p2?" — checking that two points lie on a shared edge is how the algorithm distinguishes "two intersections on the same ruler" from "two intersections on parallel rulers that happen to align".

Field naming follows pdfplumber's intersections dict-of-dicts shape: the X/Y are the keys, V/H are the value lists. We keep them as slice fields so the struct is value-comparable on (X, Y) alone.

type Line

type Line struct {
	X0, Y0, X1, Y1 float64

	// Stroke is true if the path was stroked (S, s, B, b, B*, b*).
	// Lines emitted from non-stroked paths are dropped before the
	// caller sees them — but the field is preserved for symmetry
	// with Rect/Curve so layout code can branch on it uniformly.
	Stroke bool

	// Width is the stroke line width in user-space units at the time
	// the line was emitted. For ruling lines this is typically 0.5–1.0.
	Width float64
}

Line is a single straight-line segment emitted by an `S` (stroke) operator on a path that contained an `l` segment. Each `l` becomes one Line; a rectangle drawn with `re` does NOT decompose into four Lines (it becomes a single Rect) — that's how pdfplumber tells the two apart, and we keep the same distinction so downstream code can pick the right collection for table-finding.

type Objects

type Objects struct {
	Chars  []Char
	Lines  []Line
	Rects  []Rect
	Curves []Curve
}

Objects is the bundle of all primitive page objects, returned by Page.Objects(). It's a convenience for callers that want everything in one shot rather than four separate Page method calls — the per- type accessors (Chars/Lines/Rects/Curves) are independent and safe to call concurrently, but they each redo the page content-stream walk, so Objects() is the cheaper choice when you need it all.

type Page

type Page interface {
	// Number returns the page number (1-based).
	Number() int

	// Width and Height return the page's mediabox dimensions in PDF
	// points (1/72 inch), already adjusted for the page's /Rotate
	// entry — so a portrait letter-sized page rotated 90 degrees
	// reports Width=792 Height=612 (landscape), matching what a
	// PDF viewer would display.
	Width() float64
	Height() float64

	// Chars walks the page and returns every positioned glyph. The
	// order is content-stream order — i.e. the order the producer
	// drew them, NOT visual reading order. Downstream layout code
	// (extract_text, find_tables) sorts the chars by position.
	Chars() ([]Char, error)

	// Lines returns every straight-line segment drawn on the page.
	// Each `l` segment in the content stream becomes one Line.
	// Rectangles drawn via `re` are NOT decomposed into four Lines;
	// they're reported through Rects() instead.
	Lines() ([]Line, error)

	// Rects returns every rectangle drawn via the `re` operator.
	// Both stroked and filled rectangles are returned; the Stroke
	// and Fill flags say which.
	Rects() ([]Rect, error)

	// Curves returns every Bezier or composite path that isn't a
	// pure line-segment chain or a single rect.
	Curves() ([]Curve, error)

	// Objects returns Chars + Lines + Rects + Curves in a single
	// walk. Use this when you need all four — it's strictly
	// cheaper than calling each accessor separately because the
	// content stream is parsed exactly once.
	Objects() (Objects, error)

	// Words extracts positioned text runs from the page. A "word"
	// is a contiguous group of chars whose horizontal gaps are
	// within WordOpts.XTolerance and whose vertical positions
	// agree within WordOpts.YTolerance. Pass DefaultWordOpts() to
	// use pdfplumber-matching defaults. See WordOpts for the full
	// configuration surface.
	//
	// Returns an empty slice (not nil) when the page contains no
	// extractable text.
	Words(opts WordOpts) ([]Word, error)

	// ExtractText returns the page's text as a single string. By
	// default words on the same line are joined with a single
	// space and lines are joined with "\n". When TextOpts.Layout is
	// true, the output preserves spatial layout (column-aligned
	// text, blank lines for vertical gaps) at the cost of more
	// whitespace. Pass DefaultTextOpts() for pdfplumber-matching
	// defaults.
	ExtractText(opts TextOpts) (string, error)

	// ExtractTextSimple is a no-frills extraction that clusters
	// chars by visual line and joins them by gap detection. Use
	// when ExtractText's word-grouping heuristics produce undesired
	// results on adversarial input.
	ExtractTextSimple(xTolerance, yTolerance float64) (string, error)

	// FindTables runs the geometry-only stage of the table-finding
	// pipeline: derive edges from the page primitives, snap+join
	// into rulers, scan for intersections, assemble cells, group
	// cells into tables. Returns one TableFinder per detected
	// table-group so callers building debugging tools can inspect
	// the intermediate stages (edges / intersections / raw cells)
	// alongside the assembled per-table CellsGrid.
	//
	// v0.3.0 supports all four pdfplumber strategies: "lines",
	// "lines_strict", "text", and "explicit". Each axis (vertical,
	// horizontal) selects its strategy independently, so mixed
	// settings like vertical="text" + horizontal="lines" work as
	// expected.
	FindTables(settings TableSettings) ([]TableFinder, error)

	// ExtractTables wraps FindTables and runs per-cell text
	// extraction on every detected table. Cells with no chars
	// produce an empty string. Leading and trailing whitespace
	// inside each cell is stripped. Returns the slice of fully
	// populated Table structs in visual top-to-bottom-left-to-right
	// order.
	ExtractTables(settings TableSettings) ([]*Table, error)
}

Page is one page of a PDF document. The interface (not a struct) is intentional: it lets us swap implementations later (e.g. for a streaming PDF parser) without breaking callers, and it makes the API surface easy to mock in tests.

Every accessor (Chars, Lines, Rects, Curves, Objects) walks the page content stream from scratch. We do NOT cache between calls because:

  1. Callers that need ALL the objects call Objects() once.
  2. Callers that need just the chars (say, for text extraction) don't pay for the path-painting machinery they aren't using.
  3. Caching means deciding when to invalidate, which is moot because a Page is immutable from the caller's perspective.

Pages are 1-indexed, matching pdfplumber. Number() returns the 1-based index so callers can format error messages without re-tracking which Page they were given.

type Rect

type Rect struct {
	X0, Y0, X1, Y1 float64

	// Stroke and Fill mirror the painting operator that closed the
	// path: S/s set Stroke; f/F/f* set Fill; B/b/B*/b* set both.
	// Either or both can be true — but not neither (paths with `n`
	// produce nothing).
	Stroke bool
	Fill   bool

	// Width is the stroke line width at the time the rectangle was
	// painted, in user-space units. Zero when Stroke is false.
	Width float64
}

Rect is a rectangle path emitted by an `re` operator (a single PDF instruction that draws a closed box). pdfplumber tracks these separately from generic four-segment paths because table grids, borders, and shaded cells are nearly always drawn this way — keeping the distinction makes table detection much more reliable.

type Table added in v0.2.0

type Table struct {
	// Rows is the table's text content as a 2-D slice. Row 0 is the
	// VISUALLY TOP row of the table; column 0 is the leftmost. Empty
	// cells appear as "". Missing cells (when a row has fewer columns
	// than the table's column count, because the underlying cell
	// detection found a hole) are also "" — we promote missing to
	// empty so callers don't have to nil-check every entry.
	Rows [][]string

	// BBox is the union of every cell's bbox, in PDF user-space
	// coordinates (origin bottom-left, Y growing up).
	BBox BBox

	// Page is the 1-based page number the table was found on, copied
	// from the originating Page so callers can carry results across
	// page boundaries without holding Page references.
	Page int

	// CellsBBox is the per-cell bbox aligned to Rows: CellsBBox[i][j]
	// is the bbox of Rows[i][j]. Useful for re-rendering with
	// highlight overlays, or for re-cropping the page to extract the
	// cell's contents in a richer format than plain text.
	CellsBBox [][]BBox
}

Table is the extracted result for one detected table. It carries the assembled cell texts plus the geometry needed for downstream consumers (re-rendering, click-through to source positions).

func (Table) Cells added in v0.2.0

func (t Table) Cells() []BBox

Cells returns the cell bboxes flattened into reading order (left-to-right, top-to-bottom). Provided as a convenience for callers that want a single iterable rather than a nested slice.

type TableBox added in v0.2.0

type TableBox struct {
	// BBox is the union of every cell's bbox.
	BBox BBox

	// Rows is the row count.
	Rows int

	// Cols is the column count.
	Cols int

	// CellsGrid is the per-cell bbox aligned to Rows × Cols. The
	// entry at [i][j] is the bbox of the cell at visual row i (0 is
	// topmost) and column j (0 is leftmost). Empty cells are the zero
	// BBox.
	CellsGrid [][]BBox
}

TableBox is one detected table, expressed as a bbox plus a 2-D grid of cell bboxes. Rows are visually top-to-bottom; columns are left-to- right. CellsGrid[i][j] gives the bbox of the cell at row i, column j; missing cells (rectangular gaps in the grid) are reported as the zero BBox, NOT removed — callers can detect "this cell was missing" by checking IsZero on the entry.

This is the geometry-only intermediate between FindTables and ExtractTables: FindTables returns one of these per detected table; ExtractTables then runs text-extraction per cell and wraps the result in a Table.

func (TableBox) Cells added in v0.2.0

func (t TableBox) Cells() []BBox

Cells returns the cell bboxes flattened into reading order (left-to-right, top-to-bottom). Zero-bbox entries (holes in the grid) are skipped. Convenience helper for callers that want a single iterable.

type TableFinder added in v0.2.0

type TableFinder struct {
	// Edges is the merged, length-filtered edge list used as the
	// input to the intersection scan. Useful for debugging "why
	// didn't this rule get picked up" issues.
	Edges []layout.Edge

	// Intersections is the full set of edge crossings, keyed by
	// (X, Y). The order is deterministic — sorted by Y descending,
	// then X ascending — so callers can rely on iteration order.
	Intersections []Intersection

	// Cells is the raw list of detected cell bboxes BEFORE grouping
	// into tables. Each is a single rectangle whose four corners are
	// intersections joined by shared edges.
	Cells []BBox

	// Tables is the final list of detected tables. Each carries a
	// bbox plus a CellsGrid aligned to row/column order. Tables are
	// sorted top-to-bottom-then-left-to-right by their topmost cell.
	Tables []TableBox
}

TableFinder is the geometry-only result of running the cells-from- edges pipeline on a page. It exposes the intermediate stages (edges, intersections, raw cells) alongside the assembled TableBox list so callers building debugging tools or custom text-extraction can see exactly what the pipeline produced.

Pdfplumber bundles the page reference inside its TableFinder and exposes Table objects with an .extract() method; we keep the finder a pure value (no Page pointer) and let callers either grab the assembled Tables from Page.ExtractTables or compose their own text-fill loop using the public Cells and CellsGrid.

type TableSettings added in v0.2.0

type TableSettings struct {
	// VerticalStrategy picks the source of vertical edges.
	// Default: StrategyLines.
	VerticalStrategy TableStrategy

	// HorizontalStrategy picks the source of horizontal edges.
	// Default: StrategyLines.
	HorizontalStrategy TableStrategy

	// SnapTolerance is the perpendicular-axis tolerance for clustering
	// near-collinear edges before joining (PDF points). Default: 3.
	SnapTolerance float64

	// JoinTolerance is the along-direction gap that still gets merged
	// during the join pass (PDF points). Default: 3.
	JoinTolerance float64

	// EdgeMinLength drops merged edges shorter than this (PDF points).
	// Default: 3.
	EdgeMinLength float64

	// EdgeMinLengthPrefilter drops raw edges before merging
	// (PDF points). Default: 1 — kills hairline construction
	// segments that snap+join shouldn't pull together.
	EdgeMinLengthPrefilter float64

	// IntersectionTolerance is the slack used when testing whether a
	// vertical edge crosses a horizontal edge — accounts for tiny
	// gaps between the end of a stroked line and the start of the
	// next (PDF points). Default: 3.
	IntersectionTolerance float64

	// TextTolerance is forwarded to the per-cell text-extraction call
	// inside ExtractTables. It overrides both x_tolerance and
	// y_tolerance of the underlying WordExtractor. Default: 3.
	TextTolerance float64

	// MinWordsVertical / MinWordsHorizontal control the "text"
	// strategy thresholds. A candidate column-boundary cluster must
	// contain at least MinWordsVertical words sharing X0 / X1 /
	// centre alignment to be promoted to a vertical edge; row
	// boundaries need MinWordsHorizontal words sharing a top edge.
	// pdfplumber defaults (3 / 1) mirror those in pdfplumber's
	// table.py:11-12. These fields are ignored when the corresponding
	// strategy is anything other than "text".
	MinWordsVertical   int
	MinWordsHorizontal int

	// KeepBlankChars is forwarded to the per-cell WordExtractor.
	// Default: false (matches pdfplumber's text_keep_blank_chars).
	KeepBlankChars bool

	// ExplicitVerticalLines / ExplicitHorizontalLines hold caller-
	// supplied edge positions. With StrategyLines, StrategyLinesStrict,
	// or StrategyText they are ADDED to the derived edges; with
	// StrategyExplicit they ARE the only source of edges on that axis.
	// Useful when a column or row boundary is invisible in the PDF but
	// known from an external source.
	//
	// Values are X coordinates for vertical lines, Y coordinates for
	// horizontal lines, both in PDF user-space points. Non-finite
	// values (NaN, Inf) are dropped with a log warning. When
	// StrategyExplicit is selected on an axis, at least two
	// coordinates must be supplied on that axis — fewer than two
	// returns an error.
	ExplicitVerticalLines   []float64
	ExplicitHorizontalLines []float64
}

TableSettings controls table finding. Construct via DefaultTableSettings() and override the fields you need — the zero value is NOT usable because the tolerances default to zero and the strategies are empty strings.

Field naming and defaults are 1:1 with pdfplumber's TableSettings dataclass (see pdfplumber/table.py:486-555). Where pdfplumber supports independent x/y tolerances via *_x_tolerance / *_y_tolerance fallbacks, we expose the shared field directly; explicit per-axis overrides can be added later if a real-world need surfaces.

func DefaultTableSettings added in v0.2.0

func DefaultTableSettings() TableSettings

DefaultTableSettings returns settings with the pdfplumber default values pre-populated. The intended pattern is:

settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyLinesStrict
tables, err := page.ExtractTables(settings)

pdfplumber's defaults (table.py lines 9-12, 486-503):

DEFAULT_SNAP_TOLERANCE         = 3
DEFAULT_JOIN_TOLERANCE         = 3
DEFAULT_MIN_WORDS_VERTICAL     = 3
DEFAULT_MIN_WORDS_HORIZONTAL   = 1
edge_min_length                = 3
edge_min_length_prefilter      = 1
intersection_tolerance         = 3
vertical_strategy              = "lines"
horizontal_strategy            = "lines"
text_x_tolerance/y_tolerance   = 3

type TableStrategy added in v0.2.0

type TableStrategy string

TableStrategy is the enum of edge-derivation strategies. Each axis (vertical, horizontal) picks one independently. All four pdfplumber strategies are implemented as of v0.3.0.

const (
	// StrategyLines derives edges from drawn Lines, Rects (all four
	// sides), and Curves whose segments lie on an axis. Snap and join
	// tolerances are at their defaults — looser than lines_strict so
	// hand-drawn or jittery rules still merge.
	StrategyLines TableStrategy = "lines"

	// StrategyLinesStrict derives edges ONLY from drawn Lines.
	// Rectangle outlines and curve segments are ignored, even if they
	// look like a table grid. Use this when your PDF draws cell
	// backgrounds as filled rects that you do NOT want treated as row
	// boundaries.
	StrategyLinesStrict TableStrategy = "lines_strict"

	// StrategyText infers edges from word alignment. Vertical edges
	// come from clusters of words sharing X0 / X1 / centre positions;
	// horizontal edges from clusters sharing visual top. Best for
	// borderless tables — bank statements, narrative tables in 10-K
	// filings, scanned-then-OCR'd content — where the columns and
	// rows are conveyed by whitespace alignment rather than rules.
	// Tunable via MinWordsVertical (default 3) and
	// MinWordsHorizontal (default 1).
	StrategyText TableStrategy = "text"

	// StrategyExplicit uses caller-supplied coordinates from
	// ExplicitVerticalLines / ExplicitHorizontalLines as the only
	// source of edges on that axis. Useful when the table boundaries
	// are known from an external source (layout analysis, manual
	// annotation) and you want to bypass edge detection entirely.
	// The "explicit" strategy on an axis requires at least two
	// coordinates on that axis; fewer than two produces an error.
	StrategyExplicit TableStrategy = "explicit"
)

type TextOpts added in v0.1.0

type TextOpts struct {
	XTolerance float64
	YTolerance float64

	// Layout: when true, the output preserves the page's spatial
	// layout — words at the same x-position appear in the same column
	// across lines, and lines that are far apart are separated by
	// extra newlines. When false (the default), output is dense:
	// words on the same line are joined with single spaces, lines
	// with "\n".
	Layout bool

	// LayoutWidthChars: when Layout=true, the total width of each
	// emitted line in characters. If 0, defaults to round(page.Width /
	// XDensity).
	LayoutWidthChars int

	// LayoutHeightChars: when Layout=true, the total number of
	// emitted lines (extra blank lines at the bottom pad to this
	// height). If 0, defaults to round(page.Height / YDensity).
	LayoutHeightChars int

	// XDensity / YDensity: PDF points per character / per line when
	// computing layout grid dimensions. Default values match
	// pdfplumber (XDensity=7.25, YDensity=13) — roughly the metrics
	// of 10pt Helvetica.
	XDensity float64
	YDensity float64

	// UseTextFlow / HorizontalLTR / VerticalTTB / ExtraAttrs are
	// passed through to the underlying WordExtractor.
	UseTextFlow   bool
	HorizontalLTR bool
	VerticalTTB   bool
	ExtraAttrs    []string

	// Expand passes through to WordOpts.Expand.
	Expand bool
}

TextOpts configures Page.ExtractText. Like WordOpts the zero value is not useful; call DefaultTextOpts() for sensible defaults.

func DefaultTextOpts added in v0.1.0

func DefaultTextOpts() TextOpts

DefaultTextOpts returns pdfplumber-matching defaults.

type Word added in v0.1.0

type Word struct {
	// Text is the concatenated Unicode payload of the run. Ligature
	// glyphs are expanded into their constituent characters when
	// WordOpts.Expand is true (the default), so "file" appears as
	// "file" in the output.
	Text string

	// Bounding box of the run in PDF user space (origin at bottom-
	// left, Y growing up). The bbox is the union of every char's bbox
	// in this run.
	X0, Y0, X1, Y1 float64

	// Upright is true if every char in the run was drawn in normal
	// reading orientation. We don't merge upright and rotated chars
	// into the same Word — they end up in different runs.
	Upright bool

	// Direction is one of "ltr", "rtl", "ttb", "btt". Most words on
	// most pages are "ltr"; rotated stamps may be "ttb"; Arabic/Hebrew
	// content is "rtl". The value is the direction the chars were
	// READ, not the direction they were drawn.
	Direction string

	// FontName / FontSize are copied from the first char in the run.
	// pdfplumber does the same — if a word straddles a font change,
	// only the leading font is reported, but in practice such words
	// are rare because changing font emits a new BT/ET pair which
	// breaks the run boundary at the content-stream level.
	FontName string
	FontSize float64

	// Chars is the slice of Char objects this word was assembled from.
	// Populated only when WordOpts.KeepChars is true (it costs O(n)
	// memory per word so we default to off). Useful for callers that
	// want to map word substrings back to glyph positions (highlight,
	// search) or to filter further by per-char attributes.
	Chars []Char
}

Word is one extracted text run. It bundles the assembled string and the bbox of the constituent chars, plus enough metadata for callers who want to filter/restyle on font properties (font name + size of the first char) or know which direction the run reads.

Field names map onto pdfplumber's word dict the way the rest of the package maps onto its char dict: X0/Y0/X1/Y1 instead of "x0"/"top"/ "x1"/"bottom". Y0 is the descender (lower edge of the lowest glyph in the run); Y1 is the ascender (upper edge of the tallest glyph).

func (Word) Height added in v0.1.0

func (w Word) Height() float64

Height returns y1 - y0.

func (Word) Width added in v0.1.0

func (w Word) Width() float64

Width returns x1 - x0.

type WordOpts added in v0.1.0

type WordOpts struct {
	// XTolerance is the maximum horizontal gap (in PDF points) between
	// adjacent chars that still get merged into the same word.
	// Default: 3.
	XTolerance float64

	// YTolerance is the maximum vertical jitter between chars that
	// still get clustered onto the same line. Default: 3.
	YTolerance float64

	// KeepBlankChars: when false (the default), space chars in the
	// content stream are dropped before word grouping — the word
	// boundary is inferred from the gap, not from the explicit space.
	// Set to true to preserve them (e.g. for diff-style line
	// reconstruction).
	KeepBlankChars bool

	// UseTextFlow: when true, chars are processed in content-stream
	// order rather than re-sorted by position. This is faster and
	// often matches reading order in well-formed PDFs, but breaks for
	// PDFs that draw glyphs in random order (e.g. some scanner OCR
	// output).
	UseTextFlow bool

	// HorizontalLTR: when true (the default), upright text is read
	// left-to-right; when false, right-to-left. Setting this to false
	// is shorthand for Direction="rtl" but only for upright text.
	HorizontalLTR bool

	// VerticalTTB: when true (the default), rotated text is read top-
	// to-bottom; when false, bottom-to-top.
	VerticalTTB bool

	// ExtraAttrs is a list of Char field names that must match
	// EXACTLY for two chars to be merged into the same word. The
	// supported names are: "fontname", "size". Useful when a single
	// physical line has two runs that should be kept separate (e.g. a
	// bold caption followed by regular body text).
	ExtraAttrs []string

	// SplitAtPunctuation: when true, every ASCII punctuation char
	// (string.punctuation in Python) terminates the current word and
	// becomes its own one-char word. Default: false.
	SplitAtPunctuation bool

	// Expand: when true (the default), ligature glyphs (fi, fl, …) are
	// expanded into their constituent ASCII chars during text
	// assembly. The Char's text payload is preserved unchanged; only
	// the Word.Text string is expanded.
	Expand bool

	// KeepChars: when true, Word.Chars is populated with the source
	// chars. Off by default to save memory.
	KeepChars bool
}

WordOpts configures Page.Words. The zero value is NOT useful — call DefaultWordOpts() to get a populated struct with pdfplumber-compatible defaults, then override the fields you care about.

Naming matches pdfplumber's WordExtractor kwargs where possible (XTolerance → x_tolerance, KeepBlankChars → keep_blank_chars, etc.) to make porting examples between the two libraries straightforward.

func DefaultWordOpts added in v0.1.0

func DefaultWordOpts() WordOpts

DefaultWordOpts returns a WordOpts populated with pdfplumber-matching defaults. Use this and override the fields you care about:

opts := pdftable.DefaultWordOpts()
opts.XTolerance = 1.5
words, _ := page.Words(opts)

Directories

Path Synopsis
cmd
pdftable command
cmd/pdftable is the command-line interface to the pdftable library.
cmd/pdftable is the command-line interface to the pdftable library.
examples
extract_tables command
examples/extract_tables/main.go is the runnable form of the README's "Tables (lines strategy)" example.
examples/extract_tables/main.go is the runnable form of the README's "Tables (lines strategy)" example.
internal
layout
Package layout owns the lower-level geometry primitives that drive table-finding: edges, edge-derivation from Lines/Rects/Curves, and edge merging (snap + join).
Package layout owns the lower-level geometry primitives that drive table-finding: edges, edge-derivation from Lines/Rects/Curves, and edge merging (snap + join).
pdf
Package pdf is the internal content-stream interpreter for pdftable.
Package pdf is the internal content-stream interpreter for pdftable.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL