Documentation
¶
Overview ¶
Package pdftable is a Go-native port of Python's pdfplumber. It reads a PDF document, walks the content streams, and surfaces the positioned primitives (characters, lines, rectangles, curves) that higher-level layout algorithms — text extraction, word grouping, table detection — operate on.
The library is structured in layers that mirror the pdfplumber + pdfminer.six split:
- This package (pdftable) is the public API. It is small, stable, and contains no PDF parsing logic of its own — it just exposes Document, Page, Char, Line, Rect, Curve and constructs them from the internal package.
- github.com/hallelx2/pdftable/internal/pdf is the content-stream interpreter. It is intentionally not public: the data shapes there will evolve as we add more PDF features, and we don't want callers depending on them.
Typical usage:
doc, err := pdftable.OpenFile("report.pdf")
if err != nil { return err }
defer doc.Close()
for n, p := range doc.Pages() {
chars, _ := p.Chars()
fmt.Printf("page %d: %d chars\n", n, len(chars))
}
Phase scope: v0.1.0 ships content-stream primitives plus text extraction (Page.Words, Page.ExtractText, Page.ExtractTextSimple). Table-finding (ExtractTables, FindTables) is the next phase — see the README for the roadmap. The Page interface is additive across releases; v0.0.1 callers using only Chars/Lines/Rects/Curves continue to compile against v0.1.0 without changes.
Index ¶
- Variables
- type BBox
- func (b BBox) Area() float64
- func (b BBox) Contains(other BBox) bool
- func (b BBox) ContainsPoint(x, y float64) bool
- func (b BBox) Height() float64
- func (b BBox) Intersect(other BBox) (BBox, bool)
- func (b BBox) IsZero() bool
- func (b BBox) Snap(step float64) BBox
- func (b BBox) Union(other BBox) BBox
- func (b BBox) Width() float64
- type Char
- type Curve
- type Document
- type Intersection
- type Line
- type Objects
- type Page
- type Rect
- type Table
- type TableBox
- type TableFinder
- type TableSettings
- type TableStrategy
- type TextOpts
- type Word
- type WordOpts
Constants ¶
This section is empty.
Variables ¶
var ( // ErrInvalidPDF is returned by Open / OpenBytes / OpenFile when the // input bytes can't be parsed as a PDF. The underlying pdfcpu error // is wrapped so callers can still inspect the details with errors.As. ErrInvalidPDF = errors.New("pdftable: invalid PDF") // ErrPageOutOfRange is returned by Document.Page when n is < 1 or // > NumPages(). The PDF page index is 1-based, matching pdfplumber. ErrPageOutOfRange = errors.New("pdftable: page out of range") // ErrUnsupported is returned when we hit a PDF feature this library // does not yet implement (e.g. an exotic CMap, an unsupported XObject // subtype, vertical writing). The error string names the feature. ErrUnsupported = errors.New("pdftable: unsupported feature") // ErrEncrypted is returned when the PDF is encrypted and we can't // decrypt it with the empty password. Full encryption support is // out of scope for the initial release — callers that need it can // pre-decrypt with pdfcpu's api.Decrypt and feed the cleaned bytes // to OpenBytes. ErrEncrypted = errors.New("pdftable: encrypted PDF (decryption not yet supported)") )
Sentinel errors returned by the public API. Callers can match these with errors.Is(); functions that surface a parser-level problem from pdfcpu or the content-stream interpreter wrap the underlying error so the cause is preserved.
We keep this set small on purpose. The PDF spec has hundreds of failure modes — most of them collapse into "the bytes do not look like a PDF", "you asked for a page that doesn't exist", or "we don't implement this feature yet". Anything more specific belongs in the wrapped error string, not as a new sentinel.
Functions ¶
This section is empty.
Types ¶
type BBox ¶ added in v0.1.0
type BBox struct {
X0, Y0, X1, Y1 float64
}
BBox is the canonical four-tuple bounding-box helper that the layout algorithms (clustering, word grouping, text extraction) operate on. Field naming follows the Char/Line/Rect convention used throughout the package: x0,y0 is the lower-left corner and x1,y1 is the upper- right corner in PDF user space (origin at bottom-left, Y growing up).
We expose BBox as a value type — small, stack-allocatable, trivially copyable. Algorithms that need to pass a bbox around without poking at the larger Char/Rect/Line wrappers can construct one with NewBBox or pull one out with the BBoxOf helpers below.
The Go API intentionally chooses (X0,Y0,X1,Y1) over pdfplumber's dict-of-strings ({"x0","top","x1","bottom"}). The two flavours differ because pdfplumber operates in image space (Y growing down, "top" = small Y, "bottom" = large Y) and we operate in PDF user space (Y growing up). Comments call out the mapping wherever it matters.
func BBoxOfChar ¶ added in v0.1.0
BBoxOfChar returns the bounding box of a Char.
func BBoxOfChars ¶ added in v0.1.0
BBoxOfChars returns the smallest bbox enclosing every char in cs. Returns the zero BBox for an empty slice.
func MergeBBoxes ¶ added in v0.1.0
MergeBBoxes returns the smallest bbox enclosing every input. Empty input returns a zero BBox. This mirrors pdfplumber's merge_bboxes and objects_to_bbox helpers — the typical caller has a slice of Chars and wants the combined bounding box for the resulting Word.
func NewBBox ¶ added in v0.1.0
NewBBox builds a BBox and normalises it so X0<=X1 and Y0<=Y1. Algorithms downstream rely on the normal form, so we never let an inverted bbox leak past this constructor.
func (BBox) Contains ¶ added in v0.1.0
Contains reports whether b fully encloses other. Edges are considered inside (>= on the low side, <= on the high side), so a bbox contains itself.
func (BBox) ContainsPoint ¶ added in v0.1.0
ContainsPoint reports whether (x,y) lies inside b (inclusive on edges).
func (BBox) Intersect ¶ added in v0.1.0
Intersect returns the overlapping rectangle of b and other, and a boolean reporting whether the intersection has non-empty area (i.e. the two bboxes actually overlap). This mirrors pdfplumber's get_bbox_overlap, which returns None when the boxes don't touch and the overlapping bbox otherwise.
We treat touching-but-not-overlapping (shared edge, zero area) as non-overlap, matching pdfplumber's `o_height + o_width > 0` check — a single-line ruler that grazes a word's bbox should not be reported as "intersecting" the word.
func (BBox) IsZero ¶ added in v0.1.0
IsZero reports whether the bbox is the zero value (all four fields equal to zero). Useful for "did I forget to populate this" checks.
func (BBox) Snap ¶ added in v0.1.0
Snap rounds each of b's four coordinates to the nearest multiple of step. Used by layout-analysis code to coalesce near-equal positions (e.g. ruling lines drawn at 99.9, 100.0, 100.1) before clustering. A step of 0 returns the original bbox unchanged.
func (BBox) Union ¶ added in v0.1.0
Union returns the smallest bbox enclosing both b and other.
We DON'T treat a zero-value BBox as "empty" here — a caller that passes BBox{} to Union genuinely means "enclose the origin point". Use MergeBBoxes when you have a slice and want it to be a no-op on the empty slice.
type Char ¶
type Char struct {
// Text is the Unicode payload of this glyph (one or more runes).
// Empty string means the font's encoding and ToUnicode CMap both
// failed to map the glyph; the bbox still describes where the
// glyph was drawn so downstream layout heuristics can use the
// positioning even when we can't read the text.
Text string
// Bounding box in PDF user space, with x0 <= x1 and y0 <= y1.
// Y0 is the descender baseline of the glyph; Y1 is the top of the
// ascender. The bbox is the typographic cell, not the ink — it
// matches what pdfplumber reports.
X0, Y0, X1, Y1 float64
// FontName is the /BaseFont (or /Name) value from the font dict,
// e.g. "Helvetica" or "ABCDEF+TimesNewRoman-Bold". It's surfaced
// verbatim — callers that want to detect bold weight should match
// on substrings ("bold", "-bd").
FontName string
// FontSize is the size the glyph was rendered at, in PDF user space
// units (i.e. after applying the text matrix and CTM). For a glyph
// drawn with `12 Tf` and identity CTM this is 12.0; if the CTM
// scales by 2x it's 24.0.
FontSize float64
// Upright is true when the glyph is drawn in normal reading
// orientation (a*d > 0 and b*c <= 0 in the combined text matrix).
// Rotated and mirrored glyphs report false. pdfplumber uses the
// same predicate to skip vertical-stamp text during table finding.
Upright bool
// Advance is the horizontal advance width the text matrix moved by
// after drawing this glyph (in user-space units). It already
// includes character spacing, word spacing for the space glyph,
// and the font's per-glyph width — i.e. the actual ink-to-ink
// distance to the next glyph.
Advance float64
}
Char is a single positioned glyph on a page. Adjacent Chars do NOT share state — each carries its own font, size, and absolute bbox so downstream code (word grouping, table finding, text extraction) can reason about each glyph in isolation.
Text is the Unicode string this glyph represents. For most fonts a glyph maps to a single code point ("A"), but Adobe ligature glyph names ("fi", "ffi") and ToUnicode maps with multi-character bfchar entries can produce multi-rune strings — we never split them.
type Curve ¶
type Curve struct {
// Points are the vertices of the path in order. For curve segments,
// only the on-path endpoints are emitted (control points are
// dropped) — this matches pdfplumber's "pts" field, which is also
// just the endpoints. Callers that need precise curve geometry
// should request that extension; we don't expect tables to use
// curves so dropping control points is fine for the initial scope.
Points [][2]float64
Stroke bool
Fill bool
Width float64
}
Curve is a path that contains at least one curve segment (`c`, `v`, or `y`) or any path that doesn't reduce to a single Rect or a series of straight Lines. Points are the path vertices in order, including intermediate control points; callers that need actual Bezier shapes can reconstruct them from the operator stream — for now we keep the simpler flattened representation that's enough for "did the page have decorative curves on it" detection.
type Document ¶
type Document interface {
// NumPages returns the page count. Always >= 1 for a valid PDF.
NumPages() int
// Page returns the n'th page (1-indexed). Returns
// ErrPageOutOfRange if n is < 1 or > NumPages().
Page(n int) (Page, error)
// Pages returns a range iterator over (n, Page) pairs for use
// with Go 1.23+ range-over-func syntax:
//
// for n, p := range doc.Pages() {
// // ...
// }
//
// The iterator yields pages in 1-based order. It does NOT clone
// the underlying Document — Close() on the doc still tears down
// the iterator's pages.
Pages() iter.Seq2[int, Page]
// Close releases any resources held by the underlying parser.
// For pdfcpu-backed documents this is currently a no-op (pdfcpu
// holds everything in memory), but callers should still defer
// Close() so that future backends (mmap, file handles) work.
Close() error
}
Document represents one open PDF file. The interface (not a struct) keeps the API symmetric with Page and lets us swap implementations later — e.g. a Document that streams pages lazily from a remote blob — without breaking callers.
All accessors are safe to call concurrently. The underlying pdfcpu *model.Context is treated as read-only after the document is opened.
func Open ¶
Open reads an entire io.Reader into memory and parses it as a PDF. pdfcpu's API requires an io.ReadSeeker; we buffer the input so the caller doesn't have to.
Use OpenBytes if you already have the file content as a []byte (avoids an extra copy), or OpenFile if you have a path.
type Intersection ¶ added in v0.2.0
type Intersection struct {
X, Y float64
V []layout.Edge // vertical edges passing through (X, Y)
H []layout.Edge // horizontal edges passing through (X, Y)
}
Intersection records one crossing point: an (x, y) tuple plus the vertical and horizontal edges that meet there. We need the edge sets (not just the count) because the cell-finder asks "does the same edge connect points p1 and p2?" — checking that two points lie on a shared edge is how the algorithm distinguishes "two intersections on the same ruler" from "two intersections on parallel rulers that happen to align".
Field naming follows pdfplumber's intersections dict-of-dicts shape: the X/Y are the keys, V/H are the value lists. We keep them as slice fields so the struct is value-comparable on (X, Y) alone.
type Line ¶
type Line struct {
X0, Y0, X1, Y1 float64
// Stroke is true if the path was stroked (S, s, B, b, B*, b*).
// Lines emitted from non-stroked paths are dropped before the
// caller sees them — but the field is preserved for symmetry
// with Rect/Curve so layout code can branch on it uniformly.
Stroke bool
// Width is the stroke line width in user-space units at the time
// the line was emitted. For ruling lines this is typically 0.5–1.0.
Width float64
}
Line is a single straight-line segment emitted by an `S` (stroke) operator on a path that contained an `l` segment. Each `l` becomes one Line; a rectangle drawn with `re` does NOT decompose into four Lines (it becomes a single Rect) — that's how pdfplumber tells the two apart, and we keep the same distinction so downstream code can pick the right collection for table-finding.
type Objects ¶
Objects is the bundle of all primitive page objects, returned by Page.Objects(). It's a convenience for callers that want everything in one shot rather than four separate Page method calls — the per- type accessors (Chars/Lines/Rects/Curves) are independent and safe to call concurrently, but they each redo the page content-stream walk, so Objects() is the cheaper choice when you need it all.
type Page ¶
type Page interface {
// Number returns the page number (1-based).
Number() int
// Width and Height return the page's mediabox dimensions in PDF
// points (1/72 inch), already adjusted for the page's /Rotate
// entry — so a portrait letter-sized page rotated 90 degrees
// reports Width=792 Height=612 (landscape), matching what a
// PDF viewer would display.
Width() float64
Height() float64
// Chars walks the page and returns every positioned glyph. The
// order is content-stream order — i.e. the order the producer
// drew them, NOT visual reading order. Downstream layout code
// (extract_text, find_tables) sorts the chars by position.
Chars() ([]Char, error)
// Lines returns every straight-line segment drawn on the page.
// Each `l` segment in the content stream becomes one Line.
// Rectangles drawn via `re` are NOT decomposed into four Lines;
// they're reported through Rects() instead.
Lines() ([]Line, error)
// Rects returns every rectangle drawn via the `re` operator.
// Both stroked and filled rectangles are returned; the Stroke
// and Fill flags say which.
Rects() ([]Rect, error)
// Curves returns every Bezier or composite path that isn't a
// pure line-segment chain or a single rect.
Curves() ([]Curve, error)
// Objects returns Chars + Lines + Rects + Curves in a single
// walk. Use this when you need all four — it's strictly
// cheaper than calling each accessor separately because the
// content stream is parsed exactly once.
Objects() (Objects, error)
// Words extracts positioned text runs from the page. A "word"
// is a contiguous group of chars whose horizontal gaps are
// within WordOpts.XTolerance and whose vertical positions
// agree within WordOpts.YTolerance. Pass DefaultWordOpts() to
// use pdfplumber-matching defaults. See WordOpts for the full
// configuration surface.
//
// Returns an empty slice (not nil) when the page contains no
// extractable text.
Words(opts WordOpts) ([]Word, error)
// ExtractText returns the page's text as a single string. By
// default words on the same line are joined with a single
// space and lines are joined with "\n". When TextOpts.Layout is
// true, the output preserves spatial layout (column-aligned
// text, blank lines for vertical gaps) at the cost of more
// whitespace. Pass DefaultTextOpts() for pdfplumber-matching
// defaults.
ExtractText(opts TextOpts) (string, error)
// ExtractTextSimple is a no-frills extraction that clusters
// chars by visual line and joins them by gap detection. Use
// when ExtractText's word-grouping heuristics produce undesired
// results on adversarial input.
ExtractTextSimple(xTolerance, yTolerance float64) (string, error)
// FindTables runs the geometry-only stage of the table-finding
// pipeline: derive edges from the page primitives, snap+join
// into rulers, scan for intersections, assemble cells, group
// cells into tables. Returns one TableFinder per detected
// table-group so callers building debugging tools can inspect
// the intermediate stages (edges / intersections / raw cells)
// alongside the assembled per-table CellsGrid.
//
// v0.3.0 supports all four pdfplumber strategies: "lines",
// "lines_strict", "text", and "explicit". Each axis (vertical,
// horizontal) selects its strategy independently, so mixed
// settings like vertical="text" + horizontal="lines" work as
// expected.
FindTables(settings TableSettings) ([]TableFinder, error)
// ExtractTables wraps FindTables and runs per-cell text
// extraction on every detected table. Cells with no chars
// produce an empty string. Leading and trailing whitespace
// inside each cell is stripped. Returns the slice of fully
// populated Table structs in visual top-to-bottom-left-to-right
// order.
ExtractTables(settings TableSettings) ([]*Table, error)
}
Page is one page of a PDF document. The interface (not a struct) is intentional: it lets us swap implementations later (e.g. for a streaming PDF parser) without breaking callers, and it makes the API surface easy to mock in tests.
Every accessor (Chars, Lines, Rects, Curves, Objects) walks the page content stream from scratch. We do NOT cache between calls because:
- Callers that need ALL the objects call Objects() once.
- Callers that need just the chars (say, for text extraction) don't pay for the path-painting machinery they aren't using.
- Caching means deciding when to invalidate, which is moot because a Page is immutable from the caller's perspective.
Pages are 1-indexed, matching pdfplumber. Number() returns the 1-based index so callers can format error messages without re-tracking which Page they were given.
type Rect ¶
type Rect struct {
X0, Y0, X1, Y1 float64
// Stroke and Fill mirror the painting operator that closed the
// path: S/s set Stroke; f/F/f* set Fill; B/b/B*/b* set both.
// Either or both can be true — but not neither (paths with `n`
// produce nothing).
Stroke bool
Fill bool
// Width is the stroke line width at the time the rectangle was
// painted, in user-space units. Zero when Stroke is false.
Width float64
}
Rect is a rectangle path emitted by an `re` operator (a single PDF instruction that draws a closed box). pdfplumber tracks these separately from generic four-segment paths because table grids, borders, and shaded cells are nearly always drawn this way — keeping the distinction makes table detection much more reliable.
type Table ¶ added in v0.2.0
type Table struct {
// Rows is the table's text content as a 2-D slice. Row 0 is the
// VISUALLY TOP row of the table; column 0 is the leftmost. Empty
// cells appear as "". Missing cells (when a row has fewer columns
// than the table's column count, because the underlying cell
// detection found a hole) are also "" — we promote missing to
// empty so callers don't have to nil-check every entry.
Rows [][]string
// BBox is the union of every cell's bbox, in PDF user-space
// coordinates (origin bottom-left, Y growing up).
BBox BBox
// Page is the 1-based page number the table was found on, copied
// from the originating Page so callers can carry results across
// page boundaries without holding Page references.
Page int
// CellsBBox is the per-cell bbox aligned to Rows: CellsBBox[i][j]
// is the bbox of Rows[i][j]. Useful for re-rendering with
// highlight overlays, or for re-cropping the page to extract the
// cell's contents in a richer format than plain text.
CellsBBox [][]BBox
}
Table is the extracted result for one detected table. It carries the assembled cell texts plus the geometry needed for downstream consumers (re-rendering, click-through to source positions).
type TableBox ¶ added in v0.2.0
type TableBox struct {
// BBox is the union of every cell's bbox.
BBox BBox
// Rows is the row count.
Rows int
// Cols is the column count.
Cols int
// CellsGrid is the per-cell bbox aligned to Rows × Cols. The
// entry at [i][j] is the bbox of the cell at visual row i (0 is
// topmost) and column j (0 is leftmost). Empty cells are the zero
// BBox.
CellsGrid [][]BBox
}
TableBox is one detected table, expressed as a bbox plus a 2-D grid of cell bboxes. Rows are visually top-to-bottom; columns are left-to- right. CellsGrid[i][j] gives the bbox of the cell at row i, column j; missing cells (rectangular gaps in the grid) are reported as the zero BBox, NOT removed — callers can detect "this cell was missing" by checking IsZero on the entry.
This is the geometry-only intermediate between FindTables and ExtractTables: FindTables returns one of these per detected table; ExtractTables then runs text-extraction per cell and wraps the result in a Table.
type TableFinder ¶ added in v0.2.0
type TableFinder struct {
// Edges is the merged, length-filtered edge list used as the
// input to the intersection scan. Useful for debugging "why
// didn't this rule get picked up" issues.
Edges []layout.Edge
// Intersections is the full set of edge crossings, keyed by
// (X, Y). The order is deterministic — sorted by Y descending,
// then X ascending — so callers can rely on iteration order.
Intersections []Intersection
// Cells is the raw list of detected cell bboxes BEFORE grouping
// into tables. Each is a single rectangle whose four corners are
// intersections joined by shared edges.
Cells []BBox
// Tables is the final list of detected tables. Each carries a
// bbox plus a CellsGrid aligned to row/column order. Tables are
// sorted top-to-bottom-then-left-to-right by their topmost cell.
Tables []TableBox
}
TableFinder is the geometry-only result of running the cells-from- edges pipeline on a page. It exposes the intermediate stages (edges, intersections, raw cells) alongside the assembled TableBox list so callers building debugging tools or custom text-extraction can see exactly what the pipeline produced.
Pdfplumber bundles the page reference inside its TableFinder and exposes Table objects with an .extract() method; we keep the finder a pure value (no Page pointer) and let callers either grab the assembled Tables from Page.ExtractTables or compose their own text-fill loop using the public Cells and CellsGrid.
type TableSettings ¶ added in v0.2.0
type TableSettings struct {
// VerticalStrategy picks the source of vertical edges.
// Default: StrategyLines.
VerticalStrategy TableStrategy
// HorizontalStrategy picks the source of horizontal edges.
// Default: StrategyLines.
HorizontalStrategy TableStrategy
// SnapTolerance is the perpendicular-axis tolerance for clustering
// near-collinear edges before joining (PDF points). Default: 3.
SnapTolerance float64
// JoinTolerance is the along-direction gap that still gets merged
// during the join pass (PDF points). Default: 3.
JoinTolerance float64
// EdgeMinLength drops merged edges shorter than this (PDF points).
// Default: 3.
EdgeMinLength float64
// EdgeMinLengthPrefilter drops raw edges before merging
// (PDF points). Default: 1 — kills hairline construction
// segments that snap+join shouldn't pull together.
EdgeMinLengthPrefilter float64
// IntersectionTolerance is the slack used when testing whether a
// vertical edge crosses a horizontal edge — accounts for tiny
// gaps between the end of a stroked line and the start of the
// next (PDF points). Default: 3.
IntersectionTolerance float64
// TextTolerance is forwarded to the per-cell text-extraction call
// inside ExtractTables. It overrides both x_tolerance and
// y_tolerance of the underlying WordExtractor. Default: 3.
TextTolerance float64
// MinWordsVertical / MinWordsHorizontal control the "text"
// strategy thresholds. A candidate column-boundary cluster must
// contain at least MinWordsVertical words sharing X0 / X1 /
// centre alignment to be promoted to a vertical edge; row
// boundaries need MinWordsHorizontal words sharing a top edge.
// pdfplumber defaults (3 / 1) mirror those in pdfplumber's
// table.py:11-12. These fields are ignored when the corresponding
// strategy is anything other than "text".
MinWordsVertical int
MinWordsHorizontal int
// KeepBlankChars is forwarded to the per-cell WordExtractor.
// Default: false (matches pdfplumber's text_keep_blank_chars).
KeepBlankChars bool
// ExplicitVerticalLines / ExplicitHorizontalLines hold caller-
// supplied edge positions. With StrategyLines, StrategyLinesStrict,
// or StrategyText they are ADDED to the derived edges; with
// StrategyExplicit they ARE the only source of edges on that axis.
// Useful when a column or row boundary is invisible in the PDF but
// known from an external source.
//
// Values are X coordinates for vertical lines, Y coordinates for
// horizontal lines, both in PDF user-space points. Non-finite
// values (NaN, Inf) are dropped with a log warning. When
// StrategyExplicit is selected on an axis, at least two
// coordinates must be supplied on that axis — fewer than two
// returns an error.
ExplicitVerticalLines []float64
ExplicitHorizontalLines []float64
}
TableSettings controls table finding. Construct via DefaultTableSettings() and override the fields you need — the zero value is NOT usable because the tolerances default to zero and the strategies are empty strings.
Field naming and defaults are 1:1 with pdfplumber's TableSettings dataclass (see pdfplumber/table.py:486-555). Where pdfplumber supports independent x/y tolerances via *_x_tolerance / *_y_tolerance fallbacks, we expose the shared field directly; explicit per-axis overrides can be added later if a real-world need surfaces.
func DefaultTableSettings ¶ added in v0.2.0
func DefaultTableSettings() TableSettings
DefaultTableSettings returns settings with the pdfplumber default values pre-populated. The intended pattern is:
settings := pdftable.DefaultTableSettings() settings.VerticalStrategy = pdftable.StrategyLinesStrict tables, err := page.ExtractTables(settings)
pdfplumber's defaults (table.py lines 9-12, 486-503):
DEFAULT_SNAP_TOLERANCE = 3 DEFAULT_JOIN_TOLERANCE = 3 DEFAULT_MIN_WORDS_VERTICAL = 3 DEFAULT_MIN_WORDS_HORIZONTAL = 1 edge_min_length = 3 edge_min_length_prefilter = 1 intersection_tolerance = 3 vertical_strategy = "lines" horizontal_strategy = "lines" text_x_tolerance/y_tolerance = 3
type TableStrategy ¶ added in v0.2.0
type TableStrategy string
TableStrategy is the enum of edge-derivation strategies. Each axis (vertical, horizontal) picks one independently. All four pdfplumber strategies are implemented as of v0.3.0.
const ( // StrategyLines derives edges from drawn Lines, Rects (all four // sides), and Curves whose segments lie on an axis. Snap and join // tolerances are at their defaults — looser than lines_strict so // hand-drawn or jittery rules still merge. StrategyLines TableStrategy = "lines" // StrategyLinesStrict derives edges ONLY from drawn Lines. // Rectangle outlines and curve segments are ignored, even if they // look like a table grid. Use this when your PDF draws cell // backgrounds as filled rects that you do NOT want treated as row // boundaries. StrategyLinesStrict TableStrategy = "lines_strict" // StrategyText infers edges from word alignment. Vertical edges // come from clusters of words sharing X0 / X1 / centre positions; // horizontal edges from clusters sharing visual top. Best for // borderless tables — bank statements, narrative tables in 10-K // filings, scanned-then-OCR'd content — where the columns and // rows are conveyed by whitespace alignment rather than rules. // Tunable via MinWordsVertical (default 3) and // MinWordsHorizontal (default 1). StrategyText TableStrategy = "text" // StrategyExplicit uses caller-supplied coordinates from // ExplicitVerticalLines / ExplicitHorizontalLines as the only // source of edges on that axis. Useful when the table boundaries // are known from an external source (layout analysis, manual // annotation) and you want to bypass edge detection entirely. // The "explicit" strategy on an axis requires at least two // coordinates on that axis; fewer than two produces an error. StrategyExplicit TableStrategy = "explicit" )
type TextOpts ¶ added in v0.1.0
type TextOpts struct {
XTolerance float64
YTolerance float64
// Layout: when true, the output preserves the page's spatial
// layout — words at the same x-position appear in the same column
// across lines, and lines that are far apart are separated by
// extra newlines. When false (the default), output is dense:
// words on the same line are joined with single spaces, lines
// with "\n".
Layout bool
// LayoutWidthChars: when Layout=true, the total width of each
// emitted line in characters. If 0, defaults to round(page.Width /
// XDensity).
LayoutWidthChars int
// LayoutHeightChars: when Layout=true, the total number of
// emitted lines (extra blank lines at the bottom pad to this
// height). If 0, defaults to round(page.Height / YDensity).
LayoutHeightChars int
// XDensity / YDensity: PDF points per character / per line when
// computing layout grid dimensions. Default values match
// pdfplumber (XDensity=7.25, YDensity=13) — roughly the metrics
// of 10pt Helvetica.
XDensity float64
YDensity float64
// UseTextFlow / HorizontalLTR / VerticalTTB / ExtraAttrs are
// passed through to the underlying WordExtractor.
UseTextFlow bool
HorizontalLTR bool
VerticalTTB bool
ExtraAttrs []string
// Expand passes through to WordOpts.Expand.
Expand bool
}
TextOpts configures Page.ExtractText. Like WordOpts the zero value is not useful; call DefaultTextOpts() for sensible defaults.
func DefaultTextOpts ¶ added in v0.1.0
func DefaultTextOpts() TextOpts
DefaultTextOpts returns pdfplumber-matching defaults.
type Word ¶ added in v0.1.0
type Word struct {
// Text is the concatenated Unicode payload of the run. Ligature
// glyphs are expanded into their constituent characters when
// WordOpts.Expand is true (the default), so "file" appears as
// "file" in the output.
Text string
// Bounding box of the run in PDF user space (origin at bottom-
// left, Y growing up). The bbox is the union of every char's bbox
// in this run.
X0, Y0, X1, Y1 float64
// Upright is true if every char in the run was drawn in normal
// reading orientation. We don't merge upright and rotated chars
// into the same Word — they end up in different runs.
Upright bool
// Direction is one of "ltr", "rtl", "ttb", "btt". Most words on
// most pages are "ltr"; rotated stamps may be "ttb"; Arabic/Hebrew
// content is "rtl". The value is the direction the chars were
// READ, not the direction they were drawn.
Direction string
// FontName / FontSize are copied from the first char in the run.
// pdfplumber does the same — if a word straddles a font change,
// only the leading font is reported, but in practice such words
// are rare because changing font emits a new BT/ET pair which
// breaks the run boundary at the content-stream level.
FontName string
FontSize float64
// Chars is the slice of Char objects this word was assembled from.
// Populated only when WordOpts.KeepChars is true (it costs O(n)
// memory per word so we default to off). Useful for callers that
// want to map word substrings back to glyph positions (highlight,
// search) or to filter further by per-char attributes.
Chars []Char
}
Word is one extracted text run. It bundles the assembled string and the bbox of the constituent chars, plus enough metadata for callers who want to filter/restyle on font properties (font name + size of the first char) or know which direction the run reads.
Field names map onto pdfplumber's word dict the way the rest of the package maps onto its char dict: X0/Y0/X1/Y1 instead of "x0"/"top"/ "x1"/"bottom". Y0 is the descender (lower edge of the lowest glyph in the run); Y1 is the ascender (upper edge of the tallest glyph).
type WordOpts ¶ added in v0.1.0
type WordOpts struct {
// XTolerance is the maximum horizontal gap (in PDF points) between
// adjacent chars that still get merged into the same word.
// Default: 3.
XTolerance float64
// YTolerance is the maximum vertical jitter between chars that
// still get clustered onto the same line. Default: 3.
YTolerance float64
// KeepBlankChars: when false (the default), space chars in the
// content stream are dropped before word grouping — the word
// boundary is inferred from the gap, not from the explicit space.
// Set to true to preserve them (e.g. for diff-style line
// reconstruction).
KeepBlankChars bool
// UseTextFlow: when true, chars are processed in content-stream
// order rather than re-sorted by position. This is faster and
// often matches reading order in well-formed PDFs, but breaks for
// PDFs that draw glyphs in random order (e.g. some scanner OCR
// output).
UseTextFlow bool
// HorizontalLTR: when true (the default), upright text is read
// left-to-right; when false, right-to-left. Setting this to false
// is shorthand for Direction="rtl" but only for upright text.
HorizontalLTR bool
// VerticalTTB: when true (the default), rotated text is read top-
// to-bottom; when false, bottom-to-top.
VerticalTTB bool
// ExtraAttrs is a list of Char field names that must match
// EXACTLY for two chars to be merged into the same word. The
// supported names are: "fontname", "size". Useful when a single
// physical line has two runs that should be kept separate (e.g. a
// bold caption followed by regular body text).
ExtraAttrs []string
// SplitAtPunctuation: when true, every ASCII punctuation char
// (string.punctuation in Python) terminates the current word and
// becomes its own one-char word. Default: false.
SplitAtPunctuation bool
// Expand: when true (the default), ligature glyphs (fi, fl, …) are
// expanded into their constituent ASCII chars during text
// assembly. The Char's text payload is preserved unchanged; only
// the Word.Text string is expanded.
Expand bool
// KeepChars: when true, Word.Chars is populated with the source
// chars. Off by default to save memory.
KeepChars bool
}
WordOpts configures Page.Words. The zero value is NOT useful — call DefaultWordOpts() to get a populated struct with pdfplumber-compatible defaults, then override the fields you care about.
Naming matches pdfplumber's WordExtractor kwargs where possible (XTolerance → x_tolerance, KeepBlankChars → keep_blank_chars, etc.) to make porting examples between the two libraries straightforward.
func DefaultWordOpts ¶ added in v0.1.0
func DefaultWordOpts() WordOpts
DefaultWordOpts returns a WordOpts populated with pdfplumber-matching defaults. Use this and override the fields you care about:
opts := pdftable.DefaultWordOpts() opts.XTolerance = 1.5 words, _ := page.Words(opts)
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
pdftable
command
cmd/pdftable is the command-line interface to the pdftable library.
|
cmd/pdftable is the command-line interface to the pdftable library. |
|
examples
|
|
|
extract_tables
command
examples/extract_tables/main.go is the runnable form of the README's "Tables (lines strategy)" example.
|
examples/extract_tables/main.go is the runnable form of the README's "Tables (lines strategy)" example. |
|
internal
|
|
|
layout
Package layout owns the lower-level geometry primitives that drive table-finding: edges, edge-derivation from Lines/Rects/Curves, and edge merging (snap + join).
|
Package layout owns the lower-level geometry primitives that drive table-finding: edges, edge-derivation from Lines/Rects/Curves, and edge merging (snap + join). |
|
pdf
Package pdf is the internal content-stream interpreter for pdftable.
|
Package pdf is the internal content-stream interpreter for pdftable. |