layout

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 27, 2026 License: MIT Imports: 2 Imported by: 0

Documentation

Overview

Package layout owns the lower-level geometry primitives that drive table-finding: edges, edge-derivation from Lines/Rects/Curves, and edge merging (snap + join).

The split between this internal package and the public pdftable package mirrors the pdfplumber split between pdfplumber/utils/geometry.py (edge maths) and pdfplumber/table.py (the TableFinder). Keeping the edge maths here lets us evolve the representation freely while the public surface in pdftable's table.go / finder.go stays stable.

Coordinate system: PDF user space — origin at bottom-left, Y growing UP. An edge is a single-axis line segment:

  • Horizontal edge: Y0 == Y1, X0 <= X1. Orientation "h".
  • Vertical edge: X0 == X1, Y0 <= Y1. Orientation "v".

(Pdfplumber operates in IMAGE space, Y growing down; its "top" is our larger Y, its "bottom" is our smaller Y. The algorithms here are the same, only the coordinate sign is flipped — see comments at each step for the explicit mapping.)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CurveSegment

type CurveSegment struct {
	Points [][2]float64
	Width  float64
}

CurveSegment is the point list of a curve path. We turn each horizontal or vertical pair of consecutive points into an edge — curves that are entirely diagonal contribute zero edges.

type Edge

type Edge struct {
	X0, Y0, X1, Y1 float64
	Orientation    Orientation
	Source         Source
	// Width is the stroke width of the originating primitive (used by
	// some callers to filter hair-thin construction lines). Zero is
	// "unknown".
	Width float64
}

Edge is one axis-aligned line segment carrying the data the table- finder needs for snap + join + intersection.

Invariants (constructor-enforced via newEdge / normalise):

  • Orientation == "h" → Y0 == Y1, X0 <= X1.
  • Orientation == "v" → X0 == X1, Y0 <= Y1.

Length() returns the along-direction extent; Pos() returns the perpendicular position (the "snap axis" for merging).

func FilterEdgesByLength

func FilterEdgesByLength(edges []Edge, minLength float64) []Edge

FilterEdgesByLength drops edges whose along-direction length is below minLength. Pdfplumber calls this twice in the TableFinder pipeline: once with edge_min_length_prefilter on the raw edges (default 1 — drop hairline construction lines) and once with edge_min_length on the merged set (default 3 — drop short stubs after snap+join).

func FilterEdgesByOrientation

func FilterEdgesByOrientation(edges []Edge, o Orientation) []Edge

FilterEdgesByOrientation keeps only edges with the given orientation. Convenience over a one-line loop, used by the TableFinder when separating the v / h edge lists post-merge for intersection detection.

func FilterEdgesBySource

func FilterEdgesBySource(edges []Edge, allow ...Source) []Edge

FilterEdgesBySource keeps only edges produced by an allowed Source. Used by lines_strict mode to drop rect and curve edges.

func FromCurve

func FromCurve(c CurveSegment, tolerance float64) []Edge

FromCurve returns one edge per consecutive pair of points that lies on the same axis. Diagonal pairs are dropped. pdfplumber does the same in curve_to_edges.

func FromLine

func FromLine(l LineSegment, tolerance float64) (Edge, bool)

FromLine returns one Edge for a horizontal or vertical line segment, or (zero, false) for diagonal lines (they aren't axis- aligned and so can't be table rules).

Tolerance is the same near-axis-aligned slack pdfplumber's line_to_edge predicate uses implicitly: a "horizontal" line is one whose Y0 and Y1 are equal. We treat them as equal when their difference is below the supplied tolerance to absorb floating- point drift from the content-stream interpreter.

func FromRect

func FromRect(r RectSegment) []Edge

FromRect returns the four edges of a rectangle's outline. We mirror pdfplumber's rect_to_edges: top, bottom, left, right — all tagged SourceRect. Filled-only (non-stroked) rectangles still produce edges because pdfplumber's edges property aggregates BOTH stroked and filled rects (a filled cell-background still defines a row boundary).

func JoinEdges

func JoinEdges(edges []Edge, joinXTolerance, joinYTolerance float64) []Edge

JoinEdges merges collinear edges whose along-direction extents touch (within joinTolerance). Two horizontal edges with the same Y and overlapping or near-touching X ranges become one edge spanning their union; similarly for vertical edges.

Edges that don't overlap and aren't within the join tolerance pass through unchanged.

This is the Go port of pdfplumber's table.join_edge_group, called once per (orientation, perpendicular-position) group.

func MergeEdges

func MergeEdges(edges []Edge, snapXTol, snapYTol, joinXTol, joinYTol float64) []Edge

MergeEdges is the snap-then-join pipeline. It's the entry point that table.go calls; it mirrors pdfplumber's table.merge_edges.

Order:

  1. Snap (collapse near-collinear edges onto their mean position).
  2. Group by (orientation, position) and join within each group.

func SnapEdges

func SnapEdges(edges []Edge, snapXTolerance, snapYTolerance float64) []Edge

SnapEdges replaces near-collinear edges with edges sharing the average perpendicular position. Horizontal edges within snapYTolerance of each other on the Y axis get unified onto their mean Y; vertical edges within snapXTolerance of each other on the X axis get unified onto their mean X.

This is the Go port of pdfplumber's table.snap_edges, which dispatches into utils.snap_objects per orientation.

A tolerance of 0 leaves the edges unchanged for that orientation.

func SortEdges

func SortEdges(edges []Edge) []Edge

SortEdges returns a stable-sorted copy of edges keyed by (orientation, perpendicular position, along-direction start). This is the deterministic order downstream stages rely on for intersection enumeration.

func (Edge) Length

func (e Edge) Length() float64

Length returns the edge's extent along its orientation axis.

  • Horizontal edge → X1 - X0.
  • Vertical edge → Y1 - Y0.

func (Edge) Pos

func (e Edge) Pos() float64

Pos returns the perpendicular position of the edge — the coordinate that's constant along its length. We use this as the snap-axis key for grouping edges that lie on the same infinite line.

  • Horizontal edge → Y0 (which equals Y1).
  • Vertical edge → X0 (which equals X1).

type LineSegment

type LineSegment struct {
	X0, Y0, X1, Y1 float64
	Width          float64
}

LineSegment is a minimal struct describing a drawn straight-line segment. It exists so this package doesn't have to import the public pdftable types — keeps the dependency direction one-way (pdftable depends on layout, not the other way round).

type Orientation

type Orientation string

Orientation is the axis an edge lies along. "h" = horizontal, "v" = vertical. Diagonal lines never become edges; they're dropped at derivation time.

const (
	Horizontal Orientation = "h"
	Vertical   Orientation = "v"
)

type RectSegment

type RectSegment struct {
	X0, Y0, X1, Y1 float64
	Width          float64
}

RectSegment is a minimal Rect descriptor for the same reason. Only the bbox matters for edge derivation; we drop the Stroke/Fill flags at the layer above by filtering out non-stroked rects before calling FromRect.

type Source

type Source uint8

Source tags say which kind of drawn primitive produced an edge. pdfplumber distinguishes between "line", "rect_edge", and "curve_edge" so that lines_strict mode can ignore everything that isn't a literal stroked line. We carry the same distinction.

const (
	// SourceLine: an edge derived from a stroked Line.
	SourceLine Source = iota
	// SourceRect: an edge derived from one side of a Rect.
	SourceRect
	// SourceCurve: an edge derived from a Curve's straight segment.
	// We accept curve edges in "lines" mode but not "lines_strict".
	SourceCurve
	// SourceExplicit: an edge constructed from an
	// ExplicitVerticalLines / ExplicitHorizontalLines setting.
	SourceExplicit
	// SourceText: an edge inferred from word alignment by the "text"
	// strategy. words_to_edges_v / words_to_edges_h in pdfplumber.
	SourceText
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL