xtract

package module

v0.0.0-...-4068868 Latest Latest Go to latest Published: Jan 30, 2026 License: BSD-3-Clause Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sassoftware/pdf-xtract

Links

Open Source Insights

README ¶

pdf-xtract

Overview

Go-based PDF processing library providing high-fidelity text, content, and metadata extraction capabilities.

Originally forked from ledongthuc/pdf, this library has been extensively refactored to meet enterprise-grade observability, performance, and compliance requirements.

Efficient parsing and extraction of plain text, structured content, and document metadata
Robust logging and tracing instrumentation for production debugging
Compatibility with PDF v1.4 to v2.0 standards

Installation

You can install the library using Go modules:

go get -u github.com/sassoftware/pdf-xtract

Getting Started

Import the library in your Go code:

import github.com/sassoftware/pdf-xtract

Logging & Observability

import "github.com/sassoftware/pdf-xtract/logger"

The refactored library includes a structured logging layer and a lightweight tracer interface to ensure production-grade observability.

High-level structured logs added at major functional boundaries.
Error logs include contextual information (file, object, and parsing state).

Tracer Integration

import "github.com/sassoftware/pdf-xtract/tracer"

The library includes a lightweight Tracer subsystem that provides fine-grained observability into PDF parsing and extraction operations. It is designed to support debugging and operational monitoring in production environments where PDFs may be large, malformed, or complex.

Object-level processing times (fonts, content streams, metadata objects, etc.)
Recovery attempts (e.g., corrupted xref tables, missing object references)
Execution flow, enabling reconstruction of what happened during extraction
Error points, with the ability to dump the trace when failures occur

Running

After installing the library, you can either integrate pdf-xtract into your own Go applications or run the provided example programs to get started quickly.

git clone https://github.com/sassoftware/pdf-xtract.git
cd pdf-xtract
cd examples
go run main.go

Usage

Check in examples
This library supports two primary extraction modes depending on the use case and PDF size:
1. Standard Extraction Mode (Batch Mode) – best for small/medium PDFs, returns complete text at once
2. Streaming Extraction Mode – best for large PDFs, returns text page-by-page without loading the entire file into memory

Standard Extraction Mode (Batch)

cfg := xtract.NewDefaultConfig()
cfg.MaxConcurrentPDFs = 1
cfg.MaxWorkersPerPDF = 4
cfg.ParsingMode = xtract.BestEffort
cfg.MaxTotalChars = 1000

cfg.Logger = func(level logger.LogLevel, msg string, keyvals ...interface{}) {
	// no-op logger
}

proc := xtract.NewProcessor(cfg)

text, truncated, err := proc.Extract(ctx, "pdf_test.pdf")
if err != nil {
	tracer.PrintTrace()
	return
}

fmt.Println("Truncated?", truncated)
fmt.Println("Extracted Text:", text)

// Metadata extraction
fmt.Println("---- PDF Metadata ----")
if err := proc.Metadata(ctx, "pdf_test.pdf", os.Stdout); err != nil {
	tracer.PrintTrace()
}

Streaming Extraction Mode

stream, truncated, err := proc.ExtractAsStream(ctx, "pdf_test.pdf")
if err != nil {
	return
}

fmt.Println("Streaming output:")
var total string

for pageText := range stream {
	fmt.Println(pageText)
	total += pageText
}

fmt.Println("Truncated?", truncated)
fmt.Println("Final concatenated length:", len(total))

Metadata Extraction

// Print metadata as pretty JSON to stdout
err := proc.Metadata(ctx, "yourfile.pdf", os.Stdout)
if err != nil {
	fmt.Println("Failed to extract metadata:", err)
}

CPU and Memory Usage Comparison (Batch vs Streaming)

PDF Size (KB)	Batch mode CPU %	Batch mode Memory %	Streaming mode CPU %	Streaming mode Memory %	PDF Characteristics
1	0.491	10.0	0.657	7.15	2-page PDF 1.7, multiline text streams per page, Type1 Helvetica, hybrid compressed XRef stream
2	0.880	10.3	0.811	7.18	Minimal PDF 2.0, single text stream, XMP metadata, hybrid XRef stream with /Prev (incremental update)
3	0.675	10.0	0.383	7.29	5-page PDF 1.7, Flate-compressed streams per page, Info dictionary metadata
4	2.290	9.97	0.373	0.73	1-page PDF 1.7, multiple streams, multiple fonts, transformations (rotate/scale)
23	0.655	8.70	0.520	7.38	Excel-to-PDF, tagged structure, /Lang, XMP and Info dictionary
41	0.258	10.0	0.479	6.45	Layout-heavy PDF (region/box-based content), visual-order extraction required
121	2.190	9.69	0.612	0.849	PowerPoint-to-PDF, multi-slide pages, absolute positioned layout
190	2.010	9.60	0.532	8.29	Linearized PDF 1.6, compressed XRef, /Prev incremental updates
221	3.010	11.3	1.050	9.21	15-page multilingual (CJK and English), CID fonts, ToUnicode CMaps
1382	3.930	12.4	4.160	11.4	Large multi-page PDF (83 pages), dense legislative text
3884	2.280	14.0	1.750	11.9	Mixed text and embedded images, image-heavy pages
5939	100	23.0	100	20.4	Extremely large PDF (~1000+ pages), CPU saturation during extraction

Contributing

Maintainers are accepting patches and contributions to this project. Please read CONTRIBUTING.md for details about submitting contributions to this project.

License

This project is licensed under the BSD 3-Clause License.

Documentation ¶

Overview ¶

Package pdf implements reading of PDF files.

Overview ¶

PDF is Adobe's Portable Document Format, ubiquitous on the internet. A PDF document is a complex data format built on a fairly simple structure. This package exposes the simple structure along with some wrappers to extract basic information. If more complex information is needed, it is possible to extract that information by interpreting the structure exposed by this package.

Specifically, a PDF is a data structure built from Values, each of which has one of the following Kinds:

Null, for the null object.
Integer, for an integer.
Real, for a floating-point number.
Bool, for a boolean value.
Name, for a name constant (as in /Helvetica).
String, for a string constant.
Dict, for a dictionary of name-value pairs.
Array, for an array of values.
Stream, for an opaque data stream and associated header dictionary.

The accessors on Value—Int64, Float64, Bool, Name, and so on—return a view of the data as the given type. When there is no appropriate view, the accessor returns a zero result. For example, the Name accessor returns the empty string if called on a Value v for which v.Kind() != Name. Returning zero values this way, especially from the Dict and Array accessors, which themselves return Values, makes it possible to traverse a PDF quickly without writing any error checking. On the other hand, it means that mistakes can go unreported.

The basic structure of the PDF file is exposed as the graph of Values.

Most richer data structures in a PDF file are dictionaries with specific interpretations of the name-value pairs. The Font and Page wrappers make the interpretation of a specific Value as the corresponding type easier. They are only helpers, though: they are implemented only in terms of the Value API and could be moved outside the package. Equally important, traversal of other PDF data structures can be implemented in other packages as needed.

Index ¶

Variables
func CheckHeader(f io.ReaderAt) error
func DecodeUTF8OrPreserve(s string) []rune
func EndsWithEOL(buf []byte, start, end int) bool
func FindStartXref(f io.ReaderAt, size int64) (int64, error)
func Interpret(strm Value, do func(stk *Stack, op string))
func IsSameSentence(last, current Text) bool
func NewProcessor(cfg *Config) *processor
func SkipWhitespace(buf []byte, j int) int
func ValidateEOFMarker(f io.ReaderAt, size int64) error
type AccessPermission
type BestEffortExtractor
- func (b *BestEffortExtractor) ExtractPage(ctx context.Context, page *Page) (string, error)
type Column
type Columns
type Config
- func NewDefaultConfig() *Config
- func (cfg *Config) Validate() error
type Content
type ExtractorStrategy
type Font
- func (f Font) BaseFont() string
- func (f Font) Encoder() TextEncoding
- func (f Font) FirstChar() int
- func (f Font) LastChar() int
- func (f Font) Width(code int) float64
- func (f Font) Widths() []float64
type Meta
type MetadataFull
type Outline
type Page
- func (p Page) Content() Content
- func (p Page) Font(name string) Font
- func (p Page) Fonts() []string
- func (p Page) GetPlainText(fonts map[string]*Font) (result string, err error)
- func (p Page) GetTextByColumn() (Columns, error)
- func (p Page) GetTextByRow() (Rows, error)
- func (p Page) Resources() Value
type ParsingMode
type Point
type Processor
type Reader
- func NewReader(f io.ReaderAt, size int64) (*Reader, error)
- func Open(file string) (*os.File, *Reader, error)
- func (r *Reader) GetPlainText() (reader io.Reader, err error)
- func (r *Reader) GetStyledTexts() (sentences []Text, err error)
- func (r *Reader) InfoDict() Value
- func (r *Reader) Metadata() (Meta, error)
- func (r *Reader) MetadataFull() (MetadataFull, error)
- func (r *Reader) MetadataJSON(w io.Writer) error
- func (r *Reader) NumPage() int
- func (r *Reader) Outline() Outline
- func (r *Reader) Page(num int) Page
- func (r *Reader) Trailer() Value
type Rect
type Row
type Rows
type Stack
- func (stk *Stack) Len() int
- func (stk *Stack) Pop() Value
- func (stk *Stack) Push(v Value)
type StrictExtractor
- func (s *StrictExtractor) ExtractPage(ctx context.Context, page *Page) (string, error)
type Text
type TextEncoding
type TextHorizontal
- func (x TextHorizontal) Len() int
- func (x TextHorizontal) Less(i, j int) bool
- func (x TextHorizontal) Swap(i, j int)
type TextVertical
- func (x TextVertical) Len() int
- func (x TextVertical) Less(i, j int) bool
- func (x TextVertical) Swap(i, j int)
type Value
- func (v Value) Bool() bool
- func (v Value) Float64() float64
- func (v Value) Index(i int) Value
- func (v Value) Int64() int64
- func (v Value) IsNull() bool
- func (v Value) Key(key string) Value
- func (v Value) Keys() []string
- func (v Value) Kind() ValueKind
- func (v Value) Len() int
- func (v Value) Name() string
- func (v Value) RawString() string
- func (v Value) Reader() io.ReadCloser
- func (v Value) String() string
- func (v Value) Text() string
- func (v Value) TextFromUTF16() string
type ValueKind
Bugs

Constants ¶

This section is empty.

Variables ¶

View Source

var DebugOn = false

DebugOn is responsible for logging messages into stdout. If problems arise during reading, set it true.

Functions ¶

func CheckHeader ¶

func CheckHeader(f io.ReaderAt) error

CheckHeader validates the PDF header at the beginning of the file. It ensures the file starts with "%PDF-x.y" and the version is within 1.0–1.7 or 2.0.

func DecodeUTF8OrPreserve ¶

func DecodeUTF8OrPreserve(s string) []rune

DecodeUTF8OrPreserve decodes s as UTF-8, but if it encounters an invalid byte it preserves that byte verbatim as a rune (no U+FFFD replacement).

func EndsWithEOL ¶

func EndsWithEOL(buf []byte, start, end int) bool

EndsWithEOL checks if the last skipped char is CR or LF.

func FindStartXref ¶

func FindStartXref(f io.ReaderAt, size int64) (int64, error)

FindStartXref locates and parses the "startxref" pointer near the end of the file. Returns the byte offset where the cross-reference table/stream begins.

func Interpret ¶

func Interpret(strm Value, do func(stk *Stack, op string))

Interpret interprets the content in a stream as a basic PostScript program, pushing values onto a stack and then calling the do function to execute operators. The do function may push or pop values from the stack as needed to implement op.

Interpret handles the operators "dict", "currentdict", "begin", "end", "def", and "pop" itself.

Interpret is not a full-blown PostScript interpreter. Its job is to handle the very limited PostScript found in certain supporting file formats embedded in PDF files, such as cmap files that describe the mapping from font code points to Unicode code points.

A stream can also be represented by an array of streams that has to be handled as a single stream In the case of a simple stream read only once, otherwise get the length of the stream to handle it properly

There is no support for executable blocks, among other limitations.

func IsSameSentence ¶

func IsSameSentence(last, current Text) bool

isSameSentence checks if the current text segment likely belongs to the same sentence as the last text segment based on font, size, vertical position, and lack of sentence-ending punctuation in the last segment.

func NewProcessor ¶

func NewProcessor(cfg *Config) *processor

NewProcessor validates the config and creates a new processor. Selects the correct ExtractorStrategy (Strict or BestEffort).

func SkipWhitespace ¶

func SkipWhitespace(buf []byte, j int) int

SkipWhitespace advances j past all whitespace.

func ValidateEOFMarker ¶

func ValidateEOFMarker(f io.ReaderAt, size int64) error

ValidateEOFMarker checks the last chunk of the file for the "%%EOF" marker. Ensures the PDF file is properly terminated as per the specification.

Types ¶

type AccessPermission ¶

type AccessPermission struct {
	CanPrint                bool `json:"can_print"`
	CanPrintFaithful        bool `json:"can_print_faithful"`
	CanModify               bool `json:"can_modify"`
	ExtractContent          bool `json:"extract_content"`
	ModifyAnnotations       bool `json:"modify_annotations"`
	FillInForm              bool `json:"fill_in_form"`
	ExtractForAccessibility bool `json:"extract_for_accessibility"`
	AssembleDocument        bool `json:"assemble_document"`
}

type BestEffortExtractor ¶

type BestEffortExtractor struct{}

BestEffortExtractor tolerates errors. If a page fails, it simply skips that page.

func (*BestEffortExtractor) ExtractPage ¶

func (b *BestEffortExtractor) ExtractPage(ctx context.Context, page *Page) (string, error)

type Column ¶

type Column struct {
	Position int64
	Content  TextVertical
}

Column represents the contents of a column

type Columns ¶

type Columns []*Column

Columns is a list of column

type Config ¶

type Config struct {
	MaxConcurrentPDFs int           `validate:"min=1,max=10"`
	MaxWorkersPerPDF  int           `validate:"min=1,max=10"`
	WorkerTimeout     time.Duration `validate:"required"`
	ParsingMode       ParsingMode   `validate:"oneof=strict best-effort"`
	MaxRetries        int           `validate:"min=0,max=3"`
	MaxTotalChars     int           `validate:"min=0"`
	DebugOn           bool
	Logger            logger.LogFunc
}

func NewDefaultConfig ¶

func NewDefaultConfig() *Config

func (*Config) Validate ¶

func (cfg *Config) Validate() error

type Content ¶

type Content struct {
	Text []Text
	Rect []Rect
}

Content describes the basic content on a page: the text and any drawn rectangles.

type ExtractorStrategy ¶

type ExtractorStrategy interface {
	ExtractPage(ctx context.Context, page *Page) (string, error)
}

ExtractorStrategy defines how to extract text from a single page. Different strategies handle errors differently (strict vs. best-effort).

type Font ¶

type Font struct {
	V Value
	// contains filtered or unexported fields
}

A Font represent a font in a PDF file. The methods interpret a Font dictionary stored in V.

func (Font) BaseFont ¶

func (f Font) BaseFont() string

BaseFont returns the font's name (BaseFont property).

func (Font) Encoder ¶

func (f Font) Encoder() TextEncoding

Encoder returns the encoding between font code point sequences and UTF-8.

func (Font) FirstChar ¶

func (f Font) FirstChar() int

FirstChar returns the code point of the first character in the font.

func (Font) LastChar ¶

func (f Font) LastChar() int

LastChar returns the code point of the last character in the font.

func (Font) Width ¶

func (f Font) Width(code int) float64

Width returns the width of the given code point.

func (Font) Widths ¶

func (f Font) Widths() []float64

Widths returns the widths of the glyphs in the font. In a well-formed PDF, len(f.Widths()) == f.LastChar()+1 - f.FirstChar().

type Meta ¶

type Meta struct {
	Title        string `json:"title,omitempty"`
	Author       string `json:"author,omitempty"`
	Subject      string `json:"subject,omitempty"`
	Keywords     string `json:"keywords,omitempty"`
	Creator      string `json:"creator,omitempty"`
	Producer     string `json:"producer,omitempty"`
	CreationDate string `json:"creationDate,omitempty"`
	ModDate      string `json:"modDate,omitempty"`
}

Meta is the unified, metadata model (Info + XMP fields).

type MetadataFull ¶

type MetadataFull struct {
	// Core (Info/XMP)
	Title        string `json:"title,omitempty"`
	Author       string `json:"author,omitempty"`
	Subject      string `json:"subject,omitempty"`
	Keywords     string `json:"keywords,omitempty"`
	Creator      string `json:"creator,omitempty"`
	Producer     string `json:"producer,omitempty"`
	CreationDate string `json:"creationDate,omitempty"`
	ModDate      string `json:"modDate,omitempty"`

	// Structural
	PDFVersion              string `json:"pdf:PDFVersion,omitempty"`
	HasXMP                  bool   `json:"pdf:hasXMP"`
	HasCollection           bool   `json:"pdf:hasCollection"`
	Encrypted               bool   `json:"pdf:encrypted"`
	NPages                  int    `json:"xmpTPg:NPages,omitempty"`
	ContainsNonEmbeddedFont bool   `json:"pdf:containsNonEmbeddedFont"`
	Language                string `json:"language,omitempty"`

	// Access permissions (Standard Security)
	AccessPermission AccessPermission `json:"access_permission"`
}

type Outline ¶

type Outline struct {
	Title string    // title for this element
	Child []Outline // child elements
}

An Outline is a tree describing the outline (also known as the table of contents) of a document.

type Page ¶

type Page struct {
	V Value
}

A Page represent a single page in a PDF file. The methods interpret a Page dictionary stored in V.

func (Page) Content ¶

func (p Page) Content() Content

Content returns the page's content.

func (Page) Font ¶

func (p Page) Font(name string) Font

Font returns the font with the given name associated with the page.

func (Page) Fonts ¶

func (p Page) Fonts() []string

Fonts returns a list of the fonts associated with the page.

func (Page) GetPlainText ¶

func (p Page) GetPlainText(fonts map[string]*Font) (result string, err error)

GetPlainText returns the page's all text without format. fonts can be passed in (to improve parsing performance) or left nil

func (Page) GetTextByColumn ¶

func (p Page) GetTextByColumn() (Columns, error)

GetTextByColumn returns the page's all text grouped by column

func (Page) GetTextByRow ¶

func (p Page) GetTextByRow() (Rows, error)

GetTextByRow returns the page's all text grouped by rows

func (Page) Resources ¶

func (p Page) Resources() Value

Resources returns the resources dictionary associated with the page.

type ParsingMode ¶

type ParsingMode string

const (
	Strict     ParsingMode = "strict"
	BestEffort ParsingMode = "best-effort"
)

type Point ¶

type Point struct {
	X float64
	Y float64
}

A Point represents an X, Y pair.

type Processor ¶

type Processor interface {
	Extract(ctx context.Context, path string) (string, bool, error)
}

Processor defines the contract for extracting text from a PDF file.

type Reader ¶

type Reader struct {
	// contains filtered or unexported fields
}

A Reader is a single PDF file open for reading.

func NewReader ¶

func NewReader(f io.ReaderAt, size int64) (*Reader, error)

NewReader opens a file for reading, using the data in f with the given total size.

func Open ¶

func Open(file string) (*os.File, *Reader, error)

func (*Reader) GetPlainText ¶

func (r *Reader) GetPlainText() (reader io.Reader, err error)

GetPlainText returns all the text in the PDF file

func (*Reader) GetStyledTexts ¶

func (r *Reader) GetStyledTexts() (sentences []Text, err error)

GetStyledTexts returns list all sentences in an array, that are included styles

func (*Reader) InfoDict ¶

func (r *Reader) InfoDict() Value

InfoDict returns the raw /Info dictionary as a Value (may be Null).

func (*Reader) Metadata ¶

func (r *Reader) Metadata() (Meta, error)

Metadata returns unified metadata with XMP taking precedence over /Info.

func (*Reader) MetadataFull ¶

func (r *Reader) MetadataFull() (MetadataFull, error)

MetadataFull returns a comprehensive metadata report for the PDF.

func (*Reader) MetadataJSON ¶

func (r *Reader) MetadataJSON(w io.Writer) error

MetadataJSON writes the full metadata as pretty JSON to the provided writer.

func (*Reader) NumPage ¶

func (r *Reader) NumPage() int

NumPage returns the number of pages in the PDF file.

func (*Reader) Outline ¶

func (r *Reader) Outline() Outline

Outline returns the document outline. The Outline returned is the root of the outline tree and typically has no Title itself. That is, the children of the returned root are the top-level entries in the outline.

func (*Reader) Page ¶

func (r *Reader) Page(num int) Page

Page returns the page for the given page number. Page numbers are indexed starting at 1, not 0. If the page is not found, Page returns a Page with p.V.IsNull().

func (*Reader) Trailer ¶

func (r *Reader) Trailer() Value

Trailer returns the file's Trailer value.

type Rect ¶

type Rect struct {
	Min, Max Point
}

A Rect represents a rectangle.

type Row ¶

type Row struct {
	Position int64
	Content  TextHorizontal
}

Row represents the contents of a row

type Rows ¶

type Rows []*Row

Rows is a list of rows

type Stack ¶

type Stack struct {
	// contains filtered or unexported fields
}

A Stack represents a stack of values.

func (*Stack) Len ¶

func (stk *Stack) Len() int

func (*Stack) Pop ¶

func (stk *Stack) Pop() Value

func (*Stack) Push ¶

func (stk *Stack) Push(v Value)

type StrictExtractor ¶

type StrictExtractor struct{}

StrictExtractor enforces strict parsing. If any page fails, the entire extraction fails.

func (*StrictExtractor) ExtractPage ¶

func (s *StrictExtractor) ExtractPage(ctx context.Context, page *Page) (string, error)

type Text ¶

type Text struct {
	Font     string  // the font used
	FontSize float64 // the font size, in points (1/72 of an inch)
	X        float64 // the X coordinate, in points, increasing left to right
	Y        float64 // the Y coordinate, in points, increasing bottom to top
	W        float64 // the width of the text, in points
	S        string  // the actual UTF-8 text
}

A Text represents a single piece of text drawn on a page.

type TextEncoding ¶

type TextEncoding interface {
	// Decode returns the UTF-8 text corresponding to
	// the sequence of code points in raw.
	Decode(raw string) (text string)
}

A TextEncoding represents a mapping between font code points and UTF-8 text.

type TextHorizontal ¶

type TextHorizontal []Text

TextHorizontal implements sort.Interface for sorting a slice of Text values in horizontal order, left to right, and then top to bottom within a column.

func (TextHorizontal) Len ¶

func (x TextHorizontal) Len() int

func (TextHorizontal) Less ¶

func (x TextHorizontal) Less(i, j int) bool

func (TextHorizontal) Swap ¶

func (x TextHorizontal) Swap(i, j int)

type TextVertical ¶

type TextVertical []Text

TextVertical implements sort.Interface for sorting a slice of Text values in vertical order, top to bottom, and then left to right within a line.

func (TextVertical) Len ¶

func (x TextVertical) Len() int

func (TextVertical) Less ¶

func (x TextVertical) Less(i, j int) bool

func (TextVertical) Swap ¶

func (x TextVertical) Swap(i, j int)

type Value ¶

type Value struct {
	// contains filtered or unexported fields
}

A Value is a single PDF value, such as an integer, dictionary, or array. The zero Value is a PDF null (Kind() == Null, IsNull() = true).

func (Value) Bool ¶

func (v Value) Bool() bool

Bool returns v's boolean value. If v.Kind() != Bool, Bool returns false.

func (Value) Float64 ¶

func (v Value) Float64() float64

Float64 returns v's float64 value, converting from integer if necessary. If v.Kind() != Float64 and v.Kind() != Int64, Float64 returns 0.

func (Value) Index ¶

func (v Value) Index(i int) Value

Index returns the i'th element in the array v. If v.Kind() != Array or if i is outside the array bounds, Index returns a null Value.

func (Value) Int64 ¶

func (v Value) Int64() int64

Int64 returns v's int64 value. If v.Kind() != Int64, Int64 returns 0.

func (Value) IsNull ¶

func (v Value) IsNull() bool

IsNull reports whether the value is a null. It is equivalent to Kind() == Null.

func (Value) Key ¶

func (v Value) Key(key string) Value

Key returns the value associated with the given name key in the dictionary v. Like the result of the Name method, the key should not include a leading slash. If v is a stream, Key applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Key returns a null Value.

func (Value) Keys ¶

func (v Value) Keys() []string

Keys returns a sorted list of the keys in the dictionary v. If v is a stream, Keys applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Keys returns nil.

func (Value) Kind ¶

func (v Value) Kind() ValueKind

Kind reports the kind of value underlying v.

func (Value) Len ¶

func (v Value) Len() int

Len returns the length of the array v. If v.Kind() != Array, Len returns 0.

func (Value) Name ¶

func (v Value) Name() string

Name returns v's name value. If v.Kind() != Name, Name returns the empty string. The returned name does not include the leading slash: if v corresponds to the name written using the syntax /Helvetica, Name() == "Helvetica".

func (Value) RawString ¶

func (v Value) RawString() string

RawString returns v's string value. If v.Kind() != String, RawString returns the empty string.

func (Value) Reader ¶

func (v Value) Reader() io.ReadCloser

Reader returns the data contained in the stream v. If v.Kind() != Stream, Reader returns a ReadCloser that responds to all reads with a “stream not present” error.

func (Value) String ¶

func (v Value) String() string

String returns a textual representation of the value v. Note that String is not the accessor for values with Kind() == String. To access such values, see RawString, Text, and TextFromUTF16.

func (Value) Text ¶

func (v Value) Text() string

Text returns v's string value interpreted as a “text string” (defined in the PDF spec) and converted to UTF-8. If v.Kind() != String, Text returns the empty string.

func (Value) TextFromUTF16 ¶

func (v Value) TextFromUTF16() string

TextFromUTF16 returns v's string value interpreted as big-endian UTF-16 and then converted to UTF-8. If v.Kind() != String or if the data is not valid UTF-16, TextFromUTF16 returns the empty string.

type ValueKind ¶

type ValueKind int

A ValueKind specifies the kind of data underlying a Value.

const (
	Null ValueKind = iota
	Bool
	Integer
	Real
	String
	Name
	Dict
	Array
	Stream
)

The PDF value kinds.

Notes ¶

Bugs ¶

The package is incomplete, although it has been used successfully on some large real-world PDF files.
There is no support for closing open PDF files. If you drop all references to a Reader, the underlying reader will eventually be garbage collected.
The library makes no attempt at efficiency. A value cache maintained in the Reader would probably help significantly.
The support for reading encrypted files is weak.
The Value API does not support error reporting. The intent is to allow users to set an error reporting callback in Reader, but that code has not been implemented.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
logger
tracer

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL