pdftotext

package module
v0.0.0-...-b076835 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 23, 2024 License: MIT Imports: 5 Imported by: 0

README

go-pdftotext

Package pdftotext is a wrapper for Xpdf command line tool pdftotext.

What is pdftotext?

Pdftotext converts Portable Document Format (PDF) file to plain text.

Reference: https://www.xpdfreader.com/pdftotext-man.html

Documentation

Overview

Package pdftotext is a wrapper for Xpdf command line tool `pdftotext`.

What is `pdftotext`?

Pdftotext converts Portable Document Format (PDF) file to plain text.

Reference: https://www.xpdfreader.com/pdftotext-man.html

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func WithByteOrderMarker

func WithByteOrderMarker() option

Insert a Unicode byte order marker (BOM) at the start of the text output.

func WithCharFixedWidth

func WithCharFixedWidth(width uint64) option

Specify the character pitch (width), in points.

Works only with `WithModeLayout`, `WithModeTable` and `WithModeLinePrinter`.

func WithCustomConfig

func WithCustomConfig(path string) option

Read config-file in place of ~/.xpdfrc or the system-wide config file.

func WithCustomPath

func WithCustomPath(path string) option

Set custom location for `pdftotext` executable.

func WithEncoding

func WithEncoding(name string) option

Sets the encoding to use for text output.

The name must be defined with the unicodeMap command (see xpdfrc(5)). The encoding name is case-sensitive. This defaults to "Latin1".

Available options: `pdftotext -listencodings`.

func WithEndOfLine

func WithEndOfLine(kind string) option

Sets the end-of-line convention to use for text output.

Available options: "unix", "dos", "mac".

func WithLineFixedSpacing

func WithLineFixedSpacing(spacing uint64) option

Specify the line spacing, in points.

Works only with `WithModeLinePrinter`.

func WithMargin

func WithMargin(t, r, b, l uint64) option

Specifies the margins, in points.

func WithMarginBottom

func WithMarginBottom(margin uint64) option

Specifies the bottom margin, in points.

Text in the bottom margin (i.e., within that many points of the bottom edge of the page) is discarded.

func WithMarginLeft

func WithMarginLeft(margin uint64) option

Specifies the left margin, in points.

Text in the left margin (i.e., within that many points of the left edge of the page) is discarded.

func WithMarginRight

func WithMarginRight(margin uint64) option

Specifies the right margin, in points.

Text in the right margin (i.e., within that many points of the right edge of the page) is discarded.

func WithMarginTop

func WithMarginTop(margin uint64) option

Specifies the top margin, in points.

Text in the top margin (i.e., within that many points of the top edge of the page) is discarded.

func WithModeLayout

func WithModeLayout() option

Maintain (as best as possible) the original physical layout of the text.

func WithModeLinePrinter

func WithModeLinePrinter() option

Line printer mode uses a strict fixed-character-pitch and -height layout. The page is broken into a grid, and characters are placed into that grid.

If the grid spacing is too small for the actual characters, the result is extra whitespace. If the grid spacing is too large, the result is missing whitespace.

Use `WithCharFixedWidth` and `WithLineFixedSpacing` to specify grid spacing. If one or both are not given on the command line, it will attempt to compute appropriate value(s).

func WithModeRaw

func WithModeRaw() option

Keep the text in content stream order.

Depending on how the PDF file was generated, this may or may not be useful.

func WithModeSimple

func WithModeSimple() option

Similar to `WithModeLayout`, but optimized for simple one-column pages.

This mode will do a better job of maintaining horizontal spacing, but it will only work properly with a single column of text.

func WithModeSimple2

func WithModeSimple2() option

Similar to `WithModeSimple` but handles slightly rotated text better.

Only works for pages with a single column of text.

func WithModeTable

func WithModeTable() option

Table mode is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace).

If the `WithCharFixedWidth` option is given, character spacing within each line will be determined by the specified character pitch.

func WithNoPageBreak

func WithNoPageBreak() option

Don’t insert a page breaks (form feed character) at the end of each page.

func WithNoTextDiagonal

func WithNoTextDiagonal() option

Diagonal text, i.e., text that is not close to one of the 0, 90, 180, or 270 degree axes, is discarded.

This is useful to skip watermarks drawn on top of body text, etc.

func WithOwnerPassword

func WithOwnerPassword(password string) option

Specify the owner password for the PDF file.

Providing this will bypass all security restrictions.

func WithPageFrom

func WithPageFrom(page uint64) option

Specifies the first page to convert.

func WithPageRange

func WithPageRange(from, to uint64) option

Specifies the range of pages to convert.

func WithPageTo

func WithPageTo(page uint64) option

Specifies the last page to convert.

func WithTextClipping

func WithTextClipping() option

Text which is hidden because of clipping is removed before doing layout, and then added back in.

This can be helpful for tables where clipped (invisible) text would overlap the next column.

func WithUserPassword

func WithUserPassword(password string) option

Specify the user password for the PDF file.

Types

type Command

type Command struct {
	// contains filtered or unexported fields
}

func NewCommand

func NewCommand(opts ...option) (*Command, error)

NewCommand creates new `pdftotext` command.

func (*Command) Run

func (c *Command) Run(ctx context.Context, inpath string) (io.Reader, error)

Run executes prepared `pdftotext` command.

func (*Command) String

func (c *Command) String() string

String returns a human-readable description of the command.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL