pdftotext

package module
v0.0.0-...-466b15e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 1, 2024 License: MIT Imports: 7 Imported by: 0

README

pdftotext

A Go library for converting PDF files to text using the pdftotext utility.

Prerequisites

  • pdftotext utility installed on your system (usually part of the poppler-utils package)
Installing pdftotext

Ubuntu/Debian:

sudo apt-get install poppler-utils

macOS:

brew install poppler

Installation

go get github.com/joeychilson/pdftotext

Quick Start

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/joeychilson/pdftotext"
)

func main() {
   	ctx := context.Background()

    converter, err := pdftotext.New()
    if err != nil {
        log.Fatal(err)
    }

    text, err := converter.Convert(ctx, "input.pdf", &pdftotext.Options{
        Layout:   true,
        Encoding: "UTF-8",
    })
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(text)
}

Converting to File

err = converter.ConvertToFile(ctx, "input.pdf", "output.txt", &pdftotext.Options{
	Layout:   true,
	Encoding: "UTF-8",
})
if err != nil {
    log.Fatal(err)
}

Available Options

type Options struct {
	// FirstPage is the first page to convert
	FirstPage int
	// LastPage is the last page to convert
	LastPage int
	// Resolution is the resolution in DPI (default 72)
	Resolution int
	// CropX is the X-coordinate of crop area
	CropX int
	// CropY is the Y-coordinate of crop area
	CropY int
	// CropWidth is the width of crop area
	CropWidth int
	// CropHeight is the height of crop area
	CropHeight int
	// Layout maintains the original layout
	Layout bool
	// FixedPitch keeps the text in a fixed-pitch font
	FixedPitch float64
	// Raw keeps text in content stream order
	Raw bool
	// NoDiagonal discards diagonal text
	NoDiagonal bool
	// HTMLMeta generates HTML with meta information
	HTMLMeta bool
	// BBox generates XHTML with word bounding boxes
	BBox bool
	// BBoxLayout generates XHTML with block/line/word bounding boxes
	BBoxLayout bool
	// TSV generates TSV with bounding box information
	TSV bool
	// CropBox uses crop box instead of media box
	CropBox bool
	// ColSpacing is the column spacing (default 0.7)
	ColSpacing float64
	// Encoding is the text output encoding (default UTF-8)
	Encoding string
	// EOL is the end-of-line convention (default Unix)
	EOL EOLType
	// NoPageBreaks don't insert page breaks
	NoPageBreaks bool
	// OwnerPassword is the PDF owner password
	OwnerPassword string
	// UserPassword is the PDF user password
	UserPassword string
	// Quiet suppresses messages and errors
	Quiet bool
}

Error Handling

The library provides specific error types for common failure cases:

var (
    ErrPDFOpen        = errors.New("error opening PDF file")
    ErrOutputFile     = errors.New("error opening output file")
    ErrPermissions    = errors.New("error related to PDF permissions")
    ErrInvalidPage    = errors.New("invalid page number")
    ErrInvalidRange   = errors.New("invalid page range")
    ErrCommandFailed  = errors.New("pdftotext command failed")
    ErrBinaryNotFound = errors.New("pdftotext binary not found")
)

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrPDFOpen is returned when there is an error opening the PDF file
	ErrPDFOpen = errors.New("error opening PDF file")
	// ErrOutputFile is returned when there is an error opening the output file
	ErrOutputFile = errors.New("error opening output file")
	// ErrPermissions is returned when there is an error related to PDF permissions
	ErrPermissions = errors.New("error related to PDF permissions")
	// ErrInvalidPage is returned when the page number is invalid
	ErrInvalidPage = errors.New("invalid page number")
	// ErrInvalidRange is returned when the page range is invalid
	ErrInvalidRange = errors.New("invalid page range")
	// ErrCommandFailed is returned when the pdftotext command fails
	ErrCommandFailed = errors.New("pdftotext command failed")
	// ErrBinaryNotFound is returned when the pdftotext binary is not found
	ErrBinaryNotFound = errors.New("pdftotext binary not found")
)

Functions

This section is empty.

Types

type Converter

type Converter struct {
	// contains filtered or unexported fields
}

Converter represents a PDF to text converter

func New

func New() (*Converter, error)

New creates a new Converter instance

func (*Converter) Convert

func (c *Converter) Convert(ctx context.Context, inputPath string, opts *Options) (string, error)

Convert converts a PDF file to text and returns the result

func (*Converter) ConvertToFile

func (c *Converter) ConvertToFile(ctx context.Context, inputPath, outputPath string, opts *Options) error

ConvertToFile converts a PDF file to text and saves it to the specified output file

type EOLType

type EOLType string

EOLType represents the end-of-line convention

const (
	// EOLUnix represents the Unix end-of-line convention
	EOLUnix EOLType = "unix"
	// EOLDos represents the DOS end-of-line convention
	EOLDos EOLType = "dos"
	// EOLMac represents the Mac end-of-line convention
	EOLMac EOLType = "mac"
)

type Options

type Options struct {
	// FirstPage is the first page to convert
	FirstPage int
	// LastPage is the last page to convert
	LastPage int
	// Resolution is the resolution in DPI (default 72)
	Resolution int
	// CropX is the X-coordinate of crop area
	CropX int
	// CropY is the Y-coordinate of crop area
	CropY int
	// CropWidth is the width of crop area
	CropWidth int
	// CropHeight is the height of crop area
	CropHeight int
	// Layout maintains the original layout
	Layout bool
	// FixedPitch keeps the text in a fixed-pitch font
	FixedPitch float64
	// Raw keeps text in content stream order
	Raw bool
	// NoDiagonal discards diagonal text
	NoDiagonal bool
	// HTMLMeta generates HTML with meta information
	HTMLMeta bool
	// BBox generates XHTML with word bounding boxes
	BBox bool
	// BBoxLayout generates XHTML with block/line/word bounding boxes
	BBoxLayout bool
	// TSV generates TSV with bounding box information
	TSV bool
	// CropBox uses crop box instead of media box
	CropBox bool
	// ColSpacing is the column spacing (default 0.7)
	ColSpacing float64
	// Encoding is the text output encoding (default UTF-8)
	Encoding string
	// EOL is the end-of-line convention (default Unix)
	EOL EOLType
	// NoPageBreaks don't insert page breaks
	NoPageBreaks bool
	// OwnerPassword is the PDF owner password
	OwnerPassword string
	// UserPassword is the PDF user password
	UserPassword string
	// Quiet suppresses messages and errors
	Quiet bool
}

Options represents the configuration options for the PDF conversion

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL