pdfextract

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 11, 2026 License: MIT Imports: 19 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var BAD_PDF_SHA1HEX = []string{}/* 152 elements not displayed */

This is a hack to work around timeouts when processing certain PDFs with poppler. For some reason, the usual Kafka timeout catcher isn't working on these, maybe due to threading.

View Source
var ErrNoData = errors.New("no data")

Functions

This section is empty.

Types

type Dim

type Dim struct {
	W int
	H int
}

Dim in pixels, for thumbnail size.

type FileInfo

type FileInfo struct {
	Size      int64  `json:"size"`
	SHA1Hex   string `json:"sha1hex"`
	SHA256Hex string `json:"sha256hex"`
	MD5Hex    string `json:"md5hex"`
	Mimetype  string `json:"mimetype"`
}

FileInfo groups checksum and size for a file. The checksums should all be lowercase hex digests.

func (*FileInfo) FromBytes

func (fi *FileInfo) FromBytes(p []byte)

FromBytes creates a FileInfo object from bytes.

func (*FileInfo) FromFile

func (fi *FileInfo) FromFile(filename string) error

FromFile creates a FileInfo object from a path.

func (*FileInfo) FromReader

func (fi *FileInfo) FromReader(r io.Reader) error

FromReader creates file info fields from metadata.

type Options

type Options struct {
	Dim       Dim
	ThumbType string
}

Options controls the pdf extraction process.

type Result

type Result struct {
	SHA1Hex        string            `json:"sha1hex,omitempty"`        // The SHA1 of the PDF, used later as key.
	Status         string            `json:"status,omitempty"`         // A free form status string.
	Err            error             `json:"err,omitempty"`            // Any error we encountered.
	FileInfo       *FileInfo         `json:"fileinfo,omitempty"`       // Size and checksums.
	Text           string            `json:"text,omitempty"`           // Fulltext as parsed with a tool, e.g. pdftotext.
	Page0Thumbnail []byte            `json:"page0thumbnail,omitempty"` // Thumbnail image, jpg format.
	MetaXML        string            `json:"metaxml,omitempty"`        // Unassigned.
	Metadata       *pdfinfo.Metadata `json:"metadata,omitempty"`       // New, grouped by tool, info about a pdf.
	PDFExtra       *pdfinfo.PDFExtra `json:"pdfextra,omitempty"`       // pdfextra, as provided by sandcrawler
	Source         json.RawMessage   `json:"source,omitempty"`         // Unassigned.
	Weblinks       []string          `json:"weblinks,omitempty"`       // Extracted link candidates from fulltext.
}

Result is the result of a text and thumbnail extraction from a PDF. Both are combined since previous implementation used the poppler library in one go for performance. The first processing error encountered is recorded in Err.

func ProcessBlob

func ProcessBlob(ctx context.Context, blob []byte, opts *Options) *Result

ProcessBlob takes a blob and returns a pdf extract result. TODO(martin): we can makes this faster by running various subprocesses in parallel. TODO(martin): we take a blob from memory only to persist it and run the cli tools over it, we should not require that much memory.

func ProcessFile

func ProcessFile(ctx context.Context, filename string, opts *Options) *Result

ProcessFile turns a PDF file to a structured output.

func (*Result) HasPage0Thumbnail

func (result *Result) HasPage0Thumbnail() bool

HasPage0Thumbnail is a derived property.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL