pdfextract

package

v1.0.1 Latest Latest Go to latest Published: May 11, 2026 License: MIT Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/internetarchive/scholar

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
type Dim
type FileInfo
type Options
type Result
- func ProcessBlob(ctx context.Context, blob []byte, opts *Options) *Result
- func ProcessFile(ctx context.Context, filename string, opts *Options) *Result
- func (result *Result) HasPage0Thumbnail() bool

Constants ¶

This section is empty.

Variables ¶

View Source

var BAD_PDF_SHA1HEX = []string{}/* 152 elements not displayed */

This is a hack to work around timeouts when processing certain PDFs with poppler. For some reason, the usual Kafka timeout catcher isn't working on these, maybe due to threading.

View Source

var ErrNoData = errors.New("no data")

Functions ¶

This section is empty.

Types ¶

type Dim ¶

type Dim struct {
	W int
	H int
}

Dim in pixels, for thumbnail size.

type FileInfo ¶

type FileInfo struct {
	Size      int64  `json:"size"`
	SHA1Hex   string `json:"sha1hex"`
	SHA256Hex string `json:"sha256hex"`
	MD5Hex    string `json:"md5hex"`
	Mimetype  string `json:"mimetype"`
}

FileInfo groups checksum and size for a file. The checksums should all be lowercase hex digests.

func (*FileInfo) FromBytes ¶

func (fi *FileInfo) FromBytes(p []byte)

FromBytes creates a FileInfo object from bytes.

func (*FileInfo) FromFile ¶

func (fi *FileInfo) FromFile(filename string) error

FromFile creates a FileInfo object from a path.

func (*FileInfo) FromReader ¶

func (fi *FileInfo) FromReader(r io.Reader) error

FromReader creates file info fields from metadata.

type Options ¶

type Options struct {
	Dim       Dim
	ThumbType string
}

Options controls the pdf extraction process.

type Result ¶

type Result struct {
	SHA1Hex        string            `json:"sha1hex,omitempty"`        // The SHA1 of the PDF, used later as key.
	Status         string            `json:"status,omitempty"`         // A free form status string.
	Err            error             `json:"err,omitempty"`            // Any error we encountered.
	FileInfo       *FileInfo         `json:"fileinfo,omitempty"`       // Size and checksums.
	Text           string            `json:"text,omitempty"`           // Fulltext as parsed with a tool, e.g. pdftotext.
	Page0Thumbnail []byte            `json:"page0thumbnail,omitempty"` // Thumbnail image, jpg format.
	MetaXML        string            `json:"metaxml,omitempty"`        // Unassigned.
	Metadata       *pdfinfo.Metadata `json:"metadata,omitempty"`       // New, grouped by tool, info about a pdf.
	PDFExtra       *pdfinfo.PDFExtra `json:"pdfextra,omitempty"`       // pdfextra, as provided by sandcrawler
	Source         json.RawMessage   `json:"source,omitempty"`         // Unassigned.
	Weblinks       []string          `json:"weblinks,omitempty"`       // Extracted link candidates from fulltext.
}

Result is the result of a text and thumbnail extraction from a PDF. Both are combined since previous implementation used the poppler library in one go for performance. The first processing error encountered is recorded in Err.

func ProcessBlob ¶

func ProcessBlob(ctx context.Context, blob []byte, opts *Options) *Result

ProcessBlob takes a blob and returns a pdf extract result. TODO(martin): we can makes this faster by running various subprocesses in parallel. TODO(martin): we take a blob from memory only to persist it and run the cli tools over it, we should not require that much memory.

func ProcessFile ¶

func ProcessFile(ctx context.Context, filename string, opts *Options) *Result

ProcessFile turns a PDF file to a structured output.

func (*Result) HasPage0Thumbnail ¶

func (result *Result) HasPage0Thumbnail() bool

HasPage0Thumbnail is a derived property.

Source Files ¶

View all Source files

pdfextract.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL