dossier

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 4, 2024 License: BSD-3-Clause Imports: 21 Imported by: 2

README

Extract information from PDF documents

Latest release CI workflow Go reference

Dossier is a library for extracting textual information from PDF documents. It is written using the Go programming language.

Currently PDF is the only supported format (using MuPDF). Other formats can be implemented using custom parsers or by amending the library.

Sketches provide a declarative approach to locating information as an alternative to imperative/procedural access.

Sketches

Protocol buffers are used to define a sketch. The sketch protobuf definition documents available configuration options. Usually textproto will be the format used for writing sketches.

A web-based viewer is included in the command line utility. Screenshot of the viewer with an example sketch for invoices:

Graphical viewer showing an example invoice analysis

Invocation:

$ dossiercli web ./invoice.pdf ./sketch.textproto
2023/12/31 00:00:00 HTTP server listening on http://[::1]:8080

Installation

go get github.com/hansmi/dossier

Command line utility:

go install github.com/hansmi/dossier/cmd/dossiercli@latest

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrStopVisitation = errors.New("stop visitation")
View Source
var ErrUnsupportedFileFormat = errors.New("unsupported file format")

Functions

This section is empty.

Types

type Document

type Document struct {
	// contains filtered or unexported fields
}

func NewDocument

func NewDocument(path string, opts ...DocumentOption) *Document

NewDocument constructs a new document. The file must not be modified while it's being used. Operations may open and close the file multiple times.

func (*Document) ContentType

func (d *Document) ContentType() (string, error)

ContentType determines and returns the MIME content-type of the source file. The returned string may contain parameters (e.g. charset). Use mime.ParseMediaType or similar to parse the type.

func (*Document) Fingerprint

func (d *Document) Fingerprint() (string, error)

Fingerprint returns a best-effort file version identifier in the form of an opaque, non-empty string. While a changed fingerprint is indicative of a modified file, the fingerprint may also change for an unchanged file.

func (*Document) ParsePages

func (d *Document) ParsePages(ctx context.Context, r pagerange.Range) ([]*Page, error)

ParsePages uses the underlying document parser to read and parse pages. The returned slice may contain fewer or more pages than requested by the given range, depending on what the document actually contains and the parser's behaviour. Page numbers can be determined via Page.Number.

func (*Document) Path

func (d *Document) Path() string

Path returns the file path given to NewDocument.

func (*Document) RenderPageUsing

func (d *Document) RenderPageUsing(ctx context.Context, num int, r renderformat.Renderer) error

RenderPageUsing writes a single page using the given renderer, e.g. as a PNG image via renderformat.PNG.

func (*Document) Validate

func (d *Document) Validate(ctx context.Context) error

Validate is a simple check whether the document can be read and parsed.

type DocumentOption

type DocumentOption func(*Document)

func WithDocumentParserFactory

func WithDocumentParserFactory(f DocumentParserFactory) DocumentOption

Configure a custom factory function to create parser instances. When left unconfigured an appropriate parser is automatically chosen based on the source file's content.

func WithStaticDocumentParser

func WithStaticDocumentParser(p Parser) DocumentOption

Use a fixed parser for all documents without considering the content type.

type DocumentParserFactory

type DocumentParserFactory func(path, contentType string) (Parser, error)

type MuPdfParserFactory

type MuPdfParserFactory struct {
	// Command arguments to invoke MuPDF's "mutool" program. Leave empty to use
	// the default.
	MutoolCommand []string

	// Command arguments to invoke the "xmllint" program. Leave empty to use
	// the default.
	XmllintCommand []string
}

func (MuPdfParserFactory) Check

func (f MuPdfParserFactory) Check(ctx context.Context) error

func (MuPdfParserFactory) Create

func (f MuPdfParserFactory) Create(path, contentType string) (Parser, error)

type Page

type Page struct {
	// contains filtered or unexported fields
}

func (*Page) Document

func (p *Page) Document() *Document

Document returns the source document for the page.

func (*Page) Number

func (p *Page) Number() int

1-based page number.

func (*Page) RenderUsing

func (p *Page) RenderUsing(ctx context.Context, r renderformat.Renderer) error

func (*Page) Size

func (p *Page) Size() geometry.Size

Physical page size.

func (*Page) VisitElements

func (p *Page) VisitElements(visitor PageElementVisitorFunc) error

VisitElements invokes the visitor function for all elements. The visitation continues until either all elements have been visited or the visitor function returns a non-nil error. ErrStopVisitation stops the search immediately without failing the overall search. The visitation order is undefined.

func (*Page) VisitElementsIntersecting

func (p *Page) VisitElementsIntersecting(bounds geometry.Rect, visitor PageElementVisitorFunc) error

VisitElementsIntersecting is like [VisitElements] with the additional restriction that only elements within the specified bounds are visited.

type PageElementVisitorFunc

type PageElementVisitorFunc func(content.Element) error

func AsPageElementVisitor

func AsPageElementVisitor[T content.Element](fn func(T) error) PageElementVisitorFunc

AsPageElementVisitor returns a visitor function wrapper filtering for elements of type T.

type Parser

type Parser interface {
	// Validate whether the data can be successfully parsed.
	Validate(context.Context) error

	ParsePages(context.Context, pagerange.Range) ([]content.Page, error)

	RenderPage(context.Context, int, renderformat.Renderer) error
}

Directories

Path Synopsis
cmd
internal
mutool
Package mutool wraps MuPDF's "mutool" command line program.
Package mutool wraps MuPDF's "mutool" command line program.
ref
webui/template
templ: version: v0.2.590
templ: version: v0.2.590
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL