pdfwrapper

package

v0.1.0 Latest Latest Go to latest Published: Mar 31, 2026 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Alechan/finance-analyzer

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
func FoldPages[T any](pages []Page, processWord func(string) (T, error), ...) (T, error)
func ProcessPagesOld(pages []Page, processWord func(string) (bool, bool, error)) error
type Document
- func NewDocument(pages []Page) Document
type DocumentIterator
type FakeDocumentIterator
- func NewFakeDocumentIterator(texts []string) *FakeDocumentIterator
- func (f *FakeDocumentIterator) NextText() (string, bool)
type Page
type ProcessResult
type ReaderWrapper
- func NewReaderWrapper() *ReaderWrapper
- func (r *ReaderWrapper) ReadFromBytes(rawBytes []byte) (Document, error)
type RealDocumentIterator
- func NewRealDocumentIterator(doc Document, startingRowPosition int) *RealDocumentIterator
- func (pi *RealDocumentIterator) ContinueProcessingPages(processWord func(string) (ProcessResult, error)) error
- func (pi *RealDocumentIterator) NextText() (string, bool)
type Row

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ErrNilOrEmptyRawBytes   = errors.New("nil or empty raw bytes")
	ErrCreatingVendorReader = errors.New("couldn't create vendor pdf reader")
	ErrNoPages              = errors.New("no pages provided")
	ErrPatternNotFound      = errors.New("reached the end of the file without finding the expected pattern")
)

Functions ¶

func FoldPages ¶

func FoldPages[T any](
	pages []Page,

	processWord func(string) (T, error),

	joinFn func(T, T) (T, error),
) (T, error)

FoldPages is a helper function that processes the rows of the pages of the PDF, calling the processWord function for each word. It's called "fold" because it's a common pattern in functional programming to fold a list of elements into a single value. https://en.wikipedia.org/wiki/Fold_(higher-order_function) TODO: reduce code duplication with pages iterator

func ProcessPagesOld ¶

func ProcessPagesOld(pages []Page, processWord func(string) (bool, bool, error)) error

ProcessPagesOld is a helper function that processes the rows of the pages of the PDF, calling the processWord function for each word. The inner function return values should be: 1. true if it should break from the current page 2. true if it should break from the whole document 3. an error if there was an error processing the word

The outer functions returns 1. The error if there was one. If the processing covers all the pages without breaking the loop, it returns an error TODO: reduce code duplication with pages iterator Deprecated: use FoldPages instead

Types ¶

type Document ¶

type Document struct {
	Pages []Page
}

Document represents the whole PDF document. It contains all the pages.

func NewDocument ¶

func NewDocument(pages []Page) Document

type DocumentIterator ¶

type DocumentIterator interface {
	NextText() (string, bool)
}

type FakeDocumentIterator ¶

type FakeDocumentIterator struct {
	// contains filtered or unexported fields
}

FakeDocumentIterator is a test-friendly implementation of DocumentIterator that allows preloading with a slice of strings to be returned in sequence.

func NewFakeDocumentIterator ¶

func NewFakeDocumentIterator(texts []string) *FakeDocumentIterator

NewFakeDocumentIterator creates a new FakeDocumentIterator with the given texts.

func (*FakeDocumentIterator) NextText ¶

func (f *FakeDocumentIterator) NextText() (string, bool)

NextText implements the DocumentIterator interface. It returns the next text in the sequence and a boolean indicating if there are more texts to return.

type Page ¶

type Page struct {
	// The Index is the number of the page in the PDF starting on 1 (page 1, page 2, etc.)
	Index int
	// The rows of the page. Each row contains the position of the row in the page and the text (multiple strings) of the row.
	Rows []Row
}

Page is a simplified version of a "PDF Page". It contains the minimal information to parse the page.

type ProcessResult ¶

type ProcessResult int

ProcessResult defines the result of processing a word.

const (
	// Continue indicates that processing should continue normally.
	Continue ProcessResult = iota
	// StopIteration indicates that processing should stop (to return an error or to continue iterating in the next function)
	StopIteration
)

type ReaderWrapper ¶

type ReaderWrapper struct {
}

ReaderWrapper wraps the vendor pdf.Reader to provide a more convenient interface and encapsulate behavior that shouldn't be exposed to the client.

func NewReaderWrapper ¶

func NewReaderWrapper() *ReaderWrapper

func (*ReaderWrapper) ReadFromBytes ¶

func (r *ReaderWrapper) ReadFromBytes(rawBytes []byte) (Document, error)

ReadFromBytes reads the PDF from a byte slice and returns the rows of the pages. It returns an error if it encounters any problem. You should first read the file into a byte slice and then call this function.

type RealDocumentIterator ¶

type RealDocumentIterator struct {
	// contains filtered or unexported fields
}

RealDocumentIterator is a helper to iterate over the pages of a PDF document. It can be used when you have a function that depends on the last position of a function called before, so you can continue processing the pages from that position. The "Real" prefix is because couldn't think of a better name to distinguish it from the interface and fake.

func NewRealDocumentIterator ¶

func NewRealDocumentIterator(doc Document, startingRowPosition int) *RealDocumentIterator

func (*RealDocumentIterator) ContinueProcessingPages ¶

func (pi *RealDocumentIterator) ContinueProcessingPages(

	processWord func(string) (ProcessResult, error),
) error

ContinueProcessingPages continues processing the pages of the document from the last position. It can be used to share the iterator between different functions that need to process the pages in order. Where one function left off, the next one can continue. TODO: reduce code duplication with FoldPages Deprecated: use nextText instead to decouple the logic of processing the words from the iterator

func (*RealDocumentIterator) NextText ¶

func (pi *RealDocumentIterator) NextText() (string, bool)

NextText returns the next text element. It traverses:

The next text in the same row,
The first text in the next row,
The first text in the first row of the next page.

type Row ¶

type Row struct {
	Position int
	Texts    []string
}

Row is a simplified version of a "PDF Row". It just contains the text of the row, not the position, font, etc.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL