Documentation
¶
Index ¶
- Variables
- func FoldPages[T any](pages []Page, processWord func(string) (T, error), ...) (T, error)
- func ProcessPagesOld(pages []Page, processWord func(string) (bool, bool, error)) error
- type Document
- type DocumentIterator
- type FakeDocumentIterator
- type Page
- type ProcessResult
- type ReaderWrapper
- type RealDocumentIterator
- type Row
Constants ¶
This section is empty.
Variables ¶
Functions ¶
func FoldPages ¶
func FoldPages[T any]( pages []Page, processWord func(string) (T, error), joinFn func(T, T) (T, error), ) (T, error)
FoldPages is a helper function that processes the rows of the pages of the PDF, calling the processWord function for each word. It's called "fold" because it's a common pattern in functional programming to fold a list of elements into a single value. https://en.wikipedia.org/wiki/Fold_(higher-order_function) TODO: reduce code duplication with pages iterator
func ProcessPagesOld ¶
ProcessPagesOld is a helper function that processes the rows of the pages of the PDF, calling the processWord function for each word. The inner function return values should be: 1. true if it should break from the current page 2. true if it should break from the whole document 3. an error if there was an error processing the word
The outer functions returns 1. The error if there was one. If the processing covers all the pages without breaking the loop, it returns an error TODO: reduce code duplication with pages iterator Deprecated: use FoldPages instead
Types ¶
type Document ¶
type Document struct {
Pages []Page
}
Document represents the whole PDF document. It contains all the pages.
func NewDocument ¶
type DocumentIterator ¶
type FakeDocumentIterator ¶
type FakeDocumentIterator struct {
// contains filtered or unexported fields
}
FakeDocumentIterator is a test-friendly implementation of DocumentIterator that allows preloading with a slice of strings to be returned in sequence.
func NewFakeDocumentIterator ¶
func NewFakeDocumentIterator(texts []string) *FakeDocumentIterator
NewFakeDocumentIterator creates a new FakeDocumentIterator with the given texts.
func (*FakeDocumentIterator) NextText ¶
func (f *FakeDocumentIterator) NextText() (string, bool)
NextText implements the DocumentIterator interface. It returns the next text in the sequence and a boolean indicating if there are more texts to return.
type Page ¶
type Page struct {
// The Index is the number of the page in the PDF starting on 1 (page 1, page 2, etc.)
Index int
// The rows of the page. Each row contains the position of the row in the page and the text (multiple strings) of the row.
Rows []Row
}
Page is a simplified version of a "PDF Page". It contains the minimal information to parse the page.
type ProcessResult ¶
type ProcessResult int
ProcessResult defines the result of processing a word.
const ( // Continue indicates that processing should continue normally. Continue ProcessResult = iota // StopIteration indicates that processing should stop (to return an error or to continue iterating in the next function) StopIteration )
type ReaderWrapper ¶
type ReaderWrapper struct {
}
ReaderWrapper wraps the vendor pdf.Reader to provide a more convenient interface and encapsulate behavior that shouldn't be exposed to the client.
func NewReaderWrapper ¶
func NewReaderWrapper() *ReaderWrapper
func (*ReaderWrapper) ReadFromBytes ¶
func (r *ReaderWrapper) ReadFromBytes(rawBytes []byte) (Document, error)
ReadFromBytes reads the PDF from a byte slice and returns the rows of the pages. It returns an error if it encounters any problem. You should first read the file into a byte slice and then call this function.
type RealDocumentIterator ¶
type RealDocumentIterator struct {
// contains filtered or unexported fields
}
RealDocumentIterator is a helper to iterate over the pages of a PDF document. It can be used when you have a function that depends on the last position of a function called before, so you can continue processing the pages from that position. The "Real" prefix is because couldn't think of a better name to distinguish it from the interface and fake.
func NewRealDocumentIterator ¶
func NewRealDocumentIterator(doc Document, startingRowPosition int) *RealDocumentIterator
func (*RealDocumentIterator) ContinueProcessingPages ¶
func (pi *RealDocumentIterator) ContinueProcessingPages( processWord func(string) (ProcessResult, error), ) error
ContinueProcessingPages continues processing the pages of the document from the last position. It can be used to share the iterator between different functions that need to process the pages in order. Where one function left off, the next one can continue. TODO: reduce code duplication with FoldPages Deprecated: use nextText instead to decouple the logic of processing the words from the iterator
func (*RealDocumentIterator) NextText ¶
func (pi *RealDocumentIterator) NextText() (string, bool)
NextText returns the next text element. It traverses:
- The next text in the same row,
- The first text in the next row,
- The first text in the first row of the next page.