Documentation
¶
Overview ¶
Package extract provides text extraction from binary document formats. Supported formats: PDF, Office (docx/xlsx/pptx), OpenDocument (odt/ods/odp), EPUB, RTF, archives (zip/tar/tar.gz/tar.bz2/tar.xz), iWork (pages/numbers/key), SVG.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func IsDocumentFile ¶
IsDocumentFile checks if a file path looks like a supported document format.
Types ¶
type Extractor ¶
type Extractor interface {
Extract(data []byte) (TextResult, error)
Format() string
}
Extractor extracts text from a binary document format.
type Registry ¶
type Registry struct {
// contains filtered or unexported fields
}
Registry maps file extensions to Extractor instances.
type TextResult ¶
type TextResult struct {
Text string // extracted plain text
Pages int // page/slide count (0 if not applicable)
Format string // format name: "pdf", "docx", "zip", etc.
}
TextResult holds extracted text and metadata about the source document.
Click to show internal directories.
Click to hide internal directories.