extract

package
v1.1.64 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 1, 2026 License: MIT Imports: 19 Imported by: 0

Documentation

Overview

Package extract provides text extraction from binary document formats. Supported formats: PDF, Office (docx/xlsx/pptx), OpenDocument (odt/ods/odp), EPUB, RTF, archives (zip/tar/tar.gz/tar.bz2/tar.xz), iWork (pages/numbers/key), SVG.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsDocumentFile

func IsDocumentFile(path string) bool

IsDocumentFile checks if a file path looks like a supported document format.

Types

type Extractor

type Extractor interface {
	Extract(data []byte) (TextResult, error)
	Format() string
}

Extractor extracts text from a binary document format.

type Registry

type Registry struct {
	// contains filtered or unexported fields
}

Registry maps file extensions to Extractor instances.

func (*Registry) Get

func (r *Registry) Get(ext string) Extractor

Get returns the extractor for the given extension, or nil.

func (*Registry) Register

func (r *Registry) Register(ext string, e Extractor)

Register adds an extractor for the given extension (e.g. ".pdf").

type TextResult

type TextResult struct {
	Text   string // extracted plain text
	Pages  int    // page/slide count (0 if not applicable)
	Format string // format name: "pdf", "docx", "zip", etc.
}

TextResult holds extracted text and metadata about the source document.

func Extract

func Extract(path string, data []byte) (TextResult, error)

Extract extracts text from data based on file extension. Returns an error if the format is not supported or extraction fails.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL