docpipe

package

v0.1.0 Latest Latest Go to latest Published: Mar 1, 2026 License: MIT Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/hazyhaar/pkg

Links

Open Source Insights

Documentation ¶

Overview ¶

CLAUDE:SUMMARY Configuration struct and defaults for the docpipe document extraction pipeline.

CLAUDE:SUMMARY Registers docpipe extract and detect handlers on a connectivity Router for inter-service RPC.

CLAUDE:SUMMARY Core pipeline engine that dispatches document extraction by format (docx, odt, pdf, md, txt, html). Package docpipe extracts structured text from document files.

Supported formats:

.docx — Microsoft Word (archive/zip → word/document.xml)
.odt — OpenDocument Text (archive/zip → content.xml)
.pdf — PDF text extraction (pure Go, cross-reference + stream decoding)
.md — Markdown (parsed with heading detection)
.txt — Plain text (passthrough with whitespace normalization)
.html — HTML (reuses domkeeper extract pipeline)

All parsers are pure Go, CGO_ENABLED=0 compatible, with zero external dependencies.

Usage:

pipe := docpipe.New(docpipe.Config{})
doc, err := pipe.Extract(ctx, "/path/to/file.docx")
fmt.Println(doc.Title, len(doc.Sections), "sections")

CLAUDE:SUMMARY Extracts structured text from .docx files by parsing word/document.xml from the ZIP archive.

CLAUDE:SUMMARY Extracts structured sections (headings, paragraphs, tables, lists) from HTML files.

CLAUDE:SUMMARY Registers docpipe MCP tools (extract, detect, formats) on an MCP server.

CLAUDE:SUMMARY Extracts structured text from .odt (OpenDocument) files by parsing content.xml from the ZIP archive.

CLAUDE:SUMMARY PDF text extractor using pdfcpu — page-aware extraction with quality scoring. CLAUDE:DEPENDS docpipe/quality.go CLAUDE:EXPORTS extractPDF

CLAUDE:SUMMARY Scoring qualite d'extraction PDF — detecte besoins OCR et lacunes visuelles. CLAUDE:EXPORTS ExtractionQuality, NeedsOCR, HasVisualGap, computePrintableRatio, computeWordlikeRatio, countVisualRefs

CLAUDE:SUMMARY Extracts content from plain text and Markdown files with heading detection and whitespace normalization.

CLAUDE:SUMMARY Defines Format, Section, and Document types for the docpipe extraction pipeline.

Index ¶

func SupportedFormats() []string
type Config
type Document
type ExtractionQuality
- func (q *ExtractionQuality) HasVisualGap() bool
- func (q *ExtractionQuality) NeedsOCR() bool
type Format
type Pipeline
- func New(cfg Config) *Pipeline
type Section

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func SupportedFormats ¶

func SupportedFormats() []string

SupportedFormats returns all supported format extensions.

Types ¶

type Config ¶

type Config struct {
	// MaxFileSize is the maximum file size to process (default: 100 MB).
	MaxFileSize int64 `json:"max_file_size" yaml:"max_file_size"`

	// Logger for debug/error messages.
	Logger *slog.Logger `json:"-" yaml:"-"`
}

Config configures the document pipeline.

type Document ¶

type Document struct {
	Path     string             `json:"path"`
	Format   Format             `json:"format"`
	Title    string             `json:"title"`
	Sections []Section          `json:"sections"`
	RawText  string             `json:"raw_text"`          // concatenated full text
	Quality  *ExtractionQuality `json:"quality,omitempty"` // PDF extraction quality metrics
}

Document is the result of extracting content from a file.

type ExtractionQuality ¶

type ExtractionQuality struct {
	PageCount       int     `json:"page_count"`
	CharsPerPage    float64 `json:"chars_per_page"`
	PrintableRatio  float64 `json:"printable_ratio"`
	WordlikeRatio   float64 `json:"wordlike_ratio"`
	HasImageStreams bool    `json:"has_image_streams"`
	VisualRefCount  int     `json:"visual_ref_count"`
}

ExtractionQuality captures metrics about PDF text extraction quality.

func (*ExtractionQuality) HasVisualGap ¶

func (q *ExtractionQuality) HasVisualGap() bool

HasVisualGap returns true if the text references figures/tables but the PDF has images.

func (*ExtractionQuality) NeedsOCR ¶

func (q *ExtractionQuality) NeedsOCR() bool

NeedsOCR returns true if the PDF likely needs OCR to extract text.

type Format ¶

type Format string

Format identifies a document type.

const (
	FormatDocx Format = "docx"
	FormatODT  Format = "odt"
	FormatPDF  Format = "pdf"
	FormatMD   Format = "md"
	FormatTXT  Format = "txt"
	FormatHTML Format = "html"
)

type Pipeline ¶

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline is the document extraction engine.

func New ¶

func New(cfg Config) *Pipeline

New creates a Pipeline with the given configuration.

func (*Pipeline) Detect ¶

func (p *Pipeline) Detect(path string) (Format, error)

Detect returns the document format based on file extension.

func (*Pipeline) Extract ¶

func (p *Pipeline) Extract(ctx context.Context, path string) (*Document, error)

Extract parses a document and returns structured sections.

func (*Pipeline) RegisterConnectivity ¶

func (p *Pipeline) RegisterConnectivity(router *connectivity.Router)

RegisterConnectivity registers docpipe service handlers on a connectivity Router.

Registered services:

docpipe_extract — extract content from a document file
docpipe_detect  — detect document format

func (*Pipeline) RegisterMCP ¶

func (p *Pipeline) RegisterMCP(srv *mcp.Server)

RegisterMCP registers docpipe tools on an MCP server.

type Section ¶

type Section struct {
	Title    string            `json:"title,omitempty"`
	Level    int               `json:"level"`              // heading level 1-6, 0 for body
	Text     string            `json:"text"`               // extracted text content
	Type     string            `json:"type"`               // heading, paragraph, table, list
	Metadata map[string]string `json:"metadata,omitempty"` // extra attributes
}

Section is a structural unit of a document.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL