Documentation
¶
Overview ¶
CLAUDE:SUMMARY Configuration struct and defaults for the docpipe document extraction pipeline.
CLAUDE:SUMMARY Registers docpipe extract and detect handlers on a connectivity Router for inter-service RPC.
CLAUDE:SUMMARY Core pipeline engine that dispatches document extraction by format (docx, odt, pdf, md, txt, html). Package docpipe extracts structured text from document files.
Supported formats:
- .docx — Microsoft Word (archive/zip → word/document.xml)
- .odt — OpenDocument Text (archive/zip → content.xml)
- .pdf — PDF text extraction (pure Go, cross-reference + stream decoding)
- .md — Markdown (parsed with heading detection)
- .txt — Plain text (passthrough with whitespace normalization)
- .html — HTML (reuses domkeeper extract pipeline)
All parsers are pure Go, CGO_ENABLED=0 compatible, with zero external dependencies.
Usage:
pipe := docpipe.New(docpipe.Config{})
doc, err := pipe.Extract(ctx, "/path/to/file.docx")
fmt.Println(doc.Title, len(doc.Sections), "sections")
CLAUDE:SUMMARY Extracts structured text from .docx files by parsing word/document.xml from the ZIP archive.
CLAUDE:SUMMARY Extracts structured sections (headings, paragraphs, tables, lists) from HTML files.
CLAUDE:SUMMARY Registers docpipe MCP tools (extract, detect, formats) on an MCP server.
CLAUDE:SUMMARY Extracts structured text from .odt (OpenDocument) files by parsing content.xml from the ZIP archive.
CLAUDE:SUMMARY PDF text extractor using pdfcpu — page-aware extraction with quality scoring. CLAUDE:DEPENDS docpipe/quality.go CLAUDE:EXPORTS extractPDF
CLAUDE:SUMMARY Scoring qualite d'extraction PDF — detecte besoins OCR et lacunes visuelles. CLAUDE:EXPORTS ExtractionQuality, NeedsOCR, HasVisualGap, computePrintableRatio, computeWordlikeRatio, countVisualRefs
CLAUDE:SUMMARY Extracts content from plain text and Markdown files with heading detection and whitespace normalization.
CLAUDE:SUMMARY Defines Format, Section, and Document types for the docpipe extraction pipeline.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SupportedFormats ¶
func SupportedFormats() []string
SupportedFormats returns all supported format extensions.
Types ¶
type Config ¶
type Config struct {
// MaxFileSize is the maximum file size to process (default: 100 MB).
MaxFileSize int64 `json:"max_file_size" yaml:"max_file_size"`
// Logger for debug/error messages.
Logger *slog.Logger `json:"-" yaml:"-"`
}
Config configures the document pipeline.
type Document ¶
type Document struct {
Path string `json:"path"`
Format Format `json:"format"`
Title string `json:"title"`
Sections []Section `json:"sections"`
RawText string `json:"raw_text"` // concatenated full text
Quality *ExtractionQuality `json:"quality,omitempty"` // PDF extraction quality metrics
}
Document is the result of extracting content from a file.
type ExtractionQuality ¶
type ExtractionQuality struct {
PageCount int `json:"page_count"`
CharsPerPage float64 `json:"chars_per_page"`
PrintableRatio float64 `json:"printable_ratio"`
WordlikeRatio float64 `json:"wordlike_ratio"`
HasImageStreams bool `json:"has_image_streams"`
VisualRefCount int `json:"visual_ref_count"`
}
ExtractionQuality captures metrics about PDF text extraction quality.
func (*ExtractionQuality) HasVisualGap ¶
func (q *ExtractionQuality) HasVisualGap() bool
HasVisualGap returns true if the text references figures/tables but the PDF has images.
func (*ExtractionQuality) NeedsOCR ¶
func (q *ExtractionQuality) NeedsOCR() bool
NeedsOCR returns true if the PDF likely needs OCR to extract text.
type Pipeline ¶
type Pipeline struct {
// contains filtered or unexported fields
}
Pipeline is the document extraction engine.
func (*Pipeline) RegisterConnectivity ¶
func (p *Pipeline) RegisterConnectivity(router *connectivity.Router)
RegisterConnectivity registers docpipe service handlers on a connectivity Router.
Registered services:
docpipe_extract — extract content from a document file docpipe_detect — detect document format
func (*Pipeline) RegisterMCP ¶
RegisterMCP registers docpipe tools on an MCP server.
type Section ¶
type Section struct {
Title string `json:"title,omitempty"`
Level int `json:"level"` // heading level 1-6, 0 for body
Text string `json:"text"` // extracted text content
Type string `json:"type"` // heading, paragraph, table, list
Metadata map[string]string `json:"metadata,omitempty"` // extra attributes
}
Section is a structural unit of a document.