docpipe

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 1, 2026 License: MIT Imports: 25 Imported by: 0

Documentation

Overview

CLAUDE:SUMMARY Configuration struct and defaults for the docpipe document extraction pipeline.

CLAUDE:SUMMARY Registers docpipe extract and detect handlers on a connectivity Router for inter-service RPC.

CLAUDE:SUMMARY Core pipeline engine that dispatches document extraction by format (docx, odt, pdf, md, txt, html). Package docpipe extracts structured text from document files.

Supported formats:

  • .docx — Microsoft Word (archive/zip → word/document.xml)
  • .odt — OpenDocument Text (archive/zip → content.xml)
  • .pdf — PDF text extraction (pure Go, cross-reference + stream decoding)
  • .md — Markdown (parsed with heading detection)
  • .txt — Plain text (passthrough with whitespace normalization)
  • .html — HTML (reuses domkeeper extract pipeline)

All parsers are pure Go, CGO_ENABLED=0 compatible, with zero external dependencies.

Usage:

pipe := docpipe.New(docpipe.Config{})
doc, err := pipe.Extract(ctx, "/path/to/file.docx")
fmt.Println(doc.Title, len(doc.Sections), "sections")

CLAUDE:SUMMARY Extracts structured text from .docx files by parsing word/document.xml from the ZIP archive.

CLAUDE:SUMMARY Extracts structured sections (headings, paragraphs, tables, lists) from HTML files.

CLAUDE:SUMMARY Registers docpipe MCP tools (extract, detect, formats) on an MCP server.

CLAUDE:SUMMARY Extracts structured text from .odt (OpenDocument) files by parsing content.xml from the ZIP archive.

CLAUDE:SUMMARY PDF text extractor using pdfcpu — page-aware extraction with quality scoring. CLAUDE:DEPENDS docpipe/quality.go CLAUDE:EXPORTS extractPDF

CLAUDE:SUMMARY Scoring qualite d'extraction PDF — detecte besoins OCR et lacunes visuelles. CLAUDE:EXPORTS ExtractionQuality, NeedsOCR, HasVisualGap, computePrintableRatio, computeWordlikeRatio, countVisualRefs

CLAUDE:SUMMARY Extracts content from plain text and Markdown files with heading detection and whitespace normalization.

CLAUDE:SUMMARY Defines Format, Section, and Document types for the docpipe extraction pipeline.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SupportedFormats

func SupportedFormats() []string

SupportedFormats returns all supported format extensions.

Types

type Config

type Config struct {
	// MaxFileSize is the maximum file size to process (default: 100 MB).
	MaxFileSize int64 `json:"max_file_size" yaml:"max_file_size"`

	// Logger for debug/error messages.
	Logger *slog.Logger `json:"-" yaml:"-"`
}

Config configures the document pipeline.

type Document

type Document struct {
	Path     string             `json:"path"`
	Format   Format             `json:"format"`
	Title    string             `json:"title"`
	Sections []Section          `json:"sections"`
	RawText  string             `json:"raw_text"`          // concatenated full text
	Quality  *ExtractionQuality `json:"quality,omitempty"` // PDF extraction quality metrics
}

Document is the result of extracting content from a file.

type ExtractionQuality

type ExtractionQuality struct {
	PageCount       int     `json:"page_count"`
	CharsPerPage    float64 `json:"chars_per_page"`
	PrintableRatio  float64 `json:"printable_ratio"`
	WordlikeRatio   float64 `json:"wordlike_ratio"`
	HasImageStreams bool    `json:"has_image_streams"`
	VisualRefCount  int     `json:"visual_ref_count"`
}

ExtractionQuality captures metrics about PDF text extraction quality.

func (*ExtractionQuality) HasVisualGap

func (q *ExtractionQuality) HasVisualGap() bool

HasVisualGap returns true if the text references figures/tables but the PDF has images.

func (*ExtractionQuality) NeedsOCR

func (q *ExtractionQuality) NeedsOCR() bool

NeedsOCR returns true if the PDF likely needs OCR to extract text.

type Format

type Format string

Format identifies a document type.

const (
	FormatDocx Format = "docx"
	FormatODT  Format = "odt"
	FormatPDF  Format = "pdf"
	FormatMD   Format = "md"
	FormatTXT  Format = "txt"
	FormatHTML Format = "html"
)

type Pipeline

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline is the document extraction engine.

func New

func New(cfg Config) *Pipeline

New creates a Pipeline with the given configuration.

func (*Pipeline) Detect

func (p *Pipeline) Detect(path string) (Format, error)

Detect returns the document format based on file extension.

func (*Pipeline) Extract

func (p *Pipeline) Extract(ctx context.Context, path string) (*Document, error)

Extract parses a document and returns structured sections.

func (*Pipeline) RegisterConnectivity

func (p *Pipeline) RegisterConnectivity(router *connectivity.Router)

RegisterConnectivity registers docpipe service handlers on a connectivity Router.

Registered services:

docpipe_extract — extract content from a document file
docpipe_detect  — detect document format

func (*Pipeline) RegisterMCP

func (p *Pipeline) RegisterMCP(srv *mcp.Server)

RegisterMCP registers docpipe tools on an MCP server.

type Section

type Section struct {
	Title    string            `json:"title,omitempty"`
	Level    int               `json:"level"`              // heading level 1-6, 0 for body
	Text     string            `json:"text"`               // extracted text content
	Type     string            `json:"type"`               // heading, paragraph, table, list
	Metadata map[string]string `json:"metadata,omitempty"` // extra attributes
}

Section is a structural unit of a document.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL