rag

package
v1.6.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 29, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows. It implements hierarchical, context-aware chunking that respects document structure, ensuring chunks maintain complete thoughts rather than breaking mid-sentence or mid-list.

Package rag provides RAG (Retrieval-Augmented Generation) chunking and export functionality for LLM integration.

This package prepares extracted document content for use with large language models by providing semantic chunking and various export formats.

Chunking

The Chunker splits documents into semantically meaningful chunks:

chunker := rag.NewChunker(config)
chunks := chunker.ChunkDocument(document)

Chunking respects document structure, avoiding splits in the middle of:

  • Tables
  • Lists
  • Paragraphs
  • Headings with their following content

Chunk Configuration

Use ChunkerConfig to control chunking behavior:

  • MaxChunkSize - maximum tokens/characters per chunk
  • MinChunkSize - minimum chunk size (avoids tiny chunks)
  • Overlap - overlap between consecutive chunks
  • PreserveStructure - keep tables and lists intact

Chunk Metadata

Each Chunk includes metadata for retrieval:

  • Page numbers and positions
  • Section headings
  • Content type (paragraph, table, list, etc.)
  • Relationships to other chunks

Export Formats

Export chunks in various formats:

  • ToMarkdown() - Markdown with preserved structure
  • ToPlainText() - Plain text extraction
  • ToJSON() - Structured JSON output

Markdown Export

The MarkdownOptions control markdown generation:

  • IncludeMetadata - add front matter
  • PreserveTables - use markdown table syntax
  • HeadingStyle - ATX (#) or Setext (===) headings

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ApplyOverlap

func ApplyOverlap(currentText string, overlap *OverlapResult, sectionTitle string, includeContext bool) string

ApplyOverlap applies overlap from the previous chunk to the current chunk

func ConvertSize

func ConvertSize(value int, from, to SizeUnit) int

ConvertSize converts a size value from one unit to another (approximate)

func GetListMarkerType

func GetListMarkerType(line string) string

GetListMarkerType returns the marker type for a line

func IsCaptionElement

func IsCaptionElement(elementType model.ElementType) bool

IsCaptionElement checks if an element type is a caption

func IsFigureElement

func IsFigureElement(elementType model.ElementType) bool

IsFigureElement checks if an element type is a figure or image

func IsListMarker

func IsListMarker(text string) bool

IsListMarker checks if text starts with any list marker

func IsTableElement

func IsTableElement(elementType model.ElementType) bool

IsTableElement checks if an element type is a table

func IsWithinAtomicBlock

func IsWithinAtomicBlock(index int, atomicBlocks []AtomicBlock) bool

IsWithinAtomicBlock checks if an index is within any atomic block

func NormalizeListMarkers

func NormalizeListMarkers(text string, useNumbers bool) string

NormalizeListMarkers normalizes list markers to a consistent format

Types

type AtomicBlock

type AtomicBlock struct {
	StartIndex int
	EndIndex   int
	Type       string
	Reason     string
}

AtomicBlocks identifies blocks that should not be split

func GetAtomicBlockAt

func GetAtomicBlockAt(index int, atomicBlocks []AtomicBlock) *AtomicBlock

GetAtomicBlockAt returns the atomic block containing the given index, if any

type BatchExporter

type BatchExporter struct {
	// contains filtered or unexported fields
}

BatchExporter handles exporting large collections in batches

func NewBatchExporter

func NewBatchExporter(batchSize int) *BatchExporter

NewBatchExporter creates a new batch exporter

func NewBatchExporterWithConfig

func NewBatchExporterWithConfig(batchSize int, config ExportConfig) *BatchExporter

NewBatchExporterWithConfig creates a batch exporter with custom config

func (*BatchExporter) Export

func (be *BatchExporter) Export(chunks []*Chunk, callback func(ExportBatch) error) error

Export exports chunks in batches, calling the callback for each batch

func (*BatchExporter) ExportToFiles

func (be *BatchExporter) ExportToFiles(chunks []*Chunk, filenamePattern string) error

ExportToFiles exports chunks to numbered files

type Boundary

type Boundary struct {
	// Type is the kind of boundary
	Type BoundaryType

	// Position is the character offset in the text
	Position int

	// Score is the priority score for splitting here
	Score int

	// ElementIndex is the index of the element this boundary follows
	ElementIndex int

	// Context provides additional information about the boundary
	Context string
}

Boundary represents a potential chunk boundary in the content

type BoundaryConfig

type BoundaryConfig struct {
	// MinChunkSize is the minimum characters before considering a boundary
	MinChunkSize int

	// MaxChunkSize is the maximum characters before forcing a boundary
	MaxChunkSize int

	// PreferParagraphBreaks prefers paragraph boundaries over sentence boundaries
	PreferParagraphBreaks bool

	// KeepListsIntact tries to keep lists with their introductory text
	KeepListsIntact bool

	// KeepTablesIntact treats tables as atomic units
	KeepTablesIntact bool

	// KeepFiguresIntact keeps figures with their captions
	KeepFiguresIntact bool

	// LookAheadChars is how far to look ahead for better boundaries
	LookAheadChars int

	// ListIntroPatterns are patterns that indicate list introductions
	ListIntroPatterns []*regexp.Regexp
}

BoundaryConfig holds configuration for boundary detection

func DefaultBoundaryConfig

func DefaultBoundaryConfig() BoundaryConfig

DefaultBoundaryConfig returns sensible defaults for boundary detection

type BoundaryDetector

type BoundaryDetector struct {
	// contains filtered or unexported fields
}

BoundaryDetector detects semantic boundaries in content

func NewBoundaryDetector

func NewBoundaryDetector() *BoundaryDetector

NewBoundaryDetector creates a new boundary detector with default configuration

func NewBoundaryDetectorWithConfig

func NewBoundaryDetectorWithConfig(config BoundaryConfig) *BoundaryDetector

NewBoundaryDetectorWithConfig creates a boundary detector with custom configuration

func (*BoundaryDetector) DetectBoundaries

func (d *BoundaryDetector) DetectBoundaries(blocks []ContentBlock) []Boundary

DetectBoundaries finds all semantic boundaries in a sequence of content blocks

func (*BoundaryDetector) FindAtomicBlocks

func (d *BoundaryDetector) FindAtomicBlocks(blocks []ContentBlock) []AtomicBlock

FindAtomicBlocks identifies sequences of blocks that should stay together

func (*BoundaryDetector) FindBestBoundary

func (d *BoundaryDetector) FindBestBoundary(boundaries []Boundary, minPos, maxPos int) *Boundary

FindBestBoundary finds the best boundary within a range for splitting

func (*BoundaryDetector) FindBoundaryWithLookAhead

func (d *BoundaryDetector) FindBoundaryWithLookAhead(boundaries []Boundary, targetPos int) *Boundary

FindBoundaryWithLookAhead finds a boundary, looking ahead for better options

func (*BoundaryDetector) ShouldKeepTogether

func (d *BoundaryDetector) ShouldKeepTogether(block1, block2 ContentBlock) bool

ShouldKeepTogether determines if two blocks should be kept in the same chunk

type BoundaryType

type BoundaryType int

BoundaryType represents the type of semantic boundary

const (
	// BoundaryNone indicates no boundary (middle of content)
	BoundaryNone BoundaryType = iota
	// BoundarySentence indicates a sentence ending
	BoundarySentence
	// BoundaryParagraph indicates a paragraph break
	BoundaryParagraph
	// BoundaryList indicates end of a list
	BoundaryList
	// BoundaryListItem indicates end of a list item
	BoundaryListItem
	// BoundaryHeading indicates a heading (section break)
	BoundaryHeading
	// BoundaryTable indicates end of a table
	BoundaryTable
	// BoundaryFigure indicates end of a figure/image
	BoundaryFigure
	// BoundaryCodeBlock indicates end of a code block
	BoundaryCodeBlock
	// BoundaryPageBreak indicates a page break
	BoundaryPageBreak
)

func (BoundaryType) Score

func (bt BoundaryType) Score() int

Score returns a priority score for this boundary type (higher = better split point)

func (BoundaryType) String

func (bt BoundaryType) String() string

String returns a human-readable representation of the boundary type

type CaptionDetector

type CaptionDetector struct {
	// contains filtered or unexported fields
}

CaptionDetector helps find captions associated with tables and figures

func NewCaptionDetector

func NewCaptionDetector() *CaptionDetector

NewCaptionDetector creates a new caption detector

func NewCaptionDetectorWithConfig

func NewCaptionDetectorWithConfig(config TableFigureConfig) *CaptionDetector

NewCaptionDetectorWithConfig creates a caption detector with custom config

func (*CaptionDetector) FindFigureCaption

func (d *CaptionDetector) FindFigureCaption(blocks []ContentBlock, figureIndex int) string

FindFigureCaption searches for a caption near a figure

func (*CaptionDetector) FindTableCaption

func (d *CaptionDetector) FindTableCaption(blocks []ContentBlock, tableIndex int) string

FindTableCaption searches for a caption near a table

type Chunk

type Chunk struct {
	// ID is a unique identifier for this chunk
	ID string `json:"id"`

	// Text is the chunk content
	Text string `json:"text"`

	// TextWithContext is the text with section heading prepended for better retrieval
	TextWithContext string `json:"text_with_context,omitempty"`

	// Metadata contains rich contextual information
	Metadata ChunkMetadata `json:"metadata"`
}

Chunk represents a semantic unit of text extracted from a document for RAG

func NewChunk

func NewChunk(id, text string, metadata ChunkMetadata) *Chunk

NewChunk creates a new chunk with the given text and metadata

func (*Chunk) GenerateContextText

func (c *Chunk) GenerateContextText(config MetadataConfig) string

GenerateContextText generates context text based on configuration

func (*Chunk) GetSectionPathString

func (c *Chunk) GetSectionPathString() string

GetSectionPathString returns the section path as a formatted string

func (*Chunk) Summary

func (c *Chunk) Summary() string

Summary returns a brief summary of the chunk

func (*Chunk) ToEmbeddingFormat

func (c *Chunk) ToEmbeddingFormat() string

ToEmbeddingFormat returns text optimized for embedding generation

func (*Chunk) ToMarkdown

func (c *Chunk) ToMarkdown() string

ToMarkdown converts a chunk to markdown format

func (*Chunk) ToMarkdownWithOptions

func (c *Chunk) ToMarkdownWithOptions(opts MarkdownOptions) string

ToMarkdownWithOptions converts a chunk to markdown with custom options

func (*Chunk) ToSearchableText

func (c *Chunk) ToSearchableText() string

ToSearchableText returns text optimized for keyword search

type ChunkCollection

type ChunkCollection struct {
	Chunks []*Chunk
}

ChunkCollection provides filtering and search over chunks

func ChunkDocument

func ChunkDocument(doc *model.Document) *ChunkCollection

ChunkDocument is a convenience function to chunk a document with default settings

func ChunkDocumentWithConfig

func ChunkDocumentWithConfig(doc *model.Document, config ChunkerConfig, sizeConfig SizeConfig) *ChunkCollection

ChunkDocumentWithConfig chunks a document with custom configuration

func NewChunkCollection

func NewChunkCollection(chunks []*Chunk) *ChunkCollection

NewChunkCollection creates a new collection from chunks

func (*ChunkCollection) Count

func (cc *ChunkCollection) Count() int

Count returns the number of chunks in the collection

func (*ChunkCollection) ExportToFile

func (cc *ChunkCollection) ExportToFile(filename string, config ExportConfig) error

ExportToFile exports the collection to a file

func (*ChunkCollection) Filter

func (cc *ChunkCollection) Filter(predicate func(*Chunk) bool) *ChunkCollection

Filter returns chunks matching a predicate

func (*ChunkCollection) FilterByElementType

func (cc *ChunkCollection) FilterByElementType(elementType string) *ChunkCollection

FilterByElementType returns chunks containing a specific element type

func (*ChunkCollection) FilterByMaxTokens

func (cc *ChunkCollection) FilterByMaxTokens(maxTokens int) *ChunkCollection

FilterByMaxTokens returns chunks with at most N estimated tokens

func (*ChunkCollection) FilterByMinTokens

func (cc *ChunkCollection) FilterByMinTokens(minTokens int) *ChunkCollection

FilterByMinTokens returns chunks with at least N estimated tokens

func (*ChunkCollection) FilterByPage

func (cc *ChunkCollection) FilterByPage(page int) *ChunkCollection

FilterByPage returns chunks on a specific page

func (*ChunkCollection) FilterByPageRange

func (cc *ChunkCollection) FilterByPageRange(startPage, endPage int) *ChunkCollection

FilterByPageRange returns chunks within a page range

func (*ChunkCollection) FilterBySection

func (cc *ChunkCollection) FilterBySection(sectionTitle string) *ChunkCollection

FilterBySection returns chunks in a specific section

func (*ChunkCollection) FilterWithImages

func (cc *ChunkCollection) FilterWithImages() *ChunkCollection

FilterWithImages returns chunks containing images

func (*ChunkCollection) FilterWithLists

func (cc *ChunkCollection) FilterWithLists() *ChunkCollection

FilterWithLists returns chunks containing lists

func (*ChunkCollection) FilterWithTables

func (cc *ChunkCollection) FilterWithTables() *ChunkCollection

FilterWithTables returns chunks containing tables

func (*ChunkCollection) First

func (cc *ChunkCollection) First() *Chunk

First returns the first chunk or nil

func (*ChunkCollection) GetAllSections

func (cc *ChunkCollection) GetAllSections() []string

GetAllSections returns unique section titles

func (*ChunkCollection) GetByID

func (cc *ChunkCollection) GetByID(id string) *Chunk

GetByID returns a chunk by ID

func (*ChunkCollection) GetByIndex

func (cc *ChunkCollection) GetByIndex(index int) *Chunk

GetByIndex returns a chunk by index

func (*ChunkCollection) GetPageRange

func (cc *ChunkCollection) GetPageRange() (int, int)

GetPageRange returns the min and max page numbers

func (*ChunkCollection) GetTotalTokens

func (cc *ChunkCollection) GetTotalTokens() int

GetTotalTokens returns the sum of estimated tokens across all chunks

func (*ChunkCollection) GetTotalWords

func (cc *ChunkCollection) GetTotalWords() int

GetTotalWords returns the sum of words across all chunks

func (*ChunkCollection) Last

func (cc *ChunkCollection) Last() *Chunk

Last returns the last chunk or nil

func (*ChunkCollection) Search

func (cc *ChunkCollection) Search(keyword string) *ChunkCollection

Search returns chunks containing a keyword (case-insensitive)

func (*ChunkCollection) Statistics

func (cc *ChunkCollection) Statistics() CollectionStats

Statistics returns aggregate statistics about the collection

func (*ChunkCollection) ToCSV

func (cc *ChunkCollection) ToCSV() (string, error)

ToCSV exports the collection as CSV

func (*ChunkCollection) ToJSON

func (cc *ChunkCollection) ToJSON() (string, error)

ToJSON exports the collection as JSON array

func (*ChunkCollection) ToJSONL

func (cc *ChunkCollection) ToJSONL() (string, error)

ToJSONL exports the collection as JSON Lines

func (*ChunkCollection) ToMarkdown

func (cc *ChunkCollection) ToMarkdown() string

ToMarkdown converts all chunks to a combined markdown document

func (*ChunkCollection) ToMarkdownChunks

func (cc *ChunkCollection) ToMarkdownChunks() []string

ToMarkdownChunks returns each chunk as a separate markdown string Useful when you need to process chunks individually but want markdown format

func (*ChunkCollection) ToMarkdownChunksWithOptions

func (cc *ChunkCollection) ToMarkdownChunksWithOptions(opts MarkdownOptions) []string

ToMarkdownChunksWithOptions returns each chunk as separate markdown strings

func (*ChunkCollection) ToMarkdownWithOptions

func (cc *ChunkCollection) ToMarkdownWithOptions(opts MarkdownOptions) string

ToMarkdownWithOptions converts all chunks to markdown with custom options

func (*ChunkCollection) ToSlice

func (cc *ChunkCollection) ToSlice() []*Chunk

ToSlice returns the underlying slice

func (*ChunkCollection) ToTSV

func (cc *ChunkCollection) ToTSV() (string, error)

ToTSV exports the collection as TSV

type ChunkLevel

type ChunkLevel int

ChunkLevel represents the hierarchical level of a chunk

const (
	// ChunkLevelDocument represents the entire document as one chunk
	ChunkLevelDocument ChunkLevel = iota
	// ChunkLevelSection represents a section defined by headings
	ChunkLevelSection
	// ChunkLevelParagraph represents a single paragraph
	ChunkLevelParagraph
	// ChunkLevelSentence represents a single sentence (used for oversized paragraphs)
	ChunkLevelSentence
)

func (ChunkLevel) String

func (cl ChunkLevel) String() string

String returns a human-readable representation of the chunk level

type ChunkMetadata

type ChunkMetadata struct {
	// DocumentTitle is the title of the source document
	DocumentTitle string `json:"document_title,omitempty"`

	// SectionPath is the hierarchical path of headings (e.g., ["Chapter 1", "Introduction", "Overview"])
	SectionPath []string `json:"section_path,omitempty"`

	// SectionTitle is the immediate section heading (last element of SectionPath)
	SectionTitle string `json:"section_title,omitempty"`

	// HeadingLevel is the level of the current section (1-6, 0 if no heading)
	HeadingLevel int `json:"heading_level,omitempty"`

	// PageStart is the starting page number (1-indexed)
	PageStart int `json:"page_start"`

	// PageEnd is the ending page number (1-indexed)
	PageEnd int `json:"page_end"`

	// ChunkIndex is the position of this chunk in the document (0-indexed)
	ChunkIndex int `json:"chunk_index"`

	// TotalChunks is the total number of chunks in the document
	TotalChunks int `json:"total_chunks,omitempty"`

	// Level is the hierarchical level of this chunk
	Level ChunkLevel `json:"level"`

	// ParentID is the ID of the parent chunk (empty for top-level chunks)
	ParentID string `json:"parent_id,omitempty"`

	// ChildIDs are the IDs of child chunks
	ChildIDs []string `json:"child_ids,omitempty"`

	// ElementTypes lists the types of elements contained (paragraph, list, table, etc.)
	ElementTypes []string `json:"element_types,omitempty"`

	// HasTable indicates if the chunk contains a table
	HasTable bool `json:"has_table,omitempty"`

	// HasList indicates if the chunk contains a list
	HasList bool `json:"has_list,omitempty"`

	// HasImage indicates if the chunk contains an image
	HasImage bool `json:"has_image,omitempty"`

	// CharCount is the number of characters in the chunk text
	CharCount int `json:"char_count"`

	// WordCount is the number of words in the chunk text
	WordCount int `json:"word_count"`

	// EstimatedTokens is an estimated token count (chars/4 as rough approximation)
	EstimatedTokens int `json:"estimated_tokens"`

	// BBox is the bounding box of the chunk content on the page
	BBox *model.BBox `json:"bbox,omitempty"`
}

ChunkMetadata contains rich metadata about a chunk's context within the document

func (*ChunkMetadata) ContainsElementType

func (m *ChunkMetadata) ContainsElementType(elementType string) bool

ContainsElementType checks if the chunk contains a specific element type

func (*ChunkMetadata) GetPageRange

func (m *ChunkMetadata) GetPageRange() string

GetPageRange returns a formatted page range string

func (*ChunkMetadata) GetReadingTimeMinutes

func (m *ChunkMetadata) GetReadingTimeMinutes(wordsPerMinute int) float64

GetReadingTimeMinutes estimates reading time in minutes

func (*ChunkMetadata) GetReadingTimeString

func (m *ChunkMetadata) GetReadingTimeString(wordsPerMinute int) string

GetReadingTimeString returns a human-readable reading time

func (*ChunkMetadata) GetSectionPathString

func (m *ChunkMetadata) GetSectionPathString(separator string) string

GetSectionPathString returns the section path as a formatted string

func (*ChunkMetadata) IsInSection

func (m *ChunkMetadata) IsInSection(sectionTitle string) bool

IsInSection checks if the chunk is within a given section path

func (*ChunkMetadata) IsOnPage

func (m *ChunkMetadata) IsOnPage(page int) bool

IsOnPage checks if the chunk spans a given page

func (*ChunkMetadata) ToJSON

func (m *ChunkMetadata) ToJSON() ([]byte, error)

ToJSON serializes metadata to JSON

func (*ChunkMetadata) ToJSONIndent

func (m *ChunkMetadata) ToJSONIndent() ([]byte, error)

ToJSONIndent serializes metadata to indented JSON

func (*ChunkMetadata) ToMap

func (m *ChunkMetadata) ToMap() map[string]interface{}

ToMap converts metadata to a map for flexible access

type ChunkResult

type ChunkResult struct {
	// Chunks are the generated chunks in reading order
	Chunks []*Chunk

	// DocumentTitle is the document title if available
	DocumentTitle string

	// TotalPages is the total number of pages processed
	TotalPages int

	// Statistics about the chunking process
	Stats ChunkStats
}

ChunkResult contains the chunking output

type ChunkStats

type ChunkStats struct {
	TotalChunks     int
	TotalCharacters int
	TotalWords      int
	TotalTokensEst  int
	AvgChunkSize    int
	MinChunkSize    int
	MaxChunkSize    int
	SectionChunks   int
	ParagraphChunks int
	SentenceChunks  int
}

ChunkStats contains statistics about the chunking process

type ChunkWithOverlap

type ChunkWithOverlap struct {
	*Chunk

	// OverlapPrefix is the overlap content prepended from previous chunk
	OverlapPrefix string

	// OverlapSuffix is the overlap content that will be prepended to next chunk
	OverlapSuffix string

	// HasOverlapPrefix indicates if this chunk has overlap from previous
	HasOverlapPrefix bool

	// HasOverlapSuffix indicates if this chunk provides overlap to next
	HasOverlapSuffix bool
}

ChunkWithOverlap represents a chunk with its overlap information

func ApplyOverlapToChunks

func ApplyOverlapToChunks(chunks []*Chunk, config OverlapConfig) []*ChunkWithOverlap

ApplyOverlapToChunks adds overlap between consecutive chunks

func (*ChunkWithOverlap) GetOriginalText

func (c *ChunkWithOverlap) GetOriginalText() string

GetOriginalText returns the chunk text without overlap prefix

func (*ChunkWithOverlap) GetOverlapText

func (c *ChunkWithOverlap) GetOverlapText() string

GetOverlapText returns just the overlap portion of a chunk (for analysis)

type ChunkWithOverlapResult

type ChunkWithOverlapResult struct {
	// Chunks are the generated chunks with overlap information
	Chunks []*ChunkWithOverlap

	// DocumentTitle is the document title if available
	DocumentTitle string

	// TotalPages is the total number of pages processed
	TotalPages int

	// Statistics about the chunking process
	Stats ChunkStats

	// OverlapStats contains overlap-specific statistics
	OverlapStats OverlapStats
}

ChunkWithOverlapResult contains chunking output with overlap information

type Chunker

type Chunker struct {
	// contains filtered or unexported fields
}

Chunker performs semantic chunking of documents

func NewChunker

func NewChunker() *Chunker

NewChunker creates a new chunker with default configuration

func NewChunkerWithConfig

func NewChunkerWithConfig(config ChunkerConfig) *Chunker

NewChunkerWithConfig creates a chunker with custom configuration

func (*Chunker) Chunk

func (c *Chunker) Chunk(doc *model.Document) (*ChunkResult, error)

Chunk processes a document and returns semantic chunks

func (*Chunker) ChunkWithOverlapEnabled

func (c *Chunker) ChunkWithOverlapEnabled(doc *model.Document) (*ChunkWithOverlapResult, error)

ChunkWithOverlapEnabled processes a document and returns chunks with overlap

type ChunkerConfig

type ChunkerConfig struct {
	// TargetChunkSize is the target size for chunks in characters
	// Default: 1000
	TargetChunkSize int

	// MaxChunkSize is the hard limit for chunk size in characters
	// Chunks will be split at sentence boundaries if they exceed this
	// Default: 2000
	MaxChunkSize int

	// MinChunkSize is the minimum size for a chunk in characters
	// Smaller chunks may be merged with adjacent content
	// Default: 100
	MinChunkSize int

	// OverlapSize is the number of characters to overlap between chunks
	// Default: 100
	OverlapSize int

	// OverlapSentences when true, uses sentence-based overlap instead of character-based
	// Default: true
	OverlapSentences bool

	// PreserveListCoherence keeps list intros with their items
	// Default: true
	PreserveListCoherence bool

	// PreserveTableCoherence keeps tables as atomic units
	// Default: true
	PreserveTableCoherence bool

	// IncludeSectionContext prepends section heading to chunk text
	// Default: true
	IncludeSectionContext bool

	// SplitOnHeadings creates new chunks at heading boundaries
	// Default: true
	SplitOnHeadings bool

	// MinHeadingLevel is the minimum heading level to split on (1-6)
	// Lower numbers = split on more headings
	// Default: 3 (split on H1, H2, H3)
	MinHeadingLevel int

	// PreserveParagraphs tries to keep paragraphs intact
	// Default: true
	PreserveParagraphs bool

	// IDPrefix is a prefix for generated chunk IDs
	// Default: "chunk"
	IDPrefix string
}

ChunkerConfig holds configuration options for the chunker

func DefaultChunkerConfig

func DefaultChunkerConfig() ChunkerConfig

DefaultChunkerConfig returns sensible default configuration

type CollectionStats

type CollectionStats struct {
	TotalChunks      int
	TotalTokens      int
	TotalWords       int
	TotalChars       int
	AvgTokens        int
	MinTokens        int
	MaxTokens        int
	ChunksWithTables int
	ChunksWithLists  int
	ChunksWithImages int
	UniqueSections   int
	PageStart        int
	PageEnd          int
}

CollectionStats contains aggregate statistics about a chunk collection

func (*CollectionStats) ToJSON

func (cs *CollectionStats) ToJSON() ([]byte, error)

ToJSON serializes stats to JSON

type ContentBlock

type ContentBlock struct {
	Type     model.ElementType
	Text     string
	Page     int
	Index    int
	ListInfo *model.ListInfo
	IsIntro  bool // True if this appears to introduce the next element
}

ContentBlock represents a block of content for boundary detection

type ContentElement

type ContentElement struct {
	Type     model.ElementType
	Text     string
	Page     int
	BBox     model.BBox
	ListInfo *model.ListInfo
}

ContentElement represents a piece of content within a section

type ContextFormat

type ContextFormat int

ContextFormat defines how context is injected into chunk text

const (
	// ContextFormatNone adds no context
	ContextFormatNone ContextFormat = iota
	// ContextFormatBracket adds context in brackets: [Section Title]
	ContextFormatBracket
	// ContextFormatMarkdown adds context as markdown heading
	ContextFormatMarkdown
	// ContextFormatBreadcrumb adds full path as breadcrumb
	ContextFormatBreadcrumb
	// ContextFormatXML adds context in XML-style tags
	ContextFormatXML
)

func (ContextFormat) String

func (cf ContextFormat) String() string

String returns a human-readable representation of the context format

type DocumentChunkOptions

type DocumentChunkOptions struct {
	ChunkerConfig ChunkerConfig
	SizeConfig    SizeConfig
}

DocumentChunkOptions holds options for document chunking

func DefaultDocumentChunkOptions

func DefaultDocumentChunkOptions() DocumentChunkOptions

DefaultDocumentChunkOptions returns default chunking options

func RAGOptimizedOptions

func RAGOptimizedOptions() DocumentChunkOptions

RAGOptimizedOptions returns options optimized for RAG workflows

type DocumentChunker

type DocumentChunker struct {
	// contains filtered or unexported fields
}

DocumentChunker provides RAG chunking for Document objects

func NewDocumentChunker

func NewDocumentChunker() *DocumentChunker

NewDocumentChunker creates a new document chunker with default configuration

func NewDocumentChunkerWithConfig

func NewDocumentChunkerWithConfig(config ChunkerConfig, sizeConfig SizeConfig) *DocumentChunker

NewDocumentChunkerWithConfig creates a document chunker with custom configuration

func (*DocumentChunker) ChunkDocument

func (dc *DocumentChunker) ChunkDocument(doc *model.Document) *ChunkCollection

ChunkDocument chunks a Document into semantic units for RAG

type EmbeddingExporter

type EmbeddingExporter struct {
	// contains filtered or unexported fields
}

EmbeddingExporter exports chunks with embeddings for vector databases

func NewEmbeddingExporter

func NewEmbeddingExporter() *EmbeddingExporter

NewEmbeddingExporter creates an exporter optimized for embedding export

func (*EmbeddingExporter) ExportForChroma

func (ee *EmbeddingExporter) ExportForChroma(chunks []*Chunk, embeddings [][]float64, w io.Writer) error

ExportForChroma exports in Chroma-compatible format

func (*EmbeddingExporter) ExportForPinecone

func (ee *EmbeddingExporter) ExportForPinecone(chunks []*Chunk, embeddings [][]float64, w io.Writer) error

ExportForPinecone exports in Pinecone-compatible format

func (*EmbeddingExporter) ExportForWeaviate

func (ee *EmbeddingExporter) ExportForWeaviate(chunks []*Chunk, embeddings [][]float64, className string, w io.Writer) error

ExportForWeaviate exports in Weaviate-compatible format

func (*EmbeddingExporter) PrepareForVectorDB

func (ee *EmbeddingExporter) PrepareForVectorDB(chunks []*Chunk) []EmbeddingRecord

PrepareForVectorDB prepares chunks for vector database ingestion

type EmbeddingRecord

type EmbeddingRecord struct {
	ID        string                 `json:"id"`
	Text      string                 `json:"text"`
	Embedding []float64              `json:"embedding,omitempty"`
	Metadata  map[string]interface{} `json:"metadata,omitempty"`
}

EmbeddingRecord represents a single record for vector DB ingestion

type ExportBatch

type ExportBatch struct {
	// BatchNumber is the zero-indexed batch number
	BatchNumber int

	// StartIndex is the starting chunk index in the original collection
	StartIndex int

	// EndIndex is the ending chunk index (exclusive)
	EndIndex int

	// ChunkCount is the number of chunks in this batch
	ChunkCount int

	// Data contains the exported data
	Data string
}

ExportBatch represents a single exported batch

type ExportConfig

type ExportConfig struct {
	// Format specifies the export format
	Format ExportFormat

	// IncludeMetadata determines which metadata fields to include
	IncludeMetadata bool

	// MetadataFields specifies which metadata fields to include (nil = all)
	MetadataFields []string

	// IncludeText includes the chunk text content
	IncludeText bool

	// IncludeEmbeddings includes embedding vectors if present
	IncludeEmbeddings bool

	// FlattenMetadata flattens nested metadata into dot-notation keys
	FlattenMetadata bool

	// CSVDelimiter specifies the delimiter for CSV export (default: comma)
	CSVDelimiter rune

	// IncludeHeader includes header row in CSV/TSV exports
	IncludeHeader bool

	// PrettyPrint enables pretty printing for JSON formats
	PrettyPrint bool

	// TextColumnName specifies the column name for text content
	TextColumnName string

	// ChunkIDColumnName specifies the column name for chunk ID
	ChunkIDColumnName string
}

ExportConfig holds configuration options for export

func CSVExportConfig

func CSVExportConfig() ExportConfig

CSVExportConfig returns config optimized for CSV export

func DefaultExportConfig

func DefaultExportConfig() ExportConfig

DefaultExportConfig returns sensible defaults for export configuration

func JSONLExportConfig

func JSONLExportConfig() ExportConfig

JSONLExportConfig returns config optimized for JSON Lines export

func TSVExportConfig

func TSVExportConfig() ExportConfig

TSVExportConfig returns config optimized for TSV export

func VectorDBExportConfig

func VectorDBExportConfig() ExportConfig

VectorDBExportConfig returns config optimized for vector DB ingestion

type ExportFormat

type ExportFormat int

ExportFormat defines the available export formats

const (
	// ExportFormatJSONL exports as JSON Lines (one JSON object per line)
	ExportFormatJSONL ExportFormat = iota
	// ExportFormatJSON exports as a JSON array
	ExportFormatJSON
	// ExportFormatCSV exports as comma-separated values
	ExportFormatCSV
	// ExportFormatTSV exports as tab-separated values
	ExportFormatTSV
)

func (ExportFormat) FileExtension

func (ef ExportFormat) FileExtension() string

FileExtension returns the typical file extension for this format

func (ExportFormat) String

func (ef ExportFormat) String() string

String returns a human-readable representation of the export format

type ExportedChunk

type ExportedChunk struct {
	// ID is the unique identifier for the chunk
	ID string `json:"id,omitempty"`

	// Text is the chunk content
	Text string `json:"text,omitempty"`

	// Metadata holds all metadata fields as a map
	Metadata map[string]interface{} `json:"metadata,omitempty"`

	// Embeddings holds the embedding vector(s) if present
	Embeddings []float64 `json:"embeddings,omitempty"`

	// Source document information
	DocumentTitle string `json:"document_title,omitempty"`
	PageStart     int    `json:"page_start,omitempty"`
	PageEnd       int    `json:"page_end,omitempty"`

	// Position within the document
	ChunkIndex int `json:"chunk_index,omitempty"`

	// Section information
	SectionTitle string   `json:"section_title,omitempty"`
	SectionPath  []string `json:"section_path,omitempty"`

	// Content indicators
	HasTable bool `json:"has_table,omitempty"`
	HasList  bool `json:"has_list,omitempty"`
	HasImage bool `json:"has_image,omitempty"`
}

ExportedChunk represents a chunk prepared for export

type Exporter

type Exporter struct {
	// contains filtered or unexported fields
}

Exporter handles exporting chunks to various formats

func NewExporter

func NewExporter() *Exporter

NewExporter creates a new exporter with default configuration

func NewExporterWithConfig

func NewExporterWithConfig(config ExportConfig) *Exporter

NewExporterWithConfig creates an exporter with custom configuration

func (*Exporter) Export

func (e *Exporter) Export(chunks []*Chunk, w io.Writer) error

Export exports chunks to the specified writer

func (*Exporter) ExportToFile

func (e *Exporter) ExportToFile(chunks []*Chunk, filename string) error

ExportToFile exports chunks to a file

func (*Exporter) ExportToString

func (e *Exporter) ExportToString(chunks []*Chunk) (string, error)

ExportToString exports chunks to a string

type FigureChunk

type FigureChunk struct {
	// Image is the source image (if available)
	Image *model.Image

	// Caption is the associated caption text
	Caption string

	// HasCaption indicates if a caption was found
	HasCaption bool

	// AltText is alternative text for the image
	AltText string

	// Description is a generated description
	Description string

	// Format is the image format
	Format string

	// PageNumber is the source page
	PageNumber int
}

FigureChunk represents a figure/image as a chunk

func (*FigureChunk) ToChunk

func (fc *FigureChunk) ToChunk(chunkIndex int) *Chunk

ToChunk converts a FigureChunk to a generic Chunk

type LimitType

type LimitType int

LimitType defines whether a limit is soft or hard

const (
	// LimitTypeSoft is a preference - try not to exceed but allow if necessary
	LimitTypeSoft LimitType = iota
	// LimitTypeHard is a strict limit - must not exceed
	LimitTypeHard
)

func (LimitType) String

func (lt LimitType) String() string

String returns a human-readable representation of the limit type

type ListBlock

type ListBlock struct {
	// Type is the kind of list
	Type ListType

	// IntroText is the introductory paragraph (if any)
	IntroText string

	// HasIntro indicates if there's an introductory paragraph
	HasIntro bool

	// Items are the list items
	Items []*ListItem

	// MaxLevel is the deepest nesting level
	MaxLevel int

	// TotalItems is the total count including nested items
	TotalItems int

	// IsComplete indicates if the list is complete
	IsComplete bool
}

ListBlock represents a complete list with its context

type ListCoherenceAnalyzer

type ListCoherenceAnalyzer struct {
	// contains filtered or unexported fields
}

ListCoherenceAnalyzer analyzes and manages list coherence

func NewListCoherenceAnalyzer

func NewListCoherenceAnalyzer() *ListCoherenceAnalyzer

NewListCoherenceAnalyzer creates a new analyzer with default config

func NewListCoherenceAnalyzerWithConfig

func NewListCoherenceAnalyzerWithConfig(config ListCoherenceConfig) *ListCoherenceAnalyzer

NewListCoherenceAnalyzerWithConfig creates an analyzer with custom config

func (*ListCoherenceAnalyzer) AnalyzeListBlock

func (a *ListCoherenceAnalyzer) AnalyzeListBlock(listText string, precedingText string) *ListBlock

AnalyzeListBlock creates a complete ListBlock from text

func (*ListCoherenceAnalyzer) AnalyzeListCoherence

func (a *ListCoherenceAnalyzer) AnalyzeListCoherence(blocks []ContentBlock) *ListCoherenceResult

AnalyzeListCoherence analyzes list coherence in a sequence of text blocks

func (*ListCoherenceAnalyzer) DetectListType

func (a *ListCoherenceAnalyzer) DetectListType(text string) ListType

DetectListType identifies the type of list from its content

func (*ListCoherenceAnalyzer) FindListSplitPoints

func (a *ListCoherenceAnalyzer) FindListSplitPoints(block *ListBlock) []int

FindListSplitPoints finds safe points to split a large list

func (*ListCoherenceAnalyzer) FormatListBlock

func (a *ListCoherenceAnalyzer) FormatListBlock(block *ListBlock, preserveMarkers bool) string

FormatListBlock formats a list block back to text

func (*ListCoherenceAnalyzer) IsListIntro

func (a *ListCoherenceAnalyzer) IsListIntro(text string) bool

IsListIntro checks if text appears to introduce a list

func (*ListCoherenceAnalyzer) ParseListItems

func (a *ListCoherenceAnalyzer) ParseListItems(text string) []*ListItem

ParseListItems extracts structured list items from text

func (*ListCoherenceAnalyzer) ShouldKeepListTogether

func (a *ListCoherenceAnalyzer) ShouldKeepListTogether(block *ListBlock) bool

ShouldKeepListTogether determines if a list should be kept as one chunk

func (*ListCoherenceAnalyzer) SplitListBlock

func (a *ListCoherenceAnalyzer) SplitListBlock(block *ListBlock, atIndex int) (*ListBlock, *ListBlock)

SplitListBlock splits a list at the specified item index

type ListCoherenceConfig

type ListCoherenceConfig struct {
	// KeepIntroWithList keeps introductory text with the list
	KeepIntroWithList bool

	// MaxIntroDistance is max chars between intro and list
	MaxIntroDistance int

	// PreserveNesting keeps nested lists together
	PreserveNesting bool

	// MaxListSize is max chars for a list before considering split
	MaxListSize int

	// MinItemsBeforeSplit is minimum items to have before splitting
	MinItemsBeforeSplit int

	// AllowSplitAtLevel allows splitting only at this nesting level or higher
	AllowSplitAtLevel int

	// IntroPatterns are patterns that detect list introductions
	IntroPatterns []*regexp.Regexp
}

ListCoherenceConfig holds configuration for list coherence

func DefaultListCoherenceConfig

func DefaultListCoherenceConfig() ListCoherenceConfig

DefaultListCoherenceConfig returns sensible defaults

type ListCoherenceResult

type ListCoherenceResult struct {
	// Blocks are the identified list blocks
	Blocks []*ListBlock

	// IntroOrphans are introductions without following lists
	IntroOrphans []string

	// TotalLists is the number of lists found
	TotalLists int

	// ListsWithIntros is the number of lists with introductions
	ListsWithIntros int

	// NestedLists is the number of lists with nesting
	NestedLists int
}

ListCoherenceResult holds the result of list coherence analysis

type ListItem

type ListItem struct {
	// Text is the item content
	Text string

	// Marker is the bullet/number (e.g., "•", "1.", "a)")
	Marker string

	// Level is the nesting level (0 = top level)
	Level int

	// Index is the position in the list
	Index int

	// Children are nested list items
	Children []*ListItem

	// IsComplete indicates if the item text is complete
	IsComplete bool
}

ListItem represents a single item in a list

type ListType

type ListType int

ListType represents the type of list

const (
	// ListTypeUnordered is a bullet list
	ListTypeUnordered ListType = iota
	// ListTypeOrdered is a numbered list
	ListTypeOrdered
	// ListTypeDefinition is a definition list (term: definition)
	ListTypeDefinition
	// ListTypeChecklist is a checkbox list
	ListTypeChecklist
)

func (ListType) String

func (lt ListType) String() string

String returns a human-readable representation of the list type

type MarkdownOptions

type MarkdownOptions struct {
	// IncludeMetadata adds metadata comments at the start
	IncludeMetadata bool

	// IncludeTableOfContents generates a TOC from section headings
	IncludeTableOfContents bool

	// IncludeChunkSeparators adds horizontal rules between chunks
	IncludeChunkSeparators bool

	// IncludePageNumbers adds page references
	IncludePageNumbers bool

	// IncludeChunkIDs adds chunk IDs as HTML comments
	IncludeChunkIDs bool

	// HeadingLevelOffset adjusts heading levels (e.g., 1 makes H1 -> H2)
	HeadingLevelOffset int

	// MaxHeadingLevel caps heading depth (default: 6)
	MaxHeadingLevel int

	// SectionSeparator is text between major sections (default: "\n\n---\n\n")
	SectionSeparator string
}

MarkdownOptions configures markdown output generation

func DefaultMarkdownOptions

func DefaultMarkdownOptions() MarkdownOptions

DefaultMarkdownOptions returns sensible defaults for markdown generation

func RAGOptimizedMarkdownOptions

func RAGOptimizedMarkdownOptions() MarkdownOptions

RAGOptimizedMarkdownOptions returns options optimized for RAG ingestion

type MetadataConfig

type MetadataConfig struct {
	// ContextFormat determines how context is added to chunk text
	ContextFormat ContextFormat

	// IncludeDocumentTitle includes document title in context
	IncludeDocumentTitle bool

	// IncludePageNumbers includes page numbers in context
	IncludePageNumbers bool

	// IncludeSectionPath includes full section path (not just title)
	IncludeSectionPath bool

	// WordsPerMinute for reading time estimation (default: 200)
	WordsPerMinute int
}

MetadataConfig holds configuration for metadata handling

func DefaultMetadataConfig

func DefaultMetadataConfig() MetadataConfig

DefaultMetadataConfig returns sensible defaults

type OrphanedContentDetector

type OrphanedContentDetector struct {
	// MinOrphanSize is the minimum size for standalone content
	MinOrphanSize int
}

OrphanedContentDetector helps avoid creating orphaned content at chunk boundaries

func NewOrphanedContentDetector

func NewOrphanedContentDetector(minSize int) *OrphanedContentDetector

NewOrphanedContentDetector creates a new orphan detector

func (*OrphanedContentDetector) AdjustForOrphans

func (o *OrphanedContentDetector) AdjustForOrphans(text string, position int, boundaries []Boundary) int

AdjustForOrphans adjusts a split position to avoid orphaned content

func (*OrphanedContentDetector) WouldCreateOrphan

func (o *OrphanedContentDetector) WouldCreateOrphan(text string, position int) bool

WouldCreateOrphan checks if splitting at position would create orphaned content

type OverlapConfig

type OverlapConfig struct {
	// Strategy determines how overlap is computed
	Strategy OverlapStrategy

	// Size is the target overlap size in characters (for character-based)
	// or number of sentences/paragraphs (for sentence/paragraph-based)
	Size int

	// MinOverlap is the minimum overlap to include (avoids tiny overlaps)
	MinOverlap int

	// MaxOverlap is the maximum overlap allowed (prevents excessive duplication)
	MaxOverlap int

	// PreserveWords ensures character overlap doesn't break words
	PreserveWords bool

	// IncludeHeadingContext includes section heading in overlap for context
	IncludeHeadingContext bool
}

OverlapConfig holds configuration for chunk overlap

func DefaultOverlapConfig

func DefaultOverlapConfig() OverlapConfig

DefaultOverlapConfig returns sensible defaults for overlap

type OverlapGenerator

type OverlapGenerator struct {
	// contains filtered or unexported fields
}

OverlapGenerator generates overlap content between chunks

func NewOverlapGenerator

func NewOverlapGenerator() *OverlapGenerator

NewOverlapGenerator creates a new overlap generator with default configuration

func NewOverlapGeneratorWithConfig

func NewOverlapGeneratorWithConfig(config OverlapConfig) *OverlapGenerator

NewOverlapGeneratorWithConfig creates an overlap generator with custom configuration

func (*OverlapGenerator) GenerateOverlap

func (og *OverlapGenerator) GenerateOverlap(chunkText string) *OverlapResult

GenerateOverlap extracts overlap content from the end of a chunk

type OverlapResult

type OverlapResult struct {
	// Text is the overlap content to prepend to the next chunk
	Text string

	// CharCount is the number of characters in the overlap
	CharCount int

	// SentenceCount is the number of complete sentences in the overlap
	SentenceCount int

	// Strategy is the strategy that was used
	Strategy OverlapStrategy
}

OverlapResult contains the computed overlap text and metadata

type OverlapStats

type OverlapStats struct {
	// TotalOverlapChars is the total characters in overlap regions
	TotalOverlapChars int

	// AvgOverlapChars is the average overlap size
	AvgOverlapChars int

	// ChunksWithOverlap is the number of chunks that have overlap
	ChunksWithOverlap int

	// OverlapStrategy is the strategy used
	OverlapStrategy OverlapStrategy
}

OverlapStats contains statistics about overlap in chunks

type OverlapStrategy

type OverlapStrategy int

OverlapStrategy defines how overlap between chunks is computed

const (
	// OverlapNone disables overlap between chunks
	OverlapNone OverlapStrategy = iota
	// OverlapCharacter uses character-based overlap (simple but can break words/sentences)
	OverlapCharacter
	// OverlapSentence uses sentence-based overlap (preserves complete sentences)
	OverlapSentence
	// OverlapParagraph uses paragraph-based overlap (preserves complete paragraphs)
	OverlapParagraph
)

func (OverlapStrategy) String

func (os OverlapStrategy) String() string

String returns a human-readable representation of the overlap strategy

type Section

type Section struct {
	// Heading is the section heading (nil for content before first heading)
	Heading *model.HeadingInfo

	// HeadingLevel is the heading level (0 if no heading)
	HeadingLevel int

	// Title is the section title
	Title string

	// Path is the hierarchical path of parent section titles
	Path []string

	// Content is the text content of this section
	Content []ContentElement

	// PageStart is the starting page (1-indexed)
	PageStart int

	// PageEnd is the ending page (1-indexed)
	PageEnd int

	// Children are nested subsections
	Children []*Section

	// Parent is the parent section (nil for top-level)
	Parent *Section
}

Section represents a document section defined by a heading

type SizeAction

type SizeAction int

SizeAction suggests what action to take for size issues

const (
	// SizeActionNone - no action needed
	SizeActionNone SizeAction = iota
	// SizeActionSplit - chunk should be split
	SizeActionSplit
	// SizeActionMerge - chunk should be merged with neighbor
	SizeActionMerge
	// SizeActionTruncate - chunk must be truncated (hard limit exceeded)
	SizeActionTruncate
)

func (SizeAction) String

func (sa SizeAction) String() string

String returns a human-readable representation of the size action

type SizeCalculator

type SizeCalculator struct {
	// contains filtered or unexported fields
}

SizeCalculator calculates various size metrics for text

func NewSizeCalculator

func NewSizeCalculator() *SizeCalculator

NewSizeCalculator creates a new size calculator with default config

func NewSizeCalculatorWithConfig

func NewSizeCalculatorWithConfig(config SizeConfig) *SizeCalculator

NewSizeCalculatorWithConfig creates a size calculator with custom config

func (*SizeCalculator) Calculate

func (sc *SizeCalculator) Calculate(text string) SizeMetrics

Calculate computes all size metrics for the given text

func (*SizeCalculator) Check

func (sc *SizeCalculator) Check(text string) SizeCheckResult

Check performs a comprehensive size check on the text

func (*SizeCalculator) EstimateTokens

func (sc *SizeCalculator) EstimateTokens(text string) int

EstimateTokens estimates token count using the configured ratio

func (*SizeCalculator) ExceedsLimit

func (sc *SizeCalculator) ExceedsLimit(text string, limit SizeLimit) bool

ExceedsLimit checks if text exceeds a specific limit

func (*SizeCalculator) FindSplitPoint

func (sc *SizeCalculator) FindSplitPoint(text string, boundaries []Boundary) int

FindSplitPoint finds the best position to split text to meet size constraints

func (*SizeCalculator) FindSplitPointAt

func (sc *SizeCalculator) FindSplitPointAt(text string, boundaries []Boundary, targetSize int, targetUnit SizeUnit) int

FindSplitPointAt finds the best position to split text at a specific size limit

func (*SizeCalculator) GetSize

func (sc *SizeCalculator) GetSize(text string, unit SizeUnit) int

GetSize returns the size in the specified unit

func (*SizeCalculator) IsAboveMax

func (sc *SizeCalculator) IsAboveMax(text string) bool

IsAboveMax checks if text exceeds maximum size

func (*SizeCalculator) IsBelowMin

func (sc *SizeCalculator) IsBelowMin(text string) bool

IsBelowMin checks if text is below minimum size

func (*SizeCalculator) IsWithinTarget

func (sc *SizeCalculator) IsWithinTarget(text string) bool

IsWithinTarget checks if size is within target range

func (*SizeCalculator) SplitToSize

func (sc *SizeCalculator) SplitToSize(text string, boundaries []Boundary) []string

SplitToSize splits text into chunks that meet size constraints

type SizeCheckResult

type SizeCheckResult struct {
	// Metrics are the calculated size metrics
	Metrics SizeMetrics

	// IsValid indicates if the size is acceptable
	IsValid bool

	// Reason explains why the size is not valid (if applicable)
	Reason string

	// SuggestedAction suggests what to do if size is not valid
	SuggestedAction SizeAction

	// TargetDiff is the difference from target size
	TargetDiff int
}

SizeCheckResult contains the result of a size check

type SizeConfig

type SizeConfig struct {
	// Target is the ideal chunk size to aim for
	Target SizeLimit

	// Min is the minimum chunk size
	Min SizeLimit

	// Max is the maximum chunk size
	Max SizeLimit

	// TokensPerChar is the ratio of tokens to characters (default: 0.25)
	// Used for token estimation
	TokensPerChar float64

	// AllowExceedForAtomicContent allows exceeding max for tables/lists
	AllowExceedForAtomicContent bool

	// MergeSmallChunks merges chunks below min with neighbors
	MergeSmallChunks bool

	// SplitAtSemanticBoundaries prefers semantic boundaries over exact sizes
	SplitAtSemanticBoundaries bool
}

SizeConfig holds comprehensive size configuration for chunking

func ClaudeContextConfig

func ClaudeContextConfig() SizeConfig

ClaudeContextConfig returns config for Claude's context window

func CohereEmbeddingConfig

func CohereEmbeddingConfig() SizeConfig

CohereEmbeddingConfig returns config optimized for Cohere embeddings

func DefaultSizeConfig

func DefaultSizeConfig() SizeConfig

DefaultSizeConfig returns sensible defaults for size configuration

func LargeChunkConfig

func LargeChunkConfig() SizeConfig

LargeChunkConfig returns config for large chunks (good for context)

func MediumChunkConfig

func MediumChunkConfig() SizeConfig

MediumChunkConfig returns config for medium chunks (balanced)

func OpenAIEmbeddingConfig

func OpenAIEmbeddingConfig() SizeConfig

OpenAIEmbeddingConfig returns config optimized for OpenAI embeddings (8191 tokens max)

func SemanticSizeConfig

func SemanticSizeConfig(targetParagraphs, maxParagraphs int) SizeConfig

SemanticSizeConfig returns configuration for semantic unit-based chunking

func SmallChunkConfig

func SmallChunkConfig() SizeConfig

SmallChunkConfig returns config for small chunks (good for precise retrieval)

func TokenBasedSizeConfig

func TokenBasedSizeConfig(targetTokens, maxTokens int) SizeConfig

TokenBasedSizeConfig returns configuration optimized for token-based chunking

type SizeLimit

type SizeLimit struct {
	// Value is the limit value
	Value int

	// Unit is the unit of measurement
	Unit SizeUnit

	// Type determines if this is a soft or hard limit
	Type LimitType
}

SizeLimit represents a size limit with its type and value

func (SizeLimit) String

func (sl SizeLimit) String() string

String returns a human-readable representation of the size limit

type SizeMetrics

type SizeMetrics struct {
	Characters int
	Tokens     int
	Words      int
	Sentences  int
	Paragraphs int
}

SizeMetrics holds all size measurements for a piece of text

func (SizeMetrics) GetByUnit

func (m SizeMetrics) GetByUnit(unit SizeUnit) int

GetByUnit returns the metric value for the specified unit

type SizeUnit

type SizeUnit int

SizeUnit defines the unit of measurement for chunk sizes

const (
	// SizeUnitCharacters measures size in characters
	SizeUnitCharacters SizeUnit = iota
	// SizeUnitTokens measures size in estimated tokens (chars/4)
	SizeUnitTokens
	// SizeUnitWords measures size in words
	SizeUnitWords
	// SizeUnitSentences measures size in sentences
	SizeUnitSentences
	// SizeUnitParagraphs measures size in paragraphs
	SizeUnitParagraphs
)

func (SizeUnit) String

func (su SizeUnit) String() string

String returns a human-readable representation of the size unit

type StreamExporter

type StreamExporter struct {
	// contains filtered or unexported fields
}

StreamExporter handles streaming export for very large collections

func NewStreamExporter

func NewStreamExporter(w io.Writer) *StreamExporter

NewStreamExporter creates a new stream exporter

func NewStreamExporterWithConfig

func NewStreamExporterWithConfig(w io.Writer, config ExportConfig) *StreamExporter

NewStreamExporterWithConfig creates a stream exporter with custom config

func (*StreamExporter) Close

func (se *StreamExporter) Close() error

Close finalizes the stream export

func (*StreamExporter) WriteChunk

func (se *StreamExporter) WriteChunk(chunk *Chunk, index int) error

WriteChunk writes a single chunk to the stream

type TableChunk

type TableChunk struct {
	// Table is the source table
	Table *model.Table

	// Caption is the associated caption text
	Caption string

	// HasCaption indicates if a caption was found
	HasCaption bool

	// FormattedText is the table rendered as text
	FormattedText string

	// Summary is a brief description of the table
	Summary string

	// RowCount is the number of rows
	RowCount int

	// ColCount is the number of columns
	ColCount int

	// Headers are the column headers (if detected)
	Headers []string

	// IsSplit indicates if this is part of a split table
	IsSplit bool

	// SplitIndex is the index of this part (0-based)
	SplitIndex int

	// TotalSplits is the total number of parts
	TotalSplits int

	// PageNumber is the source page
	PageNumber int
}

TableChunk represents a table as a chunk

func (*TableChunk) ToChunk

func (tc *TableChunk) ToChunk(chunkIndex int) *Chunk

ToChunk converts a TableChunk to a generic Chunk

type TableFigureConfig

type TableFigureConfig struct {
	// TableFormat determines how tables are rendered in chunks
	TableFormat TableFormat

	// MaxTableSize is the maximum characters for a table before considering split
	MaxTableSize int

	// MaxTableRows is the maximum rows before considering split
	MaxTableRows int

	// SplitLargeTables allows splitting tables that exceed limits
	SplitLargeTables bool

	// IncludeTableCaption includes detected captions with tables
	IncludeTableCaption bool

	// IncludeFigureCaption includes detected captions with figures
	IncludeFigureCaption bool

	// CaptionSearchDistance is max chars to search for caption
	CaptionSearchDistance int

	// IncludeTableSummary adds a brief summary of table dimensions
	IncludeTableSummary bool

	// IncludeFigureAltText includes alt text for figures
	IncludeFigureAltText bool

	// PreserveTableStructure keeps structural info for RAG
	PreserveTableStructure bool
}

TableFigureConfig holds configuration for table and figure chunking

func DefaultTableFigureConfig

func DefaultTableFigureConfig() TableFigureConfig

DefaultTableFigureConfig returns sensible defaults

type TableFigureHandler

type TableFigureHandler struct {
	// contains filtered or unexported fields
}

TableFigureHandler handles table and figure chunking

func NewTableFigureHandler

func NewTableFigureHandler() *TableFigureHandler

NewTableFigureHandler creates a new handler with default config

func NewTableFigureHandlerWithConfig

func NewTableFigureHandlerWithConfig(config TableFigureConfig) *TableFigureHandler

NewTableFigureHandlerWithConfig creates a handler with custom config

func (*TableFigureHandler) ProcessBlocks

func (h *TableFigureHandler) ProcessBlocks(blocks []ContentBlock) *TableFigureResult

ProcessBlocks processes content blocks to extract tables and figures

func (*TableFigureHandler) ProcessFigure

func (h *TableFigureHandler) ProcessFigure(image *model.Image, caption string, pageNumber int) *FigureChunk

ProcessFigure converts a figure/image to a chunk

func (*TableFigureHandler) ProcessTable

func (h *TableFigureHandler) ProcessTable(table *model.Table, caption string, pageNumber int) []*TableChunk

ProcessTable converts a table to one or more chunks

type TableFigureResult

type TableFigureResult struct {
	// TableChunks are the processed table chunks
	TableChunks []*TableChunk

	// FigureChunks are the processed figure chunks
	FigureChunks []*FigureChunk

	// Stats contains processing statistics
	Stats TableFigureStats
}

TableFigureResult holds the result of processing tables and figures

type TableFigureStats

type TableFigureStats struct {
	TotalTables        int
	TotalFigures       int
	TablesWithCaption  int
	FiguresWithCaption int
	SplitTables        int
	TotalTableRows     int
	TotalTableCols     int
}

TableFigureStats contains statistics about table/figure processing

type TableFormat

type TableFormat int

TableFormat defines how tables are formatted in chunks

const (
	// TableFormatPlainText formats table as tab-separated text
	TableFormatPlainText TableFormat = iota
	// TableFormatMarkdown formats table as markdown
	TableFormatMarkdown
	// TableFormatCSV formats table as CSV
	TableFormatCSV
	// TableFormatHTML formats table as HTML
	TableFormatHTML
)

func (TableFormat) String

func (tf TableFormat) String() string

String returns a human-readable representation of the table format

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL