Documentation
¶
Overview ¶
Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows. It implements hierarchical, context-aware chunking that respects document structure, ensuring chunks maintain complete thoughts rather than breaking mid-sentence or mid-list.
Package rag provides RAG (Retrieval-Augmented Generation) chunking and export functionality for LLM integration.
This package prepares extracted document content for use with large language models by providing semantic chunking and various export formats.
Chunking ¶
The Chunker splits documents into semantically meaningful chunks:
chunker := rag.NewChunker(config) chunks := chunker.ChunkDocument(document)
Chunking respects document structure, avoiding splits in the middle of:
- Tables
- Lists
- Paragraphs
- Headings with their following content
Chunk Configuration ¶
Use ChunkerConfig to control chunking behavior:
- MaxChunkSize - maximum tokens/characters per chunk
- MinChunkSize - minimum chunk size (avoids tiny chunks)
- Overlap - overlap between consecutive chunks
- PreserveStructure - keep tables and lists intact
Chunk Metadata ¶
Each Chunk includes metadata for retrieval:
- Page numbers and positions
- Section headings
- Content type (paragraph, table, list, etc.)
- Relationships to other chunks
Export Formats ¶
Export chunks in various formats:
- ToMarkdown() - Markdown with preserved structure
- ToPlainText() - Plain text extraction
- ToJSON() - Structured JSON output
Markdown Export ¶
The MarkdownOptions control markdown generation:
- IncludeMetadata - add front matter
- PreserveTables - use markdown table syntax
- HeadingStyle - ATX (#) or Setext (===) headings
Index ¶
- func ApplyOverlap(currentText string, overlap *OverlapResult, sectionTitle string, ...) string
- func ConvertSize(value int, from, to SizeUnit) int
- func GetListMarkerType(line string) string
- func IsCaptionElement(elementType model.ElementType) bool
- func IsFigureElement(elementType model.ElementType) bool
- func IsListMarker(text string) bool
- func IsTableElement(elementType model.ElementType) bool
- func IsWithinAtomicBlock(index int, atomicBlocks []AtomicBlock) bool
- func NormalizeListMarkers(text string, useNumbers bool) string
- type AtomicBlock
- type BatchExporter
- type Boundary
- type BoundaryConfig
- type BoundaryDetector
- func (d *BoundaryDetector) DetectBoundaries(blocks []ContentBlock) []Boundary
- func (d *BoundaryDetector) FindAtomicBlocks(blocks []ContentBlock) []AtomicBlock
- func (d *BoundaryDetector) FindBestBoundary(boundaries []Boundary, minPos, maxPos int) *Boundary
- func (d *BoundaryDetector) FindBoundaryWithLookAhead(boundaries []Boundary, targetPos int) *Boundary
- func (d *BoundaryDetector) ShouldKeepTogether(block1, block2 ContentBlock) bool
- type BoundaryType
- type CaptionDetector
- type Chunk
- func (c *Chunk) GenerateContextText(config MetadataConfig) string
- func (c *Chunk) GetSectionPathString() string
- func (c *Chunk) Summary() string
- func (c *Chunk) ToEmbeddingFormat() string
- func (c *Chunk) ToMarkdown() string
- func (c *Chunk) ToMarkdownWithOptions(opts MarkdownOptions) string
- func (c *Chunk) ToSearchableText() string
- type ChunkCollection
- func (cc *ChunkCollection) Count() int
- func (cc *ChunkCollection) ExportToFile(filename string, config ExportConfig) error
- func (cc *ChunkCollection) Filter(predicate func(*Chunk) bool) *ChunkCollection
- func (cc *ChunkCollection) FilterByElementType(elementType string) *ChunkCollection
- func (cc *ChunkCollection) FilterByMaxTokens(maxTokens int) *ChunkCollection
- func (cc *ChunkCollection) FilterByMinTokens(minTokens int) *ChunkCollection
- func (cc *ChunkCollection) FilterByPage(page int) *ChunkCollection
- func (cc *ChunkCollection) FilterByPageRange(startPage, endPage int) *ChunkCollection
- func (cc *ChunkCollection) FilterBySection(sectionTitle string) *ChunkCollection
- func (cc *ChunkCollection) FilterWithImages() *ChunkCollection
- func (cc *ChunkCollection) FilterWithLists() *ChunkCollection
- func (cc *ChunkCollection) FilterWithTables() *ChunkCollection
- func (cc *ChunkCollection) First() *Chunk
- func (cc *ChunkCollection) GetAllSections() []string
- func (cc *ChunkCollection) GetByID(id string) *Chunk
- func (cc *ChunkCollection) GetByIndex(index int) *Chunk
- func (cc *ChunkCollection) GetPageRange() (int, int)
- func (cc *ChunkCollection) GetTotalTokens() int
- func (cc *ChunkCollection) GetTotalWords() int
- func (cc *ChunkCollection) Last() *Chunk
- func (cc *ChunkCollection) Search(keyword string) *ChunkCollection
- func (cc *ChunkCollection) Statistics() CollectionStats
- func (cc *ChunkCollection) ToCSV() (string, error)
- func (cc *ChunkCollection) ToJSON() (string, error)
- func (cc *ChunkCollection) ToJSONL() (string, error)
- func (cc *ChunkCollection) ToMarkdown() string
- func (cc *ChunkCollection) ToMarkdownChunks() []string
- func (cc *ChunkCollection) ToMarkdownChunksWithOptions(opts MarkdownOptions) []string
- func (cc *ChunkCollection) ToMarkdownWithOptions(opts MarkdownOptions) string
- func (cc *ChunkCollection) ToSlice() []*Chunk
- func (cc *ChunkCollection) ToTSV() (string, error)
- type ChunkLevel
- type ChunkMetadata
- func (m *ChunkMetadata) ContainsElementType(elementType string) bool
- func (m *ChunkMetadata) GetPageRange() string
- func (m *ChunkMetadata) GetReadingTimeMinutes(wordsPerMinute int) float64
- func (m *ChunkMetadata) GetReadingTimeString(wordsPerMinute int) string
- func (m *ChunkMetadata) GetSectionPathString(separator string) string
- func (m *ChunkMetadata) IsInSection(sectionTitle string) bool
- func (m *ChunkMetadata) IsOnPage(page int) bool
- func (m *ChunkMetadata) ToJSON() ([]byte, error)
- func (m *ChunkMetadata) ToJSONIndent() ([]byte, error)
- func (m *ChunkMetadata) ToMap() map[string]interface{}
- type ChunkResult
- type ChunkStats
- type ChunkWithOverlap
- type ChunkWithOverlapResult
- type Chunker
- type ChunkerConfig
- type CollectionStats
- type ContentBlock
- type ContentElement
- type ContextFormat
- type DocumentChunkOptions
- type DocumentChunker
- type EmbeddingExporter
- func (ee *EmbeddingExporter) ExportForChroma(chunks []*Chunk, embeddings [][]float64, w io.Writer) error
- func (ee *EmbeddingExporter) ExportForPinecone(chunks []*Chunk, embeddings [][]float64, w io.Writer) error
- func (ee *EmbeddingExporter) ExportForWeaviate(chunks []*Chunk, embeddings [][]float64, className string, w io.Writer) error
- func (ee *EmbeddingExporter) PrepareForVectorDB(chunks []*Chunk) []EmbeddingRecord
- type EmbeddingRecord
- type ExportBatch
- type ExportConfig
- type ExportFormat
- type ExportedChunk
- type Exporter
- type FigureChunk
- type LimitType
- type ListBlock
- type ListCoherenceAnalyzer
- func (a *ListCoherenceAnalyzer) AnalyzeListBlock(listText string, precedingText string) *ListBlock
- func (a *ListCoherenceAnalyzer) AnalyzeListCoherence(blocks []ContentBlock) *ListCoherenceResult
- func (a *ListCoherenceAnalyzer) DetectListType(text string) ListType
- func (a *ListCoherenceAnalyzer) FindListSplitPoints(block *ListBlock) []int
- func (a *ListCoherenceAnalyzer) FormatListBlock(block *ListBlock, preserveMarkers bool) string
- func (a *ListCoherenceAnalyzer) IsListIntro(text string) bool
- func (a *ListCoherenceAnalyzer) ParseListItems(text string) []*ListItem
- func (a *ListCoherenceAnalyzer) ShouldKeepListTogether(block *ListBlock) bool
- func (a *ListCoherenceAnalyzer) SplitListBlock(block *ListBlock, atIndex int) (*ListBlock, *ListBlock)
- type ListCoherenceConfig
- type ListCoherenceResult
- type ListItem
- type ListType
- type MarkdownOptions
- type MetadataConfig
- type OrphanedContentDetector
- type OverlapConfig
- type OverlapGenerator
- type OverlapResult
- type OverlapStats
- type OverlapStrategy
- type Section
- type SizeAction
- type SizeCalculator
- func (sc *SizeCalculator) Calculate(text string) SizeMetrics
- func (sc *SizeCalculator) Check(text string) SizeCheckResult
- func (sc *SizeCalculator) EstimateTokens(text string) int
- func (sc *SizeCalculator) ExceedsLimit(text string, limit SizeLimit) bool
- func (sc *SizeCalculator) FindSplitPoint(text string, boundaries []Boundary) int
- func (sc *SizeCalculator) FindSplitPointAt(text string, boundaries []Boundary, targetSize int, targetUnit SizeUnit) int
- func (sc *SizeCalculator) GetSize(text string, unit SizeUnit) int
- func (sc *SizeCalculator) IsAboveMax(text string) bool
- func (sc *SizeCalculator) IsBelowMin(text string) bool
- func (sc *SizeCalculator) IsWithinTarget(text string) bool
- func (sc *SizeCalculator) SplitToSize(text string, boundaries []Boundary) []string
- type SizeCheckResult
- type SizeConfig
- func ClaudeContextConfig() SizeConfig
- func CohereEmbeddingConfig() SizeConfig
- func DefaultSizeConfig() SizeConfig
- func LargeChunkConfig() SizeConfig
- func MediumChunkConfig() SizeConfig
- func OpenAIEmbeddingConfig() SizeConfig
- func SemanticSizeConfig(targetParagraphs, maxParagraphs int) SizeConfig
- func SmallChunkConfig() SizeConfig
- func TokenBasedSizeConfig(targetTokens, maxTokens int) SizeConfig
- type SizeLimit
- type SizeMetrics
- type SizeUnit
- type StreamExporter
- type TableChunk
- type TableFigureConfig
- type TableFigureHandler
- func (h *TableFigureHandler) ProcessBlocks(blocks []ContentBlock) *TableFigureResult
- func (h *TableFigureHandler) ProcessFigure(image *model.Image, caption string, pageNumber int) *FigureChunk
- func (h *TableFigureHandler) ProcessTable(table *model.Table, caption string, pageNumber int) []*TableChunk
- type TableFigureResult
- type TableFigureStats
- type TableFormat
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ApplyOverlap ¶
func ApplyOverlap(currentText string, overlap *OverlapResult, sectionTitle string, includeContext bool) string
ApplyOverlap applies overlap from the previous chunk to the current chunk
func ConvertSize ¶
ConvertSize converts a size value from one unit to another (approximate)
func GetListMarkerType ¶
GetListMarkerType returns the marker type for a line
func IsCaptionElement ¶
func IsCaptionElement(elementType model.ElementType) bool
IsCaptionElement checks if an element type is a caption
func IsFigureElement ¶
func IsFigureElement(elementType model.ElementType) bool
IsFigureElement checks if an element type is a figure or image
func IsListMarker ¶
IsListMarker checks if text starts with any list marker
func IsTableElement ¶
func IsTableElement(elementType model.ElementType) bool
IsTableElement checks if an element type is a table
func IsWithinAtomicBlock ¶
func IsWithinAtomicBlock(index int, atomicBlocks []AtomicBlock) bool
IsWithinAtomicBlock checks if an index is within any atomic block
func NormalizeListMarkers ¶
NormalizeListMarkers normalizes list markers to a consistent format
Types ¶
type AtomicBlock ¶
AtomicBlocks identifies blocks that should not be split
func GetAtomicBlockAt ¶
func GetAtomicBlockAt(index int, atomicBlocks []AtomicBlock) *AtomicBlock
GetAtomicBlockAt returns the atomic block containing the given index, if any
type BatchExporter ¶
type BatchExporter struct {
// contains filtered or unexported fields
}
BatchExporter handles exporting large collections in batches
func NewBatchExporter ¶
func NewBatchExporter(batchSize int) *BatchExporter
NewBatchExporter creates a new batch exporter
func NewBatchExporterWithConfig ¶
func NewBatchExporterWithConfig(batchSize int, config ExportConfig) *BatchExporter
NewBatchExporterWithConfig creates a batch exporter with custom config
func (*BatchExporter) Export ¶
func (be *BatchExporter) Export(chunks []*Chunk, callback func(ExportBatch) error) error
Export exports chunks in batches, calling the callback for each batch
func (*BatchExporter) ExportToFiles ¶
func (be *BatchExporter) ExportToFiles(chunks []*Chunk, filenamePattern string) error
ExportToFiles exports chunks to numbered files
type Boundary ¶
type Boundary struct {
// Type is the kind of boundary
Type BoundaryType
// Position is the character offset in the text
Position int
// Score is the priority score for splitting here
Score int
// ElementIndex is the index of the element this boundary follows
ElementIndex int
// Context provides additional information about the boundary
Context string
}
Boundary represents a potential chunk boundary in the content
type BoundaryConfig ¶
type BoundaryConfig struct {
// MinChunkSize is the minimum characters before considering a boundary
MinChunkSize int
// MaxChunkSize is the maximum characters before forcing a boundary
MaxChunkSize int
// PreferParagraphBreaks prefers paragraph boundaries over sentence boundaries
PreferParagraphBreaks bool
// KeepListsIntact tries to keep lists with their introductory text
KeepListsIntact bool
// KeepTablesIntact treats tables as atomic units
KeepTablesIntact bool
// KeepFiguresIntact keeps figures with their captions
KeepFiguresIntact bool
// LookAheadChars is how far to look ahead for better boundaries
LookAheadChars int
// ListIntroPatterns are patterns that indicate list introductions
ListIntroPatterns []*regexp.Regexp
}
BoundaryConfig holds configuration for boundary detection
func DefaultBoundaryConfig ¶
func DefaultBoundaryConfig() BoundaryConfig
DefaultBoundaryConfig returns sensible defaults for boundary detection
type BoundaryDetector ¶
type BoundaryDetector struct {
// contains filtered or unexported fields
}
BoundaryDetector detects semantic boundaries in content
func NewBoundaryDetector ¶
func NewBoundaryDetector() *BoundaryDetector
NewBoundaryDetector creates a new boundary detector with default configuration
func NewBoundaryDetectorWithConfig ¶
func NewBoundaryDetectorWithConfig(config BoundaryConfig) *BoundaryDetector
NewBoundaryDetectorWithConfig creates a boundary detector with custom configuration
func (*BoundaryDetector) DetectBoundaries ¶
func (d *BoundaryDetector) DetectBoundaries(blocks []ContentBlock) []Boundary
DetectBoundaries finds all semantic boundaries in a sequence of content blocks
func (*BoundaryDetector) FindAtomicBlocks ¶
func (d *BoundaryDetector) FindAtomicBlocks(blocks []ContentBlock) []AtomicBlock
FindAtomicBlocks identifies sequences of blocks that should stay together
func (*BoundaryDetector) FindBestBoundary ¶
func (d *BoundaryDetector) FindBestBoundary(boundaries []Boundary, minPos, maxPos int) *Boundary
FindBestBoundary finds the best boundary within a range for splitting
func (*BoundaryDetector) FindBoundaryWithLookAhead ¶
func (d *BoundaryDetector) FindBoundaryWithLookAhead(boundaries []Boundary, targetPos int) *Boundary
FindBoundaryWithLookAhead finds a boundary, looking ahead for better options
func (*BoundaryDetector) ShouldKeepTogether ¶
func (d *BoundaryDetector) ShouldKeepTogether(block1, block2 ContentBlock) bool
ShouldKeepTogether determines if two blocks should be kept in the same chunk
type BoundaryType ¶
type BoundaryType int
BoundaryType represents the type of semantic boundary
const ( // BoundaryNone indicates no boundary (middle of content) BoundaryNone BoundaryType = iota // BoundarySentence indicates a sentence ending BoundarySentence // BoundaryParagraph indicates a paragraph break BoundaryParagraph // BoundaryList indicates end of a list BoundaryList // BoundaryListItem indicates end of a list item BoundaryListItem // BoundaryHeading indicates a heading (section break) BoundaryHeading // BoundaryTable indicates end of a table BoundaryTable // BoundaryFigure indicates end of a figure/image BoundaryFigure // BoundaryCodeBlock indicates end of a code block BoundaryCodeBlock // BoundaryPageBreak indicates a page break BoundaryPageBreak )
func (BoundaryType) Score ¶
func (bt BoundaryType) Score() int
Score returns a priority score for this boundary type (higher = better split point)
func (BoundaryType) String ¶
func (bt BoundaryType) String() string
String returns a human-readable representation of the boundary type
type CaptionDetector ¶
type CaptionDetector struct {
// contains filtered or unexported fields
}
CaptionDetector helps find captions associated with tables and figures
func NewCaptionDetector ¶
func NewCaptionDetector() *CaptionDetector
NewCaptionDetector creates a new caption detector
func NewCaptionDetectorWithConfig ¶
func NewCaptionDetectorWithConfig(config TableFigureConfig) *CaptionDetector
NewCaptionDetectorWithConfig creates a caption detector with custom config
func (*CaptionDetector) FindFigureCaption ¶
func (d *CaptionDetector) FindFigureCaption(blocks []ContentBlock, figureIndex int) string
FindFigureCaption searches for a caption near a figure
func (*CaptionDetector) FindTableCaption ¶
func (d *CaptionDetector) FindTableCaption(blocks []ContentBlock, tableIndex int) string
FindTableCaption searches for a caption near a table
type Chunk ¶
type Chunk struct {
// ID is a unique identifier for this chunk
ID string `json:"id"`
// Text is the chunk content
Text string `json:"text"`
// TextWithContext is the text with section heading prepended for better retrieval
TextWithContext string `json:"text_with_context,omitempty"`
// Metadata contains rich contextual information
Metadata ChunkMetadata `json:"metadata"`
}
Chunk represents a semantic unit of text extracted from a document for RAG
func NewChunk ¶
func NewChunk(id, text string, metadata ChunkMetadata) *Chunk
NewChunk creates a new chunk with the given text and metadata
func (*Chunk) GenerateContextText ¶
func (c *Chunk) GenerateContextText(config MetadataConfig) string
GenerateContextText generates context text based on configuration
func (*Chunk) GetSectionPathString ¶
GetSectionPathString returns the section path as a formatted string
func (*Chunk) ToEmbeddingFormat ¶
ToEmbeddingFormat returns text optimized for embedding generation
func (*Chunk) ToMarkdown ¶
ToMarkdown converts a chunk to markdown format
func (*Chunk) ToMarkdownWithOptions ¶
func (c *Chunk) ToMarkdownWithOptions(opts MarkdownOptions) string
ToMarkdownWithOptions converts a chunk to markdown with custom options
func (*Chunk) ToSearchableText ¶
ToSearchableText returns text optimized for keyword search
type ChunkCollection ¶
type ChunkCollection struct {
Chunks []*Chunk
}
ChunkCollection provides filtering and search over chunks
func ChunkDocument ¶
func ChunkDocument(doc *model.Document) *ChunkCollection
ChunkDocument is a convenience function to chunk a document with default settings
func ChunkDocumentWithConfig ¶
func ChunkDocumentWithConfig(doc *model.Document, config ChunkerConfig, sizeConfig SizeConfig) *ChunkCollection
ChunkDocumentWithConfig chunks a document with custom configuration
func NewChunkCollection ¶
func NewChunkCollection(chunks []*Chunk) *ChunkCollection
NewChunkCollection creates a new collection from chunks
func (*ChunkCollection) Count ¶
func (cc *ChunkCollection) Count() int
Count returns the number of chunks in the collection
func (*ChunkCollection) ExportToFile ¶
func (cc *ChunkCollection) ExportToFile(filename string, config ExportConfig) error
ExportToFile exports the collection to a file
func (*ChunkCollection) Filter ¶
func (cc *ChunkCollection) Filter(predicate func(*Chunk) bool) *ChunkCollection
Filter returns chunks matching a predicate
func (*ChunkCollection) FilterByElementType ¶
func (cc *ChunkCollection) FilterByElementType(elementType string) *ChunkCollection
FilterByElementType returns chunks containing a specific element type
func (*ChunkCollection) FilterByMaxTokens ¶
func (cc *ChunkCollection) FilterByMaxTokens(maxTokens int) *ChunkCollection
FilterByMaxTokens returns chunks with at most N estimated tokens
func (*ChunkCollection) FilterByMinTokens ¶
func (cc *ChunkCollection) FilterByMinTokens(minTokens int) *ChunkCollection
FilterByMinTokens returns chunks with at least N estimated tokens
func (*ChunkCollection) FilterByPage ¶
func (cc *ChunkCollection) FilterByPage(page int) *ChunkCollection
FilterByPage returns chunks on a specific page
func (*ChunkCollection) FilterByPageRange ¶
func (cc *ChunkCollection) FilterByPageRange(startPage, endPage int) *ChunkCollection
FilterByPageRange returns chunks within a page range
func (*ChunkCollection) FilterBySection ¶
func (cc *ChunkCollection) FilterBySection(sectionTitle string) *ChunkCollection
FilterBySection returns chunks in a specific section
func (*ChunkCollection) FilterWithImages ¶
func (cc *ChunkCollection) FilterWithImages() *ChunkCollection
FilterWithImages returns chunks containing images
func (*ChunkCollection) FilterWithLists ¶
func (cc *ChunkCollection) FilterWithLists() *ChunkCollection
FilterWithLists returns chunks containing lists
func (*ChunkCollection) FilterWithTables ¶
func (cc *ChunkCollection) FilterWithTables() *ChunkCollection
FilterWithTables returns chunks containing tables
func (*ChunkCollection) First ¶
func (cc *ChunkCollection) First() *Chunk
First returns the first chunk or nil
func (*ChunkCollection) GetAllSections ¶
func (cc *ChunkCollection) GetAllSections() []string
GetAllSections returns unique section titles
func (*ChunkCollection) GetByID ¶
func (cc *ChunkCollection) GetByID(id string) *Chunk
GetByID returns a chunk by ID
func (*ChunkCollection) GetByIndex ¶
func (cc *ChunkCollection) GetByIndex(index int) *Chunk
GetByIndex returns a chunk by index
func (*ChunkCollection) GetPageRange ¶
func (cc *ChunkCollection) GetPageRange() (int, int)
GetPageRange returns the min and max page numbers
func (*ChunkCollection) GetTotalTokens ¶
func (cc *ChunkCollection) GetTotalTokens() int
GetTotalTokens returns the sum of estimated tokens across all chunks
func (*ChunkCollection) GetTotalWords ¶
func (cc *ChunkCollection) GetTotalWords() int
GetTotalWords returns the sum of words across all chunks
func (*ChunkCollection) Last ¶
func (cc *ChunkCollection) Last() *Chunk
Last returns the last chunk or nil
func (*ChunkCollection) Search ¶
func (cc *ChunkCollection) Search(keyword string) *ChunkCollection
Search returns chunks containing a keyword (case-insensitive)
func (*ChunkCollection) Statistics ¶
func (cc *ChunkCollection) Statistics() CollectionStats
Statistics returns aggregate statistics about the collection
func (*ChunkCollection) ToCSV ¶
func (cc *ChunkCollection) ToCSV() (string, error)
ToCSV exports the collection as CSV
func (*ChunkCollection) ToJSON ¶
func (cc *ChunkCollection) ToJSON() (string, error)
ToJSON exports the collection as JSON array
func (*ChunkCollection) ToJSONL ¶
func (cc *ChunkCollection) ToJSONL() (string, error)
ToJSONL exports the collection as JSON Lines
func (*ChunkCollection) ToMarkdown ¶
func (cc *ChunkCollection) ToMarkdown() string
ToMarkdown converts all chunks to a combined markdown document
func (*ChunkCollection) ToMarkdownChunks ¶
func (cc *ChunkCollection) ToMarkdownChunks() []string
ToMarkdownChunks returns each chunk as a separate markdown string Useful when you need to process chunks individually but want markdown format
func (*ChunkCollection) ToMarkdownChunksWithOptions ¶
func (cc *ChunkCollection) ToMarkdownChunksWithOptions(opts MarkdownOptions) []string
ToMarkdownChunksWithOptions returns each chunk as separate markdown strings
func (*ChunkCollection) ToMarkdownWithOptions ¶
func (cc *ChunkCollection) ToMarkdownWithOptions(opts MarkdownOptions) string
ToMarkdownWithOptions converts all chunks to markdown with custom options
func (*ChunkCollection) ToSlice ¶
func (cc *ChunkCollection) ToSlice() []*Chunk
ToSlice returns the underlying slice
func (*ChunkCollection) ToTSV ¶
func (cc *ChunkCollection) ToTSV() (string, error)
ToTSV exports the collection as TSV
type ChunkLevel ¶
type ChunkLevel int
ChunkLevel represents the hierarchical level of a chunk
const ( // ChunkLevelDocument represents the entire document as one chunk ChunkLevelDocument ChunkLevel = iota // ChunkLevelSection represents a section defined by headings ChunkLevelSection // ChunkLevelParagraph represents a single paragraph ChunkLevelParagraph // ChunkLevelSentence represents a single sentence (used for oversized paragraphs) ChunkLevelSentence )
func (ChunkLevel) String ¶
func (cl ChunkLevel) String() string
String returns a human-readable representation of the chunk level
type ChunkMetadata ¶
type ChunkMetadata struct {
// DocumentTitle is the title of the source document
DocumentTitle string `json:"document_title,omitempty"`
// SectionPath is the hierarchical path of headings (e.g., ["Chapter 1", "Introduction", "Overview"])
SectionPath []string `json:"section_path,omitempty"`
// SectionTitle is the immediate section heading (last element of SectionPath)
SectionTitle string `json:"section_title,omitempty"`
// HeadingLevel is the level of the current section (1-6, 0 if no heading)
HeadingLevel int `json:"heading_level,omitempty"`
// PageStart is the starting page number (1-indexed)
PageStart int `json:"page_start"`
// PageEnd is the ending page number (1-indexed)
PageEnd int `json:"page_end"`
// ChunkIndex is the position of this chunk in the document (0-indexed)
ChunkIndex int `json:"chunk_index"`
// TotalChunks is the total number of chunks in the document
TotalChunks int `json:"total_chunks,omitempty"`
// Level is the hierarchical level of this chunk
Level ChunkLevel `json:"level"`
// ParentID is the ID of the parent chunk (empty for top-level chunks)
ParentID string `json:"parent_id,omitempty"`
// ChildIDs are the IDs of child chunks
ChildIDs []string `json:"child_ids,omitempty"`
// ElementTypes lists the types of elements contained (paragraph, list, table, etc.)
ElementTypes []string `json:"element_types,omitempty"`
// HasTable indicates if the chunk contains a table
HasTable bool `json:"has_table,omitempty"`
// HasList indicates if the chunk contains a list
HasList bool `json:"has_list,omitempty"`
// HasImage indicates if the chunk contains an image
HasImage bool `json:"has_image,omitempty"`
// CharCount is the number of characters in the chunk text
CharCount int `json:"char_count"`
// WordCount is the number of words in the chunk text
WordCount int `json:"word_count"`
// EstimatedTokens is an estimated token count (chars/4 as rough approximation)
EstimatedTokens int `json:"estimated_tokens"`
// BBox is the bounding box of the chunk content on the page
BBox *model.BBox `json:"bbox,omitempty"`
}
ChunkMetadata contains rich metadata about a chunk's context within the document
func (*ChunkMetadata) ContainsElementType ¶
func (m *ChunkMetadata) ContainsElementType(elementType string) bool
ContainsElementType checks if the chunk contains a specific element type
func (*ChunkMetadata) GetPageRange ¶
func (m *ChunkMetadata) GetPageRange() string
GetPageRange returns a formatted page range string
func (*ChunkMetadata) GetReadingTimeMinutes ¶
func (m *ChunkMetadata) GetReadingTimeMinutes(wordsPerMinute int) float64
GetReadingTimeMinutes estimates reading time in minutes
func (*ChunkMetadata) GetReadingTimeString ¶
func (m *ChunkMetadata) GetReadingTimeString(wordsPerMinute int) string
GetReadingTimeString returns a human-readable reading time
func (*ChunkMetadata) GetSectionPathString ¶
func (m *ChunkMetadata) GetSectionPathString(separator string) string
GetSectionPathString returns the section path as a formatted string
func (*ChunkMetadata) IsInSection ¶
func (m *ChunkMetadata) IsInSection(sectionTitle string) bool
IsInSection checks if the chunk is within a given section path
func (*ChunkMetadata) IsOnPage ¶
func (m *ChunkMetadata) IsOnPage(page int) bool
IsOnPage checks if the chunk spans a given page
func (*ChunkMetadata) ToJSON ¶
func (m *ChunkMetadata) ToJSON() ([]byte, error)
ToJSON serializes metadata to JSON
func (*ChunkMetadata) ToJSONIndent ¶
func (m *ChunkMetadata) ToJSONIndent() ([]byte, error)
ToJSONIndent serializes metadata to indented JSON
func (*ChunkMetadata) ToMap ¶
func (m *ChunkMetadata) ToMap() map[string]interface{}
ToMap converts metadata to a map for flexible access
type ChunkResult ¶
type ChunkResult struct {
// Chunks are the generated chunks in reading order
Chunks []*Chunk
// DocumentTitle is the document title if available
DocumentTitle string
// TotalPages is the total number of pages processed
TotalPages int
// Statistics about the chunking process
Stats ChunkStats
}
ChunkResult contains the chunking output
type ChunkStats ¶
type ChunkStats struct {
TotalChunks int
TotalCharacters int
TotalWords int
TotalTokensEst int
AvgChunkSize int
MinChunkSize int
MaxChunkSize int
SectionChunks int
ParagraphChunks int
SentenceChunks int
}
ChunkStats contains statistics about the chunking process
type ChunkWithOverlap ¶
type ChunkWithOverlap struct {
*Chunk
// OverlapPrefix is the overlap content prepended from previous chunk
OverlapPrefix string
// OverlapSuffix is the overlap content that will be prepended to next chunk
OverlapSuffix string
// HasOverlapPrefix indicates if this chunk has overlap from previous
HasOverlapPrefix bool
// HasOverlapSuffix indicates if this chunk provides overlap to next
HasOverlapSuffix bool
}
ChunkWithOverlap represents a chunk with its overlap information
func ApplyOverlapToChunks ¶
func ApplyOverlapToChunks(chunks []*Chunk, config OverlapConfig) []*ChunkWithOverlap
ApplyOverlapToChunks adds overlap between consecutive chunks
func (*ChunkWithOverlap) GetOriginalText ¶
func (c *ChunkWithOverlap) GetOriginalText() string
GetOriginalText returns the chunk text without overlap prefix
func (*ChunkWithOverlap) GetOverlapText ¶
func (c *ChunkWithOverlap) GetOverlapText() string
GetOverlapText returns just the overlap portion of a chunk (for analysis)
type ChunkWithOverlapResult ¶
type ChunkWithOverlapResult struct {
// Chunks are the generated chunks with overlap information
Chunks []*ChunkWithOverlap
// DocumentTitle is the document title if available
DocumentTitle string
// TotalPages is the total number of pages processed
TotalPages int
// Statistics about the chunking process
Stats ChunkStats
// OverlapStats contains overlap-specific statistics
OverlapStats OverlapStats
}
ChunkWithOverlapResult contains chunking output with overlap information
type Chunker ¶
type Chunker struct {
// contains filtered or unexported fields
}
Chunker performs semantic chunking of documents
func NewChunker ¶
func NewChunker() *Chunker
NewChunker creates a new chunker with default configuration
func NewChunkerWithConfig ¶
func NewChunkerWithConfig(config ChunkerConfig) *Chunker
NewChunkerWithConfig creates a chunker with custom configuration
func (*Chunker) Chunk ¶
func (c *Chunker) Chunk(doc *model.Document) (*ChunkResult, error)
Chunk processes a document and returns semantic chunks
func (*Chunker) ChunkWithOverlapEnabled ¶
func (c *Chunker) ChunkWithOverlapEnabled(doc *model.Document) (*ChunkWithOverlapResult, error)
ChunkWithOverlapEnabled processes a document and returns chunks with overlap
type ChunkerConfig ¶
type ChunkerConfig struct {
// TargetChunkSize is the target size for chunks in characters
// Default: 1000
TargetChunkSize int
// MaxChunkSize is the hard limit for chunk size in characters
// Chunks will be split at sentence boundaries if they exceed this
// Default: 2000
MaxChunkSize int
// MinChunkSize is the minimum size for a chunk in characters
// Smaller chunks may be merged with adjacent content
// Default: 100
MinChunkSize int
// OverlapSize is the number of characters to overlap between chunks
// Default: 100
OverlapSize int
// OverlapSentences when true, uses sentence-based overlap instead of character-based
// Default: true
OverlapSentences bool
// PreserveListCoherence keeps list intros with their items
// Default: true
PreserveListCoherence bool
// PreserveTableCoherence keeps tables as atomic units
// Default: true
PreserveTableCoherence bool
// IncludeSectionContext prepends section heading to chunk text
// Default: true
IncludeSectionContext bool
// SplitOnHeadings creates new chunks at heading boundaries
// Default: true
SplitOnHeadings bool
// MinHeadingLevel is the minimum heading level to split on (1-6)
// Lower numbers = split on more headings
// Default: 3 (split on H1, H2, H3)
MinHeadingLevel int
// PreserveParagraphs tries to keep paragraphs intact
// Default: true
PreserveParagraphs bool
// IDPrefix is a prefix for generated chunk IDs
// Default: "chunk"
IDPrefix string
}
ChunkerConfig holds configuration options for the chunker
func DefaultChunkerConfig ¶
func DefaultChunkerConfig() ChunkerConfig
DefaultChunkerConfig returns sensible default configuration
type CollectionStats ¶
type CollectionStats struct {
TotalChunks int
TotalTokens int
TotalWords int
TotalChars int
AvgTokens int
MinTokens int
MaxTokens int
ChunksWithTables int
ChunksWithLists int
ChunksWithImages int
UniqueSections int
PageStart int
PageEnd int
}
CollectionStats contains aggregate statistics about a chunk collection
func (*CollectionStats) ToJSON ¶
func (cs *CollectionStats) ToJSON() ([]byte, error)
ToJSON serializes stats to JSON
type ContentBlock ¶
type ContentBlock struct {
Type model.ElementType
Text string
Page int
Index int
ListInfo *model.ListInfo
IsIntro bool // True if this appears to introduce the next element
}
ContentBlock represents a block of content for boundary detection
type ContentElement ¶
type ContentElement struct {
Type model.ElementType
Text string
Page int
BBox model.BBox
ListInfo *model.ListInfo
}
ContentElement represents a piece of content within a section
type ContextFormat ¶
type ContextFormat int
ContextFormat defines how context is injected into chunk text
const ( // ContextFormatNone adds no context ContextFormatNone ContextFormat = iota // ContextFormatBracket adds context in brackets: [Section Title] ContextFormatBracket // ContextFormatMarkdown adds context as markdown heading ContextFormatMarkdown // ContextFormatBreadcrumb adds full path as breadcrumb ContextFormatBreadcrumb // ContextFormatXML adds context in XML-style tags ContextFormatXML )
func (ContextFormat) String ¶
func (cf ContextFormat) String() string
String returns a human-readable representation of the context format
type DocumentChunkOptions ¶
type DocumentChunkOptions struct {
ChunkerConfig ChunkerConfig
SizeConfig SizeConfig
}
DocumentChunkOptions holds options for document chunking
func DefaultDocumentChunkOptions ¶
func DefaultDocumentChunkOptions() DocumentChunkOptions
DefaultDocumentChunkOptions returns default chunking options
func RAGOptimizedOptions ¶
func RAGOptimizedOptions() DocumentChunkOptions
RAGOptimizedOptions returns options optimized for RAG workflows
type DocumentChunker ¶
type DocumentChunker struct {
// contains filtered or unexported fields
}
DocumentChunker provides RAG chunking for Document objects
func NewDocumentChunker ¶
func NewDocumentChunker() *DocumentChunker
NewDocumentChunker creates a new document chunker with default configuration
func NewDocumentChunkerWithConfig ¶
func NewDocumentChunkerWithConfig(config ChunkerConfig, sizeConfig SizeConfig) *DocumentChunker
NewDocumentChunkerWithConfig creates a document chunker with custom configuration
func (*DocumentChunker) ChunkDocument ¶
func (dc *DocumentChunker) ChunkDocument(doc *model.Document) *ChunkCollection
ChunkDocument chunks a Document into semantic units for RAG
type EmbeddingExporter ¶
type EmbeddingExporter struct {
// contains filtered or unexported fields
}
EmbeddingExporter exports chunks with embeddings for vector databases
func NewEmbeddingExporter ¶
func NewEmbeddingExporter() *EmbeddingExporter
NewEmbeddingExporter creates an exporter optimized for embedding export
func (*EmbeddingExporter) ExportForChroma ¶
func (ee *EmbeddingExporter) ExportForChroma(chunks []*Chunk, embeddings [][]float64, w io.Writer) error
ExportForChroma exports in Chroma-compatible format
func (*EmbeddingExporter) ExportForPinecone ¶
func (ee *EmbeddingExporter) ExportForPinecone(chunks []*Chunk, embeddings [][]float64, w io.Writer) error
ExportForPinecone exports in Pinecone-compatible format
func (*EmbeddingExporter) ExportForWeaviate ¶
func (ee *EmbeddingExporter) ExportForWeaviate(chunks []*Chunk, embeddings [][]float64, className string, w io.Writer) error
ExportForWeaviate exports in Weaviate-compatible format
func (*EmbeddingExporter) PrepareForVectorDB ¶
func (ee *EmbeddingExporter) PrepareForVectorDB(chunks []*Chunk) []EmbeddingRecord
PrepareForVectorDB prepares chunks for vector database ingestion
type EmbeddingRecord ¶
type EmbeddingRecord struct {
ID string `json:"id"`
Text string `json:"text"`
Embedding []float64 `json:"embedding,omitempty"`
Metadata map[string]interface{} `json:"metadata,omitempty"`
}
EmbeddingRecord represents a single record for vector DB ingestion
type ExportBatch ¶
type ExportBatch struct {
// BatchNumber is the zero-indexed batch number
BatchNumber int
// StartIndex is the starting chunk index in the original collection
StartIndex int
// EndIndex is the ending chunk index (exclusive)
EndIndex int
// ChunkCount is the number of chunks in this batch
ChunkCount int
// Data contains the exported data
Data string
}
ExportBatch represents a single exported batch
type ExportConfig ¶
type ExportConfig struct {
// Format specifies the export format
Format ExportFormat
// IncludeMetadata determines which metadata fields to include
IncludeMetadata bool
// MetadataFields specifies which metadata fields to include (nil = all)
MetadataFields []string
// IncludeText includes the chunk text content
IncludeText bool
// IncludeEmbeddings includes embedding vectors if present
IncludeEmbeddings bool
// FlattenMetadata flattens nested metadata into dot-notation keys
FlattenMetadata bool
// CSVDelimiter specifies the delimiter for CSV export (default: comma)
CSVDelimiter rune
// IncludeHeader includes header row in CSV/TSV exports
IncludeHeader bool
// PrettyPrint enables pretty printing for JSON formats
PrettyPrint bool
// TextColumnName specifies the column name for text content
TextColumnName string
// ChunkIDColumnName specifies the column name for chunk ID
ChunkIDColumnName string
}
ExportConfig holds configuration options for export
func CSVExportConfig ¶
func CSVExportConfig() ExportConfig
CSVExportConfig returns config optimized for CSV export
func DefaultExportConfig ¶
func DefaultExportConfig() ExportConfig
DefaultExportConfig returns sensible defaults for export configuration
func JSONLExportConfig ¶
func JSONLExportConfig() ExportConfig
JSONLExportConfig returns config optimized for JSON Lines export
func TSVExportConfig ¶
func TSVExportConfig() ExportConfig
TSVExportConfig returns config optimized for TSV export
func VectorDBExportConfig ¶
func VectorDBExportConfig() ExportConfig
VectorDBExportConfig returns config optimized for vector DB ingestion
type ExportFormat ¶
type ExportFormat int
ExportFormat defines the available export formats
const ( // ExportFormatJSONL exports as JSON Lines (one JSON object per line) ExportFormatJSONL ExportFormat = iota // ExportFormatJSON exports as a JSON array ExportFormatJSON // ExportFormatCSV exports as comma-separated values ExportFormatCSV // ExportFormatTSV exports as tab-separated values ExportFormatTSV )
func (ExportFormat) FileExtension ¶
func (ef ExportFormat) FileExtension() string
FileExtension returns the typical file extension for this format
func (ExportFormat) String ¶
func (ef ExportFormat) String() string
String returns a human-readable representation of the export format
type ExportedChunk ¶
type ExportedChunk struct {
// ID is the unique identifier for the chunk
ID string `json:"id,omitempty"`
// Text is the chunk content
Text string `json:"text,omitempty"`
// Metadata holds all metadata fields as a map
Metadata map[string]interface{} `json:"metadata,omitempty"`
// Embeddings holds the embedding vector(s) if present
Embeddings []float64 `json:"embeddings,omitempty"`
// Source document information
DocumentTitle string `json:"document_title,omitempty"`
PageStart int `json:"page_start,omitempty"`
PageEnd int `json:"page_end,omitempty"`
// Position within the document
ChunkIndex int `json:"chunk_index,omitempty"`
// Section information
SectionTitle string `json:"section_title,omitempty"`
SectionPath []string `json:"section_path,omitempty"`
// Content indicators
HasTable bool `json:"has_table,omitempty"`
HasList bool `json:"has_list,omitempty"`
HasImage bool `json:"has_image,omitempty"`
}
ExportedChunk represents a chunk prepared for export
type Exporter ¶
type Exporter struct {
// contains filtered or unexported fields
}
Exporter handles exporting chunks to various formats
func NewExporter ¶
func NewExporter() *Exporter
NewExporter creates a new exporter with default configuration
func NewExporterWithConfig ¶
func NewExporterWithConfig(config ExportConfig) *Exporter
NewExporterWithConfig creates an exporter with custom configuration
func (*Exporter) ExportToFile ¶
ExportToFile exports chunks to a file
type FigureChunk ¶
type FigureChunk struct {
// Image is the source image (if available)
Image *model.Image
// Caption is the associated caption text
Caption string
// HasCaption indicates if a caption was found
HasCaption bool
// AltText is alternative text for the image
AltText string
// Description is a generated description
Description string
// Format is the image format
Format string
// PageNumber is the source page
PageNumber int
}
FigureChunk represents a figure/image as a chunk
func (*FigureChunk) ToChunk ¶
func (fc *FigureChunk) ToChunk(chunkIndex int) *Chunk
ToChunk converts a FigureChunk to a generic Chunk
type ListBlock ¶
type ListBlock struct {
// Type is the kind of list
Type ListType
// IntroText is the introductory paragraph (if any)
IntroText string
// HasIntro indicates if there's an introductory paragraph
HasIntro bool
// Items are the list items
Items []*ListItem
// MaxLevel is the deepest nesting level
MaxLevel int
// TotalItems is the total count including nested items
TotalItems int
// IsComplete indicates if the list is complete
IsComplete bool
}
ListBlock represents a complete list with its context
type ListCoherenceAnalyzer ¶
type ListCoherenceAnalyzer struct {
// contains filtered or unexported fields
}
ListCoherenceAnalyzer analyzes and manages list coherence
func NewListCoherenceAnalyzer ¶
func NewListCoherenceAnalyzer() *ListCoherenceAnalyzer
NewListCoherenceAnalyzer creates a new analyzer with default config
func NewListCoherenceAnalyzerWithConfig ¶
func NewListCoherenceAnalyzerWithConfig(config ListCoherenceConfig) *ListCoherenceAnalyzer
NewListCoherenceAnalyzerWithConfig creates an analyzer with custom config
func (*ListCoherenceAnalyzer) AnalyzeListBlock ¶
func (a *ListCoherenceAnalyzer) AnalyzeListBlock(listText string, precedingText string) *ListBlock
AnalyzeListBlock creates a complete ListBlock from text
func (*ListCoherenceAnalyzer) AnalyzeListCoherence ¶
func (a *ListCoherenceAnalyzer) AnalyzeListCoherence(blocks []ContentBlock) *ListCoherenceResult
AnalyzeListCoherence analyzes list coherence in a sequence of text blocks
func (*ListCoherenceAnalyzer) DetectListType ¶
func (a *ListCoherenceAnalyzer) DetectListType(text string) ListType
DetectListType identifies the type of list from its content
func (*ListCoherenceAnalyzer) FindListSplitPoints ¶
func (a *ListCoherenceAnalyzer) FindListSplitPoints(block *ListBlock) []int
FindListSplitPoints finds safe points to split a large list
func (*ListCoherenceAnalyzer) FormatListBlock ¶
func (a *ListCoherenceAnalyzer) FormatListBlock(block *ListBlock, preserveMarkers bool) string
FormatListBlock formats a list block back to text
func (*ListCoherenceAnalyzer) IsListIntro ¶
func (a *ListCoherenceAnalyzer) IsListIntro(text string) bool
IsListIntro checks if text appears to introduce a list
func (*ListCoherenceAnalyzer) ParseListItems ¶
func (a *ListCoherenceAnalyzer) ParseListItems(text string) []*ListItem
ParseListItems extracts structured list items from text
func (*ListCoherenceAnalyzer) ShouldKeepListTogether ¶
func (a *ListCoherenceAnalyzer) ShouldKeepListTogether(block *ListBlock) bool
ShouldKeepListTogether determines if a list should be kept as one chunk
func (*ListCoherenceAnalyzer) SplitListBlock ¶
func (a *ListCoherenceAnalyzer) SplitListBlock(block *ListBlock, atIndex int) (*ListBlock, *ListBlock)
SplitListBlock splits a list at the specified item index
type ListCoherenceConfig ¶
type ListCoherenceConfig struct {
// KeepIntroWithList keeps introductory text with the list
KeepIntroWithList bool
// MaxIntroDistance is max chars between intro and list
MaxIntroDistance int
// PreserveNesting keeps nested lists together
PreserveNesting bool
// MaxListSize is max chars for a list before considering split
MaxListSize int
// MinItemsBeforeSplit is minimum items to have before splitting
MinItemsBeforeSplit int
// AllowSplitAtLevel allows splitting only at this nesting level or higher
AllowSplitAtLevel int
// IntroPatterns are patterns that detect list introductions
IntroPatterns []*regexp.Regexp
}
ListCoherenceConfig holds configuration for list coherence
func DefaultListCoherenceConfig ¶
func DefaultListCoherenceConfig() ListCoherenceConfig
DefaultListCoherenceConfig returns sensible defaults
type ListCoherenceResult ¶
type ListCoherenceResult struct {
// Blocks are the identified list blocks
Blocks []*ListBlock
// IntroOrphans are introductions without following lists
IntroOrphans []string
// TotalLists is the number of lists found
TotalLists int
// ListsWithIntros is the number of lists with introductions
ListsWithIntros int
// NestedLists is the number of lists with nesting
NestedLists int
}
ListCoherenceResult holds the result of list coherence analysis
type ListItem ¶
type ListItem struct {
// Text is the item content
Text string
// Marker is the bullet/number (e.g., "•", "1.", "a)")
Marker string
// Level is the nesting level (0 = top level)
Level int
// Index is the position in the list
Index int
// Children are nested list items
Children []*ListItem
// IsComplete indicates if the item text is complete
IsComplete bool
}
ListItem represents a single item in a list
type MarkdownOptions ¶
type MarkdownOptions struct {
// IncludeMetadata adds metadata comments at the start
IncludeMetadata bool
// IncludeTableOfContents generates a TOC from section headings
IncludeTableOfContents bool
// IncludeChunkSeparators adds horizontal rules between chunks
IncludeChunkSeparators bool
// IncludePageNumbers adds page references
IncludePageNumbers bool
// IncludeChunkIDs adds chunk IDs as HTML comments
IncludeChunkIDs bool
// HeadingLevelOffset adjusts heading levels (e.g., 1 makes H1 -> H2)
HeadingLevelOffset int
// MaxHeadingLevel caps heading depth (default: 6)
MaxHeadingLevel int
// SectionSeparator is text between major sections (default: "\n\n---\n\n")
SectionSeparator string
}
MarkdownOptions configures markdown output generation
func DefaultMarkdownOptions ¶
func DefaultMarkdownOptions() MarkdownOptions
DefaultMarkdownOptions returns sensible defaults for markdown generation
func RAGOptimizedMarkdownOptions ¶
func RAGOptimizedMarkdownOptions() MarkdownOptions
RAGOptimizedMarkdownOptions returns options optimized for RAG ingestion
type MetadataConfig ¶
type MetadataConfig struct {
// ContextFormat determines how context is added to chunk text
ContextFormat ContextFormat
// IncludeDocumentTitle includes document title in context
IncludeDocumentTitle bool
// IncludePageNumbers includes page numbers in context
IncludePageNumbers bool
// IncludeSectionPath includes full section path (not just title)
IncludeSectionPath bool
// WordsPerMinute for reading time estimation (default: 200)
WordsPerMinute int
}
MetadataConfig holds configuration for metadata handling
func DefaultMetadataConfig ¶
func DefaultMetadataConfig() MetadataConfig
DefaultMetadataConfig returns sensible defaults
type OrphanedContentDetector ¶
type OrphanedContentDetector struct {
// MinOrphanSize is the minimum size for standalone content
MinOrphanSize int
}
OrphanedContentDetector helps avoid creating orphaned content at chunk boundaries
func NewOrphanedContentDetector ¶
func NewOrphanedContentDetector(minSize int) *OrphanedContentDetector
NewOrphanedContentDetector creates a new orphan detector
func (*OrphanedContentDetector) AdjustForOrphans ¶
func (o *OrphanedContentDetector) AdjustForOrphans(text string, position int, boundaries []Boundary) int
AdjustForOrphans adjusts a split position to avoid orphaned content
func (*OrphanedContentDetector) WouldCreateOrphan ¶
func (o *OrphanedContentDetector) WouldCreateOrphan(text string, position int) bool
WouldCreateOrphan checks if splitting at position would create orphaned content
type OverlapConfig ¶
type OverlapConfig struct {
// Strategy determines how overlap is computed
Strategy OverlapStrategy
// Size is the target overlap size in characters (for character-based)
// or number of sentences/paragraphs (for sentence/paragraph-based)
Size int
// MinOverlap is the minimum overlap to include (avoids tiny overlaps)
MinOverlap int
// MaxOverlap is the maximum overlap allowed (prevents excessive duplication)
MaxOverlap int
// PreserveWords ensures character overlap doesn't break words
PreserveWords bool
// IncludeHeadingContext includes section heading in overlap for context
IncludeHeadingContext bool
}
OverlapConfig holds configuration for chunk overlap
func DefaultOverlapConfig ¶
func DefaultOverlapConfig() OverlapConfig
DefaultOverlapConfig returns sensible defaults for overlap
type OverlapGenerator ¶
type OverlapGenerator struct {
// contains filtered or unexported fields
}
OverlapGenerator generates overlap content between chunks
func NewOverlapGenerator ¶
func NewOverlapGenerator() *OverlapGenerator
NewOverlapGenerator creates a new overlap generator with default configuration
func NewOverlapGeneratorWithConfig ¶
func NewOverlapGeneratorWithConfig(config OverlapConfig) *OverlapGenerator
NewOverlapGeneratorWithConfig creates an overlap generator with custom configuration
func (*OverlapGenerator) GenerateOverlap ¶
func (og *OverlapGenerator) GenerateOverlap(chunkText string) *OverlapResult
GenerateOverlap extracts overlap content from the end of a chunk
type OverlapResult ¶
type OverlapResult struct {
// Text is the overlap content to prepend to the next chunk
Text string
// CharCount is the number of characters in the overlap
CharCount int
// SentenceCount is the number of complete sentences in the overlap
SentenceCount int
// Strategy is the strategy that was used
Strategy OverlapStrategy
}
OverlapResult contains the computed overlap text and metadata
type OverlapStats ¶
type OverlapStats struct {
// TotalOverlapChars is the total characters in overlap regions
TotalOverlapChars int
// AvgOverlapChars is the average overlap size
AvgOverlapChars int
// ChunksWithOverlap is the number of chunks that have overlap
ChunksWithOverlap int
// OverlapStrategy is the strategy used
OverlapStrategy OverlapStrategy
}
OverlapStats contains statistics about overlap in chunks
type OverlapStrategy ¶
type OverlapStrategy int
OverlapStrategy defines how overlap between chunks is computed
const ( // OverlapNone disables overlap between chunks OverlapNone OverlapStrategy = iota // OverlapCharacter uses character-based overlap (simple but can break words/sentences) OverlapCharacter // OverlapSentence uses sentence-based overlap (preserves complete sentences) OverlapSentence // OverlapParagraph uses paragraph-based overlap (preserves complete paragraphs) OverlapParagraph )
func (OverlapStrategy) String ¶
func (os OverlapStrategy) String() string
String returns a human-readable representation of the overlap strategy
type Section ¶
type Section struct {
// Heading is the section heading (nil for content before first heading)
Heading *model.HeadingInfo
// HeadingLevel is the heading level (0 if no heading)
HeadingLevel int
// Title is the section title
Title string
// Path is the hierarchical path of parent section titles
Path []string
// Content is the text content of this section
Content []ContentElement
// PageStart is the starting page (1-indexed)
PageStart int
// PageEnd is the ending page (1-indexed)
PageEnd int
// Children are nested subsections
Children []*Section
// Parent is the parent section (nil for top-level)
Parent *Section
}
Section represents a document section defined by a heading
type SizeAction ¶
type SizeAction int
SizeAction suggests what action to take for size issues
const ( // SizeActionNone - no action needed SizeActionNone SizeAction = iota // SizeActionSplit - chunk should be split SizeActionSplit // SizeActionMerge - chunk should be merged with neighbor SizeActionMerge // SizeActionTruncate - chunk must be truncated (hard limit exceeded) SizeActionTruncate )
func (SizeAction) String ¶
func (sa SizeAction) String() string
String returns a human-readable representation of the size action
type SizeCalculator ¶
type SizeCalculator struct {
// contains filtered or unexported fields
}
SizeCalculator calculates various size metrics for text
func NewSizeCalculator ¶
func NewSizeCalculator() *SizeCalculator
NewSizeCalculator creates a new size calculator with default config
func NewSizeCalculatorWithConfig ¶
func NewSizeCalculatorWithConfig(config SizeConfig) *SizeCalculator
NewSizeCalculatorWithConfig creates a size calculator with custom config
func (*SizeCalculator) Calculate ¶
func (sc *SizeCalculator) Calculate(text string) SizeMetrics
Calculate computes all size metrics for the given text
func (*SizeCalculator) Check ¶
func (sc *SizeCalculator) Check(text string) SizeCheckResult
Check performs a comprehensive size check on the text
func (*SizeCalculator) EstimateTokens ¶
func (sc *SizeCalculator) EstimateTokens(text string) int
EstimateTokens estimates token count using the configured ratio
func (*SizeCalculator) ExceedsLimit ¶
func (sc *SizeCalculator) ExceedsLimit(text string, limit SizeLimit) bool
ExceedsLimit checks if text exceeds a specific limit
func (*SizeCalculator) FindSplitPoint ¶
func (sc *SizeCalculator) FindSplitPoint(text string, boundaries []Boundary) int
FindSplitPoint finds the best position to split text to meet size constraints
func (*SizeCalculator) FindSplitPointAt ¶
func (sc *SizeCalculator) FindSplitPointAt(text string, boundaries []Boundary, targetSize int, targetUnit SizeUnit) int
FindSplitPointAt finds the best position to split text at a specific size limit
func (*SizeCalculator) GetSize ¶
func (sc *SizeCalculator) GetSize(text string, unit SizeUnit) int
GetSize returns the size in the specified unit
func (*SizeCalculator) IsAboveMax ¶
func (sc *SizeCalculator) IsAboveMax(text string) bool
IsAboveMax checks if text exceeds maximum size
func (*SizeCalculator) IsBelowMin ¶
func (sc *SizeCalculator) IsBelowMin(text string) bool
IsBelowMin checks if text is below minimum size
func (*SizeCalculator) IsWithinTarget ¶
func (sc *SizeCalculator) IsWithinTarget(text string) bool
IsWithinTarget checks if size is within target range
func (*SizeCalculator) SplitToSize ¶
func (sc *SizeCalculator) SplitToSize(text string, boundaries []Boundary) []string
SplitToSize splits text into chunks that meet size constraints
type SizeCheckResult ¶
type SizeCheckResult struct {
// Metrics are the calculated size metrics
Metrics SizeMetrics
// IsValid indicates if the size is acceptable
IsValid bool
// Reason explains why the size is not valid (if applicable)
Reason string
// SuggestedAction suggests what to do if size is not valid
SuggestedAction SizeAction
// TargetDiff is the difference from target size
TargetDiff int
}
SizeCheckResult contains the result of a size check
type SizeConfig ¶
type SizeConfig struct {
// Target is the ideal chunk size to aim for
Target SizeLimit
// Min is the minimum chunk size
Min SizeLimit
// Max is the maximum chunk size
Max SizeLimit
// TokensPerChar is the ratio of tokens to characters (default: 0.25)
// Used for token estimation
TokensPerChar float64
// AllowExceedForAtomicContent allows exceeding max for tables/lists
AllowExceedForAtomicContent bool
// MergeSmallChunks merges chunks below min with neighbors
MergeSmallChunks bool
// SplitAtSemanticBoundaries prefers semantic boundaries over exact sizes
SplitAtSemanticBoundaries bool
}
SizeConfig holds comprehensive size configuration for chunking
func ClaudeContextConfig ¶
func ClaudeContextConfig() SizeConfig
ClaudeContextConfig returns config for Claude's context window
func CohereEmbeddingConfig ¶
func CohereEmbeddingConfig() SizeConfig
CohereEmbeddingConfig returns config optimized for Cohere embeddings
func DefaultSizeConfig ¶
func DefaultSizeConfig() SizeConfig
DefaultSizeConfig returns sensible defaults for size configuration
func LargeChunkConfig ¶
func LargeChunkConfig() SizeConfig
LargeChunkConfig returns config for large chunks (good for context)
func MediumChunkConfig ¶
func MediumChunkConfig() SizeConfig
MediumChunkConfig returns config for medium chunks (balanced)
func OpenAIEmbeddingConfig ¶
func OpenAIEmbeddingConfig() SizeConfig
OpenAIEmbeddingConfig returns config optimized for OpenAI embeddings (8191 tokens max)
func SemanticSizeConfig ¶
func SemanticSizeConfig(targetParagraphs, maxParagraphs int) SizeConfig
SemanticSizeConfig returns configuration for semantic unit-based chunking
func SmallChunkConfig ¶
func SmallChunkConfig() SizeConfig
SmallChunkConfig returns config for small chunks (good for precise retrieval)
func TokenBasedSizeConfig ¶
func TokenBasedSizeConfig(targetTokens, maxTokens int) SizeConfig
TokenBasedSizeConfig returns configuration optimized for token-based chunking
type SizeLimit ¶
type SizeLimit struct {
// Value is the limit value
Value int
// Unit is the unit of measurement
Unit SizeUnit
// Type determines if this is a soft or hard limit
Type LimitType
}
SizeLimit represents a size limit with its type and value
type SizeMetrics ¶
SizeMetrics holds all size measurements for a piece of text
func (SizeMetrics) GetByUnit ¶
func (m SizeMetrics) GetByUnit(unit SizeUnit) int
GetByUnit returns the metric value for the specified unit
type SizeUnit ¶
type SizeUnit int
SizeUnit defines the unit of measurement for chunk sizes
const ( // SizeUnitCharacters measures size in characters SizeUnitCharacters SizeUnit = iota // SizeUnitTokens measures size in estimated tokens (chars/4) SizeUnitTokens // SizeUnitWords measures size in words SizeUnitWords // SizeUnitSentences measures size in sentences SizeUnitSentences // SizeUnitParagraphs measures size in paragraphs SizeUnitParagraphs )
type StreamExporter ¶
type StreamExporter struct {
// contains filtered or unexported fields
}
StreamExporter handles streaming export for very large collections
func NewStreamExporter ¶
func NewStreamExporter(w io.Writer) *StreamExporter
NewStreamExporter creates a new stream exporter
func NewStreamExporterWithConfig ¶
func NewStreamExporterWithConfig(w io.Writer, config ExportConfig) *StreamExporter
NewStreamExporterWithConfig creates a stream exporter with custom config
func (*StreamExporter) Close ¶
func (se *StreamExporter) Close() error
Close finalizes the stream export
func (*StreamExporter) WriteChunk ¶
func (se *StreamExporter) WriteChunk(chunk *Chunk, index int) error
WriteChunk writes a single chunk to the stream
type TableChunk ¶
type TableChunk struct {
// Table is the source table
Table *model.Table
// Caption is the associated caption text
Caption string
// HasCaption indicates if a caption was found
HasCaption bool
// FormattedText is the table rendered as text
FormattedText string
// Summary is a brief description of the table
Summary string
// RowCount is the number of rows
RowCount int
// ColCount is the number of columns
ColCount int
// Headers are the column headers (if detected)
Headers []string
// IsSplit indicates if this is part of a split table
IsSplit bool
// SplitIndex is the index of this part (0-based)
SplitIndex int
// TotalSplits is the total number of parts
TotalSplits int
// PageNumber is the source page
PageNumber int
}
TableChunk represents a table as a chunk
func (*TableChunk) ToChunk ¶
func (tc *TableChunk) ToChunk(chunkIndex int) *Chunk
ToChunk converts a TableChunk to a generic Chunk
type TableFigureConfig ¶
type TableFigureConfig struct {
// TableFormat determines how tables are rendered in chunks
TableFormat TableFormat
// MaxTableSize is the maximum characters for a table before considering split
MaxTableSize int
// MaxTableRows is the maximum rows before considering split
MaxTableRows int
// SplitLargeTables allows splitting tables that exceed limits
SplitLargeTables bool
// IncludeTableCaption includes detected captions with tables
IncludeTableCaption bool
// IncludeFigureCaption includes detected captions with figures
IncludeFigureCaption bool
// CaptionSearchDistance is max chars to search for caption
CaptionSearchDistance int
// IncludeTableSummary adds a brief summary of table dimensions
IncludeTableSummary bool
// IncludeFigureAltText includes alt text for figures
IncludeFigureAltText bool
// PreserveTableStructure keeps structural info for RAG
PreserveTableStructure bool
}
TableFigureConfig holds configuration for table and figure chunking
func DefaultTableFigureConfig ¶
func DefaultTableFigureConfig() TableFigureConfig
DefaultTableFigureConfig returns sensible defaults
type TableFigureHandler ¶
type TableFigureHandler struct {
// contains filtered or unexported fields
}
TableFigureHandler handles table and figure chunking
func NewTableFigureHandler ¶
func NewTableFigureHandler() *TableFigureHandler
NewTableFigureHandler creates a new handler with default config
func NewTableFigureHandlerWithConfig ¶
func NewTableFigureHandlerWithConfig(config TableFigureConfig) *TableFigureHandler
NewTableFigureHandlerWithConfig creates a handler with custom config
func (*TableFigureHandler) ProcessBlocks ¶
func (h *TableFigureHandler) ProcessBlocks(blocks []ContentBlock) *TableFigureResult
ProcessBlocks processes content blocks to extract tables and figures
func (*TableFigureHandler) ProcessFigure ¶
func (h *TableFigureHandler) ProcessFigure(image *model.Image, caption string, pageNumber int) *FigureChunk
ProcessFigure converts a figure/image to a chunk
func (*TableFigureHandler) ProcessTable ¶
func (h *TableFigureHandler) ProcessTable(table *model.Table, caption string, pageNumber int) []*TableChunk
ProcessTable converts a table to one or more chunks
type TableFigureResult ¶
type TableFigureResult struct {
// TableChunks are the processed table chunks
TableChunks []*TableChunk
// FigureChunks are the processed figure chunks
FigureChunks []*FigureChunk
// Stats contains processing statistics
Stats TableFigureStats
}
TableFigureResult holds the result of processing tables and figures
type TableFigureStats ¶
type TableFigureStats struct {
TotalTables int
TotalFigures int
TablesWithCaption int
FiguresWithCaption int
SplitTables int
TotalTableRows int
TotalTableCols int
}
TableFigureStats contains statistics about table/figure processing
type TableFormat ¶
type TableFormat int
TableFormat defines how tables are formatted in chunks
const ( // TableFormatPlainText formats table as tab-separated text TableFormatPlainText TableFormat = iota // TableFormatMarkdown formats table as markdown TableFormatMarkdown // TableFormatCSV formats table as CSV TableFormatCSV // TableFormatHTML formats table as HTML TableFormatHTML )
func (TableFormat) String ¶
func (tf TableFormat) String() string
String returns a human-readable representation of the table format