Documentation
¶
Overview ¶
Package chunk splits extracted text into overlapping chunks suitable for RAG (Retrieval-Augmented Generation) and full-text search indexing.
Splitting strategy:
- Split on paragraph boundaries first (double newline)
- If paragraphs exceed max tokens, split on sentence boundaries
- If sentences exceed max tokens, split on word boundaries
- Apply configurable overlap between consecutive chunks
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CountTokens ¶
CountTokens is the exported version for use outside the package.
func EstimateTokens ¶
EstimateTokens estimates GPT-style token count from text. Rough heuristic: ~0.75 words per token for English, ~4 chars per token.
Types ¶
type Chunk ¶
type Chunk struct {
Index int // 0-based position in the sequence
Text string // chunk text content
TokenCount int // approximate token count
OverlapPrev int // how many tokens overlap with the previous chunk
}
Chunk is one text fragment with metadata.
type Options ¶
type Options struct {
// MaxTokens is the maximum number of tokens per chunk. Default: 512.
MaxTokens int
// OverlapTokens is the number of tokens to overlap between chunks. Default: 64.
OverlapTokens int
// MinChunkTokens is the minimum chunk size; shorter chunks are merged. Default: 32.
MinChunkTokens int
}
Options configures the chunking behaviour.
Click to show internal directories.
Click to hide internal directories.