Documentation
¶
Overview ¶
Package chunking provides AST-aware text chunking for code files. It uses Go's native go/ast parser for Go files and regex-based boundary detection for Python, JavaScript, TypeScript, Java, C/C++, and Rust.
AST-aware chunking ensures semantic units (functions, classes, methods) are not split mid-definition, producing higher quality chunks for code RAG.
Package chunking provides text splitting utilities for indexing.
Package chunking — provenance.go Byte-level provenance: every chunk carries exact source location metadata so search results trace back to the precise file, byte offset, and line number they originated from. This enables debuggable, explainable RAG pipelines.
Stub implementation when tree-sitter is not available. All non-Go languages fall back to regex-based boundary detection.
To enable tree-sitter: go build -tags treesitter
Index ¶
- func EnrichChunkMetadata(meta map[string]any, prov ProvenanceMetadata) map[string]any
- func IsCodeFile(filename string) bool
- func IsCodeSourceFile(filename string) bool
- func IsMarkdownFile(filename string) bool
- func TreeSitterAvailable() bool
- type ASTChunker
- type ASTChunkerConfig
- type Chunk
- type CodeChunk
- type CodeChunker
- type DocumentMeta
- type Language
- type MarkdownChunk
- type MarkdownChunker
- type MarkdownSection
- type ProvenanceMetadata
- type SentenceSplitter
- type StructuredDocument
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func EnrichChunkMetadata ¶
func EnrichChunkMetadata(meta map[string]any, prov ProvenanceMetadata) map[string]any
EnrichChunkMetadata merges provenance info into a chunk's metadata map.
func IsCodeFile ¶
IsCodeFile checks if a filename looks like a code file.
func IsCodeSourceFile ¶
IsCodeSourceFile returns true if the filename has a known code extension with full AST/regex support (subset of IsCodeFile).
func IsMarkdownFile ¶
IsMarkdownFile checks if a filename is a markdown file.
func TreeSitterAvailable ¶
func TreeSitterAvailable() bool
TreeSitterAvailable reports whether tree-sitter support is compiled in. Use this to check at runtime: if chunking.TreeSitterAvailable() { ... }
Types ¶
type ASTChunker ¶
type ASTChunker struct {
// contains filtered or unexported fields
}
ASTChunker splits code into semantic chunks using AST analysis.
func NewASTChunker ¶
func NewASTChunker(config ASTChunkerConfig) *ASTChunker
NewASTChunker creates a new AST-aware chunker.
func (*ASTChunker) Chunk ¶
func (c *ASTChunker) Chunk(text string) []string
Chunk implements the gleann Chunker interface — splits text with no metadata.
func (*ASTChunker) ChunkCode ¶
func (c *ASTChunker) ChunkCode(source, filename string) []CodeChunk
ChunkCode splits source code into semantic chunks. It uses Go's native AST parser for Go files, and regex-based boundary detection for other supported languages. For unknown languages, it falls back to sliding window chunking.
func (*ASTChunker) ChunkWithMetadata ¶
func (c *ASTChunker) ChunkWithMetadata(text string, metadata map[string]any) []Chunk
ChunkWithMetadata implements the gleann Chunker interface with metadata.
type ASTChunkerConfig ¶
type ASTChunkerConfig struct {
// MaxChunkSize is the maximum number of characters per chunk.
MaxChunkSize int
// ChunkOverlap is the number of overlapping characters at boundaries.
ChunkOverlap int
// AddLineNumbers prepends line numbers to each line of the chunk.
AddLineNumbers bool
// ChunkExpansion prepends parent scope context as a header comment to each chunk.
// E.g. "// File: api.py | Scope: Calculator > add"
// This improves embedding quality by giving LLMs semantic context.
// Only effective when tree-sitter is enabled (-tags treesitter).
ChunkExpansion bool
}
ASTChunkerConfig holds configuration for the AST chunker.
func DefaultASTChunkerConfig ¶
func DefaultASTChunkerConfig() ASTChunkerConfig
DefaultASTChunkerConfig returns reasonable defaults.
type CodeChunk ¶
type CodeChunk struct {
Text string `json:"text"`
Metadata map[string]any `json:"metadata"`
StartLine int `json:"start_line"`
EndLine int `json:"end_line"`
NodeType string `json:"node_type"` // "function", "class", "method", "block", etc.
Name string `json:"name"` // e.g. function/class name
OutboundCalls []string `json:"outbound_calls,omitempty"`
}
CodeChunk represents a semantic code chunk with metadata.
type CodeChunker ¶
CodeChunker splits code files into logical chunks based on structure.
func NewCodeChunker ¶
func NewCodeChunker(chunkSize, chunkOverlap int) *CodeChunker
NewCodeChunker creates a new code chunker.
func (*CodeChunker) Chunk ¶
func (c *CodeChunker) Chunk(code string) []string
Chunk splits code into logical chunks (by function/class boundaries).
func (*CodeChunker) ChunkWithMetadata ¶
func (c *CodeChunker) ChunkWithMetadata(code string, metadata map[string]any) []Chunk
ChunkWithMetadata splits code and preserves metadata.
func (*CodeChunker) ChunkWithProvenance ¶
func (c *CodeChunker) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk
ChunkWithProvenance for CodeChunker — attaches byte-level provenance.
type DocumentMeta ¶
type DocumentMeta struct {
Title string `json:"title"`
Format string `json:"format"`
PageCount *int `json:"page_count,omitempty"`
WordCount int `json:"word_count"`
Summary string `json:"summary"`
}
DocumentMeta holds top-level document metadata from the plugin.
type Language ¶
type Language string
Language represents a supported programming language.
const ( LangGo Language = "go" LangPython Language = "python" LangJavaScript Language = "javascript" LangTypeScript Language = "typescript" LangJava Language = "java" LangC Language = "c" LangCPP Language = "cpp" LangRust Language = "rust" LangCSharp Language = "csharp" LangRuby Language = "ruby" LangPHP Language = "php" LangKotlin Language = "kotlin" LangScala Language = "scala" LangSwift Language = "swift" LangLua Language = "lua" LangElixir Language = "elixir" LangZig Language = "zig" LangPowerShell Language = "powershell" LangJulia Language = "julia" LangObjectiveC Language = "objc" LangVue Language = "vue" LangSvelte Language = "svelte" LangUnknown Language = "unknown" )
func DetectLanguage ¶
DetectLanguage determines the programming language from the filename.
type MarkdownChunk ¶
type MarkdownChunk struct {
// Text is the chunk text WITH the context header prepended.
Text string
// RawText is the chunk text WITHOUT context header.
RawText string
// SectionID is the parent section ID (e.g. "s0.1").
SectionID string
// SectionPath is the full heading breadcrumb, e.g. ["Introduction", "Background"].
SectionPath []string
// HeadingLevel of the parent section (1-6).
HeadingLevel int
// DocTitle from the document metadata.
DocTitle string
// Metadata carries all indexing metadata for this chunk.
Metadata map[string]any
}
MarkdownChunk is a chunk with hierarchical context preserved.
type MarkdownChunker ¶
type MarkdownChunker struct {
ChunkSize int
ChunkOverlap int
// contains filtered or unexported fields
}
MarkdownChunker splits structured documents into context-aware chunks that preserve heading hierarchy as a breadcrumb context header.
func NewMarkdownChunker ¶
func NewMarkdownChunker(chunkSize, chunkOverlap int) *MarkdownChunker
NewMarkdownChunker creates a new markdown-aware chunker.
func (*MarkdownChunker) ChunkDocument ¶
func (mc *MarkdownChunker) ChunkDocument(doc *StructuredDocument) []MarkdownChunk
ChunkDocument splits a structured document into context-aware chunks.
func (*MarkdownChunker) ChunkMarkdown ¶
func (mc *MarkdownChunker) ChunkMarkdown(markdown string, source string) []MarkdownChunk
ChunkMarkdown parses raw markdown by headings when structured JSON is unavailable (e.g. plain .md files on disk, not from plugin). This is the fallback path.
type MarkdownSection ¶
type MarkdownSection struct {
ID string `json:"id"`
Heading string `json:"heading"`
Level int `json:"level"`
Content string `json:"content"`
Summary string `json:"summary"`
ParentID string `json:"parent_id,omitempty"`
Order int `json:"order"`
Children []string `json:"children,omitempty"`
}
MarkdownSection represents a heading-delimited section from structured JSON.
func ParseMarkdownHeadings ¶
func ParseMarkdownHeadings(markdown string) []MarkdownSection
ParseMarkdownHeadings extracts heading structure from raw markdown text. This is the Go-side parser for when structured JSON is not available (e.g. indexing a local .md file directly or parsing markitdown output).
Handles edge cases:
- Skips headings inside fenced code blocks (``` or ~~~)
- Skips YAML front matter (--- delimited at file start)
- Skips headings inside HTML comments (<!-- ... -->)
- Strips heading anchor tags ({#id}) and trailing hashes
- Supports setext-style headings (Title\n==== or Title\n----)
type ProvenanceMetadata ¶
type ProvenanceMetadata struct {
// Source is the relative file path within the indexed directory.
Source string `json:"source"`
// StartByte is the inclusive byte offset in the source file.
StartByte int `json:"start_byte"`
// EndByte is the exclusive byte offset in the source file.
EndByte int `json:"end_byte"`
// StartLine is the 1-based line number of the first line in the chunk.
StartLine int `json:"start_line"`
// EndLine is the 1-based line number of the last line in the chunk.
EndLine int `json:"end_line"`
// ChunkIndex is the ordinal position within the source.
ChunkIndex int `json:"chunk_index"`
// TotalChunks is how many chunks were produced from this source.
TotalChunks int `json:"total_chunks"`
// ContentHash is the SHA-256 hex digest of the chunk text.
ContentHash string `json:"content_hash"`
// IndexedAt is when this chunk was indexed.
IndexedAt time.Time `json:"indexed_at"`
// EmbeddingModel is which model produced the vector for this chunk.
EmbeddingModel string `json:"embedding_model,omitempty"`
}
ProvenanceMetadata holds byte-level lineage information for a chunk. Embed this in chunk metadata so every search result carries its exact origin.
func ComputeByteOffsets ¶
func ComputeByteOffsets(fullText string, chunks []string) []ProvenanceMetadata
ComputeByteOffsets calculates byte-level provenance for each chunk given the full source text and the chunk texts. It searches for each chunk in the full text sequentially to find byte offsets.
type SentenceSplitter ¶
SentenceSplitter splits text into chunks by sentences.
func NewSentenceSplitter ¶
func NewSentenceSplitter(chunkSize, chunkOverlap int) *SentenceSplitter
NewSentenceSplitter creates a new sentence splitter with the given parameters.
func (*SentenceSplitter) Chunk ¶
func (s *SentenceSplitter) Chunk(text string) []string
Chunk splits text into chunks.
func (*SentenceSplitter) ChunkWithMetadata ¶
func (s *SentenceSplitter) ChunkWithMetadata(text string, metadata map[string]any) []Chunk
ChunkWithMetadata splits text and preserves source metadata.
func (*SentenceSplitter) ChunkWithProvenance ¶
func (s *SentenceSplitter) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk
ChunkWithProvenance splits text and attaches byte-level provenance metadata.
type StructuredDocument ¶
type StructuredDocument struct {
Document DocumentMeta `json:"document"`
Sections []MarkdownSection `json:"sections"`
Markdown string `json:"markdown"`
}
StructuredDocument is the full response from the plugin /convert endpoint.