chunking

package module
v0.0.0-...-7845765 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 12, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package chunking provides AST-aware text chunking for code files. It uses Go's native go/ast parser for Go files and regex-based boundary detection for Python, JavaScript, TypeScript, Java, C/C++, and Rust.

AST-aware chunking ensures semantic units (functions, classes, methods) are not split mid-definition, producing higher quality chunks for code RAG.

Package chunking provides text splitting utilities for indexing.

Package chunking — provenance.go Byte-level provenance: every chunk carries exact source location metadata so search results trace back to the precise file, byte offset, and line number they originated from. This enables debuggable, explainable RAG pipelines.

Stub implementation when tree-sitter is not available. All non-Go languages fall back to regex-based boundary detection.

To enable tree-sitter: go build -tags treesitter

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EnrichChunkMetadata

func EnrichChunkMetadata(meta map[string]any, prov ProvenanceMetadata) map[string]any

EnrichChunkMetadata merges provenance info into a chunk's metadata map.

func IsCodeFile

func IsCodeFile(filename string) bool

IsCodeFile checks if a filename looks like a code file.

func IsCodeSourceFile

func IsCodeSourceFile(filename string) bool

IsCodeSourceFile returns true if the filename has a known code extension with full AST/regex support (subset of IsCodeFile).

func IsMarkdownFile

func IsMarkdownFile(filename string) bool

IsMarkdownFile checks if a filename is a markdown file.

func TreeSitterAvailable

func TreeSitterAvailable() bool

TreeSitterAvailable reports whether tree-sitter support is compiled in. Use this to check at runtime: if chunking.TreeSitterAvailable() { ... }

Types

type ASTChunker

type ASTChunker struct {
	// contains filtered or unexported fields
}

ASTChunker splits code into semantic chunks using AST analysis.

func NewASTChunker

func NewASTChunker(config ASTChunkerConfig) *ASTChunker

NewASTChunker creates a new AST-aware chunker.

func (*ASTChunker) Chunk

func (c *ASTChunker) Chunk(text string) []string

Chunk implements the gleann Chunker interface — splits text with no metadata.

func (*ASTChunker) ChunkCode

func (c *ASTChunker) ChunkCode(source, filename string) []CodeChunk

ChunkCode splits source code into semantic chunks. It uses Go's native AST parser for Go files, and regex-based boundary detection for other supported languages. For unknown languages, it falls back to sliding window chunking.

func (*ASTChunker) ChunkWithMetadata

func (c *ASTChunker) ChunkWithMetadata(text string, metadata map[string]any) []Chunk

ChunkWithMetadata implements the gleann Chunker interface with metadata.

type ASTChunkerConfig

type ASTChunkerConfig struct {
	// MaxChunkSize is the maximum number of characters per chunk.
	MaxChunkSize int

	// ChunkOverlap is the number of overlapping characters at boundaries.
	ChunkOverlap int

	// AddLineNumbers prepends line numbers to each line of the chunk.
	AddLineNumbers bool

	// ChunkExpansion prepends parent scope context as a header comment to each chunk.
	// E.g. "// File: api.py | Scope: Calculator > add"
	// This improves embedding quality by giving LLMs semantic context.
	// Only effective when tree-sitter is enabled (-tags treesitter).
	ChunkExpansion bool
}

ASTChunkerConfig holds configuration for the AST chunker.

func DefaultASTChunkerConfig

func DefaultASTChunkerConfig() ASTChunkerConfig

DefaultASTChunkerConfig returns reasonable defaults.

type Chunk

type Chunk struct {
	Text     string
	Metadata map[string]any
}

Chunk represents a piece of text and its associated metadata.

type CodeChunk

type CodeChunk struct {
	Text          string         `json:"text"`
	Metadata      map[string]any `json:"metadata"`
	StartLine     int            `json:"start_line"`
	EndLine       int            `json:"end_line"`
	NodeType      string         `json:"node_type"` // "function", "class", "method", "block", etc.
	Name          string         `json:"name"`      // e.g. function/class name
	OutboundCalls []string       `json:"outbound_calls,omitempty"`
}

CodeChunk represents a semantic code chunk with metadata.

type CodeChunker

type CodeChunker struct {
	ChunkSize    int
	ChunkOverlap int
}

CodeChunker splits code files into logical chunks based on structure.

func NewCodeChunker

func NewCodeChunker(chunkSize, chunkOverlap int) *CodeChunker

NewCodeChunker creates a new code chunker.

func (*CodeChunker) Chunk

func (c *CodeChunker) Chunk(code string) []string

Chunk splits code into logical chunks (by function/class boundaries).

func (*CodeChunker) ChunkWithMetadata

func (c *CodeChunker) ChunkWithMetadata(code string, metadata map[string]any) []Chunk

ChunkWithMetadata splits code and preserves metadata.

func (*CodeChunker) ChunkWithProvenance

func (c *CodeChunker) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk

ChunkWithProvenance for CodeChunker — attaches byte-level provenance.

type DocumentMeta

type DocumentMeta struct {
	Title     string `json:"title"`
	Format    string `json:"format"`
	PageCount *int   `json:"page_count,omitempty"`
	WordCount int    `json:"word_count"`
	Summary   string `json:"summary"`
}

DocumentMeta holds top-level document metadata from the plugin.

type Language

type Language string

Language represents a supported programming language.

const (
	LangGo         Language = "go"
	LangPython     Language = "python"
	LangJavaScript Language = "javascript"
	LangTypeScript Language = "typescript"
	LangJava       Language = "java"
	LangC          Language = "c"
	LangCPP        Language = "cpp"
	LangRust       Language = "rust"
	LangCSharp     Language = "csharp"
	LangRuby       Language = "ruby"
	LangPHP        Language = "php"
	LangKotlin     Language = "kotlin"
	LangScala      Language = "scala"
	LangSwift      Language = "swift"
	LangLua        Language = "lua"
	LangElixir     Language = "elixir"
	LangZig        Language = "zig"
	LangPowerShell Language = "powershell"
	LangJulia      Language = "julia"
	LangObjectiveC Language = "objc"
	LangVue        Language = "vue"
	LangSvelte     Language = "svelte"
	LangUnknown    Language = "unknown"
)

func DetectLanguage

func DetectLanguage(filename string) Language

DetectLanguage determines the programming language from the filename.

type MarkdownChunk

type MarkdownChunk struct {
	// Text is the chunk text WITH the context header prepended.
	Text string
	// RawText is the chunk text WITHOUT context header.
	RawText string
	// SectionID is the parent section ID (e.g. "s0.1").
	SectionID string
	// SectionPath is the full heading breadcrumb, e.g. ["Introduction", "Background"].
	SectionPath []string
	// HeadingLevel of the parent section (1-6).
	HeadingLevel int
	// DocTitle from the document metadata.
	DocTitle string
	// Metadata carries all indexing metadata for this chunk.
	Metadata map[string]any
}

MarkdownChunk is a chunk with hierarchical context preserved.

type MarkdownChunker

type MarkdownChunker struct {
	ChunkSize    int
	ChunkOverlap int
	// contains filtered or unexported fields
}

MarkdownChunker splits structured documents into context-aware chunks that preserve heading hierarchy as a breadcrumb context header.

func NewMarkdownChunker

func NewMarkdownChunker(chunkSize, chunkOverlap int) *MarkdownChunker

NewMarkdownChunker creates a new markdown-aware chunker.

func (*MarkdownChunker) ChunkDocument

func (mc *MarkdownChunker) ChunkDocument(doc *StructuredDocument) []MarkdownChunk

ChunkDocument splits a structured document into context-aware chunks.

func (*MarkdownChunker) ChunkMarkdown

func (mc *MarkdownChunker) ChunkMarkdown(markdown string, source string) []MarkdownChunk

ChunkMarkdown parses raw markdown by headings when structured JSON is unavailable (e.g. plain .md files on disk, not from plugin). This is the fallback path.

type MarkdownSection

type MarkdownSection struct {
	ID       string   `json:"id"`
	Heading  string   `json:"heading"`
	Level    int      `json:"level"`
	Content  string   `json:"content"`
	Summary  string   `json:"summary"`
	ParentID string   `json:"parent_id,omitempty"`
	Order    int      `json:"order"`
	Children []string `json:"children,omitempty"`
}

MarkdownSection represents a heading-delimited section from structured JSON.

func ParseMarkdownHeadings

func ParseMarkdownHeadings(markdown string) []MarkdownSection

ParseMarkdownHeadings extracts heading structure from raw markdown text. This is the Go-side parser for when structured JSON is not available (e.g. indexing a local .md file directly or parsing markitdown output).

Handles edge cases:

  • Skips headings inside fenced code blocks (``` or ~~~)
  • Skips YAML front matter (--- delimited at file start)
  • Skips headings inside HTML comments (<!-- ... -->)
  • Strips heading anchor tags ({#id}) and trailing hashes
  • Supports setext-style headings (Title\n==== or Title\n----)

type ProvenanceMetadata

type ProvenanceMetadata struct {
	// Source is the relative file path within the indexed directory.
	Source string `json:"source"`
	// StartByte is the inclusive byte offset in the source file.
	StartByte int `json:"start_byte"`
	// EndByte is the exclusive byte offset in the source file.
	EndByte int `json:"end_byte"`
	// StartLine is the 1-based line number of the first line in the chunk.
	StartLine int `json:"start_line"`
	// EndLine is the 1-based line number of the last line in the chunk.
	EndLine int `json:"end_line"`
	// ChunkIndex is the ordinal position within the source.
	ChunkIndex int `json:"chunk_index"`
	// TotalChunks is how many chunks were produced from this source.
	TotalChunks int `json:"total_chunks"`
	// ContentHash is the SHA-256 hex digest of the chunk text.
	ContentHash string `json:"content_hash"`
	// IndexedAt is when this chunk was indexed.
	IndexedAt time.Time `json:"indexed_at"`
	// EmbeddingModel is which model produced the vector for this chunk.
	EmbeddingModel string `json:"embedding_model,omitempty"`
}

ProvenanceMetadata holds byte-level lineage information for a chunk. Embed this in chunk metadata so every search result carries its exact origin.

func ComputeByteOffsets

func ComputeByteOffsets(fullText string, chunks []string) []ProvenanceMetadata

ComputeByteOffsets calculates byte-level provenance for each chunk given the full source text and the chunk texts. It searches for each chunk in the full text sequentially to find byte offsets.

type SentenceSplitter

type SentenceSplitter struct {
	ChunkSize    int
	ChunkOverlap int
}

SentenceSplitter splits text into chunks by sentences.

func NewSentenceSplitter

func NewSentenceSplitter(chunkSize, chunkOverlap int) *SentenceSplitter

NewSentenceSplitter creates a new sentence splitter with the given parameters.

func (*SentenceSplitter) Chunk

func (s *SentenceSplitter) Chunk(text string) []string

Chunk splits text into chunks.

func (*SentenceSplitter) ChunkWithMetadata

func (s *SentenceSplitter) ChunkWithMetadata(text string, metadata map[string]any) []Chunk

ChunkWithMetadata splits text and preserves source metadata.

func (*SentenceSplitter) ChunkWithProvenance

func (s *SentenceSplitter) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk

ChunkWithProvenance splits text and attaches byte-level provenance metadata.

type StructuredDocument

type StructuredDocument struct {
	Document DocumentMeta      `json:"document"`
	Sections []MarkdownSection `json:"sections"`
	Markdown string            `json:"markdown"`
}

StructuredDocument is the full response from the plugin /convert endpoint.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL