chunking

package module

v0.0.0-...-7845765 Latest Latest Go to latest Published: May 12, 2026 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tevfik/gleann

Links

Open Source Insights

Documentation ¶

Overview ¶

Package chunking provides AST-aware text chunking for code files. It uses Go's native go/ast parser for Go files and regex-based boundary detection for Python, JavaScript, TypeScript, Java, C/C++, and Rust.

AST-aware chunking ensures semantic units (functions, classes, methods) are not split mid-definition, producing higher quality chunks for code RAG.

Package chunking provides text splitting utilities for indexing.

Package chunking — provenance.go Byte-level provenance: every chunk carries exact source location metadata so search results trace back to the precise file, byte offset, and line number they originated from. This enables debuggable, explainable RAG pipelines.

Stub implementation when tree-sitter is not available. All non-Go languages fall back to regex-based boundary detection.

To enable tree-sitter: go build -tags treesitter

Index ¶

func EnrichChunkMetadata(meta map[string]any, prov ProvenanceMetadata) map[string]any
func IsCodeFile(filename string) bool
func IsCodeSourceFile(filename string) bool
func IsMarkdownFile(filename string) bool
func TreeSitterAvailable() bool
type ASTChunker
- func NewASTChunker(config ASTChunkerConfig) *ASTChunker
- func (c *ASTChunker) Chunk(text string) []string
- func (c *ASTChunker) ChunkCode(source, filename string) []CodeChunk
- func (c *ASTChunker) ChunkWithMetadata(text string, metadata map[string]any) []Chunk
type ASTChunkerConfig
- func DefaultASTChunkerConfig() ASTChunkerConfig
type Chunk
type CodeChunk
type CodeChunker
- func NewCodeChunker(chunkSize, chunkOverlap int) *CodeChunker
- func (c *CodeChunker) Chunk(code string) []string
- func (c *CodeChunker) ChunkWithMetadata(code string, metadata map[string]any) []Chunk
- func (c *CodeChunker) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk
type DocumentMeta
type Language
- func DetectLanguage(filename string) Language
type MarkdownChunk
type MarkdownChunker
- func NewMarkdownChunker(chunkSize, chunkOverlap int) *MarkdownChunker
- func (mc *MarkdownChunker) ChunkDocument(doc *StructuredDocument) []MarkdownChunk
- func (mc *MarkdownChunker) ChunkMarkdown(markdown string, source string) []MarkdownChunk
type MarkdownSection
- func ParseMarkdownHeadings(markdown string) []MarkdownSection
type ProvenanceMetadata
- func ComputeByteOffsets(fullText string, chunks []string) []ProvenanceMetadata
type SentenceSplitter
- func NewSentenceSplitter(chunkSize, chunkOverlap int) *SentenceSplitter
- func (s *SentenceSplitter) Chunk(text string) []string
- func (s *SentenceSplitter) ChunkWithMetadata(text string, metadata map[string]any) []Chunk
- func (s *SentenceSplitter) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk
type StructuredDocument

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func EnrichChunkMetadata ¶

func EnrichChunkMetadata(meta map[string]any, prov ProvenanceMetadata) map[string]any

EnrichChunkMetadata merges provenance info into a chunk's metadata map.

func IsCodeFile ¶

func IsCodeFile(filename string) bool

IsCodeFile checks if a filename looks like a code file.

func IsCodeSourceFile ¶

func IsCodeSourceFile(filename string) bool

IsCodeSourceFile returns true if the filename has a known code extension with full AST/regex support (subset of IsCodeFile).

func IsMarkdownFile ¶

func IsMarkdownFile(filename string) bool

IsMarkdownFile checks if a filename is a markdown file.

func TreeSitterAvailable ¶

func TreeSitterAvailable() bool

TreeSitterAvailable reports whether tree-sitter support is compiled in. Use this to check at runtime: if chunking.TreeSitterAvailable() { ... }

Types ¶

type ASTChunker ¶

type ASTChunker struct {
	// contains filtered or unexported fields
}

ASTChunker splits code into semantic chunks using AST analysis.

func NewASTChunker ¶

func NewASTChunker(config ASTChunkerConfig) *ASTChunker

NewASTChunker creates a new AST-aware chunker.

func (*ASTChunker) Chunk ¶

func (c *ASTChunker) Chunk(text string) []string

Chunk implements the gleann Chunker interface — splits text with no metadata.

func (*ASTChunker) ChunkCode ¶

func (c *ASTChunker) ChunkCode(source, filename string) []CodeChunk

ChunkCode splits source code into semantic chunks. It uses Go's native AST parser for Go files, and regex-based boundary detection for other supported languages. For unknown languages, it falls back to sliding window chunking.

func (*ASTChunker) ChunkWithMetadata ¶

func (c *ASTChunker) ChunkWithMetadata(text string, metadata map[string]any) []Chunk

ChunkWithMetadata implements the gleann Chunker interface with metadata.

type ASTChunkerConfig ¶

type ASTChunkerConfig struct {
	// MaxChunkSize is the maximum number of characters per chunk.
	MaxChunkSize int

	// ChunkOverlap is the number of overlapping characters at boundaries.
	ChunkOverlap int

	// AddLineNumbers prepends line numbers to each line of the chunk.
	AddLineNumbers bool

	// ChunkExpansion prepends parent scope context as a header comment to each chunk.
	// E.g. "// File: api.py | Scope: Calculator > add"
	// This improves embedding quality by giving LLMs semantic context.
	// Only effective when tree-sitter is enabled (-tags treesitter).
	ChunkExpansion bool
}

ASTChunkerConfig holds configuration for the AST chunker.

func DefaultASTChunkerConfig ¶

func DefaultASTChunkerConfig() ASTChunkerConfig

DefaultASTChunkerConfig returns reasonable defaults.

type Chunk ¶

type Chunk struct {
	Text     string
	Metadata map[string]any
}

Chunk represents a piece of text and its associated metadata.

type CodeChunk ¶

type CodeChunk struct {
	Text          string         `json:"text"`
	Metadata      map[string]any `json:"metadata"`
	StartLine     int            `json:"start_line"`
	EndLine       int            `json:"end_line"`
	NodeType      string         `json:"node_type"` // "function", "class", "method", "block", etc.
	Name          string         `json:"name"`      // e.g. function/class name
	OutboundCalls []string       `json:"outbound_calls,omitempty"`
}

CodeChunk represents a semantic code chunk with metadata.

type CodeChunker ¶

type CodeChunker struct {
	ChunkSize    int
	ChunkOverlap int
}

CodeChunker splits code files into logical chunks based on structure.

func NewCodeChunker ¶

func NewCodeChunker(chunkSize, chunkOverlap int) *CodeChunker

NewCodeChunker creates a new code chunker.

func (*CodeChunker) Chunk ¶

func (c *CodeChunker) Chunk(code string) []string

Chunk splits code into logical chunks (by function/class boundaries).

func (*CodeChunker) ChunkWithMetadata ¶

func (c *CodeChunker) ChunkWithMetadata(code string, metadata map[string]any) []Chunk

ChunkWithMetadata splits code and preserves metadata.

func (*CodeChunker) ChunkWithProvenance ¶

func (c *CodeChunker) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk

ChunkWithProvenance for CodeChunker — attaches byte-level provenance.

type DocumentMeta ¶

type DocumentMeta struct {
	Title     string `json:"title"`
	Format    string `json:"format"`
	PageCount *int   `json:"page_count,omitempty"`
	WordCount int    `json:"word_count"`
	Summary   string `json:"summary"`
}

DocumentMeta holds top-level document metadata from the plugin.

type Language ¶

type Language string

Language represents a supported programming language.

const (
	LangGo         Language = "go"
	LangPython     Language = "python"
	LangJavaScript Language = "javascript"
	LangTypeScript Language = "typescript"
	LangJava       Language = "java"
	LangC          Language = "c"
	LangCPP        Language = "cpp"
	LangRust       Language = "rust"
	LangCSharp     Language = "csharp"
	LangRuby       Language = "ruby"
	LangPHP        Language = "php"
	LangKotlin     Language = "kotlin"
	LangScala      Language = "scala"
	LangSwift      Language = "swift"
	LangLua        Language = "lua"
	LangElixir     Language = "elixir"
	LangZig        Language = "zig"
	LangPowerShell Language = "powershell"
	LangJulia      Language = "julia"
	LangObjectiveC Language = "objc"
	LangVue        Language = "vue"
	LangSvelte     Language = "svelte"
	LangUnknown    Language = "unknown"
)

func DetectLanguage ¶

func DetectLanguage(filename string) Language

DetectLanguage determines the programming language from the filename.

type MarkdownChunk ¶

type MarkdownChunk struct {
	// Text is the chunk text WITH the context header prepended.
	Text string
	// RawText is the chunk text WITHOUT context header.
	RawText string
	// SectionID is the parent section ID (e.g. "s0.1").
	SectionID string
	// SectionPath is the full heading breadcrumb, e.g. ["Introduction", "Background"].
	SectionPath []string
	// HeadingLevel of the parent section (1-6).
	HeadingLevel int
	// DocTitle from the document metadata.
	DocTitle string
	// Metadata carries all indexing metadata for this chunk.
	Metadata map[string]any
}

MarkdownChunk is a chunk with hierarchical context preserved.

type MarkdownChunker ¶

type MarkdownChunker struct {
	ChunkSize    int
	ChunkOverlap int
	// contains filtered or unexported fields
}

MarkdownChunker splits structured documents into context-aware chunks that preserve heading hierarchy as a breadcrumb context header.

func NewMarkdownChunker ¶

func NewMarkdownChunker(chunkSize, chunkOverlap int) *MarkdownChunker

NewMarkdownChunker creates a new markdown-aware chunker.

func (*MarkdownChunker) ChunkDocument ¶

func (mc *MarkdownChunker) ChunkDocument(doc *StructuredDocument) []MarkdownChunk

ChunkDocument splits a structured document into context-aware chunks.

func (*MarkdownChunker) ChunkMarkdown ¶

func (mc *MarkdownChunker) ChunkMarkdown(markdown string, source string) []MarkdownChunk

ChunkMarkdown parses raw markdown by headings when structured JSON is unavailable (e.g. plain .md files on disk, not from plugin). This is the fallback path.

type MarkdownSection ¶

type MarkdownSection struct {
	ID       string   `json:"id"`
	Heading  string   `json:"heading"`
	Level    int      `json:"level"`
	Content  string   `json:"content"`
	Summary  string   `json:"summary"`
	ParentID string   `json:"parent_id,omitempty"`
	Order    int      `json:"order"`
	Children []string `json:"children,omitempty"`
}

MarkdownSection represents a heading-delimited section from structured JSON.

func ParseMarkdownHeadings ¶

func ParseMarkdownHeadings(markdown string) []MarkdownSection

ParseMarkdownHeadings extracts heading structure from raw markdown text. This is the Go-side parser for when structured JSON is not available (e.g. indexing a local .md file directly or parsing markitdown output).

Handles edge cases:

Skips headings inside fenced code blocks (``` or ~~~)
Skips YAML front matter (--- delimited at file start)
Skips headings inside HTML comments ()
Strips heading anchor tags ({#id}) and trailing hashes
Supports setext-style headings (Title\n==== or Title\n----)

type ProvenanceMetadata ¶

type ProvenanceMetadata struct {
	// Source is the relative file path within the indexed directory.
	Source string `json:"source"`
	// StartByte is the inclusive byte offset in the source file.
	StartByte int `json:"start_byte"`
	// EndByte is the exclusive byte offset in the source file.
	EndByte int `json:"end_byte"`
	// StartLine is the 1-based line number of the first line in the chunk.
	StartLine int `json:"start_line"`
	// EndLine is the 1-based line number of the last line in the chunk.
	EndLine int `json:"end_line"`
	// ChunkIndex is the ordinal position within the source.
	ChunkIndex int `json:"chunk_index"`
	// TotalChunks is how many chunks were produced from this source.
	TotalChunks int `json:"total_chunks"`
	// ContentHash is the SHA-256 hex digest of the chunk text.
	ContentHash string `json:"content_hash"`
	// IndexedAt is when this chunk was indexed.
	IndexedAt time.Time `json:"indexed_at"`
	// EmbeddingModel is which model produced the vector for this chunk.
	EmbeddingModel string `json:"embedding_model,omitempty"`
}

ProvenanceMetadata holds byte-level lineage information for a chunk. Embed this in chunk metadata so every search result carries its exact origin.

func ComputeByteOffsets ¶

func ComputeByteOffsets(fullText string, chunks []string) []ProvenanceMetadata

ComputeByteOffsets calculates byte-level provenance for each chunk given the full source text and the chunk texts. It searches for each chunk in the full text sequentially to find byte offsets.

type SentenceSplitter ¶

type SentenceSplitter struct {
	ChunkSize    int
	ChunkOverlap int
}

SentenceSplitter splits text into chunks by sentences.

func NewSentenceSplitter ¶

func NewSentenceSplitter(chunkSize, chunkOverlap int) *SentenceSplitter

NewSentenceSplitter creates a new sentence splitter with the given parameters.

func (*SentenceSplitter) Chunk ¶

func (s *SentenceSplitter) Chunk(text string) []string

Chunk splits text into chunks.

func (*SentenceSplitter) ChunkWithMetadata ¶

func (s *SentenceSplitter) ChunkWithMetadata(text string, metadata map[string]any) []Chunk

ChunkWithMetadata splits text and preserves source metadata.

func (*SentenceSplitter) ChunkWithProvenance ¶

func (s *SentenceSplitter) ChunkWithProvenance(fullText string, baseMeta map[string]any) []Chunk

ChunkWithProvenance splits text and attaches byte-level provenance metadata.

type StructuredDocument ¶

type StructuredDocument struct {
	Document DocumentMeta      `json:"document"`
	Sections []MarkdownSection `json:"sections"`
	Markdown string            `json:"markdown"`
}

StructuredDocument is the full response from the plugin /convert endpoint.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL