Documentation
¶
Index ¶
- Constants
- type Chunk
- type DirectoryLoader
- type DirectoryLoaderOption
- type Document
- type DocumentLoader
- type Embedder
- type IngestionPipeline
- type NanovecStore
- type PipelineOption
- type RecursiveCharacterTextSplitter
- type Retriever
- type StandardRetriever
- type TextLoader
- type TextLoaderOption
- type TextSplitter
- type VectorStore
Constants ¶
const ( // DefaultMaxDocSize limits the RAM usage per file to 5MB. // If a file is larger, it will be split into multiple Documents. DefaultMaxDocSize = 5 * core.MB )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Chunk ¶
type Chunk struct {
ID string `json:"id"`
ParentID string `json:"parent_id"`
ChunkIndex int `json:"chunk_index"`
Content string `json:"content"`
// Vector representation of the chunk's content, generated by the Embedder.
Embedding []float32 `json:"embedding,omitempty"`
// Similar scores
Score float32 `json:"score,omitempty"`
// Metadata is inherited from the original document plus Chunk's own metadata (such as chunk_index).
Metadata map[string]any `json:"metadata"`
}
Chunk represents a semantic shard. This is the actual object stored in the Vector DB and returned to the LLM.
type DirectoryLoader ¶
type DirectoryLoader struct {
// contains filtered or unexported fields
}
DirectoryLoader recursively scans a directory and uses TextLoader to load all matching files.
func NewDirectoryLoader ¶
func NewDirectoryLoader(dirPath string, opts ...DirectoryLoaderOption) *DirectoryLoader
NewDirectoryLoader creates a loader that scans an entire folder.
type DirectoryLoaderOption ¶
type DirectoryLoaderOption func(*DirectoryLoader)
DirectoryLoaderOption follows the functional option pattern for flexible configuration.
func WithAllowedExtensions ¶
func WithAllowedExtensions(exts ...string) DirectoryLoaderOption
WithAllowedExtensions specifies which file extensions to load.
func WithMaxConcurrency ¶
func WithMaxConcurrency(n int) DirectoryLoaderOption
WithMaxConcurrency allows tuning the maximum number of files opened simultaneously. Default is 50. Increase for NVMe SSDs, decrease for HDDs or strict OS ulimits.
type Document ¶
type Document struct {
ID string `json:"id"`
Content string `json:"content"`
Metadata map[string]any `json:"metadata"` // FilePath, Author, Title...
}
Document represents a raw, source file (e.g., a PDF, a Web page).
type DocumentLoader ¶
type DocumentLoader interface {
// Load streams Documents into the provided channel.
// The implementation MUST NOT close the channel (the pipeline manages it).
Load(ctx context.Context, docChan chan<- Document) error
}
DocumentLoader defines the contract for streaming data from various sources.
type Embedder ¶
type Embedder interface {
Embed(ctx context.Context, text string) ([]float32, error)
EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
}
Embedder converts text into high-dimensional vector representations.
type IngestionPipeline ¶
type IngestionPipeline struct {
Loader DocumentLoader
Splitter TextSplitter
Embedder Embedder
Store VectorStore
BatchSize int // Number of chunks to embed in one API call
NumWorkers int // Concurrent workers for embedding/upserting
}
IngestionPipeline orchestrates the flow of data from raw files to VectorDB.
func NewIngestionPipeline ¶
func NewIngestionPipeline( loader DocumentLoader, splitter TextSplitter, embedder Embedder, store VectorStore, opts ...PipelineOption, ) (*IngestionPipeline, error)
NewIngestionPipeline creates a safely initialized pipeline with default fallbacks.
type NanovecStore ¶
type NanovecStore struct {
// contains filtered or unexported fields
}
NanovecStore implements the VectorStore interface using the embedded nanovec library.
func NewNanovecStore ¶
func NewNanovecStore(path string, dimension int) (*NanovecStore, error)
NewNanovecStore creates a highly optimized, in-memory vector database using nanovec.
func (*NanovecStore) Close ¶
func (n *NanovecStore) Close() error
Close gracefully shuts down the nanovec database.
func (*NanovecStore) SimilaritySearch ¶
func (n *NanovecStore) SimilaritySearch(ctx context.Context, queryVector []float32, topK int) ([]Chunk, error)
SimilaritySearch performs a blazing fast nearest-neighbor search using nanovec's HNSW index.
type PipelineOption ¶
type PipelineOption func(*IngestionPipeline)
PipelineOption follows the Functional Option pattern for flexible configuration.
func WithBatchSize ¶
func WithBatchSize(size int) PipelineOption
WithBatchSize configures how many chunks are sent to the Embedder at once.
func WithNumWorkers ¶
func WithNumWorkers(workers int) PipelineOption
WithNumWorkers configures the number of concurrent embedding workers.
type RecursiveCharacterTextSplitter ¶
RecursiveCharacterTextSplitter implements the TextSplitter interface. It splits text recursively based on a hierarchy of separators to keep semantically related pieces together.
func NewRecursiveCharacterTextSplitter ¶
func NewRecursiveCharacterTextSplitter(chunkSize, chunkOverlap int) *RecursiveCharacterTextSplitter
NewRecursiveCharacterTextSplitter creates a new splitter with default LangChain-like separators.
type Retriever ¶
type Retriever interface {
// Search queries the Vector Database and returns topK relevant documents.
Search(ctx context.Context, query string, topK int) ([]Chunk, error)
}
Retriever defines the contract for any knowledge base search engine.
type StandardRetriever ¶
type StandardRetriever struct {
// contains filtered or unexported fields
}
StandardRetriever implements the Retriever interface for the Agent's Tool.
func NewStandardRetriever ¶
func NewStandardRetriever(embedder Embedder, store VectorStore) *StandardRetriever
NewStandardRetriever creates a new Retriever that uses the provided Embedder and VectorStore.
type TextLoader ¶
type TextLoader struct {
// contains filtered or unexported fields
}
TextLoader reads a single local text file (.txt, .md, .csv, .go, .py, ...) into a Document.
func NewTextLoader ¶
func NewTextLoader(filePath string, opts ...TextLoaderOption) *TextLoader
NewTextLoader creates a new loader for a specific file path.
func (*TextLoader) Load ¶
func (l *TextLoader) Load(ctx context.Context, docChan chan<- Document) error
Load reads the file content and attaches file-specific metadata. Principle: 1. Open the file using `os.Open` (only creates a descriptor file, no RAM usage). 2. Use `bufio.Reader` to read each line or small chunk. 3. Accumulate data into a `strings.Builder`. When the buffer reaches a safe threshold (e.g., 5MB), we will split it into parts and push the document into a `docchan`. 4. Repeat until the entire file (`io.EOF`) is read.
type TextLoaderOption ¶
type TextLoaderOption func(*TextLoader)
TextLoaderOption follows the Functional Option pattern.
func WithFileInfo ¶
func WithFileInfo(info fs.FileInfo) TextLoaderOption
WithFileInfo injects pre-fetched file stats to avoid redundant os.Stat syscalls.
func WithMaxDocSize ¶
func WithMaxDocSize(bytes core.ByteSize) TextLoaderOption
WithMaxDocSize allows custom configuration for memory limits per file.
type TextSplitter ¶
type TextSplitter interface {
// Split reads from inChan, splits the documents, and streams chunks to outChan.
Split(ctx context.Context, inChan <-chan Document, outChan chan<- Chunk) error
}
TextSplitter breaks down large documents into semantic chunks on the fly.
type VectorStore ¶
type VectorStore interface {
// Upsert saves chunks and their embeddings to the database.
Upsert(ctx context.Context, chunks []Chunk) error
// SimilaritySearch finds the top-k most similar nodes.
SimilaritySearch(ctx context.Context, queryVector []float32, limit int) ([]Chunk, error)
// Close gracefully shuts down the database connection and releases resources.
Close() error
}
VectorStore handles the storage and similarity search of vectors.