embedding

package
v0.1.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 7, 2026 License: MIT Imports: 19 Imported by: 0

Documentation

Overview

Package embedding provides text embedding infrastructure for VaultMind.

Index

Constants

View Source
const (
	DefaultModelName    = "sentence-transformers/all-MiniLM-L6-v2"
	DefaultDims         = 384
	DefaultMaxTokens    = 510 // MiniLM max is 512 minus 2 for CLS/SEP tokens
	DefaultOnnxFilePath = "onnx/model.onnx"
)

Default model configuration for the all-MiniLM-L6-v2 embedder.

View Source
const (
	BGEM3ModelName    = "BAAI/bge-m3"
	BGEM3Dims         = 1024
	BGEM3MaxTokens    = 8190 // 8192 minus 2 for CLS/SEP
	BGEM3OnnxFilePath = "onnx/model.onnx"
)

BGE-M3 model configuration.

Variables

This section is empty.

Functions

func Acceleration

func Acceleration() string

Acceleration mirrors the ORT-build's Acceleration() so callers don't need to special-case build tags. Pure-Go has no GPU path; "go-cpu" names the slow path explicitly.

func BackendName

func BackendName() string

BackendName identifies which hugot backend the binary was built against. Consumers (e.g. the index command) use this to warn when BGE-M3 indexing is about to run on the slow pure-Go path so operators don't mistake "hours-long indexing" for a hang or OOM. Reported by the build tag.

func ColBERTHead

func ColBERTHead(hiddenStates [][]float32, weights [][]float32, bias []float32) [][]float32

ColBERTHead projects each non-CLS token through a linear layer and L2-normalizes. Input: hiddenStates[seq_len][dims], weights[out_dims][in_dims], bias[out_dims]. Output: [seq_len-1][out_dims] (CLS at index 0 is skipped).

func DefaultCacheDir

func DefaultCacheDir() string

DefaultCacheDir returns the default model cache directory (~/.vaultmind/models).

func DefaultModel

func DefaultModel() string

DefaultModel returns the embedding model to use when the operator hasn't picked one explicitly. Adapts to the backend the binary was built against:

  • ORT-tagged binaries → "bge-m3" (4-way hybrid retrieval — fast on this build path; what the README's retrieval description is built around).
  • Pure-Go binaries → "minilm" (BGE-M3 indexing on pure-Go takes hours per medium vault; minilm is the always-fast baseline).

The default is conservative: it never picks a model the binary can't run reasonably. Users who want minilm on an ORT binary (e.g. for fast re-indexing during development) can pass --model minilm explicitly. Users who want bge-m3 on a pure-Go binary can opt in via --model bge-m3 + --allow-slow-backend.

The 2026-05-05 dogfood surfaced this gap: the prior hardcoded "minilm" default contradicted the system's own framing. A user running `vaultmind index --embed` on an ORT-capable build silently got MiniLM-only embeddings, learning about it only from doctor's post-hoc warning. The runtime-aware default closes that gap by matching the model to what the binary can actually run well.

func DenseHead

func DenseHead(hiddenStates [][]float32) []float32

DenseHead extracts the CLS token embedding (index 0) and L2-normalizes it. Input: hiddenStates[seq_len][dims]. Output: [dims] unit vector.

func DownloadBGEM3

func DownloadBGEM3(cacheDir string) (string, error)

DownloadBGEM3 downloads BGE-M3 model files from HuggingFace if not already cached. Returns the path to the model directory.

func L2Normalize

func L2Normalize(vec []float32) []float32

L2Normalize returns a unit vector. Returns zero vector if magnitude is zero.

func LoadLinearWeights

func LoadLinearWeights(path string) (weight [][]float32, bias []float32, err error)

LoadLinearWeights loads a PyTorch nn.Linear layer's weight and bias from a .pt file. Returns weight as [out_features][in_features] and bias as [out_features]. The .pt file must be a state_dict saved via torch.save(state_dict, path).

func MaxSimScore

func MaxSimScore(queryTokens, docTokens [][]float32) float64

MaxSimScore computes the ColBERT MaxSim score between query and document token matrices. For each query token, finds max similarity across all doc tokens, then sums. Assumes both query and doc tokens are L2-normalized (from ColBERTHead), so dot product = cosine.

func SparseDotProduct

func SparseDotProduct(a, b map[int32]float32) float64

SparseDotProduct computes the dot product between two sparse vectors. Only overlapping keys contribute.

func SparseHead

func SparseHead(hiddenStates [][]float32, tokenIDs, specialMask []uint32, weights []float32, bias float32) map[int32]float32

SparseHead computes learned lexical weights per token. For each non-special token: weight = ReLU(dot(hidden, w) + bias). Weights scattered to vocabulary positions via tokenIDs. Duplicate token IDs keep the maximum weight.

func TruncateForEmbedding

func TruncateForEmbedding(text string, maxTokens int) string

TruncateForEmbedding truncates text to fit within the model's token limit. Uses a character-based approximation (2 chars/token, empirically derived). Breaks at word boundaries when possible.

Tail loss: content beyond maxTokens × 2 chars is dropped before tokenization — the head is embedded correctly but the tail is invisible to semantic retrieval (lexical FTS still sees the full body). For long-form notes where the tail carries information not in the head, this under-covers retrieval. Tracked as a quality improvement in vaultmind#30 (chunk-and-pool); not a silent failure or robustness bug, just a coverage limit. Build the chunking fix when retrieval visibly misses tail content; don't preempt.

Types

type BGEM3Embedder

type BGEM3Embedder struct {
	// contains filtered or unexported fields
}

BGEM3Embedder produces dense, sparse, and ColBERT embeddings using BGE-M3.

func NewBGEM3Embedder

func NewBGEM3Embedder(cfg HugotConfig) (*BGEM3Embedder, error)

NewBGEM3Embedder creates a BGE-M3 embedder with all three heads.

func (*BGEM3Embedder) Close

func (e *BGEM3Embedder) Close() error

Close releases the hugot session.

func (*BGEM3Embedder) Dims

func (e *BGEM3Embedder) Dims() int

Dims returns the embedding dimensionality (1024).

func (*BGEM3Embedder) Embed

func (e *BGEM3Embedder) Embed(ctx context.Context, text string) ([]float32, error)

Embed returns the dense embedding (Embedder interface compatibility).

func (*BGEM3Embedder) EmbedBatch

func (e *BGEM3Embedder) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)

EmbedBatch returns dense embeddings (Embedder interface compatibility).

func (*BGEM3Embedder) EmbedColBERT

func (e *BGEM3Embedder) EmbedColBERT(ctx context.Context, text string) ([][]float32, error)

EmbedColBERT produces only the ColBERT per-token embeddings (used by ColBERTRetriever).

func (*BGEM3Embedder) EmbedFull

func (e *BGEM3Embedder) EmbedFull(ctx context.Context, text string) (*BGEM3Output, error)

EmbedFull produces all three embedding types for a single text.

func (*BGEM3Embedder) EmbedFullBatch

func (e *BGEM3Embedder) EmbedFullBatch(_ context.Context, texts []string) ([]*BGEM3Output, error)

EmbedFullBatch produces all three embedding types for multiple texts. Bypasses hugot's Postprocess to access raw per-token hidden states.

func (*BGEM3Embedder) EmbedSparse

func (e *BGEM3Embedder) EmbedSparse(ctx context.Context, text string) (map[int32]float32, error)

EmbedSparse produces only the sparse embedding (used by SparseRetriever).

type BGEM3Output

type BGEM3Output struct {
	Dense   []float32         // [1024] CLS-pooled, L2-normalized
	Sparse  map[int32]float32 // vocab_id -> weight (non-zero only)
	ColBERT [][]float32       // [seq_len-1][1024] per-token, L2-normalized
}

BGEM3Output contains all three embedding types from a BGE-M3 forward pass.

type Embedder

type Embedder interface {
	// Embed produces a single embedding vector for the given text.
	Embed(ctx context.Context, text string) ([]float32, error)

	// EmbedBatch produces embedding vectors for multiple texts.
	EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)

	// Dims returns the dimensionality of the embedding vectors.
	Dims() int

	// Close releases resources (model session, etc.).
	Close() error
}

Embedder converts text into dense vector representations.

type FullEmbedder

type FullEmbedder interface {
	Embedder
	EmbedFullBatch(ctx context.Context, texts []string) ([]*BGEM3Output, error)
}

FullEmbedder extends Embedder with multi-output capability (BGE-M3).

type HugotConfig

type HugotConfig struct {
	// ModelPath is the local path to the ONNX model directory.
	// If empty, the model will be downloaded from HuggingFace.
	ModelPath string

	// ModelName is the HuggingFace model ID (e.g., "sentence-transformers/all-MiniLM-L6-v2").
	// Used for downloading if ModelPath is not set.
	ModelName string

	// CacheDir is where downloaded models are stored.
	CacheDir string

	// Dims is the embedding dimensionality (e.g., 384 for MiniLM, 1024 for BGE-M3).
	Dims int

	// OnnxFilePath specifies which ONNX file to use when a model has multiple variants.
	// E.g., "onnx/model.onnx" for the default, "onnx/model_O2.onnx" for optimized.
	OnnxFilePath string

	// MaxTokens is the model's context window size. Texts longer than this (in approximate
	// tokens) are truncated before embedding. 0 means no truncation.
	MaxTokens int
}

HugotConfig configures the HugotEmbedder.

func BGEM3Config

func BGEM3Config() HugotConfig

BGEM3Config returns the HugotConfig for BGE-M3.

func DefaultHugotConfig

func DefaultHugotConfig() HugotConfig

DefaultHugotConfig returns the standard HugotConfig for all-MiniLM-L6-v2.

type HugotEmbedder

type HugotEmbedder struct {
	// contains filtered or unexported fields
}

HugotEmbedder wraps the hugot library to produce embeddings using ONNX models.

func NewHugotEmbedder

func NewHugotEmbedder(cfg HugotConfig) (*HugotEmbedder, error)

NewHugotEmbedder creates an embedder using hugot with the Go backend. For ORT backend (faster, supports larger models), build with -tags ORT.

func (*HugotEmbedder) Close

func (e *HugotEmbedder) Close() error

Close releases the hugot session.

func (*HugotEmbedder) Dims

func (e *HugotEmbedder) Dims() int

Dims returns the dimensionality of the embedding vectors.

func (*HugotEmbedder) Embed

func (e *HugotEmbedder) Embed(ctx context.Context, text string) ([]float32, error)

Embed produces a single embedding vector.

func (*HugotEmbedder) EmbedBatch

func (e *HugotEmbedder) EmbedBatch(_ context.Context, texts []string) ([][]float32, error)

EmbedBatch produces embedding vectors for multiple texts. Texts exceeding the model's token limit are truncated automatically.

type SidecarBGEM3Config

type SidecarBGEM3Config struct {
	// Python is the interpreter path. Must have torch + transformers
	// installed and be on a platform where torch.backends.mps.is_available()
	// returns true (Apple Silicon). When empty, falls back to "python3" on
	// PATH.
	Python string
	// ScriptPath is the absolute path to embed_server.py. When empty, the
	// embedder looks for the script alongside this Go file (resolved via
	// the project's $CLAUDE_PROJECT_DIR or the executable's directory).
	ScriptPath string
}

SidecarBGEM3Config controls how the sidecar process is launched.

type SidecarBGEM3Embedder

type SidecarBGEM3Embedder struct {
	// contains filtered or unexported fields
}

SidecarBGEM3Embedder runs BGE-M3 inference in an external Python process that uses PyTorch + MPS (Apple Silicon GPU). The Go side handles tokenization context (none — Python sidecar tokenizes with HF tokenizer loaded from cache) and the heads run inside the sidecar so the per-modality tensors flow through MPS without round-tripping to CPU mid-batch.

Why a sidecar instead of in-process: in-process ORT (via hugot) saturates CPU during indexing on Apple Silicon — there's no GPU acceleration path (vaultmind#34). The sidecar pattern moves heavy inference behind a JSON contract, isolating vaultmind core from the inference engine choice. Today the engine is PyTorch+MPS; tomorrow it could be CoreML or MLX without touching the Go side.

Lifecycle: the embedder spawns the Python subprocess in NewSidecarBGEM3. Close() tears it down. Per-batch round-trips happen via Send (write JSON line to stdin, read JSON line from stdout). Mutex serializes access since the protocol is synchronous request/response on a single FD pair.

func NewSidecarBGEM3

func NewSidecarBGEM3(cfg SidecarBGEM3Config) (*SidecarBGEM3Embedder, error)

NewSidecarBGEM3 spawns the Python sidecar and waits for its ready signal. Returns an error if the subprocess fails to start, the Python imports fail, or the model can't be loaded. The caller MUST defer Close() to reap the subprocess.

func (*SidecarBGEM3Embedder) Close

func (e *SidecarBGEM3Embedder) Close() error

Close terminates the sidecar process. Safe to call multiple times.

func (*SidecarBGEM3Embedder) Device

func (e *SidecarBGEM3Embedder) Device() string

Device reports the device the sidecar selected ("mps" or "cpu"). Useful for the doctor / index summary so the operator sees acceleration.

func (*SidecarBGEM3Embedder) Dims

func (e *SidecarBGEM3Embedder) Dims() int

Dims reports the dense embedding dimensionality.

func (*SidecarBGEM3Embedder) Embed

func (e *SidecarBGEM3Embedder) Embed(ctx context.Context, text string) ([]float32, error)

Embed produces a single dense embedding via the sidecar.

func (*SidecarBGEM3Embedder) EmbedBatch

func (e *SidecarBGEM3Embedder) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)

EmbedBatch produces dense embeddings for a batch via the sidecar.

func (*SidecarBGEM3Embedder) EmbedFull

func (e *SidecarBGEM3Embedder) EmbedFull(ctx context.Context, text string) (*BGEM3Output, error)

EmbedFull is the singleton form of EmbedFullBatch.

func (*SidecarBGEM3Embedder) EmbedFullBatch

func (e *SidecarBGEM3Embedder) EmbedFullBatch(_ context.Context, texts []string) ([]*BGEM3Output, error)

EmbedFullBatch sends a batch of texts to the sidecar and parses the response. Tokens are sparse-key strings in the JSON; we parse to int32 here so the sidecar protocol stays portable across languages.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL