embedding

package
v0.1.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 28, 2026 License: MIT Imports: 16 Imported by: 0

Documentation

Overview

Package downloader provides functionality for downloading embedding models from remote sources. It supports progress tracking, caching, and multi-file downloads.

Package embedding provides interfaces and implementations for text embedding generation. It supports both local and remote embedding models with batch processing and caching capabilities.

Package models provides pre-configured providers for popular embedding models. It includes automatic model type detection and factory functions for easy model loading.

Package local provides implementations for local embedding model providers. It supports various model formats and includes a tokenizer for text preprocessing.

Index

Constants

View Source
const (
	// CLSTokenID is the classification token ID, added at the start of sequences.
	CLSTokenID = 101

	// SEITokenID is the separator token ID, added at the end of sequences.
	SEITokenID = 102

	// PadTokenID is the padding token ID, used to fill sequences to uniform length.
	PadTokenID = 0

	// VocabStart is the starting ID for vocabulary tokens (tokens beyond this
	// are dynamically assigned during tokenization).
	VocabStart = 10000
)

Token ID constants for BERT-style tokenization. These represent special tokens added to sequences.

Variables

This section is empty.

Functions

This section is empty.

Types

type BGEProvider added in v0.1.2

type BGEProvider struct {
	// contains filtered or unexported fields
}

BGEProvider creates a provider for BGE models

func NewBGEProvider added in v0.1.2

func NewBGEProvider(modelPath string) (*BGEProvider, error)

NewBGEProvider creates a new BGE provider

func (*BGEProvider) Dimension added in v0.1.2

func (p *BGEProvider) Dimension() int

Dimension returns the embedding dimension

func (*BGEProvider) Embed added in v0.1.2

func (p *BGEProvider) Embed(ctx context.Context, texts []string) ([][]float32, error)

Embed generates embeddings for the given texts

type BatchOptions

type BatchOptions struct {
	// MaxBatchSize is the maximum number of texts to process in a single batch.
	// Larger batches are more efficient but use more memory.
	// Default: 32
	MaxBatchSize int

	// MaxConcurrent is the maximum number of concurrent API calls.
	// Higher values increase throughput but may hit rate limits.
	// Default: 4
	MaxConcurrent int

	// MaxCacheSize is the maximum number of entries in the LRU cache.
	// When exceeded, least recently used entries are evicted.
	// Default: 10000
	MaxCacheSize int
}

BatchOptions configures the behavior of a BatchProcessor. These options control batching strategy, concurrency, and caching.

type BatchProcessor

type BatchProcessor struct {
	// contains filtered or unexported fields
}

BatchProcessor processes multiple texts through an embedding provider with support for batching, concurrent processing, and LRU caching. It improves throughput by batching requests and reduces API calls by caching embeddings for repeated texts.

BatchProcessor is safe for concurrent use from multiple goroutines.

Example:

processor := embedding.NewBatchProcessor(provider, embedding.BatchOptions{
    MaxBatchSize:  32,
    MaxConcurrent: 4,
    MaxCacheSize:  10000,
})
embeddings, err := processor.Process(ctx, texts)

func NewBatchProcessor

func NewBatchProcessor(provider Provider, options BatchOptions) *BatchProcessor

NewBatchProcessor creates a new BatchProcessor with the specified provider and options. Default values are applied to any options that are zero:

  • MaxBatchSize: 32
  • MaxConcurrent: 4
  • MaxCacheSize: 10000

Parameters:

  • provider: The embedding provider to use for generating embeddings
  • options: Configuration options (nil uses defaults)

Returns a configured BatchProcessor

func (*BatchProcessor) CacheSize added in v0.1.10

func (bp *BatchProcessor) CacheSize() int

CacheSize returns the current number of entries in the cache.

func (*BatchProcessor) ClearCache added in v0.1.10

func (bp *BatchProcessor) ClearCache()

ClearCache removes all entries from the embedding cache. Call this to free memory or when the underlying model changes.

func (*BatchProcessor) Process

func (bp *BatchProcessor) Process(ctx context.Context, texts []string) ([][]float32, error)

Process generates embeddings for a list of texts. This is a convenience method equivalent to calling ProcessWithProgress with no progress callback.

Parameters:

  • ctx: Context for cancellation
  • texts: The texts to embed

Returns a 2D slice of embeddings (one per text) and any error

func (*BatchProcessor) ProcessWithProgress

func (bp *BatchProcessor) ProcessWithProgress(
	ctx context.Context,
	texts []string,
	callback ProgressCallback,
) ([][]float32, error)

ProcessWithProgress generates embeddings for a list of texts with optional progress reporting. The method processes texts in batches, concurrent API calls, and caches results to avoid recomputing embeddings for the same text.

The callback, if provided, is called after each batch completes with current progress. Returning false from the callback cancels processing.

Parameters:

  • ctx: Context for cancellation
  • texts: The texts to embed
  • callback: Optional progress reporter (nil for no reporting)

Returns a 2D slice of embeddings (one per text) and any error

type CLIPProvider added in v0.1.9

type CLIPProvider struct {
	// contains filtered or unexported fields
}

CLIPProvider implements MultimodalProvider for CLIP models.

func NewCLIPProvider added in v0.1.9

func NewCLIPProvider(modelPath string) (*CLIPProvider, error)

NewCLIPProvider creates a new CLIP provider.

func (*CLIPProvider) Dimension added in v0.1.9

func (p *CLIPProvider) Dimension() int

Dimension returns the embedding dimension (e.g., 512 for ViT-B/32).

func (*CLIPProvider) Embed added in v0.1.9

func (p *CLIPProvider) Embed(ctx context.Context, texts []string) ([][]float32, error)

Embed generates text embeddings using the CLIP text encoder.

func (*CLIPProvider) EmbedImages added in v0.1.9

func (p *CLIPProvider) EmbedImages(ctx context.Context, images [][]byte) ([][]float32, error)

EmbedImages generates image embeddings using the CLIP vision encoder.

type Config added in v0.1.2

type Config struct {
	Model        EmbeddingModel // Local embedding model implementation
	Dimension    int            // Embedding dimension
	MaxBatchSize int            // Maximum batch size for embedding generation
}

Config contains configuration parameters for the local embedding provider.

type DownloadModelInfo added in v0.1.2

type DownloadModelInfo struct {
	Name        string   // Model name
	Type        string   // Model type (e.g., "bge", "sentence-bert")
	URLs        []string // URLs of files to download
	Size        string   // Approximate total size
	Description string   // Model description
}

DownloadModelInfo contains information about a model available for download.

type DownloadProgressCallback added in v0.1.2

type DownloadProgressCallback func(modelName, fileName string, downloaded, total int64)

DownloadProgressCallback is a callback function for tracking download progress.

Parameters: - modelName: Name of the model being downloaded - fileName: Name of the file being downloaded - downloaded: Number of bytes downloaded so far - total: Total number of bytes to download (0 if unknown)

type Downloader added in v0.1.2

type Downloader struct {
	// contains filtered or unexported fields
}

Downloader handles the downloading and caching of embedding models.

func NewDownloader added in v0.1.2

func NewDownloader(cacheDir string) *Downloader

NewDownloader creates a new downloader with the specified cache directory.

Parameters: - cacheDir: Directory to cache downloaded models (empty for default)

Returns: - *Downloader: A new downloader instance

func (*Downloader) DownloadModel added in v0.1.2

func (d *Downloader) DownloadModel(modelName string, callback DownloadProgressCallback) (string, error)

DownloadModel downloads a model by name

func (*Downloader) GetModelInfo added in v0.1.2

func (d *Downloader) GetModelInfo() []DownloadModelInfo

GetModelInfo returns information about available models

type EmbeddingModel added in v0.1.2

type EmbeddingModel interface {
	// Run performs inference on the given inputs and returns the model outputs.
	Run(inputs map[string]interface{}) (map[string]interface{}, error)

	// Close releases any resources associated with the model.
	Close() error
}

EmbeddingModel defines the interface for local embedding models. Implementations should handle model loading, inference, and resource cleanup.

type ImageProcessor added in v0.1.9

type ImageProcessor struct {
	TargetSize int
	Mean       [3]float32
	Std        [3]float32
}

ImageProcessor handles image preprocessing for vision models like CLIP.

func NewImageProcessor added in v0.1.9

func NewImageProcessor() *ImageProcessor

NewImageProcessor creates an image processor matching CLIP default preprocessing.

func (*ImageProcessor) ProcessBatch added in v0.1.9

func (p *ImageProcessor) ProcessBatch(images [][]byte) ([][][][]float32, error)

ProcessBatch decodes, resizes, crops, and normalizes a batch of images. Returns a 4D float32 slice matching ONNX `pixel_values` shape [Batch][Channel][Height][Width].

type LocalProvider added in v0.1.2

type LocalProvider struct {
	// contains filtered or unexported fields
}

LocalProvider implements the embedding.Provider interface for local models. It handles tokenization, batch processing, and model inference.

func New added in v0.1.2

func New(config Config) (*LocalProvider, error)

New creates a new local embedding provider with the given configuration.

Parameters: - config: Configuration parameters for the provider

Returns: - *LocalProvider: A new local embedding provider instance - error: Error if configuration is invalid or initialization fails

func (*LocalProvider) Close added in v0.1.10

func (p *LocalProvider) Close() error

Close releases resources associated with the provider

func (*LocalProvider) Dimension added in v0.1.2

func (p *LocalProvider) Dimension() int

Dimension returns the embedding dimension

func (*LocalProvider) Embed added in v0.1.2

func (p *LocalProvider) Embed(ctx context.Context, texts []string) ([][]float32, error)

Embed generates embeddings for the given texts

type Model added in v0.1.2

type Model struct {
	// contains filtered or unexported fields
}

Model is a generic model implementation for embedding generation

func NewModel added in v0.1.2

func NewModel(dimension int, modelPath string) (*Model, error)

NewModel creates a new model instance

func (*Model) Close added in v0.1.2

func (m *Model) Close() error

Close closes the model

func (*Model) Run added in v0.1.2

func (m *Model) Run(inputs map[string]interface{}) (map[string]interface{}, error)

Run runs inference on the given inputs

type ModelInfo added in v0.1.2

type ModelInfo struct {
	Type      ModelType
	Name      string
	Dimension int
	ModelPath string
}

ModelInfo contains information about a model including its type, name, and dimensions.

func NewModelInfo added in v0.1.2

func NewModelInfo(modelPath string) (*ModelInfo, error)

NewModelInfo creates a new ModelInfo instance by analyzing the model file path.

type ModelType added in v0.1.2

type ModelType string

ModelType defines the type of embedding model.

const (
	// ModelTypeBERT represents BERT models
	ModelTypeBERT ModelType = "bert"
	// ModelTypeSentenceBERT represents Sentence-BERT models
	ModelTypeSentenceBERT ModelType = "sentence-bert"
	// ModelTypeBGE represents BGE models
	ModelTypeBGE ModelType = "bge"
	// ModelTypeGPT represents GPT models
	ModelTypeGPT ModelType = "gpt"
	// ModelTypeFastText represents FastText models
	ModelTypeFastText ModelType = "fasttext"
	// ModelTypeGloVe represents GloVe models
	ModelTypeGloVe ModelType = "glove"
	// ModelTypeCLIP represents CLIP models
	ModelTypeCLIP ModelType = "clip"
)

type MultimodalProvider added in v0.1.9

type MultimodalProvider interface {
	Provider

	// EmbedImages generates embeddings for the given images.
	//
	// Parameters:
	// - ctx: Context for cancellation and timeout
	// - images: Slice of byte arrays, where each byte array is a raw image (JPEG/PNG)
	//
	// Returns:
	// - [][]float32: Slice of embeddings, one for each image
	// - error: Error if embedding generation fails
	EmbedImages(ctx context.Context, images [][]byte) ([][]float32, error)
}

MultimodalProvider extends Provider with image embedding capabilities.

func WithCLIP added in v0.1.9

func WithCLIP(modelName, modelPath string) (MultimodalProvider, error)

WithCLIP 创建 CLIP Multimodal Embedding Provider modelName: 模型名称,例如 "clip-vit-base-patch32" modelPath: 模型路径,如果为空则自动下载

type ProgressCallback

type ProgressCallback func(current, total int, err error) bool

ProgressCallback is called periodically during batch processing. It receives the current progress and can return false to cancel processing.

Parameters:

  • current: Number of texts processed so far
  • total: Total number of texts to process
  • err: Error if processing failed, nil otherwise

Returns true to continue processing, false to cancel

type Provider

type Provider interface {
	// Embed generates embeddings for the given texts.
	//
	// Parameters:
	// - ctx: Context for cancellation and timeout
	// - texts: Slice of texts to embed
	//
	// Returns:
	// - [][]float32: Slice of embeddings, one for each text
	// - error: Error if embedding generation fails
	Embed(ctx context.Context, texts []string) ([][]float32, error)

	// Dimension returns the dimension of the embeddings generated by this provider.
	//
	// Returns:
	// - int: Embedding dimension
	Dimension() int
}

Provider defines the interface for embedding providers.

This interface is implemented by all embedding model providers and allows the application to generate embeddings for text.

Example implementation:

type LocalProvider struct {
    model  *LocalModel
    dimension int
}

func (p *LocalProvider) Embed(ctx context.Context, texts []string) ([][]float32, error) {
    // Generate embeddings using local model
}

func (p *LocalProvider) Dimension() int {
    return p.dimension
}

func NewProvider added in v0.1.2

func NewProvider(modelPath string) (Provider, error)

NewProvider creates a new provider based on the model path

func WithBEG added in v0.1.5

func WithBEG(modelName, modelPath string) (Provider, error)

WithBEG 创建 BGE Embedding Provider modelName: 模型名称,例如 "bge-small-zh-v1.5" modelPath: 模型路径,如果为空则自动下载

func WithBERT added in v0.1.5

func WithBERT(modelName, modelPath string) (Provider, error)

WithBERT 创建 BERT Embedding Provider modelName: 模型名称,例如 "all-mpnet-base-v2" modelPath: 模型路径,如果为空则自动下载

type SentenceBERTProvider added in v0.1.2

type SentenceBERTProvider struct {
	// contains filtered or unexported fields
}

SentenceBERTProvider creates a provider for Sentence-BERT models

func NewSentenceBERTProvider added in v0.1.2

func NewSentenceBERTProvider(modelPath string) (*SentenceBERTProvider, error)

NewSentenceBERTProvider creates a new Sentence-BERT provider

func (*SentenceBERTProvider) Dimension added in v0.1.2

func (p *SentenceBERTProvider) Dimension() int

Dimension returns the embedding dimension

func (*SentenceBERTProvider) Embed added in v0.1.2

func (p *SentenceBERTProvider) Embed(ctx context.Context, texts []string) ([][]float32, error)

Embed generates embeddings for the given texts

type Tokenizer added in v0.1.2

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer converts text into token IDs for embedding models. It implements a simple wordpiece tokenization scheme with vocabulary building during tokenization.

Tokenizer is safe for concurrent use from multiple goroutines. The vocabulary is built dynamically as new tokens are encountered.

Example:

tokenizer, _ := embedding.NewTokenizer()
inputIDs, attentionMask, _ := tokenizer.TokenizeBatch([]string{"hello world"})

func NewTokenizer added in v0.1.2

func NewTokenizer() (*Tokenizer, error)

NewTokenizer creates a new Tokenizer with an empty vocabulary. The tokenizer will build its vocabulary dynamically as texts are processed.

Returns a new Tokenizer ready for use

func (*Tokenizer) TokenizeBatch added in v0.1.2

func (t *Tokenizer) TokenizeBatch(texts []string) ([][]int64, [][]int64, error)

TokenizeBatch tokenizes multiple texts into token IDs with attention masks. This is the main method for converting text to model input.

Each input text is tokenized and padded/truncated to a uniform length. The vocabulary is built dynamically: unknown tokens are assigned new IDs.

Parameters:

  • texts: Slice of text strings to tokenize

Returns:

  • inputIDs: 2D slice of token IDs, shape [len(texts), maxLength]
  • attentionMask: 2D slice of binary mask, shape [len(texts), maxLength] 1 indicates real token, 0 indicates padding
  • error: Any error that occurred during tokenization

func (*Tokenizer) VocabSize added in v0.1.10

func (t *Tokenizer) VocabSize() int

VocabSize returns the current number of unique tokens in the vocabulary. This grows as new tokens are encountered during tokenization.

Returns the vocabulary size

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL