hnsw

package module
v0.0.0-...-7845765 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 12, 2026 License: MIT Imports: 13 Imported by: 0

Documentation

Overview

Package hnsw implements a pure Go HNSW (Hierarchical Navigable Small World) graph for approximate nearest neighbor search.

This is a from-scratch implementation that avoids CGo/FAISS dependency, following the algorithm from "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs" by Malkov & Yashunin (2018).

Package hnsw provides a pure-Go HNSW (Hierarchical Navigable Small World) vector index. It is designed as a standalone, dependency-free module that can be used as a drop-in replacement for CGO-based vector libraries.

Index

Constants

View Source
const CSRMagic uint32 = 0x474C454E // "GLEN"

CSRMagic identifies a gleann CSR file.

View Source
const CSRVersion uint32 = 1

CSRVersion is the current format version.

Variables

This section is empty.

Functions

func CosineDistance

func CosineDistance(a, b []float32) float32

CosineDistance computes cosine distance: 1 - cosine_similarity. Values range from 0 (identical) to 2 (opposite).

func InnerProductDistance

func InnerProductDistance(a, b []float32) float32

InnerProductDistance computes negative inner product distance. We negate because HNSW expects lower = more similar.

func L2DistanceSquared

func L2DistanceSquared(a, b []float32) float32

L2DistanceSquared computes the squared L2 (Euclidean) distance between two vectors.

func PruneEmbeddings

func PruneEmbeddings(g *Graph, csr *CSRGraph, keepFraction float64)

PruneEmbeddings removes embeddings from the graph, keeping only a specified fraction. Always keeps the entry point and high-degree nodes. This is LEANN's core storage optimization.

keepFraction: 0.0 = keep none (except entry point), 1.0 = keep all.

Types

type BackendBuilder

type BackendBuilder interface {
	Build(ctx context.Context, embeddings [][]float32) ([]byte, error)
	AddVectors(ctx context.Context, indexData []byte, embeddings [][]float32, startID int64) ([]byte, error)
	RemoveVectors(ctx context.Context, indexData []byte, ids []int64) ([]byte, error)
}

BackendBuilder constructs and serializes an index.

type BackendSearcher

type BackendSearcher interface {
	Load(ctx context.Context, indexData []byte, meta IndexMeta) error
	Search(ctx context.Context, query []float32, topK int) ([]int64, []float32, error)
	SearchWithRecompute(ctx context.Context, query []float32, topK int, recompute EmbeddingRecomputer) ([]int64, []float32, error)
	Close() error
}

BackendSearcher loads and queries an index.

type Builder

type Builder struct {
	// contains filtered or unexported fields
}

Builder builds HNSW indexes.

func (*Builder) AddVectors

func (b *Builder) AddVectors(ctx context.Context, indexData []byte, embeddings [][]float32, startID int64) ([]byte, error)

func (*Builder) Build

func (b *Builder) Build(ctx context.Context, embeddings [][]float32) ([]byte, error)

func (*Builder) RemoveVectors

func (b *Builder) RemoveVectors(ctx context.Context, indexData []byte, ids []int64) ([]byte, error)

type CSRGraph

type CSRGraph struct {
	// Header fields
	NumNodes   int64
	Dimensions int
	MaxLevel   int
	EntryPoint int64
	M          int

	// Per-level CSR structure
	Levels []CSRLevel

	// Optional stored embeddings (for high-degree nodes kept during pruning).
	// Map from node ID to embedding vector.
	StoredEmbeddings map[int64][]float32

	// Node-to-internal ID mapping.
	IDMap []int64 // internal index -> external ID

	// Per-node level in the original HNSW graph.
	// NodeLevels[i] is the level of IDMap[i].
	NodeLevels []int
}

CSRGraph is a Compressed Sparse Row representation of an HNSW graph. This is the compact format used for storage — embeddings are pruned and recomputed on-demand during search.

Format mirrors Python LEANN's convert_to_csr:

Header: magic, version, numNodes, dimensions, maxLevel, entryPoint, M
Per-level:
  LevelPtr[level]   -> offset into NodeOffsets
  NodeOffsets[node]  -> offset into NeighborsData
  NeighborsData      -> flat list of neighbor IDs

Optional: embeddings for a fraction of nodes (entry point + high-degree nodes).

func ConvertToCSR

func ConvertToCSR(g *Graph) *CSRGraph

ConvertToCSR converts an HNSW graph to CSR format.

func ReadCSR

func ReadCSR(r io.Reader) (*CSRGraph, error)

ReadCSR deserializes a CSR graph from a reader.

func (*CSRGraph) Stats

func (csr *CSRGraph) Stats() StorageStats

Stats computes storage statistics for the CSR graph.

func (*CSRGraph) ToGraph

func (csr *CSRGraph) ToGraph() *Graph

ToGraph reconstructs an HNSW Graph from a CSR representation. Nodes that had their embeddings pruned will have nil Vector fields.

func (*CSRGraph) ToGraphWithDistance

func (csr *CSRGraph) ToGraphWithDistance(distFunc DistanceFunc) *Graph

ToGraphWithDistance reconstructs an HNSW Graph with a custom distance function.

func (*CSRGraph) WriteTo

func (csr *CSRGraph) WriteTo(w io.Writer) (int64, error)

WriteTo serializes the CSR graph to a writer.

type CSRLevel

type CSRLevel struct {
	// NodeOffsets[i] gives the start index in Neighbors for node i.
	// NodeOffsets[i+1] - NodeOffsets[i] gives the number of neighbors.
	NodeOffsets []int64

	// Neighbors is a flat array of neighbor IDs.
	Neighbors []int64

	// NumNodes at this level.
	NumNodes int
}

CSRLevel represents one level of the HNSW graph in CSR format.

type Candidate

type Candidate struct {
	ID       int64
	Distance float32
}

Candidate is a (id, distance) pair used during search.

type Config

type Config struct {
	IndexDir   string     `json:"index_dir"`
	Backend    string     `json:"backend"`
	HNSWConfig HNSWConfig `json:"hnsw_config,omitempty"`
}

Config is a minimal config struct carrying HNSW settings. In the main gleann application this is the gleann.Config struct; here we define a local equivalent to keep gleann-hnsw dependency-free.

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns a Config with sensible defaults.

type DistanceFunc

type DistanceFunc func(a, b []float32) float32

DistanceFunc is a function that computes distance between two vectors. Lower values indicate more similar vectors.

func GetDistanceFunc

func GetDistanceFunc(metric string) DistanceFunc

GetDistanceFunc returns the distance function for the given metric name.

type DistanceMetric

type DistanceMetric string

DistanceMetric specifies the distance function for vector comparison.

const (
	DistanceL2     DistanceMetric = "l2"
	DistanceCosine DistanceMetric = "cosine"
	DistanceIP     DistanceMetric = "ip"
)

type EmbeddingRecomputer

type EmbeddingRecomputer func(ctx context.Context, ids []int64) ([][]float32, error)

EmbeddingRecomputer is a function that recomputes embeddings for given passage IDs.

type Factory

type Factory struct{}

Factory creates HNSW backend builders and searchers.

func (*Factory) Name

func (f *Factory) Name() string

func (*Factory) NewBuilder

func (f *Factory) NewBuilder(config Config) BackendBuilder

func (*Factory) NewSearcher

func (f *Factory) NewSearcher(config Config) BackendSearcher

type Graph

type Graph struct {
	// contains filtered or unexported fields
}

Graph is a multi-level HNSW graph for approximate nearest neighbor search.

func NewGraph

func NewGraph(m, efConstruction, dimensions int) *Graph

NewGraph creates a new HNSW graph with the given parameters.

func NewGraphFromTopology

func NewGraphFromTopology(topo GraphTopology, distFunc DistanceFunc) *Graph

NewGraphFromTopology constructs a Graph from an externally-built topology. The returned graph is ready for ConvertToCSR / PruneEmbeddings / Search.

func NewGraphWithDistance

func NewGraphWithDistance(m, efConstruction, dimensions int, distFunc DistanceFunc) *Graph

NewGraphWithDistance creates a new HNSW graph with a custom distance function. If distFunc is nil, L2 squared distance is used.

func NewGraphWithOptions

func NewGraphWithOptions(m, efConstruction, dimensions int, distFunc DistanceFunc, useHeuristic bool) *Graph

NewGraphWithOptions creates a new HNSW graph with all options.

func (*Graph) AllNodeIDs

func (g *Graph) AllNodeIDs() []int64

AllNodeIDs returns all node IDs in the graph.

func (*Graph) Dimensions

func (g *Graph) Dimensions() int

Dimensions returns the vector dimensions.

func (*Graph) GetEntryPoint

func (g *Graph) GetEntryPoint() int64

GetEntryPoint returns the entry point node ID.

func (*Graph) GetNode

func (g *Graph) GetNode(id int64) (*Node, bool)

GetNode returns the node with the given ID.

func (*Graph) Insert

func (g *Graph) Insert(id int64, vector []float32)

Insert adds a vector to the graph with the given ID.

func (*Graph) MaxLevel

func (g *Graph) MaxLevel() int

MaxLevel returns the maximum level in the graph.

func (*Graph) RemoveVector

func (g *Graph) RemoveVector(id int64)

RemoveVector sets a node's vector to nil (for embedding pruning).

func (*Graph) Search

func (g *Graph) Search(query []float32, topK, ef int) []Candidate

Search performs an approximate nearest neighbor search.

func (*Graph) SearchWithRecompute

func (g *Graph) SearchWithRecompute(query []float32, topK, ef int, recompute func(ids []int64) [][]float32) []Candidate

SearchWithRecompute performs search using on-demand embedding recomputation. Instead of using stored vectors, it calls the recompute function to get embeddings only for nodes visited during graph traversal.

func (*Graph) Size

func (g *Graph) Size() int64

Size returns the number of nodes in the graph.

type GraphTopology

type GraphTopology struct {
	NumNodes   int
	Dimensions int
	M          int
	MaxLevel   int
	EntryPoint int64

	// Per-node data (len = NumNodes).
	NodeLevels []int
	Embeddings [][]float32

	// Per-node-per-level neighbor lists.
	// Neighbors[i] has len = NodeLevels[i]+1; Neighbors[i][l] = IDs at level l.
	Neighbors [][][]int64
}

GraphTopology holds the raw graph structure for importing an externally-constructed HNSW graph (e.g. from FAISS).

type HNSWConfig

type HNSWConfig struct {
	M                 int            `json:"m"`
	EfConstruction    int            `json:"ef_construction"`
	EfSearch          int            `json:"ef_search"`
	UseMmap           bool           `json:"use_mmap,omitempty"`
	MaxLevel          int            `json:"max_level,omitempty"`
	DistanceMetric    DistanceMetric `json:"distance_metric,omitempty"`
	UseHeuristic      bool           `json:"use_heuristic"`
	PruneEmbeddings   bool           `json:"prune_embeddings"`
	PruneKeepFraction float64        `json:"prune_keep_fraction"`
}

HNSWConfig holds HNSW-specific build and search parameters.

type IndexMeta

type IndexMeta struct {
	Name           string    `json:"name"`
	Backend        string    `json:"backend"`
	EmbeddingModel string    `json:"embedding_model"`
	Dimensions     int       `json:"dimensions"`
	NumPassages    int       `json:"num_passages"`
	CreatedAt      time.Time `json:"created_at"`
	UpdatedAt      time.Time `json:"updated_at"`
	Version        string    `json:"version"`
}

IndexMeta stores metadata about a built index.

func (IndexMeta) MarshalJSON

func (m IndexMeta) MarshalJSON() ([]byte, error)

MarshalJSON implements custom JSON marshaling for IndexMeta.

type MmapBackendSearcher

type MmapBackendSearcher interface {
	BackendSearcher
	LoadFromFile(ctx context.Context, path string) error
}

MmapBackendSearcher optionally supports zero-copy memory-mapped search.

type MmapGraph

type MmapGraph struct {

	// Extracted header info
	NumNodes   int64
	Dimensions int
	MaxLevel   int
	EntryPoint int64
	M          int
	NumStored  int64
	// contains filtered or unexported fields
}

MmapGraph is a zero-copy, memory-mapped representation of the CSRGraph. It bypasses the Go Garbage Collector by keeping all node and edge data off-heap in a flat `[]byte` slice backed directly by a file on disk.

func OpenMmapGraph

func OpenMmapGraph(path string) (*MmapGraph, error)

OpenMmapGraph opens a CSR binary file and maps it directly into memory.

func (*MmapGraph) Close

func (mg *MmapGraph) Close() error

Close unmaps the memory and closes the file.

func (*MmapGraph) GetExternalID

func (mg *MmapGraph) GetExternalID(internalIdx int64) int64

GetExternalID returns the external node ID for a given internal index.

func (*MmapGraph) GetNeighbors

func (mg *MmapGraph) GetNeighbors(internalIdx int64, level int) []int64

GetNeighbors returns the internal neighbor indices for a node at a specific level. This is the core operation of graph traversal, completely GC-free.

func (*MmapGraph) GetNodeLevel

func (mg *MmapGraph) GetNodeLevel(internalIdx int64) int

GetNodeLevel returns the maximum HNSW level for a given internal index.

func (*MmapGraph) GetStoredEmbedding

func (mg *MmapGraph) GetStoredEmbedding(externalID int64) []float32

GetStoredEmbedding tries to retrieve a pruned stored embedding for an internal index.

type MmapSearcher

type MmapSearcher struct {
	// contains filtered or unexported fields
}

MmapSearcher implements BackendSearcher directly on a memory-mapped graph.

func (*MmapSearcher) Close

func (s *MmapSearcher) Close() error

func (*MmapSearcher) Load

func (s *MmapSearcher) Load(ctx context.Context, indexData []byte, meta IndexMeta) error

func (*MmapSearcher) LoadFromFile

func (s *MmapSearcher) LoadFromFile(ctx context.Context, path string) error

LoadFromFile allows mmapping the index directly bypassing bytes parameter.

func (*MmapSearcher) Search

func (s *MmapSearcher) Search(ctx context.Context, query []float32, topK int) ([]int64, []float32, error)

func (*MmapSearcher) SearchWithRecompute

func (s *MmapSearcher) SearchWithRecompute(ctx context.Context, query []float32, topK int, recompute EmbeddingRecomputer) ([]int64, []float32, error)

type Node

type Node struct {
	ID     int64
	Vector []float32
	Level  int
	// Neighbors at each level. neighbors[level] = list of neighbor IDs.
	Neighbors [][]int64
}

Node represents a single node in the HNSW graph.

type Searcher

type Searcher struct {
	// contains filtered or unexported fields
}

Searcher searches an HNSW index.

func (*Searcher) Close

func (s *Searcher) Close() error

func (*Searcher) Load

func (s *Searcher) Load(ctx context.Context, indexData []byte, meta IndexMeta) error

func (*Searcher) Search

func (s *Searcher) Search(ctx context.Context, query []float32, topK int) ([]int64, []float32, error)

func (*Searcher) SearchWithRecompute

func (s *Searcher) SearchWithRecompute(ctx context.Context, query []float32, topK int, recompute EmbeddingRecomputer) ([]int64, []float32, error)

type StorageStats

type StorageStats struct {
	NumNodes           int64
	NumLevels          int
	TotalEdges         int64
	StoredEmbeddings   int
	PrunedEmbeddings   int64
	GraphSizeBytes     int64
	EmbeddingSizeBytes int64
	TotalSizeBytes     int64
	CompressionRatio   float64
	OriginalSizeBytes  int64
}

StorageStats returns statistics about CSR storage.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL