projection

package
v0.1.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 5, 2026 License: GPL-3.0 Imports: 5 Imported by: 0

Documentation

Overview

Package projection provides dimensionality reduction and clustering for high-dimensional embedding vectors.

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) Overview

HDBSCAN is a density-based clustering algorithm that finds clusters of varying densities without requiring the number of clusters to be specified in advance. Unlike K-means, it can discover arbitrarily shaped clusters and automatically identifies noise points (outliers).

The algorithm works in five main steps:

  1. Compute Core Distances: For each point, find the distance to its k-th nearest neighbor. This measures local density—points in dense regions have small core distances.

  2. Build Mutual Reachability Graph: Transform the distance metric to account for density. The mutual reachability distance between points a and b is: max(core_dist(a), core_dist(b), dist(a, b)) This makes sparse regions "farther apart" even if Euclidean distance is small.

  3. Construct Minimum Spanning Tree: Build an MST using mutual reachability distances. This captures the hierarchical cluster structure—edges with small weights connect dense regions, while edges with large weights span sparse regions.

  4. Build Condensed Tree: Walk the MST from longest to shortest edges, tracking when clusters split. Small splits (fewer than MinClusterSize points) are treated as points "falling out" of a cluster rather than true splits.

  5. Extract Clusters: Use cluster stability (how long points persist in a cluster) to select the most prominent clusters. Points not in any stable cluster are noise.

Key advantages over other clustering methods:

  • No need to specify number of clusters
  • Robust to noise and outliers (labels them as -1)
  • Finds clusters of varying densities
  • Produces a hierarchy that can be cut at different levels

Reference: Campello, R.J.G.B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. https://doi.org/10.1007/978-3-642-37456-2_14

Package projection provides dimensionality reduction for high-dimensional embedding vectors.

Principal Component Analysis (PCA) Overview

PCA is a technique that reduces high-dimensional data (like 768-dimensional text embeddings) down to fewer dimensions (like 2D for visualization) while preserving as much variance as possible.

The key insight is that most high-dimensional data lies on or near a lower-dimensional subspace. PCA finds this subspace by identifying the directions (principal components) along which the data varies the most.

Why We Use Singular Value Decomposition (SVD)

While PCA can be computed by finding eigenvectors of the covariance matrix, SVD is numerically more stable and efficient. For a centered data matrix X, the right singular vectors (V) give us the principal components directly, without needing to compute X^T * X explicitly.

The mathematical relationship is:

  • X = U * Σ * V^T (SVD decomposition)
  • The columns of V are the principal components (directions of maximum variance)
  • The singular values in Σ indicate how much variance each component captures
  • Projecting data: X_projected = X * V[:, 0:k] gives us the k-dimensional representation

Package projection provides dimensionality reduction for high-dimensional embedding vectors.

UMAP (Uniform Manifold Approximation and Projection) Overview

UMAP is a nonlinear dimensionality reduction technique that preserves both local and global structure better than linear methods like PCA. It works by:

  1. Constructing a k-nearest neighbor graph in high-dimensional space
  2. Converting distances to fuzzy membership strengths (fuzzy simplicial set)
  3. Initializing a low-dimensional embedding via spectral methods
  4. Optimizing the embedding via stochastic gradient descent with negative sampling

Reference: McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://arxiv.org/abs/1802.03426

This is a Go port of the Python umap-learn library: https://github.com/lmcinnes/umap

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ClusterLabels added in v0.1.4

func ClusterLabels(vectors [][]float32, config HDBSCANConfig) []int

ClusterLabels is a convenience function that returns only cluster labels. Points labeled -1 are noise (not part of any cluster).

Types

type COOMatrix added in v0.1.3

type COOMatrix struct {
	Rows []int
	Cols []int
	Data []float64
	NRow int
	NCol int
}

COOMatrix represents a sparse matrix in coordinate (COO) format.

type ClusterResult added in v0.1.4

type ClusterResult struct {
	Labels        []int     // Cluster assignment for each point (-1 = noise)
	Probabilities []float64 // Confidence score (0-1) for each point's cluster membership
}

ClusterResult contains the output of HDBSCAN clustering.

func Cluster added in v0.1.4

func Cluster(vectors [][]float32, config HDBSCANConfig) ClusterResult

Cluster performs HDBSCAN clustering on high-dimensional vectors. Returns cluster labels and membership probabilities for each point.

The algorithm pipeline:

  1. Compute core distances (local density estimation)
  2. Build minimum spanning tree using mutual reachability distance
  3. Convert MST to single-linkage dendrogram
  4. Condense the tree by removing spurious splits
  5. Extract stable clusters using persistence-based selection
  6. Compute membership probabilities based on cluster lifetime

type HDBSCANConfig added in v0.1.4

type HDBSCANConfig struct {
	MinClusterSize int // Minimum points required to form a cluster (default: 5)
	MinSamples     int // Points used to estimate density; affects core distance (default: MinClusterSize)
}

HDBSCANConfig holds hyperparameters for HDBSCAN clustering.

func DefaultHDBSCANConfig added in v0.1.4

func DefaultHDBSCANConfig() HDBSCANConfig

DefaultHDBSCANConfig returns sensible default hyperparameters. MinClusterSize=5 works well for most datasets; increase for larger datasets or when you want to ignore small clusters.

type Point2D

type Point2D struct {
	X, Y float64
	Text string
}

Point2D represents a single data point projected into 2D space for visualization. It preserves the original text label for display in the UI.

func ProjectTo2D

func ProjectTo2D(embeddingVectors [][]float32, textLabels []string) []Point2D

ProjectTo2D reduces high-dimensional embedding vectors to 2D points using PCA. Each input vector (typically 768 dimensions from text embeddings) is transformed into a 2D point that can be plotted, while preserving the relative distances and clustering structure of the original high-dimensional space.

Parameters:

  • embeddingVectors: slice of high-dimensional vectors (e.g., from Ollama embeddings)
  • textLabels: corresponding text labels for each vector

Returns:

  • slice of Point2D structs ready for 2D visualization

func ProjectTo2DUMAP added in v0.1.3

func ProjectTo2DUMAP(embeddingVectors [][]float32, textLabels []string) []Point2D

ProjectTo2DUMAP reduces high-dimensional embedding vectors to 2D points using UMAP. This provides an alternative to PCA that better preserves nonlinear manifold structure.

func ProjectTo2DUMAPWithConfig added in v0.1.3

func ProjectTo2DUMAPWithConfig(embeddingVectors [][]float32, textLabels []string, config UMAPConfig) []Point2D

ProjectTo2DUMAPWithConfig allows customizing UMAP hyperparameters.

type UMAPConfig added in v0.1.3

type UMAPConfig struct {
	NNeighbors         int     // Number of nearest neighbors (default: 15)
	MinDist            float64 // Minimum distance in low-dim space (default: 0.1)
	Spread             float64 // Effective scale of embedded points (default: 1.0)
	NEpochs            int     // Number of optimization epochs (default: 200)
	LearningRate       float64 // Initial learning rate (default: 1.0)
	NegativeSampleRate float64 // Negative samples per positive (default: 5.0)
	RandomSeed         int64   // Random seed for reproducibility
}

UMAPConfig holds hyperparameters for UMAP dimensionality reduction.

func DefaultUMAPConfig added in v0.1.3

func DefaultUMAPConfig() UMAPConfig

DefaultUMAPConfig returns sensible default hyperparameters.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL