projection

package

v0.1.4 Latest Latest Go to latest Published: Jan 5, 2026 License: GPL-3.0 Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/alDuncanson/latent

Links

Open Source Insights

Documentation ¶

Overview ¶

Package projection provides dimensionality reduction and clustering for high-dimensional embedding vectors.

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) Overview ¶

HDBSCAN is a density-based clustering algorithm that finds clusters of varying densities without requiring the number of clusters to be specified in advance. Unlike K-means, it can discover arbitrarily shaped clusters and automatically identifies noise points (outliers).

The algorithm works in five main steps:

Compute Core Distances: For each point, find the distance to its k-th nearest neighbor. This measures local density—points in dense regions have small core distances.
Build Mutual Reachability Graph: Transform the distance metric to account for density. The mutual reachability distance between points a and b is: max(core_dist(a), core_dist(b), dist(a, b)) This makes sparse regions "farther apart" even if Euclidean distance is small.
Construct Minimum Spanning Tree: Build an MST using mutual reachability distances. This captures the hierarchical cluster structure—edges with small weights connect dense regions, while edges with large weights span sparse regions.
Build Condensed Tree: Walk the MST from longest to shortest edges, tracking when clusters split. Small splits (fewer than MinClusterSize points) are treated as points "falling out" of a cluster rather than true splits.
Extract Clusters: Use cluster stability (how long points persist in a cluster) to select the most prominent clusters. Points not in any stable cluster are noise.

Key advantages over other clustering methods:

No need to specify number of clusters
Robust to noise and outliers (labels them as -1)
Finds clusters of varying densities
Produces a hierarchy that can be cut at different levels

Reference: Campello, R.J.G.B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. https://doi.org/10.1007/978-3-642-37456-2_14

Package projection provides dimensionality reduction for high-dimensional embedding vectors.

Principal Component Analysis (PCA) Overview ¶

PCA is a technique that reduces high-dimensional data (like 768-dimensional text embeddings) down to fewer dimensions (like 2D for visualization) while preserving as much variance as possible.

The key insight is that most high-dimensional data lies on or near a lower-dimensional subspace. PCA finds this subspace by identifying the directions (principal components) along which the data varies the most.

Why We Use Singular Value Decomposition (SVD) ¶

While PCA can be computed by finding eigenvectors of the covariance matrix, SVD is numerically more stable and efficient. For a centered data matrix X, the right singular vectors (V) give us the principal components directly, without needing to compute X^T * X explicitly.

The mathematical relationship is:

X = U * Σ * V^T (SVD decomposition)
The columns of V are the principal components (directions of maximum variance)
The singular values in Σ indicate how much variance each component captures
Projecting data: X_projected = X * V[:, 0:k] gives us the k-dimensional representation

Package projection provides dimensionality reduction for high-dimensional embedding vectors.

UMAP (Uniform Manifold Approximation and Projection) Overview ¶

UMAP is a nonlinear dimensionality reduction technique that preserves both local and global structure better than linear methods like PCA. It works by:

Constructing a k-nearest neighbor graph in high-dimensional space
Converting distances to fuzzy membership strengths (fuzzy simplicial set)
Initializing a low-dimensional embedding via spectral methods
Optimizing the embedding via stochastic gradient descent with negative sampling

Reference: McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://arxiv.org/abs/1802.03426

This is a Go port of the Python umap-learn library: https://github.com/lmcinnes/umap

Index ¶

func ClusterLabels(vectors [][]float32, config HDBSCANConfig) []int
type COOMatrix
type ClusterResult
- func Cluster(vectors [][]float32, config HDBSCANConfig) ClusterResult
type HDBSCANConfig
- func DefaultHDBSCANConfig() HDBSCANConfig
type Point2D
type UMAPConfig
- func DefaultUMAPConfig() UMAPConfig

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ClusterLabels ¶ added in v0.1.4

func ClusterLabels(vectors [][]float32, config HDBSCANConfig) []int

ClusterLabels is a convenience function that returns only cluster labels. Points labeled -1 are noise (not part of any cluster).

Types ¶

type COOMatrix ¶ added in v0.1.3

type COOMatrix struct {
	Rows []int
	Cols []int
	Data []float64
	NRow int
	NCol int
}

COOMatrix represents a sparse matrix in coordinate (COO) format.

type ClusterResult ¶ added in v0.1.4

type ClusterResult struct {
	Labels        []int     // Cluster assignment for each point (-1 = noise)
	Probabilities []float64 // Confidence score (0-1) for each point's cluster membership
}

ClusterResult contains the output of HDBSCAN clustering.

func Cluster ¶ added in v0.1.4

func Cluster(vectors [][]float32, config HDBSCANConfig) ClusterResult

Cluster performs HDBSCAN clustering on high-dimensional vectors. Returns cluster labels and membership probabilities for each point.

The algorithm pipeline:

Compute core distances (local density estimation)
Build minimum spanning tree using mutual reachability distance
Convert MST to single-linkage dendrogram
Condense the tree by removing spurious splits
Extract stable clusters using persistence-based selection
Compute membership probabilities based on cluster lifetime

type HDBSCANConfig ¶ added in v0.1.4

type HDBSCANConfig struct {
	MinClusterSize int // Minimum points required to form a cluster (default: 5)
	MinSamples     int // Points used to estimate density; affects core distance (default: MinClusterSize)
}

HDBSCANConfig holds hyperparameters for HDBSCAN clustering.

func DefaultHDBSCANConfig ¶ added in v0.1.4

func DefaultHDBSCANConfig() HDBSCANConfig

DefaultHDBSCANConfig returns sensible default hyperparameters. MinClusterSize=5 works well for most datasets; increase for larger datasets or when you want to ignore small clusters.

type Point2D ¶

type Point2D struct {
	X, Y float64
	Text string
}

Point2D represents a single data point projected into 2D space for visualization. It preserves the original text label for display in the UI.

func ProjectTo2D ¶

func ProjectTo2D(embeddingVectors [][]float32, textLabels []string) []Point2D

ProjectTo2D reduces high-dimensional embedding vectors to 2D points using PCA. Each input vector (typically 768 dimensions from text embeddings) is transformed into a 2D point that can be plotted, while preserving the relative distances and clustering structure of the original high-dimensional space.

Parameters:

embeddingVectors: slice of high-dimensional vectors (e.g., from Ollama embeddings)
textLabels: corresponding text labels for each vector

Returns:

slice of Point2D structs ready for 2D visualization

func ProjectTo2DUMAP ¶ added in v0.1.3

func ProjectTo2DUMAP(embeddingVectors [][]float32, textLabels []string) []Point2D

ProjectTo2DUMAP reduces high-dimensional embedding vectors to 2D points using UMAP. This provides an alternative to PCA that better preserves nonlinear manifold structure.

func ProjectTo2DUMAPWithConfig ¶ added in v0.1.3

func ProjectTo2DUMAPWithConfig(embeddingVectors [][]float32, textLabels []string, config UMAPConfig) []Point2D

ProjectTo2DUMAPWithConfig allows customizing UMAP hyperparameters.

type UMAPConfig ¶ added in v0.1.3

type UMAPConfig struct {
	NNeighbors         int     // Number of nearest neighbors (default: 15)
	MinDist            float64 // Minimum distance in low-dim space (default: 0.1)
	Spread             float64 // Effective scale of embedded points (default: 1.0)
	NEpochs            int     // Number of optimization epochs (default: 200)
	LearningRate       float64 // Initial learning rate (default: 1.0)
	NegativeSampleRate float64 // Negative samples per positive (default: 5.0)
	RandomSeed         int64   // Random seed for reproducibility
}

UMAPConfig holds hyperparameters for UMAP dimensionality reduction.

func DefaultUMAPConfig ¶ added in v0.1.3

func DefaultUMAPConfig() UMAPConfig

DefaultUMAPConfig returns sensible default hyperparameters.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL