Documentation
¶
Overview ¶
Package projection provides dimensionality reduction and clustering for high-dimensional embedding vectors.
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) Overview ¶
HDBSCAN is a density-based clustering algorithm that finds clusters of varying densities without requiring the number of clusters to be specified in advance. Unlike K-means, it can discover arbitrarily shaped clusters and automatically identifies noise points (outliers).
The algorithm works in five main steps:
Compute Core Distances: For each point, find the distance to its k-th nearest neighbor. This measures local density—points in dense regions have small core distances.
Build Mutual Reachability Graph: Transform the distance metric to account for density. The mutual reachability distance between points a and b is: max(core_dist(a), core_dist(b), dist(a, b)) This makes sparse regions "farther apart" even if Euclidean distance is small.
Construct Minimum Spanning Tree: Build an MST using mutual reachability distances. This captures the hierarchical cluster structure—edges with small weights connect dense regions, while edges with large weights span sparse regions.
Build Condensed Tree: Walk the MST from longest to shortest edges, tracking when clusters split. Small splits (fewer than MinClusterSize points) are treated as points "falling out" of a cluster rather than true splits.
Extract Clusters: Use cluster stability (how long points persist in a cluster) to select the most prominent clusters. Points not in any stable cluster are noise.
Key advantages over other clustering methods:
- No need to specify number of clusters
- Robust to noise and outliers (labels them as -1)
- Finds clusters of varying densities
- Produces a hierarchy that can be cut at different levels
Reference: Campello, R.J.G.B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. https://doi.org/10.1007/978-3-642-37456-2_14
Package projection provides dimensionality reduction for high-dimensional embedding vectors.
Principal Component Analysis (PCA) Overview ¶
PCA is a technique that reduces high-dimensional data (like 768-dimensional text embeddings) down to fewer dimensions (like 2D for visualization) while preserving as much variance as possible.
The key insight is that most high-dimensional data lies on or near a lower-dimensional subspace. PCA finds this subspace by identifying the directions (principal components) along which the data varies the most.
Why We Use Singular Value Decomposition (SVD) ¶
While PCA can be computed by finding eigenvectors of the covariance matrix, SVD is numerically more stable and efficient. For a centered data matrix X, the right singular vectors (V) give us the principal components directly, without needing to compute X^T * X explicitly.
The mathematical relationship is:
- X = U * Σ * V^T (SVD decomposition)
- The columns of V are the principal components (directions of maximum variance)
- The singular values in Σ indicate how much variance each component captures
- Projecting data: X_projected = X * V[:, 0:k] gives us the k-dimensional representation
Package projection provides dimensionality reduction for high-dimensional embedding vectors.
UMAP (Uniform Manifold Approximation and Projection) Overview ¶
UMAP is a nonlinear dimensionality reduction technique that preserves both local and global structure better than linear methods like PCA. It works by:
- Constructing a k-nearest neighbor graph in high-dimensional space
- Converting distances to fuzzy membership strengths (fuzzy simplicial set)
- Initializing a low-dimensional embedding via spectral methods
- Optimizing the embedding via stochastic gradient descent with negative sampling
Reference: McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://arxiv.org/abs/1802.03426
This is a Go port of the Python umap-learn library: https://github.com/lmcinnes/umap
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ClusterLabels ¶ added in v0.1.4
func ClusterLabels(vectors [][]float32, config HDBSCANConfig) []int
ClusterLabels is a convenience function that returns only cluster labels. Points labeled -1 are noise (not part of any cluster).
Types ¶
type ClusterResult ¶ added in v0.1.4
type ClusterResult struct {
Labels []int // Cluster assignment for each point (-1 = noise)
Probabilities []float64 // Confidence score (0-1) for each point's cluster membership
}
ClusterResult contains the output of HDBSCAN clustering.
func Cluster ¶ added in v0.1.4
func Cluster(vectors [][]float32, config HDBSCANConfig) ClusterResult
Cluster performs HDBSCAN clustering on high-dimensional vectors. Returns cluster labels and membership probabilities for each point.
The algorithm pipeline:
- Compute core distances (local density estimation)
- Build minimum spanning tree using mutual reachability distance
- Convert MST to single-linkage dendrogram
- Condense the tree by removing spurious splits
- Extract stable clusters using persistence-based selection
- Compute membership probabilities based on cluster lifetime
type HDBSCANConfig ¶ added in v0.1.4
type HDBSCANConfig struct {
MinClusterSize int // Minimum points required to form a cluster (default: 5)
MinSamples int // Points used to estimate density; affects core distance (default: MinClusterSize)
}
HDBSCANConfig holds hyperparameters for HDBSCAN clustering.
func DefaultHDBSCANConfig ¶ added in v0.1.4
func DefaultHDBSCANConfig() HDBSCANConfig
DefaultHDBSCANConfig returns sensible default hyperparameters. MinClusterSize=5 works well for most datasets; increase for larger datasets or when you want to ignore small clusters.
type Point2D ¶
Point2D represents a single data point projected into 2D space for visualization. It preserves the original text label for display in the UI.
func ProjectTo2D ¶
ProjectTo2D reduces high-dimensional embedding vectors to 2D points using PCA. Each input vector (typically 768 dimensions from text embeddings) is transformed into a 2D point that can be plotted, while preserving the relative distances and clustering structure of the original high-dimensional space.
Parameters:
- embeddingVectors: slice of high-dimensional vectors (e.g., from Ollama embeddings)
- textLabels: corresponding text labels for each vector
Returns:
- slice of Point2D structs ready for 2D visualization
func ProjectTo2DUMAP ¶ added in v0.1.3
ProjectTo2DUMAP reduces high-dimensional embedding vectors to 2D points using UMAP. This provides an alternative to PCA that better preserves nonlinear manifold structure.
func ProjectTo2DUMAPWithConfig ¶ added in v0.1.3
func ProjectTo2DUMAPWithConfig(embeddingVectors [][]float32, textLabels []string, config UMAPConfig) []Point2D
ProjectTo2DUMAPWithConfig allows customizing UMAP hyperparameters.
type UMAPConfig ¶ added in v0.1.3
type UMAPConfig struct {
NNeighbors int // Number of nearest neighbors (default: 15)
MinDist float64 // Minimum distance in low-dim space (default: 0.1)
Spread float64 // Effective scale of embedded points (default: 1.0)
NEpochs int // Number of optimization epochs (default: 200)
LearningRate float64 // Initial learning rate (default: 1.0)
NegativeSampleRate float64 // Negative samples per positive (default: 5.0)
RandomSeed int64 // Random seed for reproducibility
}
UMAPConfig holds hyperparameters for UMAP dimensionality reduction.
func DefaultUMAPConfig ¶ added in v0.1.3
func DefaultUMAPConfig() UMAPConfig
DefaultUMAPConfig returns sensible default hyperparameters.