README
¶
Skyline
A Go library for skyline queries — multi-dimensional optimization to find the set of Pareto-optimal points from a dataset.
Supports both static and dynamic skyline computation with multiple algorithms.
What is a Skyline Query?
Skyline queries identify the subset of points in a multi-dimensional dataset that are not dominated by any other point.
A point A dominates B if A is as good or better than B in all dimensions and strictly better in at least one dimension.
This is useful in scenarios like:
- Finding products that are best across multiple criteria (price, performance, battery life)
- Multi-criteria decision making in finance, logistics, recommendation systems, etc.
The result, called the skyline set, represents the Pareto-optimal front of your data.
What does this library do?
- Compute skyline points from static datasets using multiple algorithms (Block Nested Loop, Divide & Conquer, SkyTree)
- Support dynamic updates: insert, batch insertdelete, and update points incrementally without recomputing from scratch
- Allow flexible dimension selection and preference (minimize/maximize per dimension)
- Provide a simple, idiomatic Go API for both static and dynamic skyline queries
Installation
go get github.com/gkoos/skyline
API Overview
Types
type Point map[string]float64
type Preference map[string]Order
type Order int
const (
Min Order = iota
Max
Ignore // Skip this dimension in dominance comparisons
)
Static Computation
func Skyline(points []Point, dims []string, prefs Preference, algo string) ([]Point, error)
Computes the skyline from a static dataset.
points
: input pointsdims
: dimensions to considerprefs
: preferences per dimension (Min or Max)algo
: algorithm to use ("bnl"
,"dnc"
,"skytree"
)
Dynamic Updates
You can use the dynamic skyline engine for incremental and batch updates. Two constructors are available:
1. DynamicSkyline (with initial skyline computation)
engine, err := skyline.DynamicSkyline(points, dims, prefs, algo)
if err != nil {
panic(err)
}
This computes the initial skyline from the dataset using the specified algorithm ("bnl", "dnc", or "skytree").
2. DynamicSkylineRaw (no initial skyline computation)
engine := skyline.DynamicSkylineRaw(points, dims, prefs, algo)
This skips initial skyline computation and treats the provided points as the current skyline. Useful for streaming or batch scenarios.
Supported Operations
engine.Insert(point)
— Insert a single point (uses BNL logic)engine.InsertBatch(points)
— Insert multiple points at once (uses the configured algorithm for batch skyline computation)engine.Update(oldPoint, newPoint)
— Replace a point and update the skylineengine.Delete(point)
— Remove a point and update the skylineengine.Skyline()
— Get the current skyline set
Example
package main
import (
"fmt"
"github.com/yourname/skyline"
)
func main() {
points := []skyline.Point{
{"price": 400, "battery": 10},
{"price": 500, "battery": 12},
{"price": 300, "battery": 9},
{"price": 450, "battery": 11},
}
prefs := skyline.Preference{
"price": skyline.Min,
"battery": skyline.Max,
}
// Dynamic skyline with initial computation
engine, err := skyline.DynamicSkyline(points, []string{"price", "battery"}, prefs, "dnc")
if err != nil {
panic(err)
}
// Insert a single point
newPoint := skyline.Point{"price": 420, "battery": 15}
engine.Insert(newPoint)
fmt.Println("After insert:", engine.Skyline())
// Batch insert
batch := []skyline.Point{
{"price": 410, "battery": 13},
{"price": 390, "battery": 16},
}
engine.InsertBatch(batch)
fmt.Println("After batch insert:", engine.Skyline())
// Update a point
updatedPoint := skyline.Point{"price": 460, "battery": 14}
engine.Update(newPoint, updatedPoint)
fmt.Println("After update:", engine.Skyline())
// Delete a point
engine.Delete(updatedPoint)
fmt.Println("After delete:", engine.Skyline())
}
Partial Skyline
Preference
includes an Ignore
option to skip dimensions in dominance checks. This allows you to compute skylines based on a subset of dimensions, which can be useful in scenarios where some dimensions are not relevant.
A practical example is adding a unique key to each point which then can be ignored in dominance checks.
Algorithms
Block Nested Loop (BNL)
- Simple, intuitive algorithm
- Compares each point with all others to find dominating points
- Works well for small datasets and supports incremental updates easily
- In dynamic mode, we always use this algorithm to insert a single point
Divide & Conquer (D&C)
- Recursively divides data into smaller subsets, computes skylines, and merges results
- More efficient than BNL for larger datasets
- Static algorithm, dynamic extension is complex
SkyTree
- Advanced algorithm using tree structures to prune comparisons
- Scales well with high-dimensional and large datasets
- Designed primarily for static datasets
Optimization Steps
The SkyTree implementation in this library includes several advanced optimizations for performance and scalability:
- Advanced Pivot Selection: Uses median or custom pivot selection to improve partitioning and pruning efficiency
- Parallelization: SkyTree uses parallelism in two main phases:
- Parallel Recursion: When the number of partitions (regions) exceeds a threshold, recursive calls for each partition are executed in parallel using goroutines. This allows the algorithm to process different branches of the tree concurrently, greatly speeding up computation on multicore systems.
- Parallel Merge: After recursion, the partial skylines from each partition are merged in parallel using a pairwise, multi-stage approach. At each stage, pairs of skylines are merged concurrently, reducing the total merge time to log₂(N) stages for N partitions.
- Worker Pool: Both recursion and merge parallelism are managed by a configurable worker pool, which limits the number of concurrent goroutines to avoid oversubscription and maximize CPU efficiency. The pool size defaults to the number of available CPU cores, but can be tuned for your workload.
- Dominance Caching: Caches dominance checks between points to avoid redundant computations, reducing overall work
- Slice Reuse: Minimizes memory allocations by reusing slices in recursive calls and helpers
- Custom Deduplication: Uses a fast custom join for point keys, improving deduplication speed for large datasets
- Configurable Recursion Depth: Allows limiting recursion depth to prevent stack overflow and excessive computation; falls back to BNL if the limit is reached
- Small Partition BNL Switch: If the number of points in a partition is lsmall, SkyTree will use the Block Nested Loop (BNL) algorithm for that partition instead of recursing further. This optimization avoids SkyTree's overhead on small datasets, where BNL is typically faster, and can significantly improve performance for workloads with many small partitions. The threshold is tunable; see the configuration section for details.
These optimizations make SkyTree suitable for very large and high-dimensional datasets, balancing speed, memory usage, and accuracy.
Approximate Skyline Queries
All skyline algorithms in this package support an epsilon parameter, which allows for approximate dominance. Epsilon is a non-negative float that relaxes the strictness of dominance comparisons:
- Strict dominance (
epsilon = 0
): A pointA
strictly dominates pointB
if all coordinates ofA
are less than or equal to those ofB
, and at least one is strictly less. - Epsilon dominance (
epsilon > 0
): A pointA
epsilon-dominates pointB
if all coordinates ofA
are less than or equal to those ofB
plusepsilon
, and at least one coordinate is strictly less by more thanepsilon
.
This is useful for handling floating-point imprecision or for applications where small differences are not significant.
Epsilon dominance checks can be combined with sampling and partitioning of the data to calculate approximate skylines which can be more efficient for large datasets - essentially a tradeoff between accuracy and performance.
Example: Dominance with Epsilon
import "skyline/internal/algorithms"
a := []float64{1.0, 2.0}
b := []float64{1.0, 2.0000001}
// Strict dominance (epsilon = 0): false
algorithms.DominatesEpsilon(a, b, 0) // false
// Epsilon dominance (epsilon = 1e-6): true
algorithms.DominatesEpsilon(a, b, 1e-6) // true
Configuration
Skyline algorithms can be fine-tuned using configuration options to optimize performance and scalability for different dataset sizes and characteristics. Tuning these options allows you to balance speed, memory usage, and accuracy, especially for large or high-dimensional data. Proper configuration is essential to avoid bottlenecks, excessive memory consumption, or incomplete results, and lets you adapt the algorithms to your specific workload and hardware.
Block Nested Loop (BNL)
Epsilon
: Dominance threshold for comparisons. Lower values increase accuracy but may slow down performance. Higher values speed up comparisons but may miss some dominated points. Default is0.0
, meaning exact dominance checks.
Divide & Conquer (DNC)
Threshold
: Minimum number of points in a partition before switching to BNL. Lower values increase recursion, higher values use BNL more often. Tune for your dataset size.BatchSize
: Number of points processed together in each batch. Larger batches can improve cache locality and throughput, but may use more memory.Epsilon
: Dominance threshold for comparisons. See Block Nested Loop (BNL) section.
SkyTree
PivotSelector
: Function to choose the pivot point for partitioning. The default is median selection, but you can provide a custom function for domain-specific optimization.ParallelThreshold
: Minimum number of partitions before enabling parallel processing. Lower values increase parallelism, higher values reduce goroutine overhead.MaxRecursionDepth
: Maximum allowed recursion depth. If exceeded, SkyTree falls back to BNL for the remaining data. Prevents stack overflow and excessive computation for very large or complex datasets.BNLSwitchThreshold
: If the number of points in a partition is less than or equal to this threshold, SkyTree will use the Block Nested Loop (BNL) algorithm for that partition instead of recursing further. This improves performance by avoiding SkyTree's overhead on small datasets, where BNL is typically faster. The default is 32, but you can tune this value for your workload and hardware. Lower values reduce BNL usage; higher values make SkyTree switch to BNL more often for small partitions.WorkerPoolSize
: Controls the maximum number of goroutines (workers) used for parallel recursion and merging in SkyTree. Setting this to0
(the default) will use the number of available CPU cores on your system, which is usually optimal for most workloads. You can set a specific positive value to limit CPU usage or experiment with different levels of parallelism. Increasing this value may improve performance on large, partitionable datasets, but setting it too high can cause oversubscription and reduce efficiency. For most users, leaving it at0
is recommended.Epsilon
: Dominance threshold for comparisons. See Block Nested Loop (BNL) section.
Refer to the code and examples for how to set these options in your application.
Running Tests
To run all unit tests for algorithms and utilities:
go test ./...
This will execute all correctness and edge case tests for the skyline algorithms, domination logic, and dynamic updates. Tests cover small, large, and high-dimensional datasets, as well as pathological cases.
Benchmarking Skyline Algorithms
To run performance benchmarks and compare algorithms:
go test -bench . ./internal/algorithms
This command runs repeatable benchmarks for all implemented skyline algorithms (BNL, D&C, SkyTree) on a variety of datasets, including small, large, high-dimensional, clustered, and pathological cases.
Algorithm selection guidance:
- Block Nested Loop (BNL): Best for small datasets or when incremental updates are needed. Simple, but slow for large or high-dimensional data.
- Divide & Conquer (DNC): Generally fastest for large, diverse datasets with a moderate skyline size. Uses more memory, but scales well unless most points are dominated.
- SkyTree: Optimized for very large, high-dimensional datasets with many dominated (clustered) points and a small skyline. SkyTree is much faster and more memory-efficient than DNC when the dataset is highly clustered and the skyline is small. For datasets where most points are Pareto-optimal (large skyline), DNC may be faster.
Summary:
- Use BNL for small or dynamic datasets.
- Use DNC for large, diverse datasets with a moderate skyline.
- Use SkyTree for large, high-dimensional, clustered datasets with a small skyline.
- Benchmark your own data to select the best algorithm for your use case.
Future Improvements
- Add debug visualizer (especially for the SkyTree algorithm) (optional CLI)
- Add batch processing support to the SkyTree implementation
License
MIT License
Contact
Contributions and issues are welcome! Please open an issue or submit a pull request.
For questions, issues, or contributions, please use the GitHub repository:
Directories
¶
Path | Synopsis |
---|---|
examples
|
|
bnl
command
|
|
dnc
command
|
|
dnc_with_config
command
|
|
skytree
command
|
|
internal
|
|
algorithms
...removed stray type alias...
|
...removed stray type alias... |
...existing code...
|
...existing code... |