datas3t
A high-performance data management system for storing, indexing, and retrieving large-scale datas3ts in S3-compatible object storage.
Overview
datas3t is designed for efficiently managing datas3ts containing millions of individual files (called "datapoints"). It stores files as indexed TAR archives in S3-compatible storage, enabling fast random access without the overhead of extracting entire archives.
Key Features
ποΈ Efficient Storage
- Packs individual files into TAR archives
- Eliminates S3 object overhead for small files
- Supports datas3ts with millions of datapoints
β‘ Fast Random Access
- Creates lightweight indices for TAR archives
- Enables direct access to specific files without extraction
- Disk-based caching for frequently accessed indices
π Flexible TLS Configuration
- TLS usage determined by endpoint protocol (https:// vs http://)
- No separate TLS flags needed - follows standard URL conventions
- Seamless integration with various S3-compatible services
π¦ Range-based Operations
- Upload and download data in configurable chunks (dataranges)
- Supports partial datas3t retrieval
- Parallel processing of multiple ranges
π Direct Client-to-Storage Transfer
- Uses S3 presigned URLs for efficient data transfer
- Bypasses server for large file operations
- Supports multipart uploads for large datas3ts
π Datarange Aggregation
- Combines multiple small dataranges into larger ones
- Reduces S3 object count and improves download performance
- Validates continuous datapoint coverage before aggregation
- Atomic operations with automatic cleanup on failure
π¦ Datas3t Import
- Discovers and imports existing datas3ts from S3 buckets
- Scans for objects matching datas3t patterns automatically
- Disaster recovery and migration support
- Maintains upload counter consistency
- Prevents duplicate imports with idempotent operations
π‘οΈ Data Integrity
- Validates TAR structure and file naming conventions
- Ensures datapoint consistency across operations
- Transactional database operations
Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββ
β Client/CLI βββββΆβ HTTP API βββββΆβ PostgreSQL β
β β β Server β β (Metadata) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β S3-Compatible β
β Object Storage β
β (TAR Archives) β
ββββββββββββββββββββ
Components
- HTTP API Server: REST API for datas3t management
- Client Library: Go SDK for programmatic access
- PostgreSQL Database: Stores metadata and indices
- S3-Compatible Storage: Stores TAR archives and indices
- TAR Indexing Engine: Creates fast-access indices
- Disk Cache: Local caching for performance
- Key Deletion Service: Background worker for automatic S3 object cleanup
Core Concepts
Datas3ts
Named collections of related datapoints. Each datas3t is associated with an S3 bucket configuration.
Datapoints
Individual files within a datas3t, numbered sequentially:
00000000000000000001.txt
00000000000000000002.jpg
00000000000000000003.json
Dataranges
Contiguous chunks of datapoints stored as TAR archives:
datas3t/my-datas3t/dataranges/00000000000000000001-00000000000000001000-000000000001.tar
datas3t/my-datas3t/dataranges/00000000000000001001-00000000000000002000-000000000002.tar
TAR Indices
Lightweight index files enabling fast random access:
datas3t/my-datas3t/dataranges/00000000000000000001-00000000000000001000-000000000001.index
Import Operations
Process of discovering and importing existing datas3ts from S3 buckets:
- Pattern Recognition: Automatically detects objects matching datas3t naming conventions
- Duplicate Prevention: Skips existing dataranges to prevent conflicts
- Upload Counter Management: Maintains counter consistency for future uploads
- Transaction Safety: All imports are performed atomically per datas3t
Aggregation Operations
Process of combining multiple small dataranges into larger ones for improved efficiency:
- Coverage Validation: Ensures continuous datapoint coverage with no gaps
- Atomic Replacement: Original dataranges are replaced atomically after successful aggregation
- Parallel Processing: Downloads and uploads are performed in parallel for optimal performance
- Multipart Support: Large aggregates use multipart uploads for reliability
Quick Start
Prerequisites
Development Setup
# Clone the repository
git clone https://github.com/draganm/datas3t.git
cd datas3t
# Enter development environment
nix develop
# Run tests
nix develop -c go test ./...
# Generate code
nix develop -c go generate ./...
Running the Server
# Set environment variables
export DB_URL="postgres://user:password@localhost:5432/datas3t"
export ADDR=":8765"
# Run the server
nix develop -c go run ./cmd/datas3t server
# Server will start on http://localhost:8765
API Usage
curl -X POST http://localhost:8765/api/bucket \
-H "Content-Type: application/json" \
-d '{
"name": "my-bucket-config",
"endpoint": "https://s3.amazonaws.com",
"bucket": "my-data-bucket",
"access_key": "ACCESS_KEY",
"secret_key": "SECRET_KEY"
}'
2. Create Datas3t
curl -X POST http://localhost:8765/api/datas3t \
-H "Content-Type: application/json" \
-d '{
"name": "my-datas3t",
"bucket": "my-bucket-config"
}'
3. Upload Datarange
# Start upload
curl -X POST http://localhost:8765/api/datarange/upload/start \
-H "Content-Type: application/json" \
-d '{
"datas3t_name": "my-datas3t",
"first_datapoint_index": 1,
"number_of_datapoints": 1000,
"data_size": 1048576
}'
# Use returned presigned URLs to upload TAR archive and index
# Then complete the upload
curl -X POST http://localhost:8765/api/datarange/upload/complete \
-H "Content-Type: application/json" \
-d '{
"datarange_upload_id": 123
}'
4. Download Datapoints
curl -X POST http://localhost:8765/api/download/presign \
-H "Content-Type: application/json" \
-d '{
"datas3t_name": "my-datas3t",
"first_datapoint": 100,
"last_datapoint": 200
}'
5. Import Existing Datas3ts
# Import datas3ts from S3 bucket
curl -X POST http://localhost:8765/api/v1/datas3ts/import \
-H "Content-Type: application/json" \
-d '{
"bucket_name": "my-bucket-config"
}'
6. Clear Datas3t
# Clear all dataranges from a datas3t
curl -X POST http://localhost:8765/api/v1/datas3ts/clear \
-H "Content-Type: application/json" \
-d '{
"name": "my-datas3t"
}'
7. Aggregate Dataranges
# Start aggregation
curl -X POST http://localhost:8765/api/v1/aggregate \
-H "Content-Type: application/json" \
-d '{
"datas3t_name": "my-datas3t",
"first_datapoint_index": 1,
"last_datapoint_index": 5000
}'
# Complete aggregation (after processing returned URLs)
curl -X POST http://localhost:8765/api/v1/aggregate/complete \
-H "Content-Type: application/json" \
-d '{
"aggregate_upload_id": 456
}'
Client Library Usage
package main
import (
"context"
"github.com/draganm/datas3t/client"
)
func main() {
// Create client
c := client.New("http://localhost:8765")
// List datas3ts
datas3ts, err := c.ListDatas3ts(context.Background())
if err != nil {
panic(err)
}
// Download specific datapoints
response, err := c.PreSignDownloadForDatapoints(context.Background(), &client.PreSignDownloadForDatapointsRequest{
Datas3tName: "my-datas3t",
FirstDatapoint: 1,
LastDatapoint: 100,
})
if err != nil {
panic(err)
}
// Use presigned URLs to download data directly from S3
for _, segment := range response.DownloadSegments {
// Download using segment.PresignedURL and segment.Range
}
// Import existing datas3ts from S3 bucket
importResponse, err := c.ImportDatas3t(context.Background(), &client.ImportDatas3tRequest{
BucketName: "my-bucket-config",
})
if err != nil {
panic(err)
}
fmt.Printf("Imported %d datas3ts: %v\n", importResponse.ImportedCount, importResponse.ImportedDatas3ts)
// Clear all dataranges from a datas3t
clearResponse, err := c.ClearDatas3t(context.Background(), &client.ClearDatas3tRequest{
Name: "my-datas3t",
})
if err != nil {
panic(err)
}
fmt.Printf("Cleared datas3t: deleted %d dataranges, scheduled %d objects for deletion\n",
clearResponse.DatarangesDeleted, clearResponse.ObjectsScheduled)
// Aggregate multiple dataranges into a single larger one
err = c.AggregateDataRanges(context.Background(), "my-datas3t", 1, 5000, &client.AggregateOptions{
MaxParallelism: 8,
MaxRetries: 3,
ProgressCallback: func(phase string, current, total int64) {
fmt.Printf("Phase %s: %d/%d\n", phase, current, total)
},
})
if err != nil {
panic(err)
}
}
CLI Usage
The datas3t CLI provides a comprehensive command-line interface for managing buckets, datas3ts, datarange operations, and aggregation.
Building the CLI
# Build the CLI binary
nix develop -c go build -o datas3t ./cmd/datas3t
# Or run directly
nix develop -c go run ./cmd/datas3t [command]
Global Options
All commands support:
--server-url - Server URL (default: http://localhost:8765, env: DATAS3T_SERVER_URL)
Server Management
Start the Server
# Start the datas3t server
./datas3t server \
--db-url "postgres://user:password@localhost:5432/datas3t" \
--cache-dir "/path/to/cache" \
--encryption-key "your-base64-encoded-key"
# Using environment variables
export DB_URL="postgres://user:password@localhost:5432/datas3t"
export CACHE_DIR="/path/to/cache"
export ENCRYPTION_KEY="your-encryption-key"
./datas3t server
Generate Encryption Key
# Generate a new AES-256 encryption key
./datas3t keygen
Bucket Management
Add S3 Bucket Configuration
./datas3t bucket add \
--name my-bucket-config \
--endpoint https://s3.amazonaws.com \
--bucket my-data-bucket \
--access-key ACCESS_KEY \
--secret-key SECRET_KEY
Options:
--name - Bucket configuration name (required)
--endpoint - S3 endpoint (include https:// for TLS) (required)
--bucket - S3 bucket name (required)
--access-key - S3 access key (required)
--secret-key - S3 secret key (required)
List Bucket Configurations
# List all bucket configurations
./datas3t bucket list
# Output as JSON
./datas3t bucket list --json
Datas3t Management
Add New Datas3t
./datas3t datas3t add \
--name my-dataset \
--bucket my-bucket-config
Options:
--name - Datas3t name (required)
--bucket - Bucket configuration name (required)
List Datas3ts
# List all datas3ts with statistics
./datas3t datas3t list
# Output as JSON
./datas3t datas3t list --json
Import Existing Datas3ts
# Import datas3ts from S3 bucket
./datas3t datas3t import \
--bucket my-bucket-config
# Output as JSON
./datas3t datas3t import \
--bucket my-bucket-config \
--json
Options:
--bucket - Bucket configuration name to scan for existing datas3ts (required)
--json - Output results as JSON
Clear Datas3t
# Clear all dataranges from a datas3t (with confirmation prompt)
./datas3t datas3t clear \
--name my-dataset
# Clear without confirmation prompt
./datas3t datas3t clear \
--name my-dataset \
--force
Options:
--name - Datas3t name to clear (required)
--force - Skip confirmation prompt
What it does:
- Removes all dataranges from the specified datas3t
- Schedules all associated S3 objects (TAR files and indices) for deletion
- Keeps the datas3t record itself (allows future uploads)
- The datas3t remains in the database with zero dataranges and datapoints
- S3 objects are deleted by the background worker within 24 hours
TAR Upload Operations
Upload TAR File
./datas3t upload-tar \
--datas3t my-dataset \
--file /path/to/data.tar \
--max-parallelism 8 \
--max-retries 5
Options:
--datas3t - Datas3t name (required)
--file - Path to TAR file to upload (required)
--max-parallelism - Maximum concurrent uploads (default: 4)
--max-retries - Maximum retry attempts per chunk (default: 3)
Datarange Operations
Download Datapoints as TAR
./datas3t datarange download-tar \
--datas3t my-dataset \
--first-datapoint 1 \
--last-datapoint 1000 \
--output /path/to/downloaded.tar \
--max-parallelism 8 \
--max-retries 5 \
--chunk-size 10485760
Options:
--datas3t - Datas3t name (required)
--first-datapoint - First datapoint to download (required)
--last-datapoint - Last datapoint to download (required)
--output - Output TAR file path (required)
--max-parallelism - Maximum concurrent downloads (default: 4)
--max-retries - Maximum retry attempts per chunk (default: 3)
--chunk-size - Download chunk size in bytes (default: 5MB)
Aggregation Operations
Aggregate Multiple Dataranges
./datas3t aggregate \
--datas3t my-dataset \
--first-datapoint 1 \
--last-datapoint 5000 \
--max-parallelism 4 \
--max-retries 3
Options:
--datas3t - Datas3t name (required)
--first-datapoint - First datapoint index to include in aggregate (required)
--last-datapoint - Last datapoint index to include in aggregate (required)
--max-parallelism - Maximum number of concurrent operations (default: 4)
--max-retries - Maximum number of retry attempts per operation (default: 3)
What it does:
- Downloads all source dataranges in the specified range
- Merges them into a single TAR archive with continuous datapoint numbering
- Uploads the merged archive to S3
- Atomically replaces the original dataranges with the new aggregate
- Validates that the datapoint range is fully covered by existing dataranges with no gaps
Optimization Operations
Optimize Datarange Storage
./datas3t optimize \
--datas3t my-dataset \
--dry-run \
--min-score 2.0 \
--target-size 2GB
Options:
--datas3t - Datas3t name (required)
--dry-run - Show optimization recommendations without executing them
--daemon - Run continuously, monitoring for optimization opportunities
--interval - Interval between optimization checks in daemon mode (default: 5m)
--min-score - Minimum AVS score required to perform aggregation (default: 1.0)
--target-size - Target size for aggregated files (default: 1GB)
--max-aggregate-size - Maximum size for aggregated files (default: 5GB)
--max-operations - Maximum number of aggregation operations per run (default: 10)
--max-parallelism - Maximum number of concurrent operations for each aggregation (default: 4)
--max-retries - Maximum number of retry attempts per operation (default: 3)
What it does:
- Analyzes existing dataranges to identify optimization opportunities
- Uses an Aggregation Value Score (AVS) algorithm to prioritize operations
- Automatically performs beneficial aggregations using the existing aggregate functionality
- Supports both one-time optimization and continuous monitoring modes
Optimization Strategies:
- Small file aggregation: Combines many small files into larger ones
- Adjacent ID range aggregation: Merges consecutive datapoint ranges
- Size bucket aggregation: Groups similarly sized files together
Scoring Algorithm:
Each potential aggregation is scored based on:
- Objects reduced: Fewer files to manage (reduces S3 object count)
- Size efficiency: How close the result is to the target size
- Consecutive bonus: Bonus for adjacent datapoint ranges
- Operation cost: Download/upload overhead consideration
Example Usage:
# One-time optimization with dry-run to see recommendations
./datas3t optimize \
--datas3t my-dataset \
--dry-run
# Execute optimization with custom thresholds
./datas3t optimize \
--datas3t my-dataset \
--min-score 2.0 \
--target-size 2GB \
--max-operations 5
# Continuous monitoring mode
./datas3t optimize \
--datas3t my-dataset \
--daemon \
--interval 5m
# Show all available optimization opportunities
./datas3t optimize \
--datas3t my-dataset \
--dry-run \
--min-score 0.5 \
--max-operations 20
Benefits:
- Intelligent optimization: Automatically identifies the best aggregation opportunities
- Cost reduction: Reduces S3 object count and storage costs
- Performance improvement: Fewer, larger files improve download performance
- Hands-off operation: Can run continuously to maintain optimal storage layout
- Safe operations: Uses existing battle-tested aggregation system
- Flexible configuration: Customizable thresholds and strategies
Manual Aggregation Example
# Example: You have uploaded multiple small TAR files and want to consolidate them
# First, check your current dataranges
./datas3t datas3t list
# Aggregate the first 10,000 datapoints into a single larger datarange
./datas3t aggregate \
--datas3t my-dataset \
--first-datapoint 1 \
--last-datapoint 10000 \
--max-parallelism 6
# Check the result - you should see fewer, larger dataranges
./datas3t datas3t list
Benefits:
- Reduces the number of S3 objects (lower storage costs)
- Improves download performance for large ranges
- Maintains all data integrity and accessibility
- Can be run multiple times to further consolidate data
Complete Workflow Example
# 1. Generate encryption key
./datas3t keygen
export ENCRYPTION_KEY="generated-key-here"
# 2. Start server
./datas3t server &
# 3. Add bucket configuration
./datas3t bucket add \
--name production-bucket \
--endpoint https://s3.amazonaws.com \
--bucket my-production-data \
--access-key "$AWS_ACCESS_KEY" \
--secret-key "$AWS_SECRET_KEY"
# 4. Create datas3t
./datas3t datas3t add \
--name image-dataset \
--bucket production-bucket
# 5. Upload data
./datas3t upload-tar \
--datas3t image-dataset \
--file ./images-batch-1.tar
# 6. List datasets
./datas3t datas3t list
# 7. Import existing datas3ts (disaster recovery/migration)
./datas3t datas3t import \
--bucket production-bucket
# 8. Download specific range
./datas3t datarange download-tar \
--datas3t image-dataset \
--first-datapoint 100 \
--last-datapoint 200 \
--output ./images-100-200.tar
# 9. Optimize datarange storage automatically
./datas3t optimize \
--datas3t image-dataset \
--dry-run
# 10. Execute optimization
./datas3t optimize \
--datas3t image-dataset \
--min-score 2.0
# 11. Or aggregate specific ranges manually
./datas3t aggregate \
--datas3t image-dataset \
--first-datapoint 1 \
--last-datapoint 10000
# 12. Clear all data from a datas3t (keeping the datas3t record)
./datas3t datas3t clear \
--name image-dataset \
--force
# 13. Check results after optimization/aggregation/clear
./datas3t datas3t list
Environment Variables
All CLI commands support these environment variables:
DATAS3T_SERVER_URL - Default server URL for all commands
DB_URL - Database connection string (server command)
CACHE_DIR - Cache directory path (server command)
ENCRYPTION_KEY - Base64-encoded encryption key (server command)
File Naming Convention
Datapoints must follow the naming pattern %020d.<extension>:
- β
00000000000000000001.txt
- β
00000000000000000042.jpg
- β
00000000000000001337.json
- β
file1.txt
- β
1.txt
- β
001.txt
Storage Efficiency
- Small Files: 99%+ storage efficiency vs individual S3 objects
- Large Datas3ts: Linear scaling with datas3t size
- Index Lookup: O(1) file location within TAR
- Range Queries: Optimized byte-range requests
- Caching: Local disk cache for frequently accessed indices
Scalability
- Concurrent Operations: Supports parallel uploads/downloads
- Large Datas3ts: Tested with millions of datapoints
- Distributed: Stateless server design for horizontal scaling
Contributing
Currently we are not accepting contributions to this project.
Architecture Details
Database Schema
- s3_buckets: S3 configuration storage
- datas3ts: Datas3t metadata
- dataranges: TAR archive metadata and byte ranges
- datarange_uploads: Temporary upload state management
- aggregate_uploads: Aggregation operation tracking and state management
- keys_to_delete: Immediate deletion queue for obsolete S3 objects
Binary format with 16-byte entries per file:
- Bytes 0-7: File position in TAR (big-endian uint64)
- Bytes 8-9: Header blocks count (big-endian uint16)
- Bytes 10-15: File size (big-endian, 48-bit)
Caching Strategy
- Memory: In-memory index objects during operations
- Disk: Persistent cache for TAR indices
- LRU Eviction: Automatic cleanup based on access patterns
- Cache Keys: SHA-256 hash of datarange metadata
Key Deletion Service
- Background Worker: Automatic cleanup of obsolete S3 objects
- Batch Processing: Processes 5 deletion requests at a time
- Immediate Processing: No delay between queuing and deletion
- Error Handling: Retries failed deletions, logs errors without blocking
- Database Consistency: Atomic removal from deletion queue after successful S3 deletion
- Graceful Shutdown: Respects context cancellation for clean server shutdown
License
This project is licensed under the AGPLV3 License - see the LICENSE file for details.
Support
For questions, issues, or contributions:
- Open an issue on GitHub
- Check existing documentation
- Review test files for usage examples
Installation
git clone https://github.com/draganm/datas3t.git
cd datas3t
nix develop -c make build
Configuration
Database Setup
Create a PostgreSQL database and set the connection string:
export DB_URL="postgres://user:password@localhost:5432/datas3t"
export CACHE_DIR="/path/to/cache"
S3 Credential Encryption
Important: S3 credentials are encrypted at rest using AES-256-GCM with unique random nonces.
The encryption system provides the following security features:
- AES-256-GCM encryption: Industry-standard authenticated encryption
- Unique nonces: Each encryption uses a random nonce, so identical credentials produce different encrypted values
- Key derivation: Input keys are SHA-256 hashed to ensure proper 32-byte key size
- Authenticated encryption: Protects against tampering and ensures data integrity
- Transparent operation: All S3 operations automatically encrypt/decrypt credentials
Key Generation
Generate a cryptographically secure 256-bit encryption key:
nix develop -c go run ./cmd/datas3t/main.go keygen
This generates a 32-byte (256-bit) random key encoded as base64. Store this key securely and set it as an environment variable:
export ENCRYPTION_KEY="your-generated-key-here"
Alternative Key Generation
You can also use datas3t keygen if you have built the binary:
./datas3t keygen
Critical Security Notes:
- Keep this key secure and backed up! If you lose it, you won't be able to decrypt your stored S3 credentials
- The same key must be used consistently across server restarts
- Changing the key will make existing encrypted credentials unreadable
- Store the key separately from your database backups for additional security
Starting the Server
./datas3t server --db-url "$DB_URL" --cache-dir "$CACHE_DIR" --encryption-key "$ENCRYPTION_KEY"
Or using environment variables:
export DB_URL="postgres://user:password@localhost:5432/datas3t"
export CACHE_DIR="/path/to/cache"
export ENCRYPTION_KEY="your-encryption-key"
./datas3t server