go-delta

module
v1.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 21, 2026 License: MIT

README

go-delta

Test and Release GitHub release Go Version License Buy Me A Coffee

A smart delta compression tool for backups written in Go.

Features

  • Multiple compression formats - GDELTA (custom format with optional deduplication), standard ZIP (universal compatibility), or XZ (best compression ratio)
  • Dictionary compression - Auto-trained zstd dictionary for better compression of many small files with common patterns (GDELTA03 format)
  • Content-based deduplication - FastCDC content-defined chunking with BLAKE3 hashing (GDELTA02 format)
  • Streaming chunking - Process large files (GB+) with constant memory usage via callback-based chunking
  • Human-readable sizes - Use 64KB, 128MB, 2GB instead of raw byte counts
  • Smart memory management - Auto-calculated thread memory with system RAM detection and safety warnings
  • Bounded chunk store - LRU eviction prevents memory exhaustion on large datasets
  • Minimum chunk size enforcement - 4KB minimum prevents metadata overhead from exceeding savings
  • Zstandard compression - Industry-leading compression with configurable levels (1-22) for GDELTA
  • Deflate compression - Standard ZIP deflate compression (levels 1-9) for universal compatibility
  • GC-free ZIP mode - Optional garbage collection bypass with pooled buffers for reduced latency spikes
  • True parallel compression - Folder-based worker pool with independent compression (no mutex contention)
  • Streaming architecture - Temporary file streaming avoids loading compressed data into RAM
  • Robust cleanup - Automatic temp file deletion on normal exit, errors, and interruptions (Ctrl+C)
  • Cross-platform - Native system memory detection for Linux, macOS, and Windows
  • Subdirectory support - Recursively compress directory structures
  • Custom file selection - Library API supports custom file/folder lists (independent of directory structure)
  • Progress visualization - Multi-bar progress tracking for concurrent operations
  • Archive verification - Structural and data integrity validation for GDELTA01, GDELTA02, GDELTA03, ZIP, and XZ formats
  • CLI and Library - Use as a command-line tool or Go library
  • Compress & Decompress - Full round-trip support with integrity validation
  • Overwrite protection - Safe decompression with optional overwrite mode
  • Gitignore support - Respect .gitignore files (including nested) to exclude matching paths during compression

Installation

From source
git clone https://github.com/creativeyann17/go-delta.git
cd go-delta
make build

The binary will be in bin/godelta.

Development setup
# Install git hooks for automatic code formatting
make install-hooks

# Run tests
make test

# Format code
make fmt

CLI Usage

Compress files
# Basic compression
godelta compress -i /path/to/files -o backup.delta

# With custom settings
godelta compress \
  --input /data \
  --output archive.delta \
  --threads 8 \
  --level 9 \
  --verbose

# Enable chunk-based deduplication (64KB chunks recommended)
godelta compress \
  --input /data \
  --output archive.delta \
  --chunk-size 64KB \
  --verbose

# Deduplication with bounded memory (5GB chunk store limit)
# Store keeps metadata for all chunks but evicts LRU chunk data
godelta compress \
  --input /data \
  --output archive.delta \
  --chunk-size 128KB \
  --chunk-store-size 5GB \
  --thread-memory 2GB \
  --verbose

# Auto-calculate thread memory from input size
godelta compress \
  --input /large/dataset \
  --output backup.delta \
  --threads 16 \
  --thread-memory 0

# Dry run to see what would be compressed
godelta compress -i /data -o test.delta --dry-run

# Create standard ZIP archive (universal compatibility)
# Multi-threaded ZIP creates multiple archive files for true parallelism
# Example: --threads 8 creates archive_01.zip through archive_08.zip
godelta compress \
  --input /data \
  --output archive.zip \
  --zip \
  --level 9 \
  --threads 8

# Respect .gitignore files to exclude matching paths
# Works with nested .gitignore files throughout the directory tree
godelta compress \
  --input /project \
  --output project-backup.delta \
  --gitignore \
  --verbose

# Dictionary compression for many small files with common patterns
# Auto-trains a zstd dictionary from input files (GDELTA03 format)
godelta compress \
  --input /configs \
  --output configs.delta \
  --dictionary \
  --verbose

# ZIP compression with GC disabled for reduced latency spikes
# Uses pooled buffers to minimize allocations during compression
godelta compress \
  --input /data \
  --output backup.zip \
  --zip \
  --no-gc \
  --threads 8

# XZ compression for best compression ratio (LZMA2 algorithm)
# Multi-threaded XZ creates multiple archive files for true parallelism
# Example: --threads 4 creates archive_01.tar.xz through archive_04.tar.xz
godelta compress \
  --input /data \
  --output backup.tar.xz \
  --xz \
  --level 9 \
  --threads 4

Note: ZIP format with multiple threads creates one archive file per thread (e.g., archive_01.zip, archive_02.zip, etc.) for true parallel compression without mutex contention. Decompression auto-detects and extracts all parts.

Decompress files
# Basic decompression
godelta decompress -i backup.delta -o /restore/path

# With overwrite (replace existing files)
godelta decompress -i backup.delta -o /restore/path --overwrite

# Verbose output
godelta decompress -i backup.delta -o /restore/path --verbose
Verify archives

Verify archive integrity without extracting files. Supports GDELTA01, GDELTA02, GDELTA03, ZIP, and XZ formats.

# Quick structural validation (fast)
godelta verify -i backup.delta

# Full data integrity check (slower, decompresses all data)
godelta verify -i backup.delta --data

# Verbose output with detailed information
godelta verify -i backup.delta --data --verbose

# Minimal output (only shows final result)
godelta verify -i backup.delta --quiet

What gets verified:

  • Structural validation (default, fast):

    • Header magic bytes and format
    • File count and metadata
    • Chunk index integrity (GDELTA02)
    • Footer marker
    • Duplicate path detection
    • Orphaned/missing chunks (GDELTA02)
  • Data integrity (with --data flag):

    • All structural checks above
    • Decompress all data to validate
    • Size verification (decompressed vs expected)
    • Chunk decompression (GDELTA02)
    • Reports corrupt files/chunks

Multi-part archive support:

  • ZIP: Auto-detects archive_01.zip, archive_02.zip, etc.
  • XZ: Auto-detects archive_01.tar.xz, archive_02.tar.xz, etc.
  • Verifies all parts when given the first part (e.g., godelta verify -i backup_01.zip)

Performance notes:

  • ZIP verification is fast: ZIP has a central directory, so metadata can be read without decompression
  • XZ verification is slower: tar.xz is a streaming format requiring full decompression to read file metadata
  • Use ZIP format when fast verification is important

Exit codes:

  • 0 - Archive is valid
  • 1 - Archive has errors or validation failed

Example output:

Verifying archive: backup.delta
Mode: Structural validation only

  Progress: 1234/1234 files

Archive: backup.delta [VALID]
Format:  GDELTA02
Size:    2.45 GB
Files:   1234
Original:   5.12 GB
Compressed: 2.45 GB (47.9% ratio)
Saved:      2.67 GB (52.1%)

Chunk Info:
  Chunk Size:  64.00 KB
  Unique:      38452 chunks
  References:  78903 total
  Dedup Ratio: 51.3%
Compress Options
  • -i, --input: Input file or directory (required)
  • -o, --output: Output archive file (default: "archive.delta")
  • -t, --threads: Max concurrent threads (default: CPU count)
  • --thread-memory: Max memory per thread (e.g. 128MB, 1GB, 0=auto, default: 0)
  • -l, --level: Compression level 1-9 for ZIP, 1-22 for GDELTA (default: 5)
  • --chunk-size: Average chunk size for content-defined dedup (e.g. 64KB, 512KB, actual chunks vary 1/4x-4x, min: 4KB, 0=disabled, default: 0, GDELTA only)
  • --chunk-store-size: Max in-memory dedup cache size (e.g. 1GB, 500MB, 0=unlimited, default: 0, GDELTA only)
  • --zip: Create standard ZIP archive instead of GDELTA format (universally compatible, no deduplication)
  • --xz: Create XZ archive with LZMA2 compression (best compression ratio, slower)
  • --dictionary: Use dictionary compression (GDELTA03 format, auto-trains from input, best for many small files with common patterns)
  • --no-gc: Disable garbage collection during ZIP compression (reduces latency spikes, uses pooled buffers)
  • --gitignore: Respect .gitignore files to exclude matching paths (supports nested .gitignore files)
  • --dry-run: Simulate without writing
  • --verbose: Show detailed output including chunk statistics
  • --quiet: Minimal output

Size format: All size parameters accept human-readable formats:

  • Bytes: 1024B or 1024
  • Kilobytes: 64KB or 64K
  • Megabytes: 128MB or 128M
  • Gigabytes: 2GB or 2G
  • Terabytes: 1TB or 1T
Decompress Options
  • -i, --input: Input archive file (required, auto-detects .gdelta or .zip format)
  • -o, --output: Output directory (default: current directory)
  • --overwrite: Overwrite existing files
  • --verbose: Show detailed output
  • --quiet: Minimal output

Note: Decompression automatically detects the archive format (GDELTA01, GDELTA02, GDELTA03, ZIP, or XZ) by reading the file signature.

Verify Options
  • -i, --input: Input archive file to verify (required)
  • --data: Perform full data integrity check by decompressing all content (default: false)
  • --verbose: Show detailed progress and file-by-file verification
  • --quiet: Minimal output, only show final result

Note: Structural validation is fast and checks metadata, headers, and index integrity. Data verification decompresses all content and is slower but provides complete validation.

Archive Formats

ZIP (Standard)

Standard ZIP archive format with deflate compression:

  • Universal compatibility: Works with any ZIP tool (unzip, 7zip, WinZip, etc.)
  • Deflate compression: Industry-standard compression (levels 1-9)
  • Multi-part parallel compression: Each worker thread creates its own ZIP file for true parallelism (no mutex bottleneck)
  • No deduplication: Each file compressed independently
  • Use case: Maximum portability, sharing archives, integration with existing tools

Multi-threaded behavior: When using multiple threads (e.g., --threads 8), godelta creates one ZIP file per thread:

  • Single thread: backup.zip
  • Multi-threaded: backup_01.zip, backup_02.zip, ..., backup_08.zip
  • Files are distributed evenly across worker ZIPs
  • True parallel writes (no serialization bottleneck)
  • Decompression auto-detects and extracts all parts

Performance: Slightly slower than GDELTA01 (deflate vs zstd), but universally compatible.

# Create ZIP archive (creates backup_01.zip through backup_08.zip with 8 threads)
godelta compress -i /data -o backup.zip --zip --level 9 --threads 8

# Extract with godelta (auto-detects all parts)
godelta decompress -i backup_01.zip -o /restore

# Or extract individual parts with standard tools
unzip -d /restore backup_01.zip
unzip -d /restore backup_02.zip
# ... etc
XZ (Best Compression)

Standard tar.xz archive format with LZMA2 compression:

  • Best compression ratio: LZMA2 typically achieves 10-30% better compression than zstd or deflate
  • Universal compatibility: Works with standard tar and xz tools
  • Multi-part parallel compression: Each worker thread creates its own .tar.xz file for true parallelism
  • No deduplication: Each file compressed independently
  • Use case: Maximum compression for archival, cold storage, distribution

Multi-threaded behavior: When using multiple threads (e.g., --threads 4), godelta creates one tar.xz file per thread:

  • Single thread: backup_01.tar.xz
  • Multi-threaded: backup_01.tar.xz, backup_02.tar.xz, ..., backup_04.tar.xz
  • Files are distributed evenly across worker archives
  • True parallel writes (no serialization bottleneck)
  • Decompression auto-detects and extracts all parts

Performance: Slowest compression but best ratio. Use for archival where compression time is less critical than final size.

Compression levels: XZ uses LZMA2 with levels 1-9:

Level Speed Compression Memory
1 Fast Good Low
5 Medium Very Good Medium
9 Slow Best High
# Create XZ archive (creates backup_01.tar.xz through backup_04.tar.xz with 4 threads)
godelta compress -i /data -o backup.tar.xz --xz --level 9 --threads 4

# Extract with godelta (auto-detects all parts)
godelta decompress -i backup_01.tar.xz -o /restore

# Or extract individual parts with standard tools
tar -xJf backup_01.tar.xz -C /restore
tar -xJf backup_02.tar.xz -C /restore
# ... etc

When to use XZ:

  • Archival storage where size matters more than speed
  • Distributing compressed files over slow networks
  • Cold storage backups accessed infrequently
  • Text-heavy data (source code, logs, configs) where LZMA excels

When NOT to use XZ:

  • Frequent backups where compression speed matters
  • Already compressed data (images, videos, archives)
  • Real-time or streaming applications
ZIP Performance Tuning

--no-gc flag: Disables Go's garbage collector during ZIP compression for reduced latency spikes:

  • Forces a GC cleanup before starting compression
  • Disables GC during the compression phase
  • Uses pooled buffers to minimize heap allocations
  • GC is automatically re-enabled after compression completes

When to use --no-gc:

  • Large archives with many files where GC pauses cause noticeable latency
  • Performance-critical backup jobs where consistent throughput matters
  • Systems with limited memory where GC pressure is high
# ZIP compression with GC disabled
godelta compress -i /data -o backup.zip --zip --no-gc --threads 8

Gitignore Support

The --gitignore flag enables automatic exclusion of files matching patterns defined in .gitignore files. This feature is useful for excluding build artifacts, dependencies, logs, and other generated files from backups.

Features:

  • Nested .gitignore files: Supports multiple .gitignore files throughout the directory tree
  • Pattern inheritance: Child directories inherit patterns from parent .gitignore files
  • Git-compliant behavior: Follows standard Git ignore semantics
  • Efficient pre-scanning: Scans for all .gitignore files once before compression
  • Directory pruning: Skips entire directories matching ignore patterns (e.g., node_modules/, build/)

Supported patterns:

  • Wildcards: *.log, *.tmp
  • Directories: build/, node_modules/
  • Negation: !important.log (within same file)
  • Double-star: **/temp/, **/*.bak
  • Comments: # This is a comment

Example:

# Create backup excluding files matched by .gitignore
godelta compress \
  --input /project \
  --output project-backup.delta \
  --gitignore \
  --verbose

How it works:

  1. Scans directory tree for all .gitignore files before compression
  2. Compiles each .gitignore into pattern matchers
  3. During file traversal, checks each file against applicable patterns (root to child hierarchy)
  4. Prunes entire directories matching directory patterns (e.g., build/)
  5. Skips individual files matching file patterns (e.g., *.log)

Pattern priority:

  • More specific (child) .gitignore patterns apply to files in subdirectories
  • Parent patterns apply to all descendants unless negated
  • Directory-specific patterns (with trailing /) only match directories

Note: .gitignore files themselves are included in the archive by default. To exclude them, add .gitignore to your .gitignore file.

GDELTA03 (Dictionary Compression)

Custom format with auto-trained zstd dictionary for better compression of similar files:

  • Header: Magic number + dictionary size + file count
  • Dictionary: Auto-trained zstd dictionary (32KB-112KB based on input size)
  • Entry metadata: Path, original size, compressed size, data offset
  • Compressed data: Dictionary-compressed file contents

How it works:

  1. Scans input files and collects samples for dictionary training
  2. Auto-computes optimal dictionary size based on total data volume
  3. Trains a zstd dictionary from the samples
  4. Compresses all files using the trained dictionary
  5. Stores dictionary in archive header for decompression

Dictionary size selection:

Input Size Dictionary Size
< 10 MB 32 KB
10-100 MB 64 KB
> 100 MB 112 KB

When to use GDELTA03:

  • Many small files with common patterns (config files, JSON, XML, logs)
  • Source code repositories with similar file structures
  • Collections of text files with shared vocabulary
  • Any dataset where files share common byte sequences

When NOT to use GDELTA03:

  • Few large files (dictionary overhead not worth it)
  • Already compressed files (photos, videos, archives)
  • Encrypted or random data
  • Files with no common patterns

Limitations:

  • Cannot be combined with --chunk-size (deduplication)
  • Cannot be combined with --zip
  • Dictionary training adds overhead for small datasets
# Dictionary compression for config files
godelta compress -i /etc/configs -o configs.delta --dictionary --verbose

# Dictionary compression for source code
godelta compress -i /src/project -o source.delta --dictionary --level 9
GDELTA01 (Traditional)

Custom format with zstandard compression (no deduplication):

  • Header: Magic number + file count
  • Entry metadata: Path, original size, compressed size, data offset
  • Compressed data: Zstandard-compressed file contents

Files are stored sequentially with entry headers followed immediately by compressed data.

Performance: Fastest compression, best compression ratio (zstd), no deduplication overhead.

GDELTA02 (Chunked with Deduplication)

Content-based deduplication using FastCDC (Fast Content-Defined Chunking):

  • Header: Magic number + chunk size + counts
  • Chunk Index: Hash → offset mapping for all unique chunks
  • File Metadata: Path + chunk hash list for each file
  • Chunk Data: Deduplicated compressed chunks
  • Footer: End marker

Why FastCDC (Content-Defined Chunking)?

Unlike fixed-size chunking, FastCDC finds chunk boundaries based on content patterns using a rolling hash. This makes deduplication resilient to insertions and deletions:

Fixed-size chunking (old approach):
  File A: [chunk1][chunk2][chunk3]
  File B: X[chunk1'][chunk2'][chunk3']  ← 1 byte inserted
          ↑ ALL boundaries shift, ZERO matches!

Content-defined chunking (FastCDC):
  File A: [chunk1][chunk2][chunk3]
  File B: [X][chunk1][chunk2][chunk3]  ← Only 1 new chunk, rest match!
          ↑ Boundaries based on content patterns

Real-world test results:

  • Files with 1-byte prefix difference: 95% chunk match (vs 0% with fixed chunking)
  • Similar files with shared content: 65% deduplication ratio
  • Archives are reproducible (deterministic chunk ordering)

Deduplication benefits:

  • Shared content across files stored once (even with small shifts/edits)
  • BLAKE3 hashing for chunk identification
  • Configurable average chunk size (actual chunks vary 1/4x to 4x)
  • Bounded chunk store with LRU eviction (prevents OOM on large datasets)
  • Streaming temp file architecture (compressed chunks written to disk, not RAM)
  • Statistics: Total chunks, unique chunks, deduplication ratio, bytes saved, evictions

Memory management:

  • Chunk metadata (~56 bytes per chunk in archive index + ~32 bytes per file reference)
  • In-memory overhead (~120 bytes per chunk: metadata + LRU structures)
  • Deduplication cache (LRU): Evicts least-recently-used chunks when --chunk-store-size limit reached
  • Compressed chunk data: Written to temporary file during compression, streamed to final archive
  • Temp file cleanup: Automatic cleanup on normal exit, errors, and user interruption (Ctrl+C)
  • Thread memory: Auto-calculated from input size when --thread-memory 0, with safety warnings if exceeding system RAM
  • Cross-platform memory detection: Linux (sysinfo), macOS (sysctl), Windows (GlobalMemoryStatusEx)

Minimum chunk size: 4 KB

  • Chunks smaller than 4KB have metadata overhead that exceeds compression benefits
  • Each chunk requires 56 bytes in the archive index + 32 bytes per file reference

Recommended chunk sizes:

Use Case Chunk Size Why
General purpose 64KB Good balance of dedup granularity vs overhead
Source code, logs, configs 32KB-64KB Smaller changes need finer granularity
VM images, database dumps 128KB-256KB Large files with big repeated sections

Trade-offs:

  • Smaller chunks (8-32KB): Better dedup for small edits, but more metadata overhead (~88 bytes/chunk)
  • Larger chunks (128-512KB): Less overhead and faster, but need larger matching regions for dedup

⚠️ IMPORTANT: Chunk deduplication only benefits repetitive data

  • Use chunking for: VM images, database backups, log files, source code repositories
  • DON'T use chunking for: Unique media files (photos, videos, music), compressed archives, encrypted data, random data
  • Why: Metadata overhead (56 bytes per chunk) can make archive LARGER if there's little duplication
  • Example: 5 million unique 10KB chunks = ~421 MB of pure metadata overhead
  • Rule of thumb: If you don't expect at least 10% duplication, disable chunking (--chunk-size 0)

When to use GDELTA02:

  • Backups with duplicate files (e.g., VM images, database dumps, logs with repeated patterns)
  • Similar files with repeated content (e.g., source code with shared libraries, config files)
  • Large datasets with redundant blocks (e.g., incremental backups, version-controlled data)
  • NOT recommended for: Collections of unique compressed files, media libraries, encrypted archives

Format selection:

  • With --xz: XZ format (LZMA2 compression, best ratio, slowest)
  • With --zip: ZIP format (deflate compression, universal compatibility)
  • With --dictionary: GDELTA03 (zstd + auto-trained dictionary)
  • With --chunk-size N: GDELTA02 (zstd + deduplication)
  • Default (no flags): GDELTA01 (zstd compression, fastest)

Note: --xz, --zip, --dictionary, and --chunk-size are mutually exclusive.

Architecture

Folder-Based Parallelism

go-delta achieves true parallel compression by grouping files by their parent directory:

  1. File Grouping: Files are organized into folder-based tasks
  2. Parallel Compression: Workers compress files independently (no locks during compression)
  3. Minimal Mutex Locking: Lock only during quick archive writes or chunk store updates
  4. Streaming Architecture: Compressed chunks written to temporary file, then streamed to archive

Example workflow with 4 threads:

Worker 1: Compress /src/utils/* → Write chunks to temp file → Update chunk store
Worker 2: Compress /src/models/* → Write chunks to temp file → Update chunk store (parallel!)
Worker 3: Compress /docs/* → Write chunks to temp file → Update chunk store
Worker 4: Compress /tests/* → Write chunks to temp file → Update chunk store

Bounded memory (when --chunk-store-size is set):

  • LRU eviction keeps only most-recently-used chunks in deduplication cache
  • Evicted chunks remain in archive (metadata preserved, just removed from cache)
  • Prevents OOM on large datasets while maintaining full deduplication capability
Progress Tracking

Multi-progress bar visualization using mpb/v8:

  • Individual progress bar per file being compressed
  • Overall progress bar showing total completion
  • Bars auto-remove on completion for clean output

Library Usage

Compression Example
package main

import (
    "fmt"
    "log"
    "github.com/creativeyann17/go-delta/pkg/compress"
)

func main() {
    opts := &compress.Options{
        InputPath:  "/path/to/files",
        OutputPath: "backup.delta",
        Level:      5,
        MaxThreads: 4,
    }

    result, err := compress.Compress(opts, nil)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Compressed %d files: %.2f MB -> %.2f MB (%.1f%%)\n",
        result.FilesProcessed,
        float64(result.OriginalSize)/1024/1024,
        float64(result.CompressedSize)/1024/1024,
        result.CompressionRatio())
}
With Progress Callback
progressCb := func(event compress.ProgressEvent) {
    switch event.Type {
    case compress.EventFileStart:
        fmt.Printf("Compressing %s...\n", event.FilePath)
    case compress.EventFileComplete:
        fmt.Printf("Done: %s\n", event.FilePath)
    case compress.EventComplete:
        fmt.Printf("Completed: %d files\n", event.Current)
    }
}

result, err := compress.Compress(opts, progressCb)
With Chunk-Based Deduplication
opts := &compress.Options{
    InputPath:       "/path/to/files",
    OutputPath:      "backup.delta",
    MaxThreads:      4,
    Level:           5,
    ChunkSize:       128 * 1024,           // 128 KB chunks
    ChunkStoreSize:  5 * 1024,             // 5 GB chunk store limit (in MB)
    MaxThreadMemory: 2 * 1024 * 1024 * 1024, // 2 GB per thread
}

result, err := compress.Compress(opts, nil)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Compressed %d files: %.2f MB -> %.2f MB (%.1f%%)\n",
    result.FilesProcessed,
    float64(result.OriginalSize)/1024/1024,
    float64(result.CompressedSize)/1024/1024,
    result.CompressionRatio())

if result.ChunkSize > 0 {
    fmt.Printf("Deduplication: %d/%d chunks deduplicated (%.1f%%), %.2f MB saved\n",
        result.DedupedChunks,
        result.TotalChunks,
        result.DedupRatio(),
        float64(result.BytesSaved)/1024/1024)
}
With Custom File List (Library Only)
// Compress specific files/folders without using InputPath
opts := &compress.Options{
    Files: []string{
        "/path/to/file1.txt",
        "/path/to/folder1",
        "/another/path/file2.log",
        "relative/path/to/folder",
    },
    OutputPath: "custom.delta",
    MaxThreads: 4,
    Level:      9,
}

result, err := compress.Compress(opts, nil)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Compressed %d files from custom list\n", result.FilesProcessed)

Note: When using Files, the InputPath option is ignored. Each path in Files can be absolute or relative, and can point to files or directories. This option is designed for library use only and is not exposed in the CLI.

With Progress Tracking and Formatted Summary
// Use built-in progress bar callback
progressCb, progress := compress.ProgressBarCallback()

opts := &compress.Options{
    InputPath:  "/path/to/files",
    OutputPath: "backup.delta",
    Level:      9,
}

result, err := compress.Compress(opts, progressCb)

// Wait for progress bars to complete
progress.Wait()

if err != nil {
    log.Fatal(err)
}

// Print formatted summary
fmt.Print(compress.FormatSummary(result))

Helper Functions for Library Users:

Compression Helpers:

  • compress.ProgressBarCallback() - Creates a multi-progress bar callback (returns callback and progress container)
  • compress.FormatSummary(result) - Formats compression results as human-readable text
  • compress.FormatSize(bytes) - Converts bytes to human-readable size (KB, MB, GB, etc.)
  • compress.TruncateLeft(path, maxLen) - Truncates file paths from left, preserving filename

Decompression Helpers:

  • decompress.ProgressBarCallback() - Creates a multi-progress bar callback (returns callback and progress container)
  • decompress.FormatSummary(result) - Formats decompression results as human-readable text

Note: Both compression and decompression helpers use the same underlying generic implementation from pkg/godelta, ensuring consistent behavior and formatting across operations.

Decompression with Progress and Summary
package main

import (
    "fmt"
    "log"
    "github.com/creativeyann17/go-delta/pkg/decompress"
)

func main() {
    // Use built-in progress bar callback
    progressCb, progress := decompress.ProgressBarCallback()

    opts := &decompress.Options{
        InputPath:  "backup.delta",
        OutputPath: "/restore/location",
        Overwrite:  true,
    }

    result, err := decompress.Decompress(opts, progressCb)

    // Wait for progress bars to complete
    progress.Wait()

    if err != nil {
        log.Fatal(err)
    }

    // Print formatted summary
    fmt.Print(decompress.FormatSummary(result))

    if !result.Success() {
        log.Fatalf("Decompression completed with %d errors", len(result.Errors))
    }
}
Verification with Progress
package main

import (
    "fmt"
    "log"
    "github.com/creativeyann17/go-delta/pkg/verify"
)

func main() {
    opts := &verify.Options{
        InputPath:  "backup.delta",
        VerifyData: true, // Full data integrity check
        Verbose:    false,
    }

    // Custom progress callback
    progressCb := func(event verify.ProgressEvent) {
        switch event.Type {
        case verify.EventStart:
            fmt.Printf("Starting: %s\n", event.Message)
        case verify.EventFileVerify:
            fmt.Printf("Checking file %d/%d: %s\n", event.Current, event.Total, event.FilePath)
        case verify.EventChunkVerify:
            if event.Current%100 == 0 {
                fmt.Printf("Verified %d/%d chunks\n", event.Current, event.Total)
            }
        case verify.EventComplete:
            fmt.Println("Verification complete")
        case verify.EventError:
            fmt.Printf("Error: %s\n", event.Message)
        }
    }

    result, err := verify.Verify(opts, progressCb)
    if err != nil && result == nil {
        log.Fatal(err)
    }

    // Print formatted summary
    fmt.Print(result.Summary())

    if !result.IsValid() {
        log.Fatalf("Archive validation failed with %d errors", len(result.Errors))
    }

    fmt.Printf("✓ Archive is valid (%.1f%% compression ratio)\n", result.CompressionRatio())
}

API Reference

Compression
compress.Options
type Options struct {
    InputPath       string   // Source file/directory (ignored if Files is provided)
    Files           []string // Custom list of files/folders to compress (library only, overrides InputPath)
    OutputPath      string   // Output archive path
    MaxThreads      int      // Max concurrent threads (default: CPU count)
    MaxThreadMemory uint64   // Max memory per thread in bytes (0=auto-calculate from input size)
    Level           int      // Compression level 1-22 for GDELTA, 1-9 for ZIP (default: 5)
    ChunkSize       uint64   // Chunk size in bytes for dedup (0=disabled, min 4096, GDELTA only)
    ChunkStoreSize  uint64   // Max chunk store size in MB (0=unlimited, GDELTA only)
    UseZipFormat    bool     // Create ZIP archive instead of GDELTA (no deduplication)
    UseXzFormat     bool     // Create XZ archive with LZMA2 (best compression ratio)
    UseDictionary   bool     // Use dictionary compression (GDELTA03 format)
    DisableGC       bool     // Disable GC during ZIP compression (reduces latency)
    UseGitignore    bool     // Respect .gitignore files
    DryRun          bool     // Simulate without writing
    Verbose         bool     // Detailed logging
    Quiet           bool     // Suppress output
}
compress.Result
type Result struct {
    FilesTotal     int      // Total files found
    FilesProcessed int      // Successfully compressed
    OriginalSize   uint64   // Total original bytes
    CompressedSize uint64   // Total compressed bytes
    Errors         []error  // Non-fatal errors
    
    // Deduplication statistics (GDELTA02 only)
    TotalChunks    uint64   // Total chunks processed (including duplicates)
    UniqueChunks   uint64   // Unique chunks stored in archive
    DedupedChunks  uint64   // Chunks deduplicated (found in cache, not re-written)
    BytesSaved     uint64   // Compressed bytes saved by deduplication
    Evictions      uint64   // Chunks evicted from bounded store (only affects RAM, not archive)
}

func (r *Result) CompressionRatio() float64  // Returns ratio as percentage
func (r *Result) DedupRatio() float64        // Returns dedup ratio as percentage (DedupedChunks/TotalChunks)
func (r *Result) Success() bool              // Returns true if no errors
Decompression
decompress.Options
type Options struct {
    InputPath  string  // Input archive file
    OutputPath string  // Output directory (default: ".")
    Overwrite  bool    // Overwrite existing files
    Verbose    bool    // Detailed logging
    Quiet      bool    // Suppress output
}
decompress.Result
type Result struct {
    FilesTotal       int      // Total files in archive
    FilesProcessed   int      // Successfully decompressed
    CompressedSize   uint64   // Archive file size in bytes
    DecompressedSize uint64   // Total decompressed bytes
    Errors           []error  // Non-fatal errors (e.g., file exists)
}
Verification
verify.Options
type Options struct {
    InputPath  string  // Archive file to verify (required)
    VerifyData bool    // Perform full data integrity check (default: false)
    Verbose    bool    // Detailed logging
    Quiet      bool    // Suppress output
}
verify.Result
type Result struct {
    // Archive metadata
    Format      Format // GDELTA01, GDELTA02, GDELTA03, ZIP, XZ, or UNKNOWN
    ArchivePath string // Path to verified archive
    ArchiveSize uint64 // Total archive size in bytes
    
    // Validation status
    HeaderValid    bool // Header is valid
    FooterValid    bool // Footer is valid
    StructureValid bool // Overall structure is valid
    IndexValid     bool // Chunk index is valid (GDELTA02)
    MetadataValid  bool // File metadata is valid
    
    // File statistics
    FileCount     int    // Number of files
    TotalOrigSize uint64 // Sum of original sizes
    TotalCompSize uint64 // Sum of compressed sizes
    EmptyFiles    int    // Number of zero-byte files
    
    // GDELTA02 chunk info
    ChunkSize     uint64 // Configured chunk size
    ChunkCount    uint64 // Unique chunks
    TotalChunkRef uint64 // Total chunk references
    
    // Data integrity (when VerifyData=true)
    DataVerified   bool // Data verification was performed
    FilesVerified  int  // Files with verified data
    ChunksVerified int  // Chunks with verified data
    CorruptFiles   int  // Files that failed verification
    CorruptChunks  int  // Chunks that failed verification
    
    // Issues found
    DuplicatePaths int     // Files with duplicate paths
    OrphanedChunks int     // Unreferenced chunks (GDELTA02)
    MissingChunks  int     // Missing chunk references (GDELTA02)
    Errors         []error // All errors encountered
    
    // File details
    Files []FileInfo // Per-file verification info
}

func (r *Result) IsValid() bool                     // True if archive passed all checks
func (r *Result) Success() bool                      // Alias for IsValid()
func (r *Result) CompressionRatio() float64          // Compression ratio as percentage
func (r *Result) SpaceSaved() uint64                 // Bytes saved by compression
func (r *Result) SpaceSavedRatio() float64           // Space saved as percentage
func (r *Result) ChunkDeduplicationRatio() float64   // Deduplication ratio (GDELTA02)
func (r *Result) AverageChunksPerFile() float64      // Average chunks per file (GDELTA02)
func (r *Result) Summary() string                    // Human-readable summary
verify.ProgressEvent
type ProgressEvent struct {
    Type     EventType // Start, FileVerify, ChunkVerify, Complete, Error
    FilePath string    // File being verified
    Current  int       // Current progress
    Total    int       // Total items
    Message  string    // Progress message
}

// Event types
const (
    EventStart       EventType = iota
    EventFileVerify
    EventChunkVerify
    EventComplete
    EventError
)
Error Handling

All operations return two types of errors:

  1. Fatal errors - Returned as error (operation cannot continue)
  2. Non-fatal errors - Collected in result.Errors (operation continues)

Common errors:

  • Compression: File read errors, permission denied
  • Decompression: decompress.ErrFileExists (use --overwrite)
  • Verification: verify.ErrInvalidMagic, verify.ErrTruncatedArchive, verify.ErrCorruptData

Development

Build
make build          # Build for current platform -> bin/godelta
make build-all      # Cross-compile for linux/darwin/windows
make clean          # Remove build artifacts
Testing
make test           # Run all tests
make fmt            # Format code with go fmt

The test suite includes:

  • Round-trip compression/decompression with MD5 validation
  • ZIP format with multi-part archive creation and extraction
  • XZ format compression and decompression
  • Archive verification (structural and data integrity) for all formats
  • Subdirectory handling
  • Empty file and directory edge cases
  • Overwrite protection
  • Duplicate compression/decompression scenarios
  • Thread safety and parallel processing
Git Hooks
make install-hooks  # Install pre-commit hook

CI/CD

The project uses GitHub Actions for continuous integration:

  1. Test - Run all tests on tag push
  2. Release - Build binaries and create GitHub release (only if tests pass)

Workflow file: .github/workflows/test-and-release.yml

Testing

Comprehensive test suite with 40+ tests covering:

  • FastCDC content-defined chunking with BLAKE3 hashing
  • Content-shift resilience - verifies chunks match after insertions/deletions
  • Chunked vs non-chunked comparison - asserts dedup produces smaller archives
  • Thread-safe deduplication with bounded LRU store
  • LRU eviction under capacity pressure
  • Round-trip compression/decompression with integrity checks
  • Archive verification for all formats (GDELTA01, GDELTA02, GDELTA03, ZIP, XZ)
  • Multi-part archive creation and verification
  • Cross-directory deduplication
  • Concurrent operations
  • Error handling and edge cases

License

See LICENSE file.

Directories

Path Synopsis
cmd
godelta command
cmd/godelta/verify_cmd.go
cmd/godelta/verify_cmd.go
internal
chunker
internal/chunker/chunker.go
internal/chunker/chunker.go
chunkstore
internal/chunkstore/store.go
internal/chunkstore/store.go
format
internal/format/archive.go
internal/format/archive.go
pkg
compress
pkg/compress/compress.go
pkg/compress/compress.go
decompress
pkg/decompress/decompress.go
pkg/decompress/decompress.go
godelta
pkg/godelta/helpers.go
pkg/godelta/helpers.go
verify
pkg/verify/errors.go
pkg/verify/errors.go

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL