diffx

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 28, 2025 License: MIT Imports: 3 Imported by: 0

README

diffx

A Go diffing library implementing the Myers O(ND) algorithm with heuristics for better output quality. Designed for use cases where readability of diff output matters more than minimal edit distance.

Installation

go get github.com/dacharyc/diffx

Usage

Basic Usage
package main

import (
    "fmt"
    "github.com/dacharyc/diffx"
)

func main() {
    old := []string{"The", "quick", "brown", "fox", "jumps"}
    new := []string{"A", "slow", "red", "fox", "leaps"}

    ops := diffx.Diff(old, new)

    for _, op := range ops {
        switch op.Type {
        case diffx.Equal:
            fmt.Printf("  %v\n", old[op.AStart:op.AEnd])
        case diffx.Delete:
            fmt.Printf("- %v\n", old[op.AStart:op.AEnd])
        case diffx.Insert:
            fmt.Printf("+ %v\n", new[op.BStart:op.BEnd])
        }
    }
}

Output:

- [The quick brown]
+ [A slow red]
  [fox]
- [jumps]
+ [leaps]
Custom Elements

For non-string sequences, implement the Element interface:

type Element interface {
    Equal(other Element) bool
    Hash() uint64
}

Then use DiffElements:

ops := diffx.DiffElements(elementsA, elementsB)
Histogram Diff

For files with many common tokens (prose, code), histogram diff often produces cleaner output:

ops := diffx.DiffHistogram(a, b)
Options
// Force minimal edit distance (slower, but mathematically optimal)
ops := diffx.Diff(a, b, diffx.WithMinimal(true))

// Disable preprocessing (element filtering)
ops := diffx.Diff(a, b, diffx.WithPreprocessing(false))

// Disable postprocessing (boundary shifting)
ops := diffx.Diff(a, b, diffx.WithPostprocessing(false))

Why diffx?

Standard diff algorithms like Myers produce the mathematically optimal edit sequence (minimum number of operations). However, this can result in semantically confusing output when common tokens get matched across unrelated contexts.

The Problem

Consider diffing these two sentences:

Old: "The quick brown fox jumps over the lazy dog"
New: "A slow red fox leaps over the sleeping cat"

A naive Myers implementation might produce fragmented output like:

- The
+ A
  <space>
- quick
+ slow
  <space>
- brown
+ red
  <space>
  fox
  <space>
- jumps
+ leaps
  over
  <space>
  the
  <space>
- lazy
+ sleeping
  <space>
- dog
+ cat

The common words "the" and spaces cause spurious matches that fragment the output.

The diffx Solution

diffx produces cleaner, grouped output:

- The quick brown
+ A slow red
  fox
- jumps over the lazy dog
+ leaps over the sleeping cat

How It Works

diffx uses several techniques to improve output quality:

1. Histogram-Style Anchor Selection

Instead of matching any common element, diffx preferentially anchors on low-frequency elements. Common words like "the", "for", "a" are deprioritized because they appear in many unrelated contexts.

2. Stopword Filtering

Known stopwords (articles, prepositions, common verbs) are excluded from anchor selection entirely, preventing them from creating spurious matches.

3. Preprocessing

High-frequency elements that would cause noisy matches are filtered before the core algorithm runs. The results are then mapped back to original indices.

4. Boundary Shifting

After computing the diff, boundaries are shifted to align with logical breaks:

  • Blank lines are kept as separators (not part of changes)
  • Changes align with punctuation and line boundaries
  • Adjacent operations are merged
5. Weak Anchor Elimination

Short Equal regions consisting entirely of high-frequency elements (like a lone "the" between two changes) can be converted to explicit delete/insert pairs for cleaner output.

Comparison with go-diff

Feature go-diff diffx
Algorithm Myers Myers + Histogram hybrid
Stopword handling None Filtered from anchors
Boundary optimization None Shift to logical breaks
High-frequency filtering None Preprocessing step
Output focus Minimal edits Readable grouping
When to Use diffx
  • Word-level diffs for prose or documentation
  • Token-level diffs where readability matters
  • Cases where common tokens cause fragmented output
When to Use go-diff
  • Character-level diffs
  • Cases requiring mathematically minimal edit distance
  • When you need the standard diff-match-patch feature set

API Reference

Types
type OpType int

const (
    Equal  OpType = iota  // Elements unchanged
    Insert                 // Elements added to B
    Delete                 // Elements removed from A
)

type DiffOp struct {
    Type   OpType
    AStart int  // Start index in A (inclusive)
    AEnd   int  // End index in A (exclusive)
    BStart int  // Start index in B (inclusive)
    BEnd   int  // End index in B (exclusive)
}
Functions
// Diff compares two string slices
func Diff(a, b []string, opts ...Option) []DiffOp

// DiffElements compares arbitrary Element slices
func DiffElements(a, b []Element, opts ...Option) []DiffOp

// DiffHistogram uses histogram-style diff explicitly
func DiffHistogram(a, b []string, opts ...Option) []DiffOp
Options
func WithHeuristic(enabled bool) Option      // Speed heuristics (default: true)
func WithMinimal(minimal bool) Option        // Force minimal edit (default: false)
func WithPreprocessing(enabled bool) Option  // Element filtering (default: true)
func WithPostprocessing(enabled bool) Option // Boundary shifting (default: true)
func WithAnchorElimination(enabled bool) Option // Remove weak anchors (default: true)

Performance

diffx is optimized for quality over raw speed, but remains performant:

Input Size Typical Time
5 elements ~1µs
100 elements ~20µs
1000 elements ~400µs

For large inputs with scattered changes, diffx is typically faster than character-based diff libraries because it operates at the element level.

References

License

MIT License - see LICENSE for details.

Documentation

Overview

Package diffx implements the Myers O(ND) diff algorithm with heuristics for better output quality on large files with many small, scattered changes.

Unlike simple Myers implementations, diffx includes:

  • Preprocessing: Filters out high-frequency elements that cause spurious matches
  • Heuristics: Early termination for expensive comparisons
  • Postprocessing: Shifts diff boundaries for more readable output
Example
package main

import (
	"fmt"

	"github.com/dacharyc/diffx"
)

func main() {
	old := []string{"The", "quick", "brown", "fox", "jumps"}
	new := []string{"A", "slow", "red", "fox", "leaps"}

	ops := diffx.Diff(old, new)

	for _, op := range ops {
		switch op.Type {
		case diffx.Equal:
			fmt.Printf("  %v\n", old[op.AStart:op.AEnd])
		case diffx.Delete:
			fmt.Printf("- %v\n", old[op.AStart:op.AEnd])
		case diffx.Insert:
			fmt.Printf("+ %v\n", new[op.BStart:op.BEnd])
		}
	}
}
Output:

- [The quick brown]
+ [A slow red]
  [fox]
- [jumps]
+ [leaps]

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DiffOp

type DiffOp struct {
	Type   OpType
	AStart int // start index in sequence A (inclusive)
	AEnd   int // end index in sequence A (exclusive)
	BStart int // start index in sequence B (inclusive)
	BEnd   int // end index in sequence B (exclusive)
}

DiffOp represents a single edit operation with index ranges.

func Diff

func Diff(a, b []string, opts ...Option) []DiffOp

Diff compares two string slices using the Myers algorithm. For histogram-style diff, use DiffHistogram instead.

Example
package main

import (
	"fmt"

	"github.com/dacharyc/diffx"
)

func main() {
	old := []string{"hello", "world"}
	new := []string{"hello", "there", "world"}

	ops := diffx.Diff(old, new)

	for _, op := range ops {
		switch op.Type {
		case diffx.Equal:
			fmt.Printf("KEEP:   %v\n", old[op.AStart:op.AEnd])
		case diffx.Delete:
			fmt.Printf("DELETE: %v\n", old[op.AStart:op.AEnd])
		case diffx.Insert:
			fmt.Printf("INSERT: %v\n", new[op.BStart:op.BEnd])
		}
	}
}
Output:

KEEP:   [hello]
INSERT: [there]
KEEP:   [world]
Example (Prose)
package main

import (
	"fmt"
	"strings"

	"github.com/dacharyc/diffx"
)

func main() {
	// diffx groups changes coherently, avoiding fragmentation around common words
	old := strings.Split("The quick brown fox jumps over the lazy dog", " ")
	new := strings.Split("A slow red fox leaps over the sleeping cat", " ")

	ops := diffx.Diff(old, new)

	// Count change regions (consecutive delete/insert operations)
	changeRegions := 0
	inChange := false
	for _, op := range ops {
		if op.Type == diffx.Equal {
			inChange = false
		} else if !inChange {
			changeRegions++
			inChange = true
		}
	}

	fmt.Printf("Change regions: %d\n", changeRegions)
}
Output:

Change regions: 3
Example (WithOptions)
package main

import (
	"fmt"

	"github.com/dacharyc/diffx"
)

func main() {
	old := []string{"a", "b", "c"}
	new := []string{"a", "x", "c"}

	// Force minimal edit distance (slower but mathematically optimal)
	ops := diffx.Diff(old, new, diffx.WithMinimal(true))

	for _, op := range ops {
		fmt.Printf("%s: A[%d:%d] B[%d:%d]\n",
			op.Type, op.AStart, op.AEnd, op.BStart, op.BEnd)
	}
}
Output:

Equal: A[0:1] B[0:1]
Delete: A[1:2] B[1:1]
Insert: A[2:2] B[1:2]
Equal: A[2:3] B[2:3]

func DiffElements

func DiffElements(a, b []Element, opts ...Option) []DiffOp

DiffElements compares arbitrary Element slices using the Myers algorithm. For histogram-style diff, use DiffElementsHistogram instead.

Example
package main

import (
	"fmt"

	"github.com/dacharyc/diffx"
)

// CustomElement demonstrates implementing the Element interface
// for custom types.
type CustomElement struct {
	ID   int
	Name string
}

func (e CustomElement) Equal(other diffx.Element) bool {
	o, ok := other.(CustomElement)
	if !ok {
		return false
	}
	return e.ID == o.ID
}

func (e CustomElement) Hash() uint64 {
	return uint64(e.ID)
}

func main() {
	old := []diffx.Element{
		CustomElement{1, "Alice"},
		CustomElement{2, "Bob"},
		CustomElement{3, "Charlie"},
	}
	new := []diffx.Element{
		CustomElement{1, "Alice Smith"}, // Same ID, different name - considered equal
		CustomElement{4, "David"},       // New element
		CustomElement{3, "Charlie"},
	}

	ops := diffx.DiffElements(old, new)

	for _, op := range ops {
		switch op.Type {
		case diffx.Equal:
			fmt.Printf("KEEP:   IDs %v\n", getIDs(old[op.AStart:op.AEnd]))
		case diffx.Delete:
			fmt.Printf("DELETE: IDs %v\n", getIDs(old[op.AStart:op.AEnd]))
		case diffx.Insert:
			fmt.Printf("INSERT: IDs %v\n", getIDs(new[op.BStart:op.BEnd]))
		}
	}
}

func getIDs(elems []diffx.Element) []int {
	ids := make([]int, len(elems))
	for i, e := range elems {
		ids[i] = e.(CustomElement).ID
	}
	return ids
}
Output:

KEEP:   IDs [1]
DELETE: IDs [2]
INSERT: IDs [4]
KEEP:   IDs [3]

func DiffElementsHistogram

func DiffElementsHistogram(a, b []Element, opts ...Option) []DiffOp

DiffElementsHistogram performs histogram-style diff on Element slices.

func DiffHistogram

func DiffHistogram(a, b []string, opts ...Option) []DiffOp

DiffHistogram performs histogram-style diff on string slices.

Example
package main

import (
	"fmt"

	"github.com/dacharyc/diffx"
)

func main() {
	// Histogram diff is especially good for files with many common tokens
	old := []string{"the", "quick", "fox", "the", "end"}
	new := []string{"the", "slow", "fox", "the", "end"}

	ops := diffx.DiffHistogram(old, new)

	for _, op := range ops {
		switch op.Type {
		case diffx.Equal:
			fmt.Printf("  %v\n", old[op.AStart:op.AEnd])
		case diffx.Delete:
			fmt.Printf("- %v\n", old[op.AStart:op.AEnd])
		case diffx.Insert:
			fmt.Printf("+ %v\n", new[op.BStart:op.BEnd])
		}
	}
}
Output:

  [the]
- [quick]
+ [slow]
  [fox the end]

type Element

type Element interface {
	// Equal reports whether this element is equal to another.
	Equal(other Element) bool
	// Hash returns a hash value for this element.
	// Equal elements must have equal hashes.
	Hash() uint64
}

Element represents a comparable unit (line, word, token). Implementations must provide equality comparison and hashing.

type OpType

type OpType int

OpType identifies the type of edit operation.

const (
	// Equal means the elements are unchanged.
	Equal OpType = iota
	// Insert means elements were added to B that are not in A.
	Insert
	// Delete means elements were removed from A that are not in B.
	Delete
)

func (OpType) String

func (t OpType) String() string

String returns a string representation of the OpType.

Example
package main

import (
	"fmt"

	"github.com/dacharyc/diffx"
)

func main() {
	ops := []diffx.OpType{diffx.Equal, diffx.Insert, diffx.Delete}
	for _, op := range ops {
		fmt.Println(op.String())
	}
}
Output:

Equal
Insert
Delete

type Option

type Option func(*options)

Option configures diff behavior.

func WithAnchorElimination

func WithAnchorElimination(enabled bool) Option

WithAnchorElimination enables or disables anchor elimination post-processing. Default: true.

func WithCostLimit

func WithCostLimit(n int) Option

WithCostLimit sets custom early termination threshold. 0 means auto-calculate based on input size. Default: 0.

func WithHeuristic

func WithHeuristic(enabled bool) Option

WithHeuristic enables or disables speed heuristics. Default: true.

func WithMinimal

func WithMinimal(minimal bool) Option

WithMinimal forces minimal edit script even if slow. Default: false.

func WithPostprocessing

func WithPostprocessing(enabled bool) Option

WithPostprocessing enables or disables boundary shifting. Default: true.

func WithPreprocessing

func WithPreprocessing(enabled bool) Option

WithPreprocessing enables or disables confusing element filtering. Default: true.

type StringElement

type StringElement string

StringElement is the common case for line/word comparison.

func (StringElement) Equal

func (s StringElement) Equal(other Element) bool

Equal reports whether s equals other. Returns false if other is not a StringElement.

func (StringElement) Hash

func (s StringElement) Hash() uint64

Hash returns a FNV-1a hash of the string.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL