deepdiff

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 30, 2019 License: LGPL-3.0 Imports: 11 Imported by: 4

README

deepdiff

Qri GoDoc License Codecov CI Go Report Card

deepdiff is a structured data differ that aims for near-linear time complexity. It's intended to calculate differences & apply patches to structured data ranging from 0-500MBish of encoded JSON.

Diffing structured data carries additional complexity when compared to the standard unix diff utility, which operates on lines of text. By using the structure of data itself, deepdiff is able to provide a rich description of changes that maps onto the structure of the data itself. deepdiff ignores semantically irrelevant changes like whitespace, and can isolate changes like column changes to tabular data to only the relevant switches

Most algorithms in this space have quadratic time complexity, which from our testing makes them very slow on 3MB JSON documents and unable to complete on 5MB or more. deepdiff currently hovers around the 0.9Sec/MB range on 4 core processors

Instead of operating on JSON directly, deepdiff operates on document trees consisting of the go types created by unmarshaling from JSON, which are two complex types:

  map[string]interface{}
  []interface{}

and five scalar types:

  string, int, float64, bool, nil

By operating on native go types deepdiff can compare documents encoded in different formats, for example decoded CSV or CBOR.

deepdiff is based off an algorithm designed for diffing XML documents outlined in Detecting Changes in XML Documents by Grégory Cobéna & Amélie Marian

It's been adapted to fit purposes of diffing for Qri: https://github.com/qri-io/qri, folding in parallelism primitives afforded by the go language

deepdiff also includes a tool for applying patches, see documentation for details.

Project Status:

👷♀ 👷♂ This is a very new project that hasn't been properly vetted in testing enviornments. Issues/PRs welcome & appriciated. 👷♀ 👷♂

Benchmarks

Run on a 4 core MacBook Pro:

$ go test -bench . --run XXX -v --benchmem
goos: darwin
goarch: amd64
pkg: github.com/qri-io/deepdiff
BenchmarkDiff1-4          	   20000	     88167 ns/op	   13324 B/op	     496 allocs/op
BenchmarkDiffDatasets-4   	    5000	    241119 ns/op	   53367 B/op	    1614 allocs/op
BenchmarkDiff5MB-4        	       1	4357009141 ns/op	783217944 B/op	29952860 allocs/op
PASS
ok  	github.com/qri-io/deepdiff	8.369s
Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

Basic Usage

Here's a quick example pulled from the godoc:

package main

import (
	"encoding/json"
  "fmt"
  
  "github.com/qri-io/deepdiff"
)

// start with two slightly different json documents
var aJSON = []byte(`{
  "a": 100,
  "foo": [1,2,3],
  "bar": false,
  "baz": {
    "a": {
      "b": 4,
      "c": false,
      "d": "apples-and-oranges"
    },
    "e": null,
    "g": "apples-and-oranges"
  }
}`)

var bJSON = []byte(`{
  "a": 99,
  "foo": [1,2,3],
  "bar": false,
  "baz": {
    "a": {
      "b": 5,
      "c": false,
      "d": "apples-and-oranges"
    },
    "e": "thirty-thousand-something-dogecoin",
    "f": false
  }
}`)

func main() {
	// unmarshal the data into generic interfaces
	var a, b interface{}
	if err := json.Unmarshal(aJSON, &a); err != nil {
		panic(err)
	}
	if err := json.Unmarshal(bJSON, &b); err != nil {
		panic(err)
	}

	// Diff will use default configuration to produce a slice of Deltas
	// that describe the structured changes. by default Diff will not calculate
	// moves, only inserts, deletes, and updates
	diffs, err := deepdiff.Diff(a, b)
	if err != nil {
		panic(err)
	}

	// Format the changes for terminal output
	change, err := FormatPretty(diffs)
	if err != nil {
		panic(err)
	}

	fmt.Println(change)
	// Output: baz:
	//   + f: false
	//   - g: "apples-and-oranges"
	//   a:
	//     ~ b: 5
	//   ~ e: "thirty-thousand-something-dogecoin"
	// ~ a: 99
}

License

The deepdiff library is licensed under the GNU Lesser General Public License v3.0

Documentation

Overview

Package deepdiff is a structured data differ that aims for near-linear time complexity. It's intended to calculate differences & apply patches to structured data ranging from 0-500MBish of encoded JSON

Diffing structured data carries additional complexity when compared to the standard unix diff utility, which operates on lines of text. By using the structure of data itself, deepdiff is able to provide a rich description of changes that maps onto the structure of the data itself. deepdiff ignores semantically irrelevant changes like whitespace, and can isolate changes like column changes to tabular data to only the relevant switches

Most algorithms in this space have quadratic time complexity, which from our testing makes them very slow on 3MB JSON documents and unable to complete on 5MB or more. deepdiff currently hovers around the 0.9Sec/MB range on 4 core processors

Instead of operating on JSON directly, deepdiff operates on document trees consisting of the go types created by unmarshaling from JSON, which aretwo complex types:

map[string]interface{}
[]interface{}

and five scalar types:

string, int, float64, bool, nil

by operating on native go types deepdiff can compare documents encoded in different formats, for example decoded CSV or CBOR.

deepdiff is based off an algorithm designed for diffing XML documents outlined in: Detecting Changes in XML Documents by Grégory Cobéna & Amélie Marian https://ieeexplore.ieee.org/document/994696 it's been adapted to fit purposes of diffing for Qri: https://github.com/qri-io/qri the guiding use case for this work

deepdiff also includes a tool for applying patches, see documentation for details

Index

Constants

View Source
const (
	// DTDelete means making the children of a node
	// become the children of a node's parent
	DTDelete = Operation("delete")
	// DTInsert is the compliment of deleting, adding
	// children of a parent node to a new node, and making
	// that node a child of the original parent
	DTInsert = Operation("insert")
	// DTMove is the succession of a deletion & insertion
	// of the same node
	DTMove = Operation("move")
	// DTUpdate is an alteration of a scalar data type (string, bool, float, etc)
	DTUpdate = Operation("update")
)

Variables

View Source
var NewHash = func() hash.Hash {
	return fnv.New64()
}

NewHash returns a new hash interface, wrapped in a function for easy hash algorithm switching, package consumers can override NewHash with their own desired hash.Hash implementation if the value space is particularly large. default is 64-bit FNV 1 for fast, cheap, (non-cryptographic) hashing

Functions

func FormatPretty

func FormatPretty(changes []*Delta) (string, error)

FormatPretty converts a []*Delta into a colored text report, with: red "-" for deletions green "+" for insertions blue "~" for changes (an insert & delete at the same path) This is very much a work in progress

func FormatPrettyColor

func FormatPrettyColor(changes []*Delta) (string, error)

FormatPrettyColor is the same as format pretty, but with tty color tags to print colored text to terminals

func FormatPrettyStats

func FormatPrettyStats(diffStat *Stats) string

FormatPrettyStats prints a string of stats info

func FormatPrettyStatsColor

func FormatPrettyStatsColor(diffStat *Stats) string

FormatPrettyStatsColor prints a string of stats info with ANSI colors

func Patch

func Patch(v interface{}, patch []*Delta) (err error)

Patch applies a change script (patch) to a value

Types

type Delta

type Delta struct {
	// the type of change
	Type Operation `json:"type"`
	// Path is a string representation of the patch to where the delta operation
	// begins in the destination documents
	// path should conform to the IETF JSON-pointer specification, outlined
	// in RFC 6901: https://tools.ietf.org/html/rfc6901
	Path string `json:"path"`
	// The value to change in the destination document
	Value interface{} `json:"value"`

	// To make delta's revesible, original values are included
	// the original path this change from
	SourcePath string `json:"SourcePath,omitempty"`
	// the original  value this was changed from, will not always be present
	SourceValue interface{} `json:"originalValue,omitempty"`
}

Delta represents a change between a source & destination document a delta is a single "edit" that describes changes to the destination document

func Diff

func Diff(d1, d2 interface{}, opts ...DiffOption) ([]*Delta, error)

Diff computes a slice of deltas that define an edit script for turning the value at d1 into d2 currently Diff will never return an error, error returns are reserved for future use. specifically: bailing before delta calculation based on a configurable threshold

type DiffConfig

type DiffConfig struct {
	// If true Diff will calculate "moves" that describe changing the parent of
	// a subtree
	MoveDeltas bool
	// Provide a non-nil stats pointer & diff will populate it with data from
	// the diff process
	Stats *Stats
}

DiffConfig are any possible configuration parameters for calculating diffs

type DiffOption

type DiffOption func(cfg *DiffConfig)

DiffOption is a function that adjust a config, zero or more DiffOptions can be passed to the Diff function

func OptionSetStats

func OptionSetStats(st *Stats) DiffOption

OptionSetStats will set the passed-in stats pointer when Diff is called

type Operation

type Operation string

Operation defines the operation of a Delta item

type Stats

type Stats struct {
	Left  int `json:"leftNodes"`  // count of nodes in the left tree
	Right int `json:"rightNodes"` // count of nodes in the right tree

	LeftWeight  int `json:"leftWeight"`  // byte-ish count of left tree
	RightWeight int `json:"rightWeight"` // byte-ish count of right tree

	Inserts int `json:"inserts,omitempty"` // number of nodes inserted
	Updates int `json:"updates,omitempty"` // number of nodes updated
	Deletes int `json:"deletes,omitempty"` // number of nodes deleted
	Moves   int `json:"moves,omitempty"`   // number of nodes moved
}

Stats holds statistical metadata about a diff

func (Stats) NodeChange

func (s Stats) NodeChange() int

NodeChange returns a count of the shift between left & right trees

func (Stats) PctWeightChange

func (s Stats) PctWeightChange() float64

PctWeightChange returns a value from -1.0 to max(float64) representing the size shift between left & right trees

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL