gourd

package module
v0.0.0-...-6c24a08 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 2, 2023 License: MIT Imports: 15 Imported by: 0

README

Gourd

Gourd is a command line tool to find duplicate files.

Acknowledgements

Gourd is inspired by my use of rdfind, but is not designed to be compatible with rdfind in terms of output, command line flags, or feature set. Gourd is not related to, a port of, or based on the source of rdfind.

Gourd came from a usecase where I wanted to deduplicate data on a server, but could not install rdfind natively and could not use rdfind from my local machine due to differing libc versions. I was able to work around this problem using Docker, but it is unwieldy and cumbersome to do so, and wanted an easily-portable solution.

Cautions and Warnings

This software is experimental and untested.

Use at your own risk.

Build

git clone https://github.com/nabowler/gourd.git
go build -mod=readonly -ldflags="-s -w" cmd/gourd/gourd.go

Installation

With Go

go install github.com/nabowler/gourd/cmd/gourd@latest

Use

gourd -r -v -sha1 path/to/directory [path/to/directory2 ...]

See gourd -h for all available options

Just

A justfile is provided for convenience.

Benchmarks and Comparison to rdfind

Benchmarks

11.7 GiB images containing 184 duplicate files
$ hyperfine -w 5 -N  --export-markdown /tmp/gourd-rdfind-comparison.md 'gourd -r -md5 .' 'rdfind -makeresultsfile false -checksum md5 -dryrun true .' 'gourd -r -sha1 .' 'rdfind -makeresultsfile false -checksum sha1 -dryrun true .' 'gourd -r -sha256 .' 'gourd -r -sha512 .'
Benchmark 1: gourd -r -md5 .
  Time (mean ± σ):     920.6 ms ±  45.0 ms    [User: 728.4 ms, System: 216.5 ms]
  Range (min … max):   835.5 ms … 989.4 ms    10 runs

Benchmark 2: rdfind -makeresultsfile false -checksum md5 -dryrun true .
  Time (mean ± σ):      1.028 s ±  0.016 s    [User: 0.872 s, System: 0.152 s]
  Range (min … max):    1.007 s …  1.049 s    10 runs

Benchmark 3: gourd -r -sha1 .
  Time (mean ± σ):      1.301 s ±  0.058 s    [User: 1.126 s, System: 0.200 s]
  Range (min … max):    1.193 s …  1.383 s    10 runs

Benchmark 4: rdfind -makeresultsfile false -checksum sha1 -dryrun true .
  Time (mean ± σ):      1.246 s ±  0.014 s    [User: 1.084 s, System: 0.158 s]
  Range (min … max):    1.224 s …  1.264 s    10 runs

Benchmark 5: gourd -r -sha256 .
  Time (mean ± σ):      2.636 s ±  0.053 s    [User: 2.446 s, System: 0.216 s]
  Range (min … max):    2.555 s …  2.711 s    10 runs

Benchmark 6: gourd -r -sha512 .
  Time (mean ± σ):      1.915 s ±  0.047 s    [User: 1.717 s, System: 0.222 s]
  Range (min … max):    1.830 s …  1.987 s    10 runs

Summary
  gourd -r -md5 . ran
    1.12 ± 0.06 times faster than rdfind -makeresultsfile false -checksum md5 -dryrun true .
    1.35 ± 0.07 times faster than rdfind -makeresultsfile false -checksum sha1 -dryrun true .
    1.41 ± 0.09 times faster than gourd -r -sha1 .
    2.08 ± 0.11 times faster than gourd -r -sha512 .
    2.86 ± 0.15 times faster than gourd -r -sha256 .
Command Mean [ms] Min [ms] Max [ms] Relative
gourd -r -md5 . 920.6 ± 45.0 835.5 989.4 1.00
rdfind -makeresultsfile false -checksum md5 -dryrun true . 1028.2 ± 15.6 1007.4 1049.3 1.12 ± 0.06
gourd -r -sha1 . 1300.9 ± 57.9 1193.3 1383.2 1.41 ± 0.09
rdfind -makeresultsfile false -checksum sha1 -dryrun true . 1246.0 ± 13.9 1224.3 1264.4 1.35 ± 0.07
gourd -r -sha256 . 2636.1 ± 52.9 2555.3 2711.3 2.86 ± 0.15
gourd -r -sha512 . 1914.6 ± 46.6 1830.3 1987.0 2.08 ± 0.11

Notes:

  • Results are system dependent
  • The most consistency across systems for me seems to be that gourd -sha256 is slower than gourd -sha512

Comparisons

Duplicates Found
$ gourd -r -v -md5 . > /dev/null
Found 9286 files totaling 11.7GiB
       Size: Before: 1|9286 After: 115|232 Eliminated: -114|9054 TotalSize: 447.6MiB Took: 139.169211ms
  Same File: Before: 115|232 After: 115|232 Eliminated: 0|0 TotalSize: 447.6MiB Took: 55.478µs
First Bytes: Before: 115|232 After: 94|190 Eliminated: 21|42 TotalSize: 435.2MiB Took: 2.647062ms
 Last Bytes: Before: 94|190 After: 91|184 Eliminated: 3|6 TotalSize: 434.1MiB Took: 2.074249ms
        MD5: Before: 91|184 After: 91|184 Eliminated: 0|0 TotalSize: 434.1MiB Took: 638.124048ms
Found 184 duplicate files with 220.1MiB reclaimable space
$ rdfind -makeresultsfile false -checksum md5 -dryrun true .
(DRYRUN MODE) Now scanning ".", found 9286 files.
(DRYRUN MODE) Now have 9286 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 12509236983 bytes or 12 GiB
Removed 9054 files due to unique sizes from list. 232 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 42 files from list. 190 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 6 files from list. 184 files left.
(DRYRUN MODE) Now eliminating candidates based on md5 checksum: removed 0 files from list. 184 files left.
(DRYRUN MODE) It seems like you have 184 files that are not unique
(DRYRUN MODE) Totally, 220 MiB can be reduced.
Linked Dependencies
$ ldd $(which gourd) $(which rdfind)
~/go/bin/gourd:
        not a dynamic executable
/usr/bin/rdfind:
        linux-vdso.so.1 (0x00007ffcf3326000)
        libnettle.so.8 => /usr/lib/libnettle.so.8 (0x00007fb2d6b20000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fb2d6800000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fb2d6afb000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fb2d6400000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007fb2d6713000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fb2d6bcd000)
Binary Size
$ du -h $(which gourd) $(which rdfind)
1.5M    /home/nathan/go/bin/gourd
96K     /usr/bin/rdfind

Notes:

  • rdfind installed from system repos
  • gourd installed via just install, which strips the binary
    • Non-stripped gourd binary installed via go install is 2.3M

Development Notes

Planned Bucketers

  • md5
    • configurable from commandline (-md5 flag)
  • sha1
    • configurable from commandline (-sha1 flag)
  • sha256
    • configurable from commandline (-sha256 flag)
  • sha512
    • configurable from commandline (-sha512 flag)
  • firstbytes
    • number of bytes configurable with -firstbytessize flag
  • lastbytes
    • number of bytes configurable with -lastbytessize flag
  • filesize
  • statted
    • Outputs information about the number of files before and after an inner Bucketer
    • configurable from commandline (-v flag)

Note: -md5, -sha1, -sha256, and -sha512 are additive and will be applied in that order if set. If none are set, a default of SHA-1 is used.

Other steps

  • duplicate device and inode detection

To Do:

  • -exclude on CLI
    • exlucdes a path
    • Glob? Regex? TBD.

To Consider

  • goroutines for Hash/file steps?
    • one per in bucket?

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func HumanReadableSize

func HumanReadableSize(size int64) string

func SubBucketName

func SubBucketName(currentBucketName, newBucketName string) string

SubBucketName makes consistent bucket namings.

Types

type Bucket

type Bucket []File

Bucket is a list of Files sharing common attributes

func (Bucket) TotalFileSize

func (b Bucket) TotalFileSize() int64

type Bucketer

type Bucketer interface {
	Bucket(in Buckets) (Buckets, error)
}

Bucketer receives Buckets and returns a set of Buckets. Bucketers should ignore a Bucket with `len(in[key]) < 2`. Similarly, Bucketers may omit a Bucket with less than 2 items in the returned Buckets, but are not required to.

func NewCryptoHashBucketer

func NewCryptoHashBucketer(hash crypto.Hash) (Bucketer, error)

NewCryptoHashBucketer returns a Bucketer that calcuates the provided crypto.Hash on the file. By default, MD5, SHA1, SHA256, and SHA512 are supported. Other hashes are only supported if `hash.Available()` is true.

func NewFileBucketer

func NewFileBucketer(sbf SubBucketFunc) (Bucketer, error)

NewFileBucketer returns a Bucketer that uses the provided SubBucketFunc to generate the output Buckets.

func NewFileSizeBucketer

func NewFileSizeBucketer(minSize int64) Bucketer

NewFileSizeBucketer returns a Bucketer based on the size of the File.

func NewFirstBytesBucketer

func NewFirstBytesBucketer(numBytes int64) Bucketer

NewFirstBytesBucketer creates a Bucketer that examines the first `numBytes` of a file. If numBytes is more than the file size, the file size is used as numBytes. If numBytes is <= 0 after comparing to the file size, it is bucketed under a constant key that will not overlap with normal bucket keys

func NewLastBytesBucketer

func NewLastBytesBucketer(numBytes int64) Bucketer

NewLastBytesBucketer creates a Bucketer that examines the last `numBytes` of a file. If numBytes is more than the file size, the file size is used as numBytes. If numBytes is <= 0 after comparing to the file size, it is bucketed under a constant key that will not overlap with normal bucket keys

type Buckets

type Buckets map[string]Bucket

Buckets are a map of a common attributes to a Bucket. The key is determined by the Bucketer, but must include the current Bucket name.

func (Buckets) PossibleDuplicates

func (b Buckets) PossibleDuplicates() Buckets

PossibleDuplicates returns Buckets containing at least two entries.

func (Buckets) String

func (b Buckets) String() string

type ChainedBucketer

type ChainedBucketer struct {
	Bucketers []Bucketer
}

ChainedBucketer applies a series of Bucketers.

func (ChainedBucketer) Bucket

func (bm ChainedBucketer) Bucket(in Buckets) (Buckets, error)

type DirWalker

type DirWalker struct {
	Key         string
	Exclude     map[string]struct{}
	Recursive   bool
	AppendDevID bool
}

func (DirWalker) Walk

func (dw DirWalker) Walk(rootPaths ...string) (Buckets, error)

type File

type File struct {
	Path           string
	FileInfo       os.FileInfo
	DuplicatePaths []string
}

File is a path and FileInfo.

type SameFilterBucketer

type SameFilterBucketer struct{}

SameFilterBucketer filters out files that appear to already be the same file on disk, as per `os.SameFile`. Unlike most other Bucketer implementations, this will only remove entries from Buckets rather than create new Buckets.

func (SameFilterBucketer) Bucket

func (bm SameFilterBucketer) Bucket(in Buckets) (Buckets, error)

type StattedBucketer

type StattedBucketer struct {
	StepName string
	Bucketer Bucketer
}

func (StattedBucketer) Bucket

func (b StattedBucketer) Bucket(in Buckets) (Buckets, error)

type SubBucketFunc

type SubBucketFunc func(*os.File) (subBucketName string, accept bool, err error)

SubBucketFunc determines the sub-bucket name and if the file should be sub-bucketed based on the contents of the file. SubBucketFunc implementations _should not_ Close the *os.File.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL