Gourd
Gourd is a command line tool to find duplicate files.
Acknowledgements
Gourd is inspired by my use of rdfind, but is not designed to be compatible with rdfind in terms of output, command line flags, or feature set. Gourd is not related to, a port of, or based on the source of rdfind
.
Gourd came from a usecase where I wanted to deduplicate data on a server, but could not install rdfind
natively and could not use rdfind
from my local machine due to differing libc versions. I was able to work around this problem using Docker, but it is unwieldy and cumbersome to do so, and wanted an easily-portable solution.
Cautions and Warnings
This software is experimental and untested.
Use at your own risk.
Build
git clone https://github.com/nabowler/gourd.git
go build -mod=readonly -ldflags="-s -w" cmd/gourd/gourd.go
Installation
With Go
go install github.com/nabowler/gourd/cmd/gourd@latest
Use
gourd -r -v -sha1 path/to/directory [path/to/directory2 ...]
See gourd -h
for all available options
Just
A justfile is provided for convenience.
Benchmarks and Comparison to rdfind
Benchmarks
11.7 GiB images containing 184 duplicate files
$ hyperfine -w 5 -N --export-markdown /tmp/gourd-rdfind-comparison.md 'gourd -r -md5 .' 'rdfind -makeresultsfile false -checksum md5 -dryrun true .' 'gourd -r -sha1 .' 'rdfind -makeresultsfile false -checksum sha1 -dryrun true .' 'gourd -r -sha256 .' 'gourd -r -sha512 .'
Benchmark 1: gourd -r -md5 .
Time (mean ± σ): 920.6 ms ± 45.0 ms [User: 728.4 ms, System: 216.5 ms]
Range (min … max): 835.5 ms … 989.4 ms 10 runs
Benchmark 2: rdfind -makeresultsfile false -checksum md5 -dryrun true .
Time (mean ± σ): 1.028 s ± 0.016 s [User: 0.872 s, System: 0.152 s]
Range (min … max): 1.007 s … 1.049 s 10 runs
Benchmark 3: gourd -r -sha1 .
Time (mean ± σ): 1.301 s ± 0.058 s [User: 1.126 s, System: 0.200 s]
Range (min … max): 1.193 s … 1.383 s 10 runs
Benchmark 4: rdfind -makeresultsfile false -checksum sha1 -dryrun true .
Time (mean ± σ): 1.246 s ± 0.014 s [User: 1.084 s, System: 0.158 s]
Range (min … max): 1.224 s … 1.264 s 10 runs
Benchmark 5: gourd -r -sha256 .
Time (mean ± σ): 2.636 s ± 0.053 s [User: 2.446 s, System: 0.216 s]
Range (min … max): 2.555 s … 2.711 s 10 runs
Benchmark 6: gourd -r -sha512 .
Time (mean ± σ): 1.915 s ± 0.047 s [User: 1.717 s, System: 0.222 s]
Range (min … max): 1.830 s … 1.987 s 10 runs
Summary
gourd -r -md5 . ran
1.12 ± 0.06 times faster than rdfind -makeresultsfile false -checksum md5 -dryrun true .
1.35 ± 0.07 times faster than rdfind -makeresultsfile false -checksum sha1 -dryrun true .
1.41 ± 0.09 times faster than gourd -r -sha1 .
2.08 ± 0.11 times faster than gourd -r -sha512 .
2.86 ± 0.15 times faster than gourd -r -sha256 .
Command |
Mean [ms] |
Min [ms] |
Max [ms] |
Relative |
gourd -r -md5 . |
920.6 ± 45.0 |
835.5 |
989.4 |
1.00 |
rdfind -makeresultsfile false -checksum md5 -dryrun true . |
1028.2 ± 15.6 |
1007.4 |
1049.3 |
1.12 ± 0.06 |
gourd -r -sha1 . |
1300.9 ± 57.9 |
1193.3 |
1383.2 |
1.41 ± 0.09 |
rdfind -makeresultsfile false -checksum sha1 -dryrun true . |
1246.0 ± 13.9 |
1224.3 |
1264.4 |
1.35 ± 0.07 |
gourd -r -sha256 . |
2636.1 ± 52.9 |
2555.3 |
2711.3 |
2.86 ± 0.15 |
gourd -r -sha512 . |
1914.6 ± 46.6 |
1830.3 |
1987.0 |
2.08 ± 0.11 |
Notes:
- Results are system dependent
- The most consistency across systems for me seems to be that
gourd -sha256
is slower than gourd -sha512
Comparisons
Duplicates Found
$ gourd -r -v -md5 . > /dev/null
Found 9286 files totaling 11.7GiB
Size: Before: 1|9286 After: 115|232 Eliminated: -114|9054 TotalSize: 447.6MiB Took: 139.169211ms
Same File: Before: 115|232 After: 115|232 Eliminated: 0|0 TotalSize: 447.6MiB Took: 55.478µs
First Bytes: Before: 115|232 After: 94|190 Eliminated: 21|42 TotalSize: 435.2MiB Took: 2.647062ms
Last Bytes: Before: 94|190 After: 91|184 Eliminated: 3|6 TotalSize: 434.1MiB Took: 2.074249ms
MD5: Before: 91|184 After: 91|184 Eliminated: 0|0 TotalSize: 434.1MiB Took: 638.124048ms
Found 184 duplicate files with 220.1MiB reclaimable space
$ rdfind -makeresultsfile false -checksum md5 -dryrun true .
(DRYRUN MODE) Now scanning ".", found 9286 files.
(DRYRUN MODE) Now have 9286 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 12509236983 bytes or 12 GiB
Removed 9054 files due to unique sizes from list. 232 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 42 files from list. 190 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 6 files from list. 184 files left.
(DRYRUN MODE) Now eliminating candidates based on md5 checksum: removed 0 files from list. 184 files left.
(DRYRUN MODE) It seems like you have 184 files that are not unique
(DRYRUN MODE) Totally, 220 MiB can be reduced.
Linked Dependencies
$ ldd $(which gourd) $(which rdfind)
~/go/bin/gourd:
not a dynamic executable
/usr/bin/rdfind:
linux-vdso.so.1 (0x00007ffcf3326000)
libnettle.so.8 => /usr/lib/libnettle.so.8 (0x00007fb2d6b20000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fb2d6800000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fb2d6afb000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007fb2d6400000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007fb2d6713000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fb2d6bcd000)
Binary Size
$ du -h $(which gourd) $(which rdfind)
1.5M /home/nathan/go/bin/gourd
96K /usr/bin/rdfind
Notes:
- rdfind installed from system repos
- gourd installed via
just install
, which strips the binary
- Non-stripped gourd binary installed via
go install
is 2.3M
Development Notes
Planned Bucketers
- md5
- configurable from commandline (-md5 flag)
- sha1
- configurable from commandline (-sha1 flag)
- sha256
- configurable from commandline (-sha256 flag)
- sha512
- configurable from commandline (-sha512 flag)
- firstbytes
- number of bytes configurable with -firstbytessize flag
- lastbytes
- number of bytes configurable with -lastbytessize flag
- filesize
- statted
- Outputs information about the number of files before and after an inner Bucketer
- configurable from commandline (-v flag)
Note: -md5, -sha1, -sha256, and -sha512 are additive and will be applied in that order if set. If none are set, a default of SHA-1 is used.
Other steps
- duplicate device and inode detection
To Do:
- -exclude on CLI
- exlucdes a path
- Glob? Regex? TBD.
To Consider
- goroutines for Hash/file steps?