dedupe

package module
v0.0.0-...-7f2b3d4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2025 License: MIT Imports: 9 Imported by: 0

README

Image Deduplication

This is a simple cli tool for finding and removing duplicate images written in pure Go. A duplicate image is any image that has similar visual content to another. This is done with perceptual hashing of the image.

It is capable of finding duplicates with various changes in size, rotation, brightness, saturation or contrast.

It cannot find duplicates that undergo extreme transforms or major changes to the visual content like color inversion, cropping, full rotation or flipping. The detection method works on the visual content of the file but does not do anything semantic in that the same subject on a different background would likely not be detected.

As an example all the images below are transformed from the first cat image and will be flagged as duplicates.

cat dark gray distorted upscaled saturated skewed shrink

dedupe testimages/cats/cat.jpg testimages/cats

Install

Binaries are available for each release. Download and use whichever for your system. For now this only contains binaries for Intel based Windows and Linux systems.

Alternatively you can install through go or build it from source. This requires Go version 1.23+.

go install github.com/alexgQQ/dedupe/cmd/dedupe@latest
# or
git clone https://github.com/alexgQQ/dedupe.git
cd dedupe
make build

Usage

There are two main usages. Checking for specific duplicates and checking for any duplicates. If an image file is provided as the first argument then it will return any duplicates of that image. For example we can check if two image are duplicates

dedupe image.jpg image-copy.jpg

Or we can find duplicate occurrences of an image across other directories

dedupe image.jpg path/to/images path/to/other/images

Otherwise if the first argument is a directory it will search and return any duplicates found

dedupe path/to/images

Alternatively you can provide a flag to force it into this search mode. This is most relevant if passing a file of images to handle.

cat images.txt | dedupe --search -

The cli is designed to work with pipelining in this way too. You can generate a csv like list of duplicates from that list of images like this.

cat images.txt | dedupe --search -o - > duplicates.csv

More flag usage and options are listed in the help message.

dedupe --help

You can also import this package and use it in your code

go get github.com/alexgQQ/dedupe@latest

and in your code

package main

import (
	"fmt"
	"strings"

	"github.com/alexgQQ/dedupe"
)

func main() {

	images := []string{
		"testimages/cats/cat-on-couch.jpg",
		"testimages/cats/cat.jpg",
		"testimages/cats/cat-shrink.jpg",
	}

	results, total, _ := dedupe.Duplicates(dedupe.DCT, images)
	fmt.Printf("Found %d duplicates\n", total)
	if total <= 0 {
		return
	}
	for _, files := range results {
		msg := strings.Join(files, ", ")
		fmt.Printf("%s\n", msg)
	}
}

Development

It's a straightforward package so clone and use whatever go workflow you like. The cli at cmd/dedupe/main.go is the best entrypoint and I'd recommend to have the verbose flag set and point it at the test images in the repo. Configure your debugger to do that.

go run cmd/dedupe/main.go -v -t testimages

Keep it clean and tidy

make fmt && make test

Issues and contributions are welcome. Guidelines are pretty loose:

  • For any large change please open an issue to have some discussion (for small changes feel free to just fork this and open a PR).
  • Don't add external dependencies
Implementation

The process works by computing images perceptual hashes and using a vantage point tree to find hashes close to each other by their hamming distance. The hashing method has an impact and this currently implements the dhash and dct perceptual hashes. Reasonable thresholds are defined from here and the dhash implementation description. By default the dct method is used as it is more accurate and resilient to image variation. However the dhash method is a bit faster and might be more appropriate for large amounts of images at the cost of some accuracy. For now this is sufficient but would be fun to implement more hashing methods like average hash or radial hash.

Test Images

The testimages directory contains some images to test against. These are a collection of cat images and images from a wallpaper dump. In particular these images have variation of direct duplicates, recolorings, and similar looking images for a solid test case. We can observe the accuracy difference between dhash and dct against these.

Building For Additional Targets

I find the elastic golang crossbuild image to be very helpful. There's lots of documentation on how to use it but for now I mainly target linux and windows amd64 targets. At some point it would be good to do this for OSX targets but it's more of a hassle. Mac users should build from source. How to manually build:

GOVERSION=1.24.3
docker run \
  -v .:/go/src/github.com/alexgQQ/go-image-deduper \
  -w /go/src/github.com/alexgQQ/go-image-deduper \
  -e CGO_ENABLED=0 \
  -e BINARY=dedupe.linux \
  docker.elastic.co/beats-dev/golang-crossbuild:${GOVERSION}-main \
  --build-cmd "make build" \
  -p linux/amd64

docker run \
  -v .:/go/src/github.com/alexgQQ/go-image-deduper \
  -w /go/src/github.com/alexgQQ/go-image-deduper \
  -e CGO_ENABLED=0 \
  -e BINARY=dedupe.window.exe \
  docker.elastic.co/beats-dev/golang-crossbuild:${GOVERSION}-main \
  --build-cmd "make build" \
  -p windows/amd64

Documentation

Index

Constants

This section is empty.

Variables

Functions

func Compare

func Compare(hashType hash.HashType, target string, files ...string) (filenames []string, err error)

Find any duplicate images of the target image from given image files hashTypes determines the hashing method and can be either dedupe.DCT or dedupe.DHASH

func Duplicates

func Duplicates(hashType hash.HashType, files []string) (duplicates [][]string, total int, err error)

Find groups of duplicate images from a list of given images hashTypes determines the hashing method and can be either dedupe.DCT or dedupe.DHASH

Types

This section is empty.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL