simhash

package module
v0.0.0-...-5756091 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 22, 2026 License: MIT Imports: 7 Imported by: 0

README

simhash

Small, production-quality Go module for fast 64-bit SimHash on HTML documents.

It includes:

  • Core streaming SimHash hasher (Hasher64)
  • Visible-text HTML tokenization (TokenizeHTMLText)
  • DOM-structure tokenization (TokenizeHTMLDOM)
  • Comparison helpers (Hamming64, Similarity64)

Install

go get github.com/evanleleux/simhash

Quickstart

package main

import (
	"fmt"

	"github.com/evanleleux/simhash"
)

func main() {
	a := []byte(`<html><body><h1>Checkout</h1><p>Pay securely</p></body></html>`)
	b := []byte(`<html><body><h1>Checkout</h1><p>Pay securely today</p></body></html>`)

	h1, _ := simhash.FingerprintHTMLText64(a)
	h2, _ := simhash.FingerprintHTMLText64(b)

	fmt.Printf("h1=0x%016x h2=0x%016x\n", h1, h2)
	fmt.Printf("hamming=%d similarity=%.4f\n", simhash.Hamming64(h1, h2), simhash.Similarity64(h1, h2))
}

API

  • FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)
  • FingerprintHTMLText64(html []byte, opts ...Option) (uint64, error)
  • FingerprintHTMLDOM64(html []byte, opts ...Option) (uint64, error)
  • Hamming64(a, b uint64) int
  • Similarity64(a, b uint64) float64

Hasher64 supports streaming accumulation without building token slices:

h := simhash.NewHasher64()
h.AddStringToken("checkout", 1)
h.AddStringToken("payment", 1)
fp := h.Sum64()
_ = fp

HTML Tokenization Behavior

Visible text
  • Parses with golang.org/x/net/html (no regex parsing)
  • Ignores text under <script> and <style>
  • Best-effort hidden filtering (hidden, inline display:none, visibility:hidden) by default
  • Collapses whitespace by tokenizing into word tokens
DOM structure
  • Emits path tokens like html/body/div/form/input
  • Ignores attributes by default (stable across changing classes/IDs)
  • Uses configurable max depth (default 8)
  • Can focus on form tags with WithDOMFormOnly(true)

Options

  • WithHashFunc(HashFunc64) to override hashing function
  • WithWeightFunc(WeightFunc) to override token weighting
  • WithMaxTextBytes(n int) to cap visible text bytes processed
  • WithDOMMaxDepth(depth int) to cap emitted DOM depth
  • WithIgnoreHidden(enabled bool) to toggle hidden-node filtering
  • WithLowercaseTags(enabled bool) to toggle lowercasing tag names
  • WithDOMFormOnly(enabled bool) to emit only form-related DOM paths

Default token hash is github.com/cespare/xxhash/v2.

Threshold Guidance

For 64-bit SimHash, near-duplicate detection often starts around Hamming distance <= 35, but this is dataset-dependent. Tune thresholds on your corpus and objective (precision vs recall).

Important Note

Do not shingle before SimHash in this workflow. Shingling is mainly useful for MinHash/Jaccard style similarity; this package is intended for direct token streams into SimHash.

Example Program

Run:

go run ./cmd/example [fileA.html fileB.html]

Without args, it compares:

  • https://evanleleux.dev/simhash/page-01 through https://evanleleux.dev/simhash/page-10

Default output includes per-page hashes plus adjacent-page similarity comparisons.

You can also pass two local file paths or URLs for direct pair comparison.

Documentation

Overview

Package simhash provides fast 64-bit SimHash for text and HTML structure.

Quick example:

h1, _ := FingerprintHTMLText64([]byte(`<html><body>Hello world</body></html>`))
h2, _ := FingerprintHTMLText64([]byte(`<html><body>Hello brave world</body></html>`))
d := Hamming64(h1, h2)
s := Similarity64(h1, h2)
_, _ = d, s

For HTML inputs, use:

  • FingerprintHTMLText64 for visible text similarity.
  • FingerprintHTMLDOM64 for structure similarity.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FingerprintHTMLDOM64

func FingerprintHTMLDOM64(html []byte, opts ...Option) (uint64, error)

func FingerprintHTMLText64

func FingerprintHTMLText64(html []byte, opts ...Option) (uint64, error)

func FingerprintTokens64

func FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)

func Hamming64

func Hamming64(a, b uint64) int

func Similarity64

func Similarity64(a, b uint64) float64

func TokenizeHTMLDOM

func TokenizeHTMLDOM(src []byte, sink func(tok []byte)) error

func TokenizeHTMLText

func TokenizeHTMLText(src []byte, sink func(tok []byte)) error

Types

type HashFunc64

type HashFunc64 func([]byte) uint64

type Hasher64

type Hasher64 struct {
	// contains filtered or unexported fields
}

func NewHasher64

func NewHasher64(opts ...Option) *Hasher64

func (*Hasher64) AddStringToken

func (h *Hasher64) AddStringToken(tok string, weight int16)

func (*Hasher64) AddToken

func (h *Hasher64) AddToken(tok []byte, weight int16)

func (*Hasher64) Reset

func (h *Hasher64) Reset()

func (*Hasher64) Sum64

func (h *Hasher64) Sum64() uint64

type Option

type Option func(*config)

func WithDOMFormOnly

func WithDOMFormOnly(enabled bool) Option

func WithDOMMaxDepth

func WithDOMMaxDepth(depth int) Option

A value <= 0 means unlimited.

func WithHashFunc

func WithHashFunc(fn HashFunc64) Option

func WithIgnoreHidden

func WithIgnoreHidden(enabled bool) Option

func WithLowercaseTags

func WithLowercaseTags(enabled bool) Option

func WithMaxTextBytes

func WithMaxTextBytes(n int) Option

A value <= 0 means unlimited.

func WithWeightFunc

func WithWeightFunc(fn WeightFunc) Option

type TokenStream

type TokenStream func(sink func(tok []byte)) error

The token memory may be reused by the producer after sink returns.

type WeightFunc

type WeightFunc func([]byte) int16

Directories

Path Synopsis
cmd
example command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL