bktree

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 28, 2026 License: GPL-3.0 Imports: 7 Imported by: 0

README

bktree

A Go implementation of the Burkhard-Keller Tree (BK-Tree) for fast approximate string matching in metric spaces.

Features

  • Generic distance function — plug in any metric (Levenshtein, Hamming, or your own)
  • Single tree + ForestBKTree for general use; Forest partitions by string length for natural Hamming support and Levenshtein length-based pruning
  • Exists shortcutExists(word, maxDist) stops at the first match, faster than Query when you only need a boolean
  • Sorted resultsQuery returns results ordered by distance (then word)
  • Unicode-aware — Levenshtein distance operates on runes, not raw bytes
  • Zero dependencies — standard library only

Install

go get github.com/IcyDesert/bktree

Quick Start

Single Tree with Levenshtein
package main

import (
    "fmt"
    "github.com/IcyDesert/bktree"
)

func main() {
    tree := bktree.New(bktree.Levenshtein)

    words := []string{"book", "books", "cake", "boo", "cape"}
    for _, w := range words {
        tree.Add(w)
    }

    // Find all words within edit distance 2 of "boak"
    results := tree.Query("boak", 2)
    for _, r := range results {
        fmt.Printf("%s (distance: %d)\n", r.Word, r.Distance)
    }
    // Output:
    // book (distance: 1)
    // boo (distance: 2)
    // books (distance: 2)

    // Just check if anything is close enough
    if tree.Exists("hallo", 1) {
        fmt.Println("found a near-match")
    }
}
Forest with Hamming

Forest groups words by length into separate trees. This makes Hamming distance trivial — only the same-length tree is ever queried.

forest := bktree.NewForest(bktree.Hamming)

words := []string{"000", "001", "010", "111"}
for _, w := range words {
    forest.Add(w)
}

results := forest.Query("001", 1)
for _, r := range results {
    fmt.Printf("%s (hamming: %d)\n", r.Word, r.Distance)
}
// Output:
// 001 (hamming: 0)
// 000 (hamming: 1)
Custom Distance Function
// Jaccard distance on character sets (example)
func jaccard(a, b string) int {
    // ... compute distance ...
    return distance
}

tree := bktree.New(jaccard)
tree.Add("foo")
tree.Add("bar")

results := tree.Query("baz", 2)

Persistence

Trees can be saved to disk with gob encoding. Only the topology is persisted — the distance function is not saved. You must provide the same function when loading.

import "os"

// Build
forest := bktree.NewForest(bktree.Levenshtein)
forest.Add("book")
forest.Add("books")

// Save
f, _ := os.Create(bktree.DefaultFilename("index.gob", bktree.Levenshtein))
// Creates "index_levenshtein.gob"
forest.Save(f)
f.Close()

// Load (elsewhere, later)
f, _ = os.Open("index_levenshtein.gob")
defer f.Close()

loaded, err := bktree.LoadForest(f, bktree.Levenshtein)
if err != nil {
    log.Fatal(err)
}

// Use exactly like before
results := loaded.Query("boak", 2)

BKTree uses the same pattern with Save and Load.

How Forest Pruning Works

For Levenshtein and other reasonable metrics, if dist(a, b) <= d then |len(a) - len(b)| <= d. Forest.Query exploits this to skip entire length-buckets, reducing the search space.

For Hamming, the same logic collapses to checking only the exact-length bucket.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DefaultFilename added in v0.2.0

func DefaultFilename(path string, fn DistanceFunc) string

DefaultFilename returns a filename with the metric name embedded, e.g. DefaultFilename("data.gob", Levenshtein) → "data_levenshtein.gob".

func Hamming

func Hamming(a, b string) int

Hamming computes the Hamming distance between two equal-length strings. Panics if the strings have different lengths.

func Levenshtein

func Levenshtein(a, b string) int

Levenshtein computes the edit distance between two strings using the Wagner-Fischer algorithm with O(min(len(a),len(b))) space. The distance is calculated over Unicode code points (runes), not bytes.

Types

type BKTree

type BKTree struct {
	// contains filtered or unexported fields
}

BKTree is a Burkhard-Keller tree for fast approximate string matching.

func Load added in v0.2.0

func Load(r io.Reader, dist DistanceFunc) (*BKTree, error)

Load reads a tree topology from r using gob encoding.

func New

func New(dist DistanceFunc) *BKTree

New creates a new empty BKTree with the given distance function. Panics if dist is nil.

func (*BKTree) Add

func (t *BKTree) Add(word string)

Add inserts a word into the tree.

func (*BKTree) Exists

func (t *BKTree) Exists(word string, maxDist int) bool

Exists returns true if there exists at least one word in the tree within maxDist of the query word. It short-circuits on the first match.

func (*BKTree) Query

func (t *BKTree) Query(word string, maxDist int) []Result

Query returns all words in the tree within maxDist of the query word, sorted by distance ascending.

func (*BKTree) Save added in v0.2.0

func (t *BKTree) Save(w io.Writer) error

Save writes the tree topology to w using gob encoding. The distance function is not persisted.

type DistanceFunc

type DistanceFunc func(a, b string) int

DistanceFunc defines a metric distance function between two strings.

type Forest

type Forest struct {
	// contains filtered or unexported fields
}

Forest partitions words by length into multiple BK-Trees. This enables natural Hamming support and Levenshtein length-based pruning.

func LoadForest added in v0.2.0

func LoadForest(r io.Reader, dist DistanceFunc) (*Forest, error)

LoadForest reads a forest topology from r using gob encoding.

func NewForest

func NewForest(dist DistanceFunc) *Forest

NewForest creates a new empty Forest with the given distance function. Panics if dist is nil.

func (*Forest) Add

func (f *Forest) Add(word string)

Add inserts a word into the forest, routing it to the tree keyed by its length.

func (*Forest) Exists

func (f *Forest) Exists(word string, maxDist int) bool

Exists returns true if there exists at least one word in the forest within maxDist of the query word. Short-circuits on first match.

func (*Forest) Query

func (f *Forest) Query(word string, maxDist int) []Result

Query returns all words in the forest within maxDist of the query word. It only queries trees whose length is in [len(word)-maxDist, len(word)+maxDist]. Results are sorted by distance ascending.

func (*Forest) Save added in v0.2.0

func (f *Forest) Save(w io.Writer) error

Save writes the forest topology to w using gob encoding. The distance function is not persisted.

type Node added in v0.2.0

type Node struct {
	Word     string
	Children map[int]*Node
}

Node is a node in the BK-tree.

type Result

type Result struct {
	Word     string
	Distance int
}

Result represents a single match from a Query.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL