sim

package
v0.0.0-...-5c30f4a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 20, 2022 License: AGPL-3.0 Imports: 9 Imported by: 0

README

sim

text-similarity checker for large collections of text.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DocID

type DocID uint32

type DocMatch

type DocMatch struct {
	ID    DocID
	Score float64
}

type Hash

type Hash uint64

func UniqHashes

func UniqHashes(src []Hash) []Hash

UniqHashes removes duplicates from a list of hashes. Ordering is otherwise preserved.

type Index

type Index struct {
	// Docs holds a lists of hashes for each document.
	Docs map[DocID][]Hash
	// Hashes holds lists of documents for each hash.
	Hashes map[Hash][]DocID
}

func NewIndex

func NewIndex() *Index

NewIndex creates a new Index

func (*Index) Match

func (index *Index) Match(hashes []Hash, threshold float64) []DocMatch

Match finds documents containing the given hashes. Matches below the given threshold factor are ignored.

type Indexer

type Indexer struct {
	Lang      string // language used for indexering
	NgramSize int
	// contains filtered or unexported fields
}

func NewIndexer

func NewIndexer(ngramSize int, lang string) (*Indexer, error)

NewIndexer creates a new Indexer.

func (*Indexer) HashString

func (indexer *Indexer) HashString(txt string) []Hash

HashString tokenises a string and returns a list of hashed ngrams.

func (*Indexer) IndexDoc

func (indexer *Indexer) IndexDoc(targ *Index, docID DocID, txt string)

IndexDoc indexes a document and adds it to the target index. Assumes doc does not already exist in index!

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL