Documentation
¶
Overview ¶
Package simhash provides fast 64-bit SimHash for text and HTML structure.
Quick example:
h1, _ := FingerprintHTMLText64([]byte(`<html><body>Hello world</body></html>`)) h2, _ := FingerprintHTMLText64([]byte(`<html><body>Hello brave world</body></html>`)) d := Hamming64(h1, h2) s := Similarity64(h1, h2) _, _ = d, s
For HTML inputs, use:
- FingerprintHTMLText64 for visible text similarity.
- FingerprintHTMLDOM64 for structure similarity.
Index ¶
- func FingerprintHTMLDOM64(html []byte, opts ...Option) (uint64, error)
- func FingerprintHTMLText64(html []byte, opts ...Option) (uint64, error)
- func FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)
- func Hamming64(a, b uint64) int
- func Similarity64(a, b uint64) float64
- func TokenizeHTMLDOM(src []byte, sink func(tok []byte)) error
- func TokenizeHTMLText(src []byte, sink func(tok []byte)) error
- type HashFunc64
- type Hasher64
- type Option
- type TokenStream
- type WeightFunc
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func FingerprintHTMLText64 ¶
func FingerprintTokens64 ¶
func FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)
func Similarity64 ¶
func TokenizeHTMLDOM ¶
func TokenizeHTMLText ¶
Types ¶
type HashFunc64 ¶
type Hasher64 ¶
type Hasher64 struct {
// contains filtered or unexported fields
}
func NewHasher64 ¶
func (*Hasher64) AddStringToken ¶
type Option ¶
type Option func(*config)
func WithDOMFormOnly ¶
func WithHashFunc ¶
func WithHashFunc(fn HashFunc64) Option
func WithIgnoreHidden ¶
func WithLowercaseTags ¶
func WithWeightFunc ¶
func WithWeightFunc(fn WeightFunc) Option
type TokenStream ¶
The token memory may be reused by the producer after sink returns.
type WeightFunc ¶
Click to show internal directories.
Click to hide internal directories.