visigoth

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 3, 2025 License: Apache-2.0 Imports: 14 Imported by: 0

README

Visigoth

Visigoth

A Go package for full-text search and indexing

Go License CI

Features

  • Text analysis with tokenization, filtering, and stemming
  • Memory-efficient inverted index
  • Multiple search algorithms (Linear, Hits-based, Noop)
  • Spanish language support with Snowball stemming
  • Repository pattern for data persistence

Installation

go get github.com/sonirico/visigoth

Usage

package main

import (
    "context"
    "fmt"
    "log"
    "github.com/sonirico/visigoth"
)

type VisigothSearcher struct {
    repo visigoth.Repo
}

type SearchPayload struct {
    Index string
    Terms string
}

func (v *VisigothSearcher) Search(p SearchPayload) error {
    stream, err := v.repo.Search(p.Index, p.Terms, visigoth.HitsSearch)
    if err != nil {
        return err
    }
    
    // Process results from stream
    for result := range stream.Chan() {
        fmt.Printf("Document: %s, Hits: %d\n", 
            result.Doc().ID(), result.Hits)
    }
    
    return nil
}

func main() {
    // Create tokenization pipeline with Spanish support
    tokenizer := visigoth.NewKeepAlphanumericTokenizer()
    pipeline := visigoth.NewTokenizationPipeline(
        tokenizer,
        visigoth.NewLowerCaseTokenizer(),
        visigoth.NewStopWordsFilter(visigoth.SpanishStopWords),
        visigoth.NewSpanishStemmer(true),
    )
    
    // Create repository with memory index builder
    repo := visigoth.NewIndexRepo(visigoth.NewMemoryIndexBuilder(pipeline))
    
    // Create searcher
    searcher := &VisigothSearcher{repo: repo}
    
    // Index some documents
    repo.Put("courses", visigoth.NewDocRequest("java-course", "Curso de programación en Java"))
    repo.Put("courses", visigoth.NewDocRequest("go-course", "Curso de programación en Go"))
    repo.Put("courses", visigoth.NewDocRequest("python-course", "Curso de programación en Python"))
    
    // Search
    if err := searcher.Search(SearchPayload{
        Index: "courses",
        Terms: "programación java",
    }); err != nil {
        log.Fatal(err)
    }
}

Architecture

The package is organized into focused components:

  • analyze - Text processing pipeline
  • index - Document indexing and storage
  • search - Query processing and ranking
  • stemmer - Language-specific text normalization
  • repos - Data persistence layer with aliasing
  • loaders - Document loading utilities
  • entities - Core data structures

Search Algorithms

Visigoth provides two main search algorithms, both implementing AND logic (all query tokens must be present in matching documents):

HitsSearch vs LinearSearch
Feature HitsSearch LinearSearch
Algorithm Hit counting + threshold filtering Set intersection
Time Complexity O(T × D + R log R) O(T × D + I)
Space Complexity O(R) O(I)
Result Ordering Relevance-based (hit count) + document order Document index order
Best For Relevance ranking, scoring Boolean matching, performance
Multi-token Efficiency Constant per token Early termination possible

Where:

  • T = number of search tokens
  • D = average documents per token
  • R = number of matching documents
  • I = size of intersection sets
Algorithm Details
HitsSearch
// Uses hit counting to rank results by relevance
results := HitsSearch([]string{"programming", "tutorial"}, indexer)
// Returns documents sorted by relevance (most matching tokens first)

Process:

  1. Count hits (unique tokens) per document
  2. Filter documents with hits ≥ threshold (AND logic)
  3. Sort by hit count (relevance), then by document order (determinism)

Best for:

  • ✅ Relevance ranking needed
  • ✅ User expects "best matches first"
  • ✅ Flexible scoring systems
  • ✅ Apps with ranked suggestions
LinearSearch
// Uses set intersection for exact boolean matching
results := LinearSearch([]string{"programming", "tutorial"}, indexer)
// Returns documents in document index order

Process:

  1. Get document sets for each token
  2. Compute intersection of all sets (AND logic)
  3. Return matches in document index order

Best for:

  • ✅ Pure boolean matching
  • ✅ Performance-critical applications
  • ✅ Large queries (many tokens)
  • ✅ Consistent ordering needed

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	SpanishStopWords = StopWords{}/* 443 elements not displayed */

)

Functions

func HitsSearch

func HitsSearch(tokens []string, indexer Indexer) slices.Slice[SearchResult]

HitsSearch implements a hit-counting based search algorithm with AND logic.

Algorithm: 1. For each search token, find all documents containing that token 2. Count hits per document (number of unique search tokens each document contains) 3. Filter documents that have ALL tokens (hits >= number of search tokens) 4. Sort results by hit count in descending order (most relevant first) 5. Return results in deterministic order by preserving document index order for ties

Behavior: - Uses AND logic: only returns documents that contain ALL search tokens - Hit count = number of unique search tokens found in document (not total occurrences) - Results are sorted by relevance (hit count), then by document order for determinism - For multi-token queries, only documents with all tokens are returned - Time complexity: O(T * D + R log R) where T=tokens, D=avg docs per token, R=results - Space complexity: O(R) where R=number of matching documents

Differences from LinearSearch: - HitsSearch: Uses hit counting with hash map lookup, then sorts by relevance - LinearSearch: Uses set intersection with early termination, preserves document order - Both implement AND logic but with different performance characteristics - HitsSearch is better for relevance ranking, LinearSearch for simple boolean matching

Example:

Query: "java programming"
Doc1: "java tutorial" (hits=1, excluded - doesn't have "programming")
Doc2: "java programming guide" (hits=2, included)
Doc3: "advanced java programming concepts" (hits=2, included)
Result: [Doc2, Doc3] (both have hits=2, ordered by document index)

Note: Hit counting is per unique token, not total occurrences:

"java java programming" with query "java programming" = 2 hits (not 3)

func LinearSearch

func LinearSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]

LinearSearch performs an intersection-based search across all query tokens.

Algorithm: 1. For each token in the query, get the list of documents that contain it 2. Find the intersection of all these document lists (documents that contain ALL tokens) 3. Return only documents that contain every single token in the query

Key characteristics: - Uses AND logic: ALL tokens must be present in a document for it to match - Same logic as HitsSearch but different algorithm (intersection vs. hit counting) - Results have Hits = len(tokens) since all tokens are guaranteed to be found - More efficient for queries with many tokens due to early termination - Deterministic order based on document index order

Comparison with HitsSearch: - Both implement AND logic (only documents with ALL tokens are returned) - LinearSearch: Uses set intersection operations (more efficient for large queries) - HitsSearch: Uses hit counting with threshold filtering (more flexible for scoring)

Example:

Query: "programming java"
- Only returns documents that contain BOTH "programming" AND "java"
- A document with only "programming" will NOT be returned
- A document with only "java" will NOT be returned

func NoopAllSearch

func NoopAllSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]

NoopAllSearch returns all documents as results

func NoopZeroSearch

func NoopZeroSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]

NoopZeroSearch returns empty results

func R1R2RV

func R1R2RV(w []byte) (r1 []byte, r2 []byte, rv []byte)

Types

type AliasesResult

type AliasesResult struct {
	Aliases []AliasesResultRow
}

type AliasesResultRow

type AliasesResultRow struct {
	Alias   string
	Indices []string
}

type Builder

type Builder func(name string) Index

func NewMemoryIndexBuilder

func NewMemoryIndexBuilder(tokenizer tokenizer) Builder

type CleanTokenizer

type CleanTokenizer struct {
	// contains filtered or unexported fields
}

func NewCleanTokenizer

func NewCleanTokenizer(fns ...cleanFunc) CleanTokenizer

func NewKeepAlphanumericTokenizer

func NewKeepAlphanumericTokenizer() *CleanTokenizer

func (*CleanTokenizer) Tokenize

func (c *CleanTokenizer) Tokenize(text string) []string

type Doc

type Doc struct {
	Name    string `json:"id"`
	Content string `json:"raw"`
}

func NewDoc

func NewDoc(name, content string) Doc

func (Doc) Hash

func (d Doc) Hash() HashKey

func (Doc) ID

func (d Doc) ID() string

func (Doc) Raw

func (d Doc) Raw() string

type DocRequest

type DocRequest struct {
	Name    string
	Content string

	MimeType MimeType
	// contains filtered or unexported fields
}

func NewDocRequest

func NewDocRequest(name, content string) DocRequest

func NewDocRequestWith

func NewDocRequestWith(name, content, statement string) DocRequest

func NewDocRequestWithMime

func NewDocRequestWithMime(name, content string, mime MimeType) DocRequest

func (DocRequest) ID

func (d DocRequest) ID() string

func (DocRequest) Mime

func (d DocRequest) Mime() MimeType

func (DocRequest) Raw

func (d DocRequest) Raw() string

func (DocRequest) Statement

func (d DocRequest) Statement() string

type Engine

type Engine func(tokens []string, indexable Indexer) slices.Slice[SearchResult]

Engine defines the function signature for search functions

type EngineType

type EngineType byte
const (
	NoopZero EngineType = iota
	NoopAll
	Hits
	SmartsHits
	Linear
)

type Filter

type Filter interface {
	Filter(tokens []string) []string
}

type HashKey

type HashKey struct {
	// contains filtered or unexported fields
}

type Hashable

type Hashable interface {
	Hash() HashKey
}

type Index

type Index interface {
	Put(payload DocRequest) Index
	Search(terms string, engine Engine) slices.Slice[SearchResult]
}

type IndexRepo

type IndexRepo struct {
	// contains filtered or unexported fields
}

IndexRepo handles a collection of indexes

func NewIndexRepo

func NewIndexRepo(builder Builder) *IndexRepo

func (*IndexRepo) Alias

func (h *IndexRepo) Alias(alias string, index string) bool

func (*IndexRepo) Drop

func (h *IndexRepo) Drop(indexName string) bool

func (*IndexRepo) Has

func (h *IndexRepo) Has(name string) bool

func (*IndexRepo) HasAlias

func (h *IndexRepo) HasAlias(name string) bool

func (*IndexRepo) List

func (h *IndexRepo) List() []string

func (*IndexRepo) ListAliases

func (h *IndexRepo) ListAliases() AliasesResult

func (*IndexRepo) Put

func (h *IndexRepo) Put(indexName string, doc DocRequest)

func (*IndexRepo) Rename

func (h *IndexRepo) Rename(old string, new string) bool

Rename handles index renaming

func (*IndexRepo) Search

func (h *IndexRepo) Search(
	indexName string,
	terms string,
	engine Engine,
) (streams.ReadStream[SearchResult], error)

func (*IndexRepo) String

func (h *IndexRepo) String() string

func (*IndexRepo) UnAlias

func (h *IndexRepo) UnAlias(alias, index string) bool

type Indexer

type Indexer interface {
	Len() int
	Indexed(key string) []int
	Document(index int) Doc
}

type LowerCaseFilter

type LowerCaseFilter struct{}

func NewLowerCaseTokenizer

func NewLowerCaseTokenizer() LowerCaseFilter

func (LowerCaseFilter) Filter

func (l LowerCaseFilter) Filter(tokens []string) []string

type MemoryIndex

type MemoryIndex struct {
	Docs          []Doc            `json:"indexed"`
	InvertedIndex map[string][]int `json:"inverted"`
	// contains filtered or unexported fields
}

func NewMemoryIndex

func NewMemoryIndex(name string, tkr tokenizer) *MemoryIndex

func (*MemoryIndex) Document

func (mi *MemoryIndex) Document(index int) Doc

func (*MemoryIndex) Indexed

func (mi *MemoryIndex) Indexed(key string) []int

func (*MemoryIndex) Len

func (mi *MemoryIndex) Len() int

func (*MemoryIndex) Put

func (mi *MemoryIndex) Put(payload DocRequest) Index

func (*MemoryIndex) Search

func (mi *MemoryIndex) Search(payload string, engine Engine) slices.Slice[SearchResult]

func (*MemoryIndex) String

func (mi *MemoryIndex) String() string

type MimeType

type MimeType byte
const (
	MimeText MimeType = iota + 1
	MimeJSON
)

type Repo

type Repo interface {
	List() []string
	ListAliases() AliasesResult
	Has(name string) bool
	HasAlias(name string) bool
	Alias(alias string, in string) bool
	UnAlias(alias, index string) bool
	Put(in string, req DocRequest)
	Search(index string, terms string, engine Engine) (streams.ReadStream[SearchResult], error)
	Rename(old string, new string) bool
	Drop(in string) bool
}

type SearchResult

type SearchResult struct {
	Document Doc `json:"doc"`
	Hits     int `json:"hits"`
}

func (SearchResult) Doc

func (r SearchResult) Doc() Doc

Implementación de la interfaz Row

func (SearchResult) GetDoc

func (r SearchResult) GetDoc() Doc

func (SearchResult) GetHits

func (r SearchResult) GetHits() int

func (SearchResult) MarshalEasyJSON added in v0.3.0

func (v SearchResult) MarshalEasyJSON(w *jwriter.Writer)

MarshalEasyJSON supports easyjson.Marshaler interface

func (SearchResult) MarshalJSON added in v0.3.0

func (v SearchResult) MarshalJSON() ([]byte, error)

MarshalJSON supports json.Marshaler interface

func (*SearchResult) UnmarshalEasyJSON added in v0.3.0

func (v *SearchResult) UnmarshalEasyJSON(l *jlexer.Lexer)

UnmarshalEasyJSON supports easyjson.Unmarshaler interface

func (*SearchResult) UnmarshalJSON added in v0.3.0

func (v *SearchResult) UnmarshalJSON(data []byte) error

UnmarshalJSON supports json.Unmarshaler interface

type SearchResults

type SearchResults []SearchResult

func (SearchResults) Len

func (r SearchResults) Len() int

sort.Interface implementation

func (SearchResults) Less

func (r SearchResults) Less(i, j int) bool

func (SearchResults) Swap

func (r SearchResults) Swap(i, j int)

type SpanishStemmerFilter

type SpanishStemmerFilter struct {
	// contains filtered or unexported fields
}

func NewSpanishStemmer

func NewSpanishStemmer(removeStopWords bool) SpanishStemmerFilter

func (SpanishStemmerFilter) Filter

func (s SpanishStemmerFilter) Filter(tokens []string) []string

type Stemmer

type Stemmer struct{}

func NewSnowball

func NewSnowball() *Stemmer

func (*Stemmer) Stem

func (s *Stemmer) Stem(w []byte) []byte

type StopWords

type StopWords map[string]struct{}

type StopWordsFilter

type StopWordsFilter struct {
	// contains filtered or unexported fields
}

func NewStopWordsFilter

func NewStopWordsFilter(sw StopWords) StopWordsFilter

func (StopWordsFilter) Filter

func (s StopWordsFilter) Filter(tokens []string) []string

type TokenizationPipeline

type TokenizationPipeline struct {
	// contains filtered or unexported fields
}

func NewTokenizationPipeline

func NewTokenizationPipeline(t Tokenizer, f ...Filter) *TokenizationPipeline

func (*TokenizationPipeline) Tokenize

func (p *TokenizationPipeline) Tokenize(text string) []string

type Tokenizer

type Tokenizer interface {
	Tokenize(text string) []string
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL