visigoth

package module

v0.3.0 Latest Latest Go to latest Published: Jul 3, 2025 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sonirico/visigoth

Links

Open Source Insights

README ¶

Visigoth

A Go package for full-text search and indexing

Features

Text analysis with tokenization, filtering, and stemming
Memory-efficient inverted index
Multiple search algorithms (Linear, Hits-based, Noop)
Spanish language support with Snowball stemming
Repository pattern for data persistence

Installation

go get github.com/sonirico/visigoth

Usage

package main

import (
    "context"
    "fmt"
    "log"
    "github.com/sonirico/visigoth"
)

type VisigothSearcher struct {
    repo visigoth.Repo
}

type SearchPayload struct {
    Index string
    Terms string
}

func (v *VisigothSearcher) Search(p SearchPayload) error {
    stream, err := v.repo.Search(p.Index, p.Terms, visigoth.HitsSearch)
    if err != nil {
        return err
    }
    
    // Process results from stream
    for result := range stream.Chan() {
        fmt.Printf("Document: %s, Hits: %d\n", 
            result.Doc().ID(), result.Hits)
    }
    
    return nil
}

func main() {
    // Create tokenization pipeline with Spanish support
    tokenizer := visigoth.NewKeepAlphanumericTokenizer()
    pipeline := visigoth.NewTokenizationPipeline(
        tokenizer,
        visigoth.NewLowerCaseTokenizer(),
        visigoth.NewStopWordsFilter(visigoth.SpanishStopWords),
        visigoth.NewSpanishStemmer(true),
    )
    
    // Create repository with memory index builder
    repo := visigoth.NewIndexRepo(visigoth.NewMemoryIndexBuilder(pipeline))
    
    // Create searcher
    searcher := &VisigothSearcher{repo: repo}
    
    // Index some documents
    repo.Put("courses", visigoth.NewDocRequest("java-course", "Curso de programación en Java"))
    repo.Put("courses", visigoth.NewDocRequest("go-course", "Curso de programación en Go"))
    repo.Put("courses", visigoth.NewDocRequest("python-course", "Curso de programación en Python"))
    
    // Search
    if err := searcher.Search(SearchPayload{
        Index: "courses",
        Terms: "programación java",
    }); err != nil {
        log.Fatal(err)
    }
}

Architecture

The package is organized into focused components:

analyze - Text processing pipeline
index - Document indexing and storage
search - Query processing and ranking
stemmer - Language-specific text normalization
repos - Data persistence layer with aliasing
loaders - Document loading utilities
entities - Core data structures

Search Algorithms

Visigoth provides two main search algorithms, both implementing AND logic (all query tokens must be present in matching documents):

HitsSearch vs LinearSearch

Feature	HitsSearch	LinearSearch
Algorithm	Hit counting + threshold filtering	Set intersection
Time Complexity	O(T × D + R log R)	O(T × D + I)
Space Complexity	O(R)	O(I)
Result Ordering	Relevance-based (hit count) + document order	Document index order
Best For	Relevance ranking, scoring	Boolean matching, performance
Multi-token Efficiency	Constant per token	Early termination possible

Where:

T = number of search tokens
D = average documents per token
R = number of matching documents
I = size of intersection sets

Algorithm Details

HitsSearch

// Uses hit counting to rank results by relevance
results := HitsSearch([]string{"programming", "tutorial"}, indexer)
// Returns documents sorted by relevance (most matching tokens first)

Process:

Count hits (unique tokens) per document
Filter documents with hits ≥ threshold (AND logic)
Sort by hit count (relevance), then by document order (determinism)

Best for:

✅ Relevance ranking needed
✅ User expects "best matches first"
✅ Flexible scoring systems
✅ Apps with ranked suggestions

LinearSearch

// Uses set intersection for exact boolean matching
results := LinearSearch([]string{"programming", "tutorial"}, indexer)
// Returns documents in document index order

Process:

Get document sets for each token
Compute intersection of all sets (AND logic)
Return matches in document index order

Best for:

✅ Pure boolean matching
✅ Performance-critical applications
✅ Large queries (many tokens)
✅ Consistent ordering needed

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Documentation ¶

Index ¶

Variables
func HitsSearch(tokens []string, indexer Indexer) slices.Slice[SearchResult]
func LinearSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]
func NoopAllSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]
func NoopZeroSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]
func R1R2RV(w []byte) (r1 []byte, r2 []byte, rv []byte)
type AliasesResult
type AliasesResultRow
type Builder
- func NewMemoryIndexBuilder(tokenizer tokenizer) Builder
type CleanTokenizer
- func NewCleanTokenizer(fns ...cleanFunc) CleanTokenizer
- func NewKeepAlphanumericTokenizer() *CleanTokenizer
- func (c *CleanTokenizer) Tokenize(text string) []string
type Doc
- func NewDoc(name, content string) Doc
- func (d Doc) Hash() HashKey
- func (d Doc) ID() string
- func (d Doc) Raw() string
type DocRequest
- func NewDocRequest(name, content string) DocRequest
- func NewDocRequestWith(name, content, statement string) DocRequest
- func NewDocRequestWithMime(name, content string, mime MimeType) DocRequest
- func (d DocRequest) ID() string
- func (d DocRequest) Mime() MimeType
- func (d DocRequest) Raw() string
- func (d DocRequest) Statement() string
type Engine
type EngineType
type Filter
type HashKey
type Hashable
type Index
type IndexRepo
- func NewIndexRepo(builder Builder) *IndexRepo
- func (h *IndexRepo) Alias(alias string, index string) bool
- func (h *IndexRepo) Drop(indexName string) bool
- func (h *IndexRepo) Has(name string) bool
- func (h *IndexRepo) HasAlias(name string) bool
- func (h *IndexRepo) List() []string
- func (h *IndexRepo) ListAliases() AliasesResult
- func (h *IndexRepo) Put(indexName string, doc DocRequest)
- func (h *IndexRepo) Rename(old string, new string) bool
- func (h *IndexRepo) Search(indexName string, terms string, engine Engine) (streams.ReadStream[SearchResult], error)
- func (h *IndexRepo) String() string
- func (h *IndexRepo) UnAlias(alias, index string) bool
type Indexer
type LowerCaseFilter
- func NewLowerCaseTokenizer() LowerCaseFilter
- func (l LowerCaseFilter) Filter(tokens []string) []string
type MemoryIndex
- func NewMemoryIndex(name string, tkr tokenizer) *MemoryIndex
- func (mi *MemoryIndex) Document(index int) Doc
- func (mi *MemoryIndex) Indexed(key string) []int
- func (mi *MemoryIndex) Len() int
- func (mi *MemoryIndex) Put(payload DocRequest) Index
- func (mi *MemoryIndex) Search(payload string, engine Engine) slices.Slice[SearchResult]
- func (mi *MemoryIndex) String() string
type MimeType
type Repo
type SearchResult
- func (r SearchResult) Doc() Doc
- func (r SearchResult) GetDoc() Doc
- func (r SearchResult) GetHits() int
- func (v SearchResult) MarshalEasyJSON(w *jwriter.Writer)
- func (v SearchResult) MarshalJSON() ([]byte, error)
- func (v *SearchResult) UnmarshalEasyJSON(l *jlexer.Lexer)
- func (v *SearchResult) UnmarshalJSON(data []byte) error
type SearchResults
- func (r SearchResults) Len() int
- func (r SearchResults) Less(i, j int) bool
- func (r SearchResults) Swap(i, j int)
type SpanishStemmerFilter
- func NewSpanishStemmer(removeStopWords bool) SpanishStemmerFilter
- func (s SpanishStemmerFilter) Filter(tokens []string) []string
type Stemmer
- func NewSnowball() *Stemmer
- func (s *Stemmer) Stem(w []byte) []byte
type StopWords
type StopWordsFilter
- func NewStopWordsFilter(sw StopWords) StopWordsFilter
- func (s StopWordsFilter) Filter(tokens []string) []string
type TokenizationPipeline
- func NewTokenizationPipeline(t Tokenizer, f ...Filter) *TokenizationPipeline
- func (p *TokenizationPipeline) Tokenize(text string) []string
type Tokenizer

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	SpanishStopWords = StopWords{}/* 443 elements not displayed */

)

Functions ¶

func HitsSearch ¶

func HitsSearch(tokens []string, indexer Indexer) slices.Slice[SearchResult]

HitsSearch implements a hit-counting based search algorithm with AND logic.

Algorithm: 1. For each search token, find all documents containing that token 2. Count hits per document (number of unique search tokens each document contains) 3. Filter documents that have ALL tokens (hits >= number of search tokens) 4. Sort results by hit count in descending order (most relevant first) 5. Return results in deterministic order by preserving document index order for ties

Behavior: - Uses AND logic: only returns documents that contain ALL search tokens - Hit count = number of unique search tokens found in document (not total occurrences) - Results are sorted by relevance (hit count), then by document order for determinism - For multi-token queries, only documents with all tokens are returned - Time complexity: O(T * D + R log R) where T=tokens, D=avg docs per token, R=results - Space complexity: O(R) where R=number of matching documents

Differences from LinearSearch: - HitsSearch: Uses hit counting with hash map lookup, then sorts by relevance - LinearSearch: Uses set intersection with early termination, preserves document order - Both implement AND logic but with different performance characteristics - HitsSearch is better for relevance ranking, LinearSearch for simple boolean matching

Example:

Query: "java programming"
Doc1: "java tutorial" (hits=1, excluded - doesn't have "programming")
Doc2: "java programming guide" (hits=2, included)
Doc3: "advanced java programming concepts" (hits=2, included)
Result: [Doc2, Doc3] (both have hits=2, ordered by document index)

Note: Hit counting is per unique token, not total occurrences:

"java java programming" with query "java programming" = 2 hits (not 3)

func LinearSearch ¶

func LinearSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]

LinearSearch performs an intersection-based search across all query tokens.

Algorithm: 1. For each token in the query, get the list of documents that contain it 2. Find the intersection of all these document lists (documents that contain ALL tokens) 3. Return only documents that contain every single token in the query

Key characteristics: - Uses AND logic: ALL tokens must be present in a document for it to match - Same logic as HitsSearch but different algorithm (intersection vs. hit counting) - Results have Hits = len(tokens) since all tokens are guaranteed to be found - More efficient for queries with many tokens due to early termination - Deterministic order based on document index order

Comparison with HitsSearch: - Both implement AND logic (only documents with ALL tokens are returned) - LinearSearch: Uses set intersection operations (more efficient for large queries) - HitsSearch: Uses hit counting with threshold filtering (more flexible for scoring)

Example:

Query: "programming java"
- Only returns documents that contain BOTH "programming" AND "java"
- A document with only "programming" will NOT be returned
- A document with only "java" will NOT be returned

func NoopAllSearch ¶

func NoopAllSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]

NoopAllSearch returns all documents as results

func NoopZeroSearch ¶

func NoopZeroSearch(tokens []string, indexable Indexer) slices.Slice[SearchResult]

NoopZeroSearch returns empty results

func R1R2RV ¶

func R1R2RV(w []byte) (r1 []byte, r2 []byte, rv []byte)

Types ¶

type AliasesResult ¶

type AliasesResult struct {
	Aliases []AliasesResultRow
}

type AliasesResultRow ¶

type AliasesResultRow struct {
	Alias   string
	Indices []string
}

type Builder ¶

type Builder func(name string) Index

func NewMemoryIndexBuilder ¶

func NewMemoryIndexBuilder(tokenizer tokenizer) Builder

type CleanTokenizer ¶

type CleanTokenizer struct {
	// contains filtered or unexported fields
}

func NewCleanTokenizer ¶

func NewCleanTokenizer(fns ...cleanFunc) CleanTokenizer

func NewKeepAlphanumericTokenizer ¶

func NewKeepAlphanumericTokenizer() *CleanTokenizer

func (*CleanTokenizer) Tokenize ¶

func (c *CleanTokenizer) Tokenize(text string) []string

type Doc ¶

type Doc struct {
	Name    string `json:"id"`
	Content string `json:"raw"`
}

func NewDoc ¶

func NewDoc(name, content string) Doc

func (Doc) Hash ¶

func (d Doc) Hash() HashKey

func (Doc) ID ¶

func (d Doc) ID() string

func (Doc) Raw ¶

func (d Doc) Raw() string

type DocRequest ¶

type DocRequest struct {
	Name    string
	Content string

	MimeType MimeType
	// contains filtered or unexported fields
}

func NewDocRequest ¶

func NewDocRequest(name, content string) DocRequest

func NewDocRequestWith ¶

func NewDocRequestWith(name, content, statement string) DocRequest

func NewDocRequestWithMime ¶

func NewDocRequestWithMime(name, content string, mime MimeType) DocRequest

func (DocRequest) ID ¶

func (d DocRequest) ID() string

func (DocRequest) Mime ¶

func (d DocRequest) Mime() MimeType

func (DocRequest) Raw ¶

func (d DocRequest) Raw() string

func (DocRequest) Statement ¶

func (d DocRequest) Statement() string

type Engine ¶

type Engine func(tokens []string, indexable Indexer) slices.Slice[SearchResult]

Engine defines the function signature for search functions

type EngineType ¶

type EngineType byte

const (
	NoopZero EngineType = iota
	NoopAll
	Hits
	SmartsHits
	Linear
)

type Filter ¶

type Filter interface {
	Filter(tokens []string) []string
}

type HashKey ¶

type HashKey struct {
	// contains filtered or unexported fields
}

type Hashable ¶

type Hashable interface {
	Hash() HashKey
}

type Index ¶

type Index interface {
	Put(payload DocRequest) Index
	Search(terms string, engine Engine) slices.Slice[SearchResult]
}

type IndexRepo ¶

type IndexRepo struct {
	// contains filtered or unexported fields
}

IndexRepo handles a collection of indexes

func NewIndexRepo ¶

func NewIndexRepo(builder Builder) *IndexRepo

func (*IndexRepo) Alias ¶

func (h *IndexRepo) Alias(alias string, index string) bool

func (*IndexRepo) Drop ¶

func (h *IndexRepo) Drop(indexName string) bool

func (*IndexRepo) Has ¶

func (h *IndexRepo) Has(name string) bool

func (*IndexRepo) HasAlias ¶

func (h *IndexRepo) HasAlias(name string) bool

func (*IndexRepo) List ¶

func (h *IndexRepo) List() []string

func (*IndexRepo) ListAliases ¶

func (h *IndexRepo) ListAliases() AliasesResult

func (*IndexRepo) Put ¶

func (h *IndexRepo) Put(indexName string, doc DocRequest)

func (*IndexRepo) Rename ¶

func (h *IndexRepo) Rename(old string, new string) bool

Rename handles index renaming

func (*IndexRepo) Search ¶

func (h *IndexRepo) Search(
	indexName string,
	terms string,
	engine Engine,
) (streams.ReadStream[SearchResult], error)

func (*IndexRepo) String ¶

func (h *IndexRepo) String() string

func (*IndexRepo) UnAlias ¶

func (h *IndexRepo) UnAlias(alias, index string) bool

type Indexer ¶

type Indexer interface {
	Len() int
	Indexed(key string) []int
	Document(index int) Doc
}

type LowerCaseFilter ¶

type LowerCaseFilter struct{}

func NewLowerCaseTokenizer ¶

func NewLowerCaseTokenizer() LowerCaseFilter

func (LowerCaseFilter) Filter ¶

func (l LowerCaseFilter) Filter(tokens []string) []string

type MemoryIndex ¶

type MemoryIndex struct {
	Docs          []Doc            `json:"indexed"`
	InvertedIndex map[string][]int `json:"inverted"`
	// contains filtered or unexported fields
}

func NewMemoryIndex ¶

func NewMemoryIndex(name string, tkr tokenizer) *MemoryIndex

func (*MemoryIndex) Document ¶

func (mi *MemoryIndex) Document(index int) Doc

func (*MemoryIndex) Indexed ¶

func (mi *MemoryIndex) Indexed(key string) []int

func (*MemoryIndex) Len ¶

func (mi *MemoryIndex) Len() int

func (*MemoryIndex) Put ¶

func (mi *MemoryIndex) Put(payload DocRequest) Index

func (*MemoryIndex) Search ¶

func (mi *MemoryIndex) Search(payload string, engine Engine) slices.Slice[SearchResult]

func (*MemoryIndex) String ¶

func (mi *MemoryIndex) String() string

type MimeType ¶

type MimeType byte

const (
	MimeText MimeType = iota + 1
	MimeJSON
)

type Repo ¶

type Repo interface {
	List() []string
	ListAliases() AliasesResult
	Has(name string) bool
	HasAlias(name string) bool
	Alias(alias string, in string) bool
	UnAlias(alias, index string) bool
	Put(in string, req DocRequest)
	Search(index string, terms string, engine Engine) (streams.ReadStream[SearchResult], error)
	Rename(old string, new string) bool
	Drop(in string) bool
}

type SearchResult ¶

type SearchResult struct {
	Document Doc `json:"doc"`
	Hits     int `json:"hits"`
}

func (SearchResult) Doc ¶

func (r SearchResult) Doc() Doc

Implementación de la interfaz Row

func (SearchResult) GetDoc ¶

func (r SearchResult) GetDoc() Doc

func (SearchResult) GetHits ¶

func (r SearchResult) GetHits() int

func (SearchResult) MarshalEasyJSON ¶ added in v0.3.0

func (v SearchResult) MarshalEasyJSON(w *jwriter.Writer)

MarshalEasyJSON supports easyjson.Marshaler interface

func (SearchResult) MarshalJSON ¶ added in v0.3.0

func (v SearchResult) MarshalJSON() ([]byte, error)

MarshalJSON supports json.Marshaler interface

func (*SearchResult) UnmarshalEasyJSON ¶ added in v0.3.0

func (v *SearchResult) UnmarshalEasyJSON(l *jlexer.Lexer)

UnmarshalEasyJSON supports easyjson.Unmarshaler interface

func (*SearchResult) UnmarshalJSON ¶ added in v0.3.0

func (v *SearchResult) UnmarshalJSON(data []byte) error

UnmarshalJSON supports json.Unmarshaler interface

type SearchResults ¶

type SearchResults []SearchResult

func (SearchResults) Len ¶

func (r SearchResults) Len() int

sort.Interface implementation

func (SearchResults) Less ¶

func (r SearchResults) Less(i, j int) bool

func (SearchResults) Swap ¶

func (r SearchResults) Swap(i, j int)

type SpanishStemmerFilter ¶

type SpanishStemmerFilter struct {
	// contains filtered or unexported fields
}

func NewSpanishStemmer ¶

func NewSpanishStemmer(removeStopWords bool) SpanishStemmerFilter

func (SpanishStemmerFilter) Filter ¶

func (s SpanishStemmerFilter) Filter(tokens []string) []string

type Stemmer ¶

type Stemmer struct{}

func NewSnowball ¶

func NewSnowball() *Stemmer

func (*Stemmer) Stem ¶

func (s *Stemmer) Stem(w []byte) []byte

type StopWords ¶

type StopWords map[string]struct{}

type StopWordsFilter ¶

type StopWordsFilter struct {
	// contains filtered or unexported fields
}

func NewStopWordsFilter ¶

func NewStopWordsFilter(sw StopWords) StopWordsFilter

func (StopWordsFilter) Filter ¶

func (s StopWordsFilter) Filter(tokens []string) []string

type TokenizationPipeline ¶

type TokenizationPipeline struct {
	// contains filtered or unexported fields
}

func NewTokenizationPipeline ¶

func NewTokenizationPipeline(t Tokenizer, f ...Filter) *TokenizationPipeline

func (*TokenizationPipeline) Tokenize ¶

func (p *TokenizationPipeline) Tokenize(text string) []string

type Tokenizer ¶

type Tokenizer interface {
	Tokenize(text string) []string
}

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL