bm25f

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 8, 2026 License: MIT Imports: 5 Imported by: 0

README

BM25F

Go implementation of BM25F algorithm.

Features

  • BM25 algorithm base:
    • Score calculates how closely a document matches a search query.
    • Free parameter b adjusts document length normalization.
    • Free parameter k1 adjusts the contribution of high-frequency terms.
  • BM25F algorithm:
    • Documents are composed of zero or more fields.
    • Documents have a weighted contribution to the search score.
  • Free parameters k1 and b are customizable but have sane defaults.
  • Metadata can be attached to documents and is returned with search results.
  • Corpus and BM25F can be serialized to and from JSON.

Limitations

Tokenizing

This library does not include a way to tokenize documents and queries.

For tokenizing, I recommend starting with UAX #29. A good implementation of that is clipperhouse/uax29.

Sorting and pruning non-matches

This library does not include a way to sort results or prune non-matches.

An example of sorting and pruning results is in examples/search. However, some applications may require different sorting rules.

Quick start

go get "github.com/computerghost/bm25f"

Create the corpus:

corpus := bm25f.NewCorpus()
corpus.Upsert("hello.md", bm25f.NewDocument(
    bm25f.WithField("title", []string{"Hello"}),
    bm25f.WithField("body", []string{"hello", "world"})
    bm25f.WithMetadata("title", "Hello")
))
corpus.Upsert("nature.md", bm25f.NewDocument(
    bm25f.WithField("title", []string{"Nature"}),
    bm25f.WithField("body", []string{"blue", "world"})
    bm25f.WithMetadata("title", "Nature")
))

Create the BM25F algorithm:

bm := bm25f.New()
bm.SetWeight("title", 2.0)
bm.SetWeight("body", 1.0)

[!TIP] Both the corpus and ranker can be serialized to and from JSON. Use this for easy saving and loading.

Now search:

query := []string{"world"}
scores := index.Ranker.Score(corpus, query)

// Remove non-matches.
scores = slices.DeleteFunc(scores, func(r bm25f.Result) bool {
    return r.Score == 0
})

// Sort the remaining results by score then ID
slices.SortFunc(scores, func(a, b bm25f.Result) int {
    if c := cmp.Compare(b.Score, a.Score); c != 0 {
        return c
    }
    return cmp.Compare(a.ID, b.ID)
})

fmt.Println("Results:")
for i, result := range scores {
    fmt.Printf("  #%d: %s: %s\n", i, result.ID, result.Metadata("title"))
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BM25F

type BM25F struct {
	// contains filtered or unexported fields
}

func New added in v0.3.0

func New() *BM25F

func (*BM25F) MarshalJSON added in v0.3.0

func (bm *BM25F) MarshalJSON() ([]byte, error)

func (*BM25F) Score

func (bm *BM25F) Score(corpus *Corpus, query []string) []Result

Score calculates how well each document matches the query. The results include every document and are unsorted—to remove non-matches and sort the results, use Rank or do it yourself.

func (*BM25F) SetB

func (bm *BM25F) SetB(field string, b float64) error

SetB sets the `b` parameter of the BM25F algorithm. It controls the strength of field length normalizations. With a value of 0, field lengths are not taken into consideration. With a value of 1, field lengths are fully normalized. For most corpora, a value between 0.5 and 0.8 is good. The default is 0.72.

An error is returned if the value is less than 0 or greater than 1.

func (*BM25F) SetK1

func (bm *BM25F) SetK1(k1 float64) error

SetK1 sets the `k1` parameter of the BM25F algorithm. It controls the impact of frequent terms on the scores. With lower values, frequent terms affect the score less. With higher values, frequent terms affect the score more. For most corpora, a value between 1.2 and 2 is good. The default is 1.2.

An error is returned if the value is less than or equal to 0.

func (*BM25F) SetWeight

func (bm *BM25F) SetWeight(field string, weight float64) error

SetWeight sets the relative weight of the field. The field with the bulk of the content should be 1.0. The default is 1.0

An error is returned if the value is less than 0.

func (*BM25F) UnmarshalJSON added in v0.3.0

func (bm *BM25F) UnmarshalJSON(data []byte) error

type Corpus

type Corpus struct {
	// contains filtered or unexported fields
}

func NewCorpus added in v0.3.0

func NewCorpus() *Corpus

NewCorpus creates an empty Corpus.

func (*Corpus) Documents

func (c *Corpus) Documents() map[string]*Document

Documents returns a map from document id to Document.

func (*Corpus) Len added in v0.3.0

func (c *Corpus) Len() int

Len returns the number of documents in the corpus.

func (*Corpus) MarshalJSON added in v0.3.0

func (c *Corpus) MarshalJSON() ([]byte, error)

func (*Corpus) Remove

func (c *Corpus) Remove(id string)

Remove removes all data associated with a document.

func (*Corpus) UnmarshalJSON added in v0.3.0

func (c *Corpus) UnmarshalJSON(data []byte) error

func (*Corpus) Upsert

func (c *Corpus) Upsert(id string, document *Document)

Upsert processes and adds a document into the corpus.

The document must not be changed after passing it to this function.

type Document

type Document struct {
	// contains filtered or unexported fields
}

Document is a searchable entity in the corpus. It can have multiple independently configured fields that contribute to its search ranking.

func NewDocument added in v0.3.0

func NewDocument(opts ...DocumentOption) *Document

func (*Document) Count added in v0.3.0

func (d *Document) Count(field, term string) int

Count returns the number of times a term appears in a field.

func (*Document) FieldLen added in v0.3.0

func (d *Document) FieldLen(field string) int

FieldLen returns the length of a field (in terms).

func (*Document) FieldNames added in v0.3.0

func (d *Document) FieldNames() []string

FieldNames returns the names of all document fields in lexicographic order.

func (*Document) MarshalJSON added in v0.3.0

func (d *Document) MarshalJSON() ([]byte, error)

func (*Document) Metadata added in v0.3.0

func (d *Document) Metadata(name string) (string, bool)

Metadata gets a metadata entry associated with the document. It is the same value previously passed to SetMetadata.

func (*Document) SetField added in v0.3.0

func (d *Document) SetField(name string, tokens []string)

SetField sets a document field to represent the given tokens.

func (*Document) SetMetadata added in v0.3.0

func (d *Document) SetMetadata(name string, text string)

SetMetadata sets data that is not parsed or used by BM25F, but it is included in results from BM25F.Score.

func (*Document) UnmarshalJSON added in v0.3.0

func (d *Document) UnmarshalJSON(data []byte) error

type DocumentOption added in v0.3.0

type DocumentOption func(d *Document)

func WithField added in v0.3.0

func WithField(name string, tokens []string) DocumentOption

func WithMetadata added in v0.3.0

func WithMetadata(name string, text string) DocumentOption

type Field added in v0.3.0

type Field struct {
	// contains filtered or unexported fields
}

Field is a part of a document, such as the title, byline, or body.

Use Document.SetField to set a field value for a document.

func (*Field) MarshalJSON added in v0.3.0

func (f *Field) MarshalJSON() ([]byte, error)

func (*Field) UnmarshalJSON added in v0.3.0

func (f *Field) UnmarshalJSON(data []byte) error

type Result

type Result struct {
	ID string

	// Score indicates how well the document matches the query.
	//
	// A value of 0 indicates no match.
	// Other values are meaningless except in comparison to other results:
	// a higher value indicates a better match.
	Score float64
	// contains filtered or unexported fields
}

func (*Result) Metadata added in v0.3.0

func (r *Result) Metadata(name string) (string, bool)

Metadata returns the metadata associated with the document in the result.

Directories

Path Synopsis
examples
search command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL