storage

package
v1.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 21, 2026 License: AGPL-3.0 Imports: 14 Imported by: 0

README

Layout

┌─────────────────────────────────────────────────────┐
│                      HEADER                         │
│  magic (8B) | version (2B) | padding (6B)           │
│  total_docs (8B) | field_count (8B)                 │
│  total_posting_lists (8B)                           │
│  total_token_frequencies (8B)                       │
├─────────────────────────────────────────────────────┤
│                  DOC ID TABLE                       │
│  [size (8B) | data (48B)] × total_docs              │
│  (fixed 56B stride; sorted alphabetically;          │
│   position = internal ID; mapped zero-copy)         │
├─────────────────────────────────────────────────────┤
│                 FIELD BLOCKS                        │
│  ┌───────────────────────────────────────────────┐  │
│  │  hash (8B) | avgdl (8B f64)                   │  │
│  │  token_count (8B) | doc_length_count (8B)     │  │
│  ├───────────────────────────────────────────────┤  │
│  │           DOC LENGTH ENTRIES                  │  │
│  │  [doc_index (8B) | length (8B)]               │  │
│  │  × doc_length_count                           │  │
│  │  (sorted by doc_index ascending)              │  │
│  ├───────────────────────────────────────────────┤  │
│  │             TOKEN ENTRIES                     │  │
│  │  [frequency_count (8B) |                      │  │
│  │   posting_list_index (8B) |                   │  │
│  │   frequencies_index (8B) |                    │  │
│  │   value_size (8B) | value_data (48B)]         │  │
│  │  × token_count                                │  │
│  │  (fixed 80B stride; sorted alphabetically)    │  │
│  └───────────────────────────────────────────────┘  │
│  ...repeated for each field...                      │
├─────────────────────────────────────────────────────┤
│              POSTING LISTS REGION                   │
│  [bitmap_size (8B) | roaring bitmap bytes]          │
│  × total_posting_lists                              │
│  (indexed by posting_list_index)                    │
├─────────────────────────────────────────────────────┤
│           TOKEN FREQUENCIES REGION                  │
│  [doc_index (8B) | frequency (8B)]                  │
│  × total_token_frequencies                          │
│  (indexed by frequencies_index + doc_freq as count) │
└─────────────────────────────────────────────────────┘

Invariants

  • Doc IDs and token values are stored as RawValue: an 8-byte length plus a fixed MaxRawValueSize-byte (currently 128) inline buffer. Values longer than the cap are truncated to it. The fixed stride is what allows the doc ID table and each field's token table to be mapped directly over the file as native Go slices (unsafe.Slice) with zero allocation and zero deserialization.
  • Doc IDs are sorted alphabetically. A document's position in the table is its internal sequential ID used in posting lists and TF entries.
  • Doc length entries within each field block are sorted by doc_index ascending. This enables a merge scan during BM25 scoring instead of binary search.
  • Token entries within each field block are sorted alphabetically by token bytes, so they are binary-searched in place at query time — no btree is built at load. (A btree is used only as a transient accumulator during BuildFrom.)
  • TF entries for a given token occupy a contiguous slice: TokenFrequencies[FrequenciesIndex : FrequenciesIndex+FrequencyCount]. The writer must maintain this invariant.
  • Posting lists and TF entries within a field are written in the same alphabetical token order, enabling sequential page access during sorted query processing.

Storage.Size contract

Storage.Size is set by both BuildFrom and LoadBytes:

  • After BuildFrom: exact byte count that SaveTo will write. Use it to pre-allocate via Truncate + mmap before calling Save.
  • After LoadBytes: bytes consumed from the source buffer. Supports multiple storages packed into one buffer — load the second at src[s.Size:].

Canonical write path

var s Storage
s.BuildFrom(docs...)   // computes s.Size

err := s.SaveTo("index.bin") // Truncate + mmap + msync, atomic on success

Or manually with a pre-allocated mmap:

f, _ := os.Create("index.bin")
f.Truncate(int64(s.Size))

mapped, _ := mmap(f, s.Size, PROT_READ|PROT_WRITE)
s.Save(mapped[:0])           // write into mmap'd region, no heap allocation
msync(mapped)
munmap(mapped)

os.Rename("index.bin.tmp", "index.bin") // atomic swap

Canonical read path

var s Storage
err := s.Load("index.bin")  // mmap + LoadBytes, zero-copy
defer s.Close()             // munmap + close fd

// s is now queryable, all reads are O(log n) btree + O(1) slice access

Or from a raw byte slice:

f, _ := os.Open("index.bin")
mapped, _ := mmap(f, fileSize, PROT_READ)

var s Storage
s.LoadBytes(mapped)         // zero-copy, points into mmap pages
                            // token tables mapped as fixed-stride slices
                            // (no btree rebuilt; binary search in place)

Posting list decoding

A PostingList is a thin { Data []byte } view into the mmap'd file. Decode it into a roaring64 bitmap with Bitmap, which uses FromUnsafeBytes — the bitmap aliases the mmap buffer with no copy:

var bm roaring64.Bitmap
pl.Bitmap(&bm)              // zero-copy decode into the caller-owned bitmap

Because the decoded bitmap points into read-only mmap pages, it must not be mutated in place. Clone first if you need to modify it:

owned := bm.Clone()
owned.Add(newDocID)

Merge

Merger.Merge combines two storages into a single output file using a streaming, temp-file-backed pipeline. No output size pre-computation is required. The only precondition is that doc ID ranges are disjoint: every doc ID in b must sort after every doc ID in a. This is guaranteed when the corpus is partitioned by range before building each batch.

m := storage.Merger{TempDir: "/tmp"} // temp files written here during merge

err := m.Merge("merged.bin", &a, &b)

var merged Storage
err = merged.Load("merged.bin")
defer merged.Close()

Merge semantics

Fields are classified into three groups per merge:

  • A-only fields: written verbatim, indices resequenced via global cursors.
  • B-only fields: written with all internal doc indices shifted by len(a.DocumentsIds). Posting list bitmaps are re-encoded with the offset applied. Token frequency DocumentIndex values are shifted accordingly.
  • Collision fields (same field hash in both): tokens are further classified as A-only, B-only, or shared. Shared tokens have their posting lists unioned (B's doc IDs shifted before OR), their frequency slices concatenated (B's entries shifted), and FrequencyCount summed. AvgDocumentLength is recomputed from the merged DocumentLengths slice.

Field count in the output is len(a.Fields) + len(b.Fields) - collisions.

Large-corpus ingestion pattern

For corpora too large to fit in one BuildFrom call, partition docs into unsorted batches, build each batch into a segment file in parallel, then merge pairwise:

// Build segments in parallel (each worker owns a disjoint doc ID range)
segments := buildSegmentsParallel(batches) // []string of file paths

// Pairwise merge until one file remains
for len(segments) > 1 {
    next := mergePass(segments, m)
    segments = next
}
// segments[0] is the final index

With ~2 GB of heap per 1M-doc segment build and a merge that holds only the working set (not the whole corpus), ingestion scales to arbitrarily large corpora within a bounded memory envelope. This is not a projection — textiplex indexed the full 120 GB English Wikipedia export (25.65M documents) into a 70 GB index on a single i9-10900K: ~35 min of segment creation (peak 27 GB RAM) followed by ~113 min of merge (peak 12 GB RAM), ~2 h 28 min total. Bluge and Bleve cannot complete this corpus — they OOM during segment combination.

BM25 scoring formula

score(doc, term) = idf(term) × (tf × (k1+1)) / (tf + k1 × (1 - b + b × docLen/avgdl))

idf(term) = log(1 + (totalDocs - docFreq + 0.5) / (docFreq + 0.5))

Where:

  • totalDocs — from the header
  • docFreqToken.FrequencyCount
  • tfTokenFrequencies[FrequenciesIndex + offset].Frequency for the candidate doc
  • docLenDocumentLengths[i].Length for the candidate doc (binary search or merge scan)
  • avgdlField.AvgDocumentLength

Benchmarks

All benchmarks run on an Intel Core i9-10900K @ 3.70GHz (10C/20T). The 1M-doc corpus has 3 fields per document, 1 unique token per field (Bluge-equivalent).

Operation Time Throughput Heap Allocs
BuildFrom (textiplex) 2.84s 2.12 GB 33.5M
Merge 2×500K→1M (textiplex) 1.35s 375 MB/s 0.71 GB 19.5M
LoadBytes (textiplex) 0.63s 804 MB/s 0.78 GB 27.0M
OfflineWriter (Bluge fork) 5.47s 6.34 GB 104.9M
OfflineWriter (Bluge upstream) 12.28s 8.20 GB 131.0M
OfflineWriter (Bleve) 24.28s 10.07 GB 146.5M

Build vs Bluge fork: 1.9× faster, 3.0× less heap, 3.1× fewer allocs. Build vs Bluge upstream: 4.3× faster, 3.9× less heap, 3.9× fewer allocs. Build vs Bleve: 8.5× faster, 4.7× less heap, 4.4× fewer allocs.

The fixed-stride layout makes LoadBytes a pure mmap-and-slice operation: doc IDs, token tables, doc-length tables and token-frequency tables are mapped in place with no per-record parsing, which is why load throughput reaches 804 MB/s and the merge runs at 375 MB/s while holding only the working set.

Full English Wikipedia

The decisive result: textiplex indexed the complete 120 GB English Wikipedia export — 25,653,263 documents (id + title + content, parsed with encoding/json/v2) into a 70 GB index on the same i9-10900K.

Phase Time Peak RAM
Segment creation (indexing) 35.1 min 27 GB
Merge 113.2 min 12 GB
Total 2 h 28 min

~205 GB/hour during segment creation, ~48.5 GB/hour end-to-end. Bluge (upstream and fork) and Bleve all OOM during the merge and never finish — textiplex is the only Go FTS engine able to fully index Wikipedia.

Documentation

Index

Constants

View Source
const DocumentIdSize = unsafe.Sizeof(DocumentId{})
View Source
const DocumentLengthEntrySize = unsafe.Sizeof(DocumentLengthEntry{})
View Source
const FieldHeaderSize = unsafe.Sizeof(FieldHeader{})
View Source
const HeaderSize = unsafe.Sizeof(Header{})
View Source
const MagicNumber uint64 = 0x7E7127E9
View Source
const MaxRawValueSize = 128
View Source
const PostingListHeaderSize = unsafe.Sizeof(PostingListHeader{})
View Source
const TokenFrequencyEntrySize = unsafe.Sizeof(TokenFrequencyEntry{})
View Source
const TokenSize = unsafe.Sizeof(Token{})
View Source
const (
	VersionV1 uint16 = iota
)

Variables

This section is empty.

Functions

func CloseAndRemove

func CloseAndRemove(file *os.File)

func CompareTokens

func CompareTokens(a, b Token) (cmp int)

Types

type Document

type Document struct {
	// External document identifier e.g. "CO1.PCCNTR.123456"
	// Must be unique across the index
	// Will be inserted sorted into DocumentsIds
	Id DocumentId
	// Fields present in this document
	// Fields absent from this slice are treated as empty for this document
	Fields []*FieldDefinition
}

type DocumentId

type DocumentId struct {
	Value RawValue
}

type DocumentLengthEntry

type DocumentLengthEntry struct {
	// Index of the document referenced
	Index uint64
	// Actual length of the document in number of tokens
	Length uint64
}

This is per field Meaning the length only references what the field is actually storing for that particular document Writer must ensure they are sorted based on index

type Field

type Field struct {
	// Used for BM25 calculation
	AvgDocumentLength float64
	// Tokens present on the file
	// This field is stored in memory but most of its references
	// are direct mmap zero-copied arrays
	Tokens Tokens
	// DocumentLength entries
	// Keys are indexes of the documents
	DocumentLengths []DocumentLengthEntry
}

type FieldDefinition

type FieldDefinition struct {
	// xxh3 hash of the field name
	Hash uint64
	// Total number of tokens in this field for this document
	// Used to update avgdl and store as DocumentLengthEntry
	Length uint64
	// Tokens found in this field for this document
	// Caller must deduplicate — one entry per unique token
	// Frequency carries the count of occurrences
	Tokens []*TokenDefinition
}

type FieldHeader

type FieldHeader struct {
	// xxh3 hashed representation of the field string
	Hash uint64
	// Avgdl used in the BM25 formula
	// Precomputed so the reader can go directly to queries
	AvgDocumentLength float64
	// Number of total tokens the field has
	TokenCount uint64
	// Number of document lengths included
	DocumentLengthCount uint64
}
type Header struct {
	Magic   uint64
	Version uint16

	TotalDocuments        uint64
	FieldCount            uint64
	TotalPostingLists     uint64
	TotalTokenFrequencies uint64
	// contains filtered or unexported fields
}

type Merger

type Merger struct {
	TempDir string
}

Handles merges between storages

func (*Merger) CreateTemp

func (m *Merger) CreateTemp(pattern string) (file *os.File, err error)

func (*Merger) Merge

func (m *Merger) Merge(name string, a, b *Storage) (err error)

Merges storages B and B into the specified file Document ids should not collide in both storage otherwise undefined behavior will ocurr

type PostingList

type PostingList struct {
	Data []byte
}

func (*PostingList) Bitmap

func (l *PostingList) Bitmap(dst *roaring64.Bitmap)

type PostingListHeader

type PostingListHeader struct {
	Size uint64
}

type RawValue

type RawValue struct {
	Size uint64
	Data [MaxRawValueSize]byte
}

func RawValueFrom

func RawValueFrom[T ~string | ~[]byte](b T) (v RawValue)

func (*RawValue) Bytes

func (v *RawValue) Bytes() (b []byte)

func (*RawValue) Hash

func (v *RawValue) Hash() (hash uint64)

func (*RawValue) UnsafeString

func (v *RawValue) UnsafeString() (s string)

type Storage

type Storage struct {
	// Read-only intended field
	Version uint16
	// Read-only intended field
	Size uint64
	// Reference of the internal buffer of the file
	// exposed only if the caller needs to hack his way around
	Buffer []byte
	// File reference
	File *os.File
	// Fast reference to mapped fields for O(1) lookups
	Fields map[uint64]*Field
	// Documents mapped only unce to the sub-slices of buffer for quick convertion between
	// index form and human-readble form
	DocumentsIds []DocumentId
	// Posting lists used once the caller knows which fields-tokens to query
	PostingLists []PostingList
	// Token frequencies
	TokenFrequencies []TokenFrequencyEntry
	// Used to determine if the storage was already initialized or not
	Initialized bool
}

func (*Storage) BuildFrom

func (s *Storage) BuildFrom(docs ...*Document)

Builds the entire storage from a set of document definitions

func (*Storage) Close

func (s *Storage) Close() (err error)

func (*Storage) ColdInitialize

func (s *Storage) ColdInitialize()

func (*Storage) Load

func (s *Storage) Load(name string) (err error)

Load the storage from a file source

func (*Storage) Reset

func (s *Storage) Reset() (err error)

func (*Storage) SaveTo

func (s *Storage) SaveTo(name string) (err error)

Saves the file to the target file

func (*Storage) SortAndBuildFrom

func (s *Storage) SortAndBuildFrom(docs ...*Document)

This function will allocate a new batch and sort documents in the batch by their ID if the batch in ensured to be in order already call directly BuildFrom

type Token

type Token struct {
	// Document frequency of the token in all documents
	FrequencyCount uint64
	// Posting list of the documents for this token
	PostingListIndex uint64
	// Token frequencies per document
	FrequenciesIndex uint64
	// Actual content of the token
	Value RawValue
}

type TokenDefinition

type TokenDefinition struct {
	// Normalized token value e.g. "interventoria" not "Interventoría"
	// The caller is responsible for normalization before insertion
	Value []byte
	// How many times this token appears in this field of this document
	Frequency uint64
}

type TokenFrequencyEntry

type TokenFrequencyEntry struct {
	// The index of the document
	// Mapping this to a human readable key consist in
	// indexing the document id on the document id table
	DocumentIndex uint64
	// Token frequency on this document
	// this value is used by BM25
	Frequency uint64
}

type Tokens

type Tokens []Token

func (*Tokens) GetBytes

func (s *Tokens) GetBytes(b []byte) (token *Token, found bool)

func (*Tokens) GetBytesOrNear

func (s *Tokens) GetBytesOrNear(b []byte) (token *Token, found bool)

func (*Tokens) GetString

func (s *Tokens) GetString(ss string) (token *Token, found bool)

func (*Tokens) IterBytes

func (s *Tokens) IterBytes(lo, hi []byte) (seq iter.Seq[*Token])

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL