docs

package
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 29, 2020 License: Apache-2.0 Imports: 8 Imported by: 4

README

Documents

Two files are used to represent the documents in a segment. The data file contains the data for each document in the segment. The index file contains, for each document, its corresponding offset in the data file.

Data File

The data file contains the fields for each document. The documents are stored serially.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │      Document 1       │ │
│ ├───────────────────────┤ │
│ │          ...          │ │
│ ├───────────────────────┤ │
│ │      Document n       │ │
│ └───────────────────────┘ │
└───────────────────────────┘
Document

Each document is composed of an ID and its fields. The ID is a sequence of valid UTF-8 bytes and it is encoded first by encoding the length of the ID, in bytes, as a variable-sized unsigned integer and then encoding the actual bytes which comprise the ID. Following the ID are the fields. The number of fields in the document is encoded first as a variable-sized unsigned integer and then the fields themselves are encoded.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │     Length of ID      │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │          ID           │ │
│ │        (bytes)        │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │   Number of Fields    │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │        Field 1        │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │          ...          │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │        Field n        │ │
│ │                       │ │
│ └───────────────────────┘ │
└───────────────────────────┘
Field

Each field is composed of a name and a value. The name and value are a sequence of valid UTF-8 bytes and they are stored by encoding the length of the name (value), in bytes, as a variable-sized unsigned integer and then encoding the actual bytes which comprise the name (value). The name is encoded first and the value second.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │  Length of Field Name │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │      Field Name       │ │
│ │        (bytes)        │ │
│ │                       │ │
│ ├───────────────────────┤ │
│ │ Length of Field Value │ │
│ │       (uvarint)       │ │
│ ├───────────────────────┤ │
│ │                       │ │
│ │      Field Value      │ │
│ │        (bytes)        │ │
│ │                       │ │
│ └───────────────────────┘ │
└───────────────────────────┘

Index File

The index file contains, for each postings ID in the segment, the offset of the corresponding document in the data file. The base postings ID is stored at the start of the file as a little-endian uint64. Following it are the actual offsets.

┌───────────────────────────┐
│            Base           │
│          (uint64)         │
├───────────────────────────┤
│                           │
│                           │
│          Offsets          │
│                           │
│                           │
└───────────────────────────┘
Offsets

The offsets are stored serially starting from the offset for the base postings ID. Each offset is a little-endian uint64. Since each offset is of a fixed-size we can access the offset for a given postings ID by calculating its index relative to the start of the offsets. An offset equal to the maximum value for a uint64 indicates that there is no corresponding document for a given postings ID.

┌───────────────────────────┐
│ ┌───────────────────────┐ │
│ │       Offset 1        │ │
│ │       (uint64)        │ │
│ ├───────────────────────┤ │
│ │          ...          │ │
│ ├───────────────────────┤ │
│ │       Offset n        │ │
│ │       (uint64)        │ │
│ └───────────────────────┘ │
└───────────────────────────┘

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DataReader

type DataReader struct {
	// contains filtered or unexported fields
}

DataReader is a reader for the data file for documents.

func NewDataReader

func NewDataReader(data []byte) *DataReader

NewDataReader returns a new DataReader.

func (*DataReader) Read

func (r *DataReader) Read(offset uint64) (doc.Document, error)

type DataWriter

type DataWriter struct {
	// contains filtered or unexported fields
}

DataWriter writes the data file for documents.

func NewDataWriter

func NewDataWriter(w io.Writer) *DataWriter

NewDataWriter returns a new DataWriter.

func (*DataWriter) Reset

func (w *DataWriter) Reset(wr io.Writer)

Reset resets the DataWriter.

func (*DataWriter) Write

func (w *DataWriter) Write(d doc.Document) (int, error)

type IndexReader

type IndexReader struct {
	// contains filtered or unexported fields
}

IndexReader is a reader for the index file for documents.

func NewIndexReader

func NewIndexReader(data []byte) (*IndexReader, error)

NewIndexReader returns a new IndexReader.

func (*IndexReader) Base

func (r *IndexReader) Base() postings.ID

Base returns the base postings ID.

func (*IndexReader) Len

func (r *IndexReader) Len() int

Len returns the number of postings IDs.

func (*IndexReader) Read

func (r *IndexReader) Read(id postings.ID) (uint64, error)

type IndexWriter

type IndexWriter struct {
	// contains filtered or unexported fields
}

IndexWriter is a writer for the index file for documents.

func NewIndexWriter

func NewIndexWriter(w io.Writer) *IndexWriter

NewIndexWriter returns a new IndexWriter.

func (*IndexWriter) Reset

func (w *IndexWriter) Reset(wr io.Writer)

Reset resets the IndexWriter.

func (*IndexWriter) Write

func (w *IndexWriter) Write(id postings.ID, offset uint64) error

Write writes the offset for an id. IDs must be written in increasing order but can be non-contiguous.

type Reader added in v0.15.14

type Reader interface {
	// Len is the number of documents contained by the reader.
	Len() int
	// Read reads a document with the given postings ID.
	Read(id postings.ID) (doc.Document, error)
	// Iter returns a document iterator.
	Iter() index.IDDocIterator
}

Reader is a document reader from an encoded source.

type SliceReader added in v0.5.0

type SliceReader struct {
	// contains filtered or unexported fields
}

SliceReader is a docs slice reader for use with documents stored in memory.

func NewSliceReader added in v0.5.0

func NewSliceReader(docs []doc.Document) *SliceReader

NewSliceReader returns a new docs slice reader.

func (*SliceReader) Doc added in v0.15.0

func (r *SliceReader) Doc(id postings.ID) (doc.Document, error)

Doc implements DocRetriever and reads the document with postings ID.

func (*SliceReader) Iter added in v0.15.14

func (r *SliceReader) Iter() index.IDDocIterator

Iter returns a docs iterator.

func (*SliceReader) Len added in v0.5.0

func (r *SliceReader) Len() int

Len returns the number of documents in the slice reader.

func (*SliceReader) Read added in v0.5.0

func (r *SliceReader) Read(id postings.ID) (doc.Document, error)

Read returns a document from the docs slice reader.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL