dory

package module
v0.1.0-alpha.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 7, 2026 License: MIT Imports: 10 Imported by: 0

README

Dory

Go Reference CI

Dory is a retrieval library for Go. It provides a modular pipeline for chunking, embedding, indexing, retrieving, reranking, and evaluating knowledge — with authorization built in from the ground up.

Every stage of the pipeline is expressed as a Go interface. Bring your own vector store, embedding model, and authorization backend.

Installation

go get github.com/i33ym/dory

Quick Start

// Create a document.
doc, _ := dory.NewDocument("doc-001", dory.TextContent("Your content here.", "text/plain"))

// Split into chunks.
splitter := chunk.NewFixed(chunk.FixedConfig{Size: 512, Overlap: 64})
chunks, _ := splitter.Split(ctx, doc)

// Embed and store.
embedder := embed.NewOpenAI("text-embedding-3-small")
vectorStore := store.NewMemory()
for _, c := range chunks {
    c.Vector, _ = embedder.Embed(ctx, c.AsText())
}
vectorStore.Store(ctx, chunks)

// Retrieve.
retriever := retrieve.NewVector(vectorStore, embedder)
results, _ := retriever.Retrieve(ctx, dory.Query{Text: "your question", TopK: 5})

Or wire everything together with a Pipeline:

pipe, _ := dory.NewPipeline(dory.PipelineConfig{
    Splitter:  chunk.NewFixed(chunk.FixedConfig{Size: 512, Overlap: 64}),
    Embedder:  embed.NewOpenAI("text-embedding-3-small"),
    Store:     store.NewMemory(),
    Retriever: retriever,
    Reranker:  rerank.NewLostInTheMiddle(),
})

pipe.Ingest(ctx, doc)
results, _ := pipe.Retrieve(ctx, dory.Query{Text: "your question", TopK: 5})

See examples/ for hybrid retrieval, graph retrieval, and authorization demos.

What's Included

Chunking
Strategy Package
Fixed-size with overlap chunk.NewFixed
Recursive character splitting chunk.NewRecursive
Sentence-aware grouping chunk.NewSentence
Semantic boundary detection chunk.NewSemantic
Late chunking chunk.NewLate
Contextual retrieval chunk.NewContextual
Proposition extraction chunk.NewProposition
Retrieval
Strategy Package
Dense vector search retrieve.NewVector
BM25 sparse search retrieve.NewBM25
Hybrid (RRF fusion) retrieve.NewHybrid
Ensemble (multi-retriever) retrieve.NewEnsemble
Query routing retrieve.NewRouter
Knowledge graph retrieve.NewGraph
Text-to-SQL retrieve.NewStructured
Web search retrieve.NewWeb
Reranking
Strategy Package
Cross-encoder scoring rerank.NewCrossEncoder
Lost-in-the-middle reordering rerank.NewLostInTheMiddle
Vector Stores
Backend Package
In-memory (dev/test) store.NewMemory
PostgreSQL + pgvector store.NewPgVector
Qdrant store.NewQdrant
Authorization
Backend Package
No-op (allow all) auth.NoopAuthorizer
Allowlist auth.NewAllowlist
OpenFGA auth.NewOpenFGA
Casbin-style RBAC auth.NewCasbin

Pre-filter, post-filter, and hybrid authorization modes are supported via PipelineConfig.AuthMode.

Evaluation

Context precision, context recall, faithfulness, and answer relevance. Faithfulness and answer relevance use LLM-as-judge scoring via a configurable JudgeFunc.

Contributing

Contributions are welcome. Please open an issue before submitting a pull request for significant changes. See CONTRIBUTING.md.

License

MIT

Documentation

Overview

Package dory provides a retrieval intelligence library for Go.

Dory is organized around a pipeline of composable, interface-driven stages:

  1. Chunking — split documents into retrievable units
  2. Embedding — transform text into vector representations
  3. Indexing — store chunks in a searchable backend
  4. Retrieval — find the most relevant units for a query
  5. Reranking — reorder candidates by cross-encoder relevance
  6. Authorization — filter results by what the caller is allowed to see
  7. Evaluation — measure retrieval quality with quantitative metrics

Every stage is expressed as a Go interface. Dory ships with concrete implementations for each, but any implementation of the interface works — the library has no opinion about which vector store, embedding model, or authorization backend you use.

The canonical entry point for most users is Pipeline, which wires the stages together into a single coherent retrieval flow.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Action

type Action string

Action describes what the subject wants to do with the resource.

const (
	// ActionRead is the action checked for retrieval. Most RAG systems
	// only need this single action.
	ActionRead Action = "read"
)

type AuthorizationMode

type AuthorizationMode int

AuthorizationMode controls where in the pipeline authorization is enforced.

const (
	// PostFilter retrieves candidates first, then authorizes each result.
	// This is the safe default: correct regardless of metadata staleness.
	PostFilter AuthorizationMode = iota

	// PreFilter passes authorization constraints to the VectorStore as
	// metadata filters before the similarity search runs.
	// Faster, but requires keeping chunk metadata in sync with the
	// authorization system when permissions change.
	PreFilter

	// Hybrid applies tenant isolation as a pre-filter and fine-grained
	// per-document authorization as a post-filter.
	Hybrid
)

type Authorizer

type Authorizer interface {
	// Check answers: can this subject perform this action on this resource?
	Check(ctx context.Context, req CheckRequest) (bool, error)

	// Filter answers: which of these resources (or all resources if
	// Candidates is nil) can this subject access?
	Filter(ctx context.Context, req FilterRequest) (ResourceSet, error)
}

Authorizer is the authorization backend interface. OpenFGA, Casbin, simple allowlists, and the NoopAuthorizer all implement this.

type BytesContent

type BytesContent struct {
	// contains filtered or unexported fields
}

BytesContent is a Content backed by raw bytes. Use this for binary formats like PDF or images where the content has not yet been extracted to text.

func BinaryContent

func BinaryContent(data []byte, mimeType string) *BytesContent

BinaryContent creates a BytesContent with the given data and mime type.

func (*BytesContent) MimeType

func (b *BytesContent) MimeType() string

func (*BytesContent) Reader

func (b *BytesContent) Reader() (io.ReadCloser, error)

func (*BytesContent) Size

func (b *BytesContent) Size() int64

func (*BytesContent) Text

func (b *BytesContent) Text() (string, error)

type CheckRequest

type CheckRequest struct {
	Subject  Subject
	Action   Action
	Resource Resource
}

CheckRequest is the input to a single authorization check.

type Chunk

type Chunk struct {

	// Vector is the dense embedding of this chunk's text.
	// Nil until the embedder processes this chunk.
	Vector []float32

	// Position describes where in the source document this chunk came from.
	Position *Position

	// TokenCount is the number of tokens in this chunk's text,
	// computed by the Splitter at creation time. Zero if not computed.
	TokenCount int

	// ParentID, if non-empty, points to the larger parent chunk
	// this chunk was derived from (small-to-big retrieval).
	ParentID string

	// WindowText, if non-empty, is the surrounding sentence window.
	// When set, AsText returns this instead of the raw chunk text.
	WindowText string

	// ContextPrefix is a short LLM-generated sentence that situates
	// this chunk within its source document.
	ContextPrefix string
	// contains filtered or unexported fields
}

Chunk is the concrete RetrievedUnit for text-based retrieval strategies: vector search, sparse search, hybrid search, and their variants.

func NewChunk

func NewChunk(id, sourceDocID, text string, metadata map[string]any) *Chunk

NewChunk constructs a Chunk with the required identity fields.

Example
package main

import (
	"fmt"

	"github.com/i33ym/dory"
)

func main() {
	chunk := dory.NewChunk("chunk-1", "doc-1", "The quick brown fox.", nil)
	fmt.Println(chunk.ID())
	fmt.Println(chunk.AsText())
}
Output:
chunk-1
The quick brown fox.

func NewChunkWithOptions

func NewChunkWithOptions(id, sourceDocID, text string, metadata map[string]any, sourceURI string, pos *Position, tokenCount int) *Chunk

NewChunkWithOptions constructs a Chunk with additional fields.

func (*Chunk) AsText

func (c *Chunk) AsText() string

func (*Chunk) ID

func (c *Chunk) ID() string

func (*Chunk) MarshalJSON

func (c *Chunk) MarshalJSON() ([]byte, error)

func (*Chunk) Metadata

func (c *Chunk) Metadata() map[string]any

func (*Chunk) Score

func (c *Chunk) Score() float64

func (*Chunk) Scores

func (c *Chunk) Scores() []ScoreEntry

func (*Chunk) SourceDocumentID

func (c *Chunk) SourceDocumentID() string

func (*Chunk) SourceURI

func (c *Chunk) SourceURI() string

func (*Chunk) Text

func (c *Chunk) Text() string

Text returns the raw chunk text (before any window or context prefix).

func (*Chunk) UnmarshalJSON

func (c *Chunk) UnmarshalJSON(data []byte) error

func (*Chunk) WithScore

func (c *Chunk) WithScore(stage string, score float64) RetrievedUnit

type Content

type Content interface {
	// Reader returns the content as a stream of bytes.
	// Callers are responsible for closing the reader.
	Reader() (io.ReadCloser, error)

	// Text returns the content as a UTF-8 string, if possible.
	// Returns an error if the content is binary or not yet extracted.
	// Splitters call this — they work on text, not bytes.
	Text() (string, error)

	// MimeType describes the format of the content.
	MimeType() string

	// Size returns the content length in bytes, or -1 if unknown.
	Size() int64
}

Content is the raw material of a Document. It abstracts over text, binary, and streaming content so that Dory's pipeline can handle each appropriately.

type Document

type Document struct {
	// contains filtered or unexported fields
}

Document is the ingestion unit — a raw source of knowledge before it has been chunked or indexed. A document carries its content, its identity, and the metadata the authorizer and retriever will consult later.

Documents are created via NewDocument, which validates required fields and computes a content fingerprint for change detection.

func NewDocument

func NewDocument(id string, content Content, opts ...DocumentOption) (*Document, error)

NewDocument constructs a validated Document. Returns an error if the document cannot be used by Dory's pipeline — for example, if the ID is empty or the content is nil.

Example
package main

import (
	"fmt"

	"github.com/i33ym/dory"
)

func main() {
	content := dory.TextContent("Hello, Dory!", "text/plain")
	doc, err := dory.NewDocument("doc-1", content,
		dory.WithTenantID("acme"),
		dory.WithLanguage("en"),
		dory.WithMetadata("author", "alice"),
	)
	if err != nil {
		panic(err)
	}
	fmt.Println(doc.ID())
	fmt.Println(doc.TenantID())
	fmt.Println(doc.Language())
	fmt.Println(doc.Metadata()["author"])
}
Output:
doc-1
acme
en
alice

func (*Document) Content

func (d *Document) Content() Content

Content returns the document's content.

func (*Document) CreatedAt

func (d *Document) CreatedAt() time.Time

CreatedAt returns when this document was first ingested.

func (*Document) Fingerprint

func (d *Document) Fingerprint() [32]byte

Fingerprint returns the SHA-256 hash of this document's content. If two Documents have the same ID and the same Fingerprint, re-ingestion can be skipped safely.

func (*Document) ID

func (d *Document) ID() string

ID returns the document's unique identifier.

func (*Document) Language

func (d *Document) Language() string

Language returns the BCP-47 language tag for this document's content.

func (*Document) Metadata

func (d *Document) Metadata() map[string]any

Metadata returns the document's metadata.

func (*Document) SourceURI

func (d *Document) SourceURI() string

SourceURI returns the canonical location of this document's original source.

func (*Document) TenantID

func (d *Document) TenantID() string

TenantID returns the tenant this document belongs to.

func (*Document) UpdatedAt

func (d *Document) UpdatedAt() time.Time

UpdatedAt returns the last time this document was re-ingested.

type DocumentOption

type DocumentOption func(*Document) error

DocumentOption configures a Document at construction time.

func WithLanguage

func WithLanguage(tag string) DocumentOption

WithLanguage sets the BCP-47 language tag for this document's content. Used by sentence-aware chunking strategies to apply the correct sentence boundary detection rules. Defaults to "en" if not set.

func WithMetadata

func WithMetadata(key string, value any) DocumentOption

WithMetadata attaches a key-value pair to the document's metadata.

func WithSourceURI

func WithSourceURI(uri string) DocumentOption

WithSourceURI sets the canonical source location for this document. Examples: "s3://bucket/path/to/file.pdf", "https://docs.example.com/api".

func WithTenantID

func WithTenantID(id string) DocumentOption

WithTenantID sets the tenant this document belongs to.

func WithTimestamps

func WithTimestamps(createdAt, updatedAt time.Time) DocumentOption

WithTimestamps overrides the default creation and update timestamps.

type Embedder

type Embedder interface {
	// Embed returns the vector representation of the given text.
	Embed(ctx context.Context, text string) ([]float32, error)

	// EmbedBatch embeds multiple texts in a single call.
	// Implementations that do not support native batching should
	// loop over Embed internally. Callers should prefer EmbedBatch
	// during ingestion to reduce API round-trips and cost.
	EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)

	// Dimensions returns the dimensionality of the vectors this embedder
	// produces. The vector store needs this at collection creation time.
	Dimensions() int
}

Embedder transforms text into a dense vector representation. The library is agnostic about which model or provider is used — any implementation of this interface is interchangeable.

type EvalMetrics

type EvalMetrics struct {
	// ContextPrecision measures what fraction of retrieved chunks
	// were actually relevant to the question.
	ContextPrecision *float64

	// ContextRecall measures what fraction of the information needed
	// to answer the question was present in the retrieved chunks.
	ContextRecall *float64

	// Faithfulness measures whether the generated answer is supported
	// by the retrieved context rather than the model's parametric knowledge.
	Faithfulness *float64

	// AnswerRelevance measures whether the generated answer actually
	// addresses what the question asked.
	AnswerRelevance *float64
}

EvalMetrics holds the computed scores for a single test case. All scores are in the range [0.0, 1.0]. A nil pointer means the metric was not requested or could not be computed.

type EvalResult

type EvalResult struct {
	TestCase        TestCase
	RetrievedUnits  []RetrievedUnit
	GeneratedAnswer string
	Metrics         EvalMetrics
}

EvalResult captures the full output of evaluating one TestCase.

type Evaluator

type Evaluator interface {
	Evaluate(ctx context.Context, cases []TestCase) ([]EvalResult, error)
}

Evaluator runs a retrieval pipeline against a set of test cases and produces scored results for each.

type FilterOp

type FilterOp string

FilterOp is the comparison operator in a MetadataFilter.

const (
	// FilterOpEq matches documents where the field equals the value exactly.
	FilterOpEq FilterOp = "eq"

	// FilterOpIn matches documents where the field equals any value in the list.
	FilterOpIn FilterOp = "in"

	// FilterOpAnyOf matches documents where a metadata array field contains
	// any value from the list. Used for multi-value fields like role lists.
	FilterOpAnyOf FilterOp = "any_of"
)

type FilterRequest

type FilterRequest struct {
	Subject Subject
	Action  Action
	// Candidates, if non-nil, restricts the check to this set of resources.
	// If nil, implementations should return ALL authorized resources for
	// the subject — used for the pre-filter path.
	Candidates []Resource
}

FilterRequest is the input to a bulk authorization filter.

type GraphFact

type GraphFact struct {
	Subject   string
	Predicate string
	Object    string
	// contains filtered or unexported fields
}

GraphFact is the concrete RetrievedUnit for graph retrieval. It represents a single fact extracted from the knowledge graph: a subject, a predicate (relationship type), and an object.

func NewGraphFact

func NewGraphFact(id, sourceDocID, subject, predicate, object string, metadata map[string]any) *GraphFact

NewGraphFact constructs a GraphFact with the required identity fields.

func (*GraphFact) AsText

func (g *GraphFact) AsText() string

func (*GraphFact) ID

func (g *GraphFact) ID() string

func (*GraphFact) MarshalJSON

func (g *GraphFact) MarshalJSON() ([]byte, error)

func (*GraphFact) Metadata

func (g *GraphFact) Metadata() map[string]any

func (*GraphFact) Score

func (g *GraphFact) Score() float64

func (*GraphFact) Scores

func (g *GraphFact) Scores() []ScoreEntry

func (*GraphFact) SourceDocumentID

func (g *GraphFact) SourceDocumentID() string

func (*GraphFact) SourceURI

func (g *GraphFact) SourceURI() string

func (*GraphFact) UnmarshalJSON

func (g *GraphFact) UnmarshalJSON(data []byte) error

func (*GraphFact) WithScore

func (g *GraphFact) WithScore(stage string, score float64) RetrievedUnit

type Hook

type Hook struct {
	// BeforeIngest is called before documents are ingested.
	// Receives the number of documents about to be processed.
	BeforeIngest func(ctx context.Context, docCount int)

	// AfterIngest is called after documents are ingested.
	// Receives the number of chunks produced and any error.
	AfterIngest func(ctx context.Context, chunkCount int, err error)

	// BeforeRetrieve is called before a retrieval query is executed.
	BeforeRetrieve func(ctx context.Context, query Query)

	// AfterRetrieve is called after retrieval completes.
	// Receives the number of results and any error.
	AfterRetrieve func(ctx context.Context, resultCount int, err error)

	// BeforeRerank is called before reranking.
	BeforeRerank func(ctx context.Context, query string, candidateCount int)

	// AfterRerank is called after reranking completes.
	AfterRerank func(ctx context.Context, resultCount int, err error)
}

Hook is called at specific points in the pipeline lifecycle. Hooks observe but do not modify pipeline behavior.

func NewLogHook

func NewLogHook() Hook

NewLogHook creates a Hook that logs pipeline events using the standard log package.

type MetadataFilter

type MetadataFilter struct {
	Field string
	Op    FilterOp
	Value any // string for Eq; []string for In and AnyOf
}

MetadataFilter is Dory's portable filter expression. It is intentionally minimal — just enough to express tenant isolation and authorization constraints. Each VectorStore implementation translates this into its native query language.

type Pipeline

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline wires Dory's pipeline stages together into a single coherent retrieval flow: ingest documents, retrieve relevant units, and optionally rerank and authorize the results.

func NewPipeline

func NewPipeline(config PipelineConfig) (*Pipeline, error)

NewPipeline constructs a Pipeline from the given configuration. Returns an error if any required component is nil.

func (*Pipeline) Delete

func (p *Pipeline) Delete(ctx context.Context, docIDs ...string) error

Delete removes chunks associated with the given document IDs from the vector store.

func (*Pipeline) Ingest

func (p *Pipeline) Ingest(ctx context.Context, docs ...*Document) error

Ingest splits each document into chunks, embeds them in batch, and stores them in the vector store. This is the ingestion path.

func (*Pipeline) Retrieve

func (p *Pipeline) Retrieve(ctx context.Context, q Query) ([]RetrievedUnit, error)

Retrieve finds relevant units for the given query. It calls the retriever, then optionally reranks, then optionally authorizes based on the configured AuthorizationMode.

type PipelineConfig

type PipelineConfig struct {
	Splitter  Splitter
	Embedder  Embedder
	Store     VectorStore
	Retriever Retriever

	// Reranker, if non-nil, reorders retrieval results for higher precision.
	Reranker Reranker

	// Authorizer, if non-nil, enforces access control on retrieval results.
	Authorizer Authorizer

	// AuthMode controls where authorization is enforced. Defaults to PostFilter.
	// Hybrid mode is not yet implemented and falls back to PostFilter.
	AuthMode AuthorizationMode

	// Hooks are called at key points in the pipeline lifecycle.
	// Multiple hooks are called in the order they appear in the slice.
	Hooks []Hook
}

PipelineConfig holds the components needed to construct a Pipeline. Splitter, Embedder, Store, and Retriever are required; Reranker and Authorizer are optional.

type Position

type Position struct {
	// StartByte and EndByte are the byte offsets in the original
	// document content. Used for precise deduplication and
	// for reconstructing the original context.
	StartByte int `json:"start_byte"`
	EndByte   int `json:"end_byte"`

	// Page is the page number in a paginated document (PDF, DOCX).
	// Nil for documents without pagination.
	Page *int `json:"page,omitempty"`

	// Section is the heading path to this chunk's location in a
	// structured document. For a Markdown file:
	// ["Introduction", "Background", "Prior Work"].
	// Nil for unstructured documents.
	Section []string `json:"section,omitempty"`
}

Position describes where in the source document a chunk came from.

type Query

type Query struct {
	// Text is the raw natural language question from the user.
	Text string

	// TenantID is mandatory for multi-tenant knowledge bases.
	// Retrievers must enforce tenant isolation before any other
	// filtering. An empty TenantID is valid only for single-tenant systems.
	TenantID string

	// Subject is the identity of the caller for authorization checks.
	// Passed to the Authorizer when authorization is enabled.
	Subject string

	// TopK is the maximum number of results the caller wants.
	// Retrievers may internally over-fetch (e.g., for reranking)
	// but should return at most TopK results.
	TopK int

	// Filters are additional metadata constraints the caller wants
	// applied beyond tenant isolation and authorization.
	Filters []MetadataFilter
}

Query carries everything a retriever needs to find relevant units.

type ReaderContent

type ReaderContent struct {
	// contains filtered or unexported fields
}

ReaderContent is a Content backed by a lazy reader function. Use this for streaming large files without loading them into memory.

func StreamContent

func StreamContent(open func() (io.ReadCloser, error), mimeType string, size int64) *ReaderContent

StreamContent creates a ReaderContent with the given reader factory. The open function is called each time Reader() is invoked, allowing multiple reads of the same content. Pass size=-1 if the size is unknown.

func (*ReaderContent) MimeType

func (r *ReaderContent) MimeType() string

func (*ReaderContent) Reader

func (r *ReaderContent) Reader() (io.ReadCloser, error)

func (*ReaderContent) Size

func (r *ReaderContent) Size() int64

func (*ReaderContent) Text

func (r *ReaderContent) Text() (string, error)

type Reranker

type Reranker interface {
	// Rerank takes the original query text and the candidate units
	// returned by the retriever, and returns them in a new order
	// with updated scores. The returned slice may be shorter than
	// the input if the reranker applies a relevance threshold.
	Rerank(ctx context.Context, query string, units []RetrievedUnit) ([]RetrievedUnit, error)
}

Reranker reorders a slice of RetrievedUnits by their relevance to the original query. It operates after initial retrieval, trading latency for precision.

type Resource

type Resource string

Resource identifies a document or chunk for authorization purposes.

type ResourceSet

type ResourceSet struct {
	// Resources is the explicit list of authorized resource IDs.
	Resources []Resource

	// Predicate, if non-nil, can be passed directly to a VectorStore
	// to restrict the search space at the database level.
	Predicate *MetadataFilter
}

ResourceSet is the result of a FilterRequest.

type RetrievedUnit

type RetrievedUnit interface {
	// ID returns a stable unique identifier for this unit.
	ID() string

	// SourceDocumentID returns the document or resource this unit came from.
	SourceDocumentID() string

	// SourceURI returns the canonical location of the source document.
	// Used for citations and traceability.
	SourceURI() string

	// AsText returns a natural language representation of this unit
	// suitable for injection into an LLM prompt.
	AsText() string

	// Score returns the most recent relevance score.
	Score() float64

	// Scores returns the complete scoring history of this unit,
	// from initial retrieval through all reranking passes.
	Scores() []ScoreEntry

	// WithScore returns a copy of this unit with the given score
	// appended to the score history. The stage parameter identifies
	// which pipeline stage assigned the score.
	WithScore(stage string, score float64) RetrievedUnit

	// Metadata returns arbitrary key-value pairs attached to this unit.
	Metadata() map[string]any
}

RetrievedUnit is the common interface for everything Dory can retrieve, regardless of which retrieval strategy produced it. The pipeline — reranking, authorization, and prompt injection — works exclusively against this interface, remaining agnostic about the concrete type.

func UnwrapUnit

func UnwrapUnit(e UnitEnvelope) (RetrievedUnit, error)

UnwrapUnit recovers a RetrievedUnit from an envelope.

type Retriever

type Retriever interface {
	Retrieve(ctx context.Context, q Query) ([]RetrievedUnit, error)
}

Retriever finds the most relevant RetrievedUnits for a Query. All retrieval strategies — vector, sparse, hybrid, graph, structured, web — implement this interface.

type ScoreEntry

type ScoreEntry struct {
	// Stage is the name of the pipeline stage that assigned this score.
	// Examples: "vector", "bm25", "rrf_fusion", "crossencoder", "final".
	Stage string `json:"stage"`

	// Score is the relevance score assigned at this stage.
	Score float64 `json:"score"`
}

ScoreEntry records a single scoring event in a unit's retrieval history.

type ScoredChunk

type ScoredChunk struct {
	Chunk *Chunk
	Score float64
}

ScoredChunk is a Chunk returned from a vector store search, paired with its similarity score.

type SearchRequest

type SearchRequest struct {
	// QueryVector is the embedding of the user's (possibly transformed) query.
	QueryVector []float32

	// TopK is the maximum number of results to return.
	TopK int

	// Filter, if non-nil, restricts the search to chunks matching
	// these metadata conditions. Tenant isolation and pre-filter
	// authorization constraints are passed here.
	Filter *MetadataFilter
}

SearchRequest bundles everything a VectorStore needs to execute a search.

type Splitter

type Splitter interface {
	// Split takes a Document and returns the chunks produced from it.
	// Implementations must propagate doc.ID as each chunk's SourceDocumentID
	// and doc.Metadata as the base for each chunk's metadata.
	Split(ctx context.Context, doc *Document) ([]*Chunk, error)
}

Splitter transforms a Document into a sequence of Chunks. Each concrete implementation in the chunk/ sub-package represents a different strategy for finding good chunk boundaries.

type StringContent

type StringContent struct {
	// contains filtered or unexported fields
}

StringContent is a Content backed by a plain UTF-8 string. This is the most common case for pre-extracted text.

func TextContent

func TextContent(text, mimeType string) *StringContent

TextContent creates a StringContent with the given text and mime type. Pass an empty mimeType to default to "text/plain".

Example
package main

import (
	"fmt"

	"github.com/i33ym/dory"
)

func main() {
	c := dory.TextContent("some plain text", "")
	text, _ := c.Text()
	fmt.Println(text)
	fmt.Println(c.MimeType())
	fmt.Println(c.Size())
}
Output:
some plain text
text/plain
15

func (*StringContent) MimeType

func (s *StringContent) MimeType() string

func (*StringContent) Reader

func (s *StringContent) Reader() (io.ReadCloser, error)

func (*StringContent) Size

func (s *StringContent) Size() int64

func (*StringContent) Text

func (s *StringContent) Text() (string, error)

type StructuredRow

type StructuredRow struct {

	// Columns preserves the relational structure of the row,
	// keyed by column name.
	Columns map[string]any
	// contains filtered or unexported fields
}

StructuredRow is the concrete RetrievedUnit for structured retrieval — the case where the knowledge base is a database and the retriever executed a generated SQL query.

func NewStructuredRow

func NewStructuredRow(id, sourceDocID string, columns map[string]any, metadata map[string]any) *StructuredRow

NewStructuredRow constructs a StructuredRow with the required identity fields.

func (*StructuredRow) AsText

func (s *StructuredRow) AsText() string

func (*StructuredRow) ID

func (s *StructuredRow) ID() string

func (*StructuredRow) MarshalJSON

func (s *StructuredRow) MarshalJSON() ([]byte, error)

func (*StructuredRow) Metadata

func (s *StructuredRow) Metadata() map[string]any

func (*StructuredRow) Score

func (s *StructuredRow) Score() float64

func (*StructuredRow) Scores

func (s *StructuredRow) Scores() []ScoreEntry

func (*StructuredRow) SourceDocumentID

func (s *StructuredRow) SourceDocumentID() string

func (*StructuredRow) SourceURI

func (s *StructuredRow) SourceURI() string

func (*StructuredRow) UnmarshalJSON

func (s *StructuredRow) UnmarshalJSON(data []byte) error

func (*StructuredRow) WithScore

func (s *StructuredRow) WithScore(stage string, score float64) RetrievedUnit

type Subject

type Subject string

Subject identifies the entity making the retrieval request.

type TestCase

type TestCase struct {
	// ID uniquely identifies this test case for result tracking.
	ID string

	// Question is the natural language query to evaluate.
	Question string

	// ReferenceAnswer is a high-quality answer to the question.
	// Used to score faithfulness and answer relevance.
	ReferenceAnswer string

	// RelevantDocumentIDs, if provided, are the document IDs that
	// should appear in the retrieved context.
	// Used to score context precision and context recall.
	RelevantDocumentIDs []string
}

TestCase is a single evaluation example.

type UnitEnvelope

type UnitEnvelope struct {
	Type UnitType        `json:"type"`
	Data json.RawMessage `json:"data"`
}

UnitEnvelope is a serializable wrapper around a RetrievedUnit. It carries a type discriminator so that deserializers know which concrete type to decode into.

func WrapUnit

func WrapUnit(u RetrievedUnit) (UnitEnvelope, error)

WrapUnit packs a RetrievedUnit into a serializable envelope.

type UnitType

type UnitType string

UnitType identifies the concrete type of a serialized RetrievedUnit.

const (
	UnitTypeChunk         UnitType = "chunk"
	UnitTypeGraphFact     UnitType = "graph_fact"
	UnitTypeStructuredRow UnitType = "structured_row"
)

type VectorStore

type VectorStore interface {
	// Store persists a set of chunks. Implementations decide how to
	// physically store the vector, text, and metadata fields.
	Store(ctx context.Context, chunks []*Chunk) error

	// Search finds the top-k chunks whose vectors are nearest to the
	// query vector, applying any metadata filter before scoring.
	Search(ctx context.Context, req SearchRequest) ([]ScoredChunk, error)

	// Delete removes chunks by their IDs. Called on re-ingestion
	// or when a document is permanently removed.
	Delete(ctx context.Context, ids []string) error
}

VectorStore is the persistence and similarity search abstraction. The library never depends on a concrete implementation — only on this contract.

Directories

Path Synopsis
Package auth provides authorization backend implementations for Dory.
Package auth provides authorization backend implementations for Dory.
Package chunk provides text splitting strategies for Dory.
Package chunk provides text splitting strategies for Dory.
Package embed provides embedder implementations for Dory.
Package embed provides embedder implementations for Dory.
Package eval provides the evaluation pipeline for Dory.
Package eval provides the evaluation pipeline for Dory.
examples
basic_rag command
basic_rag demonstrates the simplest possible Dory pipeline: fixed-size chunking, OpenAI embeddings, in-memory vector store, and vector retrieval.
basic_rag demonstrates the simplest possible Dory pipeline: fixed-size chunking, OpenAI embeddings, in-memory vector store, and vector retrieval.
graph_rag command
graph_rag demonstrates graph-based retrieval using GraphFact triples.
graph_rag demonstrates graph-based retrieval using GraphFact triples.
hybrid_rag command
hybrid_rag demonstrates hybrid retrieval combining dense vector search with BM25 sparse retrieval, fused via Reciprocal Rank Fusion (RRF).
hybrid_rag demonstrates hybrid retrieval combining dense vector search with BM25 sparse retrieval, fused via Reciprocal Rank Fusion (RRF).
with_auth command
with_auth demonstrates Dory's authorization integration using the Allowlist backend in PostFilter mode.
with_auth demonstrates Dory's authorization integration using the Allowlist backend in PostFilter mode.
internal
filter
Package filter provides MetadataFilter translation utilities used internally by VectorStore implementations.
Package filter provides MetadataFilter translation utilities used internally by VectorStore implementations.
similarity
Package similarity provides vector similarity calculations used internally by Dory.
Package similarity provides vector similarity calculations used internally by Dory.
tokenizer
Package tokenizer provides token counting utilities used internally by Dory's chunking strategies.
Package tokenizer provides token counting utilities used internally by Dory's chunking strategies.
Package rerank provides reranking implementations for Dory.
Package rerank provides reranking implementations for Dory.
Package retrieve provides retrieval strategy implementations for Dory.
Package retrieve provides retrieval strategy implementations for Dory.
Package store provides VectorStore implementations for Dory.
Package store provides VectorStore implementations for Dory.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL