dory

package module

v0.1.0-alpha.4 Latest Latest Go to latest Published: Apr 7, 2026 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/i33ym/dory

Links

Open Source Insights

README ¶

Dory

Dory is a retrieval library for Go. It provides a modular pipeline for chunking, embedding, indexing, retrieving, reranking, and evaluating knowledge — with authorization built in from the ground up.

Every stage of the pipeline is expressed as a Go interface. Bring your own vector store, embedding model, and authorization backend.

Installation

go get github.com/i33ym/dory

Quick Start

// Create a document.
doc, _ := dory.NewDocument("doc-001", dory.TextContent("Your content here.", "text/plain"))

// Split into chunks.
splitter := chunk.NewFixed(chunk.FixedConfig{Size: 512, Overlap: 64})
chunks, _ := splitter.Split(ctx, doc)

// Embed and store.
embedder := embed.NewOpenAI("text-embedding-3-small")
vectorStore := store.NewMemory()
for _, c := range chunks {
    c.Vector, _ = embedder.Embed(ctx, c.AsText())
}
vectorStore.Store(ctx, chunks)

// Retrieve.
retriever := retrieve.NewVector(vectorStore, embedder)
results, _ := retriever.Retrieve(ctx, dory.Query{Text: "your question", TopK: 5})

Or wire everything together with a Pipeline:

pipe, _ := dory.NewPipeline(dory.PipelineConfig{
    Splitter:  chunk.NewFixed(chunk.FixedConfig{Size: 512, Overlap: 64}),
    Embedder:  embed.NewOpenAI("text-embedding-3-small"),
    Store:     store.NewMemory(),
    Retriever: retriever,
    Reranker:  rerank.NewLostInTheMiddle(),
})

pipe.Ingest(ctx, doc)
results, _ := pipe.Retrieve(ctx, dory.Query{Text: "your question", TopK: 5})

See examples/ for hybrid retrieval, graph retrieval, and authorization demos.

What's Included

Chunking

Strategy	Package
Fixed-size with overlap	`chunk.NewFixed`
Recursive character splitting	`chunk.NewRecursive`
Sentence-aware grouping	`chunk.NewSentence`
Semantic boundary detection	`chunk.NewSemantic`
Late chunking	`chunk.NewLate`
Contextual retrieval	`chunk.NewContextual`
Proposition extraction	`chunk.NewProposition`

Retrieval

Strategy	Package
Dense vector search	`retrieve.NewVector`
BM25 sparse search	`retrieve.NewBM25`
Hybrid (RRF fusion)	`retrieve.NewHybrid`
Ensemble (multi-retriever)	`retrieve.NewEnsemble`
Query routing	`retrieve.NewRouter`
Knowledge graph	`retrieve.NewGraph`
Text-to-SQL	`retrieve.NewStructured`
Web search	`retrieve.NewWeb`

Reranking

Strategy	Package
Cross-encoder scoring	`rerank.NewCrossEncoder`
Lost-in-the-middle reordering	`rerank.NewLostInTheMiddle`

Vector Stores

Backend	Package
In-memory (dev/test)	`store.NewMemory`
PostgreSQL + pgvector	`store.NewPgVector`
Qdrant	`store.NewQdrant`

Authorization

Backend	Package
No-op (allow all)	`auth.NoopAuthorizer`
Allowlist	`auth.NewAllowlist`
OpenFGA	`auth.NewOpenFGA`
Casbin-style RBAC	`auth.NewCasbin`

Pre-filter, post-filter, and hybrid authorization modes are supported via PipelineConfig.AuthMode.

Evaluation

Context precision, context recall, faithfulness, and answer relevance. Faithfulness and answer relevance use LLM-as-judge scoring via a configurable JudgeFunc.

Contributing

Contributions are welcome. Please open an issue before submitting a pull request for significant changes. See CONTRIBUTING.md.

License

MIT

Documentation ¶

Overview ¶

Package dory provides a retrieval intelligence library for Go.

Dory is organized around a pipeline of composable, interface-driven stages:

Chunking — split documents into retrievable units
Embedding — transform text into vector representations
Indexing — store chunks in a searchable backend
Retrieval — find the most relevant units for a query
Reranking — reorder candidates by cross-encoder relevance
Authorization — filter results by what the caller is allowed to see
Evaluation — measure retrieval quality with quantitative metrics

Every stage is expressed as a Go interface. Dory ships with concrete implementations for each, but any implementation of the interface works — the library has no opinion about which vector store, embedding model, or authorization backend you use.

The canonical entry point for most users is Pipeline, which wires the stages together into a single coherent retrieval flow.

Index ¶

type Action
type AuthorizationMode
type Authorizer
type BytesContent
- func BinaryContent(data []byte, mimeType string) *BytesContent
- func (b *BytesContent) MimeType() string
- func (b *BytesContent) Reader() (io.ReadCloser, error)
- func (b *BytesContent) Size() int64
- func (b *BytesContent) Text() (string, error)
type CheckRequest
type Chunk
- func NewChunk(id, sourceDocID, text string, metadata map[string]any) *Chunk
- func NewChunkWithOptions(id, sourceDocID, text string, metadata map[string]any, sourceURI string, ...) *Chunk
- func (c *Chunk) AsText() string
- func (c *Chunk) ID() string
- func (c *Chunk) MarshalJSON() ([]byte, error)
- func (c *Chunk) Metadata() map[string]any
- func (c *Chunk) Score() float64
- func (c *Chunk) Scores() []ScoreEntry
- func (c *Chunk) SourceDocumentID() string
- func (c *Chunk) SourceURI() string
- func (c *Chunk) Text() string
- func (c *Chunk) UnmarshalJSON(data []byte) error
- func (c *Chunk) WithScore(stage string, score float64) RetrievedUnit
type Content
type Document
- func NewDocument(id string, content Content, opts ...DocumentOption) (*Document, error)
- func (d *Document) Content() Content
- func (d *Document) CreatedAt() time.Time
- func (d *Document) Fingerprint() [32]byte
- func (d *Document) ID() string
- func (d *Document) Language() string
- func (d *Document) Metadata() map[string]any
- func (d *Document) SourceURI() string
- func (d *Document) TenantID() string
- func (d *Document) UpdatedAt() time.Time
type DocumentOption
- func WithLanguage(tag string) DocumentOption
- func WithMetadata(key string, value any) DocumentOption
- func WithSourceURI(uri string) DocumentOption
- func WithTenantID(id string) DocumentOption
- func WithTimestamps(createdAt, updatedAt time.Time) DocumentOption
type Embedder
type EvalMetrics
type EvalResult
type Evaluator
type FilterOp
type FilterRequest
type GraphFact
- func NewGraphFact(id, sourceDocID, subject, predicate, object string, metadata map[string]any) *GraphFact
- func (g *GraphFact) AsText() string
- func (g *GraphFact) ID() string
- func (g *GraphFact) MarshalJSON() ([]byte, error)
- func (g *GraphFact) Metadata() map[string]any
- func (g *GraphFact) Score() float64
- func (g *GraphFact) Scores() []ScoreEntry
- func (g *GraphFact) SourceDocumentID() string
- func (g *GraphFact) SourceURI() string
- func (g *GraphFact) UnmarshalJSON(data []byte) error
- func (g *GraphFact) WithScore(stage string, score float64) RetrievedUnit
type Hook
- func NewLogHook() Hook
type MetadataFilter
type Pipeline
- func NewPipeline(config PipelineConfig) (*Pipeline, error)
- func (p *Pipeline) Delete(ctx context.Context, docIDs ...string) error
- func (p *Pipeline) Ingest(ctx context.Context, docs ...*Document) error
- func (p *Pipeline) Retrieve(ctx context.Context, q Query) ([]RetrievedUnit, error)
type PipelineConfig
type Position
type Query
type ReaderContent
- func StreamContent(open func() (io.ReadCloser, error), mimeType string, size int64) *ReaderContent
- func (r *ReaderContent) MimeType() string
- func (r *ReaderContent) Reader() (io.ReadCloser, error)
- func (r *ReaderContent) Size() int64
- func (r *ReaderContent) Text() (string, error)
type Reranker
type Resource
type ResourceSet
type RetrievedUnit
- func UnwrapUnit(e UnitEnvelope) (RetrievedUnit, error)
type Retriever
type ScoreEntry
type ScoredChunk
type SearchRequest
type Splitter
type StringContent
- func TextContent(text, mimeType string) *StringContent
- func (s *StringContent) MimeType() string
- func (s *StringContent) Reader() (io.ReadCloser, error)
- func (s *StringContent) Size() int64
- func (s *StringContent) Text() (string, error)
type StructuredRow
- func NewStructuredRow(id, sourceDocID string, columns map[string]any, metadata map[string]any) *StructuredRow
- func (s *StructuredRow) AsText() string
- func (s *StructuredRow) ID() string
- func (s *StructuredRow) MarshalJSON() ([]byte, error)
- func (s *StructuredRow) Metadata() map[string]any
- func (s *StructuredRow) Score() float64
- func (s *StructuredRow) Scores() []ScoreEntry
- func (s *StructuredRow) SourceDocumentID() string
- func (s *StructuredRow) SourceURI() string
- func (s *StructuredRow) UnmarshalJSON(data []byte) error
- func (s *StructuredRow) WithScore(stage string, score float64) RetrievedUnit
type Subject
type TestCase
type UnitEnvelope
- func WrapUnit(u RetrievedUnit) (UnitEnvelope, error)
type UnitType
type VectorStore

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Action ¶

type Action string

Action describes what the subject wants to do with the resource.

const (
	// ActionRead is the action checked for retrieval. Most RAG systems
	// only need this single action.
	ActionRead Action = "read"
)

type AuthorizationMode ¶

type AuthorizationMode int

AuthorizationMode controls where in the pipeline authorization is enforced.

const (
	// PostFilter retrieves candidates first, then authorizes each result.
	// This is the safe default: correct regardless of metadata staleness.
	PostFilter AuthorizationMode = iota

	// PreFilter passes authorization constraints to the VectorStore as
	// metadata filters before the similarity search runs.
	// Faster, but requires keeping chunk metadata in sync with the
	// authorization system when permissions change.
	PreFilter

	// Hybrid applies tenant isolation as a pre-filter and fine-grained
	// per-document authorization as a post-filter.
	Hybrid
)

type Authorizer ¶

type Authorizer interface {
	// Check answers: can this subject perform this action on this resource?
	Check(ctx context.Context, req CheckRequest) (bool, error)

	// Filter answers: which of these resources (or all resources if
	// Candidates is nil) can this subject access?
	Filter(ctx context.Context, req FilterRequest) (ResourceSet, error)
}

Authorizer is the authorization backend interface. OpenFGA, Casbin, simple allowlists, and the NoopAuthorizer all implement this.

type BytesContent ¶

type BytesContent struct {
	// contains filtered or unexported fields
}

BytesContent is a Content backed by raw bytes. Use this for binary formats like PDF or images where the content has not yet been extracted to text.

func BinaryContent ¶

func BinaryContent(data []byte, mimeType string) *BytesContent

BinaryContent creates a BytesContent with the given data and mime type.

func (*BytesContent) MimeType ¶

func (b *BytesContent) MimeType() string

func (*BytesContent) Reader ¶

func (b *BytesContent) Reader() (io.ReadCloser, error)

func (*BytesContent) Size ¶

func (b *BytesContent) Size() int64

func (*BytesContent) Text ¶

func (b *BytesContent) Text() (string, error)

type CheckRequest ¶

type CheckRequest struct {
	Subject  Subject
	Action   Action
	Resource Resource
}

CheckRequest is the input to a single authorization check.

type Chunk ¶

type Chunk struct {

	// Vector is the dense embedding of this chunk's text.
	// Nil until the embedder processes this chunk.
	Vector []float32

	// Position describes where in the source document this chunk came from.
	Position *Position

	// TokenCount is the number of tokens in this chunk's text,
	// computed by the Splitter at creation time. Zero if not computed.
	TokenCount int

	// ParentID, if non-empty, points to the larger parent chunk
	// this chunk was derived from (small-to-big retrieval).
	ParentID string

	// WindowText, if non-empty, is the surrounding sentence window.
	// When set, AsText returns this instead of the raw chunk text.
	WindowText string

	// ContextPrefix is a short LLM-generated sentence that situates
	// this chunk within its source document.
	ContextPrefix string
	// contains filtered or unexported fields
}

Chunk is the concrete RetrievedUnit for text-based retrieval strategies: vector search, sparse search, hybrid search, and their variants.

func NewChunk ¶

func NewChunk(id, sourceDocID, text string, metadata map[string]any) *Chunk

NewChunk constructs a Chunk with the required identity fields.

Example ¶

package main

import (
	"fmt"

	"github.com/i33ym/dory"
)

func main() {
	chunk := dory.NewChunk("chunk-1", "doc-1", "The quick brown fox.", nil)
	fmt.Println(chunk.ID())
	fmt.Println(chunk.AsText())
}

Output:
chunk-1
The quick brown fox.

func NewChunkWithOptions ¶

func NewChunkWithOptions(id, sourceDocID, text string, metadata map[string]any, sourceURI string, pos *Position, tokenCount int) *Chunk

NewChunkWithOptions constructs a Chunk with additional fields.

func (*Chunk) AsText ¶

func (c *Chunk) AsText() string

func (*Chunk) ID ¶

func (c *Chunk) ID() string

func (*Chunk) MarshalJSON ¶

func (c *Chunk) MarshalJSON() ([]byte, error)

func (*Chunk) Metadata ¶

func (c *Chunk) Metadata() map[string]any

func (*Chunk) Score ¶

func (c *Chunk) Score() float64

func (*Chunk) Scores ¶

func (c *Chunk) Scores() []ScoreEntry

func (*Chunk) SourceDocumentID ¶

func (c *Chunk) SourceDocumentID() string

func (*Chunk) SourceURI ¶

func (c *Chunk) SourceURI() string

func (*Chunk) Text ¶

func (c *Chunk) Text() string

Text returns the raw chunk text (before any window or context prefix).

func (*Chunk) UnmarshalJSON ¶

func (c *Chunk) UnmarshalJSON(data []byte) error

func (*Chunk) WithScore ¶

func (c *Chunk) WithScore(stage string, score float64) RetrievedUnit

type Content ¶

type Content interface {
	// Reader returns the content as a stream of bytes.
	// Callers are responsible for closing the reader.
	Reader() (io.ReadCloser, error)

	// Text returns the content as a UTF-8 string, if possible.
	// Returns an error if the content is binary or not yet extracted.
	// Splitters call this — they work on text, not bytes.
	Text() (string, error)

	// MimeType describes the format of the content.
	MimeType() string

	// Size returns the content length in bytes, or -1 if unknown.
	Size() int64
}

Content is the raw material of a Document. It abstracts over text, binary, and streaming content so that Dory's pipeline can handle each appropriately.

type Document ¶

type Document struct {
	// contains filtered or unexported fields
}

Document is the ingestion unit — a raw source of knowledge before it has been chunked or indexed. A document carries its content, its identity, and the metadata the authorizer and retriever will consult later.

Documents are created via NewDocument, which validates required fields and computes a content fingerprint for change detection.

func NewDocument ¶

func NewDocument(id string, content Content, opts ...DocumentOption) (*Document, error)

NewDocument constructs a validated Document. Returns an error if the document cannot be used by Dory's pipeline — for example, if the ID is empty or the content is nil.

Example ¶

package main

import (
	"fmt"

	"github.com/i33ym/dory"
)

func main() {
	content := dory.TextContent("Hello, Dory!", "text/plain")
	doc, err := dory.NewDocument("doc-1", content,
		dory.WithTenantID("acme"),
		dory.WithLanguage("en"),
		dory.WithMetadata("author", "alice"),
	)
	if err != nil {
		panic(err)
	}
	fmt.Println(doc.ID())
	fmt.Println(doc.TenantID())
	fmt.Println(doc.Language())
	fmt.Println(doc.Metadata()["author"])
}

Output:
doc-1
acme
en
alice

func (*Document) Content ¶

func (d *Document) Content() Content

Content returns the document's content.

func (*Document) CreatedAt ¶

func (d *Document) CreatedAt() time.Time

CreatedAt returns when this document was first ingested.

func (*Document) Fingerprint ¶

func (d *Document) Fingerprint() [32]byte

Fingerprint returns the SHA-256 hash of this document's content. If two Documents have the same ID and the same Fingerprint, re-ingestion can be skipped safely.

func (*Document) ID ¶

func (d *Document) ID() string

ID returns the document's unique identifier.

func (*Document) Language ¶

func (d *Document) Language() string

Language returns the BCP-47 language tag for this document's content.

func (*Document) Metadata ¶

func (d *Document) Metadata() map[string]any

Metadata returns the document's metadata.

func (*Document) SourceURI ¶

func (d *Document) SourceURI() string

SourceURI returns the canonical location of this document's original source.

func (*Document) TenantID ¶

func (d *Document) TenantID() string

TenantID returns the tenant this document belongs to.

func (*Document) UpdatedAt ¶

func (d *Document) UpdatedAt() time.Time

UpdatedAt returns the last time this document was re-ingested.

type DocumentOption ¶

type DocumentOption func(*Document) error

DocumentOption configures a Document at construction time.

func WithLanguage ¶

func WithLanguage(tag string) DocumentOption

WithLanguage sets the BCP-47 language tag for this document's content. Used by sentence-aware chunking strategies to apply the correct sentence boundary detection rules. Defaults to "en" if not set.

func WithMetadata ¶

func WithMetadata(key string, value any) DocumentOption

WithMetadata attaches a key-value pair to the document's metadata.

func WithSourceURI ¶

func WithSourceURI(uri string) DocumentOption

WithSourceURI sets the canonical source location for this document. Examples: "s3://bucket/path/to/file.pdf", "https://docs.example.com/api".

func WithTenantID ¶

func WithTenantID(id string) DocumentOption

WithTenantID sets the tenant this document belongs to.

func WithTimestamps ¶

func WithTimestamps(createdAt, updatedAt time.Time) DocumentOption

WithTimestamps overrides the default creation and update timestamps.

type Embedder ¶

type Embedder interface {
	// Embed returns the vector representation of the given text.
	Embed(ctx context.Context, text string) ([]float32, error)

	// EmbedBatch embeds multiple texts in a single call.
	// Implementations that do not support native batching should
	// loop over Embed internally. Callers should prefer EmbedBatch
	// during ingestion to reduce API round-trips and cost.
	EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)

	// Dimensions returns the dimensionality of the vectors this embedder
	// produces. The vector store needs this at collection creation time.
	Dimensions() int
}

Embedder transforms text into a dense vector representation. The library is agnostic about which model or provider is used — any implementation of this interface is interchangeable.

type EvalMetrics ¶

type EvalMetrics struct {
	// ContextPrecision measures what fraction of retrieved chunks
	// were actually relevant to the question.
	ContextPrecision *float64

	// ContextRecall measures what fraction of the information needed
	// to answer the question was present in the retrieved chunks.
	ContextRecall *float64

	// Faithfulness measures whether the generated answer is supported
	// by the retrieved context rather than the model's parametric knowledge.
	Faithfulness *float64

	// AnswerRelevance measures whether the generated answer actually
	// addresses what the question asked.
	AnswerRelevance *float64
}

EvalMetrics holds the computed scores for a single test case. All scores are in the range [0.0, 1.0]. A nil pointer means the metric was not requested or could not be computed.

type EvalResult ¶

type EvalResult struct {
	TestCase        TestCase
	RetrievedUnits  []RetrievedUnit
	GeneratedAnswer string
	Metrics         EvalMetrics
}

EvalResult captures the full output of evaluating one TestCase.

type Evaluator ¶

type Evaluator interface {
	Evaluate(ctx context.Context, cases []TestCase) ([]EvalResult, error)
}

Evaluator runs a retrieval pipeline against a set of test cases and produces scored results for each.

type FilterOp ¶

type FilterOp string

FilterOp is the comparison operator in a MetadataFilter.

const (
	// FilterOpEq matches documents where the field equals the value exactly.
	FilterOpEq FilterOp = "eq"

	// FilterOpIn matches documents where the field equals any value in the list.
	FilterOpIn FilterOp = "in"

	// FilterOpAnyOf matches documents where a metadata array field contains
	// any value from the list. Used for multi-value fields like role lists.
	FilterOpAnyOf FilterOp = "any_of"
)

type FilterRequest ¶

type FilterRequest struct {
	Subject Subject
	Action  Action
	// Candidates, if non-nil, restricts the check to this set of resources.
	// If nil, implementations should return ALL authorized resources for
	// the subject — used for the pre-filter path.
	Candidates []Resource
}

FilterRequest is the input to a bulk authorization filter.

type GraphFact ¶

type GraphFact struct {
	Subject   string
	Predicate string
	Object    string
	// contains filtered or unexported fields
}

GraphFact is the concrete RetrievedUnit for graph retrieval. It represents a single fact extracted from the knowledge graph: a subject, a predicate (relationship type), and an object.

func NewGraphFact ¶

func NewGraphFact(id, sourceDocID, subject, predicate, object string, metadata map[string]any) *GraphFact

NewGraphFact constructs a GraphFact with the required identity fields.

func (*GraphFact) AsText ¶

func (g *GraphFact) AsText() string

func (*GraphFact) ID ¶

func (g *GraphFact) ID() string

func (*GraphFact) MarshalJSON ¶

func (g *GraphFact) MarshalJSON() ([]byte, error)

func (*GraphFact) Metadata ¶

func (g *GraphFact) Metadata() map[string]any

func (*GraphFact) Score ¶

func (g *GraphFact) Score() float64

func (*GraphFact) Scores ¶

func (g *GraphFact) Scores() []ScoreEntry

func (*GraphFact) SourceDocumentID ¶

func (g *GraphFact) SourceDocumentID() string

func (*GraphFact) SourceURI ¶

func (g *GraphFact) SourceURI() string

func (*GraphFact) UnmarshalJSON ¶

func (g *GraphFact) UnmarshalJSON(data []byte) error

func (*GraphFact) WithScore ¶

func (g *GraphFact) WithScore(stage string, score float64) RetrievedUnit

type Hook ¶

type Hook struct {
	// BeforeIngest is called before documents are ingested.
	// Receives the number of documents about to be processed.
	BeforeIngest func(ctx context.Context, docCount int)

	// AfterIngest is called after documents are ingested.
	// Receives the number of chunks produced and any error.
	AfterIngest func(ctx context.Context, chunkCount int, err error)

	// BeforeRetrieve is called before a retrieval query is executed.
	BeforeRetrieve func(ctx context.Context, query Query)

	// AfterRetrieve is called after retrieval completes.
	// Receives the number of results and any error.
	AfterRetrieve func(ctx context.Context, resultCount int, err error)

	// BeforeRerank is called before reranking.
	BeforeRerank func(ctx context.Context, query string, candidateCount int)

	// AfterRerank is called after reranking completes.
	AfterRerank func(ctx context.Context, resultCount int, err error)
}

Hook is called at specific points in the pipeline lifecycle. Hooks observe but do not modify pipeline behavior.

func NewLogHook ¶

func NewLogHook() Hook

NewLogHook creates a Hook that logs pipeline events using the standard log package.

type MetadataFilter ¶

type MetadataFilter struct {
	Field string
	Op    FilterOp
	Value any // string for Eq; []string for In and AnyOf
}

MetadataFilter is Dory's portable filter expression. It is intentionally minimal — just enough to express tenant isolation and authorization constraints. Each VectorStore implementation translates this into its native query language.

type Pipeline ¶

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline wires Dory's pipeline stages together into a single coherent retrieval flow: ingest documents, retrieve relevant units, and optionally rerank and authorize the results.

func NewPipeline ¶

func NewPipeline(config PipelineConfig) (*Pipeline, error)

NewPipeline constructs a Pipeline from the given configuration. Returns an error if any required component is nil.

func (*Pipeline) Delete ¶

func (p *Pipeline) Delete(ctx context.Context, docIDs ...string) error

Delete removes chunks associated with the given document IDs from the vector store.

func (*Pipeline) Ingest ¶

func (p *Pipeline) Ingest(ctx context.Context, docs ...*Document) error

Ingest splits each document into chunks, embeds them in batch, and stores them in the vector store. This is the ingestion path.

func (*Pipeline) Retrieve ¶

func (p *Pipeline) Retrieve(ctx context.Context, q Query) ([]RetrievedUnit, error)

Retrieve finds relevant units for the given query. It calls the retriever, then optionally reranks, then optionally authorizes based on the configured AuthorizationMode.

type PipelineConfig ¶

type PipelineConfig struct {
	Splitter  Splitter
	Embedder  Embedder
	Store     VectorStore
	Retriever Retriever

	// Reranker, if non-nil, reorders retrieval results for higher precision.
	Reranker Reranker

	// Authorizer, if non-nil, enforces access control on retrieval results.
	Authorizer Authorizer

	// AuthMode controls where authorization is enforced. Defaults to PostFilter.
	// Hybrid mode is not yet implemented and falls back to PostFilter.
	AuthMode AuthorizationMode

	// Hooks are called at key points in the pipeline lifecycle.
	// Multiple hooks are called in the order they appear in the slice.
	Hooks []Hook
}

PipelineConfig holds the components needed to construct a Pipeline. Splitter, Embedder, Store, and Retriever are required; Reranker and Authorizer are optional.

type Position ¶

type Position struct {
	// StartByte and EndByte are the byte offsets in the original
	// document content. Used for precise deduplication and
	// for reconstructing the original context.
	StartByte int `json:"start_byte"`
	EndByte   int `json:"end_byte"`

	// Page is the page number in a paginated document (PDF, DOCX).
	// Nil for documents without pagination.
	Page *int `json:"page,omitempty"`

	// Section is the heading path to this chunk's location in a
	// structured document. For a Markdown file:
	// ["Introduction", "Background", "Prior Work"].
	// Nil for unstructured documents.
	Section []string `json:"section,omitempty"`
}

Position describes where in the source document a chunk came from.

type Query ¶

type Query struct {
	// Text is the raw natural language question from the user.
	Text string

	// TenantID is mandatory for multi-tenant knowledge bases.
	// Retrievers must enforce tenant isolation before any other
	// filtering. An empty TenantID is valid only for single-tenant systems.
	TenantID string

	// Subject is the identity of the caller for authorization checks.
	// Passed to the Authorizer when authorization is enabled.
	Subject string

	// TopK is the maximum number of results the caller wants.
	// Retrievers may internally over-fetch (e.g., for reranking)
	// but should return at most TopK results.
	TopK int

	// Filters are additional metadata constraints the caller wants
	// applied beyond tenant isolation and authorization.
	Filters []MetadataFilter
}

Query carries everything a retriever needs to find relevant units.

type ReaderContent ¶

type ReaderContent struct {
	// contains filtered or unexported fields
}

ReaderContent is a Content backed by a lazy reader function. Use this for streaming large files without loading them into memory.

func StreamContent ¶

func StreamContent(open func() (io.ReadCloser, error), mimeType string, size int64) *ReaderContent

StreamContent creates a ReaderContent with the given reader factory. The open function is called each time Reader() is invoked, allowing multiple reads of the same content. Pass size=-1 if the size is unknown.

func (*ReaderContent) MimeType ¶

func (r *ReaderContent) MimeType() string

func (*ReaderContent) Reader ¶

func (r *ReaderContent) Reader() (io.ReadCloser, error)

func (*ReaderContent) Size ¶

func (r *ReaderContent) Size() int64

func (*ReaderContent) Text ¶

func (r *ReaderContent) Text() (string, error)

type Reranker ¶

type Reranker interface {
	// Rerank takes the original query text and the candidate units
	// returned by the retriever, and returns them in a new order
	// with updated scores. The returned slice may be shorter than
	// the input if the reranker applies a relevance threshold.
	Rerank(ctx context.Context, query string, units []RetrievedUnit) ([]RetrievedUnit, error)
}

Reranker reorders a slice of RetrievedUnits by their relevance to the original query. It operates after initial retrieval, trading latency for precision.

type Resource ¶

type Resource string

Resource identifies a document or chunk for authorization purposes.

type ResourceSet ¶

type ResourceSet struct {
	// Resources is the explicit list of authorized resource IDs.
	Resources []Resource

	// Predicate, if non-nil, can be passed directly to a VectorStore
	// to restrict the search space at the database level.
	Predicate *MetadataFilter
}

ResourceSet is the result of a FilterRequest.

type RetrievedUnit ¶

type RetrievedUnit interface {
	// ID returns a stable unique identifier for this unit.
	ID() string

	// SourceDocumentID returns the document or resource this unit came from.
	SourceDocumentID() string

	// SourceURI returns the canonical location of the source document.
	// Used for citations and traceability.
	SourceURI() string

	// AsText returns a natural language representation of this unit
	// suitable for injection into an LLM prompt.
	AsText() string

	// Score returns the most recent relevance score.
	Score() float64

	// Scores returns the complete scoring history of this unit,
	// from initial retrieval through all reranking passes.
	Scores() []ScoreEntry

	// WithScore returns a copy of this unit with the given score
	// appended to the score history. The stage parameter identifies
	// which pipeline stage assigned the score.
	WithScore(stage string, score float64) RetrievedUnit

	// Metadata returns arbitrary key-value pairs attached to this unit.
	Metadata() map[string]any
}

RetrievedUnit is the common interface for everything Dory can retrieve, regardless of which retrieval strategy produced it. The pipeline — reranking, authorization, and prompt injection — works exclusively against this interface, remaining agnostic about the concrete type.

func UnwrapUnit ¶

func UnwrapUnit(e UnitEnvelope) (RetrievedUnit, error)

UnwrapUnit recovers a RetrievedUnit from an envelope.

type Retriever ¶

type Retriever interface {
	Retrieve(ctx context.Context, q Query) ([]RetrievedUnit, error)
}

Retriever finds the most relevant RetrievedUnits for a Query. All retrieval strategies — vector, sparse, hybrid, graph, structured, web — implement this interface.

type ScoreEntry ¶

type ScoreEntry struct {
	// Stage is the name of the pipeline stage that assigned this score.
	// Examples: "vector", "bm25", "rrf_fusion", "crossencoder", "final".
	Stage string `json:"stage"`

	// Score is the relevance score assigned at this stage.
	Score float64 `json:"score"`
}

ScoreEntry records a single scoring event in a unit's retrieval history.

type ScoredChunk ¶

type ScoredChunk struct {
	Chunk *Chunk
	Score float64
}

ScoredChunk is a Chunk returned from a vector store search, paired with its similarity score.

type SearchRequest ¶

type SearchRequest struct {
	// QueryVector is the embedding of the user's (possibly transformed) query.
	QueryVector []float32

	// TopK is the maximum number of results to return.
	TopK int

	// Filter, if non-nil, restricts the search to chunks matching
	// these metadata conditions. Tenant isolation and pre-filter
	// authorization constraints are passed here.
	Filter *MetadataFilter
}

SearchRequest bundles everything a VectorStore needs to execute a search.

type Splitter ¶

type Splitter interface {
	// Split takes a Document and returns the chunks produced from it.
	// Implementations must propagate doc.ID as each chunk's SourceDocumentID
	// and doc.Metadata as the base for each chunk's metadata.
	Split(ctx context.Context, doc *Document) ([]*Chunk, error)
}

Splitter transforms a Document into a sequence of Chunks. Each concrete implementation in the chunk/ sub-package represents a different strategy for finding good chunk boundaries.

type StringContent ¶

type StringContent struct {
	// contains filtered or unexported fields
}

StringContent is a Content backed by a plain UTF-8 string. This is the most common case for pre-extracted text.

func TextContent ¶

func TextContent(text, mimeType string) *StringContent

TextContent creates a StringContent with the given text and mime type. Pass an empty mimeType to default to "text/plain".

Example ¶

package main

import (
	"fmt"

	"github.com/i33ym/dory"
)

func main() {
	c := dory.TextContent("some plain text", "")
	text, _ := c.Text()
	fmt.Println(text)
	fmt.Println(c.MimeType())
	fmt.Println(c.Size())
}

Output:
some plain text
text/plain
15

func (*StringContent) MimeType ¶

func (s *StringContent) MimeType() string

func (*StringContent) Reader ¶

func (s *StringContent) Reader() (io.ReadCloser, error)

func (*StringContent) Size ¶

func (s *StringContent) Size() int64

func (*StringContent) Text ¶

func (s *StringContent) Text() (string, error)

type StructuredRow ¶

type StructuredRow struct {

	// Columns preserves the relational structure of the row,
	// keyed by column name.
	Columns map[string]any
	// contains filtered or unexported fields
}

StructuredRow is the concrete RetrievedUnit for structured retrieval — the case where the knowledge base is a database and the retriever executed a generated SQL query.

func NewStructuredRow ¶

func NewStructuredRow(id, sourceDocID string, columns map[string]any, metadata map[string]any) *StructuredRow

NewStructuredRow constructs a StructuredRow with the required identity fields.

func (*StructuredRow) AsText ¶

func (s *StructuredRow) AsText() string

func (*StructuredRow) ID ¶

func (s *StructuredRow) ID() string

func (*StructuredRow) MarshalJSON ¶

func (s *StructuredRow) MarshalJSON() ([]byte, error)

func (*StructuredRow) Metadata ¶

func (s *StructuredRow) Metadata() map[string]any

func (*StructuredRow) Score ¶

func (s *StructuredRow) Score() float64

func (*StructuredRow) Scores ¶

func (s *StructuredRow) Scores() []ScoreEntry

func (*StructuredRow) SourceDocumentID ¶

func (s *StructuredRow) SourceDocumentID() string

func (*StructuredRow) SourceURI ¶

func (s *StructuredRow) SourceURI() string

func (*StructuredRow) UnmarshalJSON ¶

func (s *StructuredRow) UnmarshalJSON(data []byte) error

func (*StructuredRow) WithScore ¶

func (s *StructuredRow) WithScore(stage string, score float64) RetrievedUnit

type Subject ¶

type Subject string

Subject identifies the entity making the retrieval request.

type TestCase ¶

type TestCase struct {
	// ID uniquely identifies this test case for result tracking.
	ID string

	// Question is the natural language query to evaluate.
	Question string

	// ReferenceAnswer is a high-quality answer to the question.
	// Used to score faithfulness and answer relevance.
	ReferenceAnswer string

	// RelevantDocumentIDs, if provided, are the document IDs that
	// should appear in the retrieved context.
	// Used to score context precision and context recall.
	RelevantDocumentIDs []string
}

TestCase is a single evaluation example.

type UnitEnvelope ¶

type UnitEnvelope struct {
	Type UnitType        `json:"type"`
	Data json.RawMessage `json:"data"`
}

UnitEnvelope is a serializable wrapper around a RetrievedUnit. It carries a type discriminator so that deserializers know which concrete type to decode into.

func WrapUnit ¶

func WrapUnit(u RetrievedUnit) (UnitEnvelope, error)

WrapUnit packs a RetrievedUnit into a serializable envelope.

type UnitType ¶

type UnitType string

UnitType identifies the concrete type of a serialized RetrievedUnit.

const (
	UnitTypeChunk         UnitType = "chunk"
	UnitTypeGraphFact     UnitType = "graph_fact"
	UnitTypeStructuredRow UnitType = "structured_row"
)

type VectorStore ¶

type VectorStore interface {
	// Store persists a set of chunks. Implementations decide how to
	// physically store the vector, text, and metadata fields.
	Store(ctx context.Context, chunks []*Chunk) error

	// Search finds the top-k chunks whose vectors are nearest to the
	// query vector, applying any metadata filter before scoring.
	Search(ctx context.Context, req SearchRequest) ([]ScoredChunk, error)

	// Delete removes chunks by their IDs. Called on re-ingestion
	// or when a document is permanently removed.
	Delete(ctx context.Context, ids []string) error
}

VectorStore is the persistence and similarity search abstraction. The library never depends on a concrete implementation — only on this contract.

Source Files ¶

View all Source files

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
auth Package auth provides authorization backend implementations for Dory.	Package auth provides authorization backend implementations for Dory.
chunk Package chunk provides text splitting strategies for Dory.	Package chunk provides text splitting strategies for Dory.
embed Package embed provides embedder implementations for Dory.	Package embed provides embedder implementations for Dory.
eval Package eval provides the evaluation pipeline for Dory.	Package eval provides the evaluation pipeline for Dory.
examples
basic_rag command basic_rag demonstrates the simplest possible Dory pipeline: fixed-size chunking, OpenAI embeddings, in-memory vector store, and vector retrieval.	basic_rag demonstrates the simplest possible Dory pipeline: fixed-size chunking, OpenAI embeddings, in-memory vector store, and vector retrieval.
graph_rag command graph_rag demonstrates graph-based retrieval using GraphFact triples.	graph_rag demonstrates graph-based retrieval using GraphFact triples.
hybrid_rag command hybrid_rag demonstrates hybrid retrieval combining dense vector search with BM25 sparse retrieval, fused via Reciprocal Rank Fusion (RRF).	hybrid_rag demonstrates hybrid retrieval combining dense vector search with BM25 sparse retrieval, fused via Reciprocal Rank Fusion (RRF).
with_auth command with_auth demonstrates Dory's authorization integration using the Allowlist backend in PostFilter mode.	with_auth demonstrates Dory's authorization integration using the Allowlist backend in PostFilter mode.
internal
filter Package filter provides MetadataFilter translation utilities used internally by VectorStore implementations.	Package filter provides MetadataFilter translation utilities used internally by VectorStore implementations.
similarity Package similarity provides vector similarity calculations used internally by Dory.	Package similarity provides vector similarity calculations used internally by Dory.
tokenizer Package tokenizer provides token counting utilities used internally by Dory's chunking strategies.	Package tokenizer provides token counting utilities used internally by Dory's chunking strategies.
rerank Package rerank provides reranking implementations for Dory.	Package rerank provides reranking implementations for Dory.
retrieve Package retrieve provides retrieval strategy implementations for Dory.	Package retrieve provides retrieval strategy implementations for Dory.
store Package store provides VectorStore implementations for Dory.	Package store provides VectorStore implementations for Dory.