Documentation
¶
Index ¶
- Constants
- Variables
- func CanonicalJSON(e Event) ([]byte, error)
- func IDF(df, N int) float64
- func LoadIndexWithCursor(indexPath, cursorPath string) (*Index, *IndexCursor, bool, error)
- func Merge(a, b []string) []string
- func NGrams(tokens []string, n int) []string
- func PersistIndex(index *Index, path string, hiULID string) error
- func RebuildIndexFromJournalInto(ctx context.Context, j Journal, idx *Index) (string, error)
- func Stem(word string) string
- func Tokenize(s string) []string
- func TokenizeFull(s string, stopwords map[string]struct{}) []string
- func Unique(tokens []string) []string
- type BM25Okapi
- type BM25Result
- type Classifier
- type Event
- type EventType
- type Index
- func (ix *Index) Add(e Event)
- func (ix *Index) Excerpt(docID string) string
- func (ix *Index) MarshalJSON() ([]byte, error)
- func (ix *Index) Params() IndexParams
- func (ix *Index) Persist(path string) error
- func (ix *Index) Remove(id string)
- func (ix *Index) SearchBM25(query string, limit int) []ScoredHit
- func (ix *Index) SearchTrigram(query string, limit int) []ScoredHit
- func (ix *Index) SetParams(p IndexParams)
- func (ix *Index) TotalDocs() int
- func (ix *Index) UnmarshalJSON(data []byte) error
- type IndexCursor
- type IndexParams
- type Journal
- type JournalOptions
- type PostingList
- type Priority
- type Rule
- type RuleMatch
- type ScoredHit
- type Source
- type TrigramIndex
- type ULID
Constants ¶
const ( // ULIDLen is the length of a ULID string. ULIDLen = 26 // MaxEventSize is the maximum size of an event in bytes. MaxEventSize = 1024 * 1024 // 1 MB // DefaultBudget is the default recall budget in bytes. DefaultBudget = 4096 // MaxBudget is the maximum recall budget. MaxBudget = 1024 * 1024 // 1 MB )
const ( DefaultBM25K1 = 1.2 DefaultBM25B = 0.75 DefaultHeadingBoost = 5.0 )
Default index constants.
const ( KeyInputTokens = "input_tokens" // int - LLM input token count KeyOutputTokens = "output_tokens" // int - LLM output token count KeyCachedTokens = "cached_tokens" // int - prompt cache savings KeyModel = "model" // string - model name KeyCacheHit = "cache_hit" // bool - cache hit occurred )
Token data keys for LLM event tracking.
const ( KeyRawBytes = "raw_bytes" // int - pre-filter (post-redact) output size KeyReturnedBytes = "returned_bytes" // int - bytes actually returned to caller )
MCP byte-tracking keys for sandbox tool calls. Populated automatically by Exec/Read/Fetch — measure what the agent context would have consumed without intent filtering vs what was actually returned.
const DefaultDurability = "batched"
DefaultDurability is the default journal durability mode.
const TokenizerVersion = 1
TokenizerVersion tracks changes to the tokenizer that require rebuild.
Variables ¶
var ( ErrJournalFull = errors.New("journal has reached max bytes") ErrJournalNotFound = errors.New("journal not found") // ErrEventTooLarge is returned when a single event exceeds maxEventBytes. ErrEventTooLarge = errors.New("event exceeds max serialized size") )
var EnglishStopwords = map[string]struct{}{
"a": {}, "an": {}, "and": {}, "are": {}, "as": {}, "at": {},
"be": {}, "been": {}, "being": {}, "but": {}, "by": {},
"can": {}, "could": {}, "did": {}, "do": {}, "does": {},
"doing": {}, "done": {}, "for": {}, "from": {},
"had": {}, "has": {}, "have": {}, "having": {},
"he": {}, "her": {}, "here": {}, "him": {}, "his": {},
"how": {}, "i": {}, "if": {}, "in": {}, "into": {},
"is": {}, "it": {}, "its": {}, "just": {},
"me": {}, "my": {},
"no": {}, "not": {}, "of": {}, "on": {}, "or": {},
"our": {}, "out": {},
"said": {}, "she": {}, "so": {}, "some": {},
"that": {}, "the": {}, "their": {}, "them": {}, "then": {},
"there": {}, "these": {}, "they": {}, "this": {}, "those": {},
"to": {}, "too": {},
"us": {}, "was": {}, "we": {}, "were": {}, "what": {},
"when": {}, "where": {}, "which": {}, "while": {},
"who": {}, "will": {}, "with": {}, "would": {},
"you": {}, "your": {},
}
Default stopwords for English.
var Now = time.Now
Now returns the current time.
var TurkishStopwords = map[string]struct{}{
"bir": {}, "bu": {}, "da": {}, "de": {}, "daha": {},
"ile": {}, "için": {}, "kadar": {}, "ne": {}, "oysa": {},
"ve": {}, "ya": {}, "yani": {}, "zaten": {},
}
Turkish stopwords.
var Version = version.Current
Version is the DFMT version. Mirrors internal/version.Current — the single build-time-injected source of truth. Kept as a re-export to preserve the historical core.Version reference.
Functions ¶
func CanonicalJSON ¶
CanonicalJSON returns canonical JSON bytes for an event. Nested maps are sorted recursively via canonicalize — encoding/json already sorts map[string]X keys, but recursing explicitly insulates the signature from any future Marshaler behavior change and makes the invariant obvious.
func IDF ¶
IDF computes inverse document frequency. Uses the smoothed formula to avoid negative values.
func LoadIndexWithCursor ¶
func LoadIndexWithCursor(indexPath, cursorPath string) (*Index, *IndexCursor, bool, error)
LoadIndexWithCursor loads an index and cursor, returns whether rebuild needed.
func PersistIndex ¶
PersistIndex saves the index and cursor to disk atomically. Each file is written to a sibling <name>.tmp, fsynced, and renamed into place. A crash mid-write leaves the previous complete file intact instead of a truncated stub that would force a full rebuild.
func RebuildIndexFromJournalInto ¶ added in v0.2.0
RebuildIndexFromJournalInto streams the journal into an existing Index. It exists so the daemon can hand a fresh empty Index to handlers immediately (so the listener can come up on time) and populate it asynchronously, rather than blocking startup behind a 5-10s rebuild that times out the agent's first MCP call. Index is concurrent-safe via its internal RWMutex, so search/recall during rebuild observes a partial-but-growing view instead of "daemon not responding".
The context is honored so daemon.Stop can interrupt a long-running rebuild.
func Stem ¶
Stem returns the Porter stem of a word. Porter's vowel/consonant rules are defined over ASCII letters only — running it byte-wise on a UTF-8 string like "çalışma" treats every continuation byte as a consonant and produces nonsense stems. For any non-ASCII input we skip stemming and return the lowercased word unchanged; a dedicated Turkish stemmer would be needed to do better.
func TokenizeFull ¶
Tokenize splits text into lowercase tokens for indexing. It drops tokens shorter than 2 or longer than 64 characters, and removes stopwords.
Types ¶
type BM25Okapi ¶
type BM25Okapi struct {
// contains filtered or unexported fields
}
BM25Okapi implements the Okapi BM25 ranking function.
func NewBM25Okapi ¶
func NewBM25Okapi() *BM25Okapi
NewBM25Okapi creates a BM25 scorer with package-default parameters (k1=DefaultBM25K1=1.2, b=DefaultBM25B=0.75).
func NewBM25OkapiWithParams ¶ added in v0.2.2
NewBM25OkapiWithParams creates a scorer with operator-configured parameters. Zero or negative inputs fall back to the package defaults — defense-in-depth for the load path where Index.UnmarshalJSON leaves the per-Index k1/b at zero until the daemon calls SetParams. Production configs are gated by config.Validate so reaching the fallback usually means a freshly-deserialized index whose owner hasn't called SetParams yet.
type BM25Result ¶
BM25Result holds a scored document result.
type Classifier ¶
type Classifier struct {
// contains filtered or unexported fields
}
Classifier assigns priority tiers to events.
func NewClassifier ¶
func NewClassifier() *Classifier
NewClassifier creates a new Classifier with default priorities and seeded note-elevation rules. The seeded rules let dfmt_remember callers raise a note above the default P4 by attaching the right tag — without the seed, every manually-recorded note ranked equal to a tool.read event in the recall budget pass.
func (*Classifier) AddRule ¶
func (c *Classifier) AddRule(rule Rule)
AddRule adds a classification rule. The MessageRegex is compiled once here; an invalid regex is logged as a silent skip (matcher treats the rule's regex clause as always-unmet) rather than failing AddRule, so a malformed user rule doesn't prevent the classifier from being built.
func (*Classifier) Classify ¶
func (c *Classifier) Classify(e Event) Priority
Classify returns the priority for an event.
func (*Classifier) SetDefault ¶
func (c *Classifier) SetDefault(typ EventType, pri Priority)
SetDefault sets the default priority for an event type.
type Event ¶
type Event struct {
ID string `json:"id"`
TS time.Time `json:"ts"`
Project string `json:"project"`
Type EventType `json:"type"`
Priority Priority `json:"priority"`
Source Source `json:"source"`
Actor string `json:"actor,omitempty"`
Data map[string]any `json:"data,omitempty"`
Refs []string `json:"refs,omitempty"`
Tags []string `json:"tags,omitempty"`
Sig string `json:"sig"`
}
Event represents a single event in the journal.
func (*Event) ComputeSig ¶
ComputeSig computes the signature of the event. It uses the first 16 hex chars of SHA-256 of the canonical JSON.
type EventType ¶
type EventType string
EventType represents the type of an event.
const ( EvtFileRead EventType = "file.read" EvtFileEdit EventType = "file.edit" EvtFileCreate EventType = "file.create" EvtFileDelete EventType = "file.delete" EvtTaskCreate EventType = "task.create" EvtTaskUpdate EventType = "task.update" EvtTaskDone EventType = "task.done" EvtDecision EventType = "decision" EvtError EventType = "error" EvtGitCommit EventType = "git.commit" EvtGitCheckout EventType = "git.checkout" EvtGitPush EventType = "git.push" EvtGitStash EventType = "git.stash" EvtGitDiff EventType = "git.diff" EvtEnvCwd EventType = "env.cwd" EvtEnvVars EventType = "env.vars" EvtEnvInstall EventType = "env.install" EvtShellCmd EventType = "shell.cmd" EvtPrompt EventType = "prompt" EvtMCPCall EventType = "mcp.call" EvtSubagent EventType = "subagent" EvtSkill EventType = "skill" EvtRole EventType = "role" EvtIntent EventType = "intent" EvtDataRef EventType = "data.ref" EvtNote EventType = "note" EvtTombstone EventType = "tombstone" )
type Index ¶
type Index struct {
// contains filtered or unexported fields
}
Index implements an in-memory inverted index with BM25 scoring.
func NewIndex ¶
func NewIndex() *Index
NewIndex creates a new Index with package-default BM25 parameters (k1=DefaultBM25K1, b=DefaultBM25B, headingBoost=DefaultHeadingBoost). This is the zero-config constructor used by tests and the dfmt-bench harness; the daemon production path uses NewIndexWithParams so the operator's config.Index.* values flow through.
func NewIndexWithParams ¶ added in v0.2.2
func NewIndexWithParams(p IndexParams) *Index
NewIndexWithParams creates an Index with operator-configured BM25 parameters. Zero fields are treated as "use package defaults" by the search path (defense-in-depth), but production callers should pass fully-resolved values; Validate already gates input ranges at config-load time. ADR-0015 v0.4.
func RebuildIndexFromJournal ¶ added in v0.1.6
RebuildIndexFromJournal streams every event from j into a fresh Index. The returned hiID is the ID of the last event ingested (empty if the journal was empty), suitable for passing to PersistIndex so the next cursor reflects the rebuild.
Use when LoadIndexWithCursor returned needsRebuild=true — without this, callers that just took NewIndex() lose the entire historical journal from search/recall until the user generates new events. A tokenizer-version bump is the canonical case the version field was designed to handle.
func (*Index) Add ¶
Add adds an event to the index. If the event ID was already indexed the call is a no-op — this prevents totalDocs drift and duplicate posting-list entries when a caller (e.g. a retry path) re-submits the same event.
func (*Index) Excerpt ¶ added in v0.2.0
Excerpt returns the short text snippet attached to docID at index time, or "" when no excerpt is recorded (event was indexed before the excerpt feature, or never had a message/path field). Safe for concurrent reads.
func (*Index) MarshalJSON ¶ added in v0.1.2
MarshalJSON implements json.Marshaler for Index.
func (*Index) Params ¶ added in v0.2.2
func (ix *Index) Params() IndexParams
Params returns the currently configured BM25 / scoring parameters under read-lock. Used by tests and a future dfmt doctor row that surfaces the effective values to operators.
func (*Index) Remove ¶
Remove removes a document from the index. Cleans up both the stem and trigram posting lists and recomputes avgDocLen from scratch so the BM25 scoring stays consistent after deletes.
func (*Index) SearchBM25 ¶
SearchBM25 searches the index using BM25. Each query token contributes a BM25 partial score for every document that contains it — summed across tokens. The prior implementation iterated only the smallest posting list and used its own length as the document frequency for every token, which effectively collapsed multi-term queries down to single-term behavior.
func (*Index) SearchTrigram ¶ added in v0.2.0
SearchTrigram is the substring-match fallback layer. BM25 is the primary search path, but it relies on the Porter stemmer and the project's stopword list; tokens that the tokenizer drops or splits awkwardly (synthetic markers like "AUDIT_PROBE_XJ7Q3", UUID-style IDs, mixed-case all-caps acronyms) silently become unsearchable. Trigram match restores them: the query is tokenized identically to indexed text, every token of length >= 3 contributes its trigrams, and a doc is scored by how many query tokens it covers.
Score is the count of matched query tokens (so a doc that matches 3 of 4 query tokens outranks one that matches only 1). Layer is set to 2 so the response can report which layer produced the hit, and the per-tier BM25 layer (1) still wins on direct ties.
func (*Index) SetParams ¶ added in v0.2.2
func (ix *Index) SetParams(p IndexParams)
SetParams overrides the BM25 / scoring parameters in place. Used by the daemon load flow: LoadIndexWithCursor returns an Index whose k1/b/headingBoost are zero (not persisted on disk), and the daemon follows up with SetParams to apply config-derived values. Tests can also use this to exercise a mid-life parameter swap.
func (*Index) TotalDocs ¶ added in v0.2.2
TotalDocs returns the number of indexed documents under read-lock. Surfaced for observability (the /metrics endpoint publishes it as dfmt_index_docs); kept as a method rather than exposing the field so the lock contract is enforced at the call site.
func (*Index) UnmarshalJSON ¶ added in v0.1.2
UnmarshalJSON implements json.Unmarshaler for Index.
type IndexCursor ¶
type IndexCursor struct {
HiULID string `json:"hi_ulid"`
TokenVer int `json:"token_ver"`
TotalDocs int `json:"total_docs"`
AvgDocLen float64 `json:"avg_doc_len"`
}
IndexCursor tracks the state needed to resume indexing.
type IndexParams ¶ added in v0.2.2
IndexParams are the tunable BM25 and scoring parameters. Zero values in any field mean "use package defaults" — both the constructor (NewIndexWithParams) and the search path (NewBM25OkapiWithParams) implement that fallback so a freshly-deserialized Index stays scorable until SetParams is called.
type Journal ¶
type Journal interface {
Append(ctx context.Context, e Event) error
Stream(ctx context.Context, from string) (<-chan Event, error)
Checkpoint(ctx context.Context) (string, error)
Rotate(ctx context.Context) error
// Size returns the active journal's on-disk byte count. Excludes
// rotated archive files. Used by the Prometheus /metrics surface
// (dfmt_journal_bytes); a -1 reading at the gauge layer means
// the implementation returned a non-nil error here. ADR-0017.
Size() (int64, error)
Close() error
}
Journal appends events to an append-only JSONL file.
func OpenJournal ¶
func OpenJournal(path string, opt JournalOptions) (Journal, error)
OpenJournal opens or creates a journal at the given path.
type JournalOptions ¶
JournalOptions configures the journal.
type PostingList ¶
type PostingList struct {
IDs []string // ULIDs, sorted
TFs []uint32 // term frequencies, parallel to IDs
}
PostingList holds the document IDs and term frequencies for a term. TFs is uint32: uint16 would overflow on tokens repeated 65 536+ times (e.g. a huge log containing the same identifier over and over), silently corrupting BM25 scores.
type Priority ¶
type Priority string
Priority represents the priority tier of an event.
type RuleMatch ¶
type RuleMatch struct {
Type EventType `yaml:"type"`
PathGlob string `yaml:"path_glob,omitempty"`
MessageRegex string `yaml:"message_regex,omitempty"`
TagAny []string `yaml:"tag_any,omitempty"`
}
RuleMatch defines matching criteria for a classification rule.
All present clauses must match (AND semantics). TagAny is the one disjunctive clause: the event matches if ANY of the listed tags is present in Event.Tags. The previous schema (Type + PathGlob + MessageRegex only) had no way to elevate priority based on tags, so callers using dfmt_remember with tags like "audit" or "decision" landed at the default P4 priority — same as routine tool calls — and got dropped first under a tight byte-budget recall.
type Source ¶
type Source string
Source represents the source of an event.
type TrigramIndex ¶
type TrigramIndex struct {
// contains filtered or unexported fields
}
TrigramIndex provides fast substring search via trigram inverted index.
func NewTrigramIndex ¶
func NewTrigramIndex() *TrigramIndex
NewTrigramIndex creates a new trigram index.
func (*TrigramIndex) Add ¶
func (ti *TrigramIndex) Add(id string, text string)
Add indexes a document's tokens for trigram search. Every 3-character window within each token is indexed so the search side (which generates all windows from the query) can intersect them correctly.
func (*TrigramIndex) Search ¶
func (ti *TrigramIndex) Search(substring string) []string
Search finds document IDs that might contain the given substring. The substring is lowercased to match how Add indexes via TokenizeFull, which lowercases every token. Without this, a mixed-case query (e.g. "Path", "URL") silently returned zero matches.
type ULID ¶
type ULID string
ULID is a Universally Unique Lexicographically Sortable Identifier.
V-07 contract: ULID is an event identifier, NOT a security token. The crypto/rand fallback below degrades to pid+counter+nanotime when the OS entropy pool refuses to read — fine for "make sure two events in the same millisecond don't collide," but unsuitable for auth tokens, session IDs, CSRF nonces, or any value an attacker could be motivated to predict. Anyone repurposing this type for that role MUST replace the fallback with a hard error.