core

package

v0.2.2 Latest Latest Go to latest Published: May 1, 2026 License: MIT Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ersinkoc/dfmt

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
func CanonicalJSON(e Event) ([]byte, error)
func IDF(df, N int) float64
func LoadIndexWithCursor(indexPath, cursorPath string) (*Index, *IndexCursor, bool, error)
func Merge(a, b []string) []string
func NGrams(tokens []string, n int) []string
func PersistIndex(index *Index, path string, hiULID string) error
func RebuildIndexFromJournalInto(ctx context.Context, j Journal, idx *Index) (string, error)
func Stem(word string) string
func Tokenize(s string) []string
func TokenizeFull(s string, stopwords map[string]struct{}) []string
func Unique(tokens []string) []string
type BM25Okapi
- func NewBM25Okapi() *BM25Okapi
- func NewBM25OkapiWithParams(k1, b float64) *BM25Okapi
- func (bm *BM25Okapi) Score(tf int, docLen int, avgDocLen float64, df, N int) float64
type BM25Result
type Classifier
- func NewClassifier() *Classifier
- func (c *Classifier) AddRule(rule Rule)
- func (c *Classifier) Classify(e Event) Priority
- func (c *Classifier) SetDefault(typ EventType, pri Priority)
type Event
- func (e *Event) ComputeSig() string
- func (e *Event) Validate() bool
type EventType
type Index
- func LoadIndex(path string) (*Index, error)
- func NewIndex() *Index
- func NewIndexWithParams(p IndexParams) *Index
- func RebuildIndexFromJournal(ctx context.Context, j Journal) (*Index, string, error)
- func (ix *Index) Add(e Event)
- func (ix *Index) Excerpt(docID string) string
- func (ix *Index) MarshalJSON() ([]byte, error)
- func (ix *Index) Params() IndexParams
- func (ix *Index) Persist(path string) error
- func (ix *Index) Remove(id string)
- func (ix *Index) SearchBM25(query string, limit int) []ScoredHit
- func (ix *Index) SearchTrigram(query string, limit int) []ScoredHit
- func (ix *Index) SetParams(p IndexParams)
- func (ix *Index) TotalDocs() int
- func (ix *Index) UnmarshalJSON(data []byte) error
type IndexCursor
type IndexParams
type Journal
- func OpenJournal(path string, opt JournalOptions) (Journal, error)
type JournalOptions
type PostingList
type Priority
type Rule
type RuleMatch
type ScoredHit
type Source
type TrigramIndex
- func NewTrigramIndex() *TrigramIndex
- func (ti *TrigramIndex) Add(id string, text string)
- func (ti *TrigramIndex) Search(substring string) []string
type ULID
- func NewULID(ts time.Time) ULID
- func (u ULID) Time() time.Time

Constants ¶

View Source

const (
	// ULIDLen is the length of a ULID string.
	ULIDLen = 26

	// MaxEventSize is the maximum size of an event in bytes.
	MaxEventSize = 1024 * 1024 // 1 MB

	// DefaultBudget is the default recall budget in bytes.
	DefaultBudget = 4096

	// MaxBudget is the maximum recall budget.
	MaxBudget = 1024 * 1024 // 1 MB
)

View Source

const (
	DefaultBM25K1       = 1.2
	DefaultBM25B        = 0.75
	DefaultHeadingBoost = 5.0
)

Default index constants.

View Source

const (
	KeyInputTokens  = "input_tokens"  // int - LLM input token count
	KeyOutputTokens = "output_tokens" // int - LLM output token count
	KeyCachedTokens = "cached_tokens" // int - prompt cache savings
	KeyModel        = "model"         // string - model name
	KeyCacheHit     = "cache_hit"     // bool - cache hit occurred
)

Token data keys for LLM event tracking.

View Source

const (
	KeyRawBytes      = "raw_bytes"      // int - pre-filter (post-redact) output size
	KeyReturnedBytes = "returned_bytes" // int - bytes actually returned to caller
)

MCP byte-tracking keys for sandbox tool calls. Populated automatically by Exec/Read/Fetch — measure what the agent context would have consumed without intent filtering vs what was actually returned.

View Source

const DefaultDurability = "batched"

DefaultDurability is the default journal durability mode.

View Source

const TokenizerVersion = 1

TokenizerVersion tracks changes to the tokenizer that require rebuild.

Variables ¶

View Source

var (
	ErrJournalFull     = errors.New("journal has reached max bytes")
	ErrJournalNotFound = errors.New("journal not found")
	// ErrEventTooLarge is returned when a single event exceeds maxEventBytes.
	ErrEventTooLarge = errors.New("event exceeds max serialized size")
)

View Source

var EnglishStopwords = map[string]struct{}{
	"a": {}, "an": {}, "and": {}, "are": {}, "as": {}, "at": {},
	"be": {}, "been": {}, "being": {}, "but": {}, "by": {},
	"can": {}, "could": {}, "did": {}, "do": {}, "does": {},
	"doing": {}, "done": {}, "for": {}, "from": {},
	"had": {}, "has": {}, "have": {}, "having": {},
	"he": {}, "her": {}, "here": {}, "him": {}, "his": {},
	"how": {}, "i": {}, "if": {}, "in": {}, "into": {},
	"is": {}, "it": {}, "its": {}, "just": {},
	"me": {}, "my": {},
	"no": {}, "not": {}, "of": {}, "on": {}, "or": {},
	"our": {}, "out": {},
	"said": {}, "she": {}, "so": {}, "some": {},
	"that": {}, "the": {}, "their": {}, "them": {}, "then": {},
	"there": {}, "these": {}, "they": {}, "this": {}, "those": {},
	"to": {}, "too": {},
	"us": {}, "was": {}, "we": {}, "were": {}, "what": {},
	"when": {}, "where": {}, "which": {}, "while": {},
	"who": {}, "will": {}, "with": {}, "would": {},
	"you": {}, "your": {},
}

Default stopwords for English.

View Source

var Now = time.Now

Now returns the current time.

View Source

var TurkishStopwords = map[string]struct{}{
	"bir": {}, "bu": {}, "da": {}, "de": {}, "daha": {},
	"ile": {}, "için": {}, "kadar": {}, "ne": {}, "oysa": {},
	"ve": {}, "ya": {}, "yani": {}, "zaten": {},
}

Turkish stopwords.

View Source

var Version = version.Current

Version is the DFMT version. Mirrors internal/version.Current — the single build-time-injected source of truth. Kept as a re-export to preserve the historical core.Version reference.

Functions ¶

func CanonicalJSON ¶

func CanonicalJSON(e Event) ([]byte, error)

CanonicalJSON returns canonical JSON bytes for an event. Nested maps are sorted recursively via canonicalize — encoding/json already sorts map[string]X keys, but recursing explicitly insulates the signature from any future Marshaler behavior change and makes the invariant obvious.

func IDF ¶

func IDF(df, N int) float64

IDF computes inverse document frequency. Uses the smoothed formula to avoid negative values.

func LoadIndexWithCursor ¶

func LoadIndexWithCursor(indexPath, cursorPath string) (*Index, *IndexCursor, bool, error)

LoadIndexWithCursor loads an index and cursor, returns whether rebuild needed.

func Merge ¶

func Merge(a, b []string) []string

Merge merges two sorted token slices.

func NGrams ¶

func NGrams(tokens []string, n int) []string

NGrams generates character n-grams from tokens.

func PersistIndex ¶

func PersistIndex(index *Index, path string, hiULID string) error

PersistIndex saves the index and cursor to disk atomically. Each file is written to a sibling <name>.tmp, fsynced, and renamed into place. A crash mid-write leaves the previous complete file intact instead of a truncated stub that would force a full rebuild.

func RebuildIndexFromJournalInto ¶ added in v0.2.0

func RebuildIndexFromJournalInto(ctx context.Context, j Journal, idx *Index) (string, error)

RebuildIndexFromJournalInto streams the journal into an existing Index. It exists so the daemon can hand a fresh empty Index to handlers immediately (so the listener can come up on time) and populate it asynchronously, rather than blocking startup behind a 5-10s rebuild that times out the agent's first MCP call. Index is concurrent-safe via its internal RWMutex, so search/recall during rebuild observes a partial-but-growing view instead of "daemon not responding".

The context is honored so daemon.Stop can interrupt a long-running rebuild.

func Stem ¶

func Stem(word string) string

Stem returns the Porter stem of a word. Porter's vowel/consonant rules are defined over ASCII letters only — running it byte-wise on a UTF-8 string like "çalışma" treats every continuation byte as a consonant and produces nonsense stems. For any non-ASCII input we skip stemming and return the lowercased word unchanged; a dedicated Turkish stemmer would be needed to do better.

func Tokenize ¶

func Tokenize(s string) []string

Tokenize splits text into tokens for indexing.

func TokenizeFull ¶

func TokenizeFull(s string, stopwords map[string]struct{}) []string

Tokenize splits text into lowercase tokens for indexing. It drops tokens shorter than 2 or longer than 64 characters, and removes stopwords.

func Unique ¶

func Unique(tokens []string) []string

Unique returns unique tokens preserving order.

Types ¶

type BM25Okapi ¶

type BM25Okapi struct {
	// contains filtered or unexported fields
}

BM25Okapi implements the Okapi BM25 ranking function.

func NewBM25Okapi ¶

func NewBM25Okapi() *BM25Okapi

NewBM25Okapi creates a BM25 scorer with package-default parameters (k1=DefaultBM25K1=1.2, b=DefaultBM25B=0.75).

func NewBM25OkapiWithParams ¶ added in v0.2.2

func NewBM25OkapiWithParams(k1, b float64) *BM25Okapi

NewBM25OkapiWithParams creates a scorer with operator-configured parameters. Zero or negative inputs fall back to the package defaults — defense-in-depth for the load path where Index.UnmarshalJSON leaves the per-Index k1/b at zero until the daemon calls SetParams. Production configs are gated by config.Validate so reaching the fallback usually means a freshly-deserialized index whose owner hasn't called SetParams yet.

func (*BM25Okapi) Score ¶

func (bm *BM25Okapi) Score(tf int, docLen int, avgDocLen float64, df, N int) float64

Score computes the BM25 score for a single term.

type BM25Result ¶

type BM25Result struct {
	ID    string
	Score float64
}

BM25Result holds a scored document result.

type Classifier ¶

type Classifier struct {
	// contains filtered or unexported fields
}

Classifier assigns priority tiers to events.

func NewClassifier ¶

func NewClassifier() *Classifier

NewClassifier creates a new Classifier with default priorities and seeded note-elevation rules. The seeded rules let dfmt_remember callers raise a note above the default P4 by attaching the right tag — without the seed, every manually-recorded note ranked equal to a tool.read event in the recall budget pass.

func (*Classifier) AddRule ¶

func (c *Classifier) AddRule(rule Rule)

AddRule adds a classification rule. The MessageRegex is compiled once here; an invalid regex is logged as a silent skip (matcher treats the rule's regex clause as always-unmet) rather than failing AddRule, so a malformed user rule doesn't prevent the classifier from being built.

func (*Classifier) Classify ¶

func (c *Classifier) Classify(e Event) Priority

Classify returns the priority for an event.

func (*Classifier) SetDefault ¶

func (c *Classifier) SetDefault(typ EventType, pri Priority)

SetDefault sets the default priority for an event type.

type Event ¶

type Event struct {
	ID       string         `json:"id"`
	TS       time.Time      `json:"ts"`
	Project  string         `json:"project"`
	Type     EventType      `json:"type"`
	Priority Priority       `json:"priority"`
	Source   Source         `json:"source"`
	Actor    string         `json:"actor,omitempty"`
	Data     map[string]any `json:"data,omitempty"`
	Refs     []string       `json:"refs,omitempty"`
	Tags     []string       `json:"tags,omitempty"`
	Sig      string         `json:"sig"`
}

Event represents a single event in the journal.

func (*Event) ComputeSig ¶

func (e *Event) ComputeSig() string

ComputeSig computes the signature of the event. It uses the first 16 hex chars of SHA-256 of the canonical JSON.

func (*Event) Validate ¶

func (e *Event) Validate() bool

Validate checks if the event's signature is valid. Empty Sig is treated as "pre-fix event — skip validation but accept". Non-empty Sig must match ComputeSig() computed from the current canonical form.

type EventType ¶

type EventType string

EventType represents the type of an event.

const (
	EvtFileRead    EventType = "file.read"
	EvtFileEdit    EventType = "file.edit"
	EvtFileCreate  EventType = "file.create"
	EvtFileDelete  EventType = "file.delete"
	EvtTaskCreate  EventType = "task.create"
	EvtTaskUpdate  EventType = "task.update"
	EvtTaskDone    EventType = "task.done"
	EvtDecision    EventType = "decision"
	EvtError       EventType = "error"
	EvtGitCommit   EventType = "git.commit"
	EvtGitCheckout EventType = "git.checkout"
	EvtGitPush     EventType = "git.push"
	EvtGitStash    EventType = "git.stash"
	EvtGitDiff     EventType = "git.diff"
	EvtEnvCwd      EventType = "env.cwd"
	EvtEnvVars     EventType = "env.vars"
	EvtEnvInstall  EventType = "env.install"
	EvtShellCmd    EventType = "shell.cmd"
	EvtPrompt      EventType = "prompt"
	EvtMCPCall     EventType = "mcp.call"
	EvtSubagent    EventType = "subagent"
	EvtSkill       EventType = "skill"
	EvtRole        EventType = "role"
	EvtIntent      EventType = "intent"
	EvtDataRef     EventType = "data.ref"
	EvtNote        EventType = "note"
	EvtTombstone   EventType = "tombstone"
)

type Index ¶

type Index struct {
	// contains filtered or unexported fields
}

Index implements an in-memory inverted index with BM25 scoring.

func LoadIndex ¶

func LoadIndex(path string) (*Index, error)

LoadIndex loads an index from a file using JSON deserialization.

func NewIndex ¶

func NewIndex() *Index

NewIndex creates a new Index with package-default BM25 parameters (k1=DefaultBM25K1, b=DefaultBM25B, headingBoost=DefaultHeadingBoost). This is the zero-config constructor used by tests and the dfmt-bench harness; the daemon production path uses NewIndexWithParams so the operator's config.Index.* values flow through.

func NewIndexWithParams ¶ added in v0.2.2

func NewIndexWithParams(p IndexParams) *Index

NewIndexWithParams creates an Index with operator-configured BM25 parameters. Zero fields are treated as "use package defaults" by the search path (defense-in-depth), but production callers should pass fully-resolved values; Validate already gates input ranges at config-load time. ADR-0015 v0.4.

func RebuildIndexFromJournal ¶ added in v0.1.6

func RebuildIndexFromJournal(ctx context.Context, j Journal) (*Index, string, error)

RebuildIndexFromJournal streams every event from j into a fresh Index. The returned hiID is the ID of the last event ingested (empty if the journal was empty), suitable for passing to PersistIndex so the next cursor reflects the rebuild.

Use when LoadIndexWithCursor returned needsRebuild=true — without this, callers that just took NewIndex() lose the entire historical journal from search/recall until the user generates new events. A tokenizer-version bump is the canonical case the version field was designed to handle.

func (*Index) Add ¶

func (ix *Index) Add(e Event)

Add adds an event to the index. If the event ID was already indexed the call is a no-op — this prevents totalDocs drift and duplicate posting-list entries when a caller (e.g. a retry path) re-submits the same event.

func (*Index) Excerpt ¶ added in v0.2.0

func (ix *Index) Excerpt(docID string) string

Excerpt returns the short text snippet attached to docID at index time, or "" when no excerpt is recorded (event was indexed before the excerpt feature, or never had a message/path field). Safe for concurrent reads.

func (*Index) MarshalJSON ¶ added in v0.1.2

func (ix *Index) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler for Index.

func (*Index) Params ¶ added in v0.2.2

func (ix *Index) Params() IndexParams

Params returns the currently configured BM25 / scoring parameters under read-lock. Used by tests and a future dfmt doctor row that surfaces the effective values to operators.

func (*Index) Persist ¶

func (ix *Index) Persist(path string) error

Persist saves the index to a file using JSON serialization.

func (*Index) Remove ¶

func (ix *Index) Remove(id string)

Remove removes a document from the index. Cleans up both the stem and trigram posting lists and recomputes avgDocLen from scratch so the BM25 scoring stays consistent after deletes.

func (*Index) SearchBM25 ¶

func (ix *Index) SearchBM25(query string, limit int) []ScoredHit

SearchBM25 searches the index using BM25. Each query token contributes a BM25 partial score for every document that contains it — summed across tokens. The prior implementation iterated only the smallest posting list and used its own length as the document frequency for every token, which effectively collapsed multi-term queries down to single-term behavior.

func (*Index) SearchTrigram ¶ added in v0.2.0

func (ix *Index) SearchTrigram(query string, limit int) []ScoredHit

SearchTrigram is the substring-match fallback layer. BM25 is the primary search path, but it relies on the Porter stemmer and the project's stopword list; tokens that the tokenizer drops or splits awkwardly (synthetic markers like "AUDIT_PROBE_XJ7Q3", UUID-style IDs, mixed-case all-caps acronyms) silently become unsearchable. Trigram match restores them: the query is tokenized identically to indexed text, every token of length >= 3 contributes its trigrams, and a doc is scored by how many query tokens it covers.

Score is the count of matched query tokens (so a doc that matches 3 of 4 query tokens outranks one that matches only 1). Layer is set to 2 so the response can report which layer produced the hit, and the per-tier BM25 layer (1) still wins on direct ties.

func (*Index) SetParams ¶ added in v0.2.2

func (ix *Index) SetParams(p IndexParams)

SetParams overrides the BM25 / scoring parameters in place. Used by the daemon load flow: LoadIndexWithCursor returns an Index whose k1/b/headingBoost are zero (not persisted on disk), and the daemon follows up with SetParams to apply config-derived values. Tests can also use this to exercise a mid-life parameter swap.

func (*Index) TotalDocs ¶ added in v0.2.2

func (ix *Index) TotalDocs() int

TotalDocs returns the number of indexed documents under read-lock. Surfaced for observability (the /metrics endpoint publishes it as dfmt_index_docs); kept as a method rather than exposing the field so the lock contract is enforced at the call site.

func (*Index) UnmarshalJSON ¶ added in v0.1.2

func (ix *Index) UnmarshalJSON(data []byte) error

UnmarshalJSON implements json.Unmarshaler for Index.

type IndexCursor ¶

type IndexCursor struct {
	HiULID    string  `json:"hi_ulid"`
	TokenVer  int     `json:"token_ver"`
	TotalDocs int     `json:"total_docs"`
	AvgDocLen float64 `json:"avg_doc_len"`
}

IndexCursor tracks the state needed to resume indexing.

type IndexParams ¶ added in v0.2.2

type IndexParams struct {
	K1           float64
	B            float64
	HeadingBoost float64
}

IndexParams are the tunable BM25 and scoring parameters. Zero values in any field mean "use package defaults" — both the constructor (NewIndexWithParams) and the search path (NewBM25OkapiWithParams) implement that fallback so a freshly-deserialized Index stays scorable until SetParams is called.

type Journal ¶

type Journal interface {
	Append(ctx context.Context, e Event) error
	Stream(ctx context.Context, from string) (<-chan Event, error)
	Checkpoint(ctx context.Context) (string, error)
	Rotate(ctx context.Context) error
	// Size returns the active journal's on-disk byte count. Excludes
	// rotated archive files. Used by the Prometheus /metrics surface
	// (dfmt_journal_bytes); a -1 reading at the gauge layer means
	// the implementation returned a non-nil error here. ADR-0017.
	Size() (int64, error)
	Close() error
}

Journal appends events to an append-only JSONL file.

func OpenJournal ¶

func OpenJournal(path string, opt JournalOptions) (Journal, error)

OpenJournal opens or creates a journal at the given path.

type JournalOptions ¶

type JournalOptions struct {
	Path     string
	MaxBytes int64
	Durable  bool
	BatchMS  int
	Compress bool
}

JournalOptions configures the journal.

type PostingList ¶

type PostingList struct {
	IDs []string // ULIDs, sorted
	TFs []uint32 // term frequencies, parallel to IDs
}

PostingList holds the document IDs and term frequencies for a term. TFs is uint32: uint16 would overflow on tokens repeated 65 536+ times (e.g. a huge log containing the same identifier over and over), silently corrupting BM25 scores.

type Priority ¶

type Priority string

Priority represents the priority tier of an event.

const (
	PriorityP1 Priority = "p1" // Critical: decisions, task outcomes
	PriorityP2 Priority = "p2" // Important: file edits, git operations
	PriorityP3 Priority = "p3" // Normal: shell commands, searches
	PriorityP4 Priority = "p4" // Low: reads, minor events
)

Priority tiers.

const (
	PriP1 Priority = "p1"
	PriP2 Priority = "p2"
	PriP3 Priority = "p3"
	PriP4 Priority = "p4"
)

type Rule ¶

type Rule struct {
	Match    RuleMatch `yaml:"match"`
	Priority Priority  `yaml:"priority"`
}

Rule defines a classification rule.

type RuleMatch ¶

type RuleMatch struct {
	Type         EventType `yaml:"type"`
	PathGlob     string    `yaml:"path_glob,omitempty"`
	MessageRegex string    `yaml:"message_regex,omitempty"`
	TagAny       []string  `yaml:"tag_any,omitempty"`
}

RuleMatch defines matching criteria for a classification rule.

All present clauses must match (AND semantics). TagAny is the one disjunctive clause: the event matches if ANY of the listed tags is present in Event.Tags. The previous schema (Type + PathGlob + MessageRegex only) had no way to elevate priority based on tags, so callers using dfmt_remember with tags like "audit" or "decision" landed at the default P4 priority — same as routine tool calls — and got dropped first under a tight byte-budget recall.

type ScoredHit ¶

type ScoredHit struct {
	ID    string
	Score float64
	Layer int
}

ScoredHit represents a scored search result.

type Source ¶

type Source string

Source represents the source of an event.

const (
	SourceCLI   Source = "cli"
	SourceMCP   Source = "mcp"
	SourceHook  Source = "hook"
	SourceFS    Source = "fs"
	SourceShell Source = "shell"
	SourceGit   Source = "git"
)

Source types.

const (
	SrcMCP     Source = "mcp"
	SrcFSWatch Source = "fswatch"
	SrcGitHook Source = "githook"
	SrcShell   Source = "shell"
	SrcCLI     Source = "cli"
)

type TrigramIndex ¶

type TrigramIndex struct {
	// contains filtered or unexported fields
}

TrigramIndex provides fast substring search via trigram inverted index.

func NewTrigramIndex ¶

func NewTrigramIndex() *TrigramIndex

NewTrigramIndex creates a new trigram index.

func (*TrigramIndex) Add ¶

func (ti *TrigramIndex) Add(id string, text string)

Add indexes a document's tokens for trigram search. Every 3-character window within each token is indexed so the search side (which generates all windows from the query) can intersect them correctly.

func (*TrigramIndex) Search ¶

func (ti *TrigramIndex) Search(substring string) []string

Search finds document IDs that might contain the given substring. The substring is lowercased to match how Add indexes via TokenizeFull, which lowercases every token. Without this, a mixed-case query (e.g. "Path", "URL") silently returned zero matches.

type ULID ¶

type ULID string

ULID is a Universally Unique Lexicographically Sortable Identifier.

V-07 contract: ULID is an event identifier, NOT a security token. The crypto/rand fallback below degrades to pid+counter+nanotime when the OS entropy pool refuses to read — fine for "make sure two events in the same millisecond don't collide," but unsuitable for auth tokens, session IDs, CSRF nonces, or any value an attacker could be motivated to predict. Anyone repurposing this type for that role MUST replace the fallback with a hard error.

func NewULID ¶

func NewULID(ts time.Time) ULID

NewULID generates a new ULID.

func (ULID) Time ¶

func (u ULID) Time() time.Time

Time extracts the timestamp from a ULID.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL