temporal

package
v0.0.0-...-721e887 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2026 License: AGPL-3.0 Imports: 9 Imported by: 0

Documentation

Overview

Package temporal extracts git-history derived facts (commits + per-file touch lists) used to emit CKS G6 Temporal edges (`changed_in`, `blame`) in the build pipeline.

V0 scope (E4): single `git log --raw --no-renames` invocation per build, streamed and parsed into a per-file commit list. Line-level blame is deferred (G6 Phase 2). Repos that aren't git checkouts return an empty FileHistory + nil error so callers degrade gracefully.

Package temporal — hunks.go extracts unified-diff hunks from `git log -p`, the foundation for the CKS G6 Hunk-graph (schema 1.8 H1 stage). Each HunkInfo is one contiguous block of changed lines in one file in one commit; the buildpipe layer turns these into NodeHunk rows + has_hunk / adjacent edges, with the gzip-compressed unified-diff text persisted as a blob keyed by the Hunk's node ID.

Why a fresh collector instead of extending LoadHistory: LoadHistory uses `git log --raw` to enumerate (commit, file) pairs — it never sees the patch body. The hunk pass needs `git log -p` to materialise diff text; the two streams have incompatible parse states. Keeping them separate lets each pass stay simple and lets the build pipeline run them concurrently if the budget ever requires it.

Repos that aren't git checkouts return nil + nil error so callers degrade gracefully (mirrors LoadHistory's contract).

Package temporal — issueid.go extracts issue/ticket identifiers from commit subject lines for the H4 stage of the hunk-graph series (docs/design/hunk-graph.md §10.4). Three regex passes recognise:

  • GitHub-style bare references: `#123`, `Fixes #45`, `(#67)` → normalised as `GH-123` / `GH-45` / `GH-67`.

  • Bracketed Linear / Jira / internal IDs: `[ABC-456]`, `[INGEST-789]` → kept verbatim (`ABC-456`, `INGEST-789`).

  • Unbracketed Jira-style ticket prefix at line start: `INGEST-789: brief subject` → `INGEST-789`. The leading-position constraint avoids accidental matches like `e.g. SOME-123 we use…` in the middle of a sentence (false-positive risk too high there).

  • GitHub issue URLs: `https://github.com/owner/repo/issues/42` → `GH-owner/repo#42`. URL form is rare in subjects (more common in bodies) but we keep the parser symmetric for completeness.

Output is deduped and sorted lexicographically so the encoded `doc_comment` stays deterministic across builds — the same commit always produces the same issue_ids string regardless of regex match order.

Package temporal — unreachable.go collects hunks from commits that exist in the local git object store but are NOT reachable from HEAD. Those commits live in two surfaces:

  1. `git reflog --all --pretty=%H` — local HEAD/branch movement records. Captures force-pushed-away SHAs that haven't been GC'd yet (default 90 days). Misses commits that landed via fetch but were never moved into a ref's history.

  2. `git fsck --no-reflogs --unreachable` — dangling objects that no ref or reflog points at. Catches the second category above and any commit explicitly excluded from a fetch's tip walk.

Together: a near-complete view of the local object store's "history humans rolled back". Used to populate the schema-1.8 §11.3 "AMBIGUOUS" hunk class — see docs/design/hunk-graph.md for the storage / retrieval / recovery layering.

Distinct from LoadHunks: that pass walks `git log HEAD --` only and produces the EXTRACTED-confidence baseline. This pass is additive — the caller merges the two result sets into one Commit/Hunk emission.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DecodeIssueIDs

func DecodeIssueIDs(docComment string) []string

DecodeIssueIDs is the inverse of EncodeIssueIDs — parses a Hunk's doc_comment field back into the slice of issue identifiers. Returns nil for any input that doesn't carry the `issues:` prefix (so plain doc_comment text on non-Hunk nodes doesn't get mistaken for issue data).

func EncodeIssueIDs

func EncodeIssueIDs(ids []string) string

EncodeIssueIDs serialises ids into the design §10.4 storage shape: `issues:ID1;ID2;…`. Returns the empty string when ids is empty so callers can assign the result directly to Node.DocComment without a nil check (an empty doc_comment is valid).

func ExtractIssueIDs

func ExtractIssueIDs(subject string) []string

ExtractIssueIDs returns deduped sorted issue identifiers found in subject. Returns nil (not a zero-length slice) when no patterns match so callers can use `if ids := ExtractIssueIDs(...); ids == nil` for the "no issues" branch.

The patterns are intentionally conservative — over-matching would pollute the H3 EvidencePack with spurious "this hunk fixes #123" claims that don't reflect the commit's stated intent. False negatives are recoverable (a follow-up PR can widen the patterns when eval shows demand); false positives ship to the LLM and can mislead.

func LoadUnreachableHunks

func LoadUnreachableHunks(repoRoot string, maxCommits int) ([]CommitInfo, []HunkInfo, error)

LoadUnreachableHunks returns commits + hunks for SHAs reachable via reflog or fsck-unreachable but NOT from HEAD. maxCommits ≤ 0 uses unreachableCommitsDefault.

Returns (nil, nil, nil) for non-git directories — same graceful degrade contract as LoadHunks. Other git failures bubble up.

Performance: reflog + fsck typically run in < 1s on repos with 10K+ commits. The per-SHA `git show` is ~50ms each, so a 100-commit cap keeps the worst-case under 5s on commodity hardware.

Types

type CommitInfo

type CommitInfo struct {
	SHA       string // full 40-char hex
	Timestamp int64  // unix seconds
	Subject   string // first line of message (already trimmed of trailing \n)
}

CommitInfo describes a single git commit. Times are unix seconds (the git `%at` author-time format) so they collate with the staleness fingerprint already used by the manifest.

type FileHistory

type FileHistory struct {
	Files   map[string][]string
	Commits map[string]CommitInfo
}

FileHistory is the parsed result of one `git log --raw` invocation:

  • Files: repo-rooted slash-form path → list of commit SHAs that touched that file, most-recent-first, capped at the per-file limit.
  • Commits: SHA → CommitInfo for every commit referenced from Files.

The maps are nil-safe (callers can range over nil maps).

func LoadHistory

func LoadHistory(repoRoot string, maxPerFile int) (FileHistory, error)

LoadHistory runs `git -C repoRoot log --raw --no-renames --pretty=format:'COMMIT %H %at %s' HEAD -- .` and parses the output into a FileHistory. maxPerFile bounds per-file commit count (default 10 if maxPerFile <= 0).

Returns an empty FileHistory + nil error if repoRoot is not a git checkout (graceful degrade — temporal edges simply won't be emitted). Other git failures bubble up as errors so the caller can log them.

Performance: a single git invocation streams the entire repo history. For 2000-file corpora with ~200 commits, output is on the order of 1MB and parses in <1s. We deliberately bound per-file count rather than passing `-n` to git, because `-n` caps TOTAL commits in the stream (not per file) and the goal is "10 most recent per file" regardless of how many other files were touched in those commits.

type HunkInfo

type HunkInfo struct {
	SHA      string
	FilePath string
	Index    int
	OldStart int
	OldLines int
	NewStart int
	NewLines int
	Added    int
	Removed  int
	Binary   bool
	Patch    []byte
}

HunkInfo is one unified-diff hunk extracted from a single commit. The fields mirror the @@ -OldStart,OldLines +NewStart,NewLines @@ header:

  • OldStart/OldLines: pre-image range in the parent file. Zero/zero when the file is newly added in this commit.
  • NewStart/NewLines: post-image range in the file at this commit. Zero/zero when the file is deleted in this commit.
  • Added/Removed: literal '+' / '-' line counts inside the hunk body (excluding the `--- a/...` / `+++ b/...` file headers and the `\ No newline at end of file` marker).
  • Patch: raw bytes of the hunk INCLUDING the `@@` header line and a trailing newline. NOT gzipped — the buildpipe layer applies the §11.6 64KB truncation and gzip compression before persisting.
  • Index: 0-based per-(commit, file) hunk position. Stable within one parse — used as the third coordinate of the Hunk node ID so multiple hunks per commit-file pair don't collide.

SHA is the full 40-char hex commit ID. FilePath is the post-image (b/) side of the `diff --git` header — for renames under --no-renames the file appears as a deletion of the old path + an addition of the new path, so SHA × FilePath × Index uniquely identifies any hunk.

func LoadHunks

func LoadHunks(repoRoot string, maxCommits int) ([]HunkInfo, error)

LoadHunks runs `git log -p` over the most-recent maxCommits HEAD-reachable commits and parses the patch stream into HunkInfo records. maxCommits<=0 uses hunkCommitsDefault.

Returns (nil, nil) if repoRoot isn't a git checkout — same graceful-degrade contract as LoadHistory. Other git failures bubble up as errors.

Why --no-renames: matches LoadHistory's flag set so changed_in (file-level) and has_hunk (per-hunk) agree on the path identity. Renames appear as delete+add hunk pairs; the H2 modifies-edge pass that lands later in this schema bump can re-link them via AST overlap if needed.

Why --no-merges: merge commits' patches are noisy (they show the resolution against one parent, not the actual code change), and the few cases where a merge introduces real code surface elsewhere via the merged branch's own commits. Excluding them halves the hunk count on heavily-rebased branches without losing signal.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL