kronos

module

v0.0.0-...-eaf5e47 Latest Latest Go to latest Published: Apr 16, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/rahulmodugula/kronos

Links

Open Source Insights

README ¶

kronos

Durable workflows you can debug like regular programs.

Time-travel debugger for durable workflows. Record every step, scrub the timeline, replay in your IDE with breakpoints, fork failed runs from any midpoint.

Kronos is a Postgres-native durable workflow engine with a time-travel debugger. Single Go binary, zero external dependencies beyond Postgres. 2,800+ workflow steps/sec.

Why Kronos?

Debugging failed workflows in production is painful. You get stack traces and logs, but you can't actually step through what happened. Kronos fixes this:

Time-travel debugger — scrub through any workflow run's timeline, inspect exact inputs/outputs at each step
Local replay with breakpoints — kronos replay <run-id> re-executes production runs locally; attach Delve or your IDE debugger
Fork from any midpoint — failed at step 5 of 10? Fork from step 5 with a fixed handler, skip re-executing steps 1–4
Single binary — no cluster, no Elasticsearch, no separate services. Just Postgres.

Comparison

	Kronos	Temporal	Hatchet	River
Time-travel debugger	✅	❌	❌	❌
Local IDE replay	✅	❌	❌	❌
Fork from midpoint	✅	❌	❌	❌
Single binary	✅	❌	❌	✅
Postgres-native	✅	✅*	✅	✅
Workflow versioning	✅	✅	❌	❌
Zero external deps	✅	❌	❌	✅

* Temporal supports Postgres but requires a separate Temporal server cluster + Elasticsearch for production visibility.

Quick Start

# Start Postgres
docker-compose up -d

# Build and run
make build && ./bin/kronos

Open the debugger UI at http://localhost:8080/debugger/

# Submit a workflow
grpcurl -plaintext -d '{
  "workflow_name": "example-workflow",
  "input": "{\"message\": \"hello\"}"
}' localhost:50051 kronos.v1.KronosService/StartWorkflow

Define a Workflow

wf := workflow.NewWorkflow("data-pipeline", "v1").
    AddStep("fetch-user", fetchUserData, nil).
    AddStep("enrich", enrichProfile, []string{"fetch-user"}).
    AddStep("validate", validate, []string{"enrich"}).
    AddStep("notify-1", sendEmail, []string{"enrich"}).
    AddStep("notify-2", slackMessage, []string{"enrich"}).
    AddStep("finish", summarize, []string{"validate", "notify-1", "notify-2"})

registry.Register(wf)

Step functions accept a context and JSON input:

func enrichProfile(ctx context.Context, input json.RawMessage) (json.RawMessage, error) {
    var data struct{ UserID string }
    json.Unmarshal(input, &data)
    // ... do work ...
    return json.Marshal(result)
}

Replay & Debug

# Replay a production run locally
kronos replay <run-id>

# Pause at a step — attach Delve or VS Code
kronos replay <run-id> --break-at enrich

# Fork from a step — test your fix without re-running upstream
kronos replay <run-id> --fork-from enrich

Output:

[replay] run abc-123 (workflow: data-pipeline v1)
[replay] fetched 24 events

[replay] step 1/6: fetch-user  OK (12ms, diff: 0 bytes)
[replay] step 2/6: enrich      BREAK
[replay] pid=48291 — attach your debugger now, press enter to continue

Architecture

       SubmitWorkflow RPC
              │
              ▼
   ┌──────────────────────┐
   │  workflow_runs row   │      RunStarted event
   │   (status=running)   │              │
   └──────────┬───────────┘              ▼
              │              ┌──────────────────────┐
              │              │  workflow_events     │
              │              │   (append-only log)  │
              ▼              └──────────────────────┘
   For each ready step:
              │
              ▼
   ┌──────────────────────┐
   │   jobs row inserted  │  ◄── existing scheduler/worker
   │ type=kronos.workflow │      infrastructure handles this
   │       _step          │      (zero code changes)
   └──────────┬───────────┘
              ▼
   workflow_step handler:
     • load workflow definition (pinned version)
     • assemble step input from upstream outputs
     • invoke user's StepFunc
     • write events transactionally
     • schedule downstream steps

The workflow engine layers on top of the existing job system — no changes to the scheduler, worker pool, retries, dead letter, or cron. Workflows execute as ordinary jobs through the production-proven pipeline.

Core Features

Durable Workflows

DAG-based execution — steps form a directed acyclic graph with parallel branches and fan-in
Per-step checkpointing — step outputs are durably recorded; retries resume from the last completed step
Event sourcing — every run is an append-only log of events
Workflow versioning — in-flight runs pin their definition; new runs use the latest version
Blob deduplication — large step inputs/outputs are content-addressed and shared across forks

Time-Travel Debugger UI

Runs list — filter by workflow/status, card layout with duration and status badges
Run detail — 3-pane layout: step tree, timeline scrubber, step inspector
Timeline scrubber — click/drag to rewind through events, see step state at any point
Step inspection — syntax-highlighted JSON for inputs, outputs, errors
Live streaming — SSE auto-updates for in-progress runs
Fork interface — one click to fork a failed run from any step

Local Replay & IDE Debugging

kronos replay <run-id> — re-executes a production run locally using recorded outputs
kronos replay <run-id> --break-at step_name — pause at a step; attach Delve / VS Code / GoLand
kronos replay <run-id> --fork-from step_name — test a fix without re-executing upstream steps
Output diffs surface divergence from production

Production Infrastructure

PostgreSQL persistence — single source of truth, survives crashes
Leader election — advisory locks ensure one active scheduler per cluster
Multi-node clustering — stateless API nodes, one scheduler; scale horizontally
Prometheus metrics — workflow throughput, step latency, queue depth
gRPC health checks — ready for Kubernetes liveness/readiness probes
Graceful shutdown — drains in-flight steps on SIGTERM

Built-in Job Queue

For simple one-off background work, Kronos also functions as a traditional job queue:

Job priorities — 1–10 scale
Cron scheduling — recurring jobs
Exponential backoff with jitter — intelligent retries
Dead-letter handling — failed jobs preserved for manual inspection
Idempotency keys — deduplication across retries

Performance

Worker pool throughput (CPU-bound, no Postgres):

10 workers: 1,256,838 jobs/sec (795ns/op)
50 workers: 1,135,330 jobs/sec (880ns/op)
100 workers: 1,117,641 jobs/sec (894ns/op)

End-to-end throughput (with Postgres 17, 50 workers):

~2,800 workflow steps/sec

Benchmark locally:

docker-compose up -d
make bench BENCH_FLAGS="-jobs=10000 -workers=50"

Configuration

Environment Variable	Default	Description
`DATABASE_URL`	`postgres://kronos:kronos@localhost:5432/kronos?sslmode=disable`	PostgreSQL DSN
`GRPC_ADDR`	`:50051`	gRPC API listen address
`ADMIN_ADDR`	`:8080`	HTTP admin server (debugger UI, dead-letter)
`METRICS_ADDR`	`:2112`	Prometheus metrics address
`WORKER_CONCURRENCY`	`10`	Worker pool size
`SCHEDULER_POLL_INTERVAL`	`2s`	Job polling interval
`SCHEDULER_BATCH_SIZE`	`10`	Max jobs to claim per poll
`JOB_TIMEOUT`	`30m`	Maximum execution time per step

Running Tests

make test                # Unit tests (race detector enabled)
make test-integration    # Integration tests (requires Docker)

Project Structure

kronos/
├── cmd/
│   ├── kronos/              main entrypoint + replay subcommand
│   └── bench/               throughput benchmark binary
├── internal/
│   ├── api/                 gRPC server (jobs + workflows)
│   ├── config/              environment-based configuration
│   ├── cron/                cron scheduling
│   ├── debugger/            web UI for time-travel debugging (HTMX + Alpine + Tailwind)
│   ├── health/              gRPC health protocol
│   ├── leader/              PostgreSQL advisory lock leader election
│   ├── metrics/             Prometheus instrumentation
│   ├── middleware/          gRPC interceptors
│   ├── replay/              local replay client + replayer engine + JSON diff
│   ├── retry/               exponential backoff
│   ├── scheduler/           job polling loop
│   ├── store/               job persistence (Store interface + Postgres)
│   ├── worker/              goroutine pool + handler registry
│   └── workflow/            workflow engine, event log, step dispatcher, fork
├── proto/kronos/v1/         protobuf definitions
├── gen/kronos/v1/           generated gRPC code
└── migrations/              SQL migrations

Roadmap

v1 (current):

Durable workflow engine with DAG support
Time-travel debugger UI
Local replay and IDE debugging
Fork from any midpoint
Workflow versioning

v1.1:

Typed step I/O via Go generics
Embeddable library mode (no separate binary)
Workflow signals & queries
Human-in-the-loop approval steps

v2:

AI step primitives (LLM calls, token counting, cost tracking)
Python / TypeScript client SDKs
Nested fan-out (map steps)

License

MIT

Directories ¶

Path	Synopsis
cmd
bench command cmd/bench is a throughput benchmark for Kronos.	cmd/bench is a throughput benchmark for Kronos.
kronos command
gen
kronos/v1
internal
admin
api
config
cron
debugger
health
leader Package leader implements single-node leader election via PostgreSQL session-level advisory locks.	Package leader implements single-node leader election via PostgreSQL session-level advisory locks.
metrics
middleware
replay
retry
scheduler
store
worker
workflow
proto
kronos/v1

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL