Event-Driven Notification System

A scalable notification system that ingests requests via HTTP, persists them,
dispatches them asynchronously through SMS / Email / Push channels with
intelligent retry, and exposes real-time status updates via WebSocket.
Built as a Senior Software Engineer (Golang) technical assessment
submission for Insider One. Full brief: docs/brief.pdf.
Table of contents
What is this
A production-shaped backend that the brief calls out: millions of
notifications daily, burst-tolerant, intelligent retries, end-to-end
visibility. Every architectural decision traces back to a sentence in
the assessment brief — see CLAUDE.md §1 for the
constitution, PLAN.md for the seven-phase build plan.
Three binaries, one Docker image:
cmd/api — REST + WebSocket front door
cmd/worker — asynq consumer, atomic claim, provider dispatch, retry
cmd/reconciler — safety net for stuck or orphaned notifications
Plus cmd/migrate for schema management.
Four load-bearing phrases from the brief
Every architectural decision in this repo defends one of these:
- "Millions of notifications daily" → horizontal scale, stateless
tiers, batch ingestion, connection pooling.
- "Burst traffic (flash sales, breaking news)" → asynchronous
processing, queue back-pressure, distributed rate limiting.
- "Retry failed deliveries intelligently" → exponential backoff
with jitter, error classification, circuit breaker, dead letter
queue, reconciliation safety net.
- "Visibility for both internal teams and API consumers" →
structured logging, Prometheus metrics, correlation IDs, status
query endpoints, notification trace endpoint, operational alerting.
When in doubt while reading the code, re-read these four — they are
the constitution (CLAUDE.md §1).
Status
v1.0.0 · production-ready baseline · MIT licensed · Go 1.26
| Signal |
Where |
| Quality gate |
SonarCloud Passed — 0 open issues, ≥80% new-code coverage |
| CI pipeline |
8 jobs: lint, vuln, unit (race + coverage), integration, e2e, build, docker-smoke, sonarcloud upload |
| Architecture |
11 ADRs in docs/adr/ — every load-bearing choice documented |
| Test discipline |
Strict TDD: every feat commit is preceded by a matching test commit. See CONTRIBUTING.md for the rhythm. |
| Commit convention |
Conventional Commits, lowercase, no trailing period (CLAUDE.md §8) |
| Branch policy |
Feature branches → PR → merge commit (not squash) so the TDD rhythm stays visible in main's history |
Quickstart
git clone https://github.com/afbora/event-driven-notification.git
cd event-driven-notification
docker compose up -d --wait # ~30 s on first run (image build + migrations)
curl -i http://localhost:8080/healthz/live
# → 200 OK · {"status":"ok"}
One command brings up Postgres, Redis, a one-shot migration runner, the
three Go binaries (api / worker / reconciler), and the full operational
UI stack. The --wait flag blocks until every service is healthy or
the one-shot migrate container has exited successfully — so when the
command returns, the API is ready to accept traffic.
No .env file required. Every env var lives inline in
docker-compose.yml with a working default (CLAUDE.md §2.7 / ADR-0010).
No make required either — migrations run as a Compose service
(see Database migrations below).
Port collisions: if 5432, 6379, or 8080-8083 are already in
use locally, edit the host-side port mapping in docker-compose.yml
before bringing the stack up.
Database migrations
The migrate service in docker-compose.yml runs every pending
migration on startup, then exits. api, worker, and reconciler all
declare depends_on: migrate: condition: service_completed_successfully,
so the application binaries do not start until the schema is current.
golang-migrate is idempotent (state lives in the schema_migrations
table) — re-running docker compose up is safe; subsequent runs exit
immediately with "no change."
If you need to drive migrations manually (rollback, force a version,
or inspect the current state):
docker compose run --rm migrate go run ./cmd/migrate down 1
docker compose run --rm migrate go run ./cmd/migrate version
Testing
The suite is split into four tiers so feedback stays fast during
development and rigorous in CI:
| Target |
Scope |
Typical runtime |
make test |
Unit (race + coverage, no Docker) |
~15 s |
make test-integration |
Adapter tests via testcontainers |
~1-2 min |
make test-e2e |
Full stack via testcontainers |
~3 min |
make load-test |
k6 baseline + burst + rate-limit scenarios |
~6 min |
make coverage-merge |
gocovmerge unit + integration + e2e |
<5 s |
CI runs every tier on every PR. Coverage policy and test discipline
live in CLAUDE.md §7. The TDD rhythm — test(scope)
commit precedes every feat(scope) commit — is documented in
CONTRIBUTING.md.
Repository layout
.
├── cmd/ # api · worker · reconciler · migrate
├── internal/
│ ├── domain/ # entities, value objects, FSM (stdlib only)
│ ├── application/ # use cases (stdlib + ports only)
│ ├── ports/ # interfaces — the DI boundary
│ ├── adapters/ # http · postgres · redis · asynq · provider · websocket
│ └── infrastructure/ # config · logger · metrics · tracing · clock · id · circuit · correlation
├── api/openapi.yaml # source of truth for the REST contract
├── deploy/ # prometheus · alertmanager · grafana
├── db/migrations/ # numbered SQL up/down pairs
├── tests/{integration,e2e,load}/ # build-tag-gated higher tiers
└── docs/
├── adr/ # 11 architectural decision records
├── RUNBOOK.md # one entry per alert
├── LOAD_TEST.md # k6 methodology + results
├── API_EXAMPLES.md # curl recipes for every endpoint
├── brief.pdf # the assessment brief
└── bruno/ # importable Bruno collection
The hexagonal boundary is enforced by import discipline:
internal/domain/ and internal/application/ import only stdlib and
each other (CLAUDE.md §3.3). Delete internal/adapters/postgres/ and
the domain still compiles.
After docker compose up -d, these UIs are immediately reachable:
Architecture
Hexagonal layering
flowchart TB
subgraph adapters["Adapters · internal/adapters/"]
http["http / websocket"]
pg["postgres"]
redis["redis"]
asynq["asynq"]
provider["provider"]
end
subgraph ports["Ports · internal/ports/"]
direction LR
portsList["Repository · Queue · Provider · IdempotencyStore<br/>RateLimiter · StatusBroadcaster · Clock · IDGenerator"]
end
subgraph app["Application · internal/application/"]
usecases["Use cases<br/>(CreateNotification · ProcessNotification · …)"]
end
subgraph domain["Domain · internal/domain/"]
entities["Notification · Batch · Template · NotificationLog<br/>FSM · sentinel errors"]
end
adapters --> ports
ports --> app
app --> domain
classDef domainStyle fill:#fde68a,stroke:#92400e,stroke-width:2px,color:#92400e
classDef appStyle fill:#bbf7d0,stroke:#065f46,stroke-width:2px,color:#065f46
classDef portsStyle fill:#c7d2fe,stroke:#312e81,stroke-width:2px,color:#312e81
classDef adaptStyle fill:#fbcfe8,stroke:#9d174d,stroke-width:2px,color:#9d174d
class domain domainStyle
class app appStyle
class ports portsStyle
class adapters adaptStyle
internal/domain/ and internal/application/ import only stdlib and
each other (CLAUDE.md §3.3). The reviewer can delete
internal/adapters/postgres/ and the domain still compiles.
Request lifecycle (happy path)
sequenceDiagram
autonumber
participant C as Client
participant A as API (cmd/api)
participant DB as Postgres
participant Q as asynq queue
participant W as Worker
participant P as Provider
participant R as Redis pub/sub
participant WS as WebSocket clients
C->>A: POST /api/v1/notifications
A->>DB: insert (status=pending) + log "created"
A->>Q: enqueue task
A->>DB: mark queued + log "queued"
A-->>C: 202 Accepted · {id, status=queued}
Note over W: worker polls asynq
Q->>W: dequeue task
W->>DB: atomic claim (queued → processing)
W->>P: provider.Send
P-->>W: DeliveryResult{success}
W->>DB: mark delivered + log "delivered"
W->>R: publish status update
R->>WS: fan-out broadcast
WS-->>C: status=delivered
Failure handling
flowchart TD
A([Provider call]) --> B{Result}
B -->|2xx success| D[mark delivered]
B -->|4xx permanent| E[mark failed<br/>reason=permanent]
B -->|5xx / timeout / network| F{attempts<br/>< maxAttempts?}
F -->|yes| G[mark retrying<br/>schedule backoff]
F -->|no| H[mark failed<br/>reason=retries_exhausted]
G --> I([Reconciler tick — every 1min])
I --> J{next_retry_at<br/>elapsed?}
J -->|yes| K[re-enqueue]
J -->|no| I
K --> A
L([Worker crashes mid-Send]) --> M[notification stuck<br/>in processing]
M --> N([Reconciler tick])
N --> O{updated_at > 5min ago?}
O -->|yes| P[mark failed<br/>reason=worker_timeout]
O -->|no| N
Q([Provider sick]) --> R[Circuit breaker trips]
R --> S[fail-fast for 30s]
S --> T[half-open probe]
T -->|success| U[close]
T -->|fail| R
style D fill:#bbf7d0,stroke:#065f46
style E fill:#fecaca,stroke:#7f1d1d
style H fill:#fecaca,stroke:#7f1d1d
style P fill:#fecaca,stroke:#7f1d1d
style G fill:#fef3c7,stroke:#92400e
Tech stack
| Layer |
Choice |
Why |
| Language |
Go 1.26 |
Standard for the brief |
| Router |
chi |
Minimal, idiomatic, composable middleware |
| DB driver |
pgx/v5 |
Fastest postgres driver for Go; first-class context support |
| Query gen |
sqlc |
Raw SQL → type-safe Go; no ORM tax (ADR-0002) |
| Migrations |
golang-migrate |
Versioned, reversible, language-agnostic |
| Queue |
asynq |
Priority + retry + scheduled + unique tasks + rate limit (ADR-0003) |
| Redis client |
go-redis/v9 |
Maintained, context-aware |
| WebSocket |
github.com/coder/websocket |
Modern, idiomatic, context-native (ADR-0006) |
| Logging |
log/slog (stdlib) |
Structured JSON, zero external deps |
| Metrics |
prometheus/client_golang |
Standard |
| Tracing |
go.opentelemetry.io/otel |
Vendor-neutral; no-op default |
| Circuit breaker |
sony/gobreaker |
Small, focused, no goroutine leaks |
| OpenAPI codegen |
oapi-codegen |
Spec-first; strict-server contract |
| Testing |
testing · testify · testcontainers-go |
Idiomatic; real DB in integration tests |
API endpoints
All endpoints declared in api/openapi.yaml; see
also Swagger UI at http://localhost:8080/docs and the
Bruno collection for ready-to-run requests.
Notifications
# Create one
curl -X POST http://localhost:8080/api/v1/notifications \
-H 'content-type: application/json' \
-H 'idempotency-key: 5e9bcb15-7c61-4e8a-9fcd-1a90d2c1d111' \
-d '{
"channel": "sms",
"recipient": "+15555550001",
"content": "Your verification code is 123456",
"priority": "high"
}'
# → 202 Accepted · Location header + JSON body
# Read one
curl http://localhost:8080/api/v1/notifications/{id}
# List with filters and cursor pagination
curl 'http://localhost:8080/api/v1/notifications?status=delivered&channel=sms&limit=25'
# Cancel (only pending / queued / retrying)
curl -X PATCH http://localhost:8080/api/v1/notifications/{id}/cancel
# Full lifecycle trace
curl http://localhost:8080/api/v1/notifications/{id}/trace
Batch
curl -X POST http://localhost:8080/api/v1/notifications/batch \
-H 'content-type: application/json' \
-d '{
"notifications": [
{ "channel":"sms", "recipient":"+15555550001", "content":"Bulk 1" },
{ "channel":"email", "recipient":"a@example.com","content":"Bulk 2" }
]
}'
# → 202 Accepted · {id, size, correlation_id}
# Fetch a batch with all member notifications inlined
curl http://localhost:8080/api/v1/notifications/batch/{id}
Templates
curl -X POST http://localhost:8080/api/v1/templates \
-H 'content-type: application/json' \
-d '{ "name":"welcome", "channel":"sms", "body":"Hello {{.Name}}!" }'
curl http://localhost:8080/api/v1/templates
curl http://localhost:8080/api/v1/templates/{id}
curl -X PUT http://localhost:8080/api/v1/templates/{id} -d '...'
curl -X DELETE http://localhost:8080/api/v1/templates/{id}
# Use a template at notification time
curl -X POST http://localhost:8080/api/v1/notifications \
-H 'content-type: application/json' \
-d '{
"channel":"sms", "recipient":"+15555550001", "content":"placeholder",
"template_id":"<uuid>", "template_variables":{"Name":"Ada"}
}'
# content is replaced with the rendered template body
WebSocket (real-time updates)
wscat -c ws://localhost:8080/api/v1/ws/notifications
> {"action":"subscribe","notification_id":"<uuid>"}
< {"notification_id":"<uuid>","status":"processing"}
< {"notification_id":"<uuid>","status":"delivered"}
curl http://localhost:8080/healthz/live # process alive
curl http://localhost:8080/healthz/ready # pg + redis reachable
curl http://localhost:8080/metrics # prometheus exposition
curl http://localhost:8080/api/v1/metrics # json-friendly summary
For complete curl recipes (every method, every status code, every
header) see docs/API_EXAMPLES.md.
Observability
CLAUDE.md §3.8 names the three-layer stack — every alert has a runbook
entry, every dashboard panel is backed by a documented metric.
| Tier |
Where |
| Logs |
stdout JSON via log/slog; every line carries service + correlation_id |
| Metrics |
/metrics on api / worker / reconciler; collectors per CLAUDE.md §12.1 |
| Traces |
OTel SDK, no-op default; set OTEL_EXPORTER_OTLP_ENDPOINT to flip live |
| Alerts |
12 rules in deploy/prometheus/alerts.yml |
| Routing |
Critical / warning severities → dedicated receivers; see alertmanager.yml |
| Dashboards |
Notifications Overview · HTTP API Performance · Worker & Queue Health (auto-loaded from deploy/grafana/dashboards/) |
| Runbook |
docs/RUNBOOK.md — one entry per alert (what / check / causes / remediate / escalate) |
The per-notification audit trail is exposed at
GET /api/v1/notifications/{id}/trace — the support endpoint for
"what actually happened to this notification".
Three k6 scenarios live under tests/load/ and the
methodology + results sit in docs/LOAD_TEST.md:
| Scenario |
Shape |
What it proves |
baseline |
300 rps for 60 s |
sustained throughput · p95 < 200 ms · no DLQ growth |
burst |
1000 rps spike + 50 s drain |
queue absorbs flash sales without spillover |
rate_limit |
200 rps single-channel |
outbound limiter throttles without surfacing 5xx |
Headline numbers (reference hardware: 8-perf-core x86, 16 GB):
- baseline accept-latency p95 ≈ 90 ms, p99 ≈ 130 ms, failure rate 0 %
- burst absorbs 10 000 enqueues, queue drains in ~45 s post-spike
- rate-limit scenario: 100 % POST 202, zero DLQ entries attributable to throttling
Design decisions
Eleven ADRs in docs/adr/ document the load-bearing
choices made before the first line of business code:
| ADR |
Decision |
| 0001 |
Hexagonal architecture (ports & adapters) |
| 0002 |
sqlc for type-safe SQL — no ORM |
| 0003 |
asynq as the queue (priority + retry + unique + rate limit) |
| 0004 |
Strategy pattern for providers (Registry routes by channel) |
| 0005 |
No bespoke dashboard — operational UIs already exist |
| 0006 |
coder/websocket for the WebSocket adapter |
| 0007 |
Circuit breaker thresholds |
| 0008 |
Idempotency at two layers (Redis header cache + DB unique key) |
| 0009 |
Atomic claim pattern in the worker (CLAUDE.md §3.10) |
| 0010 |
No .env file — config inline in docker-compose.yml |
| 0011 |
Reconciler safety net instead of an outbox pattern |
Scaling considerations
- Stateless tiers: api, worker, and reconciler hold no in-memory
state beyond a request's lifetime. Scale horizontally by adjusting
the replica count — no leader election, no sticky sessions.
- Atomic claim (CLAUDE.md §3.10, ADR-0009): the worker uses
UPDATE ... WHERE status IN ('queued', 'retrying') RETURNING *,
so concurrent workers across replicas (or asynq redeliveries) never
double-send.
FOR UPDATE SKIP LOCKED in reconciler queries (CLAUDE.md §3.11):
competing reconciler instances see disjoint rows; horizontal scaling
is safe.
- Two-layer rate limiting (CLAUDE.md §2.6): inbound 60 req/min/IP
protects the API, outbound 100 msg/s/channel protects providers.
Independent keyspaces in Redis (
ip: vs channel:).
- Cursor pagination: keyset
(created_at, id) < (...) survives
concurrent writes — offsets would drift.
- Idempotency: client-supplied
Idempotency-Key cached in Redis
for 24 h; the DB has a partial unique index on the same column as a
belt-and-braces second layer.
Troubleshooting
| Symptom |
Likely cause / where to look |
http://localhost:8080 not responding |
docker compose ps — is api healthy? make migrate-up run? |
200 OK on /healthz/live but 503 on /healthz/ready |
Postgres or Redis container is down or unreachable |
Notifications stuck in queued |
Worker not running, or asynq misconfigured (check asynqmon) |
Notifications stuck in processing |
Worker crash mid-Send; reconciler sweeps after 5 min |
| 429 on every request |
Inbound rate limit (INBOUND_RATE_LIMIT); raise or back off |
| Provider always fails |
Check circuit breaker state in Notifications Overview dashboard |
| Tests time out in CI |
Docker daemon slow on the runner — bump -timeout flag |
See docs/RUNBOOK.md for alert-specific operator
playbooks.
Limitations
Honest scope boundaries — what is not in this baseline:
- Providers are mocked.
MockProvider (configurable failure rate)
is the default; WebhookProvider accepts any HTTP endpoint as a
generic outbound sink. Twilio / SendGrid / FCM adapters are designed
for via ports.Provider (ADR-0004) but not bundled — adding one is
a single new file behind the same interface.
- Single-tenant. No
tenant_id column or per-tenant rate-limit
isolation. The schema and rate-limit Redis keys would extend cleanly
for multi-tenant SaaS.
- OpenTelemetry exporter is no-op by default. The SDK is wired; set
OTEL_EXPORTER_OTLP_ENDPOINT and point it at Jaeger / Tempo to flip
live spans on with no code change.
- AlertManager routes to a log receiver. Slack / PagerDuty / email
receivers are one config block away; left as log-only for the
assessment to keep
docker compose up self-contained.
cmd/migrate is not unit-tested. It is exercised via integration
tests (the e2e harness applies migrations); unit-testing the CLI
wrapper would require mocking *migrate.Migrate, which CLAUDE.md §4
forbids ("do not mock what you don't own").
Future work
Not in scope for the assessment but obvious next steps:
- Webhook signing — HMAC the outbound webhook body so receivers
can verify provenance.
- Per-tenant config — currently single-tenant; a
tenant_id
column + middleware unlock multi-tenant SaaS.
- OpenTelemetry collector —
OTEL_EXPORTER_OTLP_ENDPOINT is
wired; pointing it at Jaeger / Tempo in compose surfaces traces
in Grafana with one env var change.
- Real provider plugins —
WebhookProvider is the seam;
Twilio / SendGrid / FCM adapters would slot in behind the
ports.Provider interface without touching domain code.
- Distributed reconciler election — Phase 5 task 17 proves
FOR UPDATE SKIP LOCKED lets multiple instances run safely; a
leader-election layer would let them coordinate intervals.
License
MIT · see LICENSE.