samsara

package module

v0.4.0 Latest Latest Go to latest Published: Apr 4, 2026 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sunkek/samsara

Links

Open Source Insights

README ¶

samsara

A small, explicit lifecycle runtime for Go services — zero external dependencies.

The name comes from the concept of cyclical existence: components fail, are restarted, and return to service. samsara makes that cycle explicit, controlled, and observable — rather than something that happens ad hoc in main.go.

samsara coordinates the startup, health monitoring, graceful shutdown, and orchestrator integration of your service's infrastructure components. It handles the questions every production service eventually faces: what order do dependencies start? what happens when Redis dies at 3am? when should the pod stop receiving traffic?

go get github.com/sunkek/samsara

Core concept

Everything your service depends on is a Component:

type Component interface {
    Name() string
    Start(ctx context.Context, ready func()) error
    Stop(ctx context.Context) error
}

Start blocks for the component's entire lifetime. Call ready() exactly once when the component can serve traffic. Stop unblocks Start and cleans up.

Lifecycle contract

This section is the authoritative specification. Read it before implementing a component.

`Start(ctx, ready)`

Must block until the component exits — do not return early while the component is still serving.
Call ready() exactly once, only when the component is actually able to serve its intended function — not before a connection is verified, not before a port is bound.
ready() is safe to call multiple times (idempotent internally), but should only be called once semantically.
If ctx is cancelled (clean shutdown), return nil.
Return a non-nil error only on abnormal failure — not on clean context cancellation.
If ready() is never called, the supervisor waits up to startTimeout (default 15s) and treats the attempt as a failure.

`Stop(ctx)`

Must unblock Start and release resources within the context deadline (stopTimeout, default 10s).
Must be idempotent — the supervisor may call Stop more than once in some shutdown paths.
Must be concurrency-safe — Stop may be called concurrently with a still-initialising Start (e.g. before a port is bound). Guard shared state accordingly.

Background goroutines

If Start spawns background goroutines, they must exit when ctx is cancelled or Stop is called. The supervisor has no way to detect or reap leaked goroutines. A leaked goroutine from a restarted component will accumulate across restart cycles.

Cancellation semantics

Event	`Start` should return	`Stop` is called
Clean shutdown (signal / `Shutdown()`)	`nil`	Yes
Component failure	non-nil error	No (already exited)
Restart due to health failure	`nil` (Stop unblocks it)	Yes, before restart

Minimal example

func main() {
    app := samsara.NewApplication(
        samsara.WithMainFunc(func(ctx context.Context) error {
            log.Println("running — press Ctrl+C to stop")
            <-ctx.Done()
            return nil
        }),
    )
    if err := app.Run(); err != nil {
        log.Fatal(err)
    }
}

ctx is cancelled automatically on SIGINT, SIGTERM, SIGHUP, or SIGQUIT. A clean signal shutdown returns nil — exit code 0.

Real-world example

func main() {
    logger := slog.Default()

    sup := samsara.NewSupervisor(
        samsara.WithSupervisorLogger(logger),
        samsara.WithHealthInterval(10 * time.Second),
        samsara.WithEventHooks(&samsara.EventHooks{
            OnUnhealthy: func(component string, err error) {
                logger.Error("component unhealthy", "component", component, "error", err)
            },
            OnRecovered: func(component string) {
                logger.Info("component recovered", "component", component)
            },
            OnFailed: func(component string, err error) {
                logger.Error("component permanently failed", "component", component, "error", err)
            },
        }),
    )

    // Register HealthServer FIRST — it starts before all other components
    // and stops LAST, keeping orchestrators informed throughout.
    hs := samsara.NewHealthServer(sup, samsara.WithHealthAddr(":8080"))
    sup.Add(hs)

    sup.Add(postgres.New(cfg.Postgres),
        samsara.WithTier(samsara.TierCritical),
        samsara.WithRestartPolicy(samsara.ExponentialBackoff(5, time.Second)),
    )
    sup.Add(redis.New(cfg.Redis),
        samsara.WithTier(samsara.TierCritical),
        samsara.WithRestartPolicy(samsara.ExponentialBackoff(5, time.Second)),
    )
    sup.Add(s3.New(cfg.S3),
        samsara.WithTier(samsara.TierSignificant),
        samsara.WithRestartPolicy(samsara.AlwaysRestart(5*time.Second)),
    )
    // HTTP server starts only after postgres and redis are ready.
    sup.Add(httpserver.New(cfg.HTTP),
        samsara.WithTier(samsara.TierCritical),
        samsara.WithRestartPolicy(samsara.MaxRetries(3, 2*time.Second)),
        samsara.WithDependencies("postgres", "redis"),
    )

    app := samsara.NewApplication(
        samsara.WithSupervisor(sup),
        samsara.WithLogger(logger),
        samsara.WithShutdownTimeout(30*time.Second),
    )
    if err := app.Run(); err != nil {
        logger.Error("application exited with error", "error", err)
        os.Exit(1)
    }
}

Component tiers

Tiers define how a component's health affects the rest of the application.

Tier	Transient unhealthy	Permanent failure
`TierCritical` (default)	App shuts down	App shuts down
`TierSignificant`	`/readyz` → 503, app stays up	App shuts down
`TierAuxiliary`	Logged only, no effect	Component removed, app continues

Use TierSignificant for components that degrade — but don't break — your service (e.g. a cache, a metrics sink). The app keeps running and the load balancer is informed via /readyz.

Use TierAuxiliary for components whose failure is entirely non-blocking (e.g. an audit log sink, a tracing exporter).

Failure model

Each component goes through these states:

          start failure
               │
[Starting] ───────────────► [Failed] ─── if restart policy allows ──► [Starting]
               │                                                            ↑
           ready() called                                                   │
               │                                                      stop + delay
           [Running]                                                        │
               │                                                            │
        health check fails ──────────────────────────────────────────────────
               │
        restart policy exhausted
               │
           [Permanently Failed]
               │
        TierCritical/Significant ──► shutdown
        TierAuxiliary            ──► removed from monitoring, app continues

Recovery is automatic: if a restarted component calls ready() and health checks pass, OnRecovered fires and /readyz flips back to 200.

Restart policies

samsara.NeverRestart()                         // fail once → permanent (default)
samsara.AlwaysRestart(2*time.Second)           // retry forever, fixed delay
samsara.MaxRetries(5, time.Second)             // up to 5 retries, fixed delay
samsara.ExponentialBackoff(5, time.Second)     // 1s, 2s, 4s, 8s, 16s (±25% jitter)

The attempt counter resets to zero if the component runs without fault for longer than WithRestartResetWindow (default 5 minutes).

When to use restarts vs when to crash

Use internal restarts for components whose failure is transient and independent — a cache client that loses its connection, a queue consumer that gets disconnected, a metrics exporter. These are safe to restart because their failure doesn't affect application correctness.

Be cautious restarting core request-path components (the primary HTTP server, the main DB pool). A flapping critical component can create a misleading "alive but broken" state. If a component fails repeatedly, consider whether the correct response is to restart it or to crash the pod and let the orchestrator restart the whole process from a clean state.

Health checking

Implement HealthChecker to participate in health polling:

type HealthChecker interface {
    Health(ctx context.Context) error
}

The supervisor calls Health every WithHealthInterval (default 10s). A non-nil result triggers the tier logic above. Health checks are bounded by WithHealthTimeout (default 5s).

Health endpoints

HealthServer exposes three HTTP endpoints for orchestrators:

Endpoint	200 when	Use for
`GET /livez`	Process is alive	Kubernetes `livenessProbe`
`GET /readyz`	All Critical + Significant components healthy	Kubernetes `readinessProbe`
`GET /healthz`	Same as `/readyz`	Docker `HEALTHCHECK`

/readyz flips during startup — it returns 503 until every Critical and Significant component has called ready() and passed its first health check.

/readyz flips during shutdown — the HealthServer stops last, so it returns 503 as soon as shutdown begins, before any other component stops. This drains load balancer connections cleanly.

Recovery — if a degraded component recovers, /readyz returns 200 again automatically.

/readyz response body:

{
  "status": "degraded",
  "components": [
    { "name": "postgres",    "status": "ok",       "restart_count": 0 },
    { "name": "redis",       "status": "degraded", "error": "connection refused", "restart_count": 2 },
    { "name": "http-server", "status": "ok",       "restart_count": 0 }
  ]
}

restart_count is omitted from JSON when zero.

Docker example:

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
  CMD wget -qO- http://localhost:8080/healthz || exit 1

Register HealthServer first:

hs := samsara.NewHealthServer(sup,
    samsara.WithHealthAddr(":8080"),
    samsara.WithHealthLogger(logger),
)
sup.Add(hs) // always first

Dependency ordering

Components start sequentially in registration order. Use WithDependencies when a component must not start until another is ready:

sup.Add(postgres.New(cfg))
sup.Add(redis.New(cfg))
sup.Add(httpServer,
    samsara.WithDependencies("postgres", "redis"),
)

httpServer starts only after both postgres and redis have called ready(). On shutdown, components stop in reverse start order — httpServer stops before postgres and redis, so no in-flight requests touch a closed pool.

Pitfalls and best practices

Call `ready()` too early

// ❌ Wrong — ready() before the connection is verified
func (c *Cache) Start(ctx context.Context, ready func()) error {
    c.client = redis.NewClient(opts) // lazy — no connection yet
    ready()                          // supervisor proceeds, but cache may be broken
    <-ctx.Done()
    return nil
}

// ✅ Right — ready() only after a successful ping
func (c *Cache) Start(ctx context.Context, ready func()) error {
    c.client = redis.NewClient(opts)
    if err := c.client.Ping(ctx).Err(); err != nil {
        return err
    }
    ready()
    <-ctx.Done()
    return nil
}

Forget to unblock Start on Stop

// ❌ Wrong — Stop closes the client but Start blocks forever
func (s *Server) Start(ctx context.Context, ready func()) error {
    ready()
    for job := range s.jobs { process(job) } // blocks; Stop never unblocks this
    return nil
}

// ✅ Right — Stop signals Start to exit
func (s *Server) Start(ctx context.Context, ready func()) error {
    s.stop = make(chan struct{})
    ready()
    select {
    case <-s.stop:
    case <-ctx.Done():
    }
    return nil
}
func (s *Server) Stop(_ context.Context) error {
    close(s.stop)
    return nil
}

Leak goroutines after Stop

// ❌ Wrong — background goroutine outlives the component
func (w *Worker) Start(ctx context.Context, ready func()) error {
    go w.runLoop() // no way to stop this
    ready()
    <-ctx.Done()
    return nil
}

// ✅ Right — goroutine exits when ctx is cancelled
func (w *Worker) Start(ctx context.Context, ready func()) error {
    go w.runLoop(ctx) // ctx cancellation stops the loop
    ready()
    <-ctx.Done()
    return nil
}

Return non-nil error on clean shutdown

// ❌ Wrong — ctx.Err() looks like a failure to the supervisor
func (s *Server) Start(ctx context.Context, ready func()) error {
    ready()
    <-ctx.Done()
    return ctx.Err() // returns context.Canceled — treated as a crash
}

// ✅ Right
func (s *Server) Start(ctx context.Context, ready func()) error {
    ready()
    <-ctx.Done()
    return nil
}

Runtime status

Inspect all component states at any time via the supervisor:

for _, status := range sup.HealthReportOrdered() {
    fmt.Printf("%-20s healthy=%-5v restarts=%d\n",
        status.Name,
        status.Err == nil,
        status.RestartCount,
    )
}

ComponentStatus fields:

Field	Type	Description
`Err`	`error`	`nil` = healthy; non-nil = last health check error
`Known`	`bool`	`false` until first health check completes
`Tier`	`Tier`	`TierCritical`, `TierSignificant`, or `TierAuxiliary`
`RestartCount`	`int`	How many times the supervisor has restarted this component

Metrics integration

Implement MetricsObserver to receive telemetry without adding dependencies to the package:

type MetricsObserver interface {
    ComponentStarted(component string, attempt int)
    ComponentStopped(component string, err error)
    ComponentRestarting(component string, err error, attempt int, delay time.Duration)
    HealthCheckCompleted(component string, duration time.Duration, err error)
}

Programmatic shutdown

app.Shutdown(errors.New("config reload required"))
// cause is available inside Start/main via context.Cause(ctx)

app.Shutdown(nil) is a clean shutdown — same semantics as Ctrl+C, returns nil from app.Run().

Logger interface

samsara.Logger is satisfied directly by *slog.Logger and most structured loggers:

samsara.WithLogger(slog.Default())
samsara.WithSupervisorLogger(slog.Default())
samsara.WithHealthLogger(slog.Default())

Configuration reference

Supervisor

Option	Default	Description
`WithHealthInterval`	10s	How often to poll `Health()`
`WithStartTimeout`	15s	How long to wait for `ready()` to be called
`WithHealthTimeout`	5s	Deadline for each `Health()` call
`WithStopTimeout`	10s	Deadline for each `Stop()` call
`WithRestartResetWindow`	5m	Stable runtime before restart counter resets
`WithSupervisorLogger`	nop	Structured logger
`WithEventHooks`	nil	Lifecycle event callbacks
`WithMetricsObserver`	nop	Telemetry receiver

Application

Option	Default	Description
`WithShutdownTimeout`	15s	How long to wait for clean exit after signal
`WithLogger`	nop	Structured logger
`WithMainFunc`	nil	Optional blocking main function
`WithSupervisor`	nil	Optional supervisor to run alongside main

HealthServer

Option	Default	Description
`WithHealthAddr`	`:9090`	Listen address
`WithHealthName`	`health-server`	Component name (for multi-instance setups)
`WithHealthLogger`	nop	Structured logger
`WithHealthReadTimeout`	5s	HTTP read timeout
`WithHealthWriteTimeout`	5s	HTTP write timeout

Acknowledgements

Initially inspired by this article and appctl.

Documentation ¶

Index ¶

Constants
type Application
- func NewApplication(opts ...ApplicationOption) *Application
- func (a *Application) Run() error
- func (a *Application) Shutdown(cause error)
type ApplicationOption
- func WithLogger(l Logger) ApplicationOption
- func WithMainFunc(f func(ctx context.Context) error) ApplicationOption
- func WithShutdownTimeout(d time.Duration) ApplicationOption
- func WithSupervisor(s *Supervisor) ApplicationOption
type Component
type ComponentOption
- func WithDependencies(names ...string) ComponentOption
- func WithRestartPolicy(p RestartPolicy) ComponentOption
- func WithTier(t Tier) ComponentOption
type ComponentStatus
type EventHooks
type HealthChecker
type HealthReporter
type HealthServer
- func NewHealthServer(reporter HealthReporter, opts ...HealthServerOption) *HealthServer
- func (h *HealthServer) Name() string
- func (h *HealthServer) Start(_ context.Context, ready func()) error
- func (h *HealthServer) Stop(ctx context.Context) error
type HealthServerOption
- func WithHealthAddr(addr string) HealthServerOption
- func WithHealthLogger(l Logger) HealthServerOption
- func WithHealthName(name string) HealthServerOption
- func WithHealthReadTimeout(d time.Duration) HealthServerOption
- func WithHealthWriteTimeout(d time.Duration) HealthServerOption
type Logger
type MetricsObserver
type NamedComponentStatus
type RestartPolicy
- func AlwaysRestart(delay time.Duration) RestartPolicy
- func ExponentialBackoff(maxRetries int, baseDelay time.Duration) RestartPolicy
- func MaxRetries(maxRetries int, delay time.Duration) RestartPolicy
- func NeverRestart() RestartPolicy
type Supervisor
- func NewSupervisor(opts ...SupervisorOption) *Supervisor
- func (s *Supervisor) Add(c Component, opts ...ComponentOption)
- func (s *Supervisor) ComponentHealth(name string) (err error, known bool)
- func (s *Supervisor) HealthReport() map[string]ComponentStatus
- func (s *Supervisor) HealthReportOrdered() []NamedComponentStatus
- func (s *Supervisor) Run(ctx context.Context) error
type SupervisorOption
- func WithEventHooks(h *EventHooks) SupervisorOption
- func WithHealthInterval(d time.Duration) SupervisorOption
- func WithHealthTimeout(d time.Duration) SupervisorOption
- func WithMetricsObserver(m MetricsObserver) SupervisorOption
- func WithRestartResetWindow(d time.Duration) SupervisorOption
- func WithStartTimeout(d time.Duration) SupervisorOption
- func WithStopTimeout(d time.Duration) SupervisorOption
- func WithSupervisorLogger(l Logger) SupervisorOption
type Tier
- func (t Tier) String() string

Constants ¶

View Source

const (
	// ErrNothingToRun is returned by Application.Run when neither a MainFunc
	// nor a Supervisor was provided.
	ErrNothingToRun appError = "samsara: nothing to run (no main function or supervisor provided)"

	// ErrShutdownTimeout is returned when the application does not stop within
	// the configured ShutdownTimeout after the context is cancelled.
	ErrShutdownTimeout appError = "samsara: shutdown timeout exceeded"

	// ErrComponentAlreadyRegistered is returned when a component with the same
	// name is added to the Supervisor more than once.
	ErrComponentAlreadyRegistered appError = "samsara: component already registered"

	// ErrCircularDependency is returned when the Supervisor detects a cycle in
	// the component dependency graph.
	ErrCircularDependency appError = "samsara: circular dependency detected"

	// ErrUnknownDependency is returned when a component declares a dependency on
	// a name that has not been registered with the Supervisor.
	ErrUnknownDependency appError = "samsara: unknown dependency"

	// ErrSupervisorRunning is returned when Add is called after Run has started.
	ErrSupervisorRunning appError = "samsara: cannot add component after supervisor has started"
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Application ¶

type Application struct {
	// contains filtered or unexported fields
}

Application is the top-level entry point for a service. It wires together signal handling, an optional Supervisor, and a main function into a single blocking Run call.

Typical usage:

sup := samsara.NewSupervisor(...)
sup.Add(myDB, samsara.WithTier(samsara.TierCritical))
sup.Add(myCache, samsara.WithTier(samsara.TierSignificant))

app := samsara.NewApplication(
    samsara.WithSupervisor(sup),
    samsara.WithMainFunc(server.Run),
    samsara.WithShutdownTimeout(20*time.Second),
)
if err := app.Run(); err != nil {
    log.Fatal(err)
}

func NewApplication ¶

func NewApplication(opts ...ApplicationOption) *Application

NewApplication constructs an Application with the supplied options.

func (*Application) Run ¶

func (a *Application) Run() error

Run starts the application and blocks until it exits.

Startup order:

Root context is created and wired to OS signals (SIGINT, SIGTERM, SIGHUP, SIGQUIT).
Supervisor.Run is launched in a goroutine (if a Supervisor was provided).
The main function is launched in a goroutine (if one was provided).

Shutdown is triggered by any of:

An OS signal.
A call to Application.Shutdown(cause).
The main function returning (with or without an error).
The Supervisor encountering a critical failure.

After the shutdown signal, Run waits up to ShutdownTimeout for both goroutines to finish. If they do not, ErrShutdownTimeout is joined into the returned error.

func (*Application) Shutdown ¶

func (a *Application) Shutdown(cause error)

Shutdown cancels the application's root context, triggering a graceful shutdown. The optional cause is attached to the context so that components and the main function can inspect it via context.Cause if needed.

It is safe to call from any goroutine. Calling Shutdown before Run is a no-op. Calling it multiple times is safe; only the first cause is recorded.

type ApplicationOption ¶

type ApplicationOption func(*applicationConfig)

ApplicationOption configures an Application.

func WithLogger ¶

func WithLogger(l Logger) ApplicationOption

WithLogger sets the logger used by the Application itself (not the Supervisor — pass WithSupervisorLogger to NewSupervisor for that).

func WithMainFunc ¶

func WithMainFunc(f func(ctx context.Context) error) ApplicationOption

WithMainFunc sets the primary function that runs as the application's main goroutine. The context passed to f is cancelled when an OS shutdown signal is received or when the Supervisor encounters a critical failure. Returning a non-nil error from f is treated as an application-level failure.

func WithShutdownTimeout ¶

func WithShutdownTimeout(d time.Duration) ApplicationOption

WithShutdownTimeout sets how long the application waits for the main function and supervisor to exit after the root context is cancelled. Defaults to 15 s. If the timeout is exceeded, ErrShutdownTimeout is joined into the returned error.

func WithSupervisor ¶

func WithSupervisor(s *Supervisor) ApplicationOption

WithSupervisor attaches a Supervisor to the application. The supervisor is started alongside the main function and both receive the same root context.

type Component ¶

type Component interface {
	Name() string
	Start(ctx context.Context, ready func()) error
	Stop(ctx context.Context) error
}

Component is the fundamental unit managed by the Supervisor.

Lifecycle contract ¶

Start must block for the entire lifetime of the component. The ready function must be called exactly once, as soon as the component is ready to serve traffic — not before, and never more than once (the supervisor wraps it in sync.Once so double-calls are safe, but semantically wrong). The supervisor will not start the next component until ready is called. Start should return nil on a clean exit (ctx cancelled, Stop called) and a non-nil error on unexpected failure.

If ready is never called, the supervisor will wait up to startTimeout and then treat the attempt as a failure.

Stop is called with a context carrying the configured stop timeout. It must not block longer than that context allows. Stop must be idempotent — the supervisor may call it more than once in some shutdown paths. Stop must also be safe to call concurrently with a still-running Start (e.g. before a port is bound), so components must guard shared state accordingly.

If ctx is cancelled during Start (clean shutdown), Start should return nil. Only return a non-nil error when an abnormal failure occurs that the supervisor should treat as a crash.

Background goroutines ¶

Start is allowed to spawn background goroutines, but those goroutines must exit when ctx is cancelled or Stop is called — whichever comes first. A component that leaks goroutines after Stop returns will cause resource leaks on restart. The supervisor has no way to detect or recover from this.

Example — an HTTP server:

func (s *Server) Start(ctx context.Context, ready func()) error {
    ln, err := net.Listen("tcp", s.addr)
    if err != nil { return err }
    ready()                       // port is bound — supervisor proceeds
    return s.srv.Serve(ln)        // blocks until Stop calls Shutdown
}

Example — a DB pool (no run loop needed):

func (p *Pool) Start(ctx context.Context, ready func()) error {
    p.stop = make(chan struct{})
    pool, err := pgxpool.New(ctx, p.dsn)
    if err != nil { return err }
    p.pool = pool
    ready()                       // pool is up — supervisor proceeds
    select {
    case <-p.stop:
    case <-ctx.Done():
    }
    return nil
}

type ComponentOption ¶

type ComponentOption func(*componentConfig)

ComponentOption configures a managedComponent at registration time.

func WithDependencies ¶

func WithDependencies(names ...string) ComponentOption

WithDependencies declares that this component must not be started until all named dependencies are running. Names must match Component.Name() of other registered components.

func WithRestartPolicy ¶

func WithRestartPolicy(p RestartPolicy) ComponentOption

WithRestartPolicy sets the restart policy. Defaults to NeverRestart().

func WithTier ¶

func WithTier(t Tier) ComponentOption

WithTier sets the importance tier of a component. Defaults to TierCritical.

type ComponentStatus ¶

type ComponentStatus struct {
	Err          error // nil means healthy; non-nil means last health check failed
	Known        bool  // false until the first health check completes
	Tier         Tier
	RestartCount int // number of times the component has been restarted by the supervisor
}

ComponentStatus is a point-in-time snapshot of a single component's health.

type EventHooks ¶

type EventHooks struct {
	// OnUnhealthy is called when a component's Health check returns a non-nil
	// error. It receives the component name and the health error.
	OnUnhealthy func(component string, err error)

	// OnRecovered is called when a component's Health check returns nil again
	// after a previous OnUnhealthy event.
	OnRecovered func(component string)

	// OnFailed is called when a component fails permanently — either because
	// its restart policy decided not to retry, or because all retries were
	// exhausted. It receives the component name and the final error.
	OnFailed func(component string, err error)

	// OnRestart is called each time the supervisor schedules a restart attempt
	// for a component. It receives the component name, the triggering error,
	// and the attempt number (1-based).
	OnRestart func(component string, err error, attempt int)
}

EventHooks carries optional callbacks that the Supervisor fires on significant component lifecycle events. All fields are optional; a nil function is silently skipped.

Hooks are called synchronously inside the supervisor goroutine that manages the component, so they must not block. Enqueue to a channel or spawn a goroutine if you need non-trivial work (e.g. sending a PagerDuty alert).

type HealthChecker ¶

type HealthChecker interface {
	Health(ctx context.Context) error
}

HealthChecker is an optional extension of Component. When implemented, the supervisor polls Health on the configured healthInterval and acts on the result according to the component's Tier.

type HealthReporter ¶

type HealthReporter interface {
	HealthReportOrdered() []NamedComponentStatus
}

HealthReporter is the interface the HealthServer uses to query component health. *Supervisor satisfies this via HealthReportOrdered().

type HealthServer ¶

type HealthServer struct {
	// contains filtered or unexported fields
}

HealthServer is a Component that exposes three HTTP endpoints:

GET /livez   — liveness:  200 while the process is alive
GET /readyz  — readiness: 200 if all Critical/Significant components healthy
GET /healthz — alias for /readyz (Docker HEALTHCHECK compatibility)

Register HealthServer first with the Supervisor so it starts before everything else and stops last.

func NewHealthServer ¶

func NewHealthServer(reporter HealthReporter, opts ...HealthServerOption) *HealthServer

NewHealthServer creates a HealthServer. Pass a *Supervisor as the reporter.

func (*HealthServer) Name ¶

func (h *HealthServer) Name() string

func (*HealthServer) Start ¶

func (h *HealthServer) Start(_ context.Context, ready func()) error

Start implements Component. It binds the TCP port, calls ready() to signal the supervisor, then serves until Stop is called.

func (*HealthServer) Stop ¶

func (h *HealthServer) Stop(ctx context.Context) error

type HealthServerOption ¶

type HealthServerOption func(*healthServerConfig)

HealthServerOption configures a HealthServer.

func WithHealthAddr ¶

func WithHealthAddr(addr string) HealthServerOption

func WithHealthLogger ¶

func WithHealthLogger(l Logger) HealthServerOption

func WithHealthName ¶

func WithHealthName(name string) HealthServerOption

WithHealthName overrides the component name returned by HealthServer.Name. This is useful when registering multiple HealthServer instances (e.g. on different ports) with the same Supervisor. Defaults to "health-server".

func WithHealthReadTimeout ¶

func WithHealthReadTimeout(d time.Duration) HealthServerOption

func WithHealthWriteTimeout ¶

func WithHealthWriteTimeout(d time.Duration) HealthServerOption

type Logger ¶

type Logger interface {
	Debug(msg string, kv ...any)
	Info(msg string, kv ...any)
	Error(msg string, kv ...any)
}

Logger is a minimal structured logging interface. It is intentionally narrow so that any slog, zap, zerolog, or logrus wrapper satisfies it with a thin adapter, keeping the samsarawork free of logging dependencies.

Key-value pairs are passed as alternating key, value arguments (slog style).

type MetricsObserver ¶

type MetricsObserver interface {
	// ComponentStarted is called each time a component's Start call returns
	// without error and the component is considered running.
	ComponentStarted(component string, attempt int)

	// ComponentStopped is called when a component's Stop call returns,
	// regardless of whether it returned an error.
	ComponentStopped(component string, err error)

	// ComponentRestarting is called when the supervisor decides to restart a
	// component after a failure. attempt is 1-based.
	ComponentRestarting(component string, err error, attempt int, delay time.Duration)

	// HealthCheckCompleted is called after every health check poll, whether
	// healthy or not. duration is how long the Health() call took.
	HealthCheckCompleted(component string, duration time.Duration, err error)
}

MetricsObserver receives structured telemetry events from the Supervisor. Implement this interface to bridge into Prometheus, OpenTelemetry, Datadog, or any other metrics backend without adding a hard dependency to this package.

All methods are called synchronously from the supervisor goroutine that manages the component, so they must not block. Enqueue or use a non-blocking write if your backend requires I/O.

All fields are optional at the implementation level — a partial observer that only cares about restarts is perfectly valid.

type NamedComponentStatus ¶

type NamedComponentStatus struct {
	Name string
	ComponentStatus
}

NamedComponentStatus is a ComponentStatus with its component name.

type RestartPolicy ¶

type RestartPolicy interface {
	ShouldRestart(err error, attempt int) (restart bool, delay time.Duration)
}

RestartPolicy decides whether a component should be restarted after a failure and, if so, how long to wait before the next attempt.

attempt is zero-based: the first restart is attempt 0, the second is 1, etc. Returning false for restart means the component has failed permanently.

func AlwaysRestart ¶

func AlwaysRestart(delay time.Duration) RestartPolicy

AlwaysRestart returns a policy that restarts a component unconditionally with a fixed delay between attempts.

func ExponentialBackoff ¶

func ExponentialBackoff(maxRetries int, baseDelay time.Duration) RestartPolicy

ExponentialBackoff returns a policy that restarts a component up to maxRetries times. The delay doubles with each attempt starting from baseDelay, with ±25% jitter applied to spread restarts when many instances fail simultaneously:

attempt 0: baseDelay  × [0.75, 1.25)
attempt 1: 2×baseDelay × [0.75, 1.25)
attempt 2: 4×baseDelay × [0.75, 1.25)  …and so on

func MaxRetries ¶

func MaxRetries(maxRetries int, delay time.Duration) RestartPolicy

MaxRetries returns a policy that restarts a component up to maxRetries times with a fixed delay. After maxRetries attempts the component fails permanently.

func NeverRestart ¶

func NeverRestart() RestartPolicy

NeverRestart returns a policy that never restarts a component. Use this for components whose failure should propagate immediately.

type Supervisor ¶

type Supervisor struct {
	// contains filtered or unexported fields
}

Supervisor starts, monitors, and stops a set of Components in dependency order. Components are started sequentially (dependencies first) and stopped in reverse order (dependents first).

func NewSupervisor ¶

func NewSupervisor(opts ...SupervisorOption) *Supervisor

NewSupervisor constructs a Supervisor with the given options.

func (*Supervisor) Add ¶

func (s *Supervisor) Add(c Component, opts ...ComponentOption)

Add registers a Component with the Supervisor. Panics if called after Run has started or if a component with the same name is already registered.

func (*Supervisor) ComponentHealth ¶

func (s *Supervisor) ComponentHealth(name string) (err error, known bool)

ComponentHealth returns the last known health error for a named component.

func (*Supervisor) HealthReport ¶

func (s *Supervisor) HealthReport() map[string]ComponentStatus

HealthReport returns a snapshot of all component health states keyed by name.

func (*Supervisor) HealthReportOrdered ¶

func (s *Supervisor) HealthReportOrdered() []NamedComponentStatus

HealthReportOrdered returns a name-sorted slice of component health states.

func (*Supervisor) Run ¶

func (s *Supervisor) Run(ctx context.Context) error

Run starts all registered components in dependency order, monitors them, and blocks until ctx is cancelled or a critical failure occurs.

type SupervisorOption ¶

type SupervisorOption func(*supervisorConfig)

SupervisorOption configures a Supervisor.

func WithEventHooks ¶

func WithEventHooks(h *EventHooks) SupervisorOption

func WithHealthInterval ¶

func WithHealthInterval(d time.Duration) SupervisorOption

WithHealthInterval sets how often the supervisor polls each component's Health method. Defaults to 10s.

func WithHealthTimeout ¶

func WithHealthTimeout(d time.Duration) SupervisorOption

func WithMetricsObserver ¶

func WithMetricsObserver(m MetricsObserver) SupervisorOption

WithMetricsObserver registers a MetricsObserver for telemetry events.

func WithRestartResetWindow ¶

func WithRestartResetWindow(d time.Duration) SupervisorOption

func WithStartTimeout ¶

func WithStartTimeout(d time.Duration) SupervisorOption

WithStartTimeout sets how long the supervisor waits for a component to call ready() after Start is launched. Defaults to 15s.

func WithStopTimeout ¶

func WithStopTimeout(d time.Duration) SupervisorOption

func WithSupervisorLogger ¶

func WithSupervisorLogger(l Logger) SupervisorOption

type Tier ¶

type Tier int

Tier expresses how important a component is to overall application health.

const (
	// TierCritical (default) — a permanently failed or persistently unhealthy
	// critical component causes the entire application to shut down.
	TierCritical Tier = iota

	// TierSignificant — while a significant component is transiently unhealthy
	// the application is marked not-ready (/readyz returns 503) but keeps
	// running. A permanent failure (restart policy exhausted) triggers a full
	// shutdown, identical to TierCritical.
	TierSignificant

	// TierAuxiliary — health problems are logged and hooks are fired, but they
	// have no effect on /readyz and do not trigger a shutdown. Even a permanent
	// failure only removes the component from monitoring; the app continues.
	TierAuxiliary
)

func (Tier) String ¶

func (t Tier) String() string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

samsara

Core concept

Lifecycle contract

Start(ctx, ready)

Stop(ctx)

Background goroutines

Cancellation semantics

Minimal example

Real-world example

Component tiers

Failure model

Restart policies

When to use restarts vs when to crash

Health checking

Health endpoints

Dependency ordering

Pitfalls and best practices

Call ready() too early

Forget to unblock Start on Stop

Leak goroutines after Stop

Return non-nil error on clean shutdown

Runtime status

Metrics integration

Programmatic shutdown

Logger interface

Configuration reference

Supervisor

Application

HealthServer

Acknowledgements

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type Application ¶

func NewApplication ¶

func (*Application) Run ¶

func (*Application) Shutdown ¶

type ApplicationOption ¶

func WithLogger ¶

func WithMainFunc ¶

func WithShutdownTimeout ¶

func WithSupervisor ¶

type Component ¶

Lifecycle contract ¶

Background goroutines ¶

type ComponentOption ¶

func WithDependencies ¶

func WithRestartPolicy ¶

func WithTier ¶

type ComponentStatus ¶

type EventHooks ¶

type HealthChecker ¶

type HealthReporter ¶

type HealthServer ¶

func NewHealthServer ¶

func (*HealthServer) Name ¶

func (*HealthServer) Start ¶

func (*HealthServer) Stop ¶

type HealthServerOption ¶

func WithHealthAddr ¶

func WithHealthLogger ¶

func WithHealthName ¶

func WithHealthReadTimeout ¶

func WithHealthWriteTimeout ¶

type Logger ¶

type MetricsObserver ¶

type NamedComponentStatus ¶

type RestartPolicy ¶

func AlwaysRestart ¶

func ExponentialBackoff ¶

func MaxRetries ¶

func NeverRestart ¶

type Supervisor ¶

func NewSupervisor ¶

func (*Supervisor) Add ¶

func (*Supervisor) ComponentHealth ¶

`Start(ctx, ready)`

`Stop(ctx)`

Call `ready()` too early