samsara

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 4, 2026 License: MIT Imports: 14 Imported by: 0

README

samsara

A small, explicit lifecycle runtime for Go services — zero external dependencies.

The name comes from the concept of cyclical existence: components fail, are restarted, and return to service. samsara makes that cycle explicit, controlled, and observable — rather than something that happens ad hoc in main.go.

samsara coordinates the startup, health monitoring, graceful shutdown, and orchestrator integration of your service's infrastructure components. It handles the questions every production service eventually faces: what order do dependencies start? what happens when Redis dies at 3am? when should the pod stop receiving traffic?

go get github.com/sunkek/samsara

Core concept

Everything your service depends on is a Component:

type Component interface {
    Name() string
    Start(ctx context.Context, ready func()) error
    Stop(ctx context.Context) error
}

Start blocks for the component's entire lifetime. Call ready() exactly once when the component can serve traffic. Stop unblocks Start and cleans up.


Lifecycle contract

This section is the authoritative specification. Read it before implementing a component.

Start(ctx, ready)
  • Must block until the component exits — do not return early while the component is still serving.
  • Call ready() exactly once, only when the component is actually able to serve its intended function — not before a connection is verified, not before a port is bound.
  • ready() is safe to call multiple times (idempotent internally), but should only be called once semantically.
  • If ctx is cancelled (clean shutdown), return nil.
  • Return a non-nil error only on abnormal failure — not on clean context cancellation.
  • If ready() is never called, the supervisor waits up to startTimeout (default 15s) and treats the attempt as a failure.
Stop(ctx)
  • Must unblock Start and release resources within the context deadline (stopTimeout, default 10s).
  • Must be idempotent — the supervisor may call Stop more than once in some shutdown paths.
  • Must be concurrency-safeStop may be called concurrently with a still-initialising Start (e.g. before a port is bound). Guard shared state accordingly.
Background goroutines

If Start spawns background goroutines, they must exit when ctx is cancelled or Stop is called. The supervisor has no way to detect or reap leaked goroutines. A leaked goroutine from a restarted component will accumulate across restart cycles.

Cancellation semantics
Event Start should return Stop is called
Clean shutdown (signal / Shutdown()) nil Yes
Component failure non-nil error No (already exited)
Restart due to health failure nil (Stop unblocks it) Yes, before restart

Minimal example

func main() {
    app := samsara.NewApplication(
        samsara.WithMainFunc(func(ctx context.Context) error {
            log.Println("running — press Ctrl+C to stop")
            <-ctx.Done()
            return nil
        }),
    )
    if err := app.Run(); err != nil {
        log.Fatal(err)
    }
}

ctx is cancelled automatically on SIGINT, SIGTERM, SIGHUP, or SIGQUIT. A clean signal shutdown returns nil — exit code 0.


Real-world example

func main() {
    logger := slog.Default()

    sup := samsara.NewSupervisor(
        samsara.WithSupervisorLogger(logger),
        samsara.WithHealthInterval(10 * time.Second),
        samsara.WithEventHooks(&samsara.EventHooks{
            OnUnhealthy: func(component string, err error) {
                logger.Error("component unhealthy", "component", component, "error", err)
            },
            OnRecovered: func(component string) {
                logger.Info("component recovered", "component", component)
            },
            OnFailed: func(component string, err error) {
                logger.Error("component permanently failed", "component", component, "error", err)
            },
        }),
    )

    // Register HealthServer FIRST — it starts before all other components
    // and stops LAST, keeping orchestrators informed throughout.
    hs := samsara.NewHealthServer(sup, samsara.WithHealthAddr(":8080"))
    sup.Add(hs)

    sup.Add(postgres.New(cfg.Postgres),
        samsara.WithTier(samsara.TierCritical),
        samsara.WithRestartPolicy(samsara.ExponentialBackoff(5, time.Second)),
    )
    sup.Add(redis.New(cfg.Redis),
        samsara.WithTier(samsara.TierCritical),
        samsara.WithRestartPolicy(samsara.ExponentialBackoff(5, time.Second)),
    )
    sup.Add(s3.New(cfg.S3),
        samsara.WithTier(samsara.TierSignificant),
        samsara.WithRestartPolicy(samsara.AlwaysRestart(5*time.Second)),
    )
    // HTTP server starts only after postgres and redis are ready.
    sup.Add(httpserver.New(cfg.HTTP),
        samsara.WithTier(samsara.TierCritical),
        samsara.WithRestartPolicy(samsara.MaxRetries(3, 2*time.Second)),
        samsara.WithDependencies("postgres", "redis"),
    )

    app := samsara.NewApplication(
        samsara.WithSupervisor(sup),
        samsara.WithLogger(logger),
        samsara.WithShutdownTimeout(30*time.Second),
    )
    if err := app.Run(); err != nil {
        logger.Error("application exited with error", "error", err)
        os.Exit(1)
    }
}

Component tiers

Tiers define how a component's health affects the rest of the application.

Tier Transient unhealthy Permanent failure
TierCritical (default) App shuts down App shuts down
TierSignificant /readyz → 503, app stays up App shuts down
TierAuxiliary Logged only, no effect Component removed, app continues

Use TierSignificant for components that degrade — but don't break — your service (e.g. a cache, a metrics sink). The app keeps running and the load balancer is informed via /readyz.

Use TierAuxiliary for components whose failure is entirely non-blocking (e.g. an audit log sink, a tracing exporter).

Failure model

Each component goes through these states:

          start failure
               │
[Starting] ───────────────► [Failed] ─── if restart policy allows ──► [Starting]
               │                                                            ↑
           ready() called                                                   │
               │                                                      stop + delay
           [Running]                                                        │
               │                                                            │
        health check fails ──────────────────────────────────────────────────
               │
        restart policy exhausted
               │
           [Permanently Failed]
               │
        TierCritical/Significant ──► shutdown
        TierAuxiliary            ──► removed from monitoring, app continues

Recovery is automatic: if a restarted component calls ready() and health checks pass, OnRecovered fires and /readyz flips back to 200.


Restart policies

samsara.NeverRestart()                         // fail once → permanent (default)
samsara.AlwaysRestart(2*time.Second)           // retry forever, fixed delay
samsara.MaxRetries(5, time.Second)             // up to 5 retries, fixed delay
samsara.ExponentialBackoff(5, time.Second)     // 1s, 2s, 4s, 8s, 16s (±25% jitter)

The attempt counter resets to zero if the component runs without fault for longer than WithRestartResetWindow (default 5 minutes).

When to use restarts vs when to crash

Use internal restarts for components whose failure is transient and independent — a cache client that loses its connection, a queue consumer that gets disconnected, a metrics exporter. These are safe to restart because their failure doesn't affect application correctness.

Be cautious restarting core request-path components (the primary HTTP server, the main DB pool). A flapping critical component can create a misleading "alive but broken" state. If a component fails repeatedly, consider whether the correct response is to restart it or to crash the pod and let the orchestrator restart the whole process from a clean state.


Health checking

Implement HealthChecker to participate in health polling:

type HealthChecker interface {
    Health(ctx context.Context) error
}

The supervisor calls Health every WithHealthInterval (default 10s). A non-nil result triggers the tier logic above. Health checks are bounded by WithHealthTimeout (default 5s).


Health endpoints

HealthServer exposes three HTTP endpoints for orchestrators:

Endpoint 200 when Use for
GET /livez Process is alive Kubernetes livenessProbe
GET /readyz All Critical + Significant components healthy Kubernetes readinessProbe
GET /healthz Same as /readyz Docker HEALTHCHECK

/readyz flips during startup — it returns 503 until every Critical and Significant component has called ready() and passed its first health check.

/readyz flips during shutdown — the HealthServer stops last, so it returns 503 as soon as shutdown begins, before any other component stops. This drains load balancer connections cleanly.

Recovery — if a degraded component recovers, /readyz returns 200 again automatically.

/readyz response body:

{
  "status": "degraded",
  "components": [
    { "name": "postgres",    "status": "ok",       "restart_count": 0 },
    { "name": "redis",       "status": "degraded", "error": "connection refused", "restart_count": 2 },
    { "name": "http-server", "status": "ok",       "restart_count": 0 }
  ]
}

restart_count is omitted from JSON when zero.

Docker example:

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
  CMD wget -qO- http://localhost:8080/healthz || exit 1

Register HealthServer first:

hs := samsara.NewHealthServer(sup,
    samsara.WithHealthAddr(":8080"),
    samsara.WithHealthLogger(logger),
)
sup.Add(hs) // always first

Dependency ordering

Components start sequentially in registration order. Use WithDependencies when a component must not start until another is ready:

sup.Add(postgres.New(cfg))
sup.Add(redis.New(cfg))
sup.Add(httpServer,
    samsara.WithDependencies("postgres", "redis"),
)

httpServer starts only after both postgres and redis have called ready(). On shutdown, components stop in reverse start order — httpServer stops before postgres and redis, so no in-flight requests touch a closed pool.


Pitfalls and best practices

Call ready() too early
// ❌ Wrong — ready() before the connection is verified
func (c *Cache) Start(ctx context.Context, ready func()) error {
    c.client = redis.NewClient(opts) // lazy — no connection yet
    ready()                          // supervisor proceeds, but cache may be broken
    <-ctx.Done()
    return nil
}

// ✅ Right — ready() only after a successful ping
func (c *Cache) Start(ctx context.Context, ready func()) error {
    c.client = redis.NewClient(opts)
    if err := c.client.Ping(ctx).Err(); err != nil {
        return err
    }
    ready()
    <-ctx.Done()
    return nil
}
Forget to unblock Start on Stop
// ❌ Wrong — Stop closes the client but Start blocks forever
func (s *Server) Start(ctx context.Context, ready func()) error {
    ready()
    for job := range s.jobs { process(job) } // blocks; Stop never unblocks this
    return nil
}

// ✅ Right — Stop signals Start to exit
func (s *Server) Start(ctx context.Context, ready func()) error {
    s.stop = make(chan struct{})
    ready()
    select {
    case <-s.stop:
    case <-ctx.Done():
    }
    return nil
}
func (s *Server) Stop(_ context.Context) error {
    close(s.stop)
    return nil
}
Leak goroutines after Stop
// ❌ Wrong — background goroutine outlives the component
func (w *Worker) Start(ctx context.Context, ready func()) error {
    go w.runLoop() // no way to stop this
    ready()
    <-ctx.Done()
    return nil
}

// ✅ Right — goroutine exits when ctx is cancelled
func (w *Worker) Start(ctx context.Context, ready func()) error {
    go w.runLoop(ctx) // ctx cancellation stops the loop
    ready()
    <-ctx.Done()
    return nil
}
Return non-nil error on clean shutdown
// ❌ Wrong — ctx.Err() looks like a failure to the supervisor
func (s *Server) Start(ctx context.Context, ready func()) error {
    ready()
    <-ctx.Done()
    return ctx.Err() // returns context.Canceled — treated as a crash
}

// ✅ Right
func (s *Server) Start(ctx context.Context, ready func()) error {
    ready()
    <-ctx.Done()
    return nil
}

Runtime status

Inspect all component states at any time via the supervisor:

for _, status := range sup.HealthReportOrdered() {
    fmt.Printf("%-20s healthy=%-5v restarts=%d\n",
        status.Name,
        status.Err == nil,
        status.RestartCount,
    )
}

ComponentStatus fields:

Field Type Description
Err error nil = healthy; non-nil = last health check error
Known bool false until first health check completes
Tier Tier TierCritical, TierSignificant, or TierAuxiliary
RestartCount int How many times the supervisor has restarted this component

Metrics integration

Implement MetricsObserver to receive telemetry without adding dependencies to the package:

type MetricsObserver interface {
    ComponentStarted(component string, attempt int)
    ComponentStopped(component string, err error)
    ComponentRestarting(component string, err error, attempt int, delay time.Duration)
    HealthCheckCompleted(component string, duration time.Duration, err error)
}

Programmatic shutdown

app.Shutdown(errors.New("config reload required"))
// cause is available inside Start/main via context.Cause(ctx)

app.Shutdown(nil) is a clean shutdown — same semantics as Ctrl+C, returns nil from app.Run().


Logger interface

samsara.Logger is satisfied directly by *slog.Logger and most structured loggers:

samsara.WithLogger(slog.Default())
samsara.WithSupervisorLogger(slog.Default())
samsara.WithHealthLogger(slog.Default())

Configuration reference

Supervisor
Option Default Description
WithHealthInterval 10s How often to poll Health()
WithStartTimeout 15s How long to wait for ready() to be called
WithHealthTimeout 5s Deadline for each Health() call
WithStopTimeout 10s Deadline for each Stop() call
WithRestartResetWindow 5m Stable runtime before restart counter resets
WithSupervisorLogger nop Structured logger
WithEventHooks nil Lifecycle event callbacks
WithMetricsObserver nop Telemetry receiver
Application
Option Default Description
WithShutdownTimeout 15s How long to wait for clean exit after signal
WithLogger nop Structured logger
WithMainFunc nil Optional blocking main function
WithSupervisor nil Optional supervisor to run alongside main
HealthServer
Option Default Description
WithHealthAddr :9090 Listen address
WithHealthName health-server Component name (for multi-instance setups)
WithHealthLogger nop Structured logger
WithHealthReadTimeout 5s HTTP read timeout
WithHealthWriteTimeout 5s HTTP write timeout

Acknowledgements

Initially inspired by this article and appctl.

Documentation

Index

Constants

View Source
const (
	// ErrNothingToRun is returned by Application.Run when neither a MainFunc
	// nor a Supervisor was provided.
	ErrNothingToRun appError = "samsara: nothing to run (no main function or supervisor provided)"

	// ErrShutdownTimeout is returned when the application does not stop within
	// the configured ShutdownTimeout after the context is cancelled.
	ErrShutdownTimeout appError = "samsara: shutdown timeout exceeded"

	// ErrComponentAlreadyRegistered is returned when a component with the same
	// name is added to the Supervisor more than once.
	ErrComponentAlreadyRegistered appError = "samsara: component already registered"

	// ErrCircularDependency is returned when the Supervisor detects a cycle in
	// the component dependency graph.
	ErrCircularDependency appError = "samsara: circular dependency detected"

	// ErrUnknownDependency is returned when a component declares a dependency on
	// a name that has not been registered with the Supervisor.
	ErrUnknownDependency appError = "samsara: unknown dependency"

	// ErrSupervisorRunning is returned when Add is called after Run has started.
	ErrSupervisorRunning appError = "samsara: cannot add component after supervisor has started"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Application

type Application struct {
	// contains filtered or unexported fields
}

Application is the top-level entry point for a service. It wires together signal handling, an optional Supervisor, and a main function into a single blocking Run call.

Typical usage:

sup := samsara.NewSupervisor(...)
sup.Add(myDB, samsara.WithTier(samsara.TierCritical))
sup.Add(myCache, samsara.WithTier(samsara.TierSignificant))

app := samsara.NewApplication(
    samsara.WithSupervisor(sup),
    samsara.WithMainFunc(server.Run),
    samsara.WithShutdownTimeout(20*time.Second),
)
if err := app.Run(); err != nil {
    log.Fatal(err)
}

func NewApplication

func NewApplication(opts ...ApplicationOption) *Application

NewApplication constructs an Application with the supplied options.

func (*Application) Run

func (a *Application) Run() error

Run starts the application and blocks until it exits.

Startup order:

  1. Root context is created and wired to OS signals (SIGINT, SIGTERM, SIGHUP, SIGQUIT).
  2. Supervisor.Run is launched in a goroutine (if a Supervisor was provided).
  3. The main function is launched in a goroutine (if one was provided).

Shutdown is triggered by any of:

  • An OS signal.
  • A call to Application.Shutdown(cause).
  • The main function returning (with or without an error).
  • The Supervisor encountering a critical failure.

After the shutdown signal, Run waits up to ShutdownTimeout for both goroutines to finish. If they do not, ErrShutdownTimeout is joined into the returned error.

func (*Application) Shutdown

func (a *Application) Shutdown(cause error)

Shutdown cancels the application's root context, triggering a graceful shutdown. The optional cause is attached to the context so that components and the main function can inspect it via context.Cause if needed.

It is safe to call from any goroutine. Calling Shutdown before Run is a no-op. Calling it multiple times is safe; only the first cause is recorded.

type ApplicationOption

type ApplicationOption func(*applicationConfig)

ApplicationOption configures an Application.

func WithLogger

func WithLogger(l Logger) ApplicationOption

WithLogger sets the logger used by the Application itself (not the Supervisor — pass WithSupervisorLogger to NewSupervisor for that).

func WithMainFunc

func WithMainFunc(f func(ctx context.Context) error) ApplicationOption

WithMainFunc sets the primary function that runs as the application's main goroutine. The context passed to f is cancelled when an OS shutdown signal is received or when the Supervisor encounters a critical failure. Returning a non-nil error from f is treated as an application-level failure.

func WithShutdownTimeout

func WithShutdownTimeout(d time.Duration) ApplicationOption

WithShutdownTimeout sets how long the application waits for the main function and supervisor to exit after the root context is cancelled. Defaults to 15 s. If the timeout is exceeded, ErrShutdownTimeout is joined into the returned error.

func WithSupervisor

func WithSupervisor(s *Supervisor) ApplicationOption

WithSupervisor attaches a Supervisor to the application. The supervisor is started alongside the main function and both receive the same root context.

type Component

type Component interface {
	Name() string
	Start(ctx context.Context, ready func()) error
	Stop(ctx context.Context) error
}

Component is the fundamental unit managed by the Supervisor.

Lifecycle contract

Start must block for the entire lifetime of the component. The ready function must be called exactly once, as soon as the component is ready to serve traffic — not before, and never more than once (the supervisor wraps it in sync.Once so double-calls are safe, but semantically wrong). The supervisor will not start the next component until ready is called. Start should return nil on a clean exit (ctx cancelled, Stop called) and a non-nil error on unexpected failure.

If ready is never called, the supervisor will wait up to startTimeout and then treat the attempt as a failure.

Stop is called with a context carrying the configured stop timeout. It must not block longer than that context allows. Stop must be idempotent — the supervisor may call it more than once in some shutdown paths. Stop must also be safe to call concurrently with a still-running Start (e.g. before a port is bound), so components must guard shared state accordingly.

If ctx is cancelled during Start (clean shutdown), Start should return nil. Only return a non-nil error when an abnormal failure occurs that the supervisor should treat as a crash.

Background goroutines

Start is allowed to spawn background goroutines, but those goroutines must exit when ctx is cancelled or Stop is called — whichever comes first. A component that leaks goroutines after Stop returns will cause resource leaks on restart. The supervisor has no way to detect or recover from this.

Example — an HTTP server:

func (s *Server) Start(ctx context.Context, ready func()) error {
    ln, err := net.Listen("tcp", s.addr)
    if err != nil { return err }
    ready()                       // port is bound — supervisor proceeds
    return s.srv.Serve(ln)        // blocks until Stop calls Shutdown
}

Example — a DB pool (no run loop needed):

func (p *Pool) Start(ctx context.Context, ready func()) error {
    p.stop = make(chan struct{})
    pool, err := pgxpool.New(ctx, p.dsn)
    if err != nil { return err }
    p.pool = pool
    ready()                       // pool is up — supervisor proceeds
    select {
    case <-p.stop:
    case <-ctx.Done():
    }
    return nil
}

type ComponentOption

type ComponentOption func(*componentConfig)

ComponentOption configures a managedComponent at registration time.

func WithDependencies

func WithDependencies(names ...string) ComponentOption

WithDependencies declares that this component must not be started until all named dependencies are running. Names must match Component.Name() of other registered components.

func WithRestartPolicy

func WithRestartPolicy(p RestartPolicy) ComponentOption

WithRestartPolicy sets the restart policy. Defaults to NeverRestart().

func WithTier

func WithTier(t Tier) ComponentOption

WithTier sets the importance tier of a component. Defaults to TierCritical.

type ComponentStatus

type ComponentStatus struct {
	Err          error // nil means healthy; non-nil means last health check failed
	Known        bool  // false until the first health check completes
	Tier         Tier
	RestartCount int // number of times the component has been restarted by the supervisor
}

ComponentStatus is a point-in-time snapshot of a single component's health.

type EventHooks

type EventHooks struct {
	// OnUnhealthy is called when a component's Health check returns a non-nil
	// error. It receives the component name and the health error.
	OnUnhealthy func(component string, err error)

	// OnRecovered is called when a component's Health check returns nil again
	// after a previous OnUnhealthy event.
	OnRecovered func(component string)

	// OnFailed is called when a component fails permanently — either because
	// its restart policy decided not to retry, or because all retries were
	// exhausted. It receives the component name and the final error.
	OnFailed func(component string, err error)

	// OnRestart is called each time the supervisor schedules a restart attempt
	// for a component. It receives the component name, the triggering error,
	// and the attempt number (1-based).
	OnRestart func(component string, err error, attempt int)
}

EventHooks carries optional callbacks that the Supervisor fires on significant component lifecycle events. All fields are optional; a nil function is silently skipped.

Hooks are called synchronously inside the supervisor goroutine that manages the component, so they must not block. Enqueue to a channel or spawn a goroutine if you need non-trivial work (e.g. sending a PagerDuty alert).

type HealthChecker

type HealthChecker interface {
	Health(ctx context.Context) error
}

HealthChecker is an optional extension of Component. When implemented, the supervisor polls Health on the configured healthInterval and acts on the result according to the component's Tier.

type HealthReporter

type HealthReporter interface {
	HealthReportOrdered() []NamedComponentStatus
}

HealthReporter is the interface the HealthServer uses to query component health. *Supervisor satisfies this via HealthReportOrdered().

type HealthServer

type HealthServer struct {
	// contains filtered or unexported fields
}

HealthServer is a Component that exposes three HTTP endpoints:

GET /livez   — liveness:  200 while the process is alive
GET /readyz  — readiness: 200 if all Critical/Significant components healthy
GET /healthz — alias for /readyz (Docker HEALTHCHECK compatibility)

Register HealthServer first with the Supervisor so it starts before everything else and stops last.

func NewHealthServer

func NewHealthServer(reporter HealthReporter, opts ...HealthServerOption) *HealthServer

NewHealthServer creates a HealthServer. Pass a *Supervisor as the reporter.

func (*HealthServer) Name

func (h *HealthServer) Name() string

func (*HealthServer) Start

func (h *HealthServer) Start(_ context.Context, ready func()) error

Start implements Component. It binds the TCP port, calls ready() to signal the supervisor, then serves until Stop is called.

func (*HealthServer) Stop

func (h *HealthServer) Stop(ctx context.Context) error

type HealthServerOption

type HealthServerOption func(*healthServerConfig)

HealthServerOption configures a HealthServer.

func WithHealthAddr

func WithHealthAddr(addr string) HealthServerOption

func WithHealthLogger

func WithHealthLogger(l Logger) HealthServerOption

func WithHealthName

func WithHealthName(name string) HealthServerOption

WithHealthName overrides the component name returned by HealthServer.Name. This is useful when registering multiple HealthServer instances (e.g. on different ports) with the same Supervisor. Defaults to "health-server".

func WithHealthReadTimeout

func WithHealthReadTimeout(d time.Duration) HealthServerOption

func WithHealthWriteTimeout

func WithHealthWriteTimeout(d time.Duration) HealthServerOption

type Logger

type Logger interface {
	Debug(msg string, kv ...any)
	Info(msg string, kv ...any)
	Error(msg string, kv ...any)
}

Logger is a minimal structured logging interface. It is intentionally narrow so that any slog, zap, zerolog, or logrus wrapper satisfies it with a thin adapter, keeping the samsarawork free of logging dependencies.

Key-value pairs are passed as alternating key, value arguments (slog style).

type MetricsObserver

type MetricsObserver interface {
	// ComponentStarted is called each time a component's Start call returns
	// without error and the component is considered running.
	ComponentStarted(component string, attempt int)

	// ComponentStopped is called when a component's Stop call returns,
	// regardless of whether it returned an error.
	ComponentStopped(component string, err error)

	// ComponentRestarting is called when the supervisor decides to restart a
	// component after a failure. attempt is 1-based.
	ComponentRestarting(component string, err error, attempt int, delay time.Duration)

	// HealthCheckCompleted is called after every health check poll, whether
	// healthy or not. duration is how long the Health() call took.
	HealthCheckCompleted(component string, duration time.Duration, err error)
}

MetricsObserver receives structured telemetry events from the Supervisor. Implement this interface to bridge into Prometheus, OpenTelemetry, Datadog, or any other metrics backend without adding a hard dependency to this package.

All methods are called synchronously from the supervisor goroutine that manages the component, so they must not block. Enqueue or use a non-blocking write if your backend requires I/O.

All fields are optional at the implementation level — a partial observer that only cares about restarts is perfectly valid.

type NamedComponentStatus

type NamedComponentStatus struct {
	Name string
	ComponentStatus
}

NamedComponentStatus is a ComponentStatus with its component name.

type RestartPolicy

type RestartPolicy interface {
	ShouldRestart(err error, attempt int) (restart bool, delay time.Duration)
}

RestartPolicy decides whether a component should be restarted after a failure and, if so, how long to wait before the next attempt.

attempt is zero-based: the first restart is attempt 0, the second is 1, etc. Returning false for restart means the component has failed permanently.

func AlwaysRestart

func AlwaysRestart(delay time.Duration) RestartPolicy

AlwaysRestart returns a policy that restarts a component unconditionally with a fixed delay between attempts.

func ExponentialBackoff

func ExponentialBackoff(maxRetries int, baseDelay time.Duration) RestartPolicy

ExponentialBackoff returns a policy that restarts a component up to maxRetries times. The delay doubles with each attempt starting from baseDelay, with ±25% jitter applied to spread restarts when many instances fail simultaneously:

attempt 0: baseDelay  × [0.75, 1.25)
attempt 1: 2×baseDelay × [0.75, 1.25)
attempt 2: 4×baseDelay × [0.75, 1.25)  …and so on

func MaxRetries

func MaxRetries(maxRetries int, delay time.Duration) RestartPolicy

MaxRetries returns a policy that restarts a component up to maxRetries times with a fixed delay. After maxRetries attempts the component fails permanently.

func NeverRestart

func NeverRestart() RestartPolicy

NeverRestart returns a policy that never restarts a component. Use this for components whose failure should propagate immediately.

type Supervisor

type Supervisor struct {
	// contains filtered or unexported fields
}

Supervisor starts, monitors, and stops a set of Components in dependency order. Components are started sequentially (dependencies first) and stopped in reverse order (dependents first).

func NewSupervisor

func NewSupervisor(opts ...SupervisorOption) *Supervisor

NewSupervisor constructs a Supervisor with the given options.

func (*Supervisor) Add

func (s *Supervisor) Add(c Component, opts ...ComponentOption)

Add registers a Component with the Supervisor. Panics if called after Run has started or if a component with the same name is already registered.

func (*Supervisor) ComponentHealth

func (s *Supervisor) ComponentHealth(name string) (err error, known bool)

ComponentHealth returns the last known health error for a named component.

func (*Supervisor) HealthReport

func (s *Supervisor) HealthReport() map[string]ComponentStatus

HealthReport returns a snapshot of all component health states keyed by name.

func (*Supervisor) HealthReportOrdered

func (s *Supervisor) HealthReportOrdered() []NamedComponentStatus

HealthReportOrdered returns a name-sorted slice of component health states.

func (*Supervisor) Run

func (s *Supervisor) Run(ctx context.Context) error

Run starts all registered components in dependency order, monitors them, and blocks until ctx is cancelled or a critical failure occurs.

type SupervisorOption

type SupervisorOption func(*supervisorConfig)

SupervisorOption configures a Supervisor.

func WithEventHooks

func WithEventHooks(h *EventHooks) SupervisorOption

func WithHealthInterval

func WithHealthInterval(d time.Duration) SupervisorOption

WithHealthInterval sets how often the supervisor polls each component's Health method. Defaults to 10s.

func WithHealthTimeout

func WithHealthTimeout(d time.Duration) SupervisorOption

func WithMetricsObserver

func WithMetricsObserver(m MetricsObserver) SupervisorOption

WithMetricsObserver registers a MetricsObserver for telemetry events.

func WithRestartResetWindow

func WithRestartResetWindow(d time.Duration) SupervisorOption

func WithStartTimeout

func WithStartTimeout(d time.Duration) SupervisorOption

WithStartTimeout sets how long the supervisor waits for a component to call ready() after Start is launched. Defaults to 15s.

func WithStopTimeout

func WithStopTimeout(d time.Duration) SupervisorOption

func WithSupervisorLogger

func WithSupervisorLogger(l Logger) SupervisorOption

type Tier

type Tier int

Tier expresses how important a component is to overall application health.

const (
	// TierCritical (default) — a permanently failed or persistently unhealthy
	// critical component causes the entire application to shut down.
	TierCritical Tier = iota

	// TierSignificant — while a significant component is transiently unhealthy
	// the application is marked not-ready (/readyz returns 503) but keeps
	// running. A permanent failure (restart policy exhausted) triggers a full
	// shutdown, identical to TierCritical.
	TierSignificant

	// TierAuxiliary — health problems are logged and hooks are fired, but they
	// have no effect on /readyz and do not trigger a shutdown. Even a permanent
	// failure only removes the component from monitoring; the app continues.
	TierAuxiliary
)

func (Tier) String

func (t Tier) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL