docker

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 3, 2026 License: Apache-2.0 Imports: 56 Imported by: 0

README

Docker Backend

The Docker backend provisions ephemeral containers for tenant workloads. It receives provision requests from Fred, manages the full container lifecycle (pull, create, start, verify, deprovision), enforces SKU-based resource limits, and reports results via HMAC-signed callbacks.

Configuration Reference

All fields are set in the backend's YAML config block. Defaults come from DefaultConfig().

Core
Field YAML Key Type Default Description
Name name string "docker" Backend identifier
ListenAddr listen_addr string ":9001" HTTP server listen address
DockerHost docker_host string "unix:///var/run/docker.sock" Docker daemon socket path or URL
HostAddress host_address string (required) External IP/hostname for port mappings. Must be a valid IP or hostname, not a URL
HostBindIP host_bind_ip string "0.0.0.0" IP address to bind container ports to
LogLevel log_level string "info" Log verbosity: debug, info, warn, error. Not set in DefaultConfig(); defaults to "info" at startup via cmp.Or
ProductionMode production_mode bool false Tightens startup checks beyond basic validation. When true, Validate rejects dev-only insecure toggles — currently callback_insecure_skip_verify. Mirrors providerd's production_mode
TLS & mTLS (ENG-103)
Field YAML Key Type Default Description
TLSCertFile tls_cert_file string (empty) Server certificate. When set together with tls_key_file, the listener serves HTTPS; otherwise plaintext HTTP (default). Loaded once at startup — rotation requires a restart (ENG-294)
TLSKeyFile tls_key_file string (empty) Server private key. Must be set together with tls_cert_file
TLSClientCAFile tls_client_ca_file string (empty) Enables mutual TLS: the listener requires and verifies a client certificate signed by this CA. Requires tls_cert_file + tls_key_file
TLSClientAllowedNames tls_client_allowed_names []string (empty) Optionally pins the mTLS client identity — the presented cert's CommonName or a DNS SAN must be in this list. Empty accepts any cert signed by tls_client_ca_file. Requires tls_client_ca_file
Resources
Field YAML Key Type Default Description
TotalCPUCores total_cpu_cores float64 8.0 Total CPU cores in the resource pool
TotalMemoryMB total_memory_mb int64 16384 Total memory available (MB)
TotalDiskMB total_disk_mb int64 102400 Total disk space available (MB)
SKU Management
Field YAML Key Type Default Description
SKUMapping sku_mapping map[string]string (empty) Maps on-chain SKU UUIDs to profile names
SKUProfiles sku_profiles map[string]SKUProfile (required, non-empty) Maps profile names to resource limits. Operator-declared; no defaults

sku_profiles is required and authoritative — DefaultConfig() deliberately does not seed it, because yaml.v3 merges map keys during Unmarshal and a partial operator config would silently inherit defaults (see ENG-238). Validate rejects an empty map with "at least one SKU profile is required".

Recommended starter profiles (copy these into your config if you want the previous four-tier shape):

Profile CPU Cores Memory MB Disk MB
docker-micro 0.25 256 512
docker-small 0.5 512 1024
docker-medium 1.0 1024 2048
docker-large 2.0 2048 4096

SKU resolution: the backend first checks SKUMapping for a UUID-to-name translation, then looks up the name in SKUProfiles. This allows on-chain UUIDs to map to human-readable profile names.

Image Security
Field YAML Key Type Default Description
AllowedRegistries allowed_registries []string ["docker.io", "ghcr.io"] Registries from which images may be pulled

Images are validated before pull. The registry is extracted from the image reference (e.g., ghcr.io/org/app:v1 -> ghcr.io). Bare names like nginx resolve to docker.io.

Callbacks
Field YAML Key Type Default Description
CallbackSecret callback_secret string (required, min 32 chars) HMAC-SHA256 secret for signing callbacks
CallbackInsecureSkipVerify callback_insecure_skip_verify bool false Skip TLS verification for callbacks (dev only)
CallbackDBPath callback_db_path string "callbacks.db" Path to bbolt database for persisting pending callbacks
CallbackMaxAge callback_max_age duration 24h Maximum age of persisted callback entries before cleanup
Diagnostics
Field YAML Key Type Default Description
DiagnosticsDBPath diagnostics_db_path string "diagnostics.db" Path to bbolt database for persisting failure diagnostics
DiagnosticsMaxAge diagnostics_max_age duration 168h Maximum age of persisted diagnostic entries before cleanup (7 days)

When a provision fails (during provisioning, state recovery, or partial deprovision), the backend persists full failure diagnostics and container logs to a bbolt database. GET /provisions/{lease_uuid} and GET /logs/{lease_uuid} fall back to this store when the provision is no longer in memory (e.g., after deprovision or restart), returning the persisted error and logs with a 7-day default retention.

Releases
Field YAML Key Type Default Description
ReleasesDBPath releases_db_path string "releases.db" Path to bbolt database for persisting release history
ReleasesMaxAge releases_max_age duration 2160h Maximum age of persisted release entries before cleanup (90 days)
Soft-delete & Retention
Field YAML Key Type Default Description
RetainOnClose retain_on_close bool false When true, managed volumes are renamed into a fred-retained- namespace on lease close/expire instead of being destroyed. Tenants can restore data into a new lease via POST /v1/leases/{new_lease_uuid}/restore.
RetentionDBPath retention_db_path string "retention.db" Path to bbolt database tracking retained lease records
RetentionMaxAge retention_max_age duration 2160h (90 days) How long retained volumes are kept before the grace reaper destroys them. When > 0 it also gates restore eligibility — a retained record older than this is no longer restorable. Set to 0 to disable age-based reaping and the age gate: retained volumes are then kept indefinitely and stay restorable until evicted, unless a per-tenant cap (max_retained_leases_per_tenant > 0) is configured to evict them.
RetentionReapInterval retention_reap_interval duration 1h Cadence of the background retention sweep, which destroys expired retained volumes and reconciles in-flight restores. If set to 0 it falls back to retention_max_age, then to a hard-coded 1h. The sweep still runs (to reconcile restores) when retain_on_close is set even with retention_max_age: 0.
MaxRetainedLeasesPerTenant max_retained_leases_per_tenant int 0 (unlimited) Maximum number of retained leases kept per tenant. When a soft-delete would exceed the cap, the tenant's oldest retained lease(s) are evicted (hard-deleted) at close time — oldest-first until cap-1 remain (so a single close can drop multiple old leases). Never touches other tenants and never evicts a record being restored. 0 means no cap.
RetentionOrphanConfirmations retention_orphan_confirmations int 3 Number of consecutive retention sweeps a soft-deleted record must be observed with all its retained volumes missing before the record is pruned (ENG-370). Catches records orphaned when their backing volumes vanish out-of-band (host/docker churn, docker volume prune, data-root reset) so they don't linger for the full grace window. Fail-safe: a sweep that cannot enumerate volumes, or finds the volume root absent/unreadable, skips rather than pruning. This is a sweep count, not a duration — the effective confirmation window is N × retention_reap_interval (≈3h at the 1h default). 0 disables orphan pruning entirely (kill-switch).
MaxRetainedDiskMB max_retained_disk_mb int64 0 (unlimited) Per-provider cap on the aggregate retained-volume disk footprint (MB) across all tenants. When retaining a closing lease would exceed this cap, the lease is destroyed immediately instead of retained (existing in-grace data is never evicted). 0 means no cap. Must be ≤ total_disk_mb when set.

Writable-path-only reclaim (ENG-406): even with retain_on_close: true, a closing lease's volumes that hold only ephemeral _wp/ writable-path scaffolding (no declared-VOLUME durable data) are destroyed (reclaimed) at close instead of retained — restore reseeds _wp from the image regardless, so retaining them preserves nothing restorable. The detector is conservative toward RETAIN (it never destroys a stateful volume). Counted by fred_docker_backend_retention_writable_path_reclaimed_total.

Duration syntax: retention_max_age and retention_reap_interval use Go duration syntax — valid units are h, m, s (e.g. 2160h for 90 days, 336h for 14 days). The units d (days) and w (weeks) are not valid and will fail config validation.

Tenant Quotas
Field YAML Key Type Default Description
TenantQuota tenant_quota object (none) Per-tenant resource limits (optional)
TenantQuota.MaxCPUCores tenant_quota.max_cpu_cores float64 - Maximum CPU cores per tenant
TenantQuota.MaxMemoryMB tenant_quota.max_memory_mb int64 - Maximum memory per tenant (MB)
TenantQuota.MaxDiskMB tenant_quota.max_disk_mb int64 - Maximum disk per tenant (MB)

When tenant_quota is configured, no single tenant can consume more than the specified limits, even if the resource pool has capacity available. Quota values cannot exceed the total pool capacity.

Timeouts
Field YAML Key Type Default Description
ImagePullTimeout image_pull_timeout duration 5m Timeout for pulling images
ContainerCreateTimeout container_create_timeout duration 30s Timeout for creating containers
ContainerStartTimeout container_start_timeout duration 30s Timeout for starting containers
ProvisionTimeout provision_timeout duration 10m Maximum time for the entire provisioning operation. Validated as positive — must be > 0.
ReconcileInterval reconcile_interval duration 5m How often to reconcile state with Docker
StartupVerifyDuration startup_verify_duration duration 5s Grace period after start before verifying containers are still running
ContainerStopTimeout container_stop_timeout duration 30s Grace period before SIGKILL when stopping containers
MigrationReadyTimeout migration_ready_timeout duration 90s Caps how long a recover-time migration waits for the new stack-form container to reach healthy (or running when no health check is declared) before declaring the migration failed for that lease
MigrationGracePeriod migration_grace_period duration 1m How long the renamed -prev legacy container lingers after a successful recover-time migration before forced removal (preserves rollback potential)
Container Hardening
Field YAML Key Type Default Description
NetworkIsolation network_isolation *bool true Per-tenant Docker network isolation
ContainerReadonlyRootfs container_readonly_rootfs *bool true Read-only root filesystem
ContainerPidsLimit container_pids_limit *int64 256 Maximum PIDs per container
ContainerTmpfsSizeMB container_tmpfs_size_mb int 64 Tmpfs size (MB) for /tmp and /run when readonly rootfs is enabled
Ingress (Traefik Integration)
Field YAML Key Type Default Description
Ingress.Enabled ingress.enabled bool false Enable reverse proxy label generation
Ingress.WildcardDomain ingress.wildcard_domain string (required when enabled) Base domain for tenant subdomains (e.g., apps.example.com)
Ingress.Entrypoint ingress.entrypoint string (required when enabled) Traefik entrypoint name (e.g., websecure)

When enabled, containers with routable TCP ports receive Traefik Docker labels for automatic HTTPS routing. Each container gets a unique subdomain under wildcard_domain derived from lease UUID and service metadata (guaranteed ≤63 chars per RFC 1035). Port selection: 80 > 8080 > lowest TCP port. Requires network_isolation to be enabled — Traefik routes traffic via the per-tenant Docker network.

Routers are generated with tls=true but no certresolver. The wildcard certificate for wildcard_domain must be provisioned at the Traefik level — typically via a DNS-01 ACME resolver with domains set in Traefik's static config, or via a default certificate in tls.stores. Fred does not drive per-domain ACME challenges.

Example:

ingress:
  enabled: true
  wildcard_domain: "apps.example.com"
  entrypoint: "websecure"
Volume Management
Field YAML Key Type Default Description
VolumeDataPath volume_data_path string (empty) Host directory for managed volumes. Required when any SKU has disk_mb > 0
VolumeFilesystem volume_filesystem string (auto-detected) Filesystem type: btrfs, xfs, or zfs. Auto-detected from volume_data_path if empty

When any SKU profile has disk_mb > 0, the backend manages quota-enforced host directories that are bind-mounted into containers at their Dockerfile VOLUME paths.

Supported Filesystems
Filesystem Mechanism Requirements
btrfs Subvolumes with qgroup quotas btrfs quota enable on the filesystem
xfs Project quotas pquota mount option, xfs_quota binary
zfs Child datasets with quota property Parent dataset exists, zfs binary
Stateful vs Ephemeral Containers
SKU disk_mb Behavior Image VOLUME paths
> 0 (stateful) Quota-enforced host directory created per container Bind-mounted from host directory
0 (ephemeral) No host directory Overridden with tmpfs (prevents anonymous volumes)

All containers have a readonly root filesystem by default (configurable via container_readonly_rootfs). Stateful containers write to bind-mounted volumes; ephemeral containers write to tmpfs.

Example stateful SKU:

volume_data_path: "/var/lib/fred/volumes"
# volume_filesystem: "btrfs"  # optional, auto-detected

sku_profiles:
  docker-redis:
    cpu_cores: 0.5
    memory_mb: 512
    disk_mb: 2048

When provisioning redis:latest on this SKU:

  1. Image inspected — discovers VOLUME /data
  2. Host directory created: /var/lib/fred/volumes/fred-<lease>-0/ with 2048 MB quota
  3. Subdirectory data/ bind-mounted to container /data
  4. Redis writes to /data — quota enforced by kernel
  5. On deprovision: host directory destroyed
SKU Profile Fields
Field YAML Key Type Default Description
CPUCores cpu_cores float64 CPU cores allocated to each container
MemoryMB memory_mb int64 Memory in MB allocated to each container
DiskMB disk_mb int64 0 Disk budget in MB. When > 0, a quota-enforced host directory is bind-mounted to image VOLUME paths (requires volume_data_path). When 0, image VOLUME paths are overridden with tmpfs

Tenant Manifest Reference

See Manifest Guide for the full tenant-facing manifest specification (image, ports, env, health check, tmpfs). A formal JSON Schema is also available.

Soft-delete & Restore

When retain_on_close: true is set, the backend performs a soft-delete instead of a hard destroy at lease close or auto-expire time:

  1. Canonical volumes that are writable-path-only — they hold only the ephemeral _wp/ scaffolding (a read-only-rootfs writable-path mount) and no declared-VOLUME durable data — are destroyed (reclaimed), not retained. Restore reseeds _wp from the image anyway (the ENG-367 wipe-contract), so retaining such a volume preserves nothing restorable and only pollutes the retention record, a per-tenant slot, the retained-disk budget, and the volume root. The detector (isWritablePathOnly) is conservative toward RETAIN — it destroys only provably _wp-only volumes, never a stateful one (ENG-406). Observable via fred_docker_backend_retention_writable_path_reclaimed_total.
  2. The remaining managed volumes for the lease are renamed from fred-<lease_uuid>-… into a fred-retained-<lease_uuid>-… namespace and kept on disk.
  3. The original containers and resource-pool allocations are still released (the running workload is stopped; resources are freed for new leases).
  4. Fred publishes a retained status event to any connected tenant WebSocket so the tenant knows their data may be recoverable.
  5. Retained volumes are held for up to retention_max_age (default 90 days). The grace reaper runs every retention_reap_interval (default 1h) and destroys expired retained volumes.
  6. If a retained lease's backing volumes disappear out-of-band (host/docker churn, docker volume prune, a data-root reset on redeploy) while its record survives, the periodic sweep prunes the now-orphaned record after it is observed fully volume-less for retention_orphan_confirmations consecutive sweeps (default 3). This keeps dead records from accumulating for the full grace window. The prune is fail-safe — a sweep that errors listing volumes, or finds the volume root absent/unreadable, skips entirely rather than risk pruning a record whose volumes are merely transiently unavailable. Observable via fred_docker_backend_retention_orphans_pruned_total and fred_docker_backend_retention_orphan_skips_total{reason}.
Restore flow

To restore data from a closed lease into a new lease:

  1. Open a fresh lease on the same provider by requesting the same service names and quantities as the original closed lease. The new lease UUID (new_lease_uuid) will be in PENDING state.
  2. Call POST /v1/leases/{new_lease_uuid}/restore with body {"from_lease_uuid": "<original_closed_lease_uuid>"}. Fred validates the request and delegates to the backend.
  3. The backend renames the retained volumes into the new lease's namespace (the synchronous adopt phase) and re-deploys the retained manifest (the exact deployment that was running at close time) onto them. The new lease becomes active with the same data. To change the image or configuration after restore, use the normal update path once the lease is active.

Restore-specific re-deploy behavior worth knowing:

  • Image must already be present on the node. Restore re-uses the replace machinery, which inspects the image but does not pull it. If the image was garbage-collected from the node since close, restore fails with an image-inspect error — pre-pull the image (or restore before the node's image GC runs).
  • Image and configuration are fixed. Restore deploys strictly from the retained StackManifest and items; the request carries no manifest. The new lease's requested service names and quantities must shape-match the retained set exactly (otherwise the restore is rejected with a validation error).
  • Containers are recreated, ownership is not rewritten. Restore does not force-recreate beyond the normal replace, and the volume chown is non-recursive (it sets ownership on the VOLUME mount point only), so existing files keep their on-disk ownership.
Limitations
  • Best-effort and capacity-bounded: retention is not a guarantee. When a per-tenant cap (max_retained_leases_per_tenant > 0) is configured, a soft-delete may evict that tenant's oldest retained lease(s) — independent of age — to make room for the newer one. Always restore within the grace window.
  • Same-backend-node only: a restore can only run on the backend node that physically holds the retained volumes (the rename is local; nothing is copied between nodes). In single-backend deployments this is always satisfied. In multi-node deployments restore routing is automatic: the reconciler queries each backend's GET /retentions and records each retained lease's backend in the placement store, so a restore is routed to the node holding the source data (ENG-333). Restore returns 404 if no backend still holds that lease's retained data.
  • Not a backup: retained data is a single copy on the node's local disk (RAID-backed by the operator). It provides a grace window against accidental lease closure, not protection against node-level data loss. Operators should run separate backup procedures for production data.
Failure handling & crash recovery

Restore is crash-safe and self-healing. A retention record carries one of three persisted statuses — active (awaiting restore or reap), restoring (a restore is in flight), and reaping (volumes are pending physical destruction) — and the adopt rename is the only on-disk mutation:

The reaping status is a finalizer tombstone (ENG-376): when a retained record is reaped (grace-expired, cap-evicted, or abandoned by a deprovision give-up) the record is not deleted at the active→reaping transition — it outlives its volumes and is Deleted only once every volume is confirmed destroyed. The bytes still sit on disk, so a reaping record keeps counting against the retained footprint while it is no longer restore-claimable. A record that cannot be reclaimed sticks in reaping (observable via the fred_docker_backend_retention_reaping_leases / fred_docker_backend_retention_reaping_bytes gauges); a failed destroy / give-up / uncommitted revert also bumps fred_docker_backend_retention_leaked_total.

  • Restore failure (or worker panic): the new lease's compose project is torn down, the adopted volumes are re-quarantined back into the fred-retained- namespace, pool allocations are released, and the record is reverted restoring → active. The original data is preserved and the lease can be restored again. The new lease settles as a Failed lease (a failure callback is still emitted), exactly like a failed restart/update.
  • Crash mid-restore: on the next startup (and on every periodic sweep) the backend reconciles dangling restoring records — finalizing those that completed and rolling back (re-quarantining) those that did not, before the orphan-volume reaper runs. A record is written before the rename, so a crash in the narrow window between the two is repaired by re-quarantine rather than data loss.

Provisioning Lifecycle

  1. Synchronous validation -- the Provision method validates the request before returning:

    • Checks for duplicate lease (returns ErrAlreadyProvisioned unless existing provision is failed)
    • Resolves all SKUs to profiles via SKUMapping + SKUProfiles
    • Parses the JSON manifest and validates image, ports, labels, and health check
    • Validates the image against AllowedRegistries
    • Allocates resources for all instances from the resource pool (rolls back on failure)
  2. Asynchronous provisioning -- runs in a goroutine tracked by a WaitGroup:

    • Pulls the image (once, shared across all containers in the lease)
    • Inspects the image to discover Dockerfile VOLUME declarations
    • Creates/ensures the per-tenant network (if NetworkIsolation is enabled)
    • For each item in the lease (supports multi-SKU), for each unit (supports multi-unit):
      • For stateful SKUs (disk_mb > 0): creates a quota-enforced host directory and bind-mounts image VOLUME paths into it
      • For ephemeral SKUs (disk_mb == 0): overrides image VOLUME paths with tmpfs to prevent anonymous volumes
      • Creates a container with the appropriate SKU profile, hardening settings, and labels
      • Starts the container
    • Verifies startup (see Startup Verification for the two paths)
  3. Callback -- on success or failure, sends an HMAC-signed callback to the URL provided in the provision request.

Multi-unit leases create multiple containers from the same manifest. Multi-SKU leases create containers with different resource profiles per SKU. Instance indices are 0-based across all items.

The entire async operation is bounded by ProvisionTimeout and is canceled on backend shutdown.

Stack Provisioning

When lease items carry service_name fields (and the payload is a stack manifest), the backend provisions a multi-service stack:

  1. Synchronous validation — same as single-container, plus:

    • Detects stack vs single mode via IsStack(items)
    • Validates 1:1 mapping between manifest service names and lease item service names
    • Validates each per-service manifest independently
  2. Asynchronous provisioning — Docker Compose-based deployment:

    • Each service's image is pulled and inspected independently (pre-flight, before Compose)
    • Volumes are pre-created for stateful services (disk_mb > 0 with image VOLUMEs)
      • Resource allocation ID: {leaseUUID}-{serviceName}-{instanceIndex}
      • Volume ID: fred-{leaseUUID}-{serviceName}-{instanceIndex}
    • A Compose project is built in-memory from the stack manifest via buildComposeProject
    • Service startup ordering is controlled by depends_on declarations in the manifest (supports service_started and service_healthy conditions with cycle detection)
    • compose.Up atomically creates, starts, and network-attaches all service containers
    • compose.PS discovers the resulting container IDs per service
    • Startup verification runs per-service, each using its own health check config
    • Restart/update uses compose.Up with the updated project; on failure, the previous manifest is rebuilt and rolled back via another compose.Up
    • Deprovision uses compose.Down for atomic cleanup, with fallback to individual container removal
  3. Callback — single callback for the entire stack (success only when all services are healthy/running).

Container Hardening

Every container is created with the following security measures:

Feature Implementation Notes
Drop all capabilities CapDrop: ["ALL"] No Linux capabilities granted
No new privileges SecurityOpt: ["no-new-privileges:true"] Prevents privilege escalation via setuid/setgid
Read-only root filesystem ReadonlyRootfs: true Configurable via container_readonly_rootfs
Tmpfs for /tmp and /run Tmpfs: {"/tmp": "size=64M", "/run": "size=64M"} Only when readonly rootfs is enabled; size from container_tmpfs_size_mb. Tenants may request up to 4 additional tmpfs mounts via manifest, for a maximum of 6 total (384MB at default size). Note: On cgroup v1, tmpfs memory is not counted against the container's cgroup memory limit. On cgroup v2 (default on modern systems), it is.
PID limit PidsLimit: 256 Configurable via container_pids_limit
Memory (no swap) MemorySwap == Memory Prevents swap usage entirely
Restart policy disabled RestartPolicyDisabled Failed containers stay dead for crash detection
Network isolation Per-tenant bridge network Configurable via network_isolation

Startup Verification

After all containers in a lease are started, the backend verifies they are ready before sending a success callback. The verification path depends on whether the manifest declares an active health check.

No health check (fixed-wait path)

When the manifest has no health_check (or sets Test[0] to "NONE"), the backend waits for StartupVerifyDuration (default 5s) and then inspects each container. If any container has exited during this window, the entire provision is marked as failed and cleaned up.

This catches containers that crash immediately on startup due to bad configuration, read-only filesystem errors, missing dependencies, or similar issues -- before a success callback is sent and the lease is acknowledged as active on chain.

Note: the runtime uses cmp.Or to fall back to 5s when the value is zero, so setting startup_verify_duration: 0 does not disable verification -- it uses the 5s default.

With health check (health-aware path)

When the manifest declares an active health check (health_check with Test[0] of "CMD" or "CMD-SHELL"), the backend polls every 2s until all containers report healthy. The behavior on each poll:

  • healthy -- container passes, removed from the pending set.
  • unhealthy -- provision fails immediately with an error.
  • Container exited -- provision fails immediately (caught before checking health status).
  • starting -- keep polling.

The polling is bounded by the existing ProvisionTimeout context (default 10m). If the timeout fires before all containers are healthy, the provision fails. Operators must ensure ProvisionTimeout is compatible with their health check timing (start period + interval * retries).

A health check defined in the Dockerfile but not in the manifest does not trigger the health-aware path -- the manifest is the contract.

Re-provisioning

When a provision has status=failed (e.g., a container crashed and was detected by the reconciler), a new Provision call for the same lease UUID is allowed. The re-provision flow:

  1. The existing FailCount is carried over from the failed provision record.
  2. Resource allocations are released and old containers are removed. Managed volumes are kept — stateful data persists across re-provisions.
  3. A new provision record is created with FailCount preserved.
  4. The full provisioning flow runs again (image pull, image inspect, volume setup via idempotent Create, container create/start, startup verification). Existing volumes are reused with quota updated; only new volumes are created.
  5. On failure, FailCount is incremented. The FailCount is also persisted in the fred.fail_count container label. Only newly created volumes are cleaned up; reused volumes are preserved.

Lease State Machine

One concept: the lease actor is the scope of atomicity for its messages and its workers. Everything else falls out of that invariant:

  • Registry atomicity — the actor registry (b.actors) is guarded by a mutex; routeToLease(uuid, msg) resolves-or-creates AND enqueues under that mutex, so callers never hold a *leaseActor pointer. Stale-pointer races are unreachable by construction.
  • Worker ownership — every worker goroutine (provision, restart, update, diag) is spawned by the actor and tracked by its per-actor workers barrier (a channel-signaled reference counter; see work_barrier.go). The actor's exit path selects on workers.Zero() BEFORE registry-delete / done-close / drain — the actor cannot exit while a worker is in flight, so orphan-worker races are eliminated. The barrier's channel-based wait means waitForWorkers spawns no helper goroutine, so a wedged worker adds no leaked waiter on top of itself.
  • Drain-with-handle — on exit, any message in the inbox is processed via handle() (not just closed-and-dropped). Terminal events delivered during the shutdown window still drive their SM transition. Silent drops are gone.
  • Non-blocking routingrouteToLease uses a non-blocking inbox send under the registry mutex. A wedged actor cannot stall the event loop; full-inbox refusals increment die_event_dropped_total and the reconciler re-detects within its cycle.

Every lease is owned by a per-lease actor goroutine with a bounded inbox (16 messages). All transitions flow through a state machine, one per actor, which serializes transitions and owns the side effects (callback emission, diagnostics persistence, gauge updates). The SM's initial state is the lease's current Status at actor creation — new leases start in Provisioning, recovered leases start in whatever state they were in.

stateDiagram-v2
    Provisioning --> Ready: ProvisionCompleted
    Provisioning --> Failed: ProvisionErrored
    Provisioning --> Deprovisioning: DeprovisionRequested

    Ready --> Failing: ContainerDied [guard]
    Ready --> Deprovisioning: DeprovisionRequested
    Ready --> Restarting: RestartRequested
    Ready --> Updating: UpdateRequested

    Failing --> Failed: DiagGathered
    Failing --> Deprovisioning: DeprovisionRequested

    Failed --> Provisioning: ProvisionRequested
    Failed --> Restarting: RestartRequested
    Failed --> Updating: UpdateRequested
    Failed --> Deprovisioning: DeprovisionRequested

    Restarting --> Ready: ReplaceCompleted
    Restarting --> Ready: ReplaceRecovered
    Restarting --> Failed: ReplaceFailed
    Restarting --> Deprovisioning: DeprovisionRequested

    Updating --> Ready: ReplaceCompleted
    Updating --> Ready: ReplaceRecovered
    Updating --> Failed: ReplaceFailed
    Updating --> Deprovisioning: DeprovisionRequested

    Deprovisioning --> [*]

The edges above are the complete set of allowed transitions; any event not listed against a source state is either ignored (see below) or rejected as an invalid trigger. The authoritative source is internal/backend/shared/leasesm/lease_sm.go.

Key behaviors
  • Ready → Failing guard. The ContainerDied trigger fires only if a Docker Inspect confirms the container actually exited. Die events can be duplicated or stale; the guard filters them.
  • Preemption via OnExit cancellation + workers.Zero(). Failing, Provisioning, Restarting, and Updating each own one async worker goroutine (diag gather, provision, or replace). Every transition out of these states calls the worker's CancelFunc via OnExit, then a.waitForWorkers() selects on the per-actor workers.Zero() channel until the goroutine has returned and its terminal sendTerminal has landed in the inbox. A preempting Deprovision observes post-cleanup state deterministically — no orphan-container race.
  • Defense-in-depth Ignore on Deprovisioning. Cancellation is best-effort: a goroutine can race past the cancel signal and fire its completion event anyway. Deprovisioning ignores every such event (DiagGathered, ProvisionCompleted, ProvisionErrored, ReplaceCompleted, ReplaceRecovered, ReplaceFailed) so the race is structurally safe.
  • One terminal callback per lease. Callback emission lives in SM entry actions (onEnterReadyFromProvision, onEnterFailedFromDiag, onEnterFailedFromProvision, onEnterReadyFromReplaceCompleted, onEnterReadyFromReplaceRecovered, onEnterFailedFromReplace), never in goroutines. Combined with the preemption/ignore rules above, this guarantees at most one success/failed/deprovisioned callback per lease per terminal transition.
  • Three Replace* events for two terminal states. ReplaceCompleted means restart/update succeeded (→ Ready, Success callback). ReplaceRecovered means it failed but rollback restored a working lease (→ Ready, Failed callback with rollback suffix). ReplaceFailed means both the operation and the rollback failed (→ Failed, Failed callback).
  • Non-blocking routing, reconciler backstop. routeToLease is non-blocking: a full inbox returns false rather than blocking the caller. containerEventLoop and the reconcile die-event dispatch treat refusal as "reconciler will re-detect within its cycle" and increment die_event_dropped_total. One wedged actor can no longer stall die-event delivery for other leases.
Observability
  • fred_docker_backend_lease_sm_transitions_total{from,to,event} — every transition.
  • fred_docker_backend_lease_actors_created_total — cumulative actor count; should track distinct leases (recycled UUIDs after Deprovision produce a fresh actor, so this counter grows faster than the live-actor count).
  • fred_docker_backend_lease_actor_stuck_seconds — age of the oldest in-flight actor handler. Alert threshold should exceed the longest legitimate operation (Deprovision can hold an actor for minutes during container/volume cleanup).
  • fred_docker_backend_lease_actor_inbox_depth — histogram of per-actor inbox depth; p99 near 0 is healthy.
  • fred_docker_backend_lease_actor_panics_total — counts panics recovered inside actor handlers. Any non-zero is a bug; the actor survives and keeps processing, but the message that panicked did not drive its transition.
  • fred_docker_backend_lease_terminal_event_dropped_total{event} — worker terminal sends refused because the actor had exited (pathological waitForWorkers timeout). Should be zero in normal operation.
  • fred_docker_backend_die_event_dropped_total{source} — container-death events refused because the actor's inbox was full or the backend was shutting down. source is event_loop or reconcile. Not data loss — the reconciler re-detects — but a sustained non-zero value flags a wedged actor or chronic burst.

State Recovery

On startup, at each ReconcileInterval, and on every reconciler cycle (via RefreshState), recoverState rebuilds in-memory state from Docker:

  1. List managed containers -- filters by fred.managed=true label.
  2. Group by lease UUID -- containers are grouped into provision records. The highest FailCount across containers in a lease is used (handles partial re-provisions).
  3. Detect ready-to-failed transitions -- if a provision was in-memory as ready but Docker shows the container as exited/dead, the provision is marked failed, its FailCount is incremented, and a failure callback is sent.
  4. Cold-start FailCount correction -- provisions recovered as failed with no prior in-memory state have their FailCount incremented by 1. The label value was written at creation time (before the crash), so the increment accounts for the observed failure.
  5. Preserve in-flight provisions -- provisions with status=provisioning that have no containers yet are kept to avoid dropping active async work.
  6. Reset resource pool -- pool.Reset() clears all allocations and rebuilds them from the recovered containers' SKU profiles.
  7. Orphaned network cleanup -- if NetworkIsolation is enabled, removes any managed networks whose tenant has no active provisions and no connected containers.

After state recovery, the backend also runs orphaned volume cleanup: lists all fred- prefixed directories in volume_data_path, compares against expected volumes from recovered provisions, and destroys any that have no matching provision. This catches volumes leaked by crashes between volume creation and container creation, or between container removal and volume destruction.

Callback Protocol

Callbacks notify Fred of provisioning results.

Signing

Each callback carries an X-Fred-Signature header in the format:

t=<unix-timestamp>,sha256=<hex-encoded-hmac>

The HMAC-SHA256 is computed over the canonical string <timestamp>\n<METHOD>\n<canonical-URI>\n<hex(sha256(body))> using the configured CallbackSecret. Binding the method and URI prevents cross-endpoint replay of captured signatures; hashing the body keeps the canonical string binary-safe. See internal/hmacauth for the reference implementation.

Error Message Sanitization

Callback error messages use hardcoded, deterministic strings and never include container logs or runtime-specific data. This prevents secrets, API keys, or other sensitive data from being permanently recorded on-chain as rejection reasons.

Full diagnostics (exit codes, OOM status, container logs) are available via the HMAC-authenticated GET /provisions/{lease_uuid} and GET /logs/{lease_uuid} endpoints.

Payload
{
  "lease_uuid": "...",
  "status": "success" | "failed",
  "error": ""
}
Retry Strategy
  • 3 attempts with backoff delays of 0s, 1s, 5s.
  • Each attempt has a 10s HTTP timeout.
  • Retries abort immediately if the backend is shutting down (stopCtx is canceled).
  • A 2xx response is considered success; any other status triggers a retry.

HTTP API

All authenticated endpoints require an X-Fred-Signature HMAC-SHA256 header (see Signing). Request bodies are limited to 1 MiB. All JSON responses use Content-Type: application/json. Errors return {"error": "message"}.

POST /provision (authenticated)

Starts async container provisioning. Pre-flight validation (SKU, manifest, image allowlist, resources) is synchronous; the actual container lifecycle runs in a background goroutine with results delivered via callback.

Request (single-container):

{
  "lease_uuid": "abc-123",
  "tenant": "manifest1...",
  "provider_uuid": "prov-1",
  "items": [
    { "sku": "docker-small", "quantity": 2 }
  ],
  "callback_url": "https://fred-host/api/v1/backend/callback",
  "payload": "<base64-encoded manifest JSON>"
}

Request (stack):

{
  "lease_uuid": "abc-123",
  "tenant": "manifest1...",
  "provider_uuid": "prov-1",
  "items": [
    { "sku": "docker-small", "quantity": 1, "service_name": "web" },
    { "sku": "docker-medium", "quantity": 1, "service_name": "db" }
  ],
  "callback_url": "https://fred-host/api/v1/backend/callback",
  "payload": "<base64-encoded stack manifest JSON>"
}

Response (202 Accepted):

{
  "provision_id": "abc-123"
}

Errors: 400 (validation), 409 (already provisioned), 503 (insufficient resources).

POST /deprovision (authenticated)

Removes all containers and managed volumes for a lease and releases resources. Idempotent — deprovisioning a nonexistent lease returns success.

Request:

{
  "lease_uuid": "abc-123"
}

Response (200):

{
  "status": "ok"
}
GET /info/{lease_uuid} (authenticated)

Returns connection details for a running lease. Only available when the provision status is ready — returns 404 otherwise.

Response (200, single-container):

{
  "host": "192.168.1.100",
  "instances": [
    {
      "instance_index": 0,
      "container_id": "abcdefghijkl",
      "image": "nginx:latest",
      "status": "running",
      "ports": {
        "80/tcp": { "host_ip": "0.0.0.0", "host_port": "32768" }
      }
    }
  ]
}

Response (200, stack):

For stack provisions, instances are grouped by service name under a "services" map. Each service value is an object with an "instances" key:

{
  "host": "192.168.1.100",
  "services": {
    "web": {
      "instances": [
        {
          "instance_index": 0,
          "container_id": "abcdefghijkl",
          "image": "ghcr.io/myorg/webapp:v2.1.0",
          "status": "running",
          "ports": {
            "8080/tcp": { "host_ip": "0.0.0.0", "host_port": "32768" }
          }
        }
      ]
    },
    "db": {
      "instances": [
        {
          "instance_index": 0,
          "container_id": "mnopqrstuvwx",
          "image": "postgres:16",
          "status": "running",
          "ports": {
            "5432/tcp": { "host_ip": "0.0.0.0", "host_port": "32769" }
          }
        }
      ]
    }
  }
}
GET /logs/{lease_uuid} (authenticated)

Returns container stdout/stderr. For single-container provisions, logs are keyed by instance index. For stack provisions, logs use "serviceName/instanceIndex" keys (e.g., "web/0", "db/0"). Works for any provision status (provisioning, ready, or failed).

Query parameters: tail — number of lines (default 100, max 10000).

Response (200, single-container):

{
  "0": "2025-01-15T10:00:00Z Starting server...\n...",
  "1": "2025-01-15T10:00:00Z Worker ready\n..."
}

Response (200, stack):

{
  "web/0": "2025-01-15T10:00:00Z Listening on :8080\n...",
  "db/0": "2025-01-15T10:00:00Z database system is ready to accept connections\n..."
}

If log retrieval fails for a specific instance, its value contains <error: ...> instead.

GET /provisions/{lease_uuid} (authenticated)

Returns a single provision record. This is the primary endpoint for retrieving full failure diagnostics after a sanitized callback.

Response (200):

{
  "lease_uuid": "abc-123",
  "provider_uuid": "prov-1",
  "status": "failed",
  "created_at": "2025-01-15T10:00:00Z",
  "fail_count": 2,
  "last_error": "container 0 exited during startup (status: exited): exit_code=1; logs:\nError: config file not found"
}

status is one of: provisioning, ready, failing, failed, unknown, restarting, updating, deprovisioning. failing marks the brief window between container-death detection and the Failed callback being emitted; a concurrent Deprovision arriving in this window transitions the lease straight to deprovisioning without ever reaching failed, preventing a stale Failed callback. last_error is only present on failure and contains full diagnostics (exit codes, OOM status, container logs).

GET /provisions (authenticated)

Returns provision records. GET /provisions is keyset-paginated. Query params: limit (max page size) and continue (a lease UUID — the continue cursor returned by the previous page). The JSON response carries a top-level continue field set to the last record's lease UUID, omitted once the list is exhausted. An invalid limit or a non-UUID continue returns 400, as does a continue cursor supplied without a positive limit. A limit above the server maximum (5000) is coerced down to it rather than rejected. With no params it returns the full list unpaginated (back-compat). One or more lease_uuid query params return just those records. (ENG-380)

Response (200):

{
  "provisions": [
    {
      "lease_uuid": "abc-123",
      "provider_uuid": "prov-1",
      "status": "ready",
      "created_at": "2025-01-15T10:00:00Z",
      "fail_count": 0
    },
    {
      "lease_uuid": "def-456",
      "provider_uuid": "prov-1",
      "status": "failed",
      "created_at": "2025-01-15T10:05:00Z",
      "fail_count": 3,
      "last_error": "container exited unexpectedly: exit_code=137, oom_killed=true; logs:\nKilled"
    }
  ],
  "continue": "def-456"
}

The continue field is present only on a full page with more records remaining; it is omitted once the list is exhausted.

POST /restart (authenticated)

Restarts a lease's containers in place (same image, same configuration). Async — the result is delivered via callback. Returns 202 ({"status": "restarting"}), 404 if not provisioned, 409 for an invalid state.

POST /update (authenticated)

Re-deploys a lease with a new manifest (image/config change). The payload field carries the new base64-encoded manifest. Async — result via callback. On failure the previous manifest is rolled back. Returns 202 ({"status": "updating"}), 400 (validation), 404, 409.

POST /restore (authenticated)

Restores a closed lease's retained volumes into a fresh lease. Body carries from_lease_uuid (the original closed lease) and callback_url. Async — result via callback. Returns 202 ({"status": "restoring"}); 422 when no retained data exists, 409 for invalid state / already provisioned, 400 (validation), 503 (insufficient resources). See Soft-delete & Restore.

GET /retentions (authenticated)

Lists this backend's retained (soft-deleted) leases. Used by the reconciler to route restores to the node physically holding each lease's retained volumes (ENG-333).

Response (200): {"retentions": [ ... ]} (serialized as [] when empty).

GET /releases/{lease_uuid} (authenticated)

Returns the persisted release (deployment) history for a lease, retained for releases_max_age (default 90 days).

POST /reconcile_custom_domain (authenticated)

Reconciles a lease's custom-domain ingress labels to match the supplied items. Body carries lease_uuid and items. Returns 204 No Content; 404 if not provisioned, 409 for an invalid state.

GET /health (unauthenticated)

Docker daemon reachability check.

Response (200):

{
  "status": "healthy"
}

Returns 503 if the Docker daemon is unreachable.

GET /stats (unauthenticated)

Resource pool usage.

Response (200):

{
  "total_cpu_cores": 8.0,
  "total_memory_mb": 16384,
  "total_disk_mb": 102400,
  "allocated_cpu_cores": 2.5,
  "allocated_memory_mb": 4096,
  "allocated_disk_mb": 10240,
  "available_cpu_cores": 5.5,
  "available_memory_mb": 12288,
  "available_disk_mb": 92160,
  "active_containers": 5
}
GET /metrics (unauthenticated)

Prometheus metrics in exposition format. Served by promhttp.Handler().

Resource Pool

The resource pool tracks CPU, memory, and disk allocations.

  • Allocation IDs are per-instance: <lease-uuid>-<instance-index> for single-container leases (e.g., abc123-0, abc123-1), or <lease-uuid>-<service-name>-<instance-index> for stack leases (e.g., abc123-web-0, abc123-db-0).
  • TryAllocate atomically checks capacity and reserves resources for a SKU. On insufficient resources, returns an error and the caller rolls back any partial allocations.
  • Release is idempotent -- releasing a non-existent allocation is a no-op.
  • Stats returns total, allocated, and available CPU/memory/disk.
  • Reset clears all allocations and rebuilds from a provided list. Used during state recovery to synchronize with Docker's actual state.

Tenant Network Isolation

When network_isolation is enabled (default), each tenant's containers are placed in a dedicated Docker bridge network. This provides:

  • Same-tenant communication: containers on the same tenant bridge can reach each other directly.
  • Cross-tenant isolation: Docker's DOCKER-ISOLATION iptables chains DROP forwarded traffic between different bridge networks. Containers from different tenants cannot communicate directly.
  • Outbound internet: containers can reach the internet (required for port bindings).
  • Port bindings: inbound traffic to published ports works normally. Cross-tenant communication is only possible through public-facing endpoints (published ports on the host).

Prerequisite: Docker must have iptables enabled (the default). If the daemon runs with --iptables=false, cross-tenant isolation is lost. Fred logs daemon warnings at startup to help detect this.

Why not Internal: true? Docker's Internal network flag prevents port publishing entirely (moby#36174), which would make tenant services unreachable.

Network lifecycle
  • Naming: fred-tenant-<hex(sha256(tenant)[:8])> -- first 8 bytes of the SHA-256 hash, hex-encoded to 16 characters. Deterministic, derived from the tenant address.
  • Creation: EnsureTenantNetwork creates the network on first use, or returns the existing one.
  • Removal: RemoveTenantNetworkIfEmpty removes the network when no containers are connected. Called during deprovision.
  • Orphan cleanup: during state recovery, managed networks with no active provisions and no connected containers are removed.
  • Networks carry fred.managed=true and fred.tenant labels.

Container Labels

All managed containers and networks carry labels in the fred.* namespace.

Label Value Description
fred.managed "true" Marks the container/network as managed by Fred
fred.lease_uuid lease UUID Associates the container with a lease
fred.tenant tenant address Tenant that owns the container/network
fred.provider_uuid provider UUID Provider that fulfills the lease
fred.sku SKU identifier SKU profile used for resource limits
fred.created_at RFC 3339 timestamp When the container was created
fred.instance_index integer string 0-based index within a multi-unit lease
fred.fail_count integer string Number of provision failures for this lease at creation time
fred.callback_url URL string Callback URL for provision results; persisted so failure callbacks survive backend restarts
fred.service_name service name string Service name within a stack (stack provisions only)

User-provided labels in the manifest are also applied, but may not use the fred.* prefix.

Bandwidth Limiting

Network bandwidth limiting is an operational concern handled outside of the docker-backend process. Operators can use Linux tc (traffic control) to rate-limit container network traffic on the host.

Identifying container interfaces

Each Docker container gets a veth pair. The host-side interface can be found by inspecting the container's network namespace:

# Get the container's PID
PID=$(docker inspect --format '{{.State.Pid}}' <container_id>)

# Get the veth peer index from inside the container's namespace
PEER_IDX=$(nsenter -t $PID -n ip link show eth0 | grep -oP '(?<=@if)\d+')

# Find the host-side veth interface by index
HOST_VETH=$(ip link | grep "^${PEER_IDX}:" | awk '{print $2}' | tr -d ':@')
Applying rate limits with tc

Use tc to set ingress and egress limits on the host-side veth interface:

# Egress (container → network): limit to 10 Mbit/s with 32KB burst
tc qdisc add dev $HOST_VETH root tbf rate 10mbit burst 32kbit latency 50ms

# Ingress (network → container): use an IFB (intermediate functional block) device
modprobe ifb
ip link set dev ifb0 up
tc qdisc add dev $HOST_VETH ingress
tc filter add dev $HOST_VETH parent ffff: protocol ip u32 match u32 0 0 \
    action mirred egress redirect dev ifb0
tc qdisc add dev ifb0 root tbf rate 10mbit burst 32kbit latency 50ms
Automation

For production use, integrate tc rules into a container lifecycle hook or a script triggered by Docker events (docker events --filter event=start). The docker-backend does not manage bandwidth limits directly to keep the provisioning path simple and avoid requiring CAP_NET_ADMIN.

Documentation

Overview

Package docker implements a Docker backend for Fred that provisions ephemeral containers with SKU-based resource profiles, registry allowlisting, and port mapping for tenant connectivity.

Package docker implements the Backend interface for Docker container provisioning. It is the production backend bundled with Fred.

For operators and tenants, see the README.md alongside this package (internal/backend/docker/README.md) for the full configuration reference, HTTP API, lease state machine, callback protocol, and Traefik integration.

Architecture overview (for developers)

The package is organized around a single concurrency primitive: the per-lease actor. Every active lease owns one goroutine that serializes all state-mutating operations for that lease through a stateless state machine. The actor and SM implementations are substrate-agnostic and live in internal/backend/shared/leasesm; this package supplies the Docker-specific seams via the closure-builder factory in lease_actor_factory.go. All Docker calls happen outside any shared mutex; linearization comes from the actor's inbox.

The actor model is what gives the backend its key properties:

  • No held locks during slow I/O (image pull, container create/start)
  • Deterministic preemption — a Deprovision arriving mid-provisioning cancels the in-flight worker via OnExit and transitions cleanly
  • Blast-radius-contained panics — recover() in each handler keeps unrelated leases unaffected
  • One terminal callback per lease — emission lives only in SM entry actions, never in worker goroutines

Major components

  • internal/backend/shared/leasesm: per-lease actor + state machine (substrate-agnostic; consumed by every backend, not just Docker)
  • lease_actor_factory.go, lease_actor_routing.go: factory wiring Docker dependencies into leasesm.NewLeaseActor, plus Backend-side routing/dispatch around the actor inbox (b.actors map, routeToLease, DebugActors)
  • leasesm_adapters.go, leasesm_metrics.go: Docker implementations of leasesm.InstanceInspector / DiagnosticsGatherer / LeaseProvisionStore / SMMetrics
  • internal/backend/shared/workbarrier: per-actor worker reference counter (used by OnExit to wait for canceled goroutines before completing the transition)
  • provision.go, deprovision.go, restart_update.go: the lifecycle workers that the actor spawns for each long-running operation
  • recover.go: state recovery from Docker labels on startup and during each reconciliation cycle (via RefreshState)
  • compose.go, compose_project.go: Compose-based stack provisioning
  • reconcile_custom_domain.go: Traefik label sync for tenant custom domains
  • volume.go (+ volume_btrfs.go, volume_xfs.go, volume_zfs.go): filesystem-specific quota enforcement for stateful SKUs
  • ingress.go: Traefik label generation for routable ports
  • metrics.go: Prometheus metrics under fred_docker_backend_*

Container hardening

Every container is created with: dropped capabilities, no-new-privileges, read-only rootfs, tmpfs for /tmp and /run, PID limits, no swap, restart policy disabled (for crash detection), and per-tenant network isolation. See the README for the full list and operator-facing knobs.

Index

Constants

View Source
const (
	LabelManaged       = "fred.managed"
	LabelLeaseUUID     = "fred.lease_uuid"
	LabelTenant        = "fred.tenant"
	LabelProviderUUID  = "fred.provider_uuid"
	LabelSKU           = "fred.sku"
	LabelCreatedAt     = "fred.created_at"
	LabelInstanceIndex = "fred.instance_index"
	LabelFailCount     = "fred.fail_count"
	LabelCallbackURL   = "fred.callback_url"
	LabelBackendName   = "fred.backend_name"
	LabelServiceName   = "fred.service_name"
	LabelFQDN          = "fred.fqdn"
	LabelCustomDomain  = "fred.custom_domain"
)

Labels used for tracking managed containers.

View Source
const DefaultMaxRequestBodySize int64 = 2 << 20 // 2 MiB

DefaultMaxRequestBodySize caps inbound HTTP request bodies for the docker backend. It is deliberately larger than providerd's config.DefaultMaxRequestBodySize (1 MiB): providerd caps the raw tenant body, then re-serializes and wraps it (JSON envelope + base64) before forwarding, so a manifest that just cleared providerd can exceed 1 MiB on the backend hop and would otherwise be rejected with an opaque 400. Configurable via max_request_body_size / DOCKER_BACKEND_MAX_REQUEST_BODY_SIZE. (ENG-448 / F42)

Variables

This section is empty.

Functions

func ComputeFQDN

func ComputeFQDN(subdomain, wildcardDomain string) string

ComputeFQDN returns subdomain + "." + wildcardDomain.

func ComputeSubdomain

func ComputeSubdomain(leaseUUID, serviceName string, instanceIndex, quantity int) string

ComputeSubdomain derives a unique, human-friendly subdomain from lease and service metadata. The result is guaranteed to be at most 63 characters (the DNS label limit); long service names are truncated to fit.

The hash suffix is derived from all discriminating fields (leaseUUID, serviceName, instanceIndex) to prevent cross-pattern collisions — e.g., service "web" instance 0 vs. service "web-0" instance 0.

Matrix:

serviceName == "" && quantity <= 1 → {hash7}
serviceName == "" && quantity > 1  → {idx}-{hash7}
serviceName != "" && quantity <= 1 → {svc}-{hash7}
serviceName != "" && quantity > 1  → {svc}-{idx}-{hash7}

func CustomDomainRouterName

func CustomDomainRouterName(leaseUUID, serviceName string) string

CustomDomainRouterName returns a Traefik router name shared across all instances of (leaseUUID, serviceName). Custom domain is per-LeaseItem (per-service, not per-instance), so this name is the same regardless of instance index — every instance container of the service emits the same secondary-router labels and Traefik aggregates them into one router with N backends. The "-custom" suffix avoids any collision with the per-instance primary router name produced by RouterName.

func RouterName

func RouterName(leaseUUID, serviceName string, instanceIndex, quantity int) string

RouterName returns a router name derived from lease metadata.

func SelectIngressPort

func SelectIngressPort(ports map[string]manifest.PortConfig) (int, bool)

SelectIngressPort picks the best TCP port for ingress routing. Preference: ingress hint > 80 > 8080 > lowest TCP port number. Returns (port, true) if a suitable port is found, (0, false) otherwise.

func TenantNetworkName

func TenantNetworkName(tenant string) string

TenantNetworkName returns a deterministic network name for a tenant address.

func TraefikCustomDomainLabels

func TraefikCustomDomainLabels(cfg IngressConfig, customRouterName, customDomain string, containerPort int) map[string]string

TraefikCustomDomainLabels generates the secondary Traefik labels that route a tenant-supplied custom domain to the same per-tenant container(s) served by the primary router. The returned map is intended to be merged into a container's label set alongside the primary TraefikLabels output.

Multi-instance services (quantity > 1) are load-balanced by emitting these byte-identical labels on every instance container: Traefik's Docker provider deduplicates the router by name and aggregates the service into one with N backends. For single-instance services this is the same path with a degenerate single backend.

The router uses cfg.CustomDomainCertResolver (default "http01") for per-domain ACME and cfg.CustomDomainMiddlewares (default "security-headers@file") for transport-level hardening. The entrypoint reuses the existing IngressConfig.Entrypoint so the secondary router shares the primary's TLS termination posture.

func TraefikLabels

func TraefikLabels(cfg IngressConfig, networkName, routerName, fqdn string, containerPort int) map[string]string

TraefikLabels generates the Docker labels that Traefik uses for auto-discovery and routing configuration. networkName is the per-tenant Docker network that Traefik should use to reach the container (set via traefik.docker.network). This is the only Traefik-specific function; everything else is proxy-agnostic.

Routers are emitted with tls=true and no certresolver; see IngressConfig for the wildcard-cert provisioning contract.

The explicit `.service` label binding is required even though Traefik's docker provider can auto-bind a router to a same-named service: auto-binding only fires when exactly one service is declared on the container. As soon as a second service appears (notably TraefikCustomDomainLabels' secondary service for tenant custom domains), auto-binding becomes ambiguous and Traefik leaves the router orphaned (returns HTTP 418 on the primary URL). Always emitting the explicit binding is cheap and immune to future label additions.

Types

type ActorSnapshot

type ActorSnapshot struct {
	LeaseUUID  string `json:"lease_uuid"`
	SMState    string `json:"sm_state"`    // current SM state
	InboxDepth int    `json:"inbox_depth"` // pending messages not yet processed
	InboxCap   int    `json:"inbox_cap"`
}

ActorSnapshot is a point-in-time view of one lease actor's state for operator introspection. Safe to marshal to JSON for a /debug/actors endpoint when integrated with the HTTP layer.

type Backend

type Backend struct {
	// contains filtered or unexported fields
}

Backend implements the backend.Backend interface for Docker containers.

func New

func New(cfg Config, logger *slog.Logger) (*Backend, error)

New creates a new Docker backend.

func (*Backend) DebugActors

func (b *Backend) DebugActors() []ActorSnapshot

DebugActors returns a snapshot of every live lease actor. The result is stable for the caller: it's a copy; the registry may grow or change state after return. Intended for ops introspection during incidents — pair with a /debug/actors HTTP handler that JSON-encodes the return.

func (*Backend) Deprovision

func (b *Backend) Deprovision(ctx context.Context, leaseUUID string) error

Deprovision is the public shim: it routes the request through the lease's actor so that container-death and deprovision messages serialize per lease. Routing forces a Ready/Failing/Failed → Deprovisioning SM transition whose Failing.OnExit cancels the in-flight diag goroutine — the structural suppression of stale Failed callbacks.

func (*Backend) GetInfo

func (b *Backend) GetInfo(ctx context.Context, leaseUUID string) (*backend.LeaseInfo, error)

GetInfo returns lease information including connection details. Always populates `Services` keyed by service name (the primary post-Tasks-3-9 source of truth). `Instances` is a flattened convenience view computed by concatenating service instances in deterministic service-name order so older tooling that consumes the flat array keeps working.

func (*Backend) GetLoadStats added in v0.5.0

func (b *Backend) GetLoadStats(_ context.Context) (*backend.LoadStats, error)

GetLoadStats returns the backend's current CPU-load snapshot for least-loaded provision routing (ENG-318). It wraps the resource pool's Stats() — the same source the HTTP GET /stats endpoint serves — so the in-process docker backend exposes the same load signal as the HTTP path.

func (*Backend) GetLogs

func (b *Backend) GetLogs(ctx context.Context, leaseUUID string, tail int) (map[string]string, error)

GetLogs returns the last N lines of stdout/stderr for each container in a lease, keyed by "serviceName/instanceIndex" (e.g., "web/0", "db/0"). Falls back to the diagnostics store when the provision is not in memory (e.g., after deprovision). Returns ErrNotProvisioned only if both miss. On partial failure (some containers succeed, some fail), the successful logs are returned along with error placeholders, and the errors are logged.

Post-Tasks-3-9 the live path always populates `prov.ServiceContainers`, so the legacy "key by instance index only" branch is gone — every lease is stack-shaped from the in-memory state's perspective.

func (*Backend) GetProvision

func (b *Backend) GetProvision(_ context.Context, leaseUUID string) (*backend.ProvisionInfo, error)

GetProvision returns a single provision by lease UUID.

When the lease is not in the in-memory map (e.g., after close/expire), the retention store is consulted BEFORE the diagnostics fallback so a soft-deleted lease surfaces as Status=retained (with RetainedUntil + Items for the restore shape) for the offline tenant to self-serve within the grace window — and never regresses to a stale Status=failed diagnostics entry. Falls back to the diagnostics store otherwise. Returns ErrNotProvisioned only if all sources miss.

func (*Backend) GetReleases

func (b *Backend) GetReleases(_ context.Context, leaseUUID string) ([]backend.ReleaseInfo, error)

GetReleases returns the release history for a lease.

func (*Backend) Health

func (b *Backend) Health(ctx context.Context) error

Health checks that the Docker daemon is reachable AND the persistence stores are readable. Probing the bbolt stores (not just docker.Ping) means a locked/corrupt/read-only retention or release store surfaces as unhealthy instead of the backend reporting healthy while soft-delete/restore silently fail — the most data-loss-sensitive subsystem must not be the unmonitored one. (ENG-448 / F31)

func (*Backend) ListProvisions

func (b *Backend) ListProvisions(_ context.Context) ([]backend.ProvisionInfo, error)

ListProvisions returns all currently provisioned resources.

func (*Backend) ListProvisionsPage added in v0.7.0

func (b *Backend) ListProvisionsPage(ctx context.Context, after string, limit int) ([]backend.ProvisionInfo, string, error)

ListProvisionsPage returns one keyset page of provisions for the /provisions handler — the same paged-handler role ListRetentionsPage plays for /retentions, but NOT the same performance profile. The docker provision store is an in-memory map with no ordered index, so this snapshots it via ListProvisions and paginates in memory (O(N log N) per page, no disk I/O), whereas ListRetentionsPage serves O(limit) pages directly from the ordered bbolt index via a cursor seek. A true store-level O(limit) provision read is tracked as ENG-455 (it needs the ENG-381 ordered snapshot, since recoverState rebuilds the whole map each tick).

func (*Backend) ListRetentions added in v0.5.0

func (b *Backend) ListRetentions(_ context.Context) ([]backend.RetainedLease, error)

ListRetentions returns the leases this docker backend currently retains (soft-deleted, awaiting restore or grace-reap), read from the retention store. Used by fred's reconciler for restore backend affinity (ENG-333). Returns an empty slice when the retention store is not configured.

func (*Backend) ListRetentionsPage added in v0.7.0

func (b *Backend) ListRetentionsPage(_ context.Context, after string, limit int) ([]backend.RetainedLease, string, error)

ListRetentionsPage returns one keyset page of retained lease UUIDs, served directly from the retention store's ordered bbolt index via a cursor Seek (O(limit) per page) rather than reading the whole set and paginating in memory. It is the paged sibling of ListRetentions used by the /retentions handler. limit is coerced down to backend.MaxPageLimit, mirroring PaginateRetentions; limit <= 0 is the unpaginated passthrough.

func (*Backend) LookupProvisions

func (b *Backend) LookupProvisions(_ context.Context, uuids []string) ([]backend.ProvisionInfo, error)

LookupProvisions returns provision info for the requested lease UUIDs. Missing leases are absent from the returned slice (not an error). O(k) lookups against the in-memory provisions map, where k = len(uuids).

func (*Backend) Name

func (b *Backend) Name() string

Name returns the backend name.

func (*Backend) Provision

func (b *Backend) Provision(ctx context.Context, req backend.ProvisionRequest) error

Provision starts async provisioning of containers. For multi-unit leases (quantity > 1), multiple containers are created. For multi-SKU leases, containers are created with the appropriate profile for each SKU.

Pre-flight validation errors (unknown SKU, invalid manifest, disallowed image, insufficient resources) are returned synchronously so the caller can respond with an appropriate HTTP status. Only truly asynchronous failures (image pull, container create/start) are communicated via callback.

func (*Backend) ReconcileCustomDomain

func (b *Backend) ReconcileCustomDomain(ctx context.Context, leaseUUID string, items []backend.LeaseItem) error

ReconcileCustomDomain reapplies the per-LeaseItem custom_domain values from chain onto the running provision. When at least one item's CustomDomain differs from the in-memory state, the backend computes the diff read-only, then routes the redeploy through the lease actor via routeReplaceRestart. The actor commits prov.Items on success — no off-actor mutation, no CAS rollback. A failed redeploy leaves prov.Items untouched so the next reconciler tick retries (ENG-278).

Reconciliation only runs when the provision is in ProvisionStatusReady. Any other state (Provisioning, Restarting, Updating, Failing, Failed, Deprovisioning, Unknown) is treated as "not the right time": skip without error and let the periodic reconciler call back when the provision settles.

The reconcile uses a two-RLock-pass candidate-only shape (ENG-277): a candidate pre-pass (RLock) finds which incoming domains actually differ from what's emitted and need DNS resolution; only those are resolved off-lock via b.dnsGateAllows; a main pass (RLock) re-reads fresh prov and runs computeCustomDomainOverrides read-only. A steady-state lease performs zero DNS lookups; a not-Ready lease short-circuits in the pre-pass with zero DNS.

No-op when this backend is configured with ingress disabled. Without ingress, applyIngressLabels emits no Traefik labels (primary or secondary) at all, so a Restart triggered for custom-domain drift would recreate the containers with no LabelCustomDomain — and the next recoverState tick would rebuild prov.Items[].CustomDomain back to "" from the unlabeled containers, putting the reconciler into a permanent restart loop against the chain's non-empty value. Returning early here avoids the loop.

func (*Backend) RefreshState

func (b *Backend) RefreshState(ctx context.Context) error

RefreshState synchronizes in-memory provision state with Docker.

func (*Backend) Restart

func (b *Backend) Restart(ctx context.Context, req backend.RestartRequest) error

Restart restarts containers for a lease without changing the manifest. State machine: Ready|Failed → Restarting → Ready|Failed

SEAM CLOSED (ENG-230). This prelude is read-only: it fast-fails on ErrNotProvisioned / ErrInvalidState under provisionsMu, snapshots the fields the worker needs, then does pure work (manifest marshal + release-store Append). It performs NO write to prov.Status / prov.CallbackURL — the lease actor's onEnterRestarting entry action is the sole writer of those fields, firing inside handleRestartRequested BEFORE the ack. Because Restart() returns only after observing that ack, the "Restart() returns => prov.Status == Restarting" invariant the HTTP handler's event-broker publish depends on (api/handlers.go: RestartLease) is preserved without an off-actor write.

The prelude's fast-fail is only a route-time precondition — it does NOT guarantee the lease is still Ready/Failed when the actor dequeues the message. The real serialization is the actor inbox (the only path that mutates prov.Status). So a same-lease concurrent restart that passes the route-time check but loses the race (the winner already ran onEnterRestarting) is REJECTED by the actor, not prevented here: handleRestartRequested's classifyReplaceReject returns ErrInvalidState for the busy SM, which this function forwards and api/handlers.go maps to a clean 409.

Since no off-actor Status write remains, there is nothing to roll back on a marshal / Append / routing / ack failure: the error paths just return (the release-store Append is on a separate bbolt store; a "deploying" record left behind on routing/ack failure is cosmetic — recover.go skips non-active releases and deprovision deletes them).

func (*Backend) Restore added in v0.5.0

func (b *Backend) Restore(ctx context.Context, req backend.RestoreRequest) error

Restore adopts a soft-deleted lease's retained volumes into a NEW lease and brings up its stack from the retained manifest (ENG-325). The new lease is reserved at Provisioning and driven through the existing replace machinery via evRestoreRequested (Provisioning→Restarting→Ready|Failed).

The flow is the reviewed Rev 5 design; ordering is load-bearing:

(a) validate against the retained record (read-only),
(b) reserve the new-lease provision at Provisioning (reject if live),
(c) allocate pool slots,
(d) ATOMICALLY claim active→restoring (closes the prelude-vs-reaper race),
(e) adopt: rename retained→canonical (full rollback on failure),
(f) hand off to the actor; doRestore's terminal defer owns success/failure/panic.

Synchronous errors (validation, already-provisioned, insufficient resources, not-retained, not-restorable) are returned to the caller; asynchronous outcomes flow via the lease callback.

func (*Backend) Start

func (b *Backend) Start(ctx context.Context) error

Start initializes the backend, recovers state, and starts background tasks.

func (*Backend) Stats

func (b *Backend) Stats() shared.ResourceStats

Stats returns current resource usage statistics.

func (*Backend) Stop

func (b *Backend) Stop() error

Stop shuts down the backend gracefully.

func (*Backend) Update

func (b *Backend) Update(ctx context.Context, req backend.UpdateRequest) error

Update deploys a new manifest for a lease, replacing containers. State machine: Ready|Failed → Updating → Ready|Failed

SEAM CLOSED (ENG-230) — see the extended comment on Backend.Restart. Like Restart, the prelude is read-only: it fast-fails / validates under provisionsMu, snapshots fields, then records the release. It performs NO write to prov.Status / prov.CallbackURL — the actor's onEnterUpdating entry action is the sole writer, firing inside handleUpdateRequested BEFORE the ack, so the "Update() returns => Status is Updating" contract holds without an off-actor write. No rollback is needed on any failure path (nothing on prov was mutated).

type Config

type Config struct {
	// LogLevel controls the log verbosity (debug, info, warn, error).
	// When empty, defaults to "info" at startup (see cmd/docker-backend/main.go).
	LogLevel string `yaml:"log_level"`

	// Name is the backend identifier.
	Name string `yaml:"name"`

	// ListenAddr is the address the HTTP server listens on.
	ListenAddr string `yaml:"listen_addr"`

	// MaxRequestBodySize caps inbound HTTP request bodies (bytes). It must
	// exceed providerd's request cap plus forward-wrapping overhead; defaults to
	// DefaultMaxRequestBodySize when unset or non-positive. (ENG-448 / F42)
	MaxRequestBodySize int64 `yaml:"max_request_body_size"`

	// ProductionMode tightens startup checks beyond basic validation. When true,
	// Validate rejects dev-only insecure toggles — currently
	// callback_insecure_skip_verify, which disables TLS verification on the
	// backend → Fred callback hop. Mirrors providerd's production_mode (which
	// gates the reverse providerd → backend tls_skip_verify). Defaults to false.
	ProductionMode bool `yaml:"production_mode"`

	// TLSCertFile and TLSKeyFile enable HTTPS on the listener when both are
	// set; otherwise it serves plaintext HTTP (the default). Loaded once at
	// startup — rotation requires a restart (see ENG-294).
	TLSCertFile string `yaml:"tls_cert_file"`
	TLSKeyFile  string `yaml:"tls_key_file"`

	// TLSClientCAFile turns on mutual TLS when set: the listener requires and
	// verifies a client certificate signed by this CA. Requires TLSCertFile and
	// TLSKeyFile (the listener must be on TLS first).
	TLSClientCAFile string `yaml:"tls_client_ca_file"`

	// TLSClientAllowedNames optionally pins the mTLS client's identity: the
	// presented certificate's CommonName or a DNS SAN must be in this list.
	// Empty accepts any certificate signed by TLSClientCAFile. Requires
	// TLSClientCAFile. Use this whenever the client CA is not dedicated solely
	// to providerd.
	TLSClientAllowedNames []string `yaml:"tls_client_allowed_names"`

	// DockerHost is the Docker daemon socket path or URL.
	DockerHost string `yaml:"docker_host"`

	// TotalCPUCores is the total CPU cores available in the resource pool.
	TotalCPUCores float64 `yaml:"total_cpu_cores"`

	// TotalMemoryMB is the total memory available in MB.
	TotalMemoryMB int64 `yaml:"total_memory_mb"`

	// TotalDiskMB is the total disk space available in MB.
	TotalDiskMB int64 `yaml:"total_disk_mb"`

	// SKUMapping maps on-chain SKU UUIDs to profile names.
	// This allows the backend to translate chain SKU UUIDs to local resource profiles.
	// Example: {"019c1ee7-1aaf-7000-802c-ad775c72cc27": "docker-small"}
	SKUMapping map[string]string `yaml:"sku_mapping"`

	// SKUProfiles maps SKU names to resource profiles.
	SKUProfiles map[string]SKUProfile `yaml:"sku_profiles"`

	// AllowedRegistries is the list of allowed container registries.
	AllowedRegistries []string `yaml:"allowed_registries"`

	// CallbackSecret is the HMAC secret for signing callbacks.
	CallbackSecret config.Secret `yaml:"callback_secret"`

	// HostAddress is the external address for port mappings.
	HostAddress string `yaml:"host_address"`

	// ImagePullTimeout is the timeout for pulling images.
	ImagePullTimeout time.Duration `yaml:"image_pull_timeout"`

	// ContainerCreateTimeout is the timeout for creating containers.
	ContainerCreateTimeout time.Duration `yaml:"container_create_timeout"`

	// ContainerStartTimeout is the timeout for starting containers.
	ContainerStartTimeout time.Duration `yaml:"container_start_timeout"`

	// ContainerStopTimeout is the grace period for stopping containers.
	// Containers receive SIGTERM and have this long to shut down gracefully
	// before being force-killed (SIGKILL). Defaults to 30 seconds.
	ContainerStopTimeout time.Duration `yaml:"container_stop_timeout"`

	// ReconcileInterval is how often to reconcile state with Docker.
	ReconcileInterval time.Duration `yaml:"reconcile_interval"`

	// CallbackInsecureSkipVerify skips TLS certificate verification for callbacks.
	// WARNING: This disables TLS certificate validation, enabling MITM attacks.
	// NEVER enable in production. Only use for local development with self-signed certificates.
	CallbackInsecureSkipVerify bool `yaml:"callback_insecure_skip_verify"`

	// CallbackDBPath is the path to the bbolt database for persisting pending callbacks.
	// Defaults to "callbacks.db".
	CallbackDBPath string `yaml:"callback_db_path"`

	// ProvisionTimeout is the maximum time allowed for the entire provisioning
	// operation (image pull + container creation + start). If exceeded, the
	// provisioning is canceled and a failure callback is sent.
	ProvisionTimeout time.Duration `yaml:"provision_timeout"`

	// HostBindIP is the IP address to bind container ports to.
	// Defaults to "0.0.0.0" (all interfaces).
	HostBindIP string `yaml:"host_bind_ip"`

	// NetworkIsolation enables per-tenant Docker network isolation.
	// When true, each tenant's containers are placed in a separate bridge network.
	// Provides inter-tenant isolation. The network is created with Internal:false
	// — required for port publishing (moby#36174) — so containers retain outbound
	// internet access as a side effect. Defaults to true.
	NetworkIsolation *bool `yaml:"network_isolation"`

	// ContainerReadonlyRootfs sets the container's root filesystem to read-only.
	// When true, /tmp and /run are mounted as tmpfs. Defaults to true.
	ContainerReadonlyRootfs *bool `yaml:"container_readonly_rootfs"`

	// ContainerPidsLimit limits the number of PIDs in each container.
	// Defaults to 256.
	ContainerPidsLimit *int64 `yaml:"container_pids_limit"`

	// ContainerTmpfsSizeMB sets the tmpfs size in MB for /tmp and /run when
	// readonly rootfs is enabled. Defaults to 64.
	ContainerTmpfsSizeMB int `yaml:"container_tmpfs_size_mb"`

	// StartupVerifyDuration is how long to wait after starting containers before
	// verifying they're still running. This catches containers that crash immediately
	// on startup (e.g., bad config, read-only filesystem errors, missing dependencies).
	// The success callback is only sent after verification passes.
	// Defaults to 5 seconds. Setting to 0 uses the default (verification cannot be disabled).
	StartupVerifyDuration time.Duration `yaml:"startup_verify_duration"`

	// TenantQuota configures per-tenant resource limits.
	// When set, prevents any single tenant from consuming the entire pool.
	TenantQuota *TenantQuotaConfig `yaml:"tenant_quota"`

	// VolumeDataPath is the host directory for managed volumes.
	// Required when any SKU profile has DiskMB > 0.
	// Each container gets a quota-enforced subdirectory under this path.
	VolumeDataPath string `yaml:"volume_data_path"`

	// VolumeFilesystem specifies the filesystem type for volume quota enforcement.
	// Supported values: "btrfs", "xfs", "zfs". If empty, auto-detected from VolumeDataPath.
	VolumeFilesystem string `yaml:"volume_filesystem"`

	// CallbackMaxAge is the maximum age of a persisted callback entry.
	// Entries older than this are removed by the callback store's background cleanup.
	// Defaults to 24h.
	CallbackMaxAge time.Duration `yaml:"callback_max_age"`

	// DiagnosticsDBPath is the path to the bbolt database for persisting failure diagnostics.
	// Defaults to "diagnostics.db".
	DiagnosticsDBPath string `yaml:"diagnostics_db_path"`

	// DiagnosticsMaxAge is the maximum age of a persisted diagnostic entry.
	// Entries older than this are removed by the diagnostics store's background cleanup.
	// Defaults to 7 days.
	DiagnosticsMaxAge time.Duration `yaml:"diagnostics_max_age"`

	// ReleasesDBPath is the path to the bbolt database for persisting release history.
	// Defaults to "releases.db".
	ReleasesDBPath string `yaml:"releases_db_path"`

	// ReleasesMaxAge is the maximum age of a persisted release entry.
	// Entries older than this are removed by the release store's background cleanup.
	// Defaults to 90 days.
	ReleasesMaxAge time.Duration `yaml:"releases_max_age"`

	// RetainOnClose controls whether a lease's managed VOLUMES are soft-deleted
	// (renamed into the fred-retained- namespace and recorded in the retention
	// store) instead of immediately destroyed when the lease is closed. The
	// lease's containers are still stopped and removed either way; only the
	// volumes are retained. When false (default), the volumes are destroyed
	// immediately on close.
	RetainOnClose bool `yaml:"retain_on_close"`

	// RetentionDBPath is the path to the bbolt database for persisting
	// soft-deleted leases awaiting restore or reaping.
	// Defaults to "retention.db".
	RetentionDBPath string `yaml:"retention_db_path"`

	// RetentionMaxAge is the grace window after which a soft-deleted lease
	// becomes eligible for reaping. 0 disables reaping entirely.
	// Defaults to 90 days.
	RetentionMaxAge time.Duration `yaml:"retention_max_age"`

	// RetentionReapInterval is how often the background reaper sweeps for
	// expired soft-deleted leases. Decoupled from RetentionMaxAge so the
	// sweep cadence can be tuned independently of the grace window.
	// Defaults to 1h.
	RetentionReapInterval time.Duration `yaml:"retention_reap_interval"`

	// MaxRetainedLeasesPerTenant caps how many soft-deleted leases a single
	// tenant may have in the retention store at once. 0 means unlimited.
	MaxRetainedLeasesPerTenant int `yaml:"max_retained_leases_per_tenant"`

	// RetentionOrphanConfirmations is the number of consecutive retention sweeps a
	// soft-deleted record must be observed with ALL its retained volumes missing
	// before the record is pruned (ENG-370). It is a SWEEP COUNT, not a duration:
	// the effective confirmation window is N × RetentionReapInterval (≈3h at the
	// default 1h interval), so shortening RetentionReapInterval proportionally
	// shrinks the window — re-tune N to keep a fixed grace. 0 is valid and
	// disables orphan pruning entirely (kill-switch); negative values are rejected
	// by Validate. Defaults to 3.
	RetentionOrphanConfirmations int `yaml:"retention_orphan_confirmations"`

	// MaxRetainedDiskMB caps the aggregate disk (MB) the provider will hold in
	// the retained (soft-deleted) tier across ALL tenants. When retaining a
	// closing lease would push the total over this cap, the lease is destroyed
	// immediately instead of retained (refuse-to-retain) — never evicting
	// another tenant's in-grace data. 0 means unlimited (default; retained
	// volumes still count against total_disk_mb via the admission gate, but are
	// not separately bounded). Must be <= total_disk_mb and, when set, >= the
	// largest single stateful SKU's disk_mb (else a SKU-legal lease could never
	// be retained). Independent of tenant_quota.max_disk_mb: it may be smaller,
	// in which case a tenant's max-sized lease is SKU-legal yet refused
	// retention. Value is a plain integer in MB (mebibytes, 2^20 bytes —
	// consistent with total_disk_mb and the SKU disk_mb fields).
	MaxRetainedDiskMB int64 `yaml:"max_retained_disk_mb"`

	// MigrationGracePeriod is how long the renamed `-prev` legacy container
	// lingers after a successful recover-time migration before forced
	// removal. Preserves rollback potential if the operator interrupts fred
	// in the migration window to inspect. Defaults to 1m.
	MigrationGracePeriod time.Duration `yaml:"migration_grace_period"`

	// MigrationReadyTimeout caps how long the recover-time migration waits
	// for the new stack-form container to reach `healthy` (or `running`
	// when no health check is declared) before declaring the migration
	// failed for that lease. Defaults to 90s.
	MigrationReadyTimeout time.Duration `yaml:"migration_ready_timeout"`

	// Ingress configures optional reverse proxy integration.
	// When enabled, containers with routable TCP ports get proxy labels
	// pointing Traefik at the per-tenant network for HTTPS auto-discovery.
	// Requires network_isolation to be enabled.
	Ingress IngressConfig `yaml:"ingress"`
}

Config holds the configuration for the Docker backend.

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns a Config with sensible defaults.

SKUProfiles is intentionally left empty: tier sizing is operator policy, not a code default. yaml.v3 merges map keys, so seeding defaults here would silently leak into any partial sku_profiles: block in YAML and trip the bidirectional sku_mapping/sku_profiles reachability check in Validate (see ENG-238). Operators must declare sku_profiles in their config; Validate enforces non-empty.

func (*Config) GetHostBindIP

func (c *Config) GetHostBindIP() string

GetHostBindIP returns the configured bind IP, defaulting to "0.0.0.0".

func (*Config) GetPidsLimit

func (c *Config) GetPidsLimit() *int64

GetPidsLimit returns the PID limit for containers. Defaults to 256.

func (*Config) GetSKUProfile

func (c *Config) GetSKUProfile(sku string) (SKUProfile, error)

GetSKUProfile returns the profile for a SKU, or an error if not found. It first checks if the SKU is a UUID that maps to a profile name via SKUMapping, then falls back to direct profile lookup.

func (*Config) GetTmpfsSizeMB

func (c *Config) GetTmpfsSizeMB() int

GetTmpfsSizeMB returns the tmpfs size in MB. Defaults to 64.

func (*Config) HasStatefulSKUs

func (c *Config) HasStatefulSKUs() bool

HasStatefulSKUs returns true if any SKU profile has DiskMB > 0, indicating that volume management is needed.

func (*Config) IsNetworkIsolation

func (c *Config) IsNetworkIsolation() bool

IsNetworkIsolation returns whether per-tenant network isolation is enabled. Defaults to true (secure by default).

func (*Config) IsReadonlyRootfs

func (c *Config) IsReadonlyRootfs() bool

IsReadonlyRootfs returns whether containers should have a read-only root filesystem. Defaults to true (secure by default).

func (*Config) Validate

func (c *Config) Validate() error

Validate checks that the configuration is valid.

type ContainerEvent

type ContainerEvent struct {
	ContainerID string
	Action      string // "die", "stop", etc.
}

ContainerEvent represents a container lifecycle event from the Docker daemon. This keeps Docker SDK types out of the interface boundary.

type ContainerInfo

type ContainerInfo struct {
	ContainerID   string
	LeaseUUID     string
	Tenant        string
	ProviderUUID  string
	SKU           string
	ServiceName   string // Stack service name (empty for single-container leases)
	InstanceIndex int
	FailCount     int
	CallbackURL   string
	Image         string
	Status        string
	Health        HealthStatus // Health check status (HealthStatusHealthy, HealthStatusUnhealthy, HealthStatusStarting, or HealthStatusNone)
	ExitCode      int          // Process exit code (meaningful when Status is "exited"/"dead")
	OOMKilled     bool         // True if container was killed by the OOM killer
	CreatedAt     time.Time
	Ports         map[string]PortBinding
	FQDN          string
	CustomDomain  string // Tenant-supplied custom FQDN (empty when not set)

	// Name is the human-readable container name (without the leading "/"
	// the Docker API prepends). Populated by both InspectContainer and
	// ListManagedContainers. Used by the recover-time migration to filter
	// out already-migrated `-prev` remnants.
	Name string

	// Mounts is the set of bind/volume mounts attached to the container.
	// Populated by InspectContainer always; populated by
	// ListManagedContainers from `types.Container.Mounts` (which the
	// list-containers API includes inline, so no extra Inspect round-trip
	// is needed at startup). Used by the recover-time migration to locate
	// managed volumes that need renaming under the new naming convention.
	Mounts []ContainerMount
}

ContainerInfo holds information about a managed container.

type ContainerMount

type ContainerMount struct {
	Source string
	Target string
	Type   string // "bind" | "volume" | "tmpfs"
}

ContainerMount mirrors the subset of docker's Mount data fred needs for the recover-time migration: where the host bind comes from, where it's mounted inside the container, and the type discriminator (bind / volume / tmpfs). Type is a string per the docker API rather than the typed `mount.Type` because the list-containers and inspect-container APIs surface it as a free-form string and copying that semantics here keeps callers from caring about which API shape they're consuming.

type CreateContainerParams

type CreateContainerParams struct {
	LeaseUUID     string
	Tenant        string
	ProviderUUID  string
	SKU           string
	ServiceName   string // Stack service name (empty for single-container leases)
	Manifest      *manifest.Manifest
	Profile       SKUProfile
	InstanceIndex int // For multi-unit leases (0-based index)

	// Retry tracking
	FailCount int

	// CallbackURL is persisted as a label so failure callbacks can be
	// sent after a docker-backend restart (when in-memory state is lost).
	CallbackURL string

	// Hardening parameters
	HostBindIP     string
	ReadonlyRootfs bool
	PidsLimit      *int64
	TmpfsSizeMB    int
	NetworkConfig  *networktypes.NetworkingConfig

	// VolumeBinds are bind mounts from host to container for managed volumes.
	// Each entry maps a host path to a container path.
	// Used for stateful containers (disk_mb > 0).
	VolumeBinds map[string]string

	// ImageVolumes are VOLUME paths from the image that need tmpfs overrides
	// (for ephemeral containers only, when VolumeBinds is nil).
	ImageVolumes []string

	// WritablePathBinds are bind mounts from managed volume subdirectories
	// to auto-detected writable directories in the container. Each entry
	// maps a host path (with pre-extracted image content) to a container path.
	WritablePathBinds map[string]string

	// User overrides the container's runtime user (e.g., "999:999").
	// When set, container.Config.User is set to this value so the container
	// runs directly as the target user instead of relying on the entrypoint
	// to switch users (which requires CAP_CHOWN that we drop).
	User string

	// BackendName identifies the backend instance that created this container,
	// stored as a label to scope list/filter operations per backend.
	BackendName string

	// Ingress holds the reverse proxy configuration.
	// When Enabled, proxy labels are injected into the container.
	// NetworkName must be non-empty when Ingress is enabled (enforced by
	// Config.Validate requiring network_isolation).
	Ingress     IngressConfig
	NetworkName string // Per-tenant network name for traefik.docker.network label
	Quantity    int    // Total quantity for this service (used in subdomain computation)

	// CustomDomain is the optional tenant-supplied FQDN for this LeaseItem.
	// When non-empty (and a routable HTTP port exists), CreateContainer
	// emits a secondary Traefik router routing Host(<CustomDomain>) to the
	// shared per-service loadbalancer. Validated defense-in-depth before
	// emission; chain authoritatively validates upstream.
	CustomDomain string
}

CreateContainerParams holds parameters for creating a container.

type DaemonSecurityInfo

type DaemonSecurityInfo struct {
	StorageDriver     string
	BackingFilesystem string
	SecurityOptions   []string
	IPv4Forwarding    bool
	Warnings          []string
}

DaemonSecurityInfo contains Docker daemon capabilities relevant to container hardening validation.

type DockerClient

type DockerClient struct {
	// contains filtered or unexported fields
}

DockerClient wraps the Docker client for container lifecycle operations.

func NewDockerClient

func NewDockerClient(host string, backendName string) (*DockerClient, error)

NewDockerClient creates a new Docker client. The backendName parameter scopes all list/filter/event operations so that multiple backend instances sharing the same Docker daemon do not interfere with each other.

func (*DockerClient) Close

func (d *DockerClient) Close() error

Close closes the Docker client.

func (*DockerClient) ContainerEvents

func (d *DockerClient) ContainerEvents(ctx context.Context) (<-chan ContainerEvent, <-chan error)

ContainerEvents subscribes to Docker container lifecycle events, filtering for "die" events on containers managed by fred (label fred.managed=true). Returns a channel of ContainerEvent and a channel of errors. Both channels are closed when the context is canceled or the Docker event stream closes.

func (*DockerClient) ContainerLogs

func (d *DockerClient) ContainerLogs(ctx context.Context, containerID string, tail int) (string, error)

ContainerLogs returns the last `tail` lines of combined stdout/stderr for a container. If tail is <= 0 it defaults to 100.

func (*DockerClient) CreateContainer

func (d *DockerClient) CreateContainer(ctx context.Context, params CreateContainerParams, timeout time.Duration) (string, error)

CreateContainer creates a new container with the specified configuration. For ephemeral port bindings, it retries up to portBindRetries times on port conflict errors since these can be transient.

func (*DockerClient) DaemonInfo

func (d *DockerClient) DaemonInfo(ctx context.Context) (DaemonSecurityInfo, error)

DaemonInfo returns Docker daemon capabilities for hardening validation.

func (*DockerClient) DetectVolumeOwner

func (d *DockerClient) DetectVolumeOwner(ctx context.Context, imageName string, volumePaths []string) (uid, gid int, err error)

DetectVolumeOwner inspects the ownership of VOLUME directories inside an image by creating a temporary container (never started) and reading the tar headers from CopyFromContainer. If all volume paths share the same non-root UID:GID, those values are returned. If the paths have mixed ownership, are owned by root, or a path doesn't exist, (0, 0, nil) is returned.

func (*DockerClient) DetectWritablePaths

func (d *DockerClient) DetectWritablePaths(ctx context.Context, imageName string, uid int, candidateParents []string) ([]string, error)

DetectWritablePaths scans candidate parent directories inside an image for depth-1 subdirectories owned by uid. When uid is 0 (root image), it matches directories owned by any non-root user — this handles images like neo4j that run as root but chown directories to a service user during build.

func (*DockerClient) EnsureTenantNetwork

func (d *DockerClient) EnsureTenantNetwork(ctx context.Context, tenant string) (string, error)

EnsureTenantNetwork creates or returns the existing network for a tenant. The network is a per-tenant bridge that provides inter-tenant isolation. Note: Internal must be false because Docker internal networks do not support port publishing (moby/moby#36174). Outbound internet access from containers is a side effect.

func (*DockerClient) ExtractImageContent

func (d *DockerClient) ExtractImageContent(ctx context.Context, imageName string, paths []string, destDir string, maxBytes int64) map[string]error

ExtractImageContent extracts pre-existing image content for the given paths into destDir on the host filesystem. For each path, it creates a temporary container from the image (never started), streams the tar archive of that path via CopyFromContainer, sanitizes the tar stream, and extracts it to the appropriate subdirectory under destDir.

Returns nil on full success, or a map of path → error for failures. Callers should log failures but not fail the provision (graceful degradation).

func (*DockerClient) InspectContainer

func (d *DockerClient) InspectContainer(ctx context.Context, containerID string) (*ContainerInfo, error)

InspectContainer returns detailed information about a container.

func (*DockerClient) InspectImage

func (d *DockerClient) InspectImage(ctx context.Context, imageName string) (*ImageInfo, error)

InspectImage inspects a pulled image and returns its metadata.

func (*DockerClient) ListManagedContainers

func (d *DockerClient) ListManagedContainers(ctx context.Context) ([]ContainerInfo, error)

ListManagedContainers returns all containers managed by Fred. When backendName is set, only containers belonging to this backend are returned.

func (*DockerClient) ListManagedNetworks

func (d *DockerClient) ListManagedNetworks(ctx context.Context) ([]networktypes.Inspect, error)

ListManagedNetworks returns all networks created by Fred, with full details. When backendName is set, only networks belonging to this backend are returned.

func (*DockerClient) Ping

func (d *DockerClient) Ping(ctx context.Context) error

Ping checks connectivity to the Docker daemon.

func (*DockerClient) PullImage

func (d *DockerClient) PullImage(ctx context.Context, imageName string, timeout time.Duration) error

PullImage pulls a container image with timeout.

func (*DockerClient) RemoveContainer

func (d *DockerClient) RemoveContainer(ctx context.Context, containerID string) error

RemoveContainer removes a container. It is idempotent — if the container is already gone or the daemon is already removing it, returns nil only after the removal has actually completed, so callers can safely proceed with operations that assume the container is physically gone (volume destroy, name re-use).

func (*DockerClient) RemoveTenantNetworkIfEmpty

func (d *DockerClient) RemoveTenantNetworkIfEmpty(ctx context.Context, tenant string) error

RemoveTenantNetworkIfEmpty removes the tenant's network if no containers are connected.

func (*DockerClient) RenameContainer

func (d *DockerClient) RenameContainer(ctx context.Context, containerID string, newName string) error

RenameContainer changes the name of a container. The container can be running or stopped. This is used during updates/restarts to free the canonical name for the replacement container while keeping the old one available for rollback.

func (*DockerClient) ResolveImageUser

func (d *DockerClient) ResolveImageUser(ctx context.Context, imageName string, userOverride string) (uid, gid int, err error)

ResolveImageUser resolves a container user specification to numeric UID/GID. If userOverride is non-empty, it is used instead of the image's Config.User. This is needed for images like postgres that start as root and expect to chown data directories — since we drop CAP_CHOWN, the manifest must specify the target user explicitly. If both userOverride and Config.User are empty, returns (0, 0, nil) (root). Numeric UID/GID values are parsed directly. Non-numeric usernames are resolved by reading /etc/passwd (and optionally /etc/group) from a temporary container created from the image.

func (*DockerClient) StartContainer

func (d *DockerClient) StartContainer(ctx context.Context, containerID string, timeout time.Duration) error

StartContainer starts a container.

func (*DockerClient) StopContainer

func (d *DockerClient) StopContainer(ctx context.Context, containerID string, timeout time.Duration) error

StopContainer gracefully stops a running container with a timeout. After the timeout, the container is forcefully killed.

type HealthStatus

type HealthStatus string

HealthStatus represents the health check status of a Docker container.

const (
	HealthStatusHealthy   HealthStatus = "healthy"
	HealthStatusUnhealthy HealthStatus = "unhealthy"
	HealthStatusStarting  HealthStatus = "starting"
	HealthStatusNone      HealthStatus = "" // No health check configured
)

type ImageInfo

type ImageInfo struct {
	// ID is the content-addressable image ID (e.g., "sha256:...").
	// Immutable for a given image build, suitable as a cache key.
	ID string
	// Volumes are the VOLUME declarations from the Dockerfile.
	Volumes map[string]struct{}
	// User is the USER directive from the Dockerfile (may be name, uid, uid:gid, or name:group).
	User string
}

ImageInfo holds metadata from a container image inspection.

type IngressConfig

type IngressConfig struct {
	Enabled        bool   `yaml:"enabled"`
	WildcardDomain string `yaml:"wildcard_domain"`
	Entrypoint     string `yaml:"entrypoint"`

	// CustomDomainCertResolver is the Traefik certresolver name used for
	// per-tenant custom domains (HTTP-01 by default). Defaults to "http01"
	// when empty.
	CustomDomainCertResolver string `yaml:"custom_domain_cert_resolver"`

	// CustomDomainMiddlewares is the list of Traefik middleware references
	// applied to the secondary custom-domain router. Defaults to
	// ["security-headers@file"] when empty.
	CustomDomainMiddlewares []string `yaml:"custom_domain_middlewares"`

	// CustomDomainDNSResolvers are the DNS servers (host:port) fred queries to
	// check whether a tenant custom domain resolves to this host before
	// emitting its HTTP-01 router (ENG-266). Public resolvers are used so the
	// answer matches what the ACME CA sees. Defaults to Cloudflare 1.1.1.1:53 /
	// Google 8.8.8.8:53 / Quad9 9.9.9.9:53.
	CustomDomainDNSResolvers []string `yaml:"custom_domain_dns_resolvers"`

	// CustomDomainDNSQuorum is how many of the resolvers must independently see
	// the domain at this host before the gate opens (ENG-266). 0 (default) ==
	// a majority of CustomDomainDNSResolvers. Clamped to [1, len(resolvers)],
	// or 1 when no resolvers are configured.
	CustomDomainDNSQuorum int `yaml:"custom_domain_dns_quorum"`

	// CustomDomainDNSCheckDisabled turns OFF the readiness gate (ENG-266),
	// reverting to emitting the custom-domain router immediately. Default false.
	CustomDomainDNSCheckDisabled bool `yaml:"custom_domain_dns_check_disabled"`
}

IngressConfig holds configuration for reverse proxy integration. When Enabled, containers with routable TCP ports get proxy labels pointing Traefik at the per-tenant network for auto-discovery. Requires network_isolation to be enabled (validated at config load time). Currently generates Traefik-specific labels; the config layer is proxy-agnostic so a future backend swap only changes label generation.

The wildcard certificate covering *.WildcardDomain must be provisioned at the Traefik level (e.g. via a DNS-01 ACME resolver with domains in static config, or a default cert in tls.stores). Fred does not drive per-domain ACME challenges — routers are emitted with tls=true but name no certresolver, so Traefik serves whichever cert matches SNI.

func (*IngressConfig) Validate

func (ic *IngressConfig) Validate() error

Validate checks that all required IngressConfig fields are set when enabled.

type PortBinding

type PortBinding struct {
	HostIP   string `json:"host_ip"`
	HostPort string `json:"host_port"`
}

PortBinding represents a port mapping.

type SKUProfile

type SKUProfile = shared.SKUProfile

Type aliases for readability within the docker package.

type TenantQuotaConfig

type TenantQuotaConfig = shared.TenantQuotaConfig

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL