cluster

package
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 22, 2026 License: GPL-3.0, LGPL-3.0 Imports: 18 Imported by: 0

Documentation

Overview

Package cluster provides types and operations for sind cluster management.

Index

Constants

View Source
const (
	LabelRealm        = "sind.realm"
	LabelCluster      = "sind.cluster"
	LabelRole         = "sind.role"
	LabelSlurmVersion = "sind.slurm.version"
	LabelDataHostPath = "sind.data.hostpath"
)

Label keys used on sind containers.

View Source
const DNSSuffix = "sind"

DNSSuffix is the base domain for all sind DNS names.

View Source
const DefaultDataMountPath = "/data"

DefaultDataMountPath is the default mount path for the shared data volume.

Variables

AllVolumeTypes lists the cluster volume types in creation order.

Functions

func BuildContainerExecArgs added in v0.3.0

func BuildContainerExecArgs(container docker.ContainerName, isTTY bool, command []string) []string

BuildContainerExecArgs builds docker CLI arguments for running a command directly inside a cluster container via docker exec. The working directory is set to /data (the shared data mount). When command is nil, an interactive login shell is started.

func BuildRunArgs

func BuildRunArgs(cfg RunConfig) []string

BuildRunArgs returns the docker arguments for creating a node container. The returned slice does not include "create" or "run -d" — the caller passes these args to Client.CreateContainer or Client.RunContainer.

func BuildSSHArgs

func BuildSSHArgs(sshContainer docker.ContainerName, node, cluster, realm string, isTTY bool, sshOptions, command []string) []string

BuildSSHArgs builds the docker CLI arguments for running SSH through the sind-ssh relay container. The returned args are suitable for passing to docker directly (e.g. "docker exec -i -t sind-ssh ssh ...").

node is the short name (e.g. "worker-0"), cluster is the cluster name, isTTY controls whether -t is added to docker exec, sshOptions are passed through to SSH before the target, and command is the optional remote command.

func ComposeProject

func ComposeProject(realm, clusterName string) string

ComposeProject returns the Docker Compose project name for a cluster.

func ContainerLogArgs

func ContainerLogArgs(realm, node, cluster string, follow bool) []string

ContainerLogArgs builds docker CLI arguments for streaming container logs. node is the short name (e.g. "controller", "worker-0"), cluster is the cluster name, and follow controls whether --follow is added.

func ContainerName

func ContainerName(realm, cluster, shortName string) docker.ContainerName

ContainerName returns the Docker container name for a node. shortName is the node's hostname, e.g. "controller", "submitter", "worker-0".

func ContainerPrefix

func ContainerPrefix(realm, cluster string) string

ContainerPrefix returns the container name prefix for a cluster, used to extract short names from full container names.

func CreateClusterNetwork

func CreateClusterNetwork(ctx context.Context, client *docker.Client, realm, clusterName string) error

CreateClusterNetwork creates the cluster-specific Docker bridge network.

func CreateClusterNodes

func CreateClusterNodes(ctx context.Context, client *docker.Client, meshMgr *mesh.Manager, configs []RunConfig) error

CreateClusterNodes creates all node containers for the cluster. Each node is created, connected to the mesh network, and started.

func CreateClusterVolume added in v0.9.0

func CreateClusterVolume(ctx context.Context, client *docker.Client, realm, clusterName string, vtype VolumeType) error

CreateClusterVolume creates a single cluster volume.

func CreateNode

func CreateNode(ctx context.Context, client *docker.Client, meshMgr *mesh.Manager, cfg RunConfig) (docker.ContainerID, error)

CreateNode creates a node container, connects it to the mesh network, and starts it. Returns the container ID.

func DNSName

func DNSName(shortName, cluster, realm string) string

DNSName returns the fully qualified DNS name for a node. shortName is the node's hostname, e.g. "controller", "worker-0".

func DNSSearchDomain

func DNSSearchDomain(cluster, realm string) string

DNSSearchDomain returns the DNS search domain for a cluster.

func Delete

func Delete(ctx context.Context, client *docker.Client, meshMgr *mesh.Manager, clusterName string) error

Delete orchestrates the full cluster deletion flow.

Deleting a non-existent cluster is not an error. The function handles partial clusters (e.g., from a failed creation) by removing whatever resources exist.

deleteClusterResources
      │
HasOtherClusters?
    yes → done
    no  → CleanupMesh

func DeleteContainers

func DeleteContainers(ctx context.Context, client *docker.Client, containers []docker.ContainerListEntry) error

DeleteContainers force-removes the given containers in parallel (docker rm -f).

func DeleteNetwork

func DeleteNetwork(ctx context.Context, client *docker.Client, name docker.NetworkName) error

DeleteNetwork removes the cluster network.

func DeleteVolumes

func DeleteVolumes(ctx context.Context, client *docker.Client, volumes []docker.VolumeName) error

DeleteVolumes removes the given cluster volumes. Each removal retries past the dockerd async-cleanup race that follows `docker rm -f`.

func DeregisterMesh

func DeregisterMesh(ctx context.Context, meshMgr *mesh.Manager, clusterName string, containers []docker.ContainerListEntry) error

DeregisterMesh removes DNS records and known_hosts entries for all containers in batch. This is the inverse of registerMesh during cluster creation.

func DiscoverClusterNames added in v0.3.0

func DiscoverClusterNames(ctx context.Context, client *docker.Client, realm string) ([]string, error)

DiscoverClusterNames finds cluster names from orphaned networks and volumes that may not have containers. This supplements GetClusters (which only finds clusters with running containers) for cleanup operations.

Filters on both the realm and cluster labels so mesh resources (which carry only the realm label) are skipped, and resources from other realms can't match even when names collide.

func EnableSlurmServices

func EnableSlurmServices(ctx context.Context, client *docker.Client, configs []RunConfig) error

EnableSlurmServices enables the role-appropriate Slurm daemon on each node. Controller nodes get slurmctld; managed worker nodes get slurmd. Submitter and unmanaged worker nodes are skipped.

func EnterTarget

func EnterTarget(ctx context.Context, client *docker.Client, realm, clusterName string) (string, error)

EnterTarget determines the target node for an interactive shell. Returns "submitter" if present in the cluster, otherwise "controller".

func GetMungeKey

func GetMungeKey(ctx context.Context, client *docker.Client, realm, clusterName string) ([]byte, error)

GetMungeKey reads the munge key from a cluster's node container. Any container in the cluster can be used since all mount the same munge volume.

func HasOtherClusters

func HasOtherClusters(ctx context.Context, client *docker.Client, realm, clusterName string) (bool, error)

HasOtherClusters checks whether any sind cluster containers exist besides the named cluster. This is used to decide whether to clean up mesh infrastructure after deleting a cluster.

func NetworkName

func NetworkName(realm, cluster string) docker.NetworkName

NetworkName returns the Docker network name for a cluster.

func NextComputeIndex

func NextComputeIndex(ctx context.Context, client *docker.Client, realm, clusterName string) (int, error)

NextComputeIndex determines the next worker node index by examining existing containers in the cluster. Returns max(existing indices) + 1, or 0 if no worker containers exist.

func NodeLabels

func NodeLabels(realm, clusterName string, role config.Role, slurmVersion, dataHostPath string, containerNumber int) docker.Labels

NodeLabels returns the standard labels for a node container. containerNumber is the 1-based instance number for compose compatibility. The slurm version label is omitted when slurmVersion is empty. The data host path label is omitted when dataHostPath is empty (Docker volume mode).

func NodeShortNames

func NodeShortNames(nodes []config.Node) []string

NodeShortNames returns the short hostname for each node defined in the config. Worker nodes are indexed sequentially across all worker groups, matching the indexing used in slurm.GenerateNodesConf.

func PowerCut

func PowerCut(ctx context.Context, client *docker.Client, realm, clusterName string, shortNames []string) error

PowerCut immediately kills the specified nodes (docker kill).

func PowerCycle

func PowerCycle(ctx context.Context, client *docker.Client, realm, clusterName string, shortNames []string) error

PowerCycle hard-restarts the specified nodes (docker kill + start).

func PowerFreeze

func PowerFreeze(ctx context.Context, client *docker.Client, realm, clusterName string, shortNames []string) error

PowerFreeze suspends all processes in the specified nodes (docker pause). The containers remain running but are completely unresponsive.

func PowerOn

func PowerOn(ctx context.Context, client *docker.Client, realm, clusterName string, shortNames []string) error

PowerOn starts the specified stopped nodes (docker start).

func PowerReboot

func PowerReboot(ctx context.Context, client *docker.Client, realm, clusterName string, shortNames []string) error

PowerReboot gracefully restarts the specified nodes (docker stop + start).

func PowerShutdown

func PowerShutdown(ctx context.Context, client *docker.Client, realm, clusterName string, shortNames []string) error

PowerShutdown gracefully stops the specified nodes (docker stop).

func PowerUnfreeze

func PowerUnfreeze(ctx context.Context, client *docker.Client, realm, clusterName string, shortNames []string) error

PowerUnfreeze resumes the specified frozen nodes (docker unpause).

func PreflightCheck

func PreflightCheck(ctx context.Context, client *docker.Client, realm string, cfg *config.Cluster) error

PreflightCheck verifies that no Docker resources conflict with the cluster that would be created from the given configuration. It checks for existing networks, volumes, and containers with matching names.

Container existence is checked with a single `docker ps` filtered by the cluster's realm + name labels; this keeps the call count constant regardless of node count.

func ServiceLogArgs

func ServiceLogArgs(realm, node, cluster, service string, follow bool) []string

ServiceLogArgs builds docker CLI arguments for streaming service journal logs. node is the short name, cluster is the cluster name, service is the systemd unit name (e.g. "slurmctld", "slurmd"), and follow controls whether --follow is added.

func ValidateWorkerAdd

func ValidateWorkerAdd(ctx context.Context, client *docker.Client, realm string, opts WorkerAddOptions) error

ValidateWorkerAdd checks prerequisites for adding workers to a cluster. For managed workers, it verifies that sind-nodes.conf exists on the controller (indicating sind-generated Slurm configuration is in use). Unmanaged workers bypass the sind-nodes.conf check.

func VolumeName

func VolumeName(realm, cluster string, volumeType VolumeType) docker.VolumeName

VolumeName returns the Docker volume name for a cluster resource.

func WorkerRemove

func WorkerRemove(ctx context.Context, client *docker.Client, meshMgr *mesh.Manager, clusterName string, shortNames []string) error

WorkerRemove removes worker nodes from a cluster.

For managed nodes (those present in sind-nodes.conf), the flow is:

  1. Update sind-nodes.conf to remove the node definitions
  2. Reconfigure slurmctld
  3. Deregister DNS + known_hosts
  4. Stop + remove containers

For unmanaged nodes, only steps 3–4 are performed.

func WriteClusterConfig

func WriteClusterConfig(ctx context.Context, client *docker.Client, realm string, cfg *config.Cluster, image string, pull bool) error

WriteClusterConfig generates and writes slurm.conf, sind-nodes.conf, and cgroup.conf to the config volume. Uses a temporary container to access the volume.

func WriteMungeKey

func WriteMungeKey(ctx context.Context, client *docker.Client, realm, clusterName string, key []byte, image string, pull bool) error

WriteMungeKey writes the given munge key to the munge volume. Uses a temporary container to access the volume.

Types

type Cluster

type Cluster struct {
	Name         string
	SlurmVersion string
	State        State
	Nodes        []*Node
}

Cluster represents a live sind cluster as it exists in Docker. This is distinct from config.Cluster, which represents the configuration input.

func Create

func Create(ctx context.Context, client *docker.Client, meshMgr *mesh.Manager, cfg *config.Cluster, readinessInterval time.Duration) (result *Cluster, retErr error)

Create orchestrates the full cluster creation flow.

The caller must ensure mesh infrastructure exists (via mesh.Manager.EnsureMesh) before calling Create. The context deadline controls the overall timeout; readinessInterval controls the polling interval for readiness probes.

┌ PreflightCheck → createResources → ConnectNetwork ┐
┤                                                   ├→ setupNodes
└ resolveInfra (DNS IP ║ SSH key ║ Slurm version) ──┘
                        │
              registerMesh ║ enableSlurm
                        │
                    *Cluster

type MountPoint added in v0.3.0

type MountPoint struct {
	Path   string             `json:"path"`   // mount path inside the container (e.g. "/etc/slurm")
	Source string             `json:"source"` // volume name or host path
	Type   config.StorageType `json:"type"`   // "volume" or "hostPath"
	OK     bool               `json:"ok"`     // true if the Docker volume exists (always true for hostPath)
}

MountPoint describes a volume or bind mount on cluster containers.

func GetMountPoints added in v0.3.0

func GetMountPoints(ctx context.Context, client *docker.Client, realm, clusterName string, containers []docker.ContainerListEntry) ([]MountPoint, error)

GetMountPoints returns the mount points for a cluster, checking volume existence for Docker volumes. The data mount source is determined from the sind.data.hostpath label on cluster containers: when present it is a host-path bind mount, otherwise it is a Docker volume.

type NetworkHealth

type NetworkHealth struct {
	Mesh           bool   `json:"mesh_ok"`         // sind-mesh network exists
	MeshName       string `json:"mesh_name"`       // mesh network name (e.g. "sind-mesh")
	MeshDriver     string `json:"mesh_driver"`     // mesh network driver (e.g. "bridge")
	MeshSubnet     string `json:"mesh_subnet"`     // mesh network subnet
	MeshGateway    string `json:"mesh_gateway"`    // mesh network gateway
	DNS            bool   `json:"dns_ok"`          // sind-dns container exists
	DNSName        string `json:"dns_name"`        // DNS container name (e.g. "sind-dns")
	Cluster        bool   `json:"cluster_ok"`      // cluster network exists
	ClusterName    string `json:"cluster_name"`    // cluster network name (e.g. "sind-dev-net")
	ClusterDriver  string `json:"cluster_driver"`  // cluster network driver (e.g. "bridge")
	ClusterSubnet  string `json:"cluster_subnet"`  // cluster network subnet
	ClusterGateway string `json:"cluster_gateway"` // cluster network gateway
}

NetworkHealth holds the health and IPAM details of cluster networking.

func GetNetworkHealth

func GetNetworkHealth(ctx context.Context, client *docker.Client, realm, clusterName string) (*NetworkHealth, error)

GetNetworkHealth checks the health of mesh, DNS, and cluster networking.

type NetworkSummary

type NetworkSummary struct {
	Name    string `json:"name"`
	Driver  string `json:"driver"`
	Subnet  string `json:"subnet"`
	Gateway string `json:"gateway"`
}

NetworkSummary holds summary information about a sind network.

func GetNetworks

func GetNetworks(ctx context.Context, client *docker.Client, realm string) ([]*NetworkSummary, error)

GetNetworks lists all sind-related Docker networks with IPAM details. This includes per-cluster networks (sind-<cluster>-net) and the mesh network (sind-mesh).

type Node

type Node struct {
	Name        string             // short name: "controller", "worker-0"
	Role        config.Role        // "controller", "submitter", "worker"
	ContainerID docker.ContainerID // Docker container ID
	IP          string             // container IP address
	State       State
}

Node represents a running node in a sind cluster.

func WorkerAdd

func WorkerAdd(ctx context.Context, client *docker.Client, meshMgr *mesh.Manager, opts WorkerAddOptions, readinessInterval time.Duration) (result []*Node, retErr error)

WorkerAdd adds worker nodes to an existing cluster.

For managed workers (default), the flow is:

  1. Validate: controller exists, sind-nodes.conf present
  2. Create worker container(s)
  3. Wait for readiness, inject SSH keys, collect host keys
  4. Register DNS + known_hosts
  5. Update sind-nodes.conf with new node definitions
  6. Reconfigure slurmctld
  7. Enable slurmd on new nodes

For unmanaged workers (Unmanaged=true), steps 5–7 are skipped.

type NodeDetail added in v0.9.0

type NodeDetail struct {
	Container string                `json:"container"`
	Cluster   string                `json:"cluster"`
	Role      config.Role           `json:"role"`
	FQDN      string                `json:"fqdn"`
	IP        string                `json:"ip"`
	Status    docker.ContainerState `json:"status"`
	Services  ServiceHealth         `json:"services"`
}

NodeDetail holds the full identity and health information for a single node as reported by 'sind get node'. It extends NodeSummary with a per-service health map. All readiness-checked services (munge, sshd, and the role's Slurm services) are reported under Services.

type NodeHealth

type NodeHealth struct {
	State    docker.ContainerState `json:"status"`   // container state from Docker (e.g. "running", "exited")
	IP       string                `json:"ip"`       // container IP address
	Services ServiceHealth         `json:"services"` // all readiness-checked services (munge, sshd, and role-specific services like slurmctld/slurmd)
}

NodeHealth holds the health status of a single node.

func GetNodeHealth

func GetNodeHealth(ctx context.Context, client *docker.Client, containerName string, role config.Role, realm, clusterName string) (*NodeHealth, error)

GetNodeHealth checks the health of a single node container. If the container is not running, remaining checks are skipped and default to false. The role determines which Slurm services are checked. clusterName is used to select the cluster network IP.

type NodeStatus

type NodeStatus struct {
	Name   string      `json:"name"`   // DNS-style name: "controller.dev"
	Role   config.Role `json:"role"`   // "controller", "submitter", "worker"
	Health *NodeHealth `json:"health"` //nolint:revive // nested health is intentional
}

NodeStatus combines node identity with health information.

type NodeSummary

type NodeSummary struct {
	Container string      `json:"container"` // Docker container name
	Cluster   string      `json:"cluster"`   // cluster name
	Role      config.Role `json:"role"`
	FQDN      string      `json:"fqdn"` // DNS name
	IP        string      `json:"ip"`   // container IP on cluster network
	State     State       `json:"status"`
}

NodeSummary holds summary information about a node in a sind cluster.

func GetAllNodes added in v0.9.0

func GetAllNodes(ctx context.Context, client *docker.Client, realm string) ([]*NodeSummary, error)

GetAllNodes lists all nodes across all clusters in the realm.

func GetNodes

func GetNodes(ctx context.Context, client *docker.Client, realm, clusterName string) ([]*NodeSummary, error)

GetNodes lists all nodes in the named cluster.

type RealmSummary added in v0.9.0

type RealmSummary struct {
	Name     string `json:"name"`
	Clusters int    `json:"clusters"`
}

RealmSummary holds summary information about a sind realm.

func GetRealms added in v0.9.0

func GetRealms(ctx context.Context, client *docker.Client) ([]*RealmSummary, error)

GetRealms lists all sind realms by querying Docker for containers with the sind.realm label. Containers are grouped by realm, and clusters are counted per realm.

type Resources

type Resources struct {
	Containers    []docker.ContainerListEntry
	Network       docker.NetworkName
	NetworkExists bool
	Volumes       []docker.VolumeName
}

Resources holds the Docker resources belonging to a cluster.

func ListClusterResources

func ListClusterResources(ctx context.Context, client *docker.Client, realm, clusterName string) (*Resources, error)

ListClusterResources discovers all Docker resources belonging to the named cluster. Containers are found by label filter; network and volumes are checked by name convention.

type RunConfig

type RunConfig struct {
	Realm           string      // realm name (e.g. "sind")
	ClusterName     string      // cluster name
	ShortName       string      // node hostname: "controller", "worker-0"
	Role            config.Role // "controller", "submitter", "worker"
	Image           string      // container image
	CPUs            int         // CPU limit
	Memory          string      // memory limit (e.g. "2g")
	TmpSize         string      // /tmp tmpfs size (e.g. "1g")
	SlurmVersion    string      // slurm version for labels (optional)
	DNSIP           string      // mesh DNS container IP (optional)
	DataHostPath    string      // host path for data volume (empty = use docker volume)
	DataMountPath   string      // mount point for data (default: /data)
	Managed         bool        // start slurmd and add to slurm.conf (worker only)
	ContainerNumber int         // 1-based compose container instance number
	Pull            bool        // force fresh image pull (--pull always)
	CapAdd          []string    // extra Linux capabilities (e.g. "SYS_ADMIN")
	CapDrop         []string    // dropped Linux capabilities
	Devices         []string    // host devices to expose (e.g. "/dev/fuse")
	SecurityOpt     []string    // extra security options
}

RunConfig holds the parameters needed to build docker run arguments for creating a node container.

func NodeRunConfigs

func NodeRunConfigs(cfg *config.Cluster, realm, dnsIP, slurmVersion string) []RunConfig

NodeRunConfigs builds RunConfig entries for all nodes in the cluster config. Worker nodes are indexed sequentially across all worker groups.

type ServiceHealth added in v0.7.0

type ServiceHealth map[probe.Service]bool

ServiceHealth maps a readiness-check service to its health status.

type State added in v0.2.0

type State string

State represents the state of a cluster or node.

const (
	StateRunning State = "running"
	StateStopped State = "stopped"
	StatePaused  State = "paused"
	StateMixed   State = "mixed"   // cluster: nodes in different states
	StateEmpty   State = "empty"   // cluster: no nodes exist
	StateUnknown State = "unknown" // node: unrecognised container state
)

Possible cluster/node states.

type Status

type Status struct {
	Name         string         `json:"name"`
	SlurmVersion string         `json:"slurm_version"`
	State        State          `json:"status"`
	Nodes        []*NodeStatus  `json:"nodes"`
	Network      *NetworkHealth `json:"network"`
	Mounts       []MountPoint   `json:"mounts"`
}

Status holds the full status of a sind cluster.

func GetStatus

func GetStatus(ctx context.Context, client *docker.Client, realm, clusterName string) (*Status, error)

GetStatus returns the full status of a cluster, aggregating node, network, and volume health information.

type Summary added in v0.2.0

type Summary struct {
	Name         string `json:"name"`
	SlurmVersion string `json:"slurm_version"`
	State        State  `json:"status"`
	NodeCount    int    `json:"nodes"`
	Submitters   int    `json:"submitters"`
	Controllers  int    `json:"controllers"`
	Workers      int    `json:"workers"`
}

Summary holds summary information about a sind cluster.

func GetClusters

func GetClusters(ctx context.Context, client *docker.Client, realm string) ([]*Summary, error)

GetClusters lists all sind clusters by querying Docker for containers with the sind.cluster label. Containers are grouped by cluster name.

type VolumeSummary

type VolumeSummary struct {
	Name   string `json:"name"`
	Driver string `json:"driver"`
}

VolumeSummary holds summary information about a sind volume.

func GetVolumes

func GetVolumes(ctx context.Context, client *docker.Client, realm string) ([]*VolumeSummary, error)

GetVolumes lists all sind-related Docker volumes.

type VolumeType added in v0.7.0

type VolumeType string

VolumeType identifies a cluster volume kind.

const (
	VolumeConfig VolumeType = "config"
	VolumeMunge  VolumeType = "munge"
	VolumeData   VolumeType = "data"
)

Cluster volume types.

type WorkerAddOptions

type WorkerAddOptions struct {
	ClusterName string
	Count       int
	Image       string
	CPUs        int
	Memory      string
	TmpSize     string
	Unmanaged   bool
	Pull        bool
	CapAdd      []string
	CapDrop     []string
	Devices     []string
	SecurityOpt []string
}

WorkerAddOptions holds the parameters for adding worker nodes to a cluster.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL