v0.6.0 Latest Latest

This package is not in the latest version of its module.

Go to latest
Published: Jan 7, 2020 License: Apache-2.0



Multicluster-scheduler is a system of Kubernetes controllers that intelligently schedules workloads across clusters. It is simple to use and simple to integrate with other tools.

  1. Install the scheduler in any cluster and the agent in each cluster that you want to federate.
  2. Annotate any pod or pod template (e.g., of a Deployment, Job, or Argo Workflow, among others) in any member cluster with"".
  3. Multicluster-scheduler mutates the elected pods into proxy pods ("running" on a virtual-kubelet) and deploys delegate pods to other clusters (where containers are actually run).
  4. A feedback loop updates the statuses and annotations of the proxy pods to reflect the statuses and annotations of the delegate pods.
  5. Services that target proxy pods are rerouted to their delegates, replicated across clusters, and annotated with io.cilium/global-service=true to be load-balanced across a Cilium cluster mesh, if installed.

Check out Admiralty's blog post demonstrating how to run an Argo workflow across clusters to combine data from different regions or clouds and better utilize resources.

Getting Started

We assume that you are a cluster admin for two clusters, associated with, e.g., the contexts "cluster1" and "cluster2" in your kubeconfig. We're going to install a basic scheduler in cluster1 and agents in cluster1 and cluster2. Then, we will deploy a multi-cluster NGINX.

CLUSTER1=cluster1 # change me
CLUSTER2=cluster2 # change me

Cert-manager v0.11+ must be installed in each member cluster:

helm repo add jetstack
helm repo update

  kubectl --context $CONTEXT apply --validate=false -f
  kubectl --context $CONTEXT create namespace cert-manager
  helm --context $CONTEXT install cert-manager \
    --namespace cert-manager \
    --version v0.12.0 \
Optional: Cilium cluster mesh

For cross-cluster service calls, multicluster-scheduler relies on a Cilium cluster mesh and global services. If you need this feature, install Cilium and set up a cluster mesh. If you install Cilium later, you may have to restart pods.


Starting from v0.4, the recommended way to install multicluster-scheduler is with Helm (v3):

helm repo add admiralty
helm repo update

helm install multicluster-scheduler admiralty/multicluster-scheduler \
  --context $CLUSTER1 \
  --set global.clusters[0].name=c1 \
  --set global.clusters[1].name=c2 \
  --set scheduler.enabled=true \
  --set clusters.enabled=true \
  --set agent.enabled=true \
  --set agent.clusterName=c1 \

helm install multicluster-scheduler admiralty/multicluster-scheduler \
  --context $CLUSTER2 \
  --set agent.enabled=true \
  --set agent.clusterName=c2 \

Note: the Helm chart is flexible enough to configure multiple federations and/or refine RBAC so clusters can't see each other's observations. See the chart's documentation.

Service Account Exchange

For agents to talk to the scheduler across cluster boundaries (via custom resource definitions, cf. How it Works), we need to export service accounts in the scheduler's cluster as kubeconfig files and save those files inside secrets in the agents' clusters.

Luckily, the kubemcsa export command of multicluster-service-account can prepare the secrets for us. First, install kubemcsa (you don't need to deploy multicluster-service-account):

OS=linux # or darwin (i.e., OS X) or windows
ARCH=amd64 # if you're on a different platform, you must know how to build from source
curl -Lo kubemcsa "$MCSA_RELEASE_URL/kubemcsa-$OS-$ARCH"
chmod +x kubemcsa

Then, run kubemcsa export to generate templates for secrets containing kubeconfigs equivalent to the c1 and c2 service accounts created by Helm in cluster1, and apply the templates with kubectl in cluster1 and cluster2, respectively:

./kubemcsa export --context $CLUSTER1 c1 --as remote | kubectl --context $CLUSTER1 apply -f -
./kubemcsa export --context $CLUSTER1 c2 --as remote | kubectl --context $CLUSTER2 apply -f -

Note: you may wonder why the agent in cluster1 needs a kubeconfig as it runs in the same cluster as the scheduler. We simply like symmetry and didn't want to make the agent's configuration special in that case.


After a minute, check that a virtual node named admiralty and node pool objects have been created in each agent's cluster, and observations appear in the scheduler's cluster:

kubectl --context $CLUSTER1 get node admiralty
kubectl --context $CLUSTER2 get node admiralty

kubectl --context $CLUSTER1 get nodepools # or np
kubectl --context $CLUSTER2 get nodepools # or np

kubectl config use-context $CLUSTER1
kubectl get nodepoolobservations # or npobs
kubectl get nodeobservations # or nodeobs
kubectl get podobservations # or podobs
kubectl get serviceobservations # or svcobs
# or, by category
kubectl get observations --show-kind # or obs

Multicluster-scheduler's pod admission controller operates in namespaces labeled with multicluster-scheduler=enabled. In any of the member cluster, e.g., cluster2, label the default namespace:

kubectl --context "$CLUSTER2" label namespace default multicluster-scheduler=enabled

Then, deploy NGINX in it with the election annotation on the pod template:

cat <<EOF | kubectl --context "$CLUSTER2" apply -f -
apiVersion: apps/v1
kind: Deployment
  name: nginx
  replicas: 10
      app: nginx
        app: nginx
      annotations: ""
      - name: nginx
        image: nginx
            cpu: 100m
            memory: 32Mi
        - containerPort: 80

Things to check:

  1. The original pods have been transformed into proxy pods "running" on the virtual node admiralty. Notice the original manifest saved as an annotation.
  2. Proxy pod observations have been created in the scheduler's cluster.
  3. Delegate pod decisions have been created in the scheduler's cluster as well. Each decision was made based on all of the observations available at the time.
  4. Delegate pods have been created in either cluster. Notice that their spec matches the original manifest.
kubectl --context "$CLUSTER2" get pods # (-o yaml for details)
kubectl --context "$CLUSTER1" get podobs # (-o yaml)
kubectl --context "$CLUSTER1" get poddecisions # or poddec (-o yaml)

kubectl --context "$CLUSTER1" get pods # (-o yaml)
kubectl --context "$CLUSTER2" get pods # (-o yaml)
Enforcing Placement

In some cases, you may want to specify the target cluster, rather than let the scheduler decide. For example, you may want an Argo workflow to execute certain steps in certain clusters, e.g., to be closer to external dependencies. You can enforce placement using the annotation. Admiralty's blog post presents multicloud Argo workflows. To complete this getting started guide, let's annotate our NGINX deployment's pod template to reschedule all pods to cluster1.

kubectl --context "$CLUSTER2" patch deployment nginx -p '{

After a little while, delegate pods in cluster2 will be terminated and more will be created in cluster1.

Optional: Service Reroute and Globalization

Our NGINX deployment isn't much use without a service to expose it. Kubernetes services route traffic to pods based on label selectors. We could directly create a service to match the labels of the delegate pods, but that would make it tightly coupled with multicluster-scheduler. Instead, let's create a service as usual, targeting the proxy pods. If a proxy pod were to receive traffic, it wouldn't know how to handle it, so multicluster-scheduler will change the service's label selector for us, to match the delegate pods instead, whose labels are similar to those of the proxy pods, except that their keys are prefixed with

If some or all of the delegate pods are in a different cluster, we also need the service to route traffic to them. For that, we rely on a Cilium cluster mesh and global services. Multicluster-scheduler will annotate the service with io.cilium/global-service=true and replicate it across clusters. (Multicluster-scheduler replicates any global service across clusters, not just services targeting proxy pods.)

kubectl --context "$CLUSTER2" expose deployment nginx

We just created a service in cluster2, alongside our deployment. However, in the previous step, we rescheduled all NGINX pods to cluster1. Check that the service was rerouted, globalized, and replicated to cluster1:

kubectl --context "$CLUSTER2" get service nginx -o yaml
# Check the annotations and the selector,
# then check that a copy exists in cluster1:
kubectl --context "$CLUSTER1" get service nginx -o yaml

Now call the delegate pods in cluster1 from cluster2:

kubectl --context "$CLUSTER2" run foo -it --rm --image alpine --command -- sh -c "apk add curl && curl nginx"

How it Works

Multicluster-scheduler is a system of Kubernetes controllers managed by the scheduler, deployed in any cluster, and its agents, deployed in the member clusters. The scheduler manages three controllers: schedule, bind, and global service. Each agent manages seven controllers: pod admission, service reroute, observations, decisions, delegate state, feedback, and node pool.

  1. The pod admission controller, a dynamic, mutating admission webhook, intercepts pod creation requests. If a pod is annotated with"", its original manifest is saved as an annotation, and its spec.nodeName is set to admiralty (a virtual kubelet managed by the agent).
  2. The service reroute controller modifies services whose endpoints target proxy pods. The keys of their label selectors are prefixed with, to match corresponding delegate pods (see below). Also, the services are annotated with io.cilium/global-service=true, to be load-balanced across a Cilium cluster mesh.
  3. The observations controller, a multi-cluster controller, watches pods (including proxy pods), services (including global services), nodes, and node pools (created by the node pool controller, see below) in the agent's cluster and reconciles corresponding observations in the scheduler's cluster. Observations are images of the source objects' states.
  4. The schedule controller watches proxy pod observations in the scheduler's cluster and updates them with target cluster name annotations, based on other observations.
  5. The bind controller watches proxy pod observations with target cluster name annotations in the scheduler's cluster and reconciles delegate pod decisions, also in the scheduler's cluster. The scheduler doesn't push anything to the member clusters.
  6. The global service controller watches global service observations (observations of services annotated with io.cilium/global-service=true, either by the service reroute controller or by another tool or user) and reconciles global service decisions (copies of the originals), for all clusters of the federation.
  7. The decisions controller, another multi-cluster controller, watches pod and service decisions in the scheduler's cluster and reconciles corresponding delegates in the agent's cluster.
  8. The delegate state controller watches delegate pod observations and copies their states into the DelegateState field of their parent proxy pod observations.
  9. The feedback controller watches proxy pod observations with set delegate states and reconciles the corresponding proxy pods statuses and annotations (e.g., Argo outputs). The feedback controller maintains the contract between proxy pods and their controllers, e.g., replica sets or Argo workflows.
  10. The node pool controller automatically creates node pool objects in the agent's cluster. In GKE and AKS, it uses the or agentpool label, respectively; in the absence of those labels, a default node pool object is created. Min/max node counts and pricing information can be updated by the user, or controlled by other tools. Custom node pool objects can also be created using label selectors. Node pool information can be used for scheduling.

Observations, decisions, and node pools are custom resources. Node pools are defined (by CRDs) in each member cluster, whereas all observations and decisions are only defined in the scheduler's cluster.

Comparison with Kubefed (Federation v2)

The goal of Kubefed is similar to multicluster-scheduler's. However, they differ in several ways:

  • Kubefed has a broader scope than multicluster-scheduler. Any resource can be federated and deployed via a single control plane, even if the same result could be achieved with continuous delivery, e.g., GitOps. Multicluster-scheduler focuses on scheduling.
  • Multicluster-scheduler doesn't require using new federated resource types (cf. Kubefed's templates, placements and overrides). Instead, pods only need to be annotated to be scheduled to other clusters. This makes adopting multicluster-scheduler painless and ensures compatibility with other tools like Argo.
  • Whereas Kubefed's API resides in a single cluster, multicluster-scheduler's annotated pods can be declared in any member cluster and/or the scheduler's cluster. Teams can keep working in separate clusters, while utilizing available resources in other clusters as needed.
  • Kubefed propagates scheduling resources with a push-sync reconciler. Multicluster-scheduler's agents push observations and pull scheduling decisions to/from the scheduler's cluster. The scheduler reconciles scheduling decisions with observations, but never calls the Kubernetes APIs of the member clusters. Clusters allowing outbound traffic to, but no inbound traffic from the scheduler's cluster (e.g., on-prem, in some cases) can join the federation. Also, if the scheduler's cluster is compromised, attackers don't automatically gain access to the entire federation.
  • Kubefed integrates with ExternalDNS to provide cross-cluster service discovery and multicluster ingress. Multicluster-scheduler doesn't solve multicluster ingress at the moment, but integrates with Cilium for cross-cluster service discovery, and everything else Cilium has to offer. A detailed comparison of the two approaches is beyond the scope of this README (but certainly worth a future blog post).


  • Integration with Argo
  • Integration with Cilium cluster mesh and global services
  • One namespace per member cluster in the scheduler's cluster for more granular RBAC
  • Alternative cross-cluster networking implementations: Istio (1.1), Submariner
  • More integrations: Horizontal Pod Autoscaler, Knative, Rook, k3s, kube-batch
  • Advanced scheduling, respecting affinities, anti-affinities, taints, tolerations, quotas, etc.
  • Port-forward between proxy and delegate pods
  • Integrate node pool concept with other tools

API Reference


go get
godoc -http=:6060

then http://localhost:6060/pkg/


Path Synopsis
Package apis contains Kubernetes API groups.
Package apis contains Kubernetes API groups.
Package multicluster contains multicluster API versions
Package multicluster contains multicluster API versions
Package v1alpha1 contains API Schema definitions for the multicluster v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:defaulter-gen=TypeMeta Package v1alpha1 contains API Schema definitions for the multicluster v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:defaulter-gen=TypeMeta
Package v1alpha1 contains API Schema definitions for the multicluster v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:defaulter-gen=TypeMeta Package v1alpha1 contains API Schema definitions for the multicluster v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:defaulter-gen=TypeMeta

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL