kubernetes

command module
v0.0.0-...-bc88e69 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2022 License: BSD-3-Clause Imports: 10 Imported by: 0

README

TorchElastic Controller for Kubernetes

IMPORTANT This CRD is no longer being actively maintained. We encourage you to look at TorchX to launch PyTorch jobs on Kubernetes. The two relevant sections are:

  1. TorchX-Kubernetes setup: https://pytorch.org/torchx/latest/schedulers/kubernetes.html
  2. TorchX dist.ddp builtin (uses torchelastic under the hood): https://pytorch.org/torchx/latest/components/distributed.html

Overview

TorchElastic Controller for Kubernetes manages a Kubernetes custom resource ElasticJob and makes it easy to run Torch Elastic workloads on Kubernetes.

Prerequisites

NOTE:

  1. (recommended) create a cluster with GPU instances as some examples (e.g. imagenet) only work on GPU.
  2. If you provision instances with a single GPU you will only be able to run a single worker per node.
  3. Our examples assume 1 GPU per node so you will have to adjust --nproc_per_node to be equal to the number of CUDA devices on the instance you are using if you want to run multiple workers per container

(Optional) Setup

Here we provide the instructions to create an Amazon EKS cluster or a Microsoft AKS cluster. If you are not using AWS or Azure, please refer to your cloud/infrastructure provider's manual to setup a kubernetes cluster.

NOTE: EKS/AKS is not required to run this controller, you can use other Kubernetes clusters.

Create your Kubernetes Cluster
AWS: EKS

Use eksctl to create an Amazon EKS cluster. This process takes ~15 minutes.

eksctl create cluster \
    --name=torchelastic \
    --node-type=p3.2xlarge \
    --region=us-west-2 \
    --version=1.15 \
    --ssh-access \
    --ssh-public-key=~/.ssh/id_rsa.pub \
    --nodes=2
Azure: AKS

Use az aks to create a Microsoft AKS cluster. This process takes ~4 minutes.

az aks create
    --resource-group myResourceGroup \
    --name torchelastic \
    --node-vm-size Standard_NC6 \
    --node-count 3 \
    --location westus2 \
    --kubernetes-version 1.15
    --generate-ssh-keys
Google Cloud: GKE

Use gcloud to create a Google GKE cluster. This process takes ~3 minutes.

gcloud container clusters create torchelastic \
    --accelerator type=nvidia-tesla-v100,count=1\
    --machine-type=n1-standard-4 \
    --zone=us-west1-b \
    --cluster-version=1.15 \
    --num-nodes=2

Add --preemptible to use pre-emptive instances.

Install Nvidia device plugin to enable GPU support on your cluster.

Deploy the following Daemonset:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Install ElasticJob controller and CRD


git clone https://github.com/pytorch/elastic.git
cd elastic/kubernetes

kubectl apply -k config/default
# or
# kustomize build config/default  | kubectl apply -f -

You will see logs like following

namespace/elastic-job created
customresourcedefinition.apiextensions.k8s.io/elasticjobs.elastic.pytorch.org created
role.rbac.authorization.k8s.io/leader-election-role created
clusterrole.rbac.authorization.k8s.io/manager-role created
rolebinding.rbac.authorization.k8s.io/leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/elastic-job-k8s-controller-rolebinding created
deployment.apps/elastic-job-k8s-controller created

Verify that the ElasticJob custom resource is installed

kubectl get crd

The output should include elasticjobs.elastic.pytorch.org

NAME                                              CREATED AT
...
elasticjobs.elastic.pytorch.org                   2020-03-18T07:40:53Z
...

Verify controller is ready

kubectl get pods -n elastic-job

NAME                                          READY   STATUS    RESTARTS   AGE
elastic-job-k8s-controller-6d4884c75b-z22cm   1/1     Running   0          15s

Check logs of controller

kubectl logs -f elastic-job-k8s-controller-6d4884c75b-z22cm -n elastic-job

2020-03-19T10:13:43.532Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2020-03-19T10:13:43.534Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z	INFO	controller-runtime.controller	Starting EventSource	{"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z	INFO	setup	starting manager
2020-03-19T10:13:43.534Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2020-03-19T10:13:43.822Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"elastic-job","name":"controller-leader-election-helper","uid":"50269b8b-69ca-11ea-b995-0653198c16be","apiVersion":"v1","resourceVersion":"2107564"}, "reason": "LeaderElection", "message": "elastic-job-k8s-controller-6d4884c75b-z22cm_4cf549b7-3289-4285-8e64-647d067178bf became leader"}
2020-03-19T10:13:44.021Z	INFO	controller-runtime.controller	Starting Controller	{"controller": "elasticjob"}
2020-03-19T10:13:44.121Z	INFO	controller-runtime.controller	Starting workers	{"controller": "elasticjob", "worker count": 1}

Deploy a ElasticJob

  1. Deploy an etcd server. This will expose a Kubernetes service etcd-service with port 2379.

    kubectl apply -f config/samples/etcd.yaml
    
  2. Get the etcd server endpoint

    $ kubectl get svc -n elastic-job
    
    NAME           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
    etcd-service   ClusterIP   10.100.104.168   <none>        2379/TCP   5m5s
    
  3. Update config/samples/imagenet.yaml:

    1. set rdzvEndpoint (e.g. 10.100.104.168:2379) to the etcd server you just provisioned.
    2. set minReplicas and maxReplicas to the desired min and max num nodes (max should not exceed your cluster capacity)
    3. set Worker.replicas to the number of nodes to start with (you may modify this later to scale the job in/out)
    4. set the correct --nproc_per_node in container.args based on the instance you are running on.

    NOTE the ENTRYPOINT to torchelastic/examples is python -m torchelastic.distributed.launch <args...>. Notice that you do not have to specify certain launch options such as --rdzv_endpoint, and --rdzv_id. These are set automatically by the controller.

    IMPORTANT a Worker in the context of kubernetes refers to Node in torchelastic.distributed.launch. Each kubernetes Worker can run multiple trainers processes (a.k.a worker in torchelastic.distributed.launch).

  4. Submit the training job.

    kubectl apply -f config/samples/imagenet.yaml
    

    As you can see, training pod and headless services have been created.

    $ kubectl get pods -n elastic-job
    NAME                                          READY   STATUS    RESTARTS   AGE
    elastic-job-k8s-controller-6d4884c75b-z22cm   1/1     Running   0          11m
    imagenet-worker-0                             1/1     Running   0          5s
    imagenet-worker-1                             1/1     Running   0          5s
    
  5. You can scale the number of nodes by adjusting .spec.replicaSpecs[Worker].replicas and applying the change.

    kubectl apply -f config/samples/imagenet.yaml
    

    NOTE since you are scaling the containers, you will be scaling in increments of nproc_per_node trainers. In our case --nproc_per_node=1 For better performance consider using an instance with multiple GPUs and setting --nproc_per_node=$NUM_CUDA_DEVICES.

    WARNING the name of the job is used as rdzv_id, which is used to uniquely identify a job run instance. Hence to run multiple parallel jobs with the same spec you need to change .spec.metadata.name to give it a unique run id (e.g. imagenet_run_0). Otherwise the new nodes will attempt to join the membership of a different run.

Monitoring jobs

You can describe the job to check job status and job related events. In following example, imagenet job is created in elastic-job namespace, change to use your job name and namespace in your command.

$ kubectl describe elasticjob imagenet -n elastic-job

Name:         imagenet
Namespace:    elastic-job
<... OMITTED ...>
Status:
  Conditions:
    Last Transition Time:  2020-03-19T10:30:55Z
    Last Update Time:      2020-03-19T10:30:55Z
    Message:               ElasticJob imagenet is running.
    Reason:                ElasticJobRunning
    Status:                True
    Type:                  Running
<... OMITTED ...>
Events:
  Type    Reason                   Age   From                    Message
  ----    ------                   ----  ----                    -------
  Normal  SuccessfulCreatePod      13s   elastic-job-controller  Created pod: imagenet-worker-0

Tail the logs of a worker:

$ kubectl logs -f -n elastic-job imagenet-worker-0

Next Steps

We have included other sample job specs in the config/samples directory (e.g. config/samples/classy_vision.yaml), try them out by replacing imagenet.yaml with the appropariate spec filename in the instructions above.

To use your own script, build a docker image containing your script. You can use torchelastic/examples as your base image. Then point your job specs to use your container by editing Worker.template.spec.containers.image.

Our examples save checkpoints and models in the container hence the trained model and checkpoints are not accessible after the job is complete. In your scripts use a persistent store like AWS S3 or Azure Blob Storage.

Trouble Shooting

Please check TROUBLESHOOTING.md

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
api
v1alpha1
Package v1alpha1 contains API Schema definitions for the elastic v1alpha1 API group +kubebuilder:object:generate=true +groupName=elastic.pytorch.org
Package v1alpha1 contains API Schema definitions for the elastic v1alpha1 API group +kubebuilder:object:generate=true +groupName=elastic.pytorch.org

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL