caffe2-operator

module
v0.0.0-...-3773f5b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 1, 2021 License: Apache-2.0

README

kubeflow/caffe2-operator is not maintained

This repository has been deprecated and archived on Nov 30th, 2021.

caffe2-operator

Build Status Go Report Card

Experimental repository for a caffe2 operator

Motivation

Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.

For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/Gloo to find the other nodes to communicate.

Build

$ make
mkdir -p _output/bin
go build -o _output/bin/caffe2-operator ./cmd/caffe2-operator/
$ _output/bin/caffe2-operator --help

Custom Resource Definition

The custom resource submitted to the Kubernetes API would look something like this:

apiVersion: "kubeflow.org/v1alpha1"
kind: "Caffe2Job"
metadata:
  name: "example-job"
spec:
  backendSpecs:
      backendType: redis
      redisHost: redis-service
      redisPort: 6379
  replicaSpecs:
      replicas: 2
      template:
        spec:
          hostNetwork: true
          containers:
          - image: kubeflow/caffe2:py2-cuda9.0-cudnn7-ubuntu16.04
            name: caffe2
            resources:
              limits:
                nvidia.com/gpu: 2
            workingDir: /usr/local/caffe2/caffe2/python/examples/
            command: ["python", "resnet50_trainer.py"]

A full resnet50 trainer with redis example is here, and resnet50 trainer with nfs example is here.

This Caffe2Job resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of backend options.

backend Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found here.

For redis backend, you need a working Redis server to serve for workers communication.

Resulting Worker

apiVersion: v1
kind: Pod
metadata:
  name: caffe2-worker-${job_id}
  labels:
      caffe2_job_key: default-example-job
      caffe2_replica_index: "0"
      caffe2_replica_type: worker
      group_name: kubeflow.org
      runtime_id: "1529307087"
spec:
  containers:
    image: kubeflow/caffe2:py2-cuda9.0-cudnn7-ubuntu16.04
    imagePullPolicy: IfNotPresent
    name: caffe2
    env:
      - name: SHARD_ID
        value: "0"
      - name: NUM_SHARDS
        value: "1"
      - name: RUN_ID
        value: "1529307087"
      - name: CAFFE2_CONFIG
        value: '{"cluster":{"worker":["default-example-job-worker-0:2222"]},"task":{"type":"worker","index":0}}'
      - name: REDIS_HOST
        value: "redis-service"
        value: "1529307087"
    ...

The worker spec generates a pod. They will communicate to the master through the redis's service name.

NOTE: There are three additional environments which are generated based on worker role, such as index for SHARD_ID, replicas for NUM_SHARDS and running ID for RUN_ID.

Design

This is an implementaion of the Caffe2 distributed design patterns, found here.

Other backends

Form here, Caffe2 also support NFS backend, however, we do not test the nfs backend now.

How to setup

Setup kubernetes

  • A full function kubernetes.
  • Open the features-gate if you want to use GPU

Create a CRD for kuberntes

# kubectl apply -f https://raw.githubusercontent.com/kubeflow/caffe2-operator/master/examples/crd.yaml
customresourcedefinition.apiextensions.k8s.io "caffe2jobs.kubeflow.org" created

Start the caffe2-operator

# ./caffe2-operator -alsologtostderr -v 4 -controller-config-file /root/admin.conf

Prepare the dataset

In the example, we use handwritten. You need to convert it to levelDB type by using make_mnist_db.

$ make_mnist_db --channel_first --db leveldb --image_file data/mnist/train-images-idx3-ubyte --label_file data/mnist/train-labels-idx1-ubyte --output_file data/mnist/mnist-train-nchw-leveldb 

$ make_mnist_db --channel_first --db leveldb --image_file data/mnist/t10k-images-idx3-ubyte --label_file data/mnist/t10k-labels-idx1-ubyte --output_file data/mnist/mnist-test-nchw-leveldb 

Run the job

$ kubectl apply -f ./examples/resnet50_nfs.yaml
$ kubectl get caffe2jobs
NAME          AGE
example-job   29m
$ kubectl get pods
NAME                READY     STATUS    RESTARTS   AGE
example-job-pdcbs   1/1       Running   0          29m
example-job-abcdc   1/1       Running   0          29m

In this example, we use hostNetwork = true, it is not the better solution, but it will train more quickly. Because the overlay network will reduce some performance.

Directories

Path Synopsis
cmd
pkg
apis/caffe2/v1alpha1
Package v1alpha1 is the v1alpha1 version of the API.
Package v1alpha1 is the v1alpha1 version of the API.
client/clientset/versioned
This package has the automatically generated clientset.
This package has the automatically generated clientset.
client/clientset/versioned/fake
This package has the automatically generated fake clientset.
This package has the automatically generated fake clientset.
client/clientset/versioned/scheme
This package contains the scheme of the automatically generated clientset.
This package contains the scheme of the automatically generated clientset.
client/clientset/versioned/typed/kubeflow/v1alpha1
This package has the automatically generated typed clients.
This package has the automatically generated typed clients.
client/clientset/versioned/typed/kubeflow/v1alpha1/fake
Package fake has the automatically generated clients.
Package fake has the automatically generated clients.
controller
Package controller provides a Kubernetes controller for a Caffe2Job resource.
Package controller provides a Kubernetes controller for a Caffe2Job resource.
util
Package util provides various helper routines.
Package util provides various helper routines.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL