nvidia-gpu-operator

command module
v0.0.0-...-e0c3458 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 17, 2022 License: Apache-2.0 Imports: 15 Imported by: 0

README

lint tests codecov go report Build and push images

NVIDIA GPU Operator Version 2

asciicast

TL;DR

# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default

# Deploy the NVIDIA GPU Operator Version 2
$ git clone git@github.com:smgglrs/nvidia-gpu-operator.git && cd nvidia-gpu-operator
$ make deploy

# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the nodes selected
$ kubectl apply -f config/samples/gpu_v1alpha1_deviceconfig.yaml

# Wait until all NVIDIA components are healthy
$ kubectl get -n nvidia-gpu-operator get all

# Run a sample GPU workload pod
$ cat <<EOF kubectl -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
  namespace: nvidia-gpu-operator
spec:
  restartPolicy: OnFailure
  containers:
  - image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0-ubi8
    imagePullPolicy: IfNotPresent
    name: cuda-vectoradd
    resources:
      limits:
        nvidia.com/gpu: "1"
EOF

$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd

Overview

Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework.

However, configuring and managing Kubernetes nodes with these hardware resources requires configuration of multiple software components, such as drivers, container runtimes, device plugins and other libraries, which is time consuming and error prone.

The NVIDIA GPU Operator Version 2 is used in paar with the Out-of-tree Operator, in order to automate the management of all the NVIDIA software components needed to provision GPUs within Kubernetes, from the drivers to the respective monitoring metrics.

Dependencies

Components

The components managed by the operator are:

Design

For a detailed description of the design trade-offs of the NVIDIA GPU Operator v2 check this doc.

Changelog
  • Removed the NVIDIA driver and device plugin management, this is offloaded to the Out-of-tree Operator
  • Deprecated the dependency of Node Feature Discovery, set it as optional
  • Added support to deploy different GPU configurations per node group, by refactoring the NVIDIA GPU Operator v1 singleton ClusterPolicy CR (also renamed it to DeviceConfig)
  • Added Github Actions worklow that build the operator, the OLM bundle and OLM catalog images for the main branch
  • Added the NVIDIA GPU Prometheus Exporter, which exports additional metrics, i.e. gpu_info following the respective practices, in order to not rely only on Kubernetes node labels (via the NVIDIA GPU Feature Discovery)
  • Added 80%+ test coverage

OpenShift

Use the NVIDIA GPU Operator Version 2, along with the Out-of-tree Operator in your OpenShift cluster to automatically provision and manage different GPU configurations per node group.

The following guide leverages the automatically generated container images of:

# Given an OpenShift cluster with GPU powered nodes
$ kubectl get clusterversions.config.openshift.io
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.11   True        False         30h     Cluster version is 4.10.11

# Deploy the Out-of-tree Operator
$ kubectl apply -k https://github.com/qbarrand/oot-operator/config/default

# Deploy the NVIDIA GPU Operator v2 via OLM
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deploy.yaml

# Create a sample DeviceConfig targeting GPU nodes.
#
# NOTE: the `driverImage` tag should be adjusted to the kernel version
# of the selected nodes
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/deviceconfig.yaml

# Wait for all NVIDIA GPU components are healthy
$ kubectl get -n nvidia-gpu-operator all

# Verify the setup by running a sample GPU workload pod
$ kubectl apply -f https://github.com/smgglrs/nvidia-gpu-operator/hack/openshift/sample-gpu-workload.yaml

# Check the GPU workload logs
$ kubectl logs -n nvidia-gpu-operator pod/cuda-vectoradd

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
api
v1alpha1
Package v1alpha1 contains API Schema definitions for the gpu v1alpha1 API group +kubebuilder:object:generate=true +groupName=gpu.nvidia.com
Package v1alpha1 contains API Schema definitions for the gpu v1alpha1 API group +kubebuilder:object:generate=true +groupName=gpu.nvidia.com

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL