pulumi-nvidia-aicr

module

v0.2.0 Latest Latest Go to latest Published: Jun 10, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/pulumi-labs/pulumi-nvidia-aicr

README ¶

Pulumi NVIDIA AICR Provider

Deploy validated NVIDIA AI Cluster Runtime (AICR) configurations on Kubernetes clusters using Pulumi. Define your GPU infrastructure and software stack in a single program with full lifecycle management.

Overview

NVIDIA AICR captures validated combinations of GPU drivers, operators, and system configurations as reproducible recipes for GPU-accelerated Kubernetes clusters. This Pulumi provider brings AICR into the Infrastructure as Code ecosystem, enabling:

Single program — Define both cloud infrastructure (EKS, GKE, AKS) and GPU software stack together
Validated recipes — Deploy known-good component combinations, not ad-hoc configs
Full lifecycle — Preview, deploy, update, and destroy with standard Pulumi workflows
Multi-language — Use TypeScript, Python, Go, C#, Java, or YAML

Quick Start

Python

import pulumi
import pulumi_nvidia_aicr as aicr

# Deploy the NVIDIA AICR-validated GPU software stack.
# Uses your ambient kubeconfig (~/.kube/config or KUBECONFIG).
gpu_stack = aicr.ClusterStack("nvidia-aicr",
    accelerator="h100",
    service="eks",
    intent="training",
    platform="kubeflow",
)

pulumi.export("recipe_name", gpu_stack.recipe_name)
pulumi.export("components", gpu_stack.deployed_components)

TypeScript

import * as eks from "@pulumi/eks";
import * as aicr from "@pulumi/nvidia-aicr";

// Create an EKS cluster with H100 GPU nodes
const cluster = new eks.Cluster("gpu-cluster", {
    instanceType: "p5.48xlarge",
    desiredCapacity: 2,
});

// Deploy ~11 validated Helm charts: GPU Operator, Kubeflow Trainer,
// KAI Scheduler, Prometheus, cert-manager, and more.
const gpuStack = new aicr.ClusterStack("nvidia-aicr", {
    kubeconfig: cluster.kubeconfigJson,
    accelerator: "h100",
    service: "eks",
    intent: "training",
    platform: "kubeflow",
});

export const recipeName = gpuStack.recipeName;
export const components = gpuStack.deployedComponents;

Resource: ClusterStack

The ClusterStack component resource deploys a complete AICR-validated GPU software stack on a Kubernetes cluster.

Inputs

Property	Type	Required	Description
`accelerator`	`string`	Yes	GPU type: `"h100"`, `"gb200"`, `"b200"`
`service`	`string`	Yes	Kubernetes service: `"aks"`, `"eks"`, `"gke"`, `"kind"`, `"oke"`
`intent`	`string`	Yes	Workload type: `"training"`, `"inference"`
`os`	`string`	No	OS: `"ubuntu"` (default), `"cos"` (gke only)
`platform`	`string`	No	ML platform: `"kubeflow"` (training), `"dynamo"` (inference), `"nim"` (inference, EKS+H100 only). Leave unset for the base recipe without a platform-specific runtime. `intent: "inference"` always includes the kgateway inference gateway as part of the base inference stack; choosing a platform layers a runtime on top.
`kubeconfig`	`Input<string>`	No	Kubeconfig contents (accepts outputs from cluster resources)
`kubeconfigPath`	`string`	No	Path to kubeconfig file
`context`	`string`	No	Kubeconfig context
`componentOverrides`	`map`	No	Per-component Helm value overrides
`skipComponents`	`string[]`	No	Components to exclude from deployment
`skipAwait`	`bool`	No	Skip waiting for Helm releases (default: false)

If neither kubeconfig nor kubeconfigPath is set, the ambient kubeconfig (~/.kube/config or KUBECONFIG env var) is used.

Outputs

Property	Type	Description
`recipeName`	`string`	Resolved recipe identifier
`recipeVersion`	`string`	AICR recipe version
`deployedComponents`	`string[]`	Names of deployed components
`componentCount`	`int`	Number of deployed components

What Gets Deployed

A typical training recipe (H100 + EKS + Kubeflow) deploys these validated components:

Component	Purpose
cert-manager	TLS certificate management
gpu-operator	NVIDIA GPU drivers, device plugin, DCGM
nvsentinel	GPU security policies
skyhook-operator	GPU virtualization
kube-prometheus-stack	Monitoring with GPU metrics (Prometheus + Grafana)
k8s-ephemeral-storage-metrics	Storage monitoring
nvidia-dra-driver-gpu	Dynamic Resource Allocation for GPUs
kai-scheduler	GPU-aware workload scheduling
aws-ebs-csi-driver	EKS: Persistent volume provisioning
aws-efa	EKS: Elastic Fabric Adapter for RDMA networking
kubeflow-trainer	Distributed training with TrainJob

Customization

Component Overrides

Customize specific components while keeping the validated recipe baseline:

TypeScript

const gpuStack = new aicr.ClusterStack("aicr", {
    kubeconfig: cluster.kubeconfigJson,
    accelerator: "h100",
    service: "eks",
    intent: "training",
    componentOverrides: {
        "gpu-operator": {
            version: "v25.11.0",
            values: {
                driver: { version: "535.129.03" },
            },
        },
    },
});

Python

gpu_stack = aicr.ClusterStack("aicr",
    kubeconfig=cluster.kubeconfig_json,
    accelerator="h100",
    service="eks",
    intent="training",
    component_overrides={
        "gpu-operator": aicr.ComponentOverrideArgs(
            version="v25.11.0",
            values={
                "driver": {"version": "535.129.03"},
            },
        ),
    },
)

Go

gpuStack, err := aicr.NewClusterStack(ctx, "aicr", &aicr.ClusterStackArgs{
    Kubeconfig:  cluster.KubeconfigJson,
    Accelerator: "h100",
    Service:     "eks",
    Intent:      "training",
    ComponentOverrides: aicr.ComponentOverrideMap{
        "gpu-operator": aicr.ComponentOverrideArgs{
            Version: pulumi.StringPtr("v25.11.0"),
            Values: pulumi.Map{
                "driver": pulumi.Map{"version": pulumi.String("535.129.03")},
            },
        },
    },
})

C#

var gpuStack = new ClusterStack("aicr", new ClusterStackArgs
{
    Kubeconfig = cluster.KubeconfigJson,
    Accelerator = "h100",
    Service = "eks",
    Intent = "training",
    ComponentOverrides =
    {
        ["gpu-operator"] = new ComponentOverrideArgs
        {
            Version = "v25.11.0",
            Values = { ["driver"] = new InputMap<object> { ["version"] = "535.129.03" } },
        },
    },
});

Skipping Components

Exclude components that are already installed or not needed:

TypeScript

const stack = new aicr.ClusterStack("aicr", {
    accelerator: "h100",
    service: "eks",
    intent: "inference",
    platform: "dynamo",
    skipComponents: ["cert-manager", "kube-prometheus-stack"],
});

Python

stack = aicr.ClusterStack("aicr",
    accelerator="h100",
    service="eks",
    intent="inference",
    platform="dynamo",
    skip_components=["cert-manager", "kube-prometheus-stack"],
)

Go

stack, err := aicr.NewClusterStack(ctx, "aicr", &aicr.ClusterStackArgs{
    Accelerator: "h100",
    Service:     "eks",
    Intent:      "inference",
    Platform:    pulumi.StringPtr("dynamo"),
    SkipComponents: pulumi.StringArray{
        pulumi.String("cert-manager"),
        pulumi.String("kube-prometheus-stack"),
    },
})

C#

var stack = new ClusterStack("aicr", new ClusterStackArgs
{
    Accelerator = "h100",
    Service = "eks",
    Intent = "inference",
    Platform = "dynamo",
    SkipComponents = { "cert-manager", "kube-prometheus-stack" },
});

Examples

Full working examples for every supported cloud and scenario. See examples/ for prerequisites, cost estimates, and detailed instructions.

Training — PyTorch distributed training with Kubeflow Trainer:

Cloud	TypeScript	Python	Go	C#	Java
AWS EKS (H100)	ts	py	go	cs	java
Azure AKS (H100)	ts	py	go	cs	java
GCP GKE (H100)	ts	py	go	cs	java
OCI OKE (GB200)	ts	py	go	cs	java

Inference — vLLM model serving with NIM / Dynamo:

Cloud	TypeScript	Python	Go	C#	Java
AWS EKS + NIM (H100)	ts	py	go	cs	java
CoreWeave + Dynamo (H100)	ts	py	go	cs	java

Getting started:

Scenario	TypeScript	Python	Go	C#	Java	YAML
Existing cluster (quickstart)	ts	py	go	cs	java	yaml
Kind local dev (no GPUs)	ts	py	go	cs	java	yaml

Supported Configurations

Validated recipe overlays shipped by upstream AICR:

Accelerator	Services	Intents	Platforms
H100	EKS, GKE, AKS, Kind	Training, Inference	Kubeflow, Dynamo, NIM (EKS only)
GB200	EKS, OKE	Training, Inference	Kubeflow, Dynamo
B200	Any	Training	--

The kind service overlay targets local development with kind clusters -- useful for exercising the deployment pipeline without provisioning real GPU hardware.

Development

# Build provider
make provider

# Run tests
make test

# Generate schema
make schema

# Generate SDKs
make nodejs_sdk python_sdk go_sdk

AICR Version Compatibility

This provider embeds AICR recipe data. The provider version tracks the AICR version:

Provider Version	AICR Version
0.1.x	main (development)

License

Apache 2.0 -- see LICENSE for details.

Directories ¶

Path	Synopsis
provider
cmd/pulumi-resource-nvidia-aicr command
pkg/provider
pkg/recipe Package recipe implements the NVIDIA AICR recipe resolution engine.	Package recipe implements the NVIDIA AICR recipe resolution engine.
pkg/recipes
pkg/version
sdk
go/nvidiaaicr module

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL