What is NodePulse?
High-performance AI training jobs silently degrade when nodes have misconfigured networking, suboptimal hardware placement, or GPU communication libraries that cannot initialize correctly. By the time you notice slow throughput, you have already wasted expensive GPU-hours.
NodePulse is a continuous Kubernetes DaemonSet that runs a dedicated validation agent on every GPU node and exposes a real-time web dashboard for fleet-wide infrastructure visibility.
It continuously validates four critical infrastructure dimensions that are the most common root causes of degraded distributed training performance:
| Module |
What it checks |
Failure indicator |
| RDMA/EFA Fabric |
ibv_devinfo port states, fi_info EFA providers |
Any port DOWN or below minimum device count |
| MTU Consistency |
Per-interface MTU via sysfs; end-to-end DF-bit ICMP sweep |
Any interface < 9000 bytes (Jumbo Frames required) |
| GPU-NIC Affinity |
nvidia-smi topo -m PCIe/NUMA topology matrix |
Any GPU↔NIC pair below ACCEPTABLE affinity class |
| UCCL Compatibility |
GPU vendor, EFA/RDMA detection, ROCm version, NIC driver type, per-API readiness |
Missing GPU, unsupported platform, or blocking condition for UCCL-Collective/P2P/EP |
Unlike a pre-flight job that runs once before training, NodePulse never stops watching. It catches hardware faults, driver resets, and configuration drift the moment they happen — before they silently corrupt your next training run.
UCCL Integration
NodePulse integrates GPU node compatibility checks derived from the UCCL project — an efficient GPU communication library that is a drop-in replacement for NCCL/RCCL with up to 2.5× performance improvement.
The UCCL module (internal/modules/uccl) performs:
| Check |
Method |
Details |
| GPU vendor detection |
nvidia-smi / amd-smi / rocm-smi |
Identifies NVIDIA (CUDA cap) or AMD (GFX arch) |
| ARM64 blocking check |
runtime.GOARCH |
ARM64 + AMD GPU blocks ROCm per UCCL build.sh |
| EFA NIC detection |
/sys/class/infiniband/rdmap* |
Mirrors uccl/__init__.py has_efa() exactly |
| RDMA port states |
ibstat rdma0..rdma7 |
Counts State: Active ports per UCCL slurm_monitor.sh |
| NIC driver type |
/sys/class/net/*/device/msi_irqs/ |
Detects RDMA vs VirtIO/ENA (AF-XDP path) |
| ROCm version |
/opt/rocm/.info/version or rocm-smi --version |
EP requires ROCm 7+; 6.x is blocked |
| GPU idle state |
GPU utilization % |
< 5% = idle per UCCL scheduling threshold |
| Required env vars |
UCCL_SOCKET_IFNAME, UCCL_IB_GID_INDEX |
WARN if missing (UCCL-EP will hang without these) |
Per-API readiness flags:
| UCCL API |
Required |
Condition |
collective_rdma |
GPU + RDMA NIC (IB/RoCE) |
NVIDIA CX or Broadcom Thor |
collective_efa |
GPU + EFA NIC |
AWS p4d/p5/p5e/p5en/p6 |
collective_afxdp |
GPU + ENA/VirtIO NIC |
AWS ENA, IBM VirtIO (AF-XDP path) |
p2p |
GPU + RDMA NIC |
NVIDIA or AMD |
ep |
GPU + RDMA/EFA, ROCm 7+ for AMD, no ARM64+AMD |
Expert-parallel (DeepEP-compatible) |
Architecture
┌─────────────────────────── Kubernetes Cluster ───────────────────────────────┐
│ │
│ GPU Node A GPU Node B GPU Node C │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ nodepulse │ │ nodepulse │ │ nodepulse │ │
│ │ -agent │ │ -agent │ │ -agent │ │
│ │ DaemonSet │ │ DaemonSet │ │ DaemonSet │ │
│ │ │ │ │ │ │ │
│ │ :9100/ │ │ :9100/ │ │ :9100/ │ │
│ │ metrics │ │ metrics │ │ metrics │ │
│ │ api/status │ │ api/status │ │ api/status │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └─────────────────────────┴──────────────────────────┘ │
│ │ poll every 30s │
│ ┌────────────▼────────────┐ │
│ │ nodepulse-ui │ │
│ │ Deployment :8080 │ │
│ │ │ │
│ │ Fleet Dashboard │ │
│ │ ● RDMA status per node │ │
│ │ ● MTU values per iface │ │
│ │ ● GPU-NIC topology map │ │
│ │ ● UCCL readiness flags │ │
│ └──────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────────┘
Each agent runs all four checks on a configurable interval (default: 60 seconds), publishes results as Prometheus metrics on :9100/metrics, and serves a JSON status snapshot on :9100/api/status. The UI service aggregates all agents and renders a live fleet health view.
For a full deep-dive, see docs/architecture.md.
Quick Start
1 — Label your GPU nodes
kubectl label node <gpu-node-name> nodepulse/gpu-node=true
# Label all nodes with GPUs at once
kubectl get nodes -o json | \
jq -r '.items[] | select(.status.capacity["nvidia.com/gpu"]) | .metadata.name' | \
xargs -I{} kubectl label node {} nodepulse/gpu-node=true
2 — Deploy NodePulse
# Apply RBAC (namespace + service accounts)
kubectl apply -f deploy/rbac.yaml
# Deploy the DaemonSet agent on all GPU nodes
kubectl apply -f deploy/daemonset.yaml
# Deploy the Web UI
kubectl apply -f deploy/ui-deployment.yaml
Or use the Makefile shortcut:
make deploy
3 — Open the Dashboard
# Port-forward the UI to your local machine
kubectl port-forward -n nodepulse svc/nodepulse-ui 8080:80
# Open in your browser
open http://localhost:8080
4 — Watch the agents
# Stream agent logs from all GPU nodes
make logs-agent
# Check deployment status
make status
5 — Run the pre-flight validator (optional)
# Label the target node and run a one-shot validation Job
kubectl label node <gpu-node-name> nodepulse/target-node=true
kubectl apply -f gpu-validator.yaml
kubectl wait --for=condition=complete job/nodepulse-validator -n nodepulse --timeout=120s
kubectl logs -n nodepulse -l app.kubernetes.io/component=validator
Installation
See the full Installation Guide → for:
- Detailed kubectl steps and expected output
- Optional DeepFlow credentials setup
- Configuring agent endpoint discovery for the UI
- Security hardening with OPA/Kyverno
Uninstallation
See the Uninstall Guide →.
Configuration Reference
All configuration is injected via environment variables. NodePulse uses no config files inside the binary.
Agent (nodepulse-agent)
| Variable |
Default |
Description |
NODE_NAME |
(K8s downward API) |
Node name — injected automatically by Kubernetes |
NODEPULSE_AGENT_ADDR |
:9100 |
Agent HTTP server bind address |
NODEPULSE_CHECK_INTERVAL_SECONDS |
60 |
How often to run all validation checks |
NODEPULSE_MIN_RDMA_DEVICES |
1 |
Minimum number of active RDMA devices required |
NODEPULSE_RDMA_LATENCY_US |
5.0 |
Maximum acceptable RDMA round-trip latency (µs) |
NODEPULSE_MIN_MTU |
9000 |
Minimum MTU on all physical interfaces (bytes) |
NODEPULSE_PEER_NODE_IP |
(empty) |
Peer IP for end-to-end MTU sweep; omit to skip |
NODEPULSE_MIN_AFFINITY_CLASS |
4 |
Minimum GPU↔NIC affinity class (1=Bottleneck … 5=Optimal) |
NODEPULSE_BASELINE_DIR |
/etc/nodepulse/baselines |
Directory containing YAML hardware baselines |
NODEPULSE_DEEPFLOW_ENDPOINT |
(empty) |
DeepFlow ADE base URL (optional) |
NODEPULSE_DEEPFLOW_API_KEY |
(empty) |
Bearer token for DeepFlow API (optional) |
UCCL env vars (checked by UCCL module, WARN if missing)
| Variable |
Description |
UCCL_SOCKET_IFNAME |
Control NIC for UCCL-EP bootstrapping (must match torchrun --master_addr interface) |
UCCL_IB_GID_INDEX |
RDMA GID index — match with NCCL_IB_GID_INDEX |
UCCL_IB_HCA |
IB/RoCE device names (auto-detected if unset) |
UCCL_IB_MAX_INFLIGHT_BYTES |
Recommended for Broadcom Thor-2 NICs |
UI (nodepulse-ui)
| Variable |
Default |
Description |
NODEPULSE_UI_ADDR |
:8080 |
UI HTTP server bind address |
NODEPULSE_AGENT_ENDPOINTS |
(empty) |
Comma-separated list of agent pod URLs |
NODEPULSE_POLL_INTERVAL_SECONDS |
30 |
How often the UI polls agent endpoints |
Hardware Baselines
NodePulse ships with pre-tuned baselines for A100, H100, H200, and a conservative default. Override by editing the nodepulse-baselines ConfigMap in deploy/daemonset.yaml:
# A100.yaml — included in the ConfigMap
hardware_gen: A100
min_rdma_devices: 4
max_rdma_latency_us: 5.0
min_bandwidth_gbps: 200
min_mtu: 9000
min_affinity_class: 4
min_dma_throughput_gbps: 50
max_latency_ns: 1000
max_error_count: 0
API Reference
Agent endpoints (:9100)
| Endpoint |
Description |
GET /metrics |
Prometheus text format — all validation metrics |
GET /api/status |
JSON snapshot of the most recent check cycle |
GET /healthz |
Liveness probe — always 200 OK |
UI endpoints (:8080)
| Endpoint |
Description |
GET / |
Web dashboard (fleet-wide health view) |
GET /api/fleet |
Aggregated JSON from all agents |
GET /healthz |
Liveness probe |
Prometheus metrics
| Metric |
Labels |
Description |
nodepulse_check_status |
node, module |
1=PASS, 0=FAIL, -1=WARN |
nodepulse_rdma_devices_found |
node |
RDMA device count |
nodepulse_rdma_device_active |
node, hca_id |
1 if PORT_ACTIVE |
nodepulse_efa_providers_found |
node |
EFA provider count |
nodepulse_interface_mtu_bytes |
node, interface |
Per-interface MTU |
nodepulse_mtu_sweep_achieved_bytes |
node |
End-to-end path MTU |
nodepulse_gpu_nic_affinity_class |
node, gpu, nic |
Affinity score 1-5 |
nodepulse_gpu_nic_affinity_ok |
node, gpu, nic |
1 if meets threshold |
nodepulse_cross_numa_pairs_total |
node |
Count of cross-NUMA pairs |
nodepulse_hardware_gen_info |
node, gen |
GPU generation label |
nodepulse_uccl_gpu_count |
node, vendor |
GPU count per vendor |
nodepulse_uccl_active_rdma_ports |
node |
Active RDMA ports (ibstat) |
nodepulse_uccl_api_ready |
node, api |
1 if ready for UCCL API |
nodepulse_uccl_efa_detected |
node |
1 if EFA NIC present |
nodepulse_uccl_blocking_conditions_total |
node |
Count of blocking conditions |
DeepFlow Integration
When NODEPULSE_DEEPFLOW_ENDPOINT is set, the agent posts structured events to the DeepFlow ADE API after each check cycle:
{
"event_type": "NODEPULSE_MODULE_RESULT",
"validator_version": "3.0.0",
"node_name": "gpu-node-42",
"module": "RDMA_EFA",
"status": "FAIL",
"hardware_gen": "H100",
"errors": ["PORT_DOWN detected on: mlx5_2: PORT_DOWN"],
"timestamp": "2026-04-12T05:30:00Z"
}
DeepFlow failures are non-fatal — the agent logs a warning and continues. Your cluster monitoring is never blocked by a telemetry outage.
Development
Prerequisites
- Go 1.23+
- Docker (for image builds)
kind + kubectl (for E2E tests)
Build & Test
# Tidy dependencies
go mod tidy
# Build all binaries (agent, ui, validator)
make build
# Run all unit tests with race detector
make test
# Run unit tests with HTML coverage report
make test-cover
# Lint (requires golangci-lint)
make lint
Build the Docker image
make docker-build
# Produces: ghcr.io/bytehooks/nodepulse:<git-tag>
Run locally (for development)
# Start the agent (will report FAIL without real hardware — expected)
./bin/nodepulse-agent
# Start the UI pointing at a local agent
NODEPULSE_AGENT_ENDPOINTS=http://localhost:9100 ./bin/nodepulse-ui
# Run the pre-flight validator once
./bin/nodepulse-validator
E2E Tests (kind cluster)
make e2e-setup # Spin up 3-node kind cluster
make e2e-run # Run the Go E2E test suite
make e2e-teardown # Tear down the cluster
Project Structure
nodepulse/
├── cmd/
│ ├── agent/main.go # DaemonSet agent — continuous monitoring loop
│ ├── ui/
│ │ ├── main.go # Web UI service — fleet aggregation + HTTP
│ │ └── dashboard.go # Embedded single-file HTML dashboard
│ └── validator/main.go # Pre-flight Job validator (single-shot)
├── internal/
│ ├── models/ # Shared types (Status, AffinityClass, ModuleResult, UCCLNodeReadiness …)
│ ├── config/ # Env-var loader + YAML baseline reader
│ ├── shell/ # Runner interface → RealRunner / FakeRunner (tests)
│ ├── deepflow/ # DeepFlow HTTP client + FakeClient (tests)
│ ├── server/ # Agent HTTP server (/metrics, /healthz, /api/status)
│ ├── metrics/ # Prometheus registry + UpdateFromResults()
│ ├── modules/
│ │ ├── rdma/ # Module 1: RDMA/EFA checks
│ │ ├── mtu/ # Module 2: MTU consistency
│ │ ├── affinity/ # Module 3: GPU-NIC affinity
│ │ └── uccl/ # Module 4: UCCL GPU node compatibility
│ └── gatekeeper/
│ ├── ebpf/ # eBPF telemetry (sysfs fallback on old kernels)
│ ├── threshold/ # Dynamic baseline comparison
│ └── reporter/ # Incident summary + webhook alert
├── config/baselines/ # Per-GPU-generation YAML baselines (A100, H100, H200)
├── deploy/
│ ├── rbac.yaml # Namespace, ServiceAccounts, ClusterRole
│ ├── daemonset.yaml # DaemonSet + headless Service + ConfigMap
│ └── ui-deployment.yaml # UI Deployment + Service + ConfigMap
├── docs/ # Install, uninstall, architecture docs
├── e2e_test/ # kind cluster + Go E2E tests
├── gpu-validator.yaml # Pre-flight Kubernetes Job manifest
├── Dockerfile # Multi-stage build → 3 static binaries
└── Makefile # build / test / deploy / docker targets
Affinity Classes
| Class |
Score |
Description |
OPTIMAL |
5 |
PIX or NVLink, same NUMA node — ideal for RDMA |
ACCEPTABLE |
4 |
PXB or PHB, same NUMA node |
DEGRADED |
3 |
PHB or PIX, different NUMA node |
CROSS_NUMA |
2 |
SYS traversal — significant latency penalty |
BOTTLENECK |
1 |
Cross root complex — worst case, avoid for training |
Default threshold: NODEPULSE_MIN_AFFINITY_CLASS=4 (ACCEPTABLE and above required).
Security Model
| Control |
Detail |
| Privileges |
privileged: true, CAP_NET_ADMIN, CAP_SYS_ADMIN scoped only to the NodePulse agent DaemonSet |
| Training isolation |
These capabilities are never granted to training workloads |
| Policy enforcement |
Pair with OPA/Kyverno to prevent other workloads from running privileged pods in the nodepulse namespace |
| Secret handling |
DeepFlow credentials via K8s Secret (optional: true) — agent runs without them |
| Read-only mounts |
/sys and /proc mounted read-only; /dev mounted for RDMA character devices |
| Agent RBAC |
Agent ServiceAccount has automountServiceAccountToken: false — zero K8s API access |
| UI RBAC |
UI ServiceAccount has read-only access to endpoints and pods for agent discovery only |
Roadmap
- Helm chart for simplified installation
- Kubernetes-native endpoint auto-discovery (replace static ConfigMap)
- Admission Controller integration — block pod scheduling on degraded nodes
- Slack / PagerDuty alert adapters
- NCCL bandwidth test integration
- H200 / GH200 NVLink5 topology support
- UCCL binary test integration (run actual UCCL collective benchmarks)
- UCCL GPU node compatibility checks (Module 4)
- Prometheus metrics endpoint
- Real-time Web UI dashboard
- Continuous DaemonSet monitoring
Contributing
We welcome contributions! Please open an issue or pull request on GitHub.
- Fork the repo
- Create a feature branch:
git checkout -b feat/my-feature
- Write tests for new behaviour
- Run
make test to verify
- Open a PR against
main
License
NodePulse is open source under the Apache 2.0 License.