nodepulse

module
v0.0.0-...-d8e9b63 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 12, 2026 License: MIT

README

ByteHooks Logo

NodePulse

Continuous GPU Infrastructure Health Monitor & UCCL Node Validator for AI Training Clusters

CI Docker Go Report Card Go Version Container Registry

Built by ByteHooks · Open Source · Apache-2.0


What is NodePulse?

High-performance AI training jobs silently degrade when nodes have misconfigured networking, suboptimal hardware placement, or GPU communication libraries that cannot initialize correctly. By the time you notice slow throughput, you have already wasted expensive GPU-hours.

NodePulse is a continuous Kubernetes DaemonSet that runs a dedicated validation agent on every GPU node and exposes a real-time web dashboard for fleet-wide infrastructure visibility.

It continuously validates four critical infrastructure dimensions that are the most common root causes of degraded distributed training performance:

Module What it checks Failure indicator
RDMA/EFA Fabric ibv_devinfo port states, fi_info EFA providers Any port DOWN or below minimum device count
MTU Consistency Per-interface MTU via sysfs; end-to-end DF-bit ICMP sweep Any interface < 9000 bytes (Jumbo Frames required)
GPU-NIC Affinity nvidia-smi topo -m PCIe/NUMA topology matrix Any GPU↔NIC pair below ACCEPTABLE affinity class
UCCL Compatibility GPU vendor, EFA/RDMA detection, ROCm version, NIC driver type, per-API readiness Missing GPU, unsupported platform, or blocking condition for UCCL-Collective/P2P/EP

Unlike a pre-flight job that runs once before training, NodePulse never stops watching. It catches hardware faults, driver resets, and configuration drift the moment they happen — before they silently corrupt your next training run.


UCCL Integration

NodePulse integrates GPU node compatibility checks derived from the UCCL project — an efficient GPU communication library that is a drop-in replacement for NCCL/RCCL with up to 2.5× performance improvement.

The UCCL module (internal/modules/uccl) performs:

Check Method Details
GPU vendor detection nvidia-smi / amd-smi / rocm-smi Identifies NVIDIA (CUDA cap) or AMD (GFX arch)
ARM64 blocking check runtime.GOARCH ARM64 + AMD GPU blocks ROCm per UCCL build.sh
EFA NIC detection /sys/class/infiniband/rdmap* Mirrors uccl/__init__.py has_efa() exactly
RDMA port states ibstat rdma0..rdma7 Counts State: Active ports per UCCL slurm_monitor.sh
NIC driver type /sys/class/net/*/device/msi_irqs/ Detects RDMA vs VirtIO/ENA (AF-XDP path)
ROCm version /opt/rocm/.info/version or rocm-smi --version EP requires ROCm 7+; 6.x is blocked
GPU idle state GPU utilization % < 5% = idle per UCCL scheduling threshold
Required env vars UCCL_SOCKET_IFNAME, UCCL_IB_GID_INDEX WARN if missing (UCCL-EP will hang without these)

Per-API readiness flags:

UCCL API Required Condition
collective_rdma GPU + RDMA NIC (IB/RoCE) NVIDIA CX or Broadcom Thor
collective_efa GPU + EFA NIC AWS p4d/p5/p5e/p5en/p6
collective_afxdp GPU + ENA/VirtIO NIC AWS ENA, IBM VirtIO (AF-XDP path)
p2p GPU + RDMA NIC NVIDIA or AMD
ep GPU + RDMA/EFA, ROCm 7+ for AMD, no ARM64+AMD Expert-parallel (DeepEP-compatible)

Architecture

┌─────────────────────────── Kubernetes Cluster ───────────────────────────────┐
│                                                                               │
│   GPU Node A                GPU Node B                GPU Node C             │
│  ┌─────────────┐           ┌─────────────┐           ┌─────────────┐        │
│  │ nodepulse   │           │ nodepulse   │           │ nodepulse   │        │
│  │  -agent     │           │  -agent     │           │  -agent     │        │
│  │  DaemonSet  │           │  DaemonSet  │           │  DaemonSet  │        │
│  │             │           │             │           │             │        │
│  │ :9100/      │           │ :9100/      │           │ :9100/      │        │
│  │  metrics    │           │  metrics    │           │  metrics    │        │
│  │  api/status │           │  api/status │           │  api/status │        │
│  └──────┬──────┘           └──────┬──────┘           └──────┬──────┘        │
│         │                         │                          │               │
│         └─────────────────────────┴──────────────────────────┘               │
│                                           │ poll every 30s                   │
│                              ┌────────────▼────────────┐                    │
│                              │   nodepulse-ui           │                    │
│                              │   Deployment  :8080      │                    │
│                              │                          │                    │
│                              │  Fleet Dashboard         │                    │
│                              │  ● RDMA status per node  │                    │
│                              │  ● MTU values per iface  │                    │
│                              │  ● GPU-NIC topology map  │                    │
│                              │  ● UCCL readiness flags  │                    │
│                              └──────────────────────────┘                    │
│                                                                               │
└───────────────────────────────────────────────────────────────────────────────┘

Each agent runs all four checks on a configurable interval (default: 60 seconds), publishes results as Prometheus metrics on :9100/metrics, and serves a JSON status snapshot on :9100/api/status. The UI service aggregates all agents and renders a live fleet health view.

For a full deep-dive, see docs/architecture.md.


Quick Start

1 — Label your GPU nodes
kubectl label node <gpu-node-name> nodepulse/gpu-node=true

# Label all nodes with GPUs at once
kubectl get nodes -o json | \
  jq -r '.items[] | select(.status.capacity["nvidia.com/gpu"]) | .metadata.name' | \
  xargs -I{} kubectl label node {} nodepulse/gpu-node=true
2 — Deploy NodePulse
# Apply RBAC (namespace + service accounts)
kubectl apply -f deploy/rbac.yaml

# Deploy the DaemonSet agent on all GPU nodes
kubectl apply -f deploy/daemonset.yaml

# Deploy the Web UI
kubectl apply -f deploy/ui-deployment.yaml

Or use the Makefile shortcut:

make deploy
3 — Open the Dashboard
# Port-forward the UI to your local machine
kubectl port-forward -n nodepulse svc/nodepulse-ui 8080:80

# Open in your browser
open http://localhost:8080
4 — Watch the agents
# Stream agent logs from all GPU nodes
make logs-agent

# Check deployment status
make status
5 — Run the pre-flight validator (optional)
# Label the target node and run a one-shot validation Job
kubectl label node <gpu-node-name> nodepulse/target-node=true
kubectl apply -f gpu-validator.yaml
kubectl wait --for=condition=complete job/nodepulse-validator -n nodepulse --timeout=120s
kubectl logs -n nodepulse -l app.kubernetes.io/component=validator

Installation

See the full Installation Guide → for:

  • Detailed kubectl steps and expected output
  • Optional DeepFlow credentials setup
  • Configuring agent endpoint discovery for the UI
  • Security hardening with OPA/Kyverno

Uninstallation

See the Uninstall Guide →.


Configuration Reference

All configuration is injected via environment variables. NodePulse uses no config files inside the binary.

Agent (nodepulse-agent)
Variable Default Description
NODE_NAME (K8s downward API) Node name — injected automatically by Kubernetes
NODEPULSE_AGENT_ADDR :9100 Agent HTTP server bind address
NODEPULSE_CHECK_INTERVAL_SECONDS 60 How often to run all validation checks
NODEPULSE_MIN_RDMA_DEVICES 1 Minimum number of active RDMA devices required
NODEPULSE_RDMA_LATENCY_US 5.0 Maximum acceptable RDMA round-trip latency (µs)
NODEPULSE_MIN_MTU 9000 Minimum MTU on all physical interfaces (bytes)
NODEPULSE_PEER_NODE_IP (empty) Peer IP for end-to-end MTU sweep; omit to skip
NODEPULSE_MIN_AFFINITY_CLASS 4 Minimum GPU↔NIC affinity class (1=Bottleneck … 5=Optimal)
NODEPULSE_BASELINE_DIR /etc/nodepulse/baselines Directory containing YAML hardware baselines
NODEPULSE_DEEPFLOW_ENDPOINT (empty) DeepFlow ADE base URL (optional)
NODEPULSE_DEEPFLOW_API_KEY (empty) Bearer token for DeepFlow API (optional)
UCCL env vars (checked by UCCL module, WARN if missing)
Variable Description
UCCL_SOCKET_IFNAME Control NIC for UCCL-EP bootstrapping (must match torchrun --master_addr interface)
UCCL_IB_GID_INDEX RDMA GID index — match with NCCL_IB_GID_INDEX
UCCL_IB_HCA IB/RoCE device names (auto-detected if unset)
UCCL_IB_MAX_INFLIGHT_BYTES Recommended for Broadcom Thor-2 NICs
UI (nodepulse-ui)
Variable Default Description
NODEPULSE_UI_ADDR :8080 UI HTTP server bind address
NODEPULSE_AGENT_ENDPOINTS (empty) Comma-separated list of agent pod URLs
NODEPULSE_POLL_INTERVAL_SECONDS 30 How often the UI polls agent endpoints
Hardware Baselines

NodePulse ships with pre-tuned baselines for A100, H100, H200, and a conservative default. Override by editing the nodepulse-baselines ConfigMap in deploy/daemonset.yaml:

# A100.yaml — included in the ConfigMap
hardware_gen: A100
min_rdma_devices: 4
max_rdma_latency_us: 5.0
min_bandwidth_gbps: 200
min_mtu: 9000
min_affinity_class: 4
min_dma_throughput_gbps: 50
max_latency_ns: 1000
max_error_count: 0

API Reference

Agent endpoints (:9100)
Endpoint Description
GET /metrics Prometheus text format — all validation metrics
GET /api/status JSON snapshot of the most recent check cycle
GET /healthz Liveness probe — always 200 OK
UI endpoints (:8080)
Endpoint Description
GET / Web dashboard (fleet-wide health view)
GET /api/fleet Aggregated JSON from all agents
GET /healthz Liveness probe
Prometheus metrics
Metric Labels Description
nodepulse_check_status node, module 1=PASS, 0=FAIL, -1=WARN
nodepulse_rdma_devices_found node RDMA device count
nodepulse_rdma_device_active node, hca_id 1 if PORT_ACTIVE
nodepulse_efa_providers_found node EFA provider count
nodepulse_interface_mtu_bytes node, interface Per-interface MTU
nodepulse_mtu_sweep_achieved_bytes node End-to-end path MTU
nodepulse_gpu_nic_affinity_class node, gpu, nic Affinity score 1-5
nodepulse_gpu_nic_affinity_ok node, gpu, nic 1 if meets threshold
nodepulse_cross_numa_pairs_total node Count of cross-NUMA pairs
nodepulse_hardware_gen_info node, gen GPU generation label
nodepulse_uccl_gpu_count node, vendor GPU count per vendor
nodepulse_uccl_active_rdma_ports node Active RDMA ports (ibstat)
nodepulse_uccl_api_ready node, api 1 if ready for UCCL API
nodepulse_uccl_efa_detected node 1 if EFA NIC present
nodepulse_uccl_blocking_conditions_total node Count of blocking conditions

DeepFlow Integration

When NODEPULSE_DEEPFLOW_ENDPOINT is set, the agent posts structured events to the DeepFlow ADE API after each check cycle:

{
  "event_type": "NODEPULSE_MODULE_RESULT",
  "validator_version": "3.0.0",
  "node_name": "gpu-node-42",
  "module": "RDMA_EFA",
  "status": "FAIL",
  "hardware_gen": "H100",
  "errors": ["PORT_DOWN detected on: mlx5_2: PORT_DOWN"],
  "timestamp": "2026-04-12T05:30:00Z"
}

DeepFlow failures are non-fatal — the agent logs a warning and continues. Your cluster monitoring is never blocked by a telemetry outage.


Development

Prerequisites
  • Go 1.23+
  • Docker (for image builds)
  • kind + kubectl (for E2E tests)
Build & Test
# Tidy dependencies
go mod tidy

# Build all binaries (agent, ui, validator)
make build

# Run all unit tests with race detector
make test

# Run unit tests with HTML coverage report
make test-cover

# Lint (requires golangci-lint)
make lint
Build the Docker image
make docker-build
# Produces: ghcr.io/bytehooks/nodepulse:<git-tag>
Run locally (for development)
# Start the agent (will report FAIL without real hardware — expected)
./bin/nodepulse-agent

# Start the UI pointing at a local agent
NODEPULSE_AGENT_ENDPOINTS=http://localhost:9100 ./bin/nodepulse-ui

# Run the pre-flight validator once
./bin/nodepulse-validator
E2E Tests (kind cluster)
make e2e-setup    # Spin up 3-node kind cluster
make e2e-run      # Run the Go E2E test suite
make e2e-teardown # Tear down the cluster
Project Structure
nodepulse/
├── cmd/
│   ├── agent/main.go          # DaemonSet agent — continuous monitoring loop
│   ├── ui/
│   │   ├── main.go            # Web UI service — fleet aggregation + HTTP
│   │   └── dashboard.go       # Embedded single-file HTML dashboard
│   └── validator/main.go      # Pre-flight Job validator (single-shot)
├── internal/
│   ├── models/                # Shared types (Status, AffinityClass, ModuleResult, UCCLNodeReadiness …)
│   ├── config/                # Env-var loader + YAML baseline reader
│   ├── shell/                 # Runner interface → RealRunner / FakeRunner (tests)
│   ├── deepflow/              # DeepFlow HTTP client + FakeClient (tests)
│   ├── server/                # Agent HTTP server (/metrics, /healthz, /api/status)
│   ├── metrics/               # Prometheus registry + UpdateFromResults()
│   ├── modules/
│   │   ├── rdma/              # Module 1: RDMA/EFA checks
│   │   ├── mtu/               # Module 2: MTU consistency
│   │   ├── affinity/          # Module 3: GPU-NIC affinity
│   │   └── uccl/              # Module 4: UCCL GPU node compatibility
│   └── gatekeeper/
│       ├── ebpf/              # eBPF telemetry (sysfs fallback on old kernels)
│       ├── threshold/         # Dynamic baseline comparison
│       └── reporter/          # Incident summary + webhook alert
├── config/baselines/          # Per-GPU-generation YAML baselines (A100, H100, H200)
├── deploy/
│   ├── rbac.yaml              # Namespace, ServiceAccounts, ClusterRole
│   ├── daemonset.yaml         # DaemonSet + headless Service + ConfigMap
│   └── ui-deployment.yaml     # UI Deployment + Service + ConfigMap
├── docs/                      # Install, uninstall, architecture docs
├── e2e_test/                  # kind cluster + Go E2E tests
├── gpu-validator.yaml         # Pre-flight Kubernetes Job manifest
├── Dockerfile                 # Multi-stage build → 3 static binaries
└── Makefile                   # build / test / deploy / docker targets

Affinity Classes

Class Score Description
OPTIMAL 5 PIX or NVLink, same NUMA node — ideal for RDMA
ACCEPTABLE 4 PXB or PHB, same NUMA node
DEGRADED 3 PHB or PIX, different NUMA node
CROSS_NUMA 2 SYS traversal — significant latency penalty
BOTTLENECK 1 Cross root complex — worst case, avoid for training

Default threshold: NODEPULSE_MIN_AFFINITY_CLASS=4 (ACCEPTABLE and above required).


Security Model

Control Detail
Privileges privileged: true, CAP_NET_ADMIN, CAP_SYS_ADMIN scoped only to the NodePulse agent DaemonSet
Training isolation These capabilities are never granted to training workloads
Policy enforcement Pair with OPA/Kyverno to prevent other workloads from running privileged pods in the nodepulse namespace
Secret handling DeepFlow credentials via K8s Secret (optional: true) — agent runs without them
Read-only mounts /sys and /proc mounted read-only; /dev mounted for RDMA character devices
Agent RBAC Agent ServiceAccount has automountServiceAccountToken: false — zero K8s API access
UI RBAC UI ServiceAccount has read-only access to endpoints and pods for agent discovery only

Roadmap

  • Helm chart for simplified installation
  • Kubernetes-native endpoint auto-discovery (replace static ConfigMap)
  • Admission Controller integration — block pod scheduling on degraded nodes
  • Slack / PagerDuty alert adapters
  • NCCL bandwidth test integration
  • H200 / GH200 NVLink5 topology support
  • UCCL binary test integration (run actual UCCL collective benchmarks)
  • UCCL GPU node compatibility checks (Module 4)
  • Prometheus metrics endpoint
  • Real-time Web UI dashboard
  • Continuous DaemonSet monitoring

Contributing

We welcome contributions! Please open an issue or pull request on GitHub.

  1. Fork the repo
  2. Create a feature branch: git checkout -b feat/my-feature
  3. Write tests for new behaviour
  4. Run make test to verify
  5. Open a PR against main

License

NodePulse is open source under the Apache 2.0 License.


Made with ♥ by ByteHooks

Website · GitHub · Issues

Directories

Path Synopsis
cmd
agent command
Command agent is the NodePulse DaemonSet entrypoint.
Command agent is the NodePulse DaemonSet entrypoint.
ui command
Command ui is the NodePulse web UI service.
Command ui is the NodePulse web UI service.
validator command
Command validator is the entrypoint for the NodePulse pre-flight validator.
Command validator is the entrypoint for the NodePulse pre-flight validator.
internal
gatekeeper/ebpf
Package ebpf provides eBPF-based bus telemetry collection for the GIV.
Package ebpf provides eBPF-based bus telemetry collection for the GIV.
metrics
Package metrics provides a Prometheus-compatible metrics registry for the NodePulse agent.
Package metrics provides a Prometheus-compatible metrics registry for the NodePulse agent.
modules/uccl
Package uccl implements GPU node compatibility checks derived from the UCCL project (https://github.com/uccl-project/uccl).
Package uccl implements GPU node compatibility checks derived from the UCCL project (https://github.com/uccl-project/uccl).
server
Package server provides the NodePulse agent HTTP server.
Package server provides the NodePulse agent HTTP server.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL