nodepulse

module

v0.0.0-...-d8e9b63 Latest Latest Go to latest Published: Apr 12, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/bytehooks/nodepulse

Links

Open Source Insights

README ¶

NodePulse

Continuous GPU Infrastructure Health Monitor & UCCL Node Validator for AI Training Clusters

Built by ByteHooks · Open Source · Apache-2.0

What is NodePulse?

High-performance AI training jobs silently degrade when nodes have misconfigured networking, suboptimal hardware placement, or GPU communication libraries that cannot initialize correctly. By the time you notice slow throughput, you have already wasted expensive GPU-hours.

NodePulse is a continuous Kubernetes DaemonSet that runs a dedicated validation agent on every GPU node and exposes a real-time web dashboard for fleet-wide infrastructure visibility.

It continuously validates four critical infrastructure dimensions that are the most common root causes of degraded distributed training performance:

Module	What it checks	Failure indicator
RDMA/EFA Fabric	`ibv_devinfo` port states, `fi_info` EFA providers	Any port DOWN or below minimum device count
MTU Consistency	Per-interface MTU via sysfs; end-to-end DF-bit ICMP sweep	Any interface < 9000 bytes (Jumbo Frames required)
GPU-NIC Affinity	`nvidia-smi topo -m` PCIe/NUMA topology matrix	Any GPU↔NIC pair below ACCEPTABLE affinity class
UCCL Compatibility	GPU vendor, EFA/RDMA detection, ROCm version, NIC driver type, per-API readiness	Missing GPU, unsupported platform, or blocking condition for UCCL-Collective/P2P/EP

Unlike a pre-flight job that runs once before training, NodePulse never stops watching. It catches hardware faults, driver resets, and configuration drift the moment they happen — before they silently corrupt your next training run.

UCCL Integration

NodePulse integrates GPU node compatibility checks derived from the UCCL project — an efficient GPU communication library that is a drop-in replacement for NCCL/RCCL with up to 2.5× performance improvement.

The UCCL module (internal/modules/uccl) performs:

Check	Method	Details
GPU vendor detection	`nvidia-smi` / `amd-smi` / `rocm-smi`	Identifies NVIDIA (CUDA cap) or AMD (GFX arch)
ARM64 blocking check	`runtime.GOARCH`	ARM64 + AMD GPU blocks ROCm per UCCL build.sh
EFA NIC detection	`/sys/class/infiniband/rdmap*`	Mirrors `uccl/__init__.py has_efa()` exactly
RDMA port states	`ibstat rdma0..rdma7`	Counts `State: Active` ports per UCCL slurm_monitor.sh
NIC driver type	`/sys/class/net/*/device/msi_irqs/`	Detects RDMA vs VirtIO/ENA (AF-XDP path)
ROCm version	`/opt/rocm/.info/version` or `rocm-smi --version`	EP requires ROCm 7+; 6.x is blocked
GPU idle state	GPU utilization %	< 5% = idle per UCCL scheduling threshold
Required env vars	`UCCL_SOCKET_IFNAME`, `UCCL_IB_GID_INDEX`	WARN if missing (UCCL-EP will hang without these)

Per-API readiness flags:

UCCL API	Required	Condition
`collective_rdma`	GPU + RDMA NIC (IB/RoCE)	NVIDIA CX or Broadcom Thor
`collective_efa`	GPU + EFA NIC	AWS p4d/p5/p5e/p5en/p6
`collective_afxdp`	GPU + ENA/VirtIO NIC	AWS ENA, IBM VirtIO (AF-XDP path)
`p2p`	GPU + RDMA NIC	NVIDIA or AMD
`ep`	GPU + RDMA/EFA, ROCm 7+ for AMD, no ARM64+AMD	Expert-parallel (DeepEP-compatible)

Architecture

┌─────────────────────────── Kubernetes Cluster ───────────────────────────────┐
│                                                                               │
│   GPU Node A                GPU Node B                GPU Node C             │
│  ┌─────────────┐           ┌─────────────┐           ┌─────────────┐        │
│  │ nodepulse   │           │ nodepulse   │           │ nodepulse   │        │
│  │  -agent     │           │  -agent     │           │  -agent     │        │
│  │  DaemonSet  │           │  DaemonSet  │           │  DaemonSet  │        │
│  │             │           │             │           │             │        │
│  │ :9100/      │           │ :9100/      │           │ :9100/      │        │
│  │  metrics    │           │  metrics    │           │  metrics    │        │
│  │  api/status │           │  api/status │           │  api/status │        │
│  └──────┬──────┘           └──────┬──────┘           └──────┬──────┘        │
│         │                         │                          │               │
│         └─────────────────────────┴──────────────────────────┘               │
│                                           │ poll every 30s                   │
│                              ┌────────────▼────────────┐                    │
│                              │   nodepulse-ui           │                    │
│                              │   Deployment  :8080      │                    │
│                              │                          │                    │
│                              │  Fleet Dashboard         │                    │
│                              │  ● RDMA status per node  │                    │
│                              │  ● MTU values per iface  │                    │
│                              │  ● GPU-NIC topology map  │                    │
│                              │  ● UCCL readiness flags  │                    │
│                              └──────────────────────────┘                    │
│                                                                               │
└───────────────────────────────────────────────────────────────────────────────┘

Each agent runs all four checks on a configurable interval (default: 60 seconds), publishes results as Prometheus metrics on :9100/metrics, and serves a JSON status snapshot on :9100/api/status. The UI service aggregates all agents and renders a live fleet health view.

For a full deep-dive, see docs/architecture.md.

Quick Start

1 — Label your GPU nodes

kubectl label node <gpu-node-name> nodepulse/gpu-node=true

# Label all nodes with GPUs at once
kubectl get nodes -o json | \
  jq -r '.items[] | select(.status.capacity["nvidia.com/gpu"]) | .metadata.name' | \
  xargs -I{} kubectl label node {} nodepulse/gpu-node=true

2 — Deploy NodePulse

# Apply RBAC (namespace + service accounts)
kubectl apply -f deploy/rbac.yaml

# Deploy the DaemonSet agent on all GPU nodes
kubectl apply -f deploy/daemonset.yaml

# Deploy the Web UI
kubectl apply -f deploy/ui-deployment.yaml

Or use the Makefile shortcut:

make deploy

3 — Open the Dashboard

# Port-forward the UI to your local machine
kubectl port-forward -n nodepulse svc/nodepulse-ui 8080:80

# Open in your browser
open http://localhost:8080

4 — Watch the agents

# Stream agent logs from all GPU nodes
make logs-agent

# Check deployment status
make status

5 — Run the pre-flight validator (optional)

# Label the target node and run a one-shot validation Job
kubectl label node <gpu-node-name> nodepulse/target-node=true
kubectl apply -f gpu-validator.yaml
kubectl wait --for=condition=complete job/nodepulse-validator -n nodepulse --timeout=120s
kubectl logs -n nodepulse -l app.kubernetes.io/component=validator

Installation

See the full Installation Guide → for:

Detailed kubectl steps and expected output
Optional DeepFlow credentials setup
Configuring agent endpoint discovery for the UI
Security hardening with OPA/Kyverno

Uninstallation

See the Uninstall Guide →.

Configuration Reference

All configuration is injected via environment variables. NodePulse uses no config files inside the binary.

Agent (`nodepulse-agent`)

Variable	Default	Description
`NODE_NAME`	(K8s downward API)	Node name — injected automatically by Kubernetes
`NODEPULSE_AGENT_ADDR`	`:9100`	Agent HTTP server bind address
`NODEPULSE_CHECK_INTERVAL_SECONDS`	`60`	How often to run all validation checks
`NODEPULSE_MIN_RDMA_DEVICES`	`1`	Minimum number of active RDMA devices required
`NODEPULSE_RDMA_LATENCY_US`	`5.0`	Maximum acceptable RDMA round-trip latency (µs)
`NODEPULSE_MIN_MTU`	`9000`	Minimum MTU on all physical interfaces (bytes)
`NODEPULSE_PEER_NODE_IP`	(empty)	Peer IP for end-to-end MTU sweep; omit to skip
`NODEPULSE_MIN_AFFINITY_CLASS`	`4`	Minimum GPU↔NIC affinity class (1=Bottleneck … 5=Optimal)
`NODEPULSE_BASELINE_DIR`	`/etc/nodepulse/baselines`	Directory containing YAML hardware baselines
`NODEPULSE_DEEPFLOW_ENDPOINT`	(empty)	DeepFlow ADE base URL (optional)
`NODEPULSE_DEEPFLOW_API_KEY`	(empty)	Bearer token for DeepFlow API (optional)

UCCL env vars (checked by UCCL module, WARN if missing)

Variable	Description
`UCCL_SOCKET_IFNAME`	Control NIC for UCCL-EP bootstrapping (must match `torchrun --master_addr` interface)
`UCCL_IB_GID_INDEX`	RDMA GID index — match with `NCCL_IB_GID_INDEX`
`UCCL_IB_HCA`	IB/RoCE device names (auto-detected if unset)
`UCCL_IB_MAX_INFLIGHT_BYTES`	Recommended for Broadcom Thor-2 NICs

UI (`nodepulse-ui`)

Variable	Default	Description
`NODEPULSE_UI_ADDR`	`:8080`	UI HTTP server bind address
`NODEPULSE_AGENT_ENDPOINTS`	(empty)	Comma-separated list of agent pod URLs
`NODEPULSE_POLL_INTERVAL_SECONDS`	`30`	How often the UI polls agent endpoints

Hardware Baselines

NodePulse ships with pre-tuned baselines for A100, H100, H200, and a conservative default. Override by editing the nodepulse-baselines ConfigMap in deploy/daemonset.yaml:

# A100.yaml — included in the ConfigMap
hardware_gen: A100
min_rdma_devices: 4
max_rdma_latency_us: 5.0
min_bandwidth_gbps: 200
min_mtu: 9000
min_affinity_class: 4
min_dma_throughput_gbps: 50
max_latency_ns: 1000
max_error_count: 0

API Reference

Agent endpoints (`:9100`)

Endpoint	Description
`GET /metrics`	Prometheus text format — all validation metrics
`GET /api/status`	JSON snapshot of the most recent check cycle
`GET /healthz`	Liveness probe — always `200 OK`

UI endpoints (`:8080`)

Endpoint	Description
`GET /`	Web dashboard (fleet-wide health view)
`GET /api/fleet`	Aggregated JSON from all agents
`GET /healthz`	Liveness probe

Prometheus metrics

Metric	Labels	Description
`nodepulse_check_status`	`node`, `module`	1=PASS, 0=FAIL, -1=WARN
`nodepulse_rdma_devices_found`	`node`	RDMA device count
`nodepulse_rdma_device_active`	`node`, `hca_id`	1 if PORT_ACTIVE
`nodepulse_efa_providers_found`	`node`	EFA provider count
`nodepulse_interface_mtu_bytes`	`node`, `interface`	Per-interface MTU
`nodepulse_mtu_sweep_achieved_bytes`	`node`	End-to-end path MTU
`nodepulse_gpu_nic_affinity_class`	`node`, `gpu`, `nic`	Affinity score 1-5
`nodepulse_gpu_nic_affinity_ok`	`node`, `gpu`, `nic`	1 if meets threshold
`nodepulse_cross_numa_pairs_total`	`node`	Count of cross-NUMA pairs
`nodepulse_hardware_gen_info`	`node`, `gen`	GPU generation label
`nodepulse_uccl_gpu_count`	`node`, `vendor`	GPU count per vendor
`nodepulse_uccl_active_rdma_ports`	`node`	Active RDMA ports (ibstat)
`nodepulse_uccl_api_ready`	`node`, `api`	1 if ready for UCCL API
`nodepulse_uccl_efa_detected`	`node`	1 if EFA NIC present
`nodepulse_uccl_blocking_conditions_total`	`node`	Count of blocking conditions

DeepFlow Integration

When NODEPULSE_DEEPFLOW_ENDPOINT is set, the agent posts structured events to the DeepFlow ADE API after each check cycle:

{
  "event_type": "NODEPULSE_MODULE_RESULT",
  "validator_version": "3.0.0",
  "node_name": "gpu-node-42",
  "module": "RDMA_EFA",
  "status": "FAIL",
  "hardware_gen": "H100",
  "errors": ["PORT_DOWN detected on: mlx5_2: PORT_DOWN"],
  "timestamp": "2026-04-12T05:30:00Z"
}

DeepFlow failures are non-fatal — the agent logs a warning and continues. Your cluster monitoring is never blocked by a telemetry outage.

Development

Prerequisites

Go 1.23+
Docker (for image builds)
kind + kubectl (for E2E tests)

Build & Test

# Tidy dependencies
go mod tidy

# Build all binaries (agent, ui, validator)
make build

# Run all unit tests with race detector
make test

# Run unit tests with HTML coverage report
make test-cover

# Lint (requires golangci-lint)
make lint

Build the Docker image

make docker-build
# Produces: ghcr.io/bytehooks/nodepulse:<git-tag>

Run locally (for development)

# Start the agent (will report FAIL without real hardware — expected)
./bin/nodepulse-agent

# Start the UI pointing at a local agent
NODEPULSE_AGENT_ENDPOINTS=http://localhost:9100 ./bin/nodepulse-ui

# Run the pre-flight validator once
./bin/nodepulse-validator

E2E Tests (kind cluster)

make e2e-setup    # Spin up 3-node kind cluster
make e2e-run      # Run the Go E2E test suite
make e2e-teardown # Tear down the cluster

Project Structure

nodepulse/
├── cmd/
│   ├── agent/main.go          # DaemonSet agent — continuous monitoring loop
│   ├── ui/
│   │   ├── main.go            # Web UI service — fleet aggregation + HTTP
│   │   └── dashboard.go       # Embedded single-file HTML dashboard
│   └── validator/main.go      # Pre-flight Job validator (single-shot)
├── internal/
│   ├── models/                # Shared types (Status, AffinityClass, ModuleResult, UCCLNodeReadiness …)
│   ├── config/                # Env-var loader + YAML baseline reader
│   ├── shell/                 # Runner interface → RealRunner / FakeRunner (tests)
│   ├── deepflow/              # DeepFlow HTTP client + FakeClient (tests)
│   ├── server/                # Agent HTTP server (/metrics, /healthz, /api/status)
│   ├── metrics/               # Prometheus registry + UpdateFromResults()
│   ├── modules/
│   │   ├── rdma/              # Module 1: RDMA/EFA checks
│   │   ├── mtu/               # Module 2: MTU consistency
│   │   ├── affinity/          # Module 3: GPU-NIC affinity
│   │   └── uccl/              # Module 4: UCCL GPU node compatibility
│   └── gatekeeper/
│       ├── ebpf/              # eBPF telemetry (sysfs fallback on old kernels)
│       ├── threshold/         # Dynamic baseline comparison
│       └── reporter/          # Incident summary + webhook alert
├── config/baselines/          # Per-GPU-generation YAML baselines (A100, H100, H200)
├── deploy/
│   ├── rbac.yaml              # Namespace, ServiceAccounts, ClusterRole
│   ├── daemonset.yaml         # DaemonSet + headless Service + ConfigMap
│   └── ui-deployment.yaml     # UI Deployment + Service + ConfigMap
├── docs/                      # Install, uninstall, architecture docs
├── e2e_test/                  # kind cluster + Go E2E tests
├── gpu-validator.yaml         # Pre-flight Kubernetes Job manifest
├── Dockerfile                 # Multi-stage build → 3 static binaries
└── Makefile                   # build / test / deploy / docker targets

Affinity Classes

Class	Score	Description
`OPTIMAL`	5	PIX or NVLink, same NUMA node — ideal for RDMA
`ACCEPTABLE`	4	PXB or PHB, same NUMA node
`DEGRADED`	3	PHB or PIX, different NUMA node
`CROSS_NUMA`	2	SYS traversal — significant latency penalty
`BOTTLENECK`	1	Cross root complex — worst case, avoid for training

Default threshold: NODEPULSE_MIN_AFFINITY_CLASS=4 (ACCEPTABLE and above required).

Security Model

Control	Detail
Privileges	`privileged: true`, `CAP_NET_ADMIN`, `CAP_SYS_ADMIN` scoped only to the NodePulse agent DaemonSet
Training isolation	These capabilities are never granted to training workloads
Policy enforcement	Pair with OPA/Kyverno to prevent other workloads from running privileged pods in the `nodepulse` namespace
Secret handling	DeepFlow credentials via K8s Secret (`optional: true`) — agent runs without them
Read-only mounts	`/sys` and `/proc` mounted read-only; `/dev` mounted for RDMA character devices
Agent RBAC	Agent ServiceAccount has `automountServiceAccountToken: false` — zero K8s API access
UI RBAC	UI ServiceAccount has read-only access to `endpoints` and `pods` for agent discovery only

Roadmap

Helm chart for simplified installation
Kubernetes-native endpoint auto-discovery (replace static ConfigMap)
Admission Controller integration — block pod scheduling on degraded nodes
Slack / PagerDuty alert adapters
NCCL bandwidth test integration
H200 / GH200 NVLink5 topology support
UCCL binary test integration (run actual UCCL collective benchmarks)
UCCL GPU node compatibility checks (Module 4)
Prometheus metrics endpoint
Real-time Web UI dashboard
Continuous DaemonSet monitoring

Contributing

We welcome contributions! Please open an issue or pull request on GitHub.

Fork the repo
Create a feature branch: git checkout -b feat/my-feature
Write tests for new behaviour
Run make test to verify
Open a PR against main

License

NodePulse is open source under the Apache 2.0 License.

Made with ♥ by ByteHooks

Website · GitHub · Issues

Directories ¶

Path	Synopsis
cmd
agent command Command agent is the NodePulse DaemonSet entrypoint.	Command agent is the NodePulse DaemonSet entrypoint.
ui command Command ui is the NodePulse web UI service.	Command ui is the NodePulse web UI service.
validator command Command validator is the entrypoint for the NodePulse pre-flight validator.	Command validator is the entrypoint for the NodePulse pre-flight validator.
internal
config
deepflow
gatekeeper/ebpf Package ebpf provides eBPF-based bus telemetry collection for the GIV.	Package ebpf provides eBPF-based bus telemetry collection for the GIV.
gatekeeper/reporter
gatekeeper/threshold
metrics Package metrics provides a Prometheus-compatible metrics registry for the NodePulse agent.	Package metrics provides a Prometheus-compatible metrics registry for the NodePulse agent.
models
modules/affinity
modules/mtu
modules/rdma
modules/uccl Package uccl implements GPU node compatibility checks derived from the UCCL project (https://github.com/uccl-project/uccl).	Package uccl implements GPU node compatibility checks derived from the UCCL project (https://github.com/uccl-project/uccl).
server Package server provides the NodePulse agent HTTP server.	Package server provides the NodePulse agent HTTP server.
shell

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL