fabricmanager

package
v0.11.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 4, 2026 License: Apache-2.0 Imports: 24 Imported by: 0

Documentation

Overview

Package fabricmanager tracks NVIDIA fabric manager and fabric health monitoring services.

Fabric Management Architecture

NVIDIA systems use different fabric management and monitoring approaches depending on the GPU generation:

## Pre-NVL5 Systems (DGX A100, DGX H100, HGX A100, HGX H100)

Traditional nvidia-fabricmanager daemon running on compute nodes:

## NVL5+ Systems (GB200 NVL72)

Distributed fabric management architecture with NVML-based health monitoring:

### NVLink Switch Trays - Run NVOS (NVSwitch Operating System)

NVOS includes integrated fabric management services:

Quote: "NVOS includes the NVLink Subnet Manager (NVLSM), the Fabric Manager (FM),
       NMX services such as NMX-Controller and NMX-Telemetry, and the NVSwitch firmware."
Reference: https://docs.nvidia.com/networking/display/nvidianvosusermanualfornvlinkswitchesv25021884/cluster+management

Quote: "NVOS software image includes the NMX-C application, the FM application,
       and the NVLSM application, with no standalone software installation required
       for these components."
Reference: https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html

NMX-Controller (NMX-C) - Provides Global Fabric Manager (GFM):

Quote: "In the GB200 NVL the SDN services are the subnet manager (SM) and
       global fabric manager (GFM)"
Reference: https://docs.nvidia.com/networking/display/nmxcv11/nmx-controller

### Compute Nodes - Run NVSM (NVIDIA System Management)

NVSM provides system management and exposes fabric health via NVML APIs:

On GB200 compute nodes:

  • Traditional fabric-manager daemon (port 6666) does NOT run
  • NMX services do NOT run (they run on switch trays, not compute nodes)
  • Fabric management is handled by NVOS on the switch trays
  • NVSM handles system management and exposes fabric health telemetry via NVML

Attempting to start traditional fabric-manager on GB200 fails with NV_WARN_NOTHING_TO_DO because no NVSwitch kernel driver/devices are present on compute nodes. Reference: https://github.com/NVIDIA/gpu-operator/issues/610

Fabric Health Monitoring via NVML

For GB200 and newer GPUs that support fabric state telemetry, this component uses:

  • nvmlDeviceGetGpuFabricInfo() for basic fabric info (V1 API)
  • nvmlDeviceGetGpuFabricInfoV().V3() for detailed health metrics (V3 API)

The V3 API provides comprehensive health information including:

  • Clique ID and Cluster UUID
  • Fabric state (Not Started, In Progress, Completed)
  • Health summary (Healthy, Unhealthy, Limited Capacity)
  • Detailed health mask covering:
  • Bandwidth status (Full, Degraded)
  • Route recovery progress
  • Route health status
  • Access timeout recovery

Detection Strategy

This component checks for fabric management/monitoring in the following order:

  1. Check if nvmlDeviceGetGpuFabricInfo* is supported (GB200 NVL72 and newer) - If supported, use NVML fabric state APIs for health monitoring - This path is taken for systems with NVSM-based fabric telemetry
  2. Check traditional fabric-manager on port 6666 (Pre-NVL5 systems) - For DGX A100, DGX H100, HGX A100, HGX H100 - Validates service activeness and monitors logs for errors

Package fabricmanager provides utilities for detecting and managing NVIDIA NVSwitch devices.

NVSwitch and Fabric Manager

NVIDIA NVSwitch is a physical interconnect switch that enables high-bandwidth GPU-to-GPU communication in multi-GPU systems. Fabric Manager is the user-space software daemon that manages and configures NVSwitch hardware.

Fabric Manager is specifically designed to manage NVSwitch hardware, and NVSwitch hardware requires Fabric Manager to function properly. Without Fabric Manager running:

  • NVSwitch devices remain uninitialized
  • GPU-to-GPU communication through NVSwitch is unavailable
  • Multi-GPU workloads cannot utilize the full NVLink fabric bandwidth

The relationship between them:

  • NVSwitch: Physical hardware switch chips (PCIe bridge devices)
  • Fabric Manager: Software service that initializes, configures, and monitors NVSwitch

Official Documentation

For more information about Fabric Manager and NVSwitch, see:

Supported Systems

Fabric Manager is required for NVSwitch-based systems including:

  • NVIDIA DGX systems (DGX A100, DGX H100, DGX GB200)
  • NVIDIA HGX systems (HGX A100, HGX H100, HGX H200)
  • Systems with multiple GPUs connected via NVSwitch

Detection Methods

This package provides two methods for detecting NVSwitch hardware:

  1. PCIe enumeration via lspci (ListPCINVSwitches function)
  2. nvidia-smi nvlink status query (CountSMINVSwitches function)

Both methods are used as fallbacks to ensure robust NVSwitch detection across different system configurations.

Index

Constants

View Source
const DeviceVendorID = "10de"

DeviceVendorID defines the NVIDIA PCI vendor ID. This is used to filter PCI devices to only NVIDIA hardware.

Example usage with lspci:

lspci -nn | grep -i "10de.*"

Reference: https://devicehunt.com/view/type/pci/vendor/10DE

View Source
const (

	// EventNVSwitchNothingToDo indicates fabric manager found no NVSwitch devices to manage.
	// e.g., request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
	// This occurs when fabric manager starts but finds no NVSwitch devices to manage.
	// This is NOT an error condition - it indicates the system does not need fabric manager.
	// Common scenarios: GH200 standalone (no NVSwitch), simple NVLink bridges, PCIe GPUs.
	// Ref: https://forums.developer.nvidia.com/t/nvidia-fabricmanager-running-error-with-nv-warn-nothing-to-do/272899
	// Ref: https://github.com/NVIDIA/gpu-operator/issues/610
	EventNVSwitchNothingToDo = "fabricmanager_nvswitch_nothing_to_do"
)
View Source
const (
	// Name is the component name reported by the NVIDIA fabric manager checker.
	Name = "accelerator-nvidia-fabric-manager"
)

Variables

This section is empty.

Functions

func CountSMINVSwitches added in v0.8.0

func CountSMINVSwitches(ctx context.Context) ([]string, error)

CountSMINVSwitches queries nvidia-smi to count GPUs with NVLink connections, which indicates the presence of NVSwitch hardware.

This function uses "nvidia-smi nvlink --status" to enumerate GPUs that have NVLink connections. In systems with NVSwitch, all GPUs will be connected through the NVSwitch fabric.

Returns a list of GPU description lines, where each line represents a GPU with NVLink connectivity. The number of lines indicates the number of GPUs connected to the NVSwitch fabric.

This is used as a fallback detection method when PCIe enumeration (lspci) is unavailable or unreliable.

Example output line:

GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-754035b4-4708-efcd-b261-623aea38bcad)

func HasNVSwitchFatalSXid added in v0.4.5

func HasNVSwitchFatalSXid(line string) bool

HasNVSwitchFatalSXid reports whether the log line contains an NVSwitch fatal SXid error.

func HasNVSwitchNVLinkFailure added in v0.4.5

func HasNVSwitchNVLinkFailure(line string) bool

HasNVSwitchNVLinkFailure reports whether the log line contains an NVLink failure.

func HasNVSwitchNonFatalSXid added in v0.4.5

func HasNVSwitchNonFatalSXid(line string) bool

HasNVSwitchNonFatalSXid reports whether the log line contains an NVSwitch non-fatal SXid error.

func HasNVSwitchNothingToDo added in v0.10.0

func HasNVSwitchNothingToDo(line string) bool

HasNVSwitchNothingToDo reports whether the log line contains the NV_WARN_NOTHING_TO_DO marker.

func HasNVSwitchTopologyMismatch added in v0.10.0

func HasNVSwitchTopologyMismatch(line string) bool

HasNVSwitchTopologyMismatch reports whether the log line contains a topology mismatch error.

func ListPCINVSwitches added in v0.8.0

func ListPCINVSwitches(ctx context.Context) ([]string, error)

ListPCINVSwitches returns all lspci lines that represent NVIDIA NVSwitch devices.

NVSwitch devices appear as PCI bridge devices in the lspci output. This function enumerates all NVIDIA bridge devices which typically represent NVSwitch hardware.

NVSwitch is the physical interconnect hardware that connects multiple GPUs in a high-performance fabric. Without NVSwitch, GPUs cannot communicate efficiently in multi-GPU configurations.

Example lspci output for NVSwitch:

Older format (A100-era):
  0005:00:00.0 Bridge [0680]: NVIDIA Corporation Device [10de:1af1] (rev a1)

Newer format (GB200 and later):
  0018:00:00.0 PCI bridge [0604]: NVIDIA Corporation Device [10de:22b1]

This function is the primary method for detecting NVSwitch hardware via PCIe enumeration. It's used by the Fabric Manager component to determine if NVSwitch is present and therefore if Fabric Manager service is required.

func Match added in v0.4.5

func Match(line string) (eventName string, message string)

Match returns the first known event name and message for a fabric-manager log line.

func New

func New(gpudInstance *components.GPUdInstance) (components.Component, error)

New creates the NVIDIA fabric manager component.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL