Documentation
¶
Overview ¶
Package fabricmanager tracks NVIDIA fabric manager and fabric health monitoring services.
Fabric Management Architecture ¶
NVIDIA systems use different fabric management and monitoring approaches depending on the GPU generation:
## Pre-NVL5 Systems (DGX A100, DGX H100, HGX A100, HGX H100)
Traditional nvidia-fabricmanager daemon running on compute nodes:
- Service: nvidia-fabricmanager.service
- Port: 6666 (FM_CMD_PORT_NUMBER)
- Architecture: Userspace daemon managing NVSwitch kernel driver
- Requires: /dev/nvidia-switch* devices via kernel driver
- Monitoring: Service activeness via port check
- Reference: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/
## NVL5+ Systems (GB200 NVL72)
Distributed fabric management architecture with NVML-based health monitoring:
### NVLink Switch Trays - Run NVOS (NVSwitch Operating System)
NVOS includes integrated fabric management services:
Quote: "NVOS includes the NVLink Subnet Manager (NVLSM), the Fabric Manager (FM),
NMX services such as NMX-Controller and NMX-Telemetry, and the NVSwitch firmware."
Reference: https://docs.nvidia.com/networking/display/nvidianvosusermanualfornvlinkswitchesv25021884/cluster+management
Quote: "NVOS software image includes the NMX-C application, the FM application,
and the NVLSM application, with no standalone software installation required
for these components."
Reference: https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html
NMX-Controller (NMX-C) - Provides Global Fabric Manager (GFM):
Quote: "In the GB200 NVL the SDN services are the subnet manager (SM) and
global fabric manager (GFM)"
Reference: https://docs.nvidia.com/networking/display/nmxcv11/nmx-controller
### Compute Nodes - Run NVSM (NVIDIA System Management)
NVSM provides system management and exposes fabric health via NVML APIs:
- Services: nvsm-core.service, nvsm-api-gateway.service
- Port: 273 (nvsm-api-gateway REST API)
- Function: Monitors system health, exposes fabric state via nvmlDeviceGetGpuFabricInfo*
- Reference: https://docs.nvidia.com/datacenter/nvsm/nvsm-user-guide/latest/
- Reference: https://docs.nvidia.com/dgx/dgxgb200-user-guide/software.html
On GB200 compute nodes:
- Traditional fabric-manager daemon (port 6666) does NOT run
- NMX services do NOT run (they run on switch trays, not compute nodes)
- Fabric management is handled by NVOS on the switch trays
- NVSM handles system management and exposes fabric health telemetry via NVML
Attempting to start traditional fabric-manager on GB200 fails with NV_WARN_NOTHING_TO_DO because no NVSwitch kernel driver/devices are present on compute nodes. Reference: https://github.com/NVIDIA/gpu-operator/issues/610
Fabric Health Monitoring via NVML ¶
For GB200 and newer GPUs that support fabric state telemetry, this component uses:
- nvmlDeviceGetGpuFabricInfo() for basic fabric info (V1 API)
- nvmlDeviceGetGpuFabricInfoV().V3() for detailed health metrics (V3 API)
The V3 API provides comprehensive health information including:
- Clique ID and Cluster UUID
- Fabric state (Not Started, In Progress, Completed)
- Health summary (Healthy, Unhealthy, Limited Capacity)
- Detailed health mask covering:
- Bandwidth status (Full, Degraded)
- Route recovery progress
- Route health status
- Access timeout recovery
Detection Strategy ¶
This component checks for fabric management/monitoring in the following order:
- Check if nvmlDeviceGetGpuFabricInfo* is supported (GB200 NVL72 and newer) - If supported, use NVML fabric state APIs for health monitoring - This path is taken for systems with NVSM-based fabric telemetry
- Check traditional fabric-manager on port 6666 (Pre-NVL5 systems) - For DGX A100, DGX H100, HGX A100, HGX H100 - Validates service activeness and monitors logs for errors
Package fabricmanager provides utilities for detecting and managing NVIDIA NVSwitch devices.
NVSwitch and Fabric Manager ¶
NVIDIA NVSwitch is a physical interconnect switch that enables high-bandwidth GPU-to-GPU communication in multi-GPU systems. Fabric Manager is the user-space software daemon that manages and configures NVSwitch hardware.
Fabric Manager is specifically designed to manage NVSwitch hardware, and NVSwitch hardware requires Fabric Manager to function properly. Without Fabric Manager running:
- NVSwitch devices remain uninitialized
- GPU-to-GPU communication through NVSwitch is unavailable
- Multi-GPU workloads cannot utilize the full NVLink fabric bandwidth
The relationship between them:
- NVSwitch: Physical hardware switch chips (PCIe bridge devices)
- Fabric Manager: Software service that initializes, configures, and monitors NVSwitch
Official Documentation ¶
For more information about Fabric Manager and NVSwitch, see:
- NVIDIA Fabric Manager User Guide: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- NVIDIA NVSwitch Documentation: https://www.nvidia.com/en-us/data-center/nvlink/
Supported Systems ¶
Fabric Manager is required for NVSwitch-based systems including:
- NVIDIA DGX systems (DGX A100, DGX H100, DGX GB200)
- NVIDIA HGX systems (HGX A100, HGX H100, HGX H200)
- Systems with multiple GPUs connected via NVSwitch
Detection Methods ¶
This package provides two methods for detecting NVSwitch hardware:
- PCIe enumeration via lspci (ListPCINVSwitches function)
- nvidia-smi nvlink status query (CountSMINVSwitches function)
Both methods are used as fallbacks to ensure robust NVSwitch detection across different system configurations.
Index ¶
- Constants
- func CountSMINVSwitches(ctx context.Context) ([]string, error)
- func HasNVSwitchFatalSXid(line string) bool
- func HasNVSwitchNVLinkFailure(line string) bool
- func HasNVSwitchNonFatalSXid(line string) bool
- func HasNVSwitchNothingToDo(line string) bool
- func HasNVSwitchTopologyMismatch(line string) bool
- func ListPCINVSwitches(ctx context.Context) ([]string, error)
- func Match(line string) (eventName string, message string)
- func New(gpudInstance *components.GPUdInstance) (components.Component, error)
Constants ¶
const DeviceVendorID = "10de"
DeviceVendorID defines the NVIDIA PCI vendor ID. This is used to filter PCI devices to only NVIDIA hardware.
Example usage with lspci:
lspci -nn | grep -i "10de.*"
Reference: https://devicehunt.com/view/type/pci/vendor/10DE
const ( // EventNVSwitchNothingToDo indicates fabric manager found no NVSwitch devices to manage. // e.g., request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO] // This occurs when fabric manager starts but finds no NVSwitch devices to manage. // This is NOT an error condition - it indicates the system does not need fabric manager. // Common scenarios: GH200 standalone (no NVSwitch), simple NVLink bridges, PCIe GPUs. // Ref: https://forums.developer.nvidia.com/t/nvidia-fabricmanager-running-error-with-nv-warn-nothing-to-do/272899 // Ref: https://github.com/NVIDIA/gpu-operator/issues/610 EventNVSwitchNothingToDo = "fabricmanager_nvswitch_nothing_to_do" )
const (
// Name is the component name reported by the NVIDIA fabric manager checker.
Name = "accelerator-nvidia-fabric-manager"
)
Variables ¶
This section is empty.
Functions ¶
func CountSMINVSwitches ¶ added in v0.8.0
CountSMINVSwitches queries nvidia-smi to count GPUs with NVLink connections, which indicates the presence of NVSwitch hardware.
This function uses "nvidia-smi nvlink --status" to enumerate GPUs that have NVLink connections. In systems with NVSwitch, all GPUs will be connected through the NVSwitch fabric.
Returns a list of GPU description lines, where each line represents a GPU with NVLink connectivity. The number of lines indicates the number of GPUs connected to the NVSwitch fabric.
This is used as a fallback detection method when PCIe enumeration (lspci) is unavailable or unreliable.
Example output line:
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-754035b4-4708-efcd-b261-623aea38bcad)
func HasNVSwitchFatalSXid ¶ added in v0.4.5
HasNVSwitchFatalSXid reports whether the log line contains an NVSwitch fatal SXid error.
func HasNVSwitchNVLinkFailure ¶ added in v0.4.5
HasNVSwitchNVLinkFailure reports whether the log line contains an NVLink failure.
func HasNVSwitchNonFatalSXid ¶ added in v0.4.5
HasNVSwitchNonFatalSXid reports whether the log line contains an NVSwitch non-fatal SXid error.
func HasNVSwitchNothingToDo ¶ added in v0.10.0
HasNVSwitchNothingToDo reports whether the log line contains the NV_WARN_NOTHING_TO_DO marker.
func HasNVSwitchTopologyMismatch ¶ added in v0.10.0
HasNVSwitchTopologyMismatch reports whether the log line contains a topology mismatch error.
func ListPCINVSwitches ¶ added in v0.8.0
ListPCINVSwitches returns all lspci lines that represent NVIDIA NVSwitch devices.
NVSwitch devices appear as PCI bridge devices in the lspci output. This function enumerates all NVIDIA bridge devices which typically represent NVSwitch hardware.
NVSwitch is the physical interconnect hardware that connects multiple GPUs in a high-performance fabric. Without NVSwitch, GPUs cannot communicate efficiently in multi-GPU configurations.
Example lspci output for NVSwitch:
Older format (A100-era): 0005:00:00.0 Bridge [0680]: NVIDIA Corporation Device [10de:1af1] (rev a1) Newer format (GB200 and later): 0018:00:00.0 PCI bridge [0604]: NVIDIA Corporation Device [10de:22b1]
This function is the primary method for detecting NVSwitch hardware via PCIe enumeration. It's used by the Fabric Manager component to determine if NVSwitch is present and therefore if Fabric Manager service is required.
func Match ¶ added in v0.4.5
Match returns the first known event name and message for a fabric-manager log line.
func New ¶
func New(gpudInstance *components.GPUdInstance) (components.Component, error)
New creates the NVIDIA fabric manager component.
Types ¶
This section is empty.