fabricmanager

package

v0.11.4 Latest Latest Go to latest Published: Apr 4, 2026 License: Apache-2.0 Imports: 24 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/leptonai/gpud

Links

Open Source Insights

Documentation ¶

Overview ¶

Package fabricmanager tracks NVIDIA fabric manager and fabric health monitoring services.

Fabric Management Architecture ¶

NVIDIA systems use different fabric management and monitoring approaches depending on the GPU generation:

## Pre-NVL5 Systems (DGX A100, DGX H100, HGX A100, HGX H100)

Traditional nvidia-fabricmanager daemon running on compute nodes:

Service: nvidia-fabricmanager.service
Port: 6666 (FM_CMD_PORT_NUMBER)
Architecture: Userspace daemon managing NVSwitch kernel driver
Requires: /dev/nvidia-switch* devices via kernel driver
Monitoring: Service activeness via port check
Reference: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/

## NVL5+ Systems (GB200 NVL72)

Distributed fabric management architecture with NVML-based health monitoring:

### NVLink Switch Trays - Run NVOS (NVSwitch Operating System)

NVOS includes integrated fabric management services:

Quote: "NVOS includes the NVLink Subnet Manager (NVLSM), the Fabric Manager (FM),
       NMX services such as NMX-Controller and NMX-Telemetry, and the NVSwitch firmware."
Reference: https://docs.nvidia.com/networking/display/nvidianvosusermanualfornvlinkswitchesv25021884/cluster+management

Quote: "NVOS software image includes the NMX-C application, the FM application,
       and the NVLSM application, with no standalone software installation required
       for these components."
Reference: https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html

NMX-Controller (NMX-C) - Provides Global Fabric Manager (GFM):

Quote: "In the GB200 NVL the SDN services are the subnet manager (SM) and
       global fabric manager (GFM)"
Reference: https://docs.nvidia.com/networking/display/nmxcv11/nmx-controller

### Compute Nodes - Run NVSM (NVIDIA System Management)

NVSM provides system management and exposes fabric health via NVML APIs:

Services: nvsm-core.service, nvsm-api-gateway.service
Port: 273 (nvsm-api-gateway REST API)
Function: Monitors system health, exposes fabric state via nvmlDeviceGetGpuFabricInfo*
Reference: https://docs.nvidia.com/datacenter/nvsm/nvsm-user-guide/latest/
Reference: https://docs.nvidia.com/dgx/dgxgb200-user-guide/software.html

On GB200 compute nodes:

Traditional fabric-manager daemon (port 6666) does NOT run
NMX services do NOT run (they run on switch trays, not compute nodes)
Fabric management is handled by NVOS on the switch trays
NVSM handles system management and exposes fabric health telemetry via NVML

Attempting to start traditional fabric-manager on GB200 fails with NV_WARN_NOTHING_TO_DO because no NVSwitch kernel driver/devices are present on compute nodes. Reference: https://github.com/NVIDIA/gpu-operator/issues/610

Fabric Health Monitoring via NVML ¶

For GB200 and newer GPUs that support fabric state telemetry, this component uses:

nvmlDeviceGetGpuFabricInfo() for basic fabric info (V1 API)
nvmlDeviceGetGpuFabricInfoV().V3() for detailed health metrics (V3 API)

The V3 API provides comprehensive health information including:

Clique ID and Cluster UUID
Fabric state (Not Started, In Progress, Completed)
Health summary (Healthy, Unhealthy, Limited Capacity)
Detailed health mask covering:
Bandwidth status (Full, Degraded)
Route recovery progress
Route health status
Access timeout recovery

Detection Strategy ¶

This component checks for fabric management/monitoring in the following order:

Check if nvmlDeviceGetGpuFabricInfo* is supported (GB200 NVL72 and newer) - If supported, use NVML fabric state APIs for health monitoring - This path is taken for systems with NVSM-based fabric telemetry
Check traditional fabric-manager on port 6666 (Pre-NVL5 systems) - For DGX A100, DGX H100, HGX A100, HGX H100 - Validates service activeness and monitors logs for errors

Package fabricmanager provides utilities for detecting and managing NVIDIA NVSwitch devices.

NVSwitch and Fabric Manager ¶

NVIDIA NVSwitch is a physical interconnect switch that enables high-bandwidth GPU-to-GPU communication in multi-GPU systems. Fabric Manager is the user-space software daemon that manages and configures NVSwitch hardware.

Fabric Manager is specifically designed to manage NVSwitch hardware, and NVSwitch hardware requires Fabric Manager to function properly. Without Fabric Manager running:

NVSwitch devices remain uninitialized
GPU-to-GPU communication through NVSwitch is unavailable
Multi-GPU workloads cannot utilize the full NVLink fabric bandwidth

The relationship between them:

NVSwitch: Physical hardware switch chips (PCIe bridge devices)
Fabric Manager: Software service that initializes, configures, and monitors NVSwitch

Official Documentation ¶

For more information about Fabric Manager and NVSwitch, see:

NVIDIA Fabric Manager User Guide: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA NVSwitch Documentation: https://www.nvidia.com/en-us/data-center/nvlink/

Supported Systems ¶

Fabric Manager is required for NVSwitch-based systems including:

NVIDIA DGX systems (DGX A100, DGX H100, DGX GB200)
NVIDIA HGX systems (HGX A100, HGX H100, HGX H200)
Systems with multiple GPUs connected via NVSwitch

Detection Methods ¶

This package provides two methods for detecting NVSwitch hardware:

PCIe enumeration via lspci (ListPCINVSwitches function)
nvidia-smi nvlink status query (CountSMINVSwitches function)

Both methods are used as fallbacks to ensure robust NVSwitch detection across different system configurations.

Index ¶

Constants
func CountSMINVSwitches(ctx context.Context) ([]string, error)
func HasNVSwitchFatalSXid(line string) bool
func HasNVSwitchNVLinkFailure(line string) bool
func HasNVSwitchNonFatalSXid(line string) bool
func HasNVSwitchNothingToDo(line string) bool
func HasNVSwitchTopologyMismatch(line string) bool
func ListPCINVSwitches(ctx context.Context) ([]string, error)
func Match(line string) (eventName string, message string)
func New(gpudInstance *components.GPUdInstance) (components.Component, error)

Constants ¶

View Source

const DeviceVendorID = "10de"

DeviceVendorID defines the NVIDIA PCI vendor ID. This is used to filter PCI devices to only NVIDIA hardware.

Example usage with lspci:

lspci -nn | grep -i "10de.*"

Reference: https://devicehunt.com/view/type/pci/vendor/10DE

View Source

const (

	// EventNVSwitchNothingToDo indicates fabric manager found no NVSwitch devices to manage.
	// e.g., request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
	// This occurs when fabric manager starts but finds no NVSwitch devices to manage.
	// This is NOT an error condition - it indicates the system does not need fabric manager.
	// Common scenarios: GH200 standalone (no NVSwitch), simple NVLink bridges, PCIe GPUs.
	// Ref: https://forums.developer.nvidia.com/t/nvidia-fabricmanager-running-error-with-nv-warn-nothing-to-do/272899
	// Ref: https://github.com/NVIDIA/gpu-operator/issues/610
	EventNVSwitchNothingToDo = "fabricmanager_nvswitch_nothing_to_do"
)

View Source

const (
	// Name is the component name reported by the NVIDIA fabric manager checker.
	Name = "accelerator-nvidia-fabric-manager"
)

Variables ¶

This section is empty.

Functions ¶

func CountSMINVSwitches ¶ added in v0.8.0

func CountSMINVSwitches(ctx context.Context) ([]string, error)

CountSMINVSwitches queries nvidia-smi to count GPUs with NVLink connections, which indicates the presence of NVSwitch hardware.

This function uses "nvidia-smi nvlink --status" to enumerate GPUs that have NVLink connections. In systems with NVSwitch, all GPUs will be connected through the NVSwitch fabric.

Returns a list of GPU description lines, where each line represents a GPU with NVLink connectivity. The number of lines indicates the number of GPUs connected to the NVSwitch fabric.

This is used as a fallback detection method when PCIe enumeration (lspci) is unavailable or unreliable.

Example output line:

GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-754035b4-4708-efcd-b261-623aea38bcad)

func HasNVSwitchFatalSXid ¶ added in v0.4.5

func HasNVSwitchFatalSXid(line string) bool

HasNVSwitchFatalSXid reports whether the log line contains an NVSwitch fatal SXid error.

func HasNVSwitchNVLinkFailure ¶ added in v0.4.5

func HasNVSwitchNVLinkFailure(line string) bool

HasNVSwitchNVLinkFailure reports whether the log line contains an NVLink failure.

func HasNVSwitchNonFatalSXid ¶ added in v0.4.5

func HasNVSwitchNonFatalSXid(line string) bool

HasNVSwitchNonFatalSXid reports whether the log line contains an NVSwitch non-fatal SXid error.

func HasNVSwitchNothingToDo ¶ added in v0.10.0

func HasNVSwitchNothingToDo(line string) bool

HasNVSwitchNothingToDo reports whether the log line contains the NV_WARN_NOTHING_TO_DO marker.

func HasNVSwitchTopologyMismatch ¶ added in v0.10.0

func HasNVSwitchTopologyMismatch(line string) bool

HasNVSwitchTopologyMismatch reports whether the log line contains a topology mismatch error.

func ListPCINVSwitches ¶ added in v0.8.0

func ListPCINVSwitches(ctx context.Context) ([]string, error)

ListPCINVSwitches returns all lspci lines that represent NVIDIA NVSwitch devices.

NVSwitch devices appear as PCI bridge devices in the lspci output. This function enumerates all NVIDIA bridge devices which typically represent NVSwitch hardware.

NVSwitch is the physical interconnect hardware that connects multiple GPUs in a high-performance fabric. Without NVSwitch, GPUs cannot communicate efficiently in multi-GPU configurations.

Example lspci output for NVSwitch:

Older format (A100-era):
  0005:00:00.0 Bridge [0680]: NVIDIA Corporation Device [10de:1af1] (rev a1)

Newer format (GB200 and later):
  0018:00:00.0 PCI bridge [0604]: NVIDIA Corporation Device [10de:22b1]

This function is the primary method for detecting NVSwitch hardware via PCIe enumeration. It's used by the Fabric Manager component to determine if NVSwitch is present and therefore if Fabric Manager service is required.

func Match ¶ added in v0.4.5

func Match(line string) (eventName string, message string)

Match returns the first known event name and message for a fabric-manager log line.

func New ¶

func New(gpudInstance *components.GPUdInstance) (components.Component, error)

New creates the NVIDIA fabric manager component.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL