nvidia_smi

package
v1.23.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 25, 2022 License: MIT Imports: 12 Imported by: 5

README

Nvidia System Management Interface (SMI) Input Plugin

This plugin uses a query on the nvidia-smi binary to pull GPU stats including memory and GPU usage, temp and other.

Configuration

# Pulls statistics from nvidia GPUs attached to the host
[[inputs.nvidia_smi]]
  ## Optional: path to nvidia-smi binary, defaults "/usr/bin/nvidia-smi"
  ## We will first try to locate the nvidia-smi binary with the explicitly specified value (or default value), 
  ## if it is not found, we will try to locate it on PATH(exec.LookPath), if it is still not found, an error will be returned
  # bin_path = "/usr/bin/nvidia-smi"

  ## Optional: timeout for GPU polling
  # timeout = "5s"
Linux

On Linux, nvidia-smi is generally located at /usr/bin/nvidia-smi

Windows

On Windows, nvidia-smi is generally located at C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe On Windows 10, you may also find this located here C:\Windows\System32\nvidia-smi.exe

You'll need to escape the \ within the telegraf.conf like this: C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe

Metrics

  • measurement: nvidia_smi
    • tags
      • name (type of GPU e.g. GeForce GTX 1070 Ti)
      • compute_mode (The compute mode of the GPU e.g. Default)
      • index (The port index where the GPU is connected to the motherboard e.g. 1)
      • pstate (Overclocking state for the GPU e.g. P0)
      • uuid (A unique identifier for the GPU e.g. GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665)
    • fields
      • fan_speed (integer, percentage)
      • fbc_stats_session_count (integer)
      • fbc_stats_average_fps (integer)
      • fbc_stats_average_latency (integer)
      • memory_free (integer, MiB)
      • memory_used (integer, MiB)
      • memory_total (integer, MiB)
      • power_draw (float, W)
      • temperature_gpu (integer, degrees C)
      • utilization_gpu (integer, percentage)
      • utilization_memory (integer, percentage)
      • utilization_encoder (integer, percentage)
      • utilization_decoder (integer, percentage)
      • pcie_link_gen_current (integer)
      • pcie_link_width_current (integer)
      • encoder_stats_session_count (integer)
      • encoder_stats_average_fps (integer)
      • encoder_stats_average_latency (integer)
      • clocks_current_graphics (integer, MHz)
      • clocks_current_sm (integer, MHz)
      • clocks_current_memory (integer, MHz)
      • clocks_current_video (integer, MHz)
      • driver_version (string)
      • cuda_version (string)

Sample Query

The below query could be used to alert on the average temperature of the your GPUs over the last minute

SELECT mean("temperature_gpu") FROM "nvidia_smi" WHERE time > now() - 5m GROUP BY time(1m), "index", "name", "host"

Troubleshooting

Check the full output by running nvidia-smi binary manually.

Linux:

sudo -u telegraf -- /usr/bin/nvidia-smi -q -x

Windows:

"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" -q -x

Please include the output of this command if opening an GitHub issue.

Example Output

nvidia_smi,compute_mode=Default,host=8218cf,index=0,name=GeForce\ GTX\ 1070,pstate=P2,uuid=GPU-823bc202-6279-6f2c-d729-868a30f14d96 fan_speed=100i,memory_free=7563i,memory_total=8112i,memory_used=549i,temperature_gpu=53i,utilization_gpu=100i,utilization_memory=90i 1523991122000000000
nvidia_smi,compute_mode=Default,host=8218cf,index=1,name=GeForce\ GTX\ 1080,pstate=P2,uuid=GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665 fan_speed=100i,memory_free=7557i,memory_total=8114i,memory_used=557i,temperature_gpu=50i,utilization_gpu=100i,utilization_memory=85i 1523991122000000000
nvidia_smi,compute_mode=Default,host=8218cf,index=2,name=GeForce\ GTX\ 1080,pstate=P2,uuid=GPU-d4cfc28d-0481-8d07-b81a-ddfc63d74adf fan_speed=100i,memory_free=7557i,memory_total=8114i,memory_used=557i,temperature_gpu=58i,utilization_gpu=100i,utilization_memory=86i 1523991122000000000

Limitations

Note that there seems to be an issue with getting current memory clock values when the memory is overclocked. This may or may not apply to everyone but it's confirmed to be an issue on an EVGA 2080 Ti.

NOTE: For use with docker either generate your own custom docker image based on nvidia/cuda which also installs a telegraf package or use volume mount binding to inject the required binary into the docker container.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ClockStats

type ClockStats struct {
	Graphics string `xml:"graphics_clock"` // int
	SM       string `xml:"sm_clock"`       // int
	Memory   string `xml:"mem_clock"`      // int
	Video    string `xml:"video_clock"`    // int
}

ClockStats defines the structure of the clocks portion of the smi output.

type EncoderStats

type EncoderStats struct {
	SessionCount   string `xml:"session_count"`   // int
	AverageFPS     string `xml:"average_fps"`     // int
	AverageLatency string `xml:"average_latency"` // int
}

EncoderStats defines the structure of the encoder_stats portion of the smi output.

type FBCStats added in v1.15.0

type FBCStats struct {
	SessionCount   string `xml:"session_count"`   // int
	AverageFPS     string `xml:"average_fps"`     // int
	AverageLatency string `xml:"average_latency"` // int
}

FBCStats defines the structure of the fbc_stats portion of the smi output.

type GPU

type GPU []struct {
	FanSpeed    string           `xml:"fan_speed"` // int
	Memory      MemoryStats      `xml:"fb_memory_usage"`
	PState      string           `xml:"performance_state"`
	Temp        TempStats        `xml:"temperature"`
	ProdName    string           `xml:"product_name"`
	UUID        string           `xml:"uuid"`
	ComputeMode string           `xml:"compute_mode"`
	Utilization UtilizationStats `xml:"utilization"`
	Power       PowerReadings    `xml:"power_readings"`
	PCI         PCI              `xml:"pci"`
	Encoder     EncoderStats     `xml:"encoder_stats"`
	FBC         FBCStats         `xml:"fbc_stats"`
	Clocks      ClockStats       `xml:"clocks"`
}

GPU defines the structure of the GPU portion of the smi output.

type MemoryStats

type MemoryStats struct {
	Total string `xml:"total"` // int
	Used  string `xml:"used"`  // int
	Free  string `xml:"free"`  // int
}

MemoryStats defines the structure of the memory portions in the smi output.

type NvidiaSMI

type NvidiaSMI struct {
	BinPath string
	Timeout config.Duration
}

NvidiaSMI holds the methods for this plugin

func (*NvidiaSMI) Gather

func (smi *NvidiaSMI) Gather(acc telegraf.Accumulator) error

Gather implements the telegraf interface

func (*NvidiaSMI) Init added in v1.20.4

func (smi *NvidiaSMI) Init() error

func (*NvidiaSMI) SampleConfig

func (*NvidiaSMI) SampleConfig() string

type PCI

type PCI struct {
	LinkInfo struct {
		PCIEGen struct {
			CurrentLinkGen string `xml:"current_link_gen"` // int
		} `xml:"pcie_gen"`
		LinkWidth struct {
			CurrentLinkWidth string `xml:"current_link_width"` // int
		} `xml:"link_widths"`
	} `xml:"pci_gpu_link_info"`
}

PCI defines the structure of the pci portion of the smi output.

type PowerReadings

type PowerReadings struct {
	PowerDraw string `xml:"power_draw"` // float
}

PowerReadings defines the structure of the power_readings portion of the smi output.

type SMI

type SMI struct {
	GPU           GPU    `xml:"gpu"`
	DriverVersion string `xml:"driver_version"`
	CUDAVersion   string `xml:"cuda_version"`
}

SMI defines the structure for the output of _nvidia-smi -q -x_.

type TempStats

type TempStats struct {
	GPUTemp string `xml:"gpu_temp"` // int
}

TempStats defines the structure of the temperature portion of the smi output.

type UtilizationStats

type UtilizationStats struct {
	GPU     string `xml:"gpu_util"`     // int
	Memory  string `xml:"memory_util"`  // int
	Encoder string `xml:"encoder_util"` // int
	Decoder string `xml:"decoder_util"` // int
}

UtilizationStats defines the structure of the utilization portion of the smi output.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL