gpu-metrics-agent

command module

v0.1.3 Latest Latest Go to latest Published: Jan 7, 2026 License: Apache-2.0 Imports: 44 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/polarsignals/gpu-metrics-agent

Links

Open Source Insights

README ¶

GPU Metrics Agent

A high-performance agent for collecting NVIDIA GPU metrics and exporting them via OpenTelemetry Arrow protocol. Developed by Polar Signals for comprehensive GPU observability.

Why GPU Metrics Agent?

Modern GPU workloads require deep visibility into resource utilization, performance characteristics, and system behavior. This agent provides the metrics you need to optimize GPU usage, reduce costs, and troubleshoot performance issues.

Use Cases

🎯 GPU Resource Optimization

Track GPU and memory utilization to identify underutilized resources. Optimize batch sizes and workload scheduling based on actual GPU usage patterns.

Key metrics: gpu_utilization_percent, gpu_utilization_memory_percent

Monitor per-process GPU utilization to ensure fair resource allocation in shared environments. Track which processes are consuming GPU resources and enforce usage policies.

Key metrics: Per-process gpu_utilization_percent with pid and comm attributes

🔍 Performance Troubleshooting

Identify performance bottlenecks by correlating GPU metrics with application behavior. Detect thermal throttling, power limitations, and PCIe bandwidth constraints.

Key metrics: gpu_temperature_celsius, gpu_clock_hertz, gpu_pcie_throughput_*_bytes

💰 Cost Management

Monitor power consumption to calculate operational costs and optimize for efficiency. Track power usage trends and identify opportunities for cost reduction.

Key metrics: gpu_power_watt, gpu_power_limit_watt

📊 Capacity Planning

Analyze historical utilization patterns to plan for future GPU infrastructure needs. Understand peak usage times and resource requirements for different workloads.

Key metrics: All utilization metrics with time-series analysis

🤖 ML/AI Workload Monitoring

Track GPU performance during model training and inference. Ensure optimal resource allocation for different phases of machine learning pipelines.

Key metrics: All metrics combined with workload-specific context

Collected Metrics

Metric	Description	Collection Interval
`gpu_utilization_percent`	GPU compute utilization (0-100%)	5s
`gpu_utilization_memory_percent`	GPU memory utilization (0-100%)	5s
`gpu_power_watt`	Current power consumption	1s
`gpu_power_limit_watt`	Maximum power limit	1s
`gpu_clock_hertz`	Clock speeds (graphics, SM, memory, video)	1s
`gpu_temperature_celsius`	GPU temperature	1s
`gpu_pcie_throughput_transmit_bytes`	PCIe transmit throughput	100ms
`gpu_pcie_throughput_receive_bytes`	PCIe receive throughput	100ms

All metrics include uuid (GPU identifier) and index (GPU index) attributes. Process-level metrics also include pid and comm attributes.

Quick Start

Installation

For Kubernetes deployments, see our comprehensive setup guide.

Basic Usage

./gpu-metrics-agent \
  --remote-store-address=grpc.polarsignals.com:443 \
  --metrics-producer-nvidia-gpu=true \
  --node=$(hostname) \
  --log-level=info

Docker

docker run -it --rm \
  --gpus all \
  -v /etc/machine-id:/etc/machine-id:ro \
  -v /var/run/secrets/polarsignals.com:/var/run/secrets/polarsignals.com:ro \
  ghcr.io/polarsignals/gpu-metrics-agent:latest \
  --remote-store-address=grpc.polarsignals.com:443 \
  --metrics-producer-nvidia-gpu=true

Configuration Options

Flag	Description	Default
`--remote-store-address`	gRPC endpoint for metric storage	Required
`--metrics-producer-nvidia-gpu`	Enable NVIDIA GPU metrics	false
`--collection-interval`	Metric export interval	10s
`--node`	Node name for metric labeling	Machine ID
`--bearer-token`	Authentication token	-

Architecture

The agent consists of three main components:

NVIDIA Producer: Interfaces with NVIDIA Management Library (NVML) to collect GPU metrics
Metric Exporter: Batches and exports metrics using OpenTelemetry Arrow protocol
gRPC Client: Manages secure connections to remote storage endpoints

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  NVIDIA GPUs    │────▶│  GPU Metrics     │────▶│ Remote Storage  │
│  (via NVML)     │     │  Agent           │     │ (gRPC/OTel)    │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Requirements

NVIDIA GPU with driver version 390.x or newer
Linux operating system
NVIDIA Management Library (NVML) available

Building from Source

go build .

License

Apache License 2.0

Support

Documentation: polarsignals.com/docs
Issues: GitHub Issues

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL