README
ΒΆ
GPU Metrics Agent
A high-performance agent for collecting NVIDIA GPU metrics and exporting them via OpenTelemetry Arrow protocol. Developed by Polar Signals for comprehensive GPU observability.
Why GPU Metrics Agent?
Modern GPU workloads require deep visibility into resource utilization, performance characteristics, and system behavior. This agent provides the metrics you need to optimize GPU usage, reduce costs, and troubleshoot performance issues.
Use Cases
π― GPU Resource Optimization
Track GPU and memory utilization to identify underutilized resources. Optimize batch sizes and workload scheduling based on actual GPU usage patterns.
Key metrics: gpu_utilization_percent, gpu_utilization_memory_percent
π₯ Multi-tenant GPU Sharing
Monitor per-process GPU utilization to ensure fair resource allocation in shared environments. Track which processes are consuming GPU resources and enforce usage policies.
Key metrics: Per-process gpu_utilization_percent with pid and comm attributes
π Performance Troubleshooting
Identify performance bottlenecks by correlating GPU metrics with application behavior. Detect thermal throttling, power limitations, and PCIe bandwidth constraints.
Key metrics: gpu_temperature_celsius, gpu_clock_hertz, gpu_pcie_throughput_*_bytes
π° Cost Management
Monitor power consumption to calculate operational costs and optimize for efficiency. Track power usage trends and identify opportunities for cost reduction.
Key metrics: gpu_power_watt, gpu_power_limit_watt
π Capacity Planning
Analyze historical utilization patterns to plan for future GPU infrastructure needs. Understand peak usage times and resource requirements for different workloads.
Key metrics: All utilization metrics with time-series analysis
π€ ML/AI Workload Monitoring
Track GPU performance during model training and inference. Ensure optimal resource allocation for different phases of machine learning pipelines.
Key metrics: All metrics combined with workload-specific context
Collected Metrics
| Metric | Description | Collection Interval |
|---|---|---|
gpu_utilization_percent |
GPU compute utilization (0-100%) | 5s |
gpu_utilization_memory_percent |
GPU memory utilization (0-100%) | 5s |
gpu_power_watt |
Current power consumption | 1s |
gpu_power_limit_watt |
Maximum power limit | 1s |
gpu_clock_hertz |
Clock speeds (graphics, SM, memory, video) | 1s |
gpu_temperature_celsius |
GPU temperature | 1s |
gpu_pcie_throughput_transmit_bytes |
PCIe transmit throughput | 100ms |
gpu_pcie_throughput_receive_bytes |
PCIe receive throughput | 100ms |
All metrics include uuid (GPU identifier) and index (GPU index) attributes. Process-level metrics also include pid and comm attributes.
Quick Start
Installation
For Kubernetes deployments, see our comprehensive setup guide.
Basic Usage
./gpu-metrics-agent \
--remote-store-address=grpc.polarsignals.com:443 \
--metrics-producer-nvidia-gpu=true \
--node=$(hostname) \
--log-level=info
Docker
docker run -it --rm \
--gpus all \
-v /etc/machine-id:/etc/machine-id:ro \
-v /var/run/secrets/polarsignals.com:/var/run/secrets/polarsignals.com:ro \
ghcr.io/polarsignals/gpu-metrics-agent:latest \
--remote-store-address=grpc.polarsignals.com:443 \
--metrics-producer-nvidia-gpu=true
Configuration Options
| Flag | Description | Default |
|---|---|---|
--remote-store-address |
gRPC endpoint for metric storage | Required |
--metrics-producer-nvidia-gpu |
Enable NVIDIA GPU metrics | false |
--collection-interval |
Metric export interval | 10s |
--node |
Node name for metric labeling | Machine ID |
--bearer-token |
Authentication token | - |
Architecture
The agent consists of three main components:
- NVIDIA Producer: Interfaces with NVIDIA Management Library (NVML) to collect GPU metrics
- Metric Exporter: Batches and exports metrics using OpenTelemetry Arrow protocol
- gRPC Client: Manages secure connections to remote storage endpoints
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β NVIDIA GPUs ββββββΆβ GPU Metrics ββββββΆβ Remote Storage β
β (via NVML) β β Agent β β (gRPC/OTel) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
Requirements
- NVIDIA GPU with driver version 390.x or newer
- Linux operating system
- NVIDIA Management Library (NVML) available
Building from Source
go build .
License
Apache License 2.0
Support
- Documentation: polarsignals.com/docs
- Issues: GitHub Issues
Documentation
ΒΆ
There is no documentation for this package.