gcp-gpu-metrics

command module

v0.2.1 Latest Latest Go to latest Published: Jan 11, 2021 License: MIT Imports: 19 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/instadeepai/gcp-gpu-metrics

Links

Open Source Insights

README ¶

gcp-gpu-metrics

Tiny Go binary that aims to export Nvidia GPU metrics to GCP monitoring, based on nvidia-smi.

Requirements ⚓

Your machine must be a GCE (Google Compute Engine) instance.
The Cloud API access scopes of the instance or Service Account must have the Monitoring Metric Writer permission.
You need the nvidia-smi binary installed on your GCE instance.

Protip: You can use a machine learning image provided by GCP as default image.

Install ⏬

If you're root, you can install the latest binary version using the following script:

$ bash < <(curl -sSL https://raw.githubusercontent.com/instadeepai/gcp-gpu-metrics/master/install-latest.sh)

Or, you can download a release/binary from this page and install it manually.

Usage 💻

gcp-gpu-metrics is an UNIX compliant and very simple CLI, you just have to use it as usual:

$ gcp-gpu-metrics

Available flags:

--service-account-path string | GCP service account path. (default "")
--metrics-interval uint | Fetch metrics interval in seconds. (default 10)
--enable-nvidiasmi-pm | Enable persistence mod for nvidia-smi. (default false)
--version | Display current version/release and commit hash.

Available env variables:

GGM_SERVICE_ACCOUNT_PATH=./service-account.json linked to --service-account-path flag.
GGM_METRICS_INTERVAL=10 linked to --metrics-interval flag.
GGM_ENABLE_NVIDIASMI_PM=true linked to --enable-nvidiasmi-pm flag.

Priority order is binary flag ➡️ env var ➡️ default value.

Nvidia-smi persistence mod is very useful, the option permits to run nvidia-smi as a daemon in background to prevent 100% of GPU load at each request. Enabling this option requires root.

About logs, they're all located under syslog.

Metrics 📈

There are 6 differents metrics fetched, this number will grow in the future.

temperature.gpu as custom.googleapis.com/gpu/temperature_gpu | Core GPU temperature. in degrees C.
utilization.gpu as custom.googleapis.com/gpu/utilization_gpu | Percent of time over the past sample period during which one or more kernels were executed on the GPU.
utilization.memory as custom.googleapis.com/gpu/utilization_memory | Percent of time over the past sample period during which global (device) memory was being read or written.
memory.total as custom.googleapis.com/gpu/memory_total | Total installed GPU memory.
memory.free as custom.googleapis.com/gpu/memory_free | Total GPU free memory.
memory.used as custom.googleapis.com/gpu/memory_used | Total memory allocated by active contexts.