check-nvidia-gpu

Nagios / Icinga plugin to monitor NVIDIA GPU utilization and memory usage using nvidia-smi.

check_nvidia_gpu.sh

License: MIT Built by Nous Research

Nagios / Icinga plugin to monitor NVIDIA GPU utilization and memory usage using nvidia-smi.


Table of Contents


Description

check_nvidia_gpu.sh checks all available NVIDIA GPUs on a system and evaluates:

  • GPU utilization (%)
  • GPU memory usage (MB)

It returns a Nagios-compatible status and performance data.


Requirements

  • NVIDIA GPU
  • NVIDIA drivers installed
  • nvidia-smi available in PATH
  • Linux system with:
    • bash
    • awk
  • Nagios / Icinga (or compatible monitoring system)

Installation

  1. Copy script to plugin directory:

    /usr/lib/nagios/plugins/check_nvidia_gpu.sh

  2. Set executable permissions:

    chmod 755 /usr/lib/nagios/plugins/check_nvidia_gpu.sh


Usage

check_nvidia_gpu.sh [--warn ] [--crit ] [--warn-mem ] [--crit-mem ]

Parameters

Parameter Description
--warn Warning threshold for GPU utilization (%)
--crit Critical threshold for GPU utilization (%)
--warn-mem Warning threshold for GPU memory usage (MB)
--crit-mem Critical threshold for GPU memory usage (MB)

Default Values

  • warn utilization: 80%
  • crit utilization: 95%
  • memory thresholds: disabled unless specified

Threshold Semantics

Metric Behavior
GPU utilization higher is worse
GPU memory usage higher is worse

Example:

--warn 80 --crit 95
  • ≥ 80% → WARNING
  • ≥ 95% → CRITICAL

Example Usage

check_nvidia_gpu.sh
check_nvidia_gpu.sh --warn 70 --crit 90
check_nvidia_gpu.sh --warn 80 --crit 95 --warn-mem 20000 --crit-mem 24000

Example Output

[OK]: GPU0: 45% (12000/24576MB); GPU1: 30% (8000/24576MB) | gpu0_util=45;80;95;0;100 gpu0_mem_used=12000;;;0;24576 gpu1_util=30;80;95;0;100 gpu1_mem_used=8000;;;0;24576
[WARNING]: GPU0: 82% (20000/24576MB) | gpu0_util=82;80;95;0;100 gpu0_mem_used=20000;;;0;24576
[CRITICAL]: GPU0: 97% (24000/24576MB) | gpu0_util=97;80;95;0;100 gpu0_mem_used=24000;;;0;24576

Return Codes

Code State
0 OK
1 WARNING
2 CRITICAL
3 UNKNOWN

NRPE Integration

Example /etc/nagios/nrpe.cfg:

command[check_gpu]=/usr/lib/nagios/plugins/check_nvidia_gpu.sh --warn 80 --crit 95 --warn-mem 20000 --crit-mem 24000

Or dynamically pass the arguments using NRPE:

command[check_gpu]=/usr/lib/nagios/plugins/check_nvidia_gpu.sh $ARG1$

(Restart NRPE after changes)


Notes

  • Relies on nvidia-smi output; changes in format may affect parsing
  • Memory thresholds are optional
  • Multiple GPUs are evaluated individually, worst state wins

License

MIT