check-nvidia-gpu
check_nvidia_gpu.sh
Nagios / Icinga plugin to monitor NVIDIA GPU utilization and memory usage using nvidia-smi.
Table of Contents
- Description
- Requirements
- Installation
- Usage
- Threshold Semantics
- Example Usage
- Example Output
- Return Codes
- NRPE Integration
- Notes
- License
Description
check_nvidia_gpu.sh checks all available NVIDIA GPUs on a system and evaluates:
- GPU utilization (%)
- GPU memory usage (MB)
It returns a Nagios-compatible status and performance data.
Requirements
- NVIDIA GPU
- NVIDIA drivers installed
nvidia-smiavailable in PATH- Linux system with:
- bash
- awk
- Nagios / Icinga (or compatible monitoring system)
Installation
-
Copy script to plugin directory:
/usr/lib/nagios/plugins/check_nvidia_gpu.sh -
Set executable permissions:
chmod 755 /usr/lib/nagios/plugins/check_nvidia_gpu.sh
Usage
check_nvidia_gpu.sh [--warn ] [--crit ] [--warn-mem ] [--crit-mem ]
Parameters
| Parameter | Description |
|---|---|
| --warn | Warning threshold for GPU utilization (%) |
| --crit | Critical threshold for GPU utilization (%) |
| --warn-mem | Warning threshold for GPU memory usage (MB) |
| --crit-mem | Critical threshold for GPU memory usage (MB) |
Default Values
- warn utilization: 80%
- crit utilization: 95%
- memory thresholds: disabled unless specified
Threshold Semantics
| Metric | Behavior |
|---|---|
| GPU utilization | higher is worse |
| GPU memory usage | higher is worse |
Example:
--warn 80 --crit 95
- ≥ 80% → WARNING
- ≥ 95% → CRITICAL
Example Usage
check_nvidia_gpu.sh
check_nvidia_gpu.sh --warn 70 --crit 90
check_nvidia_gpu.sh --warn 80 --crit 95 --warn-mem 20000 --crit-mem 24000
Example Output
[OK]: GPU0: 45% (12000/24576MB); GPU1: 30% (8000/24576MB) | gpu0_util=45;80;95;0;100 gpu0_mem_used=12000;;;0;24576 gpu1_util=30;80;95;0;100 gpu1_mem_used=8000;;;0;24576
[WARNING]: GPU0: 82% (20000/24576MB) | gpu0_util=82;80;95;0;100 gpu0_mem_used=20000;;;0;24576
[CRITICAL]: GPU0: 97% (24000/24576MB) | gpu0_util=97;80;95;0;100 gpu0_mem_used=24000;;;0;24576
Return Codes
| Code | State |
|---|---|
| 0 | OK |
| 1 | WARNING |
| 2 | CRITICAL |
| 3 | UNKNOWN |
NRPE Integration
Example /etc/nagios/nrpe.cfg:
command[check_gpu]=/usr/lib/nagios/plugins/check_nvidia_gpu.sh --warn 80 --crit 95 --warn-mem 20000 --crit-mem 24000
Or dynamically pass the arguments using NRPE:
command[check_gpu]=/usr/lib/nagios/plugins/check_nvidia_gpu.sh $ARG1$
(Restart NRPE after changes)
Notes
- Relies on
nvidia-smioutput; changes in format may affect parsing - Memory thresholds are optional
- Multiple GPUs are evaluated individually, worst state wins
License
MIT