check_pve

check_pve

Coindrop.to me Support goes 100% to local animal hospice

Proxmox Virtual Environment Naemon/Icinga/Nagios plugin which checks various stuff via Proxmox API(v2).

[[TOC]]

Requirements

Ruby

  • Ruby >2.3

PVE

A user/role with appropriate rights. See User Management for more information.

# /etc/pve/user.cfg
user:monitoring@pve:1:0::::::

role:PVE_monitoring:Datastore.Audit,Sys.Audit,Sys.Modify,VM.Audit:
acl:1:/:monitoring@pve:PVE_monitoring:

How to add user and role

pveum useradd monitoring@pve -comment "Monitoring User"
pveum passwd monitoring@pve
pveum roleadd PVE_monitoring -privs "Datastore.Audit,Sys.Audit,Sys.Modify,VM.Audit"
pveum aclmod / -user monitoring@pve -role PVE_monitoring

Usage

check_pve v0.3.0 [https://gitlab.com/6uellerBpanda/check_pve]

This plugin checks various parameters of Proxmox Virtual Environment via API(v2)

Mode:
  Cluster:
    cluster-status            Checks quorum of cluster
  Node:
    node-smart-status          Checks SMART health of disks
    node-updates-available     Checks for available updates
    node-subscription-valid    Checks for valid subscription
    node-services-status       Checks if services are running
    node-task-errors           Checks for task errors
    node-storage-usage         Checks storage usage in percentage
    node-storage-status        Checks if storage is online/offline
    node-cpu-usage             Checks CPU usage in percentage
    node-memory-usage          Checks Memory usage in percentage
    node-io-wait               Checks IO wait in percentage
    node-net-in-usage          Checks inbound network usage
    node-net-out-usage         Checks outbound network usage
    node-ksm-usage             Checks KSM sharing usage
  VM:
    vm-cpu-usage               Checks CPU usage in percentage
    vm-memory-usage            Checks memory usage in percentage
    vm-disk-read-usage         Checks how much last 60s was read (timeframe: hour)
    vm-disk-write-usage        Checks how much last 60s was written (timeframe: hour)
    vm-net-in-usage            Checks incoming usage from last 60s (timeframe: hour)
    vm-net-out-usage           Checks outgoing usage from last 60s (timeframe: hour)

Usage: check_pve.rb [mode] [options]

Options:
    -s, -H, --address ADDRESS        PVE host address
    -k, --insecure                   No SSL verification
    -m, --mode MODE                  Mode to check
    -n, --node NODE                  PVE Node name
    -u, --username USERNAME          Username with auth realm e.g. monitoring@pve
    -p, --password PASSWORD          Password
    -w, --warning WARNING            Warning threshold
    -c, --critical CRITICAL          Critical threshold
        --unit UNIT                  Unit - kb, mb, gb, tb
        --name NAME                  Name for storage or user filter for tasks
    -i, --vmid VMID                  ID of qemu/lxc machine
    -t, --type TYPE                  VM type lxc, qemu or type filter for tasks
    -x, --exclude EXCLUDE            Exclude (regex)
        --timeframe TIMEFRAME        Timeframe for vm checks: hour,day,week,month or year
        --cf CONSOLIDATION_FUNCTION  RRD cf: average or max
        --lookback LOOKBACK          Lookback in seconds
    -v, --version                    Print version information
    -h, --help                       Show this help message

Options

  • -s, -H: PVE host address, only https supported, e.g. pve-01.example.com

  • -k: Don't validate certificate

  • -m: check to be used (node-cluster-status,..)

  • -n: PVE Node name

  • -u: Username with auth realm, e.g. monitoring@pve, root@pam

  • -i: ID of qemu/lxc machine

  • -x: Exclude items (regex)

  • --name: Storage name, e.g. local, local-lvm, also used as user filter for tasks

  • --type: either lxc or qemu, also used as type filter for tasks

  • --timeframe: time frame for rrd data; hour, day, week, month or year. Default 'hour'

  • --cf: consolidation function for rrd data; average or max. Default 'max'

  • --lookback: time in seconds to look back

  • --unit: specify desired unit output: kb, mb, gb, tb. Default 'mb'

Modes

Cluster

Checks if the cluster is quorate. Warning if not. (/cluster/status)

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -m cluster-status
OK - LNZ: Cluster ready - quorum is ok

Node

The node name (via -n option) is required for all node checks.

SMART

Checks SMART status of the disks. (/nodes/{node}/disks/list)

Allows exclude option: --exclude '^/dev/sda'

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-smart-status
OK - No SMART errors detected

Updates

Displays a warning if new updates are available. (/nodes/{node}/apt/update)

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-updates-available
Warning - 12 updates available

Subscription

Checks if subscription is valid. (/nodes/{node}/subscription)

Specify warning threshold for minimum number of days subscription has to be valid. Critical status if the subscription has expired.

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-subscription-valid -w 60                                                 
Warning - Subscription will end at 2018-10-13

Services

Displays a warning if a service isn't running. (/nodes/{node}/services)

Allows exclude option: --exclude 'ksmtuned'

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-services-status
Warning - postfix, spiceproxy not running

To exclude services:

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-services-status -x 'postfix|spiceproxy'
OK - All services running

Tasks

Displays a warning if failed tasks occurred. (/nodes/{node}/tasks)

Specify --lookback option in seconds to check from the current time.

With --name and --type user and type filter can be specified.

Exclude option --exclude can specified for the status message.

# only show errors from shutdown tasks the last hour
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-task-errors --lookback 3600 -t qmshutdown
Warning - 2022-07-24 14:20:08 +0200: qmshutdown/root@pam - received interrupt

# but exclude tasks with 'interrupt' in the status message
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-task-errors --lookback 3600 -t qmshutdown -x 'interrupt'
OK - No failed tasks

Storage

Usage

Checks storage usage in percentage. Value will be rounded. (/nodes/{node}/storage/{storage}/status)

Specify datastore/storage with --name option.

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-storage-usage --name local -w 40 -c 60
Warning - Storage usage: 45% | Usage=45%;40;60
Status

Checks the status (online/offline) of all enabled storages on the node. (/nodes/{node}/storage/{storage})

Allows exclude option.

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-storage-status
Warning - local-lvm not active

CPU

Checks CPU usage in percentage. Value will be rounded. (/nodes/{node}/status)

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-cpu-usage -w 40 -c 60
OK - CPU usage: 30% | Usage=1%;40;60

Memory

Checks memory usage in percentage. Value will be rounded. (/nodes/{node}/status)

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-memory-usage -w 90 -c 95
OK - Memory usage: 85.03% | Usage=85.03%;90;95

IO Wait

Checks IO wait/delay usage in percentage. Value will be rounded. (/nodes/{node}/status)

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-io-wait -w 1 -c 3
OK - IO Wait: 0% | Wait=0%;1;3

Network usage

Checks network usage (In/Out). Value will be rounded. (/nodes/{node}/rrddata)

# inbound
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-net-in-usage -w 100 -c 200
OK - Network usage in: 2.54MB | Usage=2.54MB;100;200
# outbound
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-net-out-usage -w 100 -c 200
OK - Network usage out: 1.12MB | Usage=1.12MB;100;200

KSM

Checks KSM usage. Value will be rounded. (/nodes/{node}/status)

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m node-ksm-usage --unit gb -w 20 -c 25
OK - KSM sharing: 14.26GB | Usage=14.26GB;20;25

VM

QEMU/KVM and LXC are supported.

Following options are necessary for all vm checks:

  • node (-n)
  • type (--type)
  • vmid (-i)

> Note: These checks are parsing the rrddata from pve and do not reflect the actual data when the check has been run. It will always use the last item in the rrddata array. > Example: If you specify timeframe hour and disk read check it will display how much read io (kb) was done in the last 60s.

CPU

Check CPU usage in percentage. Value will be rounded. (/nodes/{node}/{type}/{vmid}/rrddata)

./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m vm-cpu-usage -t qemu -i 126 -w 80 -c 90
OK - CPU usage: 5% | Usage=5%;80;90

Disk read, write

Checks how much read/write io was done. Value will be rounded. (/nodes/{node}/{type}/{vmid}/rrddata)

# read
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m vm-disk-read-usage -t qemu -i 126 -w 80 -c 90
OK - Disk read: 2MB | Usage=2MB;80;90
# write
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m vm-disk-write-usage -t qemu -i 126 -w 80 -c 90
OK - Disk write: 15.4MB | Usage=15.4MB;80;90

Network usage

Checks how much incoming/outgoing network traffic was done in kb. Value will be rounded. (/nodes/{node}/{type}/{vmid}/rrddata)

# read
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m vm-net-in-usage -t qemu -i 126 -w 50 -c 60
OK - Network usage in: 2.45MB | Usage=2.45MB;50;60
# write
./check_pve.rb -s pve.example.com -u monitoring@pve -p test1234 -n pve -m vm-net-out-usage -t qemu -i 126 -w 50 -c 60
OK - Network usage out: 1.1MB | Usage=1.1MB;50;60