check_amanda.pl

Disk-to-Disk

Nagios plugin to check Amanda disk-to-disk backups are fresh (not too old) and have a reasonable minimum size

check_amanda.pl

This is a Nagios plugin to check that

  • All configured backups have a recent backup item
  • Backup sizes exceed a minumum size
  • A least N backup items are configured

Prerequisites

  • If you are using Amanda backup, and
  • If your backups go to disk (Disk-to-Disk backups)

Then this plugin is for you.

What it can do

This check reads the files named disklist.conf which are present in the directory /etc/amanda/backup-set-name/

It creates a list of all Disk List Entries (DLEs).

A DLE is a backup item - normally a file-system on a client host.

Other than specifying the top of the the backup file-tree using -m /path/to/backups no other configuration is required.

For the backup items (DLEs) found, 3 things are checked by this plugin:

  • Minimum backup age
    • warning or critical if the most recent backup (for any item/DLE) is more than 2 or 3 days old. This is configurable in hours. Either level 0 or level 1+ backups qualify.
  • Minimum backup size
    • warning or critical if the most recent level 0 backup is less than 64k or 128k, respectively. This is configurable in k, M, G or T
  • Minumum number of backup items (DLEs) found.
    • Critical if less than 1 backup item found, configurable.

In addition, the plugin will also

  • Report as performance data the total size of all level 0 backups plus the most recent level 1 backups (for capacity planning), suitable for use with PNP4Nagios.

By default, every backup item (DLE) which is configured is checked, but the check can be restricted to a particular backup-set or client server.

A PNP4Nagios template is also provided.

Sample Output - OK

/usr/lib/nagios/plugins/check_amanda.pl -C 50

OK: 52 ok in 20 backup sets, size 1625Gi +314Gi|total_sets=20 total_servers=32 total_items=52 ok=52 warn=0 crit=0 l0size=1704404516k l1size=329119065k

Sample Output - Errors

/usr/lib/nagios/plugins/check_amanda.pl -C 6 -i proxy,redmine -w 18 -s 1G

CRITICAL: Only 5 backups found, 6 required, 2 warning in proxy (20hrs), redmine (49Mi), 3 ok in redmine (17hrs,15Gi), size 42Gi +397Mi|total_sets=2 total_servers=3 total_items=5 ok=3 warn=2 crit=0 l0size=44175852k l1size=406706k

In this case, the plugin is reporting critical because there were only 5 items found, and we specified '-C 6' on the commmand line.

The backup-set 'proxy' is reporting a warning because of the most-recent backups in the set, the oldest is 20 hrs old, and we specified '-w 18' on the command line.

The backup-set 'redmine' is reporting a warning because the smallest backup in the set is 49M, and we specified '-s 1G' on the command line.

There are 3 backups which are OK in the backup set redmine, the oldest of which is 17 hrs old, and the total size of which is 15G, excluding the one which is not OK.

'proxy' has no OK backups, as it does not appear after the '3 ok in...' message.

'size' reports that the latest level 0 backups in 'proxy' and 'redmine' use 42 Gi bytes, and this includes OK and non-OK backups.

Detailed, per-backup-item information can be obtained by adding '-v' to the command line.

Sample Output - Verbose

/usr/lib/nagios/plugins/check_amanda.pl -I nagios -s 10M -v

OK:       offsite-servers  cloud-nagios:        /var                      last backup l1 20150129040006   (6hrs) 146Mi+72Mi
OK:       internal-servers onsite-nagios:       /var                      last backup l0 20150128210024  (13hrs) 818Mi
Warning:  internal-servers onsite-nagios:       /etc                      last backup l0 20150128210024  (13hrs) 1268Ki
NO Conf:  test-servers     test-nagios:         /etc                      last backup l0 20150105034106    (24d) 894Ki
Oldest backup is 20150128210024 (13hrs)
WARNING: 1 warning in internal-servers (1268Ki), 2 ok in internal-servers (14hrs,818Mi), offsite-servers (7hrs,146Mi), size 965Mi +72Mi|total_sets=2 total_servers=2 total_items=3 ok=2 warn=1 crit=0 l0size=988216k l1size=73873k

The '-v' flag turns on verbose output.

In this case, we have used -I to search all backup-sets for servers with 'nagios' in their name.

The status message is telling us that one of the item in internal-servers is in state warning because it's size is only 1268Ki, and -s 10M asks for a warning if any item is less than 10M in size.

The status message also tells us there are 2 items which are OK, and that these are within the backup-sets internal-servers and offsite-servers. The total l0 size of OK backups and age of the oldest OK backup is shown for each OK backup-set.

The verbose messages shown above go to stderr output, and show us the state of each backup-item that was examined and included in the status message.

There is an additional message, 'NO Conf' that says that it found a backup-item for the server 'test-nagios' that did not have a corresponding disklist.conf entry. This means that the backup-item or backup-set was removed, but the files were not purged from the disk.

Items which appear as 'NO Conf' are not included in the status message, nor in the performance data.

The 'Oldest backup' message is the oldest of all the most-recent backups which were considered (OK, Warning and Critical), excluding the 'No Conf' items.

Sample Output - List Files

# /usr/lib/nagios/plugins/ixa/check_amanda.pl -I samba -l
OK:       samba-set         samba-srv-01:          /data/groups               last backup l1 20150729183007  (15hrs) 766Gi+425Mi
   /mnt/store1/samba-set/slot60/00001.samba-srv-01._data_groups.0
   /mnt/store1/samba-set/slot68/00001.samba-srv-01._data_groups.1
OK:       samba-set         samba-srv-01:          /data/homes                last backup l2 20150729183007  (15hrs) 154Gi+68Gi
   /mnt/store1/samba-set/slot59/00001.samba-srv-01._data_homes.0
   /mnt/store1/samba-set/slot65/00001.samba-srv-01._data_homes.1
   /mnt/store1/samba-set/slot67/00001.samba-srv-01._data_homes.2
Oldest backup is 20150729183007 (15hrs)
OK: 2 ok in samba-set (15hrs,920Gi), size 920Gi +69Gi|total_sets=1 total_servers=1 total_items=2 ok=2 warn=0 crit=0 l0size=965122950k l1size=72047326k

The '-l' flag turns on the 'list files' output and also turns on verbose output.

In the above output, we have 2 backup items (DLE's) for the server samba-srv-01 these being /homes and /groups The individual status messages OK: ... tells us the size and age of the most recent backup. Underneath the status message are listed the most recent level 0, level 1 and (if applicable) level 2 backup-files that would be required to restore the most recent backup.

This can be useful if there is a problem with the most-recent backup, and makes it easy to locate the files in question.

It is also useful for NO Conf: ... backups, which still have a presence on the backup file-system, but are not present in the Zmanda web-interface.

Use-case examples

Default use case

check_amanda.pl

Check all backup items with default settings:

  • Warning if more than 2 days old, critical if more than 3 days old
  • Warning if the most recent level 0 backup is less than 64k, critical if less than 32k
  • Critical if no backup items found

check_amanda.pl -m /mnt/bigdisk

As above, but the 'backup media' directory is /mnt/bigdisk instead of the default /mnt/store1

Check specific backup sets

check_amanda.pl -i weekly-backups -w 168 -c 336

Check only the backup set named weekly-backups, warning if any item is more than 7 days old (168 hours), critical if more than 14 days old.

Search for specific client hosts

check_amanda.pl -I '(prod|prd)' -C 7 -w 24

Look for backup items for servers with 'prod' or 'prd' in their name, and ensure that at least 7 items (DLE's) are found. Generate a warning if any of these items are less than 24 hours old. (Other checks are performed as per the default settings).

Exclude specific backup sets

check_amanda.pl -x weekly-backups

Exclude the backup set named 'weekly-backups'. Check everything else against the default settings.

Exclude specific client hosts

check_amanda.pl -X '(dev|test)' -C 30

Exclude backup items for servers with 'dev' or 'test' in the name from the check. Check that at least 30 backup items are included in the check. Check everything else with the default settings.

The -X ... option is particularly useful for excluding servers which have been retired (decomissioned), but for which the backup config and data are still kept (in order to make it easy to restore data if required).

Notes

The arguments -i and -x expect a comma-seperated list of specific backup-set names (not regular expressions)

The arguments -I and -X expect a single regular expression and select by hostname. ie. to include a specific hostname only, use:

-I '^my-db-server$'

A backup with a file-size of zero (after subtracting the header) is considered a failed backup, and is ignored.

Other than checking size and date, no further integrity checking is done. ie. this plugin will not detect corrupted or truncated backups. Although this would be feasible, such a check would not be likely complete in under 60s.

This plugin will normally be used with NRPE. The default timeout for check_nrpe is 10 seconds. Increase this to at least 60 seconds to avoid 'false criticals' due to the plugin taking a long time to go through the whole tree of backup items on the backup-media disk.

The plugin reads the 1st line of each backup item.

This plugin distinguishs between a backup-in-progress and a completed-backup by using the time-stamp on the file. Any backup file modified in the last 30 seconds is considered 'in-progress' and is excluded from the check results. The next-oldest file is used instead.

Sample config

define service {
  use                            generic-service-quiet          ; template name
  service_description            amanda-backups
  hostgroup_name                 backup-servers
  check_interval                 60
  max_check_attempts             12
  retry_interval                 5
  notification_interval          120
  stalking_options               o,w,c    ; save output when it changes - should be infrequent}
  check_command                  check_nrpe_1arg!check_amanda -t 60

nrpe.cfg config

command[check_amanda]=/usr/bin/sudo -u amandabackup /usr/lib/nagios/plugins/check_amanda.pl -C 50

/etc/sudoers config

Defaults:nagios !requiretty
nagios  ALL=(amandabackup) NOPASSWD: /usr/lib/nagios/plugins/check_amanda.pl

Performance Data

This plugin generates some summary information as Nagios performance data. This can be graphed using PNP4Nagios.

Of particular interest are the items:

  • l0size - the total size of the most recent level 0 backup for each backup item (DLE)
  • l1size - the total size of the most recent level 1 backup which is more recent than the level 0 backup for each backup item

These statistics are reported for the selected backup-sets and servers only. ie. after -i/-x/-I/-X have been applied.

A PNP4Nagios template is supplied.

Sample pnp4nagios graphs

In the sample graph, there are a number of failing checks on the 22nd of the month. This is due to new backups being added, and not having been backed up yet.

This shows on the output as

CRITICAL: 14 critical in internal-servers (never), 43 ok in 19 backup sets, size 1785Gi +75Gi

Running a manual backup will clear this alert, once the backup is complete.

The l0size and l1size statistics are useful for estimating what retention is feasible.

The l0size and l1size give a simple answer to the question:

  • Q. How much disk space do I need to store 4 weeks of full weekly backups and their daily incremental backups?
  • A. 4 l0size + 24 l1size = 1.92T 4 + 89.7G 24 = 9.8 TB

Note that the plugin is reporting 1785 GiB (gibi bytes, ie 1024^3), while rrdtool is reporting 1917 GB (giga bytes, ie 10^9)

l0size and l1size are also useful for observing trends in the size of a complete backup of all servers. Simple disk-usage statistics are difficult to interpret, because they do not show how many complete backups are present at any time.