check_cdu

Monitor Server Technology Cabinet Distribution (CDU) Products

Plugins

Is this your project? Learn More …

Views: 1493
Downloads: 300
Latest Release: 2014-09-08
Created: 2014-09-08

Application Requirements

Several standard Perl libraries are required for this program to function. Namely, Net::SNMP, Getopt::Std, Getopt::Long, Nagios::Plugin::Threshold

General Usage

check_cdu.pl -H  -C  [-t SNMP timeout] [-p SNMP port]

Required Arguments

Only the hostname and community are required. Timeout will default to 2 seconds, port 161.

Thresholds

I opted to use the Nagios::Plugin::Threshold class to handle thresholds. In general I do not prefer Nagios::Plugins objects, but I just simply could not avoid using the Threshold class. I apologize for the added dependency, I just could not afford re-inventing the wheel. The benefit is that the threshold logic used in this plugin follows the standard used in many other plugins. For reference, here are the general threshold guidelines:

Range definition Generate an alert if x...
10 10, (outside the range of {0 .. 10})
10: ~:10 &gt; 10, (outside the range of {-? .. 10})
10:20 20, (outside the range of {10 .. 20})
@10:20? 10 and ? 20, (inside the range of {10 .. 20})
10 10, (outside the range of {0 .. 10})

Read: http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT for the full, official, documentation.

Full Documentation

check_cdu is intended to provide extremely flexible and extensive monitoring support for Server Technology Cabinet Distribution Units (CDU). In general the workflow for this application follows this procedure:

Pull in an entire SNMP table using a Net::SNMP session and get_table().
Renumerate these "flat" values into a structured hash
Evaluate any options or thresholds passed on the command line by the user.
Process the command line options against the data collected from the CDU
Exit appropriately given the status results

This workflow is generally followed in four slightly different ways depending on the desired options.

These four procedures are:

General System
Environment
Towers
Infeeds

Environment

An optional feature of a CDU are temperature and humidity probes. On most units, only two T/H ports exist. When using an EMCU 1-1B in conjuction with a CDU this can be expanded to 4 T/H ports. I am not aware of a way to increase the number of T/H probes past this amount on a single CDU. Regardless, this application is designed to support any number of T/H probes.

Prior to running any checks for temperature or humidity, this plugin will check the T/H probe status. The following states will result in an UNKNOWN return:

notFound
readError
lost
noComm

This applies to both temperature and humidity. The LowThresh and HighThresh states are ignored. Any of these states will issue an UNKNOWN return. Since there is no data available, it's not logical to initiate a WARNING or CRITICAL and roll someone out of bed. This behavior can easily be changed in the code, if desired.

In its simplest form, the environment checks will query all available T/H probes connected to the system. The CDU has internal High/Low thresholds configured for both Temperature and Humidity, and this is done on a per sensor basis. Without any arguments, this plugin will honor those values.

Considering that there is only one high:low range, I opted to designate this as a WARNING threshold. This behavior can easily be changed in the code to the CRITICAL state, if desired, but it is NOT modifiable from the command line. A basic invocation would resemble:

$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid
OK -BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 18C, Bottom-Rack-Inlet_F31(A1): 43%,
Bottom-Rack-Exhaust_F32(A2): 33C, Bottom-Rack-Exhaust_F32(A2): 16%, Top-Rack-Inlet_F31(B1): 24C, 
Top-Rack-Inlet_F31(B1): 28%, Top-Rack-Exhaust_F32(B2): 36.5C, Top-Rack-Exhaust_F32(B2): 12%

The plugin output always includes the systemLocation defined on the CDU first. The various objects queried are then returned in a comma separated list. For temperature and humidity probes, the sensor Name is returned along with the ID in parantheses. If names haven't been set, the defaults will still be displayed. Finally the value is listed for each sensor. The temperature scale is automatically determined from the TempScale object provided via SNMP. For instances where the CDU is configured for one scale, but the user desires the plugin to report in another scale, the --fahrenheit and --celsius options are quite handy:

$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Bottom-Rack-Inlet_F31(A1): 45%, Bottom-Rack-Exhaust_F32(A2): 91.4F

--celsius works in a similar fashion. If a scale is passed to the plugin and the T/H probe is already configured for that scale, no error will occur. The values will be reported in the native scale for that sensor.

Expanding on this basic functionality is the --ths option. --ths allows the user to select which sensors to query, based on the sensor ID (not the name!). --ths will automatically determine if the sensors exist, and exit UNKNOWN if they were not found. All of the regular sensor status checks are still performed.

$ check_cdu.pl -H 192.168.0.1 -C public --temp --ths A1,B2 --fahrenheit
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Exhaust_F32(B2): 97.7F

Note I also left out the --humid option. Either option can be specified alone, or both together, providing maximum flexibility for designing purpose-built nagios service checks.

User supplied WARNING and CRITICAL thresholds can be applied to the temperature and humidity sensors using the --warning and --critical directives. This overrides the automatic threshold logic that relies upon the internal CDU configuration. Either --warning or --critical can be used, or both can be used together. When querying multiple temperature sensors, a single threshold is applied across all sensors. The same is true for querying multiple humidity sensors. Both temperature and humidity can be queried together in the same command, by "chaining" the thresholds together. Here are a couple examples:

$ check_cdu.pl -H 192.168.0.1 -C public --temp --fahrenheit --ths A1,B1 --warning 60:80
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Inlet_F31(B1): 77F

(Query just the temperature from T/H probes A1 and B1 and apply a warning threshold to alarm if either sensor falls below 60F or above 80F)

$ check_cdu.pl -H 192.168.0.1 -C public --humid --ths A2,B2 --warning 10:70
OK - BLDG_ROOM_RACK, Bottom-Rack-Exhaust_F32(A2): 18%, Top-Rack-Exhaust_F32(B2): 13%

(Query just the humidity from T/H probes A2 and B2 and apply a warning threshold to alarm if either sensor falls below 10% or above 70% relative humidity)

$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit --ths A1 --warning 80,20: --critical 95,10:
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 64.4F, Bottom-Rack-Inlet_F31(A1): 48%

(Check just sensor A1, but query both temperature and humidity from this sensor. If the temperature rises above 80F or the humidity falls below 20% generate a WARNING. If the temperature rises above 95 or the humidity falls below 10% generate a CRITICAL.)

IMPORTANT NOTE: When specifying both --temp and --humid the thresholds are chained together as temperature_threshold,humidity_threshold regardless of which order --temp and --humid are passed!! aka the following are equivalent: '--temp --humid --warning 45,60' , '--humid --temp --warning 45,60' The following are NOT equivalent: '--temp --humid --warning 45,60', '--humid --temp --warning 60,45'

Towers

Tower state and statistics are checked using the --tower directive. If specified with no arguments only the overall state of the tower(s) are checked. The ability to query a specific tower does not exist at this time. If the 'noComm' state is encountered for a tower a WARNING state is generated. This is likely only possible on a slave tower. If the master tower is in state 'noComm', I doubt you'd get this far with it ;) If 'fanFail' or 'overTemp' states are encountered, the state is returned as CRITICAL.

Various metrics from the tower can also be queried by passing them to the --tower directive as a comma separated list. At the time of development, these metrics are only supported on PIPS units. A regular SMART or SWITCHED CDU will likely not benefit from any of these enhancements. The plugin will correctly identify the absence of these metrics if you attempt to query them. The metrics are:

VACapacity
ApparentPower
VACapacityUsed
ActivePower
Energy
LineFrequency

It is very important to note that the 'Status' checks are largely skipped when querying any of these metrics. The 'fanFail' and 'overTemp' states are completely ignored. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned. Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using just '--tower'. It was not logical to exit on WARNING/CRITICAL for a 'noComm' state multiple times (say, for instance if there are separate service checks defined for every metric listed above).

The towers are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all configurable on the CDU itself. Typically, a circuit name would be used for a Tower name. Thresholds are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in which the metrics are listed is the order in which the thresholds should be "chained". The same logic applies to these thresholds, see the THRESHOLDS section for specifics.

Here are some examples:

$ check_cdu.pl -H 192.168.0.1 -C public --tower
OK - BLDG_ROOM_RACK, TowerA(A) Status: normal(0), TowerB(B) Status: normal(0)a

$ check_cdu.pl -H 192.168.0.1 -C public --tower ApparentPower,ActivePower,VACapacityUsed --warning 1200,1000,30
OK - BLDG_ROOM_RACK, TowerA(A) ApparentPower: 993VA, TowerA(A) ActivePower: 939W, TowerA(A) VACapacityUsed: 9.1%,

TowerB(B) ApparentPower: 927VA, TowerB(B) ActivePower: 870W, TowerB(B) VACapacityUsed: 8.5%

(Check that ApparentPower does not exceed 1200VA, ActivePower does not exceed 1000W and the Capacity used does not exceed 30%. If any of these scenarios occur, generate a WARNING)

$ check_cdu.pl -H 192.168.0.1 -C public --tower Energy --warning 10000 --critical 15000
OK - BLDG_ROOM_RACK, TowerA(A) Energy: 6654kWh, TowerB(B) Energy: 7658kWh

(If the kWh consumption of either tower exceeds 10,000 generate a WARNING. If it exceeds 15,000 generate a CRITICAL. Say you're in a co-lo paying for power utilization and your piggy bank will run dry if you use too much power ...)

$ check_cdu.pl -H 192.168.0.1 -C public --tower VACapacity --warning 10800
WARNING - BLDG_ROOM_RACK, TowerA(A) VACapacity: -1VA

(This is a very bizarre but interesting scenario. I included VACapacity because it was there, but who would logically check a static value such as the capacity of a tower? Well, it turns out that this particular unit is slightly broken and the Capacity is -1. This should just provide some ideas on why it may be useful to monitor things that otherwise wouldn't make sense)

Infeeds

Infeed state and statistics are checked using the --infeed directive. It is very similar to the --tower check. If specified with no agruments, the infeed 'Status' and 'LoadStatus' objects are checked. The ability to query a specific infeed does not exist at this time (and likely never will). The following infeed Statuses will generate a WARNING:

noComm
offWait
onWait
off

A CRITICAL will be generated if the Infeed has the following Status:

offError
onError

Likewise the LoadStatus object is checked for each infeed as well. A WARNING is generated for the following LoadStatus conditions:

noComm
reading
loadLow

I wasn't sure what the 'reading' state was, this state is also present across many other CDU objects. There is a good chance this state simply infers that the state is currently being "read" or updated, and it's likely that this state will be ignored in future versions of the plugin if that is the case. The loadLow must be determined by an internal CDU threshold, however this threshold isn't available via SNMP - so I left it alone. A CRITICAL is generated for the other LoadStatus states:

notOn
loadHigh
overLoad
readError

Simple modifications to the code can be done to move these various Statuses between the CRITICAL and WARNING states if desired, but it is not possible from the command line.

Similar to the --tower directive, many of these Status checks are skipped when querying specifc metrics from the infeed. If any metrics are provided to --infeed, the infeed Status is checked for the 'noComm' status. If this is true, the plugin will append this to the UNKNOWN 'bucket' and skip checking the metric. The following infeed metrics are currently supported:

PhaseVoltage *
Voltage
CapacityUsed *
Power
ApparentPower *
Energy *
LoadValue
PhaseCurrent *

* These metrics are only available on PIPS units.

The infeeds are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all configurable on the CDU itself. Typically, a circuit name would be used for an infeed name. Thresholds are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in which the metrics are listed is the order in which the thresholds should be "chained". The same logic applies to these thresholds, see the THRESHOLDS section for specifics.

Some examples:

$ check_cdu.pl -H 192.168.0.1 -C public --infeed
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) Status: on(1), TowerA_InfeedA(AA) LoadStatus: normal(0), 
TowerA_InfeedB(AB) Status: on(1), TowerA_InfeedB(AB) LoadStatus: normal(0), TowerA_InfeedC(AC) Status:
on(1), TowerA_InfeedC(AC) LoadStatus: normal(0), TowerB_InfeedA(BA) Status: on(1), TowerB_InfeedA(BA)
LoadStatus: normal(0), TowerB_InfeedB(BB) Status: on(1), TowerB_InfeedB(BB) LoadStatus: normal(0),
TowerB_InfeedC(BC) Status: on(1), TowerB_InfeedC(BC) LoadStatus: normal(0)

(This is a basic tower check for a master/slave 3 phase CDU. There are 6 infeeds total across both towers, and two separate checks are performed (Status,LoadStatus) for each infeed. This is a lot of data)

$ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadValue --warning 12 --critical 24
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadValue: 4.07A, TowerA_InfeedB(AB) LoadValue: 3.21A,
TowerA_InfeedC(AC) LoadValue: 1.62A, TowerB_InfeedA(BA) LoadValue: 3.61A, TowerB_InfeedB(BB) LoadValue:
2.76A, TowerB_InfeedC(BC) LoadValue: 1.73A

(This is a simple load/current check which applies a warning and critical threshold to the load of all 6 infeeds on a dual tower 3 phase CDU.)

$ check_cdu.pl -H 192.168.0.1 -C public --infeed ApparentPower,CapacityUsed --warning 1000,20
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) ApparentPower: 673VA, TowerA_InfeedA(AA) CapacityUsed: 12.6%,
TowerA_InfeedB(AB) ApparentPower: 0VA, TowerA_InfeedB(AB) CapacityUsed: 10.5%, TowerA_InfeedC(AC)
ApparentPower: 317VA, TowerA_InfeedC(AC) CapacityUsed: 5.3%, TowerB_InfeedA(BA) ApparentPower: 575VA,
TowerB_InfeedA(BA) CapacityUsed: 12%, TowerB_InfeedB(BB) ApparentPower: 0VA, TowerB_InfeedB(BB) CapacityUsed:
8.9%, TowerB_InfeedC(BC) ApparentPower: 348VA, TowerB_InfeedC(BC) CapacityUsed: 5.7%

(Generate a warning if the ApparentPower of any infeed exceeds 1000VA, and generate a warning if the Capacity Used exceeds 20% on any infeed)

PhaseVoltage and PhaseCurrent use the PhaseID instead of infeedID in the plugin output. Throughout our testing, it has been difficult to ascertain a difference between PhaseVoltage and Voltage. There is generally a considerable difference between PhaseCurrent and LoadValue, however it most likely makes sense to only check one of these.

Enhanced Infeed checks

There are two additional metrics that can be checked with the '--infeed' directive. They are:

LoadImbalance
VoltageImbalance

These metrics are not provided directly by the CDU, rather they are computed internally by the plugin.

Please note, these special metrics are ONLY available on 3 phase units. Some versions of the CDU firmware provide a '3-Phase Load Out-of-Balance Threshold' setting and the results are displayed on the 'istat' menu. None of this information is provided via SNMP. Thresholds are required for either of these computed metrics. Unlike the display in 'istat' only the out-of-balance infeed(s) will be displayed, not infeeds across the entire tower. I used a basic 3 phase motor load phase imbalance equation to generate the imbalance percentages for both Current and Voltage:

Percent imbalance = maximum deviation from average / average of three phases * 100

When an infeed is queried for either voltage or current imbalance, the plugin determines which tower the infeed is a part of. All infeed values (voltage or current) for that tower are then averaged together. The deviation from the average is then determined for this particular infeed, accomodating either a negative or positive delta from the average. This is then divided by the average and multiplied by 100 to determine the percent imbalance. This equation was pulled from the following document:

http://support.fluke.com/educators/download/asset/2161031\_b\_w.pdf

An example invocation of this check would look like:

$ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadImbalance --warning 20 --critical 30
CRITICAL - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadImbalance: 39.07%, TowerA_InfeedC(AC)
LoadImbalance: 50.46%, TowerB_InfeedA(BA) LoadImbalance: 33.54%, TowerB_InfeedC(BC) LoadImbalance: 34.54%

(Generate a WARNING if the load imbalance of any infeed exceeds 20%, and a CRITICAL if the imbalance exceeds 30%. Clearly, this is not a well balanced rack! Hence the need for such a check)

The same can be done for voltage, however the margins should be much, much smaller than load.

This can be useful to detect bad incoming power conditions. Unfortunately this only evaluates an imbalance across the phases of a single tower. A more useful approach would be to judge imbalance between two separate towers, and hence two separate feeds/circuits which could be coming from two separate sources (ie. UPS/utility). Currently that functionality does not exist.

Here is an example:

$ check_cdu.pl -H 192.168.0.1 -C public --infeed VoltageImbalance --warning .5 --critical 2
WARNING - BLDG_ROOM_RACK, TowerB_InfeedB(BB) VoltageImbalance: 0.65%

(Generate a WARNING if the imbalance between voltages per infeed is greater than .5% and a CRITICAL if the imbalance is greater than 2%)

Plugin Termination

Numerous scenarios exist where the plugin will exit abnormally. This could be due to user input error, or failure to retrieve required SNMP data, etc. In all identifiable cases, the plugin will exit with an UNKNOWN state and a descriptive message indicating the failure. Users should be aware that if all SNMP calls fail, monitoring of the CDU may be effectively rendered useless if UNKNOWN states are not reported (this is common). This is dissimilar to plugins like check_nrpe that exit CRITICAL if an SSL negotiation error occurs!

Throughout the workflow of the plugin metrics are evaluated against thresholds and the results are placed into various 'buckets' reflecting OK,WARNING,CRITICAL and UNKNOWN states. At the end of the workflow, reporting is done based upon the presence or absence of these buckets. If both CRITICAL and WARNING conditions exist, they are BOTH reported in the plugin_output text, however the state is reported as CRITICAL. An example of this can be seen in the following output:

$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --ths A1 --warning 16,30 --critical 20,40
CRITICAL - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 43%, WARNING - Bottom-Rack-Inlet_F31(A1): 17C

Some options end up producing a large amount of output, and this could easily exceed what Nagios can accept, or also exceed character limits on various notification devices (maybe you're tweeting your CDU status for instance ;P) The '--oksummary' option exists to summarize the output for any type of check being done. If all metrics being checked are in state 'OK' the output supresses the specifics of these metrics and simply reports 'N metrics are OK' The version and location are also displayed in the plugin_output.

Services
Consulting
Trainings
Support
Subscriptions
Connect
Forum
GitHub