check_ftServer

check_ftServer

On a fault tolerant server (1), an unaugmented OS will not become aware of a hardware failure because it will be switched to the spare hardware automatically, so the need to replace failed hardware needs to be monitored separately.
On NEC and Stratus ftServers, the "ftsmaint" command will output information on the actual underlying hardware. This plugin executes ftsmaint and condenses the information into the usual plugin API format.

(1) Imagine two sets of server hardware, interwoven so as to allow failover for individual subunits, and running just one OS instance.

Example Data

This is what check_ftServer will report while everything works:

$ ./check_ftServer
OK: No problems found|pathes=124 err_pathes=0;1;1;0;124 errors=0;1;1;0;
   if_count=2;2:;2:;0; if_errs=0;1;1;0;3

This was the output (no performance data yet) during the incident that prompted me to write the plugins:

PROBLEMS: [0:Combined CPU/IO] opstate=SIMPLEX whiteled=ON (flashing)
   [1:Combined CPU/IO] state=BROKEN opstate=BROKEN greenled=ON (flashing)
   yellowled=ON whiteled=OFF

CPU #1 had a hardware fault, and CPU #0 was thus running in SIMPLEX mode (keeping the server-as-a-whole operational).

This is the current output of stats_ftServer on the machine:

$ ./stats_ftServer
OK: Sensors read|t020130_Internal=35degC v020150_12V=12.12V
   t023130_Internal=128degC v023150_12V=12.12V t0130_Ambient=23degC
   r0140_Fan1=4331 r0141_Fan2=4222 v0150_1_2V_VTT=1.199V
   v0151_1_8V_VDD=1.806V v0152_12V=12.06V t120130_Internal=41degC
   v120150_12V=12.06V t123130_Internal=128degC v123150_12V=12.12V
   t1130_Ambient=23degC r1140_Fan1=4276 r1141_Fan2=4222
   v1150_1_2V_VTT=1.186V v1151_1_8V_VDD=1.793V v1152_12V=12.06V
   r10140_Fan=3312 v10150_-12V=-11.901V v10151_1_3V=1.277V
   v10152_1_5V_GB=1.509V v10153_2_5V_GB=2.476V v10154_2_5V_SATA=2.502V
   v10155_2_5V_VGA=2.515V v10156_3V_CLK=3.613V v10157_3_3V=3.349V
   v10158_3_3Vs=3.3V v10159_3_3V_GBE=3.333V v10160_5V=5.056V
   v10161_5Vs=5.056V v10162_12V=12.12V r11140_Fan=3312
   v11150_-12V=-11.901V v11151_1_3V=1.277V v11152_1_5V_GB=1.483V
   v11153_2_5V_GB=2.502V v11154_2_5V_SATA=2.489V v11155_2_5V_VGA=2.489V
   v11156_3V_CLK=3.613V v11157_3_3V=3.382V v11158_3_3Vs=3.283V
   v11159_3_3V_GBE=3.349V v11160_5V=5.056V v11161_5Vs=5.056V
   v11162_12V=12.06V

For comparison, this is the output from another ftServer which is just one major version away from the above machine:

$ ./stats_ftServer
OK: Sensors read|t0130_Baseboard=25degC t0131_MCH=37degC
   r0140_PSU_Fan=4968.204 r0141_CPU_Fan1=5362.505 r0142_CPU_Fan2=5362.505
   r0143_IO_Fan=7650 v0150_Baseboard_3_3V=3.380V v0151_BB_3_3Vs=3.337V
   v0152_Baseboard_12V=12.444V v0153_Proc_0_Vccp=1.240V
   v0154_Vtt_1_2V=1.206V v0155_Baseboard_1_5V=1.482V
   t1130_Baseboard=25degC t1131_MCH=37degC r1140_PSU_Fan=5197.505
   r1141_CPU_Fan1=5448.998 r1142_CPU_Fan2=5448.998 r1143_IO_Fan=7650
   v1150_Baseboard_3_3V=3.380V v1151_BB_3_3Vs=3.337V
   v1152_Baseboard_12V=12.318V v1153_Proc_0_Vccp=1.240V
   v1154_Vtt_1_2V=1.223V v1155_Baseboard_1_5V=1.492V

Version History

v0.2 (01-Mar-2010):

  • (Still-)Quick-and-dirty version without "--help" or "--version" options.

  • Scans "State", "Op State" and LEDs' states from the output of "ftsmaint lsLong", except network interfaces (which are reported in state "BROKEN" when unused).

  • Also scans the output of "ftsmaint lsVnd" and makes sure that logical interface bond0 is "ONLINE" and has at least two "DUPLEX UP" physical interfaces.

  • Also-included stats_ftServer plugin collects the various sensor data (voltages, temperatures, fan speeds) into performance data.

  • Also provided are code snippet and templates for n2rrd, to keep the performance data in RRD databases.

  • ToDo: --help, --version.

v0.3 (23-Mar-2010):

  • Updated check_ftServer to accept LED colors like "(Green Redundancy)" (in addition to the old single-word colors)