History: ======== - 22 Mar 2012 M.Fuerstenau - Started with actual version 0.5.0 - Impelemented check_esx3new2.diff from Simon (simeg / simmerl) - Reimplemented the changes of Markus Obstmayer for the actual version - Comments within the code inform you about the changes - It may happen that controllers has been found which are not active. Therefor around line 2300 the following was added: # M.Fuerstenau - additional to avoid using inactive controllers -------------- elsif (uc($dev->status) eq "3") { $status = 0; } #---------------------------------------------------------------------------- else { $state = 3; } $actual_state = Nagios::Plugin::Functions::max_state($actual_state, $status); } $perfdata = $perfdata . " adapters=" . $count . ";"$perf_thresholds . ";;"; # M.Fuerstenau - changed the output a little bit $output .= $count . " of " . @{$storage->storageDeviceInfo->hostBusAdapter} . " defined/possible adapters online, "; - 30 Mar 2012 M.Fuerstenau - added --ignore_unknown. This maps 3 to 0. Why? You have for example several host adapters. Some are reported as unknown by the plugin because they are not used or have not the capability to reports something senseful. - 02 Apr 2012 M.Fuerstenau - _info (Adapter/LUN/Path). Removed perfdata. To count such items and display as perfdata doesn't make sense - Changed PATH to MPATH and help from "path - list logical unit paths" to "mpath - list logical unit multipath info" because it is NOT an information about a path - it is an information about multipathing. - 08 Jan 2013 M.Fuerstenau - Removed installation informations for the perl SDK from VMware. This informations are part of the SDK and have nothing to do with this plugin. - Replaced global variables with my variables. Instead of "define every variable on the fly as needed it is a good practice to define variables at the beginning and place a comment within it. It gives you better readability. - 22 Jan 2013 M.Fuerstenau - Merged with the actual version from op5. Therfore all changes done to the op5 version: - 2012-05-28 Kostyantyn Hushchyn Rename check_esx3 to check_vmware_api(Fixed issue #3745) - 2012-05-28 Kostyantyn Hushchyn Minor cosmetic changes - 2012-05-29 Kostyantyn Hushchyn Clear cluster failover perfdata units, as it describes count of possible failures to tolerate, so can't be mesured in MB - 2012-05-30 Kostyantyn Hushchyn Minor help message changes - 2012-05-30 Kostyantyn Hushchyn Implemented timeshift for cluster checks, which could fix data retrievel issues. Small refactoring. - 2012-05-31 Kostyantyn Hushchyn Remove dependency on inteval value for cluster checks, which allows to run commands that doesn't require historical intervals - 2012-05-31 Kostyantyn Hushchyn Remove unnecessary/unimplemented function which caused cluster effectivecpu subcheck to fail - 2012-06-01 Kostyantyn Hushchyn Hide NIC status output for net check in case of empty perf data result(Fixed issue #5450) - 2012-06-07 Kostyantyn Hushchyn Fixed manipulation with undefined values, which caused perl interpreter warnings output - 2012-06-08 Kostyantyn Hushchyn Moved out global variables from perfdata functions. Added '-M' max sample number argument, which specify maximum data count to retrive. - 2012-06-08 Kostyantyn Hushchyn Added help text for Cluster checks. - 2012-06-08 Kostyantyn Hushchyn Increment version number Kostyantyn Hushchyn - 2012-06-11 Kostyantyn Hushchyn Reimplemented csv parser to process all values in sequence. Now all required functionality for max sample number argument are present in the plugin. - 2012-06-13 Kostyantyn Hushchyn Fixed cluster failover perf counter output. - 2012-06-22 Kostyantyn Hushchyn Added help message for literal values in interval argument. - 2012-06-22 Kostyantyn Hushchyn Added nicknames for intervals(-i argument), which helps to provide correct values in case you can not find them in GUI. Supported values are: r - realtime interval, h - historical interval at position , starting from 0. - 2012-07-02 Kostyantyn Hushchyn Reimplemented Datastore checking in Datacenter using different approach(Might fix issue #5712) - 2012-07-06 Kostyantyn Hushchyn Fixed Datacenter runtime check Kostyantyn Hushchyn - 2012-07-06 Kostyantyn Hushchyn Fixed Datastore checking in Datacenter(Might fix issue #5712) - 2012-07-09 Kostyantyn Hushchyn Added help info for Host runtime 'sensor' subcheck - 2012-07-09 Kostyantyn Hushchyn Added Host runtime subcheck to threshold sensor data - 2012-07-09 Kostyantyn Hushchyn Fixed Host temperature subcheck causing perl interpreter messages output - 2012-07-10 Kostyantyn Hushchyn Added listall option to output all available sensors. Sensor name now trieted as regexp, so result will be outputed for the first match. - 2012-07-26 Fixed issue which prevents plugin... v2.8.8 v2.8.8-beta1 Kostyantyn Hushchyn Fixed issue which prevents plugin from executing under EPN(Fixed issue #5796) - 2012-09-03 Kostyantyn Hushchyn Implemented plugin timeout(returns 3). - 2012-09-05 Kostyantyn Hushchyn Added storage refresh functionality in case when it's present(Fixed issue #5787) - 2012-09-21 Kostyantyn Hushchyn Added check for dead pathes, which generates 2 in case when at least one is present(Fixed issue #5811) - 2012-09-25 Kostyantyn Hushchyn Changed comparison logic in storage path check - 2012-09-26 Kostyantyn Hushchyn Fixed 'Global symbol normalizedPathState requires explicit package name' - 2012-10-02 Kostyantyn Hushchyn Changed timeshift argument type to integer, so that non number values will be treated as invalid. - 2012-10-02 Kostyantyn Hushchyn Changed to a conditional datastore refresh(Reduce overhead of solution suggested in issue #5787) - 2012-10-05 Kostyantyn Hushchyn Updated description so now almost all options are documented, though somewhere should be documented arguments like timeshift(-T), max samples(-M) and interval(-i) (Solve ticket #5950) - 31 Jan 2013 M.Fuerstenau - Replaced most die with a normal if statement and an exit. - 1 Feb 2013 M.Fuerstenau - Replaced unless with if. unless was only used eight times in the program. In all other statements we had an if statement with the appropriate negotiation for the statement. - 5 Feb 2013 M.Fuerstenau - Replaced all add_perfdata statements with simple concatenated variable $perfdata - 6 Feb 2013 M.Fuerstenau - Corrected bug. Name of subroutine was sub check_percantage but thist was a typo. - 7 Feb 2013 M.Fuerstenau - Replaced $percc and $percw with $crit_is_percent and $warn_is_percent. This was just cosmetic for better readability. - Removed check_percentage(). It was replaced by two one liners directly in the code. Easier to read. - The only codeblocks using check_percentage() were the blocks checking warning and critical. But unfortunately the plausability check was not sufficient. Now it is tested that no other values than numbers and the % sign can be submitted. It is also checked that in case of percent the values are in a valid level between 0 and 100 - 12 Feb 2013 M.Fuerstenau - Replaced literals like CRITICAL with numerical values. Easier to type and anyone developing plugins should be safe with the use - Replaced $state with $actual_state and $res with $state. More for cosmetical issues but the state is returned to Nagios. - check_against_threshold from Nagios::Plugin replaced with a little own subroutine check_against_threshold. - Nagios::Plugin::Functions::max_state replaced with own routine check_state - 14 Feb 2013 M.Fuerstenau - Replaced hash %STATUS_TEXT from Nagios::Plugin::Functions with own hash %status2. - 15 Feb 2013 M.Fuerstenau - Own help (print_help()) and usage (print_usage()) function. - Nagios::plugin kicked finally out. - Mo more global variables. - 25 Feb 2013 M.Fuerstenau - $quickstats instead of $quickStats for better readability. - 5 Mar 2013 M.Fuerstenau - Removed return_cluster_DRS_recommendations() because for daily use this was more of an exotical feature - Removed --quickstats for host_cpu_info and dc_cpu_info because quickstats is not a valid option here. - 6 Mar 2013 M.Fuerstenau - Replaced -o listitems with --listitems - 8 Mar 2013 M.Fuerstenau - --usedspace replaces -o used. $usedflag has been replaced by $usedflag. - --listvms replaces -o listvm. $outputlist has been replaced by $listvms. - --alertonly replaces -o brief. $briefflag has been replaced by $alertonly. - --blacklistregexp replaces -o blacklistregexp. $blackregexpflag has been replaced by $blacklistregexp. - --isregexp replaces -o regexp. $regexpflag has been replaced by $isregexp. - 9 Mar 2013 M.Fuerstenau - Main selection is now transfered to a subrouting main_select because after a successfull if statement the rest can be skipped leaving the subroutine with return - 19 Mar 2013 M.Fuerstenau - Reformatted and cleaned up a lot of code. Variable definitions are now at the beginning of each subroutine instead of defining them "on the fly" as needed with "my". Especially using "my" for definition in a loop is not goog coding style - 21 Mar 2013 M.Fuerstenau - --listvms removed as extra switch. Ballooning or swapping VMs will always be listed. - Changed subselect list(vm) to listvm for better readability. listvm was accepted before (equal to list) but not mentioned in the help. To have list or listvm for the same is a little bit exotic. Fixed this inconsistency. - 22 Mar 2013 M.Fuerstenau - Removed timeshift, interval and maxsamples. If needed use original program from op5. - 25 Mar 2013 M.Fuerstenau - Removed $defperfargs because no values will be handled over. Only performance check that needed another which needed another sampling invel was cluster. This is now fix with 3000. - 11 Apr 2013 M.Fuerstenau - Rewritten and cleaned subroutine host_mem_info. Removed $value1 - $value5. Stepwise completion of $output makes this unsophisticated construct obsolete. - 16 Apr 2013 M.Fuerstenau - Stripped down vm_cpu_info. Monitoring CPU usage in Mhz makes no sense under normal circumstances Mhz is no valid unit for performance data according to the plugin developer guide. I have never found a reason to monitor wait time or ready time in a normal alerting evironment. This data has some interest for performance analysis. But this can be done better with the vmware tools. - Rewritten and cleaned subroutine vm_mem_info. Removed $value1 - $value5. Stepwise completion of $output makes this unsophisticated construct obsolete. - 24 Apr 2013 M.Fuerstenau - Because there is a lot of different performance counters for memory in vmware we ave changed something to be more specific. - Enhenced explanations in help. - Changed swap to swapUSED in host_mem_info(). - Changed usageMB to CONSUMED in host_mem_info(). Same for variables. - Removed overall in host_mem_info(). After reading the documentation carefully the addition of consumed.average + overhead.average seems a little bit senseless because consumed.average includes overhead.average. - Changed usageMB to CONSUMED in vm_mem_info(). Same for variables. - Removed swapIN and swapOUT in vm_mem_info(). Not so sensefull for Nagios alerting because it is hard to find valid thresholds - Removed swap in vm_mem_info(). From the vmware documentation: "Current amount of guest physical memory swapped out to the virtual machine's swap file by the VMkernel. Swapped memory stays on disk until the virtual machine needs it. This statistic refers to VMkernel swapping and not to guest OS swapping. swapped = swapin + swapout" This is more an issue of performance tuning rather than alerting. It is not swapping inside the virtual machine. it is not possible to do any alerting here because (especially with vmotion) you have no thresholds. - Removed OVERHEAD in vm_mem_info(). From the vmware documentation: "Amount of machine memory used by the VMkernel to run the virtual machine." So using this we have a useless information about a virtual machine because we have no valid context and we have no valid thresholds. More important is overhead for the host system. And if we are running in problems here we have to look which machine must be moved to another host. - As a result of this overall in vm_mem_info() makes no sense. - 25 Apr 2013 M.Fuerstenau - Removed swap in vm_mem_info(). From vmware documentation: "Amount of guest physical memory that is currently reclaimed from the virtual machine through ballooning. This is the amount of guest physical memory that has been allocated and pinned by the balloon driver." So here we have again data which makes no sense used alone. You need the context for interpreting them and there are no thresholds for alerting. - 29 Apr 2013 M.Fuerstenau - Renamed $esx to $esx_server. This is only for cosmetics and better reading of the code. - Reimplmented subselect ready in vm_cpu_info and implemented it new in host_cpu_info. From the vmware documentation: "Percentage of time that the virtual machine was ready, but could not get scheduled to run on the physical CPU. CPU ready time is dependent on the number of virtual machines on the host and their CPU loads." High or growing ready time can be a hint CPU bottlenecks (host and guest system) - Reimplmented subselect wait in vm_cpu_info and implemented it new in host_cpu_info. From the vmware documentation: "CPU time spent in wait state. The wait total includes time spent the CPU Idle, CPU Swap Wait, and CPU I/O Wait states. " High or growing wait time can be a hint I/O bottlenecks (host and guest system) - 30 Apr 2013 M.Fuerstenau - Removed subroutines return_dc_performance_values, dc_cpu_info, dc_mem_info, dc_net_info and dc_disk_io_info. Monitored entity was view type HostSystem. This means, that the CPU of the data center server is monitored. The data center server (vcenter) is either a physical MS Windows server (which can be monitored better directly with SNMP and/or NSClient++) or the new Linux based appliance which is a virtual machine and can be monitored as any virtual machine. The OS (Linux) on that virtual machine can be monitored like any standard Linux. - 5 May 2013 M.Fuerstenau - Revised the code of dc_list_vm_volumes_info() - 9 May 2013 M.Fuerstenau - Revised the code of host_net_info(). The function was devided in two parts (like others): - subselects - else which included all. So most of the code existed twice. One for each subselect and nearly the same for all together. The else block was removed and in case no subselect was defined we defined all as $subselect. With the variable set to all we can decide wether to leave the function after a subselect section has been processed or stay and enhance $output and $perfdata. So the code is more clear and has nearly half the lines of code left. - Removed KBps as unit in performance data. This unit is not specified in the plugin developer guide. Performance data is now just a number without a unit. Adding the unit has to be done in the graphing tool (like pnp4nagios). - Removed the number of NICs as performance data. A little bit senseless to have those data here. - 10 May 2013 M.Fuerstenau - Revised the code of vm_net_info(). Same changes as for host_net_info() exept the NIC section. This is not available for VMs. - 14 May 2013 M.Fuerstenau - Replaced $command and $subselect with $select and $subselect. Therfore also the options --command --subselect changed to --select and --subselect. This has been done to become it more clear. In fact these items where no commands (or subselects). It were selections from the amount of performance counters available in vmware. - 15 May 2013 M.Fuerstenau - Kicked out all (I hope so) code for processing historic data from generic_performance_values(). generic_performance_values() is called by return_host_performance_values(), return_host_vmware_performance_values() and return_cluster_performance_values() (return_cluster_performance_values() must be rewritten now). The code length of generic_performance_values() was reduced to one third by doing this. - 6 Jun 2013 M.Fuerstenau - Substituted commandline option for select -l with -S. Therefore -S can't be used as option for the sessionfile Only --sessionfile is accepted nor the name of the sessionfile. - Corrected some bugs in check_against_threshold() - Ensured that in case of thresholds critical must be greater than warning. - 11 Jun 2013 M.Fuerstenau - Changed select option for datastore from vmfs to volumes because we will have volumes on nfs AND vmfs. - Changed output for datastore check to use the option --multiline. This will add a \n (unset -> default) for every line of output. If set it will use HTML tag
. The option --multiline sets a
tag instead of \n. This must be filtered out before using the output in notifications like an email. sed will do the job: sed 's/<[Bb][Rr]>/&\n/g' | sed 's/<[^<>]*>//g' Example: # 'notify-by-email' command definition define command{ command_name notify-by-email command_line /usr/bin/printf "%b" "Message from Nagios:\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTNAME$\nHostalias: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $SHORTDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n$LONGSERVICEOUTPUT$" | sed 's/<[Bb][Rr]>/&\n/g' | sed 's/<[^<>]*>//g' | /bin/mail -s "** $NOTIFICATIONTYPE$ alert - $HOSTNAME$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$ } - 13 Jun 2013 M.Fuerstenau - Replaced a previous change because it was wrong done: - --listvms replaced by subselect listvms - 14 Jun 2013 M.Fuerstenau - Some minor corrections like a doubled chop() datastore_volumes_info() - Added volume type to datastore_volumes_info(). So you can see whether the volume is vmfs (local or SAN) or NFS. - variables like $subselect or $blacklist are global there is no need to handle them over to subroutines like ($result, $output) = vm_cpu_info($vmname, local_uc($subselect)) . For $subselect we have now one uppercase (around line 580) instead of having one with each call in the main selection. - Later on I renamed local_uc to local_lc because I recognized that in cases the subselect is a volume name upper cases won't work. - replaced last -o $addopts (only for the name of a sensor) with --sensorname - 18 Jun 2013 M.Fuerstenau - Rewritten and cleaned subroutine host_disk_io_info(). Removed $value1 - $value7. Stepwise completion of $output makes this unsophisticated construct obsolete. - Removed use of performance thresholds in performance data when used disk io without subselect because threshold can only be used for on item not for all. Therefore they weren't checked in that section. Senseless. - Changed the output. Opposite to vm_disk_io_info() most vlues in host_disk_io_info() are not transfer rates but latency in milliseconds. The output is now clearly understandable. - Added subselect read. Average number of kilobytes read from the disk each second. Rate at which data is read from each LUN on the host.read rate = # blocksRead per second x blockSize. - Added subselect write. Average number of kilobytes written to disk each second. Rate at which data is written to each LUN on the host.write rate = # blocksRead per second x blockSize - Added subselect usage. Aggregated disk I/O rate. For hosts, this metric includes the rates for all virtual machines running on the host. - 21 Jun 2013 M.Fuerstenau - Rewritten and cleaned subroutine vm_disk_io_info(). Removed $value1 - $valuen. Stepwise completion of $output makes this unsophisticated construct obsolete. - Removed use of performance thresholds in performance data when used disk io without subselect because threshold can only be used for on item not for all. Therefore they weren't checked in that section. Senseless. - 24 Jun 2013 M.Fuerstenau - Changed all .= (for example $output .= $xxx.....) to = $var... (for example $output = $output . $xxx...). .= is shorter but the longer form of notification is better readable. The probability of overlooking the dot (especially for older eyes like mine) is smaller. - 07 Aug 2013 M.Fuerstenau - Changed "eval { require VMware::VIRuntime };" to "use VMware::VIRuntime;". The eval construct made no sense. If the module isn't available the program will crash with a compile error. - Removed own subroutine format_uptime() only used by host_uptime_info(). The complete work of this function was done converting seconds to days, hours etc.. Instead of the we use the perl module Time::Duration. So instead of $output = "uptime=" . format_uptime($value); we simply use $output = "uptime=" . duration_exact($value); - Removed perfdata from host_uptime_info(). Perfdata for uptime seems senseless. Same for threshold. - Started modularization of the plugin. The reason is that it is much more easier to patch modules than to patch a large file. - Variables used in that functions which are defined on the top level with "my" must now be defined with "our". BEWARE! Using "our" with unknown modules can lead to curious results if in this functions are variables with the same name. But in this case it is no risk because the modules are not generic. We have only broken the plugin in handy pieces. - Made an seperate modules: - help.pm -> print_help() - process_perfdata.pm -> get_key_metrices() -> generic_performance_values() -> return_host_performance_values() -> return_host_vmware_performance_values() -> return_cluster_performance_values() -> return_host_temporary_vc_4_1_network_performance_values() - host_cpu_info.pm -> host_cpu_info() - host_mem_info.pm -> host_mem_info() - host_net_info.pm -> host_net_info() - host_disk_io_info.pm -> host_disk_io_info() - datastore_volumes_info.pm -> datastore_volumes_info() - host_list_vm_volumes_info.pm -> host_list_vm_volumes_info() - host_runtime_info.pm -> host_runtime_info() - host_service_info.pm -> host_service_info() - host_storage_info.pm -> host_storage_info() - host_uptime_info.pm -> host_uptime_info() - 13 Aug 2013 M.Fuerstenau - Moved host_device_info to host_mounted_media_info. Opposite to it's name and the description this function wasn't designed to list all devices on a host. It was designed to show host cds/dvds mounted to one or more virtual machines. This is important for monitoring because a virtual machine with a mount cd or dvd drive can not be moved to another host. - Made an seperate modules: - host_mounted_media_info.pm -> host_mounted_media_info() - 19 Aug 2013 M.Fuerstenau - Added SOAP check from Simon Meggle, Consol. Slightly modified to fit. - Added isblacklisted and isnotwhitelisted from Simon Meggle, Consol. . Same as above. Following subroutines or modules are affected: - datastore_volumes_info.pm - host_runtime_info.pm - Enhanced host_mounted_media_info.pm - Added check for host floppy - Added isblacklisted and isnotwhitelisted - Added $multiline - 21 Aug 2013 M.Fuerstenau - Reformatted and cleaned up host_runtime_info(). - A lot of bugs in it. - 17 Aug 2013 M.Fuerstenau - Minor bug fix. - $subselect was always converted to lower case characters. This is correct exect $subselect contains a name (e.g. volumes). Volume names can contain upper and lower letters. Fixed. - datastore_volumes_info.pm had my ($datastore, $subselect) = @_; as second line This was incorrect because "global" variables (defined as our in the main program) are not handled over via function call. (Yes - may be handling over maybe more ok in the sense of structured programming. But really - does handling over and giving back a variable makes the code so much clearer? More a kind of philosophy :-)) ) - 3 Oct 2013 M.Fuerstenau - host_runtime_info(). - Filtered out the sensor type "software components". Makes no sense for alerting. - 01 Nov 2013 M.Fuerstenau version 0.8.10 - removed return_host_temporary_vc_4_1_network_performance_values() from process_perfdata.pm. Not needed in ESX version 5 and above. Affected subroutine: host_net_info(). - Bug fixed in generic_performance_values(). Unfortunaltely I had moved my @values = () to main function. Therefore instead of containing a new array reference with each run the new references were added to the array but only the first one was processed. Thanks to Timo Weber for discovering this bug. - 20 Nov 2013 M.Fuerstenau version 0.8.11 - check_state(). Bugfix. Logical error. Complete rewrite. - host_net_info() - Minor bugfix. Added exit 2 in case of unallowed thresholds - Simplified the output. Instead of doing it for every subselect (or not if subselect=all for selecting all info) we have a new helper variable called $true_sub_sel. 0 means not a true subselect. 1 means a true subselect (including all) - host_runtime_info(). - Filtered out the sensor type "software components". Makes no sense for alerting. - The complete else tree (no subselect) was scrap due to the fact that the return code was always ok. There would never be any alarm. Kicked out. - The else tree was replaced by a $subselect=all as in host_net_info(). - Same for output. - subselect listvms. - The (OK) in the output was a hardcoded. Replaced by the deliverd value (UP). (in my oipinion senseless) - Removed perfdata. The number of virtual machines as perfdata doesn't make so much sense. - Rearranged the order of subselects. Must be the same as the statements in the kicked out else sequence to get the same order in output - connection state. From the docs: connected Connected to the server. For ESX Server, this is always the setting. disconnected The user has explicitly taken the host down. VirtualCenter does not expect to receive heartbeats from the host. The next time a heartbeat is received, the host is moved to the connected state again and an event is logged. notResponding VirtualCenter is not receiving heartbeats from the server . The state automatically changes to connected once heartbeats are received again. This state is typically used to trigger an alarm on the host. In the past the presset of returncode was 2. It was set to 0 in case of a connected. But disconnected doesn' mean a critical error. It means main- tenance or somesting like that. Therefore we now return a 1 for warning. notResonding will cause a 2 for critical. - Kicked out maintenance info in runtime summary and as subselect. In the beginning of the function is a check for maintenance. In the original program in this case the program will be left with a die which caused a red alert in Nagios. Now an info is displayed and a return code of 1 (warning) is deliverd because a maintenance is regular work. Although monitoring should take notice of it. Therefore a warning. Therefor the maintenence check in the else tree was scrap. It was never reached. - listvms In case of no VMs the plugin returned a critical. But this is not correct. No VMs on a host is not an error. It is simply what it says: No VMs. - Replaced --listitems and --listall with --listsensors because there were two options for the same use. - Later on I decided to kick out the complete sensorname construction. To monitor a seperate sensor by name (a string which ist different for each sensor) seems not to make so much sense. To monitor sensors by type gave no usefull information (exept temperature which is still monitored. All usefull informations are in the health section. Better to implement some out here if needed. renamed -s temperature to -s temp (because I am lazy). - Changed output of --listsensors. | is the seperation symbol for performance data. So it should not be used in output strings. - Same for the error output of sensors in case of a problem. - subselect=health. Filter out sensors which have not valid data. Often a sensor is reckognizedby vmware but has not the ability to report something senseful. In this case an unknown is reported and the message "Cannot report on the current health state of the element". This it can be skipped. - Bugfix for vm_net_info(). It worked perfectly for -H but produced bullshit with the -D option. With -D in PerfQuerySpec->new(...) intervalId and maxSample must be set. Default was 20 for intervalId and 1 for maxSample - datastore_volumes_info() - New option --gigabyte - Fixed bug for treating name as regexp (--isregexp). - Fixed bug in perfdata (%:MB). Is now MB or GB. - If no single volume is selected warning/critical threshold can be given only in percent. - Fixed bug in regexp for blacklists and whitelists and replace --blacklistregexp and --whitelistregexp with --isregexp. Used in - host_mounted_media_info() - datastore_volumes_info() - host_runtime_info() All subroutines revised later will include it automatically - Fixed bug in regexp for blacklists and whitelists and replace --blacklistregexp and --whitelistregexp with - Opposite to the op5 original blacklists and whitelists can now contain true reular expressions. - help.pm rewritten. It was too much output. Now the user has several choices: -h|--help= The complete help for all. -h|--help= Help for datacenter/vcenter checks. -h|--help= Help for vmware host checks. -h|--help= Help for virtual machines checks. -h|--help= Help for cluster checks. - host_service_info() rewritten. - No longer a list of services via subselect. - Instead full blacklist/whitelist support - With --isregexp blacklist/whitelist items are interpreted as regular expressions - Changed switch of blacklist from -x to -B - Changed switch of whitelist from -y to -W - main_select(). Something for the scatterbrained of us. Replaced string compare (eq,ne etc.) with a regexp pattern matching. So it is possible to type in services or Services instead of service. - 29 Nov 2013 M.Fuerstenau version 0.8.13 - In all functions checking host for maintenance we had the following contruction: if (uc($host_view->get_property('runtime.inMaintenanceMode')) eq "TRUE") This is quite stupid. runtime.inMaintenanceMode is xsd:boolean wich means true or false (in lower cas letters. So a uc() make no sense. Removed - Bypass SSL certificate validation - thanks Robert Karliczek for the hint - Added --sslport if any port other port than 443 is wanted. 443 is used as default. - Rewritten authorization part. Sessionfiles are working now. - Updated README. - 03 Dec 2013 M.Fuerstenau version 0.8.14 - host_runtime_info() - Fixed minor bug. In case of an unknown with sensors the unknown was always mapped to a warning. This wasn't sensefull under all circumstances. Now it is only mapped to warning if --ignoreunknow is not set. - README - Updated installation notes - Optional pathname for sessionfile - 10 Dec 2013 M.Fuerstenau version 0.8.15 - datastore_volumes_info() - Changed "For all volumes" to "For all selected volumes". This should fit all. The subroutine datastore_volumes_info is called from several subroutines and to make a difference here for a single volume or more volumes will cause a lot of unnecessary work just for cosmetics. - Fixed typo in error message. - Modified error messages - Removed plausibility check whether critical must be greater than warning. In case of freespace for example it must be the other way round. The plausibility check was nice but too complicated for all the different conditions. - check_against_threshold() - Cleaned up and partially rewritten. - 12 Dec 2013 M.Fuerstenau version 0.8.16 - datastore_volumes_info() - Some small fixes - Changed the output for OK from "For all selected volumes" to "OK for all seleted volumes." - Same for error plus an error counter for the alarms - Parameter usedspace was ignored on volume checks because a local variable defined with my instead of a more global one. Changed it from my to our fixed it. Thanks to dgoetz for reporting (and fixing) this. - Capacity is now displayed and deliverd as perfdata - Main selection - GetOptions. Unce upon a time I had kicked out $timeout unintended. Fixed now. Thanks to Andreas Daubner for the hint. - 12 Dec 2013 M.Fuerstenau version 0.8.17 - host_net_info() - Changed output - Added total number of NICs - 25 Dec 2013 M.Fuerstenau version 0.9.0 - help() - Removed -v/--verbose. The code was not debugging the plugin but not for working with it. - host_storage_info() - Removed optional switch for adaptermodel. Displaying the adapter model is default now. - Removed upper case conversion for status. The status was converted (for examplele online to ONLINE) and compared with a upper case string like "ONLINE". Sensefull like a second asshole. - Added working blacklist/whitelist. - Blacklist: blacklisted adapters will not be displayed. - Whitelist: only whitelisted adapters will be displayed. - Removed perfdata for the number of hostbusadapters. These perfdata was absolutely senseless. - Status for hostbusadapters was not checked correctly. The check was only done for online and unknown but NOT(!!) for offline and unbound. - LUN states were not correct. UNKNOWN is not a valid state. Not all states different from unknown are supposed to be critical. From the docs: degraded One or more paths to the LUN are down, but I/O is still possible. Further path failures may result in lost connectivity. error The LUN is dead and/or not reachable. lostCommunication No more paths are available to the LUN. off The LUN is off. ok The LUN is on and available. quiesced The LUN is inactive. timeout All Paths have been down for the timeout condition determined by a user-configurable host advanced option. unknownState The LUN state is unknown. - Removed number of LUNs as perfdata. Senseless (again). - In the original selection for the displayed LUN the displayName was used first, then the deviceName and the last one was the canonical name. Unfortunately in the GUI SCSI ID, canonical name an runtime name is displayed. So using the freely configurable DisplayName is senseless. The device name is formed from the path (/vmfs/devices/disks/) followed by the canonical name. So it is either senseless. - The real numeric LUN number wasn'nt display. Fixed. Output is now LUN, canonical name, everything from the display name not equal canonical name and status. - Complete rewrite of the paths part. Only the state of the multipath was checked but not the state of the paths. So a multipath can be "Active" which is ok but the second line is dead. So if the active path becomes dead the failover won't work.There must be an alarm for a standby path too. It is now grouped in the output. - Multiline support for this. - 03 Jan 20143 M.Fuerstenau version 0.9.1 - check_vmware_esx.pl Added new flag --ignore_warning. This will map a warning to ok. - host_runtime_info() - some minor changes - Changed state to powerstate. In the original version the power state was mapped: poweredOn => UP poweredOff => DOWN suspended => SUSPENDED This suggested a machine state but it is only a powerstate. All other than UP caused a critical. But this is not true. A power off machine can also be ok. But to be sure that it is noticed we will have a warning for powerd off and suspended - Perfdata changed. - vm_up -> vm_powerdon - New: vm_poweroff - New: vm_suspended - vm_runtime_info() -> vm_runtime_info.pm - Removed a lot of unnecessary variables and hashes. Rewritten a lot. - Connection state. Only "connected" was checked. All other caused a critical error without a usefull message. This wa a little bit incomplete. Corrected. States delivered from VMware are connected, disconnected, inaccessible, invalid and orphaned. - Removed cpu. VirtualMachineRuntimeInfo maxCpuUsage (in Mhz) doesn't make so much sense for monitoring/alerting. See VMware docs for further information for this performance counter. - Removed mem. VirtualMachineRuntimeInfo maxMemoryUsage doesn't make so much sense for monitoring/alerting. See VMware docs for further information for this performance counter. - Changed state to powerstate. In the original version the power state was mapped: poweredOn => UP poweredOff => DOWN suspended => SUSPENDED This suggested a machine state but it is only a powerstate. All other than UP caused a critical. But this is not true. A power off machine can also be ok. But to be sure that it is noticed we will have a warning for powerd off and suspended - Changed guest to gueststate. This is more descriptive. Removed mapping in guest state. Mapping was "running" => "Running", "notrunning" => "Not running", "shuttingdown" => "Shutting down", "resetting" => "Resetting", "standby" => "Standby", "unknown" => "Unknown". This was not necessary from a technical point of view. The original messages are clearly understandable. - The guest states were not interpreted correctly. In check_vmware_api.pl all states different from running caused a "Critical" error. But this is nonsense. A planned shutted down machine is not an error. It's daily business. But the operator should probably have a notice of that. So it causing a "Warning". The states are (from the docs): running -> Guest is running normally. (returns 0) shuttingdown -> Guest has a pending shutdown command. (returns 1) resetting -> Guest has a pending reset command. (returns 1) standby -> Guest has a pending standby command. (returns 1) notrunning -> Guest is not running. (returns 1) unknown -> Guest information is not available. (returns 3) - Rewritten subselect tools. VirtualMachineToolsStatus was deprecated. As of vSphere API 4.0 VirtualMachineToolsVersionStatus and VirtualMachineToolsRunningStatus has to be used. So a great part of this subselect was not working. - vm_disk_io_info() - Minor bug in output and perfdata corrected. I/O is not in MB but in MB/s. Some of the counters were in MB. - Corrected help. The original on was nonsense. - Changed all values to KB/s because so it is equal to host disk I/O and so it it is deleverd from the API. - help() - Some corrections. - host_disk_io_info() - added total_latency. - 08 Jan 20143 M.Fuerstenau version 0.9.2 - help() - Some small bug fixes. - vm_disk_io_info() - Removed duplicated code. (if subselect ..... else ....) The code was 90% identical. - host_disk_io_info() - Removed duplicated code. (if subselect ..... else ....) The code was 90% identical. - Bug fix. Usage was given without subselect but missing as subselect. Not detected earlier due to the duplicate code. - host_cpu_info() - Removed duplicated code. (if subselect ..... else ....) The code was 90% identical. - Added usage as subselect. - vm_cpu_info() - Removed duplicated code. (if subselect ..... else ....) The code was 90% identical. - Added usage as subselect. - host_mem_info() - Removed duplicated code. (if subselect ..... else ....) The code was 90% identical. - swapused - I swapused is a subselect there should be enhanced information about the virtual machines and should be available. If this won't work nothing will happen. In the past this caused a critical error which is nonsense here. - memctl - Same as swapused - vm_mem_info() - Removed duplicated code. (if subselect ..... else ....) The code was 90% identical. - Added vmmemctl.average (memctl) to monitor balloning. - 16 Jan 2014 M.Fuerstenau version 0.9.3 - All modules - Corrected typo at the end. common instead of commen - host_storage_info() - Removed ignored counter for whitelisted items. A typical copy and paste b...shit. - vm_runtime_info() - issues - Some bugs with the output. Corrected. - tools - Minor bug fixed. Previously used variable was not removed. - host_runtime_info() - issues. - Some bugs with the output. Corrected. - listvms - output now sorted by powerstate (suspended, poweredoff, powerdon) - Corrected some minor bugs - dc_list_vm_volumes_info() - Removed handing over of unnecessary parameters - dc_runtime_info() -> dc_runtime_info.pm - Code cleaned up and reformated - listvms - output now sorted by powerstate (suspended, poweredoff, powerdon) - Added working blacklist/whitelist with the ability to use regular expressions - Added --alertonly here - Added --multiline here - listhosts - %host_state_strings was mostly nonsense. The mapped poser states from for virtual machines were used. Hash removed. Using now the orginal power states from the system (from the docs): - poweredOff -> The host was specifically powered off by the user through VirtualCenter. This state is not a certain state, because after VirtualCenter issues the command to power off the host, the host might crash, or kill all the processes but fail to power off. - poweredOn -> The host is powered on - standBy -> The host was specifically put in standby mode, either explicitly by the user, or automatically by DPM. This state is not a cetain state, because after VirtualCenter issues the command to put the host in stand-by state, the host might crash, or kill all the processes but fail to power off. - unknown -> If the host is disconnected, or notResponding, we can not possibly have knowledge of its power state. Hence, the host is marked as unknown. - Added working blacklist/whitelist with the ability to use regular expressions - Added --alertonly here - Added --multiline here - listcluster - Removed senseless perf data - More detailed check than before - Added working blacklist/whitelist with the ability to use regular expressions - Added --alertonly here - Added --multiline here - status - Rewritten and reformatted - tools - Rewritten and reformatted - Improved more detailed output. expressions - Added --alertonly here - Added --multiline here - 24 Jan 2014 M.Fuerstenau version 0.9.4 - Merged pull request from Sven Nierlein - Modified hel to work with Thruk - Added Makefile. This is optional. Calling it generates a single file from all the modules. Maybe it is a little bit slower than the modules. The readon for modules was speed and better maintenance. - host_runtime_info() - Added quotes in perfdata for temp. - Enhanced README. Explained the differences un host_storage_info() between the original one and this one - host_net_info() - Minor bugfix in output. Corrected typo. - 29 Jan 2014 M.Fuerstenau version 0.9.5 - host_runtime_info() - Minor bug. Corrected quotes in perfdata for temp. - vm_net_info() - Quotes in perfdata - Removed VM name from output - vm_mem_info() - Quotes in perfdata - vm_disk_io_info() - Quotes in perfdata - vm_cpu_info() - Quotes in perfdata - host_net_info() - Quotes in perfdata - host_mem_info() - Quotes in perfdata - host_disk_io_info() - Quotes in perfdata - host_cpu_info() - Quotes in perfdata - dc_runtime_info() - Quotes in perfdata - datastore_volumes_info() - Quotes in perfdata - 04 Feb 2014 M.Fuerstenau version 0.9.6 - host_storage_info() - New switch --standbyok for storage systems where a standby multipath is ok and not a warning. - 06 Feb 2014 M.Fuerstenau version 0.9.7 - Bugfixes/Enhancements - In some cases it might happen that no performance counters are delivered by VMware. Especially if the version is old (4.x, 3.x). Under these circumstances an undef was returned by the routines from process_perfdata.pm and not handled correctly in the calling subroutines. Fixed. Affected subroutines: - host_cpu_info() - vm_net_info() - vm_mem_info() - vm_net_info() - vm_disk_io_info() - vm_cpu_info() - host_net_info() - host_mem_info() - host_disk_io_info() - vm_net_info() - Rewritten to the same structure as similar modules - host_net_info() - Rewritten to the same structure as similar modules - 24 Feb 2014 M.Fuerstenau version 0.9.8 - Corrected a type in the help() - Moved the block for constructing the full path of the sessionfile downward to the authentication stuff to have all in one place. - Authentication: - To reduce amounts of login/logout events in the vShpere logfiles or a lot of open sessions using sessionfiles the login part has been rewritten. Using session files is now the default. Only one session file per host or vCenter is used as default The sessionfile name is automatically set to the vSphere host or the vCenter (IP or name - whatever is used in the check). Multiple sessions are possible using different session file names. To form different session file names the default name is enhenced by the value you set with --sessionfile. NOTICE! All checks using the same session are serialized. So a lot of checks using only one session can cause timeouts. In this case you should enhence the number of sessions by using --sessionfile in the command definition and define the value in the service definition command as an extra argument so it can be used in the command definition as $ARGn$. - --sessionfile is now optional and only used to enhance the sessionfile name to have multiple sessions. - If a session logs in it sets a lock file (sessionfilename_locked). - The lock file is been set when the session starts and removed at the end of the plugin run. - A newly started check looks for the lock file and waits until it is no longer there. So here we have a serialization now. It will not hang forever due to the alarm routine. - Fixed bug "Can't call method "unset_logout_on_disconnect"". I mixed object orientated code and classical code. (Thanks copy & paste for this bug) - $timeout set to 40 seconds instead of 30 to have a little longer waiting before automatic cancelling the check to prevent unwanted cancelling due to longer waiting caused by serialization. - 25 Feb 2014 M.Fuerstenau version 0.9.9 - Bugfix and improvement for "lost" lock files. In case of a Nagios reload (or kill -HUP) Nagios is restarted with the same PID as before. Unfortunately Nagios sends a SIGINT or SIGTERM to the plugins. This causes the plugin to terminate without removing the lockfile. - So we have to catch several signals now - SIGINT and SIGTERM. One of this will be send from Nagios - SIGALRM. Caused by alarm(). Now with output usable in Nagios. - Instead of generating an empty file as lock file we write the process identifier of the running plugin process into the lock file. If a session crashes for some reason an a lock file is left we are in a situation where signal processing won't help. But here the next run of the plugin reads the PID and checks for the process. If there is no process anymore it will remove the lock file and create a new one. Thanks to Simon Meggle, Consol, for the idea. - Removed "die" for opening the authfile or the session lock file with an unless construct. The plugin will report an "understandable" message to the monitor instead of causing an internal error code. - vm_cpu_info() and host_cpu_info() - Removed threshold for ready and wait. Therefore thresholds are no possible without subselect. - 26 Feb 2014 M.Fuerstenau version 0.9.10 - Bugfixes. - Corrected typo in dc_runtime_info() line 660. - Corrected typo in help(). - Corrected bug in datastore_volumes_info(). Giving absolute thresholds for a single volume was not possible. Fixed. - Removed print_usage(). Due to mass of parameters it is not possible to display a short usage message. Instead of that the output of the help is included in the package as a file. - Updated default timeout to 90 secs. to avoid timeouts. - Before accessing the session file (and lock file) we have a random sleep up 7 secs.. This is to avoid a concurrent access in case of a monitor restart or a "Schedule a check of all services on this host" - In case of a locked session file the wait loop is not fix to 1 sec any more. Instead of this it uses a random period up to 5 sec.. So we minimize the risc of concurrent access. - 7 Mar 2014 M.Fuerstenau version 0.9.11 - Updated README - Section for removing HTML tags was reworked - Added blacklist to host_net_info() so that interfaces with -S net can be blacklisted. - 11 Mar 2014 M.Fuerstenau version 0.9.12 - Changed sleep() to usleep() and using now microseconds instead of seconds - So before accessing the session file (and lock file) we now have a random sleep up to 1500 milliseconds (default - see $ms_ts). This is to avoid a concurrent access in case of a monitor restart or a "Schedule a check of all services on this host" but takes much less time while having much more alternatives. - In case of a locked session file the wait loop is not fix to a random period up to 5 sec. any more. Instead of this it uses also $ms_ts which means a max of 1.5 secs. instead of 5. - 3 Apr 2014 M.Fuerstenau version 0.9.13 - --trace= was not working. Fixed. Small typo. - Removed comment sign in front of unlink around. It was there due to some some tests and I had forgotten to remove it. - datastore_volumes_info(). Some bugs corrected. - Wrong percent calculation - Wrong processing of thresholds for usedspace - Wrong processing for thresholds which are not percent - If threshold is in percent it is calculated in MB/GB for perfdata because mixing percent and MB/GB doesn't make sense. - 5 Apr 2014 M.Fuerstenau version 0.9.14 - host_runtime_info() - Fixed some bugs with issues ignored and whitelist. Some counters were calculated wrong - New plausibility check to ensured that blacklist and whitelist can not be used together. - 29 Apr 2014 M.Fuerstenau version 0.9.15 - host_mem_info(), vm_mem_info(), host_cpu_info() and vm_cpu_info(). - Sometimes it may happen on Vmware 5.5 (not seen when testing with Update 1) that getting the perfdata for cpu and/or memory will result in an empty construct because one or more values are not delivered. In this case we have a fallback and and every value - dc_runtime_info() - New option --poweredonly to list only machines which are powered on