2018-12-01 // Check_MK Monitoring - Brocade / Broadcom Fibre Channel Switches
This article provides patches for the standard Check_MK distribution in order to fix and enhance the support for the monitoring of Brocade / Broadcom Fibre Channel Switches.
Out of the box, there is currently already monitoring support available for Brocade / Broadcom Fibre Channel Switches in the standard Check_MK distribution. Unfortunately there are several issues in the check brocade_fcport
, which is used to monitor the status and several metrics of the fibre channel switch ports. Those issues prevent the check from working as intended and are fixed in the version provided in this article. Up until recently, there also was no support in the standard Check_MK distribution for monitoring CPU and memory metrics on a switch level and no support for monitoring SFP metrics on a port level. This article introduces the new checks brocade_cpu
, brocade_mem
and brocade_sfp
to cover those metrics. With the most recent upstream version 1.5.x of the Check_MK distribution, there are now the standard checks brocade_sys
(covering CPU and memory metrics) and brocade_sfp
(covering the SFP port metrics) available, providing basically the same functionality.
For the impatient and TL;DR here is the enhanced version of the brocade_fcport
check:
Enhanced version of the brocade_fcport check
And the new brocade_cpu
, brocade_mem
and brocade_sfp
checks:
The new brocade_cpu check
The new brocade_mem check
The new brocade_sfp check
The sources to the new and enhanced versions of all the checks can be found in my Check_MK Plugins repository on GitHub.
Additional Checks
CPU Usage
The Check_MK service check brocade_cpu
monitors the current CPU utilization of Brocade / Broadcom fibre channel switches. It uses the SNMP OID swCpuUsage
from the fibre channel switch MIB (SW-MIB
) in order to create one service check for each CPU found in the system. The current CPU utilization in percent is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the standard WATO plugin for CPU utilization it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of CPU utilization).
The following image shows a status output example for the brocade_cpu
service check from the WATO WebUI:
This example shows the current CPU utilization of the system in percent.
The following image shows an example of the service metrics graph for the brocade_cpu
service check:
The selected example graph shows the current CPU utilization of the system in percent as well as the default warning and critical threshold values.
Memory Usage
The Check_MK service check brocade_mem
monitors the current memory (RAM) usage on Brocade / Broadcom fibre channel switches. It uses the SNMP OID swMemUsage
from the fibre channel switch MIB (SW-MIB
) in order to create a service check for the overall memory usage on the system. The amount of currently used memory is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of used memory). The configuration options for the used memory levels can be found under:
-> Host & Service Parameters -> Parameters for discovered services -> Operating System Resources -> Brocade Fibre Channel Memory Usage -> Create rule in folder ... [x] Levels for memory usage
The following image shows a status output example for the brocade_mem
service check from the WATO WebUI:
This example shows the current memory utilization of the system in percent.
The following image shows an example of the service metrics graph for the brocade_mem
service check:
The selected example graph shows the current memory utilization of the system in percent as well as the default warning and critical threshold values.
SFP Health
The Check_MK service check brocade_sfp
monitors several metrics of the SFPs in all enabled and active ports of Brocade / Broadcom fibre channel switches. It uses several SNMP OIDs from the fibre channel switch MIB (SW-MIB
) and from the Fabric Alliance Extension MIB (FA-EXT-MIB
) in order to create one service check for each SFP found in an enabled and active port on the system. The OIDs from the fibre channel switch MIB (swFCPortSpecifier
, swFCPortName
, swFCPortPhyState
, swFCPortOpStatus
, swFCPortAdmStatus
) are used to determine the number, name and status of the switch port. The OIDs from the Fabric Alliance Extension MIB are used to determine the actual SFP metrics. Those metrics are the SFP temperature (swSfpTemperature
), voltage (swSfpVoltage
) and current (swSfpCurrent
), as well as the optical receive and transmit power (swSfpRxPower
and swSfpTxPower
) of the SFP. Unfortunately the SFP metrics are not available on all Brocade / Broadcom switch hardware platforms and FabricOS version. This check was verified to work with Gen6 hardware and FabricOS v8. Another limitation is, that the SFP metrics are not available in real-time, but are only gathered in a 5 minute interval by the switch from all the SFPs in the switch. This is most likely a precausion as not to overload the processing capacity on the SFPs with too many status requests.
The current temperature level as well as the optical receive and transmit power levels are compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values:
Metric | Warning Threshold | Critical Threshold |
---|---|---|
System temperature | 55°C | 65°C |
Optical receive power | -7.0 dBm | -9.0 dBm |
Optical transmit power | -2.0 dBm | -3.0 dBm |
The configuration options for the used temperature and optical receive and transmit power levels can be found under:
-> Host & Service Parameters -> Parameters for discovered services -> Networking -> Brocade Fibre Channel SFP -> Create rule in folder ... [x] Temperature levels in degrees celcius [x] Receive power levels in dBm [x] Transmit power levels in dBm
Currently, the electrical current and voltage metrics of the SFPs are only used for long-term trends via the respective service metric template and thus are not used to raise any alarms. This is due to the fact that there is little to no information available on which precise electrical current and voltage levels would constitute as an indicator for an immediate or impending failure state.
The following image shows several status output examples for the brocade_sfp
service check from the WATO WebUI:
This example shows SFP Port service check items for eight consecutive ports on a switch. For each item the current temperature, voltage and current, as well as the optical receive and transmit power of the SFP are shown. The optical receive and transmit power levels are also visualized in the Perf-O-Meter, optical receive power growing from the middle to the left, optical transmit power growing from the middle to the right.
The following image shows an example of the service metrics graph for the brocade_sfp
service check:
The first graph shows the optical receive and transmit power of the SFP. The second shows the electrical current drawn by the SFP. The third graph shows the temperature of the SFP. The fourth and last graph shows the electrical voltage provided to the SFP.
Modified Check
Fibre Channel Port
The Check_MK service check brocade_fcport
has several issuse which are addressed by the following set of patches. The patch is actually a monolithic one and is just broken up into individual patches here for ease of discussion.
The first patch is a simple, but ugly workaround to prevent the conversion of OID_END
, which carries the port index information, from being treated as a BINARY
SNMP value:
- brocade_fcport.patch
--- a/checks/brocade_fcport 2018-11-25 18:06:04.674930057 +0100 +++ b/checks/brocade_fcport 2018-11-25 08:43:58.715721271 +0100 @@ -154,8 +135,8 @@ bbcredits = None if len(if64_info) > 0: fcmgmt_portstats = [] - for oidend, dummy, tx_elements, rx_elements, bbcredits_64 in if64_info: - if int(index) == int(oidend.split(".")[-1]): + for oidend, tx_elements, rx_elements, bbcredits_64 in if64_info: + if index == oidend.split(".")[-1]: fcmgmt_portstats = [ binstring_to_int(''.join(map(chr, tx_elements))) / 4, binstring_to_int(''.join(map(chr, rx_elements))) / 4, @@ -477,7 +426,6 @@ # Not every device supports that (".1.3.6.1.3.94.4.5.1", [ OID_END, - "1", # Dummy value, otherwise OID_END is also treated as a BINARY value BINARY("6"), # FCMGMT-MIB::connUnitPortStatCountTxElements BINARY("7"), # FCMGMT-MIB::connUnitPortStatCountRxElements BINARY("8"), # FCMGMT-MIB::connUnitPortStatCountBBCreditZero
This is achieved by simply inserting a dummy value into the list of SNMP OIDs in the snmp_info
variable. I haven't had time to dig out the root cause of this behaviour, but i guess it must be somewhere in the core SNMP components of Check_MK. The patch also implicitly addresses and fixes an issue where two variables with differently typed content are being compared. This is achieved by simply changing the line:
if index == oidend.split(".")[-1]:
to:
if int(index) == int(oidend.split(".")[-1]):
and thus forcing a conversion to integer values.
The next patch adds support for additional encoding schemes in the FC-1 layer, which are used for fibre channel beyond the speed of 8 GBit. Up to and including a speed of 8 GBit, fibre channel uses a 8/10b encoding. This means that for every 8 bits of data, 10 bits are actually send over the fibre channel link. The encoding scheme changes for speeds higher than 8 GBit. 16 GBit fibre channel – like 10 GBit ethernet – uses a 64/66b encoding scheme, 32 GBit fibre channel and higher use a 256/257b encoding scheme. Thus the wirespeed calculations of the brocade_fcport
service check are incorrect for speeds above 8 GBit. The following patch addresses and fixes this issue:
- brocade_fcport.patch
--- a/checks/brocade_fcport 2018-11-25 18:06:04.674930057 +0100 +++ b/checks/brocade_fcport 2018-11-25 08:43:58.715721271 +0100 @@ -283,15 +267,8 @@ output.append(speedmsg) - if gbit > 16: - # convert gbit netto link-rate to Byte/s (256/257 enc) - wirespeed = gbit * 1000000000.0 * ( 256 / 257 ) / 8 - elif gbit > 8: - # convert gbit netto link-rate to Byte/s (64/66 enc) - wirespeed = gbit * 1000000000.0 * ( 64 / 66 ) / 8 - else: # convert gbit netto link-rate to Byte/s (8/10 enc) - wirespeed = gbit * 1000000000.0 * ( 8 / 10 ) / 8 + wirespeed = gbit * 1000000000.0 * 0.8 / 8 in_bytes = 4 * get_rate("brocade_fcport.rxwords.%s" % index, this_time, rxwords) out_bytes = 4 * get_rate("brocade_fcport.txwords.%s" % index, this_time, txwords)
The third patch simply adds the notxcredits
counter and its value to the output of the service check. This information is currently missing from the status output of the check and is just added as a convenience:
- brocade_fcport.patch
--- a/checks/brocade_fcport 2018-11-25 18:06:04.674930057 +0100 +++ b/checks/brocade_fcport 2018-11-25 08:43:58.715721271 +0100 @@ -408,9 +361,6 @@ summarystate = max(1, summarystate) text += "(!)" output.append(text) - else: - if counter == "notxcredits": - output.append(text) # P O R T S T A T E for dev_state, state_key, state_info, warn_states, state_map in [
The last patch addresses and fixes a more serious issue. The details are explained in the comment of the following code snippet. The digest here being, that the metric notxcredits
(“No TX buffer credits”) gathered from the SNMP OID swFCPortNoTxCredits
is being calculated wrong. This is due to the fact, that the brocade_fcport
service check treats this metric like all the other error metrics of a switchport (e.g. “CRC errors”, “ENC-Out”, “ENC-In” and “C3 discards”) and puts it in relation to the number of frames transmitted over a link. In reality though the definition of the metric behind the SNMP OID swFCPortNoTxCredits
is, that it is actually relative to time. This issue leads to false positives in certain edge cases. For example when a little utilized switch port sees an otherwise uncritical number of swFCPortNoTxCredits
due to the normal activity of the fibre channel flow control. In such a case, a relatively high number of swFCPortNoTxCredits
is put into relation to the sum of a relatively low number of frames transmitted over the link and again the relatively high number of swFCPortNoTxCredits
. See the last line of the code snippet below for this calculation. The result is a high value for the metric notxcredits
(“No TX buffer credits”) although the fibre channel flow control was working perfectly fine, the configured switch.edgeHoldTime
was most likely never reached and frames have never been dropped.
In order to address this issue, the following patch adds a few lines of code for a special treatment of the metric notxcredits
to the brocade_fcport
service check. This is done in the last section of the patch. The second section of the patch is just for the purpose of clarification, as it sets the metric that is being related to, to None
in case of the metric notxcredits
. The first section of the patch adjusts the default warning and critical thresholds to lower levels. The previous levels were rather high due to the miscalculation explained above. With the new calculation added by the third section of the patch, the default warning and critical threshold levels need to be much more sensitive.
- brocade_fcport.patch
--- a/checks/brocade_fcport 2018-11-25 18:06:04.674930057 +0100 +++ b/checks/brocade_fcport 2018-11-25 08:43:58.715721271 +0100 @@ -100,11 +100,11 @@ factory_settings["brocade_fcport_default_levels"] = { "rxcrcs": (3.0, 20.0), # allowed percentage of CRC errors "rxencoutframes": (3.0, 20.0), # allowed percentage of Enc-OUT Frames "rxencinframes": (3.0, 20.0), # allowed percentage of Enc-In Frames - "notxcredits": (1.0, 3.0), # allowed percentage of No Tx Credits + "notxcredits": (3.0, 20.0), # allowed percentage of No Tx Credits "c3discards": (3.0, 20.0), # allowed percentage of C3 discards "assumed_speed": 2.0, # used if speed not available in SNMP data } @@ -349,7 +327,7 @@ ("ENC-Out", "rxencoutframes", rxencoutframes, rxframes_rate), ("ENC-In", "rxencinframes", rxencinframes, rxframes_rate), ("C3 discards", "c3discards", c3discards, txframes_rate), - ("No TX buffer credits", "notxcredits", notxcredits, None), + ("No TX buffer credits", "notxcredits", notxcredits, txframes_rate), ]: per_sec = get_rate("brocade_fcport.%s.%s" % (counter, index), this_time, value) perfdata.append((counter, per_sec)) @@ -360,31 +338,6 @@ (counter, item), this_time, per_sec, average) perfdata.append( ("%s_avg" % counter, per_sec_avg ) ) - # Calculate error rates - if counter == "notxcredits": - # Calculate the error rate for "notxcredits" (buffer credit zero). Since this value - # is relative to time instead of the number of transmitted frames it needs special - # treatment. - # Semantics of the buffer credit zero value on Brocade / Broadcom devices: - # The switch ASIC checks the buffer credit value of a switch port every 2.5us and - # increments the buffer credit zero counter if the buffer credit value is zero. - # This means if in a one second interval the buffer credit zero counter increases - # by 400000 the link on this switch port is not allowed to send any frames. - # By default the edge hold time on a Brocade / Broadcom device is about 200ms: - # switch.edgeHoldTime:220 - # If a C3 frame remains in the switches queue for more than 220ms without being - # given any credits to be transmitted, it is subsequently dropped. Thus the buffer - # credit zero counter would optimally be correlated to the C3 discards counter. - # Unfortunately the Brocade / Broadcom devices have no egress buffering and do - # ingress buffering instead. Thus the C3 discards counters are increased on the - # ingress port, while the buffer credit zero counters are increased on the egress - # port. The trade-off is to correlate the buffer credit zero counters relative to - # the measured time interval. - if per_sec > 0: - rate = per_sec / 400000.00 - else: - rate = 0 - else: # compute error rate (errors in relation to number of frames) (from 0.0 to 1.0) if ref > 0 or per_sec > 0: rate = per_sec / (ref + per_sec)
Conclusion
Adding the three new checks for Brocade / Broadcom fibre channel switches to your Check_MK server enables you to monitor additional CPU and memory aspects as well as the SFPs of your Brocade / Broadcom fibre channel devices. More recent versions of Check_MK should already include the same functionality through the now included standard checks brocade_sys
and brocade_sfp
.
New Brocade / Broadcom fibre channel devices should pick up the additional service checks immediately. Existing Brocade / Broadcom fibre channel devices might need a Check_MK inventory to be run explicitly on them in order to pick up the additional service checks.
The enhanced version of the brocade_fcport
service check addresses and fixes several issues present in the standard Check_MK service check. It should provide you with a more complete and future-proof monitoring of your fibre channel network infrastructure.
I hope you find the provided new and enhanced checks useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.
2017-10-01 // Check_MK Monitoring - HPE Virtual Connect Fibre Channel Modules
This article provides patches for the standard Check_MK distribution in order to add support for the monitoring of HPE Virtual Connect Fibre Channel Modules.
Out of the box, there is currently no monitoring support for HPE Virtual Connect Fibre Channel Modules in the standard Check_MK distribution. Those modules, like e.g. the HPE Virtual Connect 8Gb 20-port Fibre Channel Module, are used in HPE c-Class BladeSystem to provide Fibre Channel connectivity for the individual server blades. Fortunately the modules provide status and performance data via the standard SNMP FIBRE-CHANNEL-FE-MIB defined in RFC 2837 as well as its successor, the SNMP FCMGMT-MIB defined in RFC 4044. Those two SNMP MIBs are already covered by the checks qlogic_fcport
, qlogic_sanbox
and qlogic_sanbox_fabric_element
, which are part of the standard Check_MK distribution. This simplifies the task of adding support for the HPE Virtual Connect Fibre Channel modules and reduces it to be just a matter of extending the already existing checks with three rather simple patches.
For the impatient and TL;DR here are the enhanced versions of the qlogic_fcport
, qlogic_sanbox
and qlogic_sanbox_fabric_element
:
Enhanced version of the qlogic_fcport check
Enhanced version of the qlogic_sanbox check
Enhanced version of the qlogic_sanbox_fabric_element check
The sources to the enhanced versions of all three checks can be found in my Check_MK Plugins repository on GitHub.
The necessary changes to qlogic_fcport
and qlogic_sanbox_fabric_element
are limited to the snmp_scan_function
used by the Check_MK inventory. Here, the vendor specific OIDs for the HPE Virtual Connect Fibre Channel modules are added. The following patches show the respective lines for qlogic_fcport
:
- qlogic_fcport.patch
--- a/checks/qlogic_fcport 2017-03-06 21:00:07.397607946 +0100 +++ b/checks/qlogic_fcport 2017-10-01 14:34:48.153710776 +0200 @@ -218,12 +218,14 @@ # .1.3.6.1.4.1.3873.1.12 QLogic 8 Gb and 4/8 Gb Intelligent Pass-thru Module # .1.3.6.1.4.1.3873.1.9 QLogic SANBox 5802 FC Switch # .1.3.6.1.4.1.3873.1.11 HP StorageWorks 8/20q Fibre Channel Switch + # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric 'snmp_scan_function' : lambda oid: \ oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \ or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \ or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.11") \ or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.12") \ - or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9"), + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9") \ + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"), 'group': 'qlogic_fcport', 'default_levels_variable': 'qlogic_fcport_default_levels', }
and for qlogic_sanbox_fabric_element
:
- qlogic_sanbox_fabric_element.patch
--- a/checks/qlogic_sanbox_fabric_element 2017-03-06 21:00:07.397607946 +0100 +++ b/checks/qlogic_sanbox_fabric_element 2017-10-01 14:47:35.000003198 +0200 @@ -54,7 +54,9 @@ OID_END]), # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch # .1.3.6.1.4.1.3873.1.8 Qlogic-4Gb SAN Switch Module for IBM BladeCenter + # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric 'snmp_scan_function' : lambda oid: \ oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \ - or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"), + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \ + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"), }
In both cases, the relevant lines being:
or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
After those two simple changes, the checks will now be able to successfully inventorize the overall fabric status as well as the status of individual ports of HPE Virtual Connect Fibre Channel modules.
The necessary changes to qlogic_sanbox
also require the extension of the snmp_scan_function
used by the Check_MK inventory as shown by the patches above. In addition to that, the string operations on the sensor_id
need to be adjusted in order to get a more user-friendly name for the temperature and power supply sensors which are also present in the HPE Virtual Connect Fibre Channel modules. Since the sensor IDs are encoded in the SNMP OIDs and the SNMP tree for those OIDs can vary from module to module, the simple string replacement in the original qlogic_sanbox
check was exchanged for a more general, regular expression based substitution. The following patch shows the respective lines for the combined changes to qlogic_sanbox
:
- qlogic_sanbox.patch
--- a/checks/qlogic_sanbox 2017-03-06 21:00:07.397607946 +0100 +++ b/checks/qlogic_sanbox 2017-10-01 14:47:51.348002546 +0200 @@ -44,7 +44,7 @@ inventory = [] for sensor_name, sensor_status, sensor_message, sensor_type, \ sensor_characteristic, sensor_id in info: - sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "") + sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id) if sensor_type == "8" and sensor_characteristic == "3" and \ sensor_name != "Temperature Status": inventory.append( (sensor_id, None) ) @@ -53,7 +53,7 @@ def check_qlogic_sanbox_temp(item, _no_params, info): for sensor_name, sensor_status, sensor_message, sensor_type, \ sensor_characteristic, sensor_id in info: - sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "") + sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id) if sensor_id == item: sensor_status = int(sensor_status) if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map): @@ -93,9 +93,11 @@ OID_END]), # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch # .1.3.6.1.4.1.3873.1.8 Qlogic-4Gb SAN Switch Module for IBM BladeCenter + # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric 'snmp_scan_function' : lambda oid: \ oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \ - or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"), + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \ + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"), } #. @@ -113,7 +115,7 @@ inventory = [] for sensor_name, sensor_status, sensor_message, sensor_type, \ sensor_characteristic, sensor_id in info: - sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "") + sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id) if sensor_type == "5": inventory.append( (sensor_id, None) ) return inventory @@ -121,7 +123,7 @@ def check_qlogic_sanbox_psu(item, _no_params, info): for sensor_name, sensor_status, sensor_message, sensor_type, \ sensor_characteristic, sensor_id in info: - sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "") + sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id) if sensor_id == item: sensor_status = int(sensor_status) if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map): @@ -153,7 +155,9 @@ OID_END]), # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch # .1.3.6.1.4.1.3873.1.8 Qlogic-4Gb SAN Switch Module for IBM BladeCenter + # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric 'snmp_scan_function' : lambda oid: \ oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \ - or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"), + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \ + or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"), }
After those additional, but still simple, changes, the check will now be able to successfully inventorize the temperature and power supply sensors of HPE Virtual Connect Fibre Channel modules.
2017-07-21 // Check_MK - Race Condition in Processing of Piggyback Data
In current versions of Check_MK – namely 1.2.8p24 and 1.4.0p8 – which are included in the Open Monitoring Distribution, there are several race conditions in the code parts responsible for processing of piggyback data sent by the agents. These race conditions are usually triggered when a host is monitored more than once, e.g. by the use of cluster services or if by chance a manual check on the command line coincides with a scheduled check from the monitoring core. While not critical to the whole monitoring process, the effect of these races are intermittent and annoying UNKNOWN - [Errno 2] No such file or directory
errors for the monitored host. This article provides patches for both Check_MK versions in order to deal with the spurious error messages.
The normal order of processing of piggybacked data received from agents seems to be as follows:
The Check_MK server calls the agent on the monitored host.
The agent on the monitored host responds with the output of its own host as well as the piggybacked output from other entities.
The Check_MK server processes the agents response and extracts the piggybacked data.
If there is already data stored from the piggybacked host, it's considered stale and is thus removed.
The piggybacked data is stored in a file named
<AGENT_HOSTNAME>
in the directory named<PIGGYBACKED_HOSTNAME>
under the path${OMD_ROOT}/var/check_mk/piggyback/
(e.g.:${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME>
).The Check_MK server lists the contents of the directory
${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME>
. For each file item in the list it processes the data contained in the file.The Check_MK server removes the files in and subsequently the directory
${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/
itself.
In case a host is monitored more than once, e.g. through the use of cluster services, the steps 4 through 7 above can overlap for two or more concurrent monitoring processes. Such an overlap can cause one process at step 4 to delete the directory ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME>
while another process is still working on the directory at step 5 or step 6.
The effect of this issue are mainly intermittent and annoying UNKNOWN - [Errno 2] No such file or directory
errors in the Check_MK WebUI as well as errors like the following examples in the log files:
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> An exception occured while processing host "hostA" 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> Traceback (most recent call last): 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 125, in do_keepalive 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> status = command_function(command_tuple) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 437, in execute_keepalive_command 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> return mode_function(hostname, ipaddress) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1290, in do_check 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> do_all_checks_on_host(hostname, ipaddress, only_check_types) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1506, in do_all_checks_on_host 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> info = get_info_for_check(hostname, ipaddress, infotype) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 319, in get_info_for_check 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> info = apply_parse_function(get_host_info(hostname, ipaddress, section_name, max_cachefile_age, ignore_check_interval), section_name) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 371, in get_host_info 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> ignore_check_interval=True) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 541, in get_realhost_info 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> store_persisted_info(hostname, persisted) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 567, in store_persisted_info 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> os.rename("%s.#new" % file_path, file_path) 2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> OSError: [Errno 2] No such file or directory
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> An exception occured while processing host "hostA" 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> Traceback (most recent call last): 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 125, in do_keepalive 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> status = command_function(command_tuple) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 437, in execute_keepalive_command 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> return mode_function(hostname, ipaddress) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1241, in do_check 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> do_all_checks_on_host(hostname, ipaddress, only_check_types) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1457, in do_all_checks_on_host 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> info = get_info_for_check(hostname, ipaddress, infotype) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 319, in get_info_for_check 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> info = apply_parse_function(get_host_info(hostname, ipaddress, section_name, max_cachefile_age, ignore_check_interval), section_name) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 371, in get_host_info 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> ignore_check_interval=True) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 540, in get_realhost_info 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> store_piggyback_info(hostname, piggybacked) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 651, in store_piggyback_info 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> os.rename(dir + "/.new." + sourcehost, dir + "/" + sourcehost) 2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> OSError: [Errno 2] No such file or directory
In version 1.4.0p8 of Check_MK there is already a partial fix for the described issue in the form of Werk 4755 (Git commit 77b3bfc3). For both Check_MK versions the following two trivial patches are provided in order to deal with the spurious error messages:
-
to be applied by the command:
root@host:~# patch -b -z .race < check_mk_base.py_1.2.8p24.patch
-
to be applied by the command:
root@host:~# patch -b -z .race < check_mk_base.py_1.4.0p8.patch
In the current development version of Check_MK – probably to be named 1.5 – there is a rather large amount of code rewrite and movement. From a first glance i couldn't determine if the issue still exists there.
2017-04-12 // Check_MK Monitoring - Open-iSCSI
The Open-iSCSI project provides a high-performance, transport independent, implementation of RFC 3720 iSCSI for Linux. It allows remote access to SCSI targets via TCP/IP over several different transport technologies. This article introduces a new Check_MK service check to monitor the status of Open-iSCSI sessions as well as the monitoring of several statistical metrics on Open-iSCSI sessions and iSCSI hardware initiator hosts.
For the impatient and TL;DR here is the Check_MK package of the Open-iSCSI monitoring checks:
Open-iSCSI monitoring checks (Compatible with Check_MK versions 1.2.8 and later)
The sources are to be found in my Check_MK repository on GitHub
The Check_MK service check to monitor Open-iSCSI consists of two major parts, an agent plugin and three check plugins.
The first part, a Check_MK agent plugin named open-iscsi
, is a simple Bash shell script. It calls the Open-iSCSI administration tool iscsiadm
in order to retrieve a list of currently active iSCSI sessions. The exact call to iscsiadm
to retrieve the session list is:
/usr/bin/iscsiadm -m session -P 1
If there are any active iSCSI sessions, the open-iscsi
agent plugin also tries to collect several statistics for each iSCSI session. This is done by another call to iscsiadm
for each iSCSI Session ${SID}
, which is shown in the following example:
/usr/bin/iscsiadm -m session -r ${SID} -s
Unfortunately, the iSCSI session statistics are currently only supported for Open-iSCSI software initiators or dependent hardware iSCSI initiators like the Broadcom BCM577xx or BCM578xx adapters which are covered by the bnx2i
kernel module. See Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell and Backporting Open-iSCSI to Debian 8 "Jessie" for additional information on those dependent hardware iSCSI initiators.
For hardware iSCSI initiators, like the QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs, which provide a full iSCSI offload engine (iSOE) implementation in the adapters firmware, there is currently no support for iSCSI session statistics. Instead, the open-iscsi
agent plugin collects several global statistics on each iSOE host ${HST}
which is covered by the qla4xxx
kernel module with the command shown in the following example:
/usr/bin/iscsiadm -m host -H ${HST} -C stats
The output of the above commands is parsed and reformated by the agent plugin for easier processing in the check plugins. The following example shows the agent plugin output for a system with two BCM578xx dependent hardware iSCSI initiators:
<<<open-iscsi_sessions>>> bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:bf:2d eth2 10.0.3.52 LOGGED_IN LOGGED_IN NO_CHANGE bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:c2:34 eth3 10.0.3.53 LOGGED_IN LOGGED_IN NO_CHANGE <<<open-iscsi_session_stats>>> [session stats f8:ca:b8:7d:bf:2d iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>] txdata_octets: 40960 rxdata_octets: 461171313 noptx_pdus: 0 scsicmd_pdus: 153967 tmfcmd_pdus: 0 login_pdus: 0 text_pdus: 0 dataout_pdus: 0 logout_pdus: 0 snack_pdus: 0 noprx_pdus: 0 scsirsp_pdus: 153967 tmfrsp_pdus: 0 textrsp_pdus: 0 datain_pdus: 112420 logoutrsp_pdus: 0 r2t_pdus: 0 async_pdus: 0 rjt_pdus: 0 digest_err: 0 timeout_err: 0 [session stats f8:ca:b8:7d:c2:34 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>] txdata_octets: 16384 rxdata_octets: 255666052 noptx_pdus: 0 scsicmd_pdus: 84312 tmfcmd_pdus: 0 login_pdus: 0 text_pdus: 0 dataout_pdus: 0 logout_pdus: 0 snack_pdus: 0 noprx_pdus: 0 scsirsp_pdus: 84312 tmfrsp_pdus: 0 textrsp_pdus: 0 datain_pdus: 62418 logoutrsp_pdus: 0 r2t_pdus: 0 async_pdus: 0 rjt_pdus: 0 digest_err: 0 timeout_err: 0
The next example shows the agent plugin output for a system with two QLogic 8200 Series hardware iSCSI initiators:
<<<open-iscsi_sessions>>> qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7d.ipv4.0 none 10.0.3.50 LOGGED_IN Unknown Unknown qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7e.ipv4.0 none 10.0.3.51 LOGGED_IN Unknown Unknown <<<open-iscsi_host_stats>>> [host stats f8:ca:b8:7d:c1:7d iqn.2000-04.com.qlogic:isp8214.000e1e3574ac.4] mactx_frames: 563454 mactx_bytes: 52389948 mactx_multicast_frames: 877513 mactx_broadcast_frames: 0 mactx_pause_frames: 0 mactx_control_frames: 0 mactx_deferral: 0 mactx_excess_deferral: 0 mactx_late_collision: 0 mactx_abort: 0 mactx_single_collision: 0 mactx_multiple_collision: 0 mactx_collision: 0 mactx_frames_dropped: 0 mactx_jumbo_frames: 0 macrx_frames: 1573455 macrx_bytes: 440845678 macrx_unknown_control_frames: 0 macrx_pause_frames: 0 macrx_control_frames: 0 macrx_dribble: 0 macrx_frame_length_error: 0 macrx_jabber: 0 macrx_carrier_sense_error: 0 macrx_frame_discarded: 0 macrx_frames_dropped: 1755017 mac_crc_error: 0 mac_encoding_error: 0 macrx_length_error_large: 0 macrx_length_error_small: 0 macrx_multicast_frames: 0 macrx_broadcast_frames: 0 iptx_packets: 508160 iptx_bytes: 29474232 iptx_fragments: 0 iprx_packets: 401785 iprx_bytes: 354673156 iprx_fragments: 0 ip_datagram_reassembly: 0 ip_invalid_address_error: 0 ip_error_packets: 0 ip_fragrx_overlap: 0 ip_fragrx_outoforder: 0 ip_datagram_reassembly_timeout: 0 ipv6tx_packets: 0 ipv6tx_bytes: 0 ipv6tx_fragments: 0 ipv6rx_packets: 0 ipv6rx_bytes: 0 ipv6rx_fragments: 0 ipv6_datagram_reassembly: 0 ipv6_invalid_address_error: 0 ipv6_error_packets: 0 ipv6_fragrx_overlap: 0 ipv6_fragrx_outoforder: 0 ipv6_datagram_reassembly_timeout: 0 tcptx_segments: 508160 tcptx_bytes: 19310736 tcprx_segments: 401785 tcprx_byte: 346637456 tcp_duplicate_ack_retx: 1 tcp_retx_timer_expired: 1 tcprx_duplicate_ack: 0 tcprx_pure_ackr: 0 tcptx_delayed_ack: 106449 tcptx_pure_ack: 106489 tcprx_segment_error: 0 tcprx_segment_outoforder: 0 tcprx_window_probe: 0 tcprx_window_update: 695915 tcptx_window_probe_persist: 0 ecc_error_correction: 0 iscsi_pdu_tx: 401697 iscsi_data_bytes_tx: 29225 iscsi_pdu_rx: 401697 iscsi_data_bytes_rx: 327355963 iscsi_io_completed: 101 iscsi_unexpected_io_rx: 0 iscsi_format_error: 0 iscsi_hdr_digest_error: 0 iscsi_data_digest_error: 0 iscsi_sequence_error: 0 [host stats f8:ca:b8:7d:c1:7e iqn.2000-04.com.qlogic:isp8214.000e1e3574ad.5] mactx_frames: 563608 mactx_bytes: 52411412 mactx_multicast_frames: 877517 mactx_broadcast_frames: 0 mactx_pause_frames: 0 mactx_control_frames: 0 mactx_deferral: 0 mactx_excess_deferral: 0 mactx_late_collision: 0 mactx_abort: 0 mactx_single_collision: 0 mactx_multiple_collision: 0 mactx_collision: 0 mactx_frames_dropped: 0 mactx_jumbo_frames: 0 macrx_frames: 1573572 macrx_bytes: 441630442 macrx_unknown_control_frames: 0 macrx_pause_frames: 0 macrx_control_frames: 0 macrx_dribble: 0 macrx_frame_length_error: 0 macrx_jabber: 0 macrx_carrier_sense_error: 0 macrx_frame_discarded: 0 macrx_frames_dropped: 1755017 mac_crc_error: 0 mac_encoding_error: 0 macrx_length_error_large: 0 macrx_length_error_small: 0 macrx_multicast_frames: 0 macrx_broadcast_frames: 0 iptx_packets: 508310 iptx_bytes: 29490504 iptx_fragments: 0 iprx_packets: 401925 iprx_bytes: 355436636 iprx_fragments: 0 ip_datagram_reassembly: 0 ip_invalid_address_error: 0 ip_error_packets: 0 ip_fragrx_overlap: 0 ip_fragrx_outoforder: 0 ip_datagram_reassembly_timeout: 0 ipv6tx_packets: 0 ipv6tx_bytes: 0 ipv6tx_fragments: 0 ipv6rx_packets: 0 ipv6rx_bytes: 0 ipv6rx_fragments: 0 ipv6_datagram_reassembly: 0 ipv6_invalid_address_error: 0 ipv6_error_packets: 0 ipv6_fragrx_overlap: 0 ipv6_fragrx_outoforder: 0 ipv6_datagram_reassembly_timeout: 0 tcptx_segments: 508310 tcptx_bytes: 19323952 tcprx_segments: 401925 tcprx_byte: 347398136 tcp_duplicate_ack_retx: 2 tcp_retx_timer_expired: 4 tcprx_duplicate_ack: 0 tcprx_pure_ackr: 0 tcptx_delayed_ack: 106466 tcptx_pure_ack: 106543 tcprx_segment_error: 0 tcprx_segment_outoforder: 0 tcprx_window_probe: 0 tcprx_window_update: 696035 tcptx_window_probe_persist: 0 ecc_error_correction: 0 iscsi_pdu_tx: 401787 iscsi_data_bytes_tx: 37970 iscsi_pdu_rx: 401791 iscsi_data_bytes_rx: 328112050 iscsi_io_completed: 127 iscsi_unexpected_io_rx: 0 iscsi_format_error: 0 iscsi_hdr_digest_error: 0 iscsi_data_digest_error: 0 iscsi_sequence_error: 0
Although a simple Bash shell script, the agent plugin open-iscsi
has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands iscsiadm
, sed
, tr
and egrep
. On Debian based systems, the necessary packages can be installed with the following command:
root@host:~# apt-get install coreutils grep open-iscsi sed
The second part of the Check_MK service check for Open-iSCSI provides the necessary check logic through individual inventory and check functions. This is implemented in the three Check_MK check plugins open-iscsi_sessions
, open-iscsi_host_stats
and open-iscsi_session_stats
, which will be discussed separately in the following sections.
Open-iSCSI Session Status
The check plugin open-iscsi_sessions
is responsible for the monitoring of individual iSCSI sessions and their internal session states. Upon inventory this check plugin creates a service check for each pair of iSCSI network interface name and IQN of the iSCSI target volume. Unlike the iSCSI session ID, which changes over time (e.g. after iSCSI logout and login), this pair uniquely identifies a iSCSI session on a host. During normal check execution, the list of currently active iSCSI sessions on a host is compared to the list of active iSCSI sessions gathered during inventory on that host. If a session is missing or if the session has an erroneous internal state, an alarm is raised accordingly.
For all types of initiators – software, dependent hardware and hardware – there is the state session_state
which can take on the following values:
ISCSI_STATE_FREE ISCSI_STATE_LOGGED_IN ISCSI_STATE_FAILED ISCSI_STATE_TERMINATE ISCSI_STATE_IN_RECOVERY ISCSI_STATE_RECOVERY_FAILED ISCSI_STATE_LOGGING_OUT
An alarm is raised if the session is in any state other than ISCSI_STATE_LOGGED_IN
. For software and dependent hardware initiators there are two additional states – connection_state
and internal_state
. The state connection_state
can take on the values:
FREE TRANSPORT WAIT IN LOGIN LOGGED IN IN LOGOUT LOGOUT REQUESTED CLEANUP WAIT
and internal_state
can take on the values:
NO CHANGE CLEANUP REOPEN REDIRECT
In addition to the above session_state
, an alarm is raised if the connection_state
is in any other state than LOGGED IN
and internal_state
is in any other state than NO CHANGE
.
No performance data is currently reported by this check.
Open-iSCSI Hosts Statistics
The check plugin open-iscsi_host_stats
is responsible for the monitoring of the global statistics on a iSOE host. Upon inventory this check plugin creates a service check for each pair of MAC address and iSCSI network interface name. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is determined for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.
With the additional WATO plugin open-iscsi_host_stats.py
it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSOE host statistics levels can be found in the WATO WebUI under:
-> Host & Service Parameters -> Parameters for discovered services -> Storage, Filesystems and Files -> Open-iSCSI Host Statistics -> Create Rule in Folder ... -> The levels for the Open-iSCSI host statistics values [x] The levels for the number of transmitted MAC/Layer2 frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 bytes on an iSOE host. [x] The levels for the number of received MAC/Layer2 bytes on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 multicast frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 multicast frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 broadcast frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 broadcast frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 pause frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 pause frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 control frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 dropped frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 dropped frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 abort frames on an iSOE host. [x] The levels for the number of transmitted MAC/Layer2 jumbo frames on an iSOE host. [x] The levels for the number of MAC/Layer2 late transmit collisions on an iSOE host. [x] The levels for the number of MAC/Layer2 single transmit collisions on an iSOE host. [x] The levels for the number of MAC/Layer2 multiple transmit collisions on an iSOE host. [x] The levels for the number of MAC/Layer2 collisions on an iSOE host. [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 dribble on an iSOE host. [x] The levels for the number of received MAC/Layer2 frame length errors on an iSOE host. [x] The levels for the number of discarded received MAC/Layer2 frames on an iSOE host. [x] The levels for the number of received MAC/Layer2 jabber on an iSOE host. [x] The levels for the number of received MAC/Layer2 carrier sense errors on an iSOE host. [x] The levels for the number of received MAC/Layer2 CRC errors on an iSOE host. [x] The levels for the number of received MAC/Layer2 encoding errors on an iSOE host. [x] The levels for the number of received MAC/Layer2 length too large errors on an iSOE host. [x] The levels for the number of received MAC/Layer2 length too small errors on an iSOE host. [x] The levels for the number of transmitted IP packets on an iSOE host. [x] The levels for the number of received IP packets on an iSOE host. [x] The levels for the number of transmitted IP bytes on an iSOE host. [x] The levels for the number of received IP bytes on an iSOE host. [x] The levels for the number of transmitted IP fragments on an iSOE host. [x] The levels for the number of received IP fragments on an iSOE host. [x] The levels for the number of IP datagram reassemblies on an iSOE host. [x] The levels for the number of IP invalid address errors on an iSOE host. [x] The levels for the number of IP packet errors on an iSOE host. [x] The levels for the number of IP fragmentation overlaps on an iSOE host. [x] The levels for the number of IP fragmentation out-of-order on an iSOE host. [x] The levels for the number of IP datagram reassembly timeouts on an iSOE host. [x] The levels for the number of transmitted IPv6 packets on an iSOE host. [x] The levels for the number of received IPv6 packets on an iSOE host. [x] The levels for the number of transmitted IPv6 bytes on an iSOE host. [x] The levels for the number of received IPv6 bytes on an iSOE host. [x] The levels for the number of transmitted IPv6 fragments on an iSOE host. [x] The levels for the number of received IPv6 fragments on an iSOE host. [x] The levels for the number of IPv6 datagram reassemblies on an iSOE host. [x] The levels for the number of IPv6 invalid address errors on an iSOE host. [x] The levels for the number of IPv6 packet errors on an iSOE host. [x] The levels for the number of IPv6 fragmentation overlaps on an iSOE host. [x] The levels for the number of IPv6 fragmentation out-of-order on an iSOE host. [x] The levels for the number of IPv6 datagram reassembly timeouts on an iSOE host. [x] The levels for the number of transmitted TCP segments on an iSOE host. [x] The levels for the number of received TCP segments on an iSOE host. [x] The levels for the number of transmitted TCP bytes on an iSOE host. [x] The levels for the number of received TCP bytes on an iSOE host. [x] The levels for the number of duplicate TCP ACK retransmits on an iSOE host. [x] The levels for the number of received TCP retransmit timer expiries on an iSOE host. [x] The levels for the number of received TCP duplicate ACKs on an iSOE host. [x] The levels for the number of received TCP pure ACKs on an iSOE host. [x] The levels for the number of transmitted TCP delayed ACKs on an iSOE host. [x] The levels for the number of transmitted TCP pure ACKs on an iSOE host. [x] The levels for the number of received TCP segment errors on an iSOE host. [x] The levels for the number of received TCP segment out-of-order on an iSOE host. [x] The levels for the number of received TCP window probe on an iSOE host. [x] The levels for the number of received TCP window update on an iSOE host. [x] The levels for the number of transmitted TCP window probe persist on an iSOE host. [x] The levels for the number of transmitted iSCSI PDUs on an iSOE host. [x] The levels for the number of received iSCSI PDUs on an iSOE host. [x] The levels for the number of transmitted iSCSI Bytes on an iSOE host. [x] The levels for the number of received iSCSI Bytes on an iSOE host. [x] The levels for the number of iSCSI I/Os completed on an iSOE host. [x] The levels for the number of iSCSI unexpected I/Os on an iSOE host. [x] The levels for the number of iSCSI format errors on an iSOE host. [x] The levels for the number of iSCSI header digest (CRC) errors on an iSOE host. [x] The levels for the number of iSCSI data digest (CRC) errors on an iSOE host. [x] The levels for the number of iSCSI sequence errors on an iSOE host. [x] The levels for the number of ECC error corrections on an iSOE host.
The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions
(iSCSI Session Status) and open-iscsi_host_stats
(iSCSI Host Stats) service checks over two QLogic 8200 Series hardware iSCSI initiators:
This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state
– in this example LOGGED_IN
– is shown. There are also two iSCSI Host Stats service check items in the example, which are pairs of MAC addresses and iSCSI network interface names. For each of those items the current throughput rate on the MAC, IP/IPv6, TCP and iSCSI protocol layer is shown. The throughput rate on the MAC protocol layer is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.
The following three images show examples of the PNP4Nagios graphs for the open-iscsi_host_stats
(iSCSI Host Stats) service check.
The middle graph shows a combined view of the throughput rate for received and transmitted traffic on the different MAC, IP/IPv6, TCP and iSCSI protocol layers. The upper graph shows the throughput rate for various frame types on the MAC protocol layer. The lower graph shows the rate for various error frame types on the MAC protocol layer.
The upper graph shows the throughput rate for received and transmitted traffic on the IP/IPv6 protocol layer. The middle graph shows the rate for various error packet types on the IP/IPv6 protocol layer. The lower graph shows the throughput rate for received and transmitted traffic on the TCP protocol layer.
The first graph shows the rate for various protocol control and error segment types on the TCP protocol layer. The second graph shows the rate of ECC error corrections that occured on the QLogic 8200 Series hardware iSCSI initiator. The third graph shows the throughput rate for received and transmitted traffic on the iSCSI protocol layer. The fourth and last graph shows the rate for various control and error PDUs on the iSCSI protocol layer.
Open-iSCSI Session Statistics
The check plugin open-iscsi_session_stats
is responsible for the monitoring of the statistics on individual iSCSI sessions. Upon inventory this check plugin creates a service check for each pair of MAC address of the network interface and IQN of the iSCSI target volume. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is collected for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.
With the additional WATO plugin open-iscsi_session_stats.py
it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSCSI session statistics levels can be found in the WATO WebUI under:
-> Host & Service Parameters -> Parameters for discovered services -> Storage, Filesystems and Files -> Open-iSCSI Session Statistics -> Create Rule in Folder ... -> The levels for the Open-iSCSI session statistics values [x] The levels for the number of transmitted bytes in an Open-iSCSI session [x] The levels for the number of received bytes in an Open-iSCSI session [x] The levels for the number of digest (CRC) errors in an Open-iSCSI session [x] The levels for the number of timeout errors in an Open-iSCSI session [x] The levels for the number of transmitted NOP commands in an Open-iSCSI session [x] The levels for the number of received NOP commands in an Open-iSCSI session [x] The levels for the number of transmitted SCSI command requests in an Open-iSCSI session [x] The levels for the number of received SCSI command reponses in an Open-iSCSI session [x] The levels for the number of transmitted task management function commands in an Open-iSCSI session [x] The levels for the number of received task management function responses in an Open-iSCSI session [x] The levels for the number of transmitted login requests in an Open-iSCSI session [x] The levels for the number of transmitted logout requests in an Open-iSCSI session [x] The levels for the number of received logout responses in an Open-iSCSI session [x] The levels for the number of transmitted text PDUs in an Open-iSCSI session [x] The levels for the number of received text PDUs in an Open-iSCSI session [x] The levels for the number of transmitted data PDUs in an Open-iSCSI session [x] The levels for the number of received data PDUs in an Open-iSCSI session [x] The levels for the number of transmitted single negative ACKs in an Open-iSCSI session [x] The levels for the number of received ready to transfer PDUs in an Open-iSCSI session [x] The levels for the number of received reject PDUs in an Open-iSCSI session [x] The levels for the number of received asynchronous messages in an Open-iSCSI session
The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions
(iSCSI Session Status) and open-iscsi_session_stats
(iSCSI Session Stats) service checks over two BCM578xx dependent hardware iSCSI initiators:
This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state
, connection_state
and internal_state
– in this example with the respective values LOGGED_IN
, LOGGED_IN
and NO_CHANGE
– are shown. There are also an equivalent number of iSCSI Session Stats service check items in the example, which are also pairs of MAC addresses of the network interfaces and IQNs of the iSCSI target volumes. For each of those items the current throughput rate of the individual iSCSI session is shown. As long as the rate of the digest (CRC) and timeout error counters is zero, the string no protocol errors is displayed. Otherwise the name and throughput rate of any non-zero error counter is shown. The throughput rate of the iSCSI session is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.
The following image shows an example of the three PNP4Nagios graphs for a single open-iscsi_session_stats
(iSCSI Session Stats) service check.
The upper graph shows the throughput rate for received and transmitted traffic of the iSCSI session. The middle graph shows the rate for received and transmitted iSCSI PDUs, broken down by the different types of PDUs on the iSCSI protocol layer. The lower graph shows the rate for the digest (CRC) and timeout errors on the iSCSI protocol layer.
The described Check_MK service check to monitor the status of Open-iSCSI sessions, Open-iSCSI session metrics and iSCSI hardware initiator host metrics has been verified to work with version 2.0.874-2~bpo8+1 of the open-iscsi
package from the backports repository of Debian stable
(Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.
I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.
2017-03-16 // Check_MK Monitoring - XFS Filesystem Quotas
XFS, like most other modern filesystems, offers the ability to configure and use disk quotas. Usually those quotas set limits with regard to the amount of allocatable disk space or number of filesystem objects. Those limits are in most filesystems bound to users or groups of users. XFS is one of the few filesystems which offer quota limits on a per directory basis. In the case of XFS directory quotas are implemented via projects. This article introduces a new Check_MK service check to monitor the use of XFS filesystem quotas with a strong focus on directory based quotas.
For the impatient and TL;DR here is the Check_MK package of the XFS filesystem quota monitoring checks:
XFS filesystem quota monitoring checks (Compatible with Check_MK versions 1.2.6 and earlier)
XFS filesystem quota monitoring checks (Compatible with Check_MK versions 1.2.8 and later)
The sources are to be found in my Check_MK repository on GitHub
The Check_MK service check to monitor XFS filesystem quotas consists of two major parts, an agent plugin and a check plugin.
The Check_MK agent plugin named xfs_quota
is a simple Bash shell script. It calls the XFS quota administration tool xfs_quota
in order to retrieve a report of the current quota usage. The exact call to xfs_quota
is:
/usr/sbin/xfs_quota -x -c 'report -p -b -i -a'
which currently reports only block and inode quotas of projects or directories on all availables XFS filesystems. An example output of the above command is:
Project quota on /srv/xfs (/dev/mapper/vg00-xfs) Blocks Inodes Project ID Used Soft Hard Warn/Grace Used Soft Hard Warn/ Grace ---------- -------------------------------------------------- -------------------------------------------------- test1 0 0 1024 00 [--------] 1 0 0 00 [--------] test2 0 0 2048 00 [--------] 1 0 0 00 [--------] test3 0 0 3072 00 [--------] 1 0 0 00 [--------]
The output of the above command is parsed and reformated by the agent plugin for easier processing in the check plugin. The above example output would thus be transformed into the agent plugin output shown in the following example:
<<<xfs_quota>>> /srv/xfs:/dev/mapper/vg00-xfs:test1:0:0:1024:1:0:0 /srv/xfs:/dev/mapper/vg00-xfs:test2:0:0:2048:1:0:0 /srv/xfs:/dev/mapper/vg00-xfs:test3:0:0:3072:1:0:0
Although a simple Bash shell script, the agent plugin xfs_quota
has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands xfs_quota
, sed
and egrep
. On Debian based systems, the necessary packages can be installed with the following command:
root@host:~# apt-get install xfsprogs sed grep
The second part, a Check_MK check plugin also named xfs_quota
, provides the necessary inventory and check functions. Upon inventory it creates a service check for each pair of XFS filesystem mountpoint and quota project ID. During normal check execution, the number of used XFS filesystem blocks and inodes are determined for each inventorized item (pair of XFS filesystem mountpoint and quota project ID). If the hard and soft, block and inode quotas for particular item are all set to zero, no further checks are carried out and only performance data is reported by the check. If either one of hard or soft, block or inode quota is set to a non-zero value, the number of remaining free XFS filesystem blocks or inodes is compared to warning and critical threshold values and an alarm is raised accordingly.
With the additional WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for blocks_hard
and blocks_soft
are zero (0) free blocks for both warning and critical thresholds. The default values for inodes_hard
and inodes_soft
are zero (0) free inodes for both warning and critical thresholds. The configuration options for the free block or inode levels can be found under:
-> Host & Service Parameters -> Parameters for discovered services -> Storage, Filesystems and Files -> XFS Quota Utilization -> Create Rule in Folder ... -> The levels for the soft/hard block/inode quotas on XFS filesystems [x] The levels for the hard block quotas on XFS filesystems [x] The levels for the soft block quotas on XFS filesystems [x] The levels for the hard inode quotas on XFS filesystems [x] The levels for the soft inode quotas on XFS filesystems
The following image shows a status output example for several xfs_quota
service checks from the WATO WebUI:
This example shows several service check items, which again are pairs of XFS filesystem mountpoints (here: /backup
) and anonymized quota project IDs. For each item the number of blocks and inodes used are shown along with the appropriate hard and soft quota values. The number of blocks and inodes used are also visualized in the perf-o-meter, blocks on a logarithmic scale growing from the middle to the left, inodes on a logarithmic scale growing from the middle to the right.
The following image shows an example of the two PNP4Nagios graphs for a single service check:
The upper graph shows the number of used XFS filesystem blocks for a pair of XFS filesystem mountpoints (here: /backup
) and a - again anonymized - quota project ID. The lower graph shows the number of used inodes for the same pair. Both graphs show warning and critical thresholds values, which are in this example at their default value of zero. If configured - like in the upper graph of the example - block or inode quotas are also shown as blue horizontal lines in the respective graphs.
The described Check_MK service check to monitor XFS filesystem quotas has been verified to work with version 3.2.1 of the xfsprogs
package on Debian stable
(Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.
I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.