bityard Blog

// Check_MK Monitoring - Brocade / Broadcom Fibre Channel Switches

This article provides patches for the standard Check_MK distribution in order to fix and enhance the support for the monitoring of Brocade / Broadcom Fibre Channel Switches.


Out of the box, there is currently already monitoring support available for Brocade / Broadcom Fibre Channel Switches in the standard Check_MK distribution. Unfortunately there are several issues in the check brocade_fcport, which is used to monitor the status and several metrics of the fibre channel switch ports. Those issues prevent the check from working as intended and are fixed in the version provided in this article. Up until recently, there also was no support in the standard Check_MK distribution for monitoring CPU and memory metrics on a switch level and no support for monitoring SFP metrics on a port level. This article introduces the new checks brocade_cpu, brocade_mem and brocade_sfp to cover those metrics. With the most recent upstream version 1.5.x of the Check_MK distribution, there are now the standard checks brocade_sys (covering CPU and memory metrics) and brocade_sfp (covering the SFP port metrics) available, providing basically the same functionality.

For the impatient and TL;DR here is the enhanced version of the brocade_fcport check:

Enhanced version of the brocade_fcport check

And the new brocade_cpu, brocade_mem and brocade_sfp checks:

The new brocade_cpu check
The new brocade_mem check
The new brocade_sfp check

The sources to the new and enhanced versions of all the checks can be found in my Check_MK Plugins repository on GitHub.

Additional Checks

CPU Usage

The Check_MK service check brocade_cpu monitors the current CPU utilization of Brocade / Broadcom fibre channel switches. It uses the SNMP OID swCpuUsage from the fibre channel switch MIB (SW-MIB) in order to create one service check for each CPU found in the system. The current CPU utilization in percent is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the standard WATO plugin for CPU utilization it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of CPU utilization).

The following image shows a status output example for the brocade_cpu service check from the WATO WebUI:

Status output example for the new brocade_cpu service check

This example shows the current CPU utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_cpu service check:

Example service metrics graph for the new brocade_cpu service check

The selected example graph shows the current CPU utilization of the system in percent as well as the default warning and critical threshold values.

Memory Usage

The Check_MK service check brocade_mem monitors the current memory (RAM) usage on Brocade / Broadcom fibre channel switches. It uses the SNMP OID swMemUsage from the fibre channel switch MIB (SW-MIB) in order to create a service check for the overall memory usage on the system. The amount of currently used memory is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of used memory). The configuration options for the used memory levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Brocade Fibre Channel Memory Usage
            -> Create rule in folder ...
               [x] Levels for memory usage

The following image shows a status output example for the brocade_mem service check from the WATO WebUI:

Status output example for the new brocade_mem service check

This example shows the current memory utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_mem service check:

Example service metrics graph for the new brocade_mem service check

The selected example graph shows the current memory utilization of the system in percent as well as the default warning and critical threshold values.

SFP Health

The Check_MK service check brocade_sfp monitors several metrics of the SFPs in all enabled and active ports of Brocade / Broadcom fibre channel switches. It uses several SNMP OIDs from the fibre channel switch MIB (SW-MIB) and from the Fabric Alliance Extension MIB (FA-EXT-MIB) in order to create one service check for each SFP found in an enabled and active port on the system. The OIDs from the fibre channel switch MIB (swFCPortSpecifier, swFCPortName, swFCPortPhyState, swFCPortOpStatus, swFCPortAdmStatus) are used to determine the number, name and status of the switch port. The OIDs from the Fabric Alliance Extension MIB are used to determine the actual SFP metrics. Those metrics are the SFP temperature (swSfpTemperature), voltage (swSfpVoltage) and current (swSfpCurrent), as well as the optical receive and transmit power (swSfpRxPower and swSfpTxPower) of the SFP. Unfortunately the SFP metrics are not available on all Brocade / Broadcom switch hardware platforms and FabricOS version. This check was verified to work with Gen6 hardware and FabricOS v8. Another limitation is, that the SFP metrics are not available in real-time, but are only gathered in a 5 minute interval by the switch from all the SFPs in the switch. This is most likely a precausion as not to overload the processing capacity on the SFPs with too many status requests.

The current temperature level as well as the optical receive and transmit power levels are compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values:

Metric Warning Threshold Critical Threshold
System temperature 55°C 65°C
Optical receive power -7.0 dBm -9.0 dBm
Optical transmit power -2.0 dBm -3.0 dBm

The configuration options for the used temperature and optical receive and transmit power levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Networking
         -> Brocade Fibre Channel SFP
            -> Create rule in folder ...
               [x] Temperature levels in degrees celcius
               [x] Receive power levels in dBm
               [x] Transmit power levels in dBm

Currently, the electrical current and voltage metrics of the SFPs are only used for long-term trends via the respective service metric template and thus are not used to raise any alarms. This is due to the fact that there is little to no information available on which precise electrical current and voltage levels would constitute as an indicator for an immediate or impending failure state.

The following image shows several status output examples for the brocade_sfp service check from the WATO WebUI:

Status output examples for the new brocade_sfp service check

This example shows SFP Port service check items for eight consecutive ports on a switch. For each item the current temperature, voltage and current, as well as the optical receive and transmit power of the SFP are shown. The optical receive and transmit power levels are also visualized in the Perf-O-Meter, optical receive power growing from the middle to the left, optical transmit power growing from the middle to the right.

The following image shows an example of the service metrics graph for the brocade_sfp service check:

Example service metrics graph for the new brocade_sfp service check

The first graph shows the optical receive and transmit power of the SFP. The second shows the electrical current drawn by the SFP. The third graph shows the temperature of the SFP. The fourth and last graph shows the electrical voltage provided to the SFP.

Modified Check

Fibre Channel Port

The Check_MK service check brocade_fcport has several issuse which are addressed by the following set of patches. The patch is actually a monolithic one and is just broken up into individual patches here for ease of discussion.

The first patch is a simple, but ugly workaround to prevent the conversion of OID_END, which carries the port index information, from being treated as a BINARY SNMP value:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -154,8 +135,8 @@
         bbcredits = None
         if len(if64_info) > 0:
             fcmgmt_portstats = []
-            for oidend, dummy, tx_elements, rx_elements, bbcredits_64 in if64_info:
-                if int(index) == int(oidend.split(".")[-1]):
+            for oidend, tx_elements, rx_elements, bbcredits_64 in if64_info:
+                if index == oidend.split(".")[-1]:
                     fcmgmt_portstats = [
                         binstring_to_int(''.join(map(chr, tx_elements))) / 4,
                         binstring_to_int(''.join(map(chr, rx_elements))) / 4,
@@ -477,7 +426,6 @@
         # Not every device supports that
         (".1.3.6.1.3.94.4.5.1", [
             OID_END,
-            "1",            # Dummy value, otherwise OID_END is also treated as a BINARY value
             BINARY("6"),    # FCMGMT-MIB::connUnitPortStatCountTxElements
             BINARY("7"),    # FCMGMT-MIB::connUnitPortStatCountRxElements
             BINARY("8"),    # FCMGMT-MIB::connUnitPortStatCountBBCreditZero

This is achieved by simply inserting a dummy value into the list of SNMP OIDs in the snmp_info variable. I haven't had time to dig out the root cause of this behaviour, but i guess it must be somewhere in the core SNMP components of Check_MK. The patch also implicitly addresses and fixes an issue where two variables with differently typed content are being compared. This is achieved by simply changing the line:

               if index == oidend.split(".")[-1]:

to:

               if int(index) == int(oidend.split(".")[-1]):

and thus forcing a conversion to integer values.

The next patch adds support for additional encoding schemes in the FC-1 layer, which are used for fibre channel beyond the speed of 8 GBit. Up to and including a speed of 8 GBit, fibre channel uses a 8/10b encoding. This means that for every 8 bits of data, 10 bits are actually send over the fibre channel link. The encoding scheme changes for speeds higher than 8 GBit. 16 GBit fibre channel – like 10 GBit ethernet – uses a 64/66b encoding scheme, 32 GBit fibre channel and higher use a 256/257b encoding scheme. Thus the wirespeed calculations of the brocade_fcport service check are incorrect for speeds above 8 GBit. The following patch addresses and fixes this issue:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -283,15 +267,8 @@
 
     output.append(speedmsg)
 
-    if gbit > 16:
-        # convert gbit netto link-rate to Byte/s (256/257 enc)
-        wirespeed = gbit * 1000000000.0 * ( 256 / 257 ) / 8
-    elif gbit > 8:
-        # convert gbit netto link-rate to Byte/s (64/66 enc)
-        wirespeed = gbit * 1000000000.0 * ( 64 / 66 ) / 8
-    else:
         # convert gbit netto link-rate to Byte/s (8/10 enc)
-        wirespeed = gbit * 1000000000.0 * ( 8 / 10 ) / 8
+    wirespeed = gbit * 1000000000.0 * 0.8 / 8
     in_bytes = 4 * get_rate("brocade_fcport.rxwords.%s" % index, this_time, rxwords)
     out_bytes = 4 * get_rate("brocade_fcport.txwords.%s" % index, this_time, txwords)

The third patch simply adds the notxcredits counter and its value to the output of the service check. This information is currently missing from the status output of the check and is just added as a convenience:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -408,9 +361,6 @@
             summarystate = max(1, summarystate)
             text += "(!)"
             output.append(text)
-        else:
-            if counter == "notxcredits":
-                output.append(text)
 
     # P O R T S T A T E
     for dev_state, state_key, state_info, warn_states, state_map in [

The last patch addresses and fixes a more serious issue. The details are explained in the comment of the following code snippet. The digest here being, that the metric notxcredits (“No TX buffer credits”) gathered from the SNMP OID swFCPortNoTxCredits is being calculated wrong. This is due to the fact, that the brocade_fcport service check treats this metric like all the other error metrics of a switchport (e.g. “CRC errors”, “ENC-Out”, “ENC-In” and “C3 discards”) and puts it in relation to the number of frames transmitted over a link. In reality though the definition of the metric behind the SNMP OID swFCPortNoTxCredits is, that it is actually relative to time. This issue leads to false positives in certain edge cases. For example when a little utilized switch port sees an otherwise uncritical number of swFCPortNoTxCredits due to the normal activity of the fibre channel flow control. In such a case, a relatively high number of swFCPortNoTxCredits is put into relation to the sum of a relatively low number of frames transmitted over the link and again the relatively high number of swFCPortNoTxCredits. See the last line of the code snippet below for this calculation. The result is a high value for the metric notxcredits (“No TX buffer credits”) although the fibre channel flow control was working perfectly fine, the configured switch.edgeHoldTime was most likely never reached and frames have never been dropped.

In order to address this issue, the following patch adds a few lines of code for a special treatment of the metric notxcredits to the brocade_fcport service check. This is done in the last section of the patch. The second section of the patch is just for the purpose of clarification, as it sets the metric that is being related to, to None in case of the metric notxcredits. The first section of the patch adjusts the default warning and critical thresholds to lower levels. The previous levels were rather high due to the miscalculation explained above. With the new calculation added by the third section of the patch, the default warning and critical threshold levels need to be much more sensitive.

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -100,11 +100,11 @@
 factory_settings["brocade_fcport_default_levels"] = {
     "rxcrcs":           (3.0, 20.0),   # allowed percentage of CRC errors
     "rxencoutframes":   (3.0, 20.0),   # allowed percentage of Enc-OUT Frames
     "rxencinframes":    (3.0, 20.0),   # allowed percentage of Enc-In Frames
-    "notxcredits":      (1.0, 3.0),    # allowed percentage of No Tx Credits
+    "notxcredits": (3.0, 20.0),  # allowed percentage of No Tx Credits
     "c3discards":       (3.0, 20.0),   # allowed percentage of C3 discards
     "assumed_speed":    2.0,           # used if speed not available in SNMP data
 }
 
 
@@ -349,7 +327,7 @@
            ("ENC-Out",              "rxencoutframes",      rxencoutframes,  rxframes_rate),
            ("ENC-In",               "rxencinframes",       rxencinframes,   rxframes_rate),
            ("C3 discards",          "c3discards",          c3discards,      txframes_rate),
-           ("No TX buffer credits", "notxcredits",         notxcredits,     None),
+        ("No TX buffer credits", "notxcredits", notxcredits, txframes_rate),
     ]:
         per_sec = get_rate("brocade_fcport.%s.%s" % (counter, index), this_time, value)
         perfdata.append((counter, per_sec))
@@ -360,31 +338,6 @@
                     (counter, item), this_time, per_sec, average)
             perfdata.append( ("%s_avg" % counter, per_sec_avg ) )
 
-        # Calculate error rates
-        if counter == "notxcredits":
-            # Calculate the error rate for "notxcredits" (buffer credit zero). Since this value
-            # is relative to time instead of the number of transmitted frames it needs special
-            # treatment.
-            # Semantics of the buffer credit zero value on Brocade / Broadcom devices:
-            # The switch ASIC checks the buffer credit value of a switch port every 2.5us and
-            # increments the buffer credit zero counter if the buffer credit value is zero.
-            # This means if in a one second interval the buffer credit zero counter increases
-            # by 400000 the link on this switch port is not allowed to send any frames.
-            # By default the edge hold time on a Brocade / Broadcom device is about 200ms:
-            #   switch.edgeHoldTime:220
-            # If a C3 frame remains in the switches queue for more than 220ms without being
-            # given any credits to be transmitted, it is subsequently dropped. Thus the buffer
-            # credit zero counter would optimally be correlated to the C3 discards counter.
-            # Unfortunately the Brocade / Broadcom devices have no egress buffering and do
-            # ingress buffering instead. Thus the C3 discards counters are increased on the
-            # ingress port, while the buffer credit zero counters are increased on the egress
-            # port. The trade-off is to correlate the buffer credit zero counters relative to
-            # the measured time interval.
-            if per_sec > 0:
-                rate = per_sec / 400000.00
-            else:
-                rate = 0
-        else:
             # compute error rate (errors in relation to number of frames) (from 0.0 to 1.0)
             if ref > 0 or per_sec > 0:
                 rate = per_sec / (ref + per_sec)

Conclusion

Adding the three new checks for Brocade / Broadcom fibre channel switches to your Check_MK server enables you to monitor additional CPU and memory aspects as well as the SFPs of your Brocade / Broadcom fibre channel devices. More recent versions of Check_MK should already include the same functionality through the now included standard checks brocade_sys and brocade_sfp.

New Brocade / Broadcom fibre channel devices should pick up the additional service checks immediately. Existing Brocade / Broadcom fibre channel devices might need a Check_MK inventory to be run explicitly on them in order to pick up the additional service checks.

The enhanced version of the brocade_fcport service check addresses and fixes several issues present in the standard Check_MK service check. It should provide you with a more complete and future-proof monitoring of your fibre channel network infrastructure.

I hope you find the provided new and enhanced checks useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - HPE Virtual Connect Fibre Channel Modules

This article provides patches for the standard Check_MK distribution in order to add support for the monitoring of HPE Virtual Connect Fibre Channel Modules.


Out of the box, there is currently no monitoring support for HPE Virtual Connect Fibre Channel Modules in the standard Check_MK distribution. Those modules, like e.g. the HPE Virtual Connect 8Gb 20-port Fibre Channel Module, are used in HPE c-Class BladeSystem to provide Fibre Channel connectivity for the individual server blades. Fortunately the modules provide status and performance data via the standard SNMP FIBRE-CHANNEL-FE-MIB defined in RFC 2837 as well as its successor, the SNMP FCMGMT-MIB defined in RFC 4044. Those two SNMP MIBs are already covered by the checks qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element, which are part of the standard Check_MK distribution. This simplifies the task of adding support for the HPE Virtual Connect Fibre Channel modules and reduces it to be just a matter of extending the already existing checks with three rather simple patches.

For the impatient and TL;DR here are the enhanced versions of the qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element:

Enhanced version of the qlogic_fcport check
Enhanced version of the qlogic_sanbox check
Enhanced version of the qlogic_sanbox_fabric_element check

The sources to the enhanced versions of all three checks can be found in my Check_MK Plugins repository on GitHub.

The necessary changes to qlogic_fcport and qlogic_sanbox_fabric_element are limited to the snmp_scan_function used by the Check_MK inventory. Here, the vendor specific OIDs for the HPE Virtual Connect Fibre Channel modules are added. The following patches show the respective lines for qlogic_fcport:

qlogic_fcport.patch
--- a/checks/qlogic_fcport   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_fcport   2017-10-01 14:34:48.153710776 +0200
@@ -218,12 +218,14 @@
     # .1.3.6.1.4.1.3873.1.12 QLogic 8 Gb and 4/8 Gb Intelligent Pass-thru Module
     # .1.3.6.1.4.1.3873.1.9  QLogic SANBox 5802 FC Switch
     # .1.3.6.1.4.1.3873.1.11 HP StorageWorks 8/20q Fibre Channel Switch
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.11") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.12") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
     'group':                   'qlogic_fcport',
     'default_levels_variable': 'qlogic_fcport_default_levels',
 }

and for qlogic_sanbox_fabric_element:

qlogic_sanbox_fabric_element.patch
--- a/checks/qlogic_sanbox_fabric_element    2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox_fabric_element    2017-10-01 14:47:35.000003198 +0200
@@ -54,7 +54,9 @@
                                                            OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

In both cases, the relevant lines being:

        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),

After those two simple changes, the checks will now be able to successfully inventorize the overall fabric status as well as the status of individual ports of HPE Virtual Connect Fibre Channel modules.

The necessary changes to qlogic_sanbox also require the extension of the snmp_scan_function used by the Check_MK inventory as shown by the patches above. In addition to that, the string operations on the sensor_id need to be adjusted in order to get a more user-friendly name for the temperature and power supply sensors which are also present in the HPE Virtual Connect Fibre Channel modules. Since the sensor IDs are encoded in the SNMP OIDs and the SNMP tree for those OIDs can vary from module to module, the simple string replacement in the original qlogic_sanbox check was exchanged for a more general, regular expression based substitution. The following patch shows the respective lines for the combined changes to qlogic_sanbox:

qlogic_sanbox.patch
--- a/checks/qlogic_sanbox   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox   2017-10-01 14:47:51.348002546 +0200
@@ -44,7 +44,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "8" and sensor_characteristic == "3" and \
             sensor_name != "Temperature Status":
             inventory.append( (sensor_id, None) )
@@ -53,7 +53,7 @@
 def check_qlogic_sanbox_temp(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -93,9 +93,11 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }
 
 #.
@@ -113,7 +115,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "5":
             inventory.append( (sensor_id, None) )
     return inventory
@@ -121,7 +123,7 @@
 def check_qlogic_sanbox_psu(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -153,7 +155,9 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

After those additional, but still simple, changes, the check will now be able to successfully inventorize the temperature and power supply sensors of HPE Virtual Connect Fibre Channel modules.

// Check_MK - Race Condition in Processing of Piggyback Data

In current versions of Check_MK – namely 1.2.8p24 and 1.4.0p8 – which are included in the Open Monitoring Distribution, there are several race conditions in the code parts responsible for processing of piggyback data sent by the agents. These race conditions are usually triggered when a host is monitored more than once, e.g. by the use of cluster services or if by chance a manual check on the command line coincides with a scheduled check from the monitoring core. While not critical to the whole monitoring process, the effect of these races are intermittent and annoying UNKNOWN - [Errno 2] No such file or directory errors for the monitored host. This article provides patches for both Check_MK versions in order to deal with the spurious error messages.

The normal order of processing of piggybacked data received from agents seems to be as follows:

  1. The Check_MK server calls the agent on the monitored host.

  2. The agent on the monitored host responds with the output of its own host as well as the piggybacked output from other entities.

  3. The Check_MK server processes the agents response and extracts the piggybacked data.

  4. If there is already data stored from the piggybacked host, it's considered stale and is thus removed.

  5. The piggybacked data is stored in a file named <AGENT_HOSTNAME> in the directory named <PIGGYBACKED_HOSTNAME> under the path ${OMD_ROOT}/var/check_mk/piggyback/ (e.g.: ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME>).

  6. The Check_MK server lists the contents of the directory ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME>. For each file item in the list it processes the data contained in the file.

  7. The Check_MK server removes the files in and subsequently the directory ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/ itself.

In case a host is monitored more than once, e.g. through the use of cluster services, the steps 4 through 7 above can overlap for two or more concurrent monitoring processes. Such an overlap can cause one process at step 4 to delete the directory ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME> while another process is still working on the directory at step 5 or step 6.

The effect of this issue are mainly intermittent and annoying UNKNOWN - [Errno 2] No such file or directory errors in the Check_MK WebUI as well as errors like the following examples in the log files:

2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> An exception occured while processing host "hostA"
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> Traceback (most recent call last):
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 125, in do_keepalive
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     status = command_function(command_tuple)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 437, in execute_keepalive_command
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     return mode_function(hostname, ipaddress)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1290, in do_check
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     do_all_checks_on_host(hostname, ipaddress, only_check_types)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1506, in do_all_checks_on_host
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     info = get_info_for_check(hostname, ipaddress, infotype)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 319, in get_info_for_check
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     info = apply_parse_function(get_host_info(hostname, ipaddress, section_name, max_cachefile_age, ignore_check_interval), section_name)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 371, in get_host_info
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     ignore_check_interval=True)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 541, in get_realhost_info
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     store_persisted_info(hostname, persisted)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 567, in store_persisted_info
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     os.rename("%s.#new" % file_path, file_path)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> OSError: [Errno 2] No such file or directory
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> An exception occured while processing host "hostA"
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> Traceback (most recent call last):
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 125, in do_keepalive
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     status = command_function(command_tuple)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 437, in execute_keepalive_command
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     return mode_function(hostname, ipaddress)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1241, in do_check
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     do_all_checks_on_host(hostname, ipaddress, only_check_types)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1457, in do_all_checks_on_host
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     info = get_info_for_check(hostname, ipaddress, infotype)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 319, in get_info_for_check
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     info = apply_parse_function(get_host_info(hostname, ipaddress, section_name, max_cachefile_age, ignore_check_interval), section_name)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 371, in get_host_info
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     ignore_check_interval=True)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 540, in get_realhost_info
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     store_piggyback_info(hostname, piggybacked)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 651, in store_piggyback_info
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     os.rename(dir + "/.new." + sourcehost, dir + "/" + sourcehost)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> OSError: [Errno 2] No such file or directory

In version 1.4.0p8 of Check_MK there is already a partial fix for the described issue in the form of Werk 4755 (Git commit 77b3bfc3). For both Check_MK versions the following two trivial patches are provided in order to deal with the spurious error messages:

In the current development version of Check_MK – probably to be named 1.5 – there is a rather large amount of code rewrite and movement. From a first glance i couldn't determine if the issue still exists there.

// Check_MK Monitoring - Open-iSCSI

The Open-iSCSI project provides a high-performance, transport independent, implementation of RFC 3720 iSCSI for Linux. It allows remote access to SCSI targets via TCP/IP over several different transport technologies. This article introduces a new Check_MK service check to monitor the status of Open-iSCSI sessions as well as the monitoring of several statistical metrics on Open-iSCSI sessions and iSCSI hardware initiator hosts.

For the impatient and TL;DR here is the Check_MK package of the Open-iSCSI monitoring checks:

Open-iSCSI monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Check_MK service check to monitor Open-iSCSI consists of two major parts, an agent plugin and three check plugins.

The first part, a Check_MK agent plugin named open-iscsi, is a simple Bash shell script. It calls the Open-iSCSI administration tool iscsiadm in order to retrieve a list of currently active iSCSI sessions. The exact call to iscsiadm to retrieve the session list is:

/usr/bin/iscsiadm -m session -P 1

If there are any active iSCSI sessions, the open-iscsi agent plugin also tries to collect several statistics for each iSCSI session. This is done by another call to iscsiadm for each iSCSI Session ${SID}, which is shown in the following example:

/usr/bin/iscsiadm -m session -r ${SID} -s

Unfortunately, the iSCSI session statistics are currently only supported for Open-iSCSI software initiators or dependent hardware iSCSI initiators like the Broadcom BCM577xx or BCM578xx adapters which are covered by the bnx2i kernel module. See Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell and Backporting Open-iSCSI to Debian 8 "Jessie" for additional information on those dependent hardware iSCSI initiators.

For hardware iSCSI initiators, like the QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs, which provide a full iSCSI offload engine (iSOE) implementation in the adapters firmware, there is currently no support for iSCSI session statistics. Instead, the open-iscsi agent plugin collects several global statistics on each iSOE host ${HST} which is covered by the qla4xxx kernel module with the command shown in the following example:

/usr/bin/iscsiadm -m host -H ${HST} -C stats

The output of the above commands is parsed and reformated by the agent plugin for easier processing in the check plugins. The following example shows the agent plugin output for a system with two BCM578xx dependent hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:bf:2d eth2 10.0.3.52 LOGGED_IN LOGGED_IN NO_CHANGE
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:c2:34 eth3 10.0.3.53 LOGGED_IN LOGGED_IN NO_CHANGE

<<<open-iscsi_session_stats>>>
[session stats f8:ca:b8:7d:bf:2d iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 40960
rxdata_octets: 461171313
noptx_pdus: 0
scsicmd_pdus: 153967
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 153967
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 112420
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

[session stats f8:ca:b8:7d:c2:34 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 16384
rxdata_octets: 255666052
noptx_pdus: 0
scsicmd_pdus: 84312
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 84312
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 62418
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

The next example shows the agent plugin output for a system with two QLogic 8200 Series hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7d.ipv4.0 none 10.0.3.50 LOGGED_IN Unknown Unknown
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7e.ipv4.0 none 10.0.3.51 LOGGED_IN Unknown Unknown

<<<open-iscsi_host_stats>>>
[host stats f8:ca:b8:7d:c1:7d iqn.2000-04.com.qlogic:isp8214.000e1e3574ac.4]
mactx_frames: 563454
mactx_bytes: 52389948
mactx_multicast_frames: 877513
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573455
macrx_bytes: 440845678
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508160
iptx_bytes: 29474232
iptx_fragments: 0
iprx_packets: 401785
iprx_bytes: 354673156
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508160
tcptx_bytes: 19310736
tcprx_segments: 401785
tcprx_byte: 346637456
tcp_duplicate_ack_retx: 1
tcp_retx_timer_expired: 1
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106449
tcptx_pure_ack: 106489
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 695915
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401697
iscsi_data_bytes_tx: 29225
iscsi_pdu_rx: 401697
iscsi_data_bytes_rx: 327355963
iscsi_io_completed: 101
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

[host stats f8:ca:b8:7d:c1:7e iqn.2000-04.com.qlogic:isp8214.000e1e3574ad.5]
mactx_frames: 563608
mactx_bytes: 52411412
mactx_multicast_frames: 877517
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573572
macrx_bytes: 441630442
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508310
iptx_bytes: 29490504
iptx_fragments: 0
iprx_packets: 401925
iprx_bytes: 355436636
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508310
tcptx_bytes: 19323952
tcprx_segments: 401925
tcprx_byte: 347398136
tcp_duplicate_ack_retx: 2
tcp_retx_timer_expired: 4
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106466
tcptx_pure_ack: 106543
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 696035
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401787
iscsi_data_bytes_tx: 37970
iscsi_pdu_rx: 401791
iscsi_data_bytes_rx: 328112050
iscsi_io_completed: 127
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

Although a simple Bash shell script, the agent plugin open-iscsi has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands iscsiadm, sed, tr and egrep. On Debian based systems, the necessary packages can be installed with the following command:

root@host:~# apt-get install coreutils grep open-iscsi sed

The second part of the Check_MK service check for Open-iSCSI provides the necessary check logic through individual inventory and check functions. This is implemented in the three Check_MK check plugins open-iscsi_sessions, open-iscsi_host_stats and open-iscsi_session_stats, which will be discussed separately in the following sections.

Open-iSCSI Session Status

The check plugin open-iscsi_sessions is responsible for the monitoring of individual iSCSI sessions and their internal session states. Upon inventory this check plugin creates a service check for each pair of iSCSI network interface name and IQN of the iSCSI target volume. Unlike the iSCSI session ID, which changes over time (e.g. after iSCSI logout and login), this pair uniquely identifies a iSCSI session on a host. During normal check execution, the list of currently active iSCSI sessions on a host is compared to the list of active iSCSI sessions gathered during inventory on that host. If a session is missing or if the session has an erroneous internal state, an alarm is raised accordingly.

For all types of initiators – software, dependent hardware and hardware – there is the state session_state which can take on the following values:

ISCSI_STATE_FREE
ISCSI_STATE_LOGGED_IN
ISCSI_STATE_FAILED
ISCSI_STATE_TERMINATE
ISCSI_STATE_IN_RECOVERY
ISCSI_STATE_RECOVERY_FAILED
ISCSI_STATE_LOGGING_OUT

An alarm is raised if the session is in any state other than ISCSI_STATE_LOGGED_IN. For software and dependent hardware initiators there are two additional states – connection_state and internal_state. The state connection_state can take on the values:

FREE
TRANSPORT WAIT
IN LOGIN
LOGGED IN
IN LOGOUT
LOGOUT REQUESTED
CLEANUP WAIT

and internal_state can take on the values:

NO CHANGE
CLEANUP       
REOPEN
REDIRECT

In addition to the above session_state, an alarm is raised if the connection_state is in any other state than LOGGED IN and internal_state is in any other state than NO CHANGE.

No performance data is currently reported by this check.

Open-iSCSI Hosts Statistics

The check plugin open-iscsi_host_stats is responsible for the monitoring of the global statistics on a iSOE host. Upon inventory this check plugin creates a service check for each pair of MAC address and iSCSI network interface name. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is determined for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_host_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSOE host statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Host Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI host statistics values
                  [x] The levels for the number of transmitted MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 abort frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 jumbo frames on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 late transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 single transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 multiple transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 collisions on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dribble on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frame length errors on an iSOE host.
                  [x] The levels for the number of discarded received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 jabber on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 carrier sense errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 CRC errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 encoding errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too large errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too small errors on an iSOE host.
                  [x] The levels for the number of transmitted IP packets on an iSOE host.
                  [x] The levels for the number of received IP packets on an iSOE host.
                  [x] The levels for the number of transmitted IP bytes on an iSOE host.
                  [x] The levels for the number of received IP bytes on an iSOE host.
                  [x] The levels for the number of transmitted IP fragments on an iSOE host.
                  [x] The levels for the number of received IP fragments on an iSOE host.
                  [x] The levels for the number of IP datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IP invalid address errors on an iSOE host.
                  [x] The levels for the number of IP packet errors on an iSOE host.
                  [x] The levels for the number of IP fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IP fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IP datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 packets on an iSOE host.
                  [x] The levels for the number of received IPv6 packets on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 bytes on an iSOE host.
                  [x] The levels for the number of received IPv6 bytes on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 fragments on an iSOE host.
                  [x] The levels for the number of received IPv6 fragments on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IPv6 invalid address errors on an iSOE host.
                  [x] The levels for the number of IPv6 packet errors on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted TCP segments on an iSOE host.
                  [x] The levels for the number of received TCP segments on an iSOE host.
                  [x] The levels for the number of transmitted TCP bytes on an iSOE host.
                  [x] The levels for the number of received TCP bytes on an iSOE host.
                  [x] The levels for the number of duplicate TCP ACK retransmits on an iSOE host.
                  [x] The levels for the number of received TCP retransmit timer expiries on an iSOE host.
                  [x] The levels for the number of received TCP duplicate ACKs on an iSOE host.
                  [x] The levels for the number of received TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP delayed ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of received TCP segment errors on an iSOE host.
                  [x] The levels for the number of received TCP segment out-of-order on an iSOE host.
                  [x] The levels for the number of received TCP window probe on an iSOE host.
                  [x] The levels for the number of received TCP window update on an iSOE host.
                  [x] The levels for the number of transmitted TCP window probe persist on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of received iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of received iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of iSCSI I/Os completed on an iSOE host.
                  [x] The levels for the number of iSCSI unexpected I/Os on an iSOE host.
                  [x] The levels for the number of iSCSI format errors on an iSOE host.
                  [x] The levels for the number of iSCSI header digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI data digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI sequence errors on an iSOE host.
                  [x] The levels for the number of ECC error corrections on an iSOE host.

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_host_stats (iSCSI Host Stats) service checks over two QLogic 8200 Series hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_host_stats service checks over QLogic 8200 Series hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state – in this example LOGGED_IN – is shown. There are also two iSCSI Host Stats service check items in the example, which are pairs of MAC addresses and iSCSI network interface names. For each of those items the current throughput rate on the MAC, IP/IPv6, TCP and iSCSI protocol layer is shown. The throughput rate on the MAC protocol layer is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following three images show examples of the PNP4Nagios graphs for the open-iscsi_host_stats (iSCSI Host Stats) service check.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (MAC Frames, Traffic, MAC Errors)

The middle graph shows a combined view of the throughput rate for received and transmitted traffic on the different MAC, IP/IPv6, TCP and iSCSI protocol layers. The upper graph shows the throughput rate for various frame types on the MAC protocol layer. The lower graph shows the rate for various error frame types on the MAC protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (IP Packets and Fragments, IP Errors, TCP Segments)

The upper graph shows the throughput rate for received and transmitted traffic on the IP/IPv6 protocol layer. The middle graph shows the rate for various error packet types on the IP/IPv6 protocol layer. The lower graph shows the throughput rate for received and transmitted traffic on the TCP protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (TCP Errors, ECC Error Correction, iSCSI PDUs, iSCSI Errors)

The first graph shows the rate for various protocol control and error segment types on the TCP protocol layer. The second graph shows the rate of ECC error corrections that occured on the QLogic 8200 Series hardware iSCSI initiator. The third graph shows the throughput rate for received and transmitted traffic on the iSCSI protocol layer. The fourth and last graph shows the rate for various control and error PDUs on the iSCSI protocol layer.

Open-iSCSI Session Statistics

The check plugin open-iscsi_session_stats is responsible for the monitoring of the statistics on individual iSCSI sessions. Upon inventory this check plugin creates a service check for each pair of MAC address of the network interface and IQN of the iSCSI target volume. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is collected for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_session_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSCSI session statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Session Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI session statistics values
                  [x] The levels for the number of transmitted bytes in an Open-iSCSI session
                  [x] The levels for the number of received bytes in an Open-iSCSI session
                  [x] The levels for the number of digest (CRC) errors in an Open-iSCSI session
                  [x] The levels for the number of timeout errors in an Open-iSCSI session
                  [x] The levels for the number of transmitted NOP commands in an Open-iSCSI session
                  [x] The levels for the number of received NOP commands in an Open-iSCSI session
                  [x] The levels for the number of transmitted SCSI command requests in an Open-iSCSI session
                  [x] The levels for the number of received SCSI command reponses in an Open-iSCSI session
                  [x] The levels for the number of transmitted task management function commands in an Open-iSCSI session
                  [x] The levels for the number of received task management function responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted login requests in an Open-iSCSI session
                  [x] The levels for the number of transmitted logout requests in an Open-iSCSI session
                  [x] The levels for the number of received logout responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted text PDUs in an Open-iSCSI session
                  [x] The levels for the number of received text PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted data PDUs in an Open-iSCSI session
                  [x] The levels for the number of received data PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted single negative ACKs in an Open-iSCSI session
                  [x] The levels for the number of received ready to transfer PDUs in an Open-iSCSI session
                  [x] The levels for the number of received reject PDUs in an Open-iSCSI session
                  [x] The levels for the number of received asynchronous messages in an Open-iSCSI session

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_session_stats (iSCSI Session Stats) service checks over two BCM578xx dependent hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_session_stats service checks over BCM578xx dependent hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state, connection_state and internal_state – in this example with the respective values LOGGED_IN, LOGGED_IN and NO_CHANGE – are shown. There are also an equivalent number of iSCSI Session Stats service check items in the example, which are also pairs of MAC addresses of the network interfaces and IQNs of the iSCSI target volumes. For each of those items the current throughput rate of the individual iSCSI session is shown. As long as the rate of the digest (CRC) and timeout error counters is zero, the string no protocol errors is displayed. Otherwise the name and throughput rate of any non-zero error counter is shown. The throughput rate of the iSCSI session is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following image shows an example of the three PNP4Nagios graphs for a single open-iscsi_session_stats (iSCSI Session Stats) service check.

Example PNP4Nagios graphs for a single open-iscsi_session_stats service check

The upper graph shows the throughput rate for received and transmitted traffic of the iSCSI session. The middle graph shows the rate for received and transmitted iSCSI PDUs, broken down by the different types of PDUs on the iSCSI protocol layer. The lower graph shows the rate for the digest (CRC) and timeout errors on the iSCSI protocol layer.

The described Check_MK service check to monitor the status of Open-iSCSI sessions, Open-iSCSI session metrics and iSCSI hardware initiator host metrics has been verified to work with version 2.0.874-2~bpo8+1 of the open-iscsi package from the backports repository of Debian stable (Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - XFS Filesystem Quotas

XFS, like most other modern filesystems, offers the ability to configure and use disk quotas. Usually those quotas set limits with regard to the amount of allocatable disk space or number of filesystem objects. Those limits are in most filesystems bound to users or groups of users. XFS is one of the few filesystems which offer quota limits on a per directory basis. In the case of XFS directory quotas are implemented via projects. This article introduces a new Check_MK service check to monitor the use of XFS filesystem quotas with a strong focus on directory based quotas.

For the impatient and TL;DR here is the Check_MK package of the XFS filesystem quota monitoring checks:

XFS filesystem quota monitoring checks (Compatible with Check_MK versions 1.2.6 and earlier)
XFS filesystem quota monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Check_MK service check to monitor XFS filesystem quotas consists of two major parts, an agent plugin and a check plugin.

The Check_MK agent plugin named xfs_quota is a simple Bash shell script. It calls the XFS quota administration tool xfs_quota in order to retrieve a report of the current quota usage. The exact call to xfs_quota is:

/usr/sbin/xfs_quota -x -c 'report -p -b -i -a'

which currently reports only block and inode quotas of projects or directories on all availables XFS filesystems. An example output of the above command is:

Project quota on /srv/xfs (/dev/mapper/vg00-xfs)
                               Blocks                                          Inodes                     
Project ID       Used       Soft       Hard    Warn/Grace           Used       Soft       Hard    Warn/ Grace     
---------- -------------------------------------------------- -------------------------------------------------- 
test1               0          0       1024     00 [--------]          1          0          0     00 [--------]
test2               0          0       2048     00 [--------]          1          0          0     00 [--------]
test3               0          0       3072     00 [--------]          1          0          0     00 [--------]

The output of the above command is parsed and reformated by the agent plugin for easier processing in the check plugin. The above example output would thus be transformed into the agent plugin output shown in the following example:

<<<xfs_quota>>>
/srv/xfs:/dev/mapper/vg00-xfs:test1:0:0:1024:1:0:0
/srv/xfs:/dev/mapper/vg00-xfs:test2:0:0:2048:1:0:0
/srv/xfs:/dev/mapper/vg00-xfs:test3:0:0:3072:1:0:0

Although a simple Bash shell script, the agent plugin xfs_quota has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands xfs_quota, sed and egrep. On Debian based systems, the necessary packages can be installed with the following command:

root@host:~# apt-get install xfsprogs sed grep

The second part, a Check_MK check plugin also named xfs_quota, provides the necessary inventory and check functions. Upon inventory it creates a service check for each pair of XFS filesystem mountpoint and quota project ID. During normal check execution, the number of used XFS filesystem blocks and inodes are determined for each inventorized item (pair of XFS filesystem mountpoint and quota project ID). If the hard and soft, block and inode quotas for particular item are all set to zero, no further checks are carried out and only performance data is reported by the check. If either one of hard or soft, block or inode quota is set to a non-zero value, the number of remaining free XFS filesystem blocks or inodes is compared to warning and critical threshold values and an alarm is raised accordingly.

With the additional WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for blocks_hard and blocks_soft are zero (0) free blocks for both warning and critical thresholds. The default values for inodes_hard and inodes_soft are zero (0) free inodes for both warning and critical thresholds. The configuration options for the free block or inode levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> XFS Quota Utilization
            -> Create Rule in Folder ...
               -> The levels for the soft/hard block/inode quotas on XFS filesystems
                  [x] The levels for the hard block quotas on XFS filesystems
                  [x] The levels for the soft block quotas on XFS filesystems
                  [x] The levels for the hard inode quotas on XFS filesystems
                  [x] The levels for the soft inode quotas on XFS filesystems 

The following image shows a status output example for several xfs_quota service checks from the WATO WebUI:

Status output example for xfs_quota service checks

This example shows several service check items, which again are pairs of XFS filesystem mountpoints (here: /backup) and anonymized quota project IDs. For each item the number of blocks and inodes used are shown along with the appropriate hard and soft quota values. The number of blocks and inodes used are also visualized in the perf-o-meter, blocks on a logarithmic scale growing from the middle to the left, inodes on a logarithmic scale growing from the middle to the right.

The following image shows an example of the two PNP4Nagios graphs for a single service check:

Example PNP4Nagios graph for a single xfs_quota service check

The upper graph shows the number of used XFS filesystem blocks for a pair of XFS filesystem mountpoints (here: /backup) and a - again anonymized - quota project ID. The lower graph shows the number of used inodes for the same pair. Both graphs show warning and critical thresholds values, which are in this example at their default value of zero. If configured - like in the upper graph of the example - block or inode quotas are also shown as blue horizontal lines in the respective graphs.

The described Check_MK service check to monitor XFS filesystem quotas has been verified to work with version 3.2.1 of the xfsprogs package on Debian stable (Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website. More information about cookies