bityard Blog

// Check_MK Monitoring - Brocade / Broadcom Fibre Channel Switches

This article provides patches for the standard Check_MK distribution in order to fix and enhance the support for the monitoring of Brocade / Broadcom Fibre Channel Switches.


Out of the box, there is currently already monitoring support available for Brocade / Broadcom Fibre Channel Switches in the standard Check_MK distribution. Unfortunately there are several issues in the check brocade_fcport, which is used to monitor the status and several metrics of the fibre channel switch ports. Those issues prevent the check from working as intended and are fixed in the version provided in this article. Up until recently, there also was no support in the standard Check_MK distribution for monitoring CPU and memory metrics on a switch level and no support for monitoring SFP metrics on a port level. This article introduces the new checks brocade_cpu, brocade_mem and brocade_sfp to cover those metrics. With the most recent upstream version 1.5.x of the Check_MK distribution, there are now the standard checks brocade_sys (covering CPU and memory metrics) and brocade_sfp (covering the SFP port metrics) available, providing basically the same functionality.

For the impatient and TL;DR here is the enhanced version of the brocade_fcport check:

Enhanced version of the brocade_fcport check

And the new brocade_cpu, brocade_mem and brocade_sfp checks:

The new brocade_cpu check
The new brocade_mem check
The new brocade_sfp check

The sources to the new and enhanced versions of all the checks can be found in my Check_MK Plugins repository on GitHub.

Additional Checks

CPU Usage

The Check_MK service check brocade_cpu monitors the current CPU utilization of Brocade / Broadcom fibre channel switches. It uses the SNMP OID swCpuUsage from the fibre channel switch MIB (SW-MIB) in order to create one service check for each CPU found in the system. The current CPU utilization in percent is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the standard WATO plugin for CPU utilization it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of CPU utilization).

The following image shows a status output example for the brocade_cpu service check from the WATO WebUI:

Status output example for the new brocade_cpu service check

This example shows the current CPU utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_cpu service check:

Example service metrics graph for the new brocade_cpu service check

The selected example graph shows the current CPU utilization of the system in percent as well as the default warning and critical threshold values.

Memory Usage

The Check_MK service check brocade_mem monitors the current memory (RAM) usage on Brocade / Broadcom fibre channel switches. It uses the SNMP OID swMemUsage from the fibre channel switch MIB (SW-MIB) in order to create a service check for the overall memory usage on the system. The amount of currently used memory is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of used memory). The configuration options for the used memory levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Brocade Fibre Channel Memory Usage
            -> Create rule in folder ...
               [x] Levels for memory usage

The following image shows a status output example for the brocade_mem service check from the WATO WebUI:

Status output example for the new brocade_mem service check

This example shows the current memory utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_mem service check:

Example service metrics graph for the new brocade_mem service check

The selected example graph shows the current memory utilization of the system in percent as well as the default warning and critical threshold values.

SFP Health

The Check_MK service check brocade_sfp monitors several metrics of the SFPs in all enabled and active ports of Brocade / Broadcom fibre channel switches. It uses several SNMP OIDs from the fibre channel switch MIB (SW-MIB) and from the Fabric Alliance Extension MIB (FA-EXT-MIB) in order to create one service check for each SFP found in an enabled and active port on the system. The OIDs from the fibre channel switch MIB (swFCPortSpecifier, swFCPortName, swFCPortPhyState, swFCPortOpStatus, swFCPortAdmStatus) are used to determine the number, name and status of the switch port. The OIDs from the Fabric Alliance Extension MIB are used to determine the actual SFP metrics. Those metrics are the SFP temperature (swSfpTemperature), voltage (swSfpVoltage) and current (swSfpCurrent), as well as the optical receive and transmit power (swSfpRxPower and swSfpTxPower) of the SFP. Unfortunately the SFP metrics are not available on all Brocade / Broadcom switch hardware platforms and FabricOS version. This check was verified to work with Gen6 hardware and FabricOS v8. Another limitation is, that the SFP metrics are not available in real-time, but are only gathered in a 5 minute interval by the switch from all the SFPs in the switch. This is most likely a precausion as not to overload the processing capacity on the SFPs with too many status requests.

The current temperature level as well as the optical receive and transmit power levels are compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values:

Metric Warning Threshold Critical Threshold
System temperature 55°C 65°C
Optical receive power -7.0 dBm -9.0 dBm
Optical transmit power -2.0 dBm -3.0 dBm

The configuration options for the used temperature and optical receive and transmit power levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Networking
         -> Brocade Fibre Channel SFP
            -> Create rule in folder ...
               [x] Temperature levels in degrees celcius
               [x] Receive power levels in dBm
               [x] Transmit power levels in dBm

Currently, the electrical current and voltage metrics of the SFPs are only used for long-term trends via the respective service metric template and thus are not used to raise any alarms. This is due to the fact that there is little to no information available on which precise electrical current and voltage levels would constitute as an indicator for an immediate or impending failure state.

The following image shows several status output examples for the brocade_sfp service check from the WATO WebUI:

Status output examples for the new brocade_sfp service check

This example shows SFP Port service check items for eight consecutive ports on a switch. For each item the current temperature, voltage and current, as well as the optical receive and transmit power of the SFP are shown. The optical receive and transmit power levels are also visualized in the Perf-O-Meter, optical receive power growing from the middle to the left, optical transmit power growing from the middle to the right.

The following image shows an example of the service metrics graph for the brocade_sfp service check:

Example service metrics graph for the new brocade_sfp service check

The first graph shows the optical receive and transmit power of the SFP. The second shows the electrical current drawn by the SFP. The third graph shows the temperature of the SFP. The fourth and last graph shows the electrical voltage provided to the SFP.

Modified Check

Fibre Channel Port

The Check_MK service check brocade_fcport has several issuse which are addressed by the following set of patches. The patch is actually a monolithic one and is just broken up into individual patches here for ease of discussion.

The first patch is a simple, but ugly workaround to prevent the conversion of OID_END, which carries the port index information, from being treated as a BINARY SNMP value:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -154,8 +135,8 @@
         bbcredits = None
         if len(if64_info) > 0:
             fcmgmt_portstats = []
-            for oidend, dummy, tx_elements, rx_elements, bbcredits_64 in if64_info:
-                if int(index) == int(oidend.split(".")[-1]):
+            for oidend, tx_elements, rx_elements, bbcredits_64 in if64_info:
+                if index == oidend.split(".")[-1]:
                     fcmgmt_portstats = [
                         binstring_to_int(''.join(map(chr, tx_elements))) / 4,
                         binstring_to_int(''.join(map(chr, rx_elements))) / 4,
@@ -477,7 +426,6 @@
         # Not every device supports that
         (".1.3.6.1.3.94.4.5.1", [
             OID_END,
-            "1",            # Dummy value, otherwise OID_END is also treated as a BINARY value
             BINARY("6"),    # FCMGMT-MIB::connUnitPortStatCountTxElements
             BINARY("7"),    # FCMGMT-MIB::connUnitPortStatCountRxElements
             BINARY("8"),    # FCMGMT-MIB::connUnitPortStatCountBBCreditZero

This is achieved by simply inserting a dummy value into the list of SNMP OIDs in the snmp_info variable. I haven't had time to dig out the root cause of this behaviour, but i guess it must be somewhere in the core SNMP components of Check_MK. The patch also implicitly addresses and fixes an issue where two variables with differently typed content are being compared. This is achieved by simply changing the line:

               if index == oidend.split(".")[-1]:

to:

               if int(index) == int(oidend.split(".")[-1]):

and thus forcing a conversion to integer values.

The next patch adds support for additional encoding schemes in the FC-1 layer, which are used for fibre channel beyond the speed of 8 GBit. Up to and including a speed of 8 GBit, fibre channel uses a 8/10b encoding. This means that for every 8 bits of data, 10 bits are actually send over the fibre channel link. The encoding scheme changes for speeds higher than 8 GBit. 16 GBit fibre channel – like 10 GBit ethernet – uses a 64/66b encoding scheme, 32 GBit fibre channel and higher use a 256/257b encoding scheme. Thus the wirespeed calculations of the brocade_fcport service check are incorrect for speeds above 8 GBit. The following patch addresses and fixes this issue:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -283,15 +267,8 @@
 
     output.append(speedmsg)
 
-    if gbit > 16:
-        # convert gbit netto link-rate to Byte/s (256/257 enc)
-        wirespeed = gbit * 1000000000.0 * ( 256 / 257 ) / 8
-    elif gbit > 8:
-        # convert gbit netto link-rate to Byte/s (64/66 enc)
-        wirespeed = gbit * 1000000000.0 * ( 64 / 66 ) / 8
-    else:
         # convert gbit netto link-rate to Byte/s (8/10 enc)
-        wirespeed = gbit * 1000000000.0 * ( 8 / 10 ) / 8
+    wirespeed = gbit * 1000000000.0 * 0.8 / 8
     in_bytes = 4 * get_rate("brocade_fcport.rxwords.%s" % index, this_time, rxwords)
     out_bytes = 4 * get_rate("brocade_fcport.txwords.%s" % index, this_time, txwords)

The third patch simply adds the notxcredits counter and its value to the output of the service check. This information is currently missing from the status output of the check and is just added as a convenience:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -408,9 +361,6 @@
             summarystate = max(1, summarystate)
             text += "(!)"
             output.append(text)
-        else:
-            if counter == "notxcredits":
-                output.append(text)
 
     # P O R T S T A T E
     for dev_state, state_key, state_info, warn_states, state_map in [

The last patch addresses and fixes a more serious issue. The details are explained in the comment of the following code snippet. The digest here being, that the metric notxcredits (“No TX buffer credits”) gathered from the SNMP OID swFCPortNoTxCredits is being calculated wrong. This is due to the fact, that the brocade_fcport service check treats this metric like all the other error metrics of a switchport (e.g. “CRC errors”, “ENC-Out”, “ENC-In” and “C3 discards”) and puts it in relation to the number of frames transmitted over a link. In reality though the definition of the metric behind the SNMP OID swFCPortNoTxCredits is, that it is actually relative to time. This issue leads to false positives in certain edge cases. For example when a little utilized switch port sees an otherwise uncritical number of swFCPortNoTxCredits due to the normal activity of the fibre channel flow control. In such a case, a relatively high number of swFCPortNoTxCredits is put into relation to the sum of a relatively low number of frames transmitted over the link and again the relatively high number of swFCPortNoTxCredits. See the last line of the code snippet below for this calculation. The result is a high value for the metric notxcredits (“No TX buffer credits”) although the fibre channel flow control was working perfectly fine, the configured switch.edgeHoldTime was most likely never reached and frames have never been dropped.

In order to address this issue, the following patch adds a few lines of code for a special treatment of the metric notxcredits to the brocade_fcport service check. This is done in the last section of the patch. The second section of the patch is just for the purpose of clarification, as it sets the metric that is being related to, to None in case of the metric notxcredits. The first section of the patch adjusts the default warning and critical thresholds to lower levels. The previous levels were rather high due to the miscalculation explained above. With the new calculation added by the third section of the patch, the default warning and critical threshold levels need to be much more sensitive.

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -100,11 +100,11 @@
 factory_settings["brocade_fcport_default_levels"] = {
     "rxcrcs":           (3.0, 20.0),   # allowed percentage of CRC errors
     "rxencoutframes":   (3.0, 20.0),   # allowed percentage of Enc-OUT Frames
     "rxencinframes":    (3.0, 20.0),   # allowed percentage of Enc-In Frames
-    "notxcredits":      (1.0, 3.0),    # allowed percentage of No Tx Credits
+    "notxcredits": (3.0, 20.0),  # allowed percentage of No Tx Credits
     "c3discards":       (3.0, 20.0),   # allowed percentage of C3 discards
     "assumed_speed":    2.0,           # used if speed not available in SNMP data
 }
 
 
@@ -349,7 +327,7 @@
            ("ENC-Out",              "rxencoutframes",      rxencoutframes,  rxframes_rate),
            ("ENC-In",               "rxencinframes",       rxencinframes,   rxframes_rate),
            ("C3 discards",          "c3discards",          c3discards,      txframes_rate),
-           ("No TX buffer credits", "notxcredits",         notxcredits,     None),
+        ("No TX buffer credits", "notxcredits", notxcredits, txframes_rate),
     ]:
         per_sec = get_rate("brocade_fcport.%s.%s" % (counter, index), this_time, value)
         perfdata.append((counter, per_sec))
@@ -360,31 +338,6 @@
                     (counter, item), this_time, per_sec, average)
             perfdata.append( ("%s_avg" % counter, per_sec_avg ) )
 
-        # Calculate error rates
-        if counter == "notxcredits":
-            # Calculate the error rate for "notxcredits" (buffer credit zero). Since this value
-            # is relative to time instead of the number of transmitted frames it needs special
-            # treatment.
-            # Semantics of the buffer credit zero value on Brocade / Broadcom devices:
-            # The switch ASIC checks the buffer credit value of a switch port every 2.5us and
-            # increments the buffer credit zero counter if the buffer credit value is zero.
-            # This means if in a one second interval the buffer credit zero counter increases
-            # by 400000 the link on this switch port is not allowed to send any frames.
-            # By default the edge hold time on a Brocade / Broadcom device is about 200ms:
-            #   switch.edgeHoldTime:220
-            # If a C3 frame remains in the switches queue for more than 220ms without being
-            # given any credits to be transmitted, it is subsequently dropped. Thus the buffer
-            # credit zero counter would optimally be correlated to the C3 discards counter.
-            # Unfortunately the Brocade / Broadcom devices have no egress buffering and do
-            # ingress buffering instead. Thus the C3 discards counters are increased on the
-            # ingress port, while the buffer credit zero counters are increased on the egress
-            # port. The trade-off is to correlate the buffer credit zero counters relative to
-            # the measured time interval.
-            if per_sec > 0:
-                rate = per_sec / 400000.00
-            else:
-                rate = 0
-        else:
             # compute error rate (errors in relation to number of frames) (from 0.0 to 1.0)
             if ref > 0 or per_sec > 0:
                 rate = per_sec / (ref + per_sec)

Conclusion

Adding the three new checks for Brocade / Broadcom fibre channel switches to your Check_MK server enables you to monitor additional CPU and memory aspects as well as the SFPs of your Brocade / Broadcom fibre channel devices. More recent versions of Check_MK should already include the same functionality through the now included standard checks brocade_sys and brocade_sfp.

New Brocade / Broadcom fibre channel devices should pick up the additional service checks immediately. Existing Brocade / Broadcom fibre channel devices might need a Check_MK inventory to be run explicitly on them in order to pick up the additional service checks.

The enhanced version of the brocade_fcport service check addresses and fixes several issues present in the standard Check_MK service check. It should provide you with a more complete and future-proof monitoring of your fibre channel network infrastructure.

I hope you find the provided new and enhanced checks useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

Leave a comment…




A V Z G Q
  • E-Mail address will not be published.
  • Formatting:
    //italic//  __underlined__
    **bold**  ''preformatted''
  • Links:
    [[http://example.com]]
    [[http://example.com|Link Text]]
  • Quotation:
    > This is a quote. Don't forget the space in front of the text: "> "
  • Code:
    <code>This is unspecific source code</code>
    <code [lang]>This is specifc [lang] code</code>
    <code php><?php echo 'example'; ?></code>
    Available: html, css, javascript, bash, cpp, …
  • Lists:
    Indent your text by two spaces and use a * for
    each unordered list item or a - for ordered ones.
This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information