bityard Blog

// Check_MK Monitoring - Brocade / Broadcom Fibre Channel Switches

This article provides patches for the standard Check_MK distribution in order to fix and enhance the support for the monitoring of Brocade / Broadcom Fibre Channel Switches.


Out of the box, there is currently already monitoring support available for Brocade / Broadcom Fibre Channel Switches in the standard Check_MK distribution. Unfortunately there are several issues in the check brocade_fcport, which is used to monitor the status and several metrics of the fibre channel switch ports. Those issues prevent the check from working as intended and are fixed in the version provided in this article. Up until recently, there also was no support in the standard Check_MK distribution for monitoring CPU and memory metrics on a switch level and no support for monitoring SFP metrics on a port level. This article introduces the new checks brocade_cpu, brocade_mem and brocade_sfp to cover those metrics. With the most recent upstream version 1.5.x of the Check_MK distribution, there are now the standard checks brocade_sys (covering CPU and memory metrics) and brocade_sfp (covering the SFP port metrics) available, providing basically the same functionality.

For the impatient and TL;DR here is the enhanced version of the brocade_fcport check:

Enhanced version of the brocade_fcport check

And the new brocade_cpu, brocade_mem and brocade_sfp checks:

The new brocade_cpu check
The new brocade_mem check
The new brocade_sfp check

The sources to the new and enhanced versions of all the checks can be found in my Check_MK Plugins repository on GitHub.

Additional Checks

CPU Usage

The Check_MK service check brocade_cpu monitors the current CPU utilization of Brocade / Broadcom fibre channel switches. It uses the SNMP OID swCpuUsage from the fibre channel switch MIB (SW-MIB) in order to create one service check for each CPU found in the system. The current CPU utilization in percent is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the standard WATO plugin for CPU utilization it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of CPU utilization).

The following image shows a status output example for the brocade_cpu service check from the WATO WebUI:

Status output example for the new brocade_cpu service check

This example shows the current CPU utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_cpu service check:

Example service metrics graph for the new brocade_cpu service check

The selected example graph shows the current CPU utilization of the system in percent as well as the default warning and critical threshold values.

Memory Usage

The Check_MK service check brocade_mem monitors the current memory (RAM) usage on Brocade / Broadcom fibre channel switches. It uses the SNMP OID swMemUsage from the fibre channel switch MIB (SW-MIB) in order to create a service check for the overall memory usage on the system. The amount of currently used memory is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of used memory). The configuration options for the used memory levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Brocade Fibre Channel Memory Usage
            -> Create rule in folder ...
               [x] Levels for memory usage

The following image shows a status output example for the brocade_mem service check from the WATO WebUI:

Status output example for the new brocade_mem service check

This example shows the current memory utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_mem service check:

Example service metrics graph for the new brocade_mem service check

The selected example graph shows the current memory utilization of the system in percent as well as the default warning and critical threshold values.

SFP Health

The Check_MK service check brocade_sfp monitors several metrics of the SFPs in all enabled and active ports of Brocade / Broadcom fibre channel switches. It uses several SNMP OIDs from the fibre channel switch MIB (SW-MIB) and from the Fabric Alliance Extension MIB (FA-EXT-MIB) in order to create one service check for each SFP found in an enabled and active port on the system. The OIDs from the fibre channel switch MIB (swFCPortSpecifier, swFCPortName, swFCPortPhyState, swFCPortOpStatus, swFCPortAdmStatus) are used to determine the number, name and status of the switch port. The OIDs from the Fabric Alliance Extension MIB are used to determine the actual SFP metrics. Those metrics are the SFP temperature (swSfpTemperature), voltage (swSfpVoltage) and current (swSfpCurrent), as well as the optical receive and transmit power (swSfpRxPower and swSfpTxPower) of the SFP. Unfortunately the SFP metrics are not available on all Brocade / Broadcom switch hardware platforms and FabricOS version. This check was verified to work with Gen6 hardware and FabricOS v8. Another limitation is, that the SFP metrics are not available in real-time, but are only gathered in a 5 minute interval by the switch from all the SFPs in the switch. This is most likely a precausion as not to overload the processing capacity on the SFPs with too many status requests.

The current temperature level as well as the optical receive and transmit power levels are compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values:

Metric Warning Threshold Critical Threshold
System temperature 55°C 65°C
Optical receive power -7.0 dBm -9.0 dBm
Optical transmit power -2.0 dBm -3.0 dBm

The configuration options for the used temperature and optical receive and transmit power levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Networking
         -> Brocade Fibre Channel SFP
            -> Create rule in folder ...
               [x] Temperature levels in degrees celcius
               [x] Receive power levels in dBm
               [x] Transmit power levels in dBm

Currently, the electrical current and voltage metrics of the SFPs are only used for long-term trends via the respective service metric template and thus are not used to raise any alarms. This is due to the fact that there is little to no information available on which precise electrical current and voltage levels would constitute as an indicator for an immediate or impending failure state.

The following image shows several status output examples for the brocade_sfp service check from the WATO WebUI:

Status output examples for the new brocade_sfp service check

This example shows SFP Port service check items for eight consecutive ports on a switch. For each item the current temperature, voltage and current, as well as the optical receive and transmit power of the SFP are shown. The optical receive and transmit power levels are also visualized in the Perf-O-Meter, optical receive power growing from the middle to the left, optical transmit power growing from the middle to the right.

The following image shows an example of the service metrics graph for the brocade_sfp service check:

Example service metrics graph for the new brocade_sfp service check

The first graph shows the optical receive and transmit power of the SFP. The second shows the electrical current drawn by the SFP. The third graph shows the temperature of the SFP. The fourth and last graph shows the electrical voltage provided to the SFP.

Modified Check

Fibre Channel Port

The Check_MK service check brocade_fcport has several issuse which are addressed by the following set of patches. The patch is actually a monolithic one and is just broken up into individual patches here for ease of discussion.

The first patch is a simple, but ugly workaround to prevent the conversion of OID_END, which carries the port index information, from being treated as a BINARY SNMP value:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -154,8 +135,8 @@
         bbcredits = None
         if len(if64_info) > 0:
             fcmgmt_portstats = []
-            for oidend, dummy, tx_elements, rx_elements, bbcredits_64 in if64_info:
-                if int(index) == int(oidend.split(".")[-1]):
+            for oidend, tx_elements, rx_elements, bbcredits_64 in if64_info:
+                if index == oidend.split(".")[-1]:
                     fcmgmt_portstats = [
                         binstring_to_int(''.join(map(chr, tx_elements))) / 4,
                         binstring_to_int(''.join(map(chr, rx_elements))) / 4,
@@ -477,7 +426,6 @@
         # Not every device supports that
         (".1.3.6.1.3.94.4.5.1", [
             OID_END,
-            "1",            # Dummy value, otherwise OID_END is also treated as a BINARY value
             BINARY("6"),    # FCMGMT-MIB::connUnitPortStatCountTxElements
             BINARY("7"),    # FCMGMT-MIB::connUnitPortStatCountRxElements
             BINARY("8"),    # FCMGMT-MIB::connUnitPortStatCountBBCreditZero

This is achieved by simply inserting a dummy value into the list of SNMP OIDs in the snmp_info variable. I haven't had time to dig out the root cause of this behaviour, but i guess it must be somewhere in the core SNMP components of Check_MK. The patch also implicitly addresses and fixes an issue where two variables with differently typed content are being compared. This is achieved by simply changing the line:

               if index == oidend.split(".")[-1]:

to:

               if int(index) == int(oidend.split(".")[-1]):

and thus forcing a conversion to integer values.

The next patch adds support for additional encoding schemes in the FC-1 layer, which are used for fibre channel beyond the speed of 8 GBit. Up to and including a speed of 8 GBit, fibre channel uses a 8/10b encoding. This means that for every 8 bits of data, 10 bits are actually send over the fibre channel link. The encoding scheme changes for speeds higher than 8 GBit. 16 GBit fibre channel – like 10 GBit ethernet – uses a 64/66b encoding scheme, 32 GBit fibre channel and higher use a 256/257b encoding scheme. Thus the wirespeed calculations of the brocade_fcport service check are incorrect for speeds above 8 GBit. The following patch addresses and fixes this issue:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -283,15 +267,8 @@
 
     output.append(speedmsg)
 
-    if gbit > 16:
-        # convert gbit netto link-rate to Byte/s (256/257 enc)
-        wirespeed = gbit * 1000000000.0 * ( 256 / 257 ) / 8
-    elif gbit > 8:
-        # convert gbit netto link-rate to Byte/s (64/66 enc)
-        wirespeed = gbit * 1000000000.0 * ( 64 / 66 ) / 8
-    else:
         # convert gbit netto link-rate to Byte/s (8/10 enc)
-        wirespeed = gbit * 1000000000.0 * ( 8 / 10 ) / 8
+    wirespeed = gbit * 1000000000.0 * 0.8 / 8
     in_bytes = 4 * get_rate("brocade_fcport.rxwords.%s" % index, this_time, rxwords)
     out_bytes = 4 * get_rate("brocade_fcport.txwords.%s" % index, this_time, txwords)

The third patch simply adds the notxcredits counter and its value to the output of the service check. This information is currently missing from the status output of the check and is just added as a convenience:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -408,9 +361,6 @@
             summarystate = max(1, summarystate)
             text += "(!)"
             output.append(text)
-        else:
-            if counter == "notxcredits":
-                output.append(text)
 
     # P O R T S T A T E
     for dev_state, state_key, state_info, warn_states, state_map in [

The last patch addresses and fixes a more serious issue. The details are explained in the comment of the following code snippet. The digest here being, that the metric notxcredits (“No TX buffer credits”) gathered from the SNMP OID swFCPortNoTxCredits is being calculated wrong. This is due to the fact, that the brocade_fcport service check treats this metric like all the other error metrics of a switchport (e.g. “CRC errors”, “ENC-Out”, “ENC-In” and “C3 discards”) and puts it in relation to the number of frames transmitted over a link. In reality though the definition of the metric behind the SNMP OID swFCPortNoTxCredits is, that it is actually relative to time. This issue leads to false positives in certain edge cases. For example when a little utilized switch port sees an otherwise uncritical number of swFCPortNoTxCredits due to the normal activity of the fibre channel flow control. In such a case, a relatively high number of swFCPortNoTxCredits is put into relation to the sum of a relatively low number of frames transmitted over the link and again the relatively high number of swFCPortNoTxCredits. See the last line of the code snippet below for this calculation. The result is a high value for the metric notxcredits (“No TX buffer credits”) although the fibre channel flow control was working perfectly fine, the configured switch.edgeHoldTime was most likely never reached and frames have never been dropped.

In order to address this issue, the following patch adds a few lines of code for a special treatment of the metric notxcredits to the brocade_fcport service check. This is done in the last section of the patch. The second section of the patch is just for the purpose of clarification, as it sets the metric that is being related to, to None in case of the metric notxcredits. The first section of the patch adjusts the default warning and critical thresholds to lower levels. The previous levels were rather high due to the miscalculation explained above. With the new calculation added by the third section of the patch, the default warning and critical threshold levels need to be much more sensitive.

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -100,11 +100,11 @@
 factory_settings["brocade_fcport_default_levels"] = {
     "rxcrcs":           (3.0, 20.0),   # allowed percentage of CRC errors
     "rxencoutframes":   (3.0, 20.0),   # allowed percentage of Enc-OUT Frames
     "rxencinframes":    (3.0, 20.0),   # allowed percentage of Enc-In Frames
-    "notxcredits":      (1.0, 3.0),    # allowed percentage of No Tx Credits
+    "notxcredits": (3.0, 20.0),  # allowed percentage of No Tx Credits
     "c3discards":       (3.0, 20.0),   # allowed percentage of C3 discards
     "assumed_speed":    2.0,           # used if speed not available in SNMP data
 }
 
 
@@ -349,7 +327,7 @@
            ("ENC-Out",              "rxencoutframes",      rxencoutframes,  rxframes_rate),
            ("ENC-In",               "rxencinframes",       rxencinframes,   rxframes_rate),
            ("C3 discards",          "c3discards",          c3discards,      txframes_rate),
-           ("No TX buffer credits", "notxcredits",         notxcredits,     None),
+        ("No TX buffer credits", "notxcredits", notxcredits, txframes_rate),
     ]:
         per_sec = get_rate("brocade_fcport.%s.%s" % (counter, index), this_time, value)
         perfdata.append((counter, per_sec))
@@ -360,31 +338,6 @@
                     (counter, item), this_time, per_sec, average)
             perfdata.append( ("%s_avg" % counter, per_sec_avg ) )
 
-        # Calculate error rates
-        if counter == "notxcredits":
-            # Calculate the error rate for "notxcredits" (buffer credit zero). Since this value
-            # is relative to time instead of the number of transmitted frames it needs special
-            # treatment.
-            # Semantics of the buffer credit zero value on Brocade / Broadcom devices:
-            # The switch ASIC checks the buffer credit value of a switch port every 2.5us and
-            # increments the buffer credit zero counter if the buffer credit value is zero.
-            # This means if in a one second interval the buffer credit zero counter increases
-            # by 400000 the link on this switch port is not allowed to send any frames.
-            # By default the edge hold time on a Brocade / Broadcom device is about 200ms:
-            #   switch.edgeHoldTime:220
-            # If a C3 frame remains in the switches queue for more than 220ms without being
-            # given any credits to be transmitted, it is subsequently dropped. Thus the buffer
-            # credit zero counter would optimally be correlated to the C3 discards counter.
-            # Unfortunately the Brocade / Broadcom devices have no egress buffering and do
-            # ingress buffering instead. Thus the C3 discards counters are increased on the
-            # ingress port, while the buffer credit zero counters are increased on the egress
-            # port. The trade-off is to correlate the buffer credit zero counters relative to
-            # the measured time interval.
-            if per_sec > 0:
-                rate = per_sec / 400000.00
-            else:
-                rate = 0
-        else:
             # compute error rate (errors in relation to number of frames) (from 0.0 to 1.0)
             if ref > 0 or per_sec > 0:
                 rate = per_sec / (ref + per_sec)

Conclusion

Adding the three new checks for Brocade / Broadcom fibre channel switches to your Check_MK server enables you to monitor additional CPU and memory aspects as well as the SFPs of your Brocade / Broadcom fibre channel devices. More recent versions of Check_MK should already include the same functionality through the now included standard checks brocade_sys and brocade_sfp.

New Brocade / Broadcom fibre channel devices should pick up the additional service checks immediately. Existing Brocade / Broadcom fibre channel devices might need a Check_MK inventory to be run explicitly on them in order to pick up the additional service checks.

The enhanced version of the brocade_fcport service check addresses and fixes several issues present in the standard Check_MK service check. It should provide you with a more complete and future-proof monitoring of your fibre channel network infrastructure.

I hope you find the provided new and enhanced checks useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Brocade Fabric OS Authentication Failure with SSH Public Key

With the update to Fabric OS v7.4.1d on Brocade fibre channel SAN switches, the CLI login via SSH public key authentication will sometimes be broken for administrative users. This blog post describes a manual workaround which can be used in order to temporarily correct this issue without the immediate need for another Fabric OS update.


During the preparation phase for migrating from our aging Brocade 5100 and 5300 Gen4 fibre channel SAN switches to the shiny new Brocade G620 Gen6 fibre channel SAN Switches, we needed to update the Fabric OS on the old switches to a v7.4.x version. Due to compatibility and support constraints with the IBM SAN Volume Controller (SVC), we decided to go with the Fabric OS v7.4.1d version.

After the successful Fabric OS update, the CLI login via SSH public key authentication was broken for some, but not all users with admin level priviledges on some but not all switches. A re-upload of the SSH public key for those users with the sshUtil importpubkey command didn't solve the issue. Debugging this further with a strace attached to the SSH daemon process on an affected switch revealed why the SSH public key authentication was failing:

[...]
[pid 27941] connect(8, {sa_family=AF_FILE, path="/dev/log"}, 16) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 27941] close(8)                    = 0
[pid 27941] socket(PF_FILE, SOCK_STREAM, 0) = 8
[pid 27941] fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
[pid 27941] connect(8, {sa_family=AF_FILE, path="/dev/log"}, 16) = 0
[pid 27941] send(8, "<39>Feb  1 22:07:00 sshd[27941]: debug1: trying public key file /fabos/users/admin/.ssh/authorized_keys.<USERNAME>\0", 117, MSG_NOSIGNAL) = 117
[pid 27941] close(8)                    = 0
[pid 27941] open("/fabos/users/admin/.ssh/authorized_keys.<USERNAME_2>", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = -1 EACCES (Permission denied)
[...]

This was done by using the root account on the Brocade switch. The same account was also used for the following research and the temporary workaround derived from this. Beware that using the root account on Brocade switches might have serious implications on the warranty or support for the devices. Be extra careful what you are doing as root on the Brocade switch, since it might easily affect the operational status of the device.

The last line from the above snippet of the strace output show, that the cause of the issue with SSH public key authentication was in the permissions of the users authorized_keys file. Looking at the permissions of the directory containing the file and the file itself showed:

switch:FID128:root> cd /fabos/users/admin/
switch:FID128:root> ls -al
total 28
drwxr-xr-x   4 root     admin        4096 Jan 23 12:36 ./
drwxr-xr-x  12 root     sys          4096 Jul 15  2016 ../
-rw-r--r--   1 root     admin         507 Jul 15  2016 .bash_logout
-rw-r--r--   1 root     admin          27 Jul 15  2016 .inputrc
-rw-r--r--   1 root     admin        1275 Jul 15  2016 .profile
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 .ssh/
drwxrwxrwx   3 root     sys          4096 Aug 11  2011 .terminfo/

switch:FID128:root> cd .ssh
switch:FID128:root> pwd
/fabos/users/admin/.ssh

switch:FID128:root> ls -al
total 44
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 ./
drwxr-xr-x   4 root     admin        4096 Jan 23 12:36 ../
-rw-r--r--   1 root     admin       10240 Feb  1 22:54 authorizedKeys.tar
-rw-------   1 root     root          408 Jan 23 12:43 authorized_keys
-rw-r--r--   1 root     admin         755 Dec  8 11:13 authorized_keys.<USERNAME_1>
-rw-------   1 root     admin        1230 Feb  1 22:54 authorized_keys.<USERNAME_2>
-rw-------   1 root     root          408 Mar 22  2016 authorized_keys.<USERNAME_3>
-rw-r--r--   1 root     root          605 Sep 19 11:06 authorized_keys.<USERNAME_4>
-rw-r--r--   1 root     admin         134 Jul 15  2016 environment

switch:FID128:root> tar tvf authorizedKeys.tar
-rw-r--r-- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw------- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw------- root/root       408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r--r-- root/root       606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

switch:FID128:root> tar tvf /mnt/fabos/users/admin/.ssh/authorizedKeys.tar
-rw-r--r-- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw------- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw------- root/root       408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r--r-- root/root       606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

The permissions on some, but not all of the authorized_keys.<USERNAME_*> files were being too restrictive, since the SSH daemon was trying to read them as an effective user of the admin group. An immediate fix for this issue was to alter the permissions on the authorized_keys.<USERNAME_*> files in order to allow the admin group to read the content of the files:

switch:FID128:root> cd /fabos/users/admin/
switch:FID128:root> chmod 640 authorized_keys.*
switch:FID128:root> chown root:admin authorized_keys.*
switch:FID128:root> tar cpf authorizedKeys.tar authorized_keys.*

switch:FID128:root> cd /mnt/fabos/users/admin/.ssh/
switch:FID128:root> chmod 640 authorized_keys.*
switch:FID128:root> chown root:admin authorized_keys.*
switch:FID128:root> tar cpf authorizedKeys.tar authorized_keys.*

Again looking at the permissions of the directory containing the authorized_keys.<USERNAME_*> files and the files itself showed now:

switch:FID128:root> cd /fabos/users/admin/
switch:FID128:root> ls -la
total 44
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 ./
drwxr-xr-x   4 root     admin        4096 Jan 23 12:36 ../
-rw-r--r--   1 root     admin       10240 Feb  9 06:59 authorizedKeys.tar
-rw-------   1 root     root          408 Jan 23 12:43 authorized_keys
-rw-r-----   1 root     admin         755 Dec  8 11:13 authorized_keys.<USERNAME_1>
-rw-r-----   1 root     admin        1230 Feb  1 22:54 authorized_keys.<USERNAME_2>
-rw-r-----   1 root     admin         408 Mar 22  2016 authorized_keys.<USERNAME_3>
-rw-r-----   1 root     admin         605 Sep 19 11:06 authorized_keys.<USERNAME_4>
-rw-r--r--   1 root     admin         134 Jul 15  2016 environment

switch:FID128:root> tar tvf /fabos/users/admin/.ssh/authorizedKeys.tar
-rw-r----- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw-r----- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw-r----- root/admin      408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r----- root/admin      606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

switch:FID128:root> cd /mnt/fabos/users/admin/.ssh/
switch:FID128:root> ls -la
total 44
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 ./
drwxr-xr-x   4 root     admin        4096 Jan 23 12:50 ../
-rw-r--r--   1 root     admin       10240 Feb  1 22:54 authorizedKeys.tar
-rw-------   1 root     root          408 Jan 23 12:43 authorized_keys
-rw-r-----   1 root     admin         755 Dec  8 11:13 authorized_keys.<USERNAME_1>
-rw-r-----   1 root     admin        1230 Feb  1 22:54 authorized_keys.<USERNAME_2>
-rw-r-----   1 root     admin         408 Mar 22  2016 authorized_keys.<USERNAME_3>
-rw-r-----   1 root     admin         605 Sep 19 11:06 authorized_keys.<USERNAME_4>
-rw-r--r--   1 root     admin         134 Jan 23 12:50 environment

switch:FID128:root> tar tvf /mnt/fabos/users/admin/.ssh/authorizedKeys.tar
-rw-r----- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw-r----- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw-r----- root/admin      408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r----- root/admin      606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

With the corrected permissions on the authorized_keys.<USERNAME_*> files, CLI login via SSH public key authentication was now possible again.

Unfortunately this is only a temporary workaround, since the next upload of a SSH public key with the sshUtil importpubkey command will likely set the wrong permissions on the newly created or replaced authorized_keys.<USERNAME_*> file. This is due to the root cause of the issue actually being with the sshUtil importpubkey command. The snippet of a strace output show below was captured from a running sshUtil importpubkey command:

[...]
chdir("/fabos/users/admin/.ssh")        = 0
[...]
[pid 10611] execve("/bin/cat", ["cat", "<USERNAME_2>_brocade_dsa.pub"], [/* 45 vars */]) = 0
[...]
[pid 10612] execve("/bin/chmod", ["/bin/chmod", "600", "authorized_keys.<USERNAME_2>"], [/* 45 vars */]) = 0
[...]
[pid 10612] lstat64("authorized_keys.<USERNAME_2>", {st_mode=S_IFREG|0640, st_size=1230, ...}) = 0
[pid 10612] chmod("authorized_keys.<USERNAME_2>", 0600) = 0
[...]
[pid 10613] execve("/bin/cp", ["cp", "-f", "authorized_keys.<USERNAME_2>", "/mnt/fabos/users/admin/.ssh/"], [/* 45 vars */]) = 0
[...]
[pid 10613] chmod("/mnt/fabos/users/admin/.ssh/authorized_keys.<USERNAME_2>", 0100600) = 0
[...]
[pid 10618] execve("/bin/tar", ["tar", "-cf", "authorizedKeys.tar", "authorized_keys.<USERNAME_1>", "authorized_keys.<USERNAME_2>", "authorized_keys.<USERNAME_3>"], [/* 45 vars */]) = 0
[...]

The /bin/chmod command on the third line of the above strace output shows that the file permission for the authorized_keys.<USERNAME_2> file is mistakenly set to 600 (-rw-------) instead to at least 640 (-rw-r-----). Exactly why this is sometimes happening can't be further analyzed, since the source code to the sshUtil command is not available.

A permanent resolution to this issue will be to update to at least Fabric OS v7.4.1e. The Release Notes for Fabric OS v7.4.1e indicate this in the following known defect:

Defect ID: DEFECT000616486
Technical Severity: Medium
Probability: Medium
Product: Brocade Fabric OS
Technology Group: Security
Reported In Release: FOS7.4.1
Technology: SSH - Secure Shell
Symptom: Unable to authenticate an SSH session after importing public key to switch.
Condition: This is encountered by admin level users on a switch running Fabric OS v7.4.1d

// Check_MK Monitoring - HPE Virtual Connect Fibre Channel Modules

This article provides patches for the standard Check_MK distribution in order to add support for the monitoring of HPE Virtual Connect Fibre Channel Modules.


Out of the box, there is currently no monitoring support for HPE Virtual Connect Fibre Channel Modules in the standard Check_MK distribution. Those modules, like e.g. the HPE Virtual Connect 8Gb 20-port Fibre Channel Module, are used in HPE c-Class BladeSystem to provide Fibre Channel connectivity for the individual server blades. Fortunately the modules provide status and performance data via the standard SNMP FIBRE-CHANNEL-FE-MIB defined in RFC 2837 as well as its successor, the SNMP FCMGMT-MIB defined in RFC 4044. Those two SNMP MIBs are already covered by the checks qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element, which are part of the standard Check_MK distribution. This simplifies the task of adding support for the HPE Virtual Connect Fibre Channel modules and reduces it to be just a matter of extending the already existing checks with three rather simple patches.

For the impatient and TL;DR here are the enhanced versions of the qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element:

Enhanced version of the qlogic_fcport check
Enhanced version of the qlogic_sanbox check
Enhanced version of the qlogic_sanbox_fabric_element check

The sources to the enhanced versions of all three checks can be found in my Check_MK Plugins repository on GitHub.

The necessary changes to qlogic_fcport and qlogic_sanbox_fabric_element are limited to the snmp_scan_function used by the Check_MK inventory. Here, the vendor specific OIDs for the HPE Virtual Connect Fibre Channel modules are added. The following patches show the respective lines for qlogic_fcport:

qlogic_fcport.patch
--- a/checks/qlogic_fcport   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_fcport   2017-10-01 14:34:48.153710776 +0200
@@ -218,12 +218,14 @@
     # .1.3.6.1.4.1.3873.1.12 QLogic 8 Gb and 4/8 Gb Intelligent Pass-thru Module
     # .1.3.6.1.4.1.3873.1.9  QLogic SANBox 5802 FC Switch
     # .1.3.6.1.4.1.3873.1.11 HP StorageWorks 8/20q Fibre Channel Switch
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.11") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.12") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
     'group':                   'qlogic_fcport',
     'default_levels_variable': 'qlogic_fcport_default_levels',
 }

and for qlogic_sanbox_fabric_element:

qlogic_sanbox_fabric_element.patch
--- a/checks/qlogic_sanbox_fabric_element    2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox_fabric_element    2017-10-01 14:47:35.000003198 +0200
@@ -54,7 +54,9 @@
                                                            OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

In both cases, the relevant lines being:

        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),

After those two simple changes, the checks will now be able to successfully inventorize the overall fabric status as well as the status of individual ports of HPE Virtual Connect Fibre Channel modules.

The necessary changes to qlogic_sanbox also require the extension of the snmp_scan_function used by the Check_MK inventory as shown by the patches above. In addition to that, the string operations on the sensor_id need to be adjusted in order to get a more user-friendly name for the temperature and power supply sensors which are also present in the HPE Virtual Connect Fibre Channel modules. Since the sensor IDs are encoded in the SNMP OIDs and the SNMP tree for those OIDs can vary from module to module, the simple string replacement in the original qlogic_sanbox check was exchanged for a more general, regular expression based substitution. The following patch shows the respective lines for the combined changes to qlogic_sanbox:

qlogic_sanbox.patch
--- a/checks/qlogic_sanbox   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox   2017-10-01 14:47:51.348002546 +0200
@@ -44,7 +44,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "8" and sensor_characteristic == "3" and \
             sensor_name != "Temperature Status":
             inventory.append( (sensor_id, None) )
@@ -53,7 +53,7 @@
 def check_qlogic_sanbox_temp(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -93,9 +93,11 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }
 
 #.
@@ -113,7 +115,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "5":
             inventory.append( (sensor_id, None) )
     return inventory
@@ -121,7 +123,7 @@
 def check_qlogic_sanbox_psu(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -153,7 +155,9 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

After those additional, but still simple, changes, the check will now be able to successfully inventorize the temperature and power supply sensors of HPE Virtual Connect Fibre Channel modules.

// Ganglia Fibre Channel Power/Attenuation Monitoring on AIX and VIO Servers

Although usually only available upon request via IBM support, efc_power is quite the handy tool when it comes to debugging or narrowing down fibre channel link issues. It provides information about the transmit and receive, power and attenuation values for a given FC port on a AIX or VIO server. Fortunately the output of efc_power:

$ /opt/freeware/bin/efc_power /dev/fscsi2
TX: 1232 -> 0.4658 mW, -3.32 dBm
RX: 10a9 -> 0.4265 mW, -3.70 dBm

is very parser-friendly, so it can very easily be read by a script for further processing. In this case further processing means a continuous Ganglia monitoring of the fibre channel transmit and receive, power and attenuation values for each FC port on a AIX or VIO server. This is accomplished by the two RPM packages ganglia-addons-aix and ganglia-addons-aix-scripts:

RPM packages

Source RPM packages

FilenameFilesizeLast modified
ganglia-addons-aix-0.1-1.src.rpm6.6 KiB2013/07/30 09:41
ganglia-addons-aix-0.1-1.src.rpm.sha1sum75.0 B2013/07/30 09:41

The package ganglia-addons-aix-scripts is to be installed on the AIX or VIO server which has the FC adapter installed. It depends on the aaa_base package for the efc_power binary and on the ganglia-addons-base package, specifically on the cronjob (/opt/freeware/etc/run_parts/conf.d/ganglia-addons.sh) defined by this package. In the context of this cronjob all avaliable scripts in the directory /opt/freeware/libexec/ganglia-addons/ are executed. For this specific Ganglia addon an iteration over all fscsi devices in the system is done and efc_power is called for each fscsi device. Devices can be excluded by assigning a regex pattern to the BLACKLIST variable in the configuration file /opt/freeware/etc/ganglia-addons/ganglia-addons-efc_power.cfg. The output of each efc_power call is parsed and via the gmetric command fed into a Ganglia monitoring system that has to be already set up.

The package ganglia-addons-aix is to be installed on the host running the Ganglia webinterface. It contains templates for the customization of the FC power and attenuation metrics within the Ganglia Web 2 interface. See the README.templates file for further installation instructions. Here are samples of the two graphs created with those Ganglia monitoring templates:

Example of FC power and attenuation with a bad cable

In the section “1” of the graphs, the receive attenuation on FC port fscsi2 was about -7.7 dBm, which means that of the 476.6 uW sent from the Brocade switchport:

$ sfpshow 1/10

Identifier:  3    SFP
Connector:   7    LC
Transceiver: 540c404000000000 200,400,800_MB/s M5,M6 sw Short_dist
Encoding:    1    8B10B
Baud Rate:   85   (units 100 megabaud)
Length 9u:   0    (units km)
Length 9u:   0    (units 100 meters)
Length 50u:  5    (units 10 meters)
Length 62.5u:2    (units 10 meters)
Length Cu:   0    (units 1 meter)
Vendor Name: BROCADE
Vendor OUI:  00:05:1e
Vendor PN:   57-1000012-01
Vendor Rev:  A
Wavelength:  850  (units nm)
Options:     003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max:      0
BR Min:      0
Serial No:   UAF1112600001JW
Date Code:   110619
DD Type:     0x68
Enh Options: 0xfa
Status/Ctrl: 0x82
Alarm flags[0,1] = 0x5, 0x40
Warn Flags[0,1] = 0x5, 0x40
                                          Alarm                  Warn
                                      low        high       low         high
Temperature: 41      Centigrade     -10         90         -5          85
Current:     7.392   mAmps          1.000       17.000     2.000       14.000
Voltage:     3264.9  mVolts         2900.0      3700.0     3000.0      3600.0
RX Power:    -4.0    dBm (400.1 uW) 10.0   uW   1258.9 uW  15.8   uW   1000.0 uW
TX Power:    -3.2    dBm (476.6 uW) 125.9  uW   631.0  uW  158.5  uW   562.3  uW

only about 200 uW actually made it to the FC port fscsi2 on the VIO server. Section “2” shows even worse values during the time the FC connections and cables were checked, which basically means that the FC link was down during that time period. Section “3” shows the values after the bad cable was found and replaced. Receive attenuation on FC port fscsi2 went down to about -3.7 dBm, which means that of the now 473.6 uW sent from the Brocade switchport, 427.3 uW actually make it to the FC port fscsi2 on the VIO server.

The goal with the continuous monitoring of the fibre channel transmit and receive, power and attenuation values is to catch slowly deterioration situations early on, before they become a real issue or even a service interruption. As shown above, this can be accomplished with Ganglia and the two RPM packages ganglia-addons-aix and ganglia-addons-aix-scripts. For ad hoc checks, e.g. during the debugging of the components in a suspicious FC link, efc_power is still best to be called directly from the AIX or VIO server command line.

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information