bityard Blog

// Experiences with Dell PowerConnect Switches

This blog post is going to be about my recent experiences with Broadcom FASTPATH based Dell PowerConnect M-Series M8024-k and M6348 switches. Especially with their various limitations and – in my opinion – sometimes buggy behaviour.


Recently i was given the opportunity to build a new and central storage and virtualization environment from ground up. This involved a set of hardware systems which – unfortunately – were chosen and purchased previously, before i came on board with the project.

System environment

Specifically those hardware components were:

  • Multiple Dell PowerEdge M1000e blade chassis

  • Multiple Dell PowerEdge M-Series blade servers, all equipped with Intel X520 network interfaces for LAN connectivity through fabric A of the blade chassis. Servers with additional central storage requirements were also equipped with QLogic QME/QMD8262 or QLogic/Broadcom BCM57810S iSCSI HBAs for SAN connectivity through fabric B of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric A of the blade chassis forming the LAN network. Those were configured and interconnected as a stack of switches. Each stack of switches had two uplinks, one to each of two carrier grade Cisco border routers. Since the network edge was between those two border routers on the one side and the stack of M8024-k switches on the other side, the switch stack was also used as a layer 3 device and was thus running the default gateways of the local network segments provided to the blade servers.

  • Multiple Dell PowerConnect M6348 switches, which were connected through aggregated links to the stack of M8024-k switches described above. These switches were exclusively used to provide a LAN connection for external, standalone servers and devices through their external 1 GBit ethernet interfaces. The M6348 switches were located in the slots belonging to fabric C of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric B of the blade chassis forming the SAN network. In contrast to the M8024-k LAN switches, the M8024-k SAN switches were configured and interconnected as individual switches. Since there as no need for outside SAN connectivity, the M8024-k switches in fabric B ran a flat layer 2 network without any layer 3 configuration.

  • Initially all PowerConnect switches – M8024-k both LAN and SAN and M6348 – ran the firmware version 5.1.8.2.

  • Multiple Dell EqualLogic PS Series storage systems, providing central block storage capacity for the PowerEdge M-Series blade servers via iSCSI over the SAN mentioned above. Some blade chassis based PS Series models (PS-M4110) were internally connected to the SAN formed by the M8024-k switches in fabric B. Other standalone PS Series models were connected to the same SAN utilizing the external ports of the M8024-k switches.

  • Multiple Dell EqualLogic FS Series file server appliances, providing central NFS and CIFS storage capacity over the LAN mentioned above. In the back-end those FS Series file server appliances also used the block storage capacity provided by the PS Series storage systems via iSCSI over the SAN mentioned above. Both LAN and SAN connections of the EqualLogic FS Series were made through the external ports of the M8024-k switches.

There were multiple locations with roughly the same setup composed of the hardware components described above. Each location had two daisy-chained Dell PowerEdge M1000e blade chassis systems. The layer 2 LAN and SAN networks stretched over the two blade chassis. The setup at each location is shown in the following schematic:

Schematic of the Dell PowerConnect LAN and SAN setup

All in all not an ideal setup. Instead, i would have preferred a pair of capable – both functionality and performance-wise – central top-of-rack switches to which the individual M1000e blade chassis would have been connected. Preferrably a seperate pair for LAN an SAN connectivity. But again, the mentioned components were already preselected and pre-purchased.

During the implementation and later the operational phase several limitations and issues surfaced with regard to the Dell PowerConnect switches and the networks build with them. The following – probably not exhaustive – list of limitations and issues i've encountered is in no particular order with regard to their occurrence or severity.

Limitations

  • While the Dell PowerConnect switches support VRRP as a redundancy protocol for layer 3 instances, there is only support for VRRP version 2, described in RFC 3768. This limits the use of VRRP to IPv4 only. VRRP version 3 described in RFC 5798, which is needed for the implementation of redundant layer 3 instances for both IPv4 and IPv6, is not supported by Dell PowerConnect switches. Due to this limitation and the need for full IPv6 support in the whole environment, the design decision was made to run the Dell PowerConnect M8024-k switches for the LAN as a stack of switches.

  • Limited support of routing protocols. There is only support for the routing protocols OSPF and RIP v2 in Dell PowerConnect switches. In this specific setup and triggered by the design decision to run the LAN switches as layer 3 devices, BGP would have been a more suitable routing protocol. Unfortunately there were no plans to implement BGP on the Dell PowerConnect devices.

  • Limitation in the number of secondary interface addresses. Only one IPv4 secondary address is supported per interface on a layer 3 instance running on the Dell PowerConnect switches. Opposed to e.g. Cisco based layer 3 capable switches this was a limitation that caused, in this particular setup, the need for a lot more (VLAN) interfaces than would otherwise have been necessary.

  • No IPv6 secondary interface addresses. For IPv6 based layer 3 instances there is no support at all for secondary interface addresses. Although this might be a fundamental rather than product specific limitation.

  • For layer 3 instances in general there is no support for very small IPv4 subnets (e.g. /31 with 2 IPv4 addresses) which are usually used for transfer networks. In setups using private IPv4 address ranges this is no big issue. In this case though, official IPv4 addresses were used and in conjunction with the excessive need for VLAN interfaces this limitation caused a lot of wasted official IPv4 addresses.

  • The access control list (ACL) feature is very limited and rather rudimentary in Dell PowerConnect switches. There is no support for port ranges, no statefulness and each access list has a hard limit of 256 access list entries. All three – and possibly even more – limitations in combination make the ACL feature of Dell PowerConnect switches almost useless. Especially if there are seperate layer 3 networks on the system which are in need of fine-grained traffic control.

  • From the performance aspect of ACLs i have gotten the impression, that especially IPv6 ACLs are handled by the switches CPU. If IPv6 is used in conjunction with extensive ACLs, this would dramatically impact the network performance of IPv6-based traffic. Admittedly i have no hard proof to support this suspicion.

  • The out-of-band (OOB) management interface of the Dell PowerConnect switches does not provide a true out-of-band management. Instead it is integrated into the switch as just as another IP interface – although one with a special purpose. Due to this interaction of the OOB with the IP stack of the Dell PowerConnect switch there are side-effects when the switch is running at least one layer 3 instance. In this case, the standard IP routing table of the switch is not only used for routing decisions of the payload traffic, but instead it is also used to determine the destination of packets originating from the OOB interface. This behaviour can cause an asymmetric traffic flow when the systems connecting to the OOB are covered by an entry in the switches IP routing table. Far from ideal when it comes to true OOB management, not to mention the issuses arising when there are also stateful firewall rules involved.

    I addressed this limitation with a support case at Dell and got the following statement back:

    FASTPATH can learn a default gateway for the service port, the network port,
    or a routing interface. The IP stack can only have a single default gateway.
    (The stack may accept multiple default routes, but if we let that happen we may
    end up with load balancing across the network and service port or some other
    combination we don't want.) RTO may report an ECMP default route. We only give
    the IP stack a single next hop in this case, since it's not likely we need to
    additional capacity provided by load sharing for packets originating on the
    box.

    The precedence of default gateways is as follows:
    - via routing interface
    - via service port
    - via network port

    As per the above precedence, ip stack is having the default gateway which is
    configured through RTO. When the customer is trying to ping the OOB from
    different subnet , route table donesn't have the exact route so,it prefers the
    default route and it is having the RTO default gateway as next hop ip. Due to
    this, it egresses from the data port.

    If we don't have the default route which is configured through RTO then IP
    stack is having the OOB default gateway as next hop ip. So, it egresses from
    the OOB IP only.

    In my opinion this just confirms how the OOB management of the Dell PowerConnect switches is severely broken by design.

  • Another issue with the out-of-band (OOB) management interface of the Dell PowerConnect switches is that they support only a very limited access control list (ACL) in order to protect the access to the switch. The management ACL only supports one IPv4 ACL entry. IPv6 support within the management ACL protecting the OOB interface is missing altogether.

  • The Dell PowerConnect have no support for Shortest Path Bridging (SPB) as defined in the IEEE 802.1aq standard. On layer 2 the traditional spanning-tree protocols STP (IEEE 802.1D), RSTP (IEEE 802.1w) or MSTP (IEEE 802.1s) have to be used. This is particularly a drawback in the SAN network shown in the schematic above, due to the protocol determined inactivity of one inter-switch link. With the use of SPB, all inter-switch links could be equally utilizied and a traffic interruption upon link failure and spanning-tree (re)convergence could be avoided.

  • Another SAN-specific limitation is the incomplete implementation of Data Center Bridging (DCB) in the Dell PowerConnect switches. Although the protocols Priority-based Flow Control (PFC) according to IEEE 802.1Qbb and Congestion Notification (CN) according to IEEE 802.1Qau are supportet, the third needed protocol Enhanced Transmission Selection (ETS) according to IEEE 802.1Qaz is missing in Dell PowerConnect switches. The Dell EqualLogic PS Series storage systems used in the setup shown above explicitly need ETS if DCB should be used on layer 2. Since ETS is not implemented in Dell PowerConnect switches, the traditional layer 2 protocols had to be used in the SAN.

Issues

  • Not per se an issue, but the baseline CPU utilization on Dell PowerConnect M8024-k switches running layer 3 instances is significantly higher compared to those running only as layer 2 devices. The following CPU utilization graphs show a direct comparison of a layer 3 (upper graph) and a layer 2 (lower graph) device:

    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 3 device
    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 2 device

    The CPU utilization is between 10 and 15% higher once the tasks of processing layer 3 traffic are involved. What kind of switch function or what type of traffic is causing this additional CPU utilization is completely intransparent. Documentation on such in-depth subjects or details on how the processing within the Dell PowerConnect switches works is very scarce. It would be very interesting to know what kind of traffic is sent to the switches CPU for processing instead of being handled by the hardware.

  • The very high CPU utilization plateau on the right hand side of the upper graph (approximately between 10:50 - 11:05) was due to a bug in processing of IPv6 traffic on Dell PowerConnect switches. This issue caused IPv6 packets to be sent to the switchs CPU for processing instead of doing the forwarding decision in the hardware. I narrowed down the issue by transferring a large file between two hosts via the SCP protocol. In the first case and determined by preferred name resolution via DNS a IPv6 connection was used:

    user@host1:~$ scp testfile.dmp user@host2:/var/tmp/
    testfile.dmp                                   8%  301MB 746.0KB/s 1:16:05 ETA

    The CPU utilization on the switch stack during the transfer was monitored on the switches CLI:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642152
     alloc  298144904
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                     27.05%      30.44%      21.13%
     41cbae0 tXbdService                2.60%       0.40%       0.09%
     43d38d0 ipnetd                     0.40%       0.11%       0.11%
     43ee580 tIomEvtMon                 0.40%       0.09%       0.22%
     43f7d98 osapiTimer                 2.00%       3.56%       3.13%
     4608b68 bcmL2X.0                   0.00%       0.08%       1.16%
     462f3a8 bcmCNTR.0                  1.00%       0.87%       1.04%
     4682d40 bcmTX                      4.20%       5.12%       3.83%
     4d403a0 bcmRX                      9.21%      12.64%      10.35%
     4d60558 bcmNHOP                    0.80%       0.21%       0.11%
     4d72e10 bcmATP-TX                  0.80%       0.24%       0.32%
     4d7c310 bcmATP-RX                  0.20%       0.12%       0.14%
     53321e0 MAC Send Task              0.20%       0.19%       0.40%
     533b6e0 MAC Age Task               0.00%       0.05%       0.09%
     5d59520 bcmLINK.0                  5.41%       2.75%       2.15%
     84add18 tL7Timer0                  0.00%       0.22%       0.23%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.40%       0.09%       0.12%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                0.80%       0.73%       1.40%
     9b70100 dtlTask                    5.81%       7.52%       5.62%
     9dc3da8 emWeb                      0.40%       0.12%       0.09%
     a1c9400 hapiRxTask                 4.00%       8.84%       6.46%
     a65ba38 hapiL3AsyncTask            1.60%       0.45%       0.37%
     abcd0c0 DHCP snoop                 0.00%       0.00%       0.20%
     ac689d0 Dynamic ARP Inspect        0.40%       0.10%       0.05%
     ac7a6c0 SNMPTask                   0.40%       0.19%       0.95%
     b8fa268 dot1s_timer_task           1.00%       0.78%       2.74%
     b9134c8 dot1s_task                 0.20%       0.07%       0.04%
     bdb63e8 dot1xTimerTask             0.00%       0.03%       0.02%
     c520db8 radius_task                0.00%       0.02%       0.05%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.06%       0.15%
     c59ce70 unitMgrTask                0.40%       0.10%       0.20%
     c5c7410 umWorkerTask               1.80%       0.27%       0.13%
     c77ef60 snoopTask                  0.60%       0.25%       0.16%
     c8025a0 dot3ad_timer_task          1.00%       0.24%       0.61%
     ca2ab58 dot3ad_core_lac_tas        0.00%       0.02%       0.00%
     d1860b0 dhcpsPingTask              0.20%       0.13%       0.39%
     d18faa0 SNTP                       0.00%       0.02%       0.01%
     d4dc3b0 sFlowTask                  0.00%       0.00%       0.03%
     d6a4448 spmTask                    0.00%       0.13%       0.14%
     d6b79c8 fftpTask                   0.40%       0.06%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.40%       0.18%       0.29%
     dba91b8 tArpCallback               0.00%       0.04%       0.04%
     defb340 ARP Timer                  2.60%       0.92%       1.29%
     e1332f0 tRtrDiscProcessingT        0.00%       0.00%       0.11%
    12cabe30 ip6MapLocalDataTask        0.00%       0.03%       0.01%
    12cb5290 ip6MapExceptionData       11.42%      12.95%       9.41%
    12e1a0d8 lldpTask                   0.60%       0.17%       0.30%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.00%       0.03%       0.03%
    14176898 DHCPv4 Client Task         0.00%       0.01%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.10%
    14416738 RMONTask                   0.00%       0.20%       0.42%
    144287f8 boxs Req                   0.20%       0.09%       0.21%
    15c90a18 sshd                       0.40%       0.07%       0.07%
    15cde0e0 sshd[0]                    0.20%       0.05%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             89.77%      92.50%      77.29%

    In second case a IPv4 connection was deliberately choosen:

    user@host1:~$ scp testfile.dmp user@10.0.0.1:/var/tmp/
    testfile.dmp                                 100% 3627MB  31.8MB/s   01:54

    Not only was the transfer rate of the SCP copy process significantly higher – and the transfer time subsequently much lower – in the second case using a IPv4 connection. But the CPU utilization on the switch stack during the transfer using a IPv4 connection was also much lower:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642384
     alloc  298144672
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                      0.80%      23.49%      21.10%
     41cbae0 tXbdService                0.00%       0.17%       0.08%
     43d38d0 ipnetd                     0.20%       0.14%       0.12%
     43ee580 tIomEvtMon                 0.60%       0.26%       0.24%
     43f7d98 osapiTimer                 2.20%       3.10%       3.08%
     4608b68 bcmL2X.0                   4.20%       1.10%       1.22%
     462f3a8 bcmCNTR.0                  0.80%       0.80%       0.99%
     4682d40 bcmTX                      0.20%       3.35%       3.59%
     4d403a0 bcmRX                      4.80%       9.90%      10.06%
     4d60558 bcmNHOP                    0.00%       0.11%       0.10%
     4d72e10 bcmATP-TX                  1.00%       0.30%       0.32%
     4d7c310 bcmATP-RX                  0.00%       0.14%       0.15%
     53321e0 MAC Send Task              0.80%       0.39%       0.42%
     533b6e0 MAC Age Task               0.00%       0.12%       0.10%
     5d59520 bcmLINK.0                  1.80%       2.38%       2.14%
     84add18 tL7Timer0                  0.00%       0.11%       0.20%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.20%       0.09%       0.11%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                3.20%       1.54%       1.49%
     9b70100 dtlTask                    0.20%       5.47%       5.45%
     9dc3da8 emWeb                      0.40%       0.13%       0.09%
     a1c9400 hapiRxTask                 0.20%       6.46%       6.30%
     a65ba38 hapiL3AsyncTask            0.40%       0.37%       0.35%
     abcd0c0 DHCP snoop                 0.00%       0.02%       0.18%
     ac689d0 Dynamic ARP Inspect        0.40%       0.15%       0.07%
     ac7a6c0 SNMPTask                   0.00%       1.32%       1.12%
     b8fa268 dot1s_timer_task           7.21%       2.99%       2.97%
     b9134c8 dot1s_task                 0.00%       0.03%       0.03%
     bdb63e8 dot1xTimerTask             0.00%       0.01%       0.02%
     c520db8 radius_task                0.00%       0.01%       0.04%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.21%       0.17%
     c59ce70 unitMgrTask                0.60%       0.20%       0.21%
     c5c7410 umWorkerTask               0.20%       0.17%       0.12%
     c77ef60 snoopTask                  0.20%       0.18%       0.15%
     c8025a0 dot3ad_timer_task          2.20%       0.80%       0.68%
     d1860b0 dhcpsPingTask              1.80%       0.58%       0.45%
     d18faa0 SNTP                       0.00%       0.00%       0.01%
     d4dc3b0 sFlowTask                  0.20%       0.03%       0.03%
     d6a4448 spmTask                    0.20%       0.15%       0.14%
     d6b79c8 fftpTask                   0.00%       0.02%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.20%       0.19%       0.28%
     dba91b8 tArpCallback               0.00%       0.06%       0.05%
     defb340 ARP Timer                  4.60%       1.54%       1.36%
     e1332f0 tRtrDiscProcessingT        0.40%       0.14%       0.12%
    12cabe30 ip6MapLocalDataTask        0.00%       0.01%       0.01%
    12cb5290 ip6MapExceptionData        0.00%       8.60%       8.91%
    12cbe790 ip6MapNbrDiscTask          0.00%       0.02%       0.00%
    12e1a0d8 lldpTask                   0.80%       0.24%       0.29%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.40%       0.07%       0.04%
    14176898 DHCPv4 Client Task         0.00%       0.00%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.09%
    14416738 RMONTask                   1.00%       0.44%       0.44%
    144287f8 boxs Req                   0.40%       0.16%       0.21%
    15c90a18 sshd                       0.20%       0.06%       0.06%
    15cde0e0 sshd[0]                    0.00%       0.03%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             43.28%      78.79%      76.50%

    Comparing the two above output samples by per process CPU utilization showed that the major share of the higher CPU utilization in the case of a IPv6 connection is allotted to the processes tNet0, bcmTX, bcmRX, bcmLINK.0, dtlTask, hapiRxTask and ip6MapExceptionData. In a process by process comparison, those seven processes used 60.3% more CPU time in case of a IPv6 connection compared to the case using a IPv4 connection. Unfortunately the documentation on what the individual processes are exactly doing is very sparse or not available at all. In order to further analyze this issue a support case with the collected information was opened with Dell. A fix for the described issue was made availible with firmware version 5.1.9.3

  • The LAN stack of several Dell PowerConnect M8024-k switches showed sometimes erratic behaviour. There were several occasions, where the switch stack would suddenly show a hugely increased latency in packet processing or where it would just stop passing certain types of traffic altogether. Usually a reload of the stack would restore its operation and the increased latency or the packet drops would disappear with the reload as suddenly as they had appeared. The root cause of this was unfortunately never really found. Maybe it was the combination of functions (layer 3, dual stack IPv4 and IPv6, extensive ACLs, etc.) that were running simultaneously on the stack in this setup.

  • During both planned and unplanned failovers of the master switch in the stack, there is a time period of up to 120 seconds where no packets are processed by the switch stack. This occurs even with continuous forwarding enabled. I've had a strong suspicion that this issue was related to the layer 3 instances running on the switch stack. A comparison between a pure layer 2 stack and a layer 3 enabled stack in a controlled test environment confirmed this. As soon as at least one layer 3 instance was added, the described delay occured on switch failovers. The fact that migrating layer 3 instances from the former master switch to the new one takes some time makes sense to me. What's unclear to me is why this seems to also affect the layer 2 traffic going over the stack.

  • There were several occasions where the hardware- and software MAC table of the Dell PowerConnect switches got out of sync. While the root cause (hardware defect, bit flip, power surge, cosmic radiation, etc.) of this issue is unknown, the effect was a sudden reboot of affected switch. Luckily we had console servers in place, which were storing a console output history from the time the issue occured. After raising a support case with Dell with the information from the console output, we got a firmware update (v5.1.9.4) in which the issue would not trigger a sudden reboot anymore, but instead log an appropriate message to the switches log. With this fix the out of sync MAC tables will still require a reboot of the affected switch, but this can now be done in a controlled fashion. Still, a solution requiring no reboot at all would have been much more preferrable.

  • While querying the Dell PowerConnect switches with the SNMP protocol for monitoring purposes, obscure and confusing messages containing the string MGMT_ACAL would reproducibly be logged into the switches log. See the article Check_MK Monitoring - Dell PowerConnect Switches - Global Status in this blog for the gory details.

  • With a stack of Dell PowerConnect M8024-k switches the information provided via the SNMP protocol would occasionally get out of sync with the information available from the CLI. E.g. the temperature values from the stack stack1 of LAN switches compared to the standalone SAN switches standalone{1,2,3,4,5,6}:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4 = No Such Object available on this agent at this OID
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5 = No Such Object available on this agent at this OID
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 40
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 42
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 35

    At the same time the CLI management interface of the switch stack showed the correct temperature values:

    stack1# show system       
    
    System Description: Dell Ethernet Switch
    System Up Time: 89 days, 01h:50m:11s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            39             Good
    2        System            39             Good
    [...]

    Only after a reboot of the switch stack, the information provided via the SNMP protocol:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 37
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 41
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 34

    would again be in sync with the information available from the CLI:

    stack1# show system   
    
    System Description: Dell Ethernet Switch
    System Up Time: 0 days, 00h:05m:32s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            37             Good
    2        System            37             Good
    [...]

Conclusion

Although the setup build with the Dell PowerConnect switches and the other hardware components was working and providing its basic, intended functionality, there were some pretty big and annoying limitations associated with it. A lot of these limitations would have not been that significant to the entire setup if certain design descisions would have been made more carefully. For example if the layer 3 part of the LAN would have been implemented in external network components or if a proper fully meshed, fabric-based SAN would have been favored over what can only be described as a legacy technology. From the reliability, availability and serviceability (RAS) points of view, the setup is also far from ideal. By daisy-chaining the Dell PowerEdge M1000e blade chassis, stacking the LAN switches, stretching the LAN and SAN over both chassis and by connecting external devices through the external ports of the Dell PowerConnect switches, there are a lot of parts in the setup that are depending on each other. This makes normal operations difficult at best and can have disastrous effects in case of a failure.

In retrospect, either using pure pass-through network modules in the Dell PowerEdge M1000e blade chassis in conjunction with capcable 10GE top-of-rack switches or using the much more capable Dell Force10 MXL switches in the Dell PowerEdge M1000e blade chassis seem to be better solutions. The uptick for Dell Force10 MXL switches of about €2000 list price per device compared to the Dell PowerConnect switches seems negligible compared to the costs that arose through debugging, bugfixing and finding workarounds for the various limitations of the Dell PowerConnect switches. In either case a pair of capable, central layer 3 devices for gateway redundancy, routing and possibly fine-grained traffic control would be advisable.

For simpler setups, without some of the more special requirements of this particular setup, the Dell PowerConnect switches still offer a nice price-performance ratio. Especially with regard to their 10GE port density.

// Check_MK Monitoring - Open-iSCSI

The Open-iSCSI project provides a high-performance, transport independent, implementation of RFC 3720 iSCSI for Linux. It allows remote access to SCSI targets via TCP/IP over several different transport technologies. This article introduces a new Check_MK service check to monitor the status of Open-iSCSI sessions as well as the monitoring of several statistical metrics on Open-iSCSI sessions and iSCSI hardware initiator hosts.

For the impatient and TL;DR here is the Check_MK package of the Open-iSCSI monitoring checks:

Open-iSCSI monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Check_MK service check to monitor Open-iSCSI consists of two major parts, an agent plugin and three check plugins.

The first part, a Check_MK agent plugin named open-iscsi, is a simple Bash shell script. It calls the Open-iSCSI administration tool iscsiadm in order to retrieve a list of currently active iSCSI sessions. The exact call to iscsiadm to retrieve the session list is:

/usr/bin/iscsiadm -m session -P 1

If there are any active iSCSI sessions, the open-iscsi agent plugin also tries to collect several statistics for each iSCSI session. This is done by another call to iscsiadm for each iSCSI Session ${SID}, which is shown in the following example:

/usr/bin/iscsiadm -m session -r ${SID} -s

Unfortunately, the iSCSI session statistics are currently only supported for Open-iSCSI software initiators or dependent hardware iSCSI initiators like the Broadcom BCM577xx or BCM578xx adapters which are covered by the bnx2i kernel module. See Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell and Backporting Open-iSCSI to Debian 8 "Jessie" for additional information on those dependent hardware iSCSI initiators.

For hardware iSCSI initiators, like the QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs, which provide a full iSCSI offload engine (iSOE) implementation in the adapters firmware, there is currently no support for iSCSI session statistics. Instead, the open-iscsi agent plugin collects several global statistics on each iSOE host ${HST} which is covered by the qla4xxx kernel module with the command shown in the following example:

/usr/bin/iscsiadm -m host -H ${HST} -C stats

The output of the above commands is parsed and reformated by the agent plugin for easier processing in the check plugins. The following example shows the agent plugin output for a system with two BCM578xx dependent hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:bf:2d eth2 10.0.3.52 LOGGED_IN LOGGED_IN NO_CHANGE
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:c2:34 eth3 10.0.3.53 LOGGED_IN LOGGED_IN NO_CHANGE

<<<open-iscsi_session_stats>>>
[session stats f8:ca:b8:7d:bf:2d iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 40960
rxdata_octets: 461171313
noptx_pdus: 0
scsicmd_pdus: 153967
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 153967
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 112420
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

[session stats f8:ca:b8:7d:c2:34 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 16384
rxdata_octets: 255666052
noptx_pdus: 0
scsicmd_pdus: 84312
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 84312
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 62418
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

The next example shows the agent plugin output for a system with two QLogic 8200 Series hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7d.ipv4.0 none 10.0.3.50 LOGGED_IN Unknown Unknown
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7e.ipv4.0 none 10.0.3.51 LOGGED_IN Unknown Unknown

<<<open-iscsi_host_stats>>>
[host stats f8:ca:b8:7d:c1:7d iqn.2000-04.com.qlogic:isp8214.000e1e3574ac.4]
mactx_frames: 563454
mactx_bytes: 52389948
mactx_multicast_frames: 877513
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573455
macrx_bytes: 440845678
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508160
iptx_bytes: 29474232
iptx_fragments: 0
iprx_packets: 401785
iprx_bytes: 354673156
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508160
tcptx_bytes: 19310736
tcprx_segments: 401785
tcprx_byte: 346637456
tcp_duplicate_ack_retx: 1
tcp_retx_timer_expired: 1
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106449
tcptx_pure_ack: 106489
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 695915
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401697
iscsi_data_bytes_tx: 29225
iscsi_pdu_rx: 401697
iscsi_data_bytes_rx: 327355963
iscsi_io_completed: 101
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

[host stats f8:ca:b8:7d:c1:7e iqn.2000-04.com.qlogic:isp8214.000e1e3574ad.5]
mactx_frames: 563608
mactx_bytes: 52411412
mactx_multicast_frames: 877517
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573572
macrx_bytes: 441630442
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508310
iptx_bytes: 29490504
iptx_fragments: 0
iprx_packets: 401925
iprx_bytes: 355436636
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508310
tcptx_bytes: 19323952
tcprx_segments: 401925
tcprx_byte: 347398136
tcp_duplicate_ack_retx: 2
tcp_retx_timer_expired: 4
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106466
tcptx_pure_ack: 106543
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 696035
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401787
iscsi_data_bytes_tx: 37970
iscsi_pdu_rx: 401791
iscsi_data_bytes_rx: 328112050
iscsi_io_completed: 127
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

Although a simple Bash shell script, the agent plugin open-iscsi has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands iscsiadm, sed, tr and egrep. On Debian based systems, the necessary packages can be installed with the following command:

root@host:~# apt-get install coreutils grep open-iscsi sed

The second part of the Check_MK service check for Open-iSCSI provides the necessary check logic through individual inventory and check functions. This is implemented in the three Check_MK check plugins open-iscsi_sessions, open-iscsi_host_stats and open-iscsi_session_stats, which will be discussed separately in the following sections.

Open-iSCSI Session Status

The check plugin open-iscsi_sessions is responsible for the monitoring of individual iSCSI sessions and their internal session states. Upon inventory this check plugin creates a service check for each pair of iSCSI network interface name and IQN of the iSCSI target volume. Unlike the iSCSI session ID, which changes over time (e.g. after iSCSI logout and login), this pair uniquely identifies a iSCSI session on a host. During normal check execution, the list of currently active iSCSI sessions on a host is compared to the list of active iSCSI sessions gathered during inventory on that host. If a session is missing or if the session has an erroneous internal state, an alarm is raised accordingly.

For all types of initiators – software, dependent hardware and hardware – there is the state session_state which can take on the following values:

ISCSI_STATE_FREE
ISCSI_STATE_LOGGED_IN
ISCSI_STATE_FAILED
ISCSI_STATE_TERMINATE
ISCSI_STATE_IN_RECOVERY
ISCSI_STATE_RECOVERY_FAILED
ISCSI_STATE_LOGGING_OUT

An alarm is raised if the session is in any state other than ISCSI_STATE_LOGGED_IN. For software and dependent hardware initiators there are two additional states – connection_state and internal_state. The state connection_state can take on the values:

FREE
TRANSPORT WAIT
IN LOGIN
LOGGED IN
IN LOGOUT
LOGOUT REQUESTED
CLEANUP WAIT

and internal_state can take on the values:

NO CHANGE
CLEANUP       
REOPEN
REDIRECT

In addition to the above session_state, an alarm is raised if the connection_state is in any other state than LOGGED IN and internal_state is in any other state than NO CHANGE.

No performance data is currently reported by this check.

Open-iSCSI Hosts Statistics

The check plugin open-iscsi_host_stats is responsible for the monitoring of the global statistics on a iSOE host. Upon inventory this check plugin creates a service check for each pair of MAC address and iSCSI network interface name. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is determined for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_host_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSOE host statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Host Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI host statistics values
                  [x] The levels for the number of transmitted MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 abort frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 jumbo frames on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 late transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 single transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 multiple transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 collisions on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dribble on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frame length errors on an iSOE host.
                  [x] The levels for the number of discarded received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 jabber on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 carrier sense errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 CRC errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 encoding errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too large errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too small errors on an iSOE host.
                  [x] The levels for the number of transmitted IP packets on an iSOE host.
                  [x] The levels for the number of received IP packets on an iSOE host.
                  [x] The levels for the number of transmitted IP bytes on an iSOE host.
                  [x] The levels for the number of received IP bytes on an iSOE host.
                  [x] The levels for the number of transmitted IP fragments on an iSOE host.
                  [x] The levels for the number of received IP fragments on an iSOE host.
                  [x] The levels for the number of IP datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IP invalid address errors on an iSOE host.
                  [x] The levels for the number of IP packet errors on an iSOE host.
                  [x] The levels for the number of IP fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IP fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IP datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 packets on an iSOE host.
                  [x] The levels for the number of received IPv6 packets on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 bytes on an iSOE host.
                  [x] The levels for the number of received IPv6 bytes on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 fragments on an iSOE host.
                  [x] The levels for the number of received IPv6 fragments on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IPv6 invalid address errors on an iSOE host.
                  [x] The levels for the number of IPv6 packet errors on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted TCP segments on an iSOE host.
                  [x] The levels for the number of received TCP segments on an iSOE host.
                  [x] The levels for the number of transmitted TCP bytes on an iSOE host.
                  [x] The levels for the number of received TCP bytes on an iSOE host.
                  [x] The levels for the number of duplicate TCP ACK retransmits on an iSOE host.
                  [x] The levels for the number of received TCP retransmit timer expiries on an iSOE host.
                  [x] The levels for the number of received TCP duplicate ACKs on an iSOE host.
                  [x] The levels for the number of received TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP delayed ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of received TCP segment errors on an iSOE host.
                  [x] The levels for the number of received TCP segment out-of-order on an iSOE host.
                  [x] The levels for the number of received TCP window probe on an iSOE host.
                  [x] The levels for the number of received TCP window update on an iSOE host.
                  [x] The levels for the number of transmitted TCP window probe persist on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of received iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of received iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of iSCSI I/Os completed on an iSOE host.
                  [x] The levels for the number of iSCSI unexpected I/Os on an iSOE host.
                  [x] The levels for the number of iSCSI format errors on an iSOE host.
                  [x] The levels for the number of iSCSI header digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI data digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI sequence errors on an iSOE host.
                  [x] The levels for the number of ECC error corrections on an iSOE host.

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_host_stats (iSCSI Host Stats) service checks over two QLogic 8200 Series hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_host_stats service checks over QLogic 8200 Series hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state – in this example LOGGED_IN – is shown. There are also two iSCSI Host Stats service check items in the example, which are pairs of MAC addresses and iSCSI network interface names. For each of those items the current throughput rate on the MAC, IP/IPv6, TCP and iSCSI protocol layer is shown. The throughput rate on the MAC protocol layer is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following three images show examples of the PNP4Nagios graphs for the open-iscsi_host_stats (iSCSI Host Stats) service check.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (MAC Frames, Traffic, MAC Errors)

The middle graph shows a combined view of the throughput rate for received and transmitted traffic on the different MAC, IP/IPv6, TCP and iSCSI protocol layers. The upper graph shows the throughput rate for various frame types on the MAC protocol layer. The lower graph shows the rate for various error frame types on the MAC protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (IP Packets and Fragments, IP Errors, TCP Segments)

The upper graph shows the throughput rate for received and transmitted traffic on the IP/IPv6 protocol layer. The middle graph shows the rate for various error packet types on the IP/IPv6 protocol layer. The lower graph shows the throughput rate for received and transmitted traffic on the TCP protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (TCP Errors, ECC Error Correction, iSCSI PDUs, iSCSI Errors)

The first graph shows the rate for various protocol control and error segment types on the TCP protocol layer. The second graph shows the rate of ECC error corrections that occured on the QLogic 8200 Series hardware iSCSI initiator. The third graph shows the throughput rate for received and transmitted traffic on the iSCSI protocol layer. The fourth and last graph shows the rate for various control and error PDUs on the iSCSI protocol layer.

Open-iSCSI Session Statistics

The check plugin open-iscsi_session_stats is responsible for the monitoring of the statistics on individual iSCSI sessions. Upon inventory this check plugin creates a service check for each pair of MAC address of the network interface and IQN of the iSCSI target volume. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is collected for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_session_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSCSI session statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Session Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI session statistics values
                  [x] The levels for the number of transmitted bytes in an Open-iSCSI session
                  [x] The levels for the number of received bytes in an Open-iSCSI session
                  [x] The levels for the number of digest (CRC) errors in an Open-iSCSI session
                  [x] The levels for the number of timeout errors in an Open-iSCSI session
                  [x] The levels for the number of transmitted NOP commands in an Open-iSCSI session
                  [x] The levels for the number of received NOP commands in an Open-iSCSI session
                  [x] The levels for the number of transmitted SCSI command requests in an Open-iSCSI session
                  [x] The levels for the number of received SCSI command reponses in an Open-iSCSI session
                  [x] The levels for the number of transmitted task management function commands in an Open-iSCSI session
                  [x] The levels for the number of received task management function responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted login requests in an Open-iSCSI session
                  [x] The levels for the number of transmitted logout requests in an Open-iSCSI session
                  [x] The levels for the number of received logout responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted text PDUs in an Open-iSCSI session
                  [x] The levels for the number of received text PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted data PDUs in an Open-iSCSI session
                  [x] The levels for the number of received data PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted single negative ACKs in an Open-iSCSI session
                  [x] The levels for the number of received ready to transfer PDUs in an Open-iSCSI session
                  [x] The levels for the number of received reject PDUs in an Open-iSCSI session
                  [x] The levels for the number of received asynchronous messages in an Open-iSCSI session

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_session_stats (iSCSI Session Stats) service checks over two BCM578xx dependent hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_session_stats service checks over BCM578xx dependent hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state, connection_state and internal_state – in this example with the respective values LOGGED_IN, LOGGED_IN and NO_CHANGE – are shown. There are also an equivalent number of iSCSI Session Stats service check items in the example, which are also pairs of MAC addresses of the network interfaces and IQNs of the iSCSI target volumes. For each of those items the current throughput rate of the individual iSCSI session is shown. As long as the rate of the digest (CRC) and timeout error counters is zero, the string no protocol errors is displayed. Otherwise the name and throughput rate of any non-zero error counter is shown. The throughput rate of the iSCSI session is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following image shows an example of the three PNP4Nagios graphs for a single open-iscsi_session_stats (iSCSI Session Stats) service check.

Example PNP4Nagios graphs for a single open-iscsi_session_stats service check

The upper graph shows the throughput rate for received and transmitted traffic of the iSCSI session. The middle graph shows the rate for received and transmitted iSCSI PDUs, broken down by the different types of PDUs on the iSCSI protocol layer. The lower graph shows the rate for the digest (CRC) and timeout errors on the iSCSI protocol layer.

The described Check_MK service check to monitor the status of Open-iSCSI sessions, Open-iSCSI session metrics and iSCSI hardware initiator host metrics has been verified to work with version 2.0.874-2~bpo8+1 of the open-iscsi package from the backports repository of Debian stable (Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Backporting Open-iSCSI to Debian 8 "Jessie"

Starting with the Debian open-iscsi release 2.0.874-1, which is now available in the backports reprository for Debian 8 "Jessie", the manual backport of Open-iSCSI described below is no longer necessary.

The Debian Open-iSCSI package is now based on current upstream version of Open-iSCSI. Open-iSCSIs iscsiuio is now provided through its own Debian package. Several improvements (Git commit d05fe0e1, Git commit 6004a7e7) have been made in handling hardware initiator based iSCSI sessions.

Thanks to Christian Seiler for his work on bringing the Debian Open-iSCSI package up to a current upstream version and for helping to sort our some issues related to the use of hardware initiators!

In the previous article Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell i mentioned using a backported version of Open-iSCSI on Debian 8 (“Jessie”). This new post describes the backport and the changes provided by it in greater detail. All the changes to the original Debian package from “unstable” (“Sid”) can be found in my Debian Open-iSCSI Packaging repository on GitHub.

Starting point was a clone of the Debian Open-iSCSI Packaging repository at Git commit df150d90. Mind though, that in the meantime between creating the backport and writing this, the Debian Open-iSCSI maintainers have been busy and a more recent version of the Debian Open-iSCSI package from “unstable” (“Sid”) is now available.

Within this particular version of the Debian Open-iSCSI package, i first enabled the build of Open-iSCSIs iscsiuio. On the one hand, this was done in order to ensure that the iscsiuio code would successfully build even at this old level of the Open-iSCSI code. On the other hand, this would be used as a differentiator for any issues surfacing later on, after the move to the more recent upstream Open-iSCSI sources, indicating the root cause of those would then solely be with the newer upstream version of Open-iSCSI. Some integration into the general system environment was also added at this point. In detail the changes were:

  • Git commit 32c96e6c removes the Debian patch 05-disable-iscsiuio.patch which disables the build of iscsiuio.

  • Git commit 984344a1 enables the build of iscsiuio, extends the cleanup build targets and adds iscsiuio to the dh_systemd build targets.

  • Git commit 89d845a9 adds the results from the successful build – the iscsiuio binary, the iscsiuio manual page, a readme file and a logrotate configuration file – to the Debian package. It also adds the kernel modules bnx2i and cnic to the list of kernel modules to be loaded at installation time.

  • Git commit 89195bbe adds the systemd service and socket unit files for iscsiuio. Those files have been taken from this discussion on the Open-iSCSI mailing list and have slightly been altered.

With the above changes a intermediary package was build for testing purposes. During the following tests sometimes all currently mounted filesystems – even those distinctly not based on iSCSI volumes – would suddenly be unmounted. For some filesystems this would succeed, for others, like e.g. the /var and the root filesystem, this would fail due to them being currently in use. The issue particularly occured while stopping the open-iscsi service either via its regular init script or via its systemd service. This is usually done at system shutdown or during uninstall of the Open-iSCSI package. Tracking down the root cause of this issue led to an unhandled case in the umountiscsi.sh script, which is called while stopping the open-iscsi service. Specifically, the following code section is responsible for the observed behaviour:

debian/extra/umountiscsi.sh
256    if [ $HAVE_LVM -eq 1 ] ; then
257        # Look for all LVM volume groups that have a backing store
258        # on any iSCSI device we found. Also, add $LVMGROUPS set in
259        # /etc/default/open-iscsi (for more complicated stacking
260        # configurations we don't automatically detect).
261        for _vg in $(cd /dev ; $PVS --noheadings -o vg_name $iscsi_disks $iscsi_partitions $iscsi_multipath_disks $iscsi_multipath_partitions 2>/dev/null) $LVMGROUPS ; do
262            add_to_set iscsi_lvm_vgs "$_vg"
263        done

The heuristic of the umountiscsi.sh script are trying to identify iSCSI based disk devices which are valid candidates for proper deactivation upon system shutdown. It turned out that in LVM based setups where there are currently no iSCSI based disk devices present, the variables $iscsi_disks, $iscsi_partitions, $iscsi_multipath_disks and $iscsi_multipath_partitions are left empty by the scripts logic. In line 261 in the above code snippet, this leads to a call to the pvs --noheadings -o vg_name command without any additional arguments limiting its output of volume groups. Hence, the returned output is instead a complete list of all volume groups currently present on the system. Based on this list, the associated logical volumes for each volume group are determined and added to the list of devices to be unmounted. Finally all devices in this list are actually unmounted.

Without making too invasive changes to the script logic of umountiscsi.sh a quick'n'dirty solution was to introduce a check before the call to pvs which would determine whether the variables $iscsi_disks, $iscsi_partitions, $iscsi_multipath_disks and $iscsi_multipath_partitions are all empty. If this is the case, the call to pvs is simply skipped. The following patch shows the necessary code changes which are also available in Git commit 5118af7f:

umountiscsi.sh.patch
diff --git a/debian/extra/umountiscsi.sh b/debian/extra/umountiscsi.sh
index 1206fa1..485069c 100755
--- a/debian/extra/umountiscsi.sh
+++ b/debian/extra/umountiscsi.sh
@@ -258,9 +258,11 @@ enumerate_iscsi_devices() {
                # on any iSCSI device we found. Also, add $LVMGROUPS set in
                # /etc/default/open-iscsi (for more complicated stacking
                # configurations we don't automatically detect).
-               for _vg in $(cd /dev ; $PVS --noheadings -o vg_name $iscsi_disks $iscsi_partitions $iscsi_multipath_disks $iscsi_multipath_partitions 2>/dev/null) $LVMGROUPS ; do
-                       add_to_set iscsi_lvm_vgs "$_vg"
-               done
+               if [ -n "$iscsi_disks" -o -n "$iscsi_partitions" -o -n "$iscsi_multipath_disks" -o -n "$iscsi_multipath_partitions" ]; then
+                   for _vg in $(cd /dev ; $PVS --noheadings -o vg_name $iscsi_disks $iscsi_partitions $iscsi_multipath_disks $iscsi_multipath_partitions 2>/dev/null) $LVMGROUPS ; do
+                           add_to_set iscsi_lvm_vgs "$_vg"
+                   done
+               fi
 
                # $iscsi_lvm_vgs is now unique list
                for _vg in $iscsi_lvm_vgs ; do

After this was fixed, the last step was to finally move to the more recent upstream Open-iSCSI sources. In detail the changes in this last step were:

  • Git commit f5ab51ff moves the code to version 2.0.873+git1.1dfb88a4 which is based upon the upstream Git commit 1dfb88a4. This is the last commit before the externalization of the Open-iSNS library. Since i didn't want to also backport the Open-iSNS packages from Debian “unstable” (“Sid”), i decided to just skip the next two upstream commits 76832662 and c6d1117b and stick with the locally delivered Open-iSNS library.

  • Git commit 8c1e6974 removes the local Debian patches 01_spelling-errors-and-manpage-hyphen-fixes.patch, 02_make-iscsistart-a-dynamic-binary.patch and 03_respect-build-flags.patch which have already been merged into the more recent upstream Open-iSCSI sources. The remaining local Debian patches were renamed and reordered to 01_fix_iscsi_path.patch, 02_var-lock_var-run_transition.patch and 03_makefile_reproducibility_issues.patch. A whole bunch of new patches named {04,05,06,07,08,09,10,11,12,13,14,15}_upstream_git_commit_<Git commit ID>.patch were added in order to bring the sources up to the – by then most recent – upstream Git commit 0fa43f29.

  • Git commit d051dece removes some files from the Debian package, which were dynamically generated during the build of iscsiuio.

  • Finally Git commit 0fabb948 deals with the issue described in Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell.

With the steps and changes described above, a backported version of Open-iSCSI using its most recent sources was created as a package for Debian 8 (“Jessie”). This package also supports offloaded iSCSI connections via the Broadcom BCM577xx and BCM578xx iSOEs with the use of iscsiuio. The package has been in production use for over a month now and no major issues – neither with the newer upstream Open-iSCSI sources, nor with use of Broadcom BCM577xx and BCM578xx iSOEs through iscsiuio – have emerged so far.

// Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell

Open-iSCSIs tool iscsiuio, which is used to configure and manage Broadcom BCM577xx and BCM578xx iSCSI offload engines (iSOE), currently crashes with a segmentation fault upon target login on Intel Broadwell based systems. Comparable system setups, based on the older Intel Haswell chips do not show this issue.

In the past i've been using QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs which provide a full iSCSI offload engine (iSOE) implementation in the adapters firmware. Unfortunately the QLogic 8200 Series network adapters are no longer available for Dell M-Series blade servers. The alternatives offered by Dell are the Intel X520 and Intel X540 series adapters, or the Broadcom BCM57810S series adapters. Instead of using the Intel X520/X540, which provide no iSOE at all, i decided to go with the Broadcom BCM57810S, which at least provide some kind of iSOE. According to the VMware terminology the Broadcom BCM57810S are dependent hardware iSCSI Initiators. Dependent in this context means, that the iSOE does not implement all the necessary features and thus cannot perform all the tasks (e.g. TCP/IP stack, configuration and session management, authentication, etc.) necessary for target handling by itself. Rather, some of these tasks are provided by a third party on which this kind of iSOE depends on. In case of the Broadcom BCM57810S this third party is the iscsiuio daemon, which has for some time been part of the Open-iSCSI project. Simply put, the iscsiuio daemon acts as an intermediary between the iscsid on the one side and the QLogic1) NetXtreme II driver (kernel module bnx2 or bnx2x) and the QLogic2) CNIC driver (kernel module cnic) on the other side, facilitating the creation and overall management of offloaded iSCSI sessions. Very simplified, the flow of information is as follows:

iscsiadm ←→ iscsid ←→ iscsiuio ←→ bnx2/bnx2x ←→ cnic ←→ Broadcom BCM57810S adapter ←→ Network ←→ Target

In my environment the Broadcom BCM57810S adapters are installed and used on six hosts (host1 and host{5,6,7,8,9}). They all connect to the same Dell EqualLogic storage systems in the backend, using the same network dedicated to iSCSI traffic. All hosts are Dell M630 blade servers with exactly the same firmware, operating system (Debian 8) and software versions. I'm using a backported version of Open-iSCSI, which is based on Git commit 0fa43f29, but excluding the commits 76832662 and c6d1117b which just implement the externalization of the Open-iSNS code. The systems originate from the same install image, so their configuration is – to a large extent – exactly the same. The only difference between the hosts is that host1 has Intel E5 v3 (aka Haswell) CPUs, while host{5,6,7,8,9} have Intel E5 v4 (aka Broadwell) CPUs.

On host1 everything works fine, iscsiuio runs as expected and access to targets via the Broadcom BCM57810S iSOEs is working flawlessly. On host{5,6,7,8,9} on the other hand, i was getting segmentation faults like the one in the example shown below, while trying to log in to any target.

host5:~# gdb /sbin/iscsiuio
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/iscsiuio...(no debugging symbols found)...done.


(gdb) # run -d 4 -f
Starting program: /sbin/iscsiuio -d 4 -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
INFO  [Wed Jul 27 10:01:45 2016]Initialize logger using log file: /var/log/iscsiuio.log
INFO  [Wed Jul 27 10:01:45 2016]Started iSCSI uio stack: Ver 0.7.8.2
INFO  [Wed Jul 27 10:01:45 2016]Build date: Fri Jul 22 15:40:04 CEST 2016
INFO  [Wed Jul 27 10:01:45 2016]Debug mode enabled
INFO  [Wed Jul 27 10:01:45 2016]Running on sysname: 'Linux', release: '3.16.0-4-amd64', version '#1 SMP Debian 3.16.7-ckt
25-2+deb8u3 (2016-07-02)' machine: 'x86_64'
DBG   [Wed Jul 27 10:01:45 2016]Loaded nic library 'bnx2' Version: '0.7.8.2' build on Fri Jul 22 15:40:04 CEST 2016'
DBG   [Wed Jul 27 10:01:45 2016]Added 'bnx2' nic library
DBG   [Wed Jul 27 10:01:45 2016]Loaded nic library 'bnx2x' Version: '0.7.8.2' build on Fri Jul 22 15:40:04 CEST 2016'
DBG   [Wed Jul 27 10:01:45 2016]Added 'bnx2x' nic library
[New Thread 0x7ffff760f700 (LWP 4942)]
INFO  [Wed Jul 27 10:01:45 2016]signal handling thread ready
INFO  [Wed Jul 27 10:01:45 2016]nic_utils Found host[11]: host11
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host11/netdev
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host11/netdev
INFO  [Wed Jul 27 10:01:45 2016]nic_utils looking for uio device for eth3
WARN  [Wed Jul 27 10:01:45 2016]Could not find assoicate uio device with eth3
ERR   [Wed Jul 27 10:01:45 2016]nic_utils Could not determine UIO name for eth3
INFO  [Wed Jul 27 10:01:45 2016]nic_utils Found host[12]: host12
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host12/netdev
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host12/netdev
INFO  [Wed Jul 27 10:01:45 2016]nic_utils looking for uio device for eth2
INFO  [Wed Jul 27 10:01:45 2016]nic_utils eth2 associated with uio0
INFO  [Wed Jul 27 10:01:45 2016]nic_utils NIC not found creating an instance for host_no: 12 eth2
DBG   [Wed Jul 27 10:01:45 2016]Could not increase process priority: Success
[New Thread 0x7ffff6e0e700 (LWP 4943)]
DBG   [Wed Jul 27 10:01:45 2016]iscsi_ipc Started iscsid listening thread
DBG   [Wed Jul 27 10:01:45 2016]iscsi_ipc Waiting for iscsid command
INFO  [Wed Jul 27 10:01:45 2016]NIC_NL Netlink to CNIC on pid 4938 is ready
DBG   [Wed Jul 27 10:01:57 2016]iscsi_ipc recv iscsid request: cmd: 1, payload_len: 11720
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc Received request for 'eth2' to set IP address: '10.0.1.62' VLAN: '0'
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc Using netmask: 0.0.0.0
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc  eth2, using existing NIC
INFO  [Wed Jul 27 10:01:57 2016]nic_utils looking for uio device for eth2
INFO  [Wed Jul 27 10:01:57 2016]nic_utils eth2 associated with uio0
INFO  [Wed Jul 27 10:01:57 2016]Done capturing /sys/class/uio/uio0/name
INFO  [Wed Jul 27 10:01:57 2016]nic_utils eth2: Verified uio name bnx2x_cnic with library bnx2x
INFO  [Wed Jul 27 10:01:57 2016]eth2: found NIC with library 'bnx2x'
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2 library set using transport_name bnx2i
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: requesting configuration using static IP address
DBG   [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2 couldn't find interface with ip_type: 0x2 creating it
INFO  [Wed Jul 27 10:01:57 2016]nic eth2: Added nic interface for VLAN: 0, protocol: 2
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: created network interface
[New Thread 0x7ffff660d700 (LWP 4947)]
WARN  [Wed Jul 27 10:01:57 2016]nic_utils eth2: device already disabled: flag: 0x1088 state: 0x1
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: configuring using static IP IPv4 address :10.0.1.62
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc  netmask: 255.255.255.0
[New Thread 0x7ffff5e0c700 (LWP 4948)]
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc ISCSID_UIP_IPC_GET_IFACE: command: 1 name: bnx2i.d0:43:1e:51:98:53, netdev: eth2 ipaddr: 10.0.1.62 vlan: 0 transport_name:bnx2i
INFO  [Wed Jul 27 10:01:57 2016]nic_utils eth2: spinning up thread for nic
DBG   [Wed Jul 27 10:01:57 2016]iscsi_ipc Waiting for iscsid command
[New Thread 0x7ffff560b700 (LWP 4949)]
DBG   [Wed Jul 27 10:01:57 2016]nic eth2: Waiting to be enabled
INFO  [Wed Jul 27 10:01:57 2016]Created nic thread: eth2
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: started NIC enable thread state: 0x1
DBG   [Wed Jul 27 10:01:57 2016]nic eth2: is now enabled
INFO  [Wed Jul 27 10:01:57 2016]bnx2x eth2: bnx2x driver using version 1.78.19
ERR   [Wed Jul 27 10:01:58 2016]bnx2x /dev/uio0: uio device has been brought up via pid: 4938 on fd: 7
INFO  [Wed Jul 27 10:01:58 2016]Done capturing /sys/class/uio/uio0/name
INFO  [Wed Jul 27 10:01:58 2016]bnx2x eth2: Verified is a cnic_uio device
DBG   [Wed Jul 27 10:01:58 2016]bnx2x eth2: using rx ring size: 15, rx buffer size: 1024
INFO  [Wed Jul 27 10:01:58 2016]Done capturing /sys/class/uio/uio0/event
DBG   [Wed Jul 27 10:01:58 2016]bnx2x Chip ID: 168e1000
INFO  [Wed Jul 27 10:01:58 2016]nic_id eth2: is found at 03:00.00
INFO  [Wed Jul 27 10:01:58 2016]bnx2x eth2: func 0x0, pfid 0x0, client_id 0x88, cid 0x1
DBG   [Wed Jul 27 10:01:58 2016]bnx2x eth2: mode = 0x100
INFO  [Wed Jul 27 10:01:58 2016]bnx2x eth2:  Using mac address: d0:43:1e:51:98:53
INFO  [Wed Jul 27 10:01:58 2016]eth2: bnx2x initialized
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: Initialized ip stack: VLAN: 0
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: mac: d0:43:1e:51:98:53
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: Using IP address: 10.0.1.62
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: Using netmask: 255.255.255.0
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: enabled vlan 0 protocol: 2
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: entering main nic loop
DBG   [Wed Jul 27 10:01:58 2016]nic_utils eth2: device enabled
[Thread 0x7ffff5e0c700 (LWP 4948) exited]
DBG   [Wed Jul 27 10:01:59 2016]iscsi_ipc recv iscsid request: cmd: 1, payload_len: 11720
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc Received request for 'eth2' to set IP address: '10.0.1.62' VLAN: '0'
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc Using netmask: 0.0.0.0
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc  eth2, using existing NIC
INFO  [Wed Jul 27 10:01:59 2016]nic_utils looking for uio device for eth2
INFO  [Wed Jul 27 10:01:59 2016]nic_utils eth2 associated with uio0
INFO  [Wed Jul 27 10:01:59 2016]eth2: Have NIC library 'bnx2x'
INFO  [Wed Jul 27 10:01:59 2016]Done capturing /sys/class/uio/uio0/name
INFO  [Wed Jul 27 10:01:59 2016]nic_utils eth2: Verified uio name bnx2x_cnic with library bnx2x
INFO  [Wed Jul 27 10:01:59 2016]eth2: found NIC with library 'bnx2x'
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2 library set using transport_name bnx2i
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: requesting configuration using static IP address
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: using existing network interface
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: IP configuration didn't change using 0x2
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: NIC already enabled flags: 0x1084 state: 0x4

INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc ISCSID_UIP_IPC_GET_IFACE: command: 1 name: bnx2i.d0:43:1e:51:98:53, netdev: eth2 ipaddr: 10.0.1.62 vlan: 0 transport_name:bnx2i
DBG   [Wed Jul 27 10:01:59 2016]iscsi_ipc Waiting for iscsid command
INFO  [Wed Jul 27 10:02:00 2016]NIC_NL Received path_req for host 12
INFO  [Wed Jul 27 10:02:00 2016]Done capturing /sys/class/iscsi_host/host12/netdev
DBG   [Wed Jul 27 10:02:00 2016]NIC_NL Pulled nl event
INFO  [Wed Jul 27 10:02:00 2016]NIC_NL eth2: Processing 'path_req'
DBG   [Wed Jul 27 10:02:00 2016]NIC_NL eth2: PATH_REQ with iface_num -1 VLAN 32768
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: Netlink message with VLAN ID: 0, path MTU: 9000 minor: 0 ip_addr_len: 4
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: src=10.0.1.62
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: dst=10.0.1.2
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: nm=255.255.255.0
INFO  [Wed Jul 27 10:02:00 2016]CNIC eth2: Didn't find IPv4: '10.0.1.2' in ARP table
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: Sent cnic arp request for IP: 10.0.1.2
INFO  [Wed Jul 27 10:02:00 2016]Found 10.0.1.2 at b0:83:fe:cc:57:bb
DBG   [Wed Jul 27 10:02:00 2016]CNIC neighbor reply sent back to kernel 10.0.1.62 at b0:83:fe:cc:57:bb with vlan 0
INFO  [Wed Jul 27 10:02:00 2016]NIC_NL eth2: 'path_req' operation finished

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff660d700 (LWP 4947)]
__lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29      ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.


(gdb) # info threads
  Id   Target Id         Frame
  6    Thread 0x7ffff560b700 (LWP 4949) "iscsiuio" 0x00007ffff76f1ae3 in select () at ../sysdeps/unix/syscall-template.S:81
* 4    Thread 0x7ffff660d700 (LWP 4947) "iscsiuio" __lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
  3    Thread 0x7ffff6e0e700 (LWP 4943) "iscsiuio" 0x00007ffff79c9ccd in accept () at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7ffff760f700 (LWP 4942) "iscsiuio" do_sigwait (set=<optimized out>, sig=0x7ffff760eeac) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:63
  1    Thread 0x7ffff7fea700 (LWP 4938) "iscsiuio" 0x00007ffff79c9e9d in recvmsg () at ../sysdeps/unix/syscall-template.S:81


(gdb) # thread apply all bt

Thread 6 (Thread 0x7ffff560b700 (LWP 4949)):
#0  0x00007ffff76f1ae3 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055555555ac06 in ?? ()
#2  0x000055555555d39e in nic_loop ()
#3  0x00007ffff79c30a4 in start_thread (arg=0x7ffff560b700) at pthread_create.c:309
#4  0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7ffff660d700 (LWP 4947)):
#0  __lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1  0x00007ffff79c7007 in pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:94
#2  0x000055555555e803 in nl_process_handle_thread ()
#3  0x00007ffff79c30a4 in start_thread (arg=0x7ffff660d700) at pthread_create.c:309
#4  0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7ffff6e0e700 (LWP 4943)):
#0  0x00007ffff79c9ccd in accept () at ../sysdeps/unix/syscall-template.S:81
#1  0x00005555555641b0 in ?? ()
#2  0x00007ffff79c30a4 in start_thread (arg=0x7ffff6e0e700) at pthread_create.c:309
#3  0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7ffff760f700 (LWP 4942)):
#0  do_sigwait (set=<optimized out>, sig=0x7ffff760eeac) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:63
#1  0x00007ffff79ca693 in __sigwait (set=0x7ffff760eeb0, sig=0x0) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:97
#2  0x000055555555a49c in _start ()

Thread 1 (Thread 0x7ffff7fea700 (LWP 4938)):
#0  0x00007ffff79c9e9d in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055555555e5e9 in ?? ()
#2  0x000055555555eea8 in nic_nl_open ()
#3  0x000055555555a1b8 in main ()

The Open-iSCSI command which lead to this segmentation fault was a simple login at a previously defined target node:

host5:~# iscsiadm -m node -T <target iqn> -I <interface> --login

Double-checking the configuration, the firmware and software versions as well as the general hardware setup didn't yield any usable indication as to where the root cause of this issue might be. Searching the web for __lll_unlock_elision in conjunction with the pthread_* function calls, led me to the following resources:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574
https://lwn.net/Articles/534758/

Those are pointing towards a CPU (Broadwell and Skylake) specific problem when not carefully using mutexes. The general opinion from there and also other related bug reports seems to be, that the source of such issues is almost always an improper use of mutex locking, which – up to now – has either been tolerated or just by chance not lead to a failure. More recent CPUs and software implementations (e.g. the GNU libc) appear to be less forgiving in this regard. Thus the advice is to change the application behaviour towards a proper use of mutex locking, in order to address such an issue.

The article Intel's Broadwell Xeon E5-2600 v4 chips: So what's in it for you, smartie-pants coders offers a rather nice write-up of the new features introduced in the Intel Broadwell CPUs.

Tracking this issue further down in the Open-iSCSI sources, i ended up in the function nl_process_handle_thread() in iscsiuio/src/unix/nic_nl.c and specifically in the following code section:

iscsiuio/src/unix/nic_nl.c
474 /* NIC specific nl processing thread */
475 void *nl_process_handle_thread(void *arg)
476 {
[...]
483         while (!event_loop_stop) {
484                 char *data = NULL;
485
486                 rc = pthread_cond_wait(&nic->nl_process_cond,
487                                        &nic->nl_process_mutex);
488                 if (rc != 0) {
489                         LOG_ERR("Fatal error in NL processing thread "
490                                 "during wait[%s]", strerror(rc));
491                         break;
492                 }
[...]
499                 pthread_mutex_unlock(&nic->nl_process_mutex);
[...]

Debugging this revealed that the call to pthread_cond_wait() from the above GDB backtrace output of thread number 4 is the one from line 486 in the above code snippet.

Looking at the pthread_cond_wait() manpage showed the following constraint for its proper use:

[…]
The pthread_cond_timedwait() and pthread_cond_wait() functions shall
block on a condition variable. They shall be called with mutex locked
by the calling thread or undefined behavior results.
[…]

Although not shown in the above GDB output, this would on occasion – and again, probably just by chance – work on the first pass of the loop. At the end of the loop, at line 499 in the above code snippet, the mutex is then unlocked. Thus the cited constraint from the pthread_cond_wait() manpage is no longer met on the subsequent passes of the loop. On Intel E5 v3 (aka Haswell) CPUs, this seemed to be tolerated and without any impact. But on Intel E5 v4 (aka Broadwell) – and probably other CPUs implementing HLE and RTM – this causes the observed segmentation fault.

In order to verify my analysis and test this theory, i added a call to pthread_mutex_lock() right before the call to pthread_cond_wait() in line 486. The resulting change is available in the Git commit 9f770f9e of my Open-iSCSI fork on Github and also shown in the following patch:

nic_nl.c.patch
diff --git a/iscsiuio/src/unix/nic_nl.c b/iscsiuio/src/unix/nic_nl.c
index 391003f..581ddb0 100644
--- a/iscsiuio/src/unix/nic_nl.c
+++ b/iscsiuio/src/unix/nic_nl.c
@@ -483,6 +483,7 @@ void *nl_process_handle_thread(void *arg)
        while (!event_loop_stop) {
                char *data = NULL;
 
+               pthread_mutex_lock(&nic->nl_process_mutex);
                rc = pthread_cond_wait(&nic->nl_process_cond,
                                       &nic->nl_process_mutex);
                if (rc != 0) {

Posting this on the Open-iSCSI mailing list lead to this discussion. The suggestion from there was an additional change to the error handling code of nl_process_handle_thread() starting at line 488 in the above above code snippet. This adds proper handling of the locked mutex in case the loop is left due to an error returned from the call to pthread_cond_wait(). The resulting additional change is available in the Git commit 4191ca6b of my Open-iSCSI fork on Github. The following patch shows the summarised code changes:

nic_nl.c.patch
diff --git a/iscsiuio/src/unix/nic_nl.c b/iscsiuio/src/unix/nic_nl.c
index 391003f..1a920c7 100644
--- a/iscsiuio/src/unix/nic_nl.c
+++ b/iscsiuio/src/unix/nic_nl.c
@@ -483,9 +483,11 @@ void *nl_process_handle_thread(void *arg)
        while (!event_loop_stop) {
                char *data = NULL;
 
+               pthread_mutex_lock(&nic->nl_process_mutex);
                rc = pthread_cond_wait(&nic->nl_process_cond,
                                       &nic->nl_process_mutex);
                if (rc != 0) {
+                       pthread_mutex_unlock(&nic->nl_process_mutex);
                        LOG_ERR("Fatal error in NL processing thread "
                                "during wait[%s]", strerror(rc));
                        break;

With those two small changes to the sources of Open-iSCSIs iscsiuio, the iSCSI connections via the Broadcom BCM57810S iSOE do now work flawlessly even on newer Intel E5 v4 (aka Broadwell) based systems. Hopefully the original authors of the iscsiuio code at Broadcom/QLogic will also take part in the discussion and provide their feedback on the proposed code changes too.

1) , 2)
formerly Broadcom

// QLogic iSCSI HBA and Limitations in Bi-Directional Authentication

In the past the QLogic QConvergeConsole (qaucli) was used as an administration tool for the hardware initiator part of the QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs. Unfortunately this tool was only supported on the so-called “enterprise Linux distributions” like RHEL and SLES. If you were running any other Linux distribution like e.g. Debian or even one of the BSD distributions you were out of luck.

Thankfully QLogic addressed this support issue indirectly, by first announcing and since then by actually moving from a IOCTL based management method towards the Open-iSCSI based management method via the iscsiadm command. The announcement QLogic iSCSI Solution for Transitioning to the Open-iSCSI Model and the User's Guide IOCTL to Open-iSCSI Interface can be found at the QLogic web site.

While trying to test and use the new management method for the hardware initiator via the Open-iSCSI iscsiadm command, i soon ran into the issue that the packaged version of Open-iSCSI, which is shipped with Debian Wheezy, is based on the last stable release v2.0.873 from Open-iSCSI and is thus hopelessly out of date. The Open-iSCSI package shipped with Debian Jessie is a bit better, since it's already based on a newer version from the projects GitHub repository. Still, the Git commit used there dates back to August 23rd of 2013, which is also fairly old. After updating my system to Debian Jessie, i soon decided to rebuild the Open-iSCSI package from a much more recent version from the projects GitHub repository. With this, the management of the QLogic hardware initiators worked very well via the Open-iSCSI iscsiadm command and its now enhanced host mode.

In the host mode there are now three sub-modes chap, flashnode, stats. See man iscsiadm and /usr/share/doc/open-iscsi/README.gz for more details on how to use them. By first calling the host mode without any sub-mode, iscsiadm prints a list of available iSCSI HBAs along with the host number – shown in the first pair of square brackets – associated with each host by the OS kernel:

root@host:~$ iscsiadm -m host
qla4xxx: [1] 10.0.0.5,[84:8f:69:35:fc:70],<empty> iqn.2000-04.com.qlogic:isp8214.000e1e37da2c.4
qla4xxx: [2] 10.0.0.6,[84:8f:69:35:fc:71],<empty> iqn.2000-04.com.qlogic:isp8214.000e1e37da2d.5

The host number – in the above example 1 and 2 – is used in the following examples showing the three sub-modes:

  • The stats sub-mode displays various statistics values, like e.g. TCP/IP and iSCSI sessions, of the given HBA port:

    root@host:~$ iscsiadm -m host -H 1 -C stats
    
    Host Statistics:
        mactx_frames: 2351750
        mactx_bytes: 233065914
        mactx_multicast_frames: 1209409
        mactx_broadcast_frames: 0
        mactx_pause_frames: 0
        mactx_control_frames: 0
        mactx_deferral: 0
        mactx_excess_deferral: 0
        mactx_late_collision: 0
        mactx_abort: 0
        mactx_single_collision: 0
        mactx_multiple_collision: 0
        mactx_collision: 0
        mactx_frames_dropped: 0
        mactx_jumbo_frames: 0
        macrx_frames: 4037613
        macrx_bytes: 1305799553
        macrx_unknown_control_frames: 0
        macrx_pause_frames: 0
        macrx_control_frames: 0
        macrx_dribble: 0
        macrx_frame_length_error: 0
        macrx_jabber: 0
        macrx_carrier_sense_error: 0
        macrx_frame_discarded: 0
        macrx_frames_dropped: 2409752
        mac_crc_error: 0
        mac_encoding_error: 0
        macrx_length_error_large: 0
        macrx_length_error_small: 0
        macrx_multicast_frames: 0
        macrx_broadcast_frames: 0
        iptx_packets: 1694187
        iptx_bytes: 112412836
        iptx_fragments: 0
        iprx_packets: 1446806
        iprx_bytes: 721191324
        iprx_fragments: 0
        ip_datagram_reassembly: 0
        ip_invalid_address_error: 0
        ip_error_packets: 0
        ip_fragrx_overlap: 0
        ip_fragrx_outoforder: 0
        ip_datagram_reassembly_timeout: 0
        ipv6tx_packets: 0
        ipv6tx_bytes: 0
        ipv6tx_fragments: 0
        ipv6rx_packets: 0
        ipv6rx_bytes: 0
        ipv6rx_fragments: 0
        ipv6_datagram_reassembly: 0
        ipv6_invalid_address_error: 0
        ipv6_error_packets: 0
        ipv6_fragrx_overlap: 0
        ipv6_fragrx_outoforder: 0
        ipv6_datagram_reassembly_timeout: 0
        tcptx_segments: 1694187
        tcptx_bytes: 69463008
        tcprx_segments: 1446806
        tcprx_byte: 692255204
        tcp_duplicate_ack_retx: 8
        tcp_retx_timer_expired: 28
        tcprx_duplicate_ack: 0
        tcprx_pure_ackr: 0
        tcptx_delayed_ack: 247594
        tcptx_pure_ack: 247710
        tcprx_segment_error: 0
        tcprx_segment_outoforder: 0
        tcprx_window_probe: 0
        tcprx_window_update: 2248673
        tcptx_window_probe_persist: 0
        ecc_error_correction: 0
        iscsi_pdu_tx: 1446486
        iscsi_data_bytes_tx: 30308
        iscsi_pdu_rx: 1446510
        iscsi_data_bytes_rx: 622721801
        iscsi_io_completed: 253632
        iscsi_unexpected_io_rx: 0
        iscsi_format_error: 0
        iscsi_hdr_digest_error: 0
        iscsi_data_digest_error: 0
        iscsi_sequence_error: 0
  • The chap sub-mode displays and alters a table containing authentication information. Calling this sub-mode with the -o show option displays the current contents of the table:

    root@host:~$ iscsiadm -m host -H 1 -C chap -o show
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 0
    host.auth.username_in = <empty>
    host.auth.password_in = <empty>
    # END RECORD
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 1
    host.auth.username = <empty>
    host.auth.password = <empty>
    # END RECORD
    
    [...]

    Why show isn't the default option in the context of the chap sub-mode, like it is in many other iscsiadm modes and sub-modes is something i haven't quite understood yet. Maybe it's a security measure to not accidentially divulge sensitive information, maybe it has just been overlooked by the developers.

    Usually, there are already two initial records with the indexes 0 and 1 present on a HBA. As shown in the example above, each authentication record consists of three parameters. A record index host.auth.tbl_idx to reference it, a username host.auth.username or host.auth.username_in and a password host.auth.password or host.auth.password_in. Depending on whether the record is used for outgoing authentication of an initiator against a target or the other way around for incoming authentication of a target against an initiator, the parameter pairs username/password or username_in/password_in are used. Apparently both types of parameter pairs – incoming and outgoing – cannot be mixed together in a single record. My guess is that this isn't a limitation in Open-iSCSI, but rather a limitation in the specification and/or of the underlying hardware.

    New authentication records can be added with the -o new option:

    root@host:~$ iscsiadm -m host -H 1 -C chap -x 2 -o new
    root@host:~$ iscsiadm -m host -H 1 -C chap -o show
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 0
    host.auth.username_in = <empty>
    host.auth.password_in = <empty>
    # END RECORD
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 1
    host.auth.username = <empty>
    host.auth.password = <empty>
    # END RECORD
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 2
    host.auth.username = <empty>
    host.auth.password = <empty>
    # END RECORD
    
    [...]

    Parameters of existing authentication records can be set or updated with the -o update option. The particular record to be set or to be updated is selected in with the -x <host.auth.tbl_idx> option, which references the records host.auth.tbl_idx value. Multiple parameters can be set or updated with a single iscsiadm command by calling it with multiple pairs of -n <parameter-name> and -v <parameter-value>:

    root@host:~$ iscsiadm -m host -H 1 -C chap -x 2 -o update -n host.auth.username -v testuser -n host.auth.password -v testpassword
    root@host:~$ iscsiadm -m host -H 1 -C chap -o show
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 0
    host.auth.username_in = <empty>
    host.auth.password_in = <empty>
    # END RECORD
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 1
    host.auth.username = <empty>
    host.auth.password = <empty>
    # END RECORD
    # BEGIN RECORD 2.0-873
    host.auth.tbl_idx = 2
    host.auth.username = testuser
    host.auth.password = testpassword
    # END RECORD
    
    [...]

    Finally, existing authentication records can be deleted with the -o delete option:

    root@host:~$ iscsiadm -m host -H 1 -C chap -x 2 -o delete
  • The flashnode sub-mode displays and alters a table containing information about the iSCSI targets. Calling this sub-mode without any other options displays an overview of the currently configured flash nodes (i.e. targets) on a particular HBA:

    root@host:~$ iscsiadm -m host -H 1 -C flashnode
    qla4xxx: [0] 10.0.0.2:3260,0 iqn.2001-05.com.equallogic:0-fe83b6-a35c152cc-c72004e10ff558d4-lun-000002

    Similar to the previously mentioned host mode, each output line of the flashnode sub-mode contains an index number for each flash node entry (i.e. iSCSI target), which is shown in the first pair of square brackets. With this index number the individual flash node entries are referenced in all further operations.

    New flash nodes or target entries can be added with the -o new option. This operation also needs the information on whether the target addressed via the flash node will be reached via IPv4 or IPv6 addresses. This is accomplished with the -A ipv4 or -A ipv6 option:

    root@host:~$ iscsiadm -m host -H 1 -C flashnode -o new -A ipv4
    Create new flashnode for host 1.
    New flashnode for host 1 added at index 1.

    If the operation of adding a new flash node is successful, the index under which the new flash node is addressable is returned.

    root@host:~$ iscsiadm -m host -H 1 -C flashnode
    qla4xxx: [0] 10.0.0.2:3260,0 iqn.2001-05.com.equallogic:0-fe83b6-a35c152cc-c72004e10ff558d4-lun-000002
    qla4xxx: [1] 0.0.0.0:3260,0 <empty>

    Unlike authentication records, the flash node or target records contain a lot more parameters. They can be displayed by selecting a specific record by its index with the -x <flashnode_idx> option. The -o show option is the default and is thus optional:

    root@host:~$ iscsiadm -m host -H 1 -C flashnode -x 1
    # BEGIN RECORD 2.0-873
    flashnode.session.auto_snd_tgt_disable = 0
    flashnode.session.discovery_session = 0
    flashnode.session.portal_type = ipv4
    flashnode.session.entry_enable = 0
    flashnode.session.immediate_data = 0
    flashnode.session.initial_r2t = 0
    flashnode.session.data_seq_in_order = 1
    flashnode.session.data_pdu_in_order = 1
    flashnode.session.chap_auth_en = 1
    flashnode.session.discovery_logout_en = 0
    flashnode.session.bidi_chap_en = 0
    flashnode.session.discovery_auth_optional = 0
    flashnode.session.erl = 0
    flashnode.session.first_burst_len = 0
    flashnode.session.def_time2wait = 0
    flashnode.session.def_time2retain = 0
    flashnode.session.max_outstanding_r2t = 0
    flashnode.session.isid = 000e1e17da2c
    flashnode.session.tsid = 0
    flashnode.session.max_burst_len = 0
    flashnode.session.def_taskmgmt_tmo = 10
    flashnode.session.targetalias = <empty>
    flashnode.session.targetname = <empty>
    flashnode.session.discovery_parent_idx = 0
    flashnode.session.discovery_parent_type = Sendtarget
    flashnode.session.tpgt = 0
    flashnode.session.chap_out_idx = 2
    flashnode.session.chap_in_idx = 65535
    flashnode.session.username = <empty>
    flashnode.session.username_in = <empty>
    flashnode.session.password = <empty>
    flashnode.session.password_in = <empty>
    flashnode.session.is_boot_target = 0
    flashnode.conn[0].is_fw_assigned_ipv6 = 0
    flashnode.conn[0].header_digest_en = 0
    flashnode.conn[0].data_digest_en = 0
    flashnode.conn[0].snack_req_en = 0
    flashnode.conn[0].tcp_timestamp_stat = 0
    flashnode.conn[0].tcp_nagle_disable = 0
    flashnode.conn[0].tcp_wsf_disable = 0
    flashnode.conn[0].tcp_timer_scale = 0
    flashnode.conn[0].tcp_timestamp_en = 0
    flashnode.conn[0].fragment_disable = 0
    flashnode.conn[0].max_xmit_dlength = 0
    flashnode.conn[0].max_recv_dlength = 65536
    flashnode.conn[0].keepalive_tmo = 0
    flashnode.conn[0].port = 3260
    flashnode.conn[0].ipaddress = 0.0.0.0
    flashnode.conn[0].redirect_ipaddr = 0.0.0.0
    flashnode.conn[0].max_segment_size = 0
    flashnode.conn[0].local_port = 0
    flashnode.conn[0].ipv4_tos = 0
    flashnode.conn[0].ipv6_traffic_class = 0
    flashnode.conn[0].ipv6_flow_label = 0
    flashnode.conn[0].link_local_ipv6 = <empty>
    flashnode.conn[0].tcp_xmit_wsf = 0
    flashnode.conn[0].tcp_recv_wsf = 0
    flashnode.conn[0].statsn = 0
    flashnode.conn[0].exp_statsn = 0
    # END RECORD

    From the various parameters of a flash node or target record, the following are the most relevant in day to day use:

    • flashnode.session.chap_auth_en: Controls whether the initiator should authenticate against the target. This is enabled by default.

    • flashnode.session.bidi_chap_en: Controls whether the target should also authenticate itself against the initiator. This is disabled by default.

    • flashnode.session.targetname: The IQN of the target to be logged into and to be accessed.

    • flashnode.session.chap_out_idx: The index number (i.e. the value of the host.auth.tbl_idx parameter) of the authentication record to be used for authentication of the initiator against the target.

    • flashnode.conn[0].port: The TCP port of the target portal. The default is port 3260.

    • flashnode.conn[0].ipaddress: The IP address of the target portal.

    The parameter pairs flashnode.session.username/flashnode.session.password and flashnode.session.username_in/flashnode.session.password_in are handled differently than all the other parameters. They are not set or updated directly, but are rather filled in automatically. This is done by setting the respective flashnode.session.chap_out_idx or flashnode.session.chap_in_idx parameter to a value which references the index (i.e. the value host.auth.tbl_idx parameter) of an appropriate authentication record.

    Parameters of existing flash nodes or target entries can be set or updated with the -o update option. The particular record of which the parameters are to be set or to be updated is selected with the -x <flashnode_idx> option. This references an index number gathered from the list of flash nodes or a index number returned at the time of creation of a particular flash node. Multiple parameters can be set or updated with a single iscsiadm command by calling it with multiple pairs of -n <parameter-name> and -v <parameter-value>:

    root@host:~$ iscsiadm -m host -H 1 -C flashnode -x 1 -o update -n flashnode.session.chap_out_idx -v 2 -n flashnode.session.targetname -v iqn.2001-05.com.equallogic:0-fe83b6-d63c152cc-7ce004e1102558d4-lun-000003 -n flashnode.conn[0].ipaddress -v 10.0.0.2

    The flash node or target entry updated by this command is shown below in a cut-down fashion for brevity:

    root@host:~$ iscsiadm -m host -H 1 -C flashnode -x 1
    # BEGIN RECORD 2.0-873
    [...]
    flashnode.session.chap_auth_en = 1
    flashnode.session.discovery_logout_en = 0
    flashnode.session.bidi_chap_en = 0
    [...]
    flashnode.session.targetname = iqn.2001-05.com.equallogic:0-fe83b6-d63c152cc-7ce004e1102558d4-lun-000003
    [...]
    flashnode.session.chap_out_idx = 4
    flashnode.session.chap_in_idx = 65535
    flashnode.session.username = testuser
    flashnode.session.username_in = <empty>
    flashnode.session.password = testpassword
    flashnode.session.password_in = <empty>
    [...]
    flashnode.conn[0].port = 3260
    flashnode.conn[0].ipaddress = 10.0.0.2
    flashnode.conn[0].redirect_ipaddr = 0.0.0.0
    [...]
    # END RECORD

    Once configured, login and logout actions can be performed on the flash node (i.e. target) with the repective command options:

    root@host:~$ iscsiadm -m host -H 1 -C flashnode -x 1 -o login
    root@host:~$ iscsiadm -m host -H 1 -C flashnode -x 1 -o logout

    Unfortunately the iscsiadm command will return with a success status on the login and logout actions once it has passed them successfully to the HBA. This does not reflect on the status of the actual login and logout actions subsequently taken by the HBA against the target configured in the respective flash node! To my knowledge there is currently no information passed back to the command line about the result of the login and logout actions at the HBA levels.

    Newly established as well as already existing iSCSI sessions via the hardware initiator which were set up with the login action shown above, are shown along with all other Open-iSCSI session information in the output of the iscsiadm command in its session mode:

    root@host:~$ iscsiadm -m session
    qla4xxx: [1] 10.0.0.2:3260,1 iqn.2001-05.com.equallogic:0-fe83b6-a35c152cc-c72004e10ff558d4-lun-000002 (flash)
    qla4xxx: [2] 10.0.0.2:3260,1 iqn.2001-05.com.equallogic:0-fe83b6-a35c152cc-c72004e10ff558d4-lun-000002 (flash)
    
    [...]

    The fact that the session information is about a iSCSI session established via a hardware initiator is only signified by the flash label in the parentheses at the end of each line. In case of a iSCSI session established via a software initiator, the label in the parentheses reads non-flash.

    Finally, existing flash nodes can be deleted with the -o delete option:

    root@host:~$ iscsiadm -m host -H 1 -C flashnode -x 1 -o delete

The records from both, the chap and the flashnode table, are stored in the HBAs flash memory. For the limits on how many entries can be stored in each table, see the specification of the particular HBA.

In my opinion, the integration of management of the QLogic hardware initators into the Open-iSCSI iscsiadm command improves and simplifies the administration and configuration a lot over the previous IOCTL based management method via the QLogic QConvergeConsole (qaucli). It finally opens management access to the QLogic hardware initiators to non-“enterprise Linux distributions” like Debian. Definately a big step in the right direction! The importance of using a current version of Open-iSCSI can – in my experience – not be stressed enough. Building and maintaining a package based on a current version from the projects GitHub repository is definitely worth the effort.

One thing i couldn't get to work though was the incoming part of a bi-directional CHAP authentication. In this scenario, not only does the initiator authenticate itself at the target for a iSCSI session to be successfully established, the target also has to authenticate itself against the initiator. My initial thought was, that a setup with bi-directional CHAP authentication should be easily accomplished by just performing the following three steps:

  1. creating an incoming authentication record with the value of the parameters host.auth.username_in and host.auth.password_in set to the respective values configured at the target storage system.

  2. setting the value of the flash node parameter flashnode.session.bidi_chap_en to 1.

  3. setting the value of the flash node parameter flashnode.session.chap_in_idx to the value of the parameter host.auth.tbl_idx, gathered from the newly created incoming authentication record in step 1.

The first two of the above tasks were indeed easily accomplished. The third one seemed easy too, but turned out to be more of a challenge. Setting the flash node parameter flashnode.session.chap_in_idx to the value of the host.auth.tbl_idx parameter from the previously created incoming authentication record just didn't work. Any attempt to change the default value 65535 failed. Neither was the flashnode.session.username_in/flashnode.session.password_in parameter pair automatically updated with the values from the parameters host.auth.username_in and host.auth.password_in. Oddly enough bi-directional CHAP authentication worked as long as there only was one storage system with one set of incoming authentication credentials! Adding another set of flash nodes for a second storage system with its own set of incoming authentication credentials would cause the bi-directional CHAP authentication to fail for all targets on this second storage system.

Being unable to debug this weird behaviour any further on my own, i turned to the Open-iSCSI mailing list for help. See the thread on the Open-iSCSI mailing list for the details. Don't be confused by the fact that the thread at the mailing list was initially about getting to work the network communication with the use of jumbo frames. This initial issue was resolved for me by switching to a more recent Open-iSCSI version as already mentioned above. My last question there was not answered publicly on the mailing list, but Adheer Chandravanshi from development at QLogic got in touch via email. Here's his explanation of the observed behaviour:

[…]

I see that you have added multiple incoming CHAP entries for both hosts 1 and 2.
But looks like even in case of multiple incoming CHAP entries in flash only the first entry takes effect for that particular host for all
the target node entries in flash.
This seems to be a limitation with the flash target node entries.
So you cannot map different incoming CHAP entry for different target nodes in HBA flash.

In your case, only following incoming CHAP entry will be effective for Host 1 as it's the first incoming chap entry. Same goes for Host
2.
# BEGIN RECORD 2.0-873
host.auth.tbl_idx = 2
host.auth.username_in = <username-from-target1>
host.auth.password_in = <password-from-target1>
# END RECORD


Try using only one incoming CHAP entry per host, you can have different outgoing CHAP entries though for each flash target node.

[…]

To sum this up, the QLogic HBAs basically use a first match approach when it comes to the incoming part of a bi-directional CHAP authentication. After finding the first incoming authentication record that is configured, it uses the credentials stored there. Any other – and possibly more suitable – records for incoming authentication are ignored. There's also no way to override this behaviour on a case by case basis via the flash node entries (i.e. iSCSI targets).

In my humble opinion this is a rather serious limitation of the security features in the QLogic hardware initiators. No security policy i have ever encountered in any organisation would allow for the reuse of authentication credentials over different systems. Unfortunately i have no further information as to why the implementation turned out this way. Maybe there was no feature request for this yet, maybe it was just an oversight or maybe there is a limitation in the hardware, preventing a more flexible implementation. Unfortunately my reply to the above email with an inquiry whether such a feature would possibly be implemented in future firmware versions has – up to now – not been answered.

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website. More information about cookies