bityard Blog

// Experiences with Java and X.509 Certificates - Certificate Revokation Lists

As already mentioned in the article Experiences with Java and X.509 Certificates - Code Signing i recently had the task of researching two issues involving Java and X.509 certificates. While i'm familiar with X.509 certificates, i'm not too familiar with the inner workings of Java, the Java run-time environment or let alone programming in Java. So this was a good opportunity to familiarize myself with the inner workings of Java and also a great learning experience.

The second issue, which this blog post will be about, was in the area of TLS-secured HTTPS network connections, naturally involving a X.509 certificate on the server side. In particular the server processes of two Java application servers were connecting the application logic of two different systems via an API over a HTTPS based network connection. The source system would make RPC calls to an API on the target system in order to store data in and retrieve data from this system.

The communication between the two systems would work fine while using an unencrypted HTTP based network connection, but would fail while using a secured HTTPS network connections. The errors reported back to the server process on the source system which was initiating the communication, would point in the direction of an issue with the X.509 certificate that was being used on the target server. The error message would look something like this:

%% Invalidated:  [Session-1, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256]
main, SEND TLSv1.2 ALERT:  fatal, description = certificate_unknown
main, WRITE: TLSv1.2 Alert, length = 2
[Raw write]: length = 7
0000: 15 03 03 00 02 02 2E                               .......
main, called closeSocket()
main, handling exception: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException:
  PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find
  valid certification path to requested target
  Exception Failed to access the WSDL at: https://hostname:port/url/rpc-call. It failed with: 
        sun.security.validator.ValidatorException: PKIX path building failed:
            sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification
                path to requested target.
x.y.z.ExceptionHandler
        at x.y.z.method1(source1.java:line-number)
        at x.y.z.method2(source2.java:line-number)
        at x.y.z.method3(source3.java:line-number)

Checking the certificate used on the side of the target server would turn up nothing unusual. The certificate was issued by the internal CA of our infrastructure and like many other certificates looked valid. Verifiying this, by initiating a non-Java based HTTPS request (e.g. with a browser) to the target system worked fine. The certificate on the side of the target server was accepted and a secured HTTPS connection was successfully established. The next debugging step was to check the certstore on the side of the source server which was initiating the connection. It showed that a valid entry, containing the full chain of the root and the intermediate certificates of our internal CA, existed.

From an infrastructure point of view, everything looked reasonably good and should work just fine. Still, the connection between the two systems was failing.

Unfortunately, but inevitably the Java routines responsible for initiating the HTTPS connection on the source system were part of and thus embedded in a programming logic more complex than necessary or useful for the purpose of low-level debugging. For further debugging i wanted a Java-based test program which was stripped down to the bare minimum of initiating a HTTPS connection. Again, not being a seasoned Java programmer, i managed to cobble together the following test program from several examples on the internet:

CertChecker.java
import java.net.*;
import java.io.*;
import javax.net.ssl.*;
import java.security.GeneralSecurityException;
import java.security.KeyStore;
import java.security.SecureRandom;
import java.security.cert.*;
import java.util.*;
 
public class CertChecker implements X509TrustManager {
    public static void main(String[] args) throws Exception {
        try {
            SSLContext sc = SSLContext.getInstance("SSL");
            sc.init(null, new TrustManager[]{new CertChecker()}, new SecureRandom());
            SSLSocketFactory factory = (SSLSocketFactory)SSLSocketFactory.getDefault();
            SSLSocket socket = (SSLSocket)factory.createSocket("hostname", port);
            socket.startHandshake();
 
            PrintWriter out = new PrintWriter(
                                  new BufferedWriter(
                                  new OutputStreamWriter(
                                  socket.getOutputStream())));
 
            out.println("GET / HTTP/1.0");
            out.println();
            out.flush();
 
            if (out.checkError())
                System.out.println("SSLSocketClient:  java.io.PrintWriter error");
 
            BufferedReader in = new BufferedReader(
                                    new InputStreamReader(
                                    socket.getInputStream()));
 
            String inputLine;
            while ((inputLine = in.readLine()) != null)
                System.out.println(inputLine);
 
            in.close();
            out.close();
            socket.close();
 
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
 
    private final X509TrustManager defaultTM;
 
    public CertChecker() throws GeneralSecurityException {
        TrustManagerFactory tmf = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
        tmf.init((KeyStore) null);
        defaultTM = (X509TrustManager) tmf.getTrustManagers()[0];
    }
 
    public void checkServerTrusted(X509Certificate[] certs, String authType) {
        if (defaultTM != null) {
            try {
                defaultTM.checkServerTrusted(certs, authType);
                Set<TrustAnchor> trustAnchors = getTrustAnchors();
                System.out.println("Certificate valid");
            } catch (CertificateException ex) {
                System.out.println("Certificate invalid: " + ex.getMessage());
            }
        }
    }
 
    private Set<TrustAnchor> getTrustAnchors() {
        X509Certificate[] acceptedIssuers = defaultTM.getAcceptedIssuers();
        Set<TrustAnchor> trustAnchors = new HashSet<TrustAnchor>();
        for (X509Certificate acceptedIssuer : acceptedIssuers) {
            TrustAnchor trustAnchor = new TrustAnchor(acceptedIssuer, null);
            trustAnchors.add(trustAnchor);
        }
        return trustAnchors;
    }
 
    public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException {
    }
 
    public X509Certificate[] getAcceptedIssuers() {
        return null;
    }
}

In line 16 of the above Java program, the placeholders hostname and port were replaced by the appropriate values for the target server system. The code was then compiled and run against the target server system, with the following Java command line call:

user@host:$ java -Djava.security.debug=all -Djavax.net.debug=all CertChecker

The result from this was the following error message output:

[...]
certpath: BasicChecker.updateState issuer: CN=root-ca, DC=domain, DC=tld; subject: CN=intermediate-ca, DC=domain, DC=tld; serial#: ...
certpath: -checker6 validation succeeded
certpath: -Using checker7 ... [sun.security.provider.certpath.RevocationChecker]
certpath: RevocationChecker.check: checking cert
  SN:     26000000 0cff68dd 06c50fb0 4a000100 00000c
  Subject: CN=intermediate-ca, DC=domain, DC=tld
  Issuer: CN=root-ca, DC=domain, DC=tld
certpath: RevocationChecker.checkCRLs() ---checking revocation status ...
certpath: RevocationChecker.checkCRLs() possible crls.size() = 0
certpath: RevocationChecker.checkCRLs() approved crls.size() = 0
certpath: DistributionPointFetcher.getCRLs: Checking CRLDPs for CN=intermediate-ca, DC=domain, DC=tld
certpath: Trying to fetch CRL from DP http://crl-fqdn/filename.crl
certpath: CertStore URI:http://crl-fqdn/filename.crl
certpath: Downloading new CRL...
certpath: Exception fetching CRL:
java.security.cert.CRLException: Empty input
java.security.cert.CRLException: Empty input
        at sun.security.provider.X509Factory.engineGenerateCRL(X509Factory.java:397)
        at java.security.cert.CertificateFactory.generateCRL(CertificateFactory.java:497)
        at sun.security.provider.certpath.URICertStore.engineGetCRLs(URICertStore.java:419)
        at java.security.cert.CertStore.getCRLs(CertStore.java:181)
        at sun.security.provider.certpath.DistributionPointFetcher.getCRL(DistributionPointFetcher.java:245)
        at sun.security.provider.certpath.DistributionPointFetcher.getCRLs(DistributionPointFetcher.java:189)
        at sun.security.provider.certpath.DistributionPointFetcher.getCRLs(DistributionPointFetcher.java:121)
        at sun.security.provider.certpath.RevocationChecker.checkCRLs(RevocationChecker.java:552)
        at sun.security.provider.certpath.RevocationChecker.checkCRLs(RevocationChecker.java:465)
        at sun.security.provider.certpath.RevocationChecker.check(RevocationChecker.java:367)
        at sun.security.provider.certpath.RevocationChecker.check(RevocationChecker.java:337)
        at sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(PKIXMasterCertPathValidator.java:125)
        at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:219)
        at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:140)
        at sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(PKIXCertPathValidator.java:79)
        at java.security.cert.CertPathValidator.validate(CertPathValidator.java:292)
        at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:347)
        at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:249)
        at sun.security.validator.Validator.validate(Validator.java:260)
        at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
        at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
        at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1496)
        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:961)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
        at CertChecker.main(CertChecker.java:31)
[...]
%% Invalidated:  [Session-1, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256]
main, SEND TLSv1.2 ALERT:  fatal, description = certificate_unknown
main, WRITE: TLSv1.2 Alert, length = 2
[Raw write]: length = 7
0000: 15 03 03 00 02 02 2E                               .......
main, called closeSocket()
main, handling exception: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: \
    PKIX path validation failed: java.security.cert.CertPathValidatorException: Could not determine revocation status
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path validation failed: \
    java.security.cert.CertPathValidatorException: Could not determine revocation status
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
        at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:961)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
        at CertChecker.main(CertChecker.java:31)
Caused by: sun.security.validator.ValidatorException: PKIX path validation failed: \
    java.security.cert.CertPathValidatorException: Could not determine revocation status
        at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:352)
        at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:249)
        at sun.security.validator.Validator.validate(Validator.java:260)
        at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
        at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
        at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1496)
        ... 8 more
Caused by: java.security.cert.CertPathValidatorException: Could not determine revocation status
        at sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(PKIXMasterCertPathValidator.java:135)
        at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:219)
        at sun.security.provider.certpath.PKIXCertPathValidator.validate(PKIXCertPathValidator.java:140)
        at sun.security.provider.certpath.PKIXCertPathValidator.engineValidate(PKIXCertPathValidator.java:79)
        at java.security.cert.CertPathValidator.validate(CertPathValidator.java:292)
        at sun.security.validator.PKIXValidator.doValidate(PKIXValidator.java:347)
        ... 14 more
Caused by: java.security.cert.CertPathValidatorException: Could not determine revocation status
        at sun.security.provider.certpath.RevocationChecker.buildToNewKey(RevocationChecker.java:1092)
        at sun.security.provider.certpath.RevocationChecker.verifyWithSeparateSigningKey(RevocationChecker.java:910)
        at sun.security.provider.certpath.RevocationChecker.checkCRLs(RevocationChecker.java:577)
        at sun.security.provider.certpath.RevocationChecker.checkCRLs(RevocationChecker.java:465)
        at sun.security.provider.certpath.RevocationChecker.check(RevocationChecker.java:367)
        at sun.security.provider.certpath.RevocationChecker.check(RevocationChecker.java:337)
        at sun.security.provider.certpath.PKIXMasterCertPathValidator.validate(PKIXMasterCertPathValidator.java:125)
        ... 19 more

The Java exceptions shown above suggested an issue with the Certificate Revokation List (CRL) of the root CA. This was confusing, since the certificate worked when the target server was accessed with a browser. Manually downloading the CRL was also possibly. Verifying the downloaded CRL with the help of OpenSSL showed no anomalies to which this issue could be attributed.

The line:

java.security.cert.CRLException: Empty input

from the above error messages, lead the way to solve this issue. This line could be interpreted in two ways:

  1. either there was a field in the CRL currently being parsed that had an empty or an unexpected value leading to a parse error
  2. or the downloaded CRL data which was handed over to the CRL parsing code was an empty set.

I decided to first check option number 2 and – if this turned out to be not the case – move on from there to check option number 1. Since a manual download of the CRL worked per se, i decided to take a closer look at the download process with the curl command line utility:

user@host:$ curl -v http://crl-fqdn/filename.crl
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.0.0.15...
* Connected to crl-fqdn (10.0.0.15) port 80 (#0)
> GET /filename.crl HTTP/1.1
> User-Agent: curl/7.38.0
> Host: crl-fqdn
> Accept: */*
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 302 Found
< Location: https://crl-fqdn/filename.crl
< Server: loadbalancer-vendor
* HTTP/1.0 connection set to keep alive!
< Connection: Keep-Alive
< Content-Length: 0
< 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host crl-fqdn left intact

The HTTP request for the CRL file was in fact answered with a HTTP status code 302 and a new location for the CRL file, redirecting any client to a HTTPS secured location. What is apparently working for any regular browser or – with the “-L” command line option – even for the curl command line utility, seemed to be an issue for the part of the JRE that is handling the download of the CRL. Instead of following the HTTP redirect and requesting the CRL from the alternate location, the CRL-handling code within the JRE is given the content data that is returned from the initial HTTP request which is an empty set (see the Content-Length: 0 in the above curl output). This in turn leads to the failure of parsing the CRL, which leads to the failure verifying the certificate against the CRL, which leads to the failure of not establishing a TLS-secured HTTPS network connection between the source and the target server system.

The immediate and simple solution to this issue was to add an exception to the HTTP-to-HTTPS redirect for the URL of the CRL. This global HTTP-to-HTTPS redirect was implemented a while ago upon request by the information security department as an attempt make the site serving the CRL and other content more secure. Arguably serving CRLs with HTTPS is generally a bad idea, since it can cause a chicken-and-egg type of problem. In this case though it would have worked, since the host serving the CRL was protected by a certificate from a different CA than the one for which the CRL was being provided.

// Brocade Fabric OS Authentication Failure with SSH Public Key

With the update to Fabric OS v7.4.1d on Brocade fibre channel SAN switches, the CLI login via SSH public key authentication will sometimes be broken for administrative users. This blog post describes a manual workaround which can be used in order to temporarily correct this issue without the immediate need for another Fabric OS update.


During the preparation phase for migrating from our aging Brocade 5100 and 5300 Gen4 fibre channel SAN switches to the shiny new Brocade G620 Gen6 fibre channel SAN Switches, we needed to update the Fabric OS on the old switches to a v7.4.x version. Due to compatibility and support constraints with the IBM SAN Volume Controller (SVC), we decided to go with the Fabric OS v7.4.1d version.

After the successful Fabric OS update, the CLI login via SSH public key authentication was broken for some, but not all users with admin level priviledges on some but not all switches. A re-upload of the SSH public key for those users with the sshUtil importpubkey command didn't solve the issue. Debugging this further with a strace attached to the SSH daemon process on an affected switch revealed why the SSH public key authentication was failing:

[...]
[pid 27941] connect(8, {sa_family=AF_FILE, path="/dev/log"}, 16) = -1 EPROTOTYPE (Protocol wrong type for socket)
[pid 27941] close(8)                    = 0
[pid 27941] socket(PF_FILE, SOCK_STREAM, 0) = 8
[pid 27941] fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
[pid 27941] connect(8, {sa_family=AF_FILE, path="/dev/log"}, 16) = 0
[pid 27941] send(8, "<39>Feb  1 22:07:00 sshd[27941]: debug1: trying public key file /fabos/users/admin/.ssh/authorized_keys.<USERNAME>\0", 117, MSG_NOSIGNAL) = 117
[pid 27941] close(8)                    = 0
[pid 27941] open("/fabos/users/admin/.ssh/authorized_keys.<USERNAME_2>", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = -1 EACCES (Permission denied)
[...]

This was done by using the root account on the Brocade switch. The same account was also used for the following research and the temporary workaround derived from this. Beware that using the root account on Brocade switches might have serious implications on the warranty or support for the devices. Be extra careful what you are doing as root on the Brocade switch, since it might easily affect the operational status of the device.

The last line from the above snippet of the strace output show, that the cause of the issue with SSH public key authentication was in the permissions of the users authorized_keys file. Looking at the permissions of the directory containing the file and the file itself showed:

switch:FID128:root> cd /fabos/users/admin/
switch:FID128:root> ls -al
total 28
drwxr-xr-x   4 root     admin        4096 Jan 23 12:36 ./
drwxr-xr-x  12 root     sys          4096 Jul 15  2016 ../
-rw-r--r--   1 root     admin         507 Jul 15  2016 .bash_logout
-rw-r--r--   1 root     admin          27 Jul 15  2016 .inputrc
-rw-r--r--   1 root     admin        1275 Jul 15  2016 .profile
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 .ssh/
drwxrwxrwx   3 root     sys          4096 Aug 11  2011 .terminfo/

switch:FID128:root> cd .ssh
switch:FID128:root> pwd
/fabos/users/admin/.ssh

switch:FID128:root> ls -al
total 44
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 ./
drwxr-xr-x   4 root     admin        4096 Jan 23 12:36 ../
-rw-r--r--   1 root     admin       10240 Feb  1 22:54 authorizedKeys.tar
-rw-------   1 root     root          408 Jan 23 12:43 authorized_keys
-rw-r--r--   1 root     admin         755 Dec  8 11:13 authorized_keys.<USERNAME_1>
-rw-------   1 root     admin        1230 Feb  1 22:54 authorized_keys.<USERNAME_2>
-rw-------   1 root     root          408 Mar 22  2016 authorized_keys.<USERNAME_3>
-rw-r--r--   1 root     root          605 Sep 19 11:06 authorized_keys.<USERNAME_4>
-rw-r--r--   1 root     admin         134 Jul 15  2016 environment

switch:FID128:root> tar tvf authorizedKeys.tar
-rw-r--r-- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw------- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw------- root/root       408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r--r-- root/root       606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

switch:FID128:root> tar tvf /mnt/fabos/users/admin/.ssh/authorizedKeys.tar
-rw-r--r-- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw------- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw------- root/root       408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r--r-- root/root       606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

The permissions on some, but not all of the authorized_keys.<USERNAME_*> files were being too restrictive, since the SSH daemon was trying to read them as an effective user of the admin group. An immediate fix for this issue was to alter the permissions on the authorized_keys.<USERNAME_*> files in order to allow the admin group to read the content of the files:

switch:FID128:root> cd /fabos/users/admin/
switch:FID128:root> chmod 640 authorized_keys.*
switch:FID128:root> chown root:admin authorized_keys.*
switch:FID128:root> tar cpf authorizedKeys.tar authorized_keys.*

switch:FID128:root> cd /mnt/fabos/users/admin/.ssh/
switch:FID128:root> chmod 640 authorized_keys.*
switch:FID128:root> chown root:admin authorized_keys.*
switch:FID128:root> tar cpf authorizedKeys.tar authorized_keys.*

Again looking at the permissions of the directory containing the authorized_keys.<USERNAME_*> files and the files itself showed now:

switch:FID128:root> cd /fabos/users/admin/
switch:FID128:root> ls -la
total 44
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 ./
drwxr-xr-x   4 root     admin        4096 Jan 23 12:36 ../
-rw-r--r--   1 root     admin       10240 Feb  9 06:59 authorizedKeys.tar
-rw-------   1 root     root          408 Jan 23 12:43 authorized_keys
-rw-r-----   1 root     admin         755 Dec  8 11:13 authorized_keys.<USERNAME_1>
-rw-r-----   1 root     admin        1230 Feb  1 22:54 authorized_keys.<USERNAME_2>
-rw-r-----   1 root     admin         408 Mar 22  2016 authorized_keys.<USERNAME_3>
-rw-r-----   1 root     admin         605 Sep 19 11:06 authorized_keys.<USERNAME_4>
-rw-r--r--   1 root     admin         134 Jul 15  2016 environment

switch:FID128:root> tar tvf /fabos/users/admin/.ssh/authorizedKeys.tar
-rw-r----- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw-r----- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw-r----- root/admin      408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r----- root/admin      606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

switch:FID128:root> cd /mnt/fabos/users/admin/.ssh/
switch:FID128:root> ls -la
total 44
drwxr-xr-x   2 root     admin        4096 Feb  1 22:54 ./
drwxr-xr-x   4 root     admin        4096 Jan 23 12:50 ../
-rw-r--r--   1 root     admin       10240 Feb  1 22:54 authorizedKeys.tar
-rw-------   1 root     root          408 Jan 23 12:43 authorized_keys
-rw-r-----   1 root     admin         755 Dec  8 11:13 authorized_keys.<USERNAME_1>
-rw-r-----   1 root     admin        1230 Feb  1 22:54 authorized_keys.<USERNAME_2>
-rw-r-----   1 root     admin         408 Mar 22  2016 authorized_keys.<USERNAME_3>
-rw-r-----   1 root     admin         605 Sep 19 11:06 authorized_keys.<USERNAME_4>
-rw-r--r--   1 root     admin         134 Jan 23 12:50 environment

switch:FID128:root> tar tvf /mnt/fabos/users/admin/.ssh/authorizedKeys.tar
-rw-r----- root/admin      755 2017-12-08 11:13:06 authorized_keys.<USERNAME_1>
-rw-r----- root/admin     1230 2018-02-01 22:54:01 authorized_keys.<USERNAME_2>
-rw-r----- root/admin      408 2016-03-22 10:12:37 authorized_keys.<USERNAME_3>
-rw-r----- root/admin      606 2017-09-19 11:06:10 authorized_keys.<USERNAME_4>

With the corrected permissions on the authorized_keys.<USERNAME_*> files, CLI login via SSH public key authentication was now possible again.

Unfortunately this is only a temporary workaround, since the next upload of a SSH public key with the sshUtil importpubkey command will likely set the wrong permissions on the newly created or replaced authorized_keys.<USERNAME_*> file. This is due to the root cause of the issue actually being with the sshUtil importpubkey command. The snippet of a strace output show below was captured from a running sshUtil importpubkey command:

[...]
chdir("/fabos/users/admin/.ssh")        = 0
[...]
[pid 10611] execve("/bin/cat", ["cat", "<USERNAME_2>_brocade_dsa.pub"], [/* 45 vars */]) = 0
[...]
[pid 10612] execve("/bin/chmod", ["/bin/chmod", "600", "authorized_keys.<USERNAME_2>"], [/* 45 vars */]) = 0
[...]
[pid 10612] lstat64("authorized_keys.<USERNAME_2>", {st_mode=S_IFREG|0640, st_size=1230, ...}) = 0
[pid 10612] chmod("authorized_keys.<USERNAME_2>", 0600) = 0
[...]
[pid 10613] execve("/bin/cp", ["cp", "-f", "authorized_keys.<USERNAME_2>", "/mnt/fabos/users/admin/.ssh/"], [/* 45 vars */]) = 0
[...]
[pid 10613] chmod("/mnt/fabos/users/admin/.ssh/authorized_keys.<USERNAME_2>", 0100600) = 0
[...]
[pid 10618] execve("/bin/tar", ["tar", "-cf", "authorizedKeys.tar", "authorized_keys.<USERNAME_1>", "authorized_keys.<USERNAME_2>", "authorized_keys.<USERNAME_3>"], [/* 45 vars */]) = 0
[...]

The /bin/chmod command on the third line of the above strace output shows that the file permission for the authorized_keys.<USERNAME_2> file is mistakenly set to 600 (-rw-------) instead to at least 640 (-rw-r-----). Exactly why this is sometimes happening can't be further analyzed, since the source code to the sshUtil command is not available.

A permanent resolution to this issue will be to update to at least Fabric OS v7.4.1e. The Release Notes for Fabric OS v7.4.1e indicate this in the following known defect:

Defect ID: DEFECT000616486
Technical Severity: Medium
Probability: Medium
Product: Brocade Fabric OS
Technology Group: Security
Reported In Release: FOS7.4.1
Technology: SSH - Secure Shell
Symptom: Unable to authenticate an SSH session after importing public key to switch.
Condition: This is encountered by admin level users on a switch running Fabric OS v7.4.1d

// Experiences with Java and X.509 Certificates - Code Signing

Recently i had the task of researching two issues involving Java and X.509 certificates. While i'm familiar with X.509 certificates, i'm not too familiar with the inner workings of Java, the Java run-time environment or let alone programming in Java. So this was a good opportunity to familiarize myself with the inner workings of Java and also a great learning experience.

The first issue, which this blog post will be about, was in the area of code signing of an additional third party component to Java. The second issue was in the area of HTTPS network connections and will be the subject of another blog post.

In our Windows client environment, we use the smartcard system ActivIdentity from HID in conjunction with the single sign-on software SecureLogin from NetIQ, now a part of Micro Focus. In our case primarily used for an inhouse Java based application, there is a Java extension which integrates an interface to the ActivIdentity software in the clients JRE. This aims to make the ActivIdentity smartcard system available for all Java based applications in order to provide a single sign-on feature for the users.

The Java extension consists of the files:

$JAVA_HOME/lib/ext/javasso.jar
$JAVA_HOME/lib/ext/xbean.jar

During the preliminary tests for a rollout of a current release of the JRE version v1.8.0 on the Windows clients, the following issue surfaced. Probably due to the more strict enforcement of security measures in the current JRE version, the single sign-on integration would not work reliably any more, sometimes even not at all. There have previously been issues with this and a – albeit ugly – workaround implemented by our Windows client team was to disable the certificate revokation checks for the entire JRE on the Windows clients. Now, with the new JRE to be rolled out, even this workaround wouldn't get the single sign-on to work any more.

From the console of the JRE the only clue was the following, but probably unrelated, Java exception:

java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at com.actividentity.sso.javasso.awt_swing.JavaSSOHook.addListenersRecursively(JavaSSOHook.java:356)
    at com.actividentity.sso.javasso.awt_swing.JavaSSOHook.addListenersRecursively(JavaSSOHook.java:455)
    at com.actividentity.sso.javasso.awt_swing.JavaSSOHook.addListenersRecursively(JavaSSOHook.java:455)
    at com.actividentity.sso.javasso.awt_swing.JavaSSOHook.addListenersRecursively(JavaSSOHook.java:455)
    at com.actividentity.sso.javasso.awt_swing.JavaSSOHook.addListenersRecursively(JavaSSOHook.java:455)
    at com.actividentity.sso.javasso.awt_swing.JavaSSOHook.addListenersRecursively(JavaSSOHook.java:455)
    at com.actividentity.sso.javasso.awt_swing.JavaSSOJob.refreshComponentTree(JavaSSOJob.java:168)
    at com.actividentity.sso.javasso.JavaSSOJobMgr.refreshComponentTrees(JavaSSOJobMgr.java:93)
    at com.actividentity.sso.javasso.JavaSSOJobMgr.run(JavaSSOJobMgr.java:190)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
    at com.sun.proxy.$Proxy0.equals(Unknown Source)
    at java.util.Vector.indexOf(Unknown Source)
    at java.util.Vector.indexOf(Unknown Source)
    at java.util.Vector.removeElement(Unknown Source)
    at oracle.ewt.event.ListenerManager.removeListener(Unknown Source)
    at oracle.ewt.lwAWT.lwWindow.DesktopContainer.removeDesktopListener(Unknown Source)
    ... 14 more

After some searching it turned out that the file $JAVA_HOME/lib/ext/javasso.jar, which is part of the Java extension provided by ActivIdentity, was signed with a X.509 certificate which expired in 2016:

user@host:$ openssl pkcs7 -inform DER -print_certs -in SUNCODES.RSA -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            0a:77:eb:6f:b1:d6:74:7c:f2:7d:4e:3d:43:fa:72:1c
    Signature Algorithm: sha1WithRSAEncryption
        Issuer: C=US, O=VeriSign, Inc., OU=VeriSign Trust Network, OU=Terms of use at https://www.verisign.com/rpa (c)10, CN=VeriSign Class 3 Code Signing 2010 CA
        Validity
            Not Before: Mar  6 00:00:00 2013 GMT
            Not After : Jun  4 23:59:59 2016 GMT
        Subject: C=US, ST=Utah, L=Provo, O=Novell, Inc., OU=Digital ID Class 3 - Java Object Signing, CN=Novell, Inc.
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:eb:e8:89:56:52:0f:be:7d:7a:90:8c:f6:a6:46:
                    c2:c5:d7:8d:de:ab:9d:44:79:b9:ca:be:d3:22:94:
                    58:a3:b9:49:b3:59:71:52:98:ec:30:48:c3:60:32:
                    13:19:ec:b0:19:f6:9c:4a:4b:89:6f:fd:cc:67:f1:
                    a4:c0:b6:37:b9:c7:3c:58:aa:0d:0e:cd:dc:06:ff:
                    17:64:ec:a9:9d:29:ef:ae:5b:49:ef:8c:ef:8c:38:
                    a4:1b:ec:b5:26:c2:65:80:c3:cf:b8:73:d5:e7:dc:
                    e2:54:3f:63:c8:c4:12:40:57:dd:9a:bc:56:ad:6a:
                    bc:65:a8:34:a0:df:d1:87:58:2c:06:65:74:a0:48:
                    0f:df:41:e4:6b:9b:d5:45:f2:3f:3a:c3:a9:c1:84:
                    bf:a0:d4:fa:ee:53:a3:09:51:b5:18:bf:98:aa:f0:
                    6e:77:8a:c1:fd:1c:4d:62:47:ca:2d:ae:93:4c:5a:
                    ae:32:39:eb:cc:4b:da:fe:cb:e7:5f:02:af:d1:c4:
                    5f:6b:d5:e0:3c:06:3c:3a:29:83:bc:c7:10:7a:4c:
                    9a:ff:ff:bd:84:62:a8:4c:bf:76:20:b8:d8:20:9c:
                    f7:86:3b:96:d4:30:52:30:66:f5:9f:48:59:e1:1c:
                    2d:10:e8:6b:67:be:8f:21:41:be:83:af:9f:e7:41:
                    10:73
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Basic Constraints:
                CA:FALSE
            X509v3 Key Usage: critical
                Digital Signature
            X509v3 CRL Distribution Points:

                Full Name:
                  URI:http://csc3-2010-crl.verisign.com/CSC3-2010.crl

            X509v3 Certificate Policies:
                Policy: 2.16.840.1.113733.1.7.23.3
                  CPS: https://www.verisign.com/rpa

            X509v3 Extended Key Usage:
                Code Signing
            Authority Information Access:
                OCSP - URI:http://ocsp.verisign.com
                CA Issuers - URI:http://csc3-2010-aia.verisign.com/CSC3-2010.cer

            X509v3 Authority Key Identifier:
                keyid:CF:99:A9:EA:7B:26:F4:4B:C9:8E:8F:D7:F0:05:26:EF:E3:D2:A7:9D

            Netscape Cert Type:
                Object Signing
            1.3.6.1.4.1.311.2.1.27:
                0.......
    Signature Algorithm: sha1WithRSAEncryption
         40:18:43:e9:58:06:c5:3e:82:de:ec:8e:69:20:26:43:3f:0b:
         41:0f:1b:cf:ca:5d:f6:e2:f2:c3:31:e7:c3:d0:07:f4:ea:8e:
         d5:1f:72:de:1e:4c:d6:8a:d6:c5:87:5a:7b:d5:46:d1:18:1b:
         85:5c:d2:fe:62:76:ff:94:e9:7a:db:32:99:51:9a:36:55:c4:
         b1:5e:f0:9a:0b:42:07:2e:ce:b6:84:d7:20:b6:51:ef:f6:c7:
         20:fd:7d:95:68:52:f3:91:6f:5e:5f:25:3f:13:ee:f2:8d:75:
         2c:ef:b4:26:43:c5:dc:af:78:9c:45:b7:04:87:b8:a1:fd:c3:
         f4:84:7e:91:97:12:02:ad:d9:16:5a:45:62:56:85:03:71:90:
         a9:cf:61:01:9b:6d:8d:9e:59:bc:fc:8f:46:de:27:db:71:e2:
         58:13:d2:fb:1b:e0:58:f0:9f:2d:3a:bc:ca:12:78:33:d3:7a:
         76:95:7e:53:c2:2b:4d:fb:6d:bb:92:8f:c6:28:0f:15:1d:af:
         7d:60:b5:a3:21:b3:66:e1:44:ab:91:10:85:d2:20:44:45:96:
         2c:14:3e:c1:87:92:ae:a9:d6:a9:84:2a:5e:15:6c:d8:bf:37:
         f2:33:2e:cc:64:49:ce:2c:e8:30:84:22:2c:b6:a9:c1:fc:30:
         97:48:d1:fa

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            52:00:e5:aa:25:56:fc:1a:86:ed:96:c9:d4:4b:33:c7
    Signature Algorithm: sha1WithRSAEncryption
        Issuer: C=US, O=VeriSign, Inc., OU=VeriSign Trust Network, OU=(c) 2006 VeriSign, Inc. - For authorized use only, CN=VeriSign Class 3 Public Primary Certification Authority - G5
        Validity
            Not Before: Feb  8 00:00:00 2010 GMT
            Not After : Feb  7 23:59:59 2020 GMT
        Subject: C=US, O=VeriSign, Inc., OU=VeriSign Trust Network, OU=Terms of use at https://www.verisign.com/rpa (c)10, CN=VeriSign Class 3 Code Signing 2010 CA
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:f5:23:4b:5e:a5:d7:8a:bb:32:e9:d4:57:f7:ef:
                    e4:c7:26:7e:ad:19:98:fe:a8:9d:7d:94:f6:36:6b:
                    10:d7:75:81:30:7f:04:68:7f:cb:2b:75:1e:cd:1d:
                    08:8c:df:69:94:a7:37:a3:9c:7b:80:e0:99:e1:ee:
                    37:4d:5f:ce:3b:14:ee:86:d4:d0:f5:27:35:bc:25:
                    0b:38:a7:8c:63:9d:17:a3:08:a5:ab:b0:fb:cd:6a:
                    62:82:4c:d5:21:da:1b:d9:f1:e3:84:3b:8a:2a:4f:
                    85:5b:90:01:4f:c9:a7:76:10:7f:27:03:7c:be:ae:
                    7e:7d:c1:dd:f9:05:bc:1b:48:9c:69:e7:c0:a4:3c:
                    3c:41:00:3e:df:96:e5:c5:e4:94:71:d6:55:01:c7:
                    00:26:4a:40:3c:b5:a1:26:a9:0c:a7:6d:80:8e:90:
                    25:7b:cf:bf:3f:1c:eb:2f:96:fa:e5:87:77:c6:b5:
                    56:b2:7a:3b:54:30:53:1b:df:62:34:ff:1e:d1:f4:
                    5a:93:28:85:e5:4c:17:4e:7e:5b:fd:a4:93:99:7f:
                    df:cd:ef:a4:75:ef:ef:15:f6:47:e7:f8:19:72:d8:
                    2e:34:1a:a6:b4:a7:4c:7e:bd:bb:4f:0c:3d:57:f1:
                    30:d6:a6:36:8e:d6:80:76:d7:19:2e:a5:cd:7e:34:
                    2d:89
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Basic Constraints: critical
                CA:TRUE, pathlen:0
            X509v3 Certificate Policies:
                Policy: 2.16.840.1.113733.1.7.23.3
                  CPS: https://www.verisign.com/cps
                  User Notice:
                    Explicit Text: https://www.verisign.com/rpa

            X509v3 Key Usage: critical
                Certificate Sign, CRL Sign
            1.3.6.1.5.5.7.1.12:
                0_.].[0Y0W0U..image/gif0!0.0...+..............k...j.H.,{..0%.#http://logo.verisign.com/vslogo.gif
            X509v3 CRL Distribution Points:

                Full Name:
                  URI:http://crl.verisign.com/pca3-g5.crl

            Authority Information Access:
                OCSP - URI:http://ocsp.verisign.com

            X509v3 Extended Key Usage:
                TLS Web Client Authentication, Code Signing
            X509v3 Subject Alternative Name:
                DirName:/CN=VeriSignMPKI-2-8
            X509v3 Subject Key Identifier:
                CF:99:A9:EA:7B:26:F4:4B:C9:8E:8F:D7:F0:05:26:EF:E3:D2:A7:9D
            X509v3 Authority Key Identifier:
                keyid:7F:D3:65:A7:C2:DD:EC:BB:F0:30:09:F3:43:39:FA:02:AF:33:31:33

    Signature Algorithm: sha1WithRSAEncryption
         56:22:e6:34:a4:c4:61:cb:48:b9:01:ad:56:a8:64:0f:d9:8c:
         91:c4:bb:cc:0c:e5:ad:7a:a0:22:7f:df:47:38:4a:2d:6c:d1:
         7f:71:1a:7c:ec:70:a9:b1:f0:4f:e4:0f:0c:53:fa:15:5e:fe:
         74:98:49:24:85:81:26:1c:91:14:47:b0:4c:63:8c:bb:a1:34:
         d4:c6:45:e8:0d:85:26:73:03:d0:a9:8c:64:6d:dc:71:92:e6:
         45:05:60:15:59:51:39:fc:58:14:6b:fe:d4:a4:ed:79:6b:08:
         0c:41:72:e7:37:22:06:09:be:23:e9:3f:44:9a:1e:e9:61:9d:
         cc:b1:90:5c:fc:3d:d2:8d:ac:42:3d:65:36:d4:b4:3d:40:28:
         8f:9b:10:cf:23:26:cc:4b:20:cb:90:1f:5d:8c:4c:34:ca:3c:
         d8:e5:37:d6:6f:a5:20:bd:34:eb:26:d9:ae:0d:e7:c5:9a:f7:
         a1:b4:21:91:33:6f:86:e8:58:bb:25:7c:74:0e:58:fe:75:1b:
         63:3f:ce:31:7c:9b:8f:1b:96:9e:c5:53:76:84:5b:9c:ad:91:
         fa:ac:ed:93:ba:5d:c8:21:53:c2:82:53:63:af:12:0d:50:87:
         11:1b:3d:54:52:96:8a:2c:9c:3d:92:1a:08:9a:05:2e:c7:93:
         a5:48:91:d3

Due to a dependency between the smartcard readers installed at our Windows clients, farious driver and software versions, an update to the current and properly signed Java extension provided by ActivIdentity was not possible. The most immediate solution to this issue was to remove the original code signing information from the file $JAVA_HOME/lib/ext/javasso.jar and – for security purposes – to re-sign it with a still valid code signing certificate from our internal CA which is trusted by our Windows clients.

// Experiences with Dell PowerConnect Switches

This blog post is going to be about my recent experiences with Broadcom FASTPATH based Dell PowerConnect M-Series M8024-k and M6348 switches. Especially with their various limitations and – in my opinion – sometimes buggy behaviour.


Recently i was given the opportunity to build a new and central storage and virtualization environment from ground up. This involved a set of hardware systems which – unfortunately – were chosen and purchased previously, before i came on board with the project.

System environment

Specifically those hardware components were:

  • Multiple Dell PowerEdge M1000e blade chassis

  • Multiple Dell PowerEdge M-Series blade servers, all equipped with Intel X520 network interfaces for LAN connectivity through fabric A of the blade chassis. Servers with additional central storage requirements were also equipped with QLogic QME/QMD8262 or QLogic/Broadcom BCM57810S iSCSI HBAs for SAN connectivity through fabric B of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric A of the blade chassis forming the LAN network. Those were configured and interconnected as a stack of switches. Each stack of switches had two uplinks, one to each of two carrier grade Cisco border routers. Since the network edge was between those two border routers on the one side and the stack of M8024-k switches on the other side, the switch stack was also used as a layer 3 device and was thus running the default gateways of the local network segments provided to the blade servers.

  • Multiple Dell PowerConnect M6348 switches, which were connected through aggregated links to the stack of M8024-k switches described above. These switches were exclusively used to provide a LAN connection for external, standalone servers and devices through their external 1 GBit ethernet interfaces. The M6348 switches were located in the slots belonging to fabric C of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric B of the blade chassis forming the SAN network. In contrast to the M8024-k LAN switches, the M8024-k SAN switches were configured and interconnected as individual switches. Since there as no need for outside SAN connectivity, the M8024-k switches in fabric B ran a flat layer 2 network without any layer 3 configuration.

  • Initially all PowerConnect switches – M8024-k both LAN and SAN and M6348 – ran the firmware version 5.1.8.2.

  • Multiple Dell EqualLogic PS Series storage systems, providing central block storage capacity for the PowerEdge M-Series blade servers via iSCSI over the SAN mentioned above. Some blade chassis based PS Series models (PS-M4110) were internally connected to the SAN formed by the M8024-k switches in fabric B. Other standalone PS Series models were connected to the same SAN utilizing the external ports of the M8024-k switches.

  • Multiple Dell EqualLogic FS Series file server appliances, providing central NFS and CIFS storage capacity over the LAN mentioned above. In the back-end those FS Series file server appliances also used the block storage capacity provided by the PS Series storage systems via iSCSI over the SAN mentioned above. Both LAN and SAN connections of the EqualLogic FS Series were made through the external ports of the M8024-k switches.

There were multiple locations with roughly the same setup composed of the hardware components described above. Each location had two daisy-chained Dell PowerEdge M1000e blade chassis systems. The layer 2 LAN and SAN networks stretched over the two blade chassis. The setup at each location is shown in the following schematic:

Schematic of the Dell PowerConnect LAN and SAN setup

All in all not an ideal setup. Instead, i would have preferred a pair of capable – both functionality and performance-wise – central top-of-rack switches to which the individual M1000e blade chassis would have been connected. Preferrably a seperate pair for LAN an SAN connectivity. But again, the mentioned components were already preselected and pre-purchased.

During the implementation and later the operational phase several limitations and issues surfaced with regard to the Dell PowerConnect switches and the networks build with them. The following – probably not exhaustive – list of limitations and issues i've encountered is in no particular order with regard to their occurrence or severity.

Limitations

  • While the Dell PowerConnect switches support VRRP as a redundancy protocol for layer 3 instances, there is only support for VRRP version 2, described in RFC 3768. This limits the use of VRRP to IPv4 only. VRRP version 3 described in RFC 5798, which is needed for the implementation of redundant layer 3 instances for both IPv4 and IPv6, is not supported by Dell PowerConnect switches. Due to this limitation and the need for full IPv6 support in the whole environment, the design decision was made to run the Dell PowerConnect M8024-k switches for the LAN as a stack of switches.

  • Limited support of routing protocols. There is only support for the routing protocols OSPF and RIP v2 in Dell PowerConnect switches. In this specific setup and triggered by the design decision to run the LAN switches as layer 3 devices, BGP would have been a more suitable routing protocol. Unfortunately there were no plans to implement BGP on the Dell PowerConnect devices.

  • Limitation in the number of secondary interface addresses. Only one IPv4 secondary address is supported per interface on a layer 3 instance running on the Dell PowerConnect switches. Opposed to e.g. Cisco based layer 3 capable switches this was a limitation that caused, in this particular setup, the need for a lot more (VLAN) interfaces than would otherwise have been necessary.

  • No IPv6 secondary interface addresses. For IPv6 based layer 3 instances there is no support at all for secondary interface addresses. Although this might be a fundamental rather than product specific limitation.

  • For layer 3 instances in general there is no support for very small IPv4 subnets (e.g. /31 with 2 IPv4 addresses) which are usually used for transfer networks. In setups using private IPv4 address ranges this is no big issue. In this case though, official IPv4 addresses were used and in conjunction with the excessive need for VLAN interfaces this limitation caused a lot of wasted official IPv4 addresses.

  • The access control list (ACL) feature is very limited and rather rudimentary in Dell PowerConnect switches. There is no support for port ranges, no statefulness and each access list has a hard limit of 256 access list entries. All three – and possibly even more – limitations in combination make the ACL feature of Dell PowerConnect switches almost useless. Especially if there are seperate layer 3 networks on the system which are in need of fine-grained traffic control.

  • From the performance aspect of ACLs i have gotten the impression, that especially IPv6 ACLs are handled by the switches CPU. If IPv6 is used in conjunction with extensive ACLs, this would dramatically impact the network performance of IPv6-based traffic. Admittedly i have no hard proof to support this suspicion.

  • The out-of-band (OOB) management interface of the Dell PowerConnect switches does not provide a true out-of-band management. Instead it is integrated into the switch as just as another IP interface – although one with a special purpose. Due to this interaction of the OOB with the IP stack of the Dell PowerConnect switch there are side-effects when the switch is running at least one layer 3 instance. In this case, the standard IP routing table of the switch is not only used for routing decisions of the payload traffic, but instead it is also used to determine the destination of packets originating from the OOB interface. This behaviour can cause an asymmetric traffic flow when the systems connecting to the OOB are covered by an entry in the switches IP routing table. Far from ideal when it comes to true OOB management, not to mention the issuses arising when there are also stateful firewall rules involved.

    I addressed this limitation with a support case at Dell and got the following statement back:

    FASTPATH can learn a default gateway for the service port, the network port,
    or a routing interface. The IP stack can only have a single default gateway.
    (The stack may accept multiple default routes, but if we let that happen we may
    end up with load balancing across the network and service port or some other
    combination we don't want.) RTO may report an ECMP default route. We only give
    the IP stack a single next hop in this case, since it's not likely we need to
    additional capacity provided by load sharing for packets originating on the
    box.

    The precedence of default gateways is as follows:
    - via routing interface
    - via service port
    - via network port

    As per the above precedence, ip stack is having the default gateway which is
    configured through RTO. When the customer is trying to ping the OOB from
    different subnet , route table donesn't have the exact route so,it prefers the
    default route and it is having the RTO default gateway as next hop ip. Due to
    this, it egresses from the data port.

    If we don't have the default route which is configured through RTO then IP
    stack is having the OOB default gateway as next hop ip. So, it egresses from
    the OOB IP only.

    In my opinion this just confirms how the OOB management of the Dell PowerConnect switches is severely broken by design.

  • Another issue with the out-of-band (OOB) management interface of the Dell PowerConnect switches is that they support only a very limited access control list (ACL) in order to protect the access to the switch. The management ACL only supports one IPv4 ACL entry. IPv6 support within the management ACL protecting the OOB interface is missing altogether.

  • The Dell PowerConnect have no support for Shortest Path Bridging (SPB) as defined in the IEEE 802.1aq standard. On layer 2 the traditional spanning-tree protocols STP (IEEE 802.1D), RSTP (IEEE 802.1w) or MSTP (IEEE 802.1s) have to be used. This is particularly a drawback in the SAN network shown in the schematic above, due to the protocol determined inactivity of one inter-switch link. With the use of SPB, all inter-switch links could be equally utilizied and a traffic interruption upon link failure and spanning-tree (re)convergence could be avoided.

  • Another SAN-specific limitation is the incomplete implementation of Data Center Bridging (DCB) in the Dell PowerConnect switches. Although the protocols Priority-based Flow Control (PFC) according to IEEE 802.1Qbb and Congestion Notification (CN) according to IEEE 802.1Qau are supportet, the third needed protocol Enhanced Transmission Selection (ETS) according to IEEE 802.1Qaz is missing in Dell PowerConnect switches. The Dell EqualLogic PS Series storage systems used in the setup shown above explicitly need ETS if DCB should be used on layer 2. Since ETS is not implemented in Dell PowerConnect switches, the traditional layer 2 protocols had to be used in the SAN.

Issues

  • Not per se an issue, but the baseline CPU utilization on Dell PowerConnect M8024-k switches running layer 3 instances is significantly higher compared to those running only as layer 2 devices. The following CPU utilization graphs show a direct comparison of a layer 3 (upper graph) and a layer 2 (lower graph) device:

    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 3 device
    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 2 device

    The CPU utilization is between 10 and 15% higher once the tasks of processing layer 3 traffic are involved. What kind of switch function or what type of traffic is causing this additional CPU utilization is completely intransparent. Documentation on such in-depth subjects or details on how the processing within the Dell PowerConnect switches works is very scarce. It would be very interesting to know what kind of traffic is sent to the switches CPU for processing instead of being handled by the hardware.

  • The very high CPU utilization plateau on the right hand side of the upper graph (approximately between 10:50 - 11:05) was due to a bug in processing of IPv6 traffic on Dell PowerConnect switches. This issue caused IPv6 packets to be sent to the switchs CPU for processing instead of doing the forwarding decision in the hardware. I narrowed down the issue by transferring a large file between two hosts via the SCP protocol. In the first case and determined by preferred name resolution via DNS a IPv6 connection was used:

    user@host1:~$ scp testfile.dmp user@host2:/var/tmp/
    testfile.dmp                                   8%  301MB 746.0KB/s 1:16:05 ETA

    The CPU utilization on the switch stack during the transfer was monitored on the switches CLI:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642152
     alloc  298144904
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                     27.05%      30.44%      21.13%
     41cbae0 tXbdService                2.60%       0.40%       0.09%
     43d38d0 ipnetd                     0.40%       0.11%       0.11%
     43ee580 tIomEvtMon                 0.40%       0.09%       0.22%
     43f7d98 osapiTimer                 2.00%       3.56%       3.13%
     4608b68 bcmL2X.0                   0.00%       0.08%       1.16%
     462f3a8 bcmCNTR.0                  1.00%       0.87%       1.04%
     4682d40 bcmTX                      4.20%       5.12%       3.83%
     4d403a0 bcmRX                      9.21%      12.64%      10.35%
     4d60558 bcmNHOP                    0.80%       0.21%       0.11%
     4d72e10 bcmATP-TX                  0.80%       0.24%       0.32%
     4d7c310 bcmATP-RX                  0.20%       0.12%       0.14%
     53321e0 MAC Send Task              0.20%       0.19%       0.40%
     533b6e0 MAC Age Task               0.00%       0.05%       0.09%
     5d59520 bcmLINK.0                  5.41%       2.75%       2.15%
     84add18 tL7Timer0                  0.00%       0.22%       0.23%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.40%       0.09%       0.12%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                0.80%       0.73%       1.40%
     9b70100 dtlTask                    5.81%       7.52%       5.62%
     9dc3da8 emWeb                      0.40%       0.12%       0.09%
     a1c9400 hapiRxTask                 4.00%       8.84%       6.46%
     a65ba38 hapiL3AsyncTask            1.60%       0.45%       0.37%
     abcd0c0 DHCP snoop                 0.00%       0.00%       0.20%
     ac689d0 Dynamic ARP Inspect        0.40%       0.10%       0.05%
     ac7a6c0 SNMPTask                   0.40%       0.19%       0.95%
     b8fa268 dot1s_timer_task           1.00%       0.78%       2.74%
     b9134c8 dot1s_task                 0.20%       0.07%       0.04%
     bdb63e8 dot1xTimerTask             0.00%       0.03%       0.02%
     c520db8 radius_task                0.00%       0.02%       0.05%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.06%       0.15%
     c59ce70 unitMgrTask                0.40%       0.10%       0.20%
     c5c7410 umWorkerTask               1.80%       0.27%       0.13%
     c77ef60 snoopTask                  0.60%       0.25%       0.16%
     c8025a0 dot3ad_timer_task          1.00%       0.24%       0.61%
     ca2ab58 dot3ad_core_lac_tas        0.00%       0.02%       0.00%
     d1860b0 dhcpsPingTask              0.20%       0.13%       0.39%
     d18faa0 SNTP                       0.00%       0.02%       0.01%
     d4dc3b0 sFlowTask                  0.00%       0.00%       0.03%
     d6a4448 spmTask                    0.00%       0.13%       0.14%
     d6b79c8 fftpTask                   0.40%       0.06%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.40%       0.18%       0.29%
     dba91b8 tArpCallback               0.00%       0.04%       0.04%
     defb340 ARP Timer                  2.60%       0.92%       1.29%
     e1332f0 tRtrDiscProcessingT        0.00%       0.00%       0.11%
    12cabe30 ip6MapLocalDataTask        0.00%       0.03%       0.01%
    12cb5290 ip6MapExceptionData       11.42%      12.95%       9.41%
    12e1a0d8 lldpTask                   0.60%       0.17%       0.30%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.00%       0.03%       0.03%
    14176898 DHCPv4 Client Task         0.00%       0.01%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.10%
    14416738 RMONTask                   0.00%       0.20%       0.42%
    144287f8 boxs Req                   0.20%       0.09%       0.21%
    15c90a18 sshd                       0.40%       0.07%       0.07%
    15cde0e0 sshd[0]                    0.20%       0.05%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             89.77%      92.50%      77.29%

    In second case a IPv4 connection was deliberately choosen:

    user@host1:~$ scp testfile.dmp user@10.0.0.1:/var/tmp/
    testfile.dmp                                 100% 3627MB  31.8MB/s   01:54

    Not only was the transfer rate of the SCP copy process significantly higher – and the transfer time subsequently much lower – in the second case using a IPv4 connection. But the CPU utilization on the switch stack during the transfer using a IPv4 connection was also much lower:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642384
     alloc  298144672
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                      0.80%      23.49%      21.10%
     41cbae0 tXbdService                0.00%       0.17%       0.08%
     43d38d0 ipnetd                     0.20%       0.14%       0.12%
     43ee580 tIomEvtMon                 0.60%       0.26%       0.24%
     43f7d98 osapiTimer                 2.20%       3.10%       3.08%
     4608b68 bcmL2X.0                   4.20%       1.10%       1.22%
     462f3a8 bcmCNTR.0                  0.80%       0.80%       0.99%
     4682d40 bcmTX                      0.20%       3.35%       3.59%
     4d403a0 bcmRX                      4.80%       9.90%      10.06%
     4d60558 bcmNHOP                    0.00%       0.11%       0.10%
     4d72e10 bcmATP-TX                  1.00%       0.30%       0.32%
     4d7c310 bcmATP-RX                  0.00%       0.14%       0.15%
     53321e0 MAC Send Task              0.80%       0.39%       0.42%
     533b6e0 MAC Age Task               0.00%       0.12%       0.10%
     5d59520 bcmLINK.0                  1.80%       2.38%       2.14%
     84add18 tL7Timer0                  0.00%       0.11%       0.20%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.20%       0.09%       0.11%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                3.20%       1.54%       1.49%
     9b70100 dtlTask                    0.20%       5.47%       5.45%
     9dc3da8 emWeb                      0.40%       0.13%       0.09%
     a1c9400 hapiRxTask                 0.20%       6.46%       6.30%
     a65ba38 hapiL3AsyncTask            0.40%       0.37%       0.35%
     abcd0c0 DHCP snoop                 0.00%       0.02%       0.18%
     ac689d0 Dynamic ARP Inspect        0.40%       0.15%       0.07%
     ac7a6c0 SNMPTask                   0.00%       1.32%       1.12%
     b8fa268 dot1s_timer_task           7.21%       2.99%       2.97%
     b9134c8 dot1s_task                 0.00%       0.03%       0.03%
     bdb63e8 dot1xTimerTask             0.00%       0.01%       0.02%
     c520db8 radius_task                0.00%       0.01%       0.04%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.21%       0.17%
     c59ce70 unitMgrTask                0.60%       0.20%       0.21%
     c5c7410 umWorkerTask               0.20%       0.17%       0.12%
     c77ef60 snoopTask                  0.20%       0.18%       0.15%
     c8025a0 dot3ad_timer_task          2.20%       0.80%       0.68%
     d1860b0 dhcpsPingTask              1.80%       0.58%       0.45%
     d18faa0 SNTP                       0.00%       0.00%       0.01%
     d4dc3b0 sFlowTask                  0.20%       0.03%       0.03%
     d6a4448 spmTask                    0.20%       0.15%       0.14%
     d6b79c8 fftpTask                   0.00%       0.02%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.20%       0.19%       0.28%
     dba91b8 tArpCallback               0.00%       0.06%       0.05%
     defb340 ARP Timer                  4.60%       1.54%       1.36%
     e1332f0 tRtrDiscProcessingT        0.40%       0.14%       0.12%
    12cabe30 ip6MapLocalDataTask        0.00%       0.01%       0.01%
    12cb5290 ip6MapExceptionData        0.00%       8.60%       8.91%
    12cbe790 ip6MapNbrDiscTask          0.00%       0.02%       0.00%
    12e1a0d8 lldpTask                   0.80%       0.24%       0.29%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.40%       0.07%       0.04%
    14176898 DHCPv4 Client Task         0.00%       0.00%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.09%
    14416738 RMONTask                   1.00%       0.44%       0.44%
    144287f8 boxs Req                   0.40%       0.16%       0.21%
    15c90a18 sshd                       0.20%       0.06%       0.06%
    15cde0e0 sshd[0]                    0.00%       0.03%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             43.28%      78.79%      76.50%

    Comparing the two above output samples by per process CPU utilization showed that the major share of the higher CPU utilization in the case of a IPv6 connection is allotted to the processes tNet0, bcmTX, bcmRX, bcmLINK.0, dtlTask, hapiRxTask and ip6MapExceptionData. In a process by process comparison, those seven processes used 60.3% more CPU time in case of a IPv6 connection compared to the case using a IPv4 connection. Unfortunately the documentation on what the individual processes are exactly doing is very sparse or not available at all. In order to further analyze this issue a support case with the collected information was opened with Dell. A fix for the described issue was made availible with firmware version 5.1.9.3

  • The LAN stack of several Dell PowerConnect M8024-k switches showed sometimes erratic behaviour. There were several occasions, where the switch stack would suddenly show a hugely increased latency in packet processing or where it would just stop passing certain types of traffic altogether. Usually a reload of the stack would restore its operation and the increased latency or the packet drops would disappear with the reload as suddenly as they had appeared. The root cause of this was unfortunately never really found. Maybe it was the combination of functions (layer 3, dual stack IPv4 and IPv6, extensive ACLs, etc.) that were running simultaneously on the stack in this setup.

  • During both planned and unplanned failovers of the master switch in the stack, there is a time period of up to 120 seconds where no packets are processed by the switch stack. This occurs even with continuous forwarding enabled. I've had a strong suspicion that this issue was related to the layer 3 instances running on the switch stack. A comparison between a pure layer 2 stack and a layer 3 enabled stack in a controlled test environment confirmed this. As soon as at least one layer 3 instance was added, the described delay occured on switch failovers. The fact that migrating layer 3 instances from the former master switch to the new one takes some time makes sense to me. What's unclear to me is why this seems to also affect the layer 2 traffic going over the stack.

  • There were several occasions where the hardware- and software MAC table of the Dell PowerConnect switches got out of sync. While the root cause (hardware defect, bit flip, power surge, cosmic radiation, etc.) of this issue is unknown, the effect was a sudden reboot of affected switch. Luckily we had console servers in place, which were storing a console output history from the time the issue occured. After raising a support case with Dell with the information from the console output, we got a firmware update (v5.1.9.4) in which the issue would not trigger a sudden reboot anymore, but instead log an appropriate message to the switches log. With this fix the out of sync MAC tables will still require a reboot of the affected switch, but this can now be done in a controlled fashion. Still, a solution requiring no reboot at all would have been much more preferrable.

  • While querying the Dell PowerConnect switches with the SNMP protocol for monitoring purposes, obscure and confusing messages containing the string MGMT_ACAL would reproducibly be logged into the switches log. See the article Check_MK Monitoring - Dell PowerConnect Switches - Global Status in this blog for the gory details.

  • With a stack of Dell PowerConnect M8024-k switches the information provided via the SNMP protocol would occasionally get out of sync with the information available from the CLI. E.g. the temperature values from the stack stack1 of LAN switches compared to the standalone SAN switches standalone{1,2,3,4,5,6}:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4 = No Such Object available on this agent at this OID
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5 = No Such Object available on this agent at this OID
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 40
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 42
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 35

    At the same time the CLI management interface of the switch stack showed the correct temperature values:

    stack1# show system       
    
    System Description: Dell Ethernet Switch
    System Up Time: 89 days, 01h:50m:11s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            39             Good
    2        System            39             Good
    [...]

    Only after a reboot of the switch stack, the information provided via the SNMP protocol:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 37
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 41
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 34

    would again be in sync with the information available from the CLI:

    stack1# show system   
    
    System Description: Dell Ethernet Switch
    System Up Time: 0 days, 00h:05m:32s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            37             Good
    2        System            37             Good
    [...]

Conclusion

Although the setup build with the Dell PowerConnect switches and the other hardware components was working and providing its basic, intended functionality, there were some pretty big and annoying limitations associated with it. A lot of these limitations would have not been that significant to the entire setup if certain design descisions would have been made more carefully. For example if the layer 3 part of the LAN would have been implemented in external network components or if a proper fully meshed, fabric-based SAN would have been favored over what can only be described as a legacy technology. From the reliability, availability and serviceability (RAS) points of view, the setup is also far from ideal. By daisy-chaining the Dell PowerEdge M1000e blade chassis, stacking the LAN switches, stretching the LAN and SAN over both chassis and by connecting external devices through the external ports of the Dell PowerConnect switches, there are a lot of parts in the setup that are depending on each other. This makes normal operations difficult at best and can have disastrous effects in case of a failure.

In retrospect, either using pure pass-through network modules in the Dell PowerEdge M1000e blade chassis in conjunction with capcable 10GE top-of-rack switches or using the much more capable Dell Force10 MXL switches in the Dell PowerEdge M1000e blade chassis seem to be better solutions. The uptick for Dell Force10 MXL switches of about €2000 list price per device compared to the Dell PowerConnect switches seems negligible compared to the costs that arose through debugging, bugfixing and finding workarounds for the various limitations of the Dell PowerConnect switches. In either case a pair of capable, central layer 3 devices for gateway redundancy, routing and possibly fine-grained traffic control would be advisable.

For simpler setups, without some of the more special requirements of this particular setup, the Dell PowerConnect switches still offer a nice price-performance ratio. Especially with regard to their 10GE port density.

// Check_MK Monitoring - HPE Virtual Connect Fibre Channel Modules

This article provides patches for the standard Check_MK distribution in order to add support for the monitoring of HPE Virtual Connect Fibre Channel Modules.


Out of the box, there is currently no monitoring support for HPE Virtual Connect Fibre Channel Modules in the standard Check_MK distribution. Those modules, like e.g. the HPE Virtual Connect 8Gb 20-port Fibre Channel Module, are used in HPE c-Class BladeSystem to provide Fibre Channel connectivity for the individual server blades. Fortunately the modules provide status and performance data via the standard SNMP FIBRE-CHANNEL-FE-MIB defined in RFC 2837 as well as its successor, the SNMP FCMGMT-MIB defined in RFC 4044. Those two SNMP MIBs are already covered by the checks qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element, which are part of the standard Check_MK distribution. This simplifies the task of adding support for the HPE Virtual Connect Fibre Channel modules and reduces it to be just a matter of extending the already existing checks with three rather simple patches.

For the impatient and TL;DR here are the enhanced versions of the qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element:

Enhanced version of the qlogic_fcport check
Enhanced version of the qlogic_sanbox check
Enhanced version of the qlogic_sanbox_fabric_element check

The sources to the enhanced versions of all three checks can be found in my Check_MK Plugins repository on GitHub.

The necessary changes to qlogic_fcport and qlogic_sanbox_fabric_element are limited to the snmp_scan_function used by the Check_MK inventory. Here, the vendor specific OIDs for the HPE Virtual Connect Fibre Channel modules are added. The following patches show the respective lines for qlogic_fcport:

qlogic_fcport.patch
--- a/checks/qlogic_fcport   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_fcport   2017-10-01 14:34:48.153710776 +0200
@@ -218,12 +218,14 @@
     # .1.3.6.1.4.1.3873.1.12 QLogic 8 Gb and 4/8 Gb Intelligent Pass-thru Module
     # .1.3.6.1.4.1.3873.1.9  QLogic SANBox 5802 FC Switch
     # .1.3.6.1.4.1.3873.1.11 HP StorageWorks 8/20q Fibre Channel Switch
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.11") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.12") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
     'group':                   'qlogic_fcport',
     'default_levels_variable': 'qlogic_fcport_default_levels',
 }

and for qlogic_sanbox_fabric_element:

qlogic_sanbox_fabric_element.patch
--- a/checks/qlogic_sanbox_fabric_element    2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox_fabric_element    2017-10-01 14:47:35.000003198 +0200
@@ -54,7 +54,9 @@
                                                            OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

In both cases, the relevant lines being:

        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),

After those two simple changes, the checks will now be able to successfully inventorize the overall fabric status as well as the status of individual ports of HPE Virtual Connect Fibre Channel modules.

The necessary changes to qlogic_sanbox also require the extension of the snmp_scan_function used by the Check_MK inventory as shown by the patches above. In addition to that, the string operations on the sensor_id need to be adjusted in order to get a more user-friendly name for the temperature and power supply sensors which are also present in the HPE Virtual Connect Fibre Channel modules. Since the sensor IDs are encoded in the SNMP OIDs and the SNMP tree for those OIDs can vary from module to module, the simple string replacement in the original qlogic_sanbox check was exchanged for a more general, regular expression based substitution. The following patch shows the respective lines for the combined changes to qlogic_sanbox:

qlogic_sanbox.patch
--- a/checks/qlogic_sanbox   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox   2017-10-01 14:47:51.348002546 +0200
@@ -44,7 +44,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "8" and sensor_characteristic == "3" and \
             sensor_name != "Temperature Status":
             inventory.append( (sensor_id, None) )
@@ -53,7 +53,7 @@
 def check_qlogic_sanbox_temp(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -93,9 +93,11 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }
 
 #.
@@ -113,7 +115,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "5":
             inventory.append( (sensor_id, None) )
     return inventory
@@ -121,7 +123,7 @@
 def check_qlogic_sanbox_psu(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -153,7 +155,9 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

After those additional, but still simple, changes, the check will now be able to successfully inventorize the temperature and power supply sensors of HPE Virtual Connect Fibre Channel modules.

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website. More information about cookies