bityard Blog

// IBM Storwize V3700 out of Memory

Under certain conditions it is possible to inadvertently run into an out of memory situation on IBM Storwize V3700 systems, by simply running a Download Support Package procedure or the respective CLI command. This will – of course – bring all I/O on the affected system to a grinding halt.

A few days ago, the Nagios monitoring plugin introduced in “Nagios Monitoring - IBM SVC and Storwize” reported a failed PSU on one of our IBM Storwize V3700 systems. After raising a PMR with IBM in order to get the seemingly defective PSU replaced, i was told to simply reseat the PSU. According to IBM support this would usually fix this – apparently known – issue.

This was the first WTF moment and it turned out to not be the last one. So either IBM produces and sells subpar components – in this case the PSU – which need to be given a boot – yes, PSUs nowadays have their own firmware too – in order to be persuaded to cooperate again. Or it means IBM produces and sells subpar software which is not at all able to properly detect a component failure and distinguish between a faulty and a good PSU. Or perhaps its an unfortunate combination of both.

In any case, the procdure to reseat the PSU was carried out, which fixed the PSU issue. During the course of the fix procedure the system would become unreachable via TCP/IP for a rather long time, though. Definately over two minutes, but i haven't had a chance for an exact measurement. After the system was reachable again i followed this strange behaviour up and had a look at the systems event log. There were quite a lot of “Error Code: 1370, Error Code Text: SCSI ERP occurred” messages, so i decided to bother the IBM support again and send them a support collection in order to get an analysis with regard to the reachablilty issue as well as the 1370 errors.

From previous occasions i knew that the IBM support would most likely request a support collection which was run with the “svc_livedumpCLI command or with the “Standard logs plus new statesaves” option from the WebUI. The latter one is marked red in the following screenshot example:

SVC and Storwize "Download Support Package" Dialog from the WebUI

So i decided to pull a support collection with this option. After some time into the support collection process, the SVC sitting in front of the V3700 and other storage systems, started to show very high latencies (~60 sec.) on the primary VDisks backed by the V3700. On other VDisks which “only” had their secondary VDisk-Mirror located on MDiskGroups of the V3700, the latency peak was less dramatic, but still very noticable. Eventually degraded paths to the MDisks located on the V3700 started showing up on the SVC. After the support collection process finished the situation went back to normal. The latency on both primary and secondary VDisks instantly dropped down to the usual values and after running the fix procedures on the SVC, the degraded paths came back online.

The performance issues were in magnitude and duration severe enough to affect several applications pretty badly. Although the immediate issue was resolved, i still needed an analysis and written statement from IBM support for an action plan on how to prevent this kind of situation in the future and for compliance reasons as well. Here is the digest of what – according to IBM – happened:

  1. During the procedure to reseat the reportedly defective PSU, one or more power surges occured.

  2. These power surges apparently caused issues on the internal disk buses, which lead to the 1370 errors to be logged.

  3. The power surges or the resulting 1370 errors are probably the cause for a failover of the config node too. Hence the connectivity issues with the CLI and the WebUI via TCP/IP.

  4. More 1370 errors were logged during the runtime of the subsequent “Standard logs plus new statesaves” support collection process.

  5. The issue of very high latency and MDisk paths becoming degraded was caused by the “Standard logs plus new statesaves” support collection process using up all of the memory – yes, including the data cache – on the V3700 system. This behaviour is specific and limited to the V3700 systems with “only” 4GB of memory.

As one can imagine, the last item was my second WTF moment. Apparently there are no programmatical safeguards to prevent the support collection process at a “Standard logs plus new statesaves” level from using up all of the systems memory. This would normally not be that bad at all, if the “Standard logs plus new statesaves” wasn't the particular level which IBM support would usually request on support cases concerning SVC and Storwize systems. On the phone the IBM support technician admitted that this common practice is in general probably a bad idea. But he also mentioned that he up to now hadn't heard of the known side effects actually occuring, like in this case.

The suggestion on how to prevent this kind of situation in the future was to either upgrade the V3700 systems from 4GB to 8GB memory – a solution i would gladly take provided it came free of charge – or to only run the support collection process only with the “svc_snapCLI command or the “Standard logs” option from the WebUI. Since the memory upgrade for free isn't likely to happen, i'll stick with the second suggestion for now.

Incidently a third option came up over the last weekend. Looking at the IBM System Storage SAN Volume Controller V7.3.0.9 Release Note, it could be construed that someone at IBM SVC and Storwize development came to the realization that the issue could also be addressed in software, by altering the resource utilization of the support collection process:

HU00636          Livedump prepare fails on V3500 & V3700 systems with 4GB
                 memory when cache partition fullness is less than 35%

Fingers crossed, this fix really addresses and resolves the issue described above.

// Nagios Monitoring - IBM SVC and Storwize (Update)

After upgrading our test IBM SAN Volume Controller (SVC) systems from version 7.3.0.7 to 7.3.0.8 or later, the previously described Nagios monitoring plugin (Nagios Monitoring - IBM SVC and Storwize) ceased to work. A quick check revealed that the “wbemcli” command line tool from the Standards Based Linux Instrumentation project, which is used in the Nagios plugin to query the CIMOM server on the SVC or Storwize systems, would fail with the following error message:

$ /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
*
* /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
*

Re-checking the release notes, this sudden change in behaviour seemed to be explained by the fix:

SSL vulnerability CVE-2014-3566

Not really being that verbose a description, a quick look at the CVE-2014-3566 showed that this is a fix for the “POODLE” issue. So IBM probably switched off the support for the SSLv3 protocol in the SVC and Storwize code. But why would this cause the “wbemcli” command line tool to fail? Here are the steps taken in an analysis of the issue:

  1. First, i was trying to get the “wbemcli” command line tool to be a tad more verbose about what it is actually doing:

    $ /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *

    Not really an abundance of information in here too.

  2. Getting the source code for and, while we're at it, updating the “wbemcli” command line tool from version 1.6.0 to 1.6.3. Spent some time looking through the source code and with the “gdb” debugger to get a feeling for the general program flow and functions/methods being called. While looking through the source code i noticed the cURL debugging options are being set if a environment variable named “CURLDEBUG” is set to “true”. Later also found this mentioned in the ChangeLog file:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    * About to connect() to svc-test port 5989 (#0)
    *   Trying 192.168.x.x...
    * connected
    * Connected to svc-test (192.168.x.x) port 5989 (#0)
    * successfully set certificate verify locations:
    *   CAfile: none
      CApath: /etc/ssl/certs
    * Unknown SSL protocol error in connection to svc-test:5989
    * Closing connection #0
    * SSL connect error
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *

    Now we know that we're encountering a “Unknown SSL protocol error” – not that helpful either.

  3. Searched the web for further information on how to debug the cURL library and found the debug.c example source code, which was very helpful. Incorperated it into the file “CimCurl.cpp”:

    sblim-wbemcli-1.6.3_debug.patch
    --- sblim-wbemcli-1.6.3_orig/CimCurl.cpp    2013-09-21 01:26:32.000000000 +0200
    +++ sblim-wbemcli-1.6.3_new/CimCurl.cpp     2014-11-26 16:30:19.000000000 +0100
    @@ -37,6 +37,100 @@
     extern int waitTime;
     extern int expect100;
     
    +// Trace Begin
    +struct data {
    +  char trace_ascii; /* 1 or 0 */ 
    +};
    +
    +static
    +void dump(const char *text,
    +          FILE *stream, unsigned char *ptr, size_t size,
    +          char nohex)
    +{ 
    +  size_t i;
    +  size_t c;
    +  
    +  unsigned int width=0x10;
    +  
    +  if(nohex)
    +    /* without the hex output, we can fit more on screen */
    +    width = 0x40;
    +
    +  fprintf(stream, "%s, %010.10ld bytes (0x%08.8lx)\n",
    +          text, (long)size, (long)size);
    +
    +  for(i=0; i<size; i+= width) {
    +
    +    fprintf(stream, "%04.4lx: ", (long)i);
    +
    +    if(!nohex) {
    +      /* hex not disabled, show it */
    +      for(c = 0; c < width; c++)
    +        if(i+c < size)
    +          fprintf(stream, "%02x ", ptr[i+c]);
    +        else
    +          fputs("   ", stream);
    +    }
    +
    +    for(c = 0; (c < width) && (i+c < size); c++) {
    +      /* check for 0D0A; if found, skip past and start a new line of output */
    +      if (nohex && (i+c+1 < size) && ptr[i+c]==0x0D && ptr[i+c+1]==0x0A) {
    +        i+=(c+2-width);
    +        break;
    +      }
    +      fprintf(stream, "%c",
    +              (ptr[i+c]>=0x20) && (ptr[i+c]<0x80)?ptr[i+c]:'.');
    +      /* check again for 0D0A, to avoid an extra \n if it's at width */
    +      if (nohex && (i+c+2 < size) && ptr[i+c+1]==0x0D && ptr[i+c+2]==0x0A) {
    +        i+=(c+3-width);
    +        break;
    +      }
    +    }
    +    fputc('\n', stream); /* newline */
    +  }
    +  fflush(stream);
    +}
    +
    +static
    +int my_trace(CURL *handle, curl_infotype type,
    +             char *data, size_t size,
    +             void *userp)
    +{
    +  struct data *config = (struct data *)userp;
    +  const char *text;
    +  (void)handle; /* prevent compiler warning */
    +
    +  switch (type) {
    +  case CURLINFO_TEXT:
    +    fprintf(stderr, "== Info: %s", data);
    +  default: /* in case a new one is introduced to shock us */
    +    return 0;
    +
    +  case CURLINFO_HEADER_OUT:
    +    text = "=> Send header";
    +    break;
    +  case CURLINFO_DATA_OUT:
    +    text = "=> Send data";
    +    break;
    +  case CURLINFO_SSL_DATA_OUT:
    +    text = "=> Send SSL data";
    +    break;
    +  case CURLINFO_HEADER_IN:
    +    text = "<= Recv header";
    +    break;
    +  case CURLINFO_DATA_IN:
    +    text = "<= Recv data";
    +    break;
    +  case CURLINFO_SSL_DATA_IN:
    +    text = "<= Recv SSL data";
    +    break;
    +  }
    +
    +  dump(text, stderr, (unsigned char *)data, size, config->trace_ascii);
    +  return 0;
    +}
    +// Trace End
    +
     // These are the constant headers added to all requests
     static const char *headers[] = {
         "Content-Type: application/xml; charset=\"utf-8\"",
    @@ -152,6 +246,11 @@
         CURLcode rv;
         string sb;
     
    +// Trace Begin
    +struct data config;
    +config.trace_ascii = 1;
    +// Trace End
    +
         mUri = url.scheme + "://" + url.host + ":" + url.port + "/cimom";
         url.ns.toStringBuffer(sb,"%2F");
     
    @@ -248,6 +347,11 @@
     
         rv = curl_easy_setopt(mHandle, CURLOPT_WRITEHEADER, &mErrorData);
         rv = curl_easy_setopt(mHandle, CURLOPT_HEADERFUNCTION, headerCb);
    +
    +// Trace Begin
    +    rv = curl_easy_setopt(mHandle, CURLOPT_DEBUGFUNCTION, my_trace);
    +    rv = curl_easy_setopt(mHandle, CURLOPT_DEBUGDATA, &config);
    +// Trace End
     }
     
     static string getErrorMessage(CURLcode err)

    rebuild the “wbemcli” command line tool and tried again:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    == Info: About to connect() to svc-test port 5989 (#0)
    == Info:   Trying 192.168.x.x...
    == Info: connected
    == Info: Connected to svc-test (192.168.x.x) port 5989 (#0)
    == Info: found 172 certificates in /etc/ssl/certs/ca-certificates.crt
    == Info: gnutls_handshake() failed: A TLS packet with unexpected length was received.
    == Info: Closing connection #0
    == Info: SSL connect error
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *

    Now we know there is an issue in the way the GnuTLS library used by libcURL interacts with the CIMOM server on the SVC or Storwize systems.

  4. Again, searching the web for similar issues with the error message “gnutls_handshake() failed: A TLS packet with unexpected length was received.”, we find a rebuild against the OpenSSL library instead of the GnuTLS library could solve this issue:

    $ dpkg -l | grep curl
    ii  curl                        7.26.0-1+wheezy11   powerpc     command line tool for transferring data with URL syntax
    ii  libcurl3:powerpc            7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (OpenSSL flavour)
    ii  libcurl3-gnutls:powerpc     7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (GnuTLS flavour)
    ii  libcurl4-gnutls-dev         7.26.0-1+wheezy11   powerpc     development files and documentation for libcurl (GnuTLS flavour)
    
    $ apt-get install libcurl4-openssl-dev
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    Suggested packages:
      libcurl3-dbg
    The following packages will be REMOVED:
      libcurl4-gnutls-dev
    The following NEW packages will be installed:
      libcurl4-openssl-dev
    0 upgraded, 1 newly installed, 1 to remove and 0 not upgraded.
    Need to get 0 B/1,259 kB of archives.
    After this operation, 28.7 kB of additional disk space will be used.
    Do you want to continue [Y/n]? y
    (Reading database ... 65717 files and directories currently installed.)
    Removing libcurl4-gnutls-dev ...
    Processing triggers for man-db ...
    Selecting previously unselected package libcurl4-openssl-dev.
    (Reading database ... 65463 files and directories currently installed.)
    Unpacking libcurl4-openssl-dev (from .../libcurl4-openssl-dev_7.26.0-1+wheezy11_powerpc.deb) ...
    Processing triggers for man-db ...
    Setting up libcurl4-openssl-dev (7.26.0-1+wheezy11) ...
    
    $ dpkg -l | grep curl
    ii  curl                        7.26.0-1+wheezy11   powerpc     command line tool for transferring data with URL syntax
    ii  libcurl3:powerpc            7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (OpenSSL flavour)
    ii  libcurl3-gnutls:powerpc     7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (GnuTLS flavour)
    ii  libcurl4-openssl-dev        7.26.0-1+wheezy11   powerpc     development files and documentation for libcurl (OpenSSL flavour)

    Rebuild the “wbemcli” command line tool and tried again:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    == Info: About to connect() to svc-test port 5989 (#0)
    == Info:   Trying 192.168.x.x...
    == Info: connected
    == Info: Connected to svc-test (192.168.x.x) port 5989 (#0)
    == Info: successfully set certificate verify locations:
    == Info:   CAfile: none
      CApath: /etc/ssl/certs
    == Info: SSLv3, TLS handshake, Client hello (1):
    => Send SSL data, 0000000134 bytes (0x00000086)
    0000: ......T.."\.~....`...K4.......p..7"&4...Z.....9.8.........5.....
    0040: ................3.2.....E.D...../...A...........................
    0080: ......
    == Info: Unknown SSL protocol error in connection to svc-test:5989
    == Info: Closing connection #0
    == Info: SSL connect error
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *

    Now we know that the “wbemcli” command line tool is actually – as already suspected – trying to initiate a SSLv3 connection.

  5. In order to confirm we're on the right track, try to first verify manually that we're unable to connct with a SSLv3 secured connection:

    $ openssl s_client -host svc-test -port 5989 -ssl3
    CONNECTED(00000003)
    write:errno=104
    ---
    no peer certificate available
    ---
    No client certificate CA names sent
    ---
    SSL handshake has read 0 bytes and written 0 bytes
    ---
    New, (NONE), Cipher is (NONE)
    Secure Renegotiation IS NOT supported
    Compression: NONE
    Expansion: NONE
    SSL-Session:
        Protocol  : SSLv3
        Cipher    : 0000
        Session-ID:
        Session-ID-ctx:
        Master-Key:
        Key-Arg   : None
        PSK identity: None
        PSK identity hint: None
        SRP username: None
        Start Time: 1418397594
        Timeout   : 7200 (sec)
        Verify return code: 0 (ok)
    ---
    quit

    And that we're instead able to connect with a TLS secured connection:

    $ openssl s_client -host svc-test -port 5989
    CONNECTED(00000003)
    depth=0 C = GB, L = Hursley, O = IBM, OU = SSG, CN = 2145, emailAddress = support@ibm.com
    verify error:num=18:self signed certificate
    verify return:1
    depth=0 C = GB, L = Hursley, O = IBM, OU = SSG, CN = 2145, emailAddress = support@ibm.com
    verify return:1
    ---
    Certificate chain
     0 s:/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
       i:/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
    ---
    Server certificate
    -----BEGIN CERTIFICATE-----
    MIICyDCCAjGgAwIBAgIEUAPzmTANBgkqhkiG9w0BAQUFADBqMQswCQYDVQQGEwJH
    QjEQMA4GA1UEBxMHSHVyc2xleTEMMAoGA1UEChMDSUJNMQwwCgYDVQQLEwNTU0cx
    DTALBgNVBAMTBDIxNDUxHjAcBgkqhkiG9w0BCQEWD3N1cHBvcnRAaWJtLmNvbTAe
    Fw0xMjA3MTYxMDU3MjlaFw0yNzA3MTMxMDU3MjlaMGoxCzAJBgNVBAYTAkdCMRAw
    DgYDVQQHEwdIdXJzbGV5MQwwCgYDVQQKEwNJQk0xDDAKBgNVBAsTA1NTRzENMAsG
    A1UEAxMEMjE0NTEeMBwGCSqGSIb3DQEJARYPc3VwcG9ydEBpYm0uY29tMIGfMA0G
    CSqGSIb3DQEBAQUAA4GNADCBiQKBgQC3E7+7mE2GAID/35o5/s7cnzoqu9PQdOGB
    ryGMa8adD4Wd9hpmTkrsgyNvkUB6sPIifbFstGooOkQtK9ZNgP5OHOorZmqINSxM
    9goCkSCQG9xRKAvNt2tA8gujaV+p42oVEhIH6naJUul96qZI31y3GffUu2CRrJL7
    4wG/8cv0BQIDAQABo3sweTAJBgNVHRMEAjAAMCwGCWCGSAGG+EIBDQQfFh1PcGVu
    U1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUkPMkXUjn0YHlfQW8
    TJiRC5jWQO4wHwYDVR0jBBgwFoAUkPMkXUjn0YHlfQW8TJiRC5jWQO4wDQYJKoZI
    hvcNAQEFBQADgYEAKqu7KpVxnOXonQE3unC1O7qUHKoyQUEWqcKsM/4tPI+lsBMZ
    jvoPwn8yQRWiLehFmVc8VSZfdFPLzshNabXp5qbZo/EFberXrgI2CbtPiULYyyyH
    DUhWF+vhwb6uqwfBbGncvTvI2ewU8+0oTXsuTkSjumJ7+chpaHFWWyj2cJA=
    -----END CERTIFICATE-----
    subject=/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
    issuer=/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
    ---
    No client certificate CA names sent
    ---
    SSL handshake has read 1029 bytes and written 498 bytes
    ---
    New, TLSv1/SSLv3, Cipher is AES256-GCM-SHA384
    Server public key is 1024 bit
    Secure Renegotiation IS supported
    Compression: NONE
    Expansion: NONE
    SSL-Session:
        Protocol  : TLSv1.2
        Cipher    : AES256-GCM-SHA384
        Session-ID: 48291D368E0A8584A8DFA00A9881B8979BDE370FC6C9439294C670695D031239
        Session-ID-ctx:
        Master-Key: 71BD12A161FC595CD056DA8E6D6E27420F37468E47498B7591A403A86844C55F61FF02B2FEC7739FAAEDCE3DFEA0F217
        Key-Arg   : None
        PSK identity: None
        PSK identity hint: None
        SRP username: None
        TLS session ticket lifetime hint: 300 (seconds)
        TLS session ticket:
        0000 - a0 e0 b1 9b c2 37 9a ca-49 1c 54 f5 26 4b d6 24   .....7..I.T.&K.$
        0010 - af 6a 7d cc 5e 4a 97 a8-b3 6d b7 66 0b b7 0a 65   .j}.^J...m.f...e
        0020 - 47 af ef 47 76 fc c7 e9-38 ff 84 28 ca 8e 73 25   G..Gv...8..(..s%
        0030 - 47 25 f6 0d 36 01 04 f1-f9 f7 0c b6 42 ef cf 09   G%..6.......B...
        0040 - 64 8f df ff 89 38 ed 7c-ae 1d 0e 25 d1 c1 77 86   d....8.|...%..w.
        0050 - b6 61 88 15 cf fe 9f 20-86 0d 17 74 18 da ea c0   .a..... ...t....
        0060 - 33 3a 47 f5 f9 51 24 ae-48 37 8a 3f 19 dd c6 04   3:G..Q$.H7.?....
        0070 - 7e d1 20 78 35 99 0b 9f-3b 1f ce 7c bc 11 93 e4   ~. x5...;..|....
        0080 - 0f 94 de 94 f1 0d 0c da-64 ca 0d f6 10 2a c8 fa   ........d....*..
        0090 - dc 3e e4 1a 97 d1 34 7a-9c f5 c3 00 e8 1b 10 d7   .>....4z........
    
        Start Time: 1418397614
        Timeout   : 300 (sec)
        Verify return code: 18 (self signed certificate)
    ---
    quit

    The secured connection negotiated to TLS succeeded, so we're on the right track!

  6. Now we need to find the spot in the source code, where the “wbemcli” command line tool is forced to initiate a SSLv3 connection. We know this is probably done in a cURL related function call, since libcURL is used for the network connection. So lets first look into the cURL code for all the lines showing any sign of SSL related operations:

    $ grep -n curl CimCurl.cpp | grep -i ssl
    185:    // Assume we support SSL if we don't have the curl_version_info API
    276:    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYHOST, 0);
    277:    //    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0);
    280:    rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
    441:     if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSL_VERIFYPEER,0))) {
    448:       if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSL_VERIFYPEER,1))) {
    466:     if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSLCERT,certificate))) {
    470:     if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSLKEY,key))) {

    Looks like a perfect match in line 280 of the file “CimCurl.cpp”. The line numbers are a bit off from the original source code, since the file “CimCurl.cpp” was patched with our above debugging code. The code in the original, unpatched source file “CimCurl.cpp”, within the function “CimomCurl::genRequest” looks like this:

    CimCurl.cpp
    175     [...]
    176     /* Disable SSL host verification */
    177     rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYHOST, 0);
    178     //    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0);
    179
    180     /* Force using SSL V3 */
    181     rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
    182
    183     /* Set username and password */
    184     if (url.user.length() > 0 && url.password.length() > 0) {
    185         mUserPass = url.user + ":" + url.password;
    186         rv = curl_easy_setopt(mHandle, CURLOPT_USERPWD, mUserPass.c_str());
    187     }
    188     [...]

    In line 181 the cURL option “CURLOPT_SSLVERSION” is indiscriminately set to use SSLv3 and nothing else, which we know from the above deduction is bound to fail on systems adressing the “POODLE” issues.

  7. With the knowledge where the issue is actually caused, an easy quick'n'dirty fix can be implemented:

    sblim-wbemcli-1.6.3_debug.patch
    --- sblim-wbemcli-1.6.3_orig/CimCurl.cpp    2013-09-21 01:26:32.000000000 +0200
    +++ sblim-wbemcli-1.6.3/CimCurl.cpp         2014-11-26 16:46:09.000000000 +0100
    @@ -178,7 +178,7 @@
         //    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0);
     
         /* Force using SSL V3 */
    -    rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
    +    //rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
     
         /* Set username and password */
         if (url.user.length() > 0 && url.password.length() > 0) {

    Inserting the comment at the line where the cURL option “CURLOPT_SSLVERSION” is forced to SSLv3 causes libcURL to fall back to its default value, which is now TLS.

    Rebuild the “wbemcli” command line tool and tried again:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    * About to connect() to svc-test port 5989 (#0)
    *   Trying 192.168.x.x...
    * connected
    * Connected to svc-test (192.168.x.x) port 5989 (#0)
    * found 172 certificates in /etc/ssl/certs/ca-certificates.crt
    *    server certificate verification SKIPPED
    *    common name: 2145 (does not match 'svc-test')
    *    server certificate expiration date OK
    *    server certificate activation date OK
    *    certificate public key: RSA
    *    certificate version: #3
    *    subject: C=GB,L=Hursley,O=IBM,OU=SSG,CN=2145,EMAIL=support@ibm.com
    *    start date: Mon, 16 Jul 2012 10:57:29 GMT
    
    *    expire date: Tue, 13 Jul 2027 10:57:29 GMT
    
    *    issuer: C=GB,L=Hursley,O=IBM,OU=SSG,CN=2145,EMAIL=support@ibm.com
    *    compression: NULL
    *    cipher: AES-128-CBC
    *    MAC: SHA1
    * Server auth using Basic with user 'user'
    > POST /cimom HTTP/1.1
    Authorization: Basic enp6bmFnaW9zOm5hZ2lvcw==
    Host: svc-test:5989
    Content-Type: application/xml; charset="utf-8"
    Connection: Keep-Alive, TE
    CIMProtocolVersion: 1.0
    CIMOperation: MethodCall
    CIMMethod: EnumerateClassNames
    CIMObject: root%2Fibm
    Content-Length: 396
    
    * upload completely sent off: 396 out of 396 bytes
    * additional stuff not fine transfer.c:1037: 0 0
    * HTTP 1.1 or later with persistent connection, pipelining supported
    < HTTP/1.1 200 OK
    < Content-Type: application/xml; charset="utf-8"
    From server: Content-Type: application/xml; charset="utf-8"
    < content-length: 0000072284
    From server: content-length: 0000072284
    < CIMOperation: MethodResponse
    From server: CIMOperation: MethodResponse
    <
    * Connection #0 to host svc-test left intact
    From server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0">
    <SIMPLERSP>
    <IMETHODRESPONSE NAME="EnumerateClassNames">
    <IRETURNVALUE>
    <CLASSNAME NAME="CIM_ConcreteIdentity"/>
    <CLASSNAME NAME="CIM_NetworkPacketAction"/>
    <CLASSNAME NAME="CIM_CollectionInSystem"/>
    [...]
    
    $ /opt/sblim-wbemcli/bin/wbemcli -noverify ecn https://user:pass@svc-test:5989/root/ibm
    svc-test:5989/root/ibm:CIM_ConcreteIdentity
    svc-test:5989/root/ibm:CIM_NetworkPacketAction
    svc-test:5989/root/ibm:CIM_CollectionInSystem
    svc-test:5989/root/ibm:CIM_DeviceSAPImplementation
    svc-test:5989/root/ibm:CIM_ProtocolControllerAccessesUnit
    svc-test:5989/root/ibm:CIM_ControlledBy
    [...]

    Great, now we're finally able to query and monitor the IBM SVC or Storwize systems again with the Nagios monitoring plugin (Nagios Monitoring - IBM SVC and Storwize)!

Between first noticing and researching the issue and creating the quick'n'dirty fix shown above, an official bug report has been filed on the issue and a patch has already been submitted to the source code repository in order to adress the issue more thoroughly. Hopefully an updated official source code package will be released soon.

Nonetheless, the process of researching and debugging this issue was an excellent hands on exercise for me, which i enjoyed very much. Hopefully the steps taken and described here, will turn out to be of use for others as well.

// Nagios Monitoring - IBM SVC and Storwize

Please be sure to also read the update Nagios Monitoring - IBM SVC and Storwize (Update) to this blog post.

Some time ago i wrote a – rather crude – Nagios plugin to monitor IBM SAN Volume Controller (SVC) systems. The plugin was initially targeted at version 4.3.x of the SVC software on 2145-8F2 nodes, we used back then. Since the initial implementation of the plugin we upgraded the hard- and software of our SVC systems several times and are now at version 7.1.x of the SVC software on 2145-CG8 nodes. Recently we also got some IBM Storwize V3700 storage arrays, which share the same code as the SVC, but are missing some of the features and provide additional other features. A code and functional review of the original plugin for the SVC as well as an adaption for the Storwize arrays seemed to be in order. The result were the two plugins check_ibm_svc.pl and check_ibm_storwize.pl. They share a lot of common code with the original plugin, but are still maintained seperately for the simple reason that IBM might develop the SVC and the Storwize code in slightly different, incompatible directions.

In order to run the plugins, you need to have the command line tool wbemcli from the Standards Based Linux Instrumentation project installed on the Nagios system. In my case the wbemcli command line tool is placed in /opt/sblim-wbemcli/bin/wbemcli. If you use a different path, adapt the configuration hash entry “%conf{'wbemcli'}” according to your environment. The plugins use wbemcli to query the CIMOM service on the SVC or Storwize system for the necessary information. Therefor a network connection from the Nagios system to the SVC or Storwize systems on port TCP/5989 must be allowed and a user with the “Monitor” authorization must be created on the SVC or Storwize systems:

IBM_2145:svc:admin$ mkuser -name nagios -usergrp Monitor -password <password>

Or in the WebUI:

Nagios user on the SVC or Storwize system to query information via CIMOM.

Generic

  1. Optional: Enable SNMP traps to be sent to the Nagios system on each of the SVC or Storwize device. This requires SNMPD and SNMPTT to be already setup on the Nagios system. Login to the SVC or Storwize CLI and issue the command:

    IBM_2145:svc:admin$ mksnmpserver -ip <IP adress> -community public -error on -warning on -info on -port 162

    Where <IP> is the IP address of your Nagios system. Or in the SVC or Storwize WebUI navigate to:

    -> Settings
       -> Event Notifications
          -> SNMP
             -> <Enter IP of the Nagios system and the SNMPDs community string>

    Verify the port UDP/162 on the Nagios system can be reached from the SVC or Storwize devices.

SAN Volume Controller (SVC)

For SAN Volume Controller (SVC) devices the whole setup looks like this:

  1. Download the Nagios plugin check_ibm_svc.pl and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:

    $ mv -i check_ibm_svc.pl /usr/lib/nagios/plugins/
    $ chmod 755 /usr/lib/nagios/plugins/check_ibm_svc.pl
  2. Adjust the plugin settings according to your environment. Edit the following variable assignments:

    my %conf = (
        wbemcli => '/opt/sblim-wbemcli/bin/wbemcli',
  3. Define the following Nagios commands. In this example this is done in the file /etc/nagios-plugins/config/check_svc.cfg:

    # check SVC Backend Controller status
    define command {
        command_name    check_svc_bc
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendController
    }
    # check SVC Backend SCSI Status
    define command {
        command_name    check_svc_btspe
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendTargetSCSIPE
    }
    # check SVC MDisk status
    define command {
        command_name    check_svc_bv
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendVolume
    }
    # check SVC Cluster status
    define command {
        command_name    check_svc_cl
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C Cluster
    }
    # check SVC MDiskGroup status
    define command {
        command_name    check_svc_csp
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C ConcreteStoragePool
    }
    # check SVC Ethernet Port status
    define command {
        command_name    check_svc_eth
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C EthernetPort
    }
    # check SVC FC Port status
    define command {
        command_name    check_svc_fcp
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C FCPort
    }
    # check SVC FC Port statistics
    define command {
        command_name    check_svc_fcp_stats
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C FCPortStatistics
    }
    # check SVC I/O Group status and memory allocation
    define command {
        command_name    check_svc_iogrp
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C IOGroup -w $ARG1$ -c $ARG2$
    }
    # check SVC WebUI status
    define command {
        command_name    check_svc_mc
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C MasterConsole
    }
    # check SVC VDisk Mirror status
    define command {
        command_name    check_svc_mirror
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C MirrorExtent
    }
    # check SVC Node status
    define command {
        command_name    check_svc_node
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C Node
    }
    # check SVC Quorum Disk status
    define command {
        command_name    check_svc_quorum
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C QuorumDisk
    }
    # check SVC Storage Volume status
    define command {
        command_name    check_svc_sv
        command_line    $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C StorageVolume
    }

    Replace <user> and <password> with name and password of the CIMOM user created above.

  4. Define a group of services in your Nagios configuration to be checked for each SVC system:

    # check sshd
    define service {
        use                     generic-service
        hostgroup_name          svc
        service_description     Check_SSH
        check_command           check_ssh
    }
    # check_tcp CIMOM
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_CIMOM
        check_command           check_tcp!5989
    }
    # check_svc_bc
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Backend_Controller
        check_command           check_svc_bc
    }
    # check_svc_btspe
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Backend_Target
        check_command           check_svc_btspe
    }
    # check_svc_bv
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Backend_Volume
        check_command           check_svc_bv
    }
    # check_svc_cl
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Cluster
        check_command           check_svc_cl
    }
    # check_svc_csp
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Storage_Pool
        check_command           check_svc_csp
    }
    # check_svc_eth
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Ethernet_Port
        check_command           check_svc_eth
    }
    # check_svc_fcp
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_FC_Port
        check_command           check_svc_fcp
    }
    # check_svc_fcp_stats
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_FC_Port_Statistics
        check_command           check_svc_fcp_stats
    }
    # check_svc_iogrp
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_IO_Group
        check_command           check_svc_iogrp!102400!204800
    }
    # check_svc_mc
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Master_Console
        check_command           check_svc_mc
    }
    # check_svc_mirror
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Mirror_Extents
        check_command           check_svc_mirror
    }
    # check_svc_node
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Node
        check_command           check_svc_node
    }
    # check_svc_quorum
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Quorum
        check_command           check_svc_quorum
    }
    # check_svc_sv
    define service {
        use                     generic-service-pnp
        hostgroup_name          svc
        service_description     Check_Storage_Volume
        check_command           check_svc_sv
    }

    Replace generic-service with your Nagios service template. Replace generic-service-pnp with your Nagios service template that has performance data processing enabled.

  5. Define hosts in your Nagios configuration for each SVC device. In this example its named svc1:

    define host {
        use         svc
        host_name   svc1
        alias       SAN Volume Controller 1
        address     10.0.0.1
        parents     parent_lan
    }

    Replace svc with your Nagios host template for SVC devices. Adjust the address and parents parameters according to your environment.

  6. Define a hostgroup in your Nagios configuration for all SVC systems. In this example it is named svc. The above checks are run against each member of the hostgroup:

    define hostgroup {
        hostgroup_name  svc
        alias           IBM SVC Clusters
        members         svc1
    }
  7. Run a configuration check and if successful reload the Nagios process:

    $ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
    $ /etc/init.d/nagios3 reload

The new hosts and services should soon show up in the Nagios web interface.

Storwize

For Storwize devices the whole setup looks like this:

  1. Download the Nagios plugin check_ibm_storwize.pl and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:

    $ mv -i check_ibm_storwize.pl /usr/lib/nagios/plugins/
    $ chmod 755 /usr/lib/nagios/plugins/check_ibm_storwize.pl
  2. Adjust the plugin settings according to your environment. Edit the following variable assignments:

    my %conf = (
        wbemcli => '/opt/sblim-wbemcli/bin/wbemcli',
  3. Define the following Nagios commands. In this example this is done in the file /etc/nagios-plugins/config/check_storwize.cfg:

    # check Storwize RAID Array status
    define command {
        command_name    check_storwize_array
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Array
    }
    # check Storwize Hot Spare coverage
    define command {
        command_name    check_storwize_asc
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C ArrayBasedOnDiskDrive
    }
    # check Storwize MDisk status
    define command {
        command_name    check_storwize_bv
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendVolume
    }
    # check Storwize Cluster status
    define command {
        command_name    check_storwize_cl
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Cluster
    }
    # check Storwize MDiskGroup status
    define command {
        command_name    check_storwize_csp
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C ConcreteStoragePool
    }
    # check Storwize Disk status
    define command {
        command_name    check_storwize_disk
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C DiskDrive
    }
    # check Storwize Enclosure status
    define command {
        command_name    check_storwize_enc
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Enclosure
    }
    # check Storwize Ethernet Port status
    define command {
        command_name    check_storwize_eth
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C EthernetPort
    }
    # check Storwize FC Port status
    define command {
        command_name    check_storwize_fcp
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C FCPort
    }
    # check Storwize I/O Group status and memory allocation
    define command {
        command_name    check_storwize_iogrp
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C IOGroup -w $ARG1$ -c $ARG2$
    }
    # check Storwize Hot Spare status
    define command {
        command_name    check_storwize_is
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C IsSpare
    }
    # check Storwize WebUI status
    define command {
        command_name    check_storwize_mc
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C MasterConsole
    }
    # check Storwize VDisk Mirror status
    define command {
        command_name    check_storwize_mirror
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C MirrorExtent
    }
    # check Storwize Node status
    define command {
        command_name    check_storwize_node
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Node
    }
    # check Storwize Quorum Disk status
    define command {
        command_name    check_storwize_quorum
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C QuorumDisk
    }
    # check Storwize Storage Volume status
    define command {
        command_name    check_storwize_sv
        command_line    $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C StorageVolume
    }

    Replace <user> and <password> with name and password of the CIMOM user created above.

  4. Define a group of services in your Nagios configuration to be checked for each Storwize system:

    # check sshd
    define service {
        use                     generic-service
        hostgroup_name          storwize
        service_description     Check_SSH
        check_command           check_ssh
    }
    # check_tcp CIMOM
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_CIMOM
        check_command           check_tcp!5989
    }
    # check_storwize_array
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Array
        check_command           check_storwize_array
    }
    # check_storwize_asc
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Array_Spare_Coverage
        check_command           check_storwize_asc
    }
    # check_storwize_bv
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Backend_Volume
        check_command           check_storwize_bv
    }
    # check_storwize_cl
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Cluster
        check_command           check_storwize_cl
    }
    # check_storwize_csp
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Storage_Pool
        check_command           check_storwize_csp
    }
    # check_storwize_disk
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Disk_Drive
        check_command           check_storwize_disk
    }
    # check_storwize_enc
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Enclosure
        check_command           check_storwize_enc
    }
    # check_storwize_eth
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Ethernet_Port
        check_command           check_storwize_eth
    }
    # check_storwize_fcp
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_FC_Port
        check_command           check_storwize_fcp
    }
    # check_storwize_iogrp
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_IO_Group
        check_command           check_storwize_iogrp!102400!204800
    }
    # check_storwize_is
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Hot_Spare
        check_command           check_storwize_is
    }
    # check_storwize_mc
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Master_Console
        check_command           check_storwize_mc
    }
    # check_storwize_mirror
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Mirror_Extents
        check_command           check_storwize_mirror
    }
    # check_storwize_node
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Node
        check_command           check_storwize_node
    }
    # check_storwize_quorum
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Quorum
        check_command           check_storwize_quorum
    }
    # check_storwize_sv
    define service {
        use                     generic-service-pnp
        hostgroup_name          storwize
        service_description     Check_Storage_Volume
        check_command           check_storwize_sv
    }

    Replace generic-service with your Nagios service template. Replace generic-service-pnp with your Nagios service template that has performance data processing enabled.

  5. Define hosts in your Nagios configuration for each Storwize device. In this example its named storwize1:

    define host {
        use         disk
        host_name   storwize1
        alias       Storwize Disk Storage 1
        address     10.0.0.1
        parents     parent_lan
    }

    Replace disk with your Nagios host template for storage devices. Adjust the address and parents parameters according to your environment.

  6. Define a hostgroup in your Nagios configuration for all SVC systems. In this example it is named storwize. The above checks are run against each member of the hostgroup:

    define hostgroup {
        hostgroup_name  storwize
        alias           IBM Storwize Devices
        members         storwize1
    }
  7. Run a configuration check and if successful reload the Nagios process:

    $ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
    $ /etc/init.d/nagios3 reload

The new hosts and services should soon show up in the Nagios web interface.

Generic

If the optional step in the “Generic” section above was done, SNMPTT also needs to be configured to be able to understand the incoming SNMP traps from Storwize systems. This can be achieved by the following steps:

  1. Download the IBM SVC/Storwize SNMP MIB matching your software version from ftp://ftp.software.ibm.com/storage/san/sanvc/.

  2. Convert the IBM SVC/Storwize SNMP MIB definitions in SVC_MIB_<version>.MIB into a format that SNMPTT can understand.

    $ /opt/snmptt/snmpttconvertmib --in=MIB/SVC_MIB_7.1.0.MIB --out=/opt/snmptt/conf/snmptt.conf.ibm-svc-710
    
    ...
    Done
    
    Total translations:        3
    Successful translations:   3
    Failed translations:       0
  3. Edit the trap severity according to your requirements, e.g.:

    $ vim /opt/snmptt/conf/snmptt.conf.ibm-svc-710
    
    ...
    EVENT tsveETrap .1.3.6.1.4.1.2.6.190.1 "Status Events" Critical
    
    ...
    EVENT tsveWTrap .1.3.6.1.4.1.2.6.190.2 "Status Events" Warning
    ...
  4. Optional: Apply the following patch to the configuration to reduce the number of false positives:

    snmptt.conf.ibm-svc-710
    -- /opt/snmptt/conf/snmptt.conf.ibm-svc-710.orig       2013-12-28 21:16:25.000000000 +0100
    +++ /opt/snmptt/conf/snmptt.conf.ibm-svc-710    2013-12-28 21:17:55.000000000 +0100
    @@ -29,11 +29,21 @@
       16: tsveMPNO
       17: tsveOBJN
     EDESC
    +# Filter and ignore the following events that are not really warnings
    +#   "Error ID = 980440": Failed to transfer file from remote node
    +#   "Error ID = 981001": Cluster Fabric View updated by fabric discovery
    +#   "Error ID = 981014": LUN Discovery failed
    +#   "Error ID = 982009": Migration complete
     #
    +EVENT tsveWTrap .1.3.6.1.4.1.2.6.190.2 "Status Events" Normal
    +FORMAT tsve information trap $*
    +MATCH $3: (Error ID = 980440|981001|981014|982009)
     #
    +# All remaining events with this OID are actually warnings
     #
     EVENT tsveWTrap .1.3.6.1.4.1.2.6.190.2 "Status Events" Warning
     FORMAT tsve warning trap $*
    +MATCH $3: !(Error ID = 980440|981001|981014|982009)
     SDESC
     tsve warning trap
     Variables:
  5. Add the new configuration file to be included in the global SNMPTT configuration and restart the SNMPTT daemon:

    $ vim /opt/snmptt/snmptt.ini
    
    ...
    [TrapFiles]
    snmptt_conf_files = <<END
    ...
    /opt/snmptt/conf/snmptt.conf.ibm-svc-710
    ...
    END
    
    $ /etc/init.d/snmptt reload
  6. Download the Nagios plugin check_snmp_traps.sh and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:

    $ mv -i check_snmp_traps.sh /usr/lib/nagios/plugins/
    $ chmod 755 /usr/lib/nagios/plugins/check_snmp_traps.sh
  7. Define the following Nagios command to check for SNMP traps in the SNMPTT database. In this example this is done in the file /etc/nagios-plugins/config/check_snmp_traps.cfg:

    # check for snmp traps
    define command {
        command_name    check_snmp_traps
        command_line    $USER1$/check_snmp_traps.sh -H $HOSTNAME$:$HOSTADDRESS$ -u <user> -p <pass> -d <snmptt_db>
    }

    Replace user, pass and snmptt_db with values suitable for your SNMPTT database environment.

  8. Add another service in your Nagios configuration to be checked for each SVC:

    # check snmptraps
    define service {
        use                     generic-service
        hostgroup_name          svc
        service_description     Check_SNMP_traps
        check_command           check_snmp_traps
    }

    or Storwize system:

    # check snmptraps
    define service {
        use                     generic-service
        hostgroup_name          storwize
        service_description     Check_SNMP_traps
        check_command           check_snmp_traps
    }
  9. Optional: Define a serviceextinfo to display a folder icon next to the Check_SNMP_traps service check for each SVC:

    define  serviceextinfo {
        hostgroup_name          svc
        service_description     Check_SNMP_traps
        notes                   SNMP Alerts
        #notes_url               http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$
        #notes_url               http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$
    }

    or Storwize system:

    define  serviceextinfo {
        hostgroup_name          storwize
        service_description     Check_SNMP_traps
        notes                   SNMP Alerts
        #notes_url               http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$
        #notes_url               http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$
    }

    device. This icon provides a direct link to the SNMPTT web interface with a filter for the selected host. Uncomment the notes_url depending on which web interface (nagtrap or nsti) is used. Replace hostname with the FQDN or IP address of the server running the web interface.

  10. Run a configuration check and if successful reload the Nagios process:

    $ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
    $ /etc/init.d/nagios3 reload
  11. Optional: If you're running PNP4Nagios v0.6 or later to graph Nagios performance data, you can use the PNP4Nagios template in pnp4nagios_storwize.tar.bz2 and pnp4nagios_svc.tar.bz2 to beautify the graphs. Download the PNP4Nagios templates pnp4nagios_svc.tar.bz2 and pnp4nagios_storwize.tar.bz2 and place them in the PNP4Nagios template directory, in this example /usr/share/pnp4nagios/html/templates/:

    $ tar jxf pnp4nagios_storwize.tar.bz2
    $ mv -i check_storwize_*.php /usr/share/pnp4nagios/html/templates/
    $ chmod 644 /usr/share/pnp4nagios/html/templates/check_storwize_*.php
    
    $ tar jxf pnp4nagios_svc.tar.bz2
    $ mv -i check_svc_*.php /usr/share/pnp4nagios/html/templates/
    $ chmod 644 /usr/share/pnp4nagios/html/templates/check_svc_*.php

All done, you should now have a complete Nagios-based monitoring solution for your IBM SVC and Storwize systems.

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website. More information about cookies