2015-02-03 // IBM Storwize V3700 out of Memory
Under certain conditions it is possible to inadvertently run into an out of memory situation on IBM Storwize V3700 systems, by simply running a Download Support Package procedure or the respective CLI command. This will – of course – bring all I/O on the affected system to a grinding halt.
A few days ago, the Nagios monitoring plugin introduced in “Nagios Monitoring - IBM SVC and Storwize” reported a failed PSU on one of our IBM Storwize V3700 systems. After raising a PMR with IBM in order to get the seemingly defective PSU replaced, i was told to simply reseat the PSU. According to IBM support this would usually fix this – apparently known – issue.
This was the first WTF moment and it turned out to not be the last one. So either IBM produces and sells subpar components – in this case the PSU – which need to be given a boot – yes, PSUs nowadays have their own firmware too – in order to be persuaded to cooperate again. Or it means IBM produces and sells subpar software which is not at all able to properly detect a component failure and distinguish between a faulty and a good PSU. Or perhaps its an unfortunate combination of both.
In any case, the procdure to reseat the PSU was carried out, which fixed the PSU issue. During the course of the fix procedure the system would become unreachable via TCP/IP for a rather long time, though. Definately over two minutes, but i haven't had a chance for an exact measurement. After the system was reachable again i followed this strange behaviour up and had a look at the systems event log. There were quite a lot of “Error Code: 1370, Error Code Text: SCSI ERP occurred” messages, so i decided to bother the IBM support again and send them a support collection in order to get an analysis with regard to the reachablilty issue as well as the 1370 errors.
From previous occasions i knew that the IBM support would most likely request a support collection which was run with the “svc_livedump
” CLI command or with the “Standard logs plus new statesaves” option from the WebUI. The latter one is marked red in the following screenshot example:
So i decided to pull a support collection with this option. After some time into the support collection process, the SVC sitting in front of the V3700 and other storage systems, started to show very high latencies (~60 sec.) on the primary VDisks backed by the V3700. On other VDisks which “only” had their secondary VDisk-Mirror located on MDiskGroups of the V3700, the latency peak was less dramatic, but still very noticable. Eventually degraded paths to the MDisks located on the V3700 started showing up on the SVC. After the support collection process finished the situation went back to normal. The latency on both primary and secondary VDisks instantly dropped down to the usual values and after running the fix procedures on the SVC, the degraded paths came back online.
The performance issues were in magnitude and duration severe enough to affect several applications pretty badly. Although the immediate issue was resolved, i still needed an analysis and written statement from IBM support for an action plan on how to prevent this kind of situation in the future and for compliance reasons as well. Here is the digest of what – according to IBM – happened:
During the procedure to reseat the reportedly defective PSU, one or more power surges occured.
These power surges apparently caused issues on the internal disk buses, which lead to the 1370 errors to be logged.
The power surges or the resulting 1370 errors are probably the cause for a failover of the config node too. Hence the connectivity issues with the CLI and the WebUI via TCP/IP.
More 1370 errors were logged during the runtime of the subsequent “Standard logs plus new statesaves” support collection process.
The issue of very high latency and MDisk paths becoming degraded was caused by the “Standard logs plus new statesaves” support collection process using up all of the memory – yes, including the data cache – on the V3700 system. This behaviour is specific and limited to the V3700 systems with “only” 4GB of memory.
As one can imagine, the last item was my second WTF moment. Apparently there are no programmatical safeguards to prevent the support collection process at a “Standard logs plus new statesaves” level from using up all of the systems memory. This would normally not be that bad at all, if the “Standard logs plus new statesaves” wasn't the particular level which IBM support would usually request on support cases concerning SVC and Storwize systems. On the phone the IBM support technician admitted that this common practice is in general probably a bad idea. But he also mentioned that he up to now hadn't heard of the known side effects actually occuring, like in this case.
The suggestion on how to prevent this kind of situation in the future was to either upgrade the V3700 systems from 4GB to 8GB memory – a solution i would gladly take provided it came free of charge – or to only run the support collection process only with the “svc_snap
” CLI command or the “Standard logs” option from the WebUI. Since the memory upgrade for free isn't likely to happen, i'll stick with the second suggestion for now.
Incidently a third option came up over the last weekend. Looking at the IBM System Storage SAN Volume Controller V7.3.0.9 Release Note, it could be construed that someone at IBM SVC and Storwize development came to the realization that the issue could also be addressed in software, by altering the resource utilization of the support collection process:
HU00636 Livedump prepare fails on V3500 & V3700 systems with 4GB memory when cache partition fullness is less than 35%
Fingers crossed, this fix really addresses and resolves the issue described above.
2014-12-24 // Nagios Monitoring - IBM SVC and Storwize (Update)
After upgrading our test IBM SAN Volume Controller (SVC) systems from version 7.3.0.7 to 7.3.0.8 or later, the previously described Nagios monitoring plugin (Nagios Monitoring - IBM SVC and Storwize) ceased to work. A quick check revealed that the “wbemcli
” command line tool from the Standards Based Linux Instrumentation project, which is used in the Nagios plugin to query the CIMOM server on the SVC or Storwize systems, would fail with the following error message:
$ /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm * * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error *
Re-checking the release notes, this sudden change in behaviour seemed to be explained by the fix:
SSL vulnerability CVE-2014-3566
Not really being that verbose a description, a quick look at the CVE-2014-3566 showed that this is a fix for the “POODLE” issue. So IBM probably switched off the support for the SSLv3 protocol in the SVC and Storwize code. But why would this cause the “wbemcli
” command line tool to fail? Here are the steps taken in an analysis of the issue:
First, i was trying to get the “
wbemcli
” command line tool to be a tad more verbose about what it is actually doing:$ /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm To server: <?xml version="1.0" encoding="utf-8" ?> <CIM CIMVERSION="2.0" DTDVERSION="2.0"> <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH> <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE> </IMETHODCALL></SIMPLEREQ> </MESSAGE></CIM> * * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error *
Not really an abundance of information in here too.
Getting the source code for and, while we're at it, updating the “
wbemcli
” command line tool from version 1.6.0 to 1.6.3. Spent some time looking through the source code and with the “gdb
” debugger to get a feeling for the general program flow and functions/methods being called. While looking through the source code i noticed the cURL debugging options are being set if a environment variable named “CURLDEBUG
” is set to “true
”. Later also found this mentioned in theChangeLog
file:$ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm To server: <?xml version="1.0" encoding="utf-8" ?> <CIM CIMVERSION="2.0" DTDVERSION="2.0"> <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH> <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE> </IMETHODCALL></SIMPLEREQ> </MESSAGE></CIM> * About to connect() to svc-test port 5989 (#0) * Trying 192.168.x.x... * connected * Connected to svc-test (192.168.x.x) port 5989 (#0) * successfully set certificate verify locations: * CAfile: none CApath: /etc/ssl/certs * Unknown SSL protocol error in connection to svc-test:5989 * Closing connection #0 * SSL connect error * * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error *
Now we know that we're encountering a “
Unknown SSL protocol error
” – not that helpful either.Searched the web for further information on how to debug the cURL library and found the debug.c example source code, which was very helpful. Incorperated it into the file “
CimCurl.cpp
”:- sblim-wbemcli-1.6.3_debug.patch
--- sblim-wbemcli-1.6.3_orig/CimCurl.cpp 2013-09-21 01:26:32.000000000 +0200 +++ sblim-wbemcli-1.6.3_new/CimCurl.cpp 2014-11-26 16:30:19.000000000 +0100 @@ -37,6 +37,100 @@ extern int waitTime; extern int expect100; +// Trace Begin +struct data { + char trace_ascii; /* 1 or 0 */ +}; + +static +void dump(const char *text, + FILE *stream, unsigned char *ptr, size_t size, + char nohex) +{ + size_t i; + size_t c; + + unsigned int width=0x10; + + if(nohex) + /* without the hex output, we can fit more on screen */ + width = 0x40; + + fprintf(stream, "%s, %010.10ld bytes (0x%08.8lx)\n", + text, (long)size, (long)size); + + for(i=0; i<size; i+= width) { + + fprintf(stream, "%04.4lx: ", (long)i); + + if(!nohex) { + /* hex not disabled, show it */ + for(c = 0; c < width; c++) + if(i+c < size) + fprintf(stream, "%02x ", ptr[i+c]); + else + fputs(" ", stream); + } + + for(c = 0; (c < width) && (i+c < size); c++) { + /* check for 0D0A; if found, skip past and start a new line of output */ + if (nohex && (i+c+1 < size) && ptr[i+c]==0x0D && ptr[i+c+1]==0x0A) { + i+=(c+2-width); + break; + } + fprintf(stream, "%c", + (ptr[i+c]>=0x20) && (ptr[i+c]<0x80)?ptr[i+c]:'.'); + /* check again for 0D0A, to avoid an extra \n if it's at width */ + if (nohex && (i+c+2 < size) && ptr[i+c+1]==0x0D && ptr[i+c+2]==0x0A) { + i+=(c+3-width); + break; + } + } + fputc('\n', stream); /* newline */ + } + fflush(stream); +} + +static +int my_trace(CURL *handle, curl_infotype type, + char *data, size_t size, + void *userp) +{ + struct data *config = (struct data *)userp; + const char *text; + (void)handle; /* prevent compiler warning */ + + switch (type) { + case CURLINFO_TEXT: + fprintf(stderr, "== Info: %s", data); + default: /* in case a new one is introduced to shock us */ + return 0; + + case CURLINFO_HEADER_OUT: + text = "=> Send header"; + break; + case CURLINFO_DATA_OUT: + text = "=> Send data"; + break; + case CURLINFO_SSL_DATA_OUT: + text = "=> Send SSL data"; + break; + case CURLINFO_HEADER_IN: + text = "<= Recv header"; + break; + case CURLINFO_DATA_IN: + text = "<= Recv data"; + break; + case CURLINFO_SSL_DATA_IN: + text = "<= Recv SSL data"; + break; + } + + dump(text, stderr, (unsigned char *)data, size, config->trace_ascii); + return 0; +} +// Trace End + // These are the constant headers added to all requests static const char *headers[] = { "Content-Type: application/xml; charset=\"utf-8\"", @@ -152,6 +246,11 @@ CURLcode rv; string sb; +// Trace Begin +struct data config; +config.trace_ascii = 1; +// Trace End + mUri = url.scheme + "://" + url.host + ":" + url.port + "/cimom"; url.ns.toStringBuffer(sb,"%2F"); @@ -248,6 +347,11 @@ rv = curl_easy_setopt(mHandle, CURLOPT_WRITEHEADER, &mErrorData); rv = curl_easy_setopt(mHandle, CURLOPT_HEADERFUNCTION, headerCb); + +// Trace Begin + rv = curl_easy_setopt(mHandle, CURLOPT_DEBUGFUNCTION, my_trace); + rv = curl_easy_setopt(mHandle, CURLOPT_DEBUGDATA, &config); +// Trace End } static string getErrorMessage(CURLcode err)
rebuild the “
wbemcli
” command line tool and tried again:$ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm To server: <?xml version="1.0" encoding="utf-8" ?> <CIM CIMVERSION="2.0" DTDVERSION="2.0"> <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH> <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE> </IMETHODCALL></SIMPLEREQ> </MESSAGE></CIM> == Info: About to connect() to svc-test port 5989 (#0) == Info: Trying 192.168.x.x... == Info: connected == Info: Connected to svc-test (192.168.x.x) port 5989 (#0) == Info: found 172 certificates in /etc/ssl/certs/ca-certificates.crt == Info: gnutls_handshake() failed: A TLS packet with unexpected length was received. == Info: Closing connection #0 == Info: SSL connect error * * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error *
Now we know there is an issue in the way the GnuTLS library used by libcURL interacts with the CIMOM server on the SVC or Storwize systems.
Again, searching the web for similar issues with the error message “
gnutls_handshake() failed: A TLS packet with unexpected length was received.
”, we find a rebuild against the OpenSSL library instead of the GnuTLS library could solve this issue:$ dpkg -l | grep curl ii curl 7.26.0-1+wheezy11 powerpc command line tool for transferring data with URL syntax ii libcurl3:powerpc 7.26.0-1+wheezy11 powerpc easy-to-use client-side URL transfer library (OpenSSL flavour) ii libcurl3-gnutls:powerpc 7.26.0-1+wheezy11 powerpc easy-to-use client-side URL transfer library (GnuTLS flavour) ii libcurl4-gnutls-dev 7.26.0-1+wheezy11 powerpc development files and documentation for libcurl (GnuTLS flavour) $ apt-get install libcurl4-openssl-dev Reading package lists... Done Building dependency tree Reading state information... Done Suggested packages: libcurl3-dbg The following packages will be REMOVED: libcurl4-gnutls-dev The following NEW packages will be installed: libcurl4-openssl-dev 0 upgraded, 1 newly installed, 1 to remove and 0 not upgraded. Need to get 0 B/1,259 kB of archives. After this operation, 28.7 kB of additional disk space will be used. Do you want to continue [Y/n]? y (Reading database ... 65717 files and directories currently installed.) Removing libcurl4-gnutls-dev ... Processing triggers for man-db ... Selecting previously unselected package libcurl4-openssl-dev. (Reading database ... 65463 files and directories currently installed.) Unpacking libcurl4-openssl-dev (from .../libcurl4-openssl-dev_7.26.0-1+wheezy11_powerpc.deb) ... Processing triggers for man-db ... Setting up libcurl4-openssl-dev (7.26.0-1+wheezy11) ... $ dpkg -l | grep curl ii curl 7.26.0-1+wheezy11 powerpc command line tool for transferring data with URL syntax ii libcurl3:powerpc 7.26.0-1+wheezy11 powerpc easy-to-use client-side URL transfer library (OpenSSL flavour) ii libcurl3-gnutls:powerpc 7.26.0-1+wheezy11 powerpc easy-to-use client-side URL transfer library (GnuTLS flavour) ii libcurl4-openssl-dev 7.26.0-1+wheezy11 powerpc development files and documentation for libcurl (OpenSSL flavour)
Rebuild the “
wbemcli
” command line tool and tried again:$ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm To server: <?xml version="1.0" encoding="utf-8" ?> <CIM CIMVERSION="2.0" DTDVERSION="2.0"> <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH> <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE> </IMETHODCALL></SIMPLEREQ> </MESSAGE></CIM> == Info: About to connect() to svc-test port 5989 (#0) == Info: Trying 192.168.x.x... == Info: connected == Info: Connected to svc-test (192.168.x.x) port 5989 (#0) == Info: successfully set certificate verify locations: == Info: CAfile: none CApath: /etc/ssl/certs == Info: SSLv3, TLS handshake, Client hello (1): => Send SSL data, 0000000134 bytes (0x00000086) 0000: ......T.."\.~....`...K4.......p..7"&4...Z.....9.8.........5..... 0040: ................3.2.....E.D...../...A........................... 0080: ...... == Info: Unknown SSL protocol error in connection to svc-test:5989 == Info: Closing connection #0 == Info: SSL connect error * * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error *
Now we know that the “
wbemcli
” command line tool is actually – as already suspected – trying to initiate a SSLv3 connection.In order to confirm we're on the right track, try to first verify manually that we're unable to connct with a SSLv3 secured connection:
$ openssl s_client -host svc-test -port 5989 -ssl3 CONNECTED(00000003) write:errno=104 --- no peer certificate available --- No client certificate CA names sent --- SSL handshake has read 0 bytes and written 0 bytes --- New, (NONE), Cipher is (NONE) Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE SSL-Session: Protocol : SSLv3 Cipher : 0000 Session-ID: Session-ID-ctx: Master-Key: Key-Arg : None PSK identity: None PSK identity hint: None SRP username: None Start Time: 1418397594 Timeout : 7200 (sec) Verify return code: 0 (ok) --- quit
And that we're instead able to connect with a TLS secured connection:
$ openssl s_client -host svc-test -port 5989 CONNECTED(00000003) depth=0 C = GB, L = Hursley, O = IBM, OU = SSG, CN = 2145, emailAddress = support@ibm.com verify error:num=18:self signed certificate verify return:1 depth=0 C = GB, L = Hursley, O = IBM, OU = SSG, CN = 2145, emailAddress = support@ibm.com verify return:1 --- Certificate chain 0 s:/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com i:/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com --- Server certificate -----BEGIN CERTIFICATE----- MIICyDCCAjGgAwIBAgIEUAPzmTANBgkqhkiG9w0BAQUFADBqMQswCQYDVQQGEwJH QjEQMA4GA1UEBxMHSHVyc2xleTEMMAoGA1UEChMDSUJNMQwwCgYDVQQLEwNTU0cx DTALBgNVBAMTBDIxNDUxHjAcBgkqhkiG9w0BCQEWD3N1cHBvcnRAaWJtLmNvbTAe Fw0xMjA3MTYxMDU3MjlaFw0yNzA3MTMxMDU3MjlaMGoxCzAJBgNVBAYTAkdCMRAw DgYDVQQHEwdIdXJzbGV5MQwwCgYDVQQKEwNJQk0xDDAKBgNVBAsTA1NTRzENMAsG A1UEAxMEMjE0NTEeMBwGCSqGSIb3DQEJARYPc3VwcG9ydEBpYm0uY29tMIGfMA0G CSqGSIb3DQEBAQUAA4GNADCBiQKBgQC3E7+7mE2GAID/35o5/s7cnzoqu9PQdOGB ryGMa8adD4Wd9hpmTkrsgyNvkUB6sPIifbFstGooOkQtK9ZNgP5OHOorZmqINSxM 9goCkSCQG9xRKAvNt2tA8gujaV+p42oVEhIH6naJUul96qZI31y3GffUu2CRrJL7 4wG/8cv0BQIDAQABo3sweTAJBgNVHRMEAjAAMCwGCWCGSAGG+EIBDQQfFh1PcGVu U1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUkPMkXUjn0YHlfQW8 TJiRC5jWQO4wHwYDVR0jBBgwFoAUkPMkXUjn0YHlfQW8TJiRC5jWQO4wDQYJKoZI hvcNAQEFBQADgYEAKqu7KpVxnOXonQE3unC1O7qUHKoyQUEWqcKsM/4tPI+lsBMZ jvoPwn8yQRWiLehFmVc8VSZfdFPLzshNabXp5qbZo/EFberXrgI2CbtPiULYyyyH DUhWF+vhwb6uqwfBbGncvTvI2ewU8+0oTXsuTkSjumJ7+chpaHFWWyj2cJA= -----END CERTIFICATE----- subject=/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com issuer=/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com --- No client certificate CA names sent --- SSL handshake has read 1029 bytes and written 498 bytes --- New, TLSv1/SSLv3, Cipher is AES256-GCM-SHA384 Server public key is 1024 bit Secure Renegotiation IS supported Compression: NONE Expansion: NONE SSL-Session: Protocol : TLSv1.2 Cipher : AES256-GCM-SHA384 Session-ID: 48291D368E0A8584A8DFA00A9881B8979BDE370FC6C9439294C670695D031239 Session-ID-ctx: Master-Key: 71BD12A161FC595CD056DA8E6D6E27420F37468E47498B7591A403A86844C55F61FF02B2FEC7739FAAEDCE3DFEA0F217 Key-Arg : None PSK identity: None PSK identity hint: None SRP username: None TLS session ticket lifetime hint: 300 (seconds) TLS session ticket: 0000 - a0 e0 b1 9b c2 37 9a ca-49 1c 54 f5 26 4b d6 24 .....7..I.T.&K.$ 0010 - af 6a 7d cc 5e 4a 97 a8-b3 6d b7 66 0b b7 0a 65 .j}.^J...m.f...e 0020 - 47 af ef 47 76 fc c7 e9-38 ff 84 28 ca 8e 73 25 G..Gv...8..(..s% 0030 - 47 25 f6 0d 36 01 04 f1-f9 f7 0c b6 42 ef cf 09 G%..6.......B... 0040 - 64 8f df ff 89 38 ed 7c-ae 1d 0e 25 d1 c1 77 86 d....8.|...%..w. 0050 - b6 61 88 15 cf fe 9f 20-86 0d 17 74 18 da ea c0 .a..... ...t.... 0060 - 33 3a 47 f5 f9 51 24 ae-48 37 8a 3f 19 dd c6 04 3:G..Q$.H7.?.... 0070 - 7e d1 20 78 35 99 0b 9f-3b 1f ce 7c bc 11 93 e4 ~. x5...;..|.... 0080 - 0f 94 de 94 f1 0d 0c da-64 ca 0d f6 10 2a c8 fa ........d....*.. 0090 - dc 3e e4 1a 97 d1 34 7a-9c f5 c3 00 e8 1b 10 d7 .>....4z........ Start Time: 1418397614 Timeout : 300 (sec) Verify return code: 18 (self signed certificate) --- quit
The secured connection negotiated to TLS succeeded, so we're on the right track!
Now we need to find the spot in the source code, where the “
wbemcli
” command line tool is forced to initiate a SSLv3 connection. We know this is probably done in a cURL related function call, since libcURL is used for the network connection. So lets first look into the cURL code for all the lines showing any sign of SSL related operations:$ grep -n curl CimCurl.cpp | grep -i ssl 185: // Assume we support SSL if we don't have the curl_version_info API 276: rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYHOST, 0); 277: // rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0); 280: rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3); 441: if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSL_VERIFYPEER,0))) { 448: if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSL_VERIFYPEER,1))) { 466: if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSLCERT,certificate))) { 470: if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSLKEY,key))) {
Looks like a perfect match in line 280 of the file “
CimCurl.cpp
”. The line numbers are a bit off from the original source code, since the file “CimCurl.cpp
” was patched with our above debugging code. The code in the original, unpatched source file “CimCurl.cpp
”, within the function “CimomCurl::genRequest
” looks like this:- CimCurl.cpp
175 [...] 176 /* Disable SSL host verification */ 177 rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYHOST, 0); 178 // rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0); 179 180 /* Force using SSL V3 */ 181 rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3); 182 183 /* Set username and password */ 184 if (url.user.length() > 0 && url.password.length() > 0) { 185 mUserPass = url.user + ":" + url.password; 186 rv = curl_easy_setopt(mHandle, CURLOPT_USERPWD, mUserPass.c_str()); 187 } 188 [...]
In line 181 the cURL option “
CURLOPT_SSLVERSION
” is indiscriminately set to use SSLv3 and nothing else, which we know from the above deduction is bound to fail on systems adressing the “POODLE” issues.With the knowledge where the issue is actually caused, an easy quick'n'dirty fix can be implemented:
- sblim-wbemcli-1.6.3_debug.patch
--- sblim-wbemcli-1.6.3_orig/CimCurl.cpp 2013-09-21 01:26:32.000000000 +0200 +++ sblim-wbemcli-1.6.3/CimCurl.cpp 2014-11-26 16:46:09.000000000 +0100 @@ -178,7 +178,7 @@ // rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0); /* Force using SSL V3 */ - rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3); + //rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3); /* Set username and password */ if (url.user.length() > 0 && url.password.length() > 0) {
Inserting the comment at the line where the cURL option “
CURLOPT_SSLVERSION
” is forced to SSLv3 causes libcURL to fall back to its default value, which is now TLS.Rebuild the “wbemcli” command line tool and tried again:
$ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm To server: <?xml version="1.0" encoding="utf-8" ?> <CIM CIMVERSION="2.0" DTDVERSION="2.0"> <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH> <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE> </IMETHODCALL></SIMPLEREQ> </MESSAGE></CIM> * About to connect() to svc-test port 5989 (#0) * Trying 192.168.x.x... * connected * Connected to svc-test (192.168.x.x) port 5989 (#0) * found 172 certificates in /etc/ssl/certs/ca-certificates.crt * server certificate verification SKIPPED * common name: 2145 (does not match 'svc-test') * server certificate expiration date OK * server certificate activation date OK * certificate public key: RSA * certificate version: #3 * subject: C=GB,L=Hursley,O=IBM,OU=SSG,CN=2145,EMAIL=support@ibm.com * start date: Mon, 16 Jul 2012 10:57:29 GMT * expire date: Tue, 13 Jul 2027 10:57:29 GMT * issuer: C=GB,L=Hursley,O=IBM,OU=SSG,CN=2145,EMAIL=support@ibm.com * compression: NULL * cipher: AES-128-CBC * MAC: SHA1 * Server auth using Basic with user 'user' > POST /cimom HTTP/1.1 Authorization: Basic enp6bmFnaW9zOm5hZ2lvcw== Host: svc-test:5989 Content-Type: application/xml; charset="utf-8" Connection: Keep-Alive, TE CIMProtocolVersion: 1.0 CIMOperation: MethodCall CIMMethod: EnumerateClassNames CIMObject: root%2Fibm Content-Length: 396 * upload completely sent off: 396 out of 396 bytes * additional stuff not fine transfer.c:1037: 0 0 * HTTP 1.1 or later with persistent connection, pipelining supported < HTTP/1.1 200 OK < Content-Type: application/xml; charset="utf-8" From server: Content-Type: application/xml; charset="utf-8" < content-length: 0000072284 From server: content-length: 0000072284 < CIMOperation: MethodResponse From server: CIMOperation: MethodResponse < * Connection #0 to host svc-test left intact From server: <?xml version="1.0" encoding="utf-8" ?> <CIM CIMVERSION="2.0" DTDVERSION="2.0"> <MESSAGE ID="4711" PROTOCOLVERSION="1.0"> <SIMPLERSP> <IMETHODRESPONSE NAME="EnumerateClassNames"> <IRETURNVALUE> <CLASSNAME NAME="CIM_ConcreteIdentity"/> <CLASSNAME NAME="CIM_NetworkPacketAction"/> <CLASSNAME NAME="CIM_CollectionInSystem"/> [...] $ /opt/sblim-wbemcli/bin/wbemcli -noverify ecn https://user:pass@svc-test:5989/root/ibm svc-test:5989/root/ibm:CIM_ConcreteIdentity svc-test:5989/root/ibm:CIM_NetworkPacketAction svc-test:5989/root/ibm:CIM_CollectionInSystem svc-test:5989/root/ibm:CIM_DeviceSAPImplementation svc-test:5989/root/ibm:CIM_ProtocolControllerAccessesUnit svc-test:5989/root/ibm:CIM_ControlledBy [...]
Great, now we're finally able to query and monitor the IBM SVC or Storwize systems again with the Nagios monitoring plugin (Nagios Monitoring - IBM SVC and Storwize)!
Between first noticing and researching the issue and creating the quick'n'dirty fix shown above, an official bug report has been filed on the issue and a patch has already been submitted to the source code repository in order to adress the issue more thoroughly. Hopefully an updated official source code package will be released soon.
Nonetheless, the process of researching and debugging this issue was an excellent hands on exercise for me, which i enjoyed very much. Hopefully the steps taken and described here, will turn out to be of use for others as well.
2013-12-28 // Nagios Monitoring - IBM SVC and Storwize
Some time ago i wrote a – rather crude – Nagios plugin to monitor IBM SAN Volume Controller (SVC) systems. The plugin was initially targeted at version 4.3.x of the SVC software on 2145-8F2 nodes, we used back then. Since the initial implementation of the plugin we upgraded the hard- and software of our SVC systems several times and are now at version 7.1.x of the SVC software on 2145-CG8 nodes. Recently we also got some IBM Storwize V3700 storage arrays, which share the same code as the SVC, but are missing some of the features and provide additional other features. A code and functional review of the original plugin for the SVC as well as an adaption for the Storwize arrays seemed to be in order. The result were the two plugins check_ibm_svc.pl
and check_ibm_storwize.pl
. They share a lot of common code with the original plugin, but are still maintained seperately for the simple reason that IBM might develop the SVC and the Storwize code in slightly different, incompatible directions.
In order to run the plugins, you need to have the command line tool wbemcli
from the Standards Based Linux Instrumentation project installed on the Nagios system. In my case the wbemcli
command line tool is placed in /opt/sblim-wbemcli/bin/wbemcli
. If you use a different path, adapt the configuration hash entry “%conf{'wbemcli'}
” according to your environment. The plugins use wbemcli
to query the CIMOM service on the SVC or Storwize system for the necessary information. Therefor a network connection from the Nagios system to the SVC or Storwize systems on port TCP/5989 must be allowed and a user with the “Monitor
” authorization must be created on the SVC or Storwize systems:
IBM_2145:svc:admin$ mkuser -name nagios -usergrp Monitor -password <password>
Or in the WebUI:
Generic
Optional: Enable SNMP traps to be sent to the Nagios system on each of the SVC or Storwize device. This requires SNMPD and SNMPTT to be already setup on the Nagios system. Login to the SVC or Storwize CLI and issue the command:
IBM_2145:svc:admin$ mksnmpserver -ip <IP adress> -community public -error on -warning on -info on -port 162
Where
<IP>
is the IP address of your Nagios system. Or in the SVC or Storwize WebUI navigate to:-> Settings -> Event Notifications -> SNMP -> <Enter IP of the Nagios system and the SNMPDs community string>
Verify the port UDP/162 on the Nagios system can be reached from the SVC or Storwize devices.
SAN Volume Controller (SVC)
For SAN Volume Controller (SVC) devices the whole setup looks like this:
Download the Nagios plugin check_ibm_svc.pl and place it in the plugins directory of your Nagios system, in this example
/usr/lib/nagios/plugins/
:$ mv -i check_ibm_svc.pl /usr/lib/nagios/plugins/ $ chmod 755 /usr/lib/nagios/plugins/check_ibm_svc.pl
Adjust the plugin settings according to your environment. Edit the following variable assignments:
my %conf = ( wbemcli => '/opt/sblim-wbemcli/bin/wbemcli',
Define the following Nagios commands. In this example this is done in the file
/etc/nagios-plugins/config/check_svc.cfg
:# check SVC Backend Controller status define command { command_name check_svc_bc command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendController } # check SVC Backend SCSI Status define command { command_name check_svc_btspe command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendTargetSCSIPE } # check SVC MDisk status define command { command_name check_svc_bv command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendVolume } # check SVC Cluster status define command { command_name check_svc_cl command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C Cluster } # check SVC MDiskGroup status define command { command_name check_svc_csp command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C ConcreteStoragePool } # check SVC Ethernet Port status define command { command_name check_svc_eth command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C EthernetPort } # check SVC FC Port status define command { command_name check_svc_fcp command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C FCPort } # check SVC FC Port statistics define command { command_name check_svc_fcp_stats command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C FCPortStatistics } # check SVC I/O Group status and memory allocation define command { command_name check_svc_iogrp command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C IOGroup -w $ARG1$ -c $ARG2$ } # check SVC WebUI status define command { command_name check_svc_mc command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C MasterConsole } # check SVC VDisk Mirror status define command { command_name check_svc_mirror command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C MirrorExtent } # check SVC Node status define command { command_name check_svc_node command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C Node } # check SVC Quorum Disk status define command { command_name check_svc_quorum command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C QuorumDisk } # check SVC Storage Volume status define command { command_name check_svc_sv command_line $USER1$/check_ibm_svc.pl -H $HOSTNAME$ -u <user> -p <password> -C StorageVolume }
Replace
<user>
and<password>
with name and password of the CIMOM user created above.Define a group of services in your Nagios configuration to be checked for each SVC system:
# check sshd define service { use generic-service hostgroup_name svc service_description Check_SSH check_command check_ssh } # check_tcp CIMOM define service { use generic-service-pnp hostgroup_name svc service_description Check_CIMOM check_command check_tcp!5989 } # check_svc_bc define service { use generic-service-pnp hostgroup_name svc service_description Check_Backend_Controller check_command check_svc_bc } # check_svc_btspe define service { use generic-service-pnp hostgroup_name svc service_description Check_Backend_Target check_command check_svc_btspe } # check_svc_bv define service { use generic-service-pnp hostgroup_name svc service_description Check_Backend_Volume check_command check_svc_bv } # check_svc_cl define service { use generic-service-pnp hostgroup_name svc service_description Check_Cluster check_command check_svc_cl } # check_svc_csp define service { use generic-service-pnp hostgroup_name svc service_description Check_Storage_Pool check_command check_svc_csp } # check_svc_eth define service { use generic-service-pnp hostgroup_name svc service_description Check_Ethernet_Port check_command check_svc_eth } # check_svc_fcp define service { use generic-service-pnp hostgroup_name svc service_description Check_FC_Port check_command check_svc_fcp } # check_svc_fcp_stats define service { use generic-service-pnp hostgroup_name svc service_description Check_FC_Port_Statistics check_command check_svc_fcp_stats } # check_svc_iogrp define service { use generic-service-pnp hostgroup_name svc service_description Check_IO_Group check_command check_svc_iogrp!102400!204800 } # check_svc_mc define service { use generic-service-pnp hostgroup_name svc service_description Check_Master_Console check_command check_svc_mc } # check_svc_mirror define service { use generic-service-pnp hostgroup_name svc service_description Check_Mirror_Extents check_command check_svc_mirror } # check_svc_node define service { use generic-service-pnp hostgroup_name svc service_description Check_Node check_command check_svc_node } # check_svc_quorum define service { use generic-service-pnp hostgroup_name svc service_description Check_Quorum check_command check_svc_quorum } # check_svc_sv define service { use generic-service-pnp hostgroup_name svc service_description Check_Storage_Volume check_command check_svc_sv }
Replace
generic-service
with your Nagios service template. Replacegeneric-service-pnp
with your Nagios service template that has performance data processing enabled.Define hosts in your Nagios configuration for each SVC device. In this example its named
svc1
:define host { use svc host_name svc1 alias SAN Volume Controller 1 address 10.0.0.1 parents parent_lan }
Replace
svc
with your Nagios host template for SVC devices. Adjust theaddress
andparents
parameters according to your environment.Define a hostgroup in your Nagios configuration for all SVC systems. In this example it is named
svc
. The above checks are run against each member of the hostgroup:define hostgroup { hostgroup_name svc alias IBM SVC Clusters members svc1 }
Run a configuration check and if successful reload the Nagios process:
$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg $ /etc/init.d/nagios3 reload
The new hosts and services should soon show up in the Nagios web interface.
Storwize
For Storwize devices the whole setup looks like this:
Download the Nagios plugin check_ibm_storwize.pl and place it in the plugins directory of your Nagios system, in this example
/usr/lib/nagios/plugins/
:$ mv -i check_ibm_storwize.pl /usr/lib/nagios/plugins/ $ chmod 755 /usr/lib/nagios/plugins/check_ibm_storwize.pl
Adjust the plugin settings according to your environment. Edit the following variable assignments:
my %conf = ( wbemcli => '/opt/sblim-wbemcli/bin/wbemcli',
Define the following Nagios commands. In this example this is done in the file
/etc/nagios-plugins/config/check_storwize.cfg
:# check Storwize RAID Array status define command { command_name check_storwize_array command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Array } # check Storwize Hot Spare coverage define command { command_name check_storwize_asc command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C ArrayBasedOnDiskDrive } # check Storwize MDisk status define command { command_name check_storwize_bv command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C BackendVolume } # check Storwize Cluster status define command { command_name check_storwize_cl command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Cluster } # check Storwize MDiskGroup status define command { command_name check_storwize_csp command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C ConcreteStoragePool } # check Storwize Disk status define command { command_name check_storwize_disk command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C DiskDrive } # check Storwize Enclosure status define command { command_name check_storwize_enc command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Enclosure } # check Storwize Ethernet Port status define command { command_name check_storwize_eth command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C EthernetPort } # check Storwize FC Port status define command { command_name check_storwize_fcp command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C FCPort } # check Storwize I/O Group status and memory allocation define command { command_name check_storwize_iogrp command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C IOGroup -w $ARG1$ -c $ARG2$ } # check Storwize Hot Spare status define command { command_name check_storwize_is command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C IsSpare } # check Storwize WebUI status define command { command_name check_storwize_mc command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C MasterConsole } # check Storwize VDisk Mirror status define command { command_name check_storwize_mirror command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C MirrorExtent } # check Storwize Node status define command { command_name check_storwize_node command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C Node } # check Storwize Quorum Disk status define command { command_name check_storwize_quorum command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C QuorumDisk } # check Storwize Storage Volume status define command { command_name check_storwize_sv command_line $USER1$/check_ibm_storwize.pl -H $HOSTNAME$ -u <user> -p <password> -C StorageVolume }
Replace
<user>
and<password>
with name and password of the CIMOM user created above.Define a group of services in your Nagios configuration to be checked for each Storwize system:
# check sshd define service { use generic-service hostgroup_name storwize service_description Check_SSH check_command check_ssh } # check_tcp CIMOM define service { use generic-service-pnp hostgroup_name storwize service_description Check_CIMOM check_command check_tcp!5989 } # check_storwize_array define service { use generic-service-pnp hostgroup_name storwize service_description Check_Array check_command check_storwize_array } # check_storwize_asc define service { use generic-service-pnp hostgroup_name storwize service_description Check_Array_Spare_Coverage check_command check_storwize_asc } # check_storwize_bv define service { use generic-service-pnp hostgroup_name storwize service_description Check_Backend_Volume check_command check_storwize_bv } # check_storwize_cl define service { use generic-service-pnp hostgroup_name storwize service_description Check_Cluster check_command check_storwize_cl } # check_storwize_csp define service { use generic-service-pnp hostgroup_name storwize service_description Check_Storage_Pool check_command check_storwize_csp } # check_storwize_disk define service { use generic-service-pnp hostgroup_name storwize service_description Check_Disk_Drive check_command check_storwize_disk } # check_storwize_enc define service { use generic-service-pnp hostgroup_name storwize service_description Check_Enclosure check_command check_storwize_enc } # check_storwize_eth define service { use generic-service-pnp hostgroup_name storwize service_description Check_Ethernet_Port check_command check_storwize_eth } # check_storwize_fcp define service { use generic-service-pnp hostgroup_name storwize service_description Check_FC_Port check_command check_storwize_fcp } # check_storwize_iogrp define service { use generic-service-pnp hostgroup_name storwize service_description Check_IO_Group check_command check_storwize_iogrp!102400!204800 } # check_storwize_is define service { use generic-service-pnp hostgroup_name storwize service_description Check_Hot_Spare check_command check_storwize_is } # check_storwize_mc define service { use generic-service-pnp hostgroup_name storwize service_description Check_Master_Console check_command check_storwize_mc } # check_storwize_mirror define service { use generic-service-pnp hostgroup_name storwize service_description Check_Mirror_Extents check_command check_storwize_mirror } # check_storwize_node define service { use generic-service-pnp hostgroup_name storwize service_description Check_Node check_command check_storwize_node } # check_storwize_quorum define service { use generic-service-pnp hostgroup_name storwize service_description Check_Quorum check_command check_storwize_quorum } # check_storwize_sv define service { use generic-service-pnp hostgroup_name storwize service_description Check_Storage_Volume check_command check_storwize_sv }
Replace
generic-service
with your Nagios service template. Replacegeneric-service-pnp
with your Nagios service template that has performance data processing enabled.Define hosts in your Nagios configuration for each Storwize device. In this example its named
storwize1
:define host { use disk host_name storwize1 alias Storwize Disk Storage 1 address 10.0.0.1 parents parent_lan }
Replace
disk
with your Nagios host template for storage devices. Adjust theaddress
andparents
parameters according to your environment.Define a hostgroup in your Nagios configuration for all SVC systems. In this example it is named
storwize
. The above checks are run against each member of the hostgroup:define hostgroup { hostgroup_name storwize alias IBM Storwize Devices members storwize1 }
Run a configuration check and if successful reload the Nagios process:
$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg $ /etc/init.d/nagios3 reload
The new hosts and services should soon show up in the Nagios web interface.
Generic
If the optional step in the “Generic” section above was done, SNMPTT also needs to be configured to be able to understand the incoming SNMP traps from Storwize systems. This can be achieved by the following steps:
Download the IBM SVC/Storwize SNMP MIB matching your software version from ftp://ftp.software.ibm.com/storage/san/sanvc/.
Convert the IBM SVC/Storwize SNMP MIB definitions in
SVC_MIB_<version>.MIB
into a format that SNMPTT can understand.$ /opt/snmptt/snmpttconvertmib --in=MIB/SVC_MIB_7.1.0.MIB --out=/opt/snmptt/conf/snmptt.conf.ibm-svc-710 ... Done Total translations: 3 Successful translations: 3 Failed translations: 0
Edit the trap severity according to your requirements, e.g.:
$ vim /opt/snmptt/conf/snmptt.conf.ibm-svc-710 ... EVENT tsveETrap .1.3.6.1.4.1.2.6.190.1 "Status Events" Critical ... EVENT tsveWTrap .1.3.6.1.4.1.2.6.190.2 "Status Events" Warning ...
Optional: Apply the following patch to the configuration to reduce the number of false positives:
- snmptt.conf.ibm-svc-710
-- /opt/snmptt/conf/snmptt.conf.ibm-svc-710.orig 2013-12-28 21:16:25.000000000 +0100 +++ /opt/snmptt/conf/snmptt.conf.ibm-svc-710 2013-12-28 21:17:55.000000000 +0100 @@ -29,11 +29,21 @@ 16: tsveMPNO 17: tsveOBJN EDESC +# Filter and ignore the following events that are not really warnings +# "Error ID = 980440": Failed to transfer file from remote node +# "Error ID = 981001": Cluster Fabric View updated by fabric discovery +# "Error ID = 981014": LUN Discovery failed +# "Error ID = 982009": Migration complete # +EVENT tsveWTrap .1.3.6.1.4.1.2.6.190.2 "Status Events" Normal +FORMAT tsve information trap $* +MATCH $3: (Error ID = 980440|981001|981014|982009) # +# All remaining events with this OID are actually warnings # EVENT tsveWTrap .1.3.6.1.4.1.2.6.190.2 "Status Events" Warning FORMAT tsve warning trap $* +MATCH $3: !(Error ID = 980440|981001|981014|982009) SDESC tsve warning trap Variables:
Add the new configuration file to be included in the global SNMPTT configuration and restart the SNMPTT daemon:
$ vim /opt/snmptt/snmptt.ini ... [TrapFiles] snmptt_conf_files = <<END ... /opt/snmptt/conf/snmptt.conf.ibm-svc-710 ... END $ /etc/init.d/snmptt reload
Download the Nagios plugin check_snmp_traps.sh and place it in the plugins directory of your Nagios system, in this example
/usr/lib/nagios/plugins/
:$ mv -i check_snmp_traps.sh /usr/lib/nagios/plugins/ $ chmod 755 /usr/lib/nagios/plugins/check_snmp_traps.sh
Define the following Nagios command to check for SNMP traps in the SNMPTT database. In this example this is done in the file
/etc/nagios-plugins/config/check_snmp_traps.cfg
:# check for snmp traps define command { command_name check_snmp_traps command_line $USER1$/check_snmp_traps.sh -H $HOSTNAME$:$HOSTADDRESS$ -u <user> -p <pass> -d <snmptt_db> }
Replace
user
,pass
andsnmptt_db
with values suitable for your SNMPTT database environment.Add another service in your Nagios configuration to be checked for each SVC:
# check snmptraps define service { use generic-service hostgroup_name svc service_description Check_SNMP_traps check_command check_snmp_traps }
or Storwize system:
# check snmptraps define service { use generic-service hostgroup_name storwize service_description Check_SNMP_traps check_command check_snmp_traps }
Optional: Define a serviceextinfo to display a folder icon next to the
Check_SNMP_traps
service check for each SVC:define serviceextinfo { hostgroup_name svc service_description Check_SNMP_traps notes SNMP Alerts #notes_url http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$ #notes_url http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$ }
or Storwize system:
define serviceextinfo { hostgroup_name storwize service_description Check_SNMP_traps notes SNMP Alerts #notes_url http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$ #notes_url http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$ }
device. This icon provides a direct link to the SNMPTT web interface with a filter for the selected host. Uncomment the
notes_url
depending on which web interface (nagtrap or nsti) is used. Replacehostname
with the FQDN or IP address of the server running the web interface.Run a configuration check and if successful reload the Nagios process:
$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg $ /etc/init.d/nagios3 reload
Optional: If you're running PNP4Nagios v0.6 or later to graph Nagios performance data, you can use the PNP4Nagios template in
pnp4nagios_storwize.tar.bz2
andpnp4nagios_svc.tar.bz2
to beautify the graphs. Download the PNP4Nagios templates pnp4nagios_svc.tar.bz2 and pnp4nagios_storwize.tar.bz2 and place them in the PNP4Nagios template directory, in this example/usr/share/pnp4nagios/html/templates/
:$ tar jxf pnp4nagios_storwize.tar.bz2 $ mv -i check_storwize_*.php /usr/share/pnp4nagios/html/templates/ $ chmod 644 /usr/share/pnp4nagios/html/templates/check_storwize_*.php $ tar jxf pnp4nagios_svc.tar.bz2 $ mv -i check_svc_*.php /usr/share/pnp4nagios/html/templates/ $ chmod 644 /usr/share/pnp4nagios/html/templates/check_svc_*.php
All done, you should now have a complete Nagios-based monitoring solution for your IBM SVC and Storwize systems.