2014-04-27 // Recover from a broken VIO Server Update
If you're – like me – running Ganglia to monitor the performance metrics of your AIX and VIOS systems, it can make sense to disable the, now kind of redundant, default collection method for performance metrics in order not to waste recources and to prevent unnecessary cluttering of the local VIOS filesystem. This can be done via the “xmdaily
” entry in /etc/inittab
:
- /etc/inittab
:xmdaily:2:once:/usr/bin/topasrec -L -s 300 -R 1 -r 6 -o /home/ios/perf/topas/ -ypersistent=1 2>&1 >/dev/null
$ lsitab xmdaily; echo $? 1
Unfortunately this breaks the VIOS upgrade process, since the good folks at IBM insist on the “xmdaily
” inittab entry being vital for a properly functioning VIOS system. Specifically, this is caused by the /usr/lpp/ios.cli/ios.cli.rte/<VERSION>/inst_root/ios.cli.rte.post_u
script in the ios.cli.rte
package. E.g.:
- /usr/lpp/ios.cli/ios.cli.rte/6.1.9.1/inst_root/ios.cli.rte.post_u
507 rt=`/usr/sbin/lsitab xmdaily` 508 if [[ $rt != "" ]] [...] 528 else 529 echo "Warning: lsitab failed..." 530 exit 1 531 fi
I beg to differ on the whole subject of “xmdaily
” being necessary at all, but one could file this under the category of “philosophical differences”. To fail the whole package update with the “exit 1
” in line 530 of the above code sample seems to be a bit too harsh though.
So normally i would just put the “xmdaily
” entry back into the inittab right before an VIOS update. Unfortunately on the update to 2.2.3.1-FP27 and subsequently to 2.2.3.2-FP27-SP02 i forgot to do that on a few VIOS systems. The result of this negligence were failure messages during the update process (RBAC output omitted for better readability!):
[...] installp: APPLYING software for: ios.cli.rte 6.1.9.2 . . . . . << Copyright notice for ios.cli >> . . . . . . . Licensed Materials - Property of IBM 5765G3400 Copyright International Business Machines Corp. 2004, 2014. All rights reserved. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. . . . . . << End of copyright notice for ios.cli >>. . . . sysck: 3001-022 The file /usr/ios/utils/part was not found. Start Creating VIOS Authorizations Completed Creating VIOS Authorizations Warning: lsitab failed... update: Failed while executing the ios.cli.rte.post_u script. 0503-464 installp: The installation has FAILED for the "root" part ios.cli.rte 6.1.9.2 installp: Cleaning up software for: ios.cli.rte 6.1.9.2 [...]
as well as at the end of the update process:
[...] Installation Summary -------------------- Name Level Part Event Result ------------------------------------------------------------------------------- ios.cli.rte 6.1.9.2 USR APPLY SUCCESS ios.cli.rte 6.1.9.2 ROOT APPLY FAILED ios.cli.rte 6.1.9.2 ROOT CLEANUP SUCCESS devices.vtdev.scsi.rte 6.1.9.2 USR APPLY SUCCESS devices.vtdev.scsi.rte 6.1.9.2 ROOT APPLY SUCCESS [...]
This would manifest through the fact that no IOS command would work properly anymore:
IBM Virtual I/O Server login: padmin padmin's Password: Last login: Sun Apr 27 11:40:41 DFT 2014 on /dev/vty0 Access to run command is not valid. [vios1-p550-b1] /home/padmin $ license -accept Access to run command is not valid. [vios2-p550-b1] /home/padmin $ lsmap -all Access to run command is not valid. [vios2-p550-b1] /home/padmin $
A quick check with a login to the VIOS as root
via SSH confirmed that the “root” part of the package ios.cli.rte
had been rolled back entirely:
root@vios2-p550-b1:/$ lppchk -v lppchk: The following filesets need to be installed or corrected to bring the system to a consistent state: ios.cli.rte 6.1.9.2 (usr: APPLIED, root: not installed)
To fix this issue, it worked for me to just run the “ios.cli.rte.pre_u
” script of the ios.cli.rte
package manually to redefine the now rolled back RBAC authorizations:
root@vios2-p550-b1:/$ SAVEDIR=/tmp/ /usr/lpp/ios.cli/ios.cli.rte/6.1.9.2/inst_root/ios.cli.rte.pre_u root@vios2-p550-b1:/$ rm /tmp/org_*
For future reference, make sure you have a valid “xmdaily
” entry in /etc/inittab
before attempting to update any VIOS system:
$ lsitab xmdaily; echo $? xmdaily:2:once:/usr/bin/topasrec -L -s 300 -R 1 -r 6 -o /home/ios/perf/topas/ -ypersistent=1 2>&1 >/dev/null 0
2014-04-26 // HMC Update to 7.7.8.0 SP1
Updating the HMC from v7.7.8.0 to v7.7.8.0 SP1 was once again and like the previous HMC Update to 7.7.6.0 SP1 and HMC Update to 7.7.7.0 SP2 very painless. The service pack MH01397 was easily installable from the ISO images via the HMC GUI.
The service pack and the additional efixes showed the following output during the update process:
MH01397:
Management console corrective service installation in progress. Please wait... Corrective service file offload from remote server in progress... The corrective service file offload was successful. Continuing with HMC service installation... Verifying Certificate Information Authenticating Install Packages Installing Packages --- Installing ptf-req .... --- Installing RSCT .... src-3.1.4.9-13275 rsct.core.utils-3.1.4.9-13275 rsct.core-3.1.4.9-13275 rsct.service-3.5.0.0-1 rsct.basic-3.1.4.9-13275 --- Installing CSM .... csm.core-1.7.1.20-1 csm.deploy-1.7.1.20-1 csm_hmc.server-1.7.1.20-1 csm_hmc.hdwr_svr-7.0-3.4.0 csm_hmc.client-1.7.1.20-1 csm.server.hsc-1.7.1.20-1 --- Installing LPARCMD .... hsc.lparcmd-3.3.0.1-1 ln: creating symbolic link `/usr/hmcrbin/lsnodeid': File exists ln: creating symbolic link `/usr/hmcrbin/lsrsrc-api' : File exists ln: creating symbolic link `/usr/hmcrbin/mkrsrc-api' : File exists ln: creating symbolic link `/usr/hmcrbin/rmrsrc-api' : File exists --- Installing InventoryScout .... --- Installing Pegasus .... --- Updating baseOS .... cp: cannot stat `.dev': No such file or directory cp: cannot stat `.image.updates': No such file or directory PreInstalling HMC REST Web Services ... Installing HMC REST Web Services ... /dump/hsc_install.images/images/./installK2Payloads.sh: line 84: K2Payloads.txt: No such file or directory Corrective service installation was successful.
MH01416:
Management console corrective service installation in progress. Please wait... Corrective service file offload from remote server in progress... The corrective service file offload was successful. Continuing with HMC service installation... Verifying Certificate Information Authenticating Install Packages Installing Packages --- Installing ptf-req .... Corrective service installation was successful.
MH01423:
Management console corrective service installation in progress. Please wait... Corrective service file offload from remote server in progress... The corrective service file offload was successful. Continuing with HMC service installation... Verifying Certificate Information Authenticating Install Packages Installing Packages --- Installing ptf-req .... Verifying archive integrity... All good. Uncompressing Updating from 2014-02-17 to 2014-04-08. ................................................................................................... ......................................................................................................................... ... .. ........ .. .. .. .. .. ........ ... .. .. .. .. ........ .. .. ... . ... . .. .. . ......... . . . . . Updating to hmc61_1-mcp-2014-04-08-203505... upgrading distro_id... Preparing... ################################################## distro_id ######### ################################### # # ## ## Preparing... ########## ########## ########## ########## ########## openssl # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ###### # binutils_static # # # # # # # # # # # ### # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # distro_id ######### ################################## # # ### ## openssl-fips ## # ##### ###### ###### ##### ###### ###### ###### ##### ## Updating distribution id... Updating manifest... Corrective service installation was successful.
The MH01397 update still shows error messages with regard to symlink creation appearing during the update process, which can be savely ignored. The error message “cp: cannot stat `.dev': No such file or directory
” is also becoming an old friend and originates from the shell script /images/installImages
inside the MH01397 installation ISO image, where non-existing files are attempted to be copied. The newly introduced error message “cp: cannot stat `.image.updates': No such file or directory
” falls into the same category.
The error message “/dump/hsc_install.images/images/./installK2Payloads.sh: line 84: K2Payloads.txt: No such file or directory
” seems to be more serious. From tracing down the call order of /images/installImages
inside the MH01397 installation ISO image:
- /images/installImages
448 449 if [ -f $image/installK2Payloads.sh ] 450 then 451 echo "Installing HMC REST Web Services ..." 452 $image/./installK2Payloads.sh $image 453 fi 454
to:
- /images/installK2Payloads.sh
4 5 LogFile=/tmp/K2Install.log 6 IndexFile=K2Payloads.txt 7 [...] 68 69 # Install the K2 rpms if K2 directory exists 70 if [ -d $image/K2 ]; then 71 cd $image/K2 72 73 while read line 74 do 75 rpmToInstall=`echo $line | cut -d '/' -f 4` 76 rpmToUpdate=`echo $line | cut -d '/' -f 4 | cut -d '-' -f 1` 77 /bin/rpm -q $rpmToUpdate 78 if [ $? -eq 0 ]; then 79 /bin/rpm -vvh -U $rpmToInstall >> $LogFile 2>&1 80 else 81 /bin/rpm -vv -i $rpmToInstall >> $LogFile 2>&1 82 fi 83 done < $IndexFile 84 fi 85
it seems the file K2Payloads.txt
is missing from the “K2
” directory containing the RPM files for the HMC REST web services:
$ mount -o loop HMC_Update_V7R780_SP1.iso /mnt $ ls -al /mnt/images/K2/ total 251394 dr-xr-xr-x 2 root root 4096 Mar 5 19:57 . dr-xr-xr-x 8 root root 2048 Mar 5 19:57 .. -r--r--r-- 1 root root 30 Feb 18 03:52 k2.version -r--r--r-- 1 root root 13398 Feb 18 04:12 pmc.core-7.7.8.1-20140217T2012.i386.rpm -r--r--r-- 1 root root 14242627 Feb 18 04:00 pmc.pcm.rest-7.7.8.1-20140217T2019.i386.rpm -r--r--r-- 1 root root 33865936 Feb 18 03:58 pmc.soliddb-7.7.8.1-20140217T2012.i386.rpm -r--r--r-- 1 root root 15375 Feb 18 04:01 pmc.soliddb.pcm.sql-7.7.8.1-20140217T2020.i386.rpm -r--r--r-- 1 root root 46559 Feb 18 04:10 pmc.soliddb.rest.sql-7.7.8.1-20140217T2019.i386.rpm -r--r--r-- 1 root root 54565876 Feb 18 03:59 pmc.ui.developer-7.7.8.1-20140217T2021.i386.rpm -r--r--r-- 1 root root 74508025 Feb 18 03:57 pmc.war.rest-7.7.8.1-20140217T2013.i386.rpm -r--r--r-- 1 root root 76751466 Feb 18 03:58 pmc.wlp-7.7.8.1-20140217T2012.i386.rpm -r--r--r-- 1 root root 478438 Feb 18 04:08 pmc.wlp.commons-7.7.8.1-20140217T2013.i386.rpm -r--r--r-- 1 root root 1695200 Feb 18 04:08 pmc.wlp.guava-7.7.8.1-20140217T2019.i386.rpm -r--r--r-- 1 root root 109789 Feb 18 04:09 pmc.wlp.jaxb2.runtime-7.7.8.1-20140217T2019.i386.rpm -r--r--r-- 1 root root 446974 Feb 18 04:08 pmc.wlp.log4j-7.7.8.1-20140217T2019.i386.rpm -r--r--r-- 1 root root 428041 Feb 18 04:09 pmc.wlp.quartz-7.7.8.1-20140217T2020.i386.rpm -r--r--r-- 1 root root 29861 Feb 18 04:09 pmc.wlp.slf4j.api-7.7.8.1-20140217T2013.i386.rpm -r--r--r-- 1 root root 219413 Feb 18 03:58 pmc.wlp.soliddriver-7.7.8.1-20140217T2012.i386.rpm
Without privileged access to the HMCs OS its hard to verify, but it appears that the updates to the HMC REST web services have not been installed. So i guess i'll be opening up a PMR about that now.
The weird “.
” and “#
” characters in the output of the MH01423 update can also savely be ignored. They're just unhandled output from the update script and the RPM commands called therein. To take a look for yourself:
$ mkdir /tmp/hmc && cd /tmp/hmc $ mount -o loop MH01423.iso /mnt $ /mnt/images/update-hmc61_1-20140217-20140408.sh --tar xvf $ ls -al total 4100 drwxr-xr-x 3 root root 4096 Apr 26 23:07 . drwxrwxrwt 35 root root 24576 Apr 26 23:07 .. -rw-rw-r-- 1 root root 2524328 Jun 19 2010 binutils_static-2.20.0-21.i686.rpm drwxr-xr-x 2 root root 20480 Apr 9 17:48 DEPENDS -rw-rw-r-- 1 root root 24564 Aug 23 2012 distro_id-1.0-9.i686.rpm -rwxr-xr-x 1 root root 56125 Apr 9 17:49 hmc61_1-mcp-2014-04-08-203505-manifest.txt -rw-rw-r-- 1 root root 1316850 Apr 8 22:56 openssl-1.0.1e-20.i686.rpm -rw-rw-r-- 1 root root 198060 Apr 8 22:56 openssl-fips-1.0.1e-20.i686.rpm -rwxr-xr-x 1 root root 3462 Apr 9 17:49 setup.sh $ less setup.sh
In the file setup.sh
look for the RPM install (“rpm -Uvh
”) or update (“rpm -ivh
”) calls being done explicitly with the verbose and hash option. This seems to be a bad choice, because the WebUI seems not to be able to handle the output produced by the RPM commands. Certainly not pretty to have that in the output of the update process, where it could be cause for some confusion. But in the wake of the OpenSSL heartbleed bug, which MH01423 addresses, a quick and dirty solution probably seemed more appropriate.
More importantly though, going through the HMC WebUI and checking for any differences and functional breakage, i stumpled upon a long missed feature that IBM apparently had already sneaked in with the previous HMC version 7.7.8.0: Sync current configuration Capability. And without me even noticing! Finally, no more bad surprises after a LPAR shutdown and re-activation!
2014-01-20 // HMC Update to 7.7.8.0
Like before with HMC Update to 7.7.5.0 and HMC Update to 7.7.7.0 SP1, the recent HMC update to v7.7.8.0 was again not installable directly from the ISO images via the HMC GUI. Along with the HMC network installation images which are now mentioned in the release notes, there is now also an official documentation of the update procedure using the HMC network installation images. It's called “HMC network installation” and provides a more remote admin friendly way of performing the update. Since it's only a slightly shortened version of the procedure i already tested and used in HMC Update to 7.7.7.0 SP1, i decided to stick with my own procedure.
Also a turn for the better is, that now the release notes as well as FixCentral clearly point out the dependencies between the fixpacks that are supposed to go on top of the update release and the order they are supposed to be applied in. In case of MH01377 (aka V7R7.8.0.0) these are MH01388 (aka "Required fix for HMC V7R7.8.0 (11-25-2013)") and MH01396 (aka "Fix for HMC V7R7.8.0 (12-10-2013)").
Compared to earlier updates, the restriction to have to shut down or disconnect the second HMC in a dual HMC setup has been weakened to:
When two HMCs manage the same server, both HMCs must be at the same version. Once the server is connected to the higher version of the management console, the partition configuration is migrated to the latest version. Lower management consoles will not be able to understand the data properly. […]
To me this reads “ensure that you don't do any configuration from the second HMC while the first HMC has already been updated”, which still is some kind of restriction, but a far less intrusive and thus a much more manageable one compared to before.
As always, be sure to study the release notes thoroughly before an update attempt. Depending on your environment and HMC hardware there might be a road block in there. Special attention deserves the document “HMC Version 7 Release 7.8.0 Upgrade sizing”, mentioned in the release notes of MH01377.
For me, the good news gathered from the release notes was:
Power Enterprise Pools to share CoD resources among the machines in your environment without the need of a special RPQ (
chcodpool
,lscodpool
andmkcodpool
). I still dare to dream that some day IBM will allow a fully dynamic sharing, even of non-CoD resources for us mere mortals.LPM evacuation to support the migration of all LPARs off a system within a single command, much like the VMware maintenance mode (
migrlpar […] --all
).Regeneration and XML export of the hypervisor runtime configuration (
mkprofdata
).Refresh of OS level information on the HMC in case the OS within the LPAR was updated (
lssyscfg -r lpar -m <managed system> --osrefresh
).
As already mentioned above, i used the update procedure described earlier in HMC Update to 7.7.7.0 SP1. In my setup this worked well for all HMCs and showed no fatal or noticeable errors. It seems others might not have been so lucky. I did the update on 2014/01/17 and checked back to FixCentral on 2014/01/19. By then the following warning messages have been put up:
So it's probably best to hold off the update for just now, if it hasn't been already done. For reference purposes, here are some example screen shots from a KVM session to the HMC during a – eventually successful – update to MH01377:
After the upgrade to V7R7.8.0.0 is complete, you can apply the additional efixes in the usual way via the HMC GUI. For me the additional efixes showed the following output during the update process:
MH01388:
Management console corrective service installation in progress. Please wait... Corrective service file offload from remote server in progress... The corrective service file offload was successful. Continuing with HMC service installation... Verifying Certificate Information Authenticating Install Packages Installing Packages --- Installing ptf-req .... --- Installing RSCT .... src-3.1.4.9-13275 rsct.core.utils-3.1.4.9-13275 rsct.core-3.1.4.9-13275 rsct.service-3.5.0.0-1 rsct.basic-3.1.4.9-13275 --- Installing CSM .... csm.core-1.7.1.20-1 csm.deploy-1.7.1.20-1 csm_hmc.server-1.7.1.20-1 csm_hmc.hdwr_svr-7.0-3.4.0 csm_hmc.client-1.7.1.20-1 csm.server.hsc-1.7.1.20-1 --- Installing LPARCMD .... hsc.lparcmd-3.3.0.1-1 ln: creating symbolic link `/usr/hmcrbin/lsnodeid': File exists ln: creating symbolic link `/usr/hmcrbin/lsrsrc-api': File exists ln: creating symbolic link `/usr/hmcrbin/mkrsrc-api': File exists ln: creating symbolic link `/usr/hmcrbin/rmrsrc-api': File exists --- Installing InventoryScout .... --- Installing Pegasus .... --- Installing service documentation .... cp: cannot stat `.dev': No such file or directory PreInstalling HMC REST Web Services ... Installing HMC REST Web Services ... pmc.core-7.7.8.0-20131027T1102 pmc.soliddb-7.7.8.0-20131027T1102 pmc.wlp-7.7.8.0-20131027T1102 pmc.wlp.soliddriver-7.7.8.0-20131027T1102 pmc.wlp.log4j-7.7.8.0-20131027T1110 pmc.wlp.guava-7.7.8.0-20131027T1110 pmc.wlp.jaxb2.runtime-7.7.8.0-20131027T1111 pmc.wlp.slf4j.api-7.7.8.0-20131027T1103 pmc.wlp.quartz-7.7.8.0-20131027T1111 pmc.wlp.commons-7.7.8.0-20131027T1103 pmc.war.rest-7.7.8.0-20131027T1103 pmc.soliddb.rest.sql-7.7.8.0-20131027T1110 pmc.soliddb.pcm.sql-7.7.8.0-20131027T1111 pmc.pcm.rest-7.7.8.0-20131027T1110 pmc.ui.developer-7.7.8.0-20131027T1112 Corrective service installation was successful.
MH01396:
Management console corrective service installation in progress. Please wait... Corrective service file offload from remote server in progress... The corrective service file offload was successful. Continuing with HMC service installation... Verifying Certificate Information Authenticating Install Packages Installing Packages --- Installing ptf-req .... PreInstalling HMC REST Web Services ... Installing HMC REST Web Services ... pmc.core-7.7.8.0-20131124T1024 pmc.soliddb-7.7.8.0-20131124T1024 pmc.wlp-7.7.8.0-20131124T1024 pmc.wlp.soliddriver-7.7.8.0-20131124T1024 pmc.wlp.log4j-7.7.8.0-20131124T1038 pmc.wlp.guava-7.7.8.0-20131124T1038 pmc.wlp.jaxb2.runtime-7.7.8.0-20131124T1038 pmc.wlp.slf4j.api-7.7.8.0-20131124T1025 pmc.wlp.quartz-7.7.8.0-20131124T1039 pmc.wlp.commons-7.7.8.0-20131124T1025 pmc.war.rest-7.7.8.0-20131124T1025 pmc.soliddb.rest.sql-7.7.8.0-20131124T1038 pmc.soliddb.pcm.sql-7.7.8.0-20131124T1039 pmc.pcm.rest-7.7.8.0-20131124T1038 pmc.ui.developer-7.7.8.0-20131124T1039 Corrective service installation was successful.
The MH01388 update still shows error messages with regard to symlink creation appearing during the update process, so i guess my DCR MR0809134336 went straight into the circular file. The newly introduced error message “cp: cannot stat `.dev': No such file or directory
” probably also originates from the shell script /images/installImages
inside the MH01388 installation ISO image:
- /images/installImages
380 cp -p finishUpdate /console/HSC/ 381 cp -p postinstall /console/HSC/ 382 cp -p .VERSION /console/HSC/.VERSION 383 cp -p .dev /console/HSC/.dev 384 385 cp -p -r baseHMC /console/HSC/
which is trying to copy the .dev
file which doesn't exist on the MH01388 ISO image:
$ mount -o loop MH01388.iso /mnt $ ls -al /mnt/images/ total 121 dr-xr-xr-x 8 root root 4096 Nov 25 17:32 . dr-xr-xr-x 3 root root 2048 Nov 22 23:50 .. dr-xr-xr-x 2 root root 10240 Nov 25 17:32 baseHMC -r-xr-xr-x 1 root root 40644 Nov 5 22:45 finishUpdate -r--r--r-- 1 root root 2035 Nov 25 17:30 IBMhmc.MH01388_d1-7.0-7.8.0.i386.rpm -r--r--r-- 1 root root 9 Oct 29 01:45 .image.updates dr-xr-xr-x 2 root root 2048 Nov 22 23:47 info -r-xr-xr-x 1 root root 15397 Nov 6 18:56 installImages -r-xr-xr-x 1 root root 2644 Oct 29 22:05 installK2Payloads.sh -r--r--r-- 1 root root 28 Oct 29 01:45 inventory dr-xr-xr-x 2 root root 4096 Nov 25 17:30 K2 dr-xr-xr-x 2 root root 2048 Nov 25 17:32 pegasus -r-xr-xr-x 1 root root 10805 Apr 16 2013 postinstall -r-xr-xr-x 1 root root 1520 Nov 7 17:30 preInstallK2Payloads.sh dr-xr-xr-x 2 root root 4096 Nov 25 17:32 rmc dr-xr-xr-x 2 root root 2048 Nov 25 17:32 service -r--r--r-- 1 root root 17 Oct 29 01:45 .signature -r--r--r-- 2 root root 37 Nov 25 17:32 .VERSION
So either someone forgot to remove the not necessary copy command from the install script, or the file .dev
supposed to be existing was forgotten during creation of the ISO image. Either way lets hope the file .dev
doesn't contain any vital information. Aside from that, up to now no issues with the new HMC version.
2014-01-14 // Nagios Monitoring - Knürr / Emerson CoolLoop
We use Knürr (now Emerson) CoolLoop units to chill the 19“ equipment in the datacenters. The CoolLoop units come with their own CoolCon controller units for management and monitoring purposes. The CoolCon controllers can - similar to the Rittal CMC-TC - be queried via SNMP for the status and values of the environmental sensors. The kind and number of sensors depends on specific configuration that was ordered. It ranges from simple fan, air temperature and water valve sensors in the basic setup to humidity, additional temperature, water flow, water temperature and electrical current, voltage and energy sensors in the extended setup. To monitor those sensor status and environmental values provided by the CoolCon controller i wrote a Nagios plugin check_knuerr_coolcon.sh
. In order to run the Nagios plugin, you need to have SNMP activated on the CoolCon controller unit and a network connection from the Nagios system to the CoolCon controller unit on port UDP/161 must be allowed.
The whole setup for monitoring Knürr / Emerson CoolLoop - and possibly, but untested, also CoolTherm - units with Nagios looks like this:
Enable SNMP queries on the CoolCon controller unit. Verify the port UDP/161 on the CoolCon controller unit can be reached from the Nagios system.
Optional: Enable SNMP traps to be sent to the Nagios system on the CoolCon controller unit. This requires SNMPD and SNMPTT to be already setup on the Nagios system. Verify the port UDP/162 on the Nagios system can be reached from the CoolCon controller unit.
Download the Nagios plugin check_knuerr_coolcon.sh and place it in the plugins directory of your Nagios system, in this example
/usr/lib/nagios/plugins/
:$ mv -i check_knuerr_coolcon.sh /usr/lib/nagios/plugins/ $ chmod 755 /usr/lib/nagios/plugins/check_knuerr_coolcon.sh
Define the following Nagios commands. In this example this is done in the file
/etc/nagios-plugins/config/check_coolcon.cfg
:# check Knuerr CoolCon/CoolLoop energy status define command { command_name check_coolcon_energy command_line $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C energy } # check Knuerr CoolCon/CoolLoop fan status define command { command_name check_coolcon_fan command_line $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C fan } # check Knuerr CoolCon/CoolLoop humidity status define command { command_name check_coolcon_humidity command_line $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C humidity } # check Knuerr CoolCon/CoolLoop temperature status define command { command_name check_coolcon_temperature command_line $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C temperature } # check Knuerr CoolCon/CoolLoop valve status define command { command_name check_coolcon_valve command_line $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C valve } # check Knuerr CoolCon/CoolLoop waterflow status define command { command_name check_coolcon_waterflow command_line $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C waterflow } # check Knuerr CoolCon/CoolLoop watertemperature status define command { command_name check_coolcon_watertemperature command_line $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C watertemperature }
Define a group of services in your Nagios configuration to be checked for each CoolLoop system:
# check snmpd define service { use generic-service hostgroup_name coolcon service_description Check_SNMPDv2 check_command check_snmpdv2 } # check_coolcon_energy define service { use generic-service-pnp hostgroup_name coolcon service_description Check_CoolCon_Energy check_command check_coolcon_energy } # check_coolcon_fan define service { use generic-service-pnp hostgroup_name coolcon service_description Check_CoolCon_Fan check_command check_coolcon_fan } # check_coolcon_humidity define service { use generic-service-pnp hostgroup_name coolcon service_description Check_CoolCon_Humidity check_command check_coolcon_humidity } # check_coolcon_temperature define service { use generic-service-pnp hostgroup_name coolcon service_description Check_CoolCon_Temp check_command check_coolcon_temperature } # check_coolcon_valve define service { use generic-service-pnp hostgroup_name coolcon service_description Check_CoolCon_Valve check_command check_coolcon_valve } # check_coolcon_waterflow define service { use generic-service-pnp hostgroup_name coolcon service_description Check_CoolCon_Waterflow check_command check_coolcon_waterflow } # check_coolcon_watertemperature define service { use generic-service-pnp hostgroup_name coolcon service_description Check_CoolCon_WaterTemp check_command check_coolcon_watertemperature }
Replace
generic-service
with your Nagios service template. Replacegeneric-service-pnp
with your Nagios service template that has performance data processing enabled.Define a service dependency to run the above checks only if the
Check_SNMPDv2
was run successfully:# Knuerr CoolCon SNMPD dependencies define servicedependency { hostgroup_name coolcon service_description Check_SNMPDv2 dependent_service_description Check_CoolCon_.* execution_failure_criteria c,p,u,w notification_failure_criteria c,p,u,w }
Define hosts in your Nagios configuration for each CoolLoop device. In this example its named
coolcon1
:define host { use coolcon host_name coolcon1 alias Knuerr CoolCon CoolLoop 1 address 10.0.0.1 parents parent_lan }
Replace
coolcon
with your Nagios host template for the CoolCon controller units. Adjust theaddress
andparents
parameters according to your environment.Define a hostgroup in your Nagios configuration for all CoolLoop devices. In this example it is named
coolcon
. The above checks are run against each member of the hostgroup:define hostgroup { hostgroup_name coolcon alias Knuerr CoolCon/CoolLoop members coolcon1 }
Run a configuration check and if successful reload the Nagios process:
$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg $ /etc/init.d/nagios3 reload
The new hosts and services should soon show up in the Nagios web interface.
If the optional step number 2 in the above list was done, SNMPTT also needs to be configured to be able to understand the incoming SNMP traps from CoolCon controller units. This can be achieved by the following steps:
Request a current version of the CoolCon SNMP MIB file from Knürr / Emerson. In this example it's
080104140000010a_KNUERR-COOLCON-MIB-V10.mib
. Transfer the file080104140000010a_KNUERR-COOLCON-MIB-V10.mib
to the Nagios server.Convert the SNMP MIB definitions in
080104140000010a_KNUERR-COOLCON-MIB-V10.mib
into a format that SNMPTT can understand.$ /opt/snmptt/snmpttconvertmib --in=MIB/080104140000010a_KNUERR-COOLCON-MIB-V10.mib --out=/opt/snmptt/conf/snmptt.conf.knuerr-coolcon ... Done Total translations: 201 Successful translations: 201 Failed translations: 0
Edit the trap severity according to your requirements, e.g.:
$ vim /opt/snmptt/conf/snmptt.conf.knuerr-coolcon ... EVENT fans .1.3.6.1.4.1.2769.2.1.5.0.1 "Status Events" Warning ...
Add the new configuration file to be included in the global SNMPTT configuration and restart the SNMPTT daemon:
$ vim /opt/snmptt/snmptt.ini ... [TrapFiles] snmptt_conf_files = <<END ... /opt/snmptt/conf/snmptt.conf.knuerr-coolcon ... END $ /etc/init.d/snmptt reload
Download the Nagios plugin check_snmp_traps.sh and place it in the plugins directory of your Nagios system, in this example
/usr/lib/nagios/plugins/
:$ mv -i check_snmp_traps.sh /usr/lib/nagios/plugins/ $ chmod 755 /usr/lib/nagios/plugins/check_snmp_traps.sh
Define the following Nagios command to check for SNMP traps in the SNMPTT database. In this example this is done in the file
/etc/nagios-plugins/config/check_snmp_traps.cfg
:# check for snmp traps define command { command_name check_snmp_traps command_line $USER1$/check_snmp_traps.sh -H $HOSTNAME$:$HOSTADDRESS$ -u <user> -p <pass> -d <snmptt_db> }
Replace
user
,pass
andsnmptt_db
with values suitable for your SNMPTT database environment.Add another service in your Nagios configuration to be checked for each CoolLoop device:
# check snmptraps define service { use generic-service hostgroup_name coolcon service_description Check_SNMP_traps check_command check_snmp_traps }
Optional: Define a serviceextinfo to display a folder icon next to the
Check_SNMP_traps
service check for each CoolLoop device. This icon provides a direct link to the SNMPTT web interface with a filter for the selected host:define serviceextinfo { hostgroup_name coolcon service_description Check_SNMP_traps notes SNMP Alerts #notes_url http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$ #notes_url http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$ }
Uncomment the
notes_url
depending on which web interface (nagtrap or nsti) is used. Replacehostname
with the FQDN or IP address of the server running the web interface.Run a configuration check and if successful reload the Nagios process:
$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg $ /etc/init.d/nagios3 reload
Optional: If you're running PNP4Nagios v0.6 or later to graph Nagios performance data, you can use the PNP4Nagios templates in
pnp4nagios_coolcon.tar.bz2
to beautify the graphs. Download the PNP4Nagios templates pnp4nagios_coolcon.tar.bz2 and place them in the PNP4Nagios template directory, in this example/usr/share/pnp4nagios/html/templates/
:$ tar jxf pnp4nagios_coolcon.tar.bz2 $ mv -i check_coolcon_*.php /usr/share/pnp4nagios/html/templates/ $ chmod 644 /usr/share/pnp4nagios/html/templates/check_coolcon_*.php
The following image shows an example of what the PNP4Nagios graphs look like for a CoolLoop unit:
All done, you should now have a complete Nagios-based monitoring solution for your Knürr / Emerson CoolLoop systems.
2014-01-12 // Nagios Monitoring - Infortrend EonStor
We use several Infortrend EonStor and EonStor DS storage arrays as low-cost, bulk storage units in our datacenters. With check_infortrend.pl, check_infortrend and check_ift_{dev|hdd|ld}.pl there are already several Nagios plugin to monitor Infortrend EonStor storage arrays. Since i wanted a low overhead, shell-based plugin with support for performance data, i decided to write my own version check_infortrend.sh
. In order to run the Nagios plugin, you need to have SNMP activated on the Infortrend storage array and a network connection from the Nagios system to the Infortrend device on port UDP/161 must be allowed.
The whole setup looks like this:
Enable SNMP queries on the Infortrend storage array. Login via Telnet or SSH and navigate to:
-> view and edit Configuration parameters -> Communication Parameters -> Network Protocol Support -> SNMP - Disabled -> Enable SNMP Protocol? -> Yes
Verify the port UDP/161 on the Infortrend device can be reached from the Nagios system.
Optional: Enable SNMP traps to be sent to the Nagios system on the Infortrend storage array. This requires SNMPD and SNMPTT to be already setup on the Nagios system. Verify the port UDP/162 on the Nagios system can be reached from the Infortrend device.
Download the Nagios plugin check_infortrend.sh and place it in the plugins directory of your Nagios system, in this example
/usr/lib/nagios/plugins/
:$ mv -i check_infortrend.sh /usr/lib/nagios/plugins/ $ chmod 755 /usr/lib/nagios/plugins/check_infortrend.sh
Define the following Nagios commands. In this example this is done in the file
/etc/nagios-plugins/config/check_infortrend.cfg
:# check Infortrend ESDS cache status define command { command_name check_infortrend_cache command_line $USER1$/check_infortrend.sh -H $HOSTNAME$ -C cache } # check Infortrend ESDS controller status define command { command_name check_infortrend_controller command_line $USER1$/check_infortrend.sh -H $HOSTNAME$ -C controller } # check Infortrend ESDS disk status define command { command_name check_infortrend_disk command_line $USER1$/check_infortrend.sh -H $HOSTNAME$ -C disk } # check Infortrend ESDS logicaldrive status define command { command_name check_infortrend_logicaldrive command_line $USER1$/check_infortrend.sh -H $HOSTNAME$ -C logicaldrive } # check Infortrend ESDS logicalunit status define command { command_name check_infortrend_logicalunit command_line $USER1$/check_infortrend.sh -H $HOSTNAME$ -C logicalunit } # check Infortrend ESDS event status define command { command_name check_infortrend_events command_line $USER1$/check_infortrend.sh -H $HOSTNAME$ -C events }
Define a group of services in your Nagios configuration to be checked for each Infortrend system:
# check snmpd define service { use generic-service hostgroup_name infortrend service_description Check_SNMPD check_command check_snmpd } # check_infortrend_cache define service { use generic-service-pnp hostgroup_name infortrend service_description Check_IFT_Cache check_command check_infortrend_cache } # check_infortrend_controller define service { use generic-service hostgroup_name infortrend service_description Check_IFT_Controller check_command check_infortrend_controller } # check_infortrend_disk define service { use generic-service-pnp hostgroup_name infortrend service_description Check_IFT_Disk check_command check_infortrend_disk } # check_infortrend_logicaldrive define service { use generic-service-pnp hostgroup_name infortrend service_description Check_IFT_LogicalDrive check_command check_infortrend_logicaldrive } # check_infortrend_logicalunit define service { use generic-service-pnp hostgroup_name infortrend service_description Check_IFT_LogicalUnit check_command check_infortrend_logicalunit } # check_infortrend_events define service { use generic-service hostgroup_name infortrend service_description Check_IFT_Events check_command check_infortrend_events }
Replace
generic-service
with your Nagios service template. Replacegeneric-service-pnp
with your Nagios service template that has performance data processing enabled.Define a service dependency to run the above checks only if the
Check_SNMPD
was run successfully:# Infortrend SNMPD dependencies define servicedependency { hostgroup_name infortrend service_description Check_SNMPD dependent_service_description Check_IFT_.* execution_failure_criteria c,p,u,w notification_failure_criteria c,p,u,w }
Define hosts in your Nagios configuration for each Infortrend device. In this example its named
esds1
:define host { use disk host_name esds1 alias Infortrend Disk Storage 1 address 10.0.0.1 parents parent_lan }
Replace
disk
with your Nagios host template for storage devices. Adjust theaddress
andparents
parameters according to your environment.Define a hostgroup in your Nagios configuration for all Infortrend devices. In this example it is named
infortrend
. The above checks are run against each member of the hostgroup:define hostgroup { hostgroup_name infortrend alias Infortrend Disk Storages members esds1 }
Run a configuration check and if successful reload the Nagios process:
$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg $ /etc/init.d/nagios3 reload
The new hosts and services should soon show up in the Nagios web interface.
If the optional step number 2 in the above list was done, SNMPTT also needs to be configured to be able to understand the incoming SNMP traps from Infortrend devices. This can be achieved by the following steps:
Request a current version of the Infortrend SNMP MIB file from Infortrend support. In this example it's
IFT_MIB_v1.40A02.mib
. Transfer the fileIFT_MIB_v1.40A02.mib
to the Nagios server.Convert the SNMP MIB definitions in
IFT_MIB_v1.40A02.mib
into a format that SNMPTT can understand.$ /opt/snmptt/snmpttconvertmib --in=MIB/IFT_MIB_v1.40A02.mib --out=/opt/snmptt/conf/snmptt.conf.infortrend ... Done Total translations: 1 Successful translations: 1 Failed translations: 0
Edit the trap severity according to your requirements, e.g.:
$ vim /opt/snmptt/conf/snmptt.conf.infortrend ... EVENT iftEventText .1.3.6.1.4.1.1714.2.1.1 "Status Events" Warning ...
Add the new configuration file to be included in the global SNMPTT configuration and restart the SNMPTT daemon:
$ vim /opt/snmptt/snmptt.ini ... [TrapFiles] snmptt_conf_files = <<END ... /etc/snmptt/conf.d/snmptt.conf.infortrend ... END $ /etc/init.d/snmptt reload
Download the Nagios plugin check_snmp_traps.sh and place it in the plugins directory of your Nagios system, in this example
/usr/lib/nagios/plugins/
:$ mv -i check_snmp_traps.sh /usr/lib/nagios/plugins/ $ chmod 755 /usr/lib/nagios/plugins/check_snmp_traps.sh
Define the following Nagios command to check for SNMP traps in the SNMPTT database. In this example this is done in the file
/etc/nagios-plugins/config/check_snmp_traps.cfg
:# check for snmp traps define command { command_name check_snmp_traps command_line $USER1$/check_snmp_traps.sh -H $HOSTNAME$:$HOSTADDRESS$ -u <user> -p <pass> -d <snmptt_db> }
Replace
user
,pass
andsnmptt_db
with values suitable for your SNMPTT database environment.Add another service in your Nagios configuration to be checked for each Infortrend device:
# check snmptraps define service { use generic-service hostgroup_name infortrend service_description Check_SNMP_traps check_command check_snmp_traps }
Optional: Define a serviceextinfo to display a folder icon next to the
Check_SNMP_traps
service check for each Infortrend device. This icon provides a direct link to the SNMPTT web interface with a filter for the selected host:define serviceextinfo { hostgroup_name infortrend service_description Check_SNMP_traps notes SNMP Alerts #notes_url http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$ #notes_url http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$ }
Uncomment the
notes_url
depending on which web interface (nagtrap or nsti) is used. Replacehostname
with the FQDN or IP address of the server running the web interface.Run a configuration check and if successful reload the Nagios process:
$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg $ /etc/init.d/nagios3 reload
Optional: If you're running PNP4Nagios v0.6 or later to graph Nagios performance data, you can use the
check_infortrend_cache.php
,check_infortrend_disk.php
,check_infortrend_logicaldrive.php
andcheck_infortrend_logicalunit.php
PNP4Nagios templates to beautify the graphs. Download the PNP4Nagios templates check_infortrend_cache.php, check_infortrend_disk.php, check_infortrend_logicaldrive.php and check_infortrend_logicalunit.php and place them in the PNP4Nagios template directory, in this example/usr/share/pnp4nagios/html/templates/
:$ mv -i check_infortrend_*. /usr/share/pnp4nagios/html/templates/ $ chmod 644 /usr/share/pnp4nagios/html/templates/check_infortrend_*.php
The following image shows an example of what the PNP4Nagios graphs look like for a Infortrend EonStor unit:
All done, you should now have a complete Nagios-based monitoring solution for your Infortrend EonStor systems.