Recover from a broken VIO Server Update

If you're – like me – running Ganglia to monitor the performance metrics of your AIX and VIOS systems, it can make sense to disable the, now kind of redundant, default collection method for performance metrics in order not to waste recources and to prevent unnecessary cluttering of the local VIOS filesystem. This can be done via the “xmdaily” entry in /etc/inittab:

/etc/inittab

:xmdaily:2:once:/usr/bin/topasrec -L -s 300 -R 1 -r 6 -o /home/ios/perf/topas/ -ypersistent=1 2>&1 >/dev/null

$ lsitab xmdaily; echo $?
1

Unfortunately this breaks the VIOS upgrade process, since the good folks at IBM insist on the “xmdaily” inittab entry being vital for a properly functioning VIOS system. Specifically, this is caused by the /usr/lpp/ios.cli/ios.cli.rte/<VERSION>/inst_root/ios.cli.rte.post_u script in the ios.cli.rte package. E.g.:

/usr/lpp/ios.cli/ios.cli.rte/6.1.9.1/inst_root/ios.cli.rte.post_u

507 rt=`/usr/sbin/lsitab xmdaily`
508 if [[ $rt != "" ]]
[...]
528 else
529    echo "Warning: lsitab failed..."
530    exit 1
531 fi

I beg to differ on the whole subject of “xmdaily” being necessary at all, but one could file this under the category of “philosophical differences”. To fail the whole package update with the “exit 1” in line 530 of the above code sample seems to be a bit too harsh though.

So normally i would just put the “xmdaily” entry back into the inittab right before an VIOS update. Unfortunately on the update to 2.2.3.1-FP27 and subsequently to 2.2.3.2-FP27-SP02 i forgot to do that on a few VIOS systems. The result of this negligence were failure messages during the update process (RBAC output omitted for better readability!):

[...]
installp:  APPLYING software for:
        ios.cli.rte 6.1.9.2

. . . . . << Copyright notice for ios.cli >> . . . . . . .
 Licensed Materials - Property of IBM

 5765G3400
   Copyright International Business Machines Corp. 2004, 2014.

 All rights reserved.
 US Government Users Restricted Rights - Use, duplication or disclosure
 restricted by GSA ADP Schedule Contract with IBM Corp.
. . . . . << End of copyright notice for ios.cli >>. . . .

sysck: 3001-022 The file
        /usr/ios/utils/part
        was not found.

Start Creating VIOS Authorizations
Completed Creating VIOS Authorizations

Warning: lsitab failed...
update:  Failed while executing the ios.cli.rte.post_u script.

0503-464 installp:  The installation has FAILED for the "root" part
        ios.cli.rte 6.1.9.2

installp:  Cleaning up software for:
        ios.cli.rte 6.1.9.2

[...]

as well as at the end of the update process:

[...]
Installation Summary
--------------------
Name                        Level           Part        Event       Result
-------------------------------------------------------------------------------
ios.cli.rte                 6.1.9.2         USR         APPLY       SUCCESS
ios.cli.rte                 6.1.9.2         ROOT        APPLY       FAILED
ios.cli.rte                 6.1.9.2         ROOT        CLEANUP     SUCCESS
devices.vtdev.scsi.rte      6.1.9.2         USR         APPLY       SUCCESS
devices.vtdev.scsi.rte      6.1.9.2         ROOT        APPLY       SUCCESS
[...]

This would manifest through the fact that no IOS command would work properly anymore:

IBM Virtual I/O Server

login: padmin
padmin's Password:
Last login: Sun Apr 27 11:40:41 DFT 2014 on /dev/vty0

Access to run command is not valid.

[vios1-p550-b1] /home/padmin $ license -accept
Access to run command is not valid.

[vios2-p550-b1] /home/padmin $ lsmap -all
Access to run command is not valid.

[vios2-p550-b1] /home/padmin $

A quick check with a login to the VIOS as root via SSH confirmed that the “root” part of the package ios.cli.rte had been rolled back entirely:

root@vios2-p550-b1:/$ lppchk -v
lppchk:  The following filesets need to be installed or corrected to bring
         the system to a consistent state:

  ios.cli.rte 6.1.9.2                     (usr: APPLIED, root: not installed)

To fix this issue, it worked for me to just run the “ios.cli.rte.pre_u” script of the ios.cli.rte package manually to redefine the now rolled back RBAC authorizations:

root@vios2-p550-b1:/$ SAVEDIR=/tmp/ /usr/lpp/ios.cli/ios.cli.rte/6.1.9.2/inst_root/ios.cli.rte.pre_u
root@vios2-p550-b1:/$ rm /tmp/org_*

For future reference, make sure you have a valid “xmdaily” entry in /etc/inittab before attempting to update any VIOS system:

$ lsitab xmdaily; echo $?
xmdaily:2:once:/usr/bin/topasrec -L -s 300 -R 1 -r 6 -o /home/ios/perf/topas/ -ypersistent=1 2>&1 >/dev/null
0

0 Comments | 2014-04-27 written by Frank Fegert (XING) | Permanentlink
Tags:

VIOS

2014-04-26 // HMC Update to 7.7.8.0 SP1

Updating the HMC from v7.7.8.0 to v7.7.8.0 SP1 was once again and like the previous HMC Update to 7.7.6.0 SP1 and HMC Update to 7.7.7.0 SP2 very painless. The service pack MH01397 was easily installable from the ISO images via the HMC GUI.

The service pack and the additional efixes showed the following output during the update process:

MH01397:

Management console corrective service installation in progress. Please wait...
Corrective service file offload from remote server in progress...
The corrective service file offload was successful. Continuing with HMC service installation...
Verifying Certificate Information
Authenticating Install Packages
Installing Packages
--- Installing ptf-req ....
--- Installing RSCT ....
src-3.1.4.9-13275
rsct.core.utils-3.1.4.9-13275
rsct.core-3.1.4.9-13275
rsct.service-3.5.0.0-1
rsct.basic-3.1.4.9-13275
--- Installing CSM ....
csm.core-1.7.1.20-1
csm.deploy-1.7.1.20-1
csm_hmc.server-1.7.1.20-1
csm_hmc.hdwr_svr-7.0-3.4.0
csm_hmc.client-1.7.1.20-1
csm.server.hsc-1.7.1.20-1
--- Installing LPARCMD ....
hsc.lparcmd-3.3.0.1-1
ln: creating symbolic link `/usr/hmcrbin/lsnodeid': File exists
ln: creating symbolic link `/usr/hmcrbin/lsrsrc-api'
: File exists
ln: creating symbolic link `/usr/hmcrbin/mkrsrc-api'
: File exists
ln: creating symbolic link `/usr/hmcrbin/rmrsrc-api'
: File exists
--- Installing InventoryScout ....
--- Installing Pegasus ....
--- Updating baseOS ....
cp: cannot stat `.dev': No such file or directory
cp: cannot stat `.image.updates': No such file or directory
PreInstalling HMC REST Web Services ...
Installing HMC REST Web Services ...
/dump/hsc_install.images/images/./installK2Payloads.sh: line 84: K2Payloads.txt: No such file or directory
Corrective service installation was successful.

MH01416:

Management console corrective service installation in progress. Please wait...
Corrective service file offload from remote server in progress...
The corrective service file offload was successful. Continuing with HMC service installation...
Verifying Certificate Information
Authenticating Install Packages
Installing Packages
--- Installing ptf-req ....
Corrective service installation was successful.

MH01423:

Management console corrective service installation in progress. Please wait...
Corrective service file offload from remote server in progress...
The corrective service file offload was successful. Continuing with HMC service installation...
Verifying Certificate Information
Authenticating Install Packages
Installing Packages
--- Installing ptf-req ....
Verifying archive integrity...
All good.
Uncompressing Updating from 2014-02-17 to 2014-04-08.
...................................................................................................
.........................................................................................................................
...
..
........
..
..
..
..
..
........
...
..
..
..
..
........
..
..
...
.
...
.
..
..
.
.........
.
.
.
.
.
Updating to hmc61_1-mcp-2014-04-08-203505...
upgrading distro_id...
Preparing...
##################################################
distro_id
#########
###################################
#
#
##
##
Preparing...
##########
##########
##########
##########
##########
openssl
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
######
#
binutils_static
#
#
#
#
#
#
#
#
#
#
#
###
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
distro_id
#########
##################################
#
#
###
##
openssl-fips
##
#
#####
######
######
#####
######
######
######
#####
##
Updating distribution id...
Updating manifest...
Corrective service installation was successful.

The MH01397 update still shows error messages with regard to symlink creation appearing during the update process, which can be savely ignored. The error message “cp: cannot stat `.dev': No such file or directory” is also becoming an old friend and originates from the shell script /images/installImages inside the MH01397 installation ISO image, where non-existing files are attempted to be copied. The newly introduced error message “cp: cannot stat `.image.updates': No such file or directory” falls into the same category.

The error message “/dump/hsc_install.images/images/./installK2Payloads.sh: line 84: K2Payloads.txt: No such file or directory” seems to be more serious. From tracing down the call order of /images/installImages inside the MH01397 installation ISO image:

/images/installImages

448
449 if [ -f $image/installK2Payloads.sh ]
450 then
451     echo "Installing HMC REST Web Services ..."
452     $image/./installK2Payloads.sh $image
453 fi
454

to:

/images/installK2Payloads.sh

  4
  5 LogFile=/tmp/K2Install.log
  6 IndexFile=K2Payloads.txt
  7
[...]
 68
 69 # Install the K2 rpms if K2 directory exists
 70 if [ -d $image/K2 ]; then
 71     cd $image/K2
 72
 73     while read line
 74     do
 75     rpmToInstall=`echo $line | cut -d '/' -f 4`
 76     rpmToUpdate=`echo $line | cut -d '/' -f 4 | cut -d '-' -f 1`
 77     /bin/rpm -q $rpmToUpdate
 78     if [ $? -eq 0 ]; then
 79         /bin/rpm -vvh -U $rpmToInstall >> $LogFile 2>&1
 80     else
 81         /bin/rpm -vv -i $rpmToInstall >> $LogFile 2>&1
 82     fi
 83     done < $IndexFile
 84 fi
 85

it seems the file K2Payloads.txt is missing from the “K2” directory containing the RPM files for the HMC REST web services:

$ mount -o loop HMC_Update_V7R780_SP1.iso /mnt
$ ls -al /mnt/images/K2/
total 251394
dr-xr-xr-x 2 root root     4096 Mar  5 19:57 .
dr-xr-xr-x 8 root root     2048 Mar  5 19:57 ..
-r--r--r-- 1 root root       30 Feb 18 03:52 k2.version
-r--r--r-- 1 root root    13398 Feb 18 04:12 pmc.core-7.7.8.1-20140217T2012.i386.rpm
-r--r--r-- 1 root root 14242627 Feb 18 04:00 pmc.pcm.rest-7.7.8.1-20140217T2019.i386.rpm
-r--r--r-- 1 root root 33865936 Feb 18 03:58 pmc.soliddb-7.7.8.1-20140217T2012.i386.rpm
-r--r--r-- 1 root root    15375 Feb 18 04:01 pmc.soliddb.pcm.sql-7.7.8.1-20140217T2020.i386.rpm
-r--r--r-- 1 root root    46559 Feb 18 04:10 pmc.soliddb.rest.sql-7.7.8.1-20140217T2019.i386.rpm
-r--r--r-- 1 root root 54565876 Feb 18 03:59 pmc.ui.developer-7.7.8.1-20140217T2021.i386.rpm
-r--r--r-- 1 root root 74508025 Feb 18 03:57 pmc.war.rest-7.7.8.1-20140217T2013.i386.rpm
-r--r--r-- 1 root root 76751466 Feb 18 03:58 pmc.wlp-7.7.8.1-20140217T2012.i386.rpm
-r--r--r-- 1 root root   478438 Feb 18 04:08 pmc.wlp.commons-7.7.8.1-20140217T2013.i386.rpm
-r--r--r-- 1 root root  1695200 Feb 18 04:08 pmc.wlp.guava-7.7.8.1-20140217T2019.i386.rpm
-r--r--r-- 1 root root   109789 Feb 18 04:09 pmc.wlp.jaxb2.runtime-7.7.8.1-20140217T2019.i386.rpm
-r--r--r-- 1 root root   446974 Feb 18 04:08 pmc.wlp.log4j-7.7.8.1-20140217T2019.i386.rpm
-r--r--r-- 1 root root   428041 Feb 18 04:09 pmc.wlp.quartz-7.7.8.1-20140217T2020.i386.rpm
-r--r--r-- 1 root root    29861 Feb 18 04:09 pmc.wlp.slf4j.api-7.7.8.1-20140217T2013.i386.rpm
-r--r--r-- 1 root root   219413 Feb 18 03:58 pmc.wlp.soliddriver-7.7.8.1-20140217T2012.i386.rpm

Without privileged access to the HMCs OS its hard to verify, but it appears that the updates to the HMC REST web services have not been installed. So i guess i'll be opening up a PMR about that now.

The weird “.” and “#” characters in the output of the MH01423 update can also savely be ignored. They're just unhandled output from the update script and the RPM commands called therein. To take a look for yourself:

$ mkdir /tmp/hmc && cd /tmp/hmc
$ mount -o loop MH01423.iso /mnt
$ /mnt/images/update-hmc61_1-20140217-20140408.sh --tar xvf
$ ls -al
total 4100
drwxr-xr-x  3 root root    4096 Apr 26 23:07 .
drwxrwxrwt 35 root root   24576 Apr 26 23:07 ..
-rw-rw-r--  1 root root 2524328 Jun 19  2010 binutils_static-2.20.0-21.i686.rpm
drwxr-xr-x  2 root root   20480 Apr  9 17:48 DEPENDS
-rw-rw-r--  1 root root   24564 Aug 23  2012 distro_id-1.0-9.i686.rpm
-rwxr-xr-x  1 root root   56125 Apr  9 17:49 hmc61_1-mcp-2014-04-08-203505-manifest.txt
-rw-rw-r--  1 root root 1316850 Apr  8 22:56 openssl-1.0.1e-20.i686.rpm
-rw-rw-r--  1 root root  198060 Apr  8 22:56 openssl-fips-1.0.1e-20.i686.rpm
-rwxr-xr-x  1 root root    3462 Apr  9 17:49 setup.sh

$ less setup.sh

In the file setup.sh look for the RPM install (“rpm -Uvh”) or update (“rpm -ivh”) calls being done explicitly with the verbose and hash option. This seems to be a bad choice, because the WebUI seems not to be able to handle the output produced by the RPM commands. Certainly not pretty to have that in the output of the update process, where it could be cause for some confusion. But in the wake of the OpenSSL heartbleed bug, which MH01423 addresses, a quick and dirty solution probably seemed more appropriate.

More importantly though, going through the HMC WebUI and checking for any differences and functional breakage, i stumpled upon a long missed feature that IBM apparently had already sneaked in with the previous HMC version 7.7.8.0: Sync current configuration Capability. And without me even noticing! Finally, no more bad surprises after a LPAR shutdown and re-activation!

0 Comments | 2014-04-26 written by Frank Fegert (XING) | Permanentlink
Tags:

HMC

2014-01-20 // HMC Update to 7.7.8.0

Like before with HMC Update to 7.7.5.0 and HMC Update to 7.7.7.0 SP1, the recent HMC update to v7.7.8.0 was again not installable directly from the ISO images via the HMC GUI. Along with the HMC network installation images which are now mentioned in the release notes, there is now also an official documentation of the update procedure using the HMC network installation images. It's called “HMC network installation” and provides a more remote admin friendly way of performing the update. Since it's only a slightly shortened version of the procedure i already tested and used in HMC Update to 7.7.7.0 SP1, i decided to stick with my own procedure.

Also a turn for the better is, that now the release notes as well as FixCentral clearly point out the dependencies between the fixpacks that are supposed to go on top of the update release and the order they are supposed to be applied in. In case of MH01377 (aka V7R7.8.0.0) these are MH01388 (aka "Required fix for HMC V7R7.8.0 (11-25-2013)") and MH01396 (aka "Fix for HMC V7R7.8.0 (12-10-2013)").

Compared to earlier updates, the restriction to have to shut down or disconnect the second HMC in a dual HMC setup has been weakened to:

When two HMCs manage the same server, both HMCs must be at the same version. Once the server is connected to the higher version of the management console, the partition configuration is migrated to the latest version. Lower management consoles will not be able to understand the data properly. […]

To me this reads “ensure that you don't do any configuration from the second HMC while the first HMC has already been updated”, which still is some kind of restriction, but a far less intrusive and thus a much more manageable one compared to before.

As always, be sure to study the release notes thoroughly before an update attempt. Depending on your environment and HMC hardware there might be a road block in there. Special attention deserves the document “HMC Version 7 Release 7.8.0 Upgrade sizing”, mentioned in the release notes of MH01377.

For me, the good news gathered from the release notes was:

Power Enterprise Pools to share CoD resources among the machines in your environment without the need of a special RPQ (chcodpool, lscodpool and mkcodpool). I still dare to dream that some day IBM will allow a fully dynamic sharing, even of non-CoD resources for us mere mortals.
LPM evacuation to support the migration of all LPARs off a system within a single command, much like the VMware maintenance mode (migrlpar […] --all).
Regeneration and XML export of the hypervisor runtime configuration (mkprofdata).
Refresh of OS level information on the HMC in case the OS within the LPAR was updated (lssyscfg -r lpar -m <managed system> --osrefresh).

As already mentioned above, i used the update procedure described earlier in HMC Update to 7.7.7.0 SP1. In my setup this worked well for all HMCs and showed no fatal or noticeable errors. It seems others might not have been so lucky. I did the update on 2014/01/17 and checked back to FixCentral on 2014/01/19. By then the following warning messages have been put up:

So it's probably best to hold off the update for just now, if it hasn't been already done. For reference purposes, here are some example screen shots from a KVM session to the HMC during a – eventually successful – update to MH01377:

After the upgrade to V7R7.8.0.0 is complete, you can apply the additional efixes in the usual way via the HMC GUI. For me the additional efixes showed the following output during the update process:

MH01388:

Management console corrective service installation in progress. Please wait...
Corrective service file offload from remote server in progress...
The corrective service file offload was successful. Continuing with HMC service installation...
Verifying Certificate Information
Authenticating Install Packages
Installing Packages
--- Installing ptf-req ....
--- Installing RSCT ....
src-3.1.4.9-13275
rsct.core.utils-3.1.4.9-13275
rsct.core-3.1.4.9-13275
rsct.service-3.5.0.0-1
rsct.basic-3.1.4.9-13275
--- Installing CSM ....
csm.core-1.7.1.20-1
csm.deploy-1.7.1.20-1
csm_hmc.server-1.7.1.20-1
csm_hmc.hdwr_svr-7.0-3.4.0
csm_hmc.client-1.7.1.20-1
csm.server.hsc-1.7.1.20-1
--- Installing LPARCMD ....
hsc.lparcmd-3.3.0.1-1
ln: creating symbolic link `/usr/hmcrbin/lsnodeid': File exists
ln: creating symbolic link `/usr/hmcrbin/lsrsrc-api': File exists
ln: creating symbolic link `/usr/hmcrbin/mkrsrc-api': File exists
ln: creating symbolic link `/usr/hmcrbin/rmrsrc-api': File exists
--- Installing InventoryScout ....
--- Installing Pegasus ....
--- Installing service documentation ....
cp: cannot stat `.dev': No such file or directory
PreInstalling HMC REST Web Services ...
Installing HMC REST Web Services ...
pmc.core-7.7.8.0-20131027T1102
pmc.soliddb-7.7.8.0-20131027T1102
pmc.wlp-7.7.8.0-20131027T1102
pmc.wlp.soliddriver-7.7.8.0-20131027T1102
pmc.wlp.log4j-7.7.8.0-20131027T1110
pmc.wlp.guava-7.7.8.0-20131027T1110
pmc.wlp.jaxb2.runtime-7.7.8.0-20131027T1111
pmc.wlp.slf4j.api-7.7.8.0-20131027T1103
pmc.wlp.quartz-7.7.8.0-20131027T1111
pmc.wlp.commons-7.7.8.0-20131027T1103
pmc.war.rest-7.7.8.0-20131027T1103
pmc.soliddb.rest.sql-7.7.8.0-20131027T1110
pmc.soliddb.pcm.sql-7.7.8.0-20131027T1111
pmc.pcm.rest-7.7.8.0-20131027T1110
pmc.ui.developer-7.7.8.0-20131027T1112
Corrective service installation was successful.

MH01396:

Management console corrective service installation in progress. Please wait...
Corrective service file offload from remote server in progress...
The corrective service file offload was successful. Continuing with HMC service installation...
Verifying Certificate Information
Authenticating Install Packages
Installing Packages
--- Installing ptf-req ....
PreInstalling HMC REST Web Services ...
Installing HMC REST Web Services ...
pmc.core-7.7.8.0-20131124T1024
pmc.soliddb-7.7.8.0-20131124T1024
pmc.wlp-7.7.8.0-20131124T1024
pmc.wlp.soliddriver-7.7.8.0-20131124T1024
pmc.wlp.log4j-7.7.8.0-20131124T1038
pmc.wlp.guava-7.7.8.0-20131124T1038
pmc.wlp.jaxb2.runtime-7.7.8.0-20131124T1038
pmc.wlp.slf4j.api-7.7.8.0-20131124T1025
pmc.wlp.quartz-7.7.8.0-20131124T1039
pmc.wlp.commons-7.7.8.0-20131124T1025
pmc.war.rest-7.7.8.0-20131124T1025
pmc.soliddb.rest.sql-7.7.8.0-20131124T1038
pmc.soliddb.pcm.sql-7.7.8.0-20131124T1039
pmc.pcm.rest-7.7.8.0-20131124T1038
pmc.ui.developer-7.7.8.0-20131124T1039
Corrective service installation was successful.

The MH01388 update still shows error messages with regard to symlink creation appearing during the update process, so i guess my DCR MR0809134336 went straight into the circular file. The newly introduced error message “cp: cannot stat `.dev': No such file or directory” probably also originates from the shell script /images/installImages inside the MH01388 installation ISO image:

/images/installImages

 380 cp -p finishUpdate /console/HSC/
 381 cp -p postinstall /console/HSC/
 382 cp -p .VERSION /console/HSC/.VERSION
 383 cp -p .dev /console/HSC/.dev
 384
 385 cp -p -r baseHMC /console/HSC/

which is trying to copy the .dev file which doesn't exist on the MH01388 ISO image:

$ mount -o loop MH01388.iso /mnt
$ ls -al /mnt/images/
total 121
dr-xr-xr-x 8 root root  4096 Nov 25 17:32 .
dr-xr-xr-x 3 root root  2048 Nov 22 23:50 ..
dr-xr-xr-x 2 root root 10240 Nov 25 17:32 baseHMC
-r-xr-xr-x 1 root root 40644 Nov  5 22:45 finishUpdate
-r--r--r-- 1 root root  2035 Nov 25 17:30 IBMhmc.MH01388_d1-7.0-7.8.0.i386.rpm
-r--r--r-- 1 root root     9 Oct 29 01:45 .image.updates
dr-xr-xr-x 2 root root  2048 Nov 22 23:47 info
-r-xr-xr-x 1 root root 15397 Nov  6 18:56 installImages
-r-xr-xr-x 1 root root  2644 Oct 29 22:05 installK2Payloads.sh
-r--r--r-- 1 root root    28 Oct 29 01:45 inventory
dr-xr-xr-x 2 root root  4096 Nov 25 17:30 K2
dr-xr-xr-x 2 root root  2048 Nov 25 17:32 pegasus
-r-xr-xr-x 1 root root 10805 Apr 16  2013 postinstall
-r-xr-xr-x 1 root root  1520 Nov  7 17:30 preInstallK2Payloads.sh
dr-xr-xr-x 2 root root  4096 Nov 25 17:32 rmc
dr-xr-xr-x 2 root root  2048 Nov 25 17:32 service
-r--r--r-- 1 root root    17 Oct 29 01:45 .signature
-r--r--r-- 2 root root    37 Nov 25 17:32 .VERSION

So either someone forgot to remove the not necessary copy command from the install script, or the file .dev supposed to be existing was forgotten during creation of the ISO image. Either way lets hope the file .dev doesn't contain any vital information. Aside from that, up to now no issues with the new HMC version.

0 Comments | 2014-01-20 written by Frank Fegert (XING) | Permanentlink
Tags:

HMC

2014-01-14 // Nagios Monitoring - Knürr / Emerson CoolLoop

We use Knürr (now Emerson) CoolLoop units to chill the 19“ equipment in the datacenters. The CoolLoop units come with their own CoolCon controller units for management and monitoring purposes. The CoolCon controllers can - similar to the Rittal CMC-TC - be queried via SNMP for the status and values of the environmental sensors. The kind and number of sensors depends on specific configuration that was ordered. It ranges from simple fan, air temperature and water valve sensors in the basic setup to humidity, additional temperature, water flow, water temperature and electrical current, voltage and energy sensors in the extended setup. To monitor those sensor status and environmental values provided by the CoolCon controller i wrote a Nagios plugin check_knuerr_coolcon.sh. In order to run the Nagios plugin, you need to have SNMP activated on the CoolCon controller unit and a network connection from the Nagios system to the CoolCon controller unit on port UDP/161 must be allowed.

The whole setup for monitoring Knürr / Emerson CoolLoop - and possibly, but untested, also CoolTherm - units with Nagios looks like this:

Enable SNMP queries on the CoolCon controller unit. Verify the port UDP/161 on the CoolCon controller unit can be reached from the Nagios system.
Optional: Enable SNMP traps to be sent to the Nagios system on the CoolCon controller unit. This requires SNMPD and SNMPTT to be already setup on the Nagios system. Verify the port UDP/162 on the Nagios system can be reached from the CoolCon controller unit.
Download the Nagios plugin check_knuerr_coolcon.sh and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:
```
$ mv -i check_knuerr_coolcon.sh /usr/lib/nagios/plugins/
$ chmod 755 /usr/lib/nagios/plugins/check_knuerr_coolcon.sh
```

Define the following Nagios commands. In this example this is done in the file /etc/nagios-plugins/config/check_coolcon.cfg:

# check Knuerr CoolCon/CoolLoop energy status
define command {
    command_name    check_coolcon_energy
    command_line    $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C energy
}
# check Knuerr CoolCon/CoolLoop fan status
define command {
    command_name    check_coolcon_fan
    command_line    $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C fan
}
# check Knuerr CoolCon/CoolLoop humidity status
define command {
    command_name    check_coolcon_humidity
    command_line    $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C humidity
}
# check Knuerr CoolCon/CoolLoop temperature status
define command {
    command_name    check_coolcon_temperature
    command_line    $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C temperature
}
# check Knuerr CoolCon/CoolLoop valve status
define command {
    command_name    check_coolcon_valve
    command_line    $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C valve
}
# check Knuerr CoolCon/CoolLoop waterflow status
define command {
    command_name    check_coolcon_waterflow
    command_line    $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C waterflow
}
# check Knuerr CoolCon/CoolLoop watertemperature status
define command {
    command_name    check_coolcon_watertemperature
    command_line    $USER1$/check_knuerr_coolcon.sh -H $HOSTNAME$ -C watertemperature
}

Define a group of services in your Nagios configuration to be checked for each CoolLoop system:

# check snmpd
define service {
    use                     generic-service
    hostgroup_name          coolcon
    service_description     Check_SNMPDv2
    check_command           check_snmpdv2
}
# check_coolcon_energy
define service {
    use                     generic-service-pnp
    hostgroup_name          coolcon
    service_description     Check_CoolCon_Energy
    check_command           check_coolcon_energy
}
# check_coolcon_fan
define service {
    use                     generic-service-pnp
    hostgroup_name          coolcon
    service_description     Check_CoolCon_Fan
    check_command           check_coolcon_fan
}
# check_coolcon_humidity
define service {
    use                     generic-service-pnp
    hostgroup_name          coolcon
    service_description     Check_CoolCon_Humidity
    check_command           check_coolcon_humidity
}
# check_coolcon_temperature
define service {
    use                     generic-service-pnp
    hostgroup_name          coolcon
    service_description     Check_CoolCon_Temp
    check_command           check_coolcon_temperature
}
# check_coolcon_valve
define service {
    use                     generic-service-pnp
    hostgroup_name          coolcon
    service_description     Check_CoolCon_Valve
    check_command           check_coolcon_valve
}
# check_coolcon_waterflow
define service {
    use                     generic-service-pnp
    hostgroup_name          coolcon
    service_description     Check_CoolCon_Waterflow
    check_command           check_coolcon_waterflow
}
# check_coolcon_watertemperature
define service {
    use                     generic-service-pnp
    hostgroup_name          coolcon
    service_description     Check_CoolCon_WaterTemp
    check_command           check_coolcon_watertemperature
}

Replace generic-service with your Nagios service template. Replace generic-service-pnp with your Nagios service template that has performance data processing enabled.

Define a service dependency to run the above checks only if the Check_SNMPDv2 was run successfully:

# Knuerr CoolCon SNMPD dependencies
define servicedependency {
    hostgroup_name                  coolcon
    service_description             Check_SNMPDv2
    dependent_service_description   Check_CoolCon_.*
    execution_failure_criteria      c,p,u,w
    notification_failure_criteria   c,p,u,w
}

Define hosts in your Nagios configuration for each CoolLoop device. In this example its named coolcon1:
```
define host {
    use         coolcon
    host_name   coolcon1
    alias       Knuerr CoolCon CoolLoop 1
    address     10.0.0.1
    parents     parent_lan
}
```
Replace coolcon with your Nagios host template for the CoolCon controller units. Adjust the address and parents parameters according to your environment.
Define a hostgroup in your Nagios configuration for all CoolLoop devices. In this example it is named coolcon. The above checks are run against each member of the hostgroup:
```
define hostgroup {
    hostgroup_name  coolcon
    alias           Knuerr CoolCon/CoolLoop
    members         coolcon1
}
```

Run a configuration check and if successful reload the Nagios process:

$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
$ /etc/init.d/nagios3 reload

The new hosts and services should soon show up in the Nagios web interface.

If the optional step number 2 in the above list was done, SNMPTT also needs to be configured to be able to understand the incoming SNMP traps from CoolCon controller units. This can be achieved by the following steps:

Request a current version of the CoolCon SNMP MIB file from Knürr / Emerson. In this example it's 080104140000010a_KNUERR-COOLCON-MIB-V10.mib. Transfer the file 080104140000010a_KNUERR-COOLCON-MIB-V10.mib to the Nagios server.

Convert the SNMP MIB definitions in 080104140000010a_KNUERR-COOLCON-MIB-V10.mib into a format that SNMPTT can understand.

$ /opt/snmptt/snmpttconvertmib --in=MIB/080104140000010a_KNUERR-COOLCON-MIB-V10.mib --out=/opt/snmptt/conf/snmptt.conf.knuerr-coolcon
...
Done

Total translations:        201
Successful translations:   201
Failed translations:       0

Edit the trap severity according to your requirements, e.g.:

$ vim /opt/snmptt/conf/snmptt.conf.knuerr-coolcon

...
EVENT fans .1.3.6.1.4.1.2769.2.1.5.0.1 "Status Events" Warning
...

Add the new configuration file to be included in the global SNMPTT configuration and restart the SNMPTT daemon:

$ vim /opt/snmptt/snmptt.ini

...
[TrapFiles]
snmptt_conf_files = <<END
...
/opt/snmptt/conf/snmptt.conf.knuerr-coolcon
...
END

$ /etc/init.d/snmptt reload

Download the Nagios plugin check_snmp_traps.sh and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:
```
$ mv -i check_snmp_traps.sh /usr/lib/nagios/plugins/
$ chmod 755 /usr/lib/nagios/plugins/check_snmp_traps.sh
```
Define the following Nagios command to check for SNMP traps in the SNMPTT database. In this example this is done in the file /etc/nagios-plugins/config/check_snmp_traps.cfg:
```
# check for snmp traps
define command {
    command_name    check_snmp_traps
    command_line    $USER1$/check_snmp_traps.sh -H $HOSTNAME$:$HOSTADDRESS$ -u <user> -p <pass> -d <snmptt_db>
}
```
Replace user, pass and snmptt_db with values suitable for your SNMPTT database environment.

Add another service in your Nagios configuration to be checked for each CoolLoop device:

# check snmptraps
define service {
    use                     generic-service
    hostgroup_name          coolcon
    service_description     Check_SNMP_traps
    check_command           check_snmp_traps
}

Optional: Define a serviceextinfo to display a folder icon next to the Check_SNMP_traps service check for each CoolLoop device. This icon provides a direct link to the SNMPTT web interface with a filter for the selected host:
```
define serviceextinfo {
    hostgroup_name          coolcon
    service_description     Check_SNMP_traps
    notes                   SNMP Alerts
    #notes_url               http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$
    #notes_url               http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$
}
```
Uncomment the notes_url depending on which web interface (nagtrap or nsti) is used. Replace hostname with the FQDN or IP address of the server running the web interface.

Run a configuration check and if successful reload the Nagios process:

$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
$ /etc/init.d/nagios3 reload

Optional: If you're running PNP4Nagios v0.6 or later to graph Nagios performance data, you can use the PNP4Nagios templates in pnp4nagios_coolcon.tar.bz2 to beautify the graphs. Download the PNP4Nagios templates pnp4nagios_coolcon.tar.bz2 and place them in the PNP4Nagios template directory, in this example /usr/share/pnp4nagios/html/templates/:
```
$ tar jxf pnp4nagios_coolcon.tar.bz2
$ mv -i check_coolcon_*.php /usr/share/pnp4nagios/html/templates/
$ chmod 644 /usr/share/pnp4nagios/html/templates/check_coolcon_*.php
```
The following image shows an example of what the PNP4Nagios graphs look like for a CoolLoop unit:

All done, you should now have a complete Nagios-based monitoring solution for your Knürr / Emerson CoolLoop systems.

1 Comment | 2014-01-14 written by Frank Fegert (XING) | Permanentlink
Tags:

2014-01-12 // Nagios Monitoring - Infortrend EonStor

Please be sure to also read the update Nagios Monitoring - Infortrend EonStor (Update) to this blog post.

We use several Infortrend EonStor and EonStor DS storage arrays as low-cost, bulk storage units in our datacenters. With check_infortrend.pl, check_infortrend and check_ift_{dev|hdd|ld}.pl there are already several Nagios plugin to monitor Infortrend EonStor storage arrays. Since i wanted a low overhead, shell-based plugin with support for performance data, i decided to write my own version check_infortrend.sh. In order to run the Nagios plugin, you need to have SNMP activated on the Infortrend storage array and a network connection from the Nagios system to the Infortrend device on port UDP/161 must be allowed.

The whole setup looks like this:

Enable SNMP queries on the Infortrend storage array. Login via Telnet or SSH and navigate to:

-> view and edit Configuration parameters
   -> Communication Parameters
      -> Network Protocol Support
         -> SNMP - Disabled
            -> Enable SNMP Protocol?
               -> Yes

Verify the port UDP/161 on the Infortrend device can be reached from the Nagios system.

Optional: Enable SNMP traps to be sent to the Nagios system on the Infortrend storage array. This requires SNMPD and SNMPTT to be already setup on the Nagios system. Verify the port UDP/162 on the Nagios system can be reached from the Infortrend device.
Download the Nagios plugin check_infortrend.sh and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:
```
$ mv -i check_infortrend.sh /usr/lib/nagios/plugins/
$ chmod 755 /usr/lib/nagios/plugins/check_infortrend.sh
```

Define the following Nagios commands. In this example this is done in the file /etc/nagios-plugins/config/check_infortrend.cfg:

# check Infortrend ESDS cache status
define command {
    command_name    check_infortrend_cache
    command_line    $USER1$/check_infortrend.sh -H $HOSTNAME$ -C cache
}
# check Infortrend ESDS controller status
define command {
    command_name    check_infortrend_controller
    command_line    $USER1$/check_infortrend.sh -H $HOSTNAME$ -C controller
}
# check Infortrend ESDS disk status
define command {
    command_name    check_infortrend_disk
    command_line    $USER1$/check_infortrend.sh -H $HOSTNAME$ -C disk
}
# check Infortrend ESDS logicaldrive status
define command {
    command_name    check_infortrend_logicaldrive
    command_line    $USER1$/check_infortrend.sh -H $HOSTNAME$ -C logicaldrive
}
# check Infortrend ESDS logicalunit status
define command {
    command_name    check_infortrend_logicalunit
    command_line    $USER1$/check_infortrend.sh -H $HOSTNAME$ -C logicalunit
}
# check Infortrend ESDS event status
define command {
    command_name    check_infortrend_events
    command_line    $USER1$/check_infortrend.sh -H $HOSTNAME$ -C events
}

Define a group of services in your Nagios configuration to be checked for each Infortrend system:

# check snmpd
define service {
    use                     generic-service
    hostgroup_name          infortrend
    service_description     Check_SNMPD
    check_command           check_snmpd
}
# check_infortrend_cache
define service {
    use                     generic-service-pnp
    hostgroup_name          infortrend
    service_description     Check_IFT_Cache
    check_command           check_infortrend_cache
}
# check_infortrend_controller
define service {
    use                     generic-service
    hostgroup_name          infortrend
    service_description     Check_IFT_Controller
    check_command           check_infortrend_controller
}
# check_infortrend_disk
define service {
    use                     generic-service-pnp
    hostgroup_name          infortrend
    service_description     Check_IFT_Disk
    check_command           check_infortrend_disk
}
# check_infortrend_logicaldrive
define service {
    use                     generic-service-pnp
    hostgroup_name          infortrend
    service_description     Check_IFT_LogicalDrive
    check_command           check_infortrend_logicaldrive
}
# check_infortrend_logicalunit
define service {
    use                     generic-service-pnp
    hostgroup_name          infortrend
    service_description     Check_IFT_LogicalUnit
    check_command           check_infortrend_logicalunit
}
# check_infortrend_events
define service {
    use                     generic-service
    hostgroup_name          infortrend
    service_description     Check_IFT_Events
    check_command           check_infortrend_events
}

Replace generic-service with your Nagios service template. Replace generic-service-pnp with your Nagios service template that has performance data processing enabled.

Define a service dependency to run the above checks only if the Check_SNMPD was run successfully:

# Infortrend SNMPD dependencies
define servicedependency {
    hostgroup_name                  infortrend
    service_description             Check_SNMPD
    dependent_service_description   Check_IFT_.*
    execution_failure_criteria      c,p,u,w
    notification_failure_criteria   c,p,u,w
}

Define hosts in your Nagios configuration for each Infortrend device. In this example its named esds1:
```
define host {
    use         disk
    host_name   esds1
    alias       Infortrend Disk Storage 1
    address     10.0.0.1
    parents     parent_lan
}
```
Replace disk with your Nagios host template for storage devices. Adjust the address and parents parameters according to your environment.
Define a hostgroup in your Nagios configuration for all Infortrend devices. In this example it is named infortrend. The above checks are run against each member of the hostgroup:
```
define hostgroup {
    hostgroup_name  infortrend
    alias           Infortrend Disk Storages
    members         esds1
}
```

Run a configuration check and if successful reload the Nagios process:

$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
$ /etc/init.d/nagios3 reload

The new hosts and services should soon show up in the Nagios web interface.

If the optional step number 2 in the above list was done, SNMPTT also needs to be configured to be able to understand the incoming SNMP traps from Infortrend devices. This can be achieved by the following steps:

Request a current version of the Infortrend SNMP MIB file from Infortrend support. In this example it's IFT_MIB_v1.40A02.mib. Transfer the file IFT_MIB_v1.40A02.mib to the Nagios server.

Convert the SNMP MIB definitions in IFT_MIB_v1.40A02.mib into a format that SNMPTT can understand.

$ /opt/snmptt/snmpttconvertmib --in=MIB/IFT_MIB_v1.40A02.mib --out=/opt/snmptt/conf/snmptt.conf.infortrend
...
Done

Total translations:        1
Successful translations:   1
Failed translations:       0

Edit the trap severity according to your requirements, e.g.:

$ vim /opt/snmptt/conf/snmptt.conf.infortrend

...
EVENT iftEventText .1.3.6.1.4.1.1714.2.1.1 "Status Events" Warning
...

Add the new configuration file to be included in the global SNMPTT configuration and restart the SNMPTT daemon:

$ vim /opt/snmptt/snmptt.ini

...
[TrapFiles]
snmptt_conf_files = <<END
...
/etc/snmptt/conf.d/snmptt.conf.infortrend
...
END

$ /etc/init.d/snmptt reload

Download the Nagios plugin check_snmp_traps.sh and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:
```
$ mv -i check_snmp_traps.sh /usr/lib/nagios/plugins/
$ chmod 755 /usr/lib/nagios/plugins/check_snmp_traps.sh
```
Define the following Nagios command to check for SNMP traps in the SNMPTT database. In this example this is done in the file /etc/nagios-plugins/config/check_snmp_traps.cfg:
```
# check for snmp traps
define command {
    command_name    check_snmp_traps
    command_line    $USER1$/check_snmp_traps.sh -H $HOSTNAME$:$HOSTADDRESS$ -u <user> -p <pass> -d <snmptt_db>
}
```
Replace user, pass and snmptt_db with values suitable for your SNMPTT database environment.

Add another service in your Nagios configuration to be checked for each Infortrend device:

# check snmptraps
define service {
    use                     generic-service
    hostgroup_name          infortrend
    service_description     Check_SNMP_traps
    check_command           check_snmp_traps
}

Optional: Define a serviceextinfo to display a folder icon next to the Check_SNMP_traps service check for each Infortrend device. This icon provides a direct link to the SNMPTT web interface with a filter for the selected host:
```
define serviceextinfo {
    hostgroup_name          infortrend
    service_description     Check_SNMP_traps
    notes                   SNMP Alerts
    #notes_url               http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$
    #notes_url               http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$
}
```
Uncomment the notes_url depending on which web interface (nagtrap or nsti) is used. Replace hostname with the FQDN or IP address of the server running the web interface.

Run a configuration check and if successful reload the Nagios process:

$ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
$ /etc/init.d/nagios3 reload

Optional: If you're running PNP4Nagios v0.6 or later to graph Nagios performance data, you can use the check_infortrend_cache.php, check_infortrend_disk.php, check_infortrend_logicaldrive.php and check_infortrend_logicalunit.php PNP4Nagios templates to beautify the graphs. Download the PNP4Nagios templates check_infortrend_cache.php, check_infortrend_disk.php, check_infortrend_logicaldrive.php and check_infortrend_logicalunit.php and place them in the PNP4Nagios template directory, in this example /usr/share/pnp4nagios/html/templates/:
```
$ mv -i check_infortrend_*. /usr/share/pnp4nagios/html/templates/
$ chmod 644 /usr/share/pnp4nagios/html/templates/check_infortrend_*.php
```
The following image shows an example of what the PNP4Nagios graphs look like for a Infortrend EonStor unit: