bityard Blog

// TSM DLLA Procedure Performance

Some time ago we were hit by the dreaded DB corruption on one of our TSM 5.5.5.2 server instances. Upon investigating the object IDs with the hidden SHOW INVO command, we found which objects were exactly affected – some Windows filesystem backups and Oracle database archivelog backup – and re-backuped them. Getting rid of the broken remains was not so easy though. The usual AUDIT VOLUME … FIX=YES took care of some of the issues, but not all. Because of those remaining defect objects and database entries we're seeing error messages:

ANR9999D_0902881829 DetermineBackupRetention(imexp.c:7812) Thread<1150837>: No inactive versions found for 0:392144678
ANR9999D Thread<1150837> issued message 9999 from:
ANR9999D Thread<1150837>  000000010000c7e8 StdPutText
ANR9999D Thread<1150837>  000000010000fb90 OutDiagToCons
ANR9999D Thread<1150837>  000000010000a2d0 outDiagfExt
ANR9999D Thread<1150837>  0000000100784bcc DetermineBackupRetention
ANR9999D Thread<1150837>  0000000100789024 ExpirationQualifies
ANR9999D Thread<1150837>  000000010078b48c ExpirationProcess
ANR9999D Thread<1150837>  000000010078e550 ImDoExpiration
ANR9999D Thread<1150837>  000000010001509c StartThread
ANR9999D_2753579289 ExpirationQualifies(imexp.c:5116) Thread<1150837>: DetermineBackupRetention for 0:392144678 failed, rc=19
ANR9999D Thread<1150837> issued message 9999 from:
ANR9999D Thread<1150837>  000000010000c7e8 StdPutText
ANR9999D Thread<1150837>  000000010000fb90 OutDiagToCons
ANR9999D Thread<1150837>  000000010000a2d0 outDiagfExt
ANR9999D Thread<1150837>  000000010078905c ExpirationQualifies
ANR9999D Thread<1150837>  000000010078b48c ExpirationProcess
ANR9999D Thread<1150837>  000000010078e550 ImDoExpiration
ANR9999D Thread<1150837>  000000010001509c StartThread

on the daily EXPIRE INVENTORY and the content of some tape volumes cannot be moved or reclaimed. Calling up TSM support at IBM was not as helpful as we hoped. Although there are – undocumented – commands to manipulate database entries directly, we were told the only way to clean up the logical inconsistencies was to perform a DUMPDB / LOADFORMAT / LOADDB / AUDITDB (DLLA) procedure or to create a new TSM server instance and move over all nodes from the broken instance.

We went for the first option, the DLLA procedure. The runtime of the DLLA procedure, and thus the downtime for the TSM server, depends largely on the size of the database, the number of objects in the database and probably the number of inconsistencies too. Since there was no way to even roughtly estimate the runtime, we decided to test the DLLA procedure on a non-production test system. This way we could also familiarize ourselfs with the steps needed – luckily this kind of activity does not fall in the category of regular duties for a TSM admin – and also check whether the issues were actually resolved by the DLLA procedure. There's actually a good chance they won't and you'd have to fall back to the second option of creating a new TSM server instance and moving over all nodes and their data via node export/import!

We're running TSM 5.5.5.2 within AIX 6.1 LPARs, the test system had the exact same versions. The database size is 96GB with ~650 Mio. objects. We tried a lot of different test setups to optimize the DLLA runtime, the main iterations where the most gain in runtime was visible were:

  1. Test 1 (Initial test): Power6+ @5Ghz; 2 shared CPUs; Dump FS, DB and Log disks on SVC LUNs provided by a EMC Clariion with 15k FC disks.

  2. Test 2: Power6+ @5Ghz; 2 shared CPUs; Dump FS, DB and Log disks on SVC LUNs provided by a TMS RamSan-630 flash array.

  3. Test 3: Power6+ @5Ghz; 3 dedicated CPUs; DB and Log on RamDisk devices; Dump FS on on SVC LUNs provided by a TMS RamSan-630 flash array.

  4. Test 4: Power7 @3.1Ghz; 4 dedicated CPUs; DB and Log on RamDisk devices; Dump FS on on SVC LUNs provided by a TMS RamSan-630 flash array.

The runtime for each DLLA step in those iterations was:

DLLA Step Runtime (h = hours, m = minutes, s = seconds)
Test 1 Test 2 Test 3 Test 4
DUMPDB 0h 15m 0h 5m 0h 4m 11s 0h 3m 12s
LOADFORMAT 0h 0m 20s 0h 0m 13s
LOADDB 11h 30m 4h 45m 4h 54m 55s 3h 15m 4s
AUDITDB 30h 9h 36m 7h 34m 14s 5h 18m 55s
SUM ~41h 45m ~14h 26m ~12h 33m 40s ~8h 37m 24s

As those numbers show there is a vast room for runtime improvement. Between Test 1 and 2 the runtime dropped to almost 1/3rd in both LOADDB and AUDITDB. Since the change between Test 1 and 2 was the move of the storage to a system that is very good at low latency, random I/O, it's save to say setup was I/O bound in Test 1.

Having an I/O bound system we tried to lift this restraint further by moving to RamDisk based DB and Log volumes in Test 3. There was also the suspicion that the sheduling of the shared CPU resources back and forth between the LPAR and the hypervisor could have a negative impact. To mitigate this we also switched to dedicated CPU assignments. Unfortunately the result of Test 3 was “only” another 2 hours of reduction in runtime.

While observing the systems activity during the LOADDB and AUDITDB phases with different system monitoring tools, including truss, we noticed a rather high level of activity in the area of pthreads. Altough the Power7 systems at hand had a lower clock rate than the Power6+ systems, the connection between memory and CPU is very much improved with Power7. Especially thread intercommunication and synchronisation should benefit from this. The result of the move to Power7 (Test 4) is another 4 hours of reduction in runtime.

The end result of approximately 8.5 hours of DLLA runtime is still a lot, but much more manageable than the initial 42 hours in terms of a possible downtime window. Considering the kind of hardware resources we've to throw at this issue it makes one wonder how TSM support and development could see the DLLA procedure as a actually valid way to resolve inconsistencies in the TSM database. The whole process, as well as each of its steps, appear to be implemented horribly inefficient and they're eating resources like crazy. I know that the development of TSM 5.x virtually dried up some time ago with the advent of TSM 6.x and its DB2 database backend. But should one really believe that in the past there was never time nor the necessity to come up with another, more practical way to deal with TSM database issues?

// HMC Update to 7.7.6.0 SP1

Compared to the previous ordeal of the HMC Update to 7.7.5.0 the recent HMC update to v7.7.6.0 SP1 went smooth as silk. Well, as long as one had enough patience to wait for the versions to stabilize. IBM released v7.7.6.0 (MH01326), an eFix for v7.7.6.0 (MH01328) and v7.7.6.0 SP1 (MH01329) in short succession. Judging from this and from the pending issues and restrictions mentioned in the release notes, one could get the impression, that the new HMC versions were very forcefully – and maybe prematurely – shoved out the door in order to make the announcement date for the new Power7+ systems in October 2012.

On the upside, the new versions are now easily installable from the ISO images via the HMC GUI again. On the downside, if you're running a dual HMC setup, you might – depending on your infrastructure – still need a trip to the datacenter. This is necessary in order to shut down, disconnect or otherwise prevent your second HMC, still running at a lower code version from accessing the managed systems while the first HMC is being updated or already running the newer version. The release notes for MH01326 state this:

“Before upgrading server firmware to 760 on a CEC with redundant HMCs connected, you must disconnect or power off the HMC not being used to perform the upgrade. This restriction is enforced by the Upgrade Licensed Internal code task.”

This is kind of an annoyance, since the whole purpose of a dual HMC setup is to have a fallback in case something goes south with one of the HMCs. Unfortunately the release notes aren't very clear as to whether this restriction applies only during the update process or if you can't run different HMC versions in parallel at all. I decided to play it save and physically disconnected the second HMCs network interface, which is facing the service processor network for about two weeks until the new HMC version had proven itself to be ready for day to day use. After this grace period i updated the second HMC as well and plugged it back into the service processor network.

Like with earlier updates there are still error messages with regard to symlink creation appearing during the update process. E.g.:

HMC corrective service installation in progress. Please wait...
Corrective service file offload from remote server in progress...
The corrective service file offload was successful. Continuing with HMC service installation...
Verifying Certificate Information
Authenticating Install Packages
Installing Packages
--- Installing ptf-req ....
--- Installing RSCT ....
src-1.3.1.1-12163
rsct.core.utils-3.1.2.5-12163
rsct.core-3.1.2.5-12163
rsct.service-3.1.0.0-1
rsct.basic-3.1.2.5-12163
--- Installing CSM ....
csm.core-1.7.1.20-1
csm.deploy-1.7.1.20-1
csm_hmc.server-1.7.1.20-1
csm_hmc.hdwr_svr-7.0-3.4.0
csm_hmc.client-1.7.1.20-1
csm.server.hsc-1.7.1.20-1
--- Installing LPARCMD ....
hsc.lparcmd-2.0.0.0-1
ln: creating symbolic link `/usr/hmcrbin/lsnodeid': File exists
ln: creating symbolic link `/usr/hmcrbin/lsrsrc-api': File exists
ln: creating symbolic link `/usr/hmcrbin/mkrsrc-api': File exists
ln: creating symbolic link `/usr/hmcrbin/rmrsrc-api': File exists
--- Installing InventoryScout ....
--- Installing Pegasus ....
--- Installing service documentation ....
--- Updating baseOS ....
Corrective service installation was successful.

This time i got fed up and curious enough to search for what was actually breaking and where. After some loopback mounting and sifting through the installation ISO images i found the culprit in the shell script /images/installImages which is part of the /images/disk2.img ISO image inside the installation ISO image:

/images/installImages
 515     # links for xCAT support - 755346
 516     if [ ! -L /usr/sbin/rsct/bin/lsnodeid ]; then
 517        ln -s /usr/sbin/rsct/bin/lsnodeid /usr/hmcrbin/lsnodeid
 518     fi
 519     if [ ! -L /usr/sbin/rsct/bin/lsrsrc-api ]; then
 520        ln -s /usr/sbin/rsct/bin/lsrsrc-api /usr/hmcrbin/lsrsrc-api
 521     fi
 522     if [ ! -L /usr/sbin/rsct/bin/mkrsrc-api ]; then
 523        ln -s /usr/sbin/rsct/bin/mkrsrc-api /usr/hmcrbin/mkrsrc-api
 524     fi
 525     if [ ! -L /usr/sbin/rsct/bin/rmrsrc-api ]; then
 526        ln -s /usr/sbin/rsct/bin/rmrsrc-api /usr/hmcrbin/rmrsrc-api
 527     fi
 528
 529     # defect 788462 - create the lspartition for the restricted shell after rsct.service rpm installed.
 530     if [ -L /opt/hsc/bin/lspartition ]
 531     then
 532        ln -s /opt/hsc/bin/lspartition /usr/hmcrbin/ 1>&2 2>/dev/null
 533     else
 534        ln -f /opt/hsc/bin/lspartition /usr/hmcrbin/ 1>&2 2>/dev/null
 535     fi

So this is either a semantic error and the goal was actually to check for the symlinks to be created already being present, in which case e.g. /usr/hmcrbin/lsnodeid should be in the test brakets. Or this is not a semantic error, in which case the aforementioned test for the symlinks to be created already being present should actually be implemented or the symlink creation should be forced (“-f” flag). Besides that i would've probably handled the symlink creation in the RPM that installs the binaries in the first place. Well lets open a PMR on this and find out what IBM thinks about this issue.

Another issue - although a minor one - with the above code expamle is in lines 532 and 534. The shells I/O redirection is used in the wrong order, causing STDOUT to be redirected to STDERR instead of - what was probably intended - both being redirected to /dev/null.

Aside from that we're currently not experiencing any issues with the new HMC version. I've opened up a RPQ to get the feature 5799 - HMC remote restart function, which is again supported with v7.7.6.0 SP1 (MH01329). This will come in quite handy for the simplification of our disaster recovery procedures. But that'll be material for another post. Stay tuned until then …

// Cacti Monitoring Templates for TMS RamSan-630 and RamSan-810

This is an update to the previous post about Cacti Monitoring Templates and Nagios Plugin for TMS RamSan-630. With the two new RamSan-810 we got and the new firmware releases available for our existing RamSan-630, an update to the previously introduced Cacti templates and Nagios plugins seemed to be in order. The good news is, the new Cacti templates can still be used for older firmware versions, the graphs depending on newer performance counters will just remain empty. I suspect they'll work for all 6x0, 7x0 and 8x0 models. Also good news, the RamSan-630 and the RamSan-810 have basically the same SNMP MIB:

There are just some nomencalture differences with regard to the product name, so the same Cacti templates can be used for either RamSan-630 or RamSan-810 systems. For historic reasons the string “TMS RamSan-630” still appears in several template names.

As the release notes for current firmware versions mention, several new SNMP counters have been added:

** Release 5.4.6 - May 17, 2012 **
[N 23014] SNMP MIB now includes a new table for flashcard information.

** Release 5.4.5 - May 2, 2012 **
[N 23014] SNMP MIB now includes interface stats for transfer latency and DMA command sizes.

A diff on the two RamSan-630 MIBs mentioned above shows the new SNMP counters:

fcReadAvgLatency
fcWriteAvgLatency
fcReadMaxLatency
fcWriteMaxLatency
fcReadSampleLow
fcReadSampleMed
fcReadSampleHigh
fcWriteSampleLow
fcWriteSampleMed
fcWriteSampleHigh
fcscsi4k
fcscsi8k
fcscsi16k
fcscsi32k
fcscsi64k
fcscsi128k
fcscsi256k
fcRMWCount

flashTableIndex
flashObject
flashTableState
flashHealthState
flashHealthPercent
flashSizeMiB

With a little bit of reading through the MIB and comparing the new SNMP counters to the corresponding performance counters in the RamSan web interface, the following metrics were added to the Cacti templates:

  • FC port average and maximum read and write latency measured in microseconds.

    Example RamSan-630 Average and Maximum Read/Write Latency on Port fc-1a:

    Example RamSan-630 Average and Maximum Read/Write Latency on Port fc-1a

    Example RamSan-810 Average and Maximum Read/Write Latency on Port fc-1a:

    Example RamSan-810 Average and Maximum Read/Write Latency on Port fc-1a

  • FC port SCSI command count grouped by the SCSI command size.

    Example RamSan-630 SCSI Command Count on Port fc-1a:

    Example RamSan-630 SCSI Command Count on Port fc-1a

    Example RamSan-810 SCSI Command Count on Port fc-1a:

    Example RamSan-810 SCSI Command Count on Port fc-1a

  • FC port SCSI command latency grouped by latency classes (low, medium, high).

    Example RamSan-630 SCSI Command Latency on Port fc-1a:

    Example RamSan-630 SCSI Command Latency on Port fc-1a

    Example RamSan-810 SCSI Command Latency on Port fc-1a:

    Example RamSan-810 SCSI Command Latency on Port fc-1a

  • FC port read-modify-write command count (although they seem to remain at the maximum value for 32bit signed integer all the time).

  • Flashcard health percentage (good vs. failed flash cells).

    Example Health Status of Flashcard flashcard-1:

    Example Health Status of Flashcard flashcard-1

  • Flashcard size.

There still seem to be some issues with the existing and the new SNMP counters. For example the fcCacheHit, fcCacheMiss and fcCacheLookup counters always remain at a zero value. The fcRXFrames counter always stays at the same value (2147483647), which is the maximum for a 32bit signed integer and could suggest a counter overflow. The fcWriteSample* counters also seem to remain at a zero value even though the corresponding performance counters in the RamSan web interface show a steady growth.

Since there are still some performance counters left that are only accessible via the web interface, there's still some room for improofment. I hope with the aquisition by IBM we'll see some more and interesting changes in the future.

The Nagios plugins and the updated Cacti templates can be downloaded here Nagios Plugin and Cacti Templates.

// Nagios Plugins - Hostname vs. IP-Address

For obvious reasons i usually like to keep monitoring systems as independent as possible from other services (like e.g. DNS). That's why i mostly use IP addresses instead of hostnames or FQDNs when configuring hosts and services in Nagios. So a typical Nagios command configuration stanza would look like this:

define command {
    command_name    <name>
    command_line    <path>/<executable> -H $HOSTADDRESS$ <args>
}

Now, for example to check for date and time differences between the server OS and the actual time provided by our time servers, i've defined the following NRPE command on each server:

command[check_time]=<nagios plugin path>/check_ntp_time -w $ARG1$ -c $ARG2$ -H $ARG3$

On the Nagios server i've defined the following Nagios command in /etc/nagios-plugins/config/check_nrpe.cfg:

# 'check_nrpe_time' command definition
define command {
    command_name    check_nrpe_time
    command_line    $USER1$/check_nrpe -H '$HOSTADDRESS$' -c check_time -a $ARG1$ $ARG2$ $ARG3$
}

And the following Nagios service definitions are assigned to the various server groups:

# check_nrpe_time: ntp time check to NTP-1
define service {
    ...
    service_description     Check_time_NTP-1
    check_command           check_nrpe_time!<warn level>!<crit level>!10.1.1.1
}

# check_nrpe_time: ntp time check to NTP-2
define service {
    ...
    service_description     Check_time_NTP-2
    check_command           check_nrpe_time!<warn level>!<crit level>!10.1.1.2
}

The help output of check_ntp_time reads:

 ...
 -H, --hostname=ADDRESS
    Host name, IP Address, or unix socket (must be an absolute path)
 ...

which i believed to implicitly mean that the Nagios plugin knows how to differenciate between hostnames and IP addresses and behave accordingly. Unfortunately this assumption turned out to be wrong.

Recently our support group responsible for the MS ADS had trouble with the DNS service provided by the domain controllers. Probably due to a data corruption, the DNS service couldn't resolve hostnames for some, but not all, of our internally used domains anymore. Unfortunately, the time servers mentioned above are located in a domain that was affected by this issue. As a result the above checks started to fail, although they were thought to be configured to use IP addresses.

The usual head scratching, debugging, testing and putting the pieces together ensued. After looking through the Nagios Plugins source code it became clear that hostnames and IP addresses aren't treated different at all, getaddrinfo() is called indiscriminately. Since in case of an IP address this is not only unnecessary, but might also not be what was actually intended, i wrote a patch nagios-plugins_do_not_resolve_IP_address.patch for the current nagios-plugins-1.4.16 sources to fix this issue.

// Nagios Monitoring - Rittal CMC-TC with LCP

We use Rittal LCP - Liquid Cooling Package units to chill the 19“ racks and equipment in the datacenters. The LCPs come with their own Rittal CMC-TC units for management and monitoring purposes. With check_rittal_health there is already a Nagios plugin to monitor Rittal CMC-TC units. Unfortunately this plugin didn't cover the LCPs, which come with a plethora of built-in sensors. Also the existing plugin didn't allow to set individual monitoring thresholds. Therefor i modified the existing plugin to accomodate our needs. The modified version can be downloaded here check_rittal_health.pl.

The whole setup for monitoring Rittal CMC-TC and LCP with Nagios looks like this:

  1. Configure your CMC-TC unit, see the manual here: CMC-TC Basic CMC DK 7320.111 - Montage, Installation und Bedienung. Essential are the network settings, a user for SNMPv3 access, a SNMP trap receiver (your Nagios server running SNMPTT). Optional, but highly recommended, are the settings for the NTP server, change of the default user passwords and disabling insecure services (Telnet, FTP, HTTP).

  2. Download the Nagios plugin check_rittal_health.pl and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:

    $ mv -i check_rittal_health.pl /usr/lib/nagios/plugins/
    $ chmod 755 /usr/lib/nagios/plugins/check_rittal_health.pl
    
  3. Define the following Nagios commands. In this example this is done in the file /etc/nagios-plugins/config/check_cmc.cfg:

    # check Rittal CMC status
    define command {
        command_name    check_cmc_status
        command_line    $USER1$/check_rittal_health.pl --hostname $HOSTNAME$ --protocol $ARG1$ --username $ARG2$ --authpassword $ARG3$ --customthresholds "$ARG4$"
    }
  4. Verify that a generic check command for a running SNMPD is already present in your Nagios configuration. If not add a new check command like this:

    define command {
        command_name    check_snmpdv3
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -o .1.3.6.1.2.1.1.3.0 -P 3 -t 30 -L $ARG1$ -U $ARG2$ -A $ARG3$
    }

    Verify that a generic check command for a SSH service is already present in your Nagios configuration. If not add a new check command like this:

    # 'check_ssh' command definition
    define command {
        command_name    check_ssh
        command_line    /usr/lib/nagios/plugins/check_ssh -t 20 '$HOSTADDRESS$'
    }

    Verify that a generic check command for a HTTPS service is already present in your Nagios configuration. If not add a new check command like this:

    # 'check_https_port_uri' command definition
    define command {
        command_name    check_https_port_uri
        command_line    /usr/lib/nagios/plugins/check_http --ssl -I '$HOSTADDRESS$' -p '$ARG1$' -u '$ARG2$'
    }
  5. Define a group of services in your Nagios configuration to be checked for each CMC-TC device:

    # check host alive
    define service {
        use                     generic-service-pnp
        hostgroup_name          cmc
        service_description     Check_host_alive
        check_command           check-host-alive
    }
    
    # check sshd
    define service {
        use                     generic-service
        hostgroup_name          cmc
        service_description     Check_SSH
        check_command           check_ssh
    }
    
    # check snmpd
    define service {
        use                     generic-service
        hostgroup_name          cmc
        service_description     Check_SNMPDv3
        check_command           check_snmpdv3!authNoPriv!<user>!<pass>
    }
    
    # check httpd
    define service {
        use                     generic-service-pnp
        hostgroup_name          cmc
        service_description     Check_service_https
        check_command           check_https_port_uri!443!/
    }
    
    # check Rittal CMC status
    define service {
        use                     generic-service-pnp
        servicegroups           snmpchecks
        hostgroup_name          cmc
        service_description     Check_CMC_Status
        check_command           check_cmc_status!3!<user>!<pass>!airTemp,15:32,10:35\;coolingCapacity,0:10000,0:15000\;events,0:1,0:2\;fan,450:2000,400:2500\;temp,15:30,10:35\;waterFlow,0:70,0:100\;waterTemp,10:25,5:30
    }

    Replace generic-service-pnp with your Nagios service template that has performance data processing enabled. Replace <user> and <pass> with the user credentials configured on the CMC-TC devices for SNMPv3 access. Adjust the sensor threshold settings according to your requirements, see the output of check_rittal_health.pl -h for an explaination of the threshold settings format.

  6. Define a service dependency to run the check Check_CMC_status only if the Check_SNMPDv3 was run successfully:

    # Rittal CMC SNMPD dependencies
    define servicedependency {
        hostgroup_name                  cmc
        service_description             Check_SNMPDv3
        dependent_service_description   Check_CMC_Status
        execution_failure_criteria      c,p,u,w
        notification_failure_criteria   c,p,u,w
    }
  7. Define hosts in your Nagios configuration for each CMC-TC device. In this example its named cmc-host1:

    define host {
        use         cmc
        host_name   cmc-host1
        alias       Rittal CMC LPC
        address     10.0.0.1
        parents     parent_lan
    }

    Replace cmc with your Nagios host template for the CMC-TC devices. Adjust the address and parents parameters according to your environment.

  8. Define a hostgroup in your Nagios configuration for all CMC-TC devices. In this example it is named cmc. The above checks are run against each member of the hostgroup:

    define hostgroup {
        hostgroup_name  cmc
        alias           Rittal CMC
        members         cmc-host1
    }
  9. Run a configuration check and if successful reload the Nagios process:

    $ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
    $ /etc/init.d/nagios3 reload
    

The new hosts and services should soon show up in the Nagios web interface.

If the Nagios server is running SNMPTT and was configured as a SNMP trap receiver in step number 1 in the above list, SNMPTT also needs to be configured to be able to understand the incoming SNMP traps from CMC-TC devices. This can be achieved by the following steps:

  1. Convert the Rittal SNMP MIB definitions in CMC-TC_MIB_v1.1h.txt into a format that SNMPTT can understand.

    $ /opt/snmptt/snmpttconvertmib --in=MIB/CMC-TC_MIB_v1.1h.txt --out=/opt/snmptt/conf/snmptt.conf.rittal-cmc-tc
    
    ...
    Done
    
    Total translations:        10
    Successful translations:   10
    Failed translations:       0
    
  2. The trap severity settings should be pretty reasonable by default, but you can edit them according to your requirements with:

    $ vim /opt/snmptt/conf/snmptt.conf.rittal-cmc-tc
    
  3. Add the new configuration file to be included in the global SNMPTT configuration and restart the SNMPTT daemon:

    $ vim /opt/snmptt/snmptt.ini
    
    ...
    [TrapFiles]
    snmptt_conf_files = <<END
    ...
    /opt/snmptt/conf/snmptt.conf.rittal-cmc-tc
    ...
    END
    
    $ /etc/init.d/snmptt reload
    
  4. Download the Nagios plugin check_snmp_traps.sh and place it in the plugins directory of your Nagios system, in this example /usr/lib/nagios/plugins/:

    $ mv -i check_snmp_traps.sh /usr/lib/nagios/plugins/
    $ chmod 755 /usr/lib/nagios/plugins/check_snmp_traps.sh
    
  5. Define the following Nagios command to check for SNMP traps in the SNMPTT database. In this example this is done in the file /etc/nagios-plugins/config/check_snmp_traps.cfg:

    # check for snmp traps
    define command {
        command_name    check_snmp_traps
        command_line    $USER1$/check_snmp_traps.sh -H $HOSTNAME$:$HOSTADDRESS$ -u <user> -p <pass> -d <snmptt_db>
    }

    Replace user, pass and snmptt_db with values suitable for your SNMPTT database environment.

  6. Add another service in your Nagios configuration to be checked for each CMC device:

    # check snmptraps
    define service {
        use                     generic-service
        hostgroup_name          cmc
        service_description     Check_SNMP_traps
        check_command           check_snmp_traps
    }
  7. Optional: Define a serviceextinfo to display a folder icon next to the Check_SNMP_traps service check for each CMC device. This icon provides a direct link to the SNMPTT web interface with a filter for the selected host:

    define  serviceextinfo {
        hostgroup_name          cmc
        service_description     Check_SNMP_traps
        notes                   SNMP Alerts
        #notes_url               http://<hostname>/nagios3/nagtrap/index.php?hostname=$HOSTNAME$
        #notes_url               http://<hostname>/nagios3/nsti/index.php?perpage=100&hostname=$HOSTNAME$
    }

    Uncomment the notes_url depending on which web interface (nagtrap or nsti) is used. Replace hostname with the FQDN or IP address of the server running the web interface.

  8. Run a configuration check and if successful reload the Nagios process:

    $ /usr/sbin/nagios3 -v /etc/nagios3/nagios.cfg
    $ /etc/init.d/nagios3 reload
    
  9. Optional: If you're running PNP4Nagios v0.6 or later to graph Nagios performance data, you can use the check_cmc_status.php PNP4Nagios template to beautify the graphs. Download the PNP4Nagios template check_cmc_status.php and place it in the PNP4Nagios template directory, in this example /usr/share/pnp4nagios/html/templates/:

    $ mv -i check_cmc_status.php /usr/share/pnp4nagios/html/templates/
    $ chmod 644 /usr/share/pnp4nagios/html/templates/check_cmc_status.php
    

    The following image shows an example of what the PNP4Nagios graphs look like for a Rittal CMC-TC with a LCP-T3+ unit:

    PNP4Nagios graphs for a Rittal CMC-TC with a LCP-T3+ unit

All done, you should now have a complete Nagios-based monitoring solution for your Rittal CMC-TC and LCP devices.

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information