bityard Blog

// TSM DLLA Procedure Performance

Some time ago we were hit by the dreaded DB corruption on one of our TSM 5.5.5.2 server instances. Upon investigating the object IDs with the hidden SHOW INVO command, we found which objects were exactly affected – some Windows filesystem backups and Oracle database archivelog backup – and re-backuped them. Getting rid of the broken remains was not so easy though. The usual AUDIT VOLUME … FIX=YES took care of some of the issues, but not all. Because of those remaining defect objects and database entries we're seeing error messages:

ANR9999D_0902881829 DetermineBackupRetention(imexp.c:7812) Thread<1150837>: No inactive versions found for 0:392144678
ANR9999D Thread<1150837> issued message 9999 from:
ANR9999D Thread<1150837>  000000010000c7e8 StdPutText
ANR9999D Thread<1150837>  000000010000fb90 OutDiagToCons
ANR9999D Thread<1150837>  000000010000a2d0 outDiagfExt
ANR9999D Thread<1150837>  0000000100784bcc DetermineBackupRetention
ANR9999D Thread<1150837>  0000000100789024 ExpirationQualifies
ANR9999D Thread<1150837>  000000010078b48c ExpirationProcess
ANR9999D Thread<1150837>  000000010078e550 ImDoExpiration
ANR9999D Thread<1150837>  000000010001509c StartThread
ANR9999D_2753579289 ExpirationQualifies(imexp.c:5116) Thread<1150837>: DetermineBackupRetention for 0:392144678 failed, rc=19
ANR9999D Thread<1150837> issued message 9999 from:
ANR9999D Thread<1150837>  000000010000c7e8 StdPutText
ANR9999D Thread<1150837>  000000010000fb90 OutDiagToCons
ANR9999D Thread<1150837>  000000010000a2d0 outDiagfExt
ANR9999D Thread<1150837>  000000010078905c ExpirationQualifies
ANR9999D Thread<1150837>  000000010078b48c ExpirationProcess
ANR9999D Thread<1150837>  000000010078e550 ImDoExpiration
ANR9999D Thread<1150837>  000000010001509c StartThread

on the daily EXPIRE INVENTORY and the content of some tape volumes cannot be moved or reclaimed. Calling up TSM support at IBM was not as helpful as we hoped. Although there are – undocumented – commands to manipulate database entries directly, we were told the only way to clean up the logical inconsistencies was to perform a DUMPDB / LOADFORMAT / LOADDB / AUDITDB (DLLA) procedure or to create a new TSM server instance and move over all nodes from the broken instance.

We went for the first option, the DLLA procedure. The runtime of the DLLA procedure, and thus the downtime for the TSM server, depends largely on the size of the database, the number of objects in the database and probably the number of inconsistencies too. Since there was no way to even roughtly estimate the runtime, we decided to test the DLLA procedure on a non-production test system. This way we could also familiarize ourselfs with the steps needed – luckily this kind of activity does not fall in the category of regular duties for a TSM admin – and also check whether the issues were actually resolved by the DLLA procedure. There's actually a good chance they won't and you'd have to fall back to the second option of creating a new TSM server instance and moving over all nodes and their data via node export/import!

We're running TSM 5.5.5.2 within AIX 6.1 LPARs, the test system had the exact same versions. The database size is 96GB with ~650 Mio. objects. We tried a lot of different test setups to optimize the DLLA runtime, the main iterations where the most gain in runtime was visible were:

  1. Test 1 (Initial test): Power6+ @5Ghz; 2 shared CPUs; Dump FS, DB and Log disks on SVC LUNs provided by a EMC Clariion with 15k FC disks.

  2. Test 2: Power6+ @5Ghz; 2 shared CPUs; Dump FS, DB and Log disks on SVC LUNs provided by a TMS RamSan-630 flash array.

  3. Test 3: Power6+ @5Ghz; 3 dedicated CPUs; DB and Log on RamDisk devices; Dump FS on on SVC LUNs provided by a TMS RamSan-630 flash array.

  4. Test 4: Power7 @3.1Ghz; 4 dedicated CPUs; DB and Log on RamDisk devices; Dump FS on on SVC LUNs provided by a TMS RamSan-630 flash array.

The runtime for each DLLA step in those iterations was:

DLLA Step Runtime (h = hours, m = minutes, s = seconds)
Test 1 Test 2 Test 3 Test 4
DUMPDB 0h 15m 0h 5m 0h 4m 11s 0h 3m 12s
LOADFORMAT 0h 0m 20s 0h 0m 13s
LOADDB 11h 30m 4h 45m 4h 54m 55s 3h 15m 4s
AUDITDB 30h 9h 36m 7h 34m 14s 5h 18m 55s
SUM ~41h 45m ~14h 26m ~12h 33m 40s ~8h 37m 24s

As those numbers show there is a vast room for runtime improvement. Between Test 1 and 2 the runtime dropped to almost 1/3rd in both LOADDB and AUDITDB. Since the change between Test 1 and 2 was the move of the storage to a system that is very good at low latency, random I/O, it's save to say setup was I/O bound in Test 1.

Having an I/O bound system we tried to lift this restraint further by moving to RamDisk based DB and Log volumes in Test 3. There was also the suspicion that the sheduling of the shared CPU resources back and forth between the LPAR and the hypervisor could have a negative impact. To mitigate this we also switched to dedicated CPU assignments. Unfortunately the result of Test 3 was “only” another 2 hours of reduction in runtime.

While observing the systems activity during the LOADDB and AUDITDB phases with different system monitoring tools, including truss, we noticed a rather high level of activity in the area of pthreads. Altough the Power7 systems at hand had a lower clock rate than the Power6+ systems, the connection between memory and CPU is very much improved with Power7. Especially thread intercommunication and synchronisation should benefit from this. The result of the move to Power7 (Test 4) is another 4 hours of reduction in runtime.

The end result of approximately 8.5 hours of DLLA runtime is still a lot, but much more manageable than the initial 42 hours in terms of a possible downtime window. Considering the kind of hardware resources we've to throw at this issue it makes one wonder how TSM support and development could see the DLLA procedure as a actually valid way to resolve inconsistencies in the TSM database. The whole process, as well as each of its steps, appear to be implemented horribly inefficient and they're eating resources like crazy. I know that the development of TSM 5.x virtually dried up some time ago with the advent of TSM 6.x and its DB2 database backend. But should one really believe that in the past there was never time nor the necessity to come up with another, more practical way to deal with TSM database issues?

Leave a comment…




S U M B B
  • E-Mail address will not be published.
  • Formatting:
    //italic//  __underlined__
    **bold**  ''preformatted''
  • Links:
    [[http://example.com]]
    [[http://example.com|Link Text]]
  • Quotation:
    > This is a quote. Don't forget the space in front of the text: "> "
  • Code:
    <code>This is unspecific source code</code>
    <code [lang]>This is specifc [lang] code</code>
    <code php><?php echo 'example'; ?></code>
    Available: html, css, javascript, bash, cpp, …
  • Lists:
    Indent your text by two spaces and use a * for
    each unordered list item or a - for ordered ones.
This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information