With all the buzz and hype around flash storage, maybe you already asked yourself if flash based storage is really what it's cracked up to be. Maybe your already convinced, that flash based storage can ease or even solve some of the issues and challanges you're facing within your infrastructure, but you need some numbers to convince upper managment to provide the necessary – and still quite substancial – funding for it. Well, in any case here's a hands on, before-and-after example of the use of flash based storage.
We're currently running two IBM Tivoli Storage Manager (TSM) servers for our backup infrastructure. They're still on TSM version 184.108.40.206, so the dreaded hard limit of 13GB for the transaction log space of the catalog database applies. The databases on each TSM server are around 100GB in size. The database volumes as well as the transaction log volumes reside on RAID-1 LUNs provided by two IBM DCS3700 storage systems, which are distributed over two datacenters. The LUNs are backed by 300GB 15k RPM SAS disks. Redundancy is provided by TSM database and transaction log volume mirroring over the two storage systems. The two DCS3700 and the TSM server hardware (IBM Power with AIX) are attached to a dual-fabric 8Gbit FC SAN. The following image shows an overview of the whole backup infrastructure, with some additional components not discussed in this context:
With the increasing number of Windows 2008 servers to be backed up as TSM clients, we noticed very heavy database and transaction log activity on the TSM servers. At busy times, this would even lead to a situation where the 12GB transaction logs would fill up and a manual recovery of the TSM server would be necessary. The strain on the database was so high, that the database backup process triggered to free up transaction log space would just sit there, showing no activity. Sometimes two hours would pass between the start of the database backup process and the first database pages being processed. After some research and raising a PMR with IBM, it turned out the handling of the Windows 2008 system state backup was modeled with the DB2 backed catalog database of TSM version 6 in mind. The TSM version 5 embedded database was apparently not considered any more and just not up to the task (see IBM TSM Flash "Windows system state backup with a V5 Tivoli Storage Manager server"). So the suggested solutions were:
Migration to TSM server version 6.
Spread client backup windows over time and/or setup up additional TSM server instances to take over the load.
Disable Windows system state backup.
For various reasons we ended up with spreading the system state backup of the Windows clients over time, which allowed use to get by for quite some time. But in the end even this didn't help anymore.
Luckyly, around that time we still had some free space leftover on our four TMS RamSan 630 and 810 systems. After updating the OS of the TSM servers to AIX 220.127.116.11 and installing iFix IV38225, we were able to attach the flash based LUNs with proper multipathing support. We then moved the database and transaction log volume groups over to the flash storage with the AIX
migratepv command. The effect was incredible and instantaneous – without any other changes to the client or server environment, the database backup trigger at 50% transaction log space didn't fire even once during the next backup window! Gathering the available historical runtime data of database backup processes and graphing them over time confirmed the increadible performance gain for database backups on both TSM server instances:
Another I/O intensive operation on the TSM server is the expiration of backup and archive objects to be deleted from the catalog database. In our case this process is run on a daily basis on each TSM server. With the above results chances were, that we'd see an improvement in this area too. Like above we gathered the available historical runtime data of expire inventory processes and graphed them over time:
Monitoring the situation for several weeks after the migration to flash based storage volumes, showed us several interesting facts we found to be characteristic in our overall experience of flash vs. disk based storage:
As expected an increased number of I/O operations per second (IOPS) and thus a generally increased throughput.
In this particular case this is reflected by a number of symptoms:
A largely reduced number of unintended database backups that were triggered by a filling transaction log.
A generally lower transaction log usage, which was probably due to more database transactions being able to complete in time due to the increased number of available IOPS.
A largely reduced runtime of the deliberate database backups and the expire inventory processes started as part of the daily TSM server maintenance.
A very low variance of the response time, which is independend of the load on the system. This is especially in contrast to disk based storage systems, where one can observe a snowballing effect of increasing latency under medium and heavy load. In the above example graphs this is represented indirectly by the low level runtime plateau after the migration to flash based storage.
A shift of the performance bottleneck into other areas. Previously the quite convenient excuse on performance issues was disk I/O and the best measure the reduction of the same. With the introduction of flash based storage the focus has shifted and other areas like CPU, memory, network and storage-network latency are now put in the spotlight.
Some time ago we were hit by the dreaded DB corruption on one of our TSM 18.104.22.168 server instances. Upon investigating the object IDs with the hidden
SHOW INVO command, we found which objects were exactly affected – some Windows filesystem backups and Oracle database archivelog backup – and re-backuped them. Getting rid of the broken remains was not so easy though. The usual
AUDIT VOLUME … FIX=YES took care of some of the issues, but not all. Because of those remaining defect objects and database entries we're seeing error messages:
ANR9999D_0902881829 DetermineBackupRetention(imexp.c:7812) Thread<1150837>: No inactive versions found for 0:392144678 ANR9999D Thread<1150837> issued message 9999 from: ANR9999D Thread<1150837> 000000010000c7e8 StdPutText ANR9999D Thread<1150837> 000000010000fb90 OutDiagToCons ANR9999D Thread<1150837> 000000010000a2d0 outDiagfExt ANR9999D Thread<1150837> 0000000100784bcc DetermineBackupRetention ANR9999D Thread<1150837> 0000000100789024 ExpirationQualifies ANR9999D Thread<1150837> 000000010078b48c ExpirationProcess ANR9999D Thread<1150837> 000000010078e550 ImDoExpiration ANR9999D Thread<1150837> 000000010001509c StartThread ANR9999D_2753579289 ExpirationQualifies(imexp.c:5116) Thread<1150837>: DetermineBackupRetention for 0:392144678 failed, rc=19 ANR9999D Thread<1150837> issued message 9999 from: ANR9999D Thread<1150837> 000000010000c7e8 StdPutText ANR9999D Thread<1150837> 000000010000fb90 OutDiagToCons ANR9999D Thread<1150837> 000000010000a2d0 outDiagfExt ANR9999D Thread<1150837> 000000010078905c ExpirationQualifies ANR9999D Thread<1150837> 000000010078b48c ExpirationProcess ANR9999D Thread<1150837> 000000010078e550 ImDoExpiration ANR9999D Thread<1150837> 000000010001509c StartThread
on the daily
EXPIRE INVENTORY and the content of some tape volumes cannot be moved or reclaimed. Calling up TSM support at IBM was not as helpful as we hoped. Although there are – undocumented – commands to manipulate database entries directly, we were told the only way to clean up the logical inconsistencies was to perform a DUMPDB / LOADFORMAT / LOADDB / AUDITDB (DLLA) procedure or to create a new TSM server instance and move over all nodes from the broken instance.
We went for the first option, the DLLA procedure. The runtime of the DLLA procedure, and thus the downtime for the TSM server, depends largely on the size of the database, the number of objects in the database and probably the number of inconsistencies too. Since there was no way to even roughtly estimate the runtime, we decided to test the DLLA procedure on a non-production test system. This way we could also familiarize ourselfs with the steps needed – luckily this kind of activity does not fall in the category of regular duties for a TSM admin – and also check whether the issues were actually resolved by the DLLA procedure. There's actually a good chance they won't and you'd have to fall back to the second option of creating a new TSM server instance and moving over all nodes and their data via node export/import!
We're running TSM 22.214.171.124 within AIX 6.1 LPARs, the test system had the exact same versions. The database size is 96GB with ~650 Mio. objects. We tried a lot of different test setups to optimize the DLLA runtime, the main iterations where the most gain in runtime was visible were:
Test 1 (Initial test): Power6+ @5Ghz; 2 shared CPUs; Dump FS, DB and Log disks on SVC LUNs provided by a EMC Clariion with 15k FC disks.
Test 2: Power6+ @5Ghz; 2 shared CPUs; Dump FS, DB and Log disks on SVC LUNs provided by a TMS RamSan-630 flash array.
Test 3: Power6+ @5Ghz; 3 dedicated CPUs; DB and Log on RamDisk devices; Dump FS on on SVC LUNs provided by a TMS RamSan-630 flash array.
Test 4: Power7 @3.1Ghz; 4 dedicated CPUs; DB and Log on RamDisk devices; Dump FS on on SVC LUNs provided by a TMS RamSan-630 flash array.
The runtime for each DLLA step in those iterations was:
|DLLA Step||Runtime (h = hours, m = minutes, s = seconds)|
|Test 1||Test 2||Test 3||Test 4|
|DUMPDB||0h 15m||0h 5m||0h 4m 11s||0h 3m 12s|
|LOADFORMAT||–||–||0h 0m 20s||0h 0m 13s|
|LOADDB||11h 30m||4h 45m||4h 54m 55s||3h 15m 4s|
|AUDITDB||30h||9h 36m||7h 34m 14s||5h 18m 55s|
|SUM||~41h 45m||~14h 26m||~12h 33m 40s||~8h 37m 24s|
As those numbers show there is a vast room for runtime improvement. Between Test 1 and 2 the runtime dropped to almost 1/3rd in both LOADDB and AUDITDB. Since the change between Test 1 and 2 was the move of the storage to a system that is very good at low latency, random I/O, it's save to say setup was I/O bound in Test 1.
Having an I/O bound system we tried to lift this restraint further by moving to RamDisk based DB and Log volumes in Test 3. There was also the suspicion that the sheduling of the shared CPU resources back and forth between the LPAR and the hypervisor could have a negative impact. To mitigate this we also switched to dedicated CPU assignments. Unfortunately the result of Test 3 was “only” another 2 hours of reduction in runtime.
While observing the systems activity during the LOADDB and AUDITDB phases with different system monitoring tools, including
truss, we noticed a rather high level of activity in the area of pthreads. Altough the Power7 systems at hand had a lower clock rate than the Power6+ systems, the connection between memory and CPU is very much improved with Power7. Especially thread intercommunication and synchronisation should benefit from this. The result of the move to Power7 (Test 4) is another 4 hours of reduction in runtime.
The end result of approximately 8.5 hours of DLLA runtime is still a lot, but much more manageable than the initial 42 hours in terms of a possible downtime window. Considering the kind of hardware resources we've to throw at this issue it makes one wonder how TSM support and development could see the DLLA procedure as a actually valid way to resolve inconsistencies in the TSM database. The whole process, as well as each of its steps, appear to be implemented horribly inefficient and they're eating resources like crazy. I know that the development of TSM 5.x virtually dried up some time ago with the advent of TSM 6.x and its DB2 database backend. But should one really believe that in the past there was never time nor the necessity to come up with another, more practical way to deal with TSM database issues?