Description TSM

This Overview on “How TSM works” was already published in German language in the GWDG news 11/2014 and will be published in english in GWDG news 1-2/2016:

A Short Abstract On How TSM Works and The Principles Behind It

Talking to customers and colleagues about backup and restore, many of them know the basic principles, but nearly nothing of the way the IBM Tivoli Storage Manager (TSM) implements those principles. This document gives a short overview on ”how TSM works” without having to read the whole extensive documentation IBM offers for TSM [1] – well, many details we do not mention will be found there, of course. The original German text has been co-authored by Wolfgang Hitzler, an IBM TSM Pre Sales Engineer.

Basic Principles

The IBM Tivoli Storage Manager (TSM) follows, like most backup solutions, the ”client server principle”. The backup client is going to be installed on the client machine. We will refer to the backup client software and the backup client machine as ”client”. The client will transfer the backup data to the TSM server. The server is then writing the data to a low-cost but secure storage medium. Depending on the actual requirements either magnetic tapes or disc systems can be used. At this point the similarities between all backup solutions end. Identifying backup data, transferring the data to the server, and processing the data on the server can vary sometimes considerably with each backup solution. This article won´t do a comparison of backup solutions, our focus lies on the description of how GWDG’s TSM works.

TSM Client Mechanism: "Filelevel-Backup”

The TSM client is making a backup based on the file space. First, the client requests a list of changes of the files to backup and their attributes (metadata) from the TSM server. It then creates a virtual directory tree directly in the client memory from the requested data. Using the file and directory attributes the TSM client then compares the virtual directory tree and their physical counterparts. Files and directories that differ regarding the attributes are transmitted to the server. TSM uses an incremental backup strategy [2], since only changed files and metadata are transmitted. TSM backups are not designed to perform full backups. Therefore IBM names this backup principle ”incremental forever”.

Searching through the drives and partitions is likely to be much more time consuming then the actual data transfer. This varies greatly depending on the amount of data. For the search process the size of data is of less importance than the number of files and directories (summed as ”objects” by TSM). This can be seen very well in the conclusion summary in the log file of a client below (extraction of a log file of a medium sized client [4.6 million objects, 8.6 TByte, 531 MByte new files]):

Total number of objects inspected:         4,629,154
Total number of objects backed up:             2,134
Total number of objects updated:                 414
Total number of objects rebound:                   0
Total number of objects deleted:                   0
Total number of objects expired:                 454
Total number of objects failed:                   10
Total number of bytes inspected:             8.64 TB
Total number of bytes transferred:         513.64 MB
Data transfer time:                        13.06 sec
Network data transfer rate:         40,269.79 KB/sec
Aggregate data transfer rate:          107.49 KB/sec
Objects compressed by:                           0 %
Total data reduction ratio:                 100.00 %
Elapsed processing time:                    01:21:33

As the transmission took just a few seconds (13.06 sec), the entire backup took nearly 1.5 hours (01:21:33 h). Most of the time the client was identifying changed files. This discrepancy is also shown in transmission rates. While the ”Network data transfer rate” is at about 40 MByte/sec, the ”Aggregate data transfer rate” (average transfer rate of “total size of backup files” divided by “total backup time”) is much less at 107.5 KByte.

In the past IBM focused on the problem of low bandwidth (keywords ”data compression” and ”deduplication”[4]). Since client version 6.3 IBM started to focus on what is called the ”seek time” or ”lookup time”. For Windows and Linux systems a function (journal-daemon/service) was added. The function is monitoring the local mass storage and lists all changes made after the last backup. The TSM client is capable of reading the list and is selecting only those files which have changed according to the list.

Unfortunately this function is not available for network mass storages (e.g. Linux: NFS; Windows and Mac: SMB, CIFS), as changes and backups can be made from other devices as well. Continuously monitoring the network drives would not be less expensive than a usual incremental TSM backup. However, an option exists for a few NAS filers (e.g. NetApp FAS series running ONTAP 7.3 or higher, IBM SONAS) to generate the list of changed files on the filer itself and to provide it to TSM. IBMs very own file system GPFS is supporting this function as well. EMC Isilon systems can detect file changes. However, due to the use of different charsets the evaluation of the file list is expensive and prone to error. At this time a comparable function is not supported by more simple file systems.

In addition to file backups a full partition can be transferred as well. The so called ”image backup” is a kind of full backup; in fact the entire partition is transferred. The operating system or services providing the file spaces need to support a pause or copy function for the entire file space, as a writing process during the backup might cause data inconsistency.

TSM Client Mechanism: "API Backup"

TSM provides the possibility to use application APIs (SQL, Exchange) in addition to the ”File-Level-Backup”. Some special TSM clients are provided to use different APIs.

Examples:

  • TSM for Databases (for Oracle DB’s incl. RAC MS SQL Server)
  • TSM DataProtection Client / DB2 (for DB2 databases, this license is included in the DB2-license)
  • TSM for Virtual Environments (Plugin in VMware vCenter, VADP inter-face is used here)
  • TSM for Mail (Backup for MS Exchange and Domino)

In general, the way the API clients work can be summarized as: The files which need a backup are identified by the client. Once it collected all files they are transferred to the TSM server. An incremental data transfer is possible but it is depending on the API client implementation.

Data Exchange between Server and Client

The data transfer between a TSM client and a TSM server is implemented using a TSM own protocol based on TCP/IP. Usually all data is sent as a plain stream. But TSM provides security features which can be combined:

  1. Transport encryption via SSL
    The TSM server holds an SSL certificate [5] (similar to a webserver) to authenticate to the TSM client. Both the server and the client are arranging a cryptographic token for the actual data transfer. The data is decrypted at the server and processed (for example writing to tape) as plain data. With SSL encryption comes higher security but a not to neglect work load for clients and especially for servers. Therefore, the TSM server of the GWDG is not supporting the SSL functionality.
  2. Complete data encryption
    The client is encrypting the entire data before it is transferred to the server. The TSM server then processes the data as ”plain”, unencrypted data even though it is encrypted. The data than is compressed and written to the LTO tapes. Due to the encryption, the compression will not be as effective as it would be for data in the clear.

The GWDG highly recommends the complete encryption as it protects data from unauthorized access at the TSM servers / tape libraries and during the transmission.

TSM Server Mechanism: Data Management

Different approaches are used on TSM servers to optimize data management. The most important ones are described below:

Management Classes, Policy Sets And Copygroups

The parameters set in a management class define several settings such as storage location, retention period or number of versions. The default management class can be overwritten by providing special “include exclude” rules to the client. Please contact us to determine which class suits your needs best.

Storage Areas

TSM is using a so called ”Storage Pool” to store the data transferred by a client. A ”Storage Pool” is a storage area most likely abstracted and not bound to a single physical storage medium, such as tape or disc. TSM requests tapes from a tape library or container files from an operating system on demand. ”Storage Pools” always consist of storage media with identical properties, but still it can store up data with different requirements.

Cache ”Staging Pools”

A continuous stream of new data on a 10 Gb network interface would be needed to write directly on tape without interruption. However, some points of the previous statement do not apply in practice:

  1. The transmission of the data is done partially and is interrupted by the lookup time.
  2. Only few clients have a 10 Gb connection.
  3. The amount of data from a single client is quite low in comparison to the total data processed on the TSM server

Current tapes have high writing throughput of data [6]. Therefore, writing data directly to the tapes would cause a stop-and-go state whenever a drive cache is emptied. This stop-and-go state would decrease the performance of the tape and shorten its lifetime as the tape must be moved back and forth.

TSM optimizes the usage of the tapes by collecting the data in the ”Staging Pool”, which is a fast storage area connected to the TSM server. TSM then writes the data collectively on the tapes. During the recreation of the TSM environment (see ”GWDG-Nachrichten” 8/2014) it is planned to expand the cache as much as needed to bridge even maintenance work at the tape libraries. All data sent to the TSM server during the maintenance work is then held in the cache until the library can write again. Errors such as ”ANS… Server out of space” should belong to the past. A restore will be faster while the data is held in the cache as no time for the tape mount is required.

Active Data Storage Areas ("Active Data Pools")

In TSM environments all data representing the client state at the time of the last backup are called “active data”. Due to the ”incremental forever” approach of TSM, unchanged files and just rarely changed are transferred to the TSM server only once or infrequently. As consequence it can be seen that, with a high probability the active data is spread over different tapes. Multiple mounts have to be made to perform a restore. The tape mounts take a lot of time and delay other internal processes as restores have a higher priority than a backup.

Special ”Active Data Pools” can be defined for entire clients, just for a special file space, and for multiple file spaces. It can be defined exclusively as the only storage destination or as an additional copy, too. These new defined destinations are used to keep copies of the latest change only. This concept massively improves the speed of a restore. As disadvantage the significantly higher price due to a rising need of resources has to be mentioned. The need for resources is caused by using normal hard drives instead of tape libraries for ”Data Pools”.

Considering the costs and the need of resources, the GWDG refused the concept of ”Active Data Pools”. But we are, in cooperation with our customers, in the process of evaluating the requirements as ”Active Data Pools” and the concept of local mirroring can become, at least partially, economical.

Virtual Full Backup

TSM is using an ”incremental forever” approach for the backups. A full backup is not made. However, the TSM server can use all the different client backups made [7] to combine the data for a specific point in time into one virtual full backup. Just as if a full backup would have been done at that time. The virtual full backup can be saved as ”BackupSet” within TSM or on an external storage device (e.g. USB drive). The retention period for a backup set can differ from the standard backup retention. This is a way the GFS principle [8], known from other backup solutions, can be emulated without the need to perform a full backup frequently.

The functionality to define ”BackupSets” in addition to the usual retention period has not been supported by the GWDG so far, because of a typically large number of tape mounts needed if the data is spread over multiple tapes. These tape mounts compete against the restores and writing on tape, and may take several days in some cases. Using the ”Active Data Pools” the costs for creating ”BackupSets” drop and can be done with only little effort. Please get in touch with us if you are interested in that topic.

Deleting Data from Tape

The data on the hard drives and tapes is deleted only indirectly. Data is marked as ”deletable” but stays physically on the storage device. Compared to a hard drive these marked sectors are not simply overwritten but reduce the usable capacity of the tape. When a minimal capacity is reached, the unmarked data is copied to a new tape to free the partially used one. The limit for the redensification (”Reclamation”) is usually chosen low. Therefore, a single tape can be used to store data of multiple tapes.

The tapes freed by the ”Reclamation” will hold the data until it is overwritten and it is therefore still readable. Recovering deleting data is just a theoretical issue as, a really large amount of time is needed (The TSM server needs to be restored to an earlier state in order to read the skipped data).

Mechanism ”Restore”

In addition to the description of client and server, no separate description is given. Generally multiple restore scenarios are possible:

  1. The user selects the data for the restore from within the GUI or via client command line interface. The server receives the request and collects the data to hand it over to the client.
  2. The user selects entire partitions or folders to restore. Due to the ”Incremental Forever” approach the needed data is spread across targets. The server must then go back and forth on a tape or it even has to mount multiple tapes to gather the data.
  3. 3. The restore of files, folders and entire partitions can be performed for any point in time [9] in the past. Here as well the data has to be collected from different locations.

High latency can occur during all restore operations, as multiple tape mounts might be needed or the tape has to be moved back and forth to the correct position. The high latency can only be avoided if the restore data is still in the server’s disk cache. The ”Active Data Pools” would increase the speed of a restore of the last version significantly as no tape mounts would be needed. The restore time for older versions is not or only slightly improved by ”Active Data Pools” as tape mounts would still be needed for older data.

Analogously to backups, the TSM server will sum up data to larger packets to improve the usage of the bandwidth. All restore processes have a higher priority compared to all other TSM operations. Caused by the time needed by tape mounts and tape movement, still a lot of time is needed to do a restore. The time waiting for the first response is, in most cases, significantly longer than the time needed to send the entire data.

Footnotes And References

  1. For differences between multiple backup strategies see https://en.wikipedia.org/wiki/Backup
  2. Of course there are exclusions possible. Using the option ”ABSOLUTE” a full backup can be forced. But we highly recommend not to use it. A full backup comes with a disproportionate effort and TSM offers to create virtual full backups (details below).
  3. Deduplication is looking for duplicated files or file parts and replaces them with a reference to the already existing file (part). See also https://en.wikipedia.org/wiki/Data_deduplication
  4. SSL (Secure Socket Layer), see hts://en.wikipedia.org/wiki/Secure_Sockets_Layer
  5. LTO-6 with up to 160 MB/sec, Jaguar5 with up to 350 MB/sec, the minimal speed requirement is still at 40 MB/sec (LTO-5, LTO-6)
  6. As data versions will forfeit due to the period for retention, virtual full backups are possible within the period for retention only.
  7. ”Grandfather Father Son Principle”; The period for retention differs by the type of backup. E.g. end of the week, end of the month, end of the year, see also https://en.wikipedia.org/wiki/Backup_rotation_ scheme#Grandfather-father-son
  8. If the “point-in-time” is well within the period for retention only.

We thank Mr. Thomas Ripping and Mr. Peter Chronz for translating the text to english.