Checklist for data deduplication

Data deduplication requires additional processing resources on the server or client. Use the checklist to verify that hardware and your Tivoli® Storage Manager configuration have characteristics that are key to good performance.

Question Tasks, characteristics, options, or settings More information
Are you using fast disk storage for the Tivoli Storage Manager database as measured in terms of input/output operations per second (IOPS)?

Use a high-performance disk for the Tivoli Storage Manager database. At a minimum, use 10000-rpm drives for smaller databases that are 200 GB or less. For databases over 500 GB, use 15000-rpm drives or solid-state drives.

Tivoli Storage Manager database should have a minimum capability of 3,000 IOPS. For each TB of data that is backed up daily (before data deduplication), include an additional 1,000 IOPS to this minimum.

For example, a Tivoli Storage Manager server that is ingesting 3 TB of data per day would need 6,000 IOPS for the database disks:
3,000 IOPS minimum + 3,000 (3 
TB x 1,000 IOPS) = 6,000 IOPS
Checklist for server database disks

For more information about IOPS, see the Tivoli Storage Manager Blueprint at https://www.ibm.com/developerworks/community/wikis/home/wiki/Tivoli Storage Manager/page/NEW - Tivoli Storage Manager Blueprint - Improve the time-to-value of your deployments

Do you have enough memory for the size of your database? Use a minimum of 64 GB of system memory for Tivoli Storage Manager servers that are deduplicating data. If the retained capacity of backup data grows, the memory requirement might need to be higher.

Monitor memory usage regularly to determine whether more memory is required.

Use additional system memory to improve caching of database pages. The following memory size guidelines are based on the daily amount of new data that you back up:
  • 128 GB system memory for daily backups of data, where the amount of new data that you back up is up to 8 TB
  • 192 GB system memory for daily backups of data, where the amount of new data that you back up is over 8 TB
 
Have you properly sized your disk space for the database, logs, and storage pools? For a rough estimate, plan for 100 GB of database storage for every 10 TB of data that is to be protected in deduplicated storage pools. Protected data is the amount of data before deduplication, including all versions of objects stored.

Configure the server to have the maximum active log size of 128 GB by setting the ACTIVELOGSIZE server option to a value of 131072.

Use a directory for the database archive logs with an initial free capacity of at least 500 GB. Specify the directory by using the ARCHLOGDIRECTORY server option.

Define space for the archive failover log by using the ARCHFAILOVERLOGDIRECTORY server option.

 
Are the Tivoli Storage Manager database and logs on separate disk volumes (LUNs)?

Is the disk that is used for the database configured according to best practices for a transactional database?

The Tivoli Storage Manager database must not share disk volumes with Tivoli Storage Manager database logs or storage pools, or with any other application or file system.

See Server database and recovery log configuration and tuning
Are you using a minimum of 8 (2.2 GHz or equivalent) processor cores for each Tivoli Storage Manager server that you plan to use with data deduplication? If you are planning to use client-side data deduplication, verify that client systems have adequate resources available during a backup operation to perform data deduplication processing. Use a processor that is at least the minimum equivalent of one 2.2 GHz processor core per backup process with client-side data deduplication.

Effective Planning and Use of IBM® Tivoli Storage Manager V6 Deduplication.

Have you estimated storage pool capacity to configure enough space for the size of your environment? You can estimate storage pool capacity requirements for a deduplicated storage pool by using the following technique:
  1. Estimate the base size of the source data.
  2. Estimate the daily backup size by using an estimated change and growth rate.
  3. Determine retention requirements.
  4. Estimate the total amount of source data by factoring in the base size, daily backup size, and retention requirements.
  5. Apply the deduplication ratio factor.
  6. Round up the estimate to consider transient storage pool usage.

For an example of using this technique, see Effective Planning and Use of IBM Tivoli Storage Manager V6 Deduplication.

Have you distributed disk I/O over many disk devices and controllers? Use arrays that consist of as many disks as possible, which is sometimes referred to as wide striping.

Specify 8 or more file systems for the deduplicated storage pool device class so that I/O is distributed across as many LUNs and physical devices as possible.

See Checklist for storage pools on disk.
Have you scheduled data deduplication processing based on your backup strategy? If you are not creating a secondary copy of backup data or if you are using node replication for the second copy, client backup and duplicate identification can be overlapped. This can reduce the total elapsed time for these operations, but might increase the time that is required for client backup.

If you are using storage pool backup, do not overlap client backup and duplicate identification. The best practice sequence of operations is client backup, storage pool backup, and then duplicate identification.

For data that is not stored with client-side data deduplication, schedule storage-pool backup operations to complete before you start data deduplication processing. Set up your schedule this way to avoid reconstructing objects that are deduplicated to make a non-deduplicated copy to a different storage pool.

See Scheduling data deduplication and node replication processes.
Are the processes for identifying duplicates able to handle all new data that is backed up each day? If the process completes, or goes into an idle state before the next scheduled operation begins, then all new data is being processed.  
Is reclamation able to run to a sufficiently low threshold? If a low threshold cannot be reached, consider the following actions:
  • Increase the number of processes that are used for reclamation.
  • Upgrade to faster hardware.
 
Is deduplication cleanup processing able to clean out the dereferenced extents to free disk space before the start of the next backup cycle? Run the SHOW DEDUPDELETE command. The output shows that all threads are idle when the workload is complete.
If cleanup processing cannot complete, consider the following actions:
  • Increase the number of processes that are used for duplicate identification.
  • Upgrade to faster hardware.
  • Determine if the Tivoli Storage Manager server is ingesting more data than it can process with data deduplication and consider deploying an additional Tivoli Storage Manager server.