Linux: Example: Estimating active and archive log sizes for data deduplication operations

If you deduplicate data, you must consider its effects on space requirements for active and archive logs.

The following factors affect requirements for active and archive log space:

The amount of deduplicated data
The effect of data deduplication on the active log and archive log space depends on the percentage of data that is eligible for deduplication. If the percentage of data that can be deduplicated is relatively high, more log space is required.
The size and number of extents
Approximately 1,500 bytes of active log space are required for each extent that is identified by a duplicate-identification process. For example, if 250,000 extents are identified by a duplicate-identification process, the estimated size of the active log is 358 MB:
250,000 extents identified during each process x 1,500 bytes
 for each extent = 358 MB
Consider the following scenario. Three hundred backup-archive clients back up 100,000 files each night. This activity creates a workload of 30,000,000 files. The average number of extents for each file is two. Therefore, the total number of extents is 60,000,000, and the space requirement for the archive log is 84 GB:
60,000,000 extents x 1,500 bytes for each extent = 84 GB
A duplicate-identification process operates on aggregates of files. An aggregate consists of files that are stored in a given transaction, as specified by the TXNGROUPMAX server option. Suppose that the TXNGROUPMAX server option is set to the default of 4096. If the average number of extents for each file is two, the total number of extents in each aggregate is 8192, and the space required for the active log is 12 MB:
8192 extents in each aggregate x 1500 bytes for each extent =
   12 MB
The timing and number of the duplicate-identification processes
The timing and number of duplicate-identification processes also affects the size of the active log. Using the 12 MB active-log size that was calculated in the preceding example, the concurrent load on the active log is 120 MB if 10 duplicate-identification processes are running in parallel:
12 MB for each process x 10 processes = 120 MB
File size
Large files that are processed for duplicate identification can also affect the size of the active log. For example, suppose that a backup-archive client backs up an 80 GB, file-system image. This object can have a high number of duplicate extents if, for example, the files included in the file system image were backed up incrementally. For example, assume that a file system image has 1.2 million duplicate extents. The 1.2 million extents in this large file represent a single transaction for a duplicate-identification process. The total space in the active log that is required for this single object is 1.7 GB:
1,200,000 extents x 1,500 bytes for each extent = 1.7 GB

If other, smaller duplicate-identification processes occur at the same time as the duplicate-identification process for a single large object, the active log might not have enough space. For example, suppose that a storage pool is enabled for deduplication. The storage pool has a mixture of data, including many relatively small files that range from 10 KB to several hundred KB. The storage pool also has few large objects that have a high percentage of duplicate extents.

To take into account not only space requirements but also the timing and duration of concurrent transactions, increase the estimated size of the active log by a factor of two. For example, suppose that your calculations for space requirements are 25 GB (23.3 GB + 1.7 GB for deduplication of a large object). If deduplication processes are running concurrently, the suggested size of the active log is 50 GB. The suggested size of the archive log is 150 GB.

The examples in the following tables show calculations for active and archive logs. The example in the first table uses an average size of 700 KB for extents. The example in the second table uses an average size of 256 KB. As the examples show, the average deduplicate-extent size of 256 KB indicates a larger estimated size for the active log. To minimize or prevent operational problems for the server, use 256 KB to estimate the size of the active log in your production environment.

Table 1. Average duplicate-extent size of 700 KB
Item Example values Description
Size of largest single object to deduplicate 800 GB 4 TB The granularity of processing for deduplication is at the file level. Therefore, the largest single file to deduplicate represents the largest transaction and a correspondingly large load on the active and archive logs.
Average size of extents 700 KB 700 KB The deduplication algorithms use a variable block method. Not all deduplicated extents for a given file are the same size, so this calculation assumes an average size for extents.
Extents for a given file 1,198,372 bits 6,135,667 bits Using the average extent size (700 KB), these calculations represent the total number of extents for a given object.

The following calculation was used for an 800 GB object: (800 GB ÷ 700 KB) = 1,198,372 bits

The following calculation was used for a 4 TB object: (4 TB ÷ 700 KB) = 6,135,667 bits

Active log: Suggested size that is required for the deduplication of a single large object during a single duplicate-identification process 1.7 GB 8.6 GB The estimated active log space that are needed for this transaction.
Active log: Suggested total size 66 GB 1 79.8 GB 1 After considering other aspects of the workload on the server in addition to deduplication, multiply the existing estimate by a factor of two. In these examples, the active log space required to deduplicate a single large object is considered along with previous estimates for the required active log size.

The following calculation was used for multiple transactions and an 800 GB object:

(23.3 GB + 1.7 GB) x 2 = 50 GB

Increase that amount by the suggested starting size of 16 GB:

50 + 16 = 66 GB

The following calculation was used for multiple transactions and a 4 TB object:

(23.3 GB + 8.6 GB) x 2 = 63.8 GB

Increase that amount by the suggested starting size of 16 GB:

63.8 + 16 = 79.8 GB

Archive log: Suggested size 198 GB 1 239.4 GB 1 Multiply the estimated size of the active log by a factor of 3.

The following calculation was used for multiple transactions and an 800 GB object:

50 GB x 3 = 150 GB

Increase that amount by the suggested starting size of 48 GB:

150 + 48 = 198 GB

The following calculation was used for multiple transactions and a 4 TB object:

63.8 GB x 3 = 191.4 GB

Increase that amount by the suggested starting size of 48 GB:

191.4 + 48 = 239.4 GB

1 The example values in this table are used only to illustrate how the sizes for active logs and archive logs are calculated. In a production environment that uses deduplication, 32 GB is the suggested minimum size for an active log. The suggested minimum size for an archive log in a production environment that uses deduplication is 96 GB. If you substitute values from your environment and the results are larger than 32 GB and 96 GB, use your results to size the active log and archive log.

Monitor your logs and adjust their size if necessary.

Table 2. Average duplicate-extent size of 256 KB
Item Example values Description
Size of largest single object to deduplicate 800 GB 4 TB The granularity of processing for deduplication is at the file level. Therefore, the largest single file to deduplicate represents the largest transaction and a correspondingly large load on the active and archive logs.
Average size of extents 256 KB 256 KB The deduplication algorithms use a variable block method. Not all deduplicated extents for a given file are the same size, so this calculation assumes an average extent size.
Extents for a given file 3,276,800 bits 16,777,216 bits Using the average extent size, these calculations represent the total number of extents for a given object.

The following calculation was used for multiple transactions and an 800 GB object:

(800 GB ÷ 256 KB) = 3,276,800 bits

The following calculation was used for multiple transactions and a 4 TB object:

(4 TB ÷ 256 KB) = 16,777,216 bits

Active log: Suggested size that is required for the deduplication of a single large object during a single duplicate-identification process 4.5 GB 23.4 GB The estimated size of the active log space that is required for this transaction.
Active log: Suggested total size 71.6 GB 1 109.4 GB 1 After considering other aspects of the workload on the server in addition to deduplication, multiply the existing estimate by a factor of 2. In these examples, the active log space required to deduplicate a single large object is considered along with previous estimates for the required active log size.

The following calculation was used for multiple transactions and an 800 GB object:

(23.3 GB + 4.5 GB) x 2 = 55.6 GB

Increase that amount by the suggested starting size of 16 GB:

55.6 + 16 = 71.6 GB

The following calculation was used for multiple transactions and a 4 TB object:

(23.3 GB + 23.4 GB) x 2 = 93.4 GB

Increase that amount by the suggested starting size of 16 GB:

93.4 + 16 = 109.4 GB

Archive log: Suggested size 214.8 GB 1 328.2 GB 1 The estimated size of the active log multiplied by a factor of 3.

The following calculation was used for an 800 GB object:

55.6 GB x 3 = 166.8 GB

Increase that amount by the suggested starting size of 48 GB:

166.8 + 48 = 214.8 GB

The following calculation was used for a 4 TB object:

93.4 GB x 3 = 280.2 GB

Increase that amount by the suggested starting size of 48 GB:

280.2 + 48 = 328.2 GB

1 The example values in this table are used only to illustrate how the sizes for active logs and archive logs are calculated. In a production environment that uses deduplication, 32 GB is the suggested minimum size for an active log. The suggested minimum size for an archive log in a production environment that uses deduplication is 96 GB. If you substitute values from your environment and the results are larger than 32 GB and 96 GB, use your results to size the active log and archive log.

Monitor your logs and adjust their size if necessary.