IBM Support

Boundary Condition Can Occur When Using Client-Side Deduplication Causing Some Data in a Container Storage Pool to Be Left Unrecoverable

Flashes (Alerts)


Abstract

APARs IT28096 and IT29362 may affect directory-container and cloud-container storage pools which can result in damaged deduplicated extents (chunks). Any files affected by the damaged deduplicated extents may become unrecoverable. This potential loss of data may affect the ability to restore, retrieve, or space management recall the data at some future time.

Both APARs represent boundary conditions that may result in incorrect data being written to a container storage pool. For APAR IT28096, the issue occurs when client-side deduplication is being performed and the server makes an incorrect determination about where in the data stream extents begin and end. For APAR IT29362, the issue occurs during error handling writing to the container storage pool. For both APARs, the result is that deduplicated extents written to the storage pool may be incorrect. As a result of these incorrect deduplicated extents written to the storage pool, data read operations, such as by the client or REPLICATE NODE for restore or retrieve, will fail. For both cases, in a directory container pool, AUDIT CONTAINER will detect and report that there are damaged deduplicated extents in the respective containers. Note that the incorrect deduplicated extents may be propagated from the source to the target in a server-to-server replication pair where PROTECT STGPOOL is being used.

Content

Problem Summary:

For both APARs, data stored to a directory-container or cloud-container storage pool may encounter an error in processing which results in portions of the stored data being incorrect.  This results in data that may not be readable by the client.  If the affected data is backup data, then restore of the affected data may not be possible.  Similarly, if the affected data is archive data, then retrieve of the affected data may not be possible. Finally, if the affected data is hierarchically space managed (HSM), then recall of the affected data may not be possible.  Please refer to each individual APAR for additional details describing the respective issue and symptom.

Who is affected:

These APARs affect users of client-side deduplication writing to directory-container or cloud-container storage pools where the IBM Spectrum Protect server version is either V7 or V8.  For version 7, affected fix-pack levels are 7.1.3.0 or higher.  For version 8, affected fix pack levels are 8.1.0.0 or higher.

Problem Resolution:

Fixes for the IBM Spectrum Protect server which resolve both APAR IT28096 and IT29362 will be delivered in the following server levels. This is subject to change at the sole discretion of IBM:

7.1.7.500
7.1.9.300
7.1.10.000
8.1.1.400
8.1.6.200
8.1.7.0 for IT28096 and 8.1.7.100 for IT29362

If you require a fix before these levels become available, contact IBM Software Support.

Recommended Actions:

The following are the recommended actions to perform.  Please review the entire set of instructions first.  There are steps that may be repeated multiple times, often times once per affected storage pool, before proceeding to the next step.  Some steps may not apply to you, such as step 3 may be skipped if you do not have a storage type “CLOUD” storage pool.  The recommended steps are:

  1. Issue the command:  QUERY STGPOOL * FORMAT=DETAILED

    For each storage pool of storage type “DIRECTORY” or “CLOUD”, proceed through the following steps.  If all storage pools for a given server are storage type “DEVCLASS”, then that server is not affected by the APARs discussed in this advisory.

    This needs to be performed for all IBM Spectrum Protect servers. 

     
  2. For each storage pool of storage type “DIRECTORY” identified in step 1, perform the following:

    Steps 2-1 and 2-2 need to be performed on all servers before performing step 2-3.
    1. Minimize the defragmentation processing (which is done through background MOVE CONTAINER operations) for the directory container storage pools. This is done by setting these server options to a value of 99.  The options to set to 99 are:  DEFRAGFSTRIGGER and DEFRAGCNTRTRIGGER.  Before changing these values, issue QUERY OPTION and make note of the current values for these options so that they can be returned to their originally set values after these remediation actions are completed.  The default values are 90 for the DEFRAGFSTRIGGER and 95 for the DEFRAGCNTRTRIGGER.

      These options can be set using the SETOPT command.  An example of how to set this is to issue the following commands to the IBM Spectrum Protect server:

      SETOPT DEFRAGFSTRIGGER 99
      SETOPT DEFRAGCNTRTRIGGER 99

      NOTE: Manually performed MOVE CONTAINER processes should not be run until the entire set of clean up instructions is complete. If a MOVE CONTAINER process is started before the pool is fully audited (whether manually or automatically), then the audit must be performed again.

       
    2. Doing the following ensures the deduplication catalogs are synchronized so that the AUDIT CONTAINER operations in the following steps are complete:

      For each server that runs the PROTECT STGPOOL command, ensure there is a complete and successful run of PROTECT STGPOOL before proceeding to the next step.

      If REPLICATE NODE is used in the environment (without the use of PROTECT STGPOOL), ensure there is a complete and successful run of REPLICATE NODE before proceeding.   
    3. Perform AUDIT CONTAINER on all container files in the storage pool.  The AUDIT CONTAINER command will audit any container that has a “Last Audit Date” older than the day these AUDIT CONTAINER operations are started. 

      This audit can be performed manually or automatically. To perform it automatically, this is done using storage rules which were introduced in 8.1.5 and higher. This is documented in the following link:
      Audit Storage Rule


      To do this manually, make note of the date when this action is started.  For example, if the date that the fixing level of the server was installed and when the AUDIT CONTAINER operations are started is April 17, 2019 then that is the date that will be used.  The recommended syntax of the command is:

      AUDIT CONTAINER STGPOOL=<name of storage pool> ACTION=SCANALL MAXPR=20 WAIT=NO ENDDATE=04/17/2019

      This will cause all containers in the referenced pool with a last audit date prior to April 17, 2019 to be audited. 

      This command can be scheduled to run every day for some number of hours.  Note that the number of processes used on the command is set to 20 (MAXPR=20).  Care should be used when considering this setting since AUDIT CONTAINER will add I/O load on the system by performing reads from the container storage pool directories.  It will also add additional load on the server while it is performing the calculations to create the cryptographic digest (SHA-1) used to validate the deduplicated extents during the audit processing.  If the AUDIT CONTAINER processing is impacting the performance of the IBM Spectrum Protect server, such as by slowing down client backups, consider cancelling and re-invoking the command with an adjusted MAXPR value depending on the impact of the audit in the environment. There may be opportunity to move the AUDIT CONTAINER to a window that under lighter load. This might allow for a higher MAXPR value for improving runtime performance.

       
    4. To monitor the progress of the AUDIT CONTAINER processing for this pool, consider one of the following: Once all the containers in the referenced directory container storage pool have been audited, proceed to the next step.

      AUDIT CONTAINER processing will mark any deduplicated extents (chunks) that fail to validate as damaged.  These can be viewed with the QUERY DAMAGED command.

      Repeat step (2-3) for other storage type “DIRECTORY” storage pools. For multiple pools, it is possible to do this concurrently assuming there is sufficient CPU resource to handle the operations.  NOTE: Ensure that no more than 40 AUDIT CONTAINER processes are running at a given time on a single server.
       
      1. Issue the command: “QUERY CONTAINER F=D”. 

        Review those containers with an “Appox. Last Audit Date” prior to the date the audit processing was started.

         
      2. Issue the command: “SELECT COUNT(*) FROM CONTAINERS WHERE STGPOOL_NAME=’stgpool_name_goes_here’ AND LASTAUDIT_DATE<=’04/17/2019’”. 

        This will provide a count of the number of containers that still need to be audited for the referenced storage pool.
    5. Once the AUDIT CONTAINER processing has been completed, revert the defragmentation options back to their original settings.  The following illustrate how to set these options back to their default settings:

      SETOPT DEFRAGFSTRIGGER 95
      SETOPT DEFRAGCNTRTRIGGER 90

       
  3. For each storage pool of storage type “CLOUD” identified in step 1, perform the following:

    Contact IBM support.  An audit tool will be provided along with instructions on how to use it.  The audit tool will scan the container objects stored to the object storage (cloud).  It will provide a summary of the containers that have been evaluated.

    For any container objects stored to the object storage which have deduplicated extents affected by the APARs discussed in this advisory, the audit tool will create a shell script to execute against the IBM Spectrum Protect server.  The shell script will have a list of the affected deduplicated extents.  This shell script will need to be executed against the IBM Spectrum Protect server to mark those deduplicated extents (chunks) as damaged.  Once the script has been performed, the affected extents will be reported in QUERY DAMAGED.

     
  4. At this point, all storage type “DIRECTORY” and “CLOUD” storage pools have been through the audit identification processing.  This audit processing results in all the affected deduplicated extents (chunks) as having been marked in the IBM Spectrum Protect server database.  To review the client nodes and a count of damaged data extents, issue the command:  QUERY DAMAGED TYPE=NODE.
     
  5. It may be possible for the IBM Spectrum Protect server to replace the damaged deduplicated data extent during normal ingest processing.  This is done by having the client where the data originated resend the data which is affected.  When the server receives an extent that matches an existing extent that is damaged, the IBM Spectrum Protect server will automatically store a new copy of that extent from the ingest stream and then update the meta-data for all objects referencing the damaged extent.

    When following sub-steps (5-1, 5-2, and 5-3 below), periodically review the QUERY DAMAGED results.  As affected deduplicated extents are replaced, other data reported as damaged may also be corrected because the deduplicated extent may be shared by many different files. 

    In order to utilize this damaged extent replacement capability of the IBM Spectrum Protect server, consider the following actions for each client reported in step 4 above:

     
    1. For backup data, have the client perform a full backup. The effectiveness of this full backup may vary depending on the type of client in use. 
       
    2. For archive data, have the client perform a new archive of the affected data.  This is only possible if the archive data is still available on the client or from some other location outside of IBM Spectrum Protect.  For data where there is no other copy available, the data is lost and not recoverable.
       
    3. For space managed (HSM) data, if another copy of the affected file is available from some other location, copy and replace the damaged file in the HSM filesystem using this other copy of the data.  If no other copy of the affected HSM migrated file exists, then delete the affected file from the HSM filesystem.  In the event that a copy is not available for the affected HSM file, the data is lost and not recoverable. 
       
  6. Once all data that could be re-ingested in step 5 is completed, the final step is to remove any remaining damaged deduplicated extents.  Those remaining damaged extents can be reviewed using the command QUERY DAMAGED TYPE=INVENTORY.  NOTE: This command may be long-running depending on the amount of damage remaining in the environment. This provides a final list of any objects that were lost as a result of the APARs identified above.

    To remove the remaining damaged deduplicated extents from the IBM Spectrum Protect server, issue the command “AUDIT CONTAINER STGPOOL=<stgpool name goes here> ACTION=REMOVEDAMAGED”.

    This command should be performed for each pool of either storage type “DIRECTORY” or “CLOUD” in step 1 above. 

     
For any questions or other assistance regarding this advisory, please contact IBM support.

Document information

More support for: IBM Spectrum Protect

Component: Server

Software version: 7.1, 8.1

Operating system(s): AIX, Linux, Windows

Reference #: 0872118

Modified date: 11 June 2019