IBM Support

IBM Spectrum Scale (GPFS) V4.2.3: failures in scanning file system metadata may result in file system data or metadata corruption

Flashes (Alerts)


Abstract

IBM has identified a problem with the GPFS file system metadata scanning function in IBM Spectrum Scale 4.2.3.0 - 4.2.3.3 which may result in silent file system data corruption or metadata corruption on certain failures.

Content

Problem Summary:  The first step in rebalancing or restoring the replication factor of all files in a file system (for example, using the mmrestripefs command) is to scan and repair the file system metadata. If an unrecoverable error occurs during this scan, some file system metadata blocks may be incorrectly de-allocated. If these blocks are reused by other files, corruption may occur in random locations. The consequences may range from corruption of file system metadata, to silent file data corruption, or potential loss of the entire file system being processed. An unrecoverable error, when it occurs, can be observed in the command output and mmfs.log. "Not enough memory to allocate internal data structure" is one of the unrecoverable errors, which indicates that the GPFS pagepool is not large enough for the current workload.


Below is an example of a file system metadata scanning failure shown in the command output:

$ mmrestripefs gpfs1 -m
Scanning file system metadata, phase 1 ...
Error processing inodes.
Not enough memory to allocate internal data structure.
mmrestripefs: Command failed. Examine previous error messages to determine cause.

 

 

 

The failed commands are also logged in the mmfs.log files. For example:

 

 

 

[I] Command: mmrestripefs /dev/gpfs1 -m
[E] Command: err 12: mmrestripefs /dev/gpfs1 -m
[E] Not enough memory to allocate internal data structure.

[I] Command: tsdeldisk /dev/fs0 -F /var/mmfs/tmp/diskfile.mmdeldisk.9961542 -r
[E] Command: err 12: tsdeldisk /dev/fs0 -F /var/mmfs/tmp/diskfile.mmdeldisk.9961542 -r
[E] Not enough memory to allocate internal data structure.

 

The symptoms following a metadata scanning failure, where that has caused corruption, can be extremely varied. Following are examples of some of the symptoms that have been observed:

1) After the mmrestripefs command failed with "err 12", the out of pagepool memory error, the mmrestripefs command was run a second time on the same file system. On the second run, a MMFS_FSSTRUCT error was reported. The MMFS_FSSTRUCT error may happen whether the second run of the command succeeds or fails if it fails the first time.

The MMFS_FSSTRUCT error will always be logged in the syslog. In the case of Linux, /var/log/messages is the syslog file.

 


Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=7866532:
Invalid disk data structure.  Error code 1108.   Volume fs0


2) In a second instance, after the mmdeldisk command (shown as tsdeldisk in the mmfs.log file) failed with "Not enough memory to allocate internal data structure", offline mmfsck failed in log recovery and the file system could no longer be mounted.

# mmfsck fs0 -xa -nvc -t /log
GPFS: 6027-700 [E] Log recovery failed.
GPFS: 6027-699 [E] Inconsistency in file system metadata.
mmfsck: 6027-1639 Command failed. Examine previous error messages to determine cause.

$ mmmount fs0
6027-1623 mmmount: Mounting file systems ...
mount: Stale file handle
mmmount: 6027-1639 Command failed. Examine previous error messages to determine cause.

 


Users Affected: This issue affects users meeting all of the following conditions:

1) User is running Spectrum Scale versions 4.2.3.0 to 4.2.3.3.

2) Any of the following commands have been run, and failed, while running the affected Spectrum Scale 4.2.3.0-4.2.3.3 code levels:

- mmdeldisk
- mmrpldisk
- mmrestripefs
(except the -c option for replica compare and the -z option for compressing user files)
- mmadddisk -r (mmrestripefs is run automatically after adding the disks)

If none of the commands listed above have been run since upgrading to IBM Spectrum Scale 4.2.3, or if all such commands have been run successfully, your file system has not been affected by this problem.

Note: If all of the mmfs.log and syslog files since upgrading to 4.2.3 are not available, checking those files may not be sufficient. System administrators need to be on the alert for unusual error reports in logs or from users.

Recommendations:

1. Any customer running IBM Spectrum Scale 4.2.3.0 to 4.2.3.3 should apply IBM Spectrum Scale V4.2.3.4 available from Fix Central at: https://www-945.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=4.2.3&platform=All&function=all


or contact IBM Service to obtain and apply the efix for this issue (APAR IV98609).


2. Avoid running any of the following commands listed below before applying the efix:

- mmdeldisk
- mmrpldisk
- mmrestripefs
(except the -c option for replica compare and the -z option for compressing user files)
- mmadddisk -r (mmrestripefs is run automatically after adding the disks)

3. Evaluate whether your file system has been affected:

Check the mmfs.log files and the syslog files to look for any occurrences of failed mmdeldisk, mmrpldisk, mmrestripefs commands, or MMFS_FSSTRUCT error messages. Search for "mmrestripefs", "tsdeldisk", and "tsrpldisk" in the mmfs.log files for the execution record of the mmdeldisk, mmrpldisk, mmrestripefs commands. If all of the mmfs.log and syslog files since upgrading to 4.2.3 are not available, checking those files may not be sufficient. System administrators need to be on the alert for unusual error reports in logs or from users.

4. If you suspect that a file system is already affected by this problem, please run offline mmfsck to confirm.

5. If the offline mmfsck reports anything but a clean file system, please contact IBM Service for further guidance and assistance.

Note: A clean file system would be indicated by the following mmfsck output: "File system is clean".

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"4.2.3","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022

UID

ssg1S1010487