IBM Support

IJ49543: SPECTRUM SCALE ERASURE CODE EDITION BACK-END STORAGE MANAGEMENT DEGRADATION WITH INCOMPLETE BUG FIX CAN CAUSE NODE EXPULSION.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Spectrum Scale Erasure code edition interacts with third party
    software/hardware APIs for internal disk enclosure management.If
    the management interface becomes degraded and starts to hang
    commands in the kernel, the hang may also block communication
    handling threads.This causes a node to fail to renew its lease,
    causing it to be fenced off from the rest of the cluster. This
    may lead to additional outages. A previous APAR was issued for
    this in 5.1.4, but that fix was incomplete.
    

Local fix

  • The node with hardware problems will show waiters 'Until
    NSPDServer discovery completes.'It is recommended to reboot
    those nodes with those GPFS waiters exceeding 2 minutes if this
    node is also being expelled.
    

Problem summary

  • Spectrum Scale Erasure code edition interacts with third party
    software/hardware APIs for internal disk enclosure management.If
    the management interface becomes degraded and starts to hang
    commands in the kernel, the hang may also block communication
    handling threads.This causes a node to fail to renew its lease,
    causing it to be fenced off from the rest of the cluster. This
    may lead to additional outages. A previous APAR was issued for
    this in 5.1.4, but that fix was incomplete.
    

Problem conclusion

  • This problem is fixed in 5.1.2.15
    To see all Spectrum Scale APARs and their respective
    Fix solutions refer to page:
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    Code was further reworked to break a lock ordering dependency
    that tightly coupled the RPC handling mechanism to the storage
    backend management software. Degradation of back-end storage
    management no longer causes node expels.
    
    Work Around:
    The node with hardware problems will show waiters 'Until
    NSPDServer discovery completes.'It is recommended to reboot
    those nodes with those GPFS waiters exceeding 2 minutes if this
    node is also being expelled.
    
    Problem trigger:
    Degradation in back-end storage management that causes commands
    to hang in the kernel.
    
    Symptom:
    Hang/Deadlock/Unresponsiveness/Long Waiters
    
    Platforms affected:
    Linux Only
    
    Functional Area affected:
    ESS/GNR
    
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ49543

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    512

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2023-12-14

  • Closed date

    2023-12-14

  • Last modified date

    2023-12-14

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"512","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
15 December 2023