IBM Support

IJ44691: SPECTRUM SCALE ERASURE CODE EDITION BACK-END STORAGE MANAGEMENT DEGRADATION CAN CAUSE NODE EXPULSION

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Spectrum Scale Erasure code edition interacts with third party
    software/hardware APIs for internal disk enclosure management.
    If the management interface becomes degraded and starts to hang
    commands in the kernel, the hang may also block communication
    handling threads. This causes a node to fail to renew its lease,
    causing it to be fenced off from the rest of the cluster.
    This may lead to additional outages
    

Local fix

  • The node with hardware problems will show waiters 'Until
    NSPDServer discovery completes.'
    It is recommended to reboot nodes with those GPFS waiters
    exceeding 2 minutes if this node is also being expelled.
    

Problem summary

  • Spectrum Scale Erasure code edition interacts with third party
     software/hardware APIs for internal disk enclosure management.
    If the management interface becomes degraded and starts to hang
    commands in the kernel, the hang may also block communication
    handling threads. This causes a node to fail to renew its lease,
    causing it to be fenced off from the rest of the cluster.
    This may lead to additional outages.
    

Problem conclusion

  • This problem is fixed in 5.1.2.9
    To see all Spectrum Scale APARs and their respective
    Fix solutions refer to page:
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    Break the dependency between back-end storage management
    and communication handling threads.
    Degradation of back-end storage management no longer causes
    node expels.
    
    Work Around:
    The node with hardware problems will show waiters 'Until
    NSPDServer discovery completes.'
    It is recommended to reboot nodes with those GPFS waiters
    exceeding 2 minutes if this node is also being expelled.
    
    Problem trigger:
    Degradation in back-end storage management that causes
    commands to hang in the kernel.
    
    Symptom:
    Hang/Deadlock/Unresponsiveness/Long Waiters
    
    Platforms affected:
    Linux Only
    
    Functional Area affected:
    ESS/GNR
    
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ44691

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    512

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2023-01-04

  • Closed date

    2023-01-19

  • Last modified date

    2023-01-19

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"512","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
19 January 2023