IBM Support

PI50551: Repeated network failures cause primary shard demotions and Targ etNotAvailableExceptions

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • After repeated network failures and recovery, stale placement wo
    
    Symptoms of this problem include:
    
    A CWOBJ1524 listing "Replica was disconnected from primary on co
    
    Or a primary is demoted and not promoted to either a replica. Th
    
    [10/1/15 11:05:29:560 EDT] 000000d4 PrimaryShardI I   CWOBJ1547I
    [10/1/15 11:05:29:560 EDT] 000000d4 PrimaryShardI I   CWOBJ1575I
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:  WebSphere eXtreme Scale users experiencing  *
    *                  frequent network failures in a short amount *
    *                  of                                          *
    *                  time where the next failure occurs before   *
    *                  placement and replication completes from    *
    *                  the                                         *
    *                  prior recovery.                             *
    ****************************************************************
    * PROBLEM DESCRIPTION: Repeated network failures cause primary *
    *                      shard demotions and                     *
    *                      TargetNotAvailableExceptions            *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    After repeated network failures and recovery, stale placement
    work can occur and cause incorrect shard movements. Including
    the demotion of a primary shard or the recycling of a primary
    shard. This can lead to TargetNotAvailableException or loss of
    data.
    Symptoms of this problem include:
    A CWOBJ1524 listing "Replica was disconnected from primary on
    containerName for an unknown length of time and must be
    reregistered to restart replication" as the reason to re-
    register on a shard that is primary. The CWOBJ1524 happens as a
    stale request from a primary shard running on the server
    experiencing network problems.
    Or a primary is demoted and not promoted to either a replica.
    The demotion occurs by a stale primary on the server
    experiencing network problems. In the following example,
    container1 would be the container experiencing intermittent
    network problems.:
    [10/1/15 11:05:29:560 EDT] 000000d4 PrimaryShardI I
    CWOBJ1547I: PLATFORM:PLATFORM_MAPSET:9 (demoting primary to
    inactive) in transition.
    [10/1/15 11:05:29:560 EDT] 000000d4 PrimaryShardI I
    CWOBJ1575I: Request to demote primary
    (PLATFORM:PLATFORM_MAPSET:9) originated from container
    container1.
    

Problem conclusion

  • Stale placement work was blocked. If the network fails and
    recovers repeatedly and more quickly than placement replication
    can complete, extra shards can remain after recovery. If the ext
    shards persist, they can be resolved using the xscmd command,
    triggerPlacement with the -removeExtra option.
    

Temporary fix

Comments

APAR Information

  • APAR number

    PI50551

  • Reported component name

    WS EXTREME SCAL

  • Reported component ID

    5724X6702

  • Reported release

    860

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2015-10-14

  • Closed date

    2015-11-05

  • Last modified date

    2015-11-05

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WS EXTREME SCAL

  • Fixed component ID

    5724X6702

Applicable component levels

  • R860 PSY

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSTVLU","label":"WebSphere eXtreme Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"860","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
05 November 2015