IBM Support

PI51087: Brief network losses during primary shard promotion can cause da ta loss

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • If there are brief network losses causing timeouts on remote cal
    
    The following are an example of the failed transition. The follo
    
    [10/13/15 14:30:43:027 EDT] 000000a7 ObjectGridCon I   CWOBJ7509
    [10/13/15 14:30:43:794 EDT] 000000b7 SynchronousRe I   CWOBJ1511
    [10/13/15 14:37:13:811 EDT] 00000043 FutureImpl    W   CWOBJ7851
    [10/13/15 14:37:35:042 EDT] 000000a7 ReplicatedPar I   CWOBJ1573
    [10/13/15 14:37:35:056 EDT] 000000a7 ReplicatedPar I   CWOBJ1531
    1074 110:[10/13/15 14:37:35:586 EDT] 000000a7 ObjectGridCon I
    
    
    If a shard tries and fails to become primary on the first try, b
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:  WebSphere eXtreme Scale users experiencing  *
    *                  brief network failures, not long enough to  *
    *                  induce a server failure.                    *
    ****************************************************************
    * PROBLEM DESCRIPTION: Brief network losses during primary     *
    *                      shard                                   *
    *                      promotion can cause data loss.          *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    If there are brief network losses causing timeouts on remote
    calls (but the network issues do not last long enough to cause
    container failure) during a primary shard promotion, data loss
    can occur by promoting an empty or partially populated shard.
    The following are an example of the failed transition. The
    following is a valid transition if the prior primary shard (in
    this example, container2) failed or experienced a long network
    issue and was marked as down by the catalog server. If this
    occurs, placement work will arrive on container2 and end
    without logging any additional changes.
    [10/13/15 14:30:43:027 EDT] 000000a7 ObjectGridCon I
    CWOBJ7509I: Placement work, workId 168, from catalog server for
    partition SimpleGrid:mapSet:2  intended for container
    container2_C-4 was received.
    [10/13/15 14:30:43:794 EDT] 000000b7 SynchronousRe I
    CWOBJ1511I: SimpleGrid:mapSet:2 (temporary synchronous replica)
    is open for business.
    [10/13/15 14:37:13:811 EDT] 00000043 FutureImpl    W
    CWOBJ7851W: Received a timeout while waiting for a response to
    a
    com.ibm.ws.xs.xio.protobuf.ContainerReplicationProtos$QueryRevis
    ionRequestContext/reqID=14164 message from endpoint
    9.42.123.165:6805. The current timeout is 30 seconds. When the
    message was added, the queue size was 2.
    [10/13/15 14:37:35:042 EDT] 000000a7 ReplicatedPar I
    CWOBJ1573I: As part of becoming primary for
    SimpleGrid:mapSet:2,
    this container was unable to retrieve the necessary data from
    the container container1_C-2.  As such, the catalog service is
    going to be notified to promote an existing replica if one
    exists.  The container is not going to be the host for the
    primary shard for this partition
    ([FAILED_PRI_TRANS_FROM_INACTIVE_PROMOTE_EXISTING_PRI]).
    [10/13/15 14:37:35:056 EDT] 000000a7 ReplicatedPar I
    CWOBJ1531I: SimpleGrid:mapSet:2 (synchronous replica) stopped
    on
    this server.
    1074 110:[10/13/15 14:37:35:586 EDT] 000000a7 ObjectGridCon I
    CWOBJ7508I: Placement work, workId 168, for partition
    SimpleGrid:mapSet:2  intended for container container1_C-4
    successfully completed.
    If a shard tries and fails to become primary on the first try,
    but the placement work tries to make the same transition again
    (the original primary shard is still available), the shard is
    prevented from taking over as primary due to stale data
    remaining on the original primary.
    

Problem conclusion

  • The checks on whether to continue a primary promotion were
    improved.
    

Temporary fix

Comments

APAR Information

  • APAR number

    PI51087

  • Reported component name

    WS EXTREME SCAL

  • Reported component ID

    5724X6702

  • Reported release

    860

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2015-10-23

  • Closed date

    2015-11-05

  • Last modified date

    2015-11-05

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WS EXTREME SCAL

  • Fixed component ID

    5724X6702

Applicable component levels

  • R860 PSY

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSTVLU","label":"WebSphere eXtreme Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"860","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
05 November 2015