APAR status
Closed as program error.
Error description
If there are brief network losses causing timeouts on remote cal The following are an example of the failed transition. The follo [10/13/15 14:30:43:027 EDT] 000000a7 ObjectGridCon I CWOBJ7509 [10/13/15 14:30:43:794 EDT] 000000b7 SynchronousRe I CWOBJ1511 [10/13/15 14:37:13:811 EDT] 00000043 FutureImpl W CWOBJ7851 [10/13/15 14:37:35:042 EDT] 000000a7 ReplicatedPar I CWOBJ1573 [10/13/15 14:37:35:056 EDT] 000000a7 ReplicatedPar I CWOBJ1531 1074 110:[10/13/15 14:37:35:586 EDT] 000000a7 ObjectGridCon I If a shard tries and fails to become primary on the first try, b
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: WebSphere eXtreme Scale users experiencing * * brief network failures, not long enough to * * induce a server failure. * **************************************************************** * PROBLEM DESCRIPTION: Brief network losses during primary * * shard * * promotion can cause data loss. * **************************************************************** * RECOMMENDATION: * **************************************************************** If there are brief network losses causing timeouts on remote calls (but the network issues do not last long enough to cause container failure) during a primary shard promotion, data loss can occur by promoting an empty or partially populated shard. The following are an example of the failed transition. The following is a valid transition if the prior primary shard (in this example, container2) failed or experienced a long network issue and was marked as down by the catalog server. If this occurs, placement work will arrive on container2 and end without logging any additional changes. [10/13/15 14:30:43:027 EDT] 000000a7 ObjectGridCon I CWOBJ7509I: Placement work, workId 168, from catalog server for partition SimpleGrid:mapSet:2 intended for container container2_C-4 was received. [10/13/15 14:30:43:794 EDT] 000000b7 SynchronousRe I CWOBJ1511I: SimpleGrid:mapSet:2 (temporary synchronous replica) is open for business. [10/13/15 14:37:13:811 EDT] 00000043 FutureImpl W CWOBJ7851W: Received a timeout while waiting for a response to a com.ibm.ws.xs.xio.protobuf.ContainerReplicationProtos$QueryRevis ionRequestContext/reqID=14164 message from endpoint 9.42.123.165:6805. The current timeout is 30 seconds. When the message was added, the queue size was 2. [10/13/15 14:37:35:042 EDT] 000000a7 ReplicatedPar I CWOBJ1573I: As part of becoming primary for SimpleGrid:mapSet:2, this container was unable to retrieve the necessary data from the container container1_C-2. As such, the catalog service is going to be notified to promote an existing replica if one exists. The container is not going to be the host for the primary shard for this partition ([FAILED_PRI_TRANS_FROM_INACTIVE_PROMOTE_EXISTING_PRI]). [10/13/15 14:37:35:056 EDT] 000000a7 ReplicatedPar I CWOBJ1531I: SimpleGrid:mapSet:2 (synchronous replica) stopped on this server. 1074 110:[10/13/15 14:37:35:586 EDT] 000000a7 ObjectGridCon I CWOBJ7508I: Placement work, workId 168, for partition SimpleGrid:mapSet:2 intended for container container1_C-4 successfully completed. If a shard tries and fails to become primary on the first try, but the placement work tries to make the same transition again (the original primary shard is still available), the shard is prevented from taking over as primary due to stale data remaining on the original primary.
Problem conclusion
The checks on whether to continue a primary promotion were improved.
Temporary fix
Comments
APAR Information
APAR number
PI51087
Reported component name
WS EXTREME SCAL
Reported component ID
5724X6702
Reported release
860
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2015-10-23
Closed date
2015-11-05
Last modified date
2015-11-05
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
WS EXTREME SCAL
Fixed component ID
5724X6702
Applicable component levels
R860 PSY
UP
[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSTVLU","label":"WebSphere eXtreme Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"860","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]
Document Information
Modified date:
05 November 2015