APAR status
Closed as program error.
Error description
Primary shard promotion or new replicas are delayed due to hung threads or MessageTimeOutException; for example, you can see the following log activity: 4/20/15 13:57:01:449 JST] 000000a1 XSThreadPool W CWOBJ7853W: Detected a hung thread named "XIOPrimaryPool : 0" TID:c2 WAITING. Executing since 4/20/2015 13:56:36:169 +0900. Stack Trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:197 ) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionO bject.await(AbstractQueuedSynchronizer.java:2054) com.ibm.ws.xs.xio.actor.impl.FutureImpl.await(FutureImpl.java:27 8) com.ibm.ws.xs.xio.actor.impl.FutureImpl.get(FutureImpl.java:310) com.ibm.ws.objectgrid.container.xio.XIORemoteObjectGridContainer Impl._non_existent(XIORemoteObjectGridContainerImpl.java:139) com.ibm.ws.objectgrid.replication.PrimaryShardImpl.updateMasterC ontainerRefs(PrimaryShardImpl.java:6546) com.ibm.ws.objectgrid.replication.XIOIDLReplicatedPartition.proc essContainerRefs(XIOIDLReplicatedPartition.java:306) com.ibm.ws.objectgrid.server.container.ContainerActor.doWorkRece ive(ContainerActor.java:303) com.ibm.ws.objectgrid.server.container.ContainerActor.receive(Co ntainerActor.java:180) com.ibm.ws.xs.xio.actor.impl.XIOReferableImpl.dispatch(XIORefera bleImpl.java:110) com.ibm.ws.xsspi.xio.actor.XIORegistry.sendToTarget(XIORegistry. java:981) com.ibm.ws.xs.xio.transport.channel.XIORegistryRunnable.run(XIOR egistryRunnable.java:88) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExec utor.java:1176) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe cutor.java:641) com.ibm.ws.objectgrid.thread.XSThreadPool$Worker.run(XSThreadPoo l.java:309)
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: WebSphere eXtreme Scale users running XIO. * * * **************************************************************** * PROBLEM DESCRIPTION: Primary shard promotion or new * * replicas are delayed due to hung * * threads or MessageTimeOutExceptions. * **************************************************************** * RECOMMENDATION: * **************************************************************** Primary shard promotion or new replicas are delayed due to hung threads or MessageTimeOutExceptions because the incoming placement work (ContainerActor.doWorkReceive) tries to ping remote XIO references in the work proactively. If there was a recent failure (such as a network issue or a machine failed), the remote call can time out. This prevents other incoming placement work from completing and might delay shard promotions or the addition of new replicas; for example: 4/20/15 13:57:01:449 JST] 000000a1 XSThreadPool W CWOBJ7853W: Detected a hung thread named "XIOPrimaryPool : 0" TID:c2 WAITING. Executing since 4/20/2015 13:56:36:169 +0900. Stack Trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:197 ) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionO bject.await(AbstractQueuedSynchronizer.java:2054) com.ibm.ws.xs.xio.actor.impl.FutureImpl.await(FutureImpl.java:27 8) com.ibm.ws.xs.xio.actor.impl.FutureImpl.get(FutureImpl.java:310) com.ibm.ws.objectgrid.container.xio.XIORemoteObjectGridContainer Impl._non_existent(XIORemoteObjectGridContainerImpl.java:139) com.ibm.ws.objectgrid.replication.PrimaryShardImpl.updateMasterC ontainerRefs(PrimaryShardImpl.java:6546) com.ibm.ws.objectgrid.replication.XIOIDLReplicatedPartition.proc essContainerRefs(XIOIDLReplicatedPartition.java:306) com.ibm.ws.objectgrid.server.container.ContainerActor.doWorkRece ive(ContainerActor.java:303) com.ibm.ws.objectgrid.server.container.ContainerActor.receive(Co ntainerActor.java:180) com.ibm.ws.xs.xio.actor.impl.XIOReferableImpl.dispatch(XIORefera bleImpl.java:110) com.ibm.ws.xsspi.xio.actor.XIORegistry.sendToTarget(XIORegistry. java:981) com.ibm.ws.xs.xio.transport.channel.XIORegistryRunnable.run(XIOR egistryRunnable.java:88) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExec utor.java:1176) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExe cutor.java:641) com.ibm.ws.objectgrid.thread.XSThreadPool$Worker.run(XSThreadPoo l.java:309)
Problem conclusion
The proactive ping was removed. Each placement work deals with any failures individually to avoid a bottleneck.
Temporary fix
Comments
APAR Information
APAR number
PI40223
Reported component name
WS EXTREME SCAL
Reported component ID
5724X6702
Reported release
860
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2015-05-01
Closed date
2015-05-28
Last modified date
2015-05-28
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
WS EXTREME SCAL
Fixed component ID
5724X6702
Applicable component levels
R860 PSY
UP
[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSTVLU","label":"WebSphere eXtreme Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"860","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]
Document Information
Modified date:
28 May 2015