PM71892: COMPUTE GRID JOB SCHEDULER FAILS TO RESUME DISPATCHING JOBS TO AN ENDPOINT SERVER THAT WAS QUIESCED BEFORE BEING RECYCLED.

A fix is available

8.5.0.2: WebSphere Application Server V8.5 Fix Pack 2

APAR status

Closed as program error.

Error description

Compute Grid V8.0 jobs stopped running after customer recycled
their  "batch" cluster for  the Database Config issues. Also,
Compute Grid 8.0 tends to wipe out joblog messages stating that
a given job cannot be dispatched.

Local fix

Problem summary

****************************************************************
* USERS AFFECTED:  Users of the Java batch function in IBM     *
*                  WebSphere Application Server V8.5           *
****************************************************************
* PROBLEM DESCRIPTION: Issues re-establishing communications   *
*                      between ComputeGrid/batch scheduler     *
*                      server and endpoint server(s) after a   *
*                      scheduler or endpoint is recycled.      *
*                      For example, after an endpoint is       *
*                      quiesced - resulting in jobs not        *
*                      getting dispatched (stuck in            *
*                      submitted state).                       *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
The Java batch architecture uses a single scheduler server to
dispatch work to a number of endpoint servers hosting the
batch application, with the scheduler establishing
communication with the endpoint servers using a "heart beat"
mechanism.
There was a timing window where communication with a
particular endpoint server was not getting reestablished in
the case that the endpoint was recycled, as well as in the
case where the endpoint(s) remained active while the scheduler
was recycled.
There was also a bug such that if an endpoint was quiesced,
then recycled, communication with that particular endpoint
wasn't reestablished correctly when the endpoint came back
up.
In both cases, you can experience the symptom of jobs
appearing to be "stuck in submitted state", that is, not
getting dispatched to the appropriate endpoint.  It is also
possible that a given cluster member does not get any jobs
dispatched to it while other cluster member(s) receive
the job dispatches.

Problem conclusion

The quiesce bug was fixed and the timing window closed so
that endpoint and scheduler servers can be recycled with
dispatch resuming normally once both are up and running.

The fix for this APAR is currently targeted for inclusion in
fix pack 8.5.0.2. Please refer to the Recommended Updates page
for delivery information:
http://www.ibm.com/support/docview.wss?rs=180&uid=swg27004980

Temporary fix

Comments

APAR Information

APAR number
PM71892
Reported component name
WEBS APP SERV N
Reported component ID
5724H8800
Reported release
850
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2012-08-30
Closed date
2012-12-18
Last modified date
2012-12-18

APAR is sysrouted FROM one or more of the following:

PM69782
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
WEBS APP SERV N
Fixed component ID
5724H8800

Applicable component levels

R850 PSY
UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.5","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
01 November 2021

Tips

PM71892: COMPUTE GRID JOB SCHEDULER FAILS TO RESUME DISPATCHING JOBS TO AN ENDPOINT SERVER THAT WAS QUIESCED BEFORE BEING RECYCLED.

A fix is available

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R850 PSY

Document Information

Share your feedback

Need support?