IBM Support

IT21112: CLUSTER SENDER CHANNELS FAIL TO RESTART FOLLOWING CPU FAILURE ON 2 CPU SYSTEM

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • The issue ocurs only when the primary instances of the channel
    server and the execution controller are on the failing CPU.
    Root cause is a race condition during takeover between
    MQCHANSVR and MQECSVR, when no new backup process could be
    created. Usually this happens only during CPU down on a two CPU
    system
    

Local fix

  • Once the second cpu is back up and the backup processes are back
    running, the issue situation, if present, could be resolved, by
    1. manually stopping the primary MQCHSVR using its CPU,PID
    2. manually stopping MCA, PID shown in channelstatus
     mqsc> display chs(CHANNEL) all
    Additionally, the problem could be avoided on 2 cpu systems, by
    placing primary MQCHSVR on the CPU running the backup MQECSVR.
    

Problem summary

  • The root cause is a race condition between failover procedures
    of the MQECSVR and the MQCHSVR, when both primaries fail
    simultaneously and no new backup processes could be created.
    This usually only happens on CPU down on two CPU systems.
    Due to the race condition the MQCHSVR uses the process-handle
    of the failed/lost primary MQECSVR during EC health check,
    resulting in initialization problems, and ending up with
    this reported issue.
    

Problem conclusion

  • Added additional step to EC health check, retrieve current
    process-handle for known process name of MQECSVR using OS APIs,
    before performing health check operations requiring
    process-handle of MQECSVR.
    

Temporary fix

  • Configure MQ pathway environment on two CPU systems, to start
    primary MQCHSVR on the CPU hosting the backup of MQECSVR.
    This avoids the problem at least for one CPU down.
    

Comments

APAR Information

  • APAR number

    IT21112

  • Reported component name

    WEBS MQ NSS ITA

  • Reported component ID

    5724A3902

  • Reported release

    531

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-06-21

  • Closed date

    2017-06-28

  • Last modified date

    2017-06-28

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WEBS MQ NSS ITA

  • Fixed component ID

    5724A3902

Applicable component levels

  • R531 PSY

       UP

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSFKSJ","label":"WebSphere MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"5.3.1"}]

Document Information

Modified date:
28 September 2021