IT24388: MQ FOR NONSTOP SERVER, REINITIALISATION OF MULTIPLE SLAVE REPOSITORY MANGERS IN PARALLEL RESULTS IN DEADLOCK.

A fix is available

WebSphere MQ for HP NonStop Server V5.3.1 fix pack 5.3.1.15

APAR status

Closed as program error.

Error description

MQ for NonStop Server, reinitialisation of multiple slave
repository mangers in parallel results in deadlock,  FDC
RM220005 with AMQ9511 and AMQ9448 Logged.

When there is an inconsistency between the cache state of the
slave repository managers compared to the masters one. The
slaves perform an automatic full reinitialization. During
initilisation the repmans perform a handshake among each other.

In situations resulting in reinitialisation of multiple slave
repository mangers in parallel, this handshake results in a
deadlock, preventing the slave repository managers ability to
process further updates.

Restarting a particular repman will invalidate cluster cache on
its CPU.
This results in cluster cache being unavailable as the freshly
started repman will also hang on the deadlock situation.

Local fix

The deadlock could be resolved by identifying repmans
participating in the deadlock using pstate open information:

   26 \CS3.$X12PC:6937755593                  Process       0
0
           Current operations: Writeread
           Sync depth at open time was 0.
           Options at open time was x4000.
           Access mode is Read/Write Shared

Usually all slaves but one should be waiting for writeread on a
particular slave, which in turn waits for one of the others.

To resolve the deadlock all the slaves waiting for the same
slave to complete their request, should be stopped one by one,
with the last one to be be stopped being the slave the others
are waiting for.

Problem summary

Every slave amqrrmfa process introduces itself on
initialization to master and other slave amqrrmfa processes.
When multiple slave instances reinitialize in parallel, each
one tries to introduce to all others. As the other slaves are
also initializing, they don't process this requests and don't
reply until they completed initialization.

Problem conclusion

Code was changed and a timeout was implemented, when trying
to handshake with another slave running on a CPU, with a
larger ordinal number, than the process itself.

Temporary fix

Comments

APAR Information

APAR number
IT24388
Reported component name
WEBS MQ NSS ITA
Reported component ID
5724A3902
Reported release
531
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2018-03-15
Closed date
2018-07-13
Last modified date
2018-07-13

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
WEBS MQ NSS ITA
Fixed component ID
5724A3902

Applicable component levels

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSFKSJ","label":"WebSphere MQ"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"5.3.1","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
31 March 2023

Tips

IT24388: MQ FOR NONSTOP SERVER, REINITIALISATION OF MULTIPLE SLAVE REPOSITORY MANGERS IN PARALLEL RESULTS IN DEADLOCK.

A fix is available

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?