A fix is available
APAR status
Closed as program error.
Error description
MQ for NonStop Server, reinitialisation of multiple slave repository mangers in parallel results in deadlock, FDC RM220005 with AMQ9511 and AMQ9448 Logged. When there is an inconsistency between the cache state of the slave repository managers compared to the masters one. The slaves perform an automatic full reinitialization. During initilisation the repmans perform a handshake among each other. In situations resulting in reinitialisation of multiple slave repository mangers in parallel, this handshake results in a deadlock, preventing the slave repository managers ability to process further updates. Restarting a particular repman will invalidate cluster cache on its CPU. This results in cluster cache being unavailable as the freshly started repman will also hang on the deadlock situation.
Local fix
The deadlock could be resolved by identifying repmans participating in the deadlock using pstate open information: 26 \CS3.$X12PC:6937755593 Process 0 0 Current operations: Writeread Sync depth at open time was 0. Options at open time was x4000. Access mode is Read/Write Shared Usually all slaves but one should be waiting for writeread on a particular slave, which in turn waits for one of the others. To resolve the deadlock all the slaves waiting for the same slave to complete their request, should be stopped one by one, with the last one to be be stopped being the slave the others are waiting for.
Problem summary
Every slave amqrrmfa process introduces itself on initialization to master and other slave amqrrmfa processes. When multiple slave instances reinitialize in parallel, each one tries to introduce to all others. As the other slaves are also initializing, they don't process this requests and don't reply until they completed initialization.
Problem conclusion
Code was changed and a timeout was implemented, when trying to handshake with another slave running on a CPU, with a larger ordinal number, than the process itself.
Temporary fix
Comments
APAR Information
APAR number
IT24388
Reported component name
WEBS MQ NSS ITA
Reported component ID
5724A3902
Reported release
531
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2018-03-15
Closed date
2018-07-13
Last modified date
2018-07-13
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
WEBS MQ NSS ITA
Fixed component ID
5724A3902
Applicable component levels
[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSFKSJ","label":"WebSphere MQ"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"5.3.1","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]
Document Information
Modified date:
31 March 2023