LO88214: REPEATED ORDEREDLATCH TIMEOUT ERRORS EVEN AFTER THE CAUSE HAS BEEN FIXED
Direct links to fixes
Closed as program error.
1. Thread A makes the request for something which can only be allowed to be handled one at a time. For this, it starts an OrderedLatch which enforces access one at a time only. 2. Thread B needs to make the same operation, so it latches onto the OrderedLatch from #1. 3. Thread C needs to make the same operation, so it latches onto the OrderedLatch from #1 (but behind Thread B). 4. Thread D needs to make the same operation, so it latches onto the Orderedlatch from #1 (but behind Thread C). 5. Thread B times out. This should cause both C and D to timeout as well, but only C times out. 6. Thread E needs to make the same operation, so it latches onto the OrderedLatch from #1. D should have been gone already, but because it has not, E is waiting on D instead of A as it should be. 7. Thread A finishes and gives up the latch. At this point, E should have been the only waiting thread, gotten the latch, and executed the code as the holder of the latch access. However, E is waiting on D and D is waiting on C which doesn't exist any longer. As C doesn't exist, D is going to time out as C (which doesn't exist) is never going to signal to D that C is finished. If another thread (Thread F) needs to make the same operation, it is going to get into the pile up of waiting threads that will time out because what they are waiting on no longer exists which leaves the time out as the only possible conclusion for them. If the latches were linked correctly (if D had gone away with C and E been correctly latched onto A), E would have run as soon as A was done and F would have followed E as expected and E and F would not have also logged the WARNING message about timing out. Note: This APAR fix only handles time outs better in these situations. The root cause of the problem is whatever is holding up A long enough such that B timed out in the first place. Whatever that root cause is should still be investigated as it is most likely causing non-optimal user sync experiences, but this APAR fix will at least allow Traveler to recover more quickly in cases where the root cause is intermittent.
Restart Traveler will clear the pending latches, but the root cause of the slowness that caused the latches to time out in the first place will quite possibly cause timeouts again until it is addressed.
Repeated latch time out errors reported.
The IBM Traveler server has been updated to handle this scenario correctly when using order latches to syncronize threads.
This fix will be included in IBM Traveler 126.96.36.199 and all future releases. For the latest available maintenance release see this technote: http://www.ibm.com/support/docview.wss?uid=swg24019529
Reported component name
LOTUS NOTES TRA
Reported component ID
NoSpecatt / Xsystem
Last modified date
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fixed component name
LOTUS NOTES TRA
Fixed component ID
Applicable component levels