LATEHB and NIM_ERROR_STUCK messages appear in the syslog. What do they mean ?
Dec 14 11:45:51 node02 cthats: (Recorded using libct_ffdc.a cv 2):::Error ID: 823....D1CuC/k4r08.6hm....................:::Reference ID: :::Template ID: 0:::Details File: :::Location: rsct,bootstrp.C,220.127.116.11,5386 :::TS_LATEHB_PE Late in sending heartbeat A heartbeat is late by the following number of seconds 14
Dec 14 11:45:51 node02 cthats: (Recorded using libct_ffdc.a cv 2):::Error ID: 822....D1CuC/9X018.6hm....................:::Reference ID: :::Template ID: 0:::Details File: :::Location: rsct,nim_control.C,18.104.22.168,7916 :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked receive thread Interval in seconds during which process was blocked 16 Interface name eth0
The error label (and description) here is unfortunately a leftover from the pre-NIM days, when the cthats daemon handled everything, including heartbeating. What this message means today is that the main thread of the daemon thinks it was blocked for the indicated time, because consecutive clock checks showed a time gap that is outside RSCT's "comfort zone." Since this is a clock check, it is not guaranteed that the thread was actually hung ... it could have been in a busy loop doing something unusual, or there could even have been a clock change fooling us ... but most of the time the TS_LATEHB_PE message indicates a blockage.
Note: It is not possible to know if any threads of the daemon were actually hung from this message alone. In fact since the daemon did log a message, it means the thread was actually freed and continued to run.
This message indicates one of the NIM processes was hung for the indicated time period, also based simply on a clock check and thus subject to the same weaknesses that are mentioned above.
In the case of the above example, it was the "receive" thread that was blocked, so sending of heartbeats would not have been impacted, but receiving would, if the impact lasted long enough to pass the Failure Detection Rate for this network. The main thread (of the NIM, not the daemon) can detect whether the receive thread is blocked and will make allowances for that by increasing the HB limit.
Because the NIMs are responsible for the actual heartbeating work, all critical threads (send, receive, main, command receive, and netmon -- everything except the logging threads) monitor the clock and issue these "stuck" messages when necessary.
Its common to see these two messages together. What these two messages mean on their own is that some blockage appears to have occurred which might have led the interface in question to declare its neighbor down ... and if there is only one network connecting the nodes, this might have also led to the remote node being declared down.