Technote (troubleshooting)
Problem(Abstract)
A thread within the cache node process, which is running on all of the modules, may miss several heartbeats due to a clock change backward in time. This in turn might cause one cache node in the \ system to fail and cause the system to go into a rebuild.
The clock change backward could have been made by an NTP server or by a xcli time_set command
Symptom
A failure of a cache node process will cause the 12 disks in the module to become unavailable, triggering a rebuild.
Since the heartbeat monitoring code will fail only one cache node process, there is no risk of double module failure or data loss
Cause
Time shift backward at the system level can cause the proc_sync_remote thread within the cache node to sleep for a longer period of time than expected and to miss a heartbeat, making the cache node think that it has failed
- This event will occur for all of the cache nodes but the manager will only allow one cache node to fail in order to avoid a data loss state
- This only happens when the thread is dormant, if a copy service sync job is running this will not happen, because that the thread is alive and does not sleep
- Changes to the timezone of the machine such as moving between daylight saving and standard times do not affect the underlying system time and will not cause this issue.
Environment
10.2.2 & 10.2.2.a
Resolving the problem
Fix is planned for 10.2.4 version
Rate this page:
Copyright and trademark information
IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.