Fixes are available
APAR status
Closed as program error.
Error description
Due to a race condition between the Disk and the "Disk Usage Trends" data collections the Linux OS agent may crash or hang while executing the threads to retrieve file systems information. Affected Platforms / Versions: This issue may affect all the Linux OS agent versions since 6.22 FP2 Diagnostics: At the default ERROR level look for statfs64 timed out messages occurring exactly every hour: (Wed Mar 19 07:34:02 2015.0005-2:filestats.cpp,123,"GetFileStats") statfs64 timed out for /nfs/kissapp01 ... (Thu Mar 19 08:34:02 2015.0009-2:filestats.cpp,123,"GetFileStats") statfs64 timed out for /var (Thu Mar 19 08:34:02 2015.000A-F:filestats.cpp,123,"GetFileStats") statfs64 timed out for /opt (Thu Mar 19 08:34:02 2015.000B-2:filestats.cpp,123,"GetFileStats") statfs64 timed out for /opt/IBM (Thu Mar 19 08:34:02 2015.000C-F:filestats.cpp,123,"GetFileStats") statfs64 timed out for /proc/sys/fs/binfmt_misc (Thu Mar 19 08:34:02 2015.000D-2:filestats.cpp,123,"GetFileStats") statfs64 timed out for /var/lib/nfs/rpc_pipefs (Thu Mar 19 08:34:02 2015.000E-F:filestats.cpp,123,"GetFileStats") statfs64 timed out for /proc/fs/nfsd (Thu Mar 19 08:34:02 2015.000F-2:filestats.cpp,123,"GetFileStats") statfs64 timed out for /nfs/kissapp01 ... (Thu Mar 19 09:34:02 2015.0005-F:filestats.cpp,123,"GetFileStats") statfs64 timed out for / (Thu Mar 19 09:34:02 2015.0006-2:filestats.cpp,123,"GetFileStats") statfs64 timed out for /proc ... <crash at Thu Mar 19 10:34:02 2015> in: Program terminated with signal 11, Segmentation fault. #0 0x000000357220b91b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000505880 in FileStats::executeStatfsInSeparateThread(char const*, statfs64*) () #2 0x0000000000504b3d in FileStats::GetFileStats() () Initial Impact: Medium, the event is rare and the agent gets automatically restarted in a few seconds by the watchdog, if enabled.
Local fix
None. Mitigation is given by a large value assigned to the variable KLZ_DISK_SAMPLE_HRS
Problem summary
Monitoring Agent for Linux OS randomly crashes during Disk data collection Due to a lack of synchronization with the internal thread that collects data for the "Disk Usage Trends" attribute group it may happen that the agent crashes while responding to a query or a situation on the "Linux Disk" attribute group, as the two groups share the same filesystems' cache. The event is rare because the "Disk Usage Trends" thread runs only once per hour by default.
Problem conclusion
Introduced mutex control to prevent concurrent execution of the threads for the two attribute groups. The fix for this APAR will be contained in the following maintenance packages: | FixPack | 6.3.0-TIV-ITM-FP0006 | InterimFix | 6.3.0.5-TIV-ITM_LINUX-IF0001
Temporary fix
Set the environment variable KLZ_DISK_SAMPLE_HRS, that drives the frequency of the "Disk Usage Trends" thread, to a very large number of hours in order to reduce the likelihood of this race condition.
Comments
APAR Information
APAR number
IV71612
Reported component name
ITM AGENT UNIX
Reported component ID
5724C040U
Reported release
630
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2015-03-30
Closed date
2015-06-30
Last modified date
2015-12-10
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
ITM AGENT LINUX
Fixed component ID
5724C04LN
Applicable component levels
R623 PSY
UP
R630 PSY
UP
R610 PSN
UP
R620 PSN
UP
R621 PSN
UP
R622 PSN
UP
[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTFXA","label":"Tivoli Monitoring"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"630"}]
Document Information
Modified date:
30 December 2022