IBM Support

IV71612: LINUX OS AGENT MAY RANDOMLY CRASH OR HANG DURING DISK DATA COLLECTION

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Due to a race condition between the Disk and the "Disk Usage
    Trends"
     data collections the Linux OS agent may crash or hang while
    executing
     the threads to retrieve file systems information.
    
    Affected Platforms / Versions:
    
     This issue may affect all the Linux OS agent versions since
    6.22 FP2
    
    Diagnostics:
     At the default ERROR level look for statfs64 timed out messages
     occurring exactly every hour:
    
    (Wed Mar 19 07:34:02
    2015.0005-2:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /nfs/kissapp01
    ...
    (Thu Mar 19 08:34:02
    2015.0009-2:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /var
    (Thu Mar 19 08:34:02
    2015.000A-F:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /opt
    (Thu Mar 19 08:34:02
    2015.000B-2:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /opt/IBM
    (Thu Mar 19 08:34:02
    2015.000C-F:filestats.cpp,123,"GetFileStats")
    
    statfs64 timed out for /proc/sys/fs/binfmt_misc
    (Thu Mar 19 08:34:02
    2015.000D-2:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /var/lib/nfs/rpc_pipefs
    (Thu Mar 19 08:34:02
    2015.000E-F:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /proc/fs/nfsd
    (Thu Mar 19 08:34:02
    2015.000F-2:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /nfs/kissapp01
    ...
    (Thu Mar 19 09:34:02
    2015.0005-F:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /
    (Thu Mar 19 09:34:02
    2015.0006-2:filestats.cpp,123,"GetFileStats")
    statfs64 timed out for /proc
    ...
    <crash at Thu Mar 19 10:34:02 2015> in:
    Program terminated with signal 11, Segmentation fault.
    #0  0x000000357220b91b in pthread_cond_timedwait@@GLIBC_2.3.2 ()
    from
    /lib64/libpthread.so.0
    #1  0x0000000000505880 in
    FileStats::executeStatfsInSeparateThread(char
    const*, statfs64*) ()
    #2  0x0000000000504b3d in
    FileStats::GetFileStats() ()
    
    Initial Impact:
     Medium, the event is rare and the agent gets automatically
    restarted in
     a few seconds by the watchdog, if enabled.
    

Local fix

  • None. Mitigation is given by a large value assigned to the
    variable
     KLZ_DISK_SAMPLE_HRS
    

Problem summary

  • Monitoring Agent for Linux OS randomly crashes during Disk data
    collection
    
    
    Due to a lack of synchronization with the internal thread that
    collects data for the "Disk Usage Trends" attribute group it may
    happen that the agent crashes while responding to a query or a
    situation on the "Linux Disk" attribute group, as the two groups
    share the same filesystems' cache.  The event is rare because
    the "Disk Usage Trends" thread runs only once per hour by
    default.
    

Problem conclusion

  • Introduced mutex control to prevent concurrent execution of the
    threads for the two attribute groups.
    The fix for this APAR will be contained in the following
    maintenance packages:
    
    | FixPack    | 6.3.0-TIV-ITM-FP0006
    | InterimFix | 6.3.0.5-TIV-ITM_LINUX-IF0001
    

Temporary fix

  • Set the environment variable KLZ_DISK_SAMPLE_HRS, that drives
    the frequency of the "Disk Usage Trends" thread, to a very large
     number of hours in order to reduce the likelihood of this race
    condition.
    

Comments

APAR Information

  • APAR number

    IV71612

  • Reported component name

    ITM AGENT UNIX

  • Reported component ID

    5724C040U

  • Reported release

    630

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2015-03-30

  • Closed date

    2015-06-30

  • Last modified date

    2015-12-10

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    ITM AGENT LINUX

  • Fixed component ID

    5724C04LN

Applicable component levels

  • R623 PSY

       UP

  • R630 PSY

       UP

  • R610 PSN

       UP

  • R620 PSN

       UP

  • R621 PSN

       UP

  • R622 PSN

       UP

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTFXA","label":"Tivoli Monitoring"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"630"}]

Document Information

Modified date:
30 December 2022