Troubleshooting IBM Tivoli Netcool/OMNIbus probe hang issue

Use this information to troubleshoot an IBM Tivoli Netcool/OMNIbus probe hang issue that can occur with any probe that uses a TCP (Transmission Control Protocol) connection to send data to the ObjectServer.

Description of the issue

IBM Tivoli Netcool/OMNIbus probes may hang as a result of network configuration issues that break TCP sessions.

When these TCP communication errors occur, the probe log ends at the hang while flushing events to the ObjectServers. For example:

2016-01-29T17:19:23: Debug: D-UNK-000-000: Sending.....
2016-01-29T17:19:23: Debug: D-UNK-000-000: Flushing events to object servers
2016-01-29T17:19:23: Debug: D-UNK-000-000: 1 buffered alerts

The probe log also contains errors associated with network connectivity. For example:

Error: E-UNK-000-000: [ProtocolTDS]: ct_results(): network packet layer: 
internal net library error: Net-Lib protocol driver call to read data failed
OS Error: Socket recv failed - errno 110 Connection timed out

Thus, if you see the probe hang as described previously, it is because of a TCP communication error. Fixing the error in your network connection will allow the probe to function normally.

Note: This problem was seen when a probe was connecting to an ObjectServer over a VPN that was limiting MTU to 1300 bytes.  MTU discovery was not working between two servers due to the network devices on the path being configured to not send back ICMP responses for security reasons. This is known to break path MTU discovery.

Workaround to the issue

In the previous example, setting the MTU to 1300 bytes on the ObjectServer network interfaces forced the larger events to be broken into smaller packets for transmission over the VPN. This prevented TCP connection time outs that resulted in the probe hang.

Additional tips for troubleshooting

The SQL Interactive Interface (the nco_sql utility) can be used to recreate network issues by inserting suspected records directly into the alerts.status table in the ObjectServer. This simulates the insert performed by a probe. In the previous example, the nco_sql utility was used to determine that large records that were present in the probe log were not being inserted successfully into the ObjectServer. Also, try using Wireshark (an open source packet analyzer) to follow the events as they are sent across the network from the probe host to the ObjectServer to clearly identify network related issues.