IBM Support

IBM Java for AIX HowTo: Troubleshooting a JVM hang or timeout exception

Question & Answer


Question

Troubleshooting a JVM hang or timeout exception

Answer

If a Java program is experiencing a timeout when making calls (such as a socket read) or communicating with an application server, database server or any other type of server or IP for that matter (I.e. an NFS server), this may be due to an issue with network packets reaching the target server/IP.
This is even more likely if DNS is being used to resolve hostnames/IPs.
To resolve a DNS issue, you need to find out which DNS server the JVM is using.
Using the RES_OPTIONS=debug (resolver options) environment variable can assist in this.
The RES_OPTIONS=debug variable is a useful tool in determining whether the JVM can be ruled out as a root cause of the timeout, or whether the timeout can be attributed to something other than the JVM.

A timeout can occur at many levels on the server where the JVM is running:

1. Application logic
2. JVM level
3. AIX socket
4. AIX kernel
5. Network layer on AIX
6. The network itself

*This document will not focus on troubleshooting on the application/database system. However the same concept can be applied on the application/database system if no problems are found on the system where the JVM is running.


Jump to section: ( Prepare ) ( Actions ) ( Example: Successful hostname resolution ) ( Example: Failed hostname resolution ) ( Conclusions )
Overview
Step-by-Step Instructions
Examples / Tips / Hints / Comments / Descriptions

Step 1:

Prepare

A. Determine if whether the system is using DNS or a local /etc/hosts file for hostname resolution.
To do this, look at the uncommented section of the /etc/netsvc.conf file, and check to see if the 'NSORDER' environment variable has been set:

Examples

A1. The last 5 lines of an /etc/netsvc.conf file from a system using DNS only:

# tail -6 /etc/netsvc.conf
# Example:
# aliases = nis, files
#
#
hosts=bind4


A2. The last 5 lines of an /etc/netsvc.conf file from a system using local /etc/hosts only:

# tail -6 /etc/netsvc.conf
# Example:
# aliases = nis, files
#
#
hosts=local4


A3. The last 5 lines of an /etc/netsvc.conf file from a system that will look at /etc/hosts first, and then querry the DNS server(s) if the hostname or IP is not found in the /etc/hosts file:

# tail -6 /etc/netsvc.conf
# Example:
# aliases = nis, files
#
#
hosts=local4,bind4


A4. The last 5 lines of an /etc/netsvc.conf file from a system that will look at the DNS server(s) first, and then querry the local /etc/hosts file if the hostname or IP is not found in the DNS server(s):

# tail -6 /etc/netsvc.conf
# Example:
# aliases = nis, files
#
#
hosts=bind4,local4


A5. Use the 'env' and 'echo' commands to verify the 'NSORDER' environment variable has not been set. If it has been set, this will override the /etc/netsvc.conf file settings:

# env | grep NSORDER
NSORDER=hosts=local4,bind4

# echo $NSORDER
NSORDER=hosts=local4,bind4



B. The RES_OPTIONS=debug environment varible will give details about the path a network packet takes to reach the desired target.
The output can be examined to see what nameservers are being called, when they are called, and in what order.

NOTES:

A1.
Depending on whether you are using IPv4, IPv6, or both, you may see 'hosts=bind', or 'hosts=bind6'.


A2. Depending on whether you are using IPv4, IPv6, or both, you may see 'hosts=local', or 'hosts=local6'.


A3. Depending on whether you are using IPv4, IPv6, or both, you may see 'hosts=local,bind', or 'hosts=local6,bind6'.


A4. Depending on whether you are using IPv4, IPv6, or both, you may see 'hosts=bind,local', or 'hosts=bind6,local6'.



B. You will need the name or IP of the system the JVM is attempting to communcate with, and a knowledge of how your internal network is configured.



Additional notes:

- With the "options rotate" entry in your /etc/resolv.conf file, all of your nameservers will be querried.
This could slow down lookups, unless DNS caching has been set up.

Exapmle:

# cat /etc/resolv.conf
nameserver 192.168.2.200
nameserver 192.168.130.50
domain domain.ibm.com
domain ibm.com
options rotate <--------------





*** For the examples in this technote, the application server that is the target of our network packets will be named 'chris.domain.ibm.com' ***

Step 2:

ACTION

The syntax when using the RES_OPTIONS=debug variable with a command is:

# RES_OPTIONS=debug command command_options


For example:

# RES_OPTIONS=debug ssh chris.domain.ibm.com


*Or you can run your code while setting "RES_OPTIONS=debug" environment variable:

# RES_OPTIONS=debug ./res_nsearch chris.domain.ibm.com

Step 3:

Example
(successful hostname resolution)

Output when successfully resolving a hostname:

# RES_OPTIONS=debug ssh chris.domain.ibm.com

;; res_setoptions("debug", "env")..
;; debug
;; calling process id = 7995714
;; res_nquerydomain(chris.domain.ibm.com, , 1, 1)
;; res_query(chris.domain.ibm.com, 1, 1)
;; res_nmkquery(QUERY, chris.domain.ibm.com, IN, A)
;; res_send()
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50316
;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; chris.domain.ibm.com, type = A, class = IN
;; Querying server (# 1) address = 192.168.2.200
<---- nameserver queried
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50316
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 8

;; chris.domain.ibm.com, type = A, class = IN
chris.domain.ibm.com. 15M IN A 192.168.235.38
ibm.com. 20h17m46s IN NS xxxx.akam.net.
ibm.com. 20h17m46s IN NS xxxx-206.akam.net.
ibm.com. 20h17m46s IN NS xxxx.akam.net.
ibm.com. 20h17m46s IN NS xxxx.akam.net.
ibm.com. 20h17m46s IN NS xxxx.akam.net.
ibm.com. 20h17m46s IN NS xxxx.akam.net.
ibm.com. 20h17m46s IN NS xxxx-99.akam.net.
ibm.com. 20h17m46s IN NS xxxx.akam.net.
xxxx.akam.net. 2h54m19s IN A 192.168.160.64
xxxx.akam.net. 18h23m45s IN A 192.168.61.64
xxxx-99.akam.net. 46m8s IN AAAA 2600:1401:2::63
xxxx-206.akam.net. 4h23m12s IN AAAA 2600:1401:2::ce
xxxx.akam.net. 10h35s IN A 192.168.50.64
xxxx.akam.net. 2h54m19s IN A 192.168.25.64
xxxx.akam.net. 10h36s IN A 192.168.161.64
xxxx.akam.net. 10h39s IN A 192.168.173.64
chris.domain.ibm.com is 192.168.235.38 <---- Successfully resolved
 

*NOTES:

- The output in bold is the output to focus on as it displays the name you are querying for, as well as the name server(s) your system is set up to query.
 

Step 4:

Example
(failed hostname resolution)

Output when a timeout occurs:

# RES_OPTIONS=debug telnet peters1.domain.ibm.com

;; res_setoptions("debug", "env")..
;; debug
;; calling process id = 7274858
;; res_nquerydomain(peters1.domain.ibm.com, , 1, 1)
;; res_query(peters1.domain.ibm.com, 1, 1)
;; res_nmkquery(QUERY, peters1.domain.ibm.com, IN, A)
;; res_send()
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2349
;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; peters1.domain.ibm.com, type = A, class = IN
;; Querying server (# 1) address = 192.168.130.50
<-------- nameserver queried
;; timeout
;; Querying server (# 2) address = 192.168.2.200
<-------- nameserver queried
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2349
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 8

;; peters1.domain.ibm.com, type = A, class = IN
peters1.domain.ibm.com. 7m15s IN A 192.168.98.21
ibm.com. 2h33m12s IN NS eur2.akam.net.
ibm.com. 2h33m12s IN NS ns1-206.akam.net.
ibm.com. 2h33m12s IN NS usc3.akam.net.
ibm.com. 2h33m12s IN NS usw2.akam.net.
ibm.com. 2h33m12s IN NS ns1-99.akam.net.
ibm.com. 2h33m12s IN NS eur5.akam.net.
ibm.com. 2h33m12s IN NS asia3.akam.net.
ibm.com. 2h33m12s IN NS usc2.akam.net.
usc2.akam.net. 5h41m37s IN A 184.26.160.64
asia3.akam.net. 39m11s IN A 23.211.61.64
ns1-99.akam.net. 8h1m34s IN AAAA 2600:1401:2::63
ns1-206.akam.net. 11h38m42s IN AAAA 2600:1401:2::ce
usc3.akam.net. 17h16m1s IN A 96.7.50.64
eur5.akam.net. 17h16m2s IN A 23.74.25.64
usw2.akam.net. 17h16m2s IN A 184.26.161.64
eur2.akam.net. 17h16m5s IN A 95.100.173.64
Trying...
;; res_nquerydomain(peters1.domain.ibm.com, , 1, 1)
;; res_query(peters1.domain.ibm.com, 1, 1)
;; res_nmkquery(QUERY, peters1.domain.ibm.com, IN, A)
;; res_send()
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45176
;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; peters1.domain.ibm.com, type = A, class = IN
;; Querying server (# 1) address =
192.168.2.200
<----------- nameserver queried
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45176
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 8

;; peters1.domain.ibm.com, type = A, class = IN
peters1.domain.ibm.com. 10m6s IN A 192.168.98.21
ibm.com. 2h36m3s IN NS ns1-99.akam.net.
ibm.com. 2h36m3s IN NS usw2.akam.net.
ibm.com. 2h36m3s IN NS ns1-206.akam.net.
ibm.com. 2h36m3s IN NS asia3.akam.net.
ibm.com. 2h36m3s IN NS eur2.akam.net.
ibm.com. 2h36m3s IN NS usc3.akam.net.
ibm.com. 2h36m3s IN NS eur5.akam.net.
ibm.com. 2h36m3s IN NS usc2.akam.net.
usc2.akam.net. 5h44m28s IN A 184.26.160.64
asia3.akam.net. 42m2s IN A 23.211.61.64
ns1-99.akam.net. 8h4m25s IN AAAA 2600:1401:2::63
ns1-206.akam.net. 11h41m33s IN AAAA 2600:1401:2::ce
usc3.akam.net. 17h18m52s IN A 96.7.50.64
eur5.akam.net. 17h18m53s IN A 23.74.25.64
usw2.akam.net. 17h18m53s IN A 184.26.161.64
eur2.akam.net. 17h18m56s IN A 95.100.173.64
telnet: connect: Connection timed out <----- failed to resolve the hostname

NOTES:

- The output in bold is the output to focus on, as it displays the name you are querying for, as well as the name server(s) your system is set up to query.


- Notice two nameservers are querried. This is the result of having two name servers, as well as "options rotate" in the /etc/resolv.conf file.

Step 5:

Conclusions

A. If the connection times out, then we know the issue is somewhere outside of the JVM and application logic. The following areas will need to be investigated further:

1. AIX socket
2. AIX kernel
3. Network layer on AIX
4. The network itself


Ensure that there are no issues with the nameserver that your server was not able to contact, as well as the network structure itself (this may require physically inspecting connections, cables, etc) if necessary.

If Java and the application logic are not the root cause, and neither are any components of the network or the nameserver, please open a PMR with the AIIX networking team to further troubleshoot any AIX components (AIX socket, AIX kernel, etc).



B. On the other hand, if the output of RES_OPTIONS testing shows a successful network connection, check if problem is in JVM or application logic.

if the problem persists, collect data as per URL below and upload:

http://www-01.ibm.com/support/docview.wss?uid=isg3T1022750">http://www-01.ibm.com/support/docview.wss?uid=isg3T1022750

Step 6:

ACTION

Step 7:

ACTION

Step 8:

ACTION

Step 9:

ACTION

Step 10:

ACTION

Step 11:

ACTION

Step 12:

ACTION

Step 13:

ACTION

Step 14:

ACTION

Step 15:

ACTION

Step 16:

ACTION

Step 17:

ACTION

Step 18:

ACTION

Step 19:

ACTION

Step 20:

ACTION

Step 21:

ACTION

Document Type: Instruction
Content Type: Troubleshooting
Hardware: all Power
Operating System: AIX 6 | AIX 7
IBM Java: all Java Versions
Author(s): Christopher C.D. Peters
Reviewer(s): Rama Tenjarla

[{"Product":{"code":"SG9NGS","label":"IBM Java"},"Business Unit":{"code":null,"label":null},"Component":"--","Platform":[{"code":"PF002","label":"AIX"}],"Version":"Version Independent","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
22 June 2020

UID

isg3T1024364