IBM Support

Correlating Linux CPU usage to a particular thread in the JVM

Troubleshooting


Problem

This document contains Information on how to correlate a process ID (PID) from top output to a thread in your javacore file.

Cause

This document describes how to find Java™ threads from an IBM JRE running on Linux that are consuming large amounts of CPU cycles. This document does not describe all of the possible reasons and causes for high CPU problems. It's possible that the high CPU usage is a Java heap problem or an operating system or "machine" problem. Before attempting to find the Java threads that might be consuming CPU, be sure to rule out operating system paging and JRE Heap problems.

Environment

Linux

Resolving The Problem

The following example data was gathered as indicated in the MustGather: Performance, Hang or High CPU Issues on Linux technote when the JVM is consuming high CPU.  The same applies to data captured with "MustGather: Performance, hang, or high CPU issues with Liberty on Linux"

The linperf.sh script takes one required parameter, the process ID (PID) of the WebSphere Java process causing the high CPU problem. In the examples below, we ran the command:
./linperf.sh 59456 
This created linperf_RESULTS.tar.gz that contains the Linux CPU stat data and some other system data.  For high CPU we are most interested in vmstat.out, top.out, topdashH.59456.out.  It also produces multiple javacores for pid 59456 that we need to cross-reference with the data below.

We're using sample data to illustrate what needs to be reviewed.  Always discard the first line/section/capture of all 3 files because it will be skewed having captured that first entry in the middle of a time cycle.
1) vmstat.out - The "us" column is CPU used for the whole system (or VMWare VM if split out):
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0   2048 182608 219620 954208    0    0    14    45   97  109  6  1 89  0  4
 1  0   2048 169288 219676 967328    0    0   230  2260  254 7142 37  1  2  1 59
 2  0   2048 169336 219712 967420    0    0     0    37  264  542 76  1  0  0 23
 1  0   2048 168980 219732 967672    0    0     0    42  382  719 69  1  0  0 30
 1  0   2048 168632 219760 967620    0    0     2    43  269  555 68  1  0  0 31
 2  0   2048 167172 219812 967752    0    0     0    74  282  600 72  1  0  0 27
 1  0   2048 164440 219840 967700    0    0     1    40  267  611 74  1  0  0 25
 1  0   2048 155708 219916 976460    0    0     0  2270  293 7164 71  2  0  0 27

... more lines
** Each line represents a 5 second timeslice by default.
The top.out file is your standard top output. This file needs to be reviewed to see if the PID passed to the linperf.sh script is truly the process consuming the CPU.
Here's an example of the top.out output as collected by the linperf.sh script:
top - 08:25:12 up 4 days,  2:30,  3 users,  load average: 1.05, 0.90, 0.53
Tasks: 194 total,   2 running, 192 sleeping,   0 stopped,   0 zombie
%Cpu(s): 63.9 us,  0.9 sy,  0.0 ni,  0.4 id,  0.1 wa,  0.1 hi,  0.1 si, 34.7 st
KiB Mem:   5132736 total,  4953372 used,   179364 free,   219964 buffers
KiB Swap:  1332484 total,     2048 used,  1330436 free.   952608 cached Mem
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
59456 root      20   0 4607528 2.202g  20040 S 86.18 44.99 253:48.89 /opt/IBM/+
60756 root      20   0    3740   2436   1984 S 0.416 0.047   0:34.09 /usr/bin/+
 3332 root      20   0 1751908 563576  11344 S 0.216 10.98  11:54.58 /opt/IBM/+

... more lines
** Each line represents a 60 second timeslice by default.  From this output, we see that the header portion "%Cpu(s): 63.9 us" roughly correlates to what we see in the vmstat "us" column.
** Note: The "%CPU" of "86.18" can be distorted depending on how many CPUs the system has, but it is the guideline for knowing which pid is consuming the most CPU.
The topdashH.out file consists of series of top commands run over around four minutes of time every 5 seconds. The big difference between the regular top output and the top -H output, is while the regular top output lists all of the processes running on the machine, the top -H output lists all of the threads from one particular process that linperf.sh was run against.  As with default top output, the threads from the process are sorted according to CPU usage with the highest consumers at the top. The topdashH.out file needs to be reviewed from top to bottom to see what thread(s) is(are) consuming the CPU over the four minutes this command ran.
Since we ran linperf.sh against pid 59456, we have topdashH.59456.out that contains sections like:
top - 08:24:22 up 4 days,  2:29,  2 users,  load average: 1.13, 0.88, 0.50
Threads: 252 total,   1 running, 251 sleeping,   0 stopped,   0 zombie
%Cpu(s): 76.2 us,  0.7 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.2 si, 22.6 st
KiB Mem:   5132736 total,  4963400 used,   169336 free,   219712 buffers
KiB Swap:  1332484 total,     2048 used,  1330436 free.   967420 cached Mem
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
51830 root      20   0 4607520 2.202g  19432 R 91.22 44.98  17:55.99 WebContai+
51829 root      20   0 4607520 2.202g  19432 S 0.399 44.98  34:11.67 WebContai+
59550 root      20   0 4607520 2.202g  19432 S 0.200 44.98   0:06.39 Non-defer+

... more lines
** Each section represents a 5 second timeslice by default.  As with top.out, the "%Cpu(s): 76.2 us" in the header will line up roughly to what is seen in vmstat.out and the header of top.out.  The "%CPU" of "91.22" can be distorted, but use that as the guideline for what thread(s) is/are consuming the most CPU.  The "PID" is actually the thread ID on pid 59456, but Linux treats threads as processes and assigns a "PID" number.
Since 51830 is consuming the most, convert 51830 to hex (CA76), and search the javacore files for this native thread id.
** Hint, change to: 0XCA76 to find an exact match.
In this example, the javacore list this thread:
"WebContainer : 1" J9VMThread:0x00000000054A9900, omrthread_t:0x00007F5D8005F1B8, java/lang/Thread:0x0000000701DBCFE0, state:R, prio=5
(native thread ID:0xCA76, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000001)
at java/text/FieldPosition$Delegate.formatted(FieldPosition.java:291(Compiled Code))
at java/text/SimpleDateFormat.subFormat(SimpleDateFormat.java:1037(Compiled Code))
at java/text/SimpleDateFormat.format(SimpleDateFormat.java:836(Compiled Code))
at java/text/SimpleDateFormat.format(SimpleDateFormat.java:808(Compiled Code))
at com/ibm/ejs/ras/StreamEvent6.formatTimeAndThread(StreamEvent6.java:400(Compiled Code))
at com/ibm/ejs/ras/StreamEvent6.formatHeaderBasic(StreamEvent6.java:284(Compiled Code))
at com/ibm/ejs/ras/StreamEvent6.writeSelfToStream(StreamEvent6.java:132(Compiled Code))
at com/ibm/ejs/ras/SystemStream.doPrintLine(SystemStream.java:786(Compiled Code))
at com/ibm/ejs/ras/SystemStream.println(SystemStream.java:691(Compiled Code))
at TE01Servlet.p(TE01Servlet.java:106(Compiled Code))
at TE01Servlet.doLoop(TE01Servlet.java:94(Compiled Code))
at TE01Servlet.doPost(TE01Servlet.java:45)
at TE01Servlet.doGet(TE01Servlet.java:18)
at javax/servlet/http/HttpServlet.service(HttpServlet.java:743)
... more lines
For this example, we see that the javacore files show this thread looping somewhere in the java/text/SimpleDateFormat.format code. Review the thread(s) you found in all of the javacore files provided to see how and if the code stack moves through time.

[{"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Java SDK","Platform":[{"code":"PF016","label":"Linux"}],"Version":"8.5;8.0;7.0;6.1;6.0;5.1","Edition":"Advanced;Base;Express;Network Deployment;Single Server","Line of Business":{"code":"LOB36","label":"IBM Automation"}},{"Product":{"code":"SSNVBF","label":"Runtimes for Java Technology"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Java SDK","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB36","label":"IBM Automation"}},{"Product":{"code":"SS7JFU","label":"WebSphere Application Server - Express"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"J2EE Development and deployment","Platform":[{"code":"PF016","label":"Linux"}],"Version":"7.0;6.1;6.0.2;6.0.1;6.0;5.1.1;5.1","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
26 October 2021

UID

swg21158192