This document applies only to the following language version(s):
A device is reporting high load, high CPU, high resource utilization, or slow responsiveness. What data should I collect as part of the MustGather process?
A device may encounter high load, CPU, etc. from problems such as recursive style sheets, configurations, heavy use of system variables or multiple debug. This type of behavior can be analyzed to determine the cause of the problem. The main questions that need to be clarified for DataPower Support are:
- Has the device's FFDC feature been configured or enabled? See this technote for additional information about First Failure Data Capture.
- Is this occurring on a single device?
- Is the problem behavior something I can recreate?
- Is the cause of the problem valid or potentially self-inflicted? For example, is the load at 100% along with high CPU and is the device slow to respond?
- Is the probe enabled?
- Is debug logging enabled?
- Has a style sheet caused a recursive loop of connections or is it trying to copy a large request or response into several variables during a load test?
The main purpose is to determine if this is a one-time event or a reproducible problem. If the problem can be reproduced, DataPower Support has a significant chance of locating the source of the problem.
If the problem has occurred once on a single device:
- If the issue is still affecting the device, collect multiple instances of the Command Line Interface (CLI) output from the bash script available at the bottom of the page.
- Obtain all logs stored inside the logtemp:// directory.
- If the issue is related to high memory, remove the device from network activity and let it remain idle for at least 15 minutes. Monitor the memory consumption to determine if it returns to a lower value.
- If possible do not reboot. Instead, leave the device in the bad state and contact IBM DataPower Support.
If you can reproduce the problem, follow these instructions:
Using the CLI, the following tools can be used by DataPower Support to better understand the problem. The following output collection will begin to isolate the problem but in some cases, the problem may need to be simplified and reproduced further.
Methods to simplify the problem can be:
- Reduce the number of active services and/or domains running when the problem occurs.
- If the behavior is a sudden spike, then the traffic hitting the device at the time of the spike would be the best thing to provide. A packet trace to capture the inbound traffic would be ideal.
- There will be times when IBM Support will need to reproduce the problem. Sample client traffic and backend content will need to be simulated. A packet trace can help replicate this.
Keep in mind that the diagnostic outputs are for use by IBM Support only and not publicly documented.
- You will need to enable tracing after logging into the CLI with the following commands:
This should be enabled immediately after a reboot, and before reproducing the problem. For best results after the device booted, enable tracing followed by reloading the configuration. Then reproduce the problem or allow the condition to occur again. If that is not possible, enable tracing and allow the condition or problem to get worse. Then continue with the following capture.
Note: The tracing in this case should have run for at least 30 minutes to 1 hour before collecting the outputs, if possible.
- Reproduce or allow the problem to occur; then collect the output from the following CLI commands:
show mem details
show activity 50
- The CLI outputs will need to be collected at 5 minutes intervals over a span of 30 minutes. A total of 7 output files will have been captured, including the initial capture. It would be best to save these outputs to separate, unique files per capture; this will help support in analyzing the information.
- A debug log from the default domain and domain where the active service is running should be collected. If it is unclear which other domain the debug log should be collected from, the default domain alone will be a good start.
- Collect a device backup. With this, DataPower Support will have all domains including the default to work from as needed.
When complete you should have the following files:
- CLI output collection
- Device backup
- A packet traces of client or backend traffic going through the device
- A sample client input message(s) that triggers the condition
- Debug logs from the default domain and any other domains that span the same time as the CLI output capture
To help capture the CLI outputs from a device, the following shell script may help. Review the script as it is commented and designed to collect the outputs at a timed interval into a unique filename + timestamp format. The script will need to be modified to work with your device IP address and login information.
Keep in mind this is only a sample to assist in this collection and is not a supported script. It is not suggested to use the "admin" id to capture data with this script. A privileged user id would be best. If this script is run with the admin id using an incorrect password you could lock yourself out of the device. The "COUNT" and "sleep" values should be adjusted depending on problem behavior.
These diagnostic commands are intrusive and necessary. To prevent any known issues from causing additional complications or problems, it is highly recommended that you be running the latest firmware. To confirm your firmware will not cause a problem during debugging, always check the release notes for your firmware level accessible from this document.