IBM Support

MustGather: IBM DataPower Gateway Appliance statistics gathering

Question & Answer


Question

A device is reporting high load, high CPU, high resource utilization, or slow responsiveness. What data should I collect as part of the MustGather process?

Cause

A device may encounter high load, CPU, etc. from problems such as but not limited to, recursive stylesheets, configurations, heavy use of system variables or extensive debugging.

Answer

Section 1. Please answer/establish the following in your support ticket update:
  • Has the device's FFDC feature been configured or enabled? See for the Best Practices: Most Detailed Error Report - assure error reports are generated Always on Startup.
  • Is this occurring on a single device or a group?  What are the devices names and which are impacted, are there also others that are not impacted?
  • Is the problem behavior something that can be replicated?  What service(s) and domain(s) are involved?
  • Is the probe enabled? Is debug logging enabled?
  • Identify any recent changes made to domains and/or services on the device.
  • After collecting diagnostics below, should issue persist try removing all non-management traffic from the device(s) and note if the issue still occurs.

Section 2. Following data to be collected, ideally staged before and leading into the high cpu, load or hang up:
  • Using an 'admin' CLI/SSH session to DataPower will be required to better understand the problem.  The following diagnostics will help establish root cause of the issue.
  1. Setup DPMon to be running prior to replicating the problem:
    top
    diag
    dpmon on
    dpmon show
  2. Setup LLDiag to be running prior to replicating the problem: https://www.ibm.com/support/docview.wss?uid=ibm10719605
  3. Have the following commands ran before, during and after the problem is recreated (the more iterations over time the better).  A sample script (sample_cli_script.sh) is included at the bottom for periodic captures on a linux operating system with a bash script.
    top
    show clock
    show load
    show cpu
    show memory
    show filesystem
    show accepted-connections
    show tcp-table
    show tcp-connections
    show throughput
    show interface
    show link
    show gateway-transactions
    show crypto-engine
    show xml-names
    diag
    show memory details
    show connections
    show handles
    show activity 100
  4. Including a minimum of 5 CLI outputs is ideal. The periods over which you must collect will vary depending on the ability to recreate the issue. The default is 5 minute segments, but if the issue happens in shorter periods it would be advised to collect the CLI outputs more frequently. At a minimum in production do not collect CLI data faster than 30 seconds.
  5. In production environments if there is concern of an impact, omit the 'show tcp-table', 'show gateway-transactions' and 'show handles' commands.
  6. A debug log from the default domain and domain where the active service is running should be collected. If it is unclear which other domain the debug log should be collected from, the default domain alone will be a good start.
  7. Save an Error Report, either through CLI 'co; save error-report' or through WebGUI Troubleshooting->Generate Error Report
  8. Collect a device backup. With this, DataPower Support will have all domains including the default to work from as needed.
To help capture the CLI outputs from a device, the following shell script may help. Review the script as it is commented and designed to collect the outputs at a timed interval into a unique filename + timestamp format. The script will need to be modified to work with your device IP address and login information.

sample_cli_script.shsample_cli_script.sh

Keep in mind this is only a sample to assist in this collection and is not a supported script. The 'admin' user is required for this as we need to access the 'diag' (diagnostics) prompt. If this script is run with the admin id using an incorrect password you could lock yourself out of the device. The "COUNT" and "sleep" values should be adjusted depending on problem behavior.


Section 3. When issue is replicated, collect the following for IBM Support:
  • Questions to bullet points in Section 1.
  • DPMon data directory, (check output from: 'top; diag; dpmon show') by default in the temporary:///dpmon directory, dpmon, dpmon.1, dpmon.2, dpmon.x (all iterations) and also dpmon.errlog. Setup via Section 2, Step 1.
  • LLDiag output files (lldiag.txt, lldiag.txt.1, etc.) generated via Section 2, Step 2.
  • CLI output collection generated via Section 2, Step 3.
  • Obtain error report generated via Section 2, Step 7.
  • Device backup generated via Section 2, Step 8.
  • Sample messages of client input and/or server response messages that can trigger the condition.
  • Obtain all logs stored inside the logtemp:// directory.

Additional Notes:

Diagnostic commands are not publicly documented and are intended for IBM Support diagnosis only.  These diagnostic commands can be intrusive, but is necessary to diagnose the problem correctly. To prevent any known issues from causing additional complications or problems, it is highly recommended that you be running the latest firmware. To confirm your firmware will not cause a problem during debugging, always check the release notes for your firmware level accessible from this document.

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SS9H2Y","label":"IBM DataPower Gateway"},"Component":"General","Platform":[{"code":"PF009","label":"Firmware"}],"Version":"2018.4.1","Edition":"Edition Independent","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
10 December 2020

UID

swg21377610