Platform diagnostics (ppc64-diag)

Platform diagnostics report firmware events, provide an automated response mechanism to urgent events, and provide event notifications to system administrators and service frameworks.

The utilities described here are supported in the following Linux® distributions and virtualized environments:
Table 1. Support for ppc64-diag utilities
Utility PowerVM® partition on any level of Power® processor
rtas_errd All current versions of the following distributions:
  • Red Hat® Enterprise Linux
  • SUSE Linux Enterprise Server
opal_errd Not applicable
opal-elog-parse Not applicable
opal-dump-parse Not applicable
diag_encl All current versions of the following distributions:
  • Red Hat Enterprise Linux
  • SUSE Linux Enterprise Server
encl_led All current versions of the following distributions:
  • Red Hat Enterprise Linux
  • SUSE Linux Enterprise Server
usysident All current versions of the following distributions:
  • Red Hat Enterprise Linux
  • SUSE Linux Enterprise Server
usysattn All current versions of the following distributions:
  • Red Hat Enterprise Linux
  • SUSE Linux Enterprise Server
ppc64-diag Error Log Analyzer (ELA) All current versions of the following distributions:
  • Red Hat Enterprise Linux
  • SUSE Linux Enterprise Server
diag_nvme All current versions of the following distributions:
  • Red Hat Enterprise Linux 9.2, or any subsequent RHEL 9.x releases
  • SUSE Linux Enterprise Server (SLES) 15 SP5, or any subsequent SLES 15 updates
  • Limited to IBM® Power10 processor-based systems, or later

For Linux distributions currently supported on Power systems, see Linux on Power overview.

Platform diagnostics for systems using PowerVM virtualization

The platform diagnostics rtas_errd daemon logs platform events that are detected by firmware to servicelog. Platform events are also known as RTAS events. The rtas_errd daemon might also take more action on certain types of events, such as failures of fans or power supplies. It is configured to start automatically when Linux boots.

Platform diagnostics commands and the rtas_errd daemon are provided by the ppc64-diag package. The commands that are typically included are:

explain_syslog
Read a file (or stdin) that is in the format that is produced by the syslogd daemon, and print an explanation for each line that matches a message in the /etc/ppc64-diag/message_catalog message catalog. The explanations include probable cause and recommended action. If run with the -M flag, the command reads from the /var/log/messages file. For example:
explain_syslog -M
syslog_to_svclog
Read a file (or stdin) that is in the format that is produced by the syslogd daemon, and log an event to the servicelog database for each line that matches a message in the /etc/ppc64-diag/message_catalog message catalog. It is not automatically started when Linux boots. If run in the background with the -M flag, it continuously monitors the /var/log/messages file. For example:
syslog_to_svclog -M &
usysident
Use this utility to operate device identification, or to view and modify system identification indicators. This utility was previously in the powerpc-utils package, and now resides in the ppc64-diag package as of SUSE Linux Enterprise Server 11 SP3.
usysattn
If you run the usysattn utility without arguments, the system prints a list of all of the attention indicators on the system along with their current status (on or off). This utility was previously in the powerpc-utils package, and now resides in the ppc64-diag package as of SUSE Linux Enterprise Server 11 SP3.

Enclosure diagnostics (diag_encl)

As of SUSE Linux Enterprise Server 11 SP3, you can use additional options to diagnose problems on the 5888 PCIe storage enclosure. The diag_encl utility is contained in the ppc64-diag package.

The diag_encl utility can be run as part of a Linux CRON job (recommended), or run independently. For more information on setting up a CRON job, including the diag_encl utility, see Connecting and configuring the disk drive enclosure in a system running Linux (http://www.ibm.com/support/knowledgecenter/POWER7®/p7ham/scsidiskdriveenclosurelinux.htm).

Run the following command to access enclosure diagnostics as part of a CRON job:

:/usr/sbin/diag_encl -scl

Options for the diag_encl utility include the following:
  • -h: Print this help message.
  • -s: Generate serviceable events for any failures and write events to the service log.
  • -c: Compare with previous status and report only new failures.
  • -l: Turn on fault LEDs for serviceable events.
  • -v: Verbose output.
  • -V: Print the version of the command and exit.
  • -f: For testing, read SCSI enclosure services (SES) data from path.pg2 and VPD from path.vpd.
  • <scsi_enclosure>: The SCSI generic (sg) device on which to operate, such as sg7. If you do not specify a device, all such devices are diagnosed.

For more information, see the 5888 PCIe storage enclosure topic (http://www.ibm.com/support/knowledgecenter/POWER7/p7ham/p7ham_5888_kickoff.htm).

Note: You can also use the diag_encl utility on the IBM TotalStorage™ EXP24 Ultra320 SCSI Expandable Storage Disk Enclosure (7031).

NVMe diagnostics (diag_nvme)

The ppc64-diag package contains the diag_nvme utility. It is recommended to run the diag_nvme utility as part of a Linux CRON job. However, you can also run the diag_nvme utility independently. After ppc64-diag is installed, a CRON job is automatically created. This CRON job runs the diag_nvme utility daily for all NVMe devices that are detected on the system. The CRON job file can be found at the following location:

/etc/cron.daily/run_diag_nvme

From the list of detected events, you can select the events that must be reported to the servicelog database by editing the /etc/ppc64-diag/diag_nvme.config configuration file. By default, reporting of all detected events is enabled.

The diag_nvme utility supports the following options:
  • -h or --help: Prints a help message and exits.
  • -d or --dump: Dumps SMART data from the specified NVMe device to a specified file in the file path.
  • -f or --file: Only used for testing. Uses SMART data from the file that is specified instead of a NVMe device.
  • nvme_devices: The NVMe device (or devices) that must be diagnosed, such as nvme0. If the NVMe device name is not specified, all the NVMe devices that are detected in the system are diagnosed.

Light path diagnostics

Light path diagnostics is a system of light emitting diodes (LEDs) on various external and internal components of the server. When an error occurs, LEDs are lit throughout the server. Use the following utilities to gather information about light path diagnostics:

usysident
Use this utility to view and turn on or off the indicators that identify devices on Power systems. This utility was previously in the powerpc-utils package, and now resides in the ppc64-diag package as of SUSE Linux Enterprise Server 11 SP3.
usysattn
If you run the usysattn utility without arguments, the system prints a list of all of the attention indicators on the system along with their current status (on or off). This utility was previously in the powerpc-utils package, and now resides in the ppc64-diag package as of SUSE Linux Enterprise Server 11 SP3.

Example: Locating a faulty Ethernet card

  1. The service log notifier alerts the light path diagnostics subsystem, lp_diag, that the Ethernet card is not functioning. Typically, the lp_diag utility runs automatically through an script that is registered when the ppc64-diag package is installed.
  2. The lp_diag utility enables an indicator LED.
  3. You notice that one of the LEDs on your system is lit and not flashing. You run the usysattn utility from the command line to get the location code of the LED indicator.
  4. To gather more information about card, you run the lscfg utility.
  5. You replace the faulty card, and use the log_repair_action utility to reset the LED.

For more information, see Light path diagnostics topic.

The commands that are provided by this package, and their features and usage, might vary by distribution and release. Consult the man pages on your system for the most accurate description of their features and usage. For more information about how to list and display the man pages for commands that are provided by this package, see Displaying package man pages.

For more information about the ppc64-diag package, see ppc64 Platform Diagnostics.