IBM Support

V5R4 Watcher Jobs SRVMONxxxx for First Failure Data Capture

Troubleshooting


Problem

This document provides information about enhanced V5R4 First Failure Data Capture (FFDC) changes.

Resolving The Problem

Enhanced V5R4 First Failure Data Capture (FFDC) Changes

Q1: What is the difference from the prior FFDC design?

A1: First Failure Data Capture has been a function of the IBM System i products since V2R2.

The design for FFDC is to automatically collect information when an IBM program detects an unexpected condition. A symptom string is created with enough information to make the problem unique, and it is reported using WRKPRB. This design was also made available for problem reporting using the operating system API QPDLOGER. The electronically reported problem is then sent to the appropriate Software Service provider.

When an unexpected condition is detected from an IBM program, a Product Activity Log (PAL) is logged. If the QSFWERRLOG system value is set to *LOG, then a Work Problem entry (WRKPRB) is created and a CPI93B9 - Software Error message is issued to the System Operator. Using WRKPRB, the problem can then be submitted electronically to IBM. When submitted electronically to IBM, an ECS PMR is created and a search of known problems (APARs) is made using the symptom string of the WRKPRB entry. Then one of the following occurs:
o If a fix (PTF) is available for the APAR, the PTF is transmitted back to the user's system.
o If only the APAR is available (there is no PTF available), the APAR is transmitted back to the user's system.
o If no APAR is found, the response is APAR II12302. This APAR contains information to be collected and requests that you contact your Software Service Provider.

The differences with the V5R4 changes follow:
o WATCHER Jobs (SRVMONxxxx) are used, based on a policy, to collect additional information. Each policy is a separate SRVMONxxxx job.
o Commands can be run for additional problem determination information or to perform a recovery operation.
o Collected information is transmitted automatically to IBM TESTCASE or ECURep FTP sites for IBM service problem analysis of the electronically reported PMR.

Q2: What is a policy?

A2: A policy contains a problem symptom and actions to perform. It is used to collect diagnostic data on IBM i5/OS software problems. A policy is created by the IBM Programmer or the Rochester Software Support Center.

The following are some of the fields to define the policy:
o Policy number.
o Condition Type, such as a message ID or Licensed Internal Code Log (liclog), when the message occurs. A test can be made for the message ID and its location (QSYSOPR, specific message queue, QHST, or a Joblog). There is also a condition compare against field (which allows granularity by: to program, from program, or message data).
o Report Action: Yes/No (whether to automatically notify IBM using an ECS PMR when the condition occurs).
o Actions:

o Action: (for example, RTVDSKINF ASPDEV(*SYSBAS) or CPYF FROMFILE(QSYS/QAEZDISK) +
TOFILE(QSCXXXXXXX/PROBDATA) FROMMBR(QCURRENT) TOMBR(QCURRENT) MBROPT(*REPLACE) +
CRTFILE(*NO) FMTOPT(*NOCHK))

o Action type: User, CL command, Issue message, Submit to batch, Recover Send Action Data with report: Yes/No (whether to send data to the IBM TESTCASE or ECUREP FTP sites)

Q3: How is the policy used to detect a problem on the user's system?

A3: There are two QSRVMON jobs in QSYSWRK that perform Service Monitor functions. These system jobs start a SRVMONxxxx job for each policy and process notifications that are received from the SRVMONxxxx jobs. The SRVMONxxxx jobs run on the QUSRWRK Subsystem. Starting and stopping the Service Monitor and SRVMONxxxx jobs are controlled by the QSFWERRLOG system value.

Q4: Is the system performance affected by the number of SRVMONxxxx jobs running?

A4: Generally, it is not affected. Watch jobs are in waiting status (which means that there is no CPU usage). When one of the watched for events occurs, then the corresponding job is activated and SrvMon performs the actions defined in the policy. This takes some resources. Ideally, because the policies watch for error conditions, they do not get activated very often.

The system value QSFWERRLOG, when set to *LOG, starts the Service Monitor Watch Jobs (SRVMONxxxx) in the QUSRWRK subsystem. The shipped value of QSFWERRLOG is *LOG.

The command WRKWCH WCH(*SRVMON) can be used to view the active watches that were started using the Service Monitor function of the operating system.

Q5: How does the user obtain a new policy?

A5: Each time the Service Agent connects from the user's system to the Service Data Repository (SDR) at IBM, a determination is made whether a new policy file exists. If so, service agent downloads it and notifies the Service Monitor job. The policy file is shipped with the operating system. Each time the Service Monitor releases a PTF, it includes the latest policy file with that PTF. The policy file itself is not PTFed separately.

Users with Service Agent disabled get updated policies only when they apply a Service Monitor PTF.

Q6: How is the user notified that a condition in the policy occurred?

A6: The message Service Monitor detected a software problem is logged to the system operator and a WRKPRB entry is created.

Q7: What happens if the problem does not have an APAR that documents the problem?

A7: A typical response that is put in the ECS PMR follows:
Thank you for using ECS regarding a problem you are experiencing with your AS/400.

We have had our service specialists review your reported problem description and search our database. There isn't enough information contained in this record for us to tell you what the problem may be.

To pursue your problem further, we need additional information. Rochester Support Center knowledgebase document N1016968: Saving APAR Data for an FFDC Problem and Sending It as an E-Mail Attachment describes how to collect data related to the problem and e-mail it as an attachment.
 
Q8: Does CPI3999 RC8 indicate a problem?

A8: There are several scenarios where the watches will end normally for Reason Code 08 of CPI3999 - Watch session SRVMONxxxx has been ended:
1. A user changes the QSFWERRLOG system value to *NOLOG. The watches will end and stay ended until the user changes the system value back to *LOG.
2. A new policy file is downloaded. The watches will end and then restart with the new policy file.
3. A user uses the Work Watch (WRKWCH) command to end the watch. The watch will stay ended until the next IPL, subsystem restart, or until the user changes the system value to *NOLOG and then back to *LOG.
4. The Service Monitor encounters a recoverable error. Watches will end and then restart.

All of these cases are normal processing, and the user should not be concerned.

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m3p000000hB5BAAU","label":"WRKPRB"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"7.2.0;7.3.0;7.4.0;7.5.0"}]

Historical Number

456201648

Document Information

Modified date:
21 March 2024

UID

nas8N1014263