PFA_ENQUEUE_REQUEST_RATE

Description:

The PFA_ENQUEUE_REQUEST_RATE check detects damage to an address space or system by using the number of enqueue requests per CPU millisecond used as the tracked metric. End of change

If PFA detects that the enqueue request rate is lower than expected, PFA calls Runtime Diagnostics to detect if an address space is hung. If PFA detects that the enqueue request rate is higher than expected, PFA calls Runtime Diagnostics to detect if there is a damaged address space. By detecting these conditions early, you can correct the problem before it causes the system to hang or crash.

The enqueue request rate check issues an exception for the following types of comparisons:

tracked jobs
total system

To perform comparisons, the PFA_ENQUEUE_REQUEST_RATE check requires enough data to exist such that current predictions are available for two time ranges.

After the PFA_ENQUEUE_REQUEST_RATE check issues an exception, it does not perform the next comparison type. To avoid skewing the enqueue request rate, PFA ignores the first hour of enqueue data after IPL and the last hour of enqueue data prior to shutdown. In addition, PFA attempts to track the same persistent address spaces that it tracked prior to IPL or PFA restart if the same persistent address spaces are still active. Read the topic about persistent jons in PFA_MESSAGE_ARRIVAL_RATE to understand how the PFA_ENQUEUE_REQUEST_RATE check determines the top twenty persistent jobs.

By default, an EXCLUDED_JOBS file containing the address spaces NETVIEW and *MASTER* on all systems is created during installation. Therefore, if you have not made any modifications to the EXCLUDED_JOBS file, these jobs are excluded. See Using and configuring supervised learning for more information.

Guidelines

If you change the maximum number of concurrent ENQ, ISGENQ, RESERVE, GQSCAN and ISGQUERY requests or change system-wide defaults using the SETGRS command or through GRSCNFxx parmlib, delete the files in the PFA_ENQUEUE_REQUEST_RATE/data directory to ensure PFA is collecting relevant information.
PFA never calls Runtime Diagnostics if PFA detects something is too high. When PFA detects that the enqueue request rate is higher than expected, PFA issues an exception indicating that an address space or the system might be damaged.

Note: This check supports supervised learning. See the topic on Using and configuring supervised learning.

Reason for check:

The objective of this check is to determine if an LPAR or address space is damaged or hung by using the number of enqueues per CPU millisecond as the tracked metric.

Best practice:

The best practice is to analyze the message and reports issued by PFA to determine what is causing the increase or decrease in the enqueue request rate.

z/OS® releases the check applies to:

z/OS V1R13 and later.

Type of check:

Remote

Parameters accepted:

Yes, as follows:

Table 1. PFA_ENQUEUE_REQUEST_RATE check parameters
Parameter name	Default value	Minimum Value	Maximum Value	Description
collectint	1 Minute	1	360	This parameter determines how often (in minutes) to run the data collector that retrieves the current enqueue request rate.
modelint	720 Minutes	60	1440	This parameter determines how often (in minutes) you want the system to analyze the data and construct a new enqueue request rate model or prediction. By default, PFA analyzes the data and constructs a new model every “default value” minutes. The model interval must be at least four times larger than the collection interval. Note that, even when you set a value larger than 360, PFA performs the first model at 360 minutes (6 hours). By default, PFA analyzes the data and constructs a new model every 720 minutes (12 hours).
stddev	10	2	100	This parameter is used to specify how much variance is allowed between the actual enqueue request rate per amount of CPU and the expected enqueue request rate. It determines if the actual enqueue request rate has increased beyond the allowable upper limit and how much variance is allowed across the time range predictions. If you set the STDDEV parameter to a smaller value, an exception issues when the actual enqueue request rate is closer to the expected enqueue request rate and the predictions across the time ranges are consistent. If you set the STDDEV parameter to a larger value, an exception issues when the actual enqueue request rate is significantly greater than the expected enqueue request rate even if the predictions across the different time ranges are inconsistent.
collectinactive	1 (on)	0 (off)	1 (on)	Defines whether data is collected and modeled even if the check is not eligible to run, not ACTIVE(ENABLED), in IBM® Health Checker for z/OS.
trackedmin	3	0	1000	This parameter defines the minimum enqueue request rate required for a persistent job in order for it to be considered a top persistent job that should be tracked individually.
exceptionmin	1	0	1000	This parameter is used when determining if an exception should be issued for an unexpectedly high enqueue request rate. For tracked jobs, this parameter defines the minimum enqueue request rate and the minimum predicted enqueue request rate required to cause a too high exception. For the total system comparison, this parameter defines the minimum enqueue request rate required to cause a too high exception.
checklow	1	0	1	Defines whether Runtime Diagnostics is run to validate that a low enqueue request rate is caused by a problem. If this value is off, PFA does not issue exceptions for conditions in which the enqueue request rate is unexpectedly low.
stddevlow	4	2	100	This parameter is used to specify how much variance is allowed between the actual enqueue request rate per amount of CPU, and the expected enqueue request rate, when determining if the actual rate is unexpectedly low. If you set the STDDEVLOW parameter to a smaller value, an exception is issued when the actual enqueue request rate is closer to the expected enqueue request rate. If you set the STDDEVLOW parameter to a larger value, an exception is issued when the actual enqueue request rate is significantly lower than the expected enqueue request rate.
limitlow	3	1	100	This parameter defines the maximum enqueue request rate allowed when issuing an exception for an unexpectedly low number of enqueues.
debug	0 (off)	0 (off)	1 (on)	This parameter (an integer of 0 or 1) is used at the direction of IBM service to generate additional diagnostic information for the IBM Support Center. This debug parameter is used in place of the IBM Health Checker for z/OS policy. The default is off (0).

To determine the status of the enqueue request rate check, issue f pfa,display,check(PFA_ENQUEUE_REQUEST_RATE),detail. For the command example and more details, see . The following example shows the output written to message AIR018I in SDSF:

AIR018I 02:22:54 PFA CHECK DETAIL

CHECK NAME:  PFA_ENQUEUE_REQUEST_RATE
    ACTIVE                          : YES
    TOTAL COLLECTION COUNT          : 5
    SUCCESSFUL COLLECTION COUNT     : 5
    LAST COLLECTION TIME            : 02/05/2009 10:18:22
    LAST SUCCESSFUL COLLECTION TIME : 02/05/2009 10:18:22
    NEXT COLLECTION TIME            : 02/05/2009 10:19:22
    TOTAL MODEL COUNT               : 1
    SUCCESSFUL MODEL COUNT          : 1 
    LAST MODEL TIME                 : 02/05/2009 10:18:24
    LAST SUCCESSFUL MODEL TIME      : 02/05/2009 10:18:24
    NEXT MODEL TIME                 : 02/05/2009 22:18:24
    CHECK SPECIFIC PARAMETERS:
       COLLECTINT                   : 1
       MODELINT                     : 720
       COLLECTINACTIVE              : 1=ON
       DEBUG                        : 0=OFF
       STDDEV                       : 10
       TRACKEDMIN                   : 3
       EXCEPTIONMIN                 : 1
       CHECKLOW                     : 1=ON
       STDDEVLOW                    : 4
       LIMITLOW                     : 3

User override of IBM values:

The following shows keywords you can use to override check values on either a POLICY statement in the HZSPRMxx parmlib member or on a MODIFY command. This statement can be copied and modified to override the check defaults:

UPDATE CHECK(IBMPFA,PFA_ENQUEUE_REQUEST_RATE)
            ACTIVE
            SEVERITY(MEDIUM)
            INTERVAL(ONETIME)
      PARMS=('COLLECTINT(1)','MODELINT(720)','STDDEV(10)','DEBUG(0)',
            'COLLECTINACTIVE(1)','EXCEPTIONMIN(1)','TRACKEDMIN(3)')
            'CHECKLOW(1)','STDDEVLOW(4)','LIMITLOW(3)'
            DATE(20080330)
      REASON('The enqueue request rate is higher than expected 
              which can indicate a damaged address space.')

The enqueue request rate check is designed to run automatically after every data collection. Do not change the INTERVAL parameter.

Verbose support:

The check provides additional detail in verbose mode. You can put a check into verbose mode using the UPDATE,filters,VERBOSE=ON parameters on either the MODIFY command or in a POLICY statement in an HZSPRMxx parmlib member.

Debug support:

The DEBUG parameter in IBM Health Checker for z/OS is ignored by this check. Rather, the debug parameter is a PFA check specific parameter. For details, see Understanding how to modify PFA checks.

Reference:

For more information about PFA, see the topic on Overview of Predictive Failure Analysis.

Messages:

The output is a enqueue request rate prediction report that corresponds to the message issued. PFA generates one of the following reports:

AIRH190E - Enqueue request rate lower than expected exception
AIRH192E - Enqueue request rate higher than expected exception
AIRH210E - Total system enqueue request rate higher than expected exception
AIRH211E - Total system enqueue request rate system lower than expected exception
AIRH216I - Runtime Diagnostic output

For complete message information, see the topics on:

SECLABEL recommended for MLS users:

SYSLOW

Output:

The output is a variation of the enqueue request rate prediction report. The values found in the enqueue request prediction file are as follows:

Tracked jobs exception report for enqueue request rate higher than expected: PFA issues this report when any one or more tracked, persistent jobs cause an exception due to the enqueue request rate being higher than expected. Only the tracked jobs that caused an exception are in the list of jobs on the report.

Figure 1. Prediction report for enqueue request rate higher than expected - total jobs

						Enqueue Request Rate Prediction Report
Last successful model time 				: 01/27/2009 11:08:01
Next model time  									: 01/27/2009 23:08:01
Model interval  									: 720
Last successful collection time 	: 01/27/2009 17:41:38
Next collection time 						  : 01/27/2009 17:56:38
Collection interval  						  : 15
                                                                         
Persistent address spaces with high rates:                               
                                            Predicted Enqueue            
                      Enqueue                Request Rate                
  Job                 Request                                            
  Name     ASID          Rate        1 Hour       24 Hour         7 Day    
  TRACKED1 001D         58.00         23.88         22.82         15.82    
  TRACKED2 0028         11.00          0.34         11.11         12.11    
  TRACKED3 0029         11.00         12.43          2.36          8.36

Tracked jobs exception report for enqueue request rate lower than expected: PFA issues this report when any one or more tracked, persistent jobs cause an exception due to the enqueue request rate being lower than expected. Only the tracked jobs that caused an exception are in the list of jobs on the report.

Figure 2. Prediction report for enqueue request rate lower than expected - total jobs

						Enqueue Request Rate Prediction Report
Last successful model time 				: 10/10/2010 11:08:01
Next model time  									: 10/10/2010 23:08:01
Model interval  									: 720
Last successful collection time 	: 10/10/2010 17:41:38
Next collection time 						  : 10/10/2010 17:56:38
Collection interval  						  : 15
                                                                    
Persistent address spaces with low rates:                           
                                               Predicted Enqueue            
                          Enqueue                 Request Rate              
  Job                     Request                                           
  Name       ASID          Rate         1 Hour       24 Hour         7 Day 
  IBMUSER2 	 002F 			   1.17 				 23.88 				22.82 				15.82
  IBMUSER1 	 002E 			   2.01 				  8.34 				11.11 				12.11
                           
Runtime Diagnostics Output:
Runtime Diagnostics detected a problem in job: JOBS4
  EVENT 06: HIGH - HIGHCPU - SYSTEM: SY1 2009/06/12 - 13:28:46
  ASID CPU RATE: 96% ASID: 0027 JOBNAME: JOBS4
  STEPNAME: DAVIDZ PROCSTEP: DAVIDZ JOBID: STC00042 USERID: ++++++++
  JOBSTART: 2009/06/12 - 13:28:35
Error: 
  ADDRESS SPACE USING EXCESSIVE CPU TIME. IT MAY BE LOOPING.
Action: 
  USE YOUR SOFTWARE MONITORS TO INVESTIGATE THE ASID.
----------------------------------------------------------------------
  EVENT 07: HIGH - LOOP - SYSTEM: SY1 2009/06/12 - 13:28:46
  ASID: 0027 JOBNAME: JOBS4 TCB: 004E6850
  STEPNAME: DAVIDZ PROCSTEP: DAVIDZ JOBID: STC00042 USERID: ++++++++
  JOBSTART: 2009/06/12 - 13:28:35
Error: 
  ADDRESS SPACE APPEARS TO BE IN A LOOP.
Action: 
  USE YOUR SOFTWARE MONITORS TO INVESTIGATE THE ASID.
----------------------------------------------------------------------
Runtime Diagnostics detected a problem in job: JOBS5
  EVENT 03: HIGH - HIGHCPU - SYSTEM: SY1 2009/06/12 - 13:28:46
  ASID CPU RATE: 96% ASID: 0027 JOBNAME: JOBS5
  STEPNAME: DAVIDZ PROCSTEP: DAVIDZ JOBID: STC00042 USERID: ++++++++
  JOBSTART: 2009/06/12 - 13:28:35
Error: 
  ADDRESS SPACE USING EXCESSIVE CPU TIME. IT MAY BE LOOPING.
Action: 
  USE YOUR SOFTWARE MONITORS TO INVESTIGATE THE ASID.
----------------------------------------------------------------------
  EVENT 04: HIGH - LOOP - SYSTEM: SY1 2009/06/12 - 13:28:46
  ASID: 0027 JOBNAME: JOBS5 TCB: 004E6850
STEPNAME: DAVIDZ PROCSTEP: DAVIDZ JOBID: STC00042 USERID: ++++++++
  JOBSTART: 2009/06/12 - 13:28:35
Error: 
  ADDRESS SPACE APPEARS TO BE IN A LOOP.
Action: 
  USE YOUR SOFTWARE MONITORS TO INVESTIGATE THE ASID.
----------------------------------------------------------------------

Total system exception report for enqueue request rate higher than expected: The no problem report and the total system exception report (when the rate is higher than expected) show the totals at the top and the list of the tracked jobs.

Figure 3. Total system exception report: enqueue request rate higher than expected

 Enqueue request rate Prediction Report

Last successful model time      :  01/27/2009 17:08:01   
Next model time                 :  01/27/2009 23:08:01   
Model interval                  :  360                   
Last successful collection time :  01/27/2009 17:41:38   
Next collection time            :  01/27/2009 17:56:38   
Collection interval             :  15                    

Enqueue request rate
 at last collection interval 	 	    :  83.52                 
Prediction based on 1 hour of data  :  98.27
Prediction based on 24 hours of data:  85.98
Prediction based on 7 days of data  : 100.22
Top persistent users:                                 
                                                                           
                                            Predicted Enqueue                  
                          Enqueue                Request Rate                  
  Job                 Request                                               
  Name     ASID          Rate        1 Hour       24 Hour         7 Day    
  TRACKED1 001D         58.00         23.88         22.82         15.82    
  TRACKED2 0028         11.00          0.34         11.11         12.11    
  TRACKED3 0029         11.00         12.43          2.36          8.36

Total system exception report for enqueue request rate lower than expected: PFA issues the enqueue request rate exception report when there is a shortage or unusually low rate of enqueue requests. Runtime Diagnostics examines the system and PFA lists all output it receives from Runtime Diagnostics.

Figure 4. Total system exception report: low enqueue request rate

 Enqueue Request Rate Prediction Report

Last successful model time      :  01/27/2009 11:08:01   
Next model time                 :  01/27/2009 23:08:01   
Model interval                  :  720                   
Last successful collection time :  01/27/2009 17:41:38   
Next collection time            :  01/27/2009 17:56:38   
Collection interval             :  15                    
                                                        
Persistent address spaces with low rates:                                 
                                                                           
                                            Predicted ENQ                  
                          ENQ                Request Rate                  
  Job                 Request                                              
  Name     ASID          Rate        1 Hour       24 Hour         7 Day    
  JOBS4 	 001F 			1.17 					 23.88 				22.82 					15.82
  JOBS5 	 002D 			2.01 					 8.34 				11.11 					12.11

Runtime Diagnostics Output:
---------------------------------------------------------------------- 
EVENT 01: HIGH - ENQ          - SYSTEM: SY1      2010/10/04 - 10:19:53 
ENQ WAITER  - ASID:002F - JOBNAME:IBMUSER2 - SYSTEM:SY1                
ENQ BLOCKER - ASID:002E - JOBNAME:IBMUSER1 - SYSTEM:SY1                
QNAME: TESTENQ                                                         
RNAME: TESTOFAVERYVERYVERYVERYLOOOOOOOOOOOOOOOOOOOOOONGRNAME1234567... 
  ERROR: ADDRESS SPACES MIGHT BE IN ENQ CONTENTION.                    
 ACTION: USE YOUR SOFTWARE MONITORS TO INVESTIGATE BLOCKING JOBS AND   
 ACTION: ASIDS.                                                        
----------------------------------------------------------------------

Note: In accordance with the IBM Health Checker for z/OS messaging guidelines, the largest generated output length for decimal variable values up to 2147483647 (X'7FFFFFF') is 10 bytes. When any PFA report value is greater than 2147483647, it displays using multiplier notation with a maximum of six characters. For example, if the report value is 2222233333444445555, PFA displays it as 1973P (2222233333444445555 ÷ 1125899906842) using the following multiplier notation:

Table 2. Multiplier notation used in values for PFA reports
Name	Sym	Size
Kilo	K	1,024
Mega	M	1,048,576
Giga	G	1,073,741,824
Tera	T	1,099,511,627,776
Peta	P	1,125,899,906,842

The following fields apply to all reports:

Last successful model time: The date and time of the last successful model for this check. The predictions on this report were generated at that time.
Next model time: The date and time of the next model. The next model will recalculate the predictions.
Model interval: The value in the configured MODELINT parameter for this check. If PFA determines new prediction calculations are necessary, modeling can occur earlier.
Last successful collection time: The date and time of the last successful data collection for this check.
Next collection time: The date and time of the next collection.
Collection interval: The value in the configured COLLECTINT parameter for this check.
Enqueue request rate in last collection interval: The actual enqueue request rate in the last collection interval where the rate is defined to be the count returned by the GRS ISGQUERY API normalized by the milliseconds used.
Predicted rates based on…: The enqueue request rates based on one hour, 24 hours, and seven days. If no prediction is available for a given time range, the line is not printed. For example, if the check has been running for 2 days, there is not enough data for seven days of data therefore PFA does not print the "Prediction based on 7 days of data" line. If there is not enough data for a time range, INELGIBLE is printed for that time range and no comparisons are made.
Runtime Diagnostics Output: Runtime Diagnostics event records to assist you in diagnosing and fixing the problem. See the topic on Runtime Diagnostics symptoms in Runtime Diagnostics.
Job Name: The name of the job that has enqueue arrivals in the last collection interval.
ASID: The ASID for the job that has enqueue arrivals in the last collection interval.
Enqueue request rate: The current enqueue request rate for the system.
Predicted enqueue request rate: The predicted enqueue request rates based on one hour, 24 hours, and seven days of data. If PFA did not previously run on this system or the same jobs previously tracked are not all active, there is not be enough data for two prediction time ranges until that amount of time has passed. Also, gaps in the data caused by stopping PFA or by an IPL might cause the time ranges to not have enough data available. After the check collects enough data for two time ranges, predictions are made again for those time ranges. If there is not enough data for two time ranges, INELIGIBLE is printed and comparisons are not made.
Runtime Diagnostics Output: The reports generated by Runtime Diagnostic for this check. These reports contain additional details to help you narrow down the source of the problem and sometimes corrective actions you can take. For complete details about using Runtime Diagnostics, see Runtime Diagnostics.

Directories

Note: The content and names for these files and directories are subject to change and cannot be used as programming interfaces; these files are documented only to provide help in diagnosing problems with PFA.

pfa_directory

This directory contains all the PFA checks and is pointed to by the home directory of the started task. The following files only contain data if messages are generated by the JVM:

java.stderr (generated by JVM)
java.stdout (generated by JVM)

pfa_directory/PFA_ENQUEUE_REQUEST_RATE/data

The directory for enqueue request rate that holds data and modeling results. PFA automatically deletes the contents of the PFA_ENQUEUE_REQUEST_RATE/data directory that could lead to skewed predictions in the future.

Guideline: If the use of the z/OS image is radically different after an IPL (for instance, the change from a test system to a production system) of if you modify anything that affects enqueue details, delete the files in the PFA_ENQUEUE_REQUEST_RATE/data directory to ensure the check can collect the most accurate modeling information.

Results files

systemName.1hr.prediction - This file is generated by the modeling code for the predictions made for one hour of historical data. It contains predictions for each of the tracked address spaces and the total system category. It also contains additional information required for PFA processing.
systemName.24hr.prediction - This file is generated by the modeling code for the predictions made for 24 hours of historical data. It contains predictions for each of the tracked address spaces and the total system category. It also contains additional information required for PFA processing.
systemName.7day.prediction - This file is generated by the modeling code for the predictions made for seven days of historical data. It contains predictions for each of the tracked address spaces and the total system category. It also contains additional information required for PFA processing.
systemName.1hr.prediction.html - This file contains an .html report version of the data found in the systemName.1hr.prediction file.
systemName.24hr.prediction.html - This file contains an .html report version of the data found in the systemName.24hr.prediction file.
systemName.7day.prediction.html - This file contains an .html report version of the data found in the systemName.7day.prediction file.
systemName.prediction.stddev - The file generated by the modeling code to list the standard deviation of the predictions across the time ranges for each job.

Data store files:

systemName.OUT - The data collection file.

Intermediate files:

systemName.data - The file is used as input to the modeling to track if enough data is available to model.
systemName.1hr.data - The file used as input to modeling code. It contains one hour of historical data.
systemName.24hr.data - The file used as input to modeling code. It contains 24 hours of historical data.
systemName.7day.data - The file used as input to modeling code. It contains seven days of historical data.
systemName.1hr.holes - The file is used to track gaps in data, caused by stopping PFA or by an IPL, for a one hour period.
systemName.24hr.holes - The file is used to track gaps in the data, caused by stopping PFA or by an IPL, for a 24 hour time period.
systemName.7day.holes - The file is used to track gaps in the data, caused by stopping PFA or by an IPL, for the seven day time period.

This directory holds the following log files. Additional information is written to these log files when DEBUG(1).

systemName.1hr.cart.log - The log file generated by modeling code with details about code execution while one hour of historical data was being modeled.
systemName.24hr.cart.log - The log file generated by modeling code with details about code execution while 24 hours of historical data was being modeled.
systemName.7day.cart.log - The log file generated by modeling code with details about code execution while seven days of historical data was being modeled.
systemName.builder.log - The log file generated by intermediate code that builds the files that are input to modeling with details about code execution.
systemName.launcher.log - The log file generated by launcher code.
systemName.1hr.tree - This file is generated by the modeling code. It contains information about the model tree which was built based on the last one hour of collected data.
systemName.24hr.tree - This file is generated by the modeling code. It contains information about the model tree which was built based on the last 24 hours of collected data.
systemName.7day.tree - This file is generated by the modeling code. It contains information about the model tree which was built based on the last seven days of collected data.
systemNameCONFIG.LOG - The log file containing the configuration history for the last 30 days for this check.
systemNameCOLLECT.LOG - The log file used during data collection.
systemNameMODEL.LOG - The log file used during portions of the modeling phase.
systemNameRUN.LOG - The log file used when the check runs.

pfa_directory/PFA_ENQUEUE_REQUEST_RATE/EXC_timestamp

This directory contains all the relevant data for investigating exceptions issued by this check at the timestamp provided in the directory name. PFA keeps directories only for the last 30 exceptions. Therefore at each exception, if more than 30 exception directories exist, the oldest directory is deleted so that only 30 exceptions remain after the latest exception is added.

systemNameREPORT.LOG - The log file containing the same contents as the IBM Health Checker for z/OS report for this exception as well as other diagnostic information issued during report generation.

pfa_directory/PFA_ENQUEUE_REQUEST_RATE/config

This directory contains the configuration files for the check.

EXCLUDED_JOBS - The file containing the list of excluded jobs for this check.

Note: When using Runtime Diagnostics, it is possible to see data for jobs previously defined to the excluded jobs list in the "other persistent jobs" and "total system" categories because PFA must return any potential problem activity on the system identified by Runtime Diagnostics.