PFA_COMMON_STORAGE_USAGE

Description:
The check is looking to see if there is a potential for storage to be exhausted in the upcoming predictive failure analysis (PFA) model interval. PFA analyzes the following storage locations:
  • common storage area (CSA)
  • system queue area (SQA)
  • extended common storage area (ECSA)
  • extended system queue area (ESQA)
  • CSA + SQA
  • ECSA + ESQA

The PFA_COMMON_STORAGE_USAGE check detects three classes of common storage exhaustion:

  • Spike
  • Leak
  • Creep
If PFA detects that there is a potential for the exhaustion of common storage, PFA issues exception message AIRH101E and provides a list of suspect tasks in the report. During the analysis, this check writes the common storage usage data at intervals to a z/OS® UNIX System Services file in comma-separated value (.csv) format. The check identifies a list of users of common storage that might contribute to exhausting common storage. If deeper analysis is necessary, PFA also provides files that contain additional diagnostic information that you can examine. See Best practice:.

PFA also issues the following informational messages:

  • AIRH102I
  • AIRH103I
  • AIRH132I
Reason for check:
If the system runs out of common storage, jobs and started tasks experience abends.
Best practice:
The best practice is to predict common storage problems before they occur, determine the cause of the problem, and take the appropriate action.
When IBM® Health Checker for z/OS issues exception message AIRH101E, PFA has predicted that the amount of storage allocated to the common storage area is in jeopardy of being exhausted. Use the following steps to determine the appropriate action:
  1. Examine the Common Storage Usage Prediction Report issued with the exception message. This report contains the total current usage and predictions for each of the six storage locations: CSA, SQA, ECSA, ESQA, CSA+SQA, and ECSA+ESQA. It also contains up to ten "users" each of CSA, SQA, ECSA, and ESQA whose usage has changed the most in the last model interval. The cause of the problem is most likely within this list of users. See Output: for the example report.
  2. If the cause of the problem is not obvious from the common storage usage report, you can obtain additional information in the csadata and the csaAlldata files, or from other checks that list the top users of storage such as the checks owned by IBMVSM (VSM_CSA_THRESHOLD and VSM_SQA_THRESHOLD). The files are text files in comma-separated value (.csv) format and contain the historical data on the usage for each interval. You can export the files into any spreadsheet-type program.
  3. Determine which type of common storage problem is occurring by examining the symptoms, and then correct the behavior:
    • Spike: A piece of code uses more and more of the common storage area with usage growing linearly or exponentially over time. If the problem is caused by a spike, the csaAlldata file contains one or more users that are in the last few intervals and that consume a significant and measurable amount of common storage.

      Determine if the job causing the spike can be stopped, canceled, or slowed without affecting the overall system behavior.

    • Leak: A piece of code returns some but not all of the storage, which results in more usage of the common storage area over time. If the problem is caused by a leak, look for the contributor that is on the list multiple times, but not in every interval.

      Determine if the job causing the leak can be stopped, canceled, or slowed down without affecting the overall system behavior.

    • Creep: The common storage area usage grows slowly reflecting the overall system usage, which means there is no individual user of CSA responsible for the storage exhaustion. If there is no job or address space that appears to be using an excessive or unusual amount of common storage, the amount of work being done by the LPAR is probably causing the usage of common storage to creep.

      Determine if the amount of work being sent to this LPAR can be reduced.

    Note: Because of the random variation in common storage usage that typically occurs and the PFA check collects and models data at defined intervals, PFA is unable to detect all leaks, spikes, and creeps.
    • PFA is sometimes unable to detect a leak or creep that is less than 750 bytes per second.
    • PFA cannot detect rapid growth that occurs on a machine time frame such as within a collection interval.
    • PFA cannot detect common storage exhaustion caused by fragmentation.
z/OS releases the check applies to:
z/OS V1R10 and later.
Type of check:
Remote
Restrictions
Ensure your system is using the following DIAGxx parmlib member options: VSM TRACK CSA(ON) SQA(ON)
Parameters accepted:
Yes, as follows:
Table 1. PFA_COMMON_STORAGE_USAGE check parameters
Parameter name Default value Minimum Value Maximum Value Description
collectint 15 minutes 1 360 This parameter determines the time (in minutes) to run the data collector that determines the amount of common storage being used. The default is 15 minutes (15).
modelint 720 minutes 4 1440 This parameter determines how often (in minutes) you want the system to analyze the data and construct a new common storage usage model or prediction. Note that, even when you set a value larger than 360, PFA performs the first model at 360 minutes (6 hours). By default, PFA analyzes the data and constructs a new model every 720 minutes (12 hours). The model interval must be at least four times larger than the collection interval. If necessary modeling occurs more frequently.
threshold 2 percent 1 100 The percentage of the capacity of each area predicted to produce the capacity value to use in comparisons. The threshold can be used to reduce false positive comparisons. Setting the threshold too high might cause exhaustion problems to be undetected. The default is 2 percent (2).
collectinactive 1 (on) 0 (off) 1 (on) Defines whether data is collected and modeled even if the check is not eligible to run (is not ACTIVE(ENABLED)) in IBM Health Checker for z/OS.
debug 0 (off) 0 (off) 1 (on) This parameter (an integer of 0 or 1) is used at the direction of IBM service to generate additional diagnosic information for the IBM Support Center. This debug parameter is used in place of the IBM Health Checker for z/OS policy. The default is off (0).
To determine the status of the common storage usage check, issue f pfa,display,check(pfa_common_storage_usage),detail. See for the complete command example. The following is an example of the output written to message AIR018I in SDSF user log (ULOG):
F PFA,DISPLAY,CHECK(PFA_COMMON_STORAGE_USAGE),DETAIL
 AIR018I 16:20:21 PFA CHECK DETAIL 
 CHECK NAME: PFA_COMMON_STORAGE_USAGE
     ACTIVE                            : YES
     TOTAL COLLECTION COUNT            : 5
     SUCCESSFUL COLLECTION COUNT       : 5
     LAST COLLECTION TIME              : 09/01/2008 10:18:22  
     LAST SUCCESSFUL COLLECTION TIME   : 09/01/2008 10:18:22  
     NEXT COLLECTION TIME              : 09/01/2008 10:33:22  
     TOTAL MODEL COUNT                 : 1                    
     SUCCESSFUL MODEL COUNT            : 1                    
     LAST MODEL TIME                   : 09/01/2008 10:18:24  
     LAST SUCCESSFUL MODEL TIME        : 09/01/2008 10:18:24  
     NEXT MODEL TIME                   : 09/01/2008 22:18:24  
     CHECK SPECIFIC PARAMETERS:                               
         COLLECTINT                    : 15                   
         MODELINT                      : 720                  
         COLLECTINACTIVE               : 1=ON                 
         DEBUG                         : 0=OFF                
         THRESHOLD                     : 2                    
User override of IBM values:
The following example shows keywords you can use to override check values either on a POLICY statement in the HZSPRMxx parmlib member or on a MODIFY command. See Managing PFA checks. You can copy and modify this statement to override the check defaults:
UPDATE CHECK(IBMPFA,PFA_COMMON_STORAGE_USAGE)
            ACTIVE
            SEVERITY(MEDIUM)
            INTERVAL(00:01)
      PARMS=('COLLECTINT(15)','MODELINT(720)','THRESHOLD(2)', 
						'COLLECTINACTIVE(1)','DEBUG(0)')
            DATE(20071101)
      REASON('Common storage usage is nearing the user defined threshold.')
Verbose support:
The check provides additional details in verbose mode. You can put a check into verbose mode either using the UPDATE,filters,VERBOSE=ON parameters on the MODIFY command or on a POLICY statement on an HZSPRMxx parmlib member.
Debug support:
The DEBUG parameter in IBM Health Checker for z/OS is ignored by this check. Rather, the debug parameter is a PFA check specific parameter. The IBM Health Checker for z/OS debug commands are not the same debug parameter that PFA checks use. For details, see Understanding how to modify PFA checks.
Reference:
For more information about PFA, see the topic on Overview of Predictive Failure Analysis.
Messages:
This check issues the following exception messages:
  • AIRH101E
For additional message information, see the topics:
SECLABEL recommended for MLS users:
SYSLOW
Output:
The common storage usage output report:
Figure 1. Common storage usage prediction report
Common Storage Usage Prediction Report            
                                                                   
Last successful model time     :  07/09/2009 11:08:44              
Next model time                :  07/09/2009 23:12:44              
Model interval                 :  720                              
Last successful collection time:  07/09/2009 11:10:52              
Next collection time           :  07/09/2009 11:25:52             
Collection interval            :  15                               


                                          Capacity When  Percentage        
Storage     Current Usage  Prediction     Predicted      of Current        
Location    in Kilobytes   in Kilobytes   in Kilobytes   to Capacity     
__________  _____________  _____________  _____________  ____________    
*CSA                 2796           3152           2956           95%  
SQA                   455            455           2460           18%    
CSA+SQA              3251           3771           5116           64%    
ECSA               114922         637703         512700           22%    
ESQA                 8414           9319          13184           64%    
ECSA+ESQA          123336         646007         525884           23%   

 
 Address spaces with the highest increased usage:
                                                                         
 Job            Storage      Current Usage       Predicted Usage       
 Name           Location     in Kilobytes        in Kilobytes          
 __________     ________     _______________     _______________       
 JOB3           *CSA                    1235                1523      
 JOB1           *CSA                     752                 935       
 JOB5           *CSA                     354                 420       
 JOB8           *CSA                     152                 267       
 JOB2           *CSA                      75                  80       
 JOB6           *CSA                      66                  78       
 JOB15          *CSA                      53                  55       
 JOB18          *CSA                      42                  63       
 JOB7           *CSA                      36                  35       
 JOB9           *CSA                      31                  34       

* = Storage locations that caused the exception.
  
Note: In accordance with the IBM Health Checker for z/OS messaging guidelines, the largest generated output length for decimal variable values up to 2147483647 (X'7FFFFFF') is 10 bytes. When any PFA report value is greater than 2147483647, it displays using multiplier notation with a maximum of six characters. For example, if the report value is 2222233333444445555, PFA displays it as 1973P (2222233333444445555 ÷ 1125899906842) using the following multiplier notation:
Table 2. Multiplier notation used in values for PFA reports
Name Sym Size
Kilo K 1,024
Mega M 1,048,576
Giga G 1,073,741,824
Tera T 1,099,511,627,776
Peta P 1,125,899,906,842
  • Last successful model time: The date and time of the last successful model for this check. The predictions on this report were generated at that time.
  • Next model time: The date and time of the next model. The next model will recalculate the predictions.
  • Model interval: The value in the configured MODELINT parameter for this check. If PFA determines new prediction calculations are necessary, modeling can occur earlier.
  • Last successful collection time: The date and time of the last successful data collection for this check.
  • Next collection time: The date and time of the next collection.
  • Collection interval: The value in the configured COLLECTINT parameter for this check.
  • Storage Location: The storage location for the values in the row of the report. The location can be one of the following:
    • CSA
    • SQA
    • ECSA
    • ESQA
    • CSA + SQA
    • ECSA + ESQA
    An asterisk (*) printed prior to the storage location indicates that location is the storage location that caused the exception.

    When storage is expanded from SQA or ESQA to CSA or ECSA, an additional message prints on the report, exceptions for the original location are suppressed, and the storage is included in the CSA and ECSA current usage and predictions appropriately.

  • Current Usage in Kilobytes: The amount of storage used in kilobytes in this storage location when the check was run. The predicted usage for *SYSTEM* jobs is calculated, but no attempt is made to calculate the current usage for *SYSTEM* jobs. Therefore, UNAVAILABLE is printed for the current usage of *SYSTEM* jobs.
  • Predicted Usage in Kilobytes: The prediction of the usage in this storage location for the end of the model interval.
  • Capacity When Predicted in Kilobytes: The total defined capacity for this storage location (both used and unused) at the time the prediction was made.
  • Percentage of Current to Capacity: The percent of storage used in kilobytes in this storage location as compared to the capacity available.
  • Address spaces with the highest increased usage: The address spaces whose storage usage for each individual storage location recently increased the most. The report is sorted by predicted usage within each storage location. This list is only printed if the check issues an exception or the debug parameter is on. The number of jobs printed can vary. An asterisk printed prior to the storage location indicates that is the storage location that caused the exception. If debug is off, the only storage locations printed are those that caused the exception.
    Note: If the SQA expands into the CSA, the CSA usage and predictions include the storage taken from the CSA as SQA and PFA no longer performs comparisons for the SQA. Similarly, if the ESQA expands into the ECSA, the ECSA usage and predictions include the storage taken from the ECSA as ESQA and PFA no longer performs comparisons for the ESQA.
Directories
When you install PFA_COMMON_STORAGE_USAGE, the shell script creates the following directories that hold the executable program, log, error, data store, intermediate, and results files.
Note: The content and names for these files are subject to change and cannot be used as programming interfaces; these files are documented only to provide help in diagnosing problems with PFA.
pfa_directory
This directory contains all the PFA checks and is pointed to by the home directory of the started task. The following files only contain data if messages are generated by the JVM:
  • java.stderr (generated by JVM)
  • java.stdout (generated by JVM)
pfa_directory/PFA_COMMON_STORAGE_USAGE/data
The directory for common storage usage that holds data and modeling results.
Results files:
  • systemName.prediction - The predictions generated by modeling for the six storage locations. This file is used as input to the code that compares the predicted usage with the amount of current usage. The following example shows the common storage usage prediction report in .csv format, which is written to the systemName.prediction file:
    A/TOTAL,22910,23484   
    B/TOTAL,763,763       
    C/CSA  ,316,316       
    E/ECSA ,14832,14836   
    Q/ESQA ,8078,8644     
    S/SQA  ,447,447 
    Storage location: The location where the storage was allocated. The possible values are:
    • A/TOTAL: total above the line common storage (ECSA+ESQA)
    • B/TOTAL: total below the line common storage (CSA+SQA)
    • C/CSA: common storage area (CSA).
    • E/ECSA: extended common storage area (ECSA).
    • Q/ESQA: extended system queue area (ESQA).
    • S/SQA: system queue area (SQA).
    • 22910: The current usage when predicted in kilobytes.
    • 23484: The prediction in kilobytes
  • systemName.prediction.html - This file contains an .html report version of the data found in the systemName.prediction file.
  • systemName.diag - The predictions for the address spaces whose common storage usage increased the most since the last model. This file is not updated unless debug is on or an exception occurred. This file is used as input to the code that writes the top predicted users on the report.
  • systemName.diag.html - The file contents for systemName.diag in .html report format as follows:
    • User of Common Storage: This is the identification of the user of common storage. It consists of the address space name, ASID, and PSW.
    • Instance Count: The number of records with this user that were factored into the prediction model.
    • Current Estimated Common Storage Used: The current amount of common storage used by this user in the last collection interval included in this model.
    • Prediction Look Forward Seconds: The number of seconds the prediction should project into the future.
    • Predicted Common Storage Usage: The predicted amount of common storage usage for this user.

Data store files:

  • systemName.csaAll.timestamp - The csaAll files contain usage in a collection interval for all address spaces. The usage is categorized by the six locations of common storage tracked by this check.
  • systemName.csaSumAll.timestamp - The csaSumAll files summarize the data in the csaAll files. After five days, the csaAll data is averaged and compressed to one file each day, and then time-stamped with the start of that day. The data then moves to a csaSumAll file and csaAll files are deleted.
  • systemName.csaTotals.timestamp - The csaTotals files contain the usage of common storage in a collection interval for the six storage locations tracked by this check.
  • systemName.csaSumTotals.timestamp - The csaSumTotals files summarize the data in the csaTotals files. After five days, the csaTotals data is averaged and compressed to one file each day, and then time-stamped with the start of that day. The data then moves to a csaSumTotals file and csaTotals files are deleted.

Intermediate files:

  • systemName.csadata - The input to modeling in CSV format. The csadata file has one entry per location from the csaTotals files for each collection interval.
  • systemName.mapmvs - Convert PSW execution address to module name.
  • systemNameMAPREQF.OUT - Contains the location of the module.
  • systemName.csaAlldata -- The input to modeling the address spaces whose usage has increased the most in the model interval. This file is in CSV format.
  • systemName.csaSumAllX.timestamp - This file is used during summarization of the csaAll files.
  • systemName.csaSumTotalsX.timestamp -- This file is used during summarization of the csaTotals files.

This directory contains all the relevant files that are copied from the check's data directory to use when investigating exceptions issued by this check at the timestamp provided in the directory name. Additional information is written to these log files when DEBUG(1).

  • systemName.cart.log - The log file generated by modeling code that contains the execution details of modeling code.
  • systemNamemapcsa.log- The log file generated by intermediate code that builds the files that are input to modeling with details about code execution.
  • systemNameCONFIG.LOG - The log file containing the configuration history for the last 30 days for this check.
  • systemNameCOLLECT.LOG - The log file used during data collection.
  • systemNameMODEL.LOG - The log file used during portions of the modeling phase.
  • systemNameRUN.LOG - The log file used when the check runs.
  • systemName.launcher.log - The log file generated by launcher code.
  • systemName.tree - This file is generated by the modeling code. It contains information about the model tree that is built based on collected common storage usage data.
pfa_directory/PFA_COMMON_STORAGE_USAGE/EXC_timestamp
This directory contains all the relevant data for investigating exceptions issued by this check at the timestamp provided in the directory name. PFA keeps directories only for the last 30 exceptions. Therefore at each exception, if more than 30 exception directories exist, the oldest directory is deleted so that only 30 exceptions remain after the latest exception is added.
  • systemNameREPORT.LOG - The log file containing the same contents as the IBM Health Checker for z/OS report for this exception as well as other diagnostic information issued during report generation.
pfa_directory/PFA_COMMON_STORAGE_USAGE/config
This directory contains the configuration files for the check.