The well-behaved local check routine - recommendations and recovery considerations

Make your check clean up after itself, because the system won't do it for you: IBM Health Checker for z/OS does not perform end-of-task cleanup for your check on a regular basis. Check routines should track resources, such as storage obtained, ENQs, locks, and latches, in the PQE_ChkWork field.

Release resources within the same function code processing: Whenever possible, the check routine should release resources within the same function code processing that it obtained. Releasing resources in a different function code call is error prone, because you cannot assume that the cleanup function processing will run under the same task as the Check function. If the Cleanup function does not run under the same task as Check function, it means that the task under which the Check function was running has been terminated.

Have your check stop itself when the environment is inappropriate: If your check routine encounters an environmental condition that will prevent the check from returning useful results, your check routine should stop itself and not run again until environmental conditions change and your code requests it to run. Your check should do the following to respond to an inappropriate environment:

Issue an information message to describe why the check is not running. For example, you might issue the following message to let check users know that the environment is not appropriate for the check, and when the check will run again:
```
The server is down. 
When the server is available, the check will run again.
```
Issue the HZSFMSG service to stop itself:
```
HZSFMSG REQEST=STOP,REASON=ENVNA
```
Make sure that your product or check includes code that can detect a change in the environment and start running the check again when appropriate. To start running the check, issue the following HZSCHECK service:
```
HZSCHECK REQUEST=RUN,CHECKOWNER=checkowner,CHECKNAME=checkname
```
If the environment is still not appropriate when your code runs the check, it can always stop itself again.

Your check should not add itself in an inappropriate environment: If you use a HZSADDCHECK exit routine r to add your checks to the system, note that some checks or product code might add or delete checks to the system in response to changes in system environmental conditions. For example, if a check or product detects that a system environment is inappropriate for the check, it might then add only the checks useful in the current environment by invoking the HZSADDCHCK registration exit with an ADDNEW request (from the HZSCHECK service, the F hzsproc command, or in the HZSPRMxx parmlib member. You should add similar code to your HZSADDCHECK exit routine r to make sure that your checks don't run if they will not return useful results in the current environment. This code might:

Delete checks that do not apply in the current environment
Run a check so that it can check the environment and disable itself if it is inappropriate in the current environment. Consider supporting a check PARM so the installation may indicate the condition is successful and not an error.

If your check can never be valid for the current IPL, consider not even adding it from your HZSADDCHECK exit routine when you detect that situation. For example, if a check is relevant only when in XCF LOCAL mode but the system is not in that mode (and cannot change to that mode), there is no reason even to add the check.

Have your check stop itself for bad parameters: If your check routine is passed a bad parameter, it should stop itself using the HZSFMSG service:

HZSFMSG REQUEST=STOP,REASON=BADPARM

This request will also issue predefined HZS1001E error message to indicate what the problem is. The check routine will not be called again until it is refreshed or its parameters are changed. REQUEST=STOP prevents the check from running again and sets the results in the PQE_Result field of HZSPQE. The system sets the result field based on the severity value for the check. See Issuing messages in your local check routine with the HZSFMSG macro for examples and complete information.

Plan recovery for abends: Your check routine should be designed to handle abends. If on three consecutive check iterations:

HZSFMSG issues abend X'290'
The check abends and its recovery does not retry

then the system renders the check inactive until the check is refreshed, or parameters for the check are changed. If the check routine has obtained a resource that needs to be released under the same function code processing, but the check routine abends, a recovery routine can release that resource. IBM® suggests that you use either an ESTAEX or IEAARR recovery routine.

In some cases you may not want your check to be stopped when an abend occurs because some abend causing conditions might simply clear with time. For example, if your check abends as a result of getting garbled data from an unserialized resource, such as a data area in the midst of an MVC, your check should provide its own recovery to:

Retry the check a pre-determined number of times.
If the check fails again, the check should stop running, but not stop itself.

This allows the check to try running again at the next specified interval, with every chance of success this time.

Take advantage of verbose and debug modes in your check:

IBM Health Checker for z/OS has support for the following modes:

Debug mode, which tells the system to output extra messages designed to help you debug your check. IBM Health Checker for z/OS outputs some extra messages in debug mode, and some checks do also. When a check runs in debug mode, each message line is prefaced by a message ID, which can be helpful in pinpointing the problem. For example, report messages are not prefaced by message IDs unless a check is running in debug mode.
There are two ways to issue extra messages in debug mode:
- Use conditional logic such that when in debug mode (when field PQE_DEBUG in mapping macro HZSPQE has the value PQE_DEBUG_ON), your check issues additional messages.
- Code debug type messages - see Planning your debug messages
Users can turn on debug mode using the DEBUG=ON parameter in the MODIFY hzsproc command, in HZSPRMxx, or by overtyping the DEBUG field in SDSF to ON.
Verbose mode, which tells the system to output messages with additional detail about non-exception information found by the check. (RACF checks, for example, issue additional detail in verbose mode.) To issue extra messages in verbose mode, use conditional logic such that when in verbose mode (when field PQE_VERBOSE in mapping macro HZSPQE has the value PQE_VERBOSE_YES), your check issues additional messages.
Users can turn on verbose mode using the VERBOSE=YES parameter in the F hzsproc command or in HZSPRMxx.

Look for logrec error records when you test your check: When testing your check, be sure to look for logrec error records. The system issues abend X'290' if the system encounters an error while a message is being issued, and issues a logrec error record and a description of the problem in the variable recording area (VRA).

Save time, save trouble - test your check with these commands: When you have written your check, test it with the following commands to find some of the most common problems people make in writing checks:

F hzsproc,UPDATE,CHECK(check_owner,check_name),DEBUG=ON
F hzsproc,UPDATE,CHECK(check_owner,check_name),PARM=parameter,REASON=reason,DATE=date
F hzsproc,DELETE,CHECK(check_owner,check_name),FORCE=YES
F hzsproc,DISPLAY,CHECK(check_owner,check_name),DETAIL

Avoid disruptive practices in your check routine: The IBM Health Checker for z/OS philosophy is to keep check routines very simple. IBM recommends that checks read but not update system data and try to avoid disruptive behavior such as:

Modifying system control blocks
I/O intensive operations, such as reading a data set
Serialization
Waits (directly or by services you call)
Creating new tasks
Creating new address spaces

We're recommending against these practices because they require more overhead, complicate your check routine, and, more seriously, can affect the performance of other system functions. In addition, these practices can affect the running of other checks, since only 20 local check routines can be in control concurrently. But you'll need to decide what's appropriate on a check by check basis. An ENQ, for example, serializing on a control block, can indeed affect the performance of other functions that might need that control block. However, the downside of not serializing is that a check might get information that is not consistent. You must weigh the cost to customers of the chance of getting inconsistent data versus the costs of using an ENQ in terms of system performance and IBM Health Checker for z/OS processing.