Make your check clean up after itself, because the system won't
do it for you: IBM Health Checker for z/OS does not
perform end-of-task cleanup for your check on a regular basis. Check
routines should track resources, such as storage obtained, ENQs, locks,
and latches, in the PQE_ChkWork field.
Release resources within the same function code processing: Whenever
possible, the check routine should release resources within the same
function code processing that it obtained. Releasing resources in
a different function code call is error prone, because you cannot
assume that the cleanup function processing will run under the same
task as the Check function. If the Cleanup function does not run under
the same task as Check function, it means that the task under which
the Check function was running has been terminated.
Have your check stop itself when the environment
is inappropriate: If your check routine encounters an environmental
condition that will prevent the check from returning useful results,
your check routine should stop itself and not run again until environmental
conditions change and your code requests it to run. Your check should
do the following to respond to an inappropriate environment:
- Issue an information message to describe why the check is not
running. For example, you might issue the following message to let
check users know that the environment is not appropriate for the check,
and when the check will run again:
The server is down.
When the server is available, the check will run again.
- Issue the HZSFMSG service to stop itself:
HZSFMSG REQEST=STOP,REASON=ENVNA
- Make sure that your product or check includes code that can detect
a change in the environment and start running the check again when
appropriate. To start running the check, issue the following HZSCHECK
service:
HZSCHECK REQUEST=RUN,CHECKOWNER=checkowner,CHECKNAME=checkname
If
the environment is still not appropriate when your code runs the check,
it can always stop itself again.
Your check should not add itself in an
inappropriate environment: If you use a HZSADDCHECK exit routine
r to add your checks to the system, note that some checks or product
code might add or delete checks to the system in response to changes
in system environmental conditions. For example, if a check or product
detects that a system environment is inappropriate for the check,
it might then add only the checks useful in the current environment
by invoking the HZSADDCHCK registration exit with an ADDNEW request
(from the HZSCHECK service, the F
hzsproc command,
or in the HZSPRMxx parmlib member. You should add similar code to
your HZSADDCHECK exit routine r to make sure that your checks don't
run if they will not return useful results in the current environment.
This code might:
- Delete checks that do not apply in the current environment
- Run a check so that it can check the environment and disable itself
if it is inappropriate in the current environment. Consider supporting
a check PARM so the installation may indicate the condition is successful
and not an error.
If your check can never be valid for the current IPL, consider
not even adding it from your HZSADDCHECK exit routine when you detect
that situation. For example, if a check is relevant only when in XCF
LOCAL mode but the system is not in that mode (and cannot change to
that mode), there is no reason even to add the check.
Have your check stop itself for bad parameters: If
your check routine is passed a bad parameter, it should stop itself
using the HZSFMSG service:
HZSFMSG REQUEST=STOP,REASON=BADPARM
This
request will also issue predefined HZS1001E error message to indicate
what the problem is. The check routine will not be called again until
it is refreshed or its parameters are changed. REQUEST=STOP prevents
the check from running again and sets the results in the PQE_Result
field of HZSPQE. The system sets the result field based on the severity
value for the check. See
Issuing messages in your local check routine with the HZSFMSG macro for examples
and complete information.
Plan recovery for abends: Your check routine should be designed
to handle abends. If on three consecutive check iterations:
- HZSFMSG issues abend X'290'
- The check abends and its recovery does not retry
then the system renders the check inactive until the check is
refreshed, or parameters for the check are changed. If the check routine
has obtained a resource that needs to be released under the same function
code processing, but the check routine abends, a recovery routine
can release that resource. IBM® suggests
that you use either an ESTAEX or IEAARR recovery routine.
In some cases you may not want your check to be stopped when an
abend occurs because some abend causing conditions might simply clear
with time. For example, if your check abends as a result of getting
garbled data from an unserialized resource, such as a data area in
the midst of an MVC, your check should provide its own recovery to:
- Retry the check a pre-determined number of times.
- If the check fails again, the check should stop running, but not
stop itself.
This allows the check to try running again at the next specified
interval, with every chance of success this time.
Take advantage of verbose and debug modes in your check:
IBM Health Checker for z/OS has support
for the following modes:
- Debug mode, which tells the system to output extra messages designed
to help you debug your check. IBM Health Checker for z/OS outputs
some extra messages in debug mode, and some checks do also. When a
check runs in debug mode, each message line is prefaced by a message
ID, which can be helpful in pinpointing the problem. For example,
report messages are not prefaced by message IDs unless a check is
running in debug mode.
There are two ways to issue extra messages
in debug mode:
- Use conditional logic such that when in debug mode (when
field PQE_DEBUG in mapping macro HZSPQE has the value PQE_DEBUG_ON),
your check issues additional messages.
- Code debug type messages - see Planning your debug messages
Users can turn on debug mode using the DEBUG=ON parameter
in the MODIFY hzsproc command, in HZSPRMxx, or
by overtyping the DEBUG field in SDSF to ON.
- Verbose mode, which tells the system to output messages with additional
detail about non-exception information found by the check. (RACF checks,
for example, issue additional detail in verbose mode.) To issue extra
messages in verbose mode, use conditional logic such
that when in verbose mode (when field PQE_VERBOSE in mapping macro
HZSPQE has the value PQE_VERBOSE_YES), your check issues additional
messages.
Users can turn on verbose mode using the VERBOSE=YES parameter
in the F hzsproc command or in HZSPRMxx.
Look for logrec error records when you test your check: When
testing your check, be sure to look for logrec error records. The
system issues abend X'290' if the system encounters an error
while a message is being issued, and issues a logrec error record
and a description of the problem in the variable recording area (VRA).
Save time, save trouble - test your check
with these commands: When you have written your check, test it
with the following commands to find some of the most common problems
people make in writing checks:
F hzsproc,UPDATE,CHECK(check_owner,check_name),DEBUG=ON
F hzsproc,UPDATE,CHECK(check_owner,check_name),PARM=parameter,REASON=reason,DATE=date
F hzsproc,DELETE,CHECK(check_owner,check_name),FORCE=YES
F hzsproc,DISPLAY,CHECK(check_owner,check_name),DETAIL
Avoid disruptive practices in your check routine: The
IBM Health Checker for z/OS philosophy
is to keep check routines very simple. IBM recommends
that checks read but not update system data and try to avoid disruptive
behavior such as:
- Modifying system control blocks
- I/O intensive operations, such as reading a data set
- Serialization
- Waits (directly or by services you call)
- Creating new tasks
- Creating new address spaces
We're recommending against these practices because they require
more overhead, complicate your check routine, and, more seriously,
can affect the performance of other system functions. In addition,
these practices can affect the running of other checks, since only
20
local check routines can be in control concurrently.
But you'll need to decide what's appropriate on a check by check basis.
An ENQ, for example, serializing on a control block, can indeed affect
the performance of other functions that might need that control block.
However, the downside of not serializing is that a check might get
information that is not consistent. You must weigh the cost to customers
of the chance of getting inconsistent data versus the costs of using
an ENQ in terms of system performance and
IBM Health Checker for z/OS processing.
See also Debugging checks.