|
If the problem is a wait, use the procedure
in Figure 1 to collect the following documentation:
- I/O trace
- Buffer contents trace output
- Session trace data (if using the NetView® program)
- Session awareness data (if using the NetView program)
- Dump of the VTAM® primary
address space including CSA
- List of:
- Waiting process anchor blocks (PABs)
- Waiting request elements (WREs) and associated event IDs (EIDs)
- Waiting request parameter headers (RPHs)
- For problems associated with an application program:
- For problems associated with the network:
- Trace output
- Dump output
- Reports from NetView,
IMR, or EREP (if available)
Note: Use the documentation you have available to isolate
or resolve the problem. If you have to re-create the problem, make
sure the traces listed above are active.
Figure 1. Overview
of the wait procedure
The following procedure describes each step shown in Figure 1.
- Determine the extent of the wait state.
Determine
how extensive the wait state is in the operation of the VTAM network. Determine whether all VTAM processing stopped or only
processing with respect to a single device, application, or something
in between. Also determine what, if any, recovery action was taken
at the time the wait was encountered by the operator or user. Some
information about the activity that immediately preceded the wait
might be available on the system log or in application program transaction
logs.
- Did a logon, logoff, or command fail to complete?
If
so, continue with this step; otherwise, go to step 3.
- If the wait state was actually the failure of a VTAM procedure to complete, use
the DISPLAY ID command to identify the status of VTAM resources at the time of the problem. Note
any status codes that are abnormal.
- Use the VTAM DISPLAY
PENDING, DISPLAY SESSIONS, or MODIFY IOPD commands to identify I/O
requests for which VTAM is
awaiting a response from a network node. Sometimes a network node
appears in a pending state awaiting the completion of activity at
a higher- or lower-level node (for example, PSUB1, PTRM2). The pending
status on the other node is needed in such a case.
- Use the VTAM DISPLAY
BFRUSE command to get information about VTAM buffer
pools. Save the output for use later in this procedure.
- A VTAM operator
might have attempted a recovery action (such as issuing a VARY INACT,FORCE
command). Using the VARY INACT,FORCE command shows how to determine
whether this command completed. Check the node status to determine
whether the recovery action reset the state of the node for which
the original command was issued.
- If VTAM is waiting
for an I/O response, look at the output of the VTAM buffer contents trace (assuming it is active
when the problem occurs). If the trace shows that VTAM did send a request and is expecting a response,
the problem is probably in another network node.
- You can get additional information about the status of
a command from the VTAM internal
trace (VIT). With the SSCP and PIU options, you can match requests
and responses and determine any requests that are outstanding (that
is, for which responses have not been received). The SMS option supplies
information about resource usage, and the PSS option provides information
about VTAM scheduling of the
dispatching process. (See z/OS Communications Server: SNA Diagnosis Vol
2, FFST Dumps and the VIT for a description of the internal trace entries.)
At this point you might have enough documentation to
report the problem to the Support Center. If so, go to Reporting the problem to IBM. Otherwise, go to step 5.
- Is network traffic stopped through a specific node?
If
so, continue with this step. Otherwise, go to step 4.
- Add the specific node type to your problem documentation.
For example, the node could be a 3705, 3720, 3725, 3745, 3790, or
a 3274. NetView and EREP
facilities show whether errors have been recorded for the node in
question. Session trace data (collected by the NetView program) shows whether the node is
not responding to VTAM, or
whether VTAM is discarding
the responses. Consider using NCP intensive mode recording (IMR) for
recurrent problems of this type.
- Note any messages on the system or NetView command facility log reporting ER-INOP
outages or other failures. Use the VIT trace, or use the I/O trace
with the EVERY operand, to trace the network flow up to the point
of failure. NetView and
LOGREC show the reason for the INOP.
- For NCP-related problems, use the line trace or generalized
PIU trace if the affected node is in an adjacent subarea. Use the
transmission group trace to record intermediate node flows up to the
point where the problem occurred.
- If the problem might be in NCP software or communication
controller hardware, obtain a dump of NCP storage. If the wait affects
only part of the network, use the dynamic NCP dump facility. It allows
the rest of the network to continue operating while the dump is taken.
If the failure requires reactivating the NCP, use the MODIFY DUMP
command. See Network control program (NCP) dump for more information
on NCP dumps.
If the NCP is hung or if the hung resource is attached
to an NCP, see Table 1 to determine
what NCP diagnostic document describes troubleshooting the NCP.
- If the problem is in a channel-attached device or a channel-to-channel
attachment, examine one of the following traces, if available, to
determine the sequence of events preceding the wait. (If no trace
output is available, you have to re-create the problem to get it.)
- VIT trace with the CIO option
- CCWTRACE
To determine what document describes I/O control blocks
for your operating system, see Table 1.
If enough information is available, go to Reporting the problem to IBM. Otherwise, go to step 5.
- Is it a session or application program wait?
If
the wait state appears to be related to a particular VTAM application program, continue with this
step. Otherwise, go to step 5. - Enter the DISPLAY ID command for the application program,
using the EVERY or SCOPE=ALL operand. If there are any nodes with
status ACT/U, reenter the DISPLAY command. If you are again informed
that the status of a node is ACT/U, issue VARY INACT,FORCE for that
node. If you still have a wait state, continue with the next step.
- If only one application program is waiting while others
continue to communicate with VTAM,
that application program probably contains an error. To determine
what caused the problem, obtain a dump of the application program
and the operating system supervisor at the time of the problem.
- Make sure that the error is not an operating system error. (Use
the diagnostic books for your operating system.)
- If possible, use the dump to determine the reason the application
program is waiting. If the application program is not waiting for VTAM, use the documentation for
the application program to determine the reason for the wait. If the
problem is in TSO/VTAM, see Collecting documentation for TSO/VTAM problems.
- If VTAM still
seems to be the cause of the problem, you need output from the VIT
to obtain a record of activity on the failing session. Because large
amounts of data will wrap around in the internal trace table, you
might want to specify MODE=EXT.
See z/OS Communications Server: SNA Diagnosis Vol
2, FFST Dumps and the VIT for more information on using the internal
trace. You can also use the I/O or buffer contents traces to get information
about all sessions with that application; specify ID=application
program name.
- Using a dump of the problem, find the address of the VTAM ACDEB for the application
program.
You can find an ACDEB associated with an application
by using the VTAMMAP SES formatted dump tool. If VTAMMAP cannot be
run, then find the ACDEB chain pointer in the ATCACDA field of the
ATCVT.
- Use the ACDEB address to find it in the dump.
On
the FMCB RECEIVE ANY queue, ACDRAFQH points to the first FMCB.
On
the RPL RECEIVE ANY queue, ACDRARQ points to the first RPL. Note: - If there are FMCBs (ACDRAFQH is not equal to 0), but
no RPLs (ACDRARQ = 0), a problem has prevented the application program
from issuing RECEIVEs.
- If there are RPLs (ACDRARQ is not equal to 0), but
no FMCBs (ACDRAFQH = 0), there might be a problem involving the continue
any/continue specific (CA/CS) state of the session.
- Check for blocked PABs in the process scheduling table (PST). ACDTSKID points to the PST.
See steps 6 and 9 for
additional recommended actions.
- Get the LUCB address (field ACDLUCBA in the ACDEB).
- Get the address of a chain of FMCB extensions (field
LUCFMCBA in the LUCB). Each FMCB extension represents one LU-LU session.
- Each FMCB extension contains a pointer (field TSPFMCBA)
to the address of an associated FMCB. Find the FMCBs associated with
hung sessions.
In those FMCBs, look for: - The CA/CS indicator (in TSPPSFL1 and TSPPSFL2)
- The data queues (in TSPACCUM, TSPEWAIT, TSPNWAIT, TSPEDATA, TSPNDATA,
TSPTSOP, and TSPTSIP)
- Session state flags (in TSPSESSR, TSPDTSR, TSPCRVSR, and TSPRQRSR)
- Determine whether there are any indications of unusual
conditions. See z/OS Communications Server: SNA Data
Areas Volume 1.
- Make a cross-reference listing of network addresses
and node names to correlate the VIT PIU and I/O trace entries with VTAM session control blocks, such
as the LUCB and FMCB.
See Table 1 to determine
what NCP document contains information on hung sessions.
If
enough information is available, go to Reporting the problem to IBM.
Otherwise, go to step 5.
- Dump and examine the system data areas.
If you have
not already done so, obtain a dump of the VTAM address space, CSA, LSQA, and SQA.
Find
and analyze the task control blocks. Use the VTAMMAP PABSCAN dump
tool to format the output. See PABSCAN for
information on using PABSCAN. See Table 1 to
determine what document contains more information on using dumps and
finding and analyzing task control blocks.
- Check for waiting PABs.
Note: You can use the VTAMMAP VTCVTPAB
formatted dump tool as an alternative to step 6.
Look at
the following PABs in the ATCVT. To determine the offset locations
for these PABs, see z/OS Communications Server: SNA Data
Areas Volume 1. - ATCCSPAB
- Configuration services PAB
- ATCVDPAB
- VARY definition DYPAB
- ATCPXPAB
- Buffer pool expansion DYPAB
- ATCPUPAB
- Physical unit services DYPAB
- ATCPUIOP
- Physical unit services I/O DYPAB
- ATCLUSRT
- Logical unit services router DYPAB
- ATCNSPAB
- TSC no sessions DYPAB
- ATCSSPAB
- Session serialization PAB
- ATCSOPAB
- Session outage notification PAB
- ATCCNSPB
- CNS logon PAB
- ATCTPMPB
- Message DYPAB
- ATCTRMPB
- Termination subtask DYPAB
Check the contents of the PABWEQP (or the PABVERYA
for very extended PABs) and PABRPHA fields. The field PABWEQP in each
PAB contains the address of a chain of work elements that have not
yet been processed by VTAM.
The field PABVERYA is defined at the same location as PABWEQA and
contains a pointer to an array of WKE queues.
The array pointed
to by the PABVERYA field contains the following information: - A four-word header containing some control information about the
very extended PAB.
- An array of work element queues in descending priority. For example,
queue 1 is the first queue in the array, and it has the highest priority;
queue 2 is the next queue in the array, and it has the next highest
priority, and so on. Each queue has the following structure:
- (Field PABVFRST) A pointer to the first WKE (head, or oldest)
on this level queue
- (Field PABVLAST) A pointer to the last WKE (tail, or youngest)
on this level queue
- (Field PABVSRVL) Service level
- (Field PABVSRVC) Service count
The field PABRPHA in each PAB contains the address of an RPH
that is either running or waiting. Note: In some PABs, PABRPHA might
contain the address of an RPH, even though the RPH is not running
or waiting.
Note the contents of these fields in each
of the PABs, and have this information available when you contact IBM®.
Figure 2 shows
how to find each PAB. Figure 3 shows the
relative location of fields in a normal, extended, and slightly extended
PAB. Figure 4 shows the layout for a very
extended PAB. The DYPAB begins X'10' bytes before the PAB.
Note: The
PAB pointers shown in Figure 2 are not contiguous
in the ATCVT, but are shown that way for demonstration purposes only.
Figure 3. Normal PABs, extended
PABs, and slightly extended PABs
Figure 4. Very extended PAB
- Is the wait caused by pending I/O?
Use
the Input/Output Problem Determination (IOPD) facility to detect and
report to the operator I/O operations that have been pending longer
than a user-defined time limit.
When a VTAM process is waiting for a response, the
process is represented by a waiting request element (WRE) queued to one or more LQABs within a single I/O
LQAB group.
The WRE points to an event ID (EID), which
indicates the reason for the wait.
Look for the WREs and corresponding
EIDs in a dump by using Figure 5 and Figure 6 and the following steps.
Note: You
can use the VTAMMAP VTWRE formatted dump tool to count or help analyze
WREs. See VTWRE for information
on using VTWRE.
- Find the address of the ATCVT at low-storage address X'408'.
If this low-address location is not available in a dump, use the
pointer in the MVS™ control block
CVT (CVTATCVT) to find the VTAM control
block AVT. Location X'00' in the AVT points to the ATCVT.
The
ATCVT is identified by release level at offset X'00' in the
ATCVT. For z/OS® Communications
Server,
the ATCVT is: - VE619(X'E5C5F6F1F9404040').
- Get the address of the I/O LQAB-group hash table from
field ATCIOLQB. This hash table contains a number-of-entries field
(LQHENTNM) followed by an array of table entries numbered starting
with 0.
- Use the hash table to find the I/O LQAB groups for active
subareas.
Each entry in the hash table is 4 bytes long and contains
either 0, indicating an empty chain, or the address of the first LQAB
group in a chain of I/O LQAB groups.
Within each I/O LQAB group,
the LQGLINK field (offset X'10') contains the address of
the next LQAB group in the chain. An LQGLINK value of 0 indicates
the end of the chain.
- Find all the WREs chained off of a given I/O LQAB group.
- Each I/O LQAB group contains several different LQABs. Use the
global LQAB (LQGGLOBL) to analyze wait states, because its chain contains
all of the group's WREs. (Chains off of the other LQABs in the group
usually do not contain all of the group's WREs.) You can locate LQGGLOBL
at the beginning of the LQAB group (offset 0).
- The LQAB starts with the LQABFRST field, which contains either
0, indicating an empty chain, or the address of the first (oldest)
WRE for this subarea.
- Within each WRE, the WREGFWD field (offset 4) contains the address
of the next WRE in the chain. The end of the chain is indicated by
a WREGFWD value equal to the LQAB address minus 4.
- Find the waiting event. Each WRE contains a WREIDCD field
(offset X'32') that identifies the waiting event. The address
and length of the waiting event ID are in the fields WREIDP (offset X'24')
and WREIDL (offset X'30'), respectively.
For additional
information, check the WREDTA field (offset X'2C'). In most
cases, this field contains a CPCB operation code. If so, look in Control point/control block (CPCB) operation codes to determine what function the operation
code represents.
- Is the wait caused by a non-I/O CPWAIT?
When a VTAM process has suspended itself
using a CPWAIT and is waiting for a matching CPPOST or CPPURGE, the
process is represented by a WRE queued to one or more LQABs within
a single non-I/O LQAB group.
Analyze non-I/O CPWAITs using
the steps described for pending I/O in step 7,
with the following exceptions: - The IOPD facility does not detect and report these non-I/O events.
- No arrays or hash tables are used. Instead, each of the six LQAB
groups is pointed to directly by its own address field in the ATCVT.
These address fields are as follows:
- ATCLUSMQ – logical unit services
- ATCMCQAB – miscellaneous command
- ATCPULQB – physical unit services
- ATCNOSQ – network operator services
- ATCSSLQB – SSCP session services 1
- ATCSSMQB – SSCP session services 2
- WREs for non-I/O events do not contain a CPCB operation code value
in the WREDTA field.
Figure 5. Finding
LQAB groups
Figure 6. Finding waiting request elements for
an LQAB group
- Find waiting RPHs.
The
following steps give instructions for examining two kinds
of wait states: (1) a process waiting for a buffer, and (2) a process
waiting for some other resource. Both kinds of waiting processes are
represented by request parameter header (RPH) control blocks,
but the RPH is found in different locations for each type of wait
state. - Step 10 explains how to find RPHs
queued from a buffer pool control block. These RPHs show that the
buffer pool cannot supply the required buffers, and as a result, the
process is waiting. Note which buffer pool cannot supply the required
buffers.
- Step 11 explains how to find RPHs
that indicate a waiting process.
- Find RPHs queued from buffer pool control blocks.
A
buffer pool that has no available buffers can cause a wait state.
There are many reasons for running out of buffers (for example, incorrect
allocation in the VTAM start
options, a VTAM programming
problem, or an application programming problem). Use the DISPLAY BFRUSE
output obtained in step 2, if you were
able to get it, to analyze buffer pool usage. Or use the VTAMMAP VTBUF
and STORAGE formatted dump tools. See VTBUF and STORAGE.
Also, follow the chain at offset X'04' into
the RPH to obtain the addresses of other RPHs waiting for the same
pool.
- Find other waiting RPHs.
Waiting
RPHs indicate a VTAM process
that has not been completed. To locate the waiting RPHs, search the
large pageable buffer pool (LPBUF) by hand or use the VTAMMAP VTRPH
formatted dump tool. For more information, see VTRPH. Look at the formatted dump output.
Use
the VTAMMAP VTBASIC formatted dump tool to analyze the request parameter
headers (RPH) in the component recovery area (CRA).This
function formats CRAs which contain RPHs. For more information, see VTBASIC.
- Find RPHs waiting for locks.
- For each waiting RPH, look at theCRALxPTR fields.
If any pointer (PTR) fields are nonzero, check the corresponding bit
in CRALKACT. For example:
- If CRAL1PTR is nonzero, look at the last bit in CRALKACT.
- If CRAL2PTR is nonzero, look at the next-to-last bit in CRALKACT.
- If CRAL3PTR is nonzero, look at the third-from-last bit in CRALKACT.
If the corresponding bit in CRALKACT is off (0), the RPH is waiting
for this lock. If the bit is on (nonzero), the RPH is holding the
lock and might be waiting for another lock. On your list of waiting
RPHs, add the name of the lock being held or waited for. (See Table 1.)
- If you cannot find any locks waiting or being held using
step 12.a, scan the LPBUF buffer pool again,
and list all allocated buffers that contain a nonzero value in field
CRALKACT.These buffers indicate which RPHs own locks, if any, and
which locks are held. A CRA can hold several locks. For example, a
value of X'06' indicates two locks being held: the RDTLOCK
(X'04') and the VOCLOCK (X'02'). (See Table 1.)
For each allocated buffer
with a nonzero CRALKACT field, look at the CRALxPTR fields. (The buffer
might contain a resume address.) A nonzero pointer field contains
a lockword address. Find the lockword. The first word of the lockword
shows a queue of RPHs waiting for that lock. Add these RPHs to your
documentation list.
- Report the problem.Go to Reporting the problem to IBM.
|