z/OS Communications Server: SNA Diagnosis Vol 1, Techniques and Procedures
Previous topic | Next topic | Contents | Contact z/OS | Library | PDF


Wait

z/OS Communications Server: SNA Diagnosis Vol 1, Techniques and Procedures
GC27-3667-00

If the problem is a wait, use the procedure in Figure 1 to collect the following documentation:
Note: Use the documentation you have available to isolate or resolve the problem. If you have to re-create the problem, make sure the traces listed above are active.
Figure 1. Overview of the wait procedure
Diagram that shows the overview of the wait procedure.

The following procedure describes each step shown in Figure 1.

  1. Determine the extent of the wait state.

    Determine how extensive the wait state is in the operation of the VTAM network. Determine whether all VTAM processing stopped or only processing with respect to a single device, application, or something in between. Also determine what, if any, recovery action was taken at the time the wait was encountered by the operator or user. Some information about the activity that immediately preceded the wait might be available on the system log or in application program transaction logs.

  2. Did a logon, logoff, or command fail to complete?
    If so, continue with this step; otherwise, go to step 3.
    • If the wait state was actually the failure of a VTAM procedure to complete, use the DISPLAY ID command to identify the status of VTAM resources at the time of the problem. Note any status codes that are abnormal.
    • Use the VTAM DISPLAY PENDING, DISPLAY SESSIONS, or MODIFY IOPD commands to identify I/O requests for which VTAM is awaiting a response from a network node. Sometimes a network node appears in a pending state awaiting the completion of activity at a higher- or lower-level node (for example, PSUB1, PTRM2). The pending status on the other node is needed in such a case.
    • Use the VTAM DISPLAY BFRUSE command to get information about VTAM buffer pools. Save the output for use later in this procedure.
    • A VTAM operator might have attempted a recovery action (such as issuing a VARY INACT,FORCE command). Using the VARY INACT,FORCE command shows how to determine whether this command completed. Check the node status to determine whether the recovery action reset the state of the node for which the original command was issued.
    • If VTAM is waiting for an I/O response, look at the output of the VTAM buffer contents trace (assuming it is active when the problem occurs). If the trace shows that VTAM did send a request and is expecting a response, the problem is probably in another network node.
    • You can get additional information about the status of a command from the VTAM internal trace (VIT). With the SSCP and PIU options, you can match requests and responses and determine any requests that are outstanding (that is, for which responses have not been received). The SMS option supplies information about resource usage, and the PSS option provides information about VTAM scheduling of the dispatching process. (See z/OS Communications Server: SNA Diagnosis Vol 2, FFST Dumps and the VIT for a description of the internal trace entries.)

    At this point you might have enough documentation to report the problem to the Support Center. If so, go to Reporting the problem to IBM. Otherwise, go to step 5.

  3. Is network traffic stopped through a specific node?
    If so, continue with this step. Otherwise, go to step 4.
    • Add the specific node type to your problem documentation. For example, the node could be a 3705, 3720, 3725, 3745, 3790, or a 3274. NetView and EREP facilities show whether errors have been recorded for the node in question. Session trace data (collected by the NetView program) shows whether the node is not responding to VTAM, or whether VTAM is discarding the responses. Consider using NCP intensive mode recording (IMR) for recurrent problems of this type.
    • Note any messages on the system or NetView command facility log reporting ER-INOP outages or other failures. Use the VIT trace, or use the I/O trace with the EVERY operand, to trace the network flow up to the point of failure. NetView and LOGREC show the reason for the INOP.
    • For NCP-related problems, use the line trace or generalized PIU trace if the affected node is in an adjacent subarea. Use the transmission group trace to record intermediate node flows up to the point where the problem occurred.
    • If the problem might be in NCP software or communication controller hardware, obtain a dump of NCP storage. If the wait affects only part of the network, use the dynamic NCP dump facility. It allows the rest of the network to continue operating while the dump is taken. If the failure requires reactivating the NCP, use the MODIFY DUMP command. See Network control program (NCP) dump for more information on NCP dumps.

      If the NCP is hung or if the hung resource is attached to an NCP, see Table 1 to determine what NCP diagnostic document describes troubleshooting the NCP.

    • If the problem is in a channel-attached device or a channel-to-channel attachment, examine one of the following traces, if available, to determine the sequence of events preceding the wait. (If no trace output is available, you have to re-create the problem to get it.)
      • VIT trace with the CIO option
      • CCWTRACE

      To determine what document describes I/O control blocks for your operating system, see Table 1.

    If enough information is available, go to Reporting the problem to IBM. Otherwise, go to step 5.

  4. Is it a session or application program wait?
    If the wait state appears to be related to a particular VTAM application program, continue with this step. Otherwise, go to step 5.
    • Enter the DISPLAY ID command for the application program, using the EVERY or SCOPE=ALL operand. If there are any nodes with status ACT/U, reenter the DISPLAY command. If you are again informed that the status of a node is ACT/U, issue VARY INACT,FORCE for that node. If you still have a wait state, continue with the next step.
    • If only one application program is waiting while others continue to communicate with VTAM, that application program probably contains an error. To determine what caused the problem, obtain a dump of the application program and the operating system supervisor at the time of the problem.
      • Make sure that the error is not an operating system error. (Use the diagnostic books for your operating system.)
      • If possible, use the dump to determine the reason the application program is waiting. If the application program is not waiting for VTAM, use the documentation for the application program to determine the reason for the wait. If the problem is in TSO/VTAM, see Collecting documentation for TSO/VTAM problems.
    • If VTAM still seems to be the cause of the problem, you need output from the VIT to obtain a record of activity on the failing session. Because large amounts of data will wrap around in the internal trace table, you might want to specify MODE=EXT.

      See z/OS Communications Server: SNA Diagnosis Vol 2, FFST Dumps and the VIT for more information on using the internal trace. You can also use the I/O or buffer contents traces to get information about all sessions with that application; specify ID=application program name.

    • Using a dump of the problem, find the address of the VTAM ACDEB for the application program.

      You can find an ACDEB associated with an application by using the VTAMMAP SES formatted dump tool. If VTAMMAP cannot be run, then find the ACDEB chain pointer in the ATCACDA field of the ATCVT.

      1. Use the ACDEB address to find it in the dump.

        On the FMCB RECEIVE ANY queue, ACDRAFQH points to the first FMCB.

        On the RPL RECEIVE ANY queue, ACDRARQ points to the first RPL.
        Note:
        1. If there are FMCBs (ACDRAFQH is not equal to 0), but no RPLs (ACDRARQ = 0), a problem has prevented the application program from issuing RECEIVEs.
        2. If there are RPLs (ACDRARQ is not equal to 0), but no FMCBs (ACDRAFQH = 0), there might be a problem involving the continue any/continue specific (CA/CS) state of the session.
      2. Check for blocked PABs in the process scheduling table (PST). ACDTSKID points to the PST.
        Look at the following PABs in the PST. To determine the offset locations for these PABs, see z/OS Communications Server: SNA Data Areas Volume 1.
        PSTRQPAB
        Request PAB
        PSTRSPAB
        Response PAB
        PSTUEPAB
        User exit PAB

        See steps 6 and 9 for additional recommended actions.

      3. Get the LUCB address (field ACDLUCBA in the ACDEB).
      4. Get the address of a chain of FMCB extensions (field LUCFMCBA in the LUCB). Each FMCB extension represents one LU-LU session.
      5. Each FMCB extension contains a pointer (field TSPFMCBA) to the address of an associated FMCB. Find the FMCBs associated with hung sessions.
        In those FMCBs, look for:
        • The CA/CS indicator (in TSPPSFL1 and TSPPSFL2)
        • The data queues (in TSPACCUM, TSPEWAIT, TSPNWAIT, TSPEDATA, TSPNDATA, TSPTSOP, and TSPTSIP)
        • Session state flags (in TSPSESSR, TSPDTSR, TSPCRVSR, and TSPRQRSR)
      6. Determine whether there are any indications of unusual conditions. See z/OS Communications Server: SNA Data Areas Volume 1.
      7. Make a cross-reference listing of network addresses and node names to correlate the VIT PIU and I/O trace entries with VTAM session control blocks, such as the LUCB and FMCB.

      See Table 1 to determine what NCP document contains information on hung sessions.

      If enough information is available, go to Reporting the problem to IBM. Otherwise, go to step 5.

  5. Dump and examine the system data areas.

    If you have not already done so, obtain a dump of the VTAM address space, CSA, LSQA, and SQA.

    Find and analyze the task control blocks. Use the VTAMMAP PABSCAN dump tool to format the output. See PABSCAN for information on using PABSCAN. See Table 1 to determine what document contains more information on using dumps and finding and analyzing task control blocks.

  6. Check for waiting PABs.
    Note: You can use the VTAMMAP VTCVTPAB formatted dump tool as an alternative to step 6.
    Look at the following PABs in the ATCVT. To determine the offset locations for these PABs, see z/OS Communications Server: SNA Data Areas Volume 1.
    ATCCSPAB
    Configuration services PAB
    ATCVDPAB
    VARY definition DYPAB
    ATCPXPAB
    Buffer pool expansion DYPAB
    ATCPUPAB
    Physical unit services DYPAB
    ATCPUIOP
    Physical unit services I/O DYPAB
    ATCLUSRT
    Logical unit services router DYPAB
    ATCNSPAB
    TSC no sessions DYPAB
    ATCSSPAB
    Session serialization PAB
    ATCSOPAB
    Session outage notification PAB
    ATCCNSPB
    CNS logon PAB
    ATCTPMPB
    Message DYPAB
    ATCTRMPB
    Termination subtask DYPAB

    Check the contents of the PABWEQP (or the PABVERYA for very extended PABs) and PABRPHA fields. The field PABWEQP in each PAB contains the address of a chain of work elements that have not yet been processed by VTAM. The field PABVERYA is defined at the same location as PABWEQA and contains a pointer to an array of WKE queues.

    The array pointed to by the PABVERYA field contains the following information:
    • A four-word header containing some control information about the very extended PAB.
    • An array of work element queues in descending priority. For example, queue 1 is the first queue in the array, and it has the highest priority; queue 2 is the next queue in the array, and it has the next highest priority, and so on. Each queue has the following structure:
      • (Field PABVFRST) A pointer to the first WKE (head, or oldest) on this level queue
      • (Field PABVLAST) A pointer to the last WKE (tail, or youngest) on this level queue
      • (Field PABVSRVL) Service level
      • (Field PABVSRVC) Service count
    The field PABRPHA in each PAB contains the address of an RPH that is either running or waiting.
    Note: In some PABs, PABRPHA might contain the address of an RPH, even though the RPH is not running or waiting.

    Note the contents of these fields in each of the PABs, and have this information available when you contact IBM®.

    Figure 2 shows how to find each PAB. Figure 3 shows the relative location of fields in a normal, extended, and slightly extended PAB. Figure 4 shows the layout for a very extended PAB. The DYPAB begins X'10' bytes before the PAB.

    Note: The PAB pointers shown in Figure 2 are not contiguous in the ATCVT, but are shown that way for demonstration purposes only.
    Figure 2. PAB locations
    Diagram that shows how to find each ATCVT PAB.
    Figure 3. Normal PABs, extended PABs, and slightly extended PABs
    Diagram that shows the relative location of fields in a normal, extended, and slightly extended PAB.
    Figure 4. Very extended PAB
    Diagram that shows the layout for a very extended PAB.
  7. Is the wait caused by pending I/O?

    Use the Input/Output Problem Determination (IOPD) facility to detect and report to the operator I/O operations that have been pending longer than a user-defined time limit.

    When a VTAM process is waiting for a response, the process is represented by a waiting request element (WRE) queued to one or more LQABs within a single I/O LQAB group.

    The WRE points to an event ID (EID), which indicates the reason for the wait.

    Look for the WREs and corresponding EIDs in a dump by using Figure 5 and Figure 6 and the following steps.

    Note: You can use the VTAMMAP VTWRE formatted dump tool to count or help analyze WREs. See VTWRE for information on using VTWRE.
    1. Find the address of the ATCVT at low-storage address X'408'.

      If this low-address location is not available in a dump, use the pointer in the MVS™ control block CVT (CVTATCVT) to find the VTAM control block AVT. Location X'00' in the AVT points to the ATCVT.

      The ATCVT is identified by release level at offset X'00' in the ATCVT. For z/OS® Communications Server, the ATCVT is:
      • VE619(X'E5C5F6F1F9404040').
    2. Get the address of the I/O LQAB-group hash table from field ATCIOLQB. This hash table contains a number-of-entries field (LQHENTNM) followed by an array of table entries numbered starting with 0.
    3. Use the hash table to find the I/O LQAB groups for active subareas.

      Each entry in the hash table is 4 bytes long and contains either 0, indicating an empty chain, or the address of the first LQAB group in a chain of I/O LQAB groups.

      Within each I/O LQAB group, the LQGLINK field (offset X'10') contains the address of the next LQAB group in the chain. An LQGLINK value of 0 indicates the end of the chain.
      • To find the I/O LQAB group for a specific subarea:
        • Calculate the hash table entry number, N, by dividing the subarea number by LQHENTNM and taking the remainder.
        • Search the chain for hash table entry N to find the LQAB group whose LQGSUBA field (offset X'0C') equals the subarea number.
        Note: I/O LQAB groups are allocated only when needed. Therefore, you do not find an LQAB group for a subarea that has had no I/O traffic.
      • To find all I/O LQAB groups, search the chain for each entry in the hash table.
    4. Find all the WREs chained off of a given I/O LQAB group.
      • Each I/O LQAB group contains several different LQABs. Use the global LQAB (LQGGLOBL) to analyze wait states, because its chain contains all of the group's WREs. (Chains off of the other LQABs in the group usually do not contain all of the group's WREs.) You can locate LQGGLOBL at the beginning of the LQAB group (offset 0).
      • The LQAB starts with the LQABFRST field, which contains either 0, indicating an empty chain, or the address of the first (oldest) WRE for this subarea.
      • Within each WRE, the WREGFWD field (offset 4) contains the address of the next WRE in the chain. The end of the chain is indicated by a WREGFWD value equal to the LQAB address minus 4.
    5. Find the waiting event. Each WRE contains a WREIDCD field (offset X'32') that identifies the waiting event. The address and length of the waiting event ID are in the fields WREIDP (offset X'24') and WREIDL (offset X'30'), respectively.

      For additional information, check the WREDTA field (offset X'2C'). In most cases, this field contains a CPCB operation code. If so, look in Control point/control block (CPCB) operation codes to determine what function the operation code represents.

  8. Is the wait caused by a non-I/O CPWAIT?

    When a VTAM process has suspended itself using a CPWAIT and is waiting for a matching CPPOST or CPPURGE, the process is represented by a WRE queued to one or more LQABs within a single non-I/O LQAB group.

    Analyze non-I/O CPWAITs using the steps described for pending I/O in step 7, with the following exceptions:
    • The IOPD facility does not detect and report these non-I/O events.
    • No arrays or hash tables are used. Instead, each of the six LQAB groups is pointed to directly by its own address field in the ATCVT. These address fields are as follows:
      • ATCLUSMQ – logical unit services
      • ATCMCQAB – miscellaneous command
      • ATCPULQB – physical unit services
      • ATCNOSQ – network operator services
      • ATCSSLQB – SSCP session services 1
      • ATCSSMQB – SSCP session services 2
    • WREs for non-I/O events do not contain a CPCB operation code value in the WREDTA field.
    Figure 5. Finding LQAB groups
    Diagram that shows how to find LQAB groups.
    Figure 6. Finding waiting request elements for an LQAB group
    Diagram that shows how to find WRE for an LQAB group.
  9. Find waiting RPHs.
    The following steps give instructions for examining two kinds of wait states: (1) a process waiting for a buffer, and (2) a process waiting for some other resource. Both kinds of waiting processes are represented by request parameter header (RPH) control blocks, but the RPH is found in different locations for each type of wait state.
    • Step 10 explains how to find RPHs queued from a buffer pool control block. These RPHs show that the buffer pool cannot supply the required buffers, and as a result, the process is waiting. Note which buffer pool cannot supply the required buffers.
    • Step 11 explains how to find RPHs that indicate a waiting process.
  10. Find RPHs queued from buffer pool control blocks.

    A buffer pool that has no available buffers can cause a wait state. There are many reasons for running out of buffers (for example, incorrect allocation in the VTAM start options, a VTAM programming problem, or an application programming problem). Use the DISPLAY BFRUSE output obtained in step 2, if you were able to get it, to analyze buffer pool usage. Or use the VTAMMAP VTBUF and STORAGE formatted dump tools. See VTBUF and STORAGE.

    Also, follow the chain at offset X'04' into the RPH to obtain the addresses of other RPHs waiting for the same pool.

  11. Find other waiting RPHs.

    Waiting RPHs indicate a VTAM process that has not been completed. To locate the waiting RPHs, search the large pageable buffer pool (LPBUF) by hand or use the VTAMMAP VTRPH formatted dump tool. For more information, see VTRPH. Look at the formatted dump output.

    Use the VTAMMAP VTBASIC formatted dump tool to analyze the request parameter headers (RPH) in the component recovery area (CRA).This function formats CRAs which contain RPHs. For more information, see VTBASIC.

  12. Find RPHs waiting for locks.
    1. For each waiting RPH, look at theCRALxPTR fields. If any pointer (PTR) fields are nonzero, check the corresponding bit in CRALKACT. For example:
      • If CRAL1PTR is nonzero, look at the last bit in CRALKACT.
      • If CRAL2PTR is nonzero, look at the next-to-last bit in CRALKACT.
      • If CRAL3PTR is nonzero, look at the third-from-last bit in CRALKACT.
      If the corresponding bit in CRALKACT is off (0), the RPH is waiting for this lock. If the bit is on (nonzero), the RPH is holding the lock and might be waiting for another lock. On your list of waiting RPHs, add the name of the lock being held or waited for. (See Table 1.)
    2. If you cannot find any locks waiting or being held using step 12.a, scan the LPBUF buffer pool again, and list all allocated buffers that contain a nonzero value in field CRALKACT.These buffers indicate which RPHs own locks, if any, and which locks are held. A CRA can hold several locks. For example, a value of X'06' indicates two locks being held: the RDTLOCK (X'04') and the VOCLOCK (X'02'). (See Table 1.)

      For each allocated buffer with a nonzero CRALKACT field, look at the CRALxPTR fields. (The buffer might contain a resume address.) A nonzero pointer field contains a lockword address. Find the lockword. The first word of the lockword shows a queue of RPHs waiting for that lock. Add these RPHs to your documentation list.

  13. Report the problem.Go to Reporting the problem to IBM.

Go to the previous page Go to the next page




Copyright IBM Corporation 1990, 2014