Hard failures

If the machine check interruption is for a hard failure, MCH analyzes the information in the model independent logout area to isolate the error.

Before the records are written, the system inserts the same error identifier in various pieces of diagnostic data that pertains to a particular error, so that all pieces can be used together for diagnosis. The system inserts the same error identifier in the software record(s), the SVC dump output associated with this particular error, and the console message that indicates an SVC dump was taken. See SVC summary for information on SVC dumps; see z/OS MVS System Messages, Vol 7 (IEB-IEE) for information on console messages.

The error identifier has the form:
SEQxxxxx   CPUyy   ASIDzzzz   TIMEhh.mm.ss.t
xxxxx
Sequence number.
yy
Logical central processor identifier.
zzzz
Address space identifier (ASID).
hh.mm.ss.t
Time stamp, in hours, minutes, seconds, and tenths of a second.

With each IPL, the system begins a sequential count of errors. The sequence number is therefore unique for each software error or machine failure. It indicates which number this is since the most recent IPL. The sequence number remains constant for subsequent software records associated with the same error, although the time stamp may change.

Note: If the logrec data set record has no associated error identifier, the system prints the message NO ERRORID ASSOCIATED WITH THIS RECORD where the error identifier normally would be printed.

If the failure is going to cause the central processor to end and the system has only one central processor, the system collects environmental, model-independent, and model-dependent information to describe the failure. After formatting the information, the system writes this information on the logrec data set as an MCH record and issues a message to the operator. Then, before the system enters a wait state, the system writes MCH records to the logrec data set. Offset 3 of the MCH record format indicates that the failure resulted in system ending.

If, in a multiprocessing system, a failure occurs in one central processor, the system invokes alternate central processor recovery (ACR) on another central processor. The system records the error as a hard failure that does not cause the processor to end.

Note: System damage is recorded as a hard error (offset 33 bit 3) and not an ending error (offset 32 bit 6). See Principles of Operation for a detailed description of the machine check interruption code shown in the MCH record format.
Table 1. Format of the MCH record
Offset Size (bytes) alignment (bits) Field name Description
Dec Hex
0 (0) 1 LRBHTYPE Class/Source:
    ...1 ..11 LRBHMCH MCH record recorded in the system environment; type=X'13'.
1 (1) 1 LRBHSYS System/Release level:
    100. ....   OS/VS2.
    bits 3-7    
    0-1F   Release level 0-31.
2 (2) 1 LRBHSW0 Record-independent switches:
    1... ....   More records follow.
    0... ....   Last record.
    .1.. ....   Time-of-day (TOD) clock instruction issued. Used in conjunction with date and time values at displacements 8 and 12.
    ..1. ....   Record truncated. (Not used for MCH record.)
    ...1 .... LRBHEAB Extended addressing hardware.
    .... 1...   TIME macro used.
    .... .xxx   Reserved.
3 (3) 3 LRBHSW1 Record-dependent switches:
    Byte 0    
    1... .... LRBMNOIO IOS (IOSRMCH) informing IGFPTSIG not to perform any I/O.
    .1.. .... LRBMNVF LRB may not be valid.
    ..1. .... LRBMSYST System ended by MCH.
    ...1 .... LRBTRACE Set to 1 by IGFPMCIH before ALTRTRCsuspend and set to 0 after.
    .... 1... LRBDAT Set to 1 by IGFPMICH before loading aDATON PSW to go to IGFPMAIN. Set to 0 when IGFPMAIN receives control.
    .... .1.. LRBMRECV Set to 1 when an error is totally recovered.
    .... ..x.   Reserved.
    .... ...1 LRBMFA Set to 1 after a malfunction alert.
    Byte 1 LRBMACT Buffer contains a record to be recorded on the logrec data set or
        moved to another buffer.
    Byte 2 LRBMCLB MCH the logrec data set record buffer overlaid with another record. If
        this byte is X'FF', SVC 76 does not record this record on the logrec
        data set.
6 (6) 1 LRBHCNT Record count:
    bits 0-3   Sequence number of this physical record.
    bits 4-7   Total number of physical records in this logical record.
7 (7) 1   Reserved.
8 (8) 4 LRBHDATE System date of incident.
12 (C) 4 LRBHTIME System time of incident.
16 (10) 1 LRBHCPID Machine version code.
17 (11) 3 LRBHCSER Central processor serial number.
20 (14) 2 LRBHMDL Central processor machine model number.
22 (16) 2 LRBHMCEL Reserved.
        END OF STANDARD HEADER
24 (18) 4 LRBMLNH Length of record for the logrec data set.
28 (1C) 4 LRBMWSC Wait state code.
    1... .... LRBMAMOD If the remaining bits in this byte are non zero, then this bit must be zero; otherwise a program check occurs when a PSW containing this bit in its address part is loaded.
32 (20) 4 LRBMCEIA Machine check error indication area.
    Byte 0 LRBMTERM Terminal error flags:
    1... ... LRBMTIOS IOSRMCH has requested that this processor be ended.
    .x.. ....   Reserved.
    ..1. .... LRMMTTHR Hard error threshold flag.
    ...1 .... LRBMTSEC Secondary error.
    .... 1... LRBMTCKS Check stop.
    .... .1.. LRBMTWRN Power® warning.
    .... ..1. LRBMTDMG System damage.
    .... ...1 LRBMTINV Incorrect logout flag; set when LRBMCIC=0 or when a store-status-at-address has failed after a malfunction alert.
    Byte 1 LRBMHARD Hard machine error switches:
    1... .... LRBMHHRD Hard error assumed.
    .1.. .... LRBMHIO IOSRMCH has examined the MCIC and determined that a hard I/O Error has occurred.
    ..1. .... LRBMHVS Vector facility source.
    ...1 .... LRBMHSD System damage.
    .... 1... LRBMHINV Register or PSW incorrect.
    .... .1.. LRBMHSTO Hard storage error.
    .... ..1. LRBMHSPF Hard storage protection key error.
    .... ...1 LRBMHIPD Instruction processing damage.
    Byte 2 LRBMINTM Intermediate error switches:
    1... .... LRBMIPSD Primary clock sync facility damage.
    .1.. .... LRBMIAFD ETR attachment facility damage.
    ..1. .... LRBMISWL Switch to local sync.
    ...1 .... LRBMISYC ETR sync check condition.
    .... 1... LRBMITOD Time-of-day (TOD) clock error.
    .... .1.. LRBMICKC Clock comparator error.
    .... ..1. LRBMICTM Central processor timer error.
    .... ...1 LRBMIVTE Vector facility threshold exceeded.
    Byte 3 LRBMSOFT Soft machine error switches:
    1... .... LRBMSSFT Soft error assumed.
    .1.. .... LRBMSSPD Service processor damage.
    ..1. .... LRBMSVF Vector facility failure.
    ...1 .... LRBMDBSE Double bit storage error correction flag.
    .... 1... LRBMSTSL ETR sync check threshold exceeded.
    .... .1.. LRBMSECC ECC corrected storage error.
    .... ..1. LRBMSHIR HIR corrected processor (Central processor) error.
    .... ...1 LRBMSDG Degradation machine check.
36 (24) 1 LRBMPDAR PDAR (program damage assessment and repair) data supplied by RTM:
    xxx. ....   Reserved.
    ...1 .... LRBMINVP Storage reconfigured; page invalidated.
    .... 1... LRBMRSRC Storage reconfiguration status available at displacement 37.
    .... .1.. LRBMRSRF Storage reconfiguration not attempted.
    .... ..xx   Reserved.
37 (25) 2 LRBMRSRS Status returned to IGFPMRTH by IARXMCKS, the status and key error storage routine. The details of the bits are described by IEERSRRB.
39 (27) 1 LRBMPWL Length of checking block used by machine model.
40 (28) 8 LRBMMOSW Machine check old PSW from storage locations 48-55.
48 (30) 8 LRBMCIC Machine check interruption code (from storage locations 232-239) as stored by hardware routines at time of machine check:
    Byte 0    
    1... .... LRBMFSD System damage (SD).
    .1.. .... LRBMFPD Instruction-processing damage (PD).
    ..1. .... LRBMFSR System recovery (SR).
    ...x ....   Reserved.
    .... 1... LRBMFCD Timer-facility damage (CD).
    .... .1.. LRBMFED External damage (ED).
    .... ..1. LRBMFVF Vector facility failure (VF).
    .... ...1 LRBMFDG Degradation (DG).
         
    Byte 1    
    1... .... LRBMFWM Power warning (W).
    .1.. .... LRBMFLP Available CRW is pending (CP).
    ..1. .... LRBMFSPD Service processor damage (SP).
    ...1 .... LRBMFCK Channel subsystem damage (CK).
    .... x...   Reserved.
    .... .1.. LRBMFVS Vector facility source (VS).
    .... ..1. LRBMIBU Backed up indicator (B).
    .... ...x LRBMIDY Reserved.
    Byte 2    
    1... .... LRBMFSE Storage error uncorrected (SE).
    .1.. .... LRBMFSC Storage error corrected (SC).
    ..1. .... LRBMFKE Storage key error uncorrected (KE).
    ...1 .... LRBMDFDS Storage degradation (DS).
    .... 1... LRBMVWP PSW-MWP is valid (WP).
    .... .1.. LRBMVMS PSW masks and key are valid (MS).
    .... ..1. LRBMVPM PSW program masks and condition code are valid (PM).
    .... ...1 LRBMVIA PSW Instruction address is valid (IA®).
    Byte 3    
    1... .... LRBMVFA Failing storage address is valid (FA).
    .x.. ....   Reserved.
    ..1. .... LRBMVED External damage code is valid (EC).
    ...1 .... LRBMVFP Floating point register is valid (FP).
    .... 1... LRBMVGR General purpose register is valid (GR).
    .... .1.. LRBMVCR Control register is valid (CR).
    .... ..x.   Reserved.
    .... ...1 LRBMVST Storage logical is valid (ST).
    Byte 4    
    x... ....   Indirect storage error (IE).
    .1.. .... LRBMARV Access register is valid.
    ..1. .... LRBMDAE Delayed access exception.
    ...x xxx.   Reserved.
    .... ...1 LRBMSYC ETR sync check.
    Byte 5    
    xxxx .x..   Reserved.
    .... 1... LRBMVAP Ancillary Report
    .... ..1. LRBMVPT Processor timer is valid (CT).
    .... ...1 LRBMVCC Clock comparator is valid (CC).
    Bytes 6, 7   Reserved.
56 (38) 4   240-243 storage data.
60 (3C) 4 LRBMEDCD 244-247 storage data: External damage code.
    Byte 0 LRBMEDC Data from 244.
    Byte 1 LRBMEDC1 Data from 245.
    1... .... LRBMEDXN Extended (expanded) storage not operational.
    .1.. .... LRBMEDXF Extended (expanded) storage control failure.
    Byte 2 LRBMEDC2 Data from 246.
    1... .... LRBMEDPS Primary Sync damage.
    .1.. .... LRBMEDAD ETR attachment damage.
    ..1. .... LRBMEDSL Switch to local.
    ...1 .... LRBMEDSC ETR sync check.
    .... 1... LRBMEDEC Side Control Element/Side Id Change.
    Byte 3   Reserved, x'00'.
64 (40) 4 LRBMFSA 248-251 storage data: Failing storage address
68 (44) 4   252-255 storage data.
72 (48) 8 LRBSSPSW 256-263 storage data: Store status PSW.
80 (50) 7   264-270 storage data.
87 (57) 1 LRBADRSI 271storage data: CPU address & site code.
88 (58) 16   272-287 storage data.
104 (68) 64 LRBAREGS 288-351 storage data: Access Registers.
168 (A8) 32   352-383 storage data.
200 (C8) 64 LRBGREGS 384-447 storage data: General Purpose Registers.
264 (108) 64 LRBCREGS 448-511 storage data: Control Registers.
328 (148) 1 LRBMEVIA Event Indicator Area.
329 (149) 63   Reserved.
392 (188) 10 ERRORID Error identifier, consisting of:
  • 2-byte sequence number
  • 2-byte central processor identifier
  • 2-byte ASID
  • 4-byte time stamp