Shared EEH Programming Model
For the shared EEH programming model, the EEH kernel services present the following state machine to the drivers:
- A slot starts out in the NORMAL state.
- When an EEH event happens, the driver receives all F's from an MMIO load. Because all F's might be a legal value for a driver, the driver must call eeh_read_slot_state() to confirm the event.
- If eeh_read_slot_state() finds the slot to be frozen, it broadcasts an EEH_DD_SUSPEND message to all registered drivers, and the slot state moves to SUSPEND. The kernel messages like this one are broadcast by invoking the callback routine sequentially. The messages are broadcast at INTIODONE priority.
- When the drivers receive the EEH_DD_SUSPEND message, they can
do one of the following:
- Gather some debug data from the adapter and proceed to reset the
slot.
Gathering the debug data is really an optional step in the recovery process, where a driver can choose to read certain registers on the adapter in an attempt to understand what caused the EEH event in the first place.
To gather the debug data, the drivers must enable PIO to the adapter. PIO is frozen when an EEH event occurs. To enable PIO:- The master driver must call eeh_enable_pio().
The master driver is picked by the EEH kernel services. It has the
EEH_MASTER flag set on the callback routine and is the last driver
called in the callback chain. This ensures that all other drivers
in the shared EEH domain have finished the last step of the recovery
and that the master driver can now proceed to the next step (such
as enabling PIO).
When eeh_enable_pio() is called, an
EEH_DD_DEBUG
message is sent to the drivers indicating that PIO is enabled, and the slot state moves to DEBUG. - The drivers then gather the data.
eeh_enable_pio() can be called multiple times. Each time it is called, another EEH_DD_DEBUG message is broadcast.
- When the drivers receive EEH_DD_SUSPEND or EEH_DD_DEBUG messages, they call eeh_slot_error() to create an AIX® error log entry with hardware debug data. This step is required to figure out the reason for the EEH event.
- The master driver must call eeh_reset_slot() to reset the slot. Only one driver calls reset because it is not necessary to reset the slot multiple times.
- The master driver must call eeh_enable_pio().
The master driver is picked by the EEH kernel services. It has the
EEH_MASTER flag set on the callback routine and is the last driver
called in the callback chain. This ensures that all other drivers
in the shared EEH domain have finished the last step of the recovery
and that the master driver can now proceed to the next step (such
as enabling PIO).
- Proceed directly to reset the slot.
- Gather some debug data from the adapter and proceed to reset the
slot.
- The reset line on the PCI bus is toggled with 100 ms delay between activate and deactivate to reset the slot. The delay is hidden from the device drivers and is enforced by the eeh_reset_slot() kernel service internally. The slot internally moves through the ACTIVATE and the DEACTIVATE states.
- If there are any intermediate bridges present (such as a bridge
on the adapter), at the end of a successful reset, EEH kernel services
configures the bridge using eeh_configure_bridge() service. Kernel
services also enforces a certain amount of delay between the deactivation
of the reset line and the configuration of bridge.
The device drivers do not need to call eeh_configure_bridge() directly.
- If everything goes well, the EEH_DD_RESUME message is sent to the drivers indicating that the slot recovery is complete.
- At this point, most drivers would have to reinitialize their adapters
before starting normal operations again. Reinitialization typically
requires a partial restore of the config space (such as the BARs and
Cache Line). Determining the config space registers to be restore
depends on the device. Note: This is the usual recovery sequence. If any of the services fail, the EEH_DD_DEAD message is broadcast asking the drivers to mark their adapters unavailable (for example, the drivers might have to perform some cleanup work and mark their internal states appropriately). The master driver must call eeh_slot_error() to create an AIX® error log and mark the adapter permanently unavailable.
There are two special scenarios that a driver developer needs to
be aware of:
- If a driver receives either an
EEH_DD_SUSPEND
or anEEH_DD_DEAD
message, it can return anEEH_BUSY
return code from its callback routine instead of anEEH_SUCC
return code. If EEH kernel services receives an EEH_BUSY message, EEH kernel services waits for some time and then calls the same driver again. This process continues until EEH kernel services receive a different return code. This process is repeated because some drivers need more time to cleanup before recovery can continue. Cleanup would include such activities like killing a kproc or notifying a user level app. - If eeh_enable_dma() and eeh_enable_pio() cannot
succeed due to the platform state restrictions, the service returns
an
EEH_FAIL
return code followed by anEEH_DD_DEAD
message unless you take action. To avoid receiving anEEH_FAIL
return code, the driver must supply an EEH_ENABLE_NO_SUPPORT_RC flag when eeh_init_multifunc() kernel services is initiated. If an EEH_ENABLE_NO_SUPPORT_RC flag is supplied, eeh_enable_pio() and eeh_enable_dma() return theEEH_NO_SUPPORT
return code that indicates to the drivers that they cannot collect debug data but can continue with the next step in recovery. For more information, see eeh_read_slot_state.
The EEH kernel services that you can use are listed in the following table:
Note: eeh_init() and eeh_init_multifunc() are the only exported
kernel services. All other kernel services are called using function
pointers in the eeh_handle kernel service.
Kernel Service | Single Function | Shared EEH | Process Environment | Interrupt Environment |
---|---|---|---|---|
eeh_init | Y | N | Y | N |
eeh_init_multifunc | N | Y | Y | N |
eeh_clear | Y | Y | Y | N |
eeh_read_slot_state | Y | Y | Y | Y |
eeh_enable_pio | Y | Y | Y | Y |
eeh_enable_dma | Y | Y | Y | Y |
eeh_enable_slot | Y | N | Y | Y |
eeh_disable_slot | Y | N | Y | Y |
eeh_reset_slot | Y | Y | Y | Y |
eeh_slot_error | Y | Y | Y | Y |
eeh_broadcast | N | Y | Y | Y |