Shared EEH Programming Model

For the shared EEH programming model, the EEH kernel services present the following state machine to the drivers:

  1. A slot starts out in the NORMAL state.
  2. When an EEH event happens, the driver receives all F's from an MMIO load. Because all F's might be a legal value for a driver, the driver must call eeh_read_slot_state() to confirm the event.
  3. If eeh_read_slot_state() finds the slot to be frozen, it broadcasts an EEH_DD_SUSPEND message to all registered drivers, and the slot state moves to SUSPEND. The kernel messages like this one are broadcast by invoking the callback routine sequentially. The messages are broadcast at INTIODONE priority.
  4. When the drivers receive the EEH_DD_SUSPEND message, they can do one of the following:
    1. Gather some debug data from the adapter and proceed to reset the slot.

      Gathering the debug data is really an optional step in the recovery process, where a driver can choose to read certain registers on the adapter in an attempt to understand what caused the EEH event in the first place.

      To gather the debug data, the drivers must enable PIO to the adapter. PIO is frozen when an EEH event occurs. To enable PIO:
      1. The master driver must call eeh_enable_pio(). The master driver is picked by the EEH kernel services. It has the EEH_MASTER flag set on the callback routine and is the last driver called in the callback chain. This ensures that all other drivers in the shared EEH domain have finished the last step of the recovery and that the master driver can now proceed to the next step (such as enabling PIO).

        When eeh_enable_pio() is called, an EEH_DD_DEBUG message is sent to the drivers indicating that PIO is enabled, and the slot state moves to DEBUG.

      2. The drivers then gather the data.

        eeh_enable_pio() can be called multiple times. Each time it is called, another EEH_DD_DEBUG message is broadcast.

      3. When the drivers receive EEH_DD_SUSPEND or EEH_DD_DEBUG messages, they call eeh_slot_error() to create an AIX® error log entry with hardware debug data. This step is required to figure out the reason for the EEH event.
      4. The master driver must call eeh_reset_slot() to reset the slot. Only one driver calls reset because it is not necessary to reset the slot multiple times.
    2. Proceed directly to reset the slot.
  5. The reset line on the PCI bus is toggled with 100 ms delay between activate and deactivate to reset the slot. The delay is hidden from the device drivers and is enforced by the eeh_reset_slot() kernel service internally. The slot internally moves through the ACTIVATE and the DEACTIVATE states.
  6. If there are any intermediate bridges present (such as a bridge on the adapter), at the end of a successful reset, EEH kernel services configures the bridge using eeh_configure_bridge() service. Kernel services also enforces a certain amount of delay between the deactivation of the reset line and the configuration of bridge.

    The device drivers do not need to call eeh_configure_bridge() directly.

  7. If everything goes well, the EEH_DD_RESUME message is sent to the drivers indicating that the slot recovery is complete.
  8. At this point, most drivers would have to reinitialize their adapters before starting normal operations again. Reinitialization typically requires a partial restore of the config space (such as the BARs and Cache Line). Determining the config space registers to be restore depends on the device.
    Note: This is the usual recovery sequence. If any of the services fail, the EEH_DD_DEAD message is broadcast asking the drivers to mark their adapters unavailable (for example, the drivers might have to perform some cleanup work and mark their internal states appropriately). The master driver must call eeh_slot_error() to create an AIX® error log and mark the adapter permanently unavailable.
There are two special scenarios that a driver developer needs to be aware of:
  1. If a driver receives either an EEH_DD_SUSPEND or an EEH_DD_DEAD message, it can return an EEH_BUSY return code from its callback routine instead of an EEH_SUCC return code. If EEH kernel services receives an EEH_BUSY message, EEH kernel services waits for some time and then calls the same driver again. This process continues until EEH kernel services receive a different return code. This process is repeated because some drivers need more time to cleanup before recovery can continue. Cleanup would include such activities like killing a kproc or notifying a user level app.
  2. If eeh_enable_dma() and eeh_enable_pio() cannot succeed due to the platform state restrictions, the service returns an EEH_FAIL return code followed by an EEH_DD_DEAD message unless you take action. To avoid receiving an EEH_FAIL return code, the driver must supply an EEH_ENABLE_NO_SUPPORT_RC flag when eeh_init_multifunc() kernel services is initiated. If an EEH_ENABLE_NO_SUPPORT_RC flag is supplied, eeh_enable_pio() and eeh_enable_dma() return the EEH_NO_SUPPORT return code that indicates to the drivers that they cannot collect debug data but can continue with the next step in recovery. For more information, see eeh_read_slot_state.

The EEH kernel services that you can use are listed in the following table:

Note: eeh_init() and eeh_init_multifunc() are the only exported kernel services. All other kernel services are called using function pointers in the eeh_handle kernel service.
Kernel Service Single Function Shared EEH Process Environment Interrupt Environment
eeh_init Y N Y N
eeh_init_multifunc N Y Y N
eeh_clear Y Y Y N
eeh_read_slot_state Y Y Y Y
eeh_enable_pio Y Y Y Y
eeh_enable_dma Y Y Y Y
eeh_enable_slot Y N Y Y
eeh_disable_slot Y N Y Y
eeh_reset_slot Y Y Y Y
eeh_slot_error Y Y Y Y
eeh_broadcast N Y Y Y