Using Dynamic Processor Deallocation

Starting with machine type 7044 model 270, the hardware of all systems with more than two processors can detect correctable errors, which are gathered by the firmware. These errors are not fatal and, as long as they remain rare occurrences, can be safely ignored. However, when a pattern of failures seems to be developing on a specific processor, this pattern may indicate that this component is likely to exhibit an unrecoverable failure in the near future. This prediction is made by the firmware based-on-failure rates and threshold analysis.

AIX® implements continuous hardware surveillance and regularly polls the firmware for hardware errors. When the number of processor errors hits a threshold and the firmware recognizes the distinct probability that this system component will fail, the firmware returns an error report to AIX and logs the error in the system error log. In addition, on multiprocessor systems, depending on the type of failure, AIX attempts to stop using the untrustworthy processor and deallocate it. This feature is called dynamic processor deallocation.

At this point, the firmware flags the processor for persistent deallocation for subsequent reboots, until service personnel replace the processor.

Potential impact to applications

Processor deallocation is not apparent for the vast majority of applications, including drivers and kernel extensions. However, you can use AIX published interfaces to determine whether an application or kernel extension is running on a multiprocessor machine, find out how many processors there are, and bind threads to specific processors.

The bindprocessor interface for binding processes or threads to processors uses bind CPU numbers. The bind CPU numbers are in the range [0..N-1] where N is the total number of CPUs. To avoid breaking applications or kernel extensions that assume no "holes" in the CPU numbering, AIX always makes it appear for applications as if the CPU is the "last" (highest numbered) bind CPU to be deallocated. For instance, on an 8-way SMP, the bind CPU numbers are [0..7]. If one processor is deallocated, the total number of available CPUs becomes 7, and they are numbered [0..6]. Externally, CPU 7 seems to have disappeared, regardless of which physical processor failed.

Note: In the rest of this description, the term CPU is used for the logical entity and the term processor for the physical entity.

Applications or kernel extensions using processes/threads binding could potentially be broken if AIX silently terminated their bound threads or forcibly moved them to another CPU when one of the processors needs to be deallocated. Dynamic processor deallocation provides programming interfaces so that those applications and kernel extensions can be notified that a processor deallocation is about to happen. When these applications and kernel extensions get this notification, they are responsible for moving their bound threads and associated resources (such as timer request blocks) away form the last bind CPU ID and adapt themselves to the new CPU configuration.

If, after notification of applications and kernel extensions, some of the threads are still bound to the last bind CPU ID, the deallocation is aborted. In this case, AIX logs the fact that the deallocation has been aborted in the error log and continues using the ailing processor. When the processor ultimately fails, it creates a total system failure. Thus, it is important for applications or kernel extensions that are binding threads to CPUs to get the notification of an impending processor deallocation, and to act on this notice.

Even in the rare cases where the deallocation cannot go through, dynamic processor deallocation still gives advanced warning to system administrators. By recording the error in the error log, it gives them a chance to schedule a maintenance operation on the system to replace the ailing component before a global system failure occurs.

Flow of events for processor deallocation

The typical flow of events for processor deallocation is as follows:

  1. The firmware detects that a recoverable error threshold has been reached by one of the processors.
  2. AIX logs the firmware error report in the system error log, and, when executing on a machine supporting processor deallocation, starts the deallocation process.
  3. AIX notifies non-kernel processes and threads bound to the last bind CPU.
  4. AIX waits for all the bound threads to move away from the last bind CPU. If threads remain bound, AIX eventually times out (after ten minutes) and aborts the deallocation. Otherwise, AIX invokes the previously registered High Availability Event Handlers (HAEHs). An HAEH may return an error that will abort the deallocation. Otherwise, AIX continues with the deallocation process and ultimately stops the failing processor.

In case of failure at any point of the deallocation, AIX logs the failure, indicating the reason why the deallocation was aborted. The system administrator can look at the error log, take corrective action (when possible) and restart the deallocation. For instance, if the deallocation was aborted because at least one application did not unbind its bound threads, the system administrator could stop the application(s), restart the deallocation (which should continue this time) and restart the application.

Programming interfaces dealing with individual processors

The following sections describe available programming interfaces:

Interfaces to determine the number of CPUs on a system

sysconf subroutine

The sysconf subroutine returns a number of processors using the following parameters:
  • _SC_NPROCESSORS_CONF: Number of processors configured
  • _SC_NPROCESSORS_ONLN: Number of processors online

For more information, see sysconf Subroutine in Technical Reference: Base Operating System and Extensions, Volume 2.

The value returned by the sysconf subroutine for _SC_NPROCESSORS_CONF will remain constant between reboots. Uniprocessor (UP) machines are identified by a 1. Values greater than 1 indicate multiprocessor (MP) machines. The value returned for the _SC_NPROCESSORS_ONLN parameter will be the count of active CPUs and will be decremented every time a processor is deallocated.

The _system_configuration.ncpus field identifies the number of CPUs active on a machine. This field is analogous to the _SC_NPROCESSOR_ONLN parameter. For more information, see systemcfg.h File in Files Reference.

For code that must recognize how many processors were originally available at boot time, the ncpus_cfg field is added to the _system_configuration table, which remains constant between reboots.

The CPUs are identified by bind CPU IDs in the range [0..(ncpus-1)]. The processors also have a physical CPU number that depends on which CPU board they are on, in which order, and so on. The commands and subroutines dealing with CPU numbers always use bind CPU numbers. To ease the transition to varying numbers of CPUs, the bind CPU numbers are contiguous numbers in the range [0..(ncpus-1). The effect of this is that from a user point of view, when a processor deallocation takes place, it always looks like the highest-numbered ("last") bind CPU is disappearing, regardless of which physical processor failed.

Note: To avoid problems, use the ncpus_cfg variable to determine what the highest possible bind CPU number is for a particular system.

Interfaces to bind threads to a specific processor

The bindprocessorcommand and the bindprocessor programming interface allow you to bind a thread or a process to a specific CPU, designated by its bind CPU number. Both interfaces will allow you to bind threads or processes only to active CPUs. Those programs that directly use the bindprocessor programming interface or are bound externally by a bindprocessor command must be able to handle the processor deallocation.

The primary problem seen by programs that bind to a processor when a CPU has been deallocated is that requests to bind to a deallocated processor will fail. Code that issues bindprocessor requests should always check the return value from those requests.

For more information on these interfaces, see bindprocessor Command in Commands Reference, Volume 1 or bindprocessor Subroutine in Technical Reference: Base Operating System and Extensions, Volume 1.

Interfaces for processor deallocation notification

The notification mechanism is different for user-mode applications having threads bound to the last bind CPU than it is for kernel extensions.

Notification in user mode

Each thread of a user mode application that is bound to the last bind CPU is sent the SIGCPUFAIL and SIGRECONFIG signals. These applications need to be modified to catch these signals and dispose of the threads bound to the last bind CPU (either by unbinding them or by binding them to a different CPU).

Notification in kernel mode

The drivers and kernel extensions that must be notified of an impending processor deallocation must register a High-Availability Event Handler (HAEH) routine with the kernel. This routine will be called when a processor deallocation is imminent. An interface is also provided to unregister the HAEH before the kernel extension is unconfigured or unloaded.

Registering a high-availability event handler

The kernel exports a new function to allow notification of the kernel extensions in case of events that affect the availability of the system.

The system call is:
int register_HA_handler(ha_handler_ext_t *)

For more information on this system call, see register_HA_handler in Operating system and device management.

The return value is equal to 0 in case of success. A non-zero value indicates a failure.

The system call argument is a pointer to a structure describing the kernel extension's HAEH. This structure is defined in a header file, named sys/high_avail.h, as follows:
typedef struct _ha_handler_ext_ { 
    int (*_fun)();        /* Function to be invoked */ 
    long long _data;      /* Private data for (*_fun)() */ 
    char        _name[sizeof(long long) + 1]; 
} ha_handler_ext_t;

The private _data field is provided for the use of the kernel extension if it is needed. Whatever value given in this field at the time of registration will be passed as a parameter to the registered function when the field is called due to a CPU predictive failure event.

The _name field is a null-terminated string with a maximum length of 8 characters (not including the null character terminator) which is used to uniquely identify the kernel extension with the kernel. This name must be unique among all the registered kernel extensions. This name is listed in the detailed data area of the CPU_DEALLOC_ABORTED error log entry if the kernel extension returns an error when the HAEH routine is called by the kernel.

Kernel extensions should register their HAEH only once.

Invocation of the high-availability event handler

The following parameters call the HAEH routine:
  • The value of the _data field of the ha_handler_ext_t structure passed to register_HA_handler.
  • A pointer to a ha_event_t structure defined in the sys/high_avail.h file as:
    typedef struct {                    /* High-availability related event */ 
        uint _magic;                    /* Identifies the kind of the event */ 
    #define HA_CPU_FAIL 0x40505546      /* "CPUF" */ 
        union { 
            struct {                   /* Predictive processor failure */ 
                cpu_t dealloc_cpu;     /* CPU bind ID of failing processor */ 
                           ushort domain;         /* future extension */ 
                ushort nodeid;         /* future extension */ 
                ushort reserved3;      /* future extension */ 
                uint reserved[4];      /* future extension */ 
            } _cpu; 
            /* ... */                  /* Additional kind of events -- */ 
            /* future extension */ 
        } _u; 
    } haeh_event_t;
The function returns one of the following codes, also defined in the sys/high_avail.h file:
#define HA_ACCEPTED 0     /* Positive acknowledgement */ 
#define HA_REFUSED -1     /* Negative acknowledgement */

If any of the registered extensions does not return HA_ACCEPTED, the deallocation is aborted. The HAEH routines are called in the process environment and do not need to be pinned.

If a kernel extension depends on the CPU configuration, its HAEH routine must react to the upcoming CPU deallocation. This reaction is highly application-dependent. To allow AIX to proceed with the deconfiguration, they must move the threads that are bound to the last bind CPU, if any. Also, if they have been using timers started from bound threads, those timers will be moved to another CPU as part of the CPU deallocation. If they have any dependency on these timers being delivered to a specific CPU, they must take action (such as stopping them) and restart their timer requests when the threads are bound to a new CPU, for instance.

Canceling the registration of a high-availability event handler

To keep the system coherent and prevent system crashes, the kernel extensions that register an HAEH must cancel the registration when they are unconfigured and are going to be unloaded. The interface is as follows:
int unregister_HA_handler(ha_handler_ext_t *)

This interface returns 0 in case of success. Any non-zero return value indicates an error.

For more information on the system call, see unregister_HA_handler in Technical Reference: Kernel and Subsystems, Volume 1.

Deallocating a processor in the test environment

To test any of the modifications made in applications or kernel extensions to support this processor deallocation, use the following command to trigger the deallocation of a CPU designated by its logical CPU number. The syntax is:
cpu_deallocate cpunum

where:

cpunum is a valid logical CPU number.

You must reboot the system to get the target processor back online. Hence, this command is provided for test purposes only and is not intended as a system administration tool.