System maintenance

As you plan for system maintenance updates, review information about the supported paths, the availability of virtual machine instances during the updates, and the time required for the maintenance procedure to complete.

The installation of system updates includes an automated process that applies updates and ensures virtual machine instances created on the system remain in their original state and available without requiring user interaction. The automated process applies updates to management software and hardware firmware in a single fix pack delivery. During the fix pack installation, updates are applied to management components, VMware vCenter, storage firmware, compute node firmware, and so on to ensure that all versions are compatible. However, not every system fix pack includes updates for all components in the system, therefore the same orchestrated process is not required from release to release.

To review the maintenance paths that are supported for your current product version, see System updates.

Duration of the system update process

The actual amount of time required to update a system varies based on your usage and requirements, as well as the number and duration of stopping points that you set during the automated system updates operation. Updates are applied to the Platform System® Managers, and the hardware firmware if applicable, during the automated process. The leader Platform System Manager is updated first, at which point the console is unavailable for approximately 60 minutes. Existing instances remain available for use, but new instances cannot be deployed.

During this outage, the status of the update to the leader Platform System Manager can still be monitored from the Upgrade Status page on the non-leader Platform System Manager. After the leader Platform System Manager is updated, the IBM® service representative returns to the console and proceeds with the automated process.

When compute node updates are included in the fix pack, an outage can occur during the updates of the compute node when an instance evacuation is required and the cloud group does not have enough physical capacity for the evacuation to occur. The cloud group is not highly available and the updates cannot complete without affecting the instances. Some instances are stopped and cannot be restarted until the updates are complete and the cloud group resources are fully restored.

Instance availability

Cloud Pak System Software offers high availability of a single system to address failures and keep virtual machine instances running.
  • Redundant hardware, such as networking, storage, and power supplies.
  • No single points of failure for cloud groups with active high availability containing two or more compute nodes.
  • Virtual machine instances remain available during system maintenance updates or hardware failures, leveraging reserved capacity and mobility actions within the system.
  • Additional capacity can be added and used with no service interruption.

In order for a system to be highly available, all components must be highly available. Currently, the only component for which you can control its high availability mode is a cloud group. When a cloud group's high availability is active, the physical capacity is reserved to ensure that even during peak utilization the overall functionality and state of the system remains healthy while virtual machine instances are evacuated, during both system failures and updates. The amount of reserved physical capacity is determined by the cloud group type: dedicated or average.

The following table shows how a virtual machine's CPU and memory reservations are mapped to hardware resources:
Note: VMware CPU overhead is amortized over each physical CPU. There is a 10% overhead that is reserved on each pCPU for ESX giving you 0.9 of a core for a dedicated cloud group, and 0.1125 for an average cloud group.
Table 1. Mapping of CPU and memory reservations to hardware resources
Type CPU count (1 vCPU) Virtual memory (1 MB)
Dedicated 0.9 pCPU per vCPU 1 physical MB
Average 0.1125 pCPU per vCPU 1 physical MB
Note: VMware CPU overhead is amortized over each physical CPU. There is a 10% overhead that is reserved on each pCPU for ESX giving you 0.9 of a core for a dedicated cloud group, and 0.1125 for an average cloud group.

Optionally, a cloud group can be set to reserve resources for high availability. This option reserves resources (CPU and memory) within the cloud group equivalent to one compute node. The reserved capacity in a cloud group containing N compute nodes is 1 / N of the resources (CPU and memory) on each compute node.

If the Reserve resources for availability option is enabled, the evacuation of virtual machine instances from one compute node to another, if required, will always complete successfully without impacting the virtual machine instances because the required resources within the cloud group have been set aside in advance.

If the Reserve resources for availability option is disabled, one of three situations regarding evacuation can occur during the system update process:
  • When an evacuation is not required, the updates can complete without requiring the movement of virtual machines off of their existing compute nodes.
  • When an evacuation is required and the cloud group has enough physical capacity for the evacuation to occur, as determined by the cloud group type, the updates can complete without impacting the running virtual machine instances.
  • When an evacuation is required and the cloud group does not have enough physical capacity for the evacuation to occur, the updates cannot complete without affecting the virtual machine instances.
Note: When you evacuate a virtual machine instance from one compute node to another, the log file can include numerous DUPLICATE IP ADDRESS DETECTED messages. These messages are for informational purposes only and no action is required.