Availability concepts

Before you plan for the availability of your system, it is important for you to understand some of the concepts associated with availability.

Businesses and their IT operations that support them must determine which solutions and technologies address their business needs. In the case of business continuity requirements, detailed business continuity requirements must be developed and documented, the solution types must be identified, and the solution choices must be evaluated. This is a challenging task due in part to the complexity of the problem.

Business continuity is the capability of a business to withstand outages, which are times when the system is unavailable, and to operate important services normally and without interruption in accordance with predefined service-level agreements. To achieve a given level of business continuity, a collection of services, software, hardware, and procedures must be selected, described in a documented plan, implemented, and practiced regularly. The business continuity solution must address the data, the operational environment, the applications, the application hosting environment, and the user interface. All must be available to deliver a good, complete business continuity solution. Your business continuity plan includes disaster recovery and high availability (HA).

Disaster recovery provides a plan in the event of a complete outage at the production site of your business, such as during a natural disaster. Disaster recovery provides a set of resources, plans, services, and procedures used to recover important applications and to resume normal operations from a remote site. This disaster recovery plan includes a stated disaster recovery goal (for example, resume operations within eight hours) and addresses acceptable levels of degradation.

Another major aspect of business continuity goals for many customers is high availability, which is the ability to withstand all outages (planned, unplanned, and disasters) and to provide continuous processing for all important applications. The ultimate goal is for the outage time to be less than .001% of the total service time. The differences between high availability and disaster recovery typically include more demanding recovery time objectives (seconds to minutes) and more demanding recovery point objectives (zero user disruption).

Availability is measured in terms of outages, which are periods of time when the system is not available to users. During a planned outage (also called a scheduled outage), you deliberately make your system unavailable to users. You might use a scheduled outage to run batch work, back up your system, or apply fixes.

Your backup window is the amount of time that your system can be unavailable to users while you perform your backup operations. Your backup window is a scheduled outage that typically occurs in the night or on a weekend when your system has less traffic.

An unplanned outage (also called an unscheduled outage) is typically caused by a failure. You can recover from some unplanned outages (such as disk failure, system failure, power failure, program failure, or human error) if you have an adequate backup strategy. However, an unplanned outage that causes a complete system loss, such as a tornado or fire, requires you to have a detailed disaster recovery plan in place in order to recover.

High availability solutions provide fully automated failover to a backup system to ensure continuous operation for users and applications. These HA solutions must provide an immediate recovery point and ensure that the time of recovery is faster than a non-HA solution.

Unlike with disaster recovery, where entire systems experience an outage, high availability solutions can be customized to individual critical resources within a system; for example, a specific application instance. High availability solutions are based on cluster technology. You can use clusters to avoid the impacts of both planned and unplanned outages. Even though you still have an outage, the business function is not impacted by the outage. A cluster is a collection of interconnected complete systems used as a single, unified resource. The cluster provides a coordinated, distributed process across the systems to deliver the solution. This results in higher levels of availability, some horizontal growth, and simpler administration across the enterprise. For a complete solution, you must address the operational environment, the application hosting environment, application resilience, and the user interfaces in addition to providing the data resilience mechanisms. Clusters focus on all aspects of the complete solution. The integrated cluster resource services enable you to define a cluster of systems and the set of resources that should be protected against outages. Cluster resource services detect outage conditions and coordinate automatic movement of critical resources to a backup system.