|
With Scheduler improvements in Linux 2.6is it ready for the data center? from CCR2, Issue 2 - 2004
 |
By Pete Marshall
Assistant Vice President, Business Development
Candle Corporation |
Aside perhaps from virtual memory management, nothing interests operating system theorists and practitioners more than task scheduling. The way an operating system handles task scheduling is key to its performance characteristics. How scheduling is implemented determines how an operating system will fare in meeting the conflicting goals of providing fast throughput of jobs and transactions, maximizing hardware utilization and, where appropriate, giving users good interactive response times. Schedulers also determine two other important characteristics of an operating system: how it scales in large multiprocessor hardware configurations and how it performs when resources (especially CPU) are overcommitted.
So when an operating system makes a major change to its scheduler implementation, as Linux has with the new 2.6 kernel, the event is newsworthy and merits some analysis. Much of the motivation to rewrite the Linux scheduler came from concerns (some real, some perceived, some rumored) about scalability in large multiprocessor systems. As we shall see, the new scheduler not only addresses these concerns but also provides a foundation for implementing true enterprise-class workload and server consolidation.
What makes a good scheduler?
The correct response to the question “What makes a good scheduler?” is another question: “What are you trying to schedule?” The designer of a small real-time embedded system will be concerned with scheduling rapid responses to real-world events. The supercomputer implementer, on the other hand, will be concentrating on minimizing interruptions and maximizing CPU utilization.
These two situations represent the extremes of a spectrum. For general IT usage, the tougher problem is not getting one thing right (after all, this is Linux: You are free to write your own specialized scheduler) but getting many things right at the same time. For general IT purposes, the key is to meet three goals simultaneously:
- Provide the fastest response times for transaction-oriented workloads, such as Web, application server or database requests. This is important for providing good response times for applications. Longer-running tasks similarly need to complete in a reasonable and somewhat consistent amount of time.
- For systems with interactive users, provide a “responsive” user experience. Although there is a lot of subjectivity here, responsive usually comes down to subsecond response times for interactive shell sessions and little or no discernable delay in GUI manipulations.
- Maximize the utilization of hardware resources, which in the case of the scheduler means maximizing CPU usage. Assuming that there's sufficient available work to be done, the scheduler should be able to keep all the available processors as busy as possible doing productive work.
Clearly, the first two goals are in conflict. Goal one is best served by an individual transaction being allowed to complete as quickly as possible with the minimum of interruptions (aided, typically, by server strategies, such as caching, that aim to minimize I/O interruptions). Goal two, on the other hand, requires that the system be ready to drop everything at a moment's notice to attend to the needs of the end user(s).
Even if these conflicts can be avoided, goal three provides yet another challenge. Maximizing CPU utilization requires that we have more work “ready to go” in the system to keep the processor(s) fed with useful work. More work means more tasks in the system, which in turn implies that the scheduler has to look at more task information before making a decision about which to schedule next.
Looking at more tasks takes more CPU time in the scheduler. If this CPU overhead grows as the number of tasks increases, it works against maximizing throughput: Scheduling overhead goes up, and scheduling delays mean more tasks waiting on, say, I/O completion before they become available to be scheduled. So the problem only gets worse. Ultimately, this can lead to system instability and, therefore, many installations take the following approach: avoid conflicts between goals one and two by splitting the workload across different systems, and avoid the problems associated with goal three by running at low CPU utilizations (often below 25 percent). This may constitute a sound workload management strategy but it is, clearly, not cost-effective.
The pre-2.6 scheduler
Critics of Linux have, prior to the 2.6 kernel, pointed to the scheduler as a weakness and proof that Linux was not ready for large-scale enterprise implementations. While this debate may have created more heat than light, it is certainly true that the pre-2.6 scheduler has its limitations. These limitations stemmed from the fact that the time taken to run the pre-2.6 kernel scheduling algorithm increased linearly with the number of tasks in the system.
The basic job of any scheduler is to pick the highest-priority task out of all the candidates that could be run and give control of the CPU to that task. To do this, the scheduler maintains a list of all the processes (in a data structure called a runqueue) in the system. Each process is assigned a priority (based on its nice value and run-time behavior) and is allocated a slice of CPU time (called a timeslice), typically around 100 ms. As a process is run, it may well be interrupted because the CPU has to handle something else, such as the completion of a disk I/O for another process or because the process itself needs to wait for something. When these interruptions occur, the process' timeslice is decremented by the amount of CPU time that was consumed. This continues until the timeslice value reaches 0, at which point, the process is marked expired. Each time the scheduler is called, it picks the nonexpired process with the highest priority and dispatches it.
Once all the tasks have been expired, the scheduler then recalculates the priority and timeslice for each task before returning to look for the next candidate to run. This recalculation, which iterates across all processes in the system, takes time and does not scale well. As the number of tasks in the system increases, the number of entries in the runqueue increases, and the amount of work the scheduler needs to perform increases accordingly.
This increase is directly proportional to the number of tasks in the system. Such algorithms are labeled O(n), to indicate that their computational time increases as a linear function on n, the number of tasks. For small systems, an O(n) scheduler works fine, and the single runqueue is a simple data structure to maintain. For large-scale systems with large numbers of tasks (e.g., a busy database or application server), this behavior becomes a limitation: The time spent rescheduling simply grows and grows as the number of processes increases, reducing the amount of CPU time available for “real” work.
The situation is made more complicated in multi-CPU systems. Because the pre-2.6 runqueue is global to the entire system, only one CPU can manipulate it at any given time (otherwise, both CPUs may well decide to schedule the same task simultaneously -- something that is likely to cause severe problems). To avoid such conflicts, the kernel maintains a lock to protect the runqueue. Once one CPU obtains this lock, any other CPU that becomes freed up has to wait until the lock is released. Only one processor can execute the scheduler code at any one time. As the number of processors grows, this lock contention becomes a bigger and bigger problem. While the pre-2.6 kernel could run on eight- and 16-way systems, the performance penalties could be severe.
The final issue with the global nature of the pre-2.6 scheduler in multiprocessor systems is its poor CPU affinity handling. Processes get interrupted from time to time, either because the system has something else to do or the process needs to wait for something like a disk I/O. When a process is resumed after an interruption, it is desirable that it do so on the same processor it was previously running on, as that processor is likely to have memory for the process still in its cache.
The pre-2.6 scheduler, based on picking the best run-candidate from a global list irrespective of the CPU in question, has a tendency to “bounce” processes between processors. This adversely affects cache and, therefore, workload performance.
These three performance constraints: the O(n) nature of the scheduling algorithm, lock contention for the single runqueue and the lack of CPU affinity represent significant barriers to scalability in the pre-2.6 kernel. What was needed -- a fundamental rethinking and rewriting of the scheduler -- happened with the recently released 2.6 kernel.
The new O(1) scheduler
Version 2.6 of the Linux kernel contains many enhancements over the previous 2.4 version, including improvements in virtual memory handling, a redesign of the block I/O layer and the introduction of pre-emption into the kernel (more on this later). V2.6 also contains a completely new scheduler, designed to overcome the problems described above.
From an architecture perspective, the new scheduler does away with the single global runqueue and replaces it with one queue per processor. This change alone addresses two of the biggest issues: multiprocessor scalability and CPU affinity. Having runqueues available on a per-processor basis eliminates the need to serialize one global runqueue and eliminates waiting for the scheduler: Multiple processors can now execute the scheduler concurrently, each on their own local queue.
CPU affinity is naturally improved in this scheme. Because runqueues are local to each processor, an interrupted task will resume on the same processor it was running on before the interruption, and any performance improvement realized by local processor and memory caches is maintained. Of course, strictly maintaining CPU affinity can be counterproductive in cases in which there is an imbalance between the amount of work in each processor's runqueue. To avoid this situation, the scheduler will, when appropriate, migrate processes between CPUs to balance the workload. This migration is now, however, the exception rather than the rule.
Moving to multiple queues does not in itself solve the O(n) scalability issue. The runqueues for each processor still grow linearly with the number of tasks in the system and, even if nothing else has changed, the scheduler still has to scan all n entries, and the O(n) behavior would still occur (albeit now divided across available CPUs). To remedy this, the new scheduler implements a priority-based array of task entries that enables the highest-priority task to be found quickly. It also eliminates the recalculation of priorities and timeslices by maintaining two queues: an active queue and an expired queue.
In this scheme, the scheduler recalculates the timeslice of an expired task before it places it on the expired queue. When all the tasks expire, the scheduler simply needs to swap the active and expired queue pointers and schedule the next task. Long scans of runqueues are, thus, eliminated. This process takes the same amount of processing, irrespective of the number of tasks in the system. It no longer depends on the value of n, but is a fixed constant. In the same notation, this constant behavior is designated as O(1), and the new scheduler is often referred to as the “O(1) scheduler.”
Thus the new scheduler scales well not only as we add processes but as we add processors. Many of the previous concerns about Linux scalability have therefore been addressed. Expect to see more Linux implementations in large Symmetric Multiprocessing (SMP) environments in the near future. Early benchmarks bear out these predictions.
With better SMP performance and shorter scheduling paths, the 2.6 kernel goes a long way to meeting two of the three goals outlined earlier in this article: the ability to deliver fast transaction response times by effectively using multiprocessor systems and the ability to drive to higher levels of CPU utilization by maintaining more active tasks at any given time. But what about our third goal: providing a responsive service to the end-user?
In theory, responsiveness should be guaranteed by running the interactive task at a higher priority than a noninteractive task, and, indeed, this prioritization is honored. The scheduler will pick the highest-priority task to run at any given time, also true in the pre-2.6 world. What has changed in 2.6 is what happens before the scheduler gets to run. Earlier kernels are nonpre-emptible: If a task makes a system call and is running code in the kernel, it cannot be interrupted until that kernel code is completed. If the running task in question is the low-priority task, the nonpre-emptive nature of the pre-2.6 kernel effectively gives this task priority, thus, impeding the responsiveness of the higher-priority task.
In 2.6, the kernel is pre-emptible; our higher priority task can now interrupt the lower-priority task even during the processing of a system call. The high-priority task can, therefore, get control of the CPU faster than was previously possible and deliver more responsive service for time-critical tasks.
Conclusion
There is little doubt that 2.6 is a very significant release in terms of Linux in the data center and that the new scheduler is a key part of realizing large-scale enterprise-class Linux implementations. With 2.6, Linux loses the last vestiges of its desktop development heritage and positions Linux as being truly ready for the data center.
Additional resources
For more information, contact your local Candle account representative or request information from the left-hand menu of this Web page.
|
|
|
 | Continuous file backup without scheduling, tapes or worries! |  |
 |
|
|
 |
|
|
|
|