Exiting Processes

Processes must terminate with the exit system call. This gives the kernel the opportunity to free the resources used by the processes to the system.19 The entry point for this call is the sys_exit function that requires an error code as its parameter in order to exit the process. Its definition is architecture-independent and is held in kernel/exit.c. Its implementation is not particularly interesting because it immediately delegates its work to do_exit.

Suffice it to say that the implementation of this function consists essentially of decrementing reference counters and returning memory areas to memory management once the reference counter has reverted to 0 and the corresponding structure is no longer being used by any process in the system.

2.5 Implementation of the Scheduler

A unique description of each process is held in memory and is linked with other processes by means of several structures. This is the situation facing the scheduler, whose task is to share CPU time between the programs to create the illusion of concurrent execution. As discussed above, this task is split into two different parts — one relating to the scheduling policy and the other to context switching.

19exit can be called explicitly by the programmer. However, the compiler automatically adds a corresponding call to the end of the main function (or to the main function used by the particular language).

2.5.1 Overview

The kernel must provide a method of sharing CPU time as fairly as possible between the individual processes while at the same time taking into account differing task priorities. There are many ways of doing this, and all have their pros and cons, which we need not discuss here (see [Tan07] for an overview of potential approaches). Our focus is on the solution adopted in the Linux kernel.

The schedule function is the starting point to an understanding of scheduling operations. It is defined in kernel/sched.c and is one of the most frequently invoked functions in the kernel code. The implementation of the scheduler is obscured a little by several factors:

□ On multiprocessor systems, several details (some very subtle) must be noted so that the scheduler doesn't get under its own feet.

□ Not only priority scheduling but also two other soft real-time policies required by the Posix standard are implemented.

□ gotos are used to generate optimal assembly language code. These jump backward and forward in the C code and run counter to all principles of structured programming. However, this feature can be beneficial if it is used with great care, and the scheduler is one example where gotos make sense.

In the following overview, I consider the completely fair scheduler and neglect real-time tasks for now. I come back to them later. An outstanding feature of the Linux scheduler is that it does not require the concept of time slices, at least not in the traditional way. Classical schedulers compute time slices for each process in the system and allow them to run until their time slice is used up. When all time slices of all processes have been used up, they need to be recalculated again. The current scheduler, in contrast, considers only the wait time of a process — that is, how long it has been sitting around in the run-queue and was ready to be executed. The task with the gravest need for CPU time is scheduled.

The general principle of the scheduler is to provide maximum fairness to each task in the system in terms of the computational power it is given. Or, put differently, it tries to ensure that no task is treated unfairly. Now this clearly sounds good, but what do fair and unfair with respect to CPU time mean? Consider an ideal computer that can run an arbitrary number of tasks in parallel: If N processes are present on the system, then each one gets ^ of the total computational power, and all tasks really execute physically parallel. Suppose that a task requires 10 minutes to complete its work. If 5 such tasks are simultaneously present on a perfect CPU, each will get 20 percent of the computational power, which means that it will be running for 50 instead of 10 minutes. However, all 5 tasks will finish their job after exactly this time span, and none of them will have ever been inactive!

This is clearly not achievable on real hardware: If a system has only a single CPU, at most one process can be run simultaneously. Multitasking is only achieved by switching back and forth between the tasks with high frequency. For users, who think considerably more slowly than the switching frequency, this creates the illusion of parallel executing, but in reality, it is not. While more CPUs in the system improve the situation and allow perfect parallel execution of a small number of tasks, there will always be situations in which fewer CPUs than processes that are to be run are available, and the problem starts anew.

If multitasking is simulated by running one process after another, then the process that is currently running is favored over those waiting to be picked by the scheduler — the poor waiting processes are being treated unfairly. The unfairness is directly proportional to the waiting time.

Every time the scheduler is called, it picks the task with the highest waiting time and gives the CPU to it. If this happens often enough, no large unfairness will accumulate for tasks, and the unfairness will be evenly distributed among all tasks in the system.

Figure 2-12 illustrates how the scheduler keeps track of which process has been waiting for how long. Since runnable processes are queued, the structure is known as the run queue.

Time ordered

Time ordered

wait time

Figure 2-12: The scheduler keeps track of the waiting time of the available processes by sorting them in a red-black tree.

wait time

Figure 2-12: The scheduler keeps track of the waiting time of the available processes by sorting them in a red-black tree.

All runnable tasks are time-ordered in a red-black tree, essentially with respect to their waiting time. The task that has been waiting for the CPU for the largest amount of time is the leftmost entry and will be considered next by the scheduler. Tasks that have been waiting less long are sorted on the tree from left to right.

If you are not familiar with red-black trees, suffice it to know here that this data structure allows for efficient management of the entries it contains, and that the time required for lookup, insertion, and deletion operations will only moderately rise with the number of processes present in the tree.20 Red-black trees are available as a standard data structure of the kernel, and Appendix C provides more information about them. Besides, a discussion of such trees can be found in every textbook on data structures.

Besides the red-black tree, a run queue is also equipped with a virtual clock.21 Time passes slower on this clock than in real time, and the exact speed depends on the number of processes that are currently waiting to be picked by the scheduler. Suppose that four processes are on the queue: Then the virtual clock will run at one-quarter of the speed of a real clock. This is the basis to determine how much CPU time a waiting process would have gotten if computational power could be shared in a completely fair manner. Sitting on the run queue for 20 seconds in real time amounts to 5 seconds in virtual time. Four tasks executing for 5 seconds each would keep the CPU occupied for 20 seconds in real time.

20To be precise: Time complexity is O(log n), where n is the number of elements in the tree. This is worse than for the old scheduler, which was famous for being an O(1) scheduler, that is, its run time was independent of the number of processes it had to deal with. However, the slow-down caused by the linear-logarithmic dependency of the new scheduler is negligible unless a huge number of processes is simultaneously runnable. In practice, such a situation does not occur.

21Notice that the kernel really used the concept of a virtual clock for the scheduling mechanism in kernel 2.6.23, but currently computes the virtual time a little differently. Since the method is easier to understand with virtual clocks, I will stick to this now and discuss how the virtual clock is emulated when I discuss the scheduler implementation.

Suppose that the virtual time of the run queue is given by fair_clock, while the waiting time of a process is stored in wait_runtime. To sort tasks on the red-black tree, the kernel uses the difference fair_clock -wait_runtime. While fair_clock is a measure for the CPU time a task would have gotten if scheduling were completely fair, wait_runtime is a direct measure for the unfairness caused by the imperfection of real systems.

When a task is allowed to run, the interval during which it has been running is subtracted from wait_runtime. This way, it will move rightward in the time-ordered tree at some point, and another process will be the leftmost one — and is consequently selected to run. Notice, however, that the virtual clock in fair_clock will increase when the task is running. This effectively means that the share of CPU time that the task would have received in a perfectly fair system is deducted from the time spent executing on the real CPU. This slows degradation of unfairness: Decrementing wait_runtime is equivalent to lowering the amount of unfairness received by the task, but the kernel must not forget that some portion of the time used to lower the unfairness would have belonged to the process in a completely fair world anyway. Suppose again that four processes sit on the run queue, and that a process has been waiting for 20 real seconds. Now it is allowed to run for 10 seconds: wait_runtime is afterward 10, but since the process would have gotten 10/4 = 2 seconds of this time span anyway, effectively only 8 time units account for the potentially new position on the run queue.

Unfortunately, this strategy is complicated by a number of real-world issues:

□ Different priority levels for tasks (i.e., nice values) must be taken into account, and more important processes must get a higher share of CPU time than less important ones.

□ Tasks must not be switched too often because a context switch, that is, changing from one task to another, has a certain overhead. When switching happens too often, too much time is spent with exchanging tasks that is not available for effective work anymore.

On the other hand, the time that goes by between task switches must not be too long because large unfairness values could accumulate in this case. Letting tasks run for too long can also lead to larger latencies than desired for multimedia systems.

We will see how the scheduler tackles these problems in the following discussion.

A good way to understand scheduling decisions is to activate scheduler statistics at compile time. This will generate the file /proc/sched_debug, which contains information on all aspects of the current state of the scheduler.

Finally, note that the Documentation/ directory contains some files that relate to various aspects of the scheduler. Keep in mind, however, that some of them still relate to the old 0(1) scheduler and are therefore outdated!

Continue reading here: Data Structures

Was this article helpful?

0 0