The Virtual Clock

I discussed in the Introduction that the completely fair scheduling algorithm depends on a virtual clock that measures the amount of time a waiting process would have been allowed to spend on the CPU on a completely fair system. However, no virtual clock can be found anywhere in the data structures! This is because all required information can be inferred from the existing real-time clocks combined with the load weight associated with every process. All calculations related to the virtual clock are performed in update_curr, which is called from various places in the system including the periodic scheduler. The code flow diagram in Figure 2-17 provides an overview of what the function does.

Set rq->exec_start

Figure 2-17: Code flow diagram for update_curr.

Set rq->exec_start

Figure 2-17: Code flow diagram for update_curr.

First of all, the function determines the currently executing process of the run queue and also obtains the real clock value of the main scheduler run queue, which is updated at each scheduler tick (rq_of is an auxiliary function to determine the instance of struct rq that is associated with a CFS run queue):

static void update_curr(struct cfs_rq *cfs_rq) {

struct sched_entity *curr = cfs_rq->curr; u64 now = rq_of(cfs_rq)->clock; unsigned long delta_exec;

if (unlikely(!curr)) return;

If no process is currently executing on the run queue, there is obviously nothing to do. Otherwise, the kernel computes the time difference between the last update of the load statistics and now, and delegates the rest of the work to_update_curr.

kernel/sched_fair.c delta_exec = (unsigned long)(now - curr->exec_start);

_update_curr(cfs_rq, curr, delta_exec);

Based on this information, __update_curr has to update the physical and virtual time that the current process has spent executing on the CPU. This is simple for the physical time. The time difference just needs to be added to the previously accounted time:

kernel/sched_fair.c static inline void

_update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, unsigned long delta_exec)

unsigned long delta_exec_weighted; u64 vruntime;

curr->sum_exec_runtime += delta_exec;

The interesting thing is how the non-existing virtual clock is emulated using the given information. Once more, the kernel is clever and saves some time in the common case: For processes that run at nice level 0, virtual and physical time are identical by definition. When a different priority is used, the time must be weighted according to the load weight of the process (recall that Section 2.5.3 discussed how process priority and load weight are connected):

kernel/sched_fair.c delta_exec_weighted = delta_exec; if (unlikely(curr->load.weight != NICE_0_LOAD)) {

delta_exec_weighted = calc_delta_fair(delta_exec_weighted,

curr->vruntime += delta_exec_weighted;

Neglecting some rounding and overflow checking, what calc_delta_fair does is to compute the value given by the following formula:


delta_exec_weighted = delta_exec x -


The inverse weight values mentioned above can be brought to good use in this calculation. Recall that more important tasks with higher priorities (i.e., lower nice values) will get larger weights, so the virtual run time accounted to them will be smaller. Figure 2-18 illustrates the connection between real and virtual time for various priorities. One can also see from the formula that the virtual and physical time are identical for nice 0 tasks with priority 120, that is, if current->load.weight is NICE_0_LOAD. Notice that the inset in Figure 2-18 uses a double logarithmic plot to show a wider range of priorities.







100000 - 10000 1000 100 10 1

Nice 0 (prio 120) -Nice +10 (prio 130) Nice +19 (prio 139)'-Nice -10 (prio' 110) Nice,-20 (prio 100)

Nice 0 (prio 120) -Nice +10 (prio 130) Nice +19 (prio 139)'-Nice -10 (prio' 110) Nice,-20 (prio 100)


Nice 0 (prio 120) -Nice +5 (prio 125) Nice -5 (prio 115) -

300 400 500 600 700 Real time interval (milliseconds) Figure 2-18: Relation between real and virtual time for processes depending on their priority/nice level.


Finally, the kernel needs to set min_vruntime. Care is taken to ensure that the value is increasing mono-tonically.


* maintain cfs_rq->min_vruntime to be a monotonically increasing

* value tracking the leftmost vruntime in the tree. */

vruntime = min_vruntime(curr->vruntime,


} else vruntime = curr->vruntime;

cfs_rq->min_vruntime =

max_vruntime(cfs_rq->min_vruntime, vruntime);

first_fair is a helper function that checks if the tree has a leftmost element, that is, if any process is waiting on the tree to be scheduled. If so, the kernel obtains its vruntime, which is the smallest of all elements in the tree. If no leftmost element is in the tree because it is empty, the virtual run time of the current process is used instead. To ensure that the per-queue min_vruntime is monotonic increasing, the kernel sets it to the larger of both values. This means that the per-queue min_vruntime is only updated if it is exceeded by the vruntime of one of the elements on the tree. With this policy, the kernel ensures that min_vrtime can only increase, but never decrease.

One really crucial point of the completely fair scheduler is that sorting processes on the red-black tree is based on the following key:

kernel/sched_fair.c static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) {

return se->vruntime - cfs_rq->min_vruntime;

Elements with a smaller key will be placed more to the left, and thus be scheduled more quickly. This way, the kernel implements two antagonistic mechanisms:

1. When a process is running, its vruntime will steadily increase, so it will finally move right-ward in the red-black tree.

Because vruntime will increase more slowly for more important processes, they will also move rightward more slowly, so their chance to be scheduled is bigger than for a less important process —just as required.

2. If a process sleeps, its vruntime will remain unchanged. Because the per-queue min_vruntime increases in the meantime (recall that it is monotonic!), the sleeper will be placed more to the left after waking up because the key got smaller.28

In practice, both effects naturally happen simultaneously, but this does not influence the interpretation. Figure 2-19 illustrates the different movement mechanisms on the red-black tree graphically.

min_vruntime f <--f Value increases

T-*' i Z a ^ Value decreases vruntime | vruntime f T

Position in the redblack tree (more to the left is better)

+1 0

Post a comment