The Linux Timekeeping Architecture

Linux must carry on several time-related activities. For instance, the kernel periodically:

• Updates the time elapsed since system startup.

• Updates the time and date.

• Determines, for every CPU, how long the current process has been running, and preempts it if it has exceeded the time allocated to it. The allocation of time slots (also called quanta) is discussed in Chapter 11.

• Updates resource usage statistics.

• Checks whether the interval of time associated with each software timer (see the later section Section 6.6) has elapsed.

Linux's timekeeping architecture is the set of kernel data structures and functions related to the flow of time. Actually, Intel-based multiprocessor machines have a timekeeping architecture that is slightly different from the timekeeping architecture of uniprocessor machines:

• In a uniprocessor system, all time-keeping activities are triggered by interrupts raised by the Programmable Interval Timer.

• In a multiprocessor system, all general activities (like handling of software timers) are triggered by the interrupts raised by the PIT, while CPU-specific activities (like monitoring the execution time of the currently running process) are triggered by the interrupts raised by the local APIC timers.

Unfortunately, the distinction between the two cases is somewhat blurred. For instance, some early SMP systems based on Intel 80486 processors didn't have local APICs. Even nowadays, there are SMP motherboards so buggy that local timer interrupts are not usable at all. In these cases, the SMP kernel must resort to the UP timekeeping architecture. On the other hand, recent uniprocessor systems have a local APIC and an I/O APIC, so the kernel may use the SMP timekeeping architecture. Another significant case holds when a SMP-enabled kernel is running on a uniprocessor machine. However, to simplify our description, we won't discuss these hybrid cases and will stick to the two "pure" timekeeping architectures.

Linux's timekeeping architecture depends also on the availability of the Time Stamp Counter (TSC). The kernel uses two basic timekeeping functions: one to keep the current time up to date and another to count the number of microseconds that have elapsed within the current second. There are two different ways to get the last value. One method is more precise and is available if the CPU has a Time Stamp Counter; a less-precise method is used in the opposite case (see the later section Section 6.7.1).

6.2.1 Timekeeping Architecture in Uniprocessor Systems

In a uniprocessor system, all time-related activities are triggered by the interrupts raised by the Programmable Interval Timer on IRQ line 0. As usual, in Linux, some of these activities are executed as soon as possible after the interrupt is raised (in the "top half" of the interrupt handler), while the remaining activities are delayed (in the "bottom half" of the interrupt handler).

6.2.1.1 PIT's interrupt service routine

The time_init( ) function sets up the interrupt gate corresponding to IRQ 0 during kernel setup. Once this is done, the handler field of IRQ 0's irqaction descriptor contains the address of the timer_interrupt( ) function. This function starts running with the interrupts disabled, since the status field of IRQ 0's main descriptor has the sa_interrupt flag set. It performs the following steps:

1. If the CPU has a TSC register, it performs the following substeps:

a. Executes an rdtsc assembly language instruction to store the 32 least-significant bits of the TSC register in the last_tsc_low variable.

b. Reads the state of the 8254 chip device internal oscillator and computes the delay between the timer interrupt occurrence and the execution of the interrupt service routine. HI

[3] The 8254 oscillator drives a counter that is continuously decremented. When the counter becomes 0, the chip raises an IRQ 0. Thus, reading the counter indicates how much time has elapsed since the interrupt occurred.

2. Stores that delay (in microseconds) in the delay_at_last_interrupt variable; as we shall see in Section 6.7.1, this variable is used to provide the correct time to user processes.

do_timer_interrupt( ), which may be considered the PIT's interrupt service routine common to all 80 x 86 models, essentially executes the following operations:

1. It invokes the do_timer( ) function, which is fully explained shortly.

2. If the timer interrupt occurred in Kernel Mode, it invokes the x8 6_do_profile( ) function (see Section 6.5.3 later in this chapter).

3. If an adjtimex( ) system call is issued, it invokes the set_rtc_mmss( ) function once every 660 seconds (every 11 minutes) to adjust the Real Time Clock. This feature helps systems on a network synchronize their clocks (see the later section Section 6.7.2).

The do_timer( ) function, which runs with the interrupts disabled, must be executed as quickly as possible. For this reason, it simply updates one fundamental value—the time elapsed from system startup—and checks whether the running processes have exhausted its time quantum while delegating all remaining activities to the timer_bh bottom half.

The function is equivalent to:

void do_timer(struct pt_regs * regs) {

jiffies++;

update_process_times(user_mode(regs)); /* UP only */ mark_bh(TIMER_BH); if (TQ_ACTIVE(tq_timer)) mark_bh(TQUEUE_BH);

The jiffies global variable stores the number of elapsed ticks since the system was started. It is set to 0 during kernel initialization and incremented by 1 when a timer interrupt occurs — that is, on every tick. Since jiffies is a 32-bit unsigned integer, it returns to 0 about 497 days after the system has been booted. However, the kernel is smart enough to handle the overflow without getting confused.

The update_process_times( ) function essentially checks how long the current process has been running; it is described in Section 6.3 later in this chapter.

Finally do_timer( ) activates the TIMER_BH bottom half; if the tq_timer task queue is not empty (see Section 4.7), the function also activates the tqueue_bh bottom half.

6.2.1.2 The TIMER_BH bottom half

Each invocation of the "top half" PIT's timer interrupt handler marks the timer_bh bottom half as active. As soon as the kernel leaves interrupt mode, the timer_bh( ) function, which is associated with timer_bh, starts:

void timer_bh(void) {

The update_times( ) function updates the system date and time and computes the current system load; these activities are discussed later in Section 6.4 and Section 6.5. The run_timer_list( ) function takes care of software timers handling; it is discussed in the later section Section 6.6.

6.2.2 Timekeeping Architecture in Multiprocessor Systems

In multiprocessor systems, timer interrupts raised by the Programmable Interval Timer still play an important role. Indeed, the corresponding interrupt handler takes care of activities not related to a specific CPU, such as the handling of software timers and keeping the system time up to date. As in the uniprocessor case, the most urgent activities are performed by the "top half" of the interrupt handler (see Section 6.2.1.1 earlier in this chapter), while the remaining activities are delayed until the execution of the timer_bh bottom half (see the earlier section Section 6.2.1.2).

However, the SMP version of the PIT's interrupt service routine differs from the UP version in a few points:

• The timer_interrupt( ) function acquires the xtime_lock read/write spin lock for writing. Although local interrupts are disabled, the kernel must protect the xtime, last_tsc_low, and delay_at_last_interrupt global variables from concurrent read and write accesses performed by other CPUs (see Section 6.4 later in this chapter).

• The do_timer_interrupt( ) function does not invoke the x8 6_do_profile( ) function because this function performs actions related to a specific CPU.

• The do_timer( ) function does not invoke update_process_times( ) because this function also performs actions related to a specific CPU.

There are two timekeeping activities related to every specific CPU in the system:

• Monitoring how much time the current process has been running on the CPU

• Updating the resource usage statistics of the CPU

To simplify the overall timekeeping architecture, in Linux 2.4, every CPU takes care of these activities in the handler of the local timer interrupt raised by the APIC device embedded in the CPU. In this way, the number of accessed spin locks is minimized, since every CPU tends to access only its own "private" data structures.

6.2.2.1 Initialization of the timekeeping architecture

During kernel initialization, each APIC has to be told how often to generate a local time interrupt. The setup_APIC_clocks( ) function programs the local APICs of all CPUS to generate interrupts as follows:

void setup_APIC_clocks (void) {

calibration_result = calibrate_APIC_clock( ) ; setup_APIC_timer((void *)calibration_result); __sti( );

smp_call_function(setup_APIC_timer, (void *)calibration_result, 1, 1);

The calibrate_APIC_clock( ) function computes how many local timer interrupts are generated by the local APIC of the booting CPU during a tick (10 ms). This exact value is then used to program the local APICs in such a way to generate one local timer interrupt every tick. This is done by the setup_APIC_timer( ) function, which is invoked directly on the booting CPU, and through the call_function_vector Interprocessor Interrupts (IPI) on the other CPUs (see Section 4.6.2).

All local APIC timers are synchronized because they are based on the common bus clock signal. This means that the value computed by calibrate_APIC_clock( ) for the booting CPU is good also for the other CPUs in the system. However, we don't really want to have all local timer interrupts generated at exactly the same time because this could induce a substantial performance penalty due to waits on spin locks. For the same reason, a local timer interrupt handler should not run on a CPU when a PIT's timer interrupt handler is being executed on another CPU.

Therefore, the setup_APIC_timer( ) function spreads the local timer interrupts inside the tick interval. Figure 6-1 shows an example. In a multiprocessor systems with four CPUs, the beginning of the tick is marked by the PIT's timer interrupt. Two milliseconds after the PIT's timer interrupt, the local APIC of CPU 0 raises its local timer interrupt; two milliseconds later, it is the turn of the local APIC of CPU 1, and so on. Two milliseconds after the local timer interrupt of CPU 3, the PIT raises another timer interrupt on IRQ 0 line and starts a new tick.

Figure 6-1. Spreading local timer interrupts inside a tick

2m3H

I nset

1 msec

1 msec

setup_APIC_timer( ) programs the local APIC in such a way to raise timer interrupts that have vector LOCAL_TIMER_VECTOR (usually, 0xef); moreover, the init_IRQ( ) function associates local_timer_vector to the low-level interrupt handler apic timer interrupt( ) .

6.2.2.2 The local timer interrupt handler

The apic_timer_interrupt( ) assembly language function is equivalent to the following code:

apic_timer_interrupt:

pushl $LOCAL_TIMER_VECTOR-2 5 6

SAVE_ALL

movl %esp,%eax pushl %eax call smp_apic_timer_interrupt addl $4,%esp jmp ret_from_intr

As you can see, the low-level handler is very similar to the other low-level interrupt handlers already described in Chapter 4. The high-level interrupt handler called smp_apic_timer_interrupt( ) executes the following steps:

1. Gets the CPU logical number (say n)

2. Increments the nth entry of the apic_timer_irqs array by 1 (see Section 6.5.4 later in this chapter)

3. Acknowledges the interrupt on the local APIC

4. Calls the irq_enter( ) function to increment the nth entry of the local_irq_count array and to honor the global_irq_lock spin lock (see Chapter 5)

5. Invokes the smp_local_timer_interrupt( ) function

6. Calls the irq_exit( ) function to decrement the nth entry of the local_irq_count array

7. Invokes do_softirq( ) if some softirqs are pending (see Section 4.7.1)

The smp_local_timer_interrupt( ) function executes the per-CPU timekeeping activities. Actually, it performs the following steps:

1. Invokes the x8 6_do_profile( ) function if the timer interrupt occurred in Kernel Mode (see Section 6.5.3 later in this chapter)

2. Invokes the update_process_times( ) function to check how long the current process has been running (see Section 6.6 later in this chapter)!4^

[4] The system administrator can change the sample frequency of the kernel code profiler. To do this, the kernel changes the frequency at which local timer interrupts are generated. However, the smp_local_timer_interrupt( ) function keeps invoking the update_process_times( ) function exactly once every tick. Unfortunately, changing the frequency of a local timer interrupt destroys the elegant spreading of the local timer interrupts inside a tick interval.

6.3 CPU's Time Sharing

Timer interrupts are essential for time-sharing the CPU among runnable processes (that is, those in the task_running state). As we shall see in Chapter 11, each process is usually allowed a quantum of time of limited duration: if the process is not terminated when its quantum expires, the schedule( ) function selects the new process to run.

The counter field of the process descriptor specifies how many ticks of CPU time are left to the process. The quantum is always a multiple of a tick — a multiple of about 10 ms. The value of counter is updated at every tick by update_ process_times( ), which is invoked by either the PIT's timer interrupt handler on uniprocessor systems or the local timer interrupt handler in multiprocessor systems. The code is equivalent to the following:

--current->counter; if (current->counter <= 0) { current->counter = 0; current->need resched = 1;

The snippet of code starts by making sure the kernel is not handling a process with PID 0 — the swapper process associated with the executing CPU. It must not be time-shared because it is the process that runs on the CPU when no other task_running processes exist (see Section 3.2.2).

When counter becomes smaller than 0, the need_resched field of the process descriptor is set to 1. In this case, the schedule( ) function is invoked before resuming User Mode execution, and other task_running processes will have a chance to resume execution on the CPU.

I [email protected] RuBoard

Was this article helpful?

0 -2

Post a comment