Process Switch

To control the execution of processes, the kernel must be able to suspend the execution of the process running on the CPU and resume the execution of some other process previously suspended. This activity goes variously by the names process switch, task switch, or context switch. The next sections describe the elements of process switching in Linux.

3.3.1 Hardware Context

While each process can have its own address space, all processes have to share the CPU registers. So before resuming the execution of a process, the kernel must ensure that each such register is loaded with the value it had when the process was suspended.

The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called the hardware context. The hardware context is a subset of the process execution context, which includes all information needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.

In the description that follows, we will assume the prev local variable refers to the process descriptor of the process being switched out and next refers to the one being switched in to replace it. We can thus define a process switch as the activity consisting of saving the hardware context of prev and replacing it with the hardware context of next. Since process switches occur quite often, it is important to minimize the time spent in saving and loading hardware contexts.

Old versions of Linux took advantage of the hardware support offered by the Intel architecture and performed a process switch through a far jmp instruction^4! to the selector of the Task State Segment Descriptor of the next process. While executing the instruction, the CPU performs a hardware context switch by automatically saving the old hardware context and loading a new one. But Linux 2.4 uses software to perform a process switch for the following reasons:

[4] far jmp instructions modify both the cs and eip registers, while simple jmp instructions modify only eip.

• Step-by-step switching performed through a sequence of mov instructions allows better control over the validity of the data being loaded. In particular, it is possible to check the values of segmentation registers. This type of checking is not possible when using a single far jmp instruction.

• The amount of time required by the old approach and the new approach is about the same. However, it is not possible to optimize a hardware context switch, while there might be room for improving the current switching code.

Process switching occurs only in Kernel Mode. The contents of all registers used by a process in User Mode have already been saved before performing process switching (see Chapter 4). This includes the contents of the ss and esp pair that specifies the User Mode stack pointer address.

3.3.2 Task State Segment

The 80 x 86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts. Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:

• When an 80 x 86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS (see Chapter 4).

• When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.

More precisely, when a process executes an in or out I/O instruction in User Mode, the control unit performs the following operations:

1. It checks the 2-bit IOPL field in the eflags register. If it is set to 3, the control unit executes the I/O instructions. Otherwise, it performs the next check.

2. It accesses the tr register to determine the current TSS, and thus the proper I/O Permission Bitmap.

3. It checks the bit of the I/O Permission Bitmap corresponding to the I/O port specified in the I/O instruction. If it is cleared, the instruction is executed; otherwise, the control unit raises a "General protection error" exception.

The tss_struct structure describes the format of the TSS. As already mentioned in Chapter 2, the init_tss array stores one TSS for each different CPU on the system. At each process switch, the kernel updates some fields of the TSS so that the corresponding CPU's control unit may safely retrieve the information it needs.

Each TSS has its own 8-byte Task State Segment Descriptor (TSSD). This Descriptor includes a 32-bit Base field that points to the TSS starting address and a 20-bit Limit field. The S flag of a TSSD is cleared to denote the fact that the corresponding TSS is a System Segment.

The Type field is set to either 9 or 11 to denote that the segment is actually a TSS. In the Intel's original design, each process in the system should refer to its own TSS; the second least significant bit of the Type field is called the Busy bit; it is set to 1 if the process is being executed by a CPU, and to 0 otherwise. In Linux design, there is just one TSS for each CPU, so the Busy bit is always set to 1.

The TSSDs created by Linux are stored in the Global Descriptor Table (GDT), whose base address is stored in the gdtr register of each CPU. The tr register of each CPU contains the TSSD Selector of the corresponding TSS. The register also includes two hidden, nonprogrammable fields: the Base and Limit fields of the TSSD. In this way, the processor can address the TSS directly without having to retrieve the TSS address from the GDT. The thread field

At every process switch, the hardware context of the process being replaced must be saved somewhere. It cannot be saved on the TSS, as in the original Intel design, because we cannot make assumptions about when the process being replaced will resume execution and what CPU will execute it again.

Thus, each process descriptor includes a field called thread of type thread_struct, in which the kernel saves the hardware context whenever the process is being switched out.

As we shall see later, this data structure includes fields for most of the CPU registers, such as the general-purpose registers, the floating point registers, and so on.

3.3.3 Performing the Process Switch

A process switch may occur at just one well-defined point: the schedule( ) function (discussed at length in Chapter 11). Here, we are only concerned with how the kernel performs a process switch.

Essentially, every process switch consists of two steps:

1. Switching the Page Global Directory to install a new address space; we'll describe this step in Chapter 8.

2. Switching the Kernel Mode stack and the hardware context, which provides all the information needed by the kernel to execute the new process, including the CPU registers.

Again, we assume that prev points to the descriptor of the process being replaced, and next to the descriptor of the process being activated. As we shall see in Chapter 11, prev and next are local variables of the schedule( ) function.

The second step of the process switch is performed by the switch_to macro. It is one of the most hardware-dependent routines of the kernel, and it takes some effort to understand what it does.

First of all, the macro has three parameters called prev, next, and last. The actual invocation of the macro in schedule( ) is:

switch to(prev, next, prev);

You might easily guess the role of prev and next — they are just placeholders for the local variables prev and next — but what about the third parameter last? Well, the point is that in any process switch, three processes are involved, not just two.

Suppose the kernel decides to switch off process A and to activate process B. In the schedule( ) function, prev points to A's descriptor and next points to B's descriptor. As soon as the switch_to macro deactivates A, the execution flow of A freezes.

Later, when the kernel wants to reactivate A, it must switch off another process C (in general, this is different from B) by executing another switch_to macro with prev pointing to C and next pointing to A. When A resumes its execution flow, it finds its old Kernel Mode stack, so the prev local variable points to A's descriptor and next points to B's descriptor. The kernel, which is now executing on behalf of process A, has lost any reference to C.

The last parameter of the switch_to macro reinserts the address of C's descriptor into the prev local variable. The mechanism exploits the state of registers during function calls. The first prev parameter corresponds to a CPU register, which is loaded with the content of the prev local variable when the macro starts. When the macro ends, it writes the content of the same register in the last parameter — namely, in the prev local variable. However, the CPU register doesn't change across the process switch, so prev receives the address of C's descriptor (as we shall see in Chapter 11, the scheduler checks whether C should be readily executed on another CPU).

Here is a description of what the switch_to macro does on an 80 x 86 microprocessor:

1. Saves the values of prev and next in the eax and edx registers, respectively:

movl prev,%eax movl next,%edx

The eax and edx registers correspond to the prev and next parameters of the macro.

2. Saves another copy of prev in the ebx register; ebx corresponds to the last parameter of the macro:

movl %eax,%ebx

3. Saves the contents of the esi, edi, and ebp registers in the prev Kernel Mode stack. They must be saved because the compiler assumes that they will stay unchanged until the end of switch to:

pushl %esi pushl %edi pushl %ebp

4. Saves the content of esp in prev->thread.esp so that the field points to the top of the prev Kernel Mode stack:

The 616(%eax) operand identifies the memory cell whose address is the contents of eax plus 616.

5. Loads next->thread.esp in esp. From now on, the kernel operates on the Kernel Mode stack of next, so this instruction performs the actual process switch from prev to next. Since the address of a process descriptor is closely related to that of the Kernel Mode stack (as explained in Section 3.2.2 earlier in this chapter), changing the kernel stack means changing the current process:

6. Saves the address labeled 1 (shown later in this section) in prev->thread.eip. When the process being replaced resumes its execution, the process executes the instruction labeled as 1:

7. On the Kernel Mode stack of next, the macro pushes the next->thread.eip value, which, in most cases, is the address labeled 1:

pushl 612(%edx)

This function acts on the prev and next parameters that denote the former process and the new process. This function call is different from the average function call, though, because _ _switch_to( ) takes the prev and next parameters from the eax and edx (where we saw they were stored), not from the stack like most functions. To force the function to go to the registers for its parameters, the kernel uses the attribute and regparm keywords, which are nonstandard extensions of the C language implemented by the gcc compiler. The _ _switch_to( ) function is declared in the include/asm-i386/system.h header file as follows:

_switch_to(struct task_struct *prev, struct task_struct *next) _attribute_ _(regparm(3))

The _ _switch_to( ) function completes the process switch started by the switch_to( ) macro. It includes extended inline assembly language code that makes for rather complex reading because the code refers to registers by means of special symbols:

a. Executes the code yielded by the unlazy_fpu( ) macro (see Section 3.3.4 later in this chapter) to optionally save the contents of the FPU, MMX, and XMM registers. As we shall see, there is no need to load the corresponding registers of next while performing the context switch:


b. Loads next->esp0 in the esp0 field of the TSS relative to the current CPU so that any future privilege level change from User Mode to Kernel Mode automatically forces this address into the esp register:

init_tss[smp_processor_id( )].esp0 = next->thread.esp0; The smp_processor_id( ) macro yields the index of the executing CPU.

c. Stores the contents of the fs and gs segmentation registers in prev->thread.fs and prev->, respectively; the corresponding assembly language instructions are:

The esi register points to the prev->thread structure.

d. Loads the fs and gs segment registers with the values contained in next->thread.fs and next->, respectively. This step logically complements the actions performed in the previous step. The corresponding assembly language instructions are:

The ebx register points to the next->thread structure. The code is actually more intricate, as an exception might be raised by the CPU when it detects an invalid segment register value. The code takes this possibility into account by adopting a "fix-up" approach (see Section 9.2.6).

e. Loads six debug registers!5! with the contents of the next->thread.debugreg array.

[5] The 80 x 86 debug registers allow a process to be monitored by the hardware. Up to four breakpoint areas may be defined. Whenever a monitored process issues a linear address included in one of the breakpoint areas, an exception occurs.

This is done only if next was using the debug registers when it was suspended (that is, field next->thread.debugreg[7] is not 0). As we shall see in Chapter 20, these registers are modified only by writing in the TSS, so there is no need to save the corresponding registers of prev:

loaddebug(&next->thread, 0); loaddebug(&next->thread, 1); loaddebug(&next->thread, 2); loaddebug(&next->thread, 3);

/* no 4 and 5 */ loaddebug(&next->thread, 6); loaddebug(&next->thread, 7);

9. Updates the I/O bitmap in the TSS, if necessary. This must be done when either next or prev have their own customized I/O Permission Bitmap:

memcpy(init_tss[smp_processor_id( )].io_bitmap, next->thread.io_bitmap, 128));

init_tss[smp_processor_id( )].bitmap = 104; } else if (prev->thread.ioperm)

init_tss[smp_processor_id( )].bitmap = 0x8000;

The customized I/O Permission Bitmap of a process is stored in a buffer pointed to by the thread.io_bitmap field of the process descriptor. If next has a customized bitmap, it is copied into the io_bitmap field of the TSS. Otherwise, if next doesn't have it, the kernel checks whether prev defined such a bitmap. In this case, the bitmap must be invalidated.

10. Terminates. Like any other function, _ _switch_to( ) ends by means of a ret assembly language instruction, which loads the eip program counter with the return address stored into the stack. However, the _ _switch_to( ) function has been invoked simply by jumping into it. Therefore the ret assembly language instruction finds on the stack the address of the instruction shown in the following item and labeled 1, which was pushed by the switch_to macro. If next was never suspended before because it is being executed for the first time, the function finds the starting address of the ret_from_fork( ) function (see Section 3.4.1 later in this chapter).

• Includes a few instructions that restore the contents of the esi, edi, and ebp registers. The first of these three instructions is labeled 1:

popl %ebp popl %edi popl %esi

Notice how these pop instructions refer to the kernel stack of the prev process. They will be executed when the scheduler selects prev as the new process to be executed on the CPU, thus invoking switch_to with prev as the second parameter. Therefore, the esp register points to the prev's Kernel Mode stack.

• Copies the content of the ebx register (corresponding to the last parameter of the switch_to macro) into the prev local variable:

movl %ebx,prev

As discussed earlier, the ebx register points to the descriptor of the process that has just been replaced. 3.3.4 Saving the FPU, MMX, and XMM Registers

Starting with the Intel 80486, the arithmetic floating-point unit (FPU) has been integrated into the CPU. The name mathematical coprocessor continues to be used in memory of the days when floating-point computations were executed by an expensive special-purpose chip. To maintain compatibility with older models, however, floating-point arithmetic functions are performed with ESCAPE instructions, which are instructions with a prefix byte ranging between 0xd8 and 0xdf. These instructions act on the set of floating point registers included in the CPU. Clearly, if a process is using ESCAPE instructions, the contents of the floating point registers belong to its hardware context.

In later Pentium models, Intel introduced a new set of assembly language instructions into its microprocessors. They are called MMXinstructions and are supposed to speed up the execution of multimedia applications. MMX instructions act on the floating point registers of the FPU. The obvious disadvantage of this architectural choice is that programmers cannot mix floating-point instructions and MMX instructions. The advantage is that operating system designers can ignore the new instruction set, since the same facility of the task-switching code for saving the state of the floating-point unit can also be relied upon to save the MMX state.

MMX instructions speed up multimedia applications because they introduce a single-instruction multiple-data (SIMD) pipeline inside the processor. The Pentium III model extends such SIMD capability: it introduces the SSE extensions (Streaming SIMD Extensions), which adds facilities for handling floatingpoint values contained in eight 128-bit registers (the XMM registers). Such registers do not overlap with the FPU and MMX registers, so SSE and FPU/MMX instructions may be freely mixed. The Pentium 4 model introduces yet another feature: the SSE2 extensions, which is basically an extension of SSE supporting higher-precision floating-point values. SSE2 uses the same set of XMM registers as SSE.

The 80 x 86 microprocessors do not automatically save the FPU, MMX, and XMM registers in the TSS. However, they include some hardware support that enables kernels to save these registers only when needed. The hardware support consists of a ts (Task-Switching) flag in the cr0 register, which obeys the following rules:

• Every time a hardware context switch is performed, the ts flag is set.

• Every time an ESCAPE, MMX, SSE, or SSE2 instruction is executed when the TS flag is set, the control unit raises a "Device not available" exception (see Chapter 4).

The TS flag allows the kernel to save and restore the FPU, MMX, and XMM registers only when really needed. To illustrate how it works, suppose that a process A is using the mathematical coprocessor. When a context switch occurs, the kernel sets the TS flag and saves the floating point registers into the TSS of process A. If the new process B does not use the mathematical coprocessor, the kernel won't need to restore the contents of the floating point registers. But as soon as B tries to execute an ESCAPE or MMX instruction, the CPU raises a "Device not available" exception, and the corresponding handler loads the floating point registers with the values saved in the TSS of process B.

Let's now describe the data structures introduced to handle selective loading of the FPU, MMX, and XMM registers. They are stored in the thread.i387 subfield of the process descriptor, whose format is described by the i387_union union:

union i387 union {

struct i387 fsave struct fsave; struct i387 fxsave struct fxsave; struct i387 soft struct soft;

As you see, the field may store just one of three different types of data structures. The i38 7_soft_struct type is used by CPU models without a mathematical coprocessor; the Linux kernel still supports these old chips by emulating the coprocessor via software. We don't discuss this legacy case further, however. The i387_fsave_struct type is used by CPU models with a mathematical coprocessor and, optionally, a MMX unit. Finally, the i387_fxsave_struct type is used by CPU models featuring SSE and SSE2 extensions.

The process descriptor includes two additional flags:

• The pf_usedfpu flag, which is included in the flags field. It specifies whether the process used the FPU, MMX, or XMM registers in the current execution run.

• The used_math field. This flag specifies whether the contents of the thread.i387 subfield are significant. The flag is cleared (not significant) in two cases, shown in the following list.

o When the process starts executing a new program by invoking an execve( ) system call (see Chapter 20). Since control will never return to the former program, the data currently stored in thread.i387 is never used again. o When a process that was executing a program in User Mode starts executing a signal handler procedure (see Chapter 10). Since signal handlers are asynchronous with respect to the program execution flow, the floating point registers could be meaningless to the signal handler. However, the kernel saves the floating point registers in thread.i387 before starting the handler and restores them after the handler terminates. Therefore, a signal handler is allowed to use the mathematical coprocessor, but it cannot carry on a floating-point computation started during the normal program execution flow.

As stated earlier, the _ _switch_to( ) function executes the unlazy_fpu macro, passing the process descriptor of the process being replaced as an argument. The macro checks the value of the PF_USEDFPU flags of prev. If the flag is set, prev has used a FPU, MMX, SSE, or SSE2 instructions in this run of execution; therefore, the kernel must save the relative hardware context:

if (prev->flags & PF_USEDFPU) save init fpu(prev);

The save_init_fpu( ) function, in turn, executes the following operations:

1. Dumps the contents of the FPU registers in the process descriptor of prev and then re-initializes the FPU. If the CPU uses SSE/SSE2 extensions, it also dumps the contents of the XMM registers and re-initialize the SSE/SSE2 unit. A couple of powerful assembly language instructions take care of everything, either:

asm volatile( "fxsave %0 ; fnclex"

if the CPU uses SSE/SSE2 extensions, or otherwise:

asm volatile( "fnsave %0 ; fwait"

2. Resets the pf_usedfpu flag of prev:

3. Sets the TS flag of crO by means of the stts( ) macro, which in practice yields the following assembly language instructions:

The contents of the floating point registers are not restored right after a process resumes execution. However, the ts flag of cr0 has been set by unlazy_fpu( ) . Thus, the first time the process tries to execute an ESCAPE, MMX, or SSE/SSE2 instruction, the control unit raises a "Device not available" exception, and the kernel (more precisely, the exception handler involved by the exception) runs the math_state_restore( ) function:

void math state restore( )

asm("clts"); /* clear the TS flag of cr0 */ if (current->used math) { restore fpu(current); } else {

/* initialize the FPU unit */ asm("fninit");

/* and also the SSE/SSE2 unit, if present */ if ( cpu has xmm )

load mxcsr(0x1f8 0); current->used math = 1;

current-flags |= PF_USEDFPU;

Since the process is executing an FPU, MMX, or SSE/SSE2 instruction, this function sets the PF_USEDFPU flag. Moreover, the function clears the TS flags of cr0 so that further FPU, MMX, or SSE/SSE2 instructions executed by the process won't trigger the "Device is not available" exception. If the data stored in the thread.i387 field is valid, the restore fpu( ) function loads the registers with the proper values. To do this, either the fxrstor or the frstor assembly language instructions are used, depending on whether the CPU supports SSE/SSE2 extensions. Otherwise, if the data stored in the thread.i387 field is not valid, the FPU/MMX unit is re-initialized and all its registers are cleared. To re-initialize the SSE/SSE2 unit, it is sufficient to load a value in a XMM register.

I [email protected] RuBoard


I [email protected] RuBoard wrmm

Continue reading here: Creating Processes

Was this article helpful?

+4 -1


  • Isaias Selam
    Is program counter saved by kernel during process switch?
    2 years ago
  • monika
    How process specific registers saved during context switch linux?
    3 years ago