Linux Process Manager

Dequeue a signal

The function shown in Figure 18.30, from kernel signal .c, takes a signal off the queue and returns the information about it to the caller, which is expected to be holding the sigmask_lock. siginfo_t*info) 244 sig next_signal(current, mask) sig)) 248 if 249 current-> sigpending 0 2 50 return 0 printk( d-> d n, signal_pending(current), sig) the parameters are a pointer to the blocked mask of the current process and a struct siginfo into which the extra information about the signal being...

Maintaining the time of day

The time of day is maintained in seconds and microseconds in the variable xtime (see Section 15.2.2.2). This is known as wall time. This section will first examine the data structure used to maintain time in this format and then the function that actually updates it. Linux maintains time of day to a granularity of a microsecond. It is not stored in the usual year month day format but in the number of seconds that have elapsed since 1 January 1970. Figure 15.10, from < linux time.h> , shows...

The 8259A programmable interrupt controller

So far in this chapter we have been dealing with the software side of the interrupt mechanism, but there is a hardware interrupt controller somewhere in the picture, and now we have to turn our attention to that. A number of different interrupt controllers are handled by the Linux process manager. At present these include the 8259 PIC, the Pentium IIX4 internal 8259 PIC, the local APIC, IO APIC, and SGI's Visual Workstation Cobalt IO APIC. The simplest piece of hardware to understand is the...

The debug exception handler

The function that actually handles the debug exception is shown in Figure 11.7, from arch i386 kernel traps.c. It checks for several unusual situations, before sending a signal to the current process. 477 asmlinkage void do_debug(struct pt_regs * regs, longerror_code) 480 struct task_struct *tsk current r (condition)) 486 if (condition& (DR_TRAP0 DR_TRAP1 DR_TRAP2 DR_TRAP3)) 491 if (regs-> eflags & VM_MASK) 498 if (condition& DR_STEP) 508 if ((tsk-> ptrace & (PT_DTRACE...

Nonmaskable interrupt

427 if ( (reason & 0xc0)) unknown_nmi_error(reason, regs) return Figure 11.8 The handler for the nonmaskable interrupt 421 although passed an error code by the first-level handler, the function never actually uses it. 423 input-output (IO) port 0x61, port B in the PC, has bits indicating the source or reason for an nmi (among other things) Bit 2 is for system board parity error checking 0 means that it is enabled, 1 means that it is reset but disabled. It is a read-write bit. Bit 3 is for...

Parity error on main memory board

If the non maskable interrupt was caused by a memory parity error on the main board, the function shown in Figure 11.11, from arch i386 kernel traps.c, is called. It prints a warning message and reenables parity error detection. 380 static void mem_parity_error(unsigned char reason, struct 382 printk(Uhhuh. NMI received. Dazed and confused, but 38 3 printk(You probably have a hardware problem with your RAM 386 reason (reason & 0xf) 4 Figure 11.11 Clearing and disabling the memory parity bit...

Setup virtual wire mode

If there is no IO APIC present, then external devices are connected to a PIC, which in turn is connected to a local APIC. In this case the local APIC is set into virtual wire mode, merely providing a connection to the CPU. All arbitration between interrupts and provision of vectors is done by an 8259A PIC. The code shown in Figure 13.9, from arch i386 kernel apic.c, sets up a local APIC in this mode. 224 void_init init_bsp_APIC void if smp_found_config cpu_has_apic return value apic_read...

Disabling an Io Apic before rebooting

Linux provides a number of functions for clearing one or more entries from the IO APIC registers as well as for disabling the whole APIC. These are typically used by the reboot code. The function shown in Figure 14.26, from arch i386 kernel io_apic.c, is used by the reboot code. It clears all registers in all IO APICs before rebooting. 1029 void disable_IO_APIC void 1034 this clears all the registers in all IO APICs see Section 14.2.4.2 . 1036 this function reenables the PIC now that the APIC...

Clone flags

The clone flags determine a whole range of properties of the child process. They are defined in < linux sched.h> , as shown in Figure 8.10. define define define define define define define define define define 0x00000200 0x00000400 0x00000800 0x00001000 0x00002000 0x00004000 0x00008000 0x00010000 define CLONE_SIGNAL (CLONE_SIGHAND CLONE_THREAD) 35 the number of the signal to be sent to the parent process when the child exits is encoded in the low-order 8 bits of the clone flag value. This...

Generating firstlevel interrupt handlers

In Section 12.3.1 an array of pointers to first-level handlers for hardware interrupts was initialised. The handler stubs themselves are built using some ugly macros, which create the low-level assembly routines that save register context and call the second-level handler, do_IRQ(). The do_IRQ() function then does all the operations that are needed to keep the hardware interrupt controller happy. 12.3.2.1 Building all the handler stubs Figure 12.12, from arch i386 kernel i82 59.c, shows the...

Task state segment

The task state segment (TSS) is specific to the i386 architecture. It is Intel's layout for the volatile environment of a process. The TR register in the CPU always points to the TSS of the current process. Intel intended that each process would have its own TSS and that the volatile environment of a process would be saved there when it was context switched out. Linux does not implement things that way, preferring to save most of the volatile environment on the kernel stack of the process and...

Initialising hardware interrupts

The root of the whole setup is init_IRQ , from arch i386 kernel i82 59 .c see Figure 12.19 . This function is called at boot time, from line 560 of init main.c. 447 ifndef CONFIG_X86_VISWS_APIC int vector FIRST_EXTERNAL_VECTOR i if vector SYSCALL_VECTOR set_intr_gate vector, interrupt i interrupt 0 reschedule_interrupt invalidate_interrupt 483 ifdef CONFIG_X86_LOCAL_APIC apic_timer_interrupt spurious_interrupt error_interrupt 497 outb_p LATCH amp 0xff , 0x40 498 outb LATCH gt gt 8 , 0x40 508 if...

Getting the polarity and trigger type of an irq line

The polarity of an irq line indicates whether it is active high or low. Also, a line can be edge triggered or level triggered. This section will examine the standard definitions for the different busses, and various functions provided for determining the polarity and trigger type of a particular irq. The four different bus types have different values for polarity and trigger type, as defined in Figure 14.10, from arch i386 kernel io_apic.c. 299 static int_init EISA_ELCR unsigned int irq 302...

The Linux system call entry

Before we get into the internals of this, it may be useful to look at how a Linux user program enters the kernel - the system call interface. This involves changing the CPU to run in kernel mode and changing it back to user mode afterwards. Then we will examine the mainline entry code. The section concludes with two branches from the mainline, one taken when a process is being traced, the other when the call specifies an invalid system service. movl4( edx), edx pushl 0x2 7 call* edx addl 4, esp...

Restarting an interrupted system call

The only way control can transfer to the code shown in Figure 18.29 is by breaking out of the infinite loop at line 609 (Figure 18.24) because there were no further signals queued. One final possibility must be considered. When a process calls a system service that blocks, it is put into the TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state. When a signal is posted to a process in the TASK_INTERRUPTIBLE state it is woken up and moved to the runqueue, even though the system service has not...

Writing to a register

The function shown in Figure 22.6, from arch i386 kernel ptrace.c, writes to the field in the struct pt_regs on the kernel stack of the traced process, corresponding to a specified hardware register. 73 static int putreg(struct task_struct *child, 74 unsigned long regno, unsigned long value) if (value & & (value & 3) 3) child-> thread.fs value return 0 if (value & & (value & 3) 3) child-> thread.gs value return 0 if (value & & (value & 3) 3) get_stack_long(child,...

Setting values in the virtual flags register

Two functions are provided for setting values in the virtual flags register. One sets bits in the 32-bit VEFLAGS the other in the 16-bit VFLAGS. Both are shown in Figure 24.14, from arch i386 kernel vm86.c. 292 static inline void set_vflags_long(unsigned long eflags, flags IF_MASK return flags (VEFLAGS & current- thread.v86mask) set_flags(VEFLAGS, eflags, current- thread.v86mask) set_flags(regs- eflags, eflags, SAFE_MASK) if (eflags & IF_MASK) 300 static inline void set_vflags_short(unsigned...

Trap handling in vm86 mode

Return_to_32bit(regs, VM86_TRAP + (trapno ss ptrace & PT_PTRACED) unsigned long flags flags) sigdelset(¤t-; blocked, SIGTRAP) recalc_sigpending(current) send_sig(SIGTRAP, current, 1) current- thread.trap_no trapno current- thread.error_code error_code return 0 Figure 24.25 Trap handling invm86 mode it is necessary to be clear on the state of the kernel stack on entry to this function. On top of the stack is the struct pt_regs pushed there by the system call handler on entry to vm86 mode....

Handling a vectored interrupt in vm86 mode

Vectored interrupts, either traps or INTx, will be handled by indexing into the 16-bit interrupt table, or the 32-bit IDT. A special function is provided, to check for all sorts of conditions see Figure 24.29, from arch i386 kernel vm86.c. This function is called from handle_vm86_trap() (vm86pus is set) and handle_vm86_fault() (if an INTx). 398 static void do_int(struct kernel_vm86_regs *regs, inti, unsigned char * ssp, unsigned long sp) 400 unsigned long*intr_ptr, segoffs 402 if (regs-> cs...

SIMD coprocessor errors

The function that actually handles errors in the SIMD co-processor is shown in Figure 11.25, from arch i386 kernel traps. c. 610 void simd_math_error void eip task current save_init_fpu task task- gt thread.trap_no 19 task- gt thread.error_code 0 info.si_signo SIGFPE info.si_errno 0 switch mxcsr amp 0x1f80 gt gt 7 amp mxcsr amp 0x3f case 0x000 default info.si_code FPE_FLTINV break case 0x002 case 0x010 info.si_code FPE_FLTUND break case 0x004 info.si_code FPE_FLTDIV break case 0x008...

Manipulating floating point unit registers

Later models of the i386 architecture have more FPU registers than do earlier ones. The streaming SIMD extensions were introduced with the Pentium III. They are of use in areas such as image processing and word recognition. SSE uses 128-bit registers, called XMM registers. There is also an MXCSR register, containing control and status bits for operating the XMM registers. H.10.2.1 Initialising floating point unit registers When the current process uses the FPU for the first time, the function...

The software interrupt kernel thread

The previous section has described how software interrupts are handled in interrupt context on the return path from hardware interrupt handling, but there is also a kernel thread (in fact, one per CPU) dedicated to handling software interrupts. This thread is woken up when the load of software interrupts becomes too great to handle in interrupt context (it would take too many machine cycles from the current process). 16.1.4.1 Spawning kernel threads to handle software interrupts At boot time,...

The hash table for task structures

As all the data structures representing processes (task_struct) are on a doubly linked list, any one can be found by searching the list linearly. This method can be time-consuming, particularly if there are a large number of processes. So, to speed things up, all the structures are also kept on hash lists, hashed on the pid of the process, and linked through the pidhash_next and pidhash_pprev fields of the task_struct. This section examines the hash structure itself, the hash function used, and...

Macros for interruptsafe locks

The extra locking macros are all shown in Figure 12.42, from . There are 19 macros defined in this block of code. These 19 definitions are completely made up of calls to a further 13 macros. Seven of these deal with locking and unlocking spinlocks and read-write locks, which have been examined in great detail in Sections 5.3-5.7. Of the other 6, local_irq_save( ) and local_irq_restore() take care of saving and restoring the value in the EFLAGS register (see Section 12.8.2). Then...

Entering vm86 mode

Entry to vm86 mode is by means of a system call. As a discussion of system calls is outside the scope of this book, only a summary will be given here, in Section 24.4.1. The system call then goes on to call internal kernel functions, which will be described in full in Section 24.4.2. A process can use one of two different kernel entry points to switch to vm86 mode. The only difference between them is that the vm86() system service passes a vm86plus_struct as a parameter, whereas the oldvm86()...

Declaring and initialising wait queue entries

This section will examine the data structure used to represent an individual entry in a wait queue and how such entries are declared and initialised. Creating a new entry for a wait queue is quite a frequent event in the kernel, so there are a number of macros and functions provided for this purpose. One declares new entries the other fills in fields in an existing entry. To allow more than one process to wait on the same event, a link data structure _wait_queue is used (see Figure 4.1, from...

Kernel statistics

The operating system maintains a significant amount of statistical information about what is going on in the kernel. As much of this is maintained by the scheduler, this is a suitable place to introduce it. 7.2.4.1 Scheduler-specific information Figure 7.9, from kernel sched.c, shows some data structures used to record scheduling statistics. 106 char_pad SMP_CACHE_BYTES 107 aligned_data NR_CPUS _cacheline_aligned &init;_task,0 109 define cpu_curr(cpu) 112 struct kernel_stat kstat Figure 7.9...

Programmable interrupt controller

A hardware interrupt line is an electrical connection between a device and the CPU. The device can put an electrical signal on this line and so get the attention of the CPU. Because devices use these lines to request interrupts they are commonly referred to as an irq line, or just an irq. The Intel 8080 was designed at a time when the number of transistors that could be integrated onto one chip was quite limited. The designers only had space in the CPU to implement one interrupt line. For...

Sanity checks and APIC identification

The first part of the code, as shown in Figure 13.5, from arch i386 kernel apic. c, consists of some sanity checks, and setting up of the ID of the APIC. 263 void_init setup_local_APIC (void) 265 unsigned long value, ver, maxlvt 2 75 value apic_read(APIC_LVR) 2 76 ver GET_APIC_VERSION(value) 278 if ((SPURIOUS_APIC_VECTOR& 0x0f) 0x0f) 2 79 _error_in_apic_c() 285 if( clustered_apic_mode & & & phys_cpu_present_map)) 301 apic_write_around(APIC_DFR, 0xffffffff) 307 value &...;

Generic functions for manipulating a local APIC

There are a number of functions that are concerned with manipulating a local APIC itself, as opposed to any irq on that APIC. 13.3.2.1 Getting the number of entries in the local vector table The function shown in Figure 13.12, from arch i386 kernel apic.c, returns the maximum number of entries in the local vector table of the APIC. 42 unsigned int v, ver, maxlvt 47 maxlvt APIC_INTEGRATED(ver) GET_APIC_MAXLVT(v) 2 Figure 13.12 Getting the number of entries in the local vector table (LVT) 44 this...

Lowlevel functions to send an interprocessor interrupt

Finally, the low-level functions used for sending IPIs, both in the previous section and elsewhere, will be examined here. There are several functions for sending IPIs between CPUs. The destination can be one, some, or all of the CPUs in the system. It is also possible for a CPU to send an IPI to itself. The destination can be specified either physically or logically. In physical mode, the destination processor is specified by the 4-bit hardware-assigned ID 8-bit for Pentium 4 and Xeon of the...

The reschedule interrupt

This is sent to a specific CPU in order to force the execution of the schedule function on that CPU. The handler has nothing to do all the work on the target machine is done automatically on return from interrupt handling. Figure 13.32, from arch i386 kernel smp.c, shows the trivial function. 601 asmlinkage void smp_reschedule_interrupt void Figure 13.32 Second-level handler for the APIC advanced programmable interrupt controller reschedule interrupt 603 this acknowledges receipt of the...

Freeing an interrupt line

The function shown in Figure 12.26, from arch i386 kernel irq.c, deallocates an interrupt line. The handler is removed and the interrupt line is not available for use by any driver it is disabled and shutdown. If the irq was shared, then the caller must ensure that the interrupt is disabled on the card that issues this irq before calling this function. This function may be called from interrupt context, but note that attempting to free an irq in a handler for the same irq hangs the machine. 740...

Error handling on a local APIC

The final block of code, as shown in Figure 13.8, from arch i386 kernel apic.c, is relevant only to an integrated local APIC, not an 82489DX. It is setting up the ESR error status register . 393 if APIC_INTEGRATED ver amp amp esr_disable 398 printk ESR value before enabling vector 08lx n, value value 408 printk ESR value after enabling vector 08lx n, value 417 printk Leaving ESR disabled. n 419 printk No ESR for 82489DX. n 422 if nmi_watchdog NMI_LOCAL_APIC Figure 13.8 Error handling on a local...

Interrupt command register

This section considers the definitions for the 64-bit interrupt command register, as shown in Figure 13.3, from lt asm-i386 apicdef.h gt . A CPU sends an interprocessor interrupt IPI by writing to this register. APIC_INT_LEVELTRIG APIC_INT_ASSERT APIC_ICR_BUSY APIC_DEST_LOGICAL 0x08000 0x04000 0x01000 0x00800 0x00000 0x00100 APIC_DM_NMI APIC_DM_INIT APIC_DM_STARTUP APIC_DM_EXTINT GET_APIC_DEST_FIELD x SET_APIC_DEST_FIELD x C x 24 amp 0xFF C x lt lt 24 Figure 13.3 The interrupt command register...

Manipulating the linked list of task structures

This section examines the sequential list. There are three macros defined in < linux sched.h> that manipulate the various links in a task_struct. One removes a structure, another inserts a structure, and a third follows the links from start to finish. The macro shown in Figure 3.1, from < linux sched.h> , removes a descriptor p from the process structure, and from lists of siblings. Note that it does nothing about mutual exclusion. Any functions that use this macro have to guarantee...

Conditional interruptible sleep

The significant difference between this section and the previous one is the value in the state field of the process while it is sleeping. As before, there are two macros involved in putting a process to sleep conditionally. The first just checks the condition, whereas the main one actually puts the process to sleep. The macro shown in Figure 4.37, from < linux sched.h> , puts a process to sleep in the TASK_INTERRUPTIBLE state. It is only a wrapper that tests the condition before ever...

Delayed timer processing

The timer bottom half may not be run for some time after the first-level handler, depending on the load on the machine. In an extreme case, several timer ticks may occur before the bottom-half handler runs. So, first of all, it has to figure out how many timer interrupts have occurred since it last ran. The function to do this is shown in Figure 15.9, from kernel timer.c. 64 7 rwlock_t xtime_lock RW_LOCK_UNLOCKED 649 static inline voidupdate_times(void) 660 ticks jiffies - wall_jiffies 66 642...

Divide error

The first-level handler for the divide error exception (number 0) is the assembly language routine shown in Figure 10.25, from entry.S. This occurs if the result of a divide instruction is too big to fit into the result operand or if the divisor is 0. The CS and EIP values on the stack point to the instruction that caused the exception. The CPU does not push any error code on the stack corresponding to this exception. 264 pushl SYMBOL_NAME(do_divide_error) 269 xorl eax , eax 2 70 pushl ebp 275...

Enabling and disabling tasklets

It is also possible to mark a tasklet as disabled. Although a tasklet can always be scheduled to run, it will not actually be run until it is in the enabled state. This is indicated by its count field having a value of 0. Two functions are provided for disabling tasklets see Figure 16.20, from < linux interrupt.h> . Although disabled, a tasklet may still be scheduled to run, using the functions from Section 16.2.3 or Section 16.2.4, but it will not run until enabled again, by one of the...

Edge triggered interrupts on an Io Apic

The discussion begins with the struct hw_interrupt_type and then we go on to look at the individual functions. 14.3.1.1 Controller functions for edge triggered interrupts The struct hw_interrupt_type declared and initialised for an edge triggered irq is shown in Figure 14.30, from arch i386 kernel io_apic.c. 13 34 static struct hw_interrupt_type ioapic_edge_irq_type Figure 14.30 Controller functions for edge triggered interrupts Figure 14.30 Controller functions for edge triggered interrupts...

Interrupt handling registers

The next block of definitions are shown in Figure 13.2, from lt asm-i386 apicdef.h gt . These are concerned with incoming interrupts from all sources. APIC_SPIV_FOCUS_DISABLED 1 lt lt 9 APIC_ESR_RECV_ACC APIC_ESR_SENDILL APIC_ESR_RECVILL APIC_ESR_ILLREGA Figure 13.2 Interrupt handling registers this is the offset for the logical destination register LDR . The destination of an interrupt can be specified logically, using an 8-bit destination address. Each local APIC is given a unique logical ID...

The local vector table

Each local APIC has a range of registers known as the local vector table LVT . The definitions for these registers are shown in Figure 13.4, from lt asm-i386 apicdef.h gt . define define define define define define define define define define define define 108 define APIC_BASE fix_to_virt FIX_APIC_BASE 110 define MAX_IO_APICS 8 Figure 13.4 Constants for the local vector table 73-94 these registers constitute the LVT, which specifies delivery and status information for local interrupts. There...

The local interrupt pins

The next part of the code, as shown in Figure 13.7, from arch i386 kernel apic. c, sets up the two local interrupt pins, LINT0 and LINT1. 372 value apic_read APIC_LVT0 amp APIC_LVT_MASKED 373 if smp_processor_id amp amp pic_mode lvalue smp_processor_id 377 value APIC_DM_EXTINT APIC_LVT_MASKED 378 printk masked ExtINT on CPU d n, smp_processor_id apic_write_around APIC_LVT0, value value APIC_DM_NMI APIC_LVT_MASKED if APIC_INTEGRATED ver 82489DX value APIC_LVT_LEVEL_TRIGGER apic_write_around...

The spurious interrupt

If, at the moment an interrupt is sent to the CPU, it is running at higher priority than the interrupt level, there may be a delay in issuing the INTA cycle. If that interrupt has been masked by software in the meantime, then, when INTA finally does arrive, the local APIC does not issue the vector corresponding to the masked interrupt but the spurious interrupt vector. No bit is set in the ISR corresponding to this, so the handler for this vector does not issue an EOI. The second-level handler...

Updating the timeofday clock by one tick

The timer interrupt is used, among other things, to update the computer's time-of-day clock. But, owing to small irregularities in the frequency of this interrupt, the clock may run fast or slow over a period of time. Sophisticated algorithms are used in an attempt to offset this. Any attempt to correct a clock relies on access to an external time-source. This source updates kernel variables, which are then read by the functions described in this and the following sections. Before describing...

Saving registers

All interrupt handlers need to save the general purpose CPU registers on the stack. The macro shown Figure 10.15, from entry.S, does that. Figure 10.15 Macro to save registers Figure 10.15 Macro to save registers 86 This clears the direction flag bit in EFLAGS to 0. After this, string instructions will increment the index registers ESI and EDI (as opposed to decrementing them). 87-95 the registers are pushed one by one. The order conforms to that in a struct pt_regs. Note that the values on the...

Secondlevel handler for machine check

Figure 11.17, from arch i386 kernel bluesmoke.c, shows the generic handler for the machine check exception. It prints debugging information on the console and, depending on the seriousness of the problem, either shuts down or continues. 17 void intel_machine_check(struct pt_regs * regs, long error_code) rdmsr((MSR_IA32_MCG_STATUS, mcgstl, mcgsth) if(mcgstl& (1< < 0)) recover 0 printk(KERN_EMERG CPU d Machine Check Exception 08x 08x n, smp_processor_id(), mcgsth, mcgstl) high) 39...

Setting up the idle thread

Figure 3.21, from arch i386 kernel process.c , is the function executed by the idle thread. There is no useful work to be done, so it tries to conserve power, halting the processor, waiting for something to happen. 131 void (*idle)(void) pm_idle 134 while ( current-> need_resched) idle() schedule() check_pgt_cache() 126 this function will be dealt with in Section 1.3.2. It merely initialises some fields specific to the CPU on which this thread is running. 127 this gives the idle thread the...

Event timer data structures

Each timer is represented by a struct timer_list which specifies the function and when it is to be run. Then Linux uses what at first sight might seem an unusual data structure to keep track of these structures. It has headers for 512 different lists, sorted on the order in which the timer is to expire. These are divided into five different groups, known as vectors. The first one, the root vector, contains headers for 256 different lists. This is the 'ready-use' vector it maintains timers with...

Sleeping for a fixed length of time

Sometimes a process may want to put itself to sleep for a fixed length of time. Linux uses a standard timer for that and also provides a skeleton function for the timer to run, which merely wakes up the process. The function shown in Figure 15.36, from kernel sched.h, puts a process to sleep for the number of ticks specified by the parameter timeout. It is called from many places in the kernel, mostly from drivers. 410 signed long schedule_timeout(signed long timeout) if (timeout 0)...