Linux Kernel Reference
Atomic Operations
Several assembly language instructions are of type read-modify-write that is, they access a memory location twice, the first time to read the old value and the second time to write a new value. Suppose that two kernel control paths running on two CPUs try to read-modify-write the same memory location at the same time by executing nonatomic operations. At first, both CPUs try to read the same location, but the memory arbiter (a hardware circuit that serializes accesses to the RAM chips) steps in...
Group Descriptor and Bitmap
Each block group has its own group descriptor, an ext2_group_desc structure whose fields are illustrated in Table 17-2. Table 17-2. The fields of the Ext2 group descriptor Table 17-2. The fields of the Ext2 group descriptor Block number of first inode table block The bg free blocks count, bg free inodes count,and bg used dirs count fields are used when allocating new inodes and data blocks. These fields determine the most suitable block in which to allocate each data structure. The bitmaps are...
Creating Inodes
The ext2_new_inode( ) function creates an Ext2 disk inode, returning the address of the corresponding inode object (or null, in case of failure). It acts on two parameters the address dir of the inode object that refers to the directory into which the new inode must be inserted and a mode that indicates the type of inode being created. The latter argument also includes an ms_synchronous flag that requires the current process to be suspended until the inode is allocated. The function performs...
Deleting Inodes
The ext2_free_inode( ) function deletes a disk inode, which is identified by an inode object whose address is passed as the parameter. The kernel should invoke the function after a series of cleanup operations involving internal data structures and the data in the file itself. It should come after the inode object has been removed from the inode hash table, after the last hard link referring to that inode has been deleted from the proper directory and after the file is truncated to 0 length to...
The dentry Cache
Since reading a directory entry from disk and constructing the corresponding dentry object requires considerable time, it makes sense to keep in memory dentry objects that you've finished with but might need later. For instance, people often edit a file and then compile it, or edit and print it, or copy it and then edit the copy. In such cases, the same file needs to be repeatedly accessed. To maximize efficiency in handling dentries, Linux uses a dentry cache, which consists of two kinds of...
Mounting the Root Filesystem
Mounting the root filesystem is a crucial part of system initialization. It is a fairly complex procedure because the Linux kernel allows the root filesystem to be stored in many different places, such as a hard disk partition, a floppy disk, a remote filesystem shared via NFS, or even a fictitious block device kept in RAM. To keep the description simple, let's assume that the root filesystem is stored in a partition of a hard disk (the most common case, after all). While the system boots, the...
Reading and Writing a File
Section 12.6.2, described how the read( ) and write( ) system calls are implemented. The corresponding service routines end up invoking the file object's read and write methods, which may be filesystem-dependent. For disk-based filesystems, these methods locate the physical blocks that contain the data being accessed, and activate the block device driver to start the data transfer. Reading a file is page-based the kernel always transfers whole pages of data at once. If a process issues a read(...
The Buddy System Algorithm
The kernel must establish a robust and efficient strategy for allocating groups of contiguous page frames. In doing so, it must deal with a well-known memory management problem called external fragmentation frequent requests and releases of groups of contiguous page frames of different sizes may lead to a situation in which several small blocks of free page frames are scattered inside blocks of allocated page frames. As a result, it may become impossible to allocate a large block of contiguous...
Mounting a Generic Filesystem
Once the root filesystem is initialized, additional filesystems may be mounted. Each must have its own mount point, which is just an already existing directory in the system's directory tree. The mount( ) system call is used to mount a filesystem its sys_mount( ) service routine acts on the following parameters The pathname of a device file containing the filesystem, or null if it is not required (for instance, when the filesystem to be mounted is network-based) The pathname of the directory on...
Inode Objects
All information needed by the filesystem to handle a file is included in a data structure called an inode. A filename is a casually assigned label that can be changed, but the inode is unique to the file and remains the same as long as the file exists. An inode object in memory consists of an inode structure whose fields are described in Table 12-3. Table 12-3. The fields of the inode object Table 12-3. The fields of the inode object Pointers for the modified buffers list Pointers for the...
Paging in Linux
As we explained earlier in Section 2.4.5, Linux adopted a three-level paging model so paging is feasible on 64-bit architectures. Figure 2-11 shows the model, which defines three types of paging tables. The Page Global Directory includes the addresses of several Page Middle Directories, which in turn include the addresses of several Page Tables. Each Page Table entry points to a page frame. The linear address is thus split into four parts. Figure 2-11 does not show the bit numbers because the...
Segmentation in Linux
Segmentation has been included in 80 x 86 microprocessors to encourage programmers to split their applications into logically related entities, such as subroutines or global and local data areas. However, Linux uses segmentation in a very limited way. In fact, segmentation and paging are somewhat redundant since both can be used to separate the physical address spaces of processes segmentation can assign a different linear address space to each process, while paging can map the same linear...
Read Write Spin Locks
Read write spin locks have been introduced to increase the amount of concurrency inside the kernel. They allow several kernel control paths to simultaneously read the same data structure, as long as no kernel control path modifies it. If a kernel control path wishes to write to the structure, it must acquire the write version of the read write lock, which grants exclusive access to the resource. Of course, allowing concurrent reads on data structures improves system performance. Figure 5-2...
Checking the NMI Watchdogs
In multiprocessor systems, Linux offers yet another feature to kernel developers a watchdog system, which might be quite useful to detect kernel bugs that cause a system freeze. To activate such watchdog, the kernel must be booted with the nmi_watchdog parameter. The watchdog is based on a clever hardware feature of multiprocessor motherboards they can broadcast the PIT's interrupt timer as NMI interrupts to all CPUs. Since NMI interrupts are not masked by the cli assembly language instruction,...
Understanding the Linux Kernel 2nd Edition
The Audience for This Book Organization of the Material Overview of the Book Background Information Conventions in This Book How to Contact Us Acknowledgments Section 1.1. Linux Versus Other Unix-Like Kernels Section 1.2. Hardware Dependency Section 1.4. Basic Operating System Concepts Section 1.5. An Overview of the Unix Filesystem Section 1.6. An Overview of Unix Kernels Section 2.1. Memory Addresses Section 2.2. Segmentation in Hardware Section 2.3. Segmentation in Linux Section 2.4. Paging...
Request descriptors
Each block device request is represented by a request descriptor, which is stored in the request data structure illustrated in Table 13-7. The direction of the data transfer is stored in the cmd field it is either read (from block device to RAM) or write (from RAM to block device). The rq_status field is used to specify the status of the request for most block devices, it is simply set either to rq_inactive (for a request descriptor not in use) or to RQ_active (for a valid request that is to be...
The Common File Model
The key idea behind the VFS consists of introducing a common file model capable of representing all supported filesystems. This model strictly mirrors the file model provided by the traditional Unix filesystem. This is not surprising, since Linux wants to run its native filesystem with minimum overhead. However, each specific filesystem implementation must translate its physical organization into the VFS's common file model. For instance, in the common file model, each directory is regarded as...
Reexecution of System Calls
The request associated with a system call cannot always be immediately satisfied by the kernel when this happens, the process that issued the system call is put in a TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state. If the process is put in a task_interruptible state and some other process sends a signal to it, the kernel puts it in the task_running state without completing the system call (see Section 4.8). When this happens, the system call service routine does not complete its job, but...
Dentry Objects
We mentioned in Section 12.1.1 that the VFS considers each directory a file that contains a list of files and other directories. We shall discuss in Chapter 17 how directories are implemented on a specific filesystem. Once a directory entry is read into memory, however, it is transformed by the VFS into a dentry object based on the dentry structure, whose fields are described in Table 12-5. The kernel creates a dentry object for every component of a pathname that a process looks up the dentry...
Handling the TLB
As a general rule, any process switch implies changing the set of active Page Tables. Local TLB entries relative to the old Page Tables must be flushed this is done automatically when the kernel writes the address of the new Page Global Directory into the cr3 control register. In some cases, however, the kernel succeeds in avoiding TLB flushes, which are listed here When performing a process switch between two regular processes that use the same set of Page Tables (see Section 11.2.2). When...
Routing Data Structures
The most important function of the IP layer consists of ensuring that packets originated by the host or received by the network interface cards are forwarded toward their final destinations. As you might easily guess, this task is really crucial because the routing algorithm should be fast enough to keep up with the highest network loads. The IP routing mechanism is fairly simple. Each 32-bit integer representing an IP address encodes both a network address, which specifies the network the host...
The exec Functions
Unix systems provide a family of functions that replace the execution context of a process with a new context described by an executable file. The names of these functions start with the prefix exec, followed by one or two letters therefore, a generic function in the family is usually referred to as an exec function. The exec functions are listed in Table 20-7 they differ in how the parameters are interpreted. The first parameter of each function denotes the pathname of the file to be executed....
IRQs and Interrupts
Each hardware device controller capable of issuing interrupt requests has an output line designated as an Interrupt ReQuest IRQ . All existing IRQ lines are connected to the input pins of a hardware circuit called the Interrupt Controller, which performs the following actions 1. Monitors the IRQ lines, checking for raised signals. 2. If a raised signal occurs on an IRQ line a. Converts the raised signal received into a corresponding vector. b. Stores the vector in an Interrupt Controller I O...
Direct Memory Access DMA
All PCs include an auxiliary processor called the Direct Memory Access Controller (DMAC), which can be instructed to transfer data between the RAM and an I O device. Once activated by the CPU, the DMAC is able to continue the data transfer on its own when the data transfer is completed, the DMAC issues an interrupt request. The conflicts that occur when both CPU and DMAC need to access the same memory location at the same time are resolved by a hardware circuit called a memory arbiter (see...
The ret from fork Function
The ret_from_fork( ) function is executed by the child process right after its creation through a fork( ), vfork( ), or clone( ) system call (see Section 3.4.1). It is essentially equivalent to the following assembly language code ret from fork pushl ebx call schedule tail addl 4, esp movl 0xffffe000, ebx andl esp, ebx testb 0x02,24( ebx) jne tracesys_exit jmp ret from sys call tracesys_exit Initially, the ebx register stores the address of the parent's process descriptor this value is passed...
TLB mode mechanism it is usually invoked whenever the kernel modifies a Page Table entry relative to the Kernel Mode
When some CPU starts running a kernel thread, the kernel sets it into lazy TLB mode. When requests are issued to clear some TLB entries, each CPU in lazy TLB mode does not flush the corresponding entries however, the CPU remembers that its current process is running on a set of Page Tables whose TLB entries for the User Mode addresses are invalid. As soon as the CPU in lazy TLB mode switches to a regular process with a different set of Page Tables, the hardware automatically flushes the TLB...
The Memory Descriptor
All information related to the process address space is included in a data structure called a memory descriptor. This structure of type mm_struct is referenced by the mm field of the process descriptor. The fields of a memory descriptor are listed in Table 8-2. Table 8-2. The fields of the memory descriptor Table 8-2. The fields of the memory descriptor Pointer to the head of the list of memory region objects Pointer to the root of the red-black tree of memory region objects Pointer to the last...
Reclaiming Page Frame
The virtual memory subsystem of Linux is, without any doubt, the most complex and performance-critical component of the whole kernel. In previous chapters, we explained how the kernel handles dynamic memory by keeping track of free and busy page frames. We have also discussed how every process in User Mode has its own linear address space so that page frames can be assigned to the process at the very last possible moment. Finally, we have also described how dynamic memory is used to cache the...
Process may in turn resume execution of the traced process by means of a sigcont signal
The signal is not blocked by the destination process. The signal is being ignored by the destination process (either because the process explicitly ignored it or because the process did not change the default action of the signal and that action is ignore). Handle the signal, which may require switching the process to a handler function at any point during its execution and restoring the original execution context after the function returns. Moreover, Linux must take into account the different...
System Calls Related to Signal Handling
As stated in the introduction of this chapter, programs running in User Mode are allowed to send and receive signals. This means that a set of system calls must be defined to allow these kinds of operations. Unfortunately, for historical reasons, several system calls exist that serve essentially the same purpose. As a result, some of these system calls are never invoked. For instance, sys_sigaction( ) and sys_rt_sigaction( ) are almost identical, so the sigaction( ) wrapper function included in...
Buffer Head Data Structures
As mentioned in Section 13.4.4, each buffer head is stored in a data structure of type buffer_head. These data structures have their own slab allocator cache called bh_cachep, which should not be confused with the buffer cache itself. The slab allocator cache is a memory cache (see Section 3.2.2) for the buffer head objects, meaning that it has no interaction with disks and is simply a way of managing memory efficiently. In contrast, the buffer cache is a disk cache for the data in the buffers....
Buffer Heads
The buffer head is a descriptor of type buffer_head associated with each buffer. It contains all the information needed by the kernel to know how to handle the buffer thus, before operating on each buffer, the kernel checks its buffer head. The buffer head fields are listed in Table 13-6. The b_data field of each buffer head stores the starting address of the corresponding buffer. Since a page frame may store several buffers, the b_this_page field points to the buffer head of the next buffer in...
The getblk Function
The getblk( ) function is the main service routine for the buffer cache. When the kernel needs to read or write the contents of a block of a physical device, it must check whether the buffer head for the required buffer is already included in the buffer cache. If the buffer is not there, the kernel must create a new entry in the cache. To do this, the kernel invokes getblk( ), specifying as parameters the device identifier, the block number, and the block size. This function returns the address...
Local Interrupt Disabling
Interrupt disabling is one of the key mechanisms used to ensure that a sequence of kernel statements is treated as a critical section. It allows a kernel control path to continue executing even when hardware devices issue IRQ signals, thus providing an effective way to protect data structures that are also accessed by interrupt handlers. By itself, however, local interrupt disabling does not protect against concurrent accesses to data structures by interrupt handlers running on other CPUs, so...
Read Write Semaphores
Read write semaphores are a new feature of Linux 2.4. They are similar to the read write spin locks described earlier in Section 5.3.4, except that waiting processes are suspended until the semaphore becomes open again. Many kernel control paths may concurrently acquire a read write semaphore for reading however, any writer kernel control path must have exclusive access to the protected resource. Therefore, the semaphore can be acquired for writing only if no other kernel control path is...
The Role of the Virtual Filesystem VFS
The Virtual Filesystem (also known as Virtual Filesystem Switch or VFS) is a kernel software layer that handles all system calls related to a standard Unix filesystem. Its main strength is providing a common interface to several kinds of filesystems. For instance, let's assume that a user issues the shell command where floppy is the mount point of an MS-DOS diskette and tmp is a normal Second Extended Filesystem (Ext2) directory. As shown in Figure 12-1(a), the VFS is an abstraction layer...
Inode Semaphore
As we shall see in Chapter 12, Linux stores the information on a disk file in a memory object called an inode. The corresponding data structure includes its own semaphore in the i_sem field. A huge number of race conditions can occur during filesystem handling. Indeed, each file on disk is a resource held in common for all users, since all processes may (potentially) access the file content, change its name or location, destroy or duplicate it, and so on. For example, let's suppose that a...
Activating and Deactivating a Swap Area
Once a swap area is initialized, the superuser (or, more precisely, any user having the cap_sys_admin capability, as described in Section 20.1.1) may use the swapon and swapoff programs to activate and deactivate the swap area, respectively. These programs use the swapon( ) and swapoff( ) system calls we'll briefly sketch out the corresponding service routines. 16.2.3.1 The sys_swapon( ) service routine The sys_swapon( ) service routine receives the following as parameters This parameter points...
Table 205 The ptrace commands
Start execution tracing for the current process Read a 32-bit value from the text segment Read a 32-bit value from the data segment Read the CPU's normal and debug registers Write a 32-bit value into the text segment Write a 32-bit value into the data segment Write the CPU's normal and debug registers Resume execution for a single assembly language instruction Start execution tracing for another process The ptrace( ) system call modifies the p_pptr field in the descriptor of the traced process...
Device Controllers
A complex device may require a device controller to drive it. Essentially, the controller plays two important roles It interprets the high-level commands received from the I O interface and forces the device to execute specific actions by sending proper sequences of electrical signals to it. It converts and properly interprets the electrical signals received from the device and modifies (through the I O interface) the value of the status register. A typical device controller is the disk...
Changing a Signal Action
The sigaction(sig,act,oact) system call allows users to specify an action for a signal of course, if no signal action is defined, the kernel executes the default action associated with the delivered signal. The corresponding sys_sigaction( ) service routine acts on two parameters the sig signal number and the act table of type sigaction that specifies the new action. A third oact optional output parameter may be used to get the previous action associated with the signal. The function checks...
Modifying the Set of Blocked Signals
The sigprocmask( ) system call allows processes to modify the set of blocked signals it applies only to regular (non-real-time) signals. The corresponding sys_sigprocmask( ) service routine acts on three parameters oset Pointer in the process address space to a bit array where the previous bit mask must be stored Pointer in the process address space to the bit array containing the new bit mask how Flag that may have one of the following values SIG_BLOCK The *set bit mask array specifies the...
Suspending the Process
The sigsuspend( ) system call puts the process in the task_interruptible state, after having blocked the standard signals specified by a bit mask array to which the mask parameter points. The process will wake up only when a nonignored, nonblocked signal is sent to it. The corresponding sys_sigsuspend( ) service routine executes these statements mask & (sigmask(SIGKILL) sigmask(SIGSTOP)) spin lock irq(& current-> sigmask lock) siginitset(& current-> blocked, mask) spin unlock...
The read and write System Calls
Let's return to the code in our cp example. The open( ) system calls return two file descriptors, which are stored in the inf and outf variables. Then the program starts a loop at each iteration, a portion of the floppy TEST file is copied into a local buffer (read( ) system call), and then the data in the local buffer is written into the tmp test file (write( ) system call). The read( ) and write( ) system calls are quite similar. Both require three parameters a file descriptor fd, the address...
[7 More precisely the usage counter keeps track of the number of file objects referring to the device file since clone
The open method checks the value of the usage counter before the increment. If the counter is null, the device driver must allocate the resources and enable interrupts and DMA on the hardware device. The release method checks the value of the usage counter after the decrement. If the counter is null, no more processes are using the hardware device. If so, the method disables interrupts and DMA on the I O controller, and then releases the allocated resources. The duration of an I O operation is...
Reading from a Pipe
A process wishing to get data from a pipe issues a read( ) system call, specifying the file descriptor associated with the pipe's reading end. As described in Section 12.6.2, the kernel ends up invoking the read method found in the file operation table associated with the proper file object. In the case of a pipe, the entry for the read method in the read_pipe_fops table points to the pipe_read( ) function. The pipe_read( ) function is quite involved, since the POSIX standard specifies several...
Softirqs Tasklets and Bottom Halves
We mentioned earlier in Section 4.6 that several tasks among those executed by the kernel are not critical they can be deferred for a long period of time, if necessary. Remember that the interrupt service routines of an interrupt handler are serialized, and often there should be no occurrence of an interrupt until the corresponding interrupt handler has terminated. Conversely, the deferrable tasks can execute with all interrupts enabled. Taking them out of the interrupt handler helps keep...
Delivering a Signal
We assume that the kernel noticed the arrival of a signal and invoked one of the functions mentioned in the previous section to prepare the process descriptor of the process that is supposed to receive the signal. But in case that process was not running on the CPU at that moment, the kernel deferred the task of delivering the signal. We now turn to the activities that the kernel performs to ensure that pending signals of a process are handled. As mentioned in Section 4.8, the kernel checks the...
File Descriptor and Inode
Unix makes a clear distinction between the contents of a file and the information about a file. With the exception of device and special files, each file consists of a sequence of characters. The file does not include any control information, such as its length or an End-Of-File (EOF) delimiter. All information needed by the filesystem to handle a file is included in a data structure called an inode. Each file has its own inode, which the filesystem uses to identify the file. While filesystems...
Processes Lightweight Processes and Threads
The term process is often used with several different meanings. In this book, we stick to the usual OS textbook definition a process is an instance of a program in execution. You might think of it as the collection of data structures that fully describes how far the execution of the program has progressed. Processes are like human beings they are generated, they have a more or less significant life, they optionally generate one or more child processes, and eventually they die. A small...
Regular Paging
Starting with the 80386, the paging unit of Intel processors handles 4 KB pages. The 32 bits of a linear address are divided into three fields The translation of linear addresses is accomplished in two steps, each based on a type of translation table. The first translation table is called the Page Directory and the second is called the Page Table. The aim of this two-level scheme is to reduce the amount of RAM required for per-process Page Tables. If a simple one-level Page Table was used, then...
Kernel Mappings of High Memory Page Frames
Page frames above the 896 MB boundary are not mapped in the fourth gigabyte of the kernel linear address spaces, so they cannot be directly accessed by the kernel. This implies that any page allocator function that returns the linear address of the assigned page frame doesn't work for the high memory. For instance, suppose that the kernel invoked__get_free_pages(GFP_HiGHMEM,0) to allocate a page frame in high memory. If the allocator assigned a page frame in high memory,__get_free_pages( )...
Device Drivers
A device driver is a software layer that makes a hardware device respond to a well-defined programming interface. We are already familiar with this kind of interface it consists of the canonical set of VFS functions open, read, lseek, ioctl, and so forth that control a device. The actual implementation of all these functions is delegated to the device driver. Since each device has a unique I O controller, and thus unique commands and unique state information, most I O devices have their own...
Initializing a Block Device Driver
Let's now describe how a block device driver is initialized. We already described how the kernel customizes the methods of the file object when a block device file is opened in Section 13.2.3. Its f_op field is set to the address of the def_blk_fops variable. The contents of this table are shown in Table 13-5. The dentry_open( ) function checks whether the open method is defined this is always true for a block device file, so the blkdev_open( ) function is executed. Table 13-5. The default file...
Sectors Blocks and Buffers
Each data transfer operation for a block device acts on a group of adjacent bytes called a sector. In most disk devices, the size of a sector is 512 bytes, although there are devices that use larger sectors (1,024 and 2,048 bytes). Notice that the sector should be considered the basic unit of data transfer it is never possible to transfer less than a sector, although most disk devices are capable of transferring several adjacent sectors at once. The kernel stores the sector size of each...
An Overview of Block Device Driver Architecture
Although block device drivers are able to transfer a single block at a time, the kernel does not perform an individual I O operation for each block to be accessed on disk this would lead to poor disk performances, since locating the physical position of a block on the disk surface is quite time-consuming. Instead, the kernel tries, whenever possible, to cluster several blocks and handle them as a whole, thus reducing the average number of head movements. When a process, the VFS layer, or any...
Request queue descriptors
Request queues are represented by means of request queue descriptors each of them is a request_queue_t data structure whose fields are listed in Table 13-8. Table 13-8. The fields of a request queue descriptor Table 13-8. The fields of a request queue descriptor read and write free lists of requests Method to append a block to the request Method to insert a block in front of the request Method to attempt merging an enlarged request with the adjacent ones Method that passes a request to a driver...
The llrwblock Function
The ll_rw_block( ) function creates a block device request. It is invoked from several places in the kernel to trigger the I O data transfer of one or more blocks. The function receives the following parameters The type of operation, rw, whose value can be read, write, or reada . The last operation type differs from the former in that the function does not block when a request descriptor is not available. The number, nr, of blocks to be transferred. A bhs array of nr pointers to buffer heads...
Lets look closer at these two cases 13461 Scheduling the activation of the strategy routine
As we saw earlier, it's expedient to delay activation of the strategy routine in order to increase the chances of clustering requests for adjacent blocks. The delay is accomplished through a technique known as device plugging and unplugging. As long as a block device driver is plugged, its strategy routine is not activated even if there are requests to be processed in the driver's queues. If the real device's request queue is empty and the device isn't already plugged, _ _make_request( )...
General Characteristics of Ext2
Unix-like operating systems use several filesystems. Although the files of all such filesystems have a common subset of attributes required by a few POSIX APIs like stat( ), each filesystem is implemented in a different way. The first versions of Linux were based on the Minix filesystem. As Linux matured, the Extended Filesystem (Ext FS) was introduced it included several significant extensions, but offered unsatisfactory performance. The Second Extended Filesystem (Ext2) was introduced in 1994...
Process Switch
To control the execution of processes, the kernel must be able to suspend the execution of the process running on the CPU and resume the execution of some other process previously suspended. This activity goes variously by the names process switch, task switch, or context switch. The next sections describe the elements of process switching in Linux. While each process can have its own address space, all processes have to share the CPU registers. So before resuming the execution of a process,...
The clone fork and vfork System Calls
Lightweight processes are created in Linux by using a function named clone( ), which uses four parameters Specifies a function to be executed by the new process when the function returns, the child terminates. The function returns an integer, which represents the exit code for the child process. Points to data passed to the fn( ) function. Miscellaneous information. The low byte specifies the signal number to be sent to the parent process when the child terminates the sigchld signal is...
Figure 85 The flow diagram of the Page Fault handler
Dues the deists belongToEhe prate addresi space' The identifiers vmalloc_fault, good_area, bad_area, and no_context are labels appearing in do_page_fault( ) that should help you to relate the blocks of the flow diagram to specific lines of code. The identifiers vmalloc_fault, good_area, bad_area, and no_context are labels appearing in do_page_fault( ) that should help you to relate the blocks of the flow diagram to specific lines of code. The do_ page_fault( ) function accepts the following...
Kernel Wrapper Routines
Although system calls are used mainly by User Mode processes, they can also be invoked by kernel threads, which cannot use library functions. To simplify the declarations of the corresponding wrapper routines, Linux defines a set of seven macros called _syscall0 through _syscall6. In the name of each macro, the numbers 0 through 6 correspond to the number of parameters used by the system call (excluding the system call number). The macros are used to declare wrapper routines that are not...
IPC Semaphores
IPC semaphores are quite similar to the kernel semaphores introduced in Chapter 5 they are counters used to provide controlled access to shared data structures for multiple processes. The semaphore value is positive if the protected resource is available, and 0 if the protected resource is currently not available. A process that wants to access the resource tries to decrement the semaphore value the kernel, however, blocks the process until the operation on the semaphore yields a positive...
Figure 191 IPC semaphore data structures
Formally, the array stores pointers to kern_ipc_perm data structures, but each structure is simply the first field of the sem_array data structure. All fields of the sem_array data structure are shown in Table 19-9. Table 19-9. The fields in the sem_array data structure Table 19-9. The fields in the sem_array data structure The sem_base field points to an array of struct sem data structures, one for every IPC primitive semaphore. The latter data structure includes only two fields The value of...
Kernel Page Tables
The kernel maintains a set of Page Tables for its own use, rooted at a so-called master kernel Page Global Directory. After system initialization, this set of Page Tables are never directly used by any process or kernel thread rather, the highest entries of the master kernel Page Global Directory are the reference model for the corresponding entries of the Page Global Directories of every regular process in the system. We explain how the kernel ensures that changes to the master kernel Page...
[8 Linux records the number of minor and major Page Faults for each process This information together with several
Conversely, when handling a read access, the content of the page is irrelevant because the process is addressing it for the first time. It is safer to give a page filled with zeros to the process rather than an old page filled with information written by some other process. Linux goes one step further in the spirit of demand paging. There is no need to assign a new page frame filled with zeros to the process right away, since we might as well give it an existing page called zero page, thus...
Writing Dirty Buffers to Disk
Unix systems allow the deferred writes of dirty buffers into block devices, since this noticeably improves system performance. Several write operations on a buffer could be satisfied by just one slow physical update of the corresponding disk block. Moreover, write operations are less critical than read operations, since a process is usually not suspended because of delayed writings, while it is most often suspended because of delayed reads. Thanks to deferred writes, each physical block device...
Sending Packets to the Network Card
A network card device driver is usually started either when the kernel inserts a packet in its transmit queue (as described in the previous section), or when a packet is received from the communication channel. Let's focus here on packet transmission. As we have seen, the qdisc_run( ) function is invoked whenever the kernel wishes to activate a network card device driver it is also executed by the net_tx_softirq softirq, which is implemented by the net_tx_action( ) function (see Section 4.7)....
Disk Caches
It shows how Linux uses sophisticated techniques to improve system performances by reducing disk accesses as much as possible. As mentioned in Section 12.1.1, a disk cache is a software mechanism that allows the system to keep in RAM some data that is normally stored on a disk, so that further accesses to that data can be satisfied quickly without accessing the disk. Besides the dentry cache, which is used by the VFS to speed up the translation of a file...
Global Interrupt Disabling
Some critical kernel functions can execute on a CPU only if no interrupt handler or deferrable function is running on any other CPU. This synchronization requirement is satisfied by global interrupt disabling. A typical scenario consists of a driver that needs to reset the hardware device. Before fiddling with I O ports, the driver performs global interrupt disabling, ensuring that no other driver will access the same ports. As we shall see in this section, global interrupt disabling...
Parenthood Relationships Among Processes
The pidhash table and chained lists Processes created by a program have a parent child relationship. When a process creates multiple children, these children have sibling relationships. Several fields must be introduced in a process descriptor to represent these relationships. Processes 0 and 1 are created by the kernel as we shall see later in the chapter, process 1 (init) is the ancestor of all other processes. The descriptor of a process P includes the following fields Points to...
A21 Booting Linux from Floppy Disk
The only way to store a Linux kernel on a single floppy disk is to compress the kernel image. As we shall see, compression is done at compile time and decompression is done by the loader. If the Linux kernel is loaded from a floppy disk, the boot loader is quite simple. It is coded in the arch i386 boot bootsect.S assembly language file. When a new kernel image is produced by compiling the kernel source, the executable code yielded by this assembly language file is placed at the beginning of...
Doubly linked lists
The process list is a special doubly linked list. However, as you may have noticed, the Linux kernel uses hundreds of doubly linked lists that store the various kernel data structures. For each list, a set of primitive operations must be implemented initializing the list, inserting and deleting an element, scanning the list, and so on. It would be both a waste of programmers' efforts and a waste of memory to replicate the primitive operations for each Therefore, the Linux kernel defines the...
The current macro
The close association between the process descriptor and the Kernel Mode stack just described offers a key benefit in terms of efficiency the kernel can easily obtain the process descriptor pointer of the process currently running on a CPU from the value of the esp register. In fact, since the memory area is 8 KB (213 bytes) long, all the kernel has to do is mask out the 13 least significant bits of esp to obtain the base address of the process descriptor. This is done by the current macro,...
Verifying the Parameters
All system call parameters must be carefully checked before the kernel attempts to satisfy a user request. The type of check depends both on the system call and on the specific parameter. Let's go back to the write( ) system call introduced before the fd parameter should be a file descriptor that describes a specific file, so sys_write( ) must check whether fd really is a file descriptor of a file previously opened and whether the process is allowed to write into it (see Section 1.5.6). If any...
Creating and Destroying a Pipe
The pipe( ) system call is serviced by the sys_pipe( ) function, which in turn invokes the do_pipe( ) function. To create a new pipe, do_pipe( ) performs the following operations 1. Invokes the get_pipe_inode( ) function, which allocates and initializes an inode object for the pipe in the pipefs filesystem. In particular, this function executes the following actions a. Allocates a pipe_inode_info data structure and stores its address in the i_pipe field of the inode. b. Allocates a page frame...
Direct IO Transfers
As we have seen, in Version 2.4 of Linux, there is no substantial difference between accessing a regular file through the filesystem, accessing it by referencing its blocks on the underlying block device file, or even establishing a file memory mapping. There are, however, some highly sophisticated programs (self-caching applications) that would like to have full control of the whole I O data transfer mechanism. Consider, for example, highperformance database servers most of them implement...
The Slab Allocator
Running a memory area allocation algorithm on top of the buddy algorithm is not particularly efficient. A better algorithm is derived from the slab allocator schema developed in 1994 for the Sun Microsystem Solaris 2.4 operating system. It is based on the following premises The type of data to be stored may affect how memory areas are allocated for instance, when allocating a page frame to a User Mode process, the kernel invokes the get_zeroed_page( ) function, which fills the page with zeros....
Figure 74 Relationship between cache and slab descriptors
Caches are divided into two types general and specific. General caches are used only by the slab allocator for its own purposes, while specific caches are used by the remaining parts of the kernel. A first cache contains the cache descriptors of the remaining caches used by the kernel. The cache_cache variable contains its descriptor. Twenty-six additional caches contain geometrically distributed memory areas. The table, called cache_sizes (whose elements are of type cache_sizes_t), points to...
The trytoswapout Function
The try_to_swap_out function attempts to free a given page frame, either discarding or swapping out its contents. The function returns the value 1 if it succeeds in releasing the page, and 0 otherwise. Remember that by releasing the page, we mean that the references to the page frame are removed from the Page Tables of all processes that share the page. In this case, however, the page frame is not necessarily released to the buddy system for instance, it could be referenced by the swap cache....
IPC Shared Memory
The most useful IPC mechanism is shared memory, which allows two or more processes to access some common data structures by placing them in an IPC shared memory region. Each process that wants to access the data structures included in an IPC shared memory region must add to its address space a new memory region (see Section 8.3), which maps the page frames associated with the IPC shared memory region. Such page frames can then be easily handled by the kernel through demand paging (see Section...
Kernel Control Paths
In Section 4.3, a kernel control path was defined as a sequence of instructions executed by the kernel to handle interrupts of different kinds. Each kernel request is handled by a different kernel control path, which usually executes several different kernel functions. For example, when a User Mode process issues a system call request, the system_call( ) function (see Chapter 9) begins this particular kernel control path and the ret_from_sys_call( ) function ends it (see Section 4.8). As we...
The Physical Address Extension PAE Paging Mechanism
The amount of RAM supported by a processor is limited by the number of address pins connected to the address bus. Older Intel processors from the 80386 to the Pentium used 32bit physical addresses. In theory, up to 4 GB of RAM could be installed on such systems in practice, due to the linear address space requirements of User Mode processes, the kernel cannot directly address more than 1 GB of RAM, as we shall see in the later section Section 2.5. However, some demanding applications running on...
Memory Mapping Data Structures
A memory mapping is represented by a combination of the following data structures The inode object associated with the mapped file The address_space object of the mapped file A file object for each different mapping performed on the file by different processes A vm_area_struct descriptor for each different mapping on the file A page descriptor for each page frame assigned to a memory region that maps the file Figure 15-4 illustrates how the data structures are linked. In the upper-left corner,...
Memory Addresses
Programmers casually refer to a memory address as the way to access the contents of a memory cell. But when dealing with 80 x 86 microprocessors, we have to distinguish three kinds of addresses Included in the machine language instructions to specify the address of an operand or of an instruction. This type of address embodies the well-known 80 x x86 segmented architecture that forces MS-DOS and Windows programmers to divide their programs into segments. Each logical address consists of a...
Appendix C Source Code Structure
To help you to find your way through the files of the source code, we briefly describe the organization of the kernel directory tree. As usual, all pathnames refer to the main directory of the Linux kernel, which is, in most Linux distributions, usr src linux. Linux source code for all supported architectures is contained in about 8750 C and Assembly files stored in about 530 subdirectories it consists of about 4 million lines of code, which occupy more than 144 megabytes of disk space. The...
System Call Handler and Service Routines
When a User Mode process invokes a system call, the CPU switches to Kernel Mode and starts the execution of a kernel function. In Linux a system call must be invoked by executing the int 0x8 0 assembly language instruction, which raises the programmed exception that has vector 128 (see Section 4.4.1 and Section 4.2.4, both in Chapter 4). Since the kernel implements many different system calls, the process must pass a parameter called the system call number to identify the required system call...
Synchronization Primitives
Chapter 1 introduced the concepts of race condition and critical region for processes. The same definitions apply to kernel control paths. In this chapter, a race condition can occur when the outcome of a computation depends on how two or more interleaved kernel control paths are nested. A critical region is any section of code that must be completely executed by any kernel control path that enters it before another kernel control path can enter it. We now examine how kernel control paths can...
Choosing Among Spin Locks Semaphores and Interrupt Disabling
Unfortunately, access patterns to most kernel data structures are a lot more complex than the simple examples just shown, and kernel developers are forced to use semaphores, spin locks, interrupts, and softirq disabling. Generally speaking, choosing the synchronization primitives depends on what kinds of kernel control paths access the data structure, as shown in Table 5-6. Table 5-6. Protection required by data structures accessed by kernel control paths Table 5-6. Protection required by data...
Extended Paging
Starting with the Pentium model, 80 x 86 microprocessors introduce extended paging, which allows page frames to be 4 MB instead of 4 KB in size (see Figure 2-7). As mentioned in the previous section, extended paging is enabled by setting the Page Size flag of a Page Directory entry. In this case, the paging unit divides the 32 bits of a linear address into two fields Page Directory entries for extended paging are the same as for normal paging, except that The Page Size flag must be set. Only...
Reserved Page Frames
The kernel's code and data structures are stored in a group of reserved page frames. A page contained in one of these page frames can never be dynamically assigned or swapped to disk. As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte. The total number of page frames required depends on how the kernel is configured. A typical configuration yields a kernel that can be loaded in less than 2 MBs of RAM. Why isn't the...
Memory Barriers
When using optimizing compilers, you should never take for granted that instructions will be performed in the exact order in which they appear in the source code. For example, a compiler might reorder the assembly language instructions in such a way to optimize how registers are used. Moreover, modern CPUs usually execute several instructions in parallel, and might reorder memory accesses. These kinds of reordering can greatly speed up the program. When dealing with synchronization, however,...
The page hash table
When a process reads a large file, the page cache may become filled with pages related to that file. In such cases, scanning a long list of page descriptors to find the page that maps the required file portion could become a time-consuming operation. For this reason, Linux uses a hash table of page descriptor pointers named page_hash_table. Its size depends on the amount of available RAM for example, for systems having 128 MB of RAM, page_hash_table is stored in 32 page frames and includes...
Catching the Signal
If a handler has been established for the signal, the do_signal( ) function must enforce its execution. It does this by invoking handle_signal( ) handle signal(signr, ka, & info, oldset, regs) return 1 Notice how do_signal( ) returns after having handled a single signal. Other pending signals won't be considered until the next invocation of do_signal( ). This approach ensures that real-time signals will be dealt with in the proper order. Executing a signal handler is a rather complex task...
File Objects
A file object describes how a process interacts with a file it has opened. The object is created when the file is opened and consists of a file structure, whose fields are described in Table 12-4. Notice that file objects have no corresponding image on disk, and hence no dirty field is included in the file structure to specify that the file object has been modified. Table 12-4. The fields of the file object Table 12-4. The fields of the file object Pointers for generic file object list dentry...
Buffer Pages
Although the page cache and the buffer cache are different disk caches, in Version 2.4 of Linux, they are somewhat intertwined. In fact, for reasons of efficiency, buffers are not allocated as single memory objects instead, buffers are stored in dedicated pages called buffer pages. All the buffers within a single buffer page must have the same size hence, on the 80 x 86 architecture, a buffer page can include from one to eight buffers, depending on the block size. A stronger constraint,...
Parameter Passing
Like ordinary functions, system calls often require some input output parameters, which may consist of actual values (i.e., numbers), addresses of variables in the address space of the User Mode process, or even addresses of data structures including pointers to User Mode functions (see Section 10.4). Since the system_call( ) function is the common entry point for all system calls in Linux, each of them has at least one parameter the system call number passed in the eax register. For instance,...
Creating a Memory Mapping
To create a new memory mapping, a process issues an mmap( ) system call, passing the following parameters to it A file descriptor identifying the file to be mapped. An offset inside the file specifying the first character of the file portion to be mapped. The length of the file portion to be mapped. A set of flags. The process must explicitly set either the map_shared flag or the map_private flag to specify the kind of memory mapping requested.-4 41 The process could also set the map_anonymous...
Final kernel Page Table when RAM size is less than 896 MB
The final mapping provided by the kernel Page Tables must transform linear addresses starting from 0xc000000 0 into physical addresses starting from 0. The _ pa macro is used to convert a linear address starting from PAGE_OFFSET to the corresponding physical address, while the _va macro does the reverse. The kernel master Page Global Directory is still stored in swapper_pg_dir. It is initialized by the paging_init( ) function, which does the following 1. Invokes pagetable_init( ) to set up the...
Fix Mapped Linear Addresses
We saw that the initial part of the fourth gigabyte of kernel linear addresses maps the physical memory of the system. However, at least 128 MB of linear addresses are always left available because the kernel uses them to implement noncontiguous memory allocation and fix-mapped linear addresses. Noncontiguous memory allocation is just a special way to dynamically allocate and release pages of memory, and is described in Section 7.3. In this section, we focus on fix-mapped linear addresses....
Posix Apis and System Calls
Let's start by stressing the difference between an application programmer interface (API) and a system call. The former is a function definition that specifies how to obtain a given service, while the latter is an explicit request to the kernel made via a software interrupt. Unix systems include several libraries of functions that provide APIs to programmers. Some of the APIs defined by the libc standard C library refer to wrapper routines (routines whose only purpose is to issue a system...
The Role of Signals
A signal is a very short message that may be sent to a process or a group of processes. The only information given to the process is usually a number identifying the signal there is no room in standard signals for arguments, a message, or other accompanying information. A set of macros whose names start with the prefix sig is used to identify signals we have already made a few references to them in previous chapters. For instance, the sigchld macro was mentioned in Section 3.4.1. This macro,...
Linux Kernel Block Fifo
Although pipes are a simple, flexible, and efficient communication mechanism, they have one main drawback namely, that there is no way to open an already existing pipe. This makes it impossible for two arbitrary processes to share the same pipe, unless the pipe was created by a common ancestor process. This drawback is substantial for many application programs. Consider, for instance, a database engine server, which continuously polls client processes wishing to issue some queries and which...
File such as block and character device files FIFOs and even regular files it cannot create directories or sockets
Once created, a FIFO can be accessed through the usual open( ), read( ), write( ), and close( ) system calls, but the VFS handles it in a special way because the FIFO inode and file operations are customized and do not depend on the filesystems in which the FIFO is stored. The POSIX standard specifies the behavior of the open( ) system call on FIFOs the behavior depends essentially on the requested access type, the kind of I O operation (blocking or nonblocking), and the presence of other...
Pathname Lookup
In this section, we illustrate how the VFS derives an inode from the corresponding file pathname. When a process must identify a file, it passes its file pathname to some VFS system call, such as open( ) , mkdir( ) , rename( ) , or stat( ) . The standard procedure for performing this task consists of analyzing the pathname and breaking it into a sequence of filenames. All filenames except the last must identify directories. If the first character of the pathname is , the pathname is absolute,...
Description Pmw
Create the file if it does not exist With o creat, fail if the file already exists Never consider the file as a controlling terminal Truncate the file (remove all existing contents) No system calls will block on the file Synchronous write (block until physical write terminates) Asynchronous I O notification via signals Direct I O transfer (no kernel buffering) Do not follow a trailing symbolic link in pathname Let's describe the operation of the sys_open( ) function. It performs the following...
Final kernel Page Table when RAM size is between 896 MB and 4096 MB
In this case, the RAM cannot be mapped entirely into the kernel linear address space. The best Linux can do during the initialization phase is to map a RAM window having size of 896 MB into the kernel linear address space. If a program needs to address other parts of the existing RAM, some other linear address interval must be mapped to the required RAM. This implies changing the value of some Page Table entries. We'll defer discussing how this kind of dynamic remapping is done in Chapter 7. To...
Linux File Locking
Linux supports all fashions of file locking advisory and mandatory locks, as well as the fcntl( ), flock( ), and the lockf( ) system calls. However, the lockf( ) system call is just a library wrapper routine, and therefore is not discussed here. fcntl( )'s mandatory locks can be enabled and disabled on a per-filesystem basis using the ms_mandlock flag (the mand option) of the mount( ) system call. The default is to switch off mandatory locking. In this case, fcntl( ) creates advisory locks....
Process Resource Limits
Each process has an associated set of resource limits, which specify the amount of system resources it can use. These limits keep a user from overwhelming the system (its CPU, disk space, and so on). Linux recognizes the following resource limits The maximum size of process address space, in bytes. The kernel checks this value when the process uses malloc( ) or a related function to enlarge its address space (see Section 8.1). The maximum core dump file size, in bytes. The kernel checks this...
Completions
Linux 2.4 also makes use of another synchronization primitive similar to semaphores the completions. They have been introduced to solve a subtle race condition that occurs in multiprocessor systems when process A allocates a temporary semaphore variable, initializes it as closed MUTEX, passes its address to process B, and then invokes down( ) on it. Later on, process B running on a different CPU invokes up( ) on the same semaphore. However, the current implementation of up( ) and down( ) also...
Memory Region Handling
Having the basic understanding of data structures and state information that control memory handling, we can look at a group of low-level functions that operate on memory region descriptors. They should be considered auxiliary functions that simplify the implementation of do_mmap( ) and do_munmap( ). Those two functions, which are described in Section 8.3.4 and Section 8.3.5 later in this chapter, enlarge and shrink the address space of a process, respectively. Working at a higher level than...
Synchronizing Accesses to Kernel Data Structures
A shared data structure can be protected against race conditions by using some of the synchronization primitives shown in the previous section. Of course, system performances may vary considerably, depending on the kind of synchronization primitive selected. Usually, the following rule of thumb is adopted by kernel developers always keep the concurrency level as high as possible in the system. In turn, the concurrency level in the system depends on two main factors 1. The number of I O devices...
Handling a Faulty Address Outside the Address Space
If address does not belong to the process address space, do_page_fault( ) proceeds to execute the statements at the label bad_area. If the error occurred in User Mode, it sends a sigsegv signal to current (see Section 10.2) and terminates up read(& tsk-> mm-> mmap sem) if (error code & 4) * User Mode * tsk-> thread.cr2 address tsk-> thread.error code error code tsk-> thread.trap no 14 info.si_signo SIGSEGV info.si errno 0 info.si addr (void *) address force_sig_info(SIGSEGV,...
The Global Kernel Lock
In earlier Linux kernel versions, a global kernel lock (also known as big kernel lock, or BKL) was widely used. In Version 2.0, this lock was a relatively crude spin lock, ensuring that only one processor at a time could run in Kernel Mode. The 2.2 kernel was considerably more flexible and no longer relied on a single spin lock rather, a large number of kernel data structures were protected by specialized spin locks. The global kernel lock, on other hand, was still present because splitting a...
Page IO operations
Block devices transfer information one block at a time, while process address spaces (or to be more precise, memory regions allocated to the process) are defined as sets of pages. This mismatch can be hidden to some extent by using page I O operations. They may be activated in the following cases A process issues a read( ) or write( ) system call on a file (see Chapter 15). A process reads a location of a page that maps a file in memory (see Chapter 15). The kernel flushes some dirty pages...
Character Device Drivers
Handling a character device is relatively easy, since usually sophisticated buffering strategies are not needed and disk caches are not involved. Of course, character devices differ in their requirements some of them must implement a sophisticated communication protocol to drive the hardware device, while others just have to read a few values from a couple of I O ports of the hardware devices. For instance, the device driver of a multiport serial card device (a hardware device offering many...
Program Segments and Process Memory Regions
The linear address space of a Unix program is traditionally partitioned, from a logical point of view, in several linear address intervals called segments 141 41 The word segment has historical roots, since the first Unix systems implemented each linear address interval with a different segment register. Linux, however, does not rely on the segmentation mechanism of the 80 x 86 microprocessors to implement program segments. Contains the initialized data that is, the static variables and the...
Interrupt Descriptor Table
A system table called Interrupt Descriptor Table IDT associates each interrupt or exception vector with the address of the corresponding interrupt or exception handler. The IDT must be properly initialized before the kernel enables interrupts. The IDT format is similar to that of the GDT and the LDTs examined in Chapter 2. Each entry corresponds to an interrupt or an exception vector and consists of an 8-byte descriptor. Thus, a maximum of 256 x 8 2048 bytes are required to store the IDT. The...
The time ftime and gettimeofday System Calls
Processes in User Mode can get the current time and date by means of several system calls Returns the number of elapsed seconds since midnight at the start of January 1, 1970 (UTC). Returns, in a data structure of type timeb, the number of elapsed seconds since midnight of January 1, 1970 (UTC) and the number of elapsed milliseconds in the last second. Returns, in a data structure named timeval, the number of elapsed seconds since midnight of January 1, 1970 (UTC) (a second data structure named...
Lookup of Symbolic Links
Recall that a symbolic link is a regular file that stores a pathname of another file. A pathname may include symbolic links, and they must be resolved by the kernel. For example, if foo bar is a symbolic link pointing to (containing the pathname) dir, the pathname foo bar file must be resolved by the kernel as a reference to the file dir file. In this example, the kernel must perform two different lookup operations. The first one resolves foo bar when the kernel discovers that bar is the name...
Managing the Heap
Each Unix process owns a specific memory region called heap, which is used to satisfy the process's dynamic memory requests. The start_brk and brk fields of the memory descriptor delimit the starting and ending addresses, respectively, of that region. The following C library functions can be used by the process to request and release dynamic memory Requests size bytes of dynamic memory if the allocation succeeds, it returns the linear address of the first memory location. Requests an array...
Accessing the Process Address Space
System call service routines often need to read or write data contained in the process's address space. Linux includes a set of macros that make this access easier. We'll describe two of them, called get_user( ) and put_user( ). The first can be used to read 1, 2, or 4 consecutive bytes from an address, while the second can be used to write data of those sizes into an address. Each function accepts two arguments, a value x to transfer and a variable ptr. The second variable also determines how...
The Destination Cache
As we shall see in the later section Section 18.2.2, processes usually assign names to sockets that is, they specify the remote IP address and port number of the host that should receive the data written onto the socket. The kernel shall also make available to the processes reading the sockets every packet received from the remote host carrying the proper port number. Actually, the kernel has to keep in memory a bunch of data about the remote host identified by an in-use socket. To speed up the...
How Journaling Works
Let's try to explain how journaling works with an example the Ext3 filesystem layer receives a request to write some data blocks of a regular file. As you might easily guess, we are not going to describe in detail every single operation of the Ext3 filesystem layer and of the JBD layer. There would be far too many issues to be covered However, we describe the essential actions 1. The service routine of the write( ) system call triggers the write method of the file object associated with the...
Dynamic Address Checking The Fixup Code
As seen previously, verify_area( ), access_ok, and _ _addr_ok make only a coarse check on the validity of linear addresses passed as parameters of a system call. Since they do not ensure that these addresses are included in the process address space, a process could cause a Page Fault exception by passing a wrong address. Before describing how the kernel detects this type of error, let's specify the three cases in which Page Fault exceptions may occur in Kernel Mode. These cases must be...
The getpriority and setpriority System Calls
The nice( ) system call affects only the process that invokes it. Two other system calls, denoted as getpriority( ) and setpriority( ), act on the base priorities of all processes in a given group. getpriority( ) returns 20 minus the lowest nice field value among all processes in a given group that is, the highest priority among that processes setpriority( ) sets the base priority of all processes in a given group to a given value. The kernel implements these system calls by means of the...
Executing the Default Action for the Signal
If ka-> sa.sa_handler is equal to SIG_DFL, do_signal( ) must perform the default action of the signal. The only exception comes when the receiving process is init, in which case the signal is discarded as described in the earlier section Section 10.1.1 For other processes, since the default action depends on the type of signal, the function executes a switch statement based on the value of signr. The signals whose default action is ignore are easily handled case SIGCONT case SIGCHLD case...
Initializing the Interrupt Descriptor Table
Now that you understand what the Intel processor does with interrupts and exceptions at the hardware level, we can move on to describe how the Interrupt Descriptor Table is initialized. Remember that before the kernel enables the interrupts, it must load the initial address of the IDT table into the idtr register and initialize all the entries of that table. This activity is done while initializing the system (see Appendix A). The int instruction allows a User Mode process to issue an interrupt...
Swapped Out Page Identifier
A swapped-out page is uniquely identified quite simply by specifying the index of the swap area in the swap_info array and the page slot index inside the swap area. Since the first page with index 0 of the swap area is reserved for the swap_header union discussed earlier, the first useful page slot has index 1. The format of a swapped-out page identifier is illustrated in Figure 16-2. Figure 16-2. Swapped-out page identifier Figure 16-2. Swapped-out page identifier The SWP_ENTRY type,offset...
[9 As well see the message queue is implemented by means of
Since messages can be retrieved in an order different from first in, first out, the name message queue is not appropriate. However, new messages are always put at the end of the linked list. To send a message, a process invokes the msgsnd( ) function, passing the following as parameters The IPC identifier of the destination message queue The size of the message text The address of a User Mode buffer that contains the message type immediately followed by the message text To...
Now the empty foo file on the floppy filesystem can be accessed both as flpfoo and flpmntfoo
It is also possible to stack multiple mounts on a single mount point. Each new mount on the same mount point hides the previously mounted filesystem, although processes already using the files and directories under the old mount can continue to do so. When the topmost mounting is removed, then the next lower mount is once more made visible. As you can imagine, keeping track of mounted filesystems can quickly become a nightmare. For each mount operation, the kernel must save in memory the mount...
Process Preemption
As mentioned in the first chapter, Linux processes are preemptive. If a process enters the task_running state, the kernel checks whether its dynamic priority is greater than the priority of the currently running process. If it is, the execution of current is interrupted and the scheduler is invoked to select another process to run (usually the process that just became runnable). Of course, a process may also be preempted when its time quantum expires. As mentioned in Section 6.3, when this...
T2 The exact number depends on the processor model
The following list gives the vector, the name, the type, and a brief description of the exceptions found in 80 x 86 processors. Additional information may be found in the Intel technical documentation. 0 - Divide error fault Raised when a program issues an integer division by 0. 1- Debug trap or fault Raised when the T flag of eflags is set quite useful to implement step-by-step execution of a debugged program or when the address of an instruction or operand falls within the range of an active...
Time Stamp Counter
All 80 x 86 microprocessors include a CLK input pin, which receives the clock signal of an external oscillator. Starting with the Pentium, many recent 80 x 86 microprocessors include a 64-bit Time Stamp Counter (TSC ) register that can be read by means of the rdtsc assembly language instruction. This register is a counter that is incremented at each clock signal if, for instance, the clock ticks at 400 MHz, the Time Stamp Counter is incremented once every 2.5 nanoseconds. Linux takes advantage...
Requesting and Releasing Page Frames
After having seen how the kernel allocates and initializes the data structures for page frame handling, we now look at how page frames are allocated and released. Page frames can be requested by using six slightly different functions and macros. Unless otherwise stated, they return the linear address of the first allocated page, or return null if the allocation failed. Function used to request 2order contiguous page frames. It returns the address of the descriptor of the first allocated page...
Block IO operations
The bread( ) function reads a single block from a block device and stores it in a buffer. It receives as parameters the device identifier, the block number, and the block size, and returns a pointer to the buffer head of the buffer containing the block. The function performs the following operations 1. Invokes the getblk( ) function to search for the block in a software cache called the buffer cache (see Section 14.2.4). If the block is not included in the cache, getblk( ) allocates a new...
Slab Coloring
We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this chapter, we have also seen that objects of the same size tend to be stored at the same offset within a cache. Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects from the same cache line back and forth to different RAM locations,...
Description
Number of processes sharing this table Read write spin lock for the table fields Current maximum number of file objects Current maximum number of file descriptors Maximum file descriptors ever allocated plus 1 Pointer to array of file object pointers Pointer to file descriptors to be closed on exec( ) Initial set of file descriptors to be closed on Initial array of file object pointers The fd field points to an array of pointers to file objects. The size of the array is stored in the max_fds...
The Big Reader Lock
Read write spin locks are useful for data structures that are accessed by many readers and a few writers. They are more convenient than normal spin locks because they allow several readers to concurrently access the protected data structure. However, any time a CPU acquires a read write spin lock, the counter in rwlock_t must be updated. A further access to the rwlock_t data structure performed by another CPU incurs in a significant performance penalty because the hardware caches of the two...
Dynamic Timers
Dynamic timers may be dynamically created and destroyed. No limit is placed on the number of currently active dynamic timers. A dynamic timer is stored in the following timer_list structure struct list_head list unsigned long expires The function field contains the address of the function to be executed when the timer expires. The data field specifies a parameter to be passed to this timer function. Thanks to the data field, it is possible to define a single general-purpose function that...
Figure 62 The groups of lists associated with dynamic timers
The tv1 structure is of type struct timer_vec_root, which includes an index field and a vec array of 256 list_head elements that is, lists of dynamic timers. It contains all dynamic timers that will decay within the next 255 ticks. The index field specifies the currently scanned list it is initialized to 0 and incremented by 1 (modulo 256) at every tick. The list referenced by index contains all dynamic timers that expired during the current tick the next list contains all dynamic timers that...
Filesystem Type Registration
Often, the user configures Linux to recognize all the filesystems needed when compiling the kernel for her system. But the code for a filesystem actually may either be included in the kernel image or dynamically loaded as a module (see Appendix B). The VFS must keep track of all filesystem types whose code is currently included in the kernel. It does this by performing filesystem type registration. Each registered filesystem is represented as a file_system_type object whose fields are...
Swap Area Descriptor
Each active swap area has its own swap_info_struct descriptor in memory. The fields of the descriptor are illustrated in Table 16-1. Table 16-1. Fields of a swap area descriptor Table 16-1. Fields of a swap area descriptor Device number of the swap disk partition Mounted filesystem descriptor of the file or device file Pointer to array of counters, one for each swap area page slot First page slot to be scanned when searching for a free one Last page slot to be scanned when searching for a free...
Old Style Device Files
Old-style device files have been in use since the early versions of the Unix operating system. An old-style device file is a real file stored in a filesystem. Its inode, however, doesn't address blocks of data on the disk. Instead, the inode includes an identifier of a hardware device. Besides its name and type (either character or block, as already mentioned), each device file has two main attributes A number ranging from 1 to 254 that identifies the device type. Usually, all device files that...
Figure 192 IPC message queue data structures
The msg_queue data structure Table 19-11. The msg_queue data structure List of processes receiving messages The most important field is q_messages, which represents the head (i.e., the first dummy element) of a doubly linked circular list containing all messages currently in the queue. Each message is broken in one or more pages, which are dynamically allocated. The beginning of the first page stores the message header, which is a data structure of type msg_msg its fields are...
Kernel Architecture
As stated before, most Unix kernels are monolithic each kernel layer is integrated into the whole kernel program and runs in Kernel Mode on behalf of the current process. In contrast, microkernel operating systems demand a very small set of functions from the kernel, generally including a few synchronization primitives, a simple scheduler, and an interprocess communication mechanism. Several system processes that run on top of the microkernel implement other operating system-layer functions,...
Devfs Device Files
Identifying I O devices by means of major and minor numbers has some limitations 1. Most of the devices present in a dev directory don't exist the device files have been included so that the system administrator doesn't need to create a device file before installing a new I O driver. However, a typical dev directory, which includes over 1,800 device files, increases the time taken to look up an inode when first referenced. 2. The major and minor numbers are 8-bit long. Nowadays, this is a...
The process list
To allow an efficient search through processes of a given type (for instance, all processes in a runnable state), the kernel creates several lists of processes. Each list consists of pointers to process descriptors. A list pointer (that is, the field that each process uses to point to the next process) is embedded right in the process descriptor's data structure. When you look at the C-language declaration of the task_struct structure, the descriptors may seem to turn in on themselves in a...
Ext2 Disk Data Structures
The first block in any Ext2 partition is never managed by the Ext2 filesystem, since it is reserved for the partition boot sector (see Appendix A). The rest of the Ext2 partition is split into block groups, each of which has the layout shown in Figure 17-1. As you will notice from the figure, some data structures must fit in exactly one block, while others may require more than one block. All the block groups in the filesystem have the same size and are stored sequentially, thus the kernel can...
Interrupts and Exceptions
An interrupt is usually defined as an event that alters the sequence of instructions executed by a processor. Such events correspond to electrical signals generated by hardware circuits both inside and outside the CPU chip. Interrupts are often divided into synchronous and asynchronous interrupts Synchronous interrupts are produced by the CPU control unit while executing instructions and are called synchronous because the control unit issues them only after terminating the execution of an...














































