Linux Kernel Reference

Reclaiming Page Frame

The virtual memory subsystem of Linux is, without any doubt, the most complex and performance-critical component of the whole kernel. In previous chapters, we explained how the kernel handles dynamic memory by keeping track of free and busy page frames. We have also discussed how every process in User Mode has its own linear address space so that page frames can be assigned to the process at the very last possible moment. Finally, we have also described how dynamic memory is used to cache the...

The Buddy System Algorithm

Buddy Algorithm Linux Diagram

The kernel must establish a robust and efficient strategy for allocating groups of contiguous page frames. In doing so, it must deal with a well-known memory management problem called external fragmentation frequent requests and releases of groups of contiguous page frames of different sizes may lead to a situation in which several small blocks of free page frames are scattered inside blocks of allocated page frames. As a result, it may become impossible to allocate a large block of contiguous...

The Physical Address Extension PAE Paging Mechanism

The amount of RAM supported by a processor is limited by the number of address pins connected to the address bus. Older Intel processors from the 80386 to the Pentium used 32bit physical addresses. In theory, up to 4 GB of RAM could be installed on such systems in practice, due to the linear address space requirements of User Mode processes, the kernel cannot directly address more than 1 GB of RAM, as we shall see in the later section Section 2.5. However, some demanding applications running on...

Global Interrupt Disabling

Some critical kernel functions can execute on a CPU only if no interrupt handler or deferrable function is running on any other CPU. This synchronization requirement is satisfied by global interrupt disabling. A typical scenario consists of a driver that needs to reset the hardware device. Before fiddling with I O ports, the driver performs global interrupt disabling, ensuring that no other driver will access the same ports. As we shall see in this section, global interrupt disabling...

Local Interrupt Disabling

Interrupt disabling is one of the key mechanisms used to ensure that a sequence of kernel statements is treated as a critical section. It allows a kernel control path to continue executing even when hardware devices issue IRQ signals, thus providing an effective way to protect data structures that are also accessed by interrupt handlers. By itself, however, local interrupt disabling does not protect against concurrent accesses to data structures by interrupt handlers running on other CPUs, so...

Buffer Pages

Although the page cache and the buffer cache are different disk caches, in Version 2.4 of Linux, they are somewhat intertwined. In fact, for reasons of efficiency, buffers are not allocated as single memory objects instead, buffers are stored in dedicated pages called buffer pages. All the buffers within a single buffer page must have the same size hence, on the 80 x 86 architecture, a buffer page can include from one to eight buffers, depending on the block size. A stronger constraint,...

The dentry Cache

Since reading a directory entry from disk and constructing the corresponding dentry object requires considerable time, it makes sense to keep in memory dentry objects that you've finished with but might need later. For instance, people often edit a file and then compile it, or edit and print it, or copy it and then edit the copy. In such cases, the same file needs to be repeatedly accessed. To maximize efficiency in handling dentries, Linux uses a dentry cache, which consists of two kinds of...

Writing Dirty Buffers to Disk

Unix systems allow the deferred writes of dirty buffers into block devices, since this noticeably improves system performance. Several write operations on a buffer could be satisfied by just one slow physical update of the corresponding disk block. Moreover, write operations are less critical than read operations, since a process is usually not suspended because of delayed writings, while it is most often suspended because of delayed reads. Thanks to deferred writes, each physical block device...

Memory Mapping Data Structures

Linux Memory Mapping Files

A memory mapping is represented by a combination of the following data structures The inode object associated with the mapped file The address_space object of the mapped file A file object for each different mapping performed on the file by different processes A vm_area_struct descriptor for each different mapping on the file A page descriptor for each page frame assigned to a memory region that maps the file Figure 15-4 illustrates how the data structures are linked. In the upper-left corner,...

Mounting the Root Filesystem

Mounting the root filesystem is a crucial part of system initialization. It is a fairly complex procedure because the Linux kernel allows the root filesystem to be stored in many different places, such as a hard disk partition, a floppy disk, a remote filesystem shared via NFS, or even a fictitious block device kept in RAM. To keep the description simple, let's assume that the root filesystem is stored in a partition of a hard disk (the most common case, after all). While the system boots, the...

Process Switch

To control the execution of processes, the kernel must be able to suspend the execution of the process running on the CPU and resume the execution of some other process previously suspended. This activity goes variously by the names process switch, task switch, or context switch. The next sections describe the elements of process switching in Linux. While each process can have its own address space, all processes have to share the CPU registers. So before resuming the execution of a process,...

Dentry Objects

We mentioned in Section 12.1.1 that the VFS considers each directory a file that contains a list of files and other directories. We shall discuss in Chapter 17 how directories are implemented on a specific filesystem. Once a directory entry is read into memory, however, it is transformed by the VFS into a dentry object based on the dentry structure, whose fields are described in Table 12-5. The kernel creates a dentry object for every component of a pathname that a process looks up the dentry...

Segmentation in Linux

Global Descriptor Table 80386

Segmentation has been included in 80 x 86 microprocessors to encourage programmers to split their applications into logically related entities, such as subroutines or global and local data areas. However, Linux uses segmentation in a very limited way. In fact, segmentation and paging are somewhat redundant since both can be used to separate the physical address spaces of processes segmentation can assign a different linear address space to each process, while paging can map the same linear...

Sectors Blocks and Buffers

Each data transfer operation for a block device acts on a group of adjacent bytes called a sector. In most disk devices, the size of a sector is 512 bytes, although there are devices that use larger sectors (1,024 and 2,048 bytes). Notice that the sector should be considered the basic unit of data transfer it is never possible to transfer less than a sector, although most disk devices are capable of transferring several adjacent sectors at once. The kernel stores the sector size of each...

Direct Memory Access DMA

All PCs include an auxiliary processor called the Direct Memory Access Controller (DMAC), which can be instructed to transfer data between the RAM and an I O device. Once activated by the CPU, the DMAC is able to continue the data transfer on its own when the data transfer is completed, the DMAC issues an interrupt request. The conflicts that occur when both CPU and DMAC need to access the same memory location at the same time are resolved by a hardware circuit called a memory arbiter (see...

Extended Paging

Linux Paging

Starting with the Pentium model, 80 x 86 microprocessors introduce extended paging, which allows page frames to be 4 MB instead of 4 KB in size (see Figure 2-7). As mentioned in the previous section, extended paging is enabled by setting the Page Size flag of a Page Directory entry. In this case, the paging unit divides the 32 bits of a linear address into two fields Page Directory entries for extended paging are the same as for normal paging, except that The Page Size flag must be set. Only...

Regular Paging

Linux Paging

Starting with the 80386, the paging unit of Intel processors handles 4 KB pages. The 32 bits of a linear address are divided into three fields The translation of linear addresses is accomplished in two steps, each based on a type of translation table. The first translation table is called the Page Directory and the second is called the Page Table. The aim of this two-level scheme is to reduce the amount of RAM required for per-process Page Tables. If a simple one-level Page Table was used, then...

Parenthood Relationships Among Processes

The pidhash table and chained lists Processes created by a program have a parent child relationship. When a process creates multiple children, these children have sibling relationships. Several fields must be introduced in a process descriptor to represent these relationships. Processes 0 and 1 are created by the kernel as we shall see later in the chapter, process 1 (init) is the ancestor of all other processes. The descriptor of a process P includes the following fields Points to...

Slab Coloring

We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this chapter, we have also seen that objects of the same size tend to be stored at the same offset within a cache. Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects from the same cache line back and forth to different RAM locations,...

The getblk Function

The getblk( ) function is the main service routine for the buffer cache. When the kernel needs to read or write the contents of a block of a physical device, it must check whether the buffer head for the required buffer is already included in the buffer cache. If the buffer is not there, the kernel must create a new entry in the cache. To do this, the kernel invokes getblk( ), specifying as parameters the device identifier, the block number, and the block size. This function returns the address...

The Role of the Virtual Filesystem VFS

The Virtual Filesystem (also known as Virtual Filesystem Switch or VFS) is a kernel software layer that handles all system calls related to a standard Unix filesystem. Its main strength is providing a common interface to several kinds of filesystems. For instance, let's assume that a user issues the shell command where floppy is the mount point of an MS-DOS diskette and tmp is a normal Second Extended Filesystem (Ext2) directory. As shown in Figure 12-1(a), the VFS is an abstraction layer...

Appendix C Source Code Structure

To help you to find your way through the files of the source code, we briefly describe the organization of the kernel directory tree. As usual, all pathnames refer to the main directory of the Linux kernel, which is, in most Linux distributions, usr src linux. Linux source code for all supported architectures is contained in about 8750 C and Assembly files stored in about 530 subdirectories it consists of about 4 million lines of code, which occupy more than 144 megabytes of disk space. The...

The Common File Model

Linux File Object

The key idea behind the VFS consists of introducing a common file model capable of representing all supported filesystems. This model strictly mirrors the file model provided by the traditional Unix filesystem. This is not surprising, since Linux wants to run its native filesystem with minimum overhead. However, each specific filesystem implementation must translate its physical organization into the VFS's common file model. For instance, in the common file model, each directory is regarded as...

Handling wait queues

The add_wait_queue( ) function inserts a nonexclusive process in the first position of a wait queue list. The add_wait_queue_exclusive( ) function inserts an exclusive process in the last position of a wait queue list. The remove_wait_queue( ) function removes a process from a wait queue list. The waitqueue_active( ) function checks whether a given wait queue list is empty. A new wait queue may be defined by using the DECLARE_WAiT_QUEUE_HEAD(name) macro, which statically declares and...

Routing Data Structures

Symplified Structures Linux

The most important function of the IP layer consists of ensuring that packets originated by the host or received by the network interface cards are forwarded toward their final destinations. As you might easily guess, this task is really crucial because the routing algorithm should be fast enough to keep up with the highest network loads. The IP routing mechanism is fairly simple. Each 32-bit integer representing an IP address encodes both a network address, which specifies the network the host...

Execution Domains

As mentioned in Chapter 1, a neat feature of Linux is its ability to execute files compiled for other operating systems. Of course, this is possible only if the files include machine code for the same computer architecture on which the kernel is running. Two kinds of support are offered for these foreign programs Emulated execution necessary to execute programs that include system calls that are not POSIX-compliant Native execution valid for programs whose system calls are totally...

The Scheduling Algorithm

The Linux scheduling algorithm works by dividing the CPU time into epochs. In a single epoch, every process has a specified time quantum whose duration is computed when the epoch begins. In general, different processes have different time quantum durations. The time quantum value is the maximum CPU time portion assigned to the process in that epoch. When a process has exhausted its time quantum, it is preempted and replaced by another runnable process. Of course, a process can be selected...

System Call Handler and Service Routines

Syscall Table For Intel Assembly

When a User Mode process invokes a system call, the CPU switches to Kernel Mode and starts the execution of a kernel function. In Linux a system call must be invoked by executing the int 0x8 0 assembly language instruction, which raises the programmed exception that has vector 128 (see Section 4.4.1 and Section 4.2.4, both in Chapter 4). Since the kernel implements many different system calls, the process must pass a parameter called the system call number to identify the required system call...

Inode Objects

All information needed by the filesystem to handle a file is included in a data structure called an inode. A filename is a casually assigned label that can be changed, but the inode is unique to the file and remains the same as long as the file exists. An inode object in memory consists of an inode structure whose fields are described in Table 12-3. Table 12-3. The fields of the inode object Table 12-3. The fields of the inode object Pointers for the modified buffers list Pointers for the...

Read Write Semaphores

Read write semaphores are a new feature of Linux 2.4. They are similar to the read write spin locks described earlier in Section 5.3.4, except that waiting processes are suspended until the semaphore becomes open again. Many kernel control paths may concurrently acquire a read write semaphore for reading however, any writer kernel control path must have exclusive access to the protected resource. Therefore, the semaphore can be acquired for writing only if no other kernel control path is...

Understanding the Linux Kernel 2nd Edition

The Audience for This Book Organization of the Material Overview of the Book Background Information Conventions in This Book How to Contact Us Acknowledgments Section 1.1. Linux Versus Other Unix-Like Kernels Section 1.2. Hardware Dependency Section 1.4. Basic Operating System Concepts Section 1.5. An Overview of the Unix Filesystem Section 1.6. An Overview of Unix Kernels Section 2.1. Memory Addresses Section 2.2. Segmentation in Hardware Section 2.3. Segmentation in Linux Section 2.4. Paging...

TLB mode mechanism it is usually invoked whenever the kernel modifies a Page Table entry relative to the Kernel Mode

When some CPU starts running a kernel thread, the kernel sets it into lazy TLB mode. When requests are issued to clear some TLB entries, each CPU in lazy TLB mode does not flush the corresponding entries however, the CPU remembers that its current process is running on a set of Page Tables whose TLB entries for the User Mode addresses are invalid. As soon as the CPU in lazy TLB mode switches to a regular process with a different set of Page Tables, the hardware automatically flushes the TLB...

Memory Barriers

When using optimizing compilers, you should never take for granted that instructions will be performed in the exact order in which they appear in the source code. For example, a compiler might reorder the assembly language instructions in such a way to optimize how registers are used. Moreover, modern CPUs usually execute several instructions in parallel, and might reorder memory accesses. These kinds of reordering can greatly speed up the program. When dealing with synchronization, however,...

Special Filesystems

While network and disk-based filesystems enable the user to handle information stored outside the kernel, special filesystems may provide an easy way for system programs and administrators to manipulate the data structures of the kernel and to implement special features of the operating system. Table 12-8 lists the most common special filesystems used in Linux for each of them, the table reports its mount point and a short description. Notice that a few filesystems have no fixed mount point...

The Memory Descriptor

All information related to the process address space is included in a data structure called a memory descriptor. This structure of type mm_struct is referenced by the mm field of the process descriptor. The fields of a memory descriptor are listed in Table 8-2. Table 8-2. The fields of the memory descriptor Table 8-2. The fields of the memory descriptor Pointer to the head of the list of memory region objects Pointer to the root of the red-black tree of memory region objects Pointer to the last...

Linux Kernel Block Fifo

Although pipes are a simple, flexible, and efficient communication mechanism, they have one main drawback namely, that there is no way to open an already existing pipe. This makes it impossible for two arbitrary processes to share the same pipe, unless the pipe was created by a common ancestor process. This drawback is substantial for many application programs. Consider, for instance, a database engine server, which continuously polls client processes wishing to issue some queries and which...

The Destination Cache

As we shall see in the later section Section 18.2.2, processes usually assign names to sockets that is, they specify the remote IP address and port number of the host that should receive the data written onto the socket. The kernel shall also make available to the processes reading the sockets every packet received from the remote host carrying the proper port number. Actually, the kernel has to keep in memory a bunch of data about the remote host identified by an in-use socket. To speed up the...

Is still able to act on the file In this particular case the inode is removed from the hash table even if it still

The methods associated with an inode object are also called inode operations. They are described by an inode_operations structure, whose address is included in the i op field. Here are the inode operations in the order they appear in the inode_operations table Creates a new disk inode for a regular file associated with a dentry object in some directory. Searches a directory for an inode corresponding to the filename included in a dentry object. Creates a new hard link that refers to the file...

File Objects

A file object describes how a process interacts with a file it has opened. The object is created when the file is opened and consists of a file structure, whose fields are described in Table 12-4. Notice that file objects have no corresponding image on disk, and hence no dirty field is included in the file structure to specify that the file object has been modified. Table 12-4. The fields of the file object Table 12-4. The fields of the file object Pointers for generic file object list dentry...

Description

Number of processes sharing this table Read write spin lock for the table fields Current maximum number of file objects Current maximum number of file descriptors Maximum file descriptors ever allocated plus 1 Pointer to array of file object pointers Pointer to file descriptors to be closed on exec( ) Initial set of file descriptors to be closed on Initial array of file object pointers The fd field points to an array of pointers to file objects. The size of the array is stored in the max_fds...

Now the empty foo file on the floppy filesystem can be accessed both as flpfoo and flpmntfoo

It is also possible to stack multiple mounts on a single mount point. Each new mount on the same mount point hides the previously mounted filesystem, although processes already using the files and directories under the old mount can continue to do so. When the topmost mounting is removed, then the next lower mount is once more made visible. As you can imagine, keeping track of mounted filesystems can quickly become a nightmare. For each mount operation, the kernel must save in memory the mount...

Mounting a Generic Filesystem

Once the root filesystem is initialized, additional filesystems may be mounted. Each must have its own mount point, which is just an already existing directory in the system's directory tree. The mount( ) system call is used to mount a filesystem its sys_mount( ) service routine acts on the following parameters The pathname of a device file containing the filesystem, or null if it is not required (for instance, when the filesystem to be mounted is network-based) The pathname of the directory on...

Pathname Lookup

In this section, we illustrate how the VFS derives an inode from the corresponding file pathname. When a process must identify a file, it passes its file pathname to some VFS system call, such as open( ) , mkdir( ) , rename( ) , or stat( ) . The standard procedure for performing this task consists of analyzing the pathname and breaking it into a sequence of filenames. All filenames except the last must identify directories. If the first character of the pathname is , the pathname is absolute,...

Standard Pathname Lookup

When the lookup_parent flag is cleared, link_path_walk( ) performs the following steps. 1. Initializes the lookup_flags local variable with nd-> flags. 2. Skips any leading slash ( ) before the first component of the pathname. 3. If the remaining pathname is empty, returns the value 0. In the nameidata data structure, the dentry and mnt fields point to the object relative to the last resolved component of the original pathname. 4. If the link_count field in the descriptor of the current...

Parent Pathname Lookup

In many cases, the real target of a lookup operation is not the last component of the pathname, but the next-to-last one. For example, when a file is created, the last component denotes the filename of the not yet existing file, and the rest of the pathname specifies the directory in which the new link must be inserted. Therefore, the lookup operation should fetch the dentry object of the next-to-last component. For another example, unlinking a file identified by the pathname foo bar consists...

Lookup of Symbolic Links

Recall that a symbolic link is a regular file that stores a pathname of another file. A pathname may include symbolic links, and they must be resolved by the kernel. For example, if foo bar is a symbolic link pointing to (containing the pathname) dir, the pathname foo bar file must be resolved by the kernel as a reference to the file dir file. In this example, the kernel must perform two different lookup operations. The first one resolves foo bar when the kernel discovers that bar is the name...

Activating and Deactivating a Swap Area

Once a swap area is initialized, the superuser (or, more precisely, any user having the cap_sys_admin capability, as described in Section 20.1.1) may use the swapon and swapoff programs to activate and deactivate the swap area, respectively. These programs use the swapon( ) and swapoff( ) system calls we'll briefly sketch out the corresponding service routines. 16.2.3.1 The sys_swapon( ) service routine The sys_swapon( ) service routine receives the following as parameters This parameter points...

The socket System Call

The socket( ) system call creates a new endpoint for a communication between two or more processes. In our example program, it is invoked in this way sockfd socket(PF_INET, SOCK_DGRAM, 0) The socket( ) system call returns a file descriptor. In fact, a socket is similar to an opened file because it is possible to read and write data on it by means of the usual read( ) and write( ) system calls. The first parameter of the socket( ) system call represents the network architecture that will be used...

Creating and Destroying a Pipe

The pipe( ) system call is serviced by the sys_pipe( ) function, which in turn invokes the do_pipe( ) function. To create a new pipe, do_pipe( ) performs the following operations 1. Invokes the get_pipe_inode( ) function, which allocates and initializes an inode object for the pipe in the pipefs filesystem. In particular, this function executes the following actions a. Allocates a pipe_inode_info data structure and stores its address in the i_pipe field of the inode. b. Allocates a page frame...

IPC Shared Memory

Memoria Compartida Distribuida

The most useful IPC mechanism is shared memory, which allows two or more processes to access some common data structures by placing them in an IPC shared memory region. Each process that wants to access the data structures included in an IPC shared memory region must add to its address space a new memory region (see Section 8.3), which maps the page frames associated with the IPC shared memory region. Such page frames can then be easily handled by the kernel through demand paging (see Section...

The exec Functions

Unix systems provide a family of functions that replace the execution context of a process with a new context described by an executable file. The names of these functions start with the prefix exec, followed by one or two letters therefore, a generic function in the family is usually referred to as an exec function. The exec functions are listed in Table 20-7 they differ in how the parameters are interpreted. The first parameter of each function denotes the pathname of the file to be executed....

The ret from fork Function

The ret_from_fork( ) function is executed by the child process right after its creation through a fork( ), vfork( ), or clone( ) system call (see Section 3.4.1). It is essentially equivalent to the following assembly language code ret from fork pushl ebx call schedule tail addl 4, esp movl 0xffffe000, ebx andl esp, ebx testb 0x02,24( ebx) jne tracesys_exit jmp ret from sys call tracesys_exit Initially, the ebx register stores the address of the parent's process descriptor this value is passed...

Kernel Wrapper Routines

Although system calls are used mainly by User Mode processes, they can also be invoked by kernel threads, which cannot use library functions. To simplify the declarations of the corresponding wrapper routines, Linux defines a set of seven macros called _syscall0 through _syscall6. In the name of each macro, the numbers 0 through 6 correspond to the number of parameters used by the system call (excluding the system call number). The macros are used to declare wrapper routines that are not...

Nested Execution of Exception and Interrupt Handlers

When handling an interrupt or an exception, the kernel begins a new kernel control path, or separate sequence of instructions. When a process issues a system call request, for instance, the first instructions of the corresponding kernel control path are those that save the content of the registers in the Kernel Mode stack, while the last instructions are those that restore the content of the registers and put the CPU back into User Mode. Linux design does not allow process switching while the...

[6 However this case should never happen since the kernel does not assign privileged page frames to the processes

If the memory region access rights match the access type that caused the exception, the handle_mm_fault( ) function is invoked to allocate a new page frame ret handle mm fault(tsk-> mm, vma, address, write) if (ret 1 ret 2) if (ret 1) tsk-> min flt++ else tsk-> maj flt++ up read(& tsk-> mm-> mmap sem) The handle_mm_fault( ) function returns 1 or 2 if it succeeded in allocating a new page frame for the process. The value 1 indicates that the Page Fault has been handled without...

Writing Packets to a Socket

Finally, our example program is ready to send messages to the remote host it simply writes the data onto the socket write(sockfd, mesg, strlen(mesg)+1) The write( ) system call triggers the write method of the file object associated with the sockfd file descriptor. For socket files, this method is implemented by the sock_write( ) function, which performs the following actions 1. Determines the address of the socket object embedded in the file's inode. 2. Allocates and initializes a message...

Parameter Passing

Like ordinary functions, system calls often require some input output parameters, which may consist of actual values (i.e., numbers), addresses of variables in the address space of the User Mode process, or even addresses of data structures including pointers to User Mode functions (see Section 10.4). Since the system_call( ) function is the common entry point for all system calls in Linux, each of them has at least one parameter the system call number passed in the eax register. For instance,...

IPC Semaphores

IPC semaphores are quite similar to the kernel semaphores introduced in Chapter 5 they are counters used to provide controlled access to shared data structures for multiple processes. The semaphore value is positive if the protected resource is available, and 0 if the protected resource is currently not available. A process that wants to access the resource tries to decrement the semaphore value the kernel, however, blocks the process until the operation on the semaphore yields a positive...

Figure 191 IPC semaphore data structures

Semaphores Ipc

Formally, the array stores pointers to kern_ipc_perm data structures, but each structure is simply the first field of the sem_array data structure. All fields of the sem_array data structure are shown in Table 19-9. Table 19-9. The fields in the sem_array data structure Table 19-9. The fields in the sem_array data structure The sem_base field points to an array of struct sem data structures, one for every IPC primitive semaphore. The latter data structure includes only two fields The value of...

Linux File Locking

Linux supports all fashions of file locking advisory and mandatory locks, as well as the fcntl( ), flock( ), and the lockf( ) system calls. However, the lockf( ) system call is just a library wrapper routine, and therefore is not discussed here. fcntl( )'s mandatory locks can be enabled and disabled on a per-filesystem basis using the ms_mandlock flag (the mand option) of the mount( ) system call. The default is to switch off mandatory locking. In this case, fcntl( ) creates advisory locks....

[10 Synchronization problems have been fully described in other works we refer the interested reader to books on the

These problems occur not only among kernel control paths, but also among processes sharing common data. Several synchronization techniques have been adopted. The following section concentrates on how to synchronize kernel control paths. In search of a drastically simple solution to synchronization problems, most traditional Unix kernels are nonpreemptive when a process executes in Kernel Mode, it cannot be arbitrarily suspended and substituted with another process. Therefore, on a uniprocessor...

Wait queues

Wait queues have several uses in the kernel, particularly for interrupt handling, process synchronization, and timing. Because these topics are discussed in later chapters, we'll just say here that a process must often wait for some event to occur, such as for a disk operation to terminate, a system resource to be released, or a fixed interval of time to elapse. Wait queues implement conditional waits on events a process wishing to wait for a specific event places itself in the proper wait...

Synchronization Primitives

Chapter 1 introduced the concepts of race condition and critical region for processes. The same definitions apply to kernel control paths. In this chapter, a race condition can occur when the outcome of a computation depends on how two or more interleaved kernel control paths are nested. A critical region is any section of code that must be completely executed by any kernel control path that enters it before another kernel control path can enter it. We now examine how kernel control paths can...

Read Write Spin Locks

Spin Read Write

Read write spin locks have been introduced to increase the amount of concurrency inside the kernel. They allow several kernel control paths to simultaneously read the same data structure, as long as no kernel control path modifies it. If a kernel control path wishes to write to the structure, it must acquire the write version of the read write lock, which grants exclusive access to the resource. Of course, allowing concurrent reads on data structures improves system performance. Figure 5-2...

The Big Reader Lock

Read write spin locks are useful for data structures that are accessed by many readers and a few writers. They are more convenient than normal spin locks because they allow several readers to concurrently access the protected data structure. However, any time a CPU acquires a read write spin lock, the counter in rwlock_t must be updated. A further access to the rwlock_t data structure performed by another CPU incurs in a significant performance penalty because the hardware caches of the two...

Completions

Linux 2.4 also makes use of another synchronization primitive similar to semaphores the completions. They have been introduced to solve a subtle race condition that occurs in multiprocessor systems when process A allocates a temporary semaphore variable, initializes it as closed MUTEX, passes its address to process B, and then invokes down( ) on it. Later on, process B running on a different CPU invokes up( ) on the same semaphore. However, the current implementation of up( ) and down( ) also...

Disabling Deferrable Functions

In Section 4.7.1, we explained that deferrable functions can be executed at unpredictable times (essentially, on termination of hardware interrupt handlers). Therefore, data structures accessed by deferrable functions must be protected against race conditions. A trivial way to forbid deferrable functions execution on a CPU is to disable interrupts on that CPU. Since no interrupt handler can be activated, softirq actions cannot be started asynchronously. Globally disabling interrupts on all CPUs...

Synchronizing Accesses to Kernel Data Structures

A shared data structure can be protected against race conditions by using some of the synchronization primitives shown in the previous section. Of course, system performances may vary considerably, depending on the kind of synchronization primitive selected. Usually, the following rule of thumb is adopted by kernel developers always keep the concurrency level as high as possible in the system. In turn, the concurrency level in the system depends on two main factors 1. The number of I O devices...

Choosing Among Spin Locks Semaphores and Interrupt Disabling

Unfortunately, access patterns to most kernel data structures are a lot more complex than the simple examples just shown, and kernel developers are forced to use semaphores, spin locks, interrupts, and softirq disabling. Generally speaking, choosing the synchronization primitives depends on what kinds of kernel control paths access the data structure, as shown in Table 5-6. Table 5-6. Protection required by data structures accessed by kernel control paths Table 5-6. Protection required by data...

The Global Kernel Lock

In earlier Linux kernel versions, a global kernel lock (also known as big kernel lock, or BKL) was widely used. In Version 2.0, this lock was a relatively crude spin lock, ensuring that only one processor at a time could run in Kernel Mode. The 2.2 kernel was considerably more flexible and no longer relied on a single spin lock rather, a large number of kernel data structures were protected by specialized spin locks. The global kernel lock, on other hand, was still present because splitting a...

Updating the Time and Date

User programs get the current time and date from the xtime variable of type struct timeval. The kernel also occasionally refers to it, for instance, when updating inode timestamps (see Section 1.5.4). In particular, xtime.tv_sec stores the number of seconds that have elapsed since midnight of January 1, 1970 (UTC), while xtime.tv_usec stores the number of microseconds that have elapsed within the last second (its value ranges between 0 and 999999). During kernel initialization, the time_init( )...

Dynamic Timers

Dynamic timers may be dynamically created and destroyed. No limit is placed on the number of currently active dynamic timers. A dynamic timer is stored in the following timer_list structure struct list_head list unsigned long expires The function field contains the address of the function to be executed when the timer expires. The data field specifies a parameter to be passed to this timer function. Thanks to the data field, it is possible to define a single general-purpose function that...

Figure 62 The groups of lists associated with dynamic timers

The tv1 structure is of type struct timer_vec_root, which includes an index field and a vec array of 256 list_head elements that is, lists of dynamic timers. It contains all dynamic timers that will decay within the next 255 ticks. The index field specifies the currently scanned list it is initialized to 0 and incremented by 1 (modulo 256) at every tick. The list referenced by index contains all dynamic timers that expired during the current tick the next list contains all dynamic timers that...

The time ftime and gettimeofday System Calls

Processes in User Mode can get the current time and date by means of several system calls Returns the number of elapsed seconds since midnight at the start of January 1, 1970 (UTC). Returns, in a data structure of type timeb, the number of elapsed seconds since midnight of January 1, 1970 (UTC) and the number of elapsed milliseconds in the last second. Returns, in a data structure named timeval, the number of elapsed seconds since midnight of January 1, 1970 (UTC) (a second data structure named...

Kernel Mappings of High Memory Page Frames

Page frames above the 896 MB boundary are not mapped in the fourth gigabyte of the kernel linear address spaces, so they cannot be directly accessed by the kernel. This implies that any page allocator function that returns the linear address of the assigned page frame doesn't work for the high memory. For instance, suppose that the kernel invoked__get_free_pages(GFP_HiGHMEM,0) to allocate a page frame in high memory. If the allocator assigned a page frame in high memory,__get_free_pages( )...

Figure 73 The slab allocator components

The slab allocator never releases the page frames of an empty slab on its own. It would not know when free memory is needed, and there is no benefit to releasing objects when there is still plenty of free memory for new objects. Therefore, releases occur only when the kernel is looking for additional free page frames (see Section 7.2.13 later in this chapter and Section 16.7). Each cache is described by a table of type struct kmem_cache_s (which is equivalent to the type kmem_cache_t). The most...

Figure 76 Slab with color col and alignment aln

Coloring works only when free is large enough. Clearly, if no alignment is required for the objects or if the number of unused bytes inside the slab is smaller than the required alignment (free < ain), the only possible slab coloring is the one that has the color 0 the one that assigns a zero offset to the first object. The various colors are distributed equally among slabs of a given object type by storing the current color in a field of the cache descriptor called colour_next. The...

The multiprocessor case

Kmem_cache_alloc( ) first disables the local interrupts it then looks for a free object in the cache's local array associated with the running CPU cc cachep- cpudata smp_processor_id( ) if (cc- avail) objp ((void *)(cc+1)) --cc- avail else objp kmem_cache_alloc_batch(cachep, flags) if ( objp) local_irq_restore(save_flags) return objp The cc local variable contains the address of the local array descriptor thus, (cc+1) yields the address of the first local array element. If the avail field of...

Releasing an Object from a Cache

The kmem_cache_free( ) function releases an object previously obtained by the slab allocator. Its parameters are cachep (the address of the cache descriptor) and objp (the address of the object to be released). As with kmem_cache_alloc( ), we discuss the uniprocessor case separately from the multiprocessor case. The function starts by disabling the local interrupts and then determines the address of the descriptor of the slab containing the object. It uses the list.prev subfield of the...

Linear Addresses of Noncontiguous Memory Areas

To find a free range of linear addresses, we can look in the area starting from page_offset (usually 0xc0000000, the beginning of the fourth gigabyte). Figure 7-7 shows how the fourth gigabyte linear addresses are used The beginning of the area includes the linear addresses that map the first 896 MB of RAM (see Section 2.5.4) the linear address that corresponds to the end of the directly mapped physical memory is stored in the high memory variable. The end of the area contains the fix-mapped...

Finding a free interval archgetunmappedarea

The arch_get_unmapped_area( ) function searches the process address space to find an available linear address interval. The len parameter specifies the interval length, while the addr parameter may specify the address from which the search is started. If the search is successful, the function returns the initial address of the new interval otherwise, it returns the error code - return -ENOMEM addr (addr + Oxfff) & 0xfffff000 if (addr & & addr + len < TASK_SIZE) vma...

Process may in turn resume execution of the traced process by means of a sigcont signal

The signal is not blocked by the destination process. The signal is being ignored by the destination process (either because the process explicitly ignored it or because the process did not change the default action of the signal and that action is ignore). Handle the signal, which may require switching the process to a handler function at any point during its execution and restoring the original execution context after the function returns. Moreover, Linux must take into account the different...

Figure 103 Frame on the User Mode stack

Ih na ha ndli prelum ad dries Signal kandlirt panmeter viiq If Ptmhs hardware hdil rating-point register Blacked ri.i -lime signal . & igEEturn() imocatian The setup_frame( ) function starts by invoking get_sigframe( ) to compute the first memory location of the frame. That memory location is usually 4 in the User Mode stack, so the function returns the value 4 Linux allows processes to specify an alternate stack for their signal handlers by invoking the sigaltstack( ) system call this...

Changing a Signal Action

The sigaction(sig,act,oact) system call allows users to specify an action for a signal of course, if no signal action is defined, the kernel executes the default action associated with the delivered signal. The corresponding sys_sigaction( ) service routine acts on two parameters the sig signal number and the act table of type sigaction that specifies the new action. A third oact optional output parameter may be used to get the previous action associated with the signal. The function checks...

Suspending the Process

The sigsuspend( ) system call puts the process in the task_interruptible state, after having blocked the standard signals specified by a bit mask array to which the mask parameter points. The process will wake up only when a nonignored, nonblocked signal is sent to it. The corresponding sys_sigsuspend( ) service routine executes these statements mask & (sigmask(SIGKILL) sigmask(SIGSTOP)) spin lock irq(& current-> sigmask lock) siginitset(& current-> blocked, mask) spin unlock...

Files Associated with a Process

We mentioned in Section 1.5 that each process has its own current working directory and its own root directory. These are just two examples of data that must be maintained by the kernel to represent the interactions between a process and a filesystem. A whole data structure of type fs_struct is used for that purpose (see Table 12-6) and each process descriptor has an fs field that points to the process fs_struct structure. Table 12-6. The fields of the fs struct structure Table 12-6. The fields...

Filesystem Type Registration

Often, the user configures Linux to recognize all the filesystems needed when compiling the kernel for her system. But the code for a filesystem actually may either be included in the kernel image or dynamically loaded as a module (see Appendix B). The VFS must keep track of all filesystem types whose code is currently included in the kernel. It does this by performing filesystem type registration. Each registered filesystem is represented as a file_system_type object whose fields are...

[7 More precisely the usage counter keeps track of the number of file objects referring to the device file since clone

The open method checks the value of the usage counter before the increment. If the counter is null, the device driver must allocate the resources and enable interrupts and DMA on the hardware device. The release method checks the value of the usage counter after the decrement. If the counter is null, no more processes are using the hardware device. If so, the method disables interrupts and DMA on the I O controller, and then releases the allocated resources. The duration of an I O operation is...

Page IO operations

Block devices transfer information one block at a time, while process address spaces (or to be more precise, memory regions allocated to the process) are defined as sets of pages. This mismatch can be hidden to some extent by using page I O operations. They may be activated in the following cases A process issues a read( ) or write( ) system call on a file (see Chapter 15). A process reads a location of a page that maps a file in memory (see Chapter 15). The kernel flushes some dirty pages...

The addressspace Object

Table 14-1 suggests that the Linux page cache speeds up several different kinds of I O operations. In fact, the page cache may include the following types of pages Pages containing data of regular files and directories of disk-based filesystems in Chapter 15, we describe how the kernel handles read and write operations on them. Pages containing data of a memory-mapped file see Chapter 15 for details. Pages containing data directly read from block device files (skipping the filesystem layer) as...

Page Cache Handling Functions

The high-level functions that use the page cache involve finding, adding, and removing a page. The find_get_page macro receives as parameters the address of an address_space object and an offset value. It uses the page_hash macro to derive the address of the hash table entry corresponding to the values of the parameters, and invokes the _ _find_get_page( ) function to search for the requested page descriptor in the proper collision list. In turn, _ _find_get_page( ) acquires the pagecache_lock...

Buffer Head Data Structures

As mentioned in Section 13.4.4, each buffer head is stored in a data structure of type buffer_head. These data structures have their own slab allocator cache called bh_cachep, which should not be confused with the buffer cache itself. The slab allocator cache is a memory cache (see Section 3.2.2) for the buffer head objects, meaning that it has no interaction with disks and is simply a way of managing memory efficiently. In contrast, the buffer cache is a disk cache for the data in the buffers....

[2 The name of the array derives from the abbreviation for Least Recently Used In earlier versions of Linux these lists

The mark_buffer_dirty( ) and mark_buffer_clean( ) functions set and clear, respectively, the BH_Dirty flag of a buffer head. To keep the number of dirty buffers in the system bounded, mark_buffer_dirty( ) invokes the balance_dirty( ) function (see the later section Section 14.2.4). Both functions also invoke the refile_buffer( ) function, which moves the buffer head into the proper list according to the value of the BH_Dirty and BH_Lock flags. Beside the buf_dirty list, the kernel manages two...

[4 The bdfprm table also includes several other unused fields

Buffer cache tuning parameters Table 14-4. Buffer cache tuning parameters Threshold percentage of dirty buffers for waking up bdflush Threshold percentage of dirty buffers for waking up bdflush in blocking mode Time-out in ticks of a dirty buffer for being written to disk Delay in ticks between kupdate activations The most typical cases that cause the kernel thread to be woken up are The balance_dirty( ) function verifies that the number of buffer pages in the buf_dirty and...

Swap Area Descriptor

Each active swap area has its own swap_info_struct descriptor in memory. The fields of the descriptor are illustrated in Table 16-1. Table 16-1. Fields of a swap area descriptor Table 16-1. Fields of a swap area descriptor Device number of the swap disk partition Mounted filesystem descriptor of the file or device file Pointer to array of counters, one for each swap area page slot First page slot to be scanned when searching for a free one Last page slot to be scanned when searching for a free...

The doswappage Function

This do_swap_page( ) function acts on the following parameters Memory descriptor address of the process that caused the Page Fault exception Memory region descriptor address of the region that includes address address Linear address that causes the exception Address of the Page Table entry that maps address orig_pte Content of the Page Table entry that maps address write access Flag denoting whether the attempted access was a read or a write Contrary to other functions, do_swap_page( ) never...

Using an IPC Resource

IPC resources are created by invoking the semget( ), msgget( ), or shmget( ) functions, depending on whether the new resource is a semaphore, a message queue, or a shared memory region. The main objective of each of these three functions is to derive from the IPC key (passed as the first parameter) the corresponding IPC identifier, which is then used by the process for accessing the resource. If there is no IPC resource already associated with the IPC key, a new resource is created. If...

The Role of Interrupt Signals

As the name suggests, interrupt signals provide a way to divert the processor to code outside the normal flow of control. When an interrupt signal arrives, the CPU must stop what it's currently doing and switch to a new activity it does this by saving the current value of the program counter (i.e., the content of the eip and cs registers) in the Kernel Mode stack and by placing an address related to the interrupt type into the program counter. There are some things in this chapter that will...

T2 The exact number depends on the processor model

The following list gives the vector, the name, the type, and a brief description of the exceptions found in 80 x 86 processors. Additional information may be found in the Intel technical documentation. 0 - Divide error fault Raised when a program issues an integer division by 0. 1- Debug trap or fault Raised when the T flag of eflags is set quite useful to implement step-by-step execution of a debugged program or when the address of an instruction or operand falls within the range of an active...

The Role of Signals

A signal is a very short message that may be sent to a process or a group of processes. The only information given to the process is usually a number identifying the signal there is no room in standard signals for arguments, a message, or other accompanying information. A set of macros whose names start with the prefix sig is used to identify signals we have already made a few references to them in previous chapters. For instance, the sigchld macro was mentioned in Section 3.4.1. This macro,...