FIFOs

Although pipes are a simple, flexible, and efficient communication mechanism, they have one main drawback namely, that there is no way to open an already existing pipe. This makes it impossible for two arbitrary processes to share the same pipe, unless the pipe was created by a common ancestor process. This drawback is substantial for many application programs. Consider, for instance, a database engine server, which continuously polls client processes wishing to issue some queries and which...

Figure 62 The groups of lists associated with dynamic timers

The tv1 structure is of type struct timer_vec_root, which includes an index field and a vec array of 256 list_head elements that is, lists of dynamic timers. It contains all dynamic timers that will decay within the next 255 ticks. The index field specifies the currently scanned list it is initialized to 0 and incremented by 1 (modulo 256) at every tick. The list referenced by index contains all dynamic timers that expired during the current tick the next list contains all dynamic timers that...

Process State

As its name implies, the state field of the process descriptor describes what is currently happening to the process. It consists of an array of flags, each of which describes a possible process state. In the current Linux version, these states are mutually exclusive, and hence exactly one flag of state is set the remaining flags are cleared. The following are the possible process states The process is either executing on a CPU or waiting to be executed. The process is suspended (sleeping) until...

IPC Messages

Processes can communicate with one another by means of IPC messages. Each message generated by a process is sent to an IPC message queue, where it stays until another process reads it. A message is composed of a fixed-size header and a variable-length text it can be labeled with an integer value (the message type), which allows a process to selectively retrieve messages from its message queue.19- Once a process has read a message from an IPC message queue, the kernel destroys the message...

Profiling the Kernel Code

Linux includes a minimalist code profiler used by Linux developers to discover where the kernel spends its time in Kernel Mode. The profiler identifies the hot spots of the kernel the most frequently executed fragments of kernel code. Identifying the kernel hot spots is very important because they may point out kernel functions that should be further optimized. The profiler is based on a very simple Monte Carlo algorithm at every timer interrupt occurrence, the kernel determines whether the...

Aligning Objects in Memory

The objects managed by the slab allocator are aligned in memory that is, they are stored in memory cells whose initial physical addresses are multiples of a given constant, which is usually a power of 2. This constant is called the alignment factor. The largest alignment factor allowed by the slab allocator is 4,096 the page frame size. This means that objects can be aligned by referring to either their physical addresses or their linear addresses. In both cases, only the 12 least significant...

Hardware Clocks

On the 80 x 86 architecture, the kernel must explicitly interact with four kinds of clocks the Real Time Clock, the Time Stamp Counter, the Programmable Interval Timer, and the timer of the local APICs in SMP systems. The first two hardware devices allow the kernel to keep track of the current time of day. The PIC device and the timers of the local APICs are programmed by the kernel so that they issue interrupts at a fixed, predefined frequency such periodic interrupts are crucial for...

The ret fromintr Function

The ret_from_intr( ) function is essentially equivalent to ret_from_exception( ) Before invoking ret_from_exception( ) , ret_from_intr( ) loads in the ebx register the address of the current's process descriptor (see Section 3.2.2). This is necessary because the ret_from_sys_call( ) function, which can be invoked by ret_from_exception( ), expects to find that address in ebx. On the other hand, when ret_from_exception( ) starts, the ebx register has already been loaded with current's address by...

Process Preemption

As mentioned in the first chapter, Linux processes are preemptive. If a process enters the task_running state, the kernel checks whether its dynamic priority is greater than the priority of the currently running process. If it is, the execution of current is interrupted and the scheduler is invoked to select another process to run (usually the process that just became runnable). Of course, a process may also be preempted when its time quantum expires. As mentioned in Section 6.3, when this...

Transactions

For reasons of efficiency, the JBD layer manages the journal by grouping the log records that belong to several atomic operation handles into a single transaction. Furthermore, all log records relative to a handle must be included in the same transaction. All log records of a transaction are stored in consecutive blocks of the journal. The JBD layer handles each transaction as a whole. For instance, it reclaims the blocks used by a transaction only after all data included in its log records is...

Journaling Filesystems

As disks became larger, one design choice of traditional Unix filesystems (like Ext2) turns out to be inappropriate. As we know from Chapter 14, updates to filesystem blocks might be kept in dynamic memory for long period of time before being flushed to disk. A dramatic event like a power-down failure or a system crash might thus leave the filesystem in an inconsistent state. To overcome this problem, each traditional Unix filesystem is checked before being mounted if it has not been properly...

A21 Booting Linux from Floppy Disk

The only way to store a Linux kernel on a single floppy disk is to compress the kernel image. As we shall see, compression is done at compile time and decompression is done by the loader. If the Linux kernel is loaded from a floppy disk, the boot loader is quite simple. It is coded in the arch i386 boot bootsect.S assembly language file. When a new kernel image is produced by compiling the kernel source, the executable code yielded by this assembly language file is placed at the beginning of...

Finding a free interval archgetunmappedarea

The arch_get_unmapped_area( ) function searches the process address space to find an available linear address interval. The len parameter specifies the interval length, while the addr parameter may specify the address from which the search is started. If the search is successful, the function returns the initial address of the new interval otherwise, it returns the error code - return -ENOMEM addr (addr + Oxfff) & 0xfffff000 if (addr & & addr + len < TASK_SIZE) vma...

Description

Create the file if it does not exist With o creat, fail if the file already exists Never consider the file as a controlling terminal Truncate the file (remove all existing contents) No system calls will block on the file Synchronous write (block until physical write terminates) Asynchronous I O notification via signals Direct I O transfer (no kernel buffering) Do not follow a trailing symbolic link in pathname Let's describe the operation of the sys_open( ) function. It performs the following...

Kernel Architecture

As stated before, most Unix kernels are monolithic each kernel layer is integrated into the whole kernel program and runs in Kernel Mode on behalf of the current process. In contrast, microkernel operating systems demand a very small set of functions from the kernel, generally including a few synchronization primitives, a simple scheduler, and an interprocess communication mechanism. Several system processes that run on top of the microkernel implement other operating system-layer functions,...

Msssm

As we explained earlier, most exceptions are handled simply by sending a Unix signal to the process that caused the exception. The action to be taken is thus deferred until the process receives the signal as a result, the kernel is able to process the exception quickly. This approach does not hold for interrupts because they frequently arrive long after the process to which they are related (for instance, a process that requested a data transfer) has been suspended and a completely unrelated...

File such as block and character device files FIFOs and even regular files it cannot create directories or sockets

Once created, a FIFO can be accessed through the usual open( ), read( ), write( ), and close( ) system calls, but the VFS handles it in a special way because the FIFO inode and file operations are customized and do not depend on the filesystems in which the FIFO is stored. The POSIX standard specifies the behavior of the open( ) system call on FIFOs the behavior depends essentially on the requested access type, the kind of I O operation (blocking or nonblocking), and the presence of other...

Linux File Locking

Linux supports all fashions of file locking advisory and mandatory locks, as well as the fcntl( ), flock( ), and the lockf( ) system calls. However, the lockf( ) system call is just a library wrapper routine, and therefore is not discussed here. fcntl( )'s mandatory locks can be enabled and disabled on a per-filesystem basis using the ms_mandlock flag (the mand option) of the mount( ) system call. The default is to switch off mandatory locking. In this case, fcntl( ) creates advisory locks....

Semaphores

We have already introduced semaphores in Section 1.6.5. Essentially, they implement a locking primitive that allows waiters to sleep until the desired resource becomes free. Actually, Linux offers two kinds of semaphores Kernel semaphores, which are used by kernel control paths System V IPC semaphores, which are used by User Mode processes In this section, we focus on kernel semaphores, while IPC semaphores are described in Chapter 19. A kernel semaphore is similar to a spin lock, in that it...

A4 Renaissance The startup32 Functions

There are two different startup_32( ) functions the one we refer to here is coded in the arch i386 boot compressed head.S file. After setup( ) terminates, the function has been moved either to physical address 0x00100000 or to physical address 0x00001000, depending on whether the kernel image was loaded high or low in RAM. This function performs the following operations 1. Initializes the segmentation registers and a provisional stack. 2. Fills the area of uninitialized data of the kernel...

The rwswap page Function

The rw_swap_page( ) function is used to swap in or swap out a page. It receives the following parameters A flag specifying the direction of data transfer read for swapping in, write for swapping out. Before invoking the function, the caller must ensure that the page is included in the swap cache and lock the page to prevent race conditions due to concurrent accesses to the page frame or to the page slot in the swap area, as described in the previous section. To be on the safe side, the...

Block and Page IO Operations

We'll discuss in the forthcoming chapters how the kernel uses the block device drivers. We'll see that there are a number of cases in which the kernel activates disk I O data transfers. However, let's describe here the two fundamental kinds of I O data transfer for block devices Here the I O operation transfers a single block of data, so the transferred data can be kept in a single RAM buffer. The disk address consists of a device number and a block number. The buffer is associated with a...

Returning from Interrupts and Exceptions

We will finish the chapter by examining the termination phase of interrupt and exception handlers. Although the main objective is clear namely, to resume execution of some program several issues must be considered before doing it Number of kernel control paths being concurrently executed If there is just one, the CPU must switch back to User Mode. If there is any request, the kernel must perform process scheduling otherwise, control is returned to the current process. If a signal is sent to the...

The list of Taskrunning processes

When looking for a new process to run on the CPU, the kernel has to consider only the runnable processes (that is, the processes in the task_running state). Since it is rather inefficient to scan the whole process list, a doubly linked circular list of task_running processes called runqueue has been introduced. This list is implemented through the run_list field of type list_head in the process descriptor. As in the previous case, the init_task process descriptor plays the role of list header....

Deleting a Process Address Space

When a process terminates, the kernel invokes the exit_mm( ) function to release the address space owned by that process atomic inc(& tsk-> mm-> mm count) mm tsk-> mm tsk-> mm NULL enter lazy tlb(mm, current, smp processor id()) mmput(mm) The mm_release( ) function wakes up any process sleeping in the tsk-> vfork_done completion (see Section 5.3.8). Typically, the corresponding wait queue is nonempty only if the exiting process was created by means of the vfork( ) system call (see...

Transferring Swap Pages

Transferring swap pages wouldn't be so complicated if there weren't so many race conditions and other potential hazards to guard against. Here are some of the things that have to be checked regularly The process that owns a page may terminate while the page is being swapped in or out. Another process may be in the middle of swapping in a page that the current one is trying to swap out (or vice versa). Like any other disk access type, I O data transfers for swap pages are blocking operations....

Figure 73 The slab allocator components

The slab allocator never releases the page frames of an empty slab on its own. It would not know when free memory is needed, and there is no benefit to releasing objects when there is still plenty of free memory for new objects. Therefore, releases occur only when the kernel is looking for additional free page frames (see Section 7.2.13 later in this chapter and Section 16.7). Each cache is described by a table of type struct kmem_cache_s (which is equivalent to the type kmem_cache_t). The most...

How Processes Are Organized

The runqueue list groups all processes in a task_running state. When it comes to grouping processes in other states, the various states call for different types of treatment, with Linux opting for one of the choices shown in the following list. Processes in a task_stopped or in a task_zombie state are not linked in specific lists. There is no need to group processes in either of these two states, since stopped and zombie processes are accessed only via PID or via linked lists of the child...

Disabling Deferrable Functions

In Section 4.7.1, we explained that deferrable functions can be executed at unpredictable times (essentially, on termination of hardware interrupt handlers). Therefore, data structures accessed by deferrable functions must be protected against race conditions. A trivial way to forbid deferrable functions execution on a CPU is to disable interrupts on that CPU. Since no interrupt handler can be activated, softirq actions cannot be started asynchronously. Globally disabling interrupts on all CPUs...

New program

Although the program interpreter runs in User Mode, we briefly sketch out here how it operates. Its first job is to set up a basic execution context for itself, starting from the information stored by the kernel in the User Mode stack between the array of pointers to environment strings and arg_start. Then the program interpreter must examine the program to be executed to identify which shared libraries must be loaded and which functions in each shared library are effectively requested. Next,...

The lists of page descriptors in the addressspace object

As we have seen, the address_space object includes three lists of page descriptors, whose heads are stored in the clean_pages, dirty_pages, and locked_pages fields. The lists allow the kernel to quickly find all pages of a file (or whatever) in a specific state Includes the pages not locked and not dirty (the PG_locked and PG_dirty flags in the page descriptor are equal to 0). The PG_uptodate flag indicates whether the data in the pages is up to date. Typically, a page is not up to date when...

General Purpose Objects

As stated earlier in Section 7.1.7, infrequent requests for memory areas are handled through a group of general caches whose objects have geometrically distributed sizes ranging from a minimum of 32 to a maximum of 131,072 bytes. Objects of this type are obtained by invoking the kmalloc( ) function void * kmalloc(size_t size, int flags) cache_sizes_t *csizep cache_sizes kmem_cache_t * cachep for ( csizep-> cs_size csizep++) if (size > csizep-> cs_size) cachep csizep-> cs_cachep return _...

As for vmalloc the kernel modifies the entries of the master kernel Page Global

Directory and its child Page Tables (see Section 2.5.5), but it leaves unchanged the entries of the process Page Tables mapping the fourth gigabyte. This is fine because the kernel never reclaims Page Middle Directories and Page Tables rooted at the master kernel Page Global Directory. For instance, suppose that a process in Kernel Mode accessed a noncontiguous memory area that later got released. The process's Page Global Directory entries are equal to the corresponding entries of the master...

General Characteristics of Ext2

Unix-like operating systems use several filesystems. Although the files of all such filesystems have a common subset of attributes required by a few POSIX APIs like stat( ), each filesystem is implemented in a different way. The first versions of Linux were based on the Minix filesystem. As Linux matured, the Extended Filesystem (Ext FS) was introduced it included several significant extensions, but offered unsatisfactory performance. The Second Extended Filesystem (Ext2) was introduced in 1994...

Process Credentials and Capabilities

Traditionally, Unix systems associate with each process some credentials, which bind the process to a specific user and a specific user group. Credentials are important on multiuser systems because they determine what each process can or cannot do, thus preserving both the integrity of each user's personal data and the stability of the system as a whole. The use of credentials requires support both in the process data structure and in the resources being protected. One obvious resource is a...

Local Interrupt Disabling

Interrupt disabling is one of the key mechanisms used to ensure that a sequence of kernel statements is treated as a critical section. It allows a kernel control path to continue executing even when hardware devices issue IRQ signals, thus providing an effective way to protect data structures that are also accessed by interrupt handlers. By itself, however, local interrupt disabling does not protect against concurrent accesses to data structures by interrupt handlers running on other CPUs, so...

Suspending the Process

The sigsuspend( ) system call puts the process in the task_interruptible state, after having blocked the standard signals specified by a bit mask array to which the mask parameter points. The process will wake up only when a nonignored, nonblocked signal is sent to it. The corresponding sys_sigsuspend( ) service routine executes these statements mask & (sigmask(SIGKILL) sigmask(SIGSTOP)) spin lock irq(& current-> sigmask lock) siginitset(& current-> blocked, mask) spin unlock...

Inode Semaphore

As we shall see in Chapter 12, Linux stores the information on a disk file in a memory object called an inode. The corresponding data structure includes its own semaphore in the i_sem field. A huge number of race conditions can occur during filesystem handling. Indeed, each file on disk is a resource held in common for all users, since all processes may (potentially) access the file content, change its name or location, destroy or duplicate it, and so on. For example, let's suppose that a...

B2 Module Implementation

Modules are stored in the filesystem as ELF object files and are linked to the kernel by executing the insmod program (see the later section, Section B.3). For each module, the kernel allocates a memory area containing the following data A null-terminated string that represents the name of the module (all modules should have unique names) The code that implements the functions of the module The module object describes a module its fields are shown in Table B-1. A simply linked list collects all...

Swap Cache Helper Functions

The kernel uses several functions to handle the swap cache they are based mainly on those discussed in Section 14.1. We show later how these relatively low-level functions are invoked by higher-level functions to swap pages in and out as needed. The main functions that handle the swap cache are Finds a page in the swap cache through its swapped-out page identifier passed as a parameter and returns the page address. It returns 0 if the page is not present in the cache. It invokes find_ get_page(...

Superblock

An Ext2 disk superblock is stored in an ext2_super_block structure, whose fields are listed in Table 17-1. The _ _u8, _ _ul6, and _ _u32 data types denote unsigned numbers of length 8, 16, and 32 bits respectively, while the__s8,__s16,__s32 data types denote signed numbers of length 8, 16, and 32 bits. Table 17-1. The fields of the Ext2 superblock Table 17-1. The fields of the Ext2 superblock Number of first useful block (always 1) Number of mount operations before check Block group number of...

The adjtimex System Call

Although clock drift ensures that all systems eventually move away from the correct time, changing the time abruptly is both an administrative nuisance and risky behavior. Imagine, for instance, programmers trying to build a large program and depending on filetime stamps to make sure that out-of-date object files are recompiled. A large change in the system's time could confuse the make program and lead to an incorrect build. Keeping the clocks tuned is also important when implementing a...

Doubly linked lists

The process list is a special doubly linked list. However, as you may have noticed, the Linux kernel uses hundreds of doubly linked lists that store the various kernel data structures. For each list, a set of primitive operations must be implemented initializing the list, inserting and deleting an element, scanning the list, and so on. It would be both a waste of programmers' efforts and a waste of memory to replicate the primitive operations for each Therefore, the Linux kernel defines the...

Synchronization Primitives

Chapter 1 introduced the concepts of race condition and critical region for processes. The same definitions apply to kernel control paths. In this chapter, a race condition can occur when the outcome of a computation depends on how two or more interleaved kernel control paths are nested. A critical region is any section of code that must be completely executed by any kernel control path that enters it before another kernel control path can enter it. We now examine how kernel control paths can...

The Swap Cache

The swap cache is crucial to avoid race conditions among processes trying to access pages that are being swapped. If a page is owned by a single process (or better, if the page belongs to an address space that is owned by one or more clone processes), there is just one race condition to be considered the process attempts to address a page that is being swapped out. An array of semaphores, one per each page slot, could be used to block the process until the I O data transfer completes. In many...

Devfs Device Files

Identifying I O devices by means of major and minor numbers has some limitations 1. Most of the devices present in a dev directory don't exist the device files have been included so that the system administrator doesn't need to create a device file before installing a new I O driver. However, a typical dev directory, which includes over 1,800 device files, increases the time taken to look up an inode when first referenced. 2. The major and minor numbers are 8-bit long. Nowadays, this is a...

Linear Addresses of Noncontiguous Memory Areas

To find a free range of linear addresses, we can look in the area starting from page_offset (usually 0xc0000000, the beginning of the fourth gigabyte). Figure 7-7 shows how the fourth gigabyte linear addresses are used The beginning of the area includes the linear addresses that map the first 896 MB of RAM (see Section 2.5.4) the linear address that corresponds to the end of the directly mapped physical memory is stored in the high memory variable. The end of the area contains the fix-mapped...

Nested Execution of Exception and Interrupt Handlers

When handling an interrupt or an exception, the kernel begins a new kernel control path, or separate sequence of instructions. When a process issues a system call request, for instance, the first instructions of the corresponding kernel control path are those that save the content of the registers in the Kernel Mode stack, while the last instructions are those that restore the content of the registers and put the CPU back into User Mode. Linux design does not allow process switching while the...

The Page Cache

To avoid unnecessary disk accesses, the kernel never tries to read a page from disk without looking into the page cache and verifying that it does not already include the requested data. To take the maximum advantage from the page cache, searching into it should be a very fast operation. The unit of information kept in the page cache is, of course, a whole page of data. A page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by a device number and a block...

Files Associated with a Process

We mentioned in Section 1.5 that each process has its own current working directory and its own root directory. These are just two examples of data that must be maintained by the kernel to represent the interactions between a process and a filesystem. A whole data structure of type fs_struct is used for that purpose (see Table 12-6) and each process descriptor has an fs field that points to the process fs_struct structure. Table 12-6. The fields of the fs struct structure Table 12-6. The fields...

Memory Mapping

As already mentioned in Section 8.3, a memory region can be associated with some portion of either a regular file in a disk-based filesystem or a block device file. This means that an access to a byte within a page of the memory region is translated by the kernel into an operation on the corresponding byte of the file. This technique is called memory mapping. Any write operation on the pages of the memory region changes the file on disk moreover, if a process writes into a page of a shared...

The Ext3 Journaling Filesystem

The idea behind Ext3 journaling is to perform any high-level change to the filesystem in two steps. First, a copy of the blocks to be written is stored in the journal then, when the I O data transfer to the journal is completed (in short, data is committed to the journal), the blocks are written in the filesystem. When the I O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded. While recovering after a system...

Flflock Locks

An fl_flock lock is always associated with a file object and is thus maintained by a particular process (or clone processes sharing the same opened file). When a lock is requested and granted, the kernel replaces any other lock that the process is holding on the same file object. This happens only when a process wants to change an already owned read lock into a write one, or vice versa. Moreover, when a file object is being freed by the fput( ) function, all fl_flock locks that refer to the...

[4 The bdfprm table also includes several other unused fields

Buffer cache tuning parameters Table 14-4. Buffer cache tuning parameters Threshold percentage of dirty buffers for waking up bdflush Threshold percentage of dirty buffers for waking up bdflush in blocking mode Time-out in ticks of a dirty buffer for being written to disk Delay in ticks between kupdate activations The most typical cases that cause the kernel thread to be woken up are The balance_dirty( ) function verifies that the number of buffer pages in the buf_dirty and...

Releasing a Slab from a Cache

As stated previously, the slab allocator never releases the page frames of an empty slab on its own. In fact, a slab is released only if both of the following conditions are true The buddy system is unable to satisfy a new request for a group of page frames (a zone is low on memory). The slab is empty all the objects included in it are unused. When the kernel looks for additional free page frames, it calls try_to_free_pages( ) this function, in turn, may invoke kmem_cache_reap( ) , which...

Understanding the Linux Kernel 2nd Edition

The Audience for This Book Organization of the Material Overview of the Book Background Information Conventions in This Book How to Contact Us Acknowledgments Section 1.1. Linux Versus Other Unix-Like Kernels Section 1.2. Hardware Dependency Section 1.4. Basic Operating System Concepts Section 1.5. An Overview of the Unix Filesystem Section 1.6. An Overview of Unix Kernels Section 2.1. Memory Addresses Section 2.2. Segmentation in Hardware Section 2.3. Segmentation in Linux Section 2.4. Paging...

Execution Tracing

Execution tracing is a technique that allows a program to monitor the execution of another program. The traced program can be executed step by step, until a signal is received, or until a system call is invoked. Execution tracing is widely used by debuggers, together with other techniques like the insertion of breakpoints in the debugged program and run-time access to its variables. We focus on how the kernel supports execution tracing rather than discussing how debuggers work. In Linux,...

Translation Lookaside Buffers TLB

Besides general-purpose hardware caches, 80 x 86 processors include other caches called Translation Lookaside Buffers (TLB) to speed up linear address translation. When a linear address is used for the first time, the corresponding physical address is computed through slow accesses to the Page Tables in RAM. The physical address is then stored in a TLB entry so that further references to the same linear address can be quickly translated. In a multiprocessor system, each CPU has its own TLB,...

Access Rights and File Mode

The potential users of a file fall into three classes The user who is the owner of the file The users who belong to the same group as the file, not including the owner All remaining users (others) There are three types of access rights Read, Write, and Execute for each of these three classes. Thus, the set of access rights associated with a file consists of nine different binary flags. Three additional flags, called suid (Set User ID), sgid (Set Group ID), and sticky, define the file mode....

Dynamic Timers

Dynamic timers may be dynamically created and destroyed. No limit is placed on the number of currently active dynamic timers. A dynamic timer is stored in the following timer_list structure struct list_head list unsigned long expires The function field contains the address of the function to be executed when the timer expires. The data field specifies a parameter to be passed to this timer function. Thanks to the data field, it is possible to define a single general-purpose function that...

The Linear Address Fields

The following macros simplify Page Table handling Specifies the length in bits of the Offset field when applied to 80 x 86 processors, it yields the value 12. Since all the addresses in a page must fit in the Offset field, the size of a page on 80 x 86 systems is 212 or the familiar 4,096 bytes the page_shift of 12 can thus be considered the logarithm base 2 of the total page size. This macro is used by page_size to return the size of the page. Finally, the page_mask macro yields the value...

Swap Area

The pages swapped out from memory are stored in a swap area, which may be implemented either as a disk partition of its own or as a file included in a larger partition. Several different swap areas may be defined, up to a maximum number specified by the max_swapfiles macro (usually set to 32). Having multiple swap areas allows a system administrator to spread a lot of swap space among several disks so that the hardware can act on them concurrently it also lets swap space be increased at runtime...

Figure 192 IPC message queue data structures

The msg_queue data structure Table 19-11. The msg_queue data structure List of processes receiving messages The most important field is q_messages, which represents the head (i.e., the first dummy element) of a doubly linked circular list containing all messages currently in the queue. Each message is broken in one or more pages, which are dynamically allocated. The beginning of the first page stores the message header, which is a data structure of type msg_msg its fields are...

Initializing a Block Device Driver

Let's now describe how a block device driver is initialized. We already described how the kernel customizes the methods of the file object when a block device file is opened in Section 13.2.3. Its f_op field is set to the address of the def_blk_fops variable. The contents of this table are shown in Table 13-5. The dentry_open( ) function checks whether the open method is defined this is always true for a block device file, so the blkdev_open( ) function is executed. Table 13-5. The default file...

Filesystem Type Registration

Often, the user configures Linux to recognize all the filesystems needed when compiling the kernel for her system. But the code for a filesystem actually may either be included in the kernel image or dynamically loaded as a module (see Appendix B). The VFS must keep track of all filesystem types whose code is currently included in the kernel. It does this by performing filesystem type registration. Each registered filesystem is represented as a file_system_type object whose fields are...

File Handling System Calls

When a user accesses the contents of either a regular file or a directory, he actually accesses some data stored in a hardware block device. In this sense, a filesystem is a userlevel view of the physical organization of a hard disk partition. Since a process in User Mode cannot directly interact with the low-level hardware components, each actual file operation must be performed in Kernel Mode. Therefore, the Unix operating system defines several system calls related to file handling. All Unix...

Page Cache Handling Functions

The high-level functions that use the page cache involve finding, adding, and removing a page. The find_get_page macro receives as parameters the address of an address_space object and an offset value. It uses the page_hash macro to derive the address of the hash table entry corresponding to the values of the parameters, and invokes the _ _find_get_page( ) function to search for the requested page descriptor in the proper collision list. In turn, _ _find_get_page( ) acquires the pagecache_lock...

Read Write Semaphores

Read write semaphores are a new feature of Linux 2.4. They are similar to the read write spin locks described earlier in Section 5.3.4, except that waiting processes are suspended until the semaphore becomes open again. Many kernel control paths may concurrently acquire a read write semaphore for reading however, any writer kernel control path must have exclusive access to the protected resource. Therefore, the semaphore can be acquired for writing only if no other kernel control path is...

IPC Semaphores

IPC semaphores are quite similar to the kernel semaphores introduced in Chapter 5 they are counters used to provide controlled access to shared data structures for multiple processes. The semaphore value is positive if the protected resource is available, and 0 if the protected resource is currently not available. A process that wants to access the resource tries to decrement the semaphore value the kernel, however, blocks the process until the operation on the semaphore yields a positive...

Figure 74 Relationship between cache and slab descriptors

Caches are divided into two types general and specific. General caches are used only by the slab allocator for its own purposes, while specific caches are used by the remaining parts of the kernel. A first cache contains the cache descriptors of the remaining caches used by the kernel. The cache_cache variable contains its descriptor. Twenty-six additional caches contain geometrically distributed memory areas. The table, called cache_sizes (whose elements are of type cache_sizes_t), points to...

Memory Regions

Linux implements a memory region by means of an object of type vm_area_struct its fields are shown in Table 8-3. Table 8-3. The fields of the memory region object Table 8-3. The fields of the memory region object Pointer to the memory descriptor that owns the region First linear address inside the region First linear address after the region Access permissions for the page frames of the region Data for the red-black tree (see later in this chapter) Pointer to the next element in the file memory...

Special Filesystems

While network and disk-based filesystems enable the user to handle information stored outside the kernel, special filesystems may provide an easy way for system programs and administrators to manipulate the data structures of the kernel and to implement special features of the operating system. Table 12-8 lists the most common special filesystems used in Linux for each of them, the table reports its mount point and a short description. Notice that a few filesystems have no fixed mount point...

System Calls Related to Signal Handling

As stated in the introduction of this chapter, programs running in User Mode are allowed to send and receive signals. This means that a set of system calls must be defined to allow these kinds of operations. Unfortunately, for historical reasons, several system calls exist that serve essentially the same purpose. As a result, some of these system calls are never invoked. For instance, sys_sigaction( ) and sys_rt_sigaction( ) are almost identical, so the sigaction( ) wrapper function included in...

Bitmap Caches

Ext2 Bitmap

When the kernel mounts an Ext2 filesystem, it allocates a buffer for the Ext2 disk superblock and reads its contents from disk. The buffer is released only when the Ext2 filesystem is unmounted. When the kernel must modify a field in the Ext2 superblock, it simply writes the new value in the proper position of the corresponding buffer and then marks the buffer as dirty. Unfortunately, this approach cannot be adopted for all Ext2 disk data structures. The tenfold increase in disk capacity...

Wait queues

Wait queues have several uses in the kernel, particularly for interrupt handling, process synchronization, and timing. Because these topics are discussed in later chapters, we'll just say here that a process must often wait for some event to occur, such as for a disk operation to terminate, a system resource to be released, or a fixed interval of time to elapse. Wait queues implement conditional waits on events a process wishing to wait for a specific event places itself in the proper wait...

Clock tick multiplied by 232 613 Programmable Interval Timer

Besides the Real Time Clock and the Time Stamp Counter, IBM-compatible PCs include another type of time-measuring device called Programmable Interval Timer (PIT ). The role of a PIT is similar to the alarm clock of a microwave oven it makes the user aware that the cooking time interval has elapsed. Instead of ringing a bell, this device issues a special interrupt called timer interrupt, which notifies the kernel that one more time interval has elapsed. 12 Another difference from the alarm clock...

CPU Local Timers

The local APIC present in recent Intel processors (see Section 4.2) provides yet another time-measuring device the CPU local timer. The CPU local timer is a device that can issue one-shot or periodic interrupts, which is similar to the Programmable Interval Timer just described. There are, however, a few differences The APIC's timer counter is 32-bits long, while the PIT's timer counter is 16-bits long therefore, the local timer can be programmed to issue interrupts at very low frequencies (the...

Slab Coloring

We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this chapter, we have also seen that objects of the same size tend to be stored at the same offset within a cache. Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects from the same cache line back and forth to different RAM locations,...

Tasklets

Tasklets are the preferred way to implement deferrable functions in I O drivers. As already explained, tasklets are built on top of two softirqs named hi_softirq and tasklet_softirq. Several tasklets may be associated with the same softirq, each tasklet carrying its own function. There is no real difference between the two softirqs, except that do_softirq( ) executes hi_softirq's tasklets before tasklet_softirq's tasklets. Tasklets and high-priority tasklets are stored in the tasklet_vec and...

The actual names are vestiges of old kernel versions

A__ksoftirqd_task field that stores the process descriptor address of a ksoftirqd_CPUn kernel thread, which is devoted to the execution of deferrable functions. (There is one such thread per CPU, and the n in ksoftiqd_CPUn represents the CPU index, as described later in Section 4.7.1.1.) This field can be accessed through the ksoftirqd_task macro. The open_softirq( ) function takes care of softirq initialization. It uses three parameters the softirq index, a pointer to the softirq function to...

The Global Kernel Lock

In earlier Linux kernel versions, a global kernel lock (also known as big kernel lock, or BKL) was widely used. In Version 2.0, this lock was a relatively crude spin lock, ensuring that only one processor at a time could run in Kernel Mode. The 2.2 kernel was considerably more flexible and no longer relied on a single spin lock rather, a large number of kernel data structures were protected by specialized spin locks. The global kernel lock, on other hand, was still present because splitting a...

The ipc System Call

All IPC functions must be implemented through suitable Linux system calls. Actually, in the 80 x 86 architecture, there is just one IPC system call named ipc( ). When a process invokes an IPC function, let's say msgget( ), it really invokes a wrapper function in the C library. This in turn invokes the ipc( ) system call by passing to it all the parameters of msgget( ) plus a proper subcommand code in this case, msgget. The sys_ipc( ) service routine examines the subcommand code and invokes the...

Paging in Linux

As we explained earlier in Section 2.4.5, Linux adopted a three-level paging model so paging is feasible on 64-bit architectures. Figure 2-11 shows the model, which defines three types of paging tables. The Page Global Directory includes the addresses of several Page Middle Directories, which in turn include the addresses of several Page Tables. Each Page Table entry points to a page frame. The linear address is thus split into four parts. Figure 2-11 does not show the bit numbers because the...

Low Level Request Handling

We have now reached the lowest level in Linux's block device handling. This level is implemented by the strategy routine, which interacts with the physical block device to satisfy the requests collected in the queue. As mentioned earlier, the strategy routine is usually started after inserting a new request in an empty request queue. Once activated, the low-level driver should handle all requests in the queue and terminate when the queue is empty. A naive implementation of the strategy routine...

Processor descriptors handling

Processes are dynamic entities whose lifetimes range from a few milliseconds to months. Thus, the kernel must be able to handle many processes at the same time, and process descriptors are stored in dynamic memory rather than in the memory area permanently assigned to the kernel. Linux stores two different data structures for each process in a single 8 KB memory area the process descriptor and the Kernel Mode process stack. In Section 2.3, we learned that a process in Kernel Mode accesses a...

The Physical Address Extension PAE Paging Mechanism

The amount of RAM supported by a processor is limited by the number of address pins connected to the address bus. Older Intel processors from the 80386 to the Pentium used 32bit physical addresses. In theory, up to 4 GB of RAM could be installed on such systems in practice, due to the linear address space requirements of User Mode processes, the kernel cannot directly address more than 1 GB of RAM, as we shall see in the later section Section 2.5. However, some demanding applications running on...

[8 Linux records the number of minor and major Page Faults for each process This information together with several

Conversely, when handling a read access, the content of the page is irrelevant because the process is addressing it for the first time. It is safer to give a page filled with zeros to the process rather than an old page filled with information written by some other process. Linux goes one step further in the spirit of demand paging. There is no need to assign a new page frame filled with zeros to the process right away, since we might as well give it an existing page called zero page, thus...

Lookup of Symbolic Links

Recall that a symbolic link is a regular file that stores a pathname of another file. A pathname may include symbolic links, and they must be resolved by the kernel. For example, if foo bar is a symbolic link pointing to (containing the pathname) dir, the pathname foo bar file must be resolved by the kernel as a reference to the file dir file. In this example, the kernel must perform two different lookup operations. The first one resolves foo bar when the kernel discovers that bar is the name...

Handling a Faulty Address Outside the Address Space

If address does not belong to the process address space, do_page_fault( ) proceeds to execute the statements at the label bad_area. If the error occurred in User Mode, it sends a sigsegv signal to current (see Section 10.2) and terminates up read(& tsk-> mm-> mmap sem) if (error code & 4) * User Mode * tsk-> thread.cr2 address tsk-> thread.error code error code tsk-> thread.trap no 14 info.si_signo SIGSEGV info.si errno 0 info.si addr (void *) address force_sig_info(SIGSEGV,...

Is still able to act on the file In this particular case the inode is removed from the hash table even if it still

The methods associated with an inode object are also called inode operations. They are described by an inode_operations structure, whose address is included in the i op field. Here are the inode operations in the order they appear in the inode_operations table Creates a new disk inode for a regular file associated with a dentry object in some directory. Searches a directory for an inode corresponding to the filename included in a dentry object. Creates a new hard link that refers to the file...

The Role of Signals

A signal is a very short message that may be sent to a process or a group of processes. The only information given to the process is usually a number identifying the signal there is no room in standard signals for arguments, a message, or other accompanying information. A set of macros whose names start with the prefix sig is used to identify signals we have already made a few references to them in previous chapters. For instance, the sigchld macro was mentioned in Section 3.4.1. This macro,...

Page IO operations

Block devices transfer information one block at a time, while process address spaces (or to be more precise, memory regions allocated to the process) are defined as sets of pages. This mismatch can be hidden to some extent by using page I O operations. They may be activated in the following cases A process issues a read( ) or write( ) system call on a file (see Chapter 15). A process reads a location of a page that maps a file in memory (see Chapter 15). The kernel flushes some dirty pages...

Process may in turn resume execution of the traced process by means of a sigcont signal

The signal is not blocked by the destination process. The signal is being ignored by the destination process (either because the process explicitly ignored it or because the process did not change the default action of the signal and that action is ignore). Handle the signal, which may require switching the process to a handler function at any point during its execution and restoring the original execution context after the function returns. Moreover, Linux must take into account the different...

What Is Swapping

To expand the address space that is effectively usable by a process To expand the amount of dynamic RAM (i.e., what is left of the RAM once the kernel code and static data structures have been initialized) to load processes Let's give a few examples of how swapping benefits the user. The simplest is when a program's data structures take up more space than the size of the available RAM. A swap area will allow this program to be loaded without any problem, and thus to run correctly. A more...

Writing Packets to a Socket

Finally, our example program is ready to send messages to the remote host it simply writes the data onto the socket write(sockfd, mesg, strlen(mesg)+1) The write( ) system call triggers the write method of the file object associated with the sockfd file descriptor. For socket files, this method is implemented by the sock_write( ) function, which performs the following actions 1. Determines the address of the socket object embedded in the file's inode. 2. Allocates and initializes a message...

Time Stamp Counter

All 80 x 86 microprocessors include a CLK input pin, which receives the clock signal of an external oscillator. Starting with the Pentium, many recent 80 x 86 microprocessors include a 64-bit Time Stamp Counter (TSC ) register that can be read by means of the rdtsc assembly language instruction. This register is a counter that is incremented at each clock signal if, for instance, the clock ticks at 400 MHz, the Time Stamp Counter is incremented once every 2.5 nanoseconds. Linux takes advantage...

Creating a Process Address Space

In Section 3.4.1, we mentioned that the kernel invokes the copy_mm( ) function while creating a new process. This function creates the process address space by setting up all Page Tables and memory descriptors of the new process. Each process usually has its own address space, but lightweight processes can be created by calling clone( ) with the clone_vm flag set. These processes share the same address space that is, they are allowed to address the same set of pages. Following the COW approach...

Second phase updating the Page Tables

A while cycle is used to scan the list of memory regions built in the first phase, starting with the memory region descriptor that free points to. In each iteration, the mpnt local variable points to the descriptor of a memory region in the list. The map_count field of the current-> mm memory descriptor is decremented (since the region has been removed in the first phase from the list of regions owned by the process) and a check is made (by means of two question-mark conditional expressions)...

Read Write Spin Locks

Spin Read Write

Read write spin locks have been introduced to increase the amount of concurrency inside the kernel. They allow several kernel control paths to simultaneously read the same data structure, as long as no kernel control path modifies it. If a kernel control path wishes to write to the structure, it must acquire the write version of the read write lock, which grants exclusive access to the resource. Of course, allowing concurrent reads on data structures improves system performance. Figure 5-2...

Memory Mapping Data Structures

A memory mapping is represented by a combination of the following data structures The inode object associated with the mapped file The address_space object of the mapped file A file object for each different mapping performed on the file by different processes A vm_area_struct descriptor for each different mapping on the file A page descriptor for each page frame assigned to a memory region that maps the file Figure 15-4 illustrates how the data structures are linked. In the upper-left corner,...