Linux Kernel Architecture

Copy on Write

The kernel uses the copy-on-write technique (COW) to prevent all data of the parent process from being copied when fork is executed. This technique exploits the fact that processes normally use only a fraction of their pages in memory.8 When fork is called, the kernel would usually create an identical copy of each memory page of the parent process for the child process. This has two very negative effects 1. A large amount of RAM, a scarce resource, is used. 2. The copy operation takes a long...

Dead Code Elimination

On first reading, the term ''dead code elimination'' sounds quite violent. On second reading, it seems to be somewhat contradictory. After all, how can code that is already dead be eliminated Only when the term is examined for a third time does it become apparent that it refers to an optimization feature in which program sections that can never execute are eliminated from code generation to reduce the size of the assembler code. How does dead code accumulate in a program It would be normal to...

Process Life Cycle

A process is not always ready to run. Occasionally, it has to wait for events from external sources beyond its control for keyboard input in a text editor, for example. Until the event occurs, the process cannot run. The scheduler must know the status of every process in the system when switching between tasks it obviously doesn't make sense to assign CPU time to processes that have nothing to do. Of equal importance are the transitions between individual process states. For example, if a...

Copying Data between Kernel and Userspace

The kernel often needs to copy data from userspace to kernel space for example, when lengthy data structures are passed indirectly in system calls by means of pointers. There is a similar need to write data in the reverse direction from kernel space to userspace. This cannot be done simply by passing and de-referencing pointers for two reasons. First, userspace programs must not access kernel addresses and second, there is no guarantee that a virtual page belonging to a pointer from userspace...

Layout of the Process Address Space

The virtual address space is populated by a number of regions. How they are distributed is architecture-specific, but all approaches have the following elements in common The binary code of the code currently running. This code is normally referred to as text and the area of virtual memory in which it is located as a text segment.1 The code of dynamic libraries used by the program. 1 This is not the same as a hardware segment, which is featured in some architectures and acts as a separate...

Professional Linux Kernel Architecture

10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright 2008 by Wolfgang Mauerer Published by Wiley Publishing, Inc., Indianapolis, Indiana Manufactured in the United States of America 10 987654321 Library of Congress Cataloging-in-Publication Data Professional Linux kernel architecture Wolfgang Mauerer. p. cm. Includes index. ISBN 978-0-470-34343-2 (pbk.) 1. Linux. 2. Computer architecture. 3. Application software. I. Title. QA76.9.A73M38 2008 005.4'32--dc22 No part of this...

High Resolution Timers in High Resolution Mode

Hrtimer Interrupt

Let us first assume that a high-resolution clock is up and running, and that the transition to highresolution mode is completely finished. The general situation is depicted in Figure 15-13. When the clock event device responsible for high-resolution timers raises an interrupt, hrtimer_interrupt is called as event handler. The function is responsible to select all timers that have expired and either move them to the expiration list (if they may be processed in softIRQ context) or call the...

D21 Subsystem Initialization

Linux Subsystem Diagram

Figure D-1 shows a code flow diagram to briefly illustrate the function's tasks and goals. The first step is to output the version message. The message text is held in the linux_banner global variable defined in init version.c. This is followed by a further architecture-specific initialization step, which no longer deals with lower-level processor details but is written in C and which, on most systems, has the primary task of setting the framework for initialization of high-level memory...

Processor Cache and TLB Control

Caches are crucial in terms of overall system performance, which is why the kernel tries to exploit them as effectively as possible. It does this primarily by skillfully aligning kernel data in memory. A judicious mix of normal functions, inline definitions, and macros also helps extract greater performance from the processor. The compiler optimizations discussed in Appendix C also make their contribution. However, the above aspects affect the cache only indirectly. Use of the correct alignment...

Types of Timers

Linux Kernel Architectiure

The timing subsystem of the kernel has grown tremendously during the development of 2.6. For the initial releases, the timer subsystem consisted solely of what are now known as low-resolution timers. Essentially, low-resolution timers are centered around a periodic tick which happens at regular intervals. Events can be scheduled to be activated at one of these intervals. Pressure to extend this comparatively simple framework came predominantly from two sources Devices with limited power (i.e.,...

Level Triggered Interrupts

Interrupt Linux Kernel

Level-triggered interrupts are a little easier to process than their edge-triggered relatives. This is also reflected in the code flow diagram of the flow handler handle_level_irq, which is depicted in Figure 14-6. Figure 14-6 Code flow diagram for handle_level_irq. Figure 14-6 Code flow diagram for handle_level_irq. Note that level-triggered interrupts must be masked when they are processed, so the first thing that needs to be done is to call mask_ack_irq. This auxiliary function masks and...

Swapping Pages in

Irq Line Diagram Linux

You already know from Chapter 4 that page faults as a result of accessing a swapped-out page are handled by do_swap_page from mm memory.c. As the associated code flow diagram in Figure 18-19 shows, it is much easier to swap a page in than to swap it out, but it still involves more than just a simple read operation. Figure 18-19 Code flow diagram for do_swap_page. Figure 18-19 Code flow diagram for do_swap_page. The kernel must not only check whether the requested page is still or already in the...

Selection According to PID

Let us turn our attention to how the process-specific information is selected by PID. If a PID is passed to proc_pid_lookup instead of self, the course of the lookup operation is as shown in the code flow diagram in Figure 10-4. Because filenames are always processed in the form of strings but PIDs are integer numbers, the former must be converted accordingly. The kernel provides the name_to_int auxiliary function to convert strings consisting of digits into an integer. The information obtained...

C210 Radix Trees

The second tree implementation provided in library form in the kernel makes use of radix trees to organize data in memory. Radix trees differ from other trees because it is not necessary to compare the entire key at every branch, but only part of the key with the stored value of the node when performing search operations. This results in slightly different worst-case and average-case behavior than in other implementations, which are described in detail in the corresponding textbooks on...

Networking Namespaces

Recall from Chapter 1 that many parts of the kernel are contained in namespaces. These allow for building multiple virtual viewpoints of the system that are separated and segregated from each other. Every instance looks like a single machine running Linux, but, in fact, many such instances can operate simultaneously on a single physical machine. During the development of 2.6.24, the kernel started to adopt namespaces also for the networking subsystem. This adds some extra complexity to the...

Dynamic Ticks

Periodic ticks have provided a notion of time to the Linux kernel for many of years. The approach is simple and effective, but shows one particular deficiency on systems where power consumption does matter The periodic tick requires that the system is in an active state at a certain frequency. Longer periods of rest are impossible because of this. Dynamic ticks mend this problem. The periodic tick is only activated when some tasks actually do need to be performed. Otherwise, it is temporarily...

The Migration Thread

The migration thread serves two purposes It must fulfill migration requests originating from the scheduler, and it is used to implement active balancing. This is handled in a kernel thread that executes migration_thread. The code flow diagram for the function is shown in Figure 2-27. Figure 2-27 Code flow diagram for migration_thread. Figure 2-27 Code flow diagram for migration_thread. migration_thread runs an infinite loop and sleeps when there is nothing to do. First of all, the function...

Registering Drivers

PCI drivers can be registered by means of pci_register_driver. The function is quite primitive. Its prime task is to fill a few remaining fields of a pci_device instance to which relevant functions have already been assigned. This instance is passed to the generic device layer using driver_register, whose mode of operation was discussed above. More interesting than the registration process is the filling of the pci_device structure in the individual drivers as this involves not only defining...

Registering Network Devices

Each network device is registered in a two-step process 1. alloc_netdev allocates a new instance of struct net_device, and a protocol-specific function fills the structure with typical values. For Ethernet devices, this function is ether_setup. Other protocols (not considerd in detail) use XXX_setup, where possible values for XXX include fddi (fiber distributed data), tr (token ring), ltalk (localtalk), hippi (high-performance parallel interface), or fc (fiber channel). Some in-kernel...

Network Layer

Layers Linux

The network access layer is still quite strongly influenced by the properties of the transmission medium and the device drivers of the associated adapters. The network layer (and therefore specifically the IP Internet protocol) is almost totally divorced from the hardware properties of the network adapters. Why only almost As you will see shortly, the layer is responsible not only for sending and receiving data, but also for forwarding and routing packets between systems not directly connected...

Elements in the Task Structure

There are several scheduling-relevant elements in the task structure of each process. < sched.h> int prio, static_prio, normal_prio unsigned int rt_priority const struct sched_class *sched_class unsigned int policy cpumask_t cpus_allowed unsigned int time_slice Not all processes on a system are equally important Less urgent tasks should receive less attention, while important work should be done as quickly as possible. To determine the importance of a particular task, it is equipped with a...

Structure of the Buffer Cache

Structure Buffer Cache

A page-oriented method has not always been used in the Linux kernel to bear the main caching burden. Earlier versions included only the buffer cache to speed file operations and to enhance system performance. This was a legacy of other Unix look-alikes with the same structure. Blocks from the underlying block devices were kept in main memory buffers to make read and write operations faster. The implementation is contained in fs buffers.c. In contrast to pages in memory, blocks are not only...

Figure 1512 Overview of the data structures used to implement highresolution timers

A clock base is given by the following data structure rb_root active rb_node *first resolution (*get_time)(void) ktime_t softirq_time ifdef CONFIG_HIGH_RES_TIMERS int (*reprogram)(struct hrtimer *t, struct hrtimer_clock_base *b, ktime_t n) The meaning of the fields is as follows hrtimer_cpu_base points to the per-CPU basis to which the clock base belongs. index distinguishes between clock_monotonic and clock_realtime. rb_root is the root of a red-black tree on which all active timers are...

Setting Timers

Setting a new timer is a two-step process 1. hrtimer_init is used to initialize a hrtimer instance. < hrtimer.h> void hrtimer_init(struct hrtimer *timer, clockid_t which_clock, enum hrtimer_mode mode) timer denotes the affected high-resolution timer, clock is the clock to bind the timer to, and mode specifies if absolute or relative time values (relative to the current time) are used. Two constants are available for selection < hrtimer.h> * Time value is absolute * * Time value is...

Clock Event Devices

Clock event devices are defined by the following data structure < clockchips.h> struct clock_event_device const char *name unsigned int features unsigned long max_delta_ns unsigned long min_delta_ns unsigned long mult int shift int rating int irq int (*set_next_event)(unsigned long evt, struct clock_event_device *) void (*set_mode)(enum clock_event_mode mode, struct clock_event_device *) void (*event_handler)(struct clock_event_device *) void (*broadcast)(cpumask_t mask) struct list_head...

The Main Scheduler

The main scheduler function (schedule) is invoked directly at many points in the kernel to allocate the CPU to a process other than the currently active one. After returning from system calls, the kernel also checks whether the reschedule flag tif_need_resched of the current process is set for example, the flag is set by scheduler_tick as mentioned above. If it is, the kernel invokes schedule. The function then assumes that the currently active task is definitely to be replaced with another...

Implementing Handlers

Recall that the prototype of ISR functions is specified by irq_handler_t. I have not shown the actual definition of this typedef, but do so now typedef irqreturn_t (*irq_handler_t)(int, void *) irq specifies the IRQ number, and dev_id is the device ID passed when the handler is registered. irqreturn_t is another typedef to a simple integer. Note that the prototype of ISRs was changed during the development of 2.6.19 Before, the arguments of the handler routine also included a pointer to the...

Time Representation

The generic time framework uses the data type ktime_t to represent time values. Irregardless of the underlying architecture, the type always resolves to a 64-bit quantity. This makes the structure convenient to work with on 64-bit architectures as only simple integer operations are required for time-related operations. To reduce the effort on 32-bit machines, the definition ensures that the two 32-bit values are ordered such that they can be directly interpreted as a 64-bit quantity without...

Cache Organization

The dentry structures not only make working with filesystem structures easier, but are also crucial to system performance. They accelerate work with the VFS by keeping communication with the underlying filesystem implementations to a minimum. Each request forwarded to the underlying implementations by the VFS leads to the creation of a new dentry object to hold the request results. These objects are held in a cache so that they can be accessed faster the next time they are needed and operations...

Standard Functions

Several auxiliary functions that ease handling dentry objects are provided by the kernel. Their implementation is mostly an exercise in list management and data structure handling, so I won't bother to discuss their code. It is, however, important to show their prototypes and describe their effect since we will come across them frequently in discussing implementation of VFS operations. The following auxiliary functions require a pointer to struct dentry as parameter. Each performs one simple...

Isolating LRU Pages and Lumpy Reclaim

Both the active and inactive pages of a zone are kept on lists that need to be protected by a spinlock, to be precise by zone-> lru_lock. To simplify matters, I have ignored this lock until now because it was not essential for our purposes. Now we need to consider it, though. When operations with the LRU lists are performed, they need to be locked, and one problem arises The page reclaim code belongs to the hottest and most important paths in the kernel for many workloads, and lock contention...

The pdflush Mechanism

The pdflush mechanism is implemented in a single file mm pdflush.c. This contrasts with the fragmented implementation of the synchronization mechanisms in earlier versions. pdflush is started with the usual kernel thread mechanisms mm pdflush.c static void start_one_pdflush_thread(void) kthread_run(pdflush, NULL, pdflush) start_one_pdflush starts a single pdflush thread however, the kernel uses several threads at the same time in general, as you will see below. It should be noted that a...

Freeing Objects

When an allocated object is no longer required, it must be returned to the slab allocator using kmem_cache_free. Figure 3-52 shows the code flow diagram of this function. kmem_cache_free immediately invokes_cache_free and forwards its arguments unchanged. (Again the reason is to prevent code duplication in the implementation of kfree, as discussed in Section 3.6.5.) As with allocation, there are two alternative courses of action depending on the state of the per-CPU cache. If the number of...

C11 From Source Code to Machine Program

The work of compilers can be divided into several phases, as the following overview demonstrates Preprocessing All pre-processor actions are performed in this phase. Depending on the compiler version, this phase is supported by an external utility (cpp) or by special library functions both are initiated automatically by the compiler. On completion of preprocessing, there is only one (large) input file generated from the source file and all header files are included using the include directive....

Loop Optimization

Code in loops may be executed repeatedly and therefore merits thorough optimization because speed gains are particularly noticeable. If a loop is iterated 1,000 times and the runtime of the loop body is shortened by one thousandth of a second as a result of optimization, total program runtime is one second shorter. One second may not seem to be much. However, the benefits are best assessed by considering very time-critical kernel actions such as task switching or long-running programs such as...

Allocation of Physical Memory

When it allocates RAM, the kernel must keep track of which pages have already been allocated and which are still free in order to prevent two processes from using the same areas in RAM. Because memory allocation and release are very frequent tasks, the kernel must also ensure that they are completed as quickly as possible. The kernel can allocate only whole page frames. Dividing memory into smaller portions is delegated to the standard library in userspace. This library splits the page frames...

Representation of Regions

Each region is represented by an instance of vm_area_struct, which is defined in simplified form as follows struct mm_struct vm_mm The address space we belong to. unsigned long vm_start Our start address within vm_mm. unsigned long vm_end The first byte after our end address linked list of VM areas per task, sorted by address struct vm_area_struct vm_next pgprot_t vm_page_prot Access permissions of this VMA. unsigned long vm_flags Flags, listed below. For areas with an address space and backing...

Handling of Page Faults

The association between virtual and physical memory is not established until the data of an area are actually needed. If a process accesses a part of virtual address space not yet associated with a page in memory, the processor automatically raises a page fault that must be handled by the kernel. This is one of the most important and complex aspects of memory management simply because a myriad of details must be taken into account. For example, the kernel must ascertain the following Was the...

Kernel Page Faults

When kernel address space is accessed, page faults can be triggered by various conditions as described below. A programming error in the kernel has caused an incorrect address to be accessed this is a genuine bug. Of course, this should never happen in stable versions20 but does occur occasionally in developer versions. The kernel accesses an invalid address passed as a system call parameter from userspace. The page fault was triggered by access to an area allocated using vmalloc. The first two...

Page Reclaim and Swapping in the Linux Kernel

This section summarizes the design decisions of the Linux page reclaim subsystem before considering the technical aspects of implementation and examining how requirements are met. Swapping out pages and all related actions do not appear to be very complicated when the situation is viewed from the high ground without taking development details into consideration. Unfortunately, the very opposite is the case. Hardly any part of the kernel entails as many technical difficulties as the virtual...

Handling Page Faults

All architectures on which Linux runs support the concept of page faults that are triggered when a page in virtual address space is accessed but is not present in physical memory. The page fault instructs the kernel to read the missing data from the swap area or from another backing store, possibly by first deleting other pages to make space for the new data. Page fault handling is in two parts. First, strongly processor-dependent (assembler) code must be used to intercept the page fault and...

E25 Relocation Entries

Relocation is the process by which undefined symbols in ELF files are associated with valid values. In the standard example (test.o), this means that undefined references to printf and add must be replaced with the addresses at which the appropriate machine code is located in the virtual address space of the process. Replacement must be performed at all points in the object file where one of the symbols is used. The kernel is not involved in symbol replacement in userspace programs because all...

This area may be larger than actually needed because it is not clear how big packets will be when they are synthesized

data and tail point to the start and end of the protocol data area. Figure 12-6 Link between socket buffer and network packet data. Figure 12-6 Link between socket buffer and network packet data. mac_header points to the start of the MAC header, while network_header and transport_header point to the header data of the network and transport layer, respectively. On systems with 32-bit word length, the data type sk_buff_data_t that is used for the various data components is a simple pointer...

Management Data of Socket Buffers

The socket buffer structure contains not only the above pointers, but also other elements that are used to handle the associated data and to manage the socket buffer itself. The less common elements are dicsussed in this chapter when they are needed. The most important elements are listed below. tstamp stores the time the packet arrived. dev specifies the network device on which the packet is processed. dev may change in the course of processing the packet for instance, when it will leave the...

Delivery to the Transport Layer

After packet defragmentation, the netfilter hook nf_ip_local_in is called to resume processing in ip_local_deliver_finish. There the packet is passed to a transport layer function that must first be determined by reference to the protocol identifier. All protocols based on the IP layer have an instance of the structure net_protocol that is defined as follows handler is the protocol routine to which the packets are passed (in the form of socket buffers) for...

Page Frames

Page frames represent the smallest unit of system memory, and an instance of struct page is created for each page in RAM. Kernel programmers take care to keep this structure as small as possible because the memory of systems even with a moderate RAM configuration is broken down into a very large number of pages. For instance, an IA-32 system working with a standard page size of 4 KiB has around 100,000 pages given a main memory size of 384 MiB. Although this memory size is certainly not...

Data Structures

The kernel uses lean data structures to minimize management overhead for reverse mappings. The page structure (discussed in Section 3.2.2) contains a single element to implement reverse mapping. atomic_t _mapcount * Count of ptes mapped in mms, * & limit reverse map searches. * _mapcount indicates at how many points the page is shared. The original value of the counter is -1. It is assigned the value 0 when the page is inserted in the reverse mapping data structures and is incremented by 1...

Creating a Reverse Mapping

When a reverse mapping is created, it is necessary to distinguish between two alternatives anonymous pages and pages with file-based mappings. This is understandable because the data structures used to manage both alternatives also differ. The information below only covers working with page instances to be inserted into the reverse mapping scheme. Other parts of the kernel are responsible for adding the relevant vm_area_structs to the data structures discussed above (priority tree and anon...

Using Reverse Mapping

The real benefits of reverse mapping do not become clear until Chapter 18, which examines the implementation of swapping. There we will see that the kernel defines the try_to_unmap function, which is invoked to delete a specific physical page from the page tables of all processes by which the page is used. It is apparent that this is only possible with the data structures just described. Nevertheless, the implementation is influenced by many details of the swap layer, and this is why I won't go...

Demand Allocation Paging

Flow Chart Halo

Allocation of pages on demand is delegated to do_linear_fault, which is defined in mm memory.c. After some parameter conversion, the work is delegated to_do_fault, and the code flow diagram of this First of all, the kernel has to make sure that the required data are read into the faulting page. How this is handled depends on the file that is mapped into the faulting address space, and therefore a file-specific method is invoked to obtain the data. Usually, it is stored in vm-> vm_ops->...

The Swap Cache

Now that I have described the layout of the swap subsystem by reference to its data structures, let us turn our focus in the following sections on the techniques employed by the kernel to write pages from memory into a swap area and to read pages from the swap area back into memory. The kernel makes use of a further cache, known as a swap cache, that acts as a liaison between the operation to select the pages to be swapped out and the mechanism that actually performs swapping. At first sight,...

Determining Page Activity

The kernel must track not only whether a page is actually used by one or more processes, but also how often it is accessed in order to assess its importance. As only very few architectures support a direct access counter for memory pages, the kernel must resort to other means and has therefore introduced two page flags named referenced and active. The corresponding bit values are PG_referenced and PG_active, and the usual set of macros as discussed in Section 3.2.2 is available to set or...

Using System V Semaphores

The System V interface for using semaphores is anything but intuitive because the concept of a semaphore has been expanded well beyond its actual definition. Semaphores are no longer treated as simple variables to support atomic execution of predefined operations. Instead, a System V semaphore now refers to a whole set of semaphores, which allows not just one but several operations to be performed at the same time (although they appear to be atomic to users). It is, of course, possible to...

Exiting Processes

Processes must terminate with the exit system call. This gives the kernel the opportunity to free the resources used by the processes to the system.19 The entry point for this call is the sys_exit function that requires an error code as its parameter in order to exit the process. Its definition is architecture-independent and is held in kernel exit.c. Its implementation is not particularly interesting because it immediately delegates its work to do_exit. Suffice it to say that the...

Figure 345 Fine structure of a slab cache

Besides management data (such as the number of used and free objects or flag registers), the cache structure includes two elements of special significance A pointer to an array in which the last freed objects can be kept for each specific CPU. Three list heads per memory node under which slabs can be listed. The first list contains full slabs, the second partially free slabs, and the third free slabs. The cache structure points to an array that contains as many entries as there are CPUs in the...

Device Control via Buses

Not all devices are addressed directly by I O statements but via a bus system. How this is done varies according to the bus and devices used. Rather than going into specific details, I describe the basic differences between the various approaches here. Not all device classes can be attached to all bus systems. For example, it is possible to connect hard disks and CD writers but not graphic cards to an SCSI interface. However, the latter can be housed in PCI slots. In contrast, hard disks must...

Address Spaces and Privilege Levels

Before we start to discuss virtual address spaces, there are some notational conventions to fix. Throughout this book I use the abbreviations KiB, MiB, and GiB as units of size. The conventional units KB, MB, and GB are not really suitable in information technology because they represent decimal powers (103, 106, and 109) although the binary system is the basis ubiquitous in computing. Accordingly KiB stands for 210, MiB for 220, and GiB for 230 bytes. Because memory areas are addressed by...

High Resolution Timers

After having discussed the generic time framework, we are now ready to take the next step and dive into the implementation of high-resolution timers. Two fundamental differences distinguish these timers from low-resolution timers 1. High-resolution (high-res) timers are time-ordered on a red-black tree. 2. They are independent of periodic ticks. They do not use a time specification based on jiffies, but employ nanosecond time stamps. Merging the high-resolution timer mechanism into the kernel...

Creating the Layout

The address space of a task is laid out when an ELF binary is loaded with load_elf_binary recall that the function is used by the exec system call. Loading an ELF file is cluttered with numerous technical details that are not interesting for our purposes, so the code flow diagram in Figure 4-3 concentrates on the steps required to set up the virtual memory region. Figure 4-3 Code flow diagram for load_elf_binary. Address space randomization is enabled if the global variable randomize_va_space...

Module Representation

Before looking more closely at the implementation of the module-related functions, it is necessary to explain how modules (and their properties) are represented in the kernel. As usual, a set of data structures is defined to do this. Not surprisingly, the name of the most important structure is module an instance of this structure is allocated for each module resident in the kernel. It is defined as follows * Member of list of modules * struct list_head list * Unique handle for this module *...

Initialization and Cleanup Functions

The module initialization and clean-up functions are stored in the module instance in the .gnu.linkonce.module section. The instance is located in the abovementioned autogenerated extra file for each module. It is defined as follows12 .name KBUILD_MODNAME, .init init_module, ifdef CONFIG_MODULE_UNLOAD kbuild_modname contains the name of the module and is only defined if the code is compiled as a module. If the code is to be permanently bound into the kernel, no_this_module object is generated...

Driver and Subsystem Makefiles

The Makefiles in the driver and subsystem directories are used to compile the correct files in accordance with the configuration in .config and to direct the compilation flow to the required subdirectories. The Kbuild framework makes the creation of such Makefiles relatively easy. Only the following line is needed to generate an object file for permanent compilation into the kernel (regardless of the configuration) By reference to the filename, Kbuild automatically detects that the source file...

Architecture Specific Setup

The initialization of memory management on IA-32 systems is in some aspects a very subtle undertaking that must overcome a few historical obstacles associated with the processor architecture. These include, for example, switching the processor from normal mode to protected mode to grant the CPU access to the 32-bit world a legacy from the days when compatibility with 16-bit 8086 processors was important. Similarly, paging is not enabled by default and must be activated manually, which, for...

Correction of Userspace Page Faults

Once the architecture-specific analysis of the page fault has been concluded and it has been established that the fault was triggered at a permitted address, the kernel must decide on the appropriate method to read the required data into RAM memory. This task is delegated to handle_mm_fault, which is no longer dependent on the underlying architecture but is implemented system-independently within the memory management framework. The function ensures that page table entries for all directory...

Automatic Loading

Generally, module loading is initiated from userspace, either by the user or by means of automated scripts. To achieve greater flexibility in the handling of modules and to improve transparency, the kernel itself is also able to request modules. Where is the catch It is not difficult for the kernel to insert the binary code once it has access to it. However, it cannot do this without further help from userspace. The binary file must be localized in the filesystem, and dependencies must be...

Shared Subtrees

The mechanisms I have discussed so far covered the standard mount cases that are available on any Unix system. However, Linux supports some more advanced possibilities that allow for gaining more power from namespaces. Since they were only introduced during the development of 2.6 (2.6.16, to be precise), their use is still somewhat limited, so I will briefly review their basic principle before I discuss the implementation. For specific details on potential applications and a precise description...

Mounting and Unmounting

Recall from Chapter 8 that the kernel requires a further structure to hold mount and unmount information when working with filesystems the information is not provided in any of the structures discussed above. The file_system_type structure is used for this purpose and is defined as follows for the Second Extended File System static struct file_system_type ext2_fs_type Chapter 8 explained that the mount system call invokes the function in get_sb to read the superblock of a filesystem. The Second...

Lock Contention and Fine Grained Locking

After having discussed the numerous locking primitives provided by the kernel, let us briefly address some of the problems related to locking and kernel scalability. While multiprocessor systems were nearly completely unknown to the average user only a decade ago, they are present on nearly every desktop today. Scalability of Linux on systems with more than a single CPU has therefore become a very important goal. This needs especially to be taken into account when locking rules are designed for...

Implementing System Calls

All operations on semaphores are performed using a single system call named ipc. This call is used not only for semaphores but also to manipulate message queues and shared memory. Its first parameter delegates the actual multiplex work to other functions. These functions are as follows for semaphores semctl performs a semaphore operation and is implemented in sys_semctl. semget reads the semaphore identifier sys_semget is the associated implementation. semop and semtimedop are responsible for...

F23 Online Resources

There are numerous web sites devoted to Linux kernel development which provide useful information. Because of the web's rapidly changing structure, it does not really make sense to present a comprehensive survey here, since most links will tend to become outdated rather quickly. However, relying only on your favorite search engine to grab useful links about kernel development is also not the easiest path to success, especially when it comes to judging relevance and quality of the results....

Working with LXR

LXR provides the following functions for viewing kernel components Directories of the source tree can be traversed and files can be selected by name using source navigation. Kernel source files can be displayed in hypertext representation using file view. Positions at which symbols are defined or used can be found with an identifier search. Figure B-5 shows what the output of the vfs_follow_link function looks like. Fils Edil Ew HiTlory BooVmarki TbcI* Help - lit3 j* e wn5U5E 'ettin.g Stoitad...

Software Interrupts

Software interrupts enable the kernel to defer tasks. Because they function in a similar way to the interrupts described above but are implemented fully in the software, they are logically enough known as software interrupts or softlRQs. The kernel is informed of an anomalous condition by means of a software interrupt, and the situation is resolved at some later time by special handler routines. As already noted, the kernel services all pending software interrupts at the end of do_IRQ so that...

Arrangement of the Kernel in Memory

Before discussing the individual memory initialization operations, we need to examine the situation in RAM after the boot loader has copied the kernel into memory and the assembler part of the initialization routines has completed. I concentrate on the default case in which the kernel is loaded to a fixed position in physical RAM that is determined at compile time. It is also possible to configure the initial position of the kernel binary in physical RAM if the crash dump mechanism is enabled....

General System Information

Not only the subdirectories of proc contain information but also the directory itself. General information relating to no specific kernel subsystem (or shared by several subsystems) resides in files in proc. Some of these files were mentioned in earlier chapters. For example, iomem and ioports provide information on memory addresses and ports used to communicate with devices, as discussed in Chapter 6. Both files contain lists in text form wolfgang meitner> c 00000000-0009dbff...

Group Descriptor

As Figure 9-2 shows, each block group has a collection of group descriptors arranged directly after the superblock. The information they hold reflects the contents of each block group of the filesystem and therefore relates not only to the data blocks associated with the local block group but also to the data and inode blocks of other block groups. The data structure used to define a single group descriptor is much shorter than the superblock structure, as the following section of kernel source...

Implementation of loctls

Ioctls permit the use of special device-specific functions that cannot be accessed by normal read and write operations. This form of support is implemented by means of the ioctl system call that can be used with regular files (detailed descriptions of how it is used are provided in the many system programming manuals). As expected, the system call is implemented in sys_ioctl, but the main work is done in vfs_ioctl. Figure 6-19 shows the associated code flow diagram. Figure 6-19 Code flow...

Figure 225 Time flow for initiation of load balancing on SMP systems

Run queues are CPU-specific, so cpu denotes the processor to which the run queue belongs. The kernel provides one migration thread per run queue to which migration requests can be posted they are kept on the list migration_queue. Such requests usually originate from the scheduler itself, but can also become necessary when a process is restricted to a certain set of CPUs and must not run on the one it is currently executing on anymore. The kernel tries to balance run queues periodically, but if...

These are not the same regions as in the NUMA concept but are areas occupied by system ROM for example or by ACPI

When the system is booted, the regions found are displayed by the kernel function print_memory_map. wolfgang meitner> dmesg BIOS BIOS BIOS BIOS BIOS BIOS BIOS BIOS BIOS BIOS BIOS eB20 eB20 eB20 eB20 eB20 eB20 eB20 eB20 eB20 eB20 eB20 0000000000000000 000000000009eB00 00000000000c0000 00000000000dB000 0000000000l00000 00000000l7cf0000 00000000l7cff000 00000000l7d00000 00000000l7eB0000 00000000ffB00000 00000000fff00000 000000000009eB00 00000000000a0000 00000000000cc000 0000000000l00000...

Initialization of Paging

Paging_init is responsible for setting up the page tables that can be used only by the kernel and are not accessible in userspace. This has far-reaching effects on the way in which access to memory is regulated between normal applications and the kernel itself. It is therefore important to explain the purpose of the function before looking closely at its implementation. As noted in Chapter 1, on IA-32 systems the kernel typically divides the total available virtual address space of 4 GiB in a...

The figure shows the structure of the page table entries used to manage the fourth gigabyte of virtual address space It

The first section of the address space is used to map all physical pages of the system into the virtual address space of the kernel. Because this address space begins at an offset of 0xC0000000 the frequently mentioned 3 GiB each virtual address x corresponds to the physical address x 0xC0000000, and is therefore a simple linear shift. As the figure shows, the direct mapping area extends from 0xC0000000 to the high_memory address whose exact value I discuss shortly. As intimated in Chapter 1,...

Alternative Division

Dividing virtual address space in a 3 1 ratio is not the only option. Relatively little effort is needed to select a different division because all bounds are defined by constants in the sources. For some purposes it may be better to split the address space symmetrically, 2 GiB for user address space and 2 GiB for kernel address space._page_offset must then be set to 0x80000000 instead of the typical default of 0xC0000000. This division is useful when the system performs tasks that require a...

Figure 318 Possible virtual versus implemented physical address space on AMD64 machines

The first 47 bits of a virtual address, that is, 0,46 , can be arbitrarily set. Bits in the range 47,63 , however, always need to have the same value Either all are 0, or all are 1. Such addresses are called canonical. They divide the total address space into three parts a lower half, a higher half, and a forbidden region in between. Together both portions form an address space that spans exactly 248 bits. The address space for the lower half is 0x0, 0x0000 7FFF ffff ffff , while the subset for...

Allocator API

As far as the interface to the buddy system is concerned, it makes no difference whether a NUMA or a UMA architecture is used as the call syntax is the same for both. Common to all functions is the fact that pages can only be allocated in integer powers of 2. For this reason, the desired memory size is not specified as parameter as it would be in the malloc function of the C standard library or in the bootmem allocator. Instead, the order of the allocation must be specified, and this causes the...

Opening Device Files

Chrdev_open from fs devices.c is the generic function for opening character devices. Figure 6-6 shows the associated code flow diagram. Find device specific file_operations Figure 6-6 Code flow diagram for chrdev_open. Figure 6-6 Code flow diagram for chrdev_open. Assume that the inode that represents the device file has not been opened before. Given the device number, kobject_lookup queries the character device database introduced in Section 6.2.5 and returns the kobject instance associated...

System Call Implementation

Figure 7-4 shows the code flow diagram for sys_init_module. Figure 7-4 Code flow diagram for sys_init_module. The binary data are transferred into the kernel address space using load_module. All required relocations are performed, and all references are resolved. The arguments are converted into a form that is easy to analyze (a table of kernel_param instances), and an instance of the module data structure is created with all the necessary information on the module. Once the module instance...

A4 Memory Pages

Memory pages are 4 KiB in size on many but not all architectures. More modern processors also support sizes up to several MiB. The following macros must be defined in the architecture-specific file asm-arch page.h to indicate the page size used PAGE_SHIFT specifies the binary logarithm of the page size. (The kernel implicitly assumes that the page size can be represented as a power of 2, as is true on all architectures supported.) PAGE_SIZE specifies the size of a memory page in bytes....

Dispatching and Parameter Passing

System calls are uniquely identified by a number assigned by the kernel. This is done for practical reasons that become clear when system calls are activated. All calls are handled by a single central piece of code that uses the number to dispatch a specific function by reference to a static table. The parameters passed are also handled by the central code so that parameter passing is implemented independently of the actual system call. Switching from user to kernel mode and therefore to...

Available System Calls

Before going into the technical details of system call implementation by the kernel (and by the userspace library), it is useful to take a brief look at the actual functions made available by the kernel in the form of system calls. Each system call is identified by means of a symbolic constant whose platform-dependent definition is specified in < asm-arcfr unistd.h> . Since not all system calls are supported on all architectures (some combinations are meaningless), the number of available...

Idle and init Thread

The last two actions of start_kernel are as follows 1. rest_init starts a new thread that, after performing a few more initialization operations as described the next step, ultimately calls the userspace initialization program sbin init. 2. The first, and formerly only, kernel thread becomes the idle thread that is called when the system has nothing else to do. rest_init is essentially implemented in just a few lines of code init main.c kernel_thread(kernel_init, NULL, CLONE_FS CLONE_SIGHAND)...

Representation of Block Devices

Block devices have a set of characteristic properties managed by the kernel. The kernel uses what is known as request queue management to support communication with devices of this kind as effectively as possible. This facility enables requests to read and write data blocks to be buffered and rearranged. The results of requests are likewise kept in caches so that the data can be read and reread in a very efficient way. This is useful when a process repeatedly accesses the same part of a file or...

Request Queues

Read and write requests to block devices are placed on a queue known as a request queue. The gendisk structure includes a pointer to the device-specific queue, which is represented by the following data type. Together with queue_head for cacheline sharing struct request_list rq Queue request freelists prep_rq_fn prep_rq_fn unplug_fn unplug_fn merge_bvec_fn merge_bvec_fn prepare_flush_fn prepare_flush_fn softirq_done_fn softirq_done_fn unplug_timer unplug_thresh unplug_delay unplug_work After...

Request Structure

The kernel provides its own data structure to describe a request to a block device. sector_t sector * next sector to submit * sector_t hard_sector * next sector to complete * unsigned long nr_sectors * no. of sectors left to submit * unsigned long hard_nr_sectors * no. of sectors left to complete * * no. of sectors left to submit in the current segment * unsigned int current_nr_sectors * no. of sectors left to complete in the current segment * unsigned int hard_cur_sectors struct bio *bio...

Creating Requests

Submit_bio is the key function that creates a new request based on a passed bio instance and finally places it on the request queue of the driver using make_request_fn. Figure 6-16 shows the associated code flow diagram. Let's consider a simplified version first, but come back later to address some problems that can arise in certain cases, and how the kernel solves them with a little trick. The function is invoked at various places in the kernel to initiate physical data transfers. submit_bio...

Figure 617 BIO lists evolving under recursive calls to genericmakerequest

Current-> bio_tail is initialized to null, so we can skip the first conditional block. One bio instance is submitted, and the data structures look as in Figure 6-17(a). bio points to the submitted BIO, while current-> bio_list is NULL and current_bio_tail point to the address of current_bio_tail. Note that the following pictures in the figure will always consider the local bio variable of the first function call to generic_make_request not any variables in later stack frames. Now suppose...

Executing Requests

The device-specific request_fn function is invoked when the requests in the request queue are ready to be processed. This is a very hardware-specific task so the kernel does not provide a default implementation. Instead, it always uses the method passed when the queue was registered with blk_dev_init. Nevertheless, the structure of the request function described below is similar in most device drivers. I assume a situation in which there are several requests in the request queue. 14elv_insert...

Data Structures in Memory

To dispense with the need to constantly read administration structures from slow hard disks, Linux saves the most important information that these structures contain in special data structures that reside permanently in RAM. Access is considerably faster, and less interaction with the hard disk is required. Then why aren't all filesystem management data held in RAM (with writeback of changes to disk at regular intervals) Although theoretically this would be possible, it does not work in...

Echo Client

The source code for the echo client is as follows5 * Host and port number of the echo server * char* echo_host 192.168.1.20 int echo_port 7 int sockfd (struct sockaddr_in*)malloc(sizeof(struct sockaddr_in)) * Set address of server to be connected * server- sin_family AF_INET server- sin_port htons(echo_port) Note network byte order server- sin_addr.s_addr inet_addr(echo_host) * Create a socket (Internet address family, stream socket and default protocol) * sockfd socket(AF_INET, SOCK_STREAM, 0)...

Mounting the Filesystem

Once all kernel-internal data that describe the structure and contents of the proc filesystem have been initialized, the next step is to mount the filesystem in the directory tree. In the view of the system administrator in userspace, mounting proc is almost the same as mounting a non-virtual filesystem. The only difference is that an arbitrary keyword (usually proc or none) is specified as the source instead of a device file root meitner mount -t proc proc proc The VFS-internal processes...

Creating and Registering Entries

New entries are added to the proc filesystem in two steps. First, a new instance of proc_dir_entry is created together with all information needed to describe the entry. This instance is then registered in the data structures of proc so that it is visible to the outside. Because the two steps are never carried out independently of each other, the kernel makes auxiliary functions available to combine both actions so that new entries can be generated quickly and easily. The most frequently used...

The Slab Allocator

Every C programmer is familiar with malloc and all its related functions in the standard library they are frequently invoked by most programs to reserve a few bytes of memory. The kernel must also frequently allocate memory but cannot resort to the standard library functions. The buddy system resources described above support the allocation of memory in pages, but this unit is much too big. If space is needed for a string with 10 characters, reserving a full page with 4 KiB or more is not only...