Page Reclaim and Swapping in the Linux Kernel

This section summarizes the design decisions of the Linux page reclaim subsystem before considering the technical aspects of implementation and examining how requirements are met. Swapping out pages and all related actions do not appear to be very complicated when the situation is viewed from the high ground without taking development details into consideration. Unfortunately, the very opposite is the case. Hardly any part of the kernel entails as many technical difficulties as the virtual...

Data Structures

As an exemplary citizen, Ext3 starts with some good advice on coding efficiency and employs the generic implementation presented above. A number of handler functions are provided, and the following map makes it possible to access handler functions by their identification number and not by their string identifier this simplifies many operations and allows a more efficient use of disk space because rather than the prefix string, only a simple number needs to be stored xattr_handler...

Generating a Checksum

The genksym tool that comes with the kernel sources and is automatically created at compilation time is used to generate a function checksum. To demonstrate how it works, let's use the following header file, which contains an exported function definition int accelerate_task(long speedup, struct task_struct *task) EXPORT_SYMBOL(accelerate_task) The function definition contains a compound structure as a parameter, and this makes the work of genksyms more difficult. When the definition of the...

Removing the Selected Pages

Two things remain to be done once the kernel has found a suitable zone with sufficient free pages for the allocation. First, it must be checked whether the pages are contiguous (up to now it only knows how many free pages there are). And second, the pages must be removed from the free_lists in the buddy fashion, and this may make it necessary to break up and rearrange memory regions. The kernel delegates this work to buffered_rmqueue as discussed in the previous section. Figure 3-32 shows the...

Executing Requests

The device-specific request_fn function is invoked when the requests in the request queue are ready to be processed. This is a very hardware-specific task so the kernel does not provide a default implementation. Instead, it always uses the method passed when the queue was registered with blk_dev_init. Nevertheless, the structure of the request function described below is similar in most device drivers. I assume a situation in which there are several requests in the request queue. 14elv_insert...

Creating and Manipulating Entries

Table 3-4 lists all functions for creating new page table entries. Table 3-4 Functions for Creating New Page Table Entries Table 3-4 Functions for Creating New Page Table Entries Creates a pte entry a page instance and the desired page access permissions must be passed as parameters. Yields the address of the page instance belonging to the page described by the page table entry. pgd_alloc pud_alloc pmd_alloc pte_alloc Reserve and initialize memory to hold a complete page table (not just a...

Switching to High Resolution Timers

High-resolution timers are not enabled from the very beginning, but can only be activated when suitable high-resolution clock sources have been initialized and added to the generic clock framework. Low-resolution ticks, however, are provided (nearly) from the very beginning. In the following, I discuss how the kernel switches from low- to high-resolution mode. The high-resolution queue is processed by hrtimer_run_queue when low-resolution timers are active. Before the queues are run, the...

Broadcast Mode

On some architectures, clock event devices will go to sleep when certain power-saving modes are active. Thankfully, systems do not have only a single clock event device, so another device that still works can replace the stopped devices. The global variable tick_broadcast_device defined in kernel tick tick-broadcast.c contains the tick_device instance for the broadcast device. An overview of broadcast mode is given in Figure 15-21.

Address Spaces

Memory mappings of files can be regarded as mappings between two different address spaces to simplify the work of (system) programmers. One address space is the virtual memory address space of the user process, the other is the address space spanned by the filesystem. When the kernel creates a mapping, it must create a link between the address spaces to support communication between the two in the form of read and write requests. The vm_operations_struct structure with which we are familiar...

Processing Interrupts

Once the CPU has been informed of an interrupt, it delegates further handling to a software routine that corrects the fault, provides special handling, or informs a user process of an external event. Because each interrupt and each exception has a unique number, the kernel uses an array containing pointers to handler functions. The associated interrupt number is found by referring to the array position, as shown in Figure 14-1. n n+1 n+2 n+3 n+4 n+5 n+6 n+7 Figure 14-1 Managing interrupt...

Figure 1211 Code flow diagram for netrxaction

After a few preparatory tasks, work is passed to process_backlog, which performs the following steps in a loop. To simplify matters, assume that the loop iterates until all pending packets have been processed and is not interrupted by any other condition. 1. _skb_dequeue removes a socket buffer that is managing a received packet from the wait 2. The packet type is analyzed by the netif_receive_skb function so that it can be delivered to the receive function of the network layer (i.e., to a...

Representation of Devices

The driver model features a special data structure to represent the generic device properties of practically all bus types.19 This structure is embedded directly in the bus-specific data structures and not by means of a reference as is the case with the kobjects introduced above. Its (simplified) definition is as follows klist_node knode_parent * node in sibling The klist and klist_node data structures used are enhanced versions of the familiar list_head data structures to which locking and...

Registering Buses

Before devices and their drivers can be registered, a bus is required. Thus we start with bus_register, which adds a new bus to the system. First of all, the new bus is added to the bus subsystem via the embedded subsys kset int bus_register(struct bus_type * bus) retval s, bus-> name) bus-> subsys.kobj.kset & bus_subsys retval The bus wants to know all about both its devices and their drivers, so the bus registers kernel sets for them. Both have the bus as parent and drivers ksets...

Info

Atomic_t d_count unsigned int d_flags spinlock_t d_lock struct inode *d_inode protected by d_lock * per dentry lock * Where the name belongs to negative * * The next three fields are touched by d lookup. * so they all fit in a cache line. struct list_head d_lru * LRU list * struct list_head d_child struct rcu_head d_rcu struct dentry_operations *d_op struct super_block *d_sb * our children * inode alias list * used by d_revalidate * The root of the dentry tree * fs-specific data * unsigned char...

References

BBD+01 Michael Beck, Harald B hme, Mirko Dziadzka, Ulrich Kunitz, Robert Magnus, and Dirk Verworrner. Linux-Kernelprogrammierung. Addison-Wesley, 2001. BC05 Daniel P. Bovet and Marco Cesati. Understanding the Linux Kernel. O'Reilly, 3rd edition, 2005. Ben05 Christian Benvenuti. Understanding Linux Network Internals. O'Reilly, 2005. BH01 Thomas Beierlein and Olaf Hagenbruch, editors. Taschenbuch Mikroprozessortechnik. Fachbuchverlag Leipzig, 2001. Bon94 Jeff Bonwick. The slab allocator An...

The Migration Thread

The migration thread serves two purposes It must fulfill migration requests originating from the scheduler, and it is used to implement active balancing. This is handled in a kernel thread that executes migration_thread. The code flow diagram for the function is shown in Figure 2-27. Figure 2-27 Code flow diagram for migration_thread. Figure 2-27 Code flow diagram for migration_thread. migration_thread runs an infinite loop and sleeps when there is nothing to do. First of all, the function...

The fault Mechanism

Memory mappings normally invoke the filemap_fault standard routine provided by the VFS layer to read pages not held in cache. Figure 8-16 shows the code flow diagram of this function. Figure 8-16 Code flow diagram for filemap_fault. Figure 8-16 Code flow diagram for filemap_fault. As the diagram illustrates, the implementation exhibits several parallels with the generic_file_read mechanism just discussed. First, the function checks if the virtual memory area to which the page belongs contains a...

Low Latency

Naturally, the kernel is interested in providing good latency times even if kernel preemption is not enabled. This can, for instance, be important in network servers. While the overhead introduced by kernel preemption is not desired in such an environment, the kernel should nevertheless respond to important events with reasonable speed. If, for example, a network request comes in that needs to be serviced by a daemon, then this should not be overly long delayed by some database doing heavy I O...

Packet Forwarding

IP packets may be delivered locally as described above, or they may leave the IP layer for forwarding to another computer without having come into local contact with the higher protocol instances. There are two categories of packet destinations 1. Target computers in one of the local networks to which the sending computer is attached. 2. Geographically remote computers not attached to the local network and accessible only via gateways. The second scenario is rather more complicated. The first...

Performing Page Reclaim

Shrink_page_list takes a list of pages selected for reclaim and attempts to write back to the appropriate backing store. This is the last step performed by the policy algorithm everything else is the responsibility of the technical part of swapping. The shrink_page_list function forms the interface between the kernel's two subsystems. The associated code flow diagram is shown in Figure 18-17. Some of the many corner cases this function has to deal with are ignored so that inessential details do...

The Netlink Mechanism

Netlink is a networking-based mechanism that allows for communication within the kernel as well as between kernel and userland. The formal definition can be found in RFC 3549. The idea to use the networking framework to communicate between kernel and userland stems from BSD's networking sockets. Netlink sockets, however, extend the possible uses much further. The mechanism is not only used for networking purposes. By now, one of the most important users is the generic object model, which uses...

Figure 34G Code flow diagram for vunmap

It is not necessary to explicitly state the size of the area to be freed because this can be derived from the information in vmlist. The first task of_vunmap is therefore to scan this list in_remove_vm_area (invoked by remove_vm_area after completion of locking) in order to find the associated entry. The vm_area instance found is used by unmap_vm_area to remove the entries no longer needed from the page tables. In the same way as when memory is reserved, the function works its way through the...

Audit Rules

How is it possible to place constraints on the types of events for which an audit log record can be generated This is done with the help of audit rules, and this chapter discusses their format and purpose. However, you should also consult the manual pages that accompany the audit framework especially 8(auditctl) for more information. In general, an audit rule consists of the following components The basic information is given by a filter value pair. The filter denotes the kind of event to which...

Central Control

The key periodic flushing component is the wb_kupdate procedure defined in mm page-writeback.c. It is responsible for dispatching lower-level routines to find the dirty pages in memory and synchronize them with the underlying block device. As usual, our description is based on a code flow diagram as shown in Figure 17-4. The superblocks are synchronized right at the start of the function because this is essential to ensure filesystem integrity. Incorrect superblock data result in consistency...

Client Server

Wolfgang meitner> , stream_server Connect to 192.168.1.20 Wait for connection on port 7777 Client 192.168.1.10 3505 Send 'HelloWorld' Numeric 3232235786 Bytes received 11 Text 'HelloWorld' 6Under Linux (and all other Unix flavors), all ports between 1 and 1,024 are referred to as reserved ports and may be used only by processes with root rights. For this reason, we use the free port number 7,777. A 4-tuple notation (192.168.1.20 7777,192.168.1.10 350 6) is used to uniquely identify a...

High Resolution Timers in Low Resolution Mode

What if no high-resolution clocks are available In this case, expiring high resolution timers is initiated from the hrtimer_run_queues, which is called by the high-resolution timer softIRQ HRTIMER_SOFTIRQ (since softIRQ processing is based on low-resolution timers in this case, the mechanism does not provide any high-resolution capabilities naturally). The code flow diagram is depicted in Figure 15-15. Note that this is a simplified version. In reality, the function is more involved because...

The Dynamic Tick Handier

The new tick handler tick_nohz_handler needs to assume two responsibilities 1. Perform all actions required for the tick mechanism. 2. Reprogram the tick device such that the next tick expires at the right time. The code to satisfy these requirements looks as follows. Some initialization work is required to obtain the per-CPU instance of struct tick_sched and the current time static void tick_nohz_handler(struct clock_event_device *dev) struct tick_sched *ts & _get_cpu_var(tick_cpu_sched)...

Interpreting Command Line Arguments

Parse_args is invoked by parse_early_param in start_kernel and assumes responsibility for interpreting the command-line parameters passed to the kernel at boot time. The same inherent problem is encountered as in userspace a string containing key value pairs in the form key1 val1 key2 val2 must be broken down into its constituent parts. The options set must be saved in the kernel or specific responses must be triggered. The kernel is faced with this parameter problem not only at boot time but...

Address Space Operations

In Section 9.2.4, the address space operations associated with the Ext2 filesystem are discussed. For the most part, functions whose names are prefixed with ext2_ are assigned to the individual function pointers. At first glance, it could therefore be assumed that they are all special implementations for the Second Extended Filesystem. However, this is not the case. Most of the functions make use of standard implementations of the virtual filesystem, which uses the function discussed in Section...

The Mount System Call

The point of entry for the mount system call is the sys_mount function defined in fs namespace.c. Figure 8-5 shows the associated code flow diagram. Figure 8-5 Code flow diagram for sys_mount. Figure 8-5 Code flow diagram for sys_mount. The approach described here is used only to mount a new filesystem in an existing root filesystem. A modified version of the above algorithm mounts the root filesystem itself, but is not sufficiently interesting to merit a separate description (its code can be...

Inode Lists

Each inode has a i_list list head element to store the inode on a list. Depending on the state of the inode, three main cases are possible 1. The inode exists in memory but is not linked to any file and is not in active use. 2. The inode structure is in memory and is being used for one or more tasks, usually to represent a file. The value of both counters (i_count and i_nlink) must be greater than 0. The file contents and the inode metadata are identical to the information on the underlying...

Block Devices

The core properties of a block device are represented by the name says it all struct block_device. Let us discuss this structure first and afterward examine how it fits into a network with various other structures. dev_t bd_dev * not a kdev_t - it's a search key * struct inode * bd_inode * will die * int bd_openers struct list_head bd_inodes void * bd_holder struct block_device * bd_contains unsigned bd_block_size struct hd_struct * bd_part unsigned bd_part_count int bd_invalidated struct...

Caching Swap Pages

In contrast to add_to_swap, add_to_swap_cache adds a page to the swap cache, but requires that a page slot has already been allocated for the page. If a page already has a swap cache entry, why is it added to the swap cache This is required when pages are swapped in. Suppose that a page that was shared among many processes has been swapped out. When the page is swapped in again, it is necessary to retain the data in the cache after the first page-in until all processes have successively...

Figure 349 Code flow diagram for kmemcachecreate

The kernel also takes account of the fact that some architectures require a minimum boundary for the alignment of data as defined by arch_slab_minalign the alignment required by users is also accepted. * 2) arch mandated alignment * if (ralign < ARCH_SLAB_MINALIGN) * 3) caller mandated alignment * if (ralign < align) A new instance of struct kmem_cache is allocated once the data alignment has been calculated (a separate slab cache named cache_cache is provided to perform allocation). The...

Disabling the Bootmem Allocator

The bootmem allocator must be disabled once system initialization has progressed so far that the buddy system allocator can assume responsibility for memory management after all, memory cannot be managed by two allocators at the same time. Disabling is done by free_all_bootmem on UMA systems and by free_all_bootmem_node on NUMA systems. Both need to be invoked by the architecture-specific initialization code after the buddy system has been set up. The page bitmap of the bootmem allocator is...

Unified Context Diffs

The following example illustrates the format used by diff to record the differences between two versions of a file. It reflects a change made to the scheduler during the development of kernel 2.6.24. diff -up a include linux sched.h b include linux sched.h +++ b include linux sched.h -908,6 +908,7 struct sched_entity u64 sum_exec_runtime u64 vruntime u64 prev_sum_exec_runtime + u64 last_min_vruntime ifdef CONFIG_SCHEDSTATS u64 wait_start diff -up a kernel sched.c b kernel sched.c -1615,6...

Implementation ofprocfilewrite

Writing to proc files is also a simple matter at least from the perspective of the filesystem. The code of proc_file_write is very compact and thus is reproduced in full below. proc_file_write(struct file * file, const char _user *buffer, struct inode *inode if ( dp-> write_proc) return -EIO return dp-> write_proc(file, buffer, count, dp-> data) The PDE function needed to obtain the required proc_dir_entry instance from the VFS inode using the container mechanism is very simple. All it...

Receiving TCP Data

All TCP actions (connection setup and shutdown, data transmission) are performed by sending data packets with specific properties and various flags. Before discussing state transitions, I must first establish how the TCP data are passed to the transport layer and at what point the information in the header is analyzed. tcp_v4_rcv is the entry point into the TCP layer once a packet has been processed by the IP layer. The code flow diagram for tcp_v4_rcv is shown in Figure 12-27. Figure 12-27...

Access Vector Cache Auditing

A very prominent example where auditing is a rather crucial requirement is the SELinux access vector cache. Granting or denying permissions is performed by the function avc_audit, which is called from avc_has_perm, that is, whenever a permission query is passed to the security server. First, the function needs to check if auditing is required for the current case (i.e., granting or denial is supposed to be audited or not) as follows struct av_decision *avd, int result, struct avc_audit_data *a)...

Fsbufferc

_getblk_slow(struct block_device *bdev, sector_t block, int size) bh _find_get_block(bdev, block, size) ret grow_buffers(bdev, block, size) if (ret < 0) Surprisingly, the first thing_getblk_slow does is to invoke_find_get_block the function that has just failed. If a buffer head is found, it is returned by the function. Of course, the function only succeeds if another CPU has installed the desired buffer and created the corresponding data structures in memory in the meantime. Although this is...

IO Ports

I O ports are a popular way of communicating with devices and buses, above all in the IA-32 world. As with I O memory, the required region must first be registered before it can be accessed by a driver in good faith unfortunately, the processor is again unable to check whether this has been done. ioport_resource from kernel resource.c acts as the root element of a resource tree. The ioports file in the proc filesystem reveals the reserved port addresses. wolfgang meitner> cat proc ioports...

Deleting Inodes

Both directory inodes and file inodes can be deleted, and, from the perspective of the filesystem, both actions are much simpler than allocating inodes. Let us first look at how directories are deleted. After the appropriate system call (rmdir) has been invoked, the code meanders through the kernel and finally arrives at the rmdir function pointer of the inode_operations structure, which, for the Ext2 filesystem, contains ext2_rmdir from fs ext2 namei.c. 18In terms of compatibility with old...

Relocation Types

The ELF standard defines a large number of relocation types, and there is a separate set for each supported architecture. Most of the types are used when dynamic libraries or location-independent code are generated. On some platforms particularly on IA-32 platforms it is also necessary to compensate for many design errors or historical ballast. Fortunately, the kernel, which is interested only in the relocation of modules, makes do with just the following two relocation types PC-relative...

Representation of Block Devices

Block devices have a set of characteristic properties managed by the kernel. The kernel uses what is known as request queue management to support communication with devices of this kind as effectively as possible. This facility enables requests to read and write data blocks to be buffered and rearranged. The results of requests are likewise kept in caches so that the data can be read and reread in a very efficient way. This is useful when a process repeatedly accesses the same part of a file or...

Data Structures and Usage

Spinlocks are implemented by means of the spinlock_t data structure, which is manipulated essentially using spin_lock and spin_unlock. There are a few other spinlock operations spin_lock_irqsave not only acquires the spinlock but also disables the interrupts on the local CPU, and spin_lock_bh also disables softIRQs (see Chapter 14). Spinlocks acquired with these operations must be released by means of their counterpart spin_unlock_bh and spin_unlock_irqsave, respectively. Once again,...

Implementing System Calls

All operations on semaphores are performed using a single system call named ipc. This call is used not only for semaphores but also to manipulate message queues and shared memory. Its first parameter delegates the actual multiplex work to other functions. These functions are as follows for semaphores semctl performs a semaphore operation and is implemented in sys_semctl. semget reads the semaphore identifier sys_semget is the associated implementation. semop and semtimedop are responsible for...

E25 Relocation Entries

Relocation is the process by which undefined symbols in ELF files are associated with valid values. In the standard example (test.o), this means that undefined references to printf and add must be replaced with the addresses at which the appropriate machine code is located in the virtual address space of the process. Replacement must be performed at all points in the object file where one of the symbols is used. The kernel is not involved in symbol replacement in userspace programs because all...

Network Cards and Other Devices

Character and block devices are not the only device categories managed by the kernel. Network cards occupy a special position in the kernel because they do not fit into this category scheme (Chapter 12 deals extensively with why this is so). This is revealed by the fact that there are no device files for network cards. Instead, user programs must use sockets to communicate with network cards, the sockets being an abstraction layer to provide an abstract view of all network cards. Access takes...

System Start

Figure 3-8 shows a code flow diagram for start_kernel. It includes only the system initialization functions associated with memory management. Figure 3-8 Kernel initialization in the view of memory management. Let's take a closer look at the functions invoked in the sections below after first summarizing their tasks as follows setup_arch is an architecture-specific set-up function responsible for, among other things, initialization of the boot allocator. On SMP systems, setup_per_cpu_areas...

Figure 319 Organization of the virtual address space on AMD64 systems The image is not drawn to scale naturally

Note that _ac is used to mark a given constant with a suffix. _ac(17,ul) becomes (17UL), for instance, which makes the constant an unsigned long. This can be handy in C code, but is not allowed in assembler code, where the _ac macro directly resolves to the given value without postfix. Another guard hole is placed between the identity-mapped region and the area for vmalloc area, which lies between vmalloc_start and vmalloc_end define VMALLOC_START _AC(0xffffc20000000000, UL) define VMALLOC_END...

Approximate PerCPU Counters

Counters can become a bottleneck if a system is equipped with a large number of CPUs Only one CPU at a time may modify the value all other CPUs need to wait for the operation to finish until they can access the counter again. If a counter is frequently visited, this has a severe impact on system performance. For some counters, it is not necessary to know the exact value at all times. An approximation of the value serves quite as well as the proper value would do. This insight can be used to...

Merging Regions

When a new region is added to the address space of a process, the kernel checks whether it can be merged with one or more existing regions as shown in Figure 4-10. vm_merge merges a new region with the surrounding regions if this is possible. It requires numerous parameters. struct vm_area_struct *vma_merge(struct mm_struct *mm, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t pgoff, struct...

File and Directory Operations

Since sysfs exposes its data structures in a filesystem, most interesting operations can be triggered with standard filesystem operations. The functions that implement the filesystem operations thus serve as a glue layer between sysfs and the internal data structures. As for every filesystem, the methods used for operations on files are collected in an instance of struct file_operations. For sysfs, the following selection is available const struct file_operations sysfs_file_operations .read...

A75 Alpha

As classic RISC machines, Alpha CPUs employ a large register set whose members are identified mainly by numbers according to an orderly principle. As already noted, the Alpha architecture uses the PAL (privileged architecture level) code to perform system tasks. This code is also used in the implementation of system calls. The C or assembler code of the kernel need not save all registers listed in pt_regs because some of them are automatically placed on the stack by PAL code routines. Those...

Initializing Central Data Structures and Caches

As a quick glance at the following kernel sources shows, the most substantial task of start_kernel is to invoke subroutines to initialize almost all important kernel subsystems asmlinkage void _init start_kernel(void) early_boot_irqs_on() local_irq_enable() * HACK ALERT This is early. We're enabling the console before * we've done PCI setups etc, and console_init() must be aware of * this. But we do want output early, in case something goes wrong. * * rootfs populating might need page-writeback...

Implementation of dolookup

Do_lookup starts from a path component and the nameidata instance with the data of the initial directory and returns the associated inode. The kernel first attempts to find the inode in the dentry cache using the_d_lookup function described in Section 8.3.5. Even if a matching element is found, this does not necessarily mean that it is current the d_revalidate function in the dentry_operations of the underlying filesystem must be invoked to check whether it is still valid. If so, it is returned...

Even though the name allocbootmempages suggests that the required size is specified in page units pages refers only to

alloc_bootmem_low and alloc_bootmem_low_pages operate in the same ways as the above functions but take the area zone_dma that is suitable for DMA operations as their source. Consequently, the functions should only be used when DMA memory is required. Basically the same API applies for NUMA systems, but the suffix _node is appended to the function name. As compared with the UMA functions, an additional parameter is required to specify which node is used for memory reservation. These functions...

The alarm and setitimer System Calls

Alarm installs timers of the itimer_real type (real-time timers), while setitimer is used to install not only real-time, but also virtual and profiling timers. The system calls all end up in do_setitimer. The implementation of both system calls rests on a common mechanism that is defined in kernel itimer.c. The implementation is centered around struct hrtimer, so if high-resolution support is available, the corresponding advantages are automatically transferred into userland and not only...

System Control Mechanism

Kernel behavior can be modified at run time by means of system controls. Parameters can be transferred from userspace into the kernel without having to reboot. The classic method of manipulating the kernel is the sysctl system call. However, for a variety of reasons, this is not always the most elegant option one reason being that it is necessary to write a program to read arguments and pass them to the kernel using sysctl. Unfortunately, this method does not allow users to obtain a quick...

Memory and Optimization Barriers

Modern compilers and processors will try to squeeze every bit of performance out of code, and readers will certainly agree that this is a good thing. However, as with every good thing, there is also a drawback to consider (maybe you've heard this before in this chapter). One particular technique to achieve better performance is to reorder instructions. This can be perfectly fine, as long as the result is identical. However, it can be hard to decide for a compiler or processor if the result of a...

Procsys File Operations

The implementations for proc_sys_read and proc_sys_write are very similar. Both require three easy steps 1. do_proc_sys_lookup finds the sysctl table entry that is associated with the file in proc sys. 2. It is not guaranteed that all rights on sysctl entries are granted even to the root user. Some entries can, for instance, be only read, but are not allowed to be changed, that is, written to. Thus an extra permission check with sysctl_perm is required. While proc_sys_read needs read...

Exiting Processes

Processes must terminate with the exit system call. This gives the kernel the opportunity to free the resources used by the processes to the system.19 The entry point for this call is the sys_exit function that requires an error code as its parameter in order to exit the process. Its definition is architecture-independent and is held in kernel exit.c. Its implementation is not particularly interesting because it immediately delegates its work to do_exit. Suffice it to say that the...

Releasing Initialization Data

Many kernel code chunks and data tables are needed only during the system initialization phase. For example, it is not necessary to keep data structure initialization routines in kernel memory for permanently linked drivers. They are no longer needed once the structures have been set up. Similarly, hardware databases that drivers need to detect their devices are no longer required once the associated devices have been identified.17 The kernel provides two attributes (_init and_initcall) to...

Dispatching and Parameter Passing

System calls are uniquely identified by a number assigned by the kernel. This is done for practical reasons that become clear when system calls are activated. All calls are handled by a single central piece of code that uses the number to dispatch a specific function by reference to a static table. The parameters passed are also handled by the central code so that parameter passing is implemented independently of the actual system call. Switching from user to kernel mode and therefore to...

Initialization of Paging

Paging_init is responsible for setting up the page tables that can be used only by the kernel and are not accessible in userspace. This has far-reaching effects on the way in which access to memory is regulated between normal applications and the kernel itself. It is therefore important to explain the purpose of the function before looking closely at its implementation. As noted in Chapter 1, on IA-32 systems the kernel typically divides the total available virtual address space of 4 GiB in a...

Version Control Functions

Above I noted that the kernel uses the auxiliary function check_version to determine whether the symbol versions required by a module match the versions made available by the kernel. This function requires several parameters a pointer to the section header of the (sechdrs) module, the index of the_version section, the name of the processed symbol (symname), a pointer to the module data structure (mod), and a pointer to the checksum (crc) that the kernel provides for the symbol and that is...

Dynamic Ticks

Periodic ticks have provided a notion of time to the Linux kernel for many of years. The approach is simple and effective, but shows one particular deficiency on systems where power consumption does matter The periodic tick requires that the system is in an active state at a certain frequency. Longer periods of rest are impossible because of this. Dynamic ticks mend this problem. The periodic tick is only activated when some tasks actually do need to be performed. Otherwise, it is temporarily...

Interaction with fork

Whenever a new process is created using the fork system call or one of its variants, the scheduler gets a chance to hook into the process with the sched_fork function. On a single-processor system, the function performs essentially three actions Initialize the scheduling-related fields of the new process, set up data structures (this is rather straightforward), and determine the dynamic priority of the process void sched_fork(struct task_struct *p, int clone_flags) * Make sure we do not leak PI...

C24 Bit Arithmetic

Bit operations are part of the standard kernel repertoire and are frequently used in all subsystems. In the past, operations of this kind were often employed in userspace programs because some things could be done faster than with standard C resources. Now that the optimization mechanisms of compilers have become more sophisticated, bit operations are hardly ever needed. However, as an extremely performance-critical program, the kernel is an exception. Similarly, bit operations are able to...

Figure 65 Device database to keep track of all block and character devices

A global array bdev_map for block and cdev_map for character devices is used to implement a hash table, which employs the device major number as hash key. Both cdev_map and bdev_map are instances of the same data structure, struct kobj_map. The hashing method is quite simple major 255. This works well since currently only a very limited number of devices has major numbers larger than 255, so hash collisions are rare. The definition of struct kobj_map also includes the definition of the hash...

Implementation of loctls

Ioctls permit the use of special device-specific functions that cannot be accessed by normal read and write operations. This form of support is implemented by means of the ioctl system call that can be used with regular files (detailed descriptions of how it is used are provided in the many system programming manuals). As expected, the system call is implemented in sys_ioctl, but the main work is done in vfs_ioctl. Figure 6-19 shows the associated code flow diagram. Figure 6-19 Code flow...

Preemptive Multitasking

The structure of Linux process management requires two further process state options user mode and kernel mode. These reflect the fact that all modern CPUs have (at least) two different execution modes, one of which has unlimited rights while the other is subject to various restrictions for example, access to certain memory areas can be prohibited. This distinction is an important prerequisite for creating locked ''cages,'' which hold existing processes and prevent them from interfering with...

Architecture Specific Setup

The initialization of memory management on IA-32 systems is in some aspects a very subtle undertaking that must overcome a few historical obstacles associated with the processor architecture. These include, for example, switching the processor from normal mode to protected mode to grant the CPU access to the 32-bit world a legacy from the days when compatibility with 16-bit 8086 processors was important. Similarly, paging is not enabled by default and must be activated manually, which, for...

Flow Handling

Before discussing how flow handlers are implemented, we need to introduce the type used for them. irq_flow_handler_t specifies the signature of IRQ flow handler functions typedef void fastcall (*irq_flow_handler_t)(unsigned int irq, Flow handlers get both the IRQ number and a pointer to the irq_handler instance that is responsible for the interrupt. This information can then be used to implement proper flow handling. Recall that different hardware requires different approaches to flow handling...

Global Variables and Auxiliary Functions

While page mobility grouping is always compiled into the kernel, it only makes sense if enough memory that can be distributed across multiple migrate lists is present in a system. Since on each migrate list a suitable amount of memory should be present, the kernel needs a notion of ''suitable.'' This is provided by the two global variables pageblock_order and pageblock_nr_pages. The first denotes an allocation order that is considered to be ''large,'' and pageblock_nr_pages denotes the...

Figure 328 Code flow diagram for freeareainitnodes

Free_area_init_nodes first has to analyze and rewrite the information provided by the architecture-specific code. Among others, the numbers of the lowest and highest page frames that can be used in contrast to the principal boundaries specified in zone_max_pfn and zone_min_pfn need to be obtained for each zone. Two global arrays are used to store the information static unsigned long _meminitdata static unsigned long _meminitdata First of all, however, free_area_init_nodes sorts the entries in...

Supported Standards

System calls are of special significance in all Unix look-alikes. Their scope and speed and their efficient implementation play a major role in system performance. System calls are implemented extremely efficiently in Linux, as demonstrated in Section 13.3. Of equal importance are the versatility and choice of available routines to make the lives of programmers (of applications and of standard library functions) easier and to facilitate program portability between the various Unix derivatives...

Restarting Ticks

Tick_nohz_restart_sched_tick is used to restart ticks. The code flow diagram is given by Figure 15-20. Figure 15-20 Code flow diagram for tick_nohz_restart_sched_tick. Figure 15-20 Code flow diagram for tick_nohz_restart_sched_tick. Again, the implementation is complicated by various technical details, but the general principle is rather simple. Our old acquaintance tick_do_updates_jiffies64 is called first. After correctly accounting the idle time, tick_sched-> tick_stopped is set to 0...

Stopping Ticks

Essentially, tick_nohz_stop_sched_tick needs to perform three tasks 1. Check if the next timer wheel event is more than one tick away. 2. If this is the case, reprogram the tick device to omit the next tick only when it is necessary again. This automatically omits all ticks that are not required. 3. Update the statistical information in tick_sched. Since many details require much attention to corner cases, the actual implementation of tick_nohz_stop_sched_tick is rather bulky, so I consider a...

A7 Thread Representation

The state of a running process is defined primarily by the contents of the processor registers. Processes that are not currently running must keep this data in corresponding data structures from which the data can be read and moved to the appropriate registers when the process is next activated by the scheduler. The structures needed to do this are defined in the following files < asm-arch ptrace.h> provides the pt_regs structure to hold all registers that are placed on the kernel stack...

Module Representation

Before looking more closely at the implementation of the module-related functions, it is necessary to explain how modules (and their properties) are represented in the kernel. As usual, a set of data structures is defined to do this. Not surprisingly, the name of the most important structure is module an instance of this structure is allocated for each module resident in the kernel. It is defined as follows * Member of list of modules * struct list_head list * Unique handle for this module *...

Binary Structure of Modules

Modules use the ELF binary format, which features several additional sections not present in normal programs or libraries. In addition to a few compiler-generated sections that are not relevant for our purposes (mainly relocation sections), modules consist of the following ELF sections11 The_ksymtab,_ksymtab_gpl, and_ksymtab_gpl_future sections contain a symbol table with all symbols exported by the module. Whereas the symbols in the first-named section can be used by all kernel parts...

The Main Makefile

The main Makefile is key to kernel compilation. It defines the call paths for the C compiler, linker, and so on. The following distinction must be made between two toolchain alternatives A toolchain to generate local programs that execute on the host that compiles the kernel. Examples of such programs are the menuconfig binaries or tools for analyzing module symbols. A toolchain to generate the kernel itself. The toolchains are usually identical. Differences arise only when a kernel is...

Kernel Preemption

As described above, the scheduler is invoked before returning to user mode after system calls or at certain designated points in the kernel. This ensures that the kernel, unlike user processes, cannot be interrupted unless it explicitly wants to be. This behavior can be problematic if the kernel is in the middle of a relatively long operation this may well be the case with filesystem, or memory-management-related tasks. The kernel is executing on behalf of a specific process for a long amount...

Filesystem Actions

As demonstrated in Chapter 8, the association between the virtual filesystem and specific implementations is established in the main by three structures that include a series of function pointers this association must be implemented by all filesystems. Operations for manipulating the contents of a file are stored in file_operations. Operations for processing the file objects themselves are held in inode_operations Operations with generalized address spaces are stored in...

E15 String Tables

Figure E-2 shows how string tables are implemented to manage strings for ELF files. 7 The version numbering seems rather strange but is correct. Libc4 and Libc5 were special C standard libraries for Linux Glibc 2.0 was the first cross-system variant of the library that replaced the old versions. Figure E-2 String table for ELF files. Figure E-2 String table for ELF files. The first byte of the table is a null byte, followed by strings separated by null bytes. To reference a string, a position...

High Resolution Timers

After having discussed the generic time framework, we are now ready to take the next step and dive into the implementation of high-resolution timers. Two fundamental differences distinguish these timers from low-resolution timers 1. High-resolution (high-res) timers are time-ordered on a red-black tree. 2. They are independent of periodic ticks. They do not use a time specification based on jiffies, but employ nanosecond time stamps. Merging the high-resolution timer mechanism into the kernel...

System Call Tracing

The following short sample program illustrates the use of ptrace. ptrace attaches itself to a process and checks system call usage as such, it is a minimal replacement for strace. * Simple replacement for strace(1) * include< asm ptrace.h> * for ORIG_EAX * static long pid int upeek(int pid, long off, long *res) long valval ptrace(PTRACE_PEEKUSER, pid, off, 0) res ptrace(PTRACE_SYSCALL, pid, (char*) 1, 0) if (res < 0) printf(Failed to execute until next syscall d n, res) void...

Working with Clock Sources

How can a clock be used First of all, it must be registered with the kernel. The function clocksource_register is responsible for this. The source is only added to the global clocksource_list (defined in kernel time clocksource.c), which sorts all available clock sources by their rating. select_clocksource is called to select the best clock source. Normally this will pick the clock with the best rating, but it is also possible to specify a preference from userland via which is used by the...

A72 Ia64

In the design of IA-64, the designated successor of the aging IA-32 architecture, Intel has kept abreast of the times and has given the processor a much bigger register set (with a more systematic name). * The following registers are saved by SAVE_MIN * * The following registers are saved by SAVE_MIN * interrupted task's instruction pointer * interrupted task's function state if bit 63 is cleared, it * contains syscall's ar.pfs.pfm unsigned long ar_unat * interrupted task's NaT register...

Representation of the Device Tree

A further data structure is needed to represent the USB device tree and the various device characteristics in the kernel. enum usb_device_state enum usb_device_speed 16 * Use in messages port port * state * configured, not attached, etc * speed * high full low (or error) * unsigned int toggle 2 , * one bit for each endpoint struct usb_device *parent, * our hub, unless we're the root * struct usb_bus *bus, * Bus we're part of * struct device dev * Generic device interface * struct...

Figure 329 Layout of a GFP mask

In contrast to the zone modifiers, the additional flags do not limit the RAM segments from which memory can be allocated, but they do alter the behavior of the allocator. For example, they modify how aggressively a search is made for free memory. The following flags are defined in the kernel sources * Action modifiers - doesn't change the zoning * * Action modifiers - doesn't change the zoning * Suppress page allocation failure warning * Enforce hardwall cpuset memory allocs define...

Thermqueue Helper Function

The kernel uses the_rmqueue function (whose purpose is evident from the preceding description), which acts as a gatekeeper to penetrate into the innermost core of the buddy system static struct page *_rmqueue(struct zone *zone, unsigned int order, 22Huge-TLB pages are created at boot time and kept in a special cache. The kernel parameter hugepages allows for specifying how many huge-TLB pages are to be created, and applications can request them via the special filesystem hugetlbfs. The library...

Section Header

The section header table is implemented by means of an array in which each entry contains a section. The individual sections form the contents of the segments defined in the program header table. The following data structure represents a section typedef Elf32_ Elf32_ Elf32_ Elf32_ Elf32_ Elf32_ Elf32_ Elf32_ Elf32_ Elf32_ Elf32_ The elements have the following meanings sh_name specifies the name of the section. Its value is not the string itself but an index into the sh_type specifies the...

Writing Back Single Inodes

As noted above, the kernel delegates synchronization of the data associated with an inode to _writeback_single_inode. The corresponding code flow diagram is shown in Figure 17-7. Figure 17-7 Code flow diagram for_writeback_single_inode. Figure 17-7 Code flow diagram for_writeback_single_inode. The function is essentially a dispatcher for_sync_single_inode, but is charged with the important task of distinguishing whether a data integrity (WB_SYNC_ALL) or regular writeback is performed. This...

Initiating Memory Reclaim

In the implementation overview at the beginning of this chapter, I demonstrated that the page selection and swap-out routines discussed so far are controlled by a further layer that decides when and how many pages are reclaimed. This decision is redirected to two places first to the kswapd daemon that attempts to maintain optimal memory balance in the system when no too-memory-intensive applications are running and second to an emergency mechanism that kicks in when the kernel thinks it is...

Socket Buffers

(*destructor)(struct sk_buff *skb) iif sk_buff_data_t sk_buff_data_t sk buff data t * These elements must be at the end, see alloc_skb() for details. * sk_buff_data_t tail Socket buffers are used to exchange data between the network implementation levels without having to copy packet data to and fro this delivers considerable speed gains. The socket structure is one of the cornerstones of the network layer because it is processed on all levels both when packets are analyzed and generated.

The name of the structure is misleading A sysctl table is an array of sysctl structures whereas a single instance of

The meanings of the structure elements are as follows ctl_name is an ID, that must be unique only on the given hierarchy level of the entry but not in the entire table. < sysctl.h> contains countless enums that define sysctl identifiers for various purposes. The identifiers for the base categories are defined by the following enumeration CTL_KERN 1, * General kernel info and control * CTL_ABI 9, * Binary emulation * CTL_CPU 10 * CPU stuff (speed scaling, etc) * Below ctl_dev, there are...

Shared Subtrees

The mechanisms I have discussed so far covered the standard mount cases that are available on any Unix system. However, Linux supports some more advanced possibilities that allow for gaining more power from namespaces. Since they were only introduced during the development of 2.6 (2.6.16, to be precise), their use is still somewhat limited, so I will briefly review their basic principle before I discuss the implementation. For specific details on potential applications and a precise description...

Level Triggered Interrupts

Level-triggered interrupts are a little easier to process than their edge-triggered relatives. This is also reflected in the code flow diagram of the flow handler handle_level_irq, which is depicted in Figure 14-6. Figure 14-6 Code flow diagram for handle_level_irq. Figure 14-6 Code flow diagram for handle_level_irq. Note that level-triggered interrupts must be masked when they are processed, so the first thing that needs to be done is to call mask_ack_irq. This auxiliary function masks and...