Linux Kernel Architecture

Copy on Write

The kernel uses the copy-on-write technique (COW) to prevent all data of the parent process from being copied when fork is executed. This technique exploits the fact that processes normally use only a fraction of their pages in memory.8 When fork is called, the kernel would usually create an identical copy of each memory page of the parent process for the child process. This has two very negative effects 1. A large amount of RAM, a scarce resource, is used. 2. The copy operation takes a long...

Representation of Block Devices

Block devices have a set of characteristic properties managed by the kernel. The kernel uses what is known as request queue management to support communication with devices of this kind as effectively as possible. This facility enables requests to read and write data blocks to be buffered and rearranged. The results of requests are likewise kept in caches so that the data can be read and reread in a very efficient way. This is useful when a process repeatedly accesses the same part of a file or...

Applying Patches

A patch is a collection of diffs that reside in a common file. For example, patches that contain the differences between two kernel versions are made available at www.kernel.org for updating purposes. It is then no longer necessary to download the entire source tree, thus saving time and bandwidth. 6If two files are totally different, diff creates a single large hunk that covers the whole file. 7Patches are referred to as orthogonal if they do not mutually influence each other that is, if a...

Lock Contention and Fine Grained Locking

After having discussed the numerous locking primitives provided by the kernel, let us briefly address some of the problems related to locking and kernel scalability. While multiprocessor systems were nearly completely unknown to the average user only a decade ago, they are present on nearly every desktop today. Scalability of Linux on systems with more than a single CPU has therefore become a very important goal. This needs especially to be taken into account when locking rules are designed for...

Edge Triggered Interrupts

Handle Edge Irq

Edge-triggered interrupts are most common on the majority of today's hardware, so I consider this type first. The default handler is implemented in handle_edge_irq. The code flow diagram is shown in Figure 14-5. Edge-triggered IRQs are not masked when they are processed in contrast to level-triggered IRQs, there is no need to do so. This has one important implication for SMP systems When an IRQ is handled on one CPU, another IRQ with the same number can appear on another CPU that we denote as...

Level Triggered Interrupts

Interrupt Linux Kernel

Level-triggered interrupts are a little easier to process than their edge-triggered relatives. This is also reflected in the code flow diagram of the flow handler handle_level_irq, which is depicted in Figure 14-6. Figure 14-6 Code flow diagram for handle_level_irq. Figure 14-6 Code flow diagram for handle_level_irq. Note that level-triggered interrupts must be masked when they are processed, so the first thing that needs to be done is to call mask_ack_irq. This auxiliary function masks and...

The pdflush Mechanism

The pdflush mechanism is implemented in a single file mm pdflush.c. This contrasts with the fragmented implementation of the synchronization mechanisms in earlier versions. pdflush is started with the usual kernel thread mechanisms mm pdflush.c static void start_one_pdflush_thread(void) kthread_run(pdflush, NULL, pdflush) start_one_pdflush starts a single pdflush thread however, the kernel uses several threads at the same time in general, as you will see below. It should be noted that a...

Idle and init Thread

Kernel Architecture Images

The last two actions of start_kernel are as follows 1. rest_init starts a new thread that, after performing a few more initialization operations as described the next step, ultimately calls the userspace initialization program sbin init. 2. The first, and formerly only, kernel thread becomes the idle thread that is called when the system has nothing else to do. rest_init is essentially implemented in just a few lines of code init main.c kernel_thread(kernel_init, NULL, CLONE_FS CLONE_SIGHAND)...

Page Frames

Page frames represent the smallest unit of system memory, and an instance of struct page is created for each page in RAM. Kernel programmers take care to keep this structure as small as possible because the memory of systems even with a moderate RAM configuration is broken down into a very large number of pages. For instance, an IA-32 system working with a standard page size of 4 KiB has around 100,000 pages given a main memory size of 384 MiB. Although this memory size is certainly not...

Managing the Heap

Managing the heap the memory area of a process used to dynamically allocate variables and data is not directly visible to application programmers because it relies on various helper functions of the standard library the most important of which is malloc to reserve memory areas of any size. The classic interface between malloc and the kernel is the brk system call that expands shrinks the heap. Recent malloc implementations such as those of the GNU standard library now use a combined approach...

Types of Timers

Linux Kernel Architectiure

The timing subsystem of the kernel has grown tremendously during the development of 2.6. For the initial releases, the timer subsystem consisted solely of what are now known as low-resolution timers. Essentially, low-resolution timers are centered around a periodic tick which happens at regular intervals. Events can be scheduled to be activated at one of these intervals. Pressure to extend this comparatively simple framework came predominantly from two sources Devices with limited power (i.e.,...

Swapping Pages in

Irq Line Diagram Linux

You already know from Chapter 4 that page faults as a result of accessing a swapped-out page are handled by do_swap_page from mm memory.c. As the associated code flow diagram in Figure 18-19 shows, it is much easier to swap a page in than to swap it out, but it still involves more than just a simple read operation. Figure 18-19 Code flow diagram for do_swap_page. Figure 18-19 Code flow diagram for do_swap_page. The kernel must not only check whether the requested page is still or already in the...

Request Queues

Read and write requests to block devices are placed on a queue known as a request queue. The gendisk structure includes a pointer to the device-specific queue, which is represented by the following data type. Together with queue_head for cacheline sharing struct request_list rq Queue request freelists prep_rq_fn prep_rq_fn unplug_fn unplug_fn merge_bvec_fn merge_bvec_fn prepare_flush_fn prepare_flush_fn softirq_done_fn softirq_done_fn unplug_timer unplug_thresh unplug_delay unplug_work After...

Software Interrupts

Software interrupts enable the kernel to defer tasks. Because they function in a similar way to the interrupts described above but are implemented fully in the software, they are logically enough known as software interrupts or softlRQs. The kernel is informed of an anomalous condition by means of a software interrupt, and the situation is resolved at some later time by special handler routines. As already noted, the kernel services all pending software interrupts at the end of do_IRQ so that...

The SoftIRQ Daemon

The task of the softIRQ daemon is to execute softIRQs asynchronously to remaining kernel code. To this end, each system processor is allocated its own daemon named ksoftirqd. wakeup_softirqd is invoked at two points in the kernel to wake up the daemon In do_softirq, as just mentioned. At the end of raise_softirq_irqoff. This funtion is called by raise_softirq internally, and can also be used directly if the kernel has interrupts turned off at the moment. The wake-up function itself can be dealt...

The Main Scheduler

The main scheduler function (schedule) is invoked directly at many points in the kernel to allocate the CPU to a process other than the currently active one. After returning from system calls, the kernel also checks whether the reschedule flag tif_need_resched of the current process is set for example, the flag is set by scheduler_tick as mentioned above. If it is, the kernel invokes schedule. The function then assumes that the currently active task is definitely to be replaced with another...

Creating the Layout

The address space of a task is laid out when an ELF binary is loaded with load_elf_binary recall that the function is used by the exec system call. Loading an ELF file is cluttered with numerous technical details that are not interesting for our purposes, so the code flow diagram in Figure 4-3 concentrates on the steps required to set up the virtual memory region. Figure 4-3 Code flow diagram for load_elf_binary. Address space randomization is enabled if the global variable randomize_va_space...

System Call Tracing

The strace tool developed to trace the system calls of processes using the ptrace system call is described in Section 13.1.1. Implementation of the sys_ptrace handler routine is architecture-specific and is defined in arch arch kernel ptrace.c. Fortunately, there are only minor differences between the code of the individual versions. I therefore provide a generalized description of how the routine works without going into architecture-specific details. Before examining the flow of the system...

Three Way Handshake

A connection must be established explicitly between a client and a host before a TCP link can be used. As already noted, a distinction is made between active and passive connection setup. The kernel (i.e., the kernel of both machines involved in the connection) sees the following situation immediately prior to connection establishment the state of the client process socket is CLOSED, that of the server socket is LISTEN. A TCP connection is set up by means of a procedure that involves the...

Handling of Page Faults

The association between virtual and physical memory is not established until the data of an area are actually needed. If a process accesses a part of virtual address space not yet associated with a page in memory, the processor automatically raises a page fault that must be handled by the kernel. This is one of the most important and complex aspects of memory management simply because a myriad of details must be taken into account. For example, the kernel must ascertain the following Was the...

Overview Dfw

There is a clear relationship between flushing, swapping, and releasing pages. Not only the state of memory pages but also the size of free memory needs checking regularly. When this is done, unused or seldom used pages are swapped out automatically but not before the data they hold have been synchronized with the backing store to prevent data loss. In the case of dynamically generated pages, the system swap areas act as the backing stores.

Using System V Semaphores

The System V interface for using semaphores is anything but intuitive because the concept of a semaphore has been expanded well beyond its actual definition. Semaphores are no longer treated as simple variables to support atomic execution of predefined operations. Instead, a System V semaphore now refers to a whole set of semaphores, which allows not just one but several operations to be performed at the same time (although they appear to be atomic to users). It is, of course, possible to...

The Migration Thread

Linux Thread Construct Diagram

The migration thread serves two purposes It must fulfill migration requests originating from the scheduler, and it is used to implement active balancing. This is handled in a kernel thread that executes migration_thread. The code flow diagram for the function is shown in Figure 2-27. Figure 2-27 Code flow diagram for migration_thread. Figure 2-27 Code flow diagram for migration_thread. migration_thread runs an infinite loop and sleeps when there is nothing to do. First of all, the function...

Dynamic Ticks for High Resolution Systems

Since clock event devices run in one-shot mode anyway if the kernel uses high timer resolution, support for dynamic ticks is much easier to implement than in the low-resolution case. Recall that the periodic tick is emulated by tick_sched_timer as discussed above. The function is also used to implement dynamic ticks. In the discussion in Section 15.4.4,1 omitted two elements required for dynamic ticks 1. Since CPUs can drop global tick duties, the handler needs to check if this has been the...

Standard Functions

Several auxiliary functions that ease handling dentry objects are provided by the kernel. Their implementation is mostly an exercise in list management and data structure handling, so I won't bother to discuss their code. It is, however, important to show their prototypes and describe their effect since we will come across them frequently in discussing implementation of VFS operations. The following auxiliary functions require a pointer to struct dentry as parameter. Each performs one simple...

Socket Buffers

(*destructor)(struct sk_buff *skb) iif sk_buff_data_t sk_buff_data_t sk buff data t * These elements must be at the end, see alloc_skb() for details. * sk_buff_data_t tail Socket buffers are used to exchange data between the network implementation levels without having to copy packet data to and fro this delivers considerable speed gains. The socket structure is one of the cornerstones of the network layer because it is processed on all levels both when packets are analyzed and generated.

This area may be larger than actually needed because it is not clear how big packets will be when they are synthesized

data and tail point to the start and end of the protocol data area. Figure 12-6 Link between socket buffer and network packet data. Figure 12-6 Link between socket buffer and network packet data. mac_header points to the start of the MAC header, while network_header and transport_header point to the header data of the network and transport layer, respectively. On systems with 32-bit word length, the data type sk_buff_data_t that is used for the various data components is a simple pointer...

The offset is given as a byte offset into the file not as a page offset into the page cache This allows filesystem code

The most important providers of this value are, however, do_generic_mapping_read and filemap_fault. The routine ondemand_readahead is responsible to implement readahead policy, that is, decide how many pages will be read in before they are actually required. As Figure 16-5 shows, both page_cache_sync_readahead and page_cache_async_readahead rely on this function. After deciding on the size of the readahead window, ra_submit is called to delegate the technical aspects to...

Elements in the Task Structure

There are several scheduling-relevant elements in the task structure of each process. < sched.h> int prio, static_prio, normal_prio unsigned int rt_priority const struct sched_class *sched_class unsigned int policy cpumask_t cpus_allowed unsigned int time_slice Not all processes on a system are equally important Less urgent tasks should receive less attention, while important work should be done as quickly as possible. To determine the importance of a particular task, it is equipped with a...

Allocation of Physical Memory

When it allocates RAM, the kernel must keep track of which pages have already been allocated and which are still free in order to prevent two processes from using the same areas in RAM. Because memory allocation and release are very frequent tasks, the kernel must also ensure that they are completed as quickly as possible. The kernel can allocate only whole page frames. Dividing memory into smaller portions is delegated to the standard library in userspace. This library splits the page frames...

Memory Zones

The kernel uses the zones structure to describe a zone. It is defined as follows * Fields commonly accessed by the page allocator * unsigned long pages_min, pages_low, pages_high unsigned struct struct * Fields commonly accessed by the page reclaim scanner * struct list_head inactive_list unsigned long nr_scan_inactive unsigned long pages_scanned * since last reclaim * unsigned long flags * zone flags, see below * * Rarely used or read-mostly fields * wait_queue_head_t * wait_table unsigned *...

Node and Zone Initialization

Build_all_zonelists builds the data structures required to manage nodes and their zones. Interestingly, it can be implemented by the macros and abstraction mechanisms introduced above regardless of whether it runs on a NUMA or UMA system. This works because the executed functions are available in two flavors one for NUMA systems and one for UMA systems. Since this little trick is often used by the kernel, I will briefly discuss it. Suppose that a certain task must be performed differently...

Creating Data Structures for Each Node

Once the zone boundaries have been determined, free_area_init_nodes creates the data structures for the individual zones iteratively by calling free_area_init_node. Several helper functions are required for this purpose. calculate_node_totalpages first calculates the total number of pages in the node by summing up the pages in the individual zones. In the case of contiguous memory, this could be done in zones_size_init, but calculate_zone_totalpages also takes holes in the zone into account....

Helper Functions

First, we need to define some flags used by the functions to control behavior when various watermarks are reached. The first flags indicate which watermark applies when the decision is made as to whether pages can be taken or not. By default (that is, there is no absolute need for more memory because of pressure exerted by other factors), pages are taken only when the zone still contains at least zone-> pages_high pages. This corresponds to the alloc_wmark_high flag. alloc_wmark_min or _low...

Allocation Control

As mentioned above,_alloc_pages is the main function of the buddy system. Now that we have dealt with all preparatory work and described all possible flags, we turn our attention to the relatively complex implementation of the function that is one of the lengthier parts of the kernel. Complexity arises above all when too little memory is available to satisfy a request or when available memory is slowly running out. If sufficient memory is available, the necessary work is quickly done as the...

Alternative Allocators

Although the slab allocator works well for many possible workloads, there are naturally situations in which it fails to provide optimal performance. Problems arise when slab allocation is used on machines that range on the borders of the current hardware scale tiny embedded systems and large, massively parallel systems equipped with huge amounts of RAM. In the second case, the large amount of metadata required by the slab allocator can become a problem developers have reported that many...

Memory Management in the Kernel

The general allocation and freeing functions of the kernel have similar names to their equivalents in the C standard library and are employed in exactly the same way. kmalloc(size, flags) reserves a memory area of size bytes and returns a void pointer to the start of the area. If insufficient memory is available (a very improbable situation in the kernel but one that must always be catered for), a null pointer is the result. The flags argument specifies the area from which memory is to be...

Initialization

At first sight, initialization of the slab system does not appear to be especially complicated because the buddy system is already fully enabled and no other particular restrictions are imposed on the kernel. Nevertheless, there is a chicken-and-egg problem34 because of the structure of the slab allocator. To initialize the slab data structures, the kernel needs memory blocks that are much smaller than a complete page and are therefore best allocated by kmalloc. And here's the crux kmalloc only...

Periodic Reclaim with kswapd

Kswapd is a kernel daemon that is activated by kswap_init each time the system is started and continues to execute for as long as the machine is running pg_data_t *pgdat NODE_DATA(nid) int ret 0 pgdat-> kswapd kthread_run(kswapd, pgdat, kswapd d, nid) return ret for_each_node_state(nid, N_HIGH_MEMORY) kswapd_run(nid) The code shows that a separate instance of kswapd is activated for each NUMA zone. On some machines, this serves to enhance system performance as different speeds of access to...

Kernel Page Faults

When kernel address space is accessed, page faults can be triggered by various conditions as described below. A programming error in the kernel has caused an incorrect address to be accessed this is a genuine bug. Of course, this should never happen in stable versions20 but does occur occasionally in developer versions. The kernel accesses an invalid address passed as a system call parameter from userspace. The page fault was triggered by access to an area allocated using vmalloc. The first two...

Networking Namespaces

Recall from Chapter 1 that many parts of the kernel are contained in namespaces. These allow for building multiple virtual viewpoints of the system that are separated and segregated from each other. Every instance looks like a single machine running Linux, but, in fact, many such instances can operate simultaneously on a single physical machine. During the development of 2.6.24, the kernel started to adopt namespaces also for the networking subsystem. This adds some extra complexity to the...

Support for High Speed Interfaces

The previously discussed old approach to transferring packets from the network device into higher layers of the kernel works well if the devices do not support too high transmission rates. Each time a frame arrives, an IRQ is used to signalize this to the kernel. This implies a notion of ''fast'' and ''slow.'' For slow devices, servicing the IRQ is usually finished before the next packet arrives. Since the next packet is also signaled by an IRQ, failing to fulfill this condition as is often the...

Timer Activation and Process Accounting

As the time base for timers, the kernel uses the timer interrupt of the processor or any other suitable periodic source. On IA-32 and AMD64 systems, the programmable interrupt timer (PIT) or the High Precision Event Timer (HPET) can be employed for this purpose. Nearly all modestly modern systems of this type are equipped with an HPET, and if one is available, it is preferred to the PIT.3 The interrupt occurs at regular intervals exactly HZ times per second. HZ is defined by an...

Implementing Timer Handling

Handling of all timers is initiated in update_process_times by invoking the run_local_timers function. This limits itself to using raise_softirq TIMER_SOFTIRQ to activate the timer management softIRQ, which is executed at the next opportunity.8 run_timer_softirq is used as the handler function of the softIRQ it selects the CPU-specific instance of struct tvec_t_base_s and invokes_run_timers. _run_timers implements the algorithm described above. However, nowhere in the data structures shown is...

High Resolution Timers in High Resolution Mode

Hrtimer Interrupt

Let us first assume that a high-resolution clock is up and running, and that the transition to highresolution mode is completely finished. The general situation is depicted in Figure 15-13. When the clock event device responsible for high-resolution timers raises an interrupt, hrtimer_interrupt is called as event handler. The function is responsible to select all timers that have expired and either move them to the expiration list (if they may be processed in softIRQ context) or call the...

Initializing Central Data Structures and Caches

As a quick glance at the following kernel sources shows, the most substantial task of start_kernel is to invoke subroutines to initialize almost all important kernel subsystems asmlinkage void _init start_kernel(void) early_boot_irqs_on() local_irq_enable() * HACK ALERT This is early. We're enabling the console before * we've done PCI setups etc, and console_init() must be aware of * this. But we do want output early, in case something goes wrong. * * rootfs populating might need page-writeback...

Arrangement of the Kernel in Memory

Before discussing the individual memory initialization operations, we need to examine the situation in RAM after the boot loader has copied the kernel into memory and the assembler part of the initialization routines has completed. I concentrate on the default case in which the kernel is loaded to a fixed position in physical RAM that is determined at compile time. It is also possible to configure the initial position of the kernel binary in physical RAM if the crash dump mechanism is enabled....

Structure of the Buffer Cache

Structure Buffer Cache

A page-oriented method has not always been used in the Linux kernel to bear the main caching burden. Earlier versions included only the buffer cache to speed file operations and to enhance system performance. This was a legacy of other Unix look-alikes with the same structure. Blocks from the underlying block devices were kept in main memory buffers to make read and write operations faster. The implementation is contained in fs buffers.c. In contrast to pages in memory, blocks are not only...

The Swap Cache

Now that I have described the layout of the swap subsystem by reference to its data structures, let us turn our focus in the following sections on the techniques employed by the kernel to write pages from memory into a swap area and to read pages from the swap area back into memory. The kernel makes use of a further cache, known as a swap cache, that acts as a liaison between the operation to select the pages to be swapped out and the mechanism that actually performs swapping. At first sight,...

Professional Linux Kernel Architecture

10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright 2008 by Wolfgang Mauerer Published by Wiley Publishing, Inc., Indianapolis, Indiana Manufactured in the United States of America 10 987654321 Library of Congress Cataloging-in-Publication Data Professional Linux kernel architecture Wolfgang Mauerer. p. cm. Includes index. ISBN 978-0-470-34343-2 (pbk.) 1. Linux. 2. Computer architecture. 3. Application software. I. Title. QA76.9.A73M38 2008 005.4'32--dc22 No part of this...

Device File Elements in I nodes

Each file in the virtual file system is associated with just one inode that manages the file properties. Since the inode data structure is very lengthy, I don't reproduce it in full here but only include the elements relevant to device drivers. struct file_operations *i_fop union 5When register_chrdev is used, no handling of struct cdev is necessary since this is automatically managed. The reason is simple The cdev abstraction was not available in the kernel at the time register_chrdev was...

Representation of proc Entries

Each entry in the proc filesystem is described by an instance of proc_dir_entry whose (abbreviated) definition is as follows unsigned int low_ino unsigned short namelen const char *name mode_t mode nlink_t nlink uid_t uid gid_t gid loff_t size struct inode_operations * proc_iops const struct file_operations * proc_fops get_info_t *get_info struct module *owner struct proc_dir_entry *next, *parent, *subdir void *data read_proc_t *read_proc write_proc_t *write_proc Because each entry is given a...

Creating and Registering Entries

New entries are added to the proc filesystem in two steps. First, a new instance of proc_dir_entry is created together with all information needed to describe the entry. This instance is then registered in the data structures of proc so that it is visible to the outside. Because the two steps are never carried out independently of each other, the kernel makes auxiliary functions available to combine both actions so that new entries can be generated quickly and easily. The most frequently used...

Selection According to PID

Let us turn our attention to how the process-specific information is selected by PID. If a PID is passed to proc_pid_lookup instead of self, the course of the lookup operation is as shown in the code flow diagram in Figure 10-4. Because filenames are always processed in the form of strings but PIDs are integer numbers, the former must be converted accordingly. The kernel provides the name_to_int auxiliary function to convert strings consisting of digits into an integer. The information obtained...

Writing Sequential File Handlers

Basically, an instance of struct file_operations that provides pointers to some seq_ routines must be implemented to benefit from the sequential file standard implementation. The kprobes subsystem does this as follows static struct file_operations debugfs_kprobes_operations .open kprobes_open, This instance of file_operations can be associated with a file by the methods discussed in Chapter 8. In the case of kprobes, the file will be created in the debugging filesystem see Section 10.2.3. The...

File and Directory Operations

Since sysfs exposes its data structures in a filesystem, most interesting operations can be triggered with standard filesystem operations. The functions that implement the filesystem operations thus serve as a glue layer between sysfs and the internal data structures. As for every filesystem, the methods used for operations on files are collected in an instance of struct file_operations. For sysfs, the following selection is available const struct file_operations sysfs_file_operations .read...

Nonlinear Mappings

As just demonstrated, normal mappings map a continuous section from a file into a likewise continuous section of virtual memory. If various parts of a file are mapped in a different sequence into an otherwise contiguous area of virtual memory, it is generally necessary to use several mappings, which is more costly in terms of resources (particularly in vm_area_structs). A simpler way of achieving the same result11 is to use nonlinear mappings as introduced during the development of 2.5. The...

Registering Network Devices

Kernel Architecture Images

Each network device is registered in a two-step process 1. alloc_netdev allocates a new instance of struct net_device, and a protocol-specific function fills the structure with typical values. For Ethernet devices, this function is ether_setup. Other protocols (not considerd in detail) use XXX_setup, where possible values for XXX include fddi (fiber distributed data), tr (token ring), ltalk (localtalk), hippi (high-performance parallel interface), or fc (fiber channel). Some in-kernel...

Closing the Audit Log

After all necessary log messages have been written to an audit buffer, audit_log_end needs to be called to ensure that the audit log is sent to the userspace daemon. The code flow diagram for the function can be found in Figure 19-6. Figure 19-6 Code flow diagram for audit_log_end. After performing another rate check (if messages have been submitted too frequently, then the present message is lost and a ''rate limit exceeded'' message is sent to the daemon instead), the socket buffer associated...

Dispatching and Parameter Passing

System calls are uniquely identified by a number assigned by the kernel. This is done for practical reasons that become clear when system calls are activated. All calls are handled by a single central piece of code that uses the number to dispatch a specific function by reference to a static table. The parameters passed are also handled by the central code so that parameter passing is implemented independently of the actual system call. Switching from user to kernel mode and therefore to...

Wakeup Preemption

When tasks are woken up in try_to_wake_up and wake_up_new_task, the kernel uses check_preempt_curr to see if the new task can preempt the currently running one. Notice that the core scheduler is not involved in this process For completely fair handled tasks, the function check_ preempt_wakeup performs the desired check. The newly woken task need not necessarily be handled by the completely fair scheduler. If the new task is a real-time task, rescheduling is immediately requested because...

Object Management and Reference Counting

All over the kernel, the need to keep track of instances of C structures arises. Despite the fact that these objects will be used in the most varying forms, some operations are very similar across subsystems just consider reference counting. This leads to code duplication. Since this is a bad thing, the kernel has adopted generic methods to manage kernel objects during the development of 2.5. The framework is, however, not just required to prevent code duplication. It also allows for providing...

Directory Entries

Directory entries are represented by struct sysfs_dirent as defined in . It is the main data structure of sysfs the whole implementation is centered around it. Each sysfs node is represented by a single instance of sysfs_dirent. The definition is as follows atomic_t s_count atomic_t s_active struct sysfs_dirent *s_parent struct sysfs_dirent *s_sibling const char *s_name struct sysfs_elem_dir s_dir struct sysfs_elem_symlink s_symlink struct sysfs_elem_attr s_attr struct sysfs_elem_bin_attr...

Echo Client

The source code for the echo client is as follows5 * Host and port number of the echo server * char* echo_host 192.168.1.20 int echo_port 7 int sockfd (struct sockaddr_in*)malloc(sizeof(struct sockaddr_in)) * Set address of server to be connected * server- sin_family AF_INET server- sin_port htons(echo_port) Note network byte order server- sin_addr.s_addr inet_addr(echo_host) * Create a socket (Internet address family, stream socket and default protocol) * sockfd socket(AF_INET, SOCK_STREAM, 0)...

Determining Page Activity

The kernel must track not only whether a page is actually used by one or more processes, but also how often it is accessed in order to assess its importance. As only very few architectures support a direct access counter for memory pages, the kernel must resort to other means and has therefore introduced two page flags named referenced and active. The corresponding bit values are PG_referenced and PG_active, and the usual set of macros as discussed in Section 3.2.2 is available to set or...

Automatic Loading

Generally, module loading is initiated from userspace, either by the user or by means of automated scripts. To achieve greater flexibility in the handling of modules and to improve transparency, the kernel itself is also able to request modules. Where is the catch It is not difficult for the kernel to insert the binary code once it has access to it. However, it cannot do this without further help from userspace. The binary file must be localized in the filesystem, and dependencies must be...

The Mount System Call

Function The Kernel Flowchart

The point of entry for the mount system call is the sys_mount function defined in fs namespace.c. Figure 8-5 shows the associated code flow diagram. Figure 8-5 Code flow diagram for sys_mount. Figure 8-5 Code flow diagram for sys_mount. The approach described here is used only to mount a new filesystem in an existing root filesystem. A modified version of the above algorithm mounts the root filesystem itself, but is not sufficiently interesting to merit a separate description (its code can be...

Inode Operations

The kernel provides a large number of functions for performing operations on inodes. A set of function pointers is defined to abstract the operations because data are manipulated by the specific filesystem implementation. The call interface always remains the same, although the actual work is carried out by implementation-specific functions. The inode structure has two pointers (i_op and i_fop) to arrays that implement the above abstraction. One relates to the inode-specific operations, and the...

Data Structures Rto

Since the structure of an extended attribute is very simple, the kernel does not provide a specific data structure to encapsulate the name value pairs instead, a simple string is used to represent the name, while a void-pointer denotes the area in memory where the value resides. Nevertheless, there need to be methods that set, retrieve, remove, and list the extended attributes. Since these operations are mode-specific, they are integrated into struct inode_operations int (*setxattr) (struct...

Sending Packets Fws

As seen from the TCP layer, the sending of TCP packets begins with the invocation of the tcp_sendmsg function by higher network instances. Figure 12-31 shows the associated code flow diagram. Figure 12-31 Code flow diagram for tcp_sendmsg. Figure 12-31 Code flow diagram for tcp_sendmsg. Naturally, the state of the socket used must be TCP_ESTABLISHED before data transmission can begin. If this is not the case, the kernel waits (with the help of wait_for_tcp_connect) until a connection has been...

Calculation of Zone Watermarks

Before calculating the various watermarks, the kernel first determines the minimum memory space that must remain free for critical allocations. This value scales nonlinearly with the size of the available RAM. It is stored in the global variable min_free_kbytes. Figure 3-4 provides an overview of the scaling behavior, and the inset which does not use a logarithmic scale for the main memory size in contrast to the main graph shows a magnification of the region up to 4 GiB. Some exemplary values...

Deleting Inodes

Both directory inodes and file inodes can be deleted, and, from the perspective of the filesystem, both actions are much simpler than allocating inodes. Let us first look at how directories are deleted. After the appropriate system call (rmdir) has been invoked, the code meanders through the kernel and finally arrives at the rmdir function pointer of the inode_operations structure, which, for the Ext2 filesystem, contains ext2_rmdir from fs ext2 namei.c. 18In terms of compatibility with old...

Atomic Operations on Integers

The kernel defines the atomic_t data type (in ) as the basis for atomic operations with integer counters. In the view of the kernel, these are operations that are performed as if they consisted of a single assembly language instruction. A short example, in which a counter is incremented by 1, is sufficient to demonstrate the need for operations of this kind. On the assembly language level, incrementation is usually performed in three steps 1. The counter value is copied from memory into a...

Core Scheduler Changes

Besides the additions discussed above, some changes to the existing methods are required in the core scheduler on SMP systems. While numerous small details change all over the place, the most important differences as compared to uniprocessor systems are the following When a new process is started with the exec system call, a good opportunity for the scheduler to move the task across CPUs arises. Naturally, it has not been running yet, so there cannot be any negative effects on the CPU cache by...

Low Latency

Naturally, the kernel is interested in providing good latency times even if kernel preemption is not enabled. This can, for instance, be important in network servers. While the overhead introduced by kernel preemption is not desired in such an environment, the kernel should nevertheless respond to important events with reasonable speed. If, for example, a network request comes in that needs to be serviced by a daemon, then this should not be overly long delayed by some database doing heavy I O...

Layout of the Process Address Space

The virtual address space is populated by a number of regions. How they are distributed is architecture-specific, but all approaches have the following elements in common The binary code of the code currently running. This code is normally referred to as text and the area of virtual memory in which it is located as a text segment.1 The code of dynamic libraries used by the program. 1 This is not the same as a hardware segment, which is featured in some architectures and acts as a separate...

Querying the Module License

Technically insignificant but important from a legal point of view the module license can now be read from the .modinfo section and placed in the module data structure 16This is not quite correct because the kernel also defines a specific order for the various sections on the basis of their flags. However, I need not discuss this here. set_license(mod, get_modinfo(sechdrs, infoindex, license )) set_license checks whether the license used is GPL-compatible (by comparing its name with the string...

Audit Rules

How is it possible to place constraints on the types of events for which an audit log record can be generated This is done with the help of audit rules, and this chapter discusses their format and purpose. However, you should also consult the manual pages that accompany the audit framework especially 8(auditctl) for more information. In general, an audit rule consists of the following components The basic information is given by a filter value pair. The filter denotes the kind of event to which...

Extensions to taskstruct

Every process in the system is represented by an instance of struct task_struct, as discussed in Chapter 2. A pointer member of the structure is used to equip a process with an audit context as follows struct audit_context *audit_context Figure 19-2 Data structures used by the audit mechanism. Note that audit_context may well be a null pointer. This is because an instance of audit_context is allocated only if system call auditing is requested for a specific process. If no auditing is to be...

Representation of Regions

Each region is represented by an instance of vm_area_struct, which is defined in simplified form as follows struct mm_struct vm_mm The address space we belong to. unsigned long vm_start Our start address within vm_mm. unsigned long vm_end The first byte after our end address linked list of VM areas per task, sorted by address struct vm_area_struct vm_next pgprot_t vm_page_prot Access permissions of this VMA. unsigned long vm_flags Flags, listed below. For areas with an address space and backing...

Audit Context Allocation

First of all, you need to consider under which circumstances such contexts are allocated. Since this is an expensive operation, it is only performed if system call auditing was explicitely enabled. If this is the case, copy_process (i.e., originating from the fork system call) is the place where audit_alloc is called to allocate a new instance of struct audit_context. Figure 19-7 shows the code flow diagram for audit_context. Figure 19-7 Code flow diagram for audit_context. Figure 19-7 Code...

Creating a Reverse Mapping

When a reverse mapping is created, it is necessary to distinguish between two alternatives anonymous pages and pages with file-based mappings. This is understandable because the data structures used to manage both alternatives also differ. The information below only covers working with page instances to be inserted into the reverse mapping scheme. Other parts of the kernel are responsible for adding the relevant vm_area_structs to the data structures discussed above (priority tree and anon...

Using Reverse Mapping

The real benefits of reverse mapping do not become clear until Chapter 18, which examines the implementation of swapping. There we will see that the kernel defines the try_to_unmap function, which is invoked to delete a specific physical page from the page tables of all processes by which the page is used. It is apparent that this is only possible with the data structures just described. Nevertheless, the implementation is influenced by many details of the swap layer, and this is why I won't go...

D21 Subsystem Initialization

Linux Subsystem Diagram

Figure D-1 shows a code flow diagram to briefly illustrate the function's tasks and goals. The first step is to output the version message. The message text is held in the linux_banner global variable defined in init version.c. This is followed by a further architecture-specific initialization step, which no longer deals with lower-level processor details but is written in C and which, on most systems, has the primary task of setting the framework for initialization of high-level memory...

Figure 316 Code flow diagram for paginginit

Pagetable_init first initializes the page tables of the system using swapper_pg_dir as a basic (this variable was previously used to hold the provisional data). Two extensions available on all modern IA-32 variants are then enabled (only a few very old Pentium implementations do not support these). Support for large memory pages. The size of specially marked pages is 4 MiB instead of the usual 4 KiB. This option is used for kernel pages because they are never swapped out. Increasing the page...

Processor Cache and TLB Control

Caches are crucial in terms of overall system performance, which is why the kernel tries to exploit them as effectively as possible. It does this primarily by skillfully aligning kernel data in memory. A judicious mix of normal functions, inline definitions, and macros also helps extract greater performance from the processor. The compiler optimizations discussed in Appendix C also make their contribution. However, the above aspects affect the cache only indirectly. Use of the correct alignment...

Network Layer

Layers Linux

The network access layer is still quite strongly influenced by the properties of the transmission medium and the device drivers of the associated adapters. The network layer (and therefore specifically the IP Internet protocol) is almost totally divorced from the hardware properties of the network adapters. Why only almost As you will see shortly, the layer is responsible not only for sending and receiving data, but also for forwarding and routing packets between systems not directly connected...

A7 Thread Representation

The state of a running process is defined primarily by the contents of the processor registers. Processes that are not currently running must keep this data in corresponding data structures from which the data can be read and moved to the appropriate registers when the process is next activated by the scheduler. The structures needed to do this are defined in the following files < asm-arch ptrace.h> provides the pt_regs structure to hold all registers that are placed on the kernel stack...

A103 Finding the Current Process

The purpose of the current macro is to find a pointer to the task structure of the currently running process. It must be declared by each architecture in < asm-arch current.h> . The pointer is held in a separate processor register that can be queried directly or indirectly using current. The registers reserved by the individual architectures are listed in Table A-1. Table A-1 Registers Holding Pointers to the Current task_struct or thread_info Instance Note that the register is used for...

Block Device Operations

Block devices account for the second large group of peripherals supported via the VFS interface of the kernel. Unfortunately, the situation faced by block device drivers is more complicated than that for character devices. This is caused by a range of circumstances, above all by the need for continuous speed adjustment occasioned by the design of the block device layer, by the way in which block devices work, and by the historical development of the block device layer. Block devices differ...

Setting Timers

Setting a new timer is a two-step process 1. hrtimer_init is used to initialize a hrtimer instance. < hrtimer.h> void hrtimer_init(struct hrtimer *timer, clockid_t which_clock, enum hrtimer_mode mode) timer denotes the affected high-resolution timer, clock is the clock to bind the timer to, and mode specifies if absolute or relative time values (relative to the current time) are used. Two constants are available for selection < hrtimer.h> * Time value is absolute * * Time value is...

Updating Jiffies

The global tick device calls tick_do_update_jiffies64 to update the global jiffies_64 variable, the basis of low-resolution timer handling. When periodic ticks are in use, this is comparatively simple because the function is called whenever a jiffy has passed. When dynamic ticks are enabled, the situation can arise in which all CPUs of the system are idle and none provides global ticks. This needs to be taken into account by tick_do_update_jiffies64. Let's go directly to the code to see how...

Working with LXR

LXR provides the following functions for viewing kernel components Directories of the source tree can be traversed and files can be selected by name using source navigation. Kernel source files can be displayed in hypertext representation using file view. Positions at which symbols are defined or used can be found with an identifier search. Figure B-5 shows what the output of the vfs_follow_link function looks like. Fils Edil Ew HiTlory BooVmarki TbcI* Help - lit3 j* e wn5U5E 'ettin.g Stoitad...

Generic Hard Disks and Partitions

While struct block_device represents a block device toward the device driver level, another abstraction emphasizes the connection with generic kernel data structures. From this point of view, the block devices by themselves are not interesting. Instead, the notion of a hard disk, possibly with subpartitions, is more useful. The partition information on a device is independent of the block_device instances that represent the partitions. Indeed, when a disk is added to the system, the kernel...

Opening Block Device Files

When a user application opens the device file of a block device, the virtual filesystem invokes the open function of the file_operations structure and arrives at blkdev_open. Figure 6-13 shows the associated code flow diagram. 9Note that this implies that reading data from the block device is already required to work when the device is registered with disk_add Figure 6-13 Code flow diagram for blkdev_open. Figure 6-13 Code flow diagram for blkdev_open. bd_acquire first finds the matching...

Request Structure

The kernel provides its own data structure to describe a request to a block device. sector_t sector * next sector to submit * sector_t hard_sector * next sector to complete * unsigned long nr_sectors * no. of sectors left to submit * unsigned long hard_nr_sectors * no. of sectors left to complete * * no. of sectors left to submit in the current segment * unsigned int current_nr_sectors * no. of sectors left to complete in the current segment * unsigned int hard_cur_sectors struct bio *bio...

Implementation of loctls

Linux Kernel Architecture

Ioctls permit the use of special device-specific functions that cannot be accessed by normal read and write operations. This form of support is implemented by means of the ioctl system call that can be used with regular files (detailed descriptions of how it is used are provided in the many system programming manuals). As expected, the system call is implemented in sys_ioctl, but the main work is done in vfs_ioctl. Figure 6-19 shows the associated code flow diagram. Figure 6-19 Code flow...

C210 Radix Trees

The second tree implementation provided in library form in the kernel makes use of radix trees to organize data in memory. Radix trees differ from other trees because it is not necessary to compare the entire key at every branch, but only part of the key with the stored value of the node when performing search operations. This results in slightly different worst-case and average-case behavior than in other implementations, which are described in detail in the corresponding textbooks on...

Clock Sources

First of all, consider how time values are acquired from the various sources present in a machine. The kernel defines the abstraction of a clock source for this purpose cycle_t (*read)(void) cycle_t mask u32 mult u32 shift A human-readable name for the source is given in name, and list is a standard list element that connect all available clock sources on a standard kernel list. Not all clocks are of the same quality, and the kernel obviously wants to select the best possible one. Thus, every...

Kernel Preemption

As described above, the scheduler is invoked before returning to user mode after system calls or at certain designated points in the kernel. This ensures that the kernel, unlike user processes, cannot be interrupted unless it explicitly wants to be. This behavior can be problematic if the kernel is in the middle of a relatively long operation this may well be the case with filesystem, or memory-management-related tasks. The kernel is executing on behalf of a specific process for a long amount...