Linux Kernel Architecture

Copy on Write

Tue, 04 Jul 2023 | Linux Kernel Architecture | 1 comment

The kernel uses the copy-on-write technique (COW) to prevent all data of the parent process from being copied when fork is executed. This technique exploits the fact that processes normally use only a fraction of their pages in memory.8 When fork is called, the kernel would usually create an identical copy of each memory page of the parent process for the child process. This has two very negative effects 1. A large amount of RAM, a scarce resource, is used. 2. The copy operation takes a long...

Representation of Block Devices

Fri, 30 Jun 2023 | Linux Kernel Architecture | 1 comment

Block devices have a set of characteristic properties managed by the kernel. The kernel uses what is known as request queue management to support communication with devices of this kind as effectively as possible. This facility enables requests to read and write data blocks to be buffered and rearranged. The results of requests are likewise kept in caches so that the data can be read and reread in a very efficient way. This is useful when a process repeatedly accesses the same part of a file or...

Applying Patches

Thu, 29 Jun 2023 | Linux Kernel Architecture

A patch is a collection of diffs that reside in a common file. For example, patches that contain the differences between two kernel versions are made available at www.kernel.org for updating purposes. It is then no longer necessary to download the entire source tree, thus saving time and bandwidth. 6If two files are totally different, diff creates a single large hunk that covers the whole file. 7Patches are referred to as orthogonal if they do not mutually influence each other that is, if a...

Lock Contention and Fine Grained Locking

Thu, 29 Jun 2023 | Linux Kernel Architecture

After having discussed the numerous locking primitives provided by the kernel, let us briefly address some of the problems related to locking and kernel scalability. While multiprocessor systems were nearly completely unknown to the average user only a decade ago, they are present on nearly every desktop today. Scalability of Linux on systems with more than a single CPU has therefore become a very important goal. This needs especially to be taken into account when locking rules are designed for...

Edge Triggered Interrupts

Thu, 29 Jun 2023 | Linux Kernel Architecture | 2 comments

Edge-triggered interrupts are most common on the majority of today's hardware, so I consider this type first. The default handler is implemented in handle_edge_irq. The code flow diagram is shown in Figure 14-5. Edge-triggered IRQs are not masked when they are processed in contrast to level-triggered IRQs, there is no need to do so. This has one important implication for SMP systems When an IRQ is handled on one CPU, another IRQ with the same number can appear on another CPU that we denote as...

Level Triggered Interrupts

Thu, 29 Jun 2023 | Linux Kernel Architecture | 1 comment

Level-triggered interrupts are a little easier to process than their edge-triggered relatives. This is also reflected in the code flow diagram of the flow handler handle_level_irq, which is depicted in Figure 14-6. Figure 14-6 Code flow diagram for handle_level_irq. Figure 14-6 Code flow diagram for handle_level_irq. Note that level-triggered interrupts must be masked when they are processed, so the first thing that needs to be done is to call mask_ack_irq. This auxiliary function masks and...

The pdflush Mechanism

Thu, 29 Jun 2023 | Linux Kernel Architecture

The pdflush mechanism is implemented in a single file mm pdflush.c. This contrasts with the fragmented implementation of the synchronization mechanisms in earlier versions. pdflush is started with the usual kernel thread mechanisms mm pdflush.c static void start_one_pdflush_thread(void) kthread_run(pdflush, NULL, pdflush) start_one_pdflush starts a single pdflush thread however, the kernel uses several threads at the same time in general, as you will see below. It should be noted that a...

Idle and init Thread

Thu, 29 Jun 2023 | Linux Kernel Architecture

The last two actions of start_kernel are as follows 1. rest_init starts a new thread that, after performing a few more initialization operations as described the next step, ultimately calls the userspace initialization program sbin init. 2. The first, and formerly only, kernel thread becomes the idle thread that is called when the system has nothing else to do. rest_init is essentially implemented in just a few lines of code init main.c kernel_thread(kernel_init, NULL, CLONE_FS CLONE_SIGHAND)...

Page Frames

Mon, 19 Jun 2023 | Linux Kernel Architecture

Page frames represent the smallest unit of system memory, and an instance of struct page is created for each page in RAM. Kernel programmers take care to keep this structure as small as possible because the memory of systems even with a moderate RAM configuration is broken down into a very large number of pages. For instance, an IA-32 system working with a standard page size of 4 KiB has around 100,000 pages given a main memory size of 384 MiB. Although this memory size is certainly not...

Managing the Heap

Fri, 16 Jun 2023 | Linux Kernel Architecture

Managing the heap the memory area of a process used to dynamically allocate variables and data is not directly visible to application programmers because it relies on various helper functions of the standard library the most important of which is malloc to reserve memory areas of any size. The classic interface between malloc and the kernel is the brk system call that expands shrinks the heap. Recent malloc implementations such as those of the GNU standard library now use a combined approach...

Types of Timers

Tue, 13 Jun 2023 | Linux Kernel Architecture | 2 comments

The timing subsystem of the kernel has grown tremendously during the development of 2.6. For the initial releases, the timer subsystem consisted solely of what are now known as low-resolution timers. Essentially, low-resolution timers are centered around a periodic tick which happens at regular intervals. Events can be scheduled to be activated at one of these intervals. Pressure to extend this comparatively simple framework came predominantly from two sources Devices with limited power (i.e.,...

Swapping Pages in

Mon, 05 Jun 2023 | Linux Kernel Architecture

You already know from Chapter 4 that page faults as a result of accessing a swapped-out page are handled by do_swap_page from mm memory.c. As the associated code flow diagram in Figure 18-19 shows, it is much easier to swap a page in than to swap it out, but it still involves more than just a simple read operation. Figure 18-19 Code flow diagram for do_swap_page. Figure 18-19 Code flow diagram for do_swap_page. The kernel must not only check whether the requested page is still or already in the...

Request Queues

Fri, 26 May 2023 | Linux Kernel Architecture

Read and write requests to block devices are placed on a queue known as a request queue. The gendisk structure includes a pointer to the device-specific queue, which is represented by the following data type. Together with queue_head for cacheline sharing struct request_list rq Queue request freelists prep_rq_fn prep_rq_fn unplug_fn unplug_fn merge_bvec_fn merge_bvec_fn prepare_flush_fn prepare_flush_fn softirq_done_fn softirq_done_fn unplug_timer unplug_thresh unplug_delay unplug_work After...

Software Interrupts

Fri, 26 May 2023 | Linux Kernel Architecture | 2 comments

Software interrupts enable the kernel to defer tasks. Because they function in a similar way to the interrupts described above but are implemented fully in the software, they are logically enough known as software interrupts or softlRQs. The kernel is informed of an anomalous condition by means of a software interrupt, and the situation is resolved at some later time by special handler routines. As already noted, the kernel services all pending software interrupts at the end of do_IRQ so that...

The SoftIRQ Daemon

Fri, 26 May 2023 | Linux Kernel Architecture

The task of the softIRQ daemon is to execute softIRQs asynchronously to remaining kernel code. To this end, each system processor is allocated its own daemon named ksoftirqd. wakeup_softirqd is invoked at two points in the kernel to wake up the daemon In do_softirq, as just mentioned. At the end of raise_softirq_irqoff. This funtion is called by raise_softirq internally, and can also be used directly if the kernel has interrupts turned off at the moment. The wake-up function itself can be dealt...

The Main Scheduler

Wed, 24 May 2023 | Linux Kernel Architecture

The main scheduler function (schedule) is invoked directly at many points in the kernel to allocate the CPU to a process other than the currently active one. After returning from system calls, the kernel also checks whether the reschedule flag tif_need_resched of the current process is set for example, the flag is set by scheduler_tick as mentioned above. If it is, the kernel invokes schedule. The function then assumes that the currently active task is definitely to be replaced with another...

Creating the Layout

Tue, 02 May 2023 | Linux Kernel Architecture

The address space of a task is laid out when an ELF binary is loaded with load_elf_binary recall that the function is used by the exec system call. Loading an ELF file is cluttered with numerous technical details that are not interesting for our purposes, so the code flow diagram in Figure 4-3 concentrates on the steps required to set up the virtual memory region. Figure 4-3 Code flow diagram for load_elf_binary. Address space randomization is enabled if the global variable randomize_va_space...

System Call Tracing

Tue, 02 May 2023 | Linux Kernel Architecture

The strace tool developed to trace the system calls of processes using the ptrace system call is described in Section 13.1.1. Implementation of the sys_ptrace handler routine is architecture-specific and is defined in arch arch kernel ptrace.c. Fortunately, there are only minor differences between the code of the individual versions. I therefore provide a generalized description of how the routine works without going into architecture-specific details. Before examining the flow of the system...

Three Way Handshake

Tue, 02 May 2023 | Linux Kernel Architecture | 4 comments

A connection must be established explicitly between a client and a host before a TCP link can be used. As already noted, a distinction is made between active and passive connection setup. The kernel (i.e., the kernel of both machines involved in the connection) sees the following situation immediately prior to connection establishment the state of the client process socket is CLOSED, that of the server socket is LISTEN. A TCP connection is set up by means of a procedure that involves the...

Handling of Page Faults

Tue, 02 May 2023 | Linux Kernel Architecture

The association between virtual and physical memory is not established until the data of an area are actually needed. If a process accesses a part of virtual address space not yet associated with a page in memory, the processor automatically raises a page fault that must be handled by the kernel. This is one of the most important and complex aspects of memory management simply because a myriad of details must be taken into account. For example, the kernel must ascertain the following Was the...

Overview Dfw

Tue, 02 May 2023 | Linux Kernel Architecture

There is a clear relationship between flushing, swapping, and releasing pages. Not only the state of memory pages but also the size of free memory needs checking regularly. When this is done, unused or seldom used pages are swapped out automatically but not before the data they hold have been synchronized with the backing store to prevent data loss. In the case of dynamically generated pages, the system swap areas act as the backing stores.

Using System V Semaphores

Tue, 02 May 2023 | Linux Kernel Architecture

The System V interface for using semaphores is anything but intuitive because the concept of a semaphore has been expanded well beyond its actual definition. Semaphores are no longer treated as simple variables to support atomic execution of predefined operations. Instead, a System V semaphore now refers to a whole set of semaphores, which allows not just one but several operations to be performed at the same time (although they appear to be atomic to users). It is, of course, possible to...

The Migration Thread

Tue, 02 May 2023 | Linux Kernel Architecture

The migration thread serves two purposes It must fulfill migration requests originating from the scheduler, and it is used to implement active balancing. This is handled in a kernel thread that executes migration_thread. The code flow diagram for the function is shown in Figure 2-27. Figure 2-27 Code flow diagram for migration_thread. Figure 2-27 Code flow diagram for migration_thread. migration_thread runs an infinite loop and sleeps when there is nothing to do. First of all, the function...

Dynamic Ticks for High Resolution Systems

Tue, 02 May 2023 | Linux Kernel Architecture

Since clock event devices run in one-shot mode anyway if the kernel uses high timer resolution, support for dynamic ticks is much easier to implement than in the low-resolution case. Recall that the periodic tick is emulated by tick_sched_timer as discussed above. The function is also used to implement dynamic ticks. In the discussion in Section 15.4.4,1 omitted two elements required for dynamic ticks 1. Since CPUs can drop global tick duties, the handler needs to check if this has been the...

Standard Functions

Fri, 28 Apr 2023 | Linux Kernel Architecture

Several auxiliary functions that ease handling dentry objects are provided by the kernel. Their implementation is mostly an exercise in list management and data structure handling, so I won't bother to discuss their code. It is, however, important to show their prototypes and describe their effect since we will come across them frequently in discussing implementation of VFS operations. The following auxiliary functions require a pointer to struct dentry as parameter. Each performs one simple...

Socket Buffers

Tue, 04 Apr 2023 | Linux Kernel Architecture

(*destructor)(struct sk_buff *skb) iif sk_buff_data_t sk_buff_data_t sk buff data t * These elements must be at the end, see alloc_skb() for details. * sk_buff_data_t tail Socket buffers are used to exchange data between the network implementation levels without having to copy packet data to and fro this delivers considerable speed gains. The socket structure is one of the cornerstones of the network layer because it is processed on all levels both when packets are analyzed and generated.

This area may be larger than actually needed because it is not clear how big packets will be when they are synthesized

Tue, 04 Apr 2023 | Linux Kernel Architecture

data and tail point to the start and end of the protocol data area. Figure 12-6 Link between socket buffer and network packet data. Figure 12-6 Link between socket buffer and network packet data. mac_header points to the start of the MAC header, while network_header and transport_header point to the header data of the network and transport layer, respectively. On systems with 32-bit word length, the data type sk_buff_data_t that is used for the various data components is a simple pointer...

The offset is given as a byte offset into the file not as a page offset into the page cache This allows filesystem code

Wed, 29 Mar 2023 | Linux Kernel Architecture

The most important providers of this value are, however, do_generic_mapping_read and filemap_fault. The routine ondemand_readahead is responsible to implement readahead policy, that is, decide how many pages will be read in before they are actually required. As Figure 16-5 shows, both page_cache_sync_readahead and page_cache_async_readahead rely on this function. After deciding on the size of the readahead window, ra_submit is called to delegate the technical aspects to...

Elements in the Task Structure

Tue, 28 Feb 2023 | Linux Kernel Architecture

There are several scheduling-relevant elements in the task structure of each process. < sched.h> int prio, static_prio, normal_prio unsigned int rt_priority const struct sched_class *sched_class unsigned int policy cpumask_t cpus_allowed unsigned int time_slice Not all processes on a system are equally important Less urgent tasks should receive less attention, while important work should be done as quickly as possible. To determine the importance of a particular task, it is equipped with a...

Allocation of Physical Memory

Sun, 19 Feb 2023 | Linux Kernel Architecture | 1 comment

When it allocates RAM, the kernel must keep track of which pages have already been allocated and which are still free in order to prevent two processes from using the same areas in RAM. Because memory allocation and release are very frequent tasks, the kernel must also ensure that they are completed as quickly as possible. The kernel can allocate only whole page frames. Dividing memory into smaller portions is delegated to the standard library in userspace. This library splits the page frames...

Memory Zones

Sun, 19 Feb 2023 | Linux Kernel Architecture

The kernel uses the zones structure to describe a zone. It is defined as follows * Fields commonly accessed by the page allocator * unsigned long pages_min, pages_low, pages_high unsigned struct struct * Fields commonly accessed by the page reclaim scanner * struct list_head inactive_list unsigned long nr_scan_inactive unsigned long pages_scanned * since last reclaim * unsigned long flags * zone flags, see below * * Rarely used or read-mostly fields * wait_queue_head_t * wait_table unsigned *...

Node and Zone Initialization

Sun, 19 Feb 2023 | Linux Kernel Architecture

Build_all_zonelists builds the data structures required to manage nodes and their zones. Interestingly, it can be implemented by the macros and abstraction mechanisms introduced above regardless of whether it runs on a NUMA or UMA system. This works because the executed functions are available in two flavors one for NUMA systems and one for UMA systems. Since this little trick is often used by the kernel, I will briefly discuss it. Suppose that a certain task must be performed differently...

Creating Data Structures for Each Node

Sun, 19 Feb 2023 | Linux Kernel Architecture

Once the zone boundaries have been determined, free_area_init_nodes creates the data structures for the individual zones iteratively by calling free_area_init_node. Several helper functions are required for this purpose. calculate_node_totalpages first calculates the total number of pages in the node by summing up the pages in the individual zones. In the case of contiguous memory, this could be done in zones_size_init, but calculate_zone_totalpages also takes holes in the zone into account....

Helper Functions

Sun, 19 Feb 2023 | Linux Kernel Architecture

First, we need to define some flags used by the functions to control behavior when various watermarks are reached. The first flags indicate which watermark applies when the decision is made as to whether pages can be taken or not. By default (that is, there is no absolute need for more memory because of pressure exerted by other factors), pages are taken only when the zone still contains at least zone-> pages_high pages. This corresponds to the alloc_wmark_high flag. alloc_wmark_min or _low...

Allocation Control

Sun, 19 Feb 2023 | Linux Kernel Architecture

As mentioned above,_alloc_pages is the main function of the buddy system. Now that we have dealt with all preparatory work and described all possible flags, we turn our attention to the relatively complex implementation of the function that is one of the lengthier parts of the kernel. Complexity arises above all when too little memory is available to satisfy a request or when available memory is slowly running out. If sufficient memory is available, the necessary work is quickly done as the...

Alternative Allocators

Sun, 19 Feb 2023 | Linux Kernel Architecture

Although the slab allocator works well for many possible workloads, there are naturally situations in which it fails to provide optimal performance. Problems arise when slab allocation is used on machines that range on the borders of the current hardware scale tiny embedded systems and large, massively parallel systems equipped with huge amounts of RAM. In the second case, the large amount of metadata required by the slab allocator can become a problem developers have reported that many...

Memory Management in the Kernel

Sun, 19 Feb 2023 | Linux Kernel Architecture

The general allocation and freeing functions of the kernel have similar names to their equivalents in the C standard library and are employed in exactly the same way. kmalloc(size, flags) reserves a memory area of size bytes and returns a void pointer to the start of the area. If insufficient memory is available (a very improbable situation in the kernel but one that must always be catered for), a null pointer is the result. The flags argument specifies the area from which memory is to be...

Initialization

Sun, 19 Feb 2023 | Linux Kernel Architecture

At first sight, initialization of the slab system does not appear to be especially complicated because the buddy system is already fully enabled and no other particular restrictions are imposed on the kernel. Nevertheless, there is a chicken-and-egg problem34 because of the structure of the slab allocator. To initialize the slab data structures, the kernel needs memory blocks that are much smaller than a complete page and are therefore best allocated by kmalloc. And here's the crux kmalloc only...

Periodic Reclaim with kswapd

Sun, 19 Feb 2023 | Linux Kernel Architecture

Kswapd is a kernel daemon that is activated by kswap_init each time the system is started and continues to execute for as long as the machine is running pg_data_t *pgdat NODE_DATA(nid) int ret 0 pgdat-> kswapd kthread_run(kswapd, pgdat, kswapd d, nid) return ret for_each_node_state(nid, N_HIGH_MEMORY) kswapd_run(nid) The code shows that a separate instance of kswapd is activated for each NUMA zone. On some machines, this serves to enhance system performance as different speeds of access to...

Kernel Page Faults

Sun, 19 Feb 2023 | Linux Kernel Architecture

When kernel address space is accessed, page faults can be triggered by various conditions as described below. A programming error in the kernel has caused an incorrect address to be accessed this is a genuine bug. Of course, this should never happen in stable versions20 but does occur occasionally in developer versions. The kernel accesses an invalid address passed as a system call parameter from userspace. The page fault was triggered by access to an area allocated using vmalloc. The first two...

Networking Namespaces

Tue, 07 Feb 2023 | Linux Kernel Architecture

Recall from Chapter 1 that many parts of the kernel are contained in namespaces. These allow for building multiple virtual viewpoints of the system that are separated and segregated from each other. Every instance looks like a single machine running Linux, but, in fact, many such instances can operate simultaneously on a single physical machine. During the development of 2.6.24, the kernel started to adopt namespaces also for the networking subsystem. This adds some extra complexity to the...

Support for High Speed Interfaces

Mon, 02 Jan 2023 | Linux Kernel Architecture

The previously discussed old approach to transferring packets from the network device into higher layers of the kernel works well if the devices do not support too high transmission rates. Each time a frame arrives, an IRQ is used to signalize this to the kernel. This implies a notion of ''fast'' and ''slow.'' For slow devices, servicing the IRQ is usually finished before the next packet arrives. Since the next packet is also signaled by an IRQ, failing to fulfill this condition as is often the...

Timer Activation and Process Accounting

Mon, 02 Jan 2023 | Linux Kernel Architecture

As the time base for timers, the kernel uses the timer interrupt of the processor or any other suitable periodic source. On IA-32 and AMD64 systems, the programmable interrupt timer (PIT) or the High Precision Event Timer (HPET) can be employed for this purpose. Nearly all modestly modern systems of this type are equipped with an HPET, and if one is available, it is preferred to the PIT.3 The interrupt occurs at regular intervals exactly HZ times per second. HZ is defined by an...

Implementing Timer Handling

Mon, 02 Jan 2023 | Linux Kernel Architecture

Handling of all timers is initiated in update_process_times by invoking the run_local_timers function. This limits itself to using raise_softirq TIMER_SOFTIRQ to activate the timer management softIRQ, which is executed at the next opportunity.8 run_timer_softirq is used as the handler function of the softIRQ it selects the CPU-specific instance of struct tvec_t_base_s and invokes_run_timers. _run_timers implements the algorithm described above. However, nowhere in the data structures shown is...

High Resolution Timers in High Resolution Mode

Mon, 02 Jan 2023 | Linux Kernel Architecture

Let us first assume that a high-resolution clock is up and running, and that the transition to highresolution mode is completely finished. The general situation is depicted in Figure 15-13. When the clock event device responsible for high-resolution timers raises an interrupt, hrtimer_interrupt is called as event handler. The function is responsible to select all timers that have expired and either move them to the expiration list (if they may be processed in softIRQ context) or call the...

Initializing Central Data Structures and Caches

Mon, 02 Jan 2023 | Linux Kernel Architecture

As a quick glance at the following kernel sources shows, the most substantial task of start_kernel is to invoke subroutines to initialize almost all important kernel subsystems asmlinkage void _init start_kernel(void) early_boot_irqs_on() local_irq_enable() * HACK ALERT This is early. We're enabling the console before * we've done PCI setups etc, and console_init() must be aware of * this. But we do want output early, in case something goes wrong. * * rootfs populating might need page-writeback...

Arrangement of the Kernel in Memory

Sun, 25 Dec 2022 | Linux Kernel Architecture | 1 comment

Before discussing the individual memory initialization operations, we need to examine the situation in RAM after the boot loader has copied the kernel into memory and the assembler part of the initialization routines has completed. I concentrate on the default case in which the kernel is loaded to a fixed position in physical RAM that is determined at compile time. It is also possible to configure the initial position of the kernel binary in physical RAM if the crash dump mechanism is enabled....

Structure of the Buffer Cache

Sun, 25 Dec 2022 | Linux Kernel Architecture | 2 comments

A page-oriented method has not always been used in the Linux kernel to bear the main caching burden. Earlier versions included only the buffer cache to speed file operations and to enhance system performance. This was a legacy of other Unix look-alikes with the same structure. Blocks from the underlying block devices were kept in main memory buffers to make read and write operations faster. The implementation is contained in fs buffers.c. In contrast to pages in memory, blocks are not only...

The Swap Cache

Thu, 22 Dec 2022 | Linux Kernel Architecture | 1 comment

Now that I have described the layout of the swap subsystem by reference to its data structures, let us turn our focus in the following sections on the techniques employed by the kernel to write pages from memory into a swap area and to read pages from the swap area back into memory. The kernel makes use of a further cache, known as a swap cache, that acts as a liaison between the operation to select the pages to be swapped out and the mechanism that actually performs swapping. At first sight,...

Professional Linux Kernel Architecture

Fri, 02 Dec 2022 | Linux Kernel Architecture

10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright 2008 by Wolfgang Mauerer Published by Wiley Publishing, Inc., Indianapolis, Indiana Manufactured in the United States of America 10 987654321 Library of Congress Cataloging-in-Publication Data Professional Linux kernel architecture Wolfgang Mauerer. p. cm. Includes index. ISBN 978-0-470-34343-2 (pbk.) 1. Linux. 2. Computer architecture. 3. Application software. I. Title. QA76.9.A73M38 2008 005.4'32--dc22 No part of this...

Device File Elements in I nodes

Fri, 02 Dec 2022 | Linux Kernel Architecture

Each file in the virtual file system is associated with just one inode that manages the file properties. Since the inode data structure is very lengthy, I don't reproduce it in full here but only include the elements relevant to device drivers. struct file_operations *i_fop union 5When register_chrdev is used, no handling of struct cdev is necessary since this is automatically managed. The reason is simple The cdev abstraction was not available in the kernel at the time register_chrdev was...

Representation of proc Entries

Fri, 02 Dec 2022 | Linux Kernel Architecture

Each entry in the proc filesystem is described by an instance of proc_dir_entry whose (abbreviated) definition is as follows unsigned int low_ino unsigned short namelen const char *name mode_t mode nlink_t nlink uid_t uid gid_t gid loff_t size struct inode_operations * proc_iops const struct file_operations * proc_fops get_info_t *get_info struct module *owner struct proc_dir_entry *next, *parent, *subdir void *data read_proc_t *read_proc write_proc_t *write_proc Because each entry is given a...

Creating and Registering Entries

Fri, 02 Dec 2022 | Linux Kernel Architecture

New entries are added to the proc filesystem in two steps. First, a new instance of proc_dir_entry is created together with all information needed to describe the entry. This instance is then registered in the data structures of proc so that it is visible to the outside. Because the two steps are never carried out independently of each other, the kernel makes auxiliary functions available to combine both actions so that new entries can be generated quickly and easily. The most frequently used...

Selection According to PID

Fri, 02 Dec 2022 | Linux Kernel Architecture

Let us turn our attention to how the process-specific information is selected by PID. If a PID is passed to proc_pid_lookup instead of self, the course of the lookup operation is as shown in the code flow diagram in Figure 10-4. Because filenames are always processed in the form of strings but PIDs are integer numbers, the former must be converted accordingly. The kernel provides the name_to_int auxiliary function to convert strings consisting of digits into an integer. The information obtained...

Writing Sequential File Handlers

Fri, 02 Dec 2022 | Linux Kernel Architecture

Basically, an instance of struct file_operations that provides pointers to some seq_ routines must be implemented to benefit from the sequential file standard implementation. The kprobes subsystem does this as follows static struct file_operations debugfs_kprobes_operations .open kprobes_open, This instance of file_operations can be associated with a file by the methods discussed in Chapter 8. In the case of kprobes, the file will be created in the debugging filesystem see Section 10.2.3. The...

File and Directory Operations

Fri, 02 Dec 2022 | Linux Kernel Architecture

Since sysfs exposes its data structures in a filesystem, most interesting operations can be triggered with standard filesystem operations. The functions that implement the filesystem operations thus serve as a glue layer between sysfs and the internal data structures. As for every filesystem, the methods used for operations on files are collected in an instance of struct file_operations. For sysfs, the following selection is available const struct file_operations sysfs_file_operations .read...

Nonlinear Mappings

Thu, 03 Nov 2022 | Linux Kernel Architecture

As just demonstrated, normal mappings map a continuous section from a file into a likewise continuous section of virtual memory. If various parts of a file are mapped in a different sequence into an otherwise contiguous area of virtual memory, it is generally necessary to use several mappings, which is more costly in terms of resources (particularly in vm_area_structs). A simpler way of achieving the same result11 is to use nonlinear mappings as introduced during the development of 2.5. The...

Registering Network Devices

Mon, 31 Oct 2022 | Linux Kernel Architecture

Each network device is registered in a two-step process 1. alloc_netdev allocates a new instance of struct net_device, and a protocol-specific function fills the structure with typical values. For Ethernet devices, this function is ether_setup. Other protocols (not considerd in detail) use XXX_setup, where possible values for XXX include fddi (fiber distributed data), tr (token ring), ltalk (localtalk), hippi (high-performance parallel interface), or fc (fiber channel). Some in-kernel...

Closing the Audit Log

Wed, 05 Oct 2022 | Linux Kernel Architecture

After all necessary log messages have been written to an audit buffer, audit_log_end needs to be called to ensure that the audit log is sent to the userspace daemon. The code flow diagram for the function can be found in Figure 19-6. Figure 19-6 Code flow diagram for audit_log_end. After performing another rate check (if messages have been submitted too frequently, then the present message is lost and a ''rate limit exceeded'' message is sent to the daemon instead), the socket buffer associated...

Dispatching and Parameter Passing

Wed, 05 Oct 2022 | Linux Kernel Architecture

System calls are uniquely identified by a number assigned by the kernel. This is done for practical reasons that become clear when system calls are activated. All calls are handled by a single central piece of code that uses the number to dispatch a specific function by reference to a static table. The parameters passed are also handled by the central code so that parameter passing is implemented independently of the actual system call. Switching from user to kernel mode and therefore to...

Wakeup Preemption

Wed, 28 Sep 2022 | Linux Kernel Architecture

When tasks are woken up in try_to_wake_up and wake_up_new_task, the kernel uses check_preempt_curr to see if the new task can preempt the currently running one. Notice that the core scheduler is not involved in this process For completely fair handled tasks, the function check_ preempt_wakeup performs the desired check. The newly woken task need not necessarily be handled by the completely fair scheduler. If the new task is a real-time task, rescheduling is immediately requested because...

Object Management and Reference Counting

Mon, 05 Sep 2022 | Linux Kernel Architecture

All over the kernel, the need to keep track of instances of C structures arises. Despite the fact that these objects will be used in the most varying forms, some operations are very similar across subsystems just consider reference counting. This leads to code duplication. Since this is a bad thing, the kernel has adopted generic methods to manage kernel objects during the development of 2.5. The framework is, however, not just required to prevent code duplication. It also allows for providing...

Directory Entries

Mon, 05 Sep 2022 | Linux Kernel Architecture

Directory entries are represented by struct sysfs_dirent as defined in . It is the main data structure of sysfs the whole implementation is centered around it. Each sysfs node is represented by a single instance of sysfs_dirent. The definition is as follows atomic_t s_count atomic_t s_active struct sysfs_dirent *s_parent struct sysfs_dirent *s_sibling const char *s_name struct sysfs_elem_dir s_dir struct sysfs_elem_symlink s_symlink struct sysfs_elem_attr s_attr struct sysfs_elem_bin_attr...

Echo Client

Mon, 05 Sep 2022 | Linux Kernel Architecture

The source code for the echo client is as follows5 * Host and port number of the echo server * char* echo_host 192.168.1.20 int echo_port 7 int sockfd (struct sockaddr_in*)malloc(sizeof(struct sockaddr_in)) * Set address of server to be connected * server- sin_family AF_INET server- sin_port htons(echo_port) Note network byte order server- sin_addr.s_addr inet_addr(echo_host) * Create a socket (Internet address family, stream socket and default protocol) * sockfd socket(AF_INET, SOCK_STREAM, 0)...

Determining Page Activity

Tue, 30 Aug 2022 | Linux Kernel Architecture

The kernel must track not only whether a page is actually used by one or more processes, but also how often it is accessed in order to assess its importance. As only very few architectures support a direct access counter for memory pages, the kernel must resort to other means and has therefore introduced two page flags named referenced and active. The corresponding bit values are PG_referenced and PG_active, and the usual set of macros as discussed in Section 3.2.2 is available to set or...

Automatic Loading

Sat, 13 Aug 2022 | Linux Kernel Architecture

Generally, module loading is initiated from userspace, either by the user or by means of automated scripts. To achieve greater flexibility in the handling of modules and to improve transparency, the kernel itself is also able to request modules. Where is the catch It is not difficult for the kernel to insert the binary code once it has access to it. However, it cannot do this without further help from userspace. The binary file must be localized in the filesystem, and dependencies must be...

The Mount System Call

Sat, 13 Aug 2022 | Linux Kernel Architecture

The point of entry for the mount system call is the sys_mount function defined in fs namespace.c. Figure 8-5 shows the associated code flow diagram. Figure 8-5 Code flow diagram for sys_mount. Figure 8-5 Code flow diagram for sys_mount. The approach described here is used only to mount a new filesystem in an existing root filesystem. A modified version of the above algorithm mounts the root filesystem itself, but is not sufficiently interesting to merit a separate description (its code can be...

Inode Operations

Sat, 13 Aug 2022 | Linux Kernel Architecture

The kernel provides a large number of functions for performing operations on inodes. A set of function pointers is defined to abstract the operations because data are manipulated by the specific filesystem implementation. The call interface always remains the same, although the actual work is carried out by implementation-specific functions. The inode structure has two pointers (i_op and i_fop) to arrays that implement the above abstraction. One relates to the inode-specific operations, and the...

Data Structures Rto

Sat, 13 Aug 2022 | Linux Kernel Architecture

Since the structure of an extended attribute is very simple, the kernel does not provide a specific data structure to encapsulate the name value pairs instead, a simple string is used to represent the name, while a void-pointer denotes the area in memory where the value resides. Nevertheless, there need to be methods that set, retrieve, remove, and list the extended attributes. Since these operations are mode-specific, they are integrated into struct inode_operations int (*setxattr) (struct...

Sending Packets Fws

Sat, 13 Aug 2022 | Linux Kernel Architecture

As seen from the TCP layer, the sending of TCP packets begins with the invocation of the tcp_sendmsg function by higher network instances. Figure 12-31 shows the associated code flow diagram. Figure 12-31 Code flow diagram for tcp_sendmsg. Figure 12-31 Code flow diagram for tcp_sendmsg. Naturally, the state of the socket used must be TCP_ESTABLISHED before data transmission can begin. If this is not the case, the kernel waits (with the help of wait_for_tcp_connect) until a connection has been...

Calculation of Zone Watermarks

Mon, 08 Aug 2022 | Linux Kernel Architecture

Before calculating the various watermarks, the kernel first determines the minimum memory space that must remain free for critical allocations. This value scales nonlinearly with the size of the available RAM. It is stored in the global variable min_free_kbytes. Figure 3-4 provides an overview of the scaling behavior, and the inset which does not use a logarithmic scale for the main memory size in contrast to the main graph shows a magnification of the region up to 4 GiB. Some exemplary values...

Deleting Inodes

Mon, 04 Jul 2022 | Linux Kernel Architecture

Both directory inodes and file inodes can be deleted, and, from the perspective of the filesystem, both actions are much simpler than allocating inodes. Let us first look at how directories are deleted. After the appropriate system call (rmdir) has been invoked, the code meanders through the kernel and finally arrives at the rmdir function pointer of the inode_operations structure, which, for the Ext2 filesystem, contains ext2_rmdir from fs ext2 namei.c. 18In terms of compatibility with old...

Atomic Operations on Integers

Mon, 04 Jul 2022 | Linux Kernel Architecture

The kernel defines the atomic_t data type (in ) as the basis for atomic operations with integer counters. In the view of the kernel, these are operations that are performed as if they consisted of a single assembly language instruction. A short example, in which a counter is incremented by 1, is sufficient to demonstrate the need for operations of this kind. On the assembly language level, incrementation is usually performed in three steps 1. The counter value is copied from memory into a...

Core Scheduler Changes

Mon, 04 Jul 2022 | Linux Kernel Architecture

Besides the additions discussed above, some changes to the existing methods are required in the core scheduler on SMP systems. While numerous small details change all over the place, the most important differences as compared to uniprocessor systems are the following When a new process is started with the exec system call, a good opportunity for the scheduler to move the task across CPUs arises. Naturally, it has not been running yet, so there cannot be any negative effects on the CPU cache by...