Memory Region Data Structures

All the regions owned by a process are linked in a simple list. Regions appear in the list in ascending order by memory address; however, successive regions can be separated by an area of unused memory addresses. The vm_next field of each vm_area_struct element points to the next element in the list. The kernel finds the memory regions through the mmap field of the process memory descriptor, which points to the first memory region descriptor in the list.

The map_count field of the memory descriptor contains the number of regions owned by the process. A process may own up to max_map_count different memory regions (this value is usually set to 65,536).

Figure 8-2 illustrates the relationships among the address space of a process, its memory descriptor, and the list of memory regions.

Figure 8-2. Descriptors related to the address space of a process

Figure 8-2. Descriptors related to the address space of a process

A frequent operation performed by the kernel is to search the memory region that includes a specific linear address. Since the list is sorted, the search can terminate as soon as a memory region that ends after the specific linear address is found.

However, using the list is convenient only if the process has very few memory regions—let's say less than a few tens of them. Searching, inserting elements, and deleting elements in the list involve a number of operations whose times are linearly proportional to the list length.

Although most Linux processes use very few memory regions, there are some large applications, such as object-oriented databases, that one might consider "pathological" because they have many hundreds or even thousands of regions. In such cases, the memory region list management becomes very inefficient, hence the performance of the memory-related system calls degrades to an intolerable point.

Therefore, Linux 2.4 stores memory descriptors in data structures called red-black trees.

In an red-black tree, each element (or node) usually has two children: a left child and a right child. The elements in the tree are sorted. For each node N, all elements of the subtree rooted at the left child of N precede N, while, conversely, all elements of the subtree rooted at the right child of N follow N (see Figure 8-3(a); the key of the node is written inside the node itself.

[3] Up to Version 2.4.9, the Linux kernel used another type of balanced search tree called AVL tree.

Moreover, a red-black tree must satisfy four additional rules:

1. Every node must be either red or black.

2. The root of the tree must be black.

3. The children of a red node must be black.

4. Every path from a node to a descendant leaf must contain the same number of black nodes. When counting the number of black nodes, null pointers are counted as black nodes.

Figure 8-3. Example of red-black trees

Figure 8-3. Example of red-black trees

fadnwie la) Q Stack node (b)

These four rules ensure that any red-black tree with n internal nodes has a height of at most 2log(n + 1).

Searching an element in a red-black tree is thus very efficient because it requires operations whose execution time is linearly proportional to the logarithm of the tree size. In other words, doubling the number of memory regions adds just one more iteration to the operation.

Inserting and deleting an element in a red-black tree is also efficient because the algorithm can quickly traverse the tree to locate the position at which the element will be inserted or from which it will be removed. Any new node must be inserted as a leaf and colored red. If the operation breaks the rules, a few nodes of the tree must be moved or recolored.

For instance, suppose that an element having the value 4 must be inserted in the red-black tree shown in Figure 8-3(a). Its proper position is the right child of the node that has key 3, but once it is inserted, the red node that has the value 3 has a red child, thus breaking rule 3. To satisfy the rule, the color of nodes that have the values 3, 4, and 7 is changed. This operation, however, breaks rule 4, thus the algorithm performs a "rotation" on the subtree rooted at the node that has the key 19, producing the new red-black tree shown in Figure 8-3(b). This looks complicated, but inserting or deleting an element in a red-black tree requires a small number of operations—a number linearly proportional to the logarithm of the tree size.

Therefore, to store the memory regions of a process, Linux uses both a linked list and a red-black tree. Both data structures contain pointers to the same memory region descriptors, When inserting or removing a memory region descriptor, the kernel searches the previous and next elements through the red-black tree and uses them to quickly update the list without scanning it.

The head of the linked list is referenced by the mmap field of the memory descriptor. Any memory region object stores the pointer to the next element of the list in the vm_next field. The head of the red-black tree is referred by the mm_rb field of the memory descriptor. Any memory region object stores the color of the node, as well as the pointers to the parent, the left child, and the right child, into the vm_rb field of type rb_node_t.

In general, the red-black tree is used to locate a region including a specific address, while the linked list is mostly useful when scanning the whole set of regions.

8.3.2 Memory Region Access Rights

Before moving on, we should clarify the relation between a page and a memory region. As mentioned in Chapter 2, we use the term "page" to refer both to a set of linear addresses and to the data contained in this group of addresses. In particular, we denote the linear address interval ranging between 0 and 4,095 as page 0, the linear address interval ranging between 4,096 and 8,191 as page 1, and so forth. Each memory region therefore consists of a set of pages that have consecutive page numbers.

We have already discussed two kinds of flags associated with a page:

• A few flags such as Read/Write, Present, or User/Supervisor stored in each Page Table entry (see Section 2.4.1).

• A set of flags stored in the flags field of each page descriptor (see Section 7.1).

The first kind of flag is used by the 80 x 86 hardware to check whether the requested kind of addressing can be performed; the second kind is used by Linux for many different purposes (see Table 7-2).

We now introduce a third kind of flags: those associated with the pages of a memory region. They are stored in the vm_flags field of the vm_area_struct descriptor (see Table 8-4). Some flags offer the kernel information about all the pages of the memory region, such as what they contain and what rights the process has to access each page. Other flags describe the region itself, such as how it can grow.

Table 8-4. The memory region flags

Flag name

Description

VM READ

Pages can be read.

VM WRITE

Pages can be written.

VM EXEC

Pages can be executed.

VM SHARED

Pages can be shared by several processes.

VM MAYREAD

vm read flag may be set.

VM MAYWRITE

vm write flag may be set.

VM MAYEXEC

vm exec flag may be set.

VM MAYSHARE

vm share flag may be set.

VM GROWSDOWN

The region can expand toward lower addresses.

VM GROWSUP

The region can expand toward higher addresses.

VM SHM

The region is used for IPC's shared memory.

VM DENYWRITE

The region maps a file that cannot be opened for writing.

VM EXECUTABLE

The region maps an executable file.

VM LOCKED

Pages in the region are locked and cannot be swapped out.

VM IO

The region maps the I/O address space of a device.

VM SEQ READ

The application accesses the pages sequentially.

VM RAND READ

The application accesses the pages in a truly random order.

VM DONTCOPY

Does not copy the region when forking a new process.

VM DONTEXPAND

Forbids region expansion through mremap( ) system call.

VM RESERVED

Does not swap out the region.

Page access rights included in a memory region descriptor may be combined arbitrarily. It is

Page access rights included in a memory region descriptor may be combined arbitrarily. It is possible, for instance, to allow the pages of a region to be executed but not read. To implement this protection scheme efficiently, the read, write, and execute access rights associated with the pages of a memory region must be duplicated in all the corresponding Page Table entries so that checks can be directly performed by the Paging Unit circuitry. In other words, the page access rights dictate what kinds of access should generate a Page Fault exception. As we shall see shortly, the job of figuring out what caused the Page Fault is delegated by Linux to the Page Fault handler, which implements several page-handling strategies.

The initial values of the Page Table flags (which must be the same for all pages in the memory region, as we have seen) are stored in the vm_ page_ prot field of the vm_area_struct descriptor. When adding a page, the kernel sets the flags in the corresponding Page Table entry according to the value of the vm_ page_ prot field.

However, translating the memory region's access rights into the page protection bits is not straightforward for the following reasons:

• In some cases, a page access should generate a Page Fault exception even when its access type is granted by the page access rights specified in the vm_flags field of the corresponding memory region. For instance, as we shall see in Section 8.4.4 later in this chapter, the kernel may wish to store two identical, writable private pages (whose vm_share flags are cleared) belonging to two different processes into the same page frame; in this case, an exception should be generated when either one of the processes tries to modify the page.

• 80 x 86 processors's Page Tables have just two protection bits, namely the Read/Write and User/Supervisor flags. Moreover, the User/Supervisor flag of any page included in a memory region must always be set, since the page must always be accessible by User Mode processes.

To overcome the hardware limitation of the 80 x 86 microprocessors, Linux adopts the following rules:

• The read access right always implies the execute access right.

• The write access right always implies the read access right.

Moreover, to correctly defer the allocation of page frames through the Section 8.4.4 technique (see later in this chapter), the page frame is write-protected whenever the corresponding page must not be shared by several processes. Therefore, the 16 possible combinations of the read, write, execute, and share access rights are scaled down to the following three:

• If the page has both write and share access rights, the Read/Write bit is set.

• If the page has the read or execute access right but does not have either the write or the share access right, the Read/Write bit is cleared.

• If the page does not have any access rights, the Present bit is cleared so that each access generates a Page Fault exception. However, to distinguish this condition from the real page-not-present case, Linux also sets the Page size bit to 1. 14]

[4] You might consider this use of the Page size bit to be a dirty trick, since the bit was meant to indicate the real page size. But Linux can get away with the deception because the 80 x 86 chip checks the Page size bit in Page Directory entries, but not in Page Table entries.

The downscaled protection bits corresponding to each combination of access rights are stored in the protection_map array.

Was this article helpful?

0 0

Post a comment