7.1 Page Frame Management

We saw in Section 2.4 how the Intel Pentium processor can use two different page frame sizes: 4 KB and 4 MB (or 2 MB if PAE is enabled—see Section 2.4.6). Linux adopts the smaller 4 KB page frame size as the standard memory allocation unit. This makes things simpler for two reasons:

• The Page Fault exceptions issued by the paging circuitry are easily interpreted. Either the page requested exists but the process is not allowed to address it, or the page does not exist. In the second case, the memory allocator must find a free 4 KB page frame and assign it to the process.

• The 4 KB size is a multiple of most disk block sizes, so transfers of data between main memory and disks are more efficient. Yet this smaller size is much more manageable than the 4 MB size.

7.1.1 Page Descriptors

The kernel must keep track of the current status of each page frame. For instance, it must be able to distinguish the page frames that are used to contain pages that belong to processes from those that contain kernel code or kernel data structures. Similarly, it must be able to determine whether a page frame in dynamic memory is free. A page frame in dynamic memory is free if it does not contain any useful data. It is not free when the page frame contains data of a User Mode process, data of a software cache, dynamically allocated kernel data structures, buffered data of a device driver, code of a kernel module, and so on.

State information of a page frame is kept in a page descriptor of type struct page, whose fields are shown in Table 7-1. All page descriptors are stored in the mem_map array. Since each descriptor is less than 64 bytes long, mem_map requires about four page frames for each megabyte of RAM.

Table 7-1. The fields of the page descriptor




struct list head


Contains pointers to next and previous items in a doubly linked list of page descriptors

struct address space *


Used when the page is inserted into the page cache (see Section 14.1)

unsigned long


Either the position of the data stored in the page within the page's disk image (see Chapter 14) or a swapped-out page identifier (see Chapter 16)

struct page *

next hash

Contains pointer to next item in a doubly linked circular list of the page cache hash table

atomic t


Page's reference counter

unsigned long


Array of flags (see Table 7-2)

struct list head


Contains pointers to the least recently used doubly linked list of pages

wait queue head t


Page's wait queue

struct page * *

pprev hash

Contains pointer to previous item in a doubly linked circular list of the page cache hash table

struct buffer head *


Used when the page stores buffers (see Section

void *


Linear address of the page frame in the fourth gigabyte (see Section 7.1.5 later in this chapter)

struct zone_struct *


The zone to which the page frame belongs (see Section 7.1.2)

You don't have to fully understand the role of all fields in the page descriptor right now. In the following chapters, we often come back to the fields of the page descriptor. Moreover, several fields have different meaning, according to whether the page frame is free and what kernel component is using the page frame.

Let's describe in greater detail two of the fields:


A usage reference counter for the page. If it is set to 0, the corresponding page frame is free and can be assigned to any process or to the kernel itself. If it is set to a value greater than 0, the page frame is assigned to one or more processes or is used to store some kernel data structures.


Includes up to 32 flags (see Table 7-2) that describe the status of the page frame. For each PG_xyz flag, the kernel defines some macros that manipulate its value. Usually, the PageXyz macro returns the value of the flag, while the SetPageXyz and ClearPageXyz macro set and clear the corresponding bit, respectively.

Table 7-2. Flags describing the status of a page frame

Flag name


PG locked

The page is involved in a disk I/O operation.


An I/O error occurred while transferring the page.


The page has been recently accessed for a disk I/O operation.


The flag is set after completing a read operation, unless a disk I/O error happened.

PG dirty

The page has been modified (see Section 16.5.1).

PG lru

The page is in the active or inactive page list (see Section 16.7.2).


The page is in the active page list (see Section 16.7.2).

PG slab

The page frame is included in a slab (see Section 7.2 later in this chapter).

PG skip

Not used.

PG highmem

The page frame belongs to the zone highmem zone (see Section 7.1.2).

PG checked

The flag used by the Ext2 filesystem (see Chapter 17).

PG arch 1

Not used on the 80 x 86 architecture.

PG reserved

The page frame is reserved to kernel code or is unusable.

PG launder

The page is involved in an I/O operation triggered by shrink cache( ) (see Section 16.7.5).

7.1.2 Memory Zones

In an ideal computer architecture, a page frame is a memory storage unit that can be used for anything: storing kernel and user data, buffering disk data, and so on. Any kind of page of data can be stored in any page frame, without limitations.

However, real computer architectures have hardware constraints that may limit the way page frames can be used. In particular, the Linux kernel must deal with two hardware constraints of the 80 x 86 architecture:

• The Direct Memory Access (DMA) processors for ISA buses have a strong limitation: they are able to address only the first 16 MB of RAM.

• In modern 32-bit computers with lots of RAM, the CPU cannot directly access all physical memory because the linear address space is too small.

To cope with these two limitations, Linux partitions the physical memory in three zones:


Contains pages of memory below 16 MB


Contains pages of memory at and above 16 MB and below 896 MB


Contains pages of memory at and above 896 MB

The zone_dma zone includes memory pages that can be used by old ISA-based devices by means of the DMA. (Section 13.1.4 gives further details on DMA.)

The zone_dma and zone_normal zones include the "normal" pages of memory that can be directly accessed by the kernel through the linear mapping in the fourth gigabyte of the linear address space (see Section 2.5.5). Conversely, the zone_highmem zone includes pages of memory that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of linear address space (see Section 7.1.6 later in this chapter). The zone_highmem zone is not used on 64-bit architectures.

Each memory zone has its own descriptor of type struct zone_struct (or equivalently, zone_t). Its fields are shown in Table 7-3.

Table 7-3. The fields of the zone descriptor




char *


Contains a pointer to the conventional name of the zone: "DMA," "Normal," or "HighMem"

unsigned long


Number of pages in the zone

spinlock t


Spin lock protecting the descriptor

unsigned long

free pages

Number of free pages in the zone

unsigned long

pages min

Minimum number of pages of the zone that should remain free (see Section 16.7)

unsigned long

pages low

Lower threshold value for the zone's page balancing algorithm (see Section 16.7)

unsigned long

pages high

Upper threshold value for the zone's page balancing algorithm (see Section 16.7)


need balance

Flag indicating that the zone's page balancing algorithm should be activated (see Section 16.7)

free area t [ ]

free area

Used by the buddy system page allocator (see the later section Section 7.1.7)

struct pglist data *

zone pgdat

Pointer to the descriptor of the node to which this zone belongs

struct page *

zone mem map

Array of page descriptors of the zone (see the later section Section 7.1.7)

unsigned long

zone start paddr

First physical address of the zone

unsigned long

zone start mapnr

First page descriptor index of the zone

The zone field in the page descriptor points to the descriptor of the zone to which the corresponding page frame belongs.

The zone_names array stores the canonical names of the three zones: "DMA," "Normal," and "HighMem."

When the kernel invokes a memory allocation function, it must specify the zones that contain the requested page frames. The kernel usually specifies which zones it's willing to use. For instance, if a page frame must be directly mapped in the fourth gigabyte of linear addresses but it is not going to be used for ISA DMA

transfers, then the kernel requests a page frame either in zone_normal or in zone_dma. Of course, the page frame should be obtained from zone_dma only if zone_normal does not have free page frames. To specify the preferred zones in a memory allocation request, the kernel uses the struct zonelist_struct data structure (or equivalently zonelist_t), which is an array of zone descriptor pointers.

7.1.3 Non-Uniform Memory Access (NUMA)

We are used to thinking of the computer's memory as an homogeneous, shared resource. Disregarding the role of the hardware caches, we expect the time required for a CPU to access a memory location is essentially the same, regardless of the location's physical address and the CPU. Unfortunately, this assumption is not true in some architectures. For instance, it is not true for some multiprocessor Alpha or MIPS computers.

Linux 2.4 supports the Non-Uniform Memory Access (NUMA) model, in which the access times for different memory locations from a given CPU may vary. The physical memory of the system is partitioned in several nodes. The time needed by any given CPU to access pages within a single node is the same. However, this time might not be the same for two different CPUs. For every CPU, the kernel tries to minimize the number of accesses to costly nodes by carefully selecting where the kernel data structures that are most often referenced by the CPU are stored.

The physical memory inside each node can be split in several zones, as we saw in the previous section. Each node has a descriptor of type pg_data_t, whose fields are shown in Table 7-4. All node descriptors are stored in a simply linked list, whose first element is pointed to by the pgdat_list variable.

Table 7-4. The fields of the node descriptor




zone t [ ]

node zones

Array of zone descriptors of the node

zonelist t [ ]


Array of zonelist t data structures used by the page allocator (see the later section Section 7.1.5)


nr zones

Number of zones in the node

struct page *

node mem map

Array of page descriptors of the node

unsigned long *


Bitmap of usable physical addresses for the node

struct bootmem data x*


Used in the kernel initialization phase

unsigned long


First physical address of the node

unsigned long


First page descriptor index of the node

unsigned long

node size

Size of the node (in pages)



Identifier of the node

pg_data_t *

node next

Next item in the node list

As usual, we are mostly concerned with the 80 x 86 architecture. IBM-compatible PCs use the Uniform Access Memory model (UMA), thus the NUMA support is not really required. However, even if NUMA support is not compiled in the kernel, Linux makes use of a single node that includes all system physical memory; the corresponding descriptor is stored in the contig_page_data variable.

On the 80 x 86 architecture, grouping the physical memory in a single node might appear useless; however, this approach makes the memory handling code more portable, because the kernel may assume that the physical memory is partitioned in one or more nodes in all architectures. HI

[!] We have another example of this kind of design choice: Linux uses three levels of Page Tables even when the hardware architecture defines just two levels (see Section 2.5).

7.1.4 Initialization of the Memory Handling Data Structures

Dynamic memory and the values used to refer to it are illustrated in Figure 7-1. The zones of memory are now drawn to scale; zone_normal is usually larger than zone_dma, and, if present, zone_highmem is usually larger than zone_normal. Notice that zone_highmem starts from physical address 0x38000000, which corresponds to 896 MB.

Figure 7-1. Memory layout end_mem

We already described how the paging_init( ) function initializes the kernel Page Tables according to the amount of RAM in the system in Section 2.5.5. Beside Page Tables, the paging_init( ) function also initializes other memory handling data structures. It invokes kmap_init( ), which essentially sets up the kmap_pte variable to create "windows" of linear addresses that allow the kernel to address the zone_highmem zone (see Section later in this chapter). Then, paging_init( ) invokes the free_area_init( ) function, passing an array storing the sizes of the three memory zones to it.

The free_area_init( ) function sets up both the zone descriptors and the page descriptors. The function receives the zones_size array (size of each memory zone) as its parameter, and executes the following operations:!21

[2] In NUMA architectures, these operations must be performed separately on every node. However, we are focusing on the 80 x 86 architecture, which has just one node.

1. Computes the total number of page frames in RAM by adding the value in zones_size, and stores

Reserv*! Kernel)

Reserv*! Kernel)

1 ■






Dynamic Memory the result in the totalpages local variable.

2. Initializes the active_list and inactive_list lists of page descriptors (see Chapter 16).

3. Allocates space for the mem_map array of page descriptors. The space needed is the product of totalpages by the page descriptor size.

4. Initializes some fields of the node descriptor contig_page_data:

contig_page_data.node_size = totalpages; contig_page_data.node_start_paddr = 0x00000000; contig_page_data.node_start_mapnr = 0;

5. Initializes some fields of all page descriptors. All page frames are marked as reserved, but later, the PG_reserved flag of the page frames in dynamic memory will be cleared:

for (p = mem_map; p < mem_map + totalpages; p++) { p->count = 0; SetPageReserved(p); init_waitqueue_head(&p->wait); p->list.next = p->list.prev = p;

6. Stores the address of the memory zone descriptor in the zone local variable and for each element of the zone_names array (index j between 0 and 2), performs the following steps:

a. Initializes some fields of the descriptor:

zone->name = zone_names[j]; zone->size = zones_size[j]; zone->lock = SPIN_LOCK_UNLOCKED; zone->zone_pgdat = & contig_page_data; zone->free_pages = 0; zone->need_balance = 0;

b. If the zone is empty (that is, it does not include any page frame), the function goes back to the beginning of Step 6 and continues with the next zone.

c. Otherwise, the zone includes at least one page frame and the function initializes the pages_min, pages_low, and pages_high fields of the zone descriptor (see Chapter 16).

d. Sets up the zone_mem_map field of the zone descriptor to the address of the first page descriptor in the zone.

e. Sets up the zone_start_mapnr field of the zone descriptor to the index of the first page descriptor in the zone.

f. Sets up the zone_start_paddr field of the zone descriptor to the physical address of the first page frame in the zone.

g. Stores the address of the zone descriptor in the zone field of the page descriptor for each page frame of the zone.

h. If the zone is either zone_dma or zone_normal, stores the linear address in the fourth gigabyte that maps the page frame into the virtual field of every page descriptor of the zone.

i. Initializes the free_area_t structures in the free_area array of the zone descriptor (see Section 7.1.7 later in this chapter).

7. Initializes the node_zonelists array of the contig_page_data node descriptor. The array includes 16 elements; each element corresponds to a different type of memory request and specifies the zones (in order of preference) from where the page frames could be retrieved. See Section 7.1.5 later in this chapter for further details.

When the paging_init( ) function terminates, dynamic memory is not yet usable because the PG_reserved flag of all pages is set. Memory initialization is further carried on by the mem_init( ) function, which is invoked subsequently to paging_init( ).

Essentially, the mem_init( ) function initializes the value of num_physpages, the total number of page frames present in the system. It then scans all page frames associated with the dynamic memory; for each of them, the function sets the count field of the corresponding descriptor to 1, resets the PG_reserved flag, sets the PG_highmem flag if the page belongs to zone_highmem, and calls the free_ page( ) function on it. Besides releasing the page frame (see Section 7.1.7 later in this chapter), free_page( ) also increments the value of the free_pages field of the memory zone descriptor that owns the page frame. The free_pages fields of all zone descriptors are used by the nr_free_pages( ) function to compute the total number of free page frames in the dynamic memory.

The mem_init( ) function also counts the number of page frames that are not associated with dynamic memory. Several symbols produced while compiling the kernel (some are described in Section 2.5.3) enable the function to count the number of page frames reserved for the hardware, kernel code, and kernel data, and the number of page frames used during kernel initialization that can be successively released.

Was this article helpful?

0 0

Post a comment