Page Frames

Page frames represent the smallest unit of system memory, and an instance of struct page is created for each page in RAM. Kernel programmers take care to keep this structure as small as possible because the memory of systems even with a moderate RAM configuration is broken down into a very large number of pages. For instance, an IA-32 system working with a standard page size of 4 KiB has around 100,000 pages given a main memory size of 384 MiB. Although this memory size is certainly not excessively large for today's standards, the number of pages is already considerable.

This is why the kernel makes great efforts to keep struct page as small as possible. The sheer number of pages in a typical system causes even small changes in the structure to lead to a large increase in the amount of physical memory required to keep all page instances.

Keeping the structure small is not exactly simplified by the ubiquity of pages: They are used in many parts of memory management, and for varying applications. While one part of the kernel absolutely depends on a specific piece of information being available in struct page, this could be useless for another part, which itself depends a different piece of information, which could again be completely useless for the other part, and so on ... .

A C union lends itself naturally as a remedy for this problem, even if clarity of struct page is not increased at first. Consider an example: A physical page can be mapped into the virtual address space via page tables from multiple places, and the kernel wants to keep track of how many places map the page. For this end, a counter in struct page counts the number of mappings. If a page is used by the slub allocator (a means to subdivide complete pages into into smaller portions, see Section 3.6.1), then it is guaranteed to be only used by the kernel and not from somewhere else, so the map count information is superfluous. Instead, the kernel can reinterpret the field to denote how many small memory objects into which a page is subdivided are in use. The double interpretation looks as follows in the data structure definition:

struct page {

union {

atomic_t _mapcount;

unsigned int inuse;

Note that atomic_t and unsigned int are two different data types — the first allows for changing values atomically, that is, safe against concurrent access, while the second is a classical integer. atomic_t provides 32 bits,6 and an integer also provides this many bits on each architecture supported by Linux. Now it could be tempting to ''simplify'' the definition as follows:

/* Count of ptes mapped in mms,

* to show when page is mapped

struct page {

atomic_t counter;

This would be bad style, though, and is completely unacceptable to the kernel developers. The slub code does not need atomicity to access its object counter, and this should also be reflected in the data type. And, most importantly, readability of the code will suffer in both subsystems. While _mapcount and inuse provide a clear and concise description of what the element is about, counter could mean almost everything.

Definition of page

The structure is defined as follows:

struct page {

unsigned long flags;

atomic_t _count; union {

atomic_t _mapcount;

Atomic flags, some possibly updated asynchronously */ Usage count, see below. */

Count of ptes mapped in mms, to show when page is mapped & limit reverse map searches.

6Before kernel 2.6.3, this was not true. The Sparc architecture could only provide 24 bits for atomic manipulation, so the generic code for all architecture needed to stick to this limit. Luckily, this problem has been resolved now by improvements in the Sparc specific code.

unsigned int inuse; /* SLUB: Nr of objects */

union {

struct {

unsigned long private; /* Mapping-private opaque data:

* usually used for buffer_heads

* if PagePrivate set; used for

* swp_entry_t if PageSwapCache;

* indicates order in the buddy

struct address_space *mapping; /* If low bit clear, points to

* inode address_space, or NULL.

* If page mapped as anonymous

* memory, low bit is set, and

* it points to anon_vma object:

* see PAGE_MAPPING_ANON below.

struct kmem_cache *slab; /* SLUB: Pointer to slab */ struct page *first_page; /* Compound tail pages */

union {

pgoff_t index; /* Our offset within mapping. */ void *freelist; /* SLUB: freelist req. slab lock */

struct list_head lru; /* Pageout list, eg. active_list

#if defined(WANT_PAGE_VIRTUAL)

void *virtual; /* Kernel virtual address (NULL if not kmapped, ie. highmem) */

The elements slab, freelist, and inuse are used by the slub allocator. We do not need to be concerned with these special arrangements, and they are not used if support for the slub allocator is not compiled into the kernel, so I omit them in the following discussion to simplify matters.

Each page frame is described by this structure in an architecture-independent format that does not depend on the CPU type used. Besides the slub elements, the page structure includes several other elements that can only be explained accurately in the context of kernel subsystems discussed elsewhere. I shall nevertheless provide an overview of the contents of the structure, even though this means referencing later chapters.

□ flags stores architecture-independent flags to describe page attributes. I discuss the different flag options below.

□ _count is a usage count indicating the number of references to this page in the kernel. When its value reaches 0, the kernel knows that the page instance is not currently in use and can therefore be removed. If its value is greater than 0, the instance should on no account be removed from memory. If you are not familiar with reference counters, you should consult Appendix C for further information.

□ _mapcount indicates how many entries in the page table point to the page.

□ lru is a list head used to keep the page on various lists that allow grouping the pages into different categories, most importantly active and inactive pages. Especially the discussion in Chapter 18 will come back to these lists.

□ The kernel allows for combining multiple adjacent pages into a larger compound page. The first page in the cluster is called the head page, while all other pages are named tail page. All tail pages have first_page set to point to the head page.

□ mapping specifies the address space in which a page frame is located. index is the offset within the mapping. Address spaces are a very general concept used, for example, when reading a file into memory. An address space is used to associate the file contents (data) with the areas in memory into which the contents are read. By means of a small trick,7 mapping is able to hold not only a pointer, but also information on whether a page belongs to an anonymous memory area that is not associated with an address space. If the bit with numeric value 1 is set in mapping, the pointer does not point to an instance of address_space but to another data structure (anon_vma) that is important in the implementation of reverse mapping for anonymous pages; this structure is discussed in Section 4.11.2. Double use of the pointer is possible because address_space instances are always aligned with sizeof(long); the least significant bit of a pointer to this instance is therefore 0 on all machines supported by Linux.

The pointer can be used directly if it points normally to an instance of address_space. If the trick involving setting the least significant bit to 1 is used, the kernel can restore the pointer by means of the following operation:

anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON)

□ private is a pointer to "private" data ignored by virtual memory management. The pointer can be employed in different ways depending on page usage. It is mostly used to associate the page with data buffers as described in the following chapters.

□ virtual is used for pages in the highmem area, in other words, for pages that cannot be directly mapped into kernel memory. virtual then accepts the virtual address of the page.

As the pre-processor statement #ifdef{WANT_PAGE_VIRTUAL} shows, the virtual element is only part of struct page if the corresponding pre-processor constant is defined. Currently, this is only the case for a few architectures, namely, Motorola m68k, FRV, and Extensa.

All other architectures adopt a different scheme of addressing virtual pages. At the heart of this is a hash table used to find the address of all highmem pages. Section 3.5.8 deals with the appropriate techniques in more detail. Handling the hash table requires some mathematical operations that are slow on the aforementioned machines, so they chose the direct approach.

Architecture-Independent Page Flags

The different attributes of a page are described by a series of page flags stored as bits in the flags element of struct page. The flags are independent of the architecture used and cannot therefore provide CPU- or machine-specific information (this information is held in the page table itself as is shown below).

Not only are the individual flags defined with the help of the pre-processor in page-flags.h, but also macros are generated to set, delete, and query the flags. In doing so, the kernel conforms to a universal

7 The trick borders on the unscrupulous but helps save space in one of the most frequently needed kernel structures.

naming scheme; for example, the PG_locked constant defines the bit position in flags to specify whether a page is locked or not. The following macros are available to manipulate the bit:

□ PageLocked queries whether the bit is set.

□ SetPageLocked sets the PG_locked bit, regardless of its previous state.

□ TestSetPageLocked sets the bit, but also returns its old value.

□ ClearPageLocked deletes the bit regardless of its previous state.

□ TestClearPageLocked deletes the bit and returns its old value.

There is an identical set of macros to perform the operations shown on the appropriate bit for the other page flags. The macros are implemented atomically. Although some of them are made up of several statements, special processor commands are used to ensure that they act as if they were a single statement; that is, they cannot be interrupted as this would result in race conditions. (Chapter 14 describes how race conditions arise and how they can be prevented.)

Which page flags are available? The following list includes the most important flags (again, their meanings become clear in later chapters):

□ PG_locked specifies whether a page is locked. If the bit is set, other parts of the kernel are not allowed to access the page. This prevents race conditions in memory management, for example, when reading data from hard disk into a page frame.

□ PG_error is set if an error occurs during an I/O operation involving the page.

PG_referenced and PG_active control how actively a page is used by the system. This information is important when the swapping subsystem has to select which page to swap out. The interaction of the two flags is explained in Chapter 18.

□ PG_uptodate indicates that the data of a page have been read without error from a block device.

□ PG_dirty is set when the contents of the page have changed as compared to the data on hard disk. For reasons of performance, pages are not written back immediately after each change. The kernel therefore uses this flag to note which pages have been changed so that they can be flushed later.

Pages for which this flag has been set are referred to as dirty (generally, this means that the data in RAM and the data on a secondary storage medium such as a hard disk have not been synchronized).

□ PG_lru helps implement page reclaim and swapping. The kernel uses two least recently used lists8 to distinguish between active and inactive pages. The bit is set if the page is held on one of these lists. There is also a PG_active flag that is set if the page is on the list of active pages. Chapter 18 discusses this important mechanism in detail.

□ PG_highmem indicates that a page is in high memory because it cannot be mapped permanently into kernel memory.

□ PG_private must be set if the value of the private element in the page structure is non-NULL. Pages that are used for I/O use this field to subdivide the page into buffers (see Chapter 16 for more information), but other parts of the kernel find different uses to attach private data to a page.

8Frequently used entries are automatically in the foremost positions on this type of list, whereas inactive entries are always moved toward the end of the list.

□ PG_writeback is set for pages whose contents are in the process of being written back to a block device.

□ PG_slab is set for pages that are part of the slab allocator discussed in Section 3.6.

□ PG_swapcache is set if the page is in the swap cache; in this case, private contains an entry of type swap_entry_t (further details are provided in Chapter 18).

□ When the available amount of memory gets smaller, the kernel tries to periodically reclaim pages, that is, get rid of inactive, unused pages. Chapter 18 discusses the details. Once the kernel has decided to reclaim a specific page, this is announced by setting the PG_reclaim flag.

□ PG_buddy is set if the page is free and contained on the lists of the buddy system, that is, the core of the page allocation mechanism.

□ PG_compound denotes that the page is part of a larger compound page consisting of multiple adjacent regular pages.

A number of standard macros are defined to check if a page has a specific bit is set, or to manipulate a bit. Their names follow a certain pattern:

□ PageXXX(page) checks if a page has the PG_XXX bit set. For instance, PageDirty checks for the PG_dirty bit, while PageActive checks for PG_active, and so on.

□ To set a bit if it is not set and return the previous value, SetPageXXX is provided.

□ ClearPageXXX unconditionally deletes a specific bit.

□ TestciearPageXXX clears a bit if it is set, but also returns the previously active value.

Notice that these operations are implemented atomically. Chapter 5 discusses what this means in more detail.

Often it is necessary to wait until the state of a page changes, and then resume work. Two auxiliary functions provided by the kernel are of particular interest for us:

<pagemap.h>

void wait_on_page_locked(struct page *page); void wait_on_page_writeback(struct page *page)

Assume that one part of the kernel wants to wait until a locked page has been unlocked. wait_on_page_locked allows for doing this. While how this is technically done is discussed in Chapter 14, it suffices to know here that after calling the function, the kernel will go to sleep if the page is locked. Once the page becomes unlocked, the sleeper is automatically woken up and can continue its work.

wait_on_page_writeback works similarly, but waits until any pending writeback operations in which the data contained in the page are synchronized with a block device — a hard disk, for instance — have been finished.

Continue reading here: Page Tables

Was this article helpful?

0 0