Memory Zones

The kernel uses the zones structure to describe a zone. It is defined as follows:

struct zone {

/* Fields commonly accessed by the page allocator */ unsigned long pages_min, pages_low, pages_high;

unsigned long lowmem_reserve[MAX_NR_ZONES];

struct per_cpu_pageset pageset[NR_CPUS];

* free areas of different sizes

spinlock_t lock;

struct free_area free_area[MAX_ORDER];

ZONE_PADDING(_pad1_)

/* Fields commonly accessed by the page reclaim scanner */

spinlock_t lru_lock;

struct list_head active_list;

struct list_head inactive_list;

unsigned long nr_scan_active;

unsigned long nr_scan_inactive;

unsigned long pages_scanned; /* since last reclaim */

unsigned long flags; /* zone flags, see below */

atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

int prev_priority;

ZONE_PADDING(_pad2_)

/* Rarely used or read-mostly fields */

wait_queue_head_t * wait_table;

unsigned long wait_table_hash_nr_entries;

unsigned long wait_table_bits;

/* Discontig memory support fields. */

struct pglist_data *zone_pgdat;

unsigned long zone_start_pfn;

unsigned long spanned_pages; /* total size, including holes */

unsigned long present_pages; /* amount of memory (excluding holes) */

* rarely used fields:

char *name;

} _cacheline_maxaligned_in_smp;

The striking aspect of this structure is that it is divided into several sections separated by zone_padding. This is because zone structures are very frequently accessed. On multiprocessor systems, it commonly occurs that different CPUs try to access structure elements at the same time. Locks (examined in Chapter 5) are therefore used to prevent them interfering with each, and giving rise to errors and inconsistencies. The two spinlocks of the structure — zone->lock and zone->lru_lock — are often acquired because the kernel very frequently accesses the structure.1

Data are processed faster they are is held in a cache of the CPU. Caches are divided into lines, and each line is responsible for various memory areas. The kernel invokes the zone_padding macro to generate "padding" that is added to the structure to ensure that each lock is in its own cache line.

The compiler keyword_cacheline_maxaligned_in_smp is also used to achieve optimal cache alignment.

The last two sections of the structure are also separated from each other by padding. As neither includes a lock, the primary aim is to keep the data in a cache line for quick access and thus to dispense with the need for loading the data from RAM memory, which is a slow process. The increase in size due to the padding structures is negligible, particularly as there are relatively few instances of zone structures in kernel memory.

What is the meaning of the structure elements? Since memory management is a complex and comprehensive part of the kernel, it is not possible to cover the exact meaning of all elements at this point — a good part of this and of following chapters will be devoted to understanding the associated data structures and mechanisms. What I can provide, however, is an overview that gives a taste of the problems I am about to discuss. A large number of forward references is nevertheless unavoidable.

1The locks are therefore known as hotspots. In Chapter 17, some tricks that are used by the kernel to reduce the pressure on these hotspots are discussed.

□ pages_min, pages_high, and pages_low are "watermarks"used when pages are swapped out. The kernel can write pages to hard disk if insufficient RAM memory is available. These three elements influence the behavior of the swapping daemon.

□ If more than pages_high pages are free, the state of the zone is ideal.

□ If the number of free pages falls below pages_low, the kernel begins to swap pages out onto the hard disk.

□ If the number of free pages falls below pages_min, the pressure to reclaim pages is increased because free pages are urgently needed in the zone. Chapter 18 will discuss various means of the kernel to find relief.

The importance of these watermarks will mainly show in Chapter 18, but they also come into play in Section 3.5.5.

□ The lowmem_reserve array specifies several pages for each memory zone that are reserved for critical allocations that must not fail under any circumstances. Each zone contributes according to its importance. The algorithm to calculate the individual contributions is discussed in Section 3.2.2.

□ pageset is an array to implement per-CPU hot-n-cold page lists. The kernel uses these lists to store fresh pages that can be used to satisfy implementations. However, they are distinguished by their cache status: Pages that are most likely still cache-hot and can therefore be quickly accessed are separated from cache-cold pages. The next section discusses the struct per_ cpu_pageset data structure used to realize this behavior.

□ free_area is an array of data structures of the same name used to implement the buddy system. Each array element stands for contiguous memory areas of a fixed size. Management of free memory pages contained in each area is performed starting from free_area.

The employed data structures merit a discussion of their own, and Section 3.5.5 covers the implementation details of the buddy system in depth.

□ The elements of the second section are responsible for cataloging the pages used in the zone according to activity. A page is regarded as active by the kernel if it is accessed frequently; an inactive page is obviously the opposite. This distinction is important when pages need to be swapped out. If possible, frequently used pages should be left intact, but superfluous inactive pages can be swapped out without impunity.

The following elements are involved:

□ active_list collects the active pages, and inactive_list the inactive pages (page instances).

□ nr_scan_active and nr_scan_inactive specify how many active and inactive pages are to be scanned when reclaiming memory.

□ pages_scanned specifies how many pages were unsuccessfully scanned since the last time a page was swapped out.

□ flags describes the current status of the zone. The following flags are allowed:

typedef enum {

ZONE_ALL_UNRECLAIMABLE, /* all pages pinned */

ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */

ZONE_OOM_LOCKED, /* zone is in OOM killer zonelist */

It is also possible that none of these flags is set. This is the normal state of the zone. zone_all_unreclaimable is a state that can occur when the kernel tries to reuse some pages of the zone (page reclaim, see Chapter 18), but this is not possible at all because all pages are pinned. For instance, a userspace application could have used the mlock system call to instruct the kernel that pages must not be removed from physical memory, for example, by swapping them out. Such a page is said to be pinned. If all pages in a zone suffer this fate, the zone is unreclaimable, and the flag is set. To waste no time, the swapping daemon scans zones of this kind very briefly when it is looking for pages to reclaim.2

On SMP systems, multiple CPUs could be tempted to reclaim a zone concurrently. The flag zone_reclaim_locked prevents this: If A CPU is reclaiming a zone, it set the flag. This prevents other CPUs from trying.

zone_oom_locked is reserved for an unfortunate situation: If processes use up so much memory that essential operations cannot be completed anymore, then the kernel will try to select the worst memory eater and kill it to obtain more free pages. The flag prevents multiple CPUs from getting into their way in this case.

The kernel provides three auxiliary functions to test and set zone flags:

void zone_set_flag(struct zone *zone, zone_flags_t flag)

int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)

void zone_clear_flag(struct zone *zone, zone_flags_t flag)

zone_set_flag and zone_clear_flag set and clear a certain flag, respectively. zone_test_ and_set_flag first tests if a given flag is set and does so if not. The old state of the flag is returned to the caller.

□ vm_stat keeps a plethora of statistical information about the zone. Since most of the information kept in there will not make much sense at the moment, a detailed discussion is deferred to Section 17.7.1. For now, it suffices to know that the information is updated from places all over the kernel. The auxiliary function zone_page_state allows for reading the information in vm_stat:

static inline unsigned long zone_page_state(struct zone *zone, enum zone_stat_item item)

item can, for instance, be nr_active or nr_inactive to query the number of active and inactive pages stored on active_list and inactive_list discussed above. The number of free pages in the zone is obtained with nr_free_pages.

□ prev_priority stores the priority with which the zone was scanned in the last scan operation until sufficient page frames were freed in try_to_free_pages (see Chapter 17). As you shall also see in Chapter 17, the decision as to whether mapped pages are swapped out depends on this value.

2However, scanning cannot be totally dispensed with because the zone may contain reclaimable pages again at some time in the future. If so, the flag is removed and the kswapd daemon treats the zone again like any other zone.

□ wait_table, wait_table_bits, and wait_table_hash_nr_entries implement a wait queue for processes waiting for a page to become available. While the details of this mechanism are shown in Chapter 14, the intuitive notion holds pretty well: Processes queue up in a line to wait for some condition. When this condition becomes true, they are notified by the kernel and can resume their work.

□ The association between a zone and the parent node is established by zone_pgdat, which points to the corresponding instance of pg_list_data.

□ zone_start_pfn is the index of the first page frame of the zone.

□ The remaining three fields are rarely used, so they've been placed at the end of the data structure.

name is a string that holds a conventional name for the zone. Three options are available at present: Normal, DMA, and HighMem.

spanned_pages specifies the total number of pages in the zone. However, not all need be usable since there may be small holes in the zone as already mentioned. A further counter (present_ pages) therefore indicates the number of pages that are actually usable. Generally, the value of this counter is the same as that for spanned_pages.

+1 -1

Post a comment