Reclaiming Page Frame

The virtual memory subsystem of Linux is, without any doubt, the most complex and performance-critical component of the whole kernel.

In previous chapters, we explained how the kernel handles dynamic memory by keeping track of free and busy page frames. We have also discussed how every process in User Mode has its own linear address space so that page frames can be assigned to the process at the very last possible moment. Finally, we have also described how dynamic memory is used to cache the data of the slow block devices.

In this chapter, we complete our description of the virtual memory subsystem by discussing page frame reclaiming. As we saw in Chapter 14, the cache systems grab more and more page frames but never release any of them. This is reasonable because cache systems don't know if and when processes will reuse some of the cached data and are therefore unable to identify the portions of cache that should be released. Moreover, thanks to the demand paging mechanism described in Chapter 8, User Mode processes get page frames as long as they proceed with their execution; however, demand paging has no way to force processes to release the page frames whenever they are no longer used. Page frame reclaiming is a remedy for this problem.

The kernel developers' worst nightmare is to encounter a situation in which no free page frame exists. When this happens, the kernel might be easily trapped in a deadly chain of memory requests—to free a page frame, the kernel must write its data to disk. However, to accomplish this operation, the kernel requires another page frame (for instance, to allocate the buffer heads for the I/O data transfer). Since no free page frame exists, no page frame can be freed. In this situation, there is just one solution: kill a victim User Mode process to reclaim the page frames it was using. Of course, even if this solution avoids a system crash, it is not very satisfying for the end users.

The goal of page frame reclaiming is to conserve a minimal pool of free page frames so that the kernel may safely recover from "low on memory" conditions. To do this, it must neither trash the disk caches nor penalize User Mode processes too much, otherwise system performances will be greatly reduced. As a matter of fact, the hardest job of a developer working on the virtual memory subsystem consists of finding an algorithm that ensures acceptable performances both to desktop machines (on which memory requests are quite limited) and to high-level machines like large database servers (on which memory requests tend to be huge).

Unfortunately, finding a good page frame reclaiming algorithm is a rather empirical job, with very little support from theory. The situation is somewhat similar to evaluating the factors that determine the dynamic priority of a process: the main objective is to tune the parameters that achieve good system performance, without asking too many questions about why it works well. Often, it's just a matter of "let's try this approach and see what happens..." An unpleasant side effect of this empirical approach is the code changes quickly, even in the even-numbered versions of Linux, which are supposed to be stable. The description that follows refers to Linux 2.4.18.

16.7.1 Outline of the Page Frame Reclaiming Algorithm

Before plunging into details, let's give a brief overview of Linux page frame reclaiming. (Looking too close to the trees' leaves might lead us to miss the whole forest!)

• By reclaiming an unused page frame within a cache (either a memory cache or a disk cache)

• By reclaiming a page that belongs to a memory region of a process or to an IPC shared memory region (see Section 19.3.5)

Of course, the algorithm should take into consideration the various different kinds of page frames. For instance, it is preferable to reclaim page frames from a memory cache rather than from a disk cache because the latter pages include precious data obtained by costly accesses to block disk devices.

Moreover, the algorithm should keep track of the number of accesses to every page frame. If a page has not been accessed for a long time, the probability that it will be accessed in the near future is low; on the other hand, if a page has been accessed recently, the probability that it will continue to be accessed is high. This is just another application of the locality principle mentioned in Section 2.4.7.

Therefore, the page frame reclaiming algorithm is a blend of several heuristics:

• Careful selection of the order in which caches are examined

• Ordering of pages based on ageing (least recently used pages should be freed before pages accessed recently)

• Distinction of pages based on the page state (for example, nondirty pages are better candidates than dirty pages for swapping out because they don't have to be written to disk)

The main function that triggers page frame reclaiming is try_to_free_pages( ). It is invoked every time the kernel fails in allocating memory. For instance:

• When the grow_buffers( ) function fails to allocate a new buffer page, or the create_buffers( ) function fails to allocate the buffer heads for a buffer page (see Section 14.2.2 and Section 14.2.3). In these cases, the kernel executes free more memory( ), which in turn invokes try to free pages( ).

• When the pages_alloc( ) function fails in allocating a group of page frames in a given list of memory zones (see Section 7.1.7). Recall that every memory zone descriptor includes the pages_min watermark, which specifies the number of page frames that should remain free to cope with the "low on memory" emergencies. If no zone in the list has enough free memory to satisfy the request while preserving the minimal pool of free page frames, the kernel invokes the balance_classzone( ) function, which in turn invokes try_to_free_pages( ).

• When the kswapd kernel thread discovers that the number of free page frames in some memory zone falls below the pages_low watermark (see the later section Section 16.7.7).

The core of the try_to_free_pages( ) function is the shrink_caches( ) function: it receives as a parameter a "goal"—namely, a given number of page frames to be reclaimed—and it terminates as soon as it has reached the goal, if possible.

To help shrink_caches( ) do its job, all pages in dynamic memory are grouped into two lists called the "active list" and the "inactive list"; they are also collectively denoted as LRU lists. The former list tends to include the pages that have been accessed recently, while the latter tends to include the pages that have not been accessed for some time. Clearly, pages should be stolen from the inactive list, although some percolation between the two lists is performed from time to time.

The shrink_caches( ) function invokes, in turn, the following functions:

kmem cache reap( )

Removes empty slabs from the slab cache refill inactive( )

Moves pages from the active list to the inactive list, and vice versa.

shrink cache( )

Tries to free page frames by writing to disk inactive pages included in the page cache.

shrink dcache memory( )

Removes entries from the dentry cache shrink icache memory( )

Removes entries from the inode cache

Let's now discuss in greater detail the various components of the page frame reclaiming algorithm.

16.7.2 The Least Recently Used (LRU) Lists

The active list and the inactive list of pages are the core data structures of the page frame reclaiming algorithm. The heads of these two doubly linked lists are stored, respectively, in the active list and inactive list variables. The nr active pages and nr_inactive_pages variables store the number of pages in the two lists. The pagemap_lru_lock spin lock protects the two lists against concurrent accesses in SMP systems.

If a page belongs to an LRU list, its PG_lru flag in the page descriptor is set. Moreover, if the page belongs to the active list, the PG_active flag is set, while if it belongs to the inactive list, the PG_active flag is cleared. The lru field of the page descriptor stores the pointers to the next and previous elements in the LRU list.

Several auxiliary functions and macros are available to handle the LRU lists:

add page to active list

Sets the PG_active flag, adds the page to the head of the active list, and increases nr active pages.

add page to inactive list

Adds the page to the head of the inactive list and increases nr_inactive_pages. del page from active list

Removes the page from the active list, clears the PG_active flag, and decreases nr active pages.

del page from inactive list

Removes the page from the inactive list and decreases nr_inactive_pages. activate page nolock( ) and activate page( )

If the page is in the inactive list, moves it in the active list by executing del page from inactive list and then add page to active list. The activate_page( ) function also acquires the pagemap_lru_lock spin lock before moving the page.

If the page is not included in a LRU list, sets the PG_lru flag, acquires the pagemap lru lock spin lock, and executes add page to inactive list to insert the page in the inactive list.

If the page is included in a LRU list, clears the PG_lru flag and executes either del page from active list or del page from inactive list, according to the value of the PG_active flag. The lru_cache_del( ) function also acquires the pagemap_lru_lock spin lock before removing the page.

16.7.2.1 Moving pages across the LRU lists

The kernel collects the pages that were recently accessed in the active list so that it will not scan them when looking for a page frame to reclaim. Conversely, the kernel collects the pages that have not been accessed for a long time in the inactive list. Of course, pages should move from the inactive list to the active list and back, according to whether they are being accessed.

Clearly, two page states ("active" and "inactive") are not sufficient to describe all possible access patterns. For instance, suppose a logger process writes some data in a page once every hour. Although the page is "inactive" for most of the time, the access makes it "active," thus denying the reclaiming of the corresponding page frame, even if it is not going to be accessed for an entire hour. Of course, there is no general solution to this problem because the kernel has no way to predict the behavior of User Mode processes; however, it seems reasonable that pages should not change their status on every single access.

The PG_referenced flag in the page descriptor is used to double the number of accesses required to move a page from the inactive list to the active list; it is also used to double the number of "missing accesses" required to move a page from the active list to the inactive list (see below). For instance, suppose that a page in the inactive list has the PG_referenced flag set to 0. The first page access sets the value of the flag to 1, but the page remains in the inactive list. The second page access finds the flag set and causes the page to be moved in the active list. If, however, the second access does not occur within a given time interval after the first one, the page frame reclaiming algorithm may reset the PG_referenced flag.

As shown in Figure 16-4, the kernel uses the mark_page_accessed( ) and refill_inactive( ) functions to move the pages across the LRU lists. In the figure, the LRU list including the page is specified by the status of the PG_active flag.

Figure 16-4. Moving pages across the LRU lists

Figure 16-4. Moving pages across the LRU lists

Whenever the kernel must mark a page as accessed, it invokes the mark_page_accessed( )

function. This happens every time the kernel determines that a page is being referenced either by a User Mode process, a filesystem layer, or a device driver. For instance, mark_page_accessed( ) is invoked in the following cases:

• When loading an anonymous page of a process on demand (performed by the do_anonymous_page( ) function in Section 8.4.3).

• When reading a block from disk (performed by the bread( ) function in Section 13.4.8).

• When loading on demand a page of a memory mapped file (performed by the filemap_nopage( ) function in Section 15.2.4).

• When reading a page of data from a file (performed by the do_generic_file_read( ) function in Section 15.1.1).

• When swapping in a page (see the earlier section Section 16.6.1).

• When the kernel finds the Accessed flag set in the Page Table entry while searching for a page to be swapped out (see the earlier section Section 16.5.1).

• When the kernel reads a page of data from a disk device (performed by the ext2_get_page( ) function in Chapter 17).

The mark_page_accessed( ) function executes the following code fragment:

if (PageActive(page) || !PageReferenced(page))

SetPageReferenced(page); else {

activate page(page); ClearPageReferenced(page);

As shown in Figure 16-4, the function moves the page from the inactive list to the active list only if the PG_referenced flag is set before the invocation.

The kernel periodically checks the status of the pages in the active list by executing the refill_inactive( ) function. Starting from the bottom of the active list (the older pages in the list), the function checks whether the PG_referenced flag of each page is set. If it is, the function clears the flag and moves the page into the first position of the active list; if it isn't, the function moves the page into the first position of the inactive list. The logic in the function is as follows:

if (PageReferenced(page)) {

ClearPageReferenced(page); list del(&page->lru); list add(&page->lru, &active list); } else {

del page from active list(page); add page to inactive list(page); SetPageReferenced(page);

The refill_inactive( ) function does not scan the pages in the inactive list; hence, the PG_referenced flag of a page is never cleared as long as the page remains in the inactive list.

The try_to_free_pages( ) function is the main function that triggers the reclaiming of page frames. It receives as parameters:

classzone

The memory zone containing the page frames to be reclaimed gfp mask

A set of flags whose meaning is exactly the same as the corresponding parameter of the alloc_pages( ) function (see Section 7.1.5)

order

Not used

The goal of the function is to free swap_cluster_max page frames (usually, 32) by repeatedly invoking the shrink_caches( ) function, each time with a higher priority than the previous invocation. The try_to_free_pages( ) function is thus essentially equivalent to the following code fragment:

int priority = DEF_PRIORITY; int nr_pages = SWAP_CLUSTER_MAX;

gfp_mask &= ~(__GFP_IO | __GFP_HIGHIO | __GFP_FS);

nr pages = shrink caches(classzone, priority, gfp mask, nr pages); if (nr pages <= 0) return 1; } while (--priority); out of memory( );

try_to_free_pages( ) clears the__GFP_IO,__GFP_HIGHIO, and__GFP_FS bits in the gfp_mask parameter if the current process has the pf_noio flag set. This flag is set whenever the kernel must ensure that the page frame reclaiming algorithm never triggers I/O data transfers; it is currently used only by kernel threads of the loop device driver, which allows User Mode processes to handle regular files as if they were disk block partitions.

The loop is repeated at most def_priority times (usually six times). The value of the decreasing priority loop index is passed to the shrink_caches( ) function. Each time shrink_caches( ) tries harder to reclaim a page frame because lower values correspond to higher priorities.

If def_priority iterations are not enough to reclaim swap_cluster_max page frames, the kernel is in serious trouble. It has just one last resort: killing a User Mode process to free all its page frames. This operation is performed by the out_of_memory( ) function. Loosely speaking, the victim process is selected from those that have the smallest runtimes, ruling out those that have superuser privileges and those that perform direct I/O operations (see Section 15.3).

16.7.4 The shrink_caches( ) Function

The shrink_caches( ) function invokes several auxiliary functions in a fixed order to reclaim page frames in different memory subsystems. One of the functions invoked is called shrink_cache( ) and should not be confused with the parent function shink_caches( ).

The function shrink_caches( ) acts on the following parameters:

classzone

The memory zone that contains the page frames to be freed.

priority

The "priority" of this trial: it tells how drastic page frame reclaiming must be.

gfp mask

Memory allocator flags, specifying the type of page frames to be freed, as well as what the kernel is allowed to do in pursuit of its goal (block the current process, trigger I/O transfers, and so on).

nr pages

The goal—i.e., the number of page frames to be freed.

The function returns the difference between nr_pages and the number of page frames that have been effectively reclaimed; if more than nr_pages page frames have been freed, the function returns 0. It executes the following actions:

1. Invokes kmem_cache_reap( ) to reclaim page frames from the slab allocator caches (see Section 7.2.7).

2. If kmem_cache_reap( ) succeeds in freeing at least nr_pages page frames, returns 0.

3. Invokes the refill_inactive( ) function to move some pages from the active list to the inactive list. As described in the earlier section Section 16.7.2, refill_inactive( ) clears the PG_referenced flags of the pages at the bottom of the active list, and moves the pages that have not been accessed since the previous execution of shrink_caches( ). The number of pages to be moved is passed as a parameter to refill_inactive( ); it is computed as:

ratio = nr pages * nr active pages / ((nr inactive pages + 1) * 2);

The rationale behind the formula is to keep the size of active list roughly equal to two-thirds of the page cache size (that's another rule of thumb, of course).

4. Invokes shrink_cache( ) to try to reclaim nr_pages from the inactive list (see the next section). If the function succeeds in freeing all the required page frames, it returns 0 (all requested page frame have been freed).

5. At this point, shrink_caches( ) has lost any hope of reaching its target in the current execution. However, it tries to free small objects in several disk caches so that the future invocations of the function will likely succeed in releasing the page frames storing them. Thus, the function invokes shrink_dcache_memory( ) to remove dentry objects from the dentry cache (see the later section Section 16.7.6).

6. Invokes shrink_icache_memory( ) to remove inode objects from the inode cache (see Section 16.7.6 later in this chapter)

7. If the kernel has support for disk quota, the function invokes shrink_dqcache_memory( ) to remove objects from the disk quota cache. (We won't discuss disk quota for lack of space.)

8. Returns the number of page frames still to be freed. 16.7.5 The shrink_cache( ) Function

The shrink_cache( ) function acts on the same parameters as shrink_caches( ) : nr pages, classzone, gfp mask, and priority. It looks for page frames to be reclaimed in the inactive list. Since the last inserted elements are placed near the head of the list, the

The function achieves its goal by:

• Freeing to the buddy system the page frames that do not belong to any process

• Swapping out pages belonging to processes if there are too many of them in the scanned portion of the inactive list

The priority parameter controls the size of the portion of the inactive list to be scanned in this invocation of shrink_cache( ). If priority is equal to 6 (def_priority, the lowest priority), the function scans at most one-sixth of the list; as priority decreases, the function scans more and more of the list until it reaches 1 (the highest priority), whereupon it scans the whole list.

The function starts by acquiring the pagemap_lru_lock spin lock and then loops over the pages of the inactive list backwards until either the chosen portion of the list is scanned or the number of elements specified by priority is reached. For every scanned page, the function performs the following actions:

1. If the need_resched field of the current process is set, temporarily relinquishes the CPU by releasing the pagemap_lru_lock spin lock and invoking schedule( ). When executing again, the function reacquires the spin lock and continues.

2. Moves the page from the tail to the head of the inactive list. This ensures that inactive pages are considered in a round-robin fashion in successive invocations of shrink cache( ).

3. Checks whether the usage counter of the page is null; if so, it continues with the next page at the tail of the list. Ideally, any page with a null usage counter should belong to the buddy system; however, to free a page frame, first its usage counter is decremented and then the page frame is released to the buddy system. Therefore, there is a small time window in which the page frame reclaiming algorithm may see the page freed.

4. Checks whether the page belongs to the memory zone specified by the classzone parameter; if not, the function stops working on this page and continues with the next page at the tail of the list.

5. Checks whether the page is not a buffer page and whether its usage counter is greater than one or the page doesn't have an image on disk (mapping field set to null). In this case, the page frame cannot be released to the buddy system because the usage counter indicates that there are processes that still reference the page. The function performs the following substeps:

a. Increments a local counter storing the number of scanned pages that cannot be freed.

b. If the counter exceeds a threshold value, releases the pagemap_lru_lock spin lock, invokes swap_out( ) to swap out some process pages (see the earlier section Section 16.5), and returns the number of pages frames yet to be freed. The threshold value is equal to the smaller of two values: one-tenth of the number of pages to be scanned and 210-priority times the number of c. Otherwise, if the counter does not exceed the threshold value, the function continues with the next page at the tail of the inactive list.

6. If the function reaches this point, the page frame can be released to the buddy system. The function tries to lock the page. If the PG_locked flag of the page is already set, it executes the following substeps:

a. If both the PG_launder flag of the page and the__gfp_fs bit in the gfp_mask parameter are set, invokes wait_on_page( ) to sleep until the page is unlocked. The PG_launder flag is set whenever the page is involved in an I/O data transfer triggered by the shrink_cache( ) function itself.

b. Continues with the next page at the tail of the inactive list.

7. Now the page is locked. Checks whether the page is dirty (PG_dirty flag set), the page has an image on disk (mapping field not null), and it is owned only by the page cache—that is, whether the underlying page frame is effectively freeable. If all these conditions hold, and moreover, the__gfp_fs bit in the gfp_mask parameter is set, the function updates the disk image by performing the following actions:

a. Clears the PG_dirty flag.

b. Sets the PG_launder flag so that a future invocation of shrink_cache( ) waits for the completion of the I/O data transfer.

c. Increments the page usage counter (fail-safe mechanism) and releases the pagemap lru lock spin lock.

d. Invokes the writepage method of the address_space object of the page. As described in Section 15.2.5, the method activates the I/O data transfer of the page contents to the disk.

e. Decrements the page usage counter and acquires the pagemap_lru_lock spin lock again.

f. Continues with the next page at the tail of the inactive list.

8. If the page is a buffer page (the buffers field is not null), the function tries to free the buffers contained in the page. In particular, it performs the following substeps:

a. Releases the pagemap_lru_lock spin lock and increments the page usage counter (a fail-safe mechanism).

b. Invokes the try_to_release_page( ) function. In turn, this function:

1. Executes the releasepage method of the corresponding address_space object, if it is defined, to release the metadata

2. Invokes the try_to_free_buffers( ) function to free the buffers in the page, provided they are referenced only by the buffer cache (see Section 14.2.2).

c. If try_to_release_page( ) failed in releasing all the buffers in the page, the function unlocks the page, decrements its usage counter to undo the increment done in Step 8a, and continues with the next page at the tail of the inactive list.

d. Otherwise, if try_to_release_page( ) succeeded in releasing all the buffers in the page, the function tries to release the page frame itself. In particular, if the page is anonymous (i.e., has no image on disk), the function acquires the pagemap_lru_lock spin lock, unlocks the page, removes the page from the inactive list, and releases the page frame to the buddy system. If the function has released the number of page frames that was its goal, it also releases the spin lock and returns the value 0; otherwise, it continues with Step 9.

e. The buffer page is included in the page cache because it has an image on disk. It decrements its usage counter to undo the increment in Step 8a and acquires the pagemap_lru_lock spin lock. It then continues with the following step.

9. Acquires the pagecache_lock spin lock.

10. If the page has no image on disk or if the page is still referenced by some process, the function releases the pagecache_lock spin lock, unlocks the page, and jumps back to Step 5a.

11. If the function reaches this point, the page has an image on disk and is freeable because no process references it and it no longer includes any buffers. Checks whether the page is dirty (i.e., the PG_dirty flag is set); in this case, the page frame cannot be freed; otherwise, the data is lost. The function releases the pagecache_lock spin lock, unlocks the page, and continues with the next page at the tail of the inactive list.

12. If the function reaches this point, the page has an image on disk, is freeable, and is clean, so the page frame can effectively be reclaimed. If the page belongs to the swap cache, the function gets the swapped-out page identifier from the index field, invokes delete_from_swap_cache( ) to remove the page descriptor from the swap cache, releases the pagecache_lock spin lock, and then invokes swap_free( ) to decrement the usage counter of the page slot.

13. Otherwise, it checks whether the page belongs to the swap cache. If not, it invokes remove_inode_page( ) to remove it from the page cache (see Section 14.1.3) and releases the pagecache_lock spin lock.

14. Invokes _ _lru_cache_del( ) to remove the page from the inactive list.

15. Unlocks the page.

16. Releases the page frame to the buddy system.

17. If the goal on the number of page frames to be freed is reached, the function releases the spin lock and returns the value 0; otherwise, it continues with the next page at the tail of the inactive list.

16.7.6 Reclaiming Page Frames from the Dentry and Inode Caches

Dentry and inode objects themselves aren't big, but freeing one of them has a cascading effect that can ultimately free a lot of memory by releasing several data structures. For this reason, the shrink_caches( ) function invokes two special purpose functions to reclaim page frames from the dentry and inode caches. 16.7.6.1 Reclaiming page frames from the dentry cache

The shrink_dcache_memory( ) function is invoked to remove dentry objects from the dentry cache. Clearly, only dentry objects not referenced by any process (defined as unused dentries in Section 12.2.4) can be removed.

Since the dentry cache objects are allocated through the slab allocator, the shrink_dcache_memory( ) function may lead some slabs to become free, causing some page frames to be consequently reclaimed by kmem_cache_reap( ). Moreover, the dentry cache acts as a controller of the inode cache. Therefore, when a dentry object is released, the buffer storing the corresponding inode becomes unused and the shrink_mmap( ) function may release the corresponding buffer page.

The shrink_dcache_memory( ) function, which acts on the two well-known parameters priority and gfp_mask, performs the following steps:

1. Returns 0 if the kernel is not allowed to trigger operations on filesystem on-disk data structures (the__gfp_io bit is cleared in the gfp_mask parameter).

2. Otherwise, invokes prune_dcache( ), passing it a parameter that is the ratio between the number of unused dentries and the value of priority.

3. Invokes kmem_cache_shrink( ) on the dentry cache to release frames that contained objects freed in the previous step.

The prune_dcache( ) function receives a parameter that specifies the number of objects to free. It scans the list of unused dentries until it reaches the requested number of freed objects or until the whole list is scanned. On each object that wasn't recently referenced, the function calls prune one dentry( ).

The prune_one_dentry( ) function, in turn, performs the following operations.

1. Removes the dentry object from the dentry hash table, from the list of dentry objects in its parent directory, and from the list of dentry objects of the owner inode.

2. Decrements the usage counter of the dentry's inode by invoking the d_iput dentry

3. Invokes the d_release method of the dentry object, if defined.

4. Invokes kmem_cache_free( ) to release the object to the slab allocator (see Section 7.2.13).

5. Decrements the usage counter of the parent directory. 16.7.6.2 Reclaiming page frames from the inode cache

The shrink_icache_memory( ) function is invoked to remove inode objects from the inode cache. It is very similar to the shrink_dcache_memory( ) just described. It checks the _

_gfp_fs bit in the gfp_mask parameter, and then invokes prune_icache( ), passing to it the number of inodes to be freed—namely the ratio between the number of unused inodes and the value of the priority parameter. Finally, it invokes the kmem_cache_shrink( )

function to release to the buddy system each page frame that has been completely freed by prune icache( ).

The prune_icache( ) function, in turn, scans the inode_unused list (see Section 12.2.2), looking for inodes that can be freed. Basically, a good inode should be clean, it should have a null usage counter, its i_freeing, i_clear, and i_lock flags should be cleared, and it should not include any buffer head in its lists of dirty buffers. Any such inode is freed by invoking the clear_inode( ) and kmem_cache_free( ) functions.

If prune_icache( ) fails in releasing the requested number of inode objects, it schedules the execution of the try_to_sync_unused_inodes( ) function, which flushes some unused inodes to disk. Since this function might block, it is executed by the keventd kernel thread (see Section 3.4.2).

16.7.7 The kswapd Kernel Thread

The kswapd kernel thread is another kernel mechanism that activates the reclamation of memory. Why is it necessary? Is it not sufficient to invoke try_to_free_pages( ) when free memory becomes really scarce and another memory allocation request is issued?

Unfortunately, this is not the case. Some memory allocation requests are performed by interrupt and exception handlers, which cannot block the current process waiting for a page frame to be freed; moreover, some memory allocation requests are done by kernel control paths that have already acquired exclusive access to critical resources and that, therefore, cannot activate I/O data transfers. In the infrequent case in which all memory allocation requests are done by such sorts of kernel control paths, the kernel is never able to free memory.

kswapd also has a beneficial effect on system performances by keeping memory free in what would otherwise be idle time for the machine; processes can thus get their pages much faster.

The kswapd kernel thread is activated when a zone includes a number of free page frames that is below a "warning" threshold. Essentially, the kernel starts to reclaim some page frames in order to avoid much more dramatic "low on memory" conditions.

The "warning" threshold is stored into the pages_low field of the memory zone descriptor. Recall from Section 7.1.2 that this descriptor also includes the pages_min field (a threshold that specifies the minimum number of free page frames that should always be preserved) and the pages_high field (a threshold that specifies the "safe" number of free page frames above which page frame reclaiming should be stopped). Usually, the pages_min field is set to the ratio between the size of the memory zone in pages and the number 128, the pages_low field is set to twice pages_min, and the pages_high field is set to three times the value of pages min. [8]

[8] A system administrator may set new values for the pages_min fields, and hence also new values for the pages_low and pages_high fields, by passing ad-hoc ratios for every memory zone with the memfrac kernel boot option. However, the minimal allowed value for pages_min is 20 and the maximum allowed value is 255.

The kswapd kernel thread is usually inserted into the kswapd_wait wait queue. As mentioned in Section 7.1.5, if the alloc_pages( ) function fails to find a memory zone that has more than pages_low page frames, it sets the need_balance field of the first checked memory zone and wakes up kswapd.

The kswapd kernel thread executes the kswapd( ) function, which at each activation performs the following operations:

1. Sets the state of current to task_running and removes it from the kswapd_wait wait queue.

2. Invokes kswapd_balance( ) (see below).

3. Invokes run_task_queue( ) on the tq_disk task queue to activate the strategy routines of the block device drivers (see Section 13.4.6). This relieves pressure on memory by starting scheduled I/O operations and thus eventually allowing the kernel to free asynchronous buffer heads and pages in the page cache.

4. Sets the state of current to TASK_INTERRUPTIBLE and adds it to the kswapd_wait wait queue.

5. Checks the need_balance flags of every memory zone descriptor (see Section 7.1.2). If none are set, invokes schedule( ) to put the kswapd kernel thread to sleep. When executed again, it jumps to Step 1.

The kswapd_balance( ) function checks the need_balance flag of all existing memory zones. For any memory zone having the flag set, the function invokes try_to_free_pages( ) to start reclaiming page frames. As we know, the latter function might not succeed in freeing swap_cluster_max page frames; in this case, a process is killed. When this happens, kswapd_balance( ) suspends itself for one second, thus letting the kernel reclaim the page frames owned by the victim.

The kswapd_balance( ) function keeps invoking try_to_free_pages( ) on a memory zone until the number of free page frames in the zone (or in one of the other zones in the node) rises above the threshold stored in the pages_high field of the memory zone descriptor.

I [email protected] RuBoard a previous a previous

Was this article helpful?

+3 -1

Post a comment