The Buddy System Algorithm

The kernel must establish a robust and efficient strategy for allocating groups of contiguous page frames. In doing so, it must deal with a well-known memory management problem called external fragmentation: frequent requests and releases of groups of contiguous page frames of different sizes may lead to a situation in which several small blocks of free page frames are "scattered" inside blocks of allocated page frames. As a result, it may become impossible to allocate a large block of contiguous page frames, even if there are enough free pages to satisfy the request.

There are essentially two ways to avoid external fragmentation:

• Use the paging circuitry to map groups of noncontiguous free page frames into intervals of contiguous linear addresses.

• Develop a suitable technique to keep track of the existing blocks of free contiguous page frames, avoiding as much as possible the need to split up a large free block to satisfy a request for a smaller one.

The second approach is preferred by the kernel for three good reasons:

• In some cases, contiguous page frames are really necessary, since contiguous linear addresses are not sufficient to satisfy the request. A typical example is a memory request for buffers to be assigned to a DMA processor (see Chapter 13). Since the DMA ignores the paging circuitry and accesses the address bus directly while transferring several disk sectors in a single I/O operation, the buffers requested must be located in contiguous page frames.

• Even if contiguous page frame allocation is not strictly necessary, it offers the big advantage of leaving the kernel paging tables unchanged. What's wrong with modifying the Page Tables? As we know from Chapter 2, frequent Page Table modifications lead to higher average memory access times, since they make the CPU flush the contents of the translation lookaside buffers.

• Large chunks of contiguous physical memory can be accessed by the kernel through 4 MB pages. The reduction of translation lookaside buffers misses, in comparison to the use of 4 KB pages, significantly speeds up the average memory access time (see Section 2.4.8).

The technique adopted by Linux to solve the external fragmentation problem is based on the well-known buddy system algorithm. All free page frames are grouped into 10 lists of blocks that contain groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512 contiguous page frames, respectively. The physical address of the first page frame of a block is a multiple of the group size—for example, the initial address of a 16-page-frame block is a multiple of 16 x 212 (212 = 4,096, which is the regular page size).

We'll show how the algorithm works through a simple example.

Assume there is a request for a group of 128 contiguous page frames (i.e., a half-megabyte). The algorithm checks first to see whether a free block in the 128-page-frame list exists. If there is no such block, the algorithm looks for the next larger block—a free block in the 256-page-frame list. If such a block exists, the kernel allocates 128 of the 256 page frames to satisfy the request and inserts the remaining 128 page frames into the list of free 128-page-frame blocks. If there is no free 256-page block, the kernel then looks for the next larger block (i.e., a free 512-page-frame block). If such a block exists, it allocates 128 of the 512 page frames to satisfy the request, inserts the first 256 of the remaining 384 page frames into the list of free 256-page-frame blocks, and inserts the last 128 of the remaining 384 page frames into the list of free 128-page-frame blocks. If the list of 512-page-frame blocks is empty, the algorithm gives up and signals an error condition.

The reverse operation, releasing blocks of page frames, gives rise to the name of this algorithm. The kernel attempts to merge pairs of free buddy blocks of size b together into a single block of size 2b. Two blocks are considered buddies if:

• They are located in contiguous physical addresses.

• The physical address of the first page frame of the first block is a multiple of 2 x b x 212.

The algorithm is iterative; if it succeeds in merging released blocks, it doubles b and tries again so as to create even bigger blocks. Data structures

Linux uses a different buddy system for each zone. Thus, in the 80 x 86 architecture, there are three buddy systems: the first handles the page frames suitable for ISA DMA, the second handles the "normal" page frames, and the third handles the high-memory page frames. Each buddy system relies on the following main data structures:

• The mem_map array introduced previously. Actually, each zone is concerned with a subset of the mem_map elements. The first element in the subset and its number of elements are specified, respectively, by the zone_mem_map and size fields of the zone descriptor.

• An array having 10 elements of type free_area_t, one element for each group size. The array is stored in the free_area field of the zone descriptor.

• Ten binary arrays named bitmaps, one for each group size. Each buddy system has its own set of bitmaps, which it uses to keep track of the blocks it allocates.

The free_area_t (or equivalently, struct free_area_struct) data structure is defined as follows:

typedef struct free_area_struct { struct list_head free_list; unsigned long *map; } free_area_t;

The k th element of the free_area array in the zone descriptor is associated with a doubly linked circular list of blocks of size 2k; each member of such a list is the descriptor of the first page frame of a block. The list is implemented through the list field of the page descriptor.

The map field points to a bitmap whose size depends on the number of page frames in the zone. Each bit of the bitmap of the k th entry of the free_area array describes the status of two buddy blocks of size 2k page frames. If a bit of the bitmap is equal to 0, either both buddy blocks of the pair are free or both are busy; if it is equal to 1, exactly one of the blocks is busy. When both buddies are free, the kernel treats them as a single free block of size 2k+1.

Let's consider, for sake of illustration, a zone including 128 MB of RAM. The 128 MB can be divided into 32,768 single pages, 16,384 groups of 2 pages each, or 8,192 groups of 4 pages each, and so on up to 64 groups of 512 pages each. So the bitmap corresponding to free_area[0] consists of 16,384 bits, one for each pair of the 32,768 existing page frames; the bitmap corresponding to free_area[1] consists of 8,192 bits, one for each pair of blocks of two consecutive page frames; the last bitmap corresponding to free_area[9] consists of 32 bits, one for each pair of blocks of 512 contiguous page frames.

Figure 7-2 illustrates the use of the data structures introduced by the buddy system algorithm. The array zone_mem_map contains nine free page frames grouped in one block of one (a single page frame) at the top and two blocks of four further down. The double arrows denote doubly linked circular lists implemented by the free_list field. Notice that the bitmaps are not drawn to scale.

Figure 7-2. Data structures used by the buddy system

Figure 7-2. Data structures used by the buddy system Allocating a block

The alloc_pages( ) function is the core of the buddy system allocation algorithm. Any other allocator function or macro listed in the earlier section Section 7.1.5 ends up invoking alloc_pages( ).

The function considers the list of the contig_page_data.node_zonelists array corresponding to the zone modifiers specified in the argument gfp_mask. Starting with the first zone descriptor in the list, it compares the number of free page frames in the zone (stored in the free_pages field of the zone descriptor), the number of requested page frames (argument order of alloc_pages( )), and the threshold value stored in the pages_low field of the zone descriptor. If free_pages - 2 order is smaller than or equal to pages_low, the function skips the zone and considers the next zone in the list. If no zone has enough free page frames, alloc_pages( ) restarts the loop, this time looking for a zone that has at least pages_min free page frames. If such a zone doesn't exist and if the current process is allowed to wait, the function invokes balance classzone( ) , which in turn invokes try to free pages( ) to reclaim enough page frames to satisfy the memory request (see Section 16.7).

When alloc_pages( ) finds a zone with a suitable number of free page frames, it invokes the rmqueue( ) function to allocate a block in that zone. The function takes two arguments: the address of the zone descriptor, and order, which denotes the logarithm of the size of the requested block of free pages (0 for a one-page block, 1 for a two-page block, and so forth). If the page frames are successfully allocated, the rmqueue( ) function returns the address of the page descriptor of the first allocated page frame; that address is also returned by alloc_pages( ). Otherwise, rmqueue( ) returns NULL, and alloc pages( ) consider the next zone in the list.

The rmqueue( ) function is equivalent to the following fragments. First, a few local variables are declared and initialized:

free_area_t * area = &(zone->free_area[order]);

unsigned int curr_order = order;

struct list_head *head, *curr;

struct page *page;

unsigned long flags;

unsigned int index;

unsigned long size;

The function disables interrupts and acquires the spin lock of the zone descriptor because it will alter its fields to allocate a block. Then it performs a cyclic search through each list for an available block (denoted by an entry that doesn't point to the entry itself), starting with the list for the requested order and continuing if necessary to larger orders. This is equivalent to the following fragment:

spin_lock_irqsave(&zone->lock, flags); do {

head = &area->free_list; curr = head->next; if (curr != head)

goto block_found; curr_order++; area++; } while (curr_order < 10);

spin_unlock_irqrestore(&zone->lock, flags); return NULL;

If the loop terminates, no suitable free block has been found, so rmqueue( ) returns a null value. Otherwise, a suitable free block has been found; in this case, the descriptor of its first page frame is removed from the list, the corresponding bitmap is updated, and the value of free_ pages in the zone descriptor is decreased:


page = list_entry(curr, struct page, list); list_del(curr);

index = page - zone->zone_mem_map; if (curr_order != 9)

change_bit(index>>(1+curr_order), area->map); zone->free_pages -= 1UL << order;

If the block found comes from a list of size curr_order greater than the requested size order, a while cycle is executed. The rationale behind these lines of codes is as follows: when it becomes necessary to use a block of 2k page frames to satisfy a request for 2h page frames (h < k), the program allocates the last 2h page frames and iteratively reassigns the first 2k - 2h page frames to the free_area lists that have indexes between h and k:

size = 1 << curr_order; while (curr_order > order) { area-- ; curr_order-- ; size >>= 1;

/* insert *page as first element in the list and update the bitmap */ list_add(&page->list, &area->free_list); change_bit(index >> (1+curr_order), area->map);

/* now take care of the second half of the free block starting at *page */

Finally, rmqueue( ) releases the spin lock, updates the count field of the page descriptor associated with the selected block, and executes the return instruction:

spin_unlock_irqrestore(&zone->lock, flags); atomic_set(&page->count, 1); return page;

As a result, the alloc_pages( ) function returns the address of the page descriptor of the first page frame allocated. Freeing a block

The__free_pages_ok( ) function implements the buddy system strategy for freeing page frames. It uses two input parameters:


The address of the descriptor of the first page frame included in the block to be released order

The logarithmic size of the block

The__free_pages_ok( ) function usually inserts the block of page frames in the buddy system data structures so they can be used in subsequent allocation requests. One case is an exception: if the current process is moving pages across memory zones to rebalance them, the function does not free the page frames, but inserts the block in a special list of the process. To decide what to do with the block of page frames,__free_pages_ok( ) checks the pf_free_pages flag of the process. It is set only if the process is rebalancing the memory zones. Anyway, we discuss this special case in Chapter 16; we assume here that the pf_free_pages flag of current is not set, so__free_pages_ok( ) inserts the block in the buddy system data structures.

The function starts by declaring and initializing a few local variables:

unsigned long flags;

free_area_t * area = &zone->free_area[order];

unsigned long page_idx = page - base;

unsigned long index = page_idx >> (1 + order);

struct page * buddy;

The page_idx local variable contains the index of the first page frame in the block with respect to the first page frame of the zone. The index local variable contains the bit number corresponding to the block in the bitmap.

The function clears the PG_referenced and PG_dirty flags of the first page frame, then it acquires the zone spin lock and disables interrupts:

page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty)); spin_lock_irqsave(&zone->lock, flags);

The mask local variable contains the two's complement of 2order. It is used to increment the counter of free page frames in the zone:

The function now starts a cycle executed at most 9 - order times, once for each possibility for merging a block with its buddy. The function starts with the smallest-sized block and moves up to the top size. The condition driving the while loop is:

In the body of the loop, the bits set in mask are shifted to the left at each iteration and the buddy block of the block that has the number page_idx is checked:

if (!test_and_change_bit(index, area->map)) break;

If the buddy block is not free, the function breaks out of the cycle; if it is free, the function detaches it from the corresponding list of free blocks. The block number of the buddy is derived from page_idx by switching a single bit:

buddy = &base[page_idx A -mask]; list_del(&buddy->list);

At the end of each iteration, the function updates the mask, area, index, and page_idx local variables:

mask <<= 1; area++; index >>= 1; page_idx &= mask;

The function then continues the next iteration, trying to merge free blocks twice as large as the ones considered in the previous cycle. When the cycle is finished, the free block obtained cannot be merged further with other free blocks. It is then inserted in the proper list:

list_add(&(base[page_idx].list), &area->free_list);

Finally, the function releases the zone spin lock and returns:

spin_unlock_irqrestore(&zone->lock, flags) return;

I [email protected] RuBoard

^ previous

Was this article helpful?

+2 -2


  • ritva
    What is buddy heap algorithmn?
    1 year ago

Post a comment