Activating and Deactivating a Swap Area

Last Updated on Wed, 13 Jan 2021 | Linux Kernel Reference

Once a swap area is initialized, the superuser (or, more precisely, any user having the cap_sys_admin capability, as described in Section 20.1.1) may use the swapon and swapoff programs to activate and deactivate the swap area, respectively. These programs use the swapon( ) and swapoff( ) system calls; we'll briefly sketch out the corresponding service routines.

16.2.3.1 The sys_swapon( ) service routine

The sys_swapon( ) service routine receives the following as parameters:

specialfile

This parameter points to the pathname (in the User Mode address space) of the device file (partition) or plain file used to implement the swap area.

swap flags

This parameter consists of a single swap_flag_prefer bit plus 15 bits of priority of the swap area (these bits are significant only if the swap_flag_prefer bit is on).

The function checks the fields of the swap_header union that was put in the first slot when the swap area was created. The function performs these main steps:

1. Checks that the current process has the cap_sys_admin capability.

2. Searches for the first element in the swap_info array of swap area descriptors that have the swp_used flag cleared, meaning that the corresponding swap area is inactive. If there is none, there are already max_swapfiles active swap areas, so the function returns an error code.

3. A descriptor for the swap area has been found. The function initializes the descriptor's fields (setting flags to swp_used, setting lowest_bit and highest_bit to 0, and so on). Moreover, if the descriptor's index is greater than nr_swapfiles, the function updates that variable.

4. If the swap_flags parameter specifies a priority for the new swap area, the function sets the prio field of the descriptor. Otherwise, it initializes the field with the lowest priority among all active swap areas minus 1 (thus assuming that the last activated swap area is on the slowest block device). If no other swap areas are already active, the function assigns the value -1.

5. Copies the string pointed to by the specialfile parameter from the User Mode address space.

6. Invokes path_init( ) and path_walk( ) to perform a pathname lookup on the string copied from the User Mode address space (see Section 12.5).

7. Stores the addresses of the dentry object and of the mounted filesystem descriptor returned by path_walk( ) in the swap_file and swap_vfsmnt fields of the swap area descriptor, respectively.

8. If the specialfile parameter identifies a block device file, the function performs the following substeps:

a. Stores the device number in the swap_device field of the descriptor.

b. Sets the block size of the device to 4 KB—that is, sets its blksize_size entry to page_size.

c. Initializes the block device driver by invoking the bd_acquire( ) and do_open( ) functions, described in Section 13.4.2.

9. Checks to make sure that the swap area was not already activated by looking at the address_space objects of the other active swap areas in swap_info (given an address q of a swap area descriptor, the corresponding address_space object is obtained by q->swap_file->d_inode->i_mapping). If the swap area is already active, it returns an error code.

10. Allocates a page frame and invokes rw_swap_page_nolock( ) (see Section 16.4 later in this chapter) to fill it with the swap_header union stored in the first page of the swap area.

11. Checks that the magic string in the last ten characters of the first page in the swap area is equal to swap-space or to swapspace2 (there are two slightly different versions of the swapping algorithm). If not, the specialfile parameter does not specify an already initialized swap area, so the function returns an error code. For the sake of brevity, we'll suppose that the swap area has the swapspace2 magic string.

12. Initializes the lowest_bit and highest_bit fields of the swap area descriptor according to the size of the swap area stored in the info.last_page field of the swap header union.

13. Invokes vmalloc( ) to create the array of counters associated with the new swap area and store its address in the swap_map field of the swap descriptor. Initializes the elements of the array to 0 or to swap_map_bad, according to the list of defective page slots stored in the info.bad_pages field of the swap_header union.

14. Computes the number of useful page slots by accessing the info.last_page and info.nr_badpages fields in the first page slot.

15. Sets the flags field of the swap descriptor to swp_writeok, sets the pages field to the number of useful page slots, and updates the nr_swap_pages and total_swap_pages variables.

16. Inserts the new swap area descriptor in the list to which the swap_list variable points.

17. Releases the page frame that contains the data of the first page of the swap area and returns 0 (success).

16.2.3.2 The sys_swapoff( ) service routine

The sys_swapoff( ) service routine deactivates a swap area identified by the parameter specialfile. It is much more complex and time-consuming than sys_swapon( ), since the partition to be deactivated might still contain pages that belong to several processes. The function is thus forced to scan the swap area and to swap in all existing pages. Since each swap in requires a new page frame, it might fail if there are no free page frames left. In this case, the function returns an error code. All this is achieved by performing the following major steps:

2. Copies the string pointed to by specialfile, and invokes path_init( ) and path_walk( ) to perform a pathname lookup.

3. Scans the list to which swap_list points and locates the descriptor whose swap_file field points to the dentry object found by the pathname lookup. If no such descriptor exists, an invalid parameter was passed to the function, so it returns an error code.

4. Otherwise, if the descriptor exists, checks that its swp_write flag is set; if not, returns an error code because the swap area is already being deactivated by another process.

5. Removes the descriptor from the list and sets its flags field to swp_used so the kernel doesn't store more pages in the swap area before this function deactivates it.

6. Subtracts the swap area size stored in the pages field of the swap area descriptor from the values of nr swap pages and total swap pages.

7. Invokes the try_to_unuse( ) function (see below) to successively force all pages left in the swap area into RAM and to correspondingly update the Page Tables of the processes that use these pages.

8. If try_to_unuse( ) fails in allocating all requested page frames, the swap area cannot be deactivated. Therefore, the function executes the following substeps:

a. Reinserts the swap area descriptor in the swap_list list and sets its flags field to swp_writeok (see Step 5)

b. Adds the content of the pages field to the nr_swap_pages and total_swap_pages variables (see Step 6)

c. Invokes path_release( ) to release the VFS objects allocated by path_walk( ) in Step 2.

d. Finally, returns an error code.

9. Otherwise, all used page slots have been successfully transferred to RAM. Therefore, the function executes the following substeps:

a. If specialfile identifies a block device file, releases the corresponding block device driver.

b. Invokes path_release( ) to release the VFS objects allocated by path_walk( ) in Step 2.

d. Invokes path_release( ) again because the VFS objects that refer to specialfile have been allocated by the path_walk( ) function invoked by sys_swapon( ) (see Step 6 in the previous section).

e. Returns 0 (success). 16.2.3.3 The try_to_unuse( ) function

As stated previously, the try_to_unuse( ) function swaps in pages and updates all the Page Tables of processes that have swapped out pages. To that end, the function visits the address spaces of all kernel threads and processes, starting with the init_mm memory descriptor that is used as a marker. It is a time-consuming function that runs mostly with the interrupts enabled. Synchronization with other processes is therefore critical.

The try_to_unuse( ) function scans the swap_map array of the swap area. When the function finds a in-use page slot, it first swaps in the page, and then starts looking for the processes that reference the page. The ordering of these two operations is crucial to avoid race conditions. While the I/O data transfer is ongoing, the page is locked, so no process can access it. Once the I/O data transfer completes, the page is locked again by try_to_unuse( ), so it cannot be swapped out again by another kernel control path. Race conditions are also avoided because each process looks up the page cache before starting a swap in or swap out operation (see the later section Section 16.3). Finally, the swap area considered by try_to_unuse( ) is marked as nonwritable (swp_write flag is not set), so no process can perform a swap out on a page slot of this area.

However, try_to_unuse( ) might be forced to scan the swap_map array of usage counters of the swap area several times. This is because memory regions that contain references to swapped-out pages might disappear during one scan and later reappear in the process lists.

For instance, recall the description of the do_munmap( ) function (in Section 8.3.5): whenever a process releases an interval of linear addresses, do_munmap( ) removes from the process list all memory regions that include the affected linear addresses; later, the function reinserts the memory regions that have been only partially unmapped in the process list. do_munmap( ) takes care of freeing the swapped-out pages that belong to the interval of released linear addresses; however, it commendably doesn't free the swapped-out pages that belong to the memory regions that have to be reinserted in the process list.

Hence, try_to_unuse( ) might fail in finding a process that references a given page slot because the corresponding memory region is temporarily not included in the process list. To cope with this fact, try_to_unuse( ) keeps scanning the swap_map array until all reference counters are null. Eventually, the ghost memory regions referencing the swapped-out pages will reappear in the process lists, so try_to_unuse( ) will succeed in freeing all page slots.

Let's describe now the major operations executed by try_to_unuse( ). It executes a continuous loop on the reference counters in the swap_map array of the swap area passed as its parameter. For each reference counter, the function performs the following steps:

1. If the counter is equal to 0 (no page is stored there) or to swap_map_bad, it continues with the next page slot.

2. Otherwise, it invokes the read_swap_cache_async( ) function (see Section 16.4 later in this chapter) to swap in the page. This consists of allocating, if necessary, a new page frame, filling it with the data stored in the page slot, and putting the page in the swap cache.

3. Waits until the new page has been properly updated from disk and locks it.

4. While the function was executing the previous step, the process could have been suspended. Therefore, it checks again whether the reference counter of the page slot is null; if so, it continues with the next page slot (this swap page has been freed by another kernel control path).

5. Invokes unuse_process( ) on every memory descriptor in the doubly linked list whose head is init_mm (see Section 8.2). This time-consuming function scans all

Page Table entries of the process that owns the memory descriptor, and replaces each occurrence of the swapped-out page identifier with the physical address of the page frame. To reflect this move, the function also decrements the page slot counter in the swap_map array (unless it is equal to swap_map_max) and increments the usage counter of the page frame.

6. Invokes shmem_unuse( ) to check whether the swapped-out page is used for an IPC shared memory resource and to properly handle that case (see Section 19.3.5).

7. Checks the value of the reference counter of the page. If it is equal to swap_map_max, the page slot is "permanent." To free it, it forces the value 1 into the reference counter.

8. The swap cache might own the page as well (it contributes to the value of the reference counter). If the page belongs to the swap cache, it invokes the rw_swap_page( ) function to flush its contents on disk (if the page is dirty), invokes delete_from_swap_cache( ) to remove the page from the swap cache, and decrements its reference counter.

9. Sets the PG_dirty flag of the page descriptor and unlocks the page.

10. Checks the need_resched field of the current process; if it is set, it invokes schedule( ) to relinquish the CPU. Deactivating a swap area is a long job, and the kernel must ensure that the other processes in the system still continue to execute. The try_to_unuse( ) function continues from this step whenever the process is selected again by the scheduler.

11. Proceeds with the next page slot. starting at Step 1.

The function continues until every reference counter in the swap_map array is null. Recall that even if the function starts examining the next page slot, the reference counter of the previous page slot could still be positive. In fact, a "ghost" process could still reference the page, typically because some memory regions have been temporarily removed from the process list scanned in Step 5. Eventually, try_to_unuse( ) catches every reference. In the meantime, however, the page is no longer in the swap cache, it is unlocked, and a copy is still included in the page slot of the swap area being deactivated.

One might expect that this situation could lead to data loss. For instance, suppose that some "ghost" process accesses the page slot and starts swapping the page in. Since the page is no longer in the swap cache, the process fills a new page frame with the data read from disk. However, this page frame would be different from the page frames owned by the processes that are supposed to share the page with the "ghost" process.

This problem does not arise when deactivating a swap area because interference from a ghost process could happen only if a swapped-out page belongs to a private anonymous memory mapping.[2] In this case, the page frame is handled by means of the Copy on Write mechanism described in Chapter 8, so it is perfectly legal to assign different page frames to the processes that reference the page. However, the try_to_unuse( ) function marks the page as "dirty" (Step 9); otherwise, the try_to_swap_out( ) function might later drop the page from the Page Table of some process without saving it in an another swap area (see the later section Section 16.5).

Continue reading here: Allocating and Releasing a Page Slot

Was this article helpful?

Activating and Deactivating a Swap Area

Related Posts