Page Cache Readahead

Predicting the future is generally accepted to be a rather hard problem, but from time to time, the kernel cannot resist making a try nevertheless. Actually, there are situations where it is not too hard to say what will happen next, namely, when a process is reading data from a file.

Usually pages are read sequentially — this is also an assumption made by most filesystems. Recall from Chapter 9 that the extended filesystem family makes great effort to allocate adjacent blocks for a file such that the head of a block device only needs to move as little as possible when data are read and written.

Consider the situation in which a process has read a file linearly from position A to position B. Then this practice will usually continue for a while. It therefore makes sense to read ahead of B (say, until position C) such that when requests for pages between B and C are issued from the process, the data are already contained in the page cache.

Naturally readahead cannot be tackled by the page cache alone, but support by the VFS and memory management layers is required. In fact, the read-ahead mechanism was discussed in Sections 8.5.2 and 8.5.1. Recall that readahead is controlled from three places as far as the kernel is directly concerned6:

1. do_generic_mapping_read, a generic read routine in which most filesystems that rely on the standard routines of the kernel to read data end up at some point.

2. The page fault handler filemap_fault, which is responsible to read missing pages for memory mappings.

3. _generic_file_splice_read, a routine invoked to support the splice system call that allows for passing data between two file descriptors directly in kernel space, without the need to involve userspace.7

The temporal flow of readahead routines on the source code level were discussed in Chapter 8, but it is also instructive to observe the behavior from a higher level. Such a viewpoint is provided in Figure 16-4. For the sake of simplicity, I restrict my consideration to do_generic_mapping_read in the following.

page_cache_sync_readahead page_cache_sync_readahead

Figure 16-4: Overview of the readahead mechanism and the required interplay between VFS and page cache.

^ Pages read by asynchronous readahead

Figure 16-4: Overview of the readahead mechanism and the required interplay between VFS and page cache.

Suppose that a process has opened a file and wants to read in the first page. The page is not yet contained in the page cache. Since typical users will not only read in a single page, but multiple sequential

6These are at least the places covered in this book. Readahead can also be influenced from userland with the madvise, fadvice, and readahead system calls, but I will not discuss them any further.

7I do not discuss this system call anywhere in more detail, but refer you to the manual page splice(2) for more information.

pages, the kernel employs page_cache_sync_readahead to read in 8 pages in a row — the number is just an example that does not comply with reality. The first page is immediately available for do_generic_mapping_read.8 Pages selected to be read in before they are actually required are said to be in a readahead window.

The process now continues to read in pages and behaves linearly as expected. When the sixth page is accessed (notice that the page was already contained in the page cache before the process issued a corresponding request), do_generic_mapping_read notices that the page was equipped with the PG_Readahead bit in the synchronous read pass.9 This triggers an asynchronous operation that reads in a number of pages in the background. Since two more pages are left in the page cache, there is no need to hurry; thus a synchronous operation is not required. However, the I/O performed in the background will ensure that the pages are present when the process makes further progress in the file. If the kernel would not adopt this scheme, then readahead could only start after a process has experienced a page fault. While the required page (and some more pages for readahead) could be then brought into the page cache synchronously, this would introduce delays, which are clearly undesired.

This scheme is now repeated further. Since page_cache_async_read — which is responsible to issue the asynchronous read request — has again marked a page in the readahead window with the PG_Readahead bit, the kernel will start asynchronous readahead again when the process comes to this page, and so on.

So much for do_generic_readahead. The differences in how filemap_fault handles things are twofold: Asynchronous, adaptive readahead is only performed if a sequential read hint is set. If no readahead hint is given, then do_page_cache_readahead does a single-shot readahead without setting PG_Readahead, and also without updating the file's readahead state tracking information.

Several functions are used to implement the readahead mechanism. Figure 16-5 shows how they are connected with each other.

Linux Page Cache

Bring pages into page cache

Figure 16-5: Functions used to implement readahead. Note that the figure shows the connections between the functions, but is not a proper code flow diagram.

Bring pages into page cache

Figure 16-5: Functions used to implement readahead. Note that the figure shows the connections between the functions, but is not a proper code flow diagram.

8 Actually, the term synchronous as adopted by the kernel is a bit misleading here. No effort is made to wait on completion of the read operation submitted by page_cache_sync_readahed, so it is not synchronous in the usual sense of the word. However, since reading in one page is fast, chances are very good that the page will usually have arrived when page_cache_sync_readahead returns to the caller. Nevertheless, the caller has to make precautions for the case in which the page is not yet available.

9 Since the readahead state for each file is separately tracked, the kernel would essentially not require this special flag because the corresponding information could also be obtained otherwise. However, it is required when multiple concurrent readers act on a file.

Reading pages into the page cache before they are actually required is simple from a technical point of view and can easily be achieved with the framework introduced so far in this chapter. The challenge lies in predicting the optimal size of the readahead window. For this purpose, the kernel keeps track of the last setting for each file. The following data structure is associated with every file instance:

+1 0

Post a comment