Executing Requests

The device-specific request_fn function is invoked when the requests in the request queue are ready to be processed. This is a very hardware-specific task so the kernel does not provide a default implementation. Instead, it always uses the method passed when the queue was registered with blk_dev_init.

Nevertheless, the structure of the request function described below is similar in most device drivers. I assume a situation in which there are several requests in the request queue.

14elv_insert is an internal function of the elevator implementation and is called at various points in the kernel.

sample_request is a hardware-independent sample routine for request_fn that illustrates the basic steps performed by all drivers.

void sample_request (request_queue_t *q) int status; struct request *req;

while ((req = elv_next_request(q)) != NULL) if (!blk_fs_request(req)) end_request(req, 0); continue;

status = perform_sample_transfer(req); end_request(req, status);

The basic layout of the strategy function is simple; elv_next_request — embedded in a while loop — is used to read the requests sequentially from the queue. Transfer itself is carried out by perform_sample_transfer. end_request is a standard kernel function to delete the request from the request queue, to update the kernel statistics, and to execute any completions (see Chapter 5) waiting in request->completion. Also invoked is the BIO-specific bi_end_io function to which the kernel can assign a cleanup that depends on the purpose of the BIO.

As BIOs can be used to transfer not only data but also diagnostics information, the driver must invoke blk_fs_request to check whether, in fact, data are to be transferred — for the sake of simplicity, I ignore all other types of transfer.

The hardware-specific actions in genuine drivers are typically segregated into separate functions to keep code concise. I have adopted the same approach in our sample strategy function. The hardware-specific functions that would be found in a genuine driver are replaced with comments in perform_sample_transfer.

int perform_transfer(request *req) switch(req->cmd) case READ:

/* Perform hardware-specific reading of data */ break; case WRITE:

/* Perform hardware-specific writing of data */ break; default:

return -EFAULT;

The cmd field is referenced to establish whether the request is for a read or write operation. The appropriate actions are then taken to transfer the data between the system and the hardware.

6.5.8 I/O Scheduling

The various algorithms employed in the kernel for scheduling and reordering I/O operations are known as I/O schedulers (in contrast to normal process schedulers or packet schedulers for traffic shaping in networks). Traditionally, I/O schedulers are also called elevators. They are represented by a collection of functions grouped together in the following data structure15:

15The kernel also defines typedef struct elevator_s elevator_t.

<elevator.h>

struct elevator_ops {

elevator_merge_fn *elevator_merge_fn; elevator_merged_fn *elevator_merged_fn; elevator_merge_req_fn *elevator_merge_req_fn;

elevator_dispatch_fn *elevator_dispatch_fn; elevator_add_req_fn *elevator_add_req_fn; elevator_activate_req_fn *elevator_activate_req_fn; elevator_deactivate_req_fn *elevator_deactivate_req_fn;

elevator_queue_empty_fn *elevator_queue_empty_fn; elevator_completed_req_fn *elevator_completed_req_fn;

elevator_request_list_fn *elevator_former_req_fn; elevator_request_list_fn *elevator_latter_req_fn;

elevator_set_req_fn *elevator_set_req_fn; elevator_put_req_fn *elevator_put_req_fn;

elevator_may_queue_fn *elevator_may_queue_fn;

elevator_init_fn *elevator_init_fn; elevator_exit_fn *elevator_exit_fn;

The I/O scheduler is not only responsible for request reordering but also for the complete management of the request queue.

□ elevator_merge_fn checks whether a new request can be coalesced with an existing request as described above. It also specifies the position at which a request is inserted in the request queue.

□ elevator_merge_req_fn coalesces two requests into a single request; elevator_merged_fn is invoked after two requests have been merged (it performs clean-up work and returns management data of the I/O scheduler that are no longer needed because of the merge to the system).

□ elevator_dispatch_fn selects which request from a given request queue should be dispatched next.

□ elevator_add_req_fn and elevator_remove_req_fn add and remove a request to/from the request queue.

□ elevator_queue_empty_fn checks whether the queue contains requests ready for processing.

□ elevator_former_req_fn and elevator_latter_req_fn find the predecessor and successor request of a given request; this is useful when performing merging.

□ elevator_set_req_fn and elevator_put_req_fn are invoked when a new request is instantiated and returned to memory management (at this point in time the requests are not yet or no longer associated with any queue or have been satisfied). The functions give the I/O scheduler the opportunity to allocate, initialize, and return management data structures.

□ elevator_init_fn and elevator_exit_fn are invoked when a queue is initialized and returned; their effect is the same as that of a constructor or destructor.

Each elevator is encapsulated in the following data structure that holds further management information for the kernel:

<elevator.h>

struct elevator_type

struct list_head list; struct elevator_ops ops; struct elv_fs_entry *elevator_attrs; char elevator_name[ELV_NAME_MAX]; struct module *elevator_owner;

The kernel keeps all elevators in a doubly linked standard list implemented by means of the list element (the list head is represented by the global variable elv_list). Each elevator is also given a human-readable name that can be used to select the elevator from userspace. Attributes that will appear in sysfs are kept in elevator_attrs. They can be used to tune the elevator behavior on a per-disk basis.

The kernel implements a whole series of I/O schedulers. However, device drivers may overwrite specific functions of the schedulers for their own purposes or even implement their own schedulers. The elevators have the following properties.

□ elevator_noop is a very simple I/O scheduler that adds incoming requests to the queue one after the other for processing on a ''first come, first served'' basis. Requests are merged but not reordered. The noop I/O scheduler (no operation) is only a good choice for intelligent hardware that can reorder requests by itself. It is also reported to be a good scheduler for devices where there are no moving parts and thus no seek times — flash disks, for instance.

□ iosched_deadline serves two purposes: it tries to minimize the number of disk seeks (i.e., movement of the read/write heads) and also does its best to ensure that requests are processed within a certain time. In the latter case, the kernel's timer mechanism is used to implement an ''expiry time'' for the individual requests. In the former case, lengthy data structures (red-black trees and linked lists) are used to analyze requests so that they can be reordered with the minimum of delay, thus reducing the number of disk seeks.

□ iosched_as implements the anticipatory scheduler, which — as its name suggests — anticipates process behavior as far as possible. Naturally, this is not an easy goal, but the scheduler tries to achieve it by assuming that read requests are not totally independent of each other. When an application submits a read request to the kernel, the assumption is then made that a second related request will be submitted within a certain period. This is important if the read request is submitted in a period during which the disk is busy with write operations. To ensure satisfactory interaction, the write operations are deferred, and preference is given to the read operations. If writing is resumed immediately, a disk seek operation is required but is negated by a new read request arriving shortly afterward. In this case, it is better to leave the disk head in its position after the first read request and to wait briefly for the next read request — if a second read request does not arrive, the kernel is free to resume write operations.

□ iosched_cfq provides complete fairness queuing. It is centered around several queues into which all requests are sorted. Requests from a given process always end up on the same queue. Time slices are allocated for each of the queues, and a round robin algorithm is used to process the queue. This ensures that the I/O bandwidth is shared in a fair manner between the different queues. If the number of queues is bigger than or equal to the number of processes doing simultaneous I/O, this implies that the bandwidth is also fairly distributed between the processes. Some practical problems (like multiple processes being mapped to identical queues, varying request sizes, different I/O priorities, etc.) make the distribution not completely fair, but basically, the method achieves its goal to a good extent.

The deadline scheduler was the default scheduler almost up to the end of development of 2.5 but was replaced with the anticipatory scheduler until 2.6.17. From 2.6.18 onward, the Completely Fair Queuing scheduler is the default choice.

For reasons of space, I shall not deal with the implementation details of each scheduler. It should be noted, however, that the deadline scheduler is much less complicated than the anticipatory scheduler but delivers practically identical performance in most situations.

Continue reading here: Implementation of loctls

Was this article helpful?

0 0