[4 The bdfprm table also includes several other unused fields

Table 14-4. Buffer cache tuning parameters

Parameter

Default

Min

Max

Description

nfract

40

0

100

Threshold percentage of dirty buffers for waking up bdflush

nfract sync

60

0

100

Threshold percentage of dirty buffers for waking up bdflush in blocking mode

age_buffer

3000

100

600,000

Time-out in ticks of a dirty buffer for being written to disk

interval

500

0

1,000,000

Delay in ticks between kupdate activations

The most typical cases that cause the kernel thread to be woken up are:

• The balance_dirty( ) function verifies that the number of buffer pages in the buf_dirty and buf_locked lists exceeds the threshold:

P x bdf_prm.b_un.nfract_sync / 100

where P represents the number of pages in the system that can be used as buffer pages (essentially, this is all the pages in the "DMA" and "Normal" memory zones; see Section 7.1.2). Actually, the computation is done by the balance_dirty_state( ) helper function, which returns -1 if the number of dirty or locked buffers is below the nfract threshold, 0 if it is between nfract and nfract_sync, and 1 if it is above nfract_sync. The balance_dirty( ) function is usually invoked whenever a buffer is marked as "dirty" and the function moves its buffer head into the buf_dirty list.

• When the try_to_free_buffers( ) function fails to release the buffer heads of some buffer page (see the earlier section Section 14.2.2.1).

• When the grow_buffers( ) function fails to allocate a new buffer page, or the create_buffers( ) function fails to allocate a new buffer head (see the earlier section Section 14.2.2.1).

• When a user presses some specific combinations of keys on the console (usually ALT+SysRq+u and ALT+SysRq+s). These key combinations, which are enabled only if the Linux kernel has been compiled with the Magic SysRq Key option, allow Linux hackers to have some explicit control over kernel behavior.

To wake up bdflush, the kernel invokes the wakeup_bdflush( ) function, which simply executes:

wake_up_interruptible(&bdflush_wait);

to wake up the process suspended in the bdflush_wait task queue. There is just one process in this wait queue, namely bdflush itself.

The core of the bdflush( ) function is the following endless loop:

if (emergency_sync_scheduled) /* Only if the kernel has been compiled */

do_emergency_sync( ); /* with Magic SysRq Key support */

spin_lock(&lru_list_lock);

if (!write_some_buffers(0) || balance_dirty_state( ) < 0) { wait_for_some_buffers(0); interruptible_sleep_on(&bdflush_wait);

If the Linux kernel has been compiled with the Magic SysRq Key option, bdflush( ) checks whether the user has requested an emergency sync. If so, the function invokes do_emergency_sync( ) to execute fsync_dev( ) on all existing block devices, flushing all dirty buffers (see the later section Section 14.2.4.3).

Next, the function acquires the lru_list_lock spin lock, and invokes the write_some_buffers( ) function, which tries to activate block I/O write operations for up to 32 unlocked dirty buffers. Once the write operations have been activated, write_some_buffers( ) releases the lru_list_lock spin lock and returns 0 if less than 32 unlocked dirty buffers have been found; it returns a negative value otherwise.

If write_some_buffers( ) didn't find 32 buffers to flush, or the number of dirty or locked buffers falls below the percentage threshold given by the bdflush's parameter nfract, the bdflush kernel thread goes to sleep. To do this, it first invokes the wait_for_some_buffers( ) function so that it sleeps until all I/O data transfers of the buffers in the buf_locked list terminate. During this time interval, the kernel thread is not woken up even if the kernel executes the wakeup_bdflush( ) function. Once data transfers terminate, the bdflush( ) function invokes interruptible_sleep_on( ) on the bdflush_wait wait queue to sleep until the next wakeup_bdflush( ) invocation.

14.2.4.2 The kupdate kernel thread

Since the bdflush kernel thread is usually activated only when there are too many dirty buffers or when more buffers are needed and available memory is scarce, some dirty buffers might stay in RAM for an arbitrarily long time before being flushed to disk. The kupdate kernel thread is thus introduced to flush the older dirty buffers. £51

£51 In an earlier version of Linux 2.2, the same task was achieved by means of the bdflush( ) system call, which was invoked every five seconds by a User Mode system process launched at system startup and which executed the /sbin/update program. In more recent kernel versions, the bdflush( ) system call is used only to allow users to modify the system parameters in the bdf_prm table.

As shown in Table 14-4, age_buffer is a time-out parameter that specifies the time for buffers to age before kupdate writes them to disk (usually 30 seconds), while the interval field of the bdf_prm table stores the delay in ticks between two activations of the kupdate kernel thread (usually five seconds). If this field is null, the kernel thread is normally stopped, and is activated only when it receives a sigcont signal.

When the kernel modifies the contents of some buffer, it sets the b_flushtime field of the corresponding buffer head to the time (in jiffies) when it should later be flushed to disk. The kupdate kernel thread selects only the dirty buffers whose b_flushtime field is smaller than the current value of jiffies.

The kupdate kernel thread runs the kupdate( ) function; it keeps executing the following endless loop:

wait_for_some_buffers(0); if (bdf_prm.b_un.interval) {

tsk->state = TASK_INTERRUPTIBLE; schedule_timeout(bdf_prm.b_un.interval); } else {

tsk->state = TASK_STOPPED; schedule( ); /* wait for SIGCONT */

sync_old_buffers( );

First of all, the kernel thread suspends itself until the I/O data transfers have been completed for all buffers in the buf_locked list. Then, if bdf.prm.b_un.interval interval is not null, the thread goes to sleep for the specified amount of ticks (see Section 6.6.2); otherwise, the thread stops itself until a sigcont signal is received (see Section 10.1).

The core of the kupdate( ) function consists of the sync_old_buffers( ) function. The operations to be performed are very simple for standard filesystems used with Unix; all the function has to do is write dirty buffers to disk. However, some nonnative filesystems introduce complexities because they store their superblock or inode information in complicated ways. sync_old_buffers( ) executes the following steps:

1. Acquires the big kernel lock.

2. Invokes sync_unlocked_inodes( ), which scans the superblocks of all currently mounted filesystems and, for each superblock, the list of dirty inodes to which the s_dirty field of the superblock object points. For each inode, the function flushes the dirty pages that belong to memory mappings of the corresponding file (see Section 15.2.5), then invokes the write_inode superblock operation if it is defined. (The write_inode method is defined only by non-Unix filesystems that do not store all the inode data inside a single disk block — for instance, the MS-DOS filesystem).

3. Invokes sync_supers( ), which takes care of superblocks used by filesystems that do not store all the superblock data in a single disk block (an example is Apple Macintosh's HFS). The function accesses the superblocks list of all currently mounted filesystems (see Section 12.4). It then invokes, for each superblock, the corresponding write_super superblock operation, if one is defined (see Section 12.2.1). The write_super method is not defined for any Unix filesystem.

4. Releases the big kernel lock.

5. Starts a loop consisting of the following steps:

a. Gets the lru_list_lock spin lock.

b. Gets the bh pointer to the first buffer head in the buf_dirty list.

c. If the pointer is null or if the b_flushtime buffer head field has a value greater than jiffies (young buffer), releases the lru_list_lock spin lock and terminates.

d. Invokes write_some_buffers( ), which tries to activate block I/O write operations for up to 32 unlocked dirty buffers in the buf_dirty list. Once the write activations have been performed, write_some_buffers( ) releases the lru_list_lock spin lock and returns 0 if less than 32 unlocked dirty buffers have been found; it returns a negative value otherwise.

e. If write_some_buffers( ) flushed to disk exactly 32 unlocked dirty buffers, jumps to Step 5a; otherwise, terminates the execution.

14.2.4.3 The sync( ), fsync( ), and fdatasync( ) system calls

Three different system calls are available to user applications to flush dirty buffers to disk:

Usually issued before a shutdown, since it flushes all dirty buffers to disk fsync( )

Allows a process to flush all blocks that belong to a specific open file to disk fdatasync( )

Very similar to fsync( ), but doesn't flush the inode block of the file

The core of the sync( ) system call is the fsync_dev( ) function, which performs the following actions:

1. Invokes sync_buffers( ), which essentially executes the following code:

spin_lock(&lru_list_lock); } while (write_some_buffers(0)); run_task_queue(&tq_disk);

As you see, the function keeps invoking the write_some_buffers( ) function until it succeeds in finding 32 unlocked, dirty buffers. Then, the block device drivers are unplugged to start real I/O data transfers (see Section 13.4.6.2).

2. Acquires the big kernel lock.

3. Invokes sync_inodes( ) , which is quite similar to the sync_unlocked_inodes( ) function discussed in the previous section.

4. Invokes sync_supers( ) to write the dirty superblocks to disk, if necessary, by using the write_super methods (see earlier in this section).

5. Releases the big kernel lock.

6. Invokes sync_buffers( ) once again. This time, it waits until all locked buffers have been transferred.

The fsync( ) system call forces the kernel to write to disk all dirty buffers that belong to the file specified by the fd file descriptor parameter (including the buffer containing its inode, if necessary). The system service routine derives the address of the file object and then invokes the fsync method. Usually, this method simply invokes the fsync_inode_buffers( ) function, which scans the two lists of dirty buffers of the inode object (see the earlier section Section 14.2.1.3), and invokes ll_rw_block( ) on each element present in the lists. The function then suspends the calling process until all dirty buffers of the file have been written to disk by invoking wait_on_buffer( ) on each locked buffer. Moreover, the service routine of the fsync( )

system call flushes the dirty pages that belong to the memory mapping of the file, if any (see Section 15.2.5).

The fdatasync( ) system call is very similar to fsync( ), but writes to disk only the buffers that contain the file's data, not those that contain inode information. Since Linux 2.4 does not have a specific file method for fdatasync( ), this system call uses the fsync method and is thus identical to fsync( ).

I [email protected] RuBoard

Continue reading here: Accessing Files

Was this article helpful?

0 0