Figure 1710 Code flow diagram for syssync

As the code flow diagram shows, sync_inodes and sync_filesystems are invoked twice, first with the parameter 0 and then 1. The parameter specifies whether the functions are to wait until the write operations are finished (1) or whether they are to execute asynchronously (0). Splitting the operation into two passes allows the write operations to be initiated in the first pass. This triggers the synchronization of dirty pages associated with inodes, and also uses write_inode to synchronize the metadata. However, a filesystem implementation may choose just to dirty the buffers or pages that contain the metadata, but not send an actual write request to the block device. Since sync_inodes iterates over all dirty inodes, the small contributions from the individual metadata changes will pile up to a comparatively large amount of dirty data.

The second pass is therefore required for two reasons:

1. The dirtied pages resulting from the calls to write_inode are written to disk (synchronization of raw block devices ensures this). Since metadata changes need not be processed on a piece-by-piece basis, the approach improves write performance.

2. The kernel now explicitly waits for all write operations to complete that have been triggered — this is ensured because wb_sync_all is set in the second pass.

The two-pass behavior requires one change to sync_sb_inodes that I have not discussed yet. The second pass wants to wait for all pages that have been submitted. This includes the pages submitted during the first pass. Recall from our previous considerations (the overview in Figure 17-1 might be helpful here)

that the corresponding wait operations are issued in_sync_single_inode. However, the function only sees inodes that have been present on one of the lists s_dirty, s_io, or s_more_io of the superblock when sync_sb_inodes is called. If sync_sb_inodes were called with WB_SYNC_NONE in the first pass, then the inodes would not be on any of these lists anymore, and waiting could not be performed!

For this purpose, the special writeback mode wb_sync_hold is introduced. It is nearly identical with WB_SYNC_NONE. The important difference is that inodes that have been synchronized are not removed from s_io in sync_sb_inodes, but are placed back onto the s_dirty list. Thus they are still visible in the second pass and can be waited for. The block layer can, nevertheless, start to write out data in between the passes.

The additional CPU time consumed by the redundant invocation of functions during the sync system call is negligible compared to the time needed for the slow I/O operations and is therefore totally acceptable.

Continue reading here: Synchronization of Inodes

Was this article helpful?

0 0