Handling of Page Faults

The association between virtual and physical memory is not established until the data of an area are actually needed. If a process accesses a part of virtual address space not yet associated with a page in memory, the processor automatically raises a page fault that must be handled by the kernel. This is one of the most important and complex aspects of memory management simply because a myriad of details must be taken into account. For example, the kernel must ascertain the following:

□ Was the page fault caused by access to a valid address from the user address space, or did the application try to access the protected area of the kernel?

□ Does a mapping exist for the desired address?

□ Which mechanism must be used to obtain the data for the area?

Figure 4-17 shows an initial overview of the potential paths the kernel may follow when handling page faults.

Kernel or User-

space address?

Kernel

No

User

Kernel mode?

Segmentation Fault |

Mapping exists?

Yes

Yes

Sufficient

privileges?

Segmentation Fault |

Synchronize with reference page table

Yes

No

Handle request

Segmentation Fault

Demand Paging/Allocation, Swapping or COW

Demand Paging/Allocation, Swapping or COW

Figure 4-17: Potential options for handling page faults.

As demonstrated below, the individual actions are much more complicated because the kernel must not only guard against malicious access from userspace but must also take note of many minor details; on top of this, it must not allow the page handling operations to degrade system performance unnecessarily.

The implementation of page fault handling varies from processor to processor. Because the CPUs employ different memory management concepts, the details of page fault generation also differ. Consequently, the handler routines in the kernel are located in the architecture-specific source code segments.

We confine ourselves below to a detailed description of the approach adopted on the IA-32 architecture. Implementation on most other CPUs is at least similar.

An assembler routine in arch/x86/kernel/entry_32.S serves as the entry point for page faults but immediately invokes the C routine do_page_fault from arch/x86/mm/fault_32.c. (A routine of the same name is present in the architecture-specific sources of most CPUs.16,17) Figure 4-18 shows the code flow diagram of this extensive routine.

do_page_fault|

Save faulting address ]

Interrupt handler or no context?

—I fixup_exception|

Address > task_size and no protection fault and kernel mode?

Yes vmalloc-Handler

\find_vma|

vm_area_struct exists? |-

Yes unsuccesssful

Stack?

\expand_stack\

Usermode access

-Segmentation Fault fixup_exception|

Yes unsuccesssful

\expand_stack\

fixup_exception|

No

successsful

Allowed read access,

Alllowed write access

Not allowed

Not allowed

page not present

page not present

write access

read access

handle_mm_fault|

Figure 4-18: Code flow diagram for do_page_fault on IA-32 processors.

This situation is complex, so it is necessary to examine the implementation of do_page_fault very closely.

Two parameters are passed to the routine — the register set active at the time of the fault, and an error code (long error_code) that supplies information on the cause of the fault. Currently, only the first three bits (0,1, and 2) of error_code are used; their meanings are given in Table 4-1.

arch/x86/mm/fault_32.c fastcall void _kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)

struct task_struct *tsk; struct mm_struct *mm;

16As usual, Sparc processors are the odd man out. There the name of the function is do_sparc_fault (Sparc32), do_sun4c_fault (Sparc32 sun4c), or do_sparc64_fault (UltraSparc). ia64_do_page_fault is used on IA-64 systems.

17Note that the code for IA-32 and AMD64 will be unified in kernel 2.6.25, which was still under development when this book was written. The remarks given here also apply for the AMD64 architecture.

struct vm_area_struct * vma; unsigned long address; unsigned long page; int write, si_code; int fault;

Table 4-1: Meaning of Page Fault Error Codes on IA-32

Bit

Set (1)

Not set (0)

0

No page present in RAM

Protection fault (insufficient access permission)

1

Read access

Write access

2

Privileged kernel mode

User mode

Once a large number of variables have been declared for subsequent use, the kernel stores the address of the location that triggered the fault in address.18

arch/i386/mm/fault.c tsk = current;

si_code = SEGV_MAPERR;

* We fault-in kernel-space virtual memory on-demand. The

* 'reference' page table is init_mm.pgd.

* NOTE! We MUST NOT take any locks for this case. We may

* be in an interrupt or a critical region, and should

* only copy the information from the master page table,

* nothing more.

* This verifies that the fault happens in kernel space

* (error_code & 4) ==0, and that the fault was not a

if (!(error_code & 0x0000000d) && vmalloc_fault(address) >= 0) return;

* Don't take the mm semaphore here. If we fixup a prefetch

* fault we could otherwise deadlock.

18On IA-32 processors, the address is held in register CR2, whose contents are copied to address by read_cr2. The processor-specific details are of no interest.

goto bad_area_nosemaphore;

A vmalloc fault is indicated if the address is outside user address space. The page tables of the process must therefore be synchronized with the information in the kernel's master page table. Naturally, this is only permitted if access took place in kernel mode and the fault was not triggered by a protection error; in other words, neither bit 2 nor bits 3 and 0 of the error code may be set.19

The kernel uses the auxiliary function vmalloc_fault to synchronize the page tables. I won't show the code in detail because all it does is copy the relevant entry from the page table of init — this is the kernel master table on IA-32 systems — into the current page table. If no matching entry is found there, the kernel invokes fixup_exception in a final attempt to recover the fault; I discuss this shortly.

The kernel jumps to the bad_area_nosemaphore label if the fault was triggered during an interrupt (see Chapter 14) or in a kernel thread (see Chapter 14) that does not have its own context and therefore no separate instance of mm_struct.

* If we're in an interrupt, have no user context or are running in an

* atomic region then we must not take the fault..

goto bad_area_nosemaphore;

bad_area_nosemaphore:

/* User mode accesses just cause a SIGSEGV */ if (error_code & 4) {

force_sig_info_fault(SIGSEGV, si_code, address, tsk); return;

no_context:

/* Are we prepared to handle this kernel fault? */ if (fixup_exception(regs)) return;

A segmentation fault is output if the fault originates from userspace (indicated by the fact that bit 4 is set in error_code). If, however, the fault originates from kernel space, fixup_exception is invoked. I describe this function below.

If the fault does not occur in an interrupt or without a context, the kernel checks whether the address space of the process contains a region in which the fault address lies. It invokes the find_vma function, which we know from Section 4.5.1 to do this.

19This is checked by !(error_code & 0x0000000d). Because 20 + 22 + 23 = 13 = 0xd, neither bit 2 nor bits 3 and 0 may be set.

arch/i386/mm/fault.c vma = find_vma(mm, address); if (!vma)

goto good_area; if (!(vma->vm_flags & VM_GROWSDOWN)) goto bad_area;

if (expand_stack(vma, address)) goto bad_area;

good_area and bad_area are labels to which the kernel jumps once it has discovered whether the address is valid or invalid.

The search can yield various results:

□ No region is found whose end address is after address, in which case access is invalid.

□ The fault address is within the region found, in which case access is valid and the page fault is corrected by the kernel.

□ A region is found whose end address is after the fault address but the fault address is not within the region. There may be two reasons for this:

1. The vm_growsdown flag of the region is set; this means that the region is a stack that grows from top to bottom. expand_stack is then invoked to enlarge the stack accordingly. If it succeeds, 0 is returned as the result, and the kernel resumes execution at good_area. Otherwise, access is interpreted as invalid.

2. The region found is not a stack, so access is invalid. good_area follows on immediately after the above code.

arch/i386/mm/fault.c good_area:

/* fall through */ case 2: /* write, not present */

if (!(vma->vm_flags & VM_WRITE)) goto bad_area;

goto bad_area; case 0: /* read, not present */

if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area;

The presence of a mapping for the fault address does not necessarily mean that access is actually permitted. The kernel must check the access permissions by examining bits 0 and 1 (because 20 + 21 = 3). The following situations may apply:

□ vm_write must be set in the event of a write access (bit 1 set, cases 3 and 2). Otherwise, access is invalid, and execution resumes at bad_area.

□ In the event of a read access to an existing page (Case 1), the fault must be a permission fault detected by the hardware. Execution then resumes at bad_area.

□ If a read access is made to a page that doesn't exist, the kernel must check whether vm_read or vm_exec is set, in which case access is valid. Otherwise, read access is denied, and the kernel jumps to bad_area.

If the kernel does not explicitly jump to bad_area, it works its way down through the case statement and arrives at the handle_mm_fault call that immediately follows; this function is responsible for correcting the page fault (i.e., reading the required data).

arch/i386/mm/fault.c survive:

* If for any reason at all we couldn't handle the fault,

* make sure we exit gracefully rather than endlessly redo

* the fault.

fault = handle_mm_fault(mm, vma, address, write); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM)

goto out_of_memory; else if (fault & VM_FAULT_SIGBUS) goto do_sigbus;

return;

handle_mm_fault is an architecture-independent routine for selecting the appropriate fault correction method (demand paging, swap-in, etc.) and for applying the method selected (we take a close look at the implementation and the various options of handle_mm_fault in Section 4.11).

If the page is created successfully, the routine returns either vm_fault_minor (the data were already in memory) or vm_fault_major (the data had to be read from a block device). The kernel then updates the process statistics.

However, faults may also occur when a page is created. If there is insufficient physical memory to load the page, the kernel forces termination of the process to at least keep the system running. If a permitted access to data fails for whatever reason — for instance, if a mapping is accessed but has been shrunk by another process in the meantime and is no longer present at the given address — the sigbus signal is sent to the process.

Continue reading here: Correction of Userspace Page Faults

Was this article helpful?

0 0