Race Conditions

Let us consider a system that reads data from an external device via two interfaces. Independent data packets arrive via both interfaces at irregular intervals and are saved in separate files. To log the order of arrival of the data packets, a number is added at the end of each filename to indicate the ''serial number'' of the packet. A typical sequence of filenames would be acti.fil, act2.fil, act3.fil, and so on. A separate variable is used to simplify the work of both processes. This variable is held in a memory page shared by both processes and specifies the next unused serial number (for the sake of simplicity, I refer to this variable as counter below).

When a packet arrives, the process must perform a few actions to save the data correctly:

1. It reads the data from the interface.

2. It opens a file with the serial number count.

3. It increments the serial number by 1.

4. It writes the data to the file and closes it.

Why should errors occur with this system? If each process strictly observes the above procedure and increments the status variable at the appropriate places, the procedure should obviously function correctly not just with two but with any number of processes.

As a matter of fact, it will function correctly in most cases — and this is where the real difficulty lies with distributed programming — but it won't in certain circumstances. Let us set a trap by calling the processes that read data from the interfaces process 1 and process 2:

Our scenario begins with a number of files to which a serial number has been added, say, 12 files in all. The value of counter is therefore 13. Obviously a bad omen ...

Process 1 receives data from the interface as a new block has just arrived. Dutifully it opens a file with the serial number 13 just at the moment when the scheduler is activated and decides that the process has had enough CPU time and must be replaced with another process — in this case, process 2. Note that at this time, process 1 has read but not yet incremented the value of counter.

Once process 2 has started to run, it too receives data from its interface and begins to perform the necessary actions to save these data. It reads the value of counter, increments it to 14, opens a file with serial number 13, writes the data to the file, and terminates.

Soon it's the turn of process 1 again. It resumes where it left off and increments the value of counter by 1, from 14 to 15. Then it writes its data to the previously opened file with serial number 13 — and, in doing so, overwrites the existing data of process 2.

This is a double mishap — a data record is lost, and serial number 14 is not used.

The program sequence could be modified to prevent this error by changing the individual steps after data have been received. For example, processes could increment the value of counter immediately after reading its value and before opening a file. However, closer examination of suggestions of this kind quickly lead to the conclusion that it is always possible to devise situations that result in a fatal error. If we look at our suggestion, it soon becomes clear that an inconsistency is generated if the scheduler is invoked between reading counter and incrementing its value.

Situations in which several processes interfere with each other when accessing resources are generally referred to as race conditions. Such conditions are a central problem in the programming of distributed applications because they cannot usually be detected by systematic trial and error. Instead, a thorough study of source code (coupled with intimate knowledge of the various paths that code can take) and a generous supply of intuition are needed to find and eliminate them.

Situations leading to race conditions are few and far between, thus begging the question as to whether it's worth making the — sometimes very considerable — effort to protect code against their occurrence.

In some environments (electronic aircraft control, monitoring of vital machinery, or dangerous equipment), race conditions may prove to be fatal in the literal sense of the word. But even in routine software projects, protection against potential race conditions is an important contribution to program quality and user satisfaction. As part of improved multiprocessor support in the Linux kernel, much effort has been invested in pinpointing areas where dangers lurk and in providing suitable protection. Unexpected system crashes and mysterious errors owing to lack of protection are simply unacceptable.

Continue reading here: Critical Sections

Was this article helpful?

0 0