## M

Output

Memory Cell

Figure 3-1: Transistor switches and memory cells

In real life, a tiny handful of other components (typically diodes and capacitors) are necessary to make things work smoothly in a computer memory context. These are not necessarily little gizmos connected by wires to the outside of the transistor (although in early transistorized computers they were), but are now cut from the same silicon crystal the transistor itself is cut from, and occupy almost no space at all. Taken together, the transistor switch and its support components are called a memory cell. I've hidden the electrical complexity of the memory cell within an appropriate black-box symbol in Figure 3-1.

A memory cell keeps the current flow through it to a minimum, because electrical current flow produces heat, and heat is the enemy of electrical components. The memory cell's circuit is arranged so that if you put a tiny voltage on its input pin and a similar voltage on its select pin, a voltage will appear and remain on its output pin. That output voltage remains in its set state until remove the voltage from the cell as a whole, or remove the voltage from the input pin while putting a voltage on the select pin.

The ''on'' voltage being applied to all of these pins is kept at a consistent level (except, of course, when it is removed entirely). In other words, you don't put 12 volts on the input pin and then change that to 6 volts or 17 volts. The computer designers pick a voltage and stick with it. The pattern is binary in nature: you either put a voltage on the input pin or you take the voltage away entirely. The output pin echoes that: it either holds a fixed voltage or no voltage at all.

We apply a code to that state of affairs: The presence of voltage indicates a binary 1, and the lack of voltage indicates a binary 0. This code is arbitrary. We could as well have said that the lack of voltage indicates a binary 1 and vice versa (and computers have been built this way for obscure reasons), but the choice is up to us. Having the presence of something indicate a binary 1 is more natural, and that is the way things have evolved in the computing mainstream.

A single computer memory cell, such as the transistor-based one we're speaking of here, holds one binary digit, either a 1 or a 0. This is called a bit. A bit is the indivisible atom of information. There is no half-a-bit, and no bit-and-a-half.

A bit is a single binary digit, either 1 or 0.

### The Incredible Shrinking Bit

One bit doesn't tell us much. To be useful, we need to bring a lot of memory cells together. Transistors started out fairly small (the originals from the 1950s looked a lot like stovepipe hats for tin soldiers) and went down from there. The first transistors were created from little chips of germanium or silicon crystal about one-eighth of an inch square. The size of the crystal chip hasn't changed outrageously since then, but the transistors themselves have shrunk almost incredibly.

Whereas in the beginning one chip held one transistor, in time semiconductor designers crisscrossed the chip into four equal areas and made each area an independent transistor. From there it was an easy jump to add the other minuscule components needed to turn a transistor into a computer memory cell.

The chip of silicon was a tiny and fragile thing, and was encased in an oblong, molded-plastic housing, like a small stick of gum with metal legs for the electrical connections.

What we had now was a sort of electrical egg carton: four little cubbyholes, each of which could contain a single binary bit. Then the shrinking process began. First 8 bits, then 16, then multiples of 8 and 16, all on the same tiny silicon chip. By the late 1960s, 256 memory cells could occupy one chip of silicon, usually in an array of 8 cells by 32. In 1976, my COSMAC ELF computer contained two memory chips. On each chip was an array of memory cells 4 wide and 256 long. (Picture a really long egg carton.) Each chip could thus hold 1,024 bits.

This was a pretty typical memory chip capacity at that time. We called them ''1K RAM chips'' because they held roughly 1,000 bits of random-access memory (RAM). The K comes from kilobit—that is, one thousand bits. We'll get back to the notion of what random access means shortly.

Toward the mid-1970s, the great memory-shrinking act was kicking into high gear. One-kilobyte chips were crisscross divided into 4K chips containing 4,096 bits of memory. The 4K chips were almost immediately divided into 16K chips (16,384 bits of memory). These 16K chips were the standard when the IBM PC first appeared in 1981. By 1982, the chips had been divided once again, and 16K became 64K, with 65,536 bits inside that same little gum stick. Keep in mind that we're talking more than 65,000 transistors (plus other odd components) formed on a square of silicon about a quarter-inch on a side.

Come 1985 and the 64K chip had been pushed aside by its drawn-and-quartered child, the 256K chip (262,144 bits). Chips almost always increase in capacity by a factor of 4 simply because the current-generation chip is divided into 4 equal areas, onto each of which is then placed the same number of transistors that the previous generation of chip had held over the whole silicon chip.

By 1990, the 256K chip was history, and the 1-megabit chip was state of the art (mega is Greek for million). By 1992, the 4-megabit chip had taken over. The critter had a grand total of 4,194,304 bits inside it, still no larger than that stick of cinnamon gum. About that time, the chips themselves grew small and fragile enough so that four or eight of them were soldered to tiny printed circuit boards so that they would survive handling by clumsy human beings.

The game has continued apace, and currently you can purchase these little plug-in circuit board memory modules with as much as two gigabytes in them—which is over sixteen billion bits.

Will it stop here? Unlikely. More is better, and we're bringing some staggeringly powerful technology to bear on the creation of ever-denser memory systems. Some physicists warn that the laws of physics may soon call a time-out in the game, as the transistors are now so small that it gets hard pushing more than one electron at a time through them. At that point, some truly ugly limitations of life called quantum mechanics begin to get in the way. We'll find a way around these limitations (we always do), but in the process the whole nature of computer memory may change.

### Random Access

Newcomers sometimes find ''random'' a perplexing and disturbing word with respect to memory, as random often connotes chaos or unpredictability. What the word really means here is ''at random,'' indicating that you can reach into a random-access memory chip and pick out any of the bits it contains without disturbing any of the others, just as you might select one book at random from your public library's many shelves of thousands of books without sifting through them in order or disturbing the places of other books on the shelves.

Memory didn't always work this way. Before memory was placed on silicon chips, it was stored on electromagnetic machines of some kind, usually rotating magnetic drums or disks distantly related to the hard drives we use today. Rotating magnetic memory sends a circular collection of bits beneath a magnetic sensor. The bits pass beneath the sensor one at a time, and if you miss the one you want, like a Chicago bus in January, you simply have to wait for it to come by again. These are serial-access devices. They present their bits to you serially, in a fixed order, one at a time, and you have to wait for the one you want to come up in its order.

There's no need to remember that; we've long since abandoned serial-access devices for main computer memory. We still use such systems for mass storage, as I describe a bit later. Your hard drive is at its heart a serial-access device.

Random access works like this: inside the chip, each bit is stored in its own memory cell, identical to the memory cell diagrammed in Figure 3-1. Each of the however-many memory cells has a unique number. This number is a cell's (and hence a bit's) address. It's like the addresses on a street: the bit on the corner is number 0 Silicon Alley, and the bit next door is number 1, and so on. You don't have to knock on the door of bit 0 and ask which bit it is, and then go to the next door and ask there too, until you find the bit you want. If you have the address, you can zip right down the street and park square in front of the bit you intend to visit.

Each chip has a number of pins coming out of it. The bulk of these pins are called address pins. One pin is called a data pin (see Figure 3-2). The address pins are electrical leads that carry a binary address code. This address is a binary number, expressed in 1s and 0s only. You apply this address to the address pins by encoding a binary 1 as (let's say) 5 volts, and a binary 0 as 0 volts. Many other voltages have been used and are still used in computer hardware. What matters is that we all agree that a certain voltage on a pin represents a binary 1. Special circuits inside the RAM chip decode this address to one of the select inputs of the numerous memory cells inside the chip. For any given address applied to the address pins, only one select input will be raised to five volts, thereby selecting that memory cell.

Depending on whether you intend to read a bit or write a bit, the data pin is switched between the memory cells' inputs or outputs, as shown in Figure 3-2.

Figure 3-2: A RAM chip

That's all done internally to the chip. As far as you, on the outside, are concerned, once you've applied the address to the address pins, voila! The data pin will contain a voltage representing the value of the bit you requested. If that bit contained a binary 1, the data pin will contain a 5-volt signal; otherwise, the binary 0 bit will be represented by 0 volts.

### Memory Access Time

Chips are graded by how long it takes for the data to appear on the data pin after you've applied the address to the address pins. Obviously, the faster the better, but some chips (for electrical reasons that again are difficult to explain) are faster than others.

The time values are so small as to seem almost insignificant: 30 nanoseconds is a typical memory chip access time. A nanosecond is a billionth of a second, so 30 nanoseconds is significantly less than one 10-millionth of a second. Great stuff—but to accomplish anything useful, a computer needs to access memory hundreds of thousands, millions, or (in most cases) billions of times. Those nanoseconds add up. If you become an expert assembly language programmer, you will jump through hoops to shave the number of memory accesses your program needs to perform, because memory access is the ultimate limiting factor in a computer's performance. Assembly language expert Michael Abrash, in fact, has published several books on doing exactly that, mostly in the realm of high-speed graphics programming. The gist of his advice can be (badly) summarized in just a few words: Stay out of memory whenever you can! (You'll soon discover just how difficult this is.)

Bytes, Words, Double Words, and Quad Words

The days are long gone (decades gone, in fact) when a serious computer could be made with only one memory chip. My poor 1976 COSMAC ELF needed at least two. Today's computers need many, irrespective of the fact that today's memory chips can hold a billion bits or more, rather than the ELF's meager 2,048 bits. Understanding how a computer gathers its memory chips together into a coherent memory system is critical when you wish to write efficient assembly language programs. Although there are infinite ways to hook memory chips together, the system I describe here is that of the Intel-based PC-compatible computer, which has ruled the world of desktop computing since 1982.

Our memory system must store our information. How we organize a memory system out of a hatful of memory chips will be dictated largely by how we organize our information.

The answer begins with this thing called a byte. The fact that the granddaddy of all computer magazines took this word for its title indicates its importance in the computer scheme of things. (Alas, Byte magazine ceased publishing late in 1998.) From a functional perspective, memory is measured in bytes. A byte is eight bits. Two bytes side by side are called a word, and two words side by side are called a double word. A quad word, as you might imagine, consists of two double words, for four words or eight bytes in all. Going in the other direction, some people refer to a group of four bits as a nybble—a nybble being somewhat smaller than a byte. (This term is now rare and becoming rarer.)

Here's the quick tour:

■ A quad word is 2 double words side by side.

Computers were designed to store and manipulate human information. The basic elements of human discourse are built from a set of symbols consisting of letters of the alphabet (two of each, for uppercase and lowercase), numbers, and symbols, including commas, colons, periods, and exclamation marks. Add to these the various international variations on letters such as a and o plus the more arcane mathematical symbols, and you'll find that human information requires a symbol set of well over 200 symbols. (The symbol set used in all PC-style computers is provided in Appendix B.)

Bytes are central to the scheme because one symbol out of that symbol set can be neatly expressed in one byte. A byte is 8 bits, and 28 is 256. This means that a binary number 8 bits in size can be one of 256 different values, numbered from 0 to 255. Because we use these symbols so much, most of what we do in computer programs is done in byte-size chunks. In fact, except for the very odd and specialized kind of computers we are now building into intelligent food processors, no computer processes information in chunks smaller than 1 byte. Most computers today, in fact, process information one double word (four bytes, or 32 bits) at a time. Since 2003, PC-compatible computers have been available that process information one quad word (64 bits) at a time.

### Pretty Chips All in a Row

One of the more perplexing things for beginners to understand is that a single RAM chip does not even contain 1 byte, though it might contain half a billion bits. The bulk of the individual RAM chips that we use today have no more than four data pins, and some only one data pin. Whole memory systems are created by combining individual memory chips in clever ways.

A simple example will help illustrate this. Consider Figure 3-3. I've drawn a memory system that distributes a single stored byte across eight separate RAM chips. Each of the black rectangles represents a RAM chip like the one shown in Figure 3-2. There is one bit from the byte stored within each of the eight chips, at the same address across all eight chips. The 20 address pins for all eight chips are connected together, ''in parallel'' as an electrician might say. When the computer applies a memory address to the 20 address lines, the address appears simultaneously on the address pins of all eight memory chips in the memory system. This way, a single address is applied simultaneously to the address pins of all eight chips, which deliver all eight bits simultaneously on the eight data lines, with one bit from each chip.

In the real world, such simple memory systems no longer exist, and there are many different ways of distributing chips (and their stored bits) across a memory system. Most memory chips today do in fact store more than one bit at each address. Chips storing 1, 2, 3, 4, or 8 bits per address are relatively common. How to design a fast and efficient computer memory system is an entire subdiscipline within electrical engineering, and as our memory chips are improved to contain more and more memory cells, the ''best'' way to design a physical memory system changes.

It's been a long time, after all, since we've had to plug individual memory chips into our computers. Today, memory chips are nearly always gathered together into plug-in Dual Inline Memory Modules (DIMMs) of various capacities. These modules are little green-colored circuit boards about 5 inches long and 1 inch high. In 2009, all desktop PC-compatible computers use such modules, generally in pairs. Each module typically stores 32 bits at each memory address (often, but not always, in eight individual memory chips, each chip storing four bits at each memory address) and a pair of modules acting together stores 64 bits at each memory address. The number of memory locations within each module varies, but the capacity is commonly 512 megabytes (MB), or 1 or 2 gigabytes (GB). (I will use the abbreviations MB and GB from now on.)

It's important to note that the way memory chips are combined into a memory system does not affect the way your programs operate. When a program that you've written accesses a byte of memory at a particular address, the computer takes care of fetching it from the appropriate place in that jungle of chips and circuit boards. One memory system arranged a certain way might bring the data back from memory faster than another memory system arranged a different way, but the addresses are the same, and the data is the same. From the point of view of your program, there is no functional difference.

To summarize: electrically, your computer's memory consists of one or more rows of memory chips, with each chip containing a large number of memory cells made out of transistors and other minuscule electrical components. Most of the time, to avoid confusion it's just as useful to forget about the transistors and even the rows of physical chips. (My high school computer science teacher was not entirely wrong but he was right for the wrong reasons.)

Over the years, memory systems have been accessed in different ways. Eight-bit computers (now ancient and almost extinct) accessed memory 8 bits (one byte) at a time. Sixteen-bit computers access memory 16 bits (one word) at a time. Today's mainstream 32-bit computers access memory 32 bits (one double word) at a time. Upscale computers based on newer 64-bit processors access memory 64 bits (one quad word) at a time. This can be confusing, so it's better in most cases to envision a very long row of byte-size containers, each with its own unique address. Don't assume that in computers which process information one word at a time that only words have addresses. It's a convention within the PC architecture that every byte has its own unique numeric address, irrespective of how many bytes are pulled from memory in one operation.

Every byte of memory in the computer has its own unique address, even in computers that process 2, 4, or 8 bytes of information at a time.

If this seems counterintuitive, yet another metaphor will help: when you go to the library to take out the three volumes of Tolkien's massive fantasy The Lord of the Rings, each of the three volumes has its own catalog number (essentially that volume's address in the library) but you take all three down at once and process them as a single entity. If you really want to, you can check only one of the books out of the library at a time, but doing so will require two more trips to the library later to get the other two volumes, which is a waste of your time and effort.

So it is with 32-bit or 64-bit computers. Every byte has its own address, but when a 32-bit computer accesses a byte, it actually reads 4 bytes starting at the address of the requested byte. You can use the remaining 3 bytes or ignore them if you don't need them—but if you later decide that you do need the other three bytes, you'll have to access memory again to get them. Best to save time and get it all at one swoop.

### The Shop Foreman and the Assembly Line

All of this talk about reading things from memory and writing things to memory has thus far carefully skirted the question of who is doing the reading and writing. The who is almost always a single chip, and a remarkable chip it is, too: the central processing unit, or CPU. If you are the president and CEO of your personal computer, the CPU is your shop foreman, who sees that your orders are carried out down among the chips, where the work gets done.

Some would say that the CPU is what actually does the work, but while largely true, it's an oversimplification. Plenty of real work is done in the memory system, and in what are called peripherals, such as video display boards, USB and network ports, and so on. So, while the CPU does do a good deal of the work, it also parcels out quite a bit to other components within the computer, largely to enable itself to do a lot more quickly what it does best. Like any good manager, the foreman delegates to other computer subsystems whatever it can.

Most of the CPU chips used in the machines we lump together as a group and call PCs were designed by a company called Intel, which pretty much invented the single-chip CPU way back in the early 1970s. Intel CPUs have evolved briskly since then, as I'll describe a little later in this chapter. There have been many changes in the details over the years, but from a height, what any Intel or Intel-compatible CPU does is largely the same.

### Talking to Memory

The CPU chip's most important job is to communicate with the computer's memory system. Like a memory chip, a CPU chip is a small square of silicon onto which a great many transistors—today, hundreds of millions of them!—have been placed. The fragile silicon chip is encased in a plastic or ceramic housing with a large number of electrical connection pins protruding from it. Like the pins of memory chips, the CPU's pins transfer information encoded as voltage levels, typically 3 to 5 volts. Five volts on a pin indicate a binary 1, and 0 volts on a pin indicate a binary 0.

Like memory chips, the CPU chip has a number of pins devoted to memory addresses, and these pins are connected to the computer's system of memory chips. I've drawn this in Figure 3-4, and the memory system to the left of the CPU chip is the same one that appears in Figure 3-3, just tipped on its side. When the CPU needs to read a byte (or a word, double word, or quad word) from memory, it places the memory address of the byte to be read on its address pins, encoded as a binary number. Some few nanoseconds later, the requested byte appears (also as a binary number) on the data pins of the memory chips. The CPU chip also has data pins, and it slurps up the byte presented by the memory chips through its own data pins.

The process, of course, also works in reverse: to write a byte into memory, the CPU first places the memory address where it wants to write onto its address pins. Some number of nanoseconds later (which varies from system to system depending on general system speed and how memory is arranged) the CPU places the byte it wants to write into memory on its data pins. The memory system obediently stores the byte inside itself at the requested address.

CPU Chip

Data Lines

Figure 3-4: The CPU and memory

Figure 3-4 is, of course, purely conceptual. Modern memory systems are a great deal more complex than what is shown, but in essence they all work the same way: the CPU passes an address to the memory system, and the memory system either accepts data from the CPU for storage at that address or places the data found at that address on the computer's data bus for the CPU to process.

### Riding the Data Bus

This give-and-take between the CPU and the memory system represents the bulk of what happens inside your computer. Information flows from memory into the CPU and back again. Information flows in other paths as well. Your computer contains additional devices called peripherals that are either sources or destinations (or both) for information.

Video display boards, disk drives, USB ports, and network ports are the most common peripherals in PC-type computers. Like the CPU and memory, they are all ultimately electrical devices. Most modern peripherals consist of one or two large chips and perhaps a couple of smaller chips that support the larger chips. Like both the CPU chip and memory chips, these peripheral devices have both address pins and data pins. Some peripherals, graphics boards in particular, have their own memory chips, and these days their own dedicated CPUs. (Your modern high-performance graphics board is a high-powered computer in its own right, albeit one with a very specific and limited mission.)

Peripherals ''talk'' to the CPU (that is, they pass the CPU data or take data from the CPU) and sometimes to one another. These conversations take place across the electrical connections linking the address pins and data pins that all devices in the computer have in common. These electrical lines are called a data bus and they form a sort of party line linking the CPU with all other parts of the computer. An elaborate system of electrical arbitration determines when

Figure 3-4 is, of course, purely conceptual. Modern memory systems are a great deal more complex than what is shown, but in essence they all work the same way: the CPU passes an address to the memory system, and the memory system either accepts data from the CPU for storage at that address or places the data found at that address on the computer's data bus for the CPU to process.

and in what order the different devices can use this party line to talk with one another, but it happens in generally the same way: an address is placed on the bus, followed by some data. (How much data moves at once depends on the peripherals involved.) Special signals go out on the bus with the address to indicate whether the address represents a location in memory or one of the peripherals attached to the data bus. The address of a peripheral is called an I/O address to differentiate between it and a memory address such as those we've been discussing all along.

The data bus is the major element in the expansion slots present in most PC-type computers, and many peripherals (especially graphics adapters) are printed circuit boards that plug into these slots. The peripherals talk to the CPU and to memory through the data bus connections implemented as electrical pins in the expansion slots.

As convenient as expansion slots are, they introduce delays into a computer system. Increasingly, as time passes, peripherals are simply a couple of chips on one corner of the main circuit board (the motherboard) inside the computer.

The Foreman's Pockets

Every CPU contains a very few data storage cubbyholes called registers. These registers are at once the foreman's pockets and the foreman's workbench. When the CPU needs a place to tuck something away for a short while, an empty register is just the place. The CPU could always store the data out in memory, but that takes considerably more time than tucking the data in a register. Because the registers are actually inside the CPU, placing data in a register or reading it back again from a register is fast.

More important, registers are the foreman's workbench. When the CPU needs to add two numbers, the easiest and fastest way is to place the numbers in two registers and add the two registers together. The sum (in usual CPU practice) replaces one of the two original numbers that were added, but after that the sum could then be placed in yet another register, or added to still another number in another register, or stored out in memory, or take part in any of a multitude of other operations.

The CPU's immediate work-in-progress is held in temporary storage containers called registers.

Work involving registers is always fast, because the registers are within the CPU and are specially connected to one another and to the CPU's internal machinery. Very little movement of data is necessary—and what data does move doesn't have to move very far.

Like memory cells and, indeed, like the entire CPU, registers are made out of transistors; but rather than having numeric addresses, registers have individual names such as EAX or EDI. To make matters even more complicated, while all CPU registers have certain common properties, some registers have unique special powers not shared by other registers. Understanding the behaviors and the limitations of CPU registers is something like following the Middle East peace process: There are partnerships, alliances, and always a bewildering array of secret agendas that each register follows. There's no general system describing such things; like irregular verbs in Spanish, you simply have to memorize them.

Most peripherals also have registers, and peripheral registers are even more limited in scope than CPU registers. Their agendas are quite explicit and in no wise secret. This does not prevent them from being confusing, as anyone who has tried programming a graphics board at the register level will attest. Fortunately, these days nearly all communication with peripheral devices is handled by the operating system, as I'll explain in the next chapter.

### The Assembly Line

If the CPU is the shop foreman, then the peripherals are the assembly-line workers, and the data bus is the assembly line itself. (Unlike most assembly lines, however, the foreman works the line much harder than the rest of his crew!)

As an example: information enters the computer through a network port peripheral, which assembles bits received from a computer network cable into bytes of data representing characters and numbers. The network port then places the assembled byte onto the data bus, from which the CPU picks it up, tallies it or processes it in other ways, and then places it back on the data bus. The display board then retrieves the byte from the data bus and writes it into video memory so that you can see it on your screen.

This is a severely simplified description, but obviously a lot is going on inside the box. Continuous furious communication along the data bus between CPU, memory, and peripherals is what accomplishes the work that the computer does. The question then arises: who tells the foreman and crew what to do? You do. How do you do that? You write a program. Where is the program? It's in memory, along with all the rest of the data stored in memory. In fact, the program is data, and that is the heart of the whole idea of programming as we know it.

The Box That Follows a Plan

Finally, we come to the essence of computing: the nature of programs and how they direct the CPU to control the computer and get your work done.

We've seen how memory can be used to store bytes of information. These bytes are all binary codes, patterns of 1 and 0 bits stored as minute electrical voltage levels and collectively making up binary numbers. We've also spoken of symbols, and how certain binary codes may be interpreted as meaning something to us human beings, things such as letters, digits, punctuation, and so on.

Just as the alphabet and the numeric digits represent a set of codes and symbols that mean something to us humans, there is a set of codes that mean something to the CPU. These codes are called machine instructions, and their name is evocative of what they actually are: instructions to the CPU. When the CPU is executing a program, it picks a sequence of numbers off the data bus, one at a time. Each number tells the CPU to do something. The CPU knows how. When it completes executing one instruction, it picks the next one up and executes that. It continues doing so until something (a command in the program, or electrical signals such as a reset button) tells it to stop.

Let's take an example or two that are common to all modern IA-32 CPU chips from Intel. The 8-bit binary code 01000000 (40H) means something to the CPU. It is an order: Add 1 to register AX and put the sum back in AX. That's about as simple as they get. Most machine instructions occupy more than a single byte. Many are 2 bytes in length, and very many more are 4 bytes in length. The binary codes 11010110 01110011 (0B6H 073H) comprise another order: Load the value 73H into register DH. On the other end of the spectrum, the binary codes 11110011 10100100 (0F3H 0A4H) direct the CPU to do the following (take a deep breath): Begin moving the number of bytes specified in register CXfrom the 32-bit address stored in registers DS and SI to the 32-bit address stored in registers ES and DI, updating the address in both SI and DI after moving each byte, and decreasing CX by one each time, and finally stopping when CX becomes zero.

You don't have to remember all the details of those particular instructions right now; I'll come back to machine instructions in later chapters. The rest of the several hundred instructions understood by the Intel IA-32 CPUs fall somewhere in between these extremes in terms of complication and power. There are instructions that perform arithmetic operations (addition, subtraction, multiplication, and division) and logical operations (AND, OR, XOR, and so on), and instructions that move information around memory. Some instructions serve to ''steer'' the path that program execution takes within the logic of the program being executed. Some instructions have highly arcane functions and don't turn up very often outside of operating system internals. The important thing to remember right now is that each instruction tells the CPU to perform one generally small and limited task. Many instructions handed to the CPU in sequence direct the CPU to perform far more complicated tasks. Writing that sequence of instructions is what assembly language programming actually is.

### Fetch and Execute

A computer program is nothing more than a table of these machine instructions stored in memory. There's nothing special about the table, nor about where it is positioned in memory. It could be almost anywhere, and the bytes in the table are nothing more than binary numbers.

The binary numbers comprising a computer program are special only in the way that the CPU treats them. When a modern 32-bit CPU begins running, it fetches a double word from an agreed-upon address in memory. (How this starting address is agreed upon doesn't matter right now.) This double word, consisting of 4 bytes in a row, is read from memory and loaded into the CPU. The CPU examines the pattern of binary bits contained in the double word, and then begins performing the task that the fetched machine instruction directs it to do.

Ancient 8088-based 8-bit machines such as the original IBM PC only fetched one byte at a time, rather than the four bytes that 32-bit Pentium-class machines fetch. Because most machine instructions are more than a single byte in size, the 8088 CPU had to return to memory to fetch a second (or a third or a fourth) byte to complete the machine instruction before it could actually begin to obey the instruction and begin performing the task it represented.

As soon as it finishes executing an instruction, the CPU goes out to memory and fetches the next machine instruction in sequence. Inside the CPU is a special register called the instruction pointer that quite literally contains the address of the next instruction to be fetched from memory and executed. Each time an instruction is completed, the instruction pointer is updated to point to the next instruction in memory. (There is some silicon magic afoot inside modern CPUs that ''guesses'' what's to be fetched next and keeps it on a side shelf so it will be there when fetched, only much more quickly—but the process as I've described it is true in terms of the outcome.)

All of this is done literally like clockwork. The computer has an electrical subsystem called a system clock, which is actually an oscillator that emits square-wave pulses at very precisely intervals. The immense number of microscopic transistor switches inside the CPU coordinate their actions according to the pulses generated by the system clock. In years past, it often took several clock cycles (basically, pulses from the clock) to execute a single instruction. As computers became faster, the majority of machine instructions executed in a single clock cycle. Modern CPUs can execute instructions in parallel, so multiple instructions can often execute in a single clock cycle.

So the process goes: fetch and execute; fetch and execute. The CPU works its way through memory, with the instruction pointer register leading the way. As it goes, it works: moving data around in memory, moving values around in registers, passing data to peripherals, crunching data in arithmetic or logical operations.

Computer programs are lists of binary machine instructions stored in memory.

They are no different from any other list of data bytes stored in memory except in how they are interpreted when fetched by the CPU.

The Foreman's Innards

I made the point earlier that machine instructions are binary codes. This is something we often gloss over, yet to understand the true nature of the CPU, we have to step away from the persistent image of machine instructions as numbers. They are not numbers. They are binary patterns designed to throw electrical switches.

Inside the CPU are a very large number of transistors. (The Intel Core 2 Quad that I have on my desk contains 582 million transistors, and CPU chips with over a billion transistors are now in limited use.) Some small number of those transistors go into making up the foreman's pockets: machine registers for holding information. A significant number of transistors go into making up short-term storage called cache that I'll describe later. (For now, think of cache as a small set of storage shelves always right there at the foreman's elbow, making it unnecessary for the foreman to cross the room to get more materials.) The vast majority of those transistors, however, are switches connected to other switches, which are connected to still more switches in a mind-numbingly complex network.

The extremely simple machine instruction 01000000 (40H) directs the CPU to add 1 to the value stored in register AX, with the sum placed back in AX. When considering the true nature of computers, it's very instructive to think about the execution of machine instruction 01000000 in this way.

The CPU fetches a byte from memory. This byte contains the binary code 01000000. Once the byte is fully within the CPU, the CPU in essence lets the machine instruction byte push eight transistor switches. The lone 1 digit pushes its switch ''up'' electrically; the rest of the digits, all 0s, push their switches ''down.''

In a chain reaction, those eight switches flip the states of first dozens, then hundreds, then thousands, and in some cases tens of thousands of tiny transistor switches within the CPU. It isn't random—this furious nanomoment of electrical activity within the CPU operates utterly according to patterns etched into the silicon of the CPU by Intel's teams of engineers. Ultimately—perhaps after many thousands of individual switch throws —the value contained in register AX is suddenly one greater than it was before.

How this happens is difficult to explain, but you must remember that any number within the CPU can also be looked upon as a binary code, including values stored in registers. Also, most switches within the CPU contain more than one handle. These switches, called gates, work according to the rules of logic. Perhaps two, or three, or even more ''up'' switch throws have to arrive at a particular gate at the same time in order for one ''down'' switch throw to pass through that gate.

These gates are used to build complex internal machinery within the CPU. Collections of gates can add two numbers in a device called an adder, which again is nothing more than a crew of dozens of little switches working together first as gates and then as gates working together to form an adder.

As part of the cavalcade of switch throws kicked off by the binary code 01000000, the value in register AX was dumped trapdoor-style into an adder, while at the same time the number 1 was fed into the other end of the adder. Finally, rising on a wave of switch throws, the new sum emerges from the adder and ascends back into register AX—and the job is done.

The foreman of your computer, then, is made of switches—just like all the other parts of the computer. It contains a mind-boggling number of such switches, interconnected in even more mind-boggling ways. The important thing is that whether you are boggled or (like me on off-days) merely jaded by it all, the CPU, and ultimately the computer, does exactly what we tell it to do. We set up a list of machine instructions as a table in memory, and then, by golly, that mute silicon brick comes alive and starts earning its keep.

### Changing Course

The first piece of genuine magic in the nature of computers is that a string of binary codes in memory tells the computer what to do, step by step. The second piece of that magic is really the jewel in the crown: There are machine instructions that change the order in which machine instructions are fetched and executed.

In other words, once the CPU has executed a machine instruction that does something useful, the next machine instruction may tell the CPU to go back and play it again—and again, and again, as many times as necessary. The CPU can keep count of the number of times that it has executed that particular instruction or list of instructions and keep repeating them until a prearranged count has been met. Alternately, it can arrange to skip certain sequences of machine instructions entirely if they don't need to be executed at all.

What this means is that the list of machine instructions in memory does not necessarily begin at the top and run without deviation to the bottom. The CPU can execute the first fifty or a hundred or a thousand instructions, then jump to the end of the program—or jump back to the start and begin again. It can skip and bounce up and down the list smoothly and at great speed. It can execute a few instructions up here, then zip down somewhere else and execute a few more instructions, then zip back and pick up where it left off, all without missing a beat or even wasting too much time.

How is this done? Recall that the CPU includes a special register that always contains the address of the next instruction to be executed. This register, the instruction pointer, is not essentially different from any of the other registers in the CPU. Just as a machine instruction can add one to register AX, another machine instruction can add/subtract some number to/from the address stored in the instruction pointer. Add 100 to the instruction pointer, and the CPU will instantly skip 100 bytes down the list of machine instructions before it continues. Subtract 100 from the address stored in the instruction pointer, and the CPU will instantly jump back 100 bytes up the machine instruction list.

Finally, the Third Whammy: The CPU can change its course of execution based on the work it has been doing. The CPU can decide whether to execute a given instruction or group of instructions, based on values stored in memory, or based on the individual state of several special one-bit CPU registers called flags. The CPU can count how many times it needs to do something, and then do that something that number of times. Or it can do something, and then do it again, and again, and again, checking each time (by looking at some data somewhere) to determine whether it's done yet, or whether it has to take another run through the task.

So, not only can you tell the CPU what to do, you can tell it where to go. Better, you can sometimes let the CPU, like a faithful bloodhound, sniff out the best course forward in the interest of getting the work done in the quickest possible way.

In Chapter 1, I described a computer program as a sequence of steps and tests. Most of the machine instructions understood by the CPU are steps, but others are tests. The tests are always two-way tests, and in fact the choice of what to do is always the same: jump or don't jump. That's all. You can test for any of numerous different conditions within the CPU, but the choice is always either jump to another place in the program or just keep truckin' along.

What vs. How: Architecture and Microarchitecture

This book is really about programming in assembly language for Intel's 32-bit x86 CPUs, and those 32-bit CPUs made by other companies to be compatible with Intel's. There are a lot of different Intel and Intel-compatible x86 CPU chips. A full list would include the 8086,8088,80286,80386,80486, the Pentium, Pentium Pro, Pentium MMX, Pentium II, Pentium D, Pentium III, Pentium 4, Pentium Xeon, Pentium II Xeon, Pentium Core, Celeron, and literally dozens of others, many of them special-purpose, obscure, and short-lived. (Quick, have you ever heard of the 80376?) Furthermore, those are only the CPU chips designed and sold by Intel. Other companies (primarily AMD) have designed their own Intel-compatible CPU chips, which adds dozens more to the full list; and within a single CPU type are often another three or four variants, with exotic names such as Coppermine, Katmai, Conroe, and so on. Still worse, there can be a Pentium III Coppermine and a Celeron Coppermine.

How does anybody keep track of all this?

Quick answer: Nobody really does. Why? For nearly all purposes, the great mass of details doesn't matter. The soul of a CPU is pretty cleanly divided into two parts: what the CPU does and how the CPU does it. We, as programmers, see it from the outside: what the CPU does. Electrical engineers and systems designers who create computer motherboards and other hardware systems incorporating Intel processors need to know some of the rest, but they are a small and hardy crew, and they know who they are.

### Evolving Architectures

Our programmer's view from the outside includes the CPU registers, the set of machine instructions that the CPU understands, and special-purpose subsystems such as fast math processors, which may include instructions and registers of their own. All of these things are defined at length by Intel, and published online and in largish books so that programmers can study and understand them. Taken together, these definitions are called the CPU's architecture.

A CPU architecture evolves over time, as vendors add new instructions, registers, and other features to the product line. Ideally, this is done with an eye toward backward compatibility, which means that the new features do not generally replace, disable, or change the outward effects of older features. Intel has been very good about backward compatibility within its primary product line, which began in 1978 with the 8086 CPU and now goes by the catchall term ''x86.'' Within certain limitations, even programs written for the ancient 8086 will run on a modern Pentium Core 2 Quad CPU. (Incompatibilities that arise are more often related to different operating systems than the details of the CPU itself.)

The reverse, of course, is not true. New machine instructions creep slowly into Intel's x86 product line over the years. A new machine instruction first introduced in 1996 will not be recognized by a CPU designed, say, in 1993; but a machine instruction first introduced in 1993 will almost always be present and operate identically in newer CPUs.

In addition to periodic additions to the instruction set, architectures occasionally make quantum leaps. Such quantum leaps typically involve a change in the ''width'' of the CPU. In 1986, Intel's 16-bit architecture expanded to 32 bits with the introduction of the 80386 CPU, which added numerous instructions and operational modes, and doubled the width of the CPU registers. In 2003, the x86 architecture expanded again, this time to 64 bits, again with new instructions, modes of operation, and expanded registers. However, CPUs that adhere to the expanded 64-bit architecture will still run software written for the older 32-bit architecture.

Intel's 32-bit architecture is called IA-32, and in this book that's what I'll be describing. The newer 64-bit architecture is called x86-64 for peculiar reasons, chief of which is that Intel did not originate it. Intel's major competitor, AMD, created a backward-compatible 64-bit x86 architecture in the early 2000s, and it was so well done that Intel had to swallow its pride and adopt it. (Intel's own 64-bit architecture, called IA-64 Itanium, was not backward compatible with IA-32 and was roundly rejected by the market.)

With only minor glitches, the newer 64-bit Intel architecture includes the IA-32 architecture, which in turn includes the still older 16-bit x86 architecture. It's useful to know which CPUs have added what instructions to the architecture, keeping in mind that when you use a ''new'' instruction, your code will not run on CPU chips made before that new instruction appeared. This is a solvable problem, however. There are ways for a program to ask a CPU how new it is, and limit itself to features present in that CPU. In the meantime, there are other things that it is not useful to know.

### The Secret Machinery in the Basement

Because of the backward compatibility issue, CPU designers do not add new instructions or registers to an architecture without very good reason. There are other, better ways to improve a family of CPUs. The most important of these is increased processor throughput, which is not a mere increase in CPU clocking rates. The other is reduced power consumption. This is not even mostly a ''green'' issue. A certain amount of the power used by a CPU is wasted as heat; and waste heat, if not minimized, can cook a CPU chip and damage surrounding components. Designers are thus always looking for ways to reduce the power required to perform the same tasks.

Increasing processor throughput means increasing the number of instructions that the CPU executes over time. A lot of arcane tricks are associated with increasing throughput, with names like prefetching, L1 and L2 cache, branch prediction, hyper-pipelining, macro-ops fusion, along with plenty of others. Some of these techniques were created to reduce or eliminate bottlenecks within the CPU so that the CPU and the memory system can remain busy nearly all the time. Other techniques stretch the ability of the CPU to process multiple instructions at once.

Taken together, all of the electrical mechanisms by which the CPU does what its instructions tell it to do are called the CPU's microarchitecture. It's the machinery in the basement that you can't see. The metaphor of the shop foreman breaks down a little here. Let me offer you another one.

Suppose that you own a company that manufactures automatic transmission parts for Ford. You have two separate plants. One is 40 years old, and one has just been built. Both plants make precisely the same parts—they have to, because Ford puts them into its transmissions without knowing or caring which of your two plants manufactured them. A cam or a housing are thus identical within a ten-thousandth of an inch, whether they were made in your old plant or your new plant.

Your old plant has been around for a while, but your new plant was designed and built based on everything you've learned while operating the old plant for 40 years. It has a more logical layout, better lighting, and modern automated tooling that requires fewer people to operate and works longer without adjustment.

The upshot is that your new plant can manufacture those cams and housings much more quickly and efficiently, wasting less power and raw materials, and requiring fewer people to do it. The day will come when you'll build an even more efficient third plant based on what you've learned running the second plant, and you'll shut the first plant down.

Nonetheless, the cams and housings are the same, no matter where they were made. Precisely how they were made is no concern of Ford's or anyone else's. As long as the cams are built to the same measurements at the same tolerance, the ''how'' doesn't matter.

All of the tooling, the assembly line layouts, and the general structure of each plant may be considered that plant's microarchitecture. Each time you build a new plant, the new plant's microarchitecture is more efficient at doing what the older plants have been doing all along.

So it is with CPUs. Intel and AMD are constantly redesigning their CPU microarchitectures to make them more efficient. Driving these efforts are improved silicon fabrication techniques that enable more and more transistors to be placed on a single CPU die. More transistors mean more switches and more potential solutions to the same old problems of throughput and power efficiency.

The prime directive in improving microarchitectures, of course, is not to ''break'' existing programs by changing the way machine instructions or registers operate. That's why it's the secret machinery in the basement. CPU designers go to great lengths to maintain that line between what the CPU does and how those tasks are actually accomplished in the forest of those half-billion transistors.

All the exotic code names like Conroe, Katmai, or Yonah actually indicate tweaks in the microarchitecture. Major changes in the microarchitecture also have names: P6, NetBurst, Core, and so on. These are described in great detail online, but don't feel bad if you don't quite follow it all. Most of the time I'm hanging on by my fingernails too.

I say all this so that you, as a newly minted programmer, don't make more of Intel microarchitecture differences than you should. It is extremely rare (like, almost never) for a difference in microarchitecture detail to give you an exploitable advantage in how you code your programs. Microarchitecture is not a mystery (much about it is available online), but for the sake of your sanity you should probably treat it as one for the time being. We have many more important things to learn right now.

### Enter the Plant Manager

What I've described so far is less ''a computer'' than ''computation.'' A CPU executing a program does not a computer make. The COSMAC ELF device that I built in 1976 was an experiment, and at best a sort of educational toy.

It was a CPU with some memory, and just enough electrical support (through switches and LED digits) that I could enter machine code and see what was happening inside the memory chips. It was in no sense of the word useful.

My first useful computer came along a couple of years later. It had a keyboard, a CRT display (though not one capable of graphics) a pair of 8-inch floppy disk drives, and a printer. It was definitely useful, and I wrote numerous magazine articles and my first three books with it. I had a number of simple application programs for it, like the primordial WordStar word processor; but what made it useful was something else: an operating system.

Operating Systems: The Corner Office

An operating system is a program that manages the operation of a computer system. It's like any other program in that it consists of a sequence of machine instructions executed by the CPU. Operating systems are different in that they have special powers not generally given to word processors and spreadsheet programs. If we continue the metaphor of the CPU as the shop foreman, then the operating system is the plant manager. The entire physical plant is under its control. It oversees the bringing in of raw materials to the plant. It supervises the work that goes on inside the plant (including the work done by the shop foreman) and packages up the finished products for shipment to customers.

In truth, our early microcomputer operating systems weren't very powerful and didn't do much. They ''spun the disks'' and handled the storage of data to the disk drives, and brought data back from disks when requested. They picked up keystrokes from the keyboard, and sent characters to the video display. With some fiddling, they could send characters to a printer. That was about it.

The CP/M operating system was ''state of the art'' for desktop microcomputers in 1979. If you entered the name of a program at the keyboard, CP/M would go out to disk, load the program from a disk file into memory, and then literally hand over all power over the machine to the loaded program. When WordStar ran, it overwrote the operating system in memory, because memory was extremely expensive and there wasn't very much of it. Quite literally, only one program could run at a time. CP/M didn't come back until WordStar exited. Then CP/M would be reloaded from the floppy disk, and would simply wait for another command from the keyboard.

BIOS: Software, Just Not as Soft

So what brought CP/M back into memory, if it wasn't there when WordStar exited? Easy: WordStar rebooted the computer. In fact, every time a piece of software ran, CP/M went away. Every time that software exited, it rebooted the machine, and CP/M came back. There was so little to CP/M that rebooting it from a floppy disk took less than two seconds.

As our computer systems grew faster, and memory cheaper, our operating systems improved right along with our word processors and spreadsheets. When the IBM PC appeared, PC DOS quickly replaced CP/M. The PC had enough memory that DOS didn't go away when a program loaded, but rather remained in its place in memory while application software loaded above it. DOS could do a lot more than CP/M, and wasn't a great deal larger. This was possible because DOS had help.

IBM had taken the program code that handled the keyboard, the display, and the disk drives and burned it into a special kind of memory chip called read-only memory (ROM). Ordinary random-access memory goes blank when power to it is turned off. ROM retains its data whether it has power or not. Thus, thousands of machine instructions did not have to be loaded from disk, because they were always there in a ROM chip soldered to the motherboard. The software on the ROM was called the Basic Input/Output System (BIOS) because it handled computer inputs (such as the keyboard) and computer outputs (such as the display and printer.)

Somewhere along the way, software like the BIOS, which existed on ''non-volatile'' ROM chips, was nicknamed firmware, because although it was still software, it was not quite as, well, soft as software stored in memory or on disk. All modern computers have a firmware BIOS, though the BIOS software does different things now than it did in 1981.

DOS had a long reign. The first versions of Windows were not really whole new operating systems, but simply file managers and program launchers drawn on the screen in graphics mode. Down in the basement under the icons, DOS was still there, doing what it had always done.

It wasn't until 1995 that things changed radically. In that year, Microsoft released Windows 95, which had a brand-new graphical user interface, but something far more radical down in the basement. Windows 95 operated in 32-bit protected mode, and required at least an 80386 CPU to run. (I'll explain in detail what ''protected mode'' means in the next chapter.) For the moment, think of protected mode as allowing the operating system to definitely be The Boss, and no longer merely a peer of word processors and spreadsheets. Windows 95 did not make full use of protected mode, because it still had DOS

and DOS applications to deal with, and such ''legacy'' software was written long before protected mode was an option. Windows 95 did, however, have something not seen previously in the PC world: preemptive multitasking.

Memory had gotten cheap enough by 1995 that it was possible to have not just one or two but several programs in memory at the same time. In an elaborate partnership with the CPU, Windows 95 created the convincing illusion that all of the programs in memory were running at once. This was done by giving each program loaded into memory a short slice of the CPU's time. A program would begin running on the CPU, and some number of its machine instructions would execute.

However, after a set period of time (usually a small fraction of a second) Windows 95 would ''preempt'' that first program, and give control of the CPU to the second program on the list. That program would execute instructions for a few milliseconds until it too was preempted. Windows 95 would go down the list, letting each program run for a little while. When it reached the bottom of the list, it would start again at the top and continue running through the list, round-robin fashion, letting each program run for a little while. The CPU was fast enough that the user sitting in front of the display would think that all the programs were running simultaneously.

Figure 3-5 may make this clearer. Imagine a rotary switch, in which a rotor turns continuously and touches each of several contacts in sequence, once per revolution. Each time it touches the contact for one of the programs, that program is allowed to run. When the rotor moves to the next contact, the previous program stops in its tracks, and the next program gets a little time to run.

Figure 3-5: The idea of multitasking

The operating system can define a priority for each program on the list, so that some get more time to run than others. High-priority tasks get more clock cycles to execute, whereas low-priority tasks get fewer.

### Promotion to Kernel

Much was made of Windows 95's ability to multitask, but in 1995 few people had heard of a Unix-like operating system called Linux, which a young Finn named Linus Torvalds had written almost as a lark, and released in 1991.

Linux did not have the elaborate graphical user interface that Windows 95 did, but it could handle multitasking, and had a much more powerful structure internally. The core of Linux was a block of code called the kernel, which took full advantage of IA-32 protected mode. The Linux kernel was entirely separate from the user interface, and it was protected from damage due to malfunctioning programs elsewhere in the system. System memory was tagged as either kernel space or user space, and nothing running in user space could write to (nor generally read from) anything stored in kernel space. Communication between kernel space and user space was handled through strictly controlled system calls (more on this later in the book).

Direct access to physical hardware, including memory, video, and peripherals, was limited to software running in kernel space. Programs wishing to make use of system peripherals could only get access through kernel-mode device drivers.

Microsoft released its own Unix-inspired operating system in 1993. Windows NT had an internal structure a great deal like Linux, with kernel and device drivers running in kernel space, and everything else running in user space. This basic design is still in use, for both Linux and Windows NT's successors, such as Windows 2000, Windows XP, Windows Vista, and Windows 7. The general design for true protected-mode operating systems is shown schematically in Figure 3-6.

### The Core Explosion

In the early 2000s, desktop PCs began to be sold with two CPU sockets. Windows 2000/XP/Vista/7 and Linux all support the use of multiple CPU chips in a single system, through a mechanism called symmetric multiprocessing (SMP). Mult