The most visible and discussed aspects of cluster computing systems are their physical components and organization. These deliver the raw capabilities of the system, take up considerable room on the machine room floor, and yield their excellent price/performance. The two principal subsystems of a Beowulf cluster are its constituent compute nodes and its interconnection network that integrates the nodes into a single system. These are discussed briefly below.
The compute or processing nodes incorporate all hardware devices and mechanisms responsible for program execution, including performing the basic operations, holding the working data, providing persistent storage, and enabling external communications of intermediate results and user command interface. Five key components make up the compute node of a Beowulf cluster: the microprocessor, main memory, the motherboard, secondary storage, and packaging.
The microprocessor provides the computing power of the node with its peak performance measured in Mips (millions of instructions per second) and Mflops
(millions of floating-point operations per second). Although Beowulfs have been implemented with almost every conceivable microprocessor family, the two most prevalent today are the 32-bit Intel Pentium 3 and Pentium 4 microprocessors and the 64-bit Compaq Alpha 21264 family. We note that the AMD devices (including the Athlon), which are binary compatible with the Intel Pentium instruction set, have also found significant application in clusters. In addition to the basic floatingpoint and integer arithmetic logic units, the register banks, and execution pipeline and control logic, the modern microprocessor, comprising on the order of 20 to 50 million transistors, includes a substantial amount of on-chip high-speed memory called cache for rapid access of data. Cache is organized in a hierarchy usually with two or three layers, the closest to the processor being the fastest but smallest and the most distant being relatively slower but with much more capacity. These caches buffer data and instructions from main memory and, where data reuse or spatial locality of access is high, can deliver a substantial percentage of peak performance. The microprocessor interfaces with the remainder of the node usually by two external buses: one specifically optimized as a high-bandwidth interface to main memory, and the other in support of data I/O.
Main memory stores the working dataset and programs used by the microprocessor during job execution. Based on DRAM technology in which a single bit is stored as a charge on a small capacitor accessed through a dedicated switching transistor, data read and write operations can be significantly slower to main memory than to cache. However, recent advances in main memory design have improved memory access speed and have substantially increased memory bandwidth. These improvements have been facilitated by advances in memory bus design such as RAMbus.
The motherboard is the medium of integration that combines all the components of a node into a single operational system. Far more than just a large printed circuit board, the motherboard incorporates a sophisticated chip set almost as complicated as the microprocessor itself. This chip set manages all the interfaces between components and controls the bus protocols. One important bus is PCI, the primary interface between the microprocessor and most high-speed external devices. Initially a 32-bit bus operating at 33 MHz, the most recent variation operates at 66 MHz on 64-bit data, thus quadrupling its potential throughput. Most system area network interface controllers are connected to the node by means of the PCI bus. The motherboard also includes a substantial read-only memory (which can be updated) containing the system's BIOS (basic input/output system), a set of low-level services, primarily related to the function of the I/O and basic bootstrap tasks, that defines the logical interface between the higher-level operating system software and the node hardware. Motherboards also support several other input/output ports such as the user's keyboard/mouse/video monitor and the now-ubiquitous universal serial bus (USB) port that is replacing several earlier distinct interface types. Nonetheless, the vestigial parallel printer port can still be found, whose specification goes to the days of the earliest PCs more than twenty years ago.
Secondary storage provides high-capacity persistent storage. While main memory loses all its contents when the system is powered off, secondary storage fully retains its data in the powered-down state. While many standalone PCs include several classes of secondary storage, some Beowulf-systems may have nodes that keep only something necessary for holding a boot image for initial startup, all other data being downloaded from an external host or master node. Secondary storage can go a long way to improving reliability and reducing per node cost. However, it misses the opportunity for low-cost, high-bandwidth mass storage. Depending on how the system ultimately is used, either choice may be optimal. The primary medium for secondary storage is the hard disk, based on a magnetic medium little different from an audio cassette tape. This technology, almost as old as digital computing itself, continues to expand in capacity at an exponential rate, although access speed and bandwidths have improved only gradually. Two primary contenders, SCSI (small computer system interface) and EIDE (enhanced integrated dual electronics), are differentiated by somewhat higher speed and capacity in the first case, and lower cost in the second case. Today, a gigabyte of EIDE disk storage costs the user a few dollars, while the list price for SCSI in a RAID (redundant array of independent disks) configuration can be as high as $100 per gigabyte (the extra cost does buy more speed, density, and reliability). Most workstations use SCSI, and most PCs employ EIDE drives, which can be as large as 100 GBytes per drive. Two other forms of secondary storage are the venerable floppy disk and the optical disk. The modern 3.5-inch floppy (they don't actually flop anymore, since they now come in a hard rather than a soft case), also more than twenty years old, holds only 1.4 MBytes of data and should have been retired long ago. Because of its ubiquity, however, it continues to hang on and is ideal as a boot medium for Beowulf nodes. Largely replacing floppies are the optical CD (compact disk), CD-RW (compact disk-read/write), and DVD (digital versatile disk). The first two hold approximately 600 MBytes of data, with access times of a few milliseconds. (The basic CD is read only, but the CD-RW disks are writable, although at a far slower rate.) Most commercial software and data are now distributed on CDs because they are very cheap to create (actually cheaper than a glossy one-page double-sided commercial flyer). DVD technology also runs on current-generation PCs, providing direct access to movies.
Packaging for PCs originally was in the form of the "pizza boxes": low, flat units, usually placed on the desk with a fat monitor sitting on top. Some small early Beowulfs were configured with such packages, usually with as many as eight of these boxes stacked one on top of another. But by the time the first Beowulfs were implemented in 1994, tower cases—vertical floor-standing (or sometimes on the desk next to the video monitor) components—were replacing pizza boxes because of their greater flexibility in configuration and their extensibility (with several heights available). Several generations of Beowulf clusters still are implemented using this low-cost, robust packaging scheme, leading to such expressions as "pile of PCs" and "lots of boxes on shelves" (LOBOS). But the single limitation of this strategy was its low density (only about two dozen boxes could be stored on a floor-to-ceiling set of shelves) and the resulting large footpad of medium- to large-scale Beowulfs. Once the industry recognized the market potential of Beowulf clusters, a new generation of rack-mounted packages was devised and standardized (e.g., 1U, 2U, 3U, and 4U, with 1U boxes having a height of 1.75 inches) so that it is possible to install a single floor-standing rack with as many as 42 processors, coming close to doubling the processing density of such systems. Vendors providing complete turnkey systems as well as hardware system integrators ("bring-your-own software") are almost universally taking this approach. Yet for small systems where cost is critical and simplicity a feature, towers will pervade small labs, offices, and even homes for a long time. (And why not? On those cold winter days, they make great space heaters.)
Beowulf cluster nodes (i.e., PCs) have seen enormous, even explosive, growth over the past seven years since Beowulfs were first introduced in 1994. We note that the entry date for Beowulf was not arbitrary: the level of hardware and software technologies based on the mass market had just (within the previous six months) reached the point that ensembles of them could compete for certain niche applications with the then-well-entrenched MPPs and provide price/performance benefits (in the very best cases) of almost 50 to 1. The new Intel 100 MHz 80486 made it possible to achieve as much as 5 Mflops per node for select computationally intense problems and the cost of 10 Mbps Ethernet network controllers and network hubs had become sufficiently low that their cost permitted them to be employed as dedicated system area networks. Equally important was the availability of the inchoate Linux operating system with the all-important attribute of being free and open source and the availability of a good implementation of the PVM message-passing library. Of course, the Beowulf project had to fill in a lot of the gaps, including writing most of the Ethernet drivers distributed with Linux and other simple tools, such as channel bonding, that facilitated the management of these early modest systems. Since then, the delivered floating-point performance per processor has grown by more than two orders of magnitude while memory capacity has grown by more than a factor of ten. Disk capacities have expanded by as much as 1000X. Thus, Beowulf compute nodes have witnessed an extraordinary evolution in capability. By the end of this decade, node floating-point performance, main memory size, and disk capacity all are expected to grow by another two orders of magnitude.
One aspect of node structure not yet discussed is symmetric multiprocessing. Modern microprocessor design includes mechanisms that permit more than one processor to be combined, sharing the same main memory while retaining full coherence across separate processor caches, thus giving all processors a consistent view of shared data in spite of their local copies in dedicated caches. While large industrial-grade servers may incorporate as many as 512 processors in a single SMP unit, a typical configuration for PC-based SMPs is two or four processors per unit. The ability to share memory with uniform access times should be a source of improved performance at lower cost. But both design and pricing are highly complicated, and the choice is not always obvious. Sometimes the added complexity of SMP design offsets the apparent advantage of sharing many of the node's resources. Also, performance benefits from tight coupling of the processors may be outweighed by the contention for main memory and possible cache thrashing. An added difficulty is attempting to program at the two levels: message passing between nodes and shared memory between processors of the same node. Most users don't bother, choosing to remain with a uniform message-passing model even between processors within the same SMP node.
Without the availability of moderate-cost short-haul network technology, Beowulf cluster computing would never have happened. Interestingly, the two leaders in cluster dedicated networks were derived from very different precedent technologies. Ethernet was developed as a local area network for interconnecting distributed single user and community computing resources with shared peripherals and file servers. Myrinet was developed from a base of experience with very tightly coupled processors in MPPs such as the Intel Paragon. Together, Fast and Gigabit Ethernet and Myrinet provide the basis for the majority of Beowulf-class clusters.
A network is a combination of physical transport and control mechanisms associated with a layered hierarchy of message encapsulation. The core concept is the "message." A message is a collection of information organized in a format (order and type) that both the sending and the receiving processes understand and can correctly interpret. One can think of a message as a movable record. It can be as short as a few bytes (not including the header information) or as long as many thousands of bytes. Ordinarily, the sending user application process calls a library routine that manages the interface between the application and the network. Performing a high-level send operation causes the user message to be packaged with additional header information and presented to the network kernel driver software. Additional routing information and additional converges are performed prior to actually sending the message. The lowest-level hardware then drives the communication channel's lines with the signal, and the network switches route the message appropriately in accordance with the routing information encoded bits at the header of the message packet. Upon receipt at the receiving node, the process is reversed and the message is eventually loaded into the user application name space to be interpreted by the application code.
The network is characterized primarily in terms of its bandwidth and its latency. Bandwidth is the rate at which the message bits are transferred, usually cited in terms of peak throughput as bits per second. Latency is the length of time required to sends the message. Perhaps a fairer measure is the time from sending to receiving an application process, taking into consideration all of the layers of translation, conversions, and copying involved. But vendors often quote the shorter time between their network interface controllers. To complicate matters, both bandwidth and latency are sensitive to message length and message traffic. Longer messages make better use of network resources and deliver improved network throughput. Shorter messages reduce transmit, receive, and copy times to provide an overall lower transfer latency but cause lower effective bandwidth. Higher total network traffic (i.e., number of messages per unit time) increases overall network throughput, but the resulting contention and the delays they incur result in longer effective message transfer latency.
More recently, an industrial consortium has developed a new networking model known as VIA. The goal of this network class is to support a zero-copy protocol, avoiding the intermediate copying of the message in the operating system space and permitting direct application-to-application message transfers. The result is significantly reduced latency of message transfer. Emulex has developed the cLAN network product, which provides a peak bandwidth in excess of 1 Gbps and for short messages exhibits a transfer latency on the order of 7 microseconds.
Was this article helpful?
Read how to maintain and repair any desktop and laptop computer. This Ebook has articles with photos and videos that show detailed step by step pc repair and maintenance procedures. There are many links to online videos that explain how you can build, maintain, speed up, clean, and repair your computer yourself. Put the money that you were going to pay the PC Tech in your own pocket.