Taxonomy of Parallel Computing

The goal of achieving performance through the exploitation of parallelism is as old as electronic digital computing itself, which emerged from the World War II era. Many different approaches and consequent paradigms and structures have been devised, with many commercial or experimental versions being implemented over the years. Few, however, have survived the harsh rigors of the data processing marketplace. Here we look briefly at many of these strategies, to better appreciate where commodity cluster computers and Beowulf systems fit and the tradeoffs and compromises they represent.

A first-tier decomposition of the space of parallel computing architectures may be codified in terms of coupling: the typical latencies involved in performing and exploiting parallel operations. This may range from the most tightly coupled finegrained systems of the systolic class, where the parallel algorithm is actually hardwired into a special-purpose ultra-fine-grained hardware computer logic structure with latencies measured in the nanosecond range, to the other extreme, often referred to as distributed computing, which engages widely separated computing resources potentially across a continent or around the world and has latencies on the order of a hundred milliseconds. Thus the realm of parallel computing structures encompasses a range of 108, when measured by degree of coupling and, by implication, granularity of parallelism. In the following list, the set of major classes in order of tightness of coupling is briefly described. We note that any such taxonomy is subjective, rarely orthogonal, and subject to debate. It is offered only as an illustration of the richness of choices and the general space into which cluster computing fits.

Systolic computers are usually special-purpose hardwired implementations of finegrained parallel algorithms exploiting one-, two-, or three-dimensional pipelining. Often used for real-time postsensor processors, digital signal processing, image processing, and graphics generation, systolic computing is experiencing a revival through adaptive computing, exploiting the versatile FPGA (field programmable gate array) technology that allows different systolic algorithms to be programmed into the same FPGA medium at different times.

Vector computers exploit fine-grained vector operations through heavy pipelining of memory bank accesses and arithmetic logic unit (ALU) structure, hardware support for gather-scatter operations, and amortizing instruction fetch/execute cycle overhead over many basic operations within the vector operation. The basis for the original supercomputers (e.g., Cray), vector processing is still a formidable strategy in certain Japanese high end systems.

SIMD (single instruction, multiple data) architecture exploits fine-grained data parallelism by having many (potentially thousands) or simple processors performing the same operation in lock step but on different data. A single control processor issues the global commands to all slaved compute processors simultaneously through a broadcast mechanism. Such systems (e.g., MasPar-2, CM-2) incorporated large communications networks to facilitate massive data movement across the system in a few cycles. No longer an active commercial area, SIMD structures continue to find special-purpose application for postsensor processing.

Dataflow models employed fine-grained asynchronous flow control that depended only on data precedence constraints, thus exploiting a greater degree of parallelism and providing a dynamic adaptive scheduling mechanism in response to resource loading. Because they suffered from severe overhead degradation, however, dataflow computers were never competitive and failed to find market presence. Nonetheless, many of the concepts reflected by the dataflow paradigm have had a strong influence on modern compiler analysis and optimization, reservation stations in out-of-order instruction completion ALU designs, and multithreaded architectures.

PIM (processor-in-memory) architectures are only just emerging as a possible force in high-end system structures, merging memory (DRAM or SRAM) with processing logic on the same integrated circuit die to expose high on-chip memory bandwidth and low latency to memory for many data-oriented operations. Diverse structures are being pursued, including system on a chip, which places DRAM banks and a conventional processor core on the same chip; SMP on a chip, which places multiple conventional processor cores and a three-level coherent cache hierarchical structure on a single chip; and Smart Memory, which puts logic at the sense amps of the DRAM memory for in-place data manipulation. PIMs can be used as standalone systems, in arrays of like devices, or as a smart layer of a larger conventional multiprocessor.

MPPs (massively parallel processors) constitute a broad class of multiprocessor architectures that exploit off-the-shelf microprocessors and memory chips in custom designs of node boards, memory hierarchies, and global system area networks. Ironically, "MPP" was first used in the context of SIMD rather than MIMD (multiple instruction, multiple data) machines. MPPs range from distributed-memory machines such as the Intel Paragon, through shared memory without coherent caches such as the BBN Butterfly and CRI T3E, to truly CC-NUMA (non-uniform memory access) such as the HP Exemplar and the SGI 0rigin2000.

Clusters are an ensemble of off-the-shelf computers integrated by an interconnection network and operating within a single administrative domain and usually within a single machine room. Commodity clusters employ commercially available networks (e.g., Ethernet, Myrinet) as opposed to custom networks (e.g., IBM SP-2). Beowulf-class clusters incorporate mass-market PC technology for their compute nodes to achieve the best price/performance.

Distributed computing, once referred to as "metacomputing", combines the processing capabilities of numerous, widely separated computer systems via the Internet. Whether accomplished by special arrangement among the participants, by means of disciplines referred to as Grid computing, or by agreements of myriad workstation and PC owners with some commercial (e.g., DSI, Entropia) or philanthropic (e.g., [email protected]) coordinating host organization, this class of parallel computing exploits available cycles on existing computers and PCs, thereby getting something for almost nothing.

In this book, we are interested in commodity clusters and, in particular, those employing PCs for best price/performance, specifically, Beowulf-class cluster systems. Commodity clusters may be subdivided into four classes, which are briefly discussed here.

Workstation clusters — ensembles of workstations (e.g., Sun, SGI) integrated by a system area network. They tend to be vendor specific in hardware and software. While exhibiting superior price/performance over MPPs for many problems, there can be as much as a factor of 2.5 to 4 higher cost than comparable PC-based clusters.

Beowulf-class systems — ensembles of PCs (e.g., Intel Pentium 4) integrated with commercial COTS local area networks (e.g., Fast Ethernet) or system area networks (e.g., Myrinet) and run widely available low-cost or no-cost software for managing system resources and coordinating parallel execution. Such systems exhibit exceptional price/performance for many applications.

Cluster farms — existing local area networks of PCs and workstations serving either as dedicated user stations or servers that, when idle, can be employed to perform pending work from outside users. Exploiting job stream parallelism, software systems (e.g., Condor) have been devised to distribute queued work while precluding intrusion on user resources when required. These systems are of lower performance and effectiveness because of the shared network integrating the resources, as opposed to the dedicated networks incorporated by workstation clusters and Beowulfs.

Superclusters — clusters of clusters, still within a local area such as a shared machine room or in separate buildings on the same industrial or academic campus, usually integrated by the institution's infrastructure backbone wide area netork. Although usually within the same internet domain, the clusters may be under separate ownership and administrative responsibilities. Nonetheless, organizations are striving to determine ways to enjoy the potential opportunities of partnering multiple local clusters to realize very large scale computing at least part of the time.

Was this article helpful?

0 0
Photoshop Secrets

Photoshop Secrets

Are You Frustrated Because Your Graphics Are Not Looking Professional? Have You Been Slaving Over Your Projects, But Find Yourself Not Getting What You Want From Your Generic Graphic Software? Well, youre about to learn some of the secrets and tips to enhance your images, photos and other projects that you are trying to create and make look professional.

Get My Free Ebook

Post a comment