Indepth

Picking the RapidMind

RapidMind has a mind to make advances in the use of multicore programming rapidly available. Nicholas petreley

Writing applications to support multiple CPU cores is not an easy task, and in some cases, it is even harder if you want to take a huge existing application and adapt it for multiple cores. So I figured the real breakthrough is likely to be years away. It seems as if RapidMind has a solution for this problem that doesn't require a massive overhaul of an existing application, and its solution is already available.

We invited RapidMind's President and CEO Ray DePaul and Founder and Chief Scientist Michael McCool to talk about RapidMind's approach to exploiting the power of multicore systems.

We deemed it important to look at RapidMind, because it seems as if we're finally entering the age of parallel processing on the desktop as chip manufacturers bump up against the practical limits of Moore's Law. Everything from graphics cards to PlayStation 3 consoles exploit parallel processing these days. I have an Intel quad-core processor in my workstation. Although I'm happy with it, I find that the only time I truly appreciate having this multicore chip is when I run multiple applications simultaneously or run multiple processes, such as with the command make -j 5. If anything, single-threaded applications run slower on this chip than on the single-core CPU I used to run, because each core in the Intel chip is significantly slower (2GHz vs. 3GHz).

So how does RapidMind bridge the gap between existing software and the changing computational model?

LJ: Could you give us a brief description of RapidMind, and the problem it is designed to solve?

DePaul: RapidMind is a multicore software platform that allows software organizations to leverage the performance of multicore processors and accelerators to gain a real competitive advantage in their industry. With RapidMind, you can develop parallel applications with minimal impact on your development lifecycle, costs and timelines. And, we allow you to accomplish this without the need for multithreading. You leverage existing skills, existing compilers and IDEs and take advantage of all key multicore architectures without constantly porting your application.

LJ: So is it accurate to say RapidMind is actually a library of common C/C++ operations, where the exploitation of multiple cores is largely transparent to the programmer? McCool: RapidMind is much more than a simple library of "canned functions". In fact, it is possible to use the API to the RapidMind platform to specify an arbitrary computation, and for that computation to execute in parallel with a very high level of performance. We provide a sophisticated multicore software platform that can leverage many levels of parallelization, but at the same time allows developers to express their own computations in a very familiar, single-threaded way.

LJ: How much, if anything, does the programmer need to know about parallel processing programming techniques in order to use RapidMind?

McCool: We believe that developers are the application experts and should have some involvement in moving their applications into the parallel world. The key is to let developers leverage what they already know, rather than force them down an unfamiliar and frustrating path. RapidMind is built upon concepts already familiar to all developers: arrays and functions. It is not necessary for a developer to work directly with threads, vectorization, cores or synchronization. Fundamentally, a developer can apply functions to arrays, and this automatically invokes parallel execution. A RapidMind-enabled program is a single-threaded sequence of parallel operations and is much easier to understand, code and test than the multithreaded model of parallel programming.

LJ: Can you give us a simple code example (the includes and declaration statements that would start a typical program)? McCool: First, you include the platform header file and optionally activate the RapidMind namespace:

#include <rapidmind/platform.hpp> using namespace rapidmind;

Next, you can declare variables using RapidMind types for numbers and arrays:

The Value1f type is basically equivalent to a float, and the Array types are used to manage large collections of data. These can be declared anywhere you would normally declare C++ variables: as members of classes or as local or global variables.

A Program object is the RapidMind representation of a function and is created by enclosing a sequence of operations on RapidMind types between RM_BEGIN and RM_END. The operations will then be stored in the Program object. For example, suppose we want to add a value f, represented using a global variable, to every element of an array. We would create a program object prog as follows:

Program prog = RM_BEGIN {

In<Value1f> c; Out<Value1f> d; d = c + f; } RM_END;

Note that although the program may run on a co-processor, we can just refer to external values like f in the same way we would from a function definition. It is not necessary to write any other code to set up the communication between the host

processor and any co-processors.

To apply this operation to array a and put the result in array b, invoking a parallel computation, we just use the program object like a function:

Of course, in real applications, program objects can contain a large number of operations, and a sequence of program objects and collective operations on arrays (such as scatter, gather and reduce) would be used.

LJ: How do you avoid the common pitfalls of parallel process ing, such as deadlocks or other synchronization issues? McCool: The semantics of the RapidMind interface does not involve explicit locking or synchronization by the developer. The platform itself automatically takes care of these issues when necessary at a lower level in the runtime platform. The developer cannot specify programs that deadlock or that have race conditions, in the same way that a Java developer cannot specify programs that have memory leaks.

LJ: I see Hewlett-Packard software ran 32.2 times faster after the software was adapted to use RapidMind. How long did it take to modify the software to use RapidMind? McCool: Our collaboration with HP was a great test of our plat form. Roughly the same amount of time was taken to RapidMind-enable the application as was taken by HP to tune its single-core baseline version. The tuning by HP sped up its version by a factor of 4, whereas RapidMind running on an NVIDIA 7900 GPU outperformed that by a factor of more than 32. More recently, we have run the same code on an NVIDIA 8800 GPU and sped it up by an additional factor of 5, and we also have run the RapidMind version on our multicore CPU quad-core product and achieved a speedup of 8 over HP's version.

So the benefit to the software organization is quite startling. For the same effort, you can use RapidMind not only to get significantly higher performance on the same multicore processors you're already targeting, but you can leverage the additional performance of accelerators as well. The RapidMind version also will scale automatically to future processors with more cores.

" For the same effort, you can use RapidMind not only to get significantly higher performance on the same multicore processors you're already targeting, but you can leverage the additional performance of accelerators as well.

processor-specific optimization, data management, dynamic load balancing, scaling for additional cores and multiple levels of paral-lelization. The RapidMind platform performs all of these functions.

LJ: Is the speed increase in the HP software typical or "best case"? What software is most likely to see speed increases? Database server software? Complex queries on data warehousing? Spam filtering? Web browsers? Something else? McCool: We have seen large speedups on a wide range of applications, including database operations, image and video processing, financial modeling, pattern matching and analysis, many different kinds of scientific computation—the list goes on and on. The RapidMind platform supports a general-purpose programming model and can be applied to any kind of computation. The HP test was compute-bound, and it could take advantage of the high compute performance of GPUs. However, in memory-bound applications, we have also seen a significant benefit, over an order of magnitude, from running the application on RapidMind. RapidMind not only manages parallel execution, it also manages data flow and so can also directly address the memory bottleneck. As a software platform company, we are constantly surprised by the variety of applications that developers are RapidMind-enabling. Prior to the launch of our v2.0 product in May 2007, we had more than 1,000 developers from many different industries in our Beta program. The problem is industry-wide, and we have developed a platform that has very broad applicability.

LJ: Shouldn't this kind of adaptation to multiple cores take place in something more fundamental like the GNU C Library? Is it only a matter of time before such libraries catch up? McCool: Simply parallelizing the standard library functions would not have the same benefit, because they do not, individually, do enough work. RapidMind programs, in contrast, can do an arbitrary amount of user-specified parallel computation.

Although RapidMind looks like a library to the developer, it's important to realize that most of the work is done by the runtime platform. The challenge facing multicore developers is not one that can be solved solely with libraries. Developers need a system that efficiently takes care of the complexities of multicore:

LJ: You support multiple platforms on different levels. For example, you can exploit the processors on NVIDIA and ATI graphics cards, the Cell processor, as well as multicore CPUs. In addition, you support both Linux and Windows, correct?

DePaul: The processor vendors are delivering some exciting and disruptive innovations. Software companies are faced with some tough choices—which vendors and which architectures should they support. By leveraging RapidMind, they get to benefit from all of the hardware innovations and deliver better products to their customers within their current development cycles and timelines. RapidMind will continue to provide portable performance across a range of both processors and operating systems. We will support future multicore and many-core processors, so applications written with RapidMind today are future-proofed and can automatically take advantage of new architectures that will likely arise, such as increases in the number of cores.

LJ: Can you tell us more about your recently demonstrated support for Intel and AMD multicore CPUs? DePaul: It's difficult to overstate the value we bring to software companies targeting Intel and AMD multicore CPUs. For example, at SIGGRAPH in San Diego, we demonstrated a 10x performance improvement on an application running on eight CPU cores. RapidMind-enabled applications will scale to any number of cores, even across multiple processors, and will be tuned for both Intel and AMD architectures. Software organizations can now target multicore CPUs, as well as accelerators, such as ATI and NVIDIA GPUs and the Cell processor, all with the same source code.

LJ: Is there anything else you'd like to tell our readers? DePaul: It's becoming clear that software organizations' plans for multicore processors and accelerators will be one of the most important initiatives they take this year. Companies that choose to do nothing will quickly find themselves behind the performance curve. Companies that embark on large complex multithreading projects will be frustrated with the costs and timelines, and in the end, largely disappointed with the outcome. We are fortunate to be partnering with a group of software organizations that see an opportunity to deliver substantial performance improvements to their customers without a devastating impact on their software development cycles.

Nicholas Petreley is Editor in Chief of Linux Journal and a former programmer, teacher, analyst and consultant who has been working with and writing about Linux for more than ten years.

Growing a World of Linux Professionals

LPl-Deutsch

LPl-Bulgaria

LPI-Korea LPI-Japan LPI-China

We at the Linux Professional Institute believe the best way to spread the adoption of Linux and Open Source software is to grow a world wide supply of talented, qualified and accredited IT professionals.

We realize the importance of providing a global standard of measurement. To assist in this effort, we are launching a Regional Enablement Initiative to ensure we understand, nurture and support the needs of the enterprise, governments, educational institutions and individual contributors around the globe.

We can only achieve this through a network of local "on the ground" partner organizations. Partners who know the sector and understand the needs of the IT work force. Through this active policy of Regional Enablement we are seeking local partners and assisting them in their efforts to promote Linux and Open Source professionalism.

We encourage you to contact our new regional partners listed above.

Together we are growing a world of Linux Professionals.

Professional Institute

Stable. Innovative. Growing.

High-Performance Network Programming in C

Programming techniques to get the best performance from your TCP applications.

girish venkatachalam

TCP/IP network programming in C on Linux is good fun. All the advanced features of the stack are at your disposal, and you can do lot of interesting things in user space without getting into kernel programming.

Performance enhancement is as much an art as it is a science. It is an iterative process, akin to an artist gingerly stroking a painting with a fine brush, looking at the work from multiple angles at different distances until satisfied with the result.

The analogy to this artistic touch is the rich set of tools that Linux provides in order to measure network throughput and performance. Based on this, programmers tweak certain parameters or sometimes even re-engineer their solutions to achieve the expected results.

I won't dwell further upon the artistic side of high-performance programming. In this article, I focus on certain generic mechanisms that are guaranteed to provide a noticeable improvement. Based on this, you should be able to make the final touch with the help of the right tools.

I deal mostly with TCP, because the kernel does the bandwidth management and flow control for us. Of course, we no longer have to worry about reliability either. If you are interested in performance and high-volume traffic, you will arrive at TCP anyway.

What Is Bandwidth?

Once we answer that question, we can ask ourselves another useful question, "How can we get the best out of the available bandwidth?"

Bandwidth, as defined by Wikipedia, is the difference between the higher and lower cutoff frequencies of a communication channel. Cutoff frequencies are determined by basic laws of physics—nothing much we can do there.

But, there is a lot we can do elsewhere. According to Claude Shannon, the practically achievable bandwidth is determined by the level of noise in the channel, the data encoding used and so on. Taking a cue from Shannon's idea, we should "encode" our data in such a way that the protocol overhead is minimal and most of the bits are used to carry useful payload data.

TCP/IP packets work in a packet-switched environment. We have to contend with other nodes on the network. There is no concept of dedicated bandwidth in the LAN environment where your product is most likely to reside. This is something we can control with a bit of programming.

Non-Blocking TCP

Here's one way to maximize throughput if the bottleneck is your local LAN (this might also be the case in certain crowded ADSL deployments). Simply use multiple TCP connections. That way, you can ensure that you get all the attention at the expense of the other nodes in the LAN. This is the secret of download accelerators. They open multiple TCP connections to FTP and HTTP servers and download a file in pieces and reassemble it at multiple offsets. This is not "playing" nicely though.

We want to be well-behaved citizens, which is where non-blocking I/O comes in. The traditional approach of blocking reads and writes on the network is very easy to program, but if you are interested in filling the pipe available to you by pumping packets, you must use non-blocking TCP sockets. Listing 1 shows a simple code fragment using non-blocking sockets for network read and write.

Note that you should use fcntl(2) instead of setsockopt(2) for setting the socket file descriptor to non-blocking mode. Use poll(2) or select(2) to figure out when the socket is ready to read or write. select(2) cannot figure out when the socket is ready to write, so watch out for this.

How does non-blocking I/O provide better throughput? The OS schedules the user process differently in the case of blocking and non-blocking I/O. When you block, the process "sleeps", which

Figure 1. Possibilities in Non-Blocking Write with Scatter/Gather I/O

Listing 1. nonblock.c

Was this article helpful?

0 0

Post a comment