2014-01-17

This is a technical article so skip it if you read my blog for pithy tech commentary and wisdom.

So I’m working with a beta version of CUDA 6.0 from nvidia that enables a new kind of GPU memory management called Unified Memory or “Managed Memory” which makes it much easier to migrate complex C++ class objects with deep data structures to the GPU. This of course is a GREAT feature for enabling much more generalized use of the GPU. Since I’m on the bleeding edge of adoption here I thought I would write an aritcle about my first approah and experiences with porting C++ objects to CUDA 6.0 for the benefit of those who will come after me with the same challenges.

As a little context, prior to this moment in GPU computing history writing GPU code was largely constrained to one of two approaches.

1) Writing individual GPU functions that could be called from the CPU code to massively parallel accelerate a specific function call.

2) Using technologies like AMP, OpenAcc or OpenMP to enable the compiler to automatically parallelize for-loops on the GPU.

Both approaches enabled developers to kick a narrow range of common compute intensive operations up to the GPU for massive acceleration.. but still largely relied on the primary code base to exist and run on the CPU. The introduction of Unified Memory in CUDA, for the first time makes it practical to move huge bodies of general C++ code entirely up to the GPU and to write and run entire complex code systems entirely on the GPU with minimal CPU governance. In theory a big leap, but not without some new challenges. In this article I will present my first experience and approach to accomplishing this with the caveat that I do not wish to present myself as an authority on this subject. It’s new and I’m just learning it myself. None-the-less it’s very exciting and I wanted to pass on my early adopter wisdom such as it is.

To keep this article concise I’m not going to attempt to explain the entire CUDA memory architecture here, the documentation does a great job, what I’m going to focus on is the specific approach I’ve taken to porting C++ class objects and deep data structures to the GPU. As a starting point I began my effort with Mark Harris’s great blog article on Unified Memory:

http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/#more-2221

The “magic” of porting C++ objects to Unified Memory is captured here with the simple overloading of a class objects new and delete operators to allocate Unified Memory for an object INSTEAD of CPU memory:

Simple, very elegant. It worked great. Obviously to port an existing object, I would need an easy way to test it against the functionality of it’s pure CPU implementation which I accomplished by modifying Mark’s code as follows;

Now I can just toggle between ordinary CPU memory management and UM memory by setting the CUDA_MANAGED variable to true or false. When you instantiate a C++ object that inherits its new and delete operator from the Managed class it gets created in shared CPU/GPU memory automatically! Deep data structure copies also happen automatically as long as those structures ALSO inherit from the UM new and delete operators. However there is obviously little benefit to doing this IF the member functions themselves only work on the CPU! Unfortunately the state-of-the art in GPU programming does not yet automatically turn my C++ code into GPU code… hopefully that will come one day soon, but in the meantime I need a strategy to convert each of my class member functions into GPU kernels. How is this accomplished?

Before we begin, I should mention that most nvidia documentation on the use of CUDA places a huge emphasis on code optimization which often obscures how simple it can be to engage in basic porting. Although CUDA optimization is often essential to achieving the incredible performance potential of the GPU, it also often overcomplicates the initial steps required to port CPU code and verify that it is working properly FIRST. Preserving the CPU only functionality of the class not only increases it’s portability and generality, it provides an essential testing template for verifying that the GPU version of the code is working correctly, especially when you begin optimizing it for the GPU.

To accomplish this I structure my initial C++ code as follows inheriting from my Managed class;

In this simple example I want to “port” foo() to a GPU kernel but preserve it’s utility as a CPU only function. To accomlish this task we first convert foo() into a CUDA __host__ __device__ function as follows;

The result paramater is added because CUDA functions can’t return values directly, you have to pass a pointer to the memory you want to return a result in. This function is dual purpose, when

It will be compiled as CPU code and work normally as a CPU function but when CUDA_MANAGED = true it will compile as a GPU function. Now I need to modify my original foo function to differentiate between CPU and GPU implimentations.

Now the old foo() function has been converted into a redirection call that chooses between CPU and GPU implementations depending on whether or not the code was compiled to use Unified Memory. As ordinary CPU code the d_foo() function gets called and behaves just as the original foo() function did. If, however CUDA_MANAGED == true, it calls a kernel I have not created yet called g_foo() in a single CUDA thread. Because device functions CANNOT be called directly from the CPU we need a CUDA kernel function to call the device version of d_foo() for us. This may seem like a lot of indirection but the nvidia compiler will actually inline the d_foo() device function into the g_foo() redirection kernel function at compile time.

Here’s what g_foo() needs to look like;

Nothing to it… just a kernel launch to call d_foo() from the GPU. The important “porting” feature of this approach is here in foo():

I’m calling the g_foo() kernel as a single CUDA thread. Stupid, not parallel, silly implementation BUT very handy for testing and verifying that the function is working in CUDA as expected. That’s it! Using this approach I ported several thousand lines of C++ code and half a dozen objects to CUDA 6.0 in a couple days. Having accomplished this I am now in the process of writing unit tests that leverage the CPU code to verify that the GPU code is working correctly which will be essential when I begin the process of actually parallelizing the GPU code and CUDA optimizing it. I’ll write a subsequent article on how that goes, once I’ve figured it out myself.

The post Porting to CUDA 6.0 appeared first on The Saint.

Show more