2015-10-29

OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors.

And starting today, with the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs. The new PGI Fortran, C and C++ compilers for the first time allow OpenACC-enabled source code to be compiled for parallel execution on either a multicore CPU or a GPU accelerator. This capability provides tremendous flexibility for programmers, enabling applications to take advantage of multiple system architectures with a single version of the source code.

“Our goal is to enable HPC developers to easily port applications across all major CPU and accelerator platforms with uniformly high performance using a common source code base,” said Douglas Miles, director of PGI Compilers & Tools at NVIDIA. “This capability will be particularly important in the race towards exascale computing in which there will be a variety of system architectures requiring a more flexible application programming approach.”

As the chart above shows, performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups.

Key benefits of running OpenACC on multicore CPUs include:

Effective utilization of all cores of a multicore CPU or multi-socket server for parallel execution

Common programming model across CPUs and GPUs in Fortran, C, and C++

Rapid exploitation of existing multicore parallelism in a program using the KERNELS directive, which enables incremental optimization for parallel execution

Scalable performance across multicore CPUs and GPUs

PGI’s compiler roadmap, shown below, includes plans to support all of the compute processors that are likely to be viable building blocks for Exascale systems.

How to Compile OpenACC Applications for Multicore CPUs

Passing the flag -ta=multicore on the PGI compiler (pgcc, pgc++ or pgfortran) command line tells the compiler to generate parallel multicore code for OpenACC compute regions, instead of the default of generating parallel GPU kernels. The parallel multicore code will execute in much the same fashion as if you had used OpenMP omp parallel directives instead of OpenACC compute regions.

Adding -Minfo or -Minfo=accel will enable compiler feedback messages, giving details about the parallel code generated, as in the following.

You can control how many threads the program will use to run the parallel compute regions with the environment variable ACC_NUM_CORES. The default is to use all available cores on the system. For Linux targets, the runtime will launch as many threads as physical cores (not hyper-threaded logical cores). OpenACC gang-parallel loops run in parallel across the threads. If you have an OpenACC parallel construct with a num_gangs(200) clause, the runtime will take the minimum of the num_gangs argument and the number of cores on the system, and launch that many threads. That avoids the problem of launching hundreds or thousands of gangs, which makes sense on a GPU but which would overload a multicore CPU.

Single Programming Model, Portable High Performance

The goal of OpenACC is to have a single programming model that allows developers to write a single program that runs with high performance in parallel across a wide range of target systems. For the last few years, PGI has been developing and delivering OpenACC compilers targeting NVIDIA Tesla and AMD Radeon GPUs, but performance portability requires being able to run the same program with high performance in parallel on non-GPU targets, and in particular on multicore and manycore CPUs. So, the first reason to use OpenACC with -ta=multicore is if you have an application that you want to use on systems with GPUs, and on other systems without GPUs but with multicore CPUs. This allows you to develop your program once, without having to include compile-time conditionals (ifdefs) or special modules for each target with the increased development and maintenance cost.

Even if you are only interested in GPU-accelerated targets, you can do parallel OpenACC code development and testing on your multicore laptop or workstation without a GPU. This can separate algorithm development from GPU performance tuning. Debugging is often easier on the host than with a heterogeneous binary with both host and GPU code.

Working Through an Example: Please do Try This at Home!

To demonstrate the performance shown in the chart above, you can download the version of miniGhost used to generate the performance numbers from the PGI website.

To build the OpenMP version for execution on multicore, issue the following commands.

This example is using the PGI OpenMP compiler, but the OpenMP time in the chart above uses the Intel OpenMP compiler. You’ll see about the same execution time using either of these two OpenMP compilers.

To build the OpenACC version for multicore using PGI, issue the following commands.

Finally, to build the OpenACC version for execution on an NVIDIA GPU using PGI, issue the following commands.

OpenACC Data Clauses on Multicore

In the OpenACC execution model, the multicore CPU is treated like an accelerator device that shares memory with the initial host thread. With a shared-memory device, most of the OpenACC data clauses (copy, copyin, copyout, create) are ignored, and the accelerator device (the parallel multicore) uses the same data as the initial host thread. Similarly, update directives and most OpenACC data API routines will not generate data allocation or movement. Other data clauses are still honored, such as private and reduction, which may require some dynamic memory allocation and data movement, but no more than the corresponding OpenMP data clauses.

When using OpenACC with a GPU, data gets copied from the system memory to device memory (and back). The user is responsible for keeping the two copies of data coherent, as needed. When using OpenACC on a multicore, there is only one copy of the data, so there is no coherence problem. However, the GPU OpenACC program can produce different results than a multicore OpenACC program if the program depends on the parallel compute regions updating a different copy of the data than the sequential initial host thread regions.

Developer.nvidia.com

Performance Portability from GPUs to CPUs with OpenACC