Compute Constructs and Paralleize Loops
Compute Constructs¶
In our first exercise, we will look into how to offload the computation to the device (GPU). Because the main aim of the OpenACC is to facilitate offloading the computation using OpenACC APIs. OpenACC provides two variants to offload the computations to GPU. They are explained as follows,
- OpenACC provides two compute constructs to parallelize the computation
- The first one is
parallel
, and the second iskernels
- Both of these parallel constructs perform more or less the same
- However,
kernels
will have more control over the parallel region - Therefore, as a programmer, if you are very familiar with what you are doing in the parallel region, you may use
parallel
; otherwise, it is better to usekernels
- Because the compiler will take care of the safe parallelization under the
kernels
construct
At the same time, in order to enable OpenACC constructs, clauses, and environment variables. etc., we need to include the OpenACC library as follows:
To create a parallel region in OpenACC, we use the following compute constructs:
Parallel Constructs
Available clauses for parallel
async [ ( int-expr ) ]
wait [ ( int-expr-list ) ]
num_gangs( int-expr )
num_workers( int-expr )
vector_length( int-expr )
device_type( device-type-list )
if( condition )
self [ ( condition ) ]
reduction( operator : var-list )
copy( var-list )
copyin( [ readonly: ] var-list )
copyout( [ zero: ] var-list )
create( [ zero: ] var-list )
no_create( var-list )
present( var-list )
deviceptr( var-list )
attach( var-list )
private( var-list )
firstprivate( var-list )
default( none | present )
Kernels Constructs
Available clauses for kernels
async [ ( int-expr ) ]
wait [ ( int-expr-list ) ]
num_gangs( int-expr )
num_workers( int-expr )
vector_length( int-expr )
device_type( device-type-list )
if( condition )
self [ ( condition ) ]
copy( var-list )
copyin( [ readonly: ] var-list )
copyout( [ zero: ] var-list )
create( [ zero: ] var-list )
no_create( var-list )
present( var-list )
deviceptr( var-list )
attach( var-list )
default( none | present )
Compilers¶
The following compilers would support the OpenACC programming model.
- GNU - It is an open source and can be used for Nvidia and AMD CPUs
- Nvidia HPC SDK - It is from Nvidia, and works very well for Nvidia GPUs
- HPE - Presently it supports the FORTRAN (not C/C++)
Examples (GNU, Nvidia HPC SDK and HPE): Compilation
Questions and Solutions¶
Examples: Hello World
Compilation and Output
Loop¶
Our second exercise is to work on how to parallelize the loop. Most of the time, we would be doing the intense computation under the loop. In situations like that, it would be more efficient to parallelize the loops in the computation. To start with a simple example, we will begin with printing Hello World from GPU
multiple times in addition to our previous example. Moreover, just adding #pragma acc parallel
or #pragma acc kernels
would not parallelize your computation; instead, it would ensure that the computation is executed on the device.
Available clauses for loop
Questions and Solutions¶
Examples: Loop (Hello World)
Compilation and Output
// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-parallel-loop.c -o Hello-World-GPU
main:
5, Generating NVIDIA GPU code
7, #pragma acc loop gang /* blockIdx.x */
// execution
$ ./Hello-World-GPU
// output
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-kernels-loop.c -o Hello-World-GPU
main:
7, Loop is parallelizable
Generating NVIDIA GPU code
7, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
// execution
$ ./Hello-World-GPU
// output
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!