Skip to content

Compute Constructs and Paralleize Loops

Compute Constructs

In our first exercise, we will look into how to offload the computation to the device (GPU). Because the main aim of the OpenACC is to facilitate offloading the computation using OpenACC APIs. OpenACC provides two variants to offload the computations to GPU. They are explained as follows,

  • OpenACC provides two compute constructs to parallelize the computation
  • The first one is parallel, and the second is kernels
  • Both of these parallel constructs perform more or less the same
  • However, kernels will have more control over the parallel region
  • Therefore, as a programmer, if you are very familiar with what you are doing in the parallel region, you may use parallel; otherwise, it is better to use kernels
  • Because the compiler will take care of the safe parallelization under the kernels construct

At the same time, in order to enable OpenACC constructs, clauses, and environment variables. etc., we need to include the OpenACC library as follows:

OpenACC library

#include<openacc.h>
use openacc

To create a parallel region in OpenACC, we use the following compute constructs:

Parallel Constructs

#pragma acc parallel [clause-list] new-line
   structured block
!$acc parallel [ clause-list ]
    structured block
!$acc end parallel
Available clauses for parallel
    async [ ( int-expr ) ]
    wait [ ( int-expr-list ) ]
    num_gangs( int-expr )
    num_workers( int-expr )
    vector_length( int-expr )
    device_type( device-type-list )
    if( condition )
    self [ ( condition ) ]
    reduction( operator : var-list )
    copy( var-list )
    copyin( [ readonly: ] var-list )
    copyout( [ zero: ] var-list )
    create( [ zero: ] var-list )
    no_create( var-list )
    present( var-list )
    deviceptr( var-list )
    attach( var-list )
    private( var-list )
    firstprivate( var-list )
    default( none | present )

Kernels Constructs

#pragma acc kernels [ clause-list ] new-line
   structured block
!$acc kernels [ clause-list ]
   structured block
!$acc end kernels
Available clauses for kernels
async [ ( int-expr ) ]
wait [ ( int-expr-list ) ]
num_gangs( int-expr )
num_workers( int-expr )
vector_length( int-expr )
device_type( device-type-list )
if( condition )
self [ ( condition ) ]
copy( var-list )
copyin( [ readonly: ] var-list )
copyout( [ zero: ] var-list )
create( [ zero: ] var-list )
no_create( var-list )
present( var-list )
deviceptr( var-list )
attach( var-list )
default( none | present )

Compilers

The following compilers would support the OpenACC programming model.

  • GNU - It is an open source and can be used for Nvidia and AMD CPUs
  • Nvidia HPC SDK - It is from Nvidia, and works very well for Nvidia GPUs
  • HPE - Presently it supports the FORTRAN (not C/C++)

Examples (GNU, Nvidia HPC SDK and HPE): Compilation

$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel test.c 

Questions and Solutions

Examples: Hello World
//Hello-world-CPU.c 
#include<stdio.h>
int main()
{
  printf("Hello World from CPU!\n");        
  return 0;
}
//Hello-world-parallel.c    
#include<stdio.h>
#include<openacc.h>     
int main()
{ 
#pragma acc parallel                                                             
  printf("Hello World from GPU!\n");
  return 0;
}
//Hello-world-kernels.c 
#include<stdio.h>
#include<openacc.h>     
int main()
{
#pragma acc kernels                            
  printf("Hello World from GPU!\n");
  return 0;
}
Compilation and Output
// compilation
$ gcc Hello-world-CPU.c -o Hello-World-CPU

// execution 
$ ./Hello-World-CPU

// output
$ Hello World from CPU!
// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-parallel.c -o Hello-World-GPU
main:
7, Generating NVIDIA GPU code

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!
// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-kernels.c -o Hello-World-GPU
main:
7, Accelerator serial kernel generated
   Generating NVIDIA GPU code

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!

Loop

Our second exercise is to work on how to parallelize the loop. Most of the time, we would be doing the intense computation under the loop. In situations like that, it would be more efficient to parallelize the loops in the computation. To start with a simple example, we will begin with printing Hello World from GPU multiple times in addition to our previous example. Moreover, just adding #pragma acc parallel or #pragma acc kernels would not parallelize your computation; instead, it would ensure that the computation is executed on the device.

Loop Constructs

#pragma acc loop [clause-list] new-line
   for loop
!$acc loop [clause-list]
   do loop
Available clauses for loop
collapse( n )
gang [( gang-arg-list )]
worker [( [num:]int-expr )]
vector [( [length:]int-expr )]
seq
independent
auto
tile( size-expr-list )
device_type( device-type-list )
private( var-list )
reduction( operator:var-list )

Questions and Solutions

Examples: Loop (Hello World)
//Hello-world-CPU-loop.c    
#include<stdio.h>
int main()
{
  for(int i = 0; i < 5; i++)
    {         
      printf("Hello World from CPU!\n");
    }       
  return 0;
}
//Hello-world-parallel-loop.c   
#include<stdio.h>
#include<openacc.h>     
int main()
{
#pragma acc parallel loop
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World from GPU!\n");
    }
return 0;
}
//Hello-world-kernels-loop.c    
#include<stdio.h>
#include<openacc.h>     
int main()
{
#pragma acc kernels loop
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World from GPU!\n");
    }
return 0;
}
Compilation and Output
// compilation
$ gcc Hello-world-CPU-loop.c -o Hello-World-CPU

// execution 
$ ./Hello-World-CPU

// output
$ Hello World from CPU!
$ Hello World from CPU!
$ Hello World from CPU!
$ Hello World from CPU!
$ Hello World from CPU!                                
// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-parallel-loop.c -o Hello-World-GPU
main:
5, Generating NVIDIA GPU code
  7, #pragma acc loop gang /* blockIdx.x */

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!                                
// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-kernels-loop.c -o Hello-World-GPU
main:
7, Loop is parallelizable
   Generating NVIDIA GPU code
    7, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!