Compute Constructs and Paralleize Loops

Compute Constructs¶

In our first exercise, we will look into how to offload the computation to the device (GPU). Because the main aim of the OpenACC is to facilitate offloading the computation using OpenACC APIs. OpenACC provides two kinds of variants to offload the computations to GPU. They are explained as follows,

OpenACC provides two compute constructs to parallelize the computation
The first one is parallel, and the second is kernels
Both of these parallel constructs perform more or less the same
However, kernels will have more control over the parallel region
Therefore, as a programmer, if you are very familiar with what you are doing in the parallel region, you may use parallel; otherwise, it is better to use kernels
Because the compiler will take care of the safe parallelization under the kernels construct

At the same time, in order to enable OpenACC constructs, clauses, and environment variables. etc., we need to include the OpenACC library as follows:

OpenACC library

C/C++FORTRAN

#include<openacc.h>

use openacc

To create a parallel region in OpenACC, we use the following compute constructs:

Parallel Constructs

C/C++FORTRAN

#pragma acc parallel [clause-list] new-line
   structured block

!$acc parallel [ clause-list ]
    structured block
!$acc end parallel

Available caluses for parallel

C/C++ and FORTRAN

    async [ ( int-expr ) ]
    wait [ ( int-expr-list ) ]
    num_gangs( int-expr )
    num_workers( int-expr )
    vector_length( int-expr )
    device_type( device-type-list )
    if( condition )
    self [ ( condition ) ]
    reduction( operator : var-list )
    copy( var-list )
    copyin( [ readonly: ] var-list )
    copyout( [ zero: ] var-list )
    create( [ zero: ] var-list )
    no_create( var-list )
    present( var-list )
    deviceptr( var-list )
    attach( var-list )
    private( var-list )
    firstprivate( var-list )
    default( none | present )

Kernels Constructs

C/C++FORTRAN

#pragma acc kernels [ clause-list ] new-line
   structured block

!$acc kernels [ clause-list ]
   structured block
!$acc end kernels

Available caluses for kernels

C/C++ and FORTRAN

async [ ( int-expr ) ]
wait [ ( int-expr-list ) ]
num_gangs( int-expr )
num_workers( int-expr )
vector_length( int-expr )
device_type( device-type-list )
if( condition )
self [ ( condition ) ]
copy( var-list )
copyin( [ readonly: ] var-list )
copyout( [ zero: ] var-list )
create( [ zero: ] var-list )
no_create( var-list )
present( var-list )
deviceptr( var-list )
attach( var-list )
default( none | present )

Compilers¶

The following compilers would support the OpenACC programming model.

GNU - It is an opensource and can be used for Nvidia and AMD CPUs
Nvidia HPC SDK - It is from Nvidia, and works very well for Nvidia GPUs
HPE - Presently it supports the FORTRAN (not C/C++)

Examples (GNU, Nvidia HPC SDK and HPE): Compilation

Nvidia HPC SDK

$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel test.c

Questions and Solutions¶

Examples: Hello World

Serial-versionOpenACC-version-parallelOpenACC-version-kernels

//Hello-world-CPU.c 
#include<stdio.h>
int main()
{
  printf("Hello World from CPU!\n");        
  return 0;
}

//Hello-world-parallel.c    
#include<stdio.h>
#include<openacc.h>     
int main()
{ 
#pragma acc parallel                                                             
  printf("Hello World from GPU!\n");
  return 0;
}

//Hello-world-kernels.c 
#include<stdio.h>
#include<openacc.h>     
int main()
{
#pragma acc kernels                            
  printf("Hello World from GPU!\n");
  return 0;
}

Compilation and Output

Serial-versionOpenACC-version-parallelOpenACC-version-kernels

// compilation
$ gcc Hello-world-CPU.c -o Hello-World-CPU

// execution 
$ ./Hello-World-CPU

// output
$ Hello World from CPU!

// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-parallel.c -o Hello-World-GPU
main:
7, Generating NVIDIA GPU code

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!

// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-kernels.c -o Hello-World-GPU
main:
7, Accelerator serial kernel generated
   Generating NVIDIA GPU code

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!

Loop¶

Our second exercise is to work on how to parallelize the loop. Most of the time, we would be doing the intense computation under the loop. In situations like that, it would be more efficient to parallelize the loops in the computation. To start with a simple example, we will begin with printing Hello World from GPU multiple times in addition to our previous example. Moreover, just adding #pragma acc parallel or #pragma acc kernels would not parallelize your computation, instead will make sure that the computation will be executed on the device.

Loop Constructs

C/C++FORTRAN

#pragma acc loop [clause-list] new-line
   for loop

!$acc loop [clause-list]
   do loop

Available caluses for loop

C/C++ and FORTRAN

collapse( n )
gang [( gang-arg-list )]
worker [( [num:]int-expr )]
vector [( [length:]int-expr )]
seq
independent
auto
tile( size-expr-list )
device_type( device-type-list )
private( var-list )
reduction( operator:var-list )

Questions and Solutions¶

Examples: Loop (Hello World)

Serial-version-loopOpenACC-version-parallel-loopOpenACC-version-kernels-loop

//Hello-world-CPU-loop.c    
#include<stdio.h>
int main()
{
  for(int i = 0; i < 5; i++)
    {         
      printf("Hello World from CPU!\n");
    }       
  return 0;
}

//Hello-world-parallel-loop.c   
#include<stdio.h>
#include<openacc.h>     
int main()
{
#pragma acc parallel loop
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World from GPU!\n");
    }
return 0;
}

//Hello-world-kernels-loop.c    
#include<stdio.h>
#include<openacc.h>     
int main()
{
#pragma acc kernels loop
  for(int i = 0; i < 5; i++)
    {                                
      printf("Hello World from GPU!\n");
    }
return 0;
}

Compilation and Output

Serial-version-loopOpenACC-version-parallel-loopOpenACC-version-kernels-loop

// compilation
$ gcc Hello-world-CPU-loop.c -o Hello-World-CPU

// execution 
$ ./Hello-World-CPU

// output
$ Hello World from CPU!
$ Hello World from CPU!
$ Hello World from CPU!
$ Hello World from CPU!
$ Hello World from CPU!

// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-parallel-loop.c -o Hello-World-GPU
main:
5, Generating NVIDIA GPU code
  7, #pragma acc loop gang /* blockIdx.x */

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!

// compilation
$ nvc -fast -acc=gpu -gpu=cc80 -Minfo=accel Hello-world-kernels-loop.c -o Hello-World-GPU
main:
7, Loop is parallelizable
   Generating NVIDIA GPU code
    7, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

// execution
$ ./Hello-World-GPU

// output
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!
$ Hello World from GPU!

Last update: January 31, 2024 09:18:25
Created: September 18, 2023 13:12:23