Guide into OpenMP: Easy multithreading programming for C++

By Joel Yliluoma, September 2007; last update in June 2016 for OpenMP 4.5

Abstract

Preface: Importance of multithreading

As CPU speeds no longer improve as significantly as they did before, multicore systems are becoming more popular.To harness that power, it is becoming important for programmers to be knowledgeable in parallel programming — making a program execute multiple things simultaneously.
This document attempts to give a quick introduction to OpenMP, a simple C/C++/Fortran compiler extension that allows to add parallelism into existing source code without significantly having to entirely rewrite it.

Support in different compilers

GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0 since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4, and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version 7 or so.
Clang++ supports OpenMP 4.5 since version 3.9 (without offloading), OpenMP 4.0 since version 3.8 (for some parts), and OpenMP 3.1 since version 3.7. Add the commandline option -fopenmp to enable it.
Solaris Studio supports OpenMP 4.0 since version 12.4, and OpenMP 3.1 since version 12.3. Add the commandline option -xopenmp to enable it.
Intel C Compiler (icc) supports Openmp 4.5 since version 17.0, OpenMP 4.0 since version 15.0, OpenMP 3.1 since version 12.1, OpenMP 3.0 since version 11.0, and OpenMP 2.5 since version 10.1. Add the commandline option -openmp to enable it. Add the -openmp-stubs option instead to enable the library without actual parallel execution.
Microsoft Visual C++ (cl) supports OpenMP 2.0 since version 2005. Add the commandline option /openmp to enable it.

Note: If your GCC complains that "-fopenmp" is valid for D but not for C++ when you try to use it, or does not recognize the option at all, your GCC version is too old. If your linker complains about missing GOMP functions, you forgot to specify "-fopenmp" in the linking.
More information: http://openmp.org/wp/openmp-compilers/

Introduction to OpenMP in C++

OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.Here are two simple example programs demonstrating OpenMP.
You can compile them like this:

  g++ tmp.cpp -fopenmp

Example: Initializing a table in parallel (multiple threads)

This code divides the table initialization into multiple threads, which are run simultaneously. Each thread initializes a portion of the table.

  #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp parallel for
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);
  
    // the table is now initialized
  }

Example: Initializing a table in parallel (single thread, SIMD)

This version requires compiler support for at least OpenMP 4.0, and the use of a parallel floating point library such as AMD ACML or Intel SVML (which can be used in GCC with e.g. ‑mveclibabi=svml).

  #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp simd
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);
  
    // the table is now initialized
  }

Example: Initializing a table in parallel (multiple threads on another device)

OpenMP 4.0 added support for offloading code to different devices, such as a GPU. Therefore there can be three layers of parallelism in a single program: Single thread processing multiple data; multiple threads running simultaneously; and multiple devices running same program simultaneously.

  #include <cmath>
  int main()
  {
    const int size = 256;
    double sinTable[size];
    
    #pragma omp target teams distribute parallel for map(from:sinTable[0:256])
    for(int n=0; n<size; ++n)
      sinTable[n] = std::sin(2 * M_PI * n / size);

    // the table is now initialized
  }

Example: Calculating the Mandelbrot fractal in parallel (host computer)

This program calculates the classic Mandelbrot fractal at a low resolution and renders it with ASCII characters, calculating multiple pixels in parallel.

 #include <complex>
 #include <cstdio>
 
 typedef std::complex<double> complex;
 
 int MandelbrotCalculate(complex c, int maxiter)
 {
     // iterates z = z + c until |z| >= 2 or maxiter is reached,
     // returns the number of iterations.
     complex z = c;
     int n=0;
     for(; n<maxiter; ++n)
     {
         if( std::abs(z) >= 2.0) break;
         z = z*z + c;
     }
     return n;
 }
 int main()
 {
     const int width = 78, height = 44, num_pixels = width*height;
     
     const complex center(-.7, 0), span(2.7, -(4/3.0)*2.7*height/width);
     const complex begin = center-span/2.0;//, end = center+span/2.0;
     const int maxiter = 100000;
   
   #pragma omp parallel for ordered schedule(dynamic)
     for(int pix=0; pix<num_pixels; ++pix)
     {
         const int x = pix%width, y = pix/width;
         
         complex c = begin + complex(x * span.real() / (width +1.0),
                                     y * span.imag() / (height+1.0));
         
         int n = MandelbrotCalculate(c, maxiter);
         if(n == maxiter) n = 0;
         
       #pragma omp ordered
         {
           char c = ' ';
           if(n > 0)
           {
               static const char charset[] = ".,c8M@jawrpogOQEPGJ";
               c = charset[n % (sizeof(charset)-1)];
           }
           std::putchar(c);
           if(x+1 == width) std::puts("|");
         }
     }
 }

This program can be improved in many different ways, but it is left simple for the sake of an introductory example.

Discussion

As you can see, there is very little in the program that indicates that it runs in parallel. If you remove the #pragma lines, the result is still a valid C++ program that runs and does the expected thing.Only when the compiler interprets those #pragma lines, it becomes a parallel program. It really does calculate N values simultaneously where N is the number of threads. In GCC, libgomp determines that from the number of processors.
By C and C++ standards, if the compiler encounters a #pragma that it does not support, it will ignore it. So adding the OMP statements can be done safely^[1] without breaking compatibility with legacy compilers.
There is also a runtime library that can be accessed through omp.h, but it is less often needed. If you need it, you can check the #define _OPENMP for conditional compilation in case of compilers that don't support OpenMP.
[1]: Within the usual parallel programming issues (concurrency, mutual exclusion) of course.

Algoritmi paraleli

Friday, July 5, 2019

Programare paralela openMP 1