Friday, July 5, 2019

Programare paralela openMP 4

Thread-safety (i.e. mutual exclusion)

There are a wide array of concurrency and mutual exclusion problems related to multithreading programs. I won't explain them here in detail; there are many good books dealing with the issue. (For example, Multithreaded, Parallel, and Distributed Programming by Gregory R. Andrews.)Instead, I will explain the tools that OpenMP provides to handle mutual exclusion correctly.

Atomicity

Atomicity means that something is inseparable; an event either happens completely or it does not happen at all, and another thread cannot intervene during the execution of the event.
 #pragma omp atomic
 counter += value;
The atomic keyword in OpenMP specifies that the denoted action happens atomically. It is commonly used to update counters and other simple variables that are accessed by multiple threads simultaneously.See also reduction.
There are four different types of atomic expressions (since OpenMP 3.1):

Atomic read expressions

 #pragma omp atomic read
 var = x;
Here the reading of x is guaranteed to happen atomically, but nothing is guaranteed about var. Note that var may not access the memory location designated for x.

Atomic write expressions

 #pragma omp atomic write
 x = expr;
Here the writing of x is guaranteed to happen atomically, but nothing is guaranteed about expr. Note that expr may not access the memory location designated for x.

Atomic update expressions

 #pragma omp atomic update // The word "update" is optional
 // One of these:
 ++x; --x; x++; x--;
 x += expr;  x -= expr;  x *= expr;   x /= expr;  x &= expr;
 x = x+expr; x = x-expr; x = x*expr;  x = x/expr; x = x&expr;
 x = expr+x; x = expr-x; x = expr*x;  x = expr/x; x = expr&x;
 x |= expr;  x ^= expr;  x <<= expr;  x >>= expr;
 x = x|expr; x = x^expr; x = x<<expr; x = x>>expr;
 x = expr|x; x = expr^x; x = expr<<x; x = expr>>x;
Here the updating of x is guaranteed to happen atomically, but nothing is guaranteed about expr. Note that expr may not access the memory location designated for x.

Atomic capture expressions

Capture expressions combine the read and update features.
 #pragma omp atomic capture
 // One of these:
 var = x++;  /* Or any other of the update expressions listed above */
 { var = x; x++; /* Or any other of of the update expressions listed above */ }
 { x++; /* Or any other of of the update expressions listed above */; var = x; }
 { var = x; x = expr; }
Note that neither var nor expr may not access the memory location designated for x.

The critical construct

The critical construct restricts the execution of the associated statement / block to a single thread at time.The critical construct may optionally contain a global name that identifies the type of the critical construct. No two threads can execute a critical construct of the same name at the same time.
If the name is omitted, a default name is assumed.

 #pragma omp critical(dataupdate)
 {
   datastructure.reorganize();
 }
 ...
 #pragma omp critical(dataupdate)
 {
   datastructure.reorganize_again();
 }
In this example, only one of the critical sections named "dataupdate" may be executed at any given time, and only one thread may be executing it at that time. I.e. the functions "reorganize" and "reorganize_again" cannot be invoked at the same time, and two calls to the function cannot be active at the same time. (Except if other calls exist elsewhere, unprotected by the critical construct.)Note: The critical section names are global to the entire program (regardless of module boundaries). So if you have a critical section by the same name in multiple modules, not two of them can be executed at the same time.
If you need something like a local mutex, see below.

Locks

The OpenMP runtime library provides a lock type, omp_lock_t in its include file, omp.h.The lock type has five manipulator functions:

  • omp_init_lock initializes the lock. After the call, the lock is unset.
  • omp_destroy_lock destroys the lock. The lock must be unset before this call.
  • omp_set_lock attempts to set the lock. If the lock is already set by another thread, it will wait until the lock is no longer set, and then sets it.
  • omp_unset_lock unsets the lock. It should only be called by the same thread that set the lock; the consequences of doing otherwise are undefined.
  • omp_test_lock attempts to set the lock. If the lock is already set by another thread, it returns 0; if it managed to set the lock, it returns 1.
Here is an example of a wrapper around std::set<> that provides per-instance mutual exclusion while still working even if the compiler does not support OpenMP.
You can maintain backward compability with non-OpenMP-supporting compilers by enclosing the library references in #ifdef _OPENMP#endif blocks.

 #ifdef _OPENMP
 # include <omp.h>
 #endif
 #include <set>
 
 class data
 {
 private:
   std::set<int> flags;
 #ifdef _OPENMP
   omp_lock_t lock;
 #endif
 public:
   data() : flags()
   {
 #ifdef _OPENMP
     omp_init_lock(&lock);
 #endif
   }
   ~data()
   {
 #ifdef _OPENMP
     omp_destroy_lock(&lock);
 #endif
   }
   
   bool set_get(int c)
   {
   #ifdef _OPENMP
     omp_set_lock(&lock);
   #endif
     bool found = flags.find(c) != flags.end();
     if(!found) flags.insert(c);
   #ifdef _OPENMP
     omp_unset_lock(&lock);
   #endif
     return found;
   }
 };
Of course, you would really rather wrap the lock into a custom container to avoid littering the code with #ifdefs and also for providing exception-safety:
 #ifdef _OPENMP
 # include <omp.h>
 struct MutexType
 {
   MutexType() { omp_init_lock(&lock); }
   ~MutexType() { omp_destroy_lock(&lock); }
   void Lock() { omp_set_lock(&lock); }
   void Unlock() { omp_unset_lock(&lock); }
   
   MutexType(const MutexType& ) { omp_init_lock(&lock); }
   MutexType& operator= (const MutexType& ) { return *this; }
 public:
   omp_lock_t lock;
 };
 #else
 /* A dummy mutex that doesn't actually exclude anything,
  * but as there is no parallelism either, no worries. */
 struct MutexType
 {
   void Lock() {}
   void Unlock() {}
 };
 #endif
 
 /* An exception-safe scoped lock-keeper. */
 struct ScopedLock
 {
   explicit ScopedLock(MutexType& m) : mut(m), locked(true) { mut.Lock(); }
   ~ScopedLock() { Unlock(); }
   void Unlock() { if(!locked) return; locked=false; mut.Unlock(); }
   void LockAgain() { if(locked) return; mut.Lock(); locked=true; }
 private:
   MutexType& mut;
   bool locked;
 private: // prevent copying the scoped lock.
   void operator=(const ScopedLock&);
   ScopedLock(const ScopedLock&);
 };
This way, the example above becomes a lot simpler, and also exception-safe:
 #include <set>
 
 class data
 {
 private:
   std::set<int> flags;
   MutexType lock;
 public:
   bool set_get(int c)
   {
     ScopedLock lck(lock); // locks the mutex
     
     if(flags.find(c) != flags.end()) return true; // was found
     flags.insert(c);
     return false; // was not found
   } // automatically releases the lock when lck goes out of scope.
 };
There is also a lock type that supports nesting, omp_nest_lock_t. I will not cover it here.

The flush directive

Even when variables used by threads are supposed to be shared, the compiler may take liberties and optimize them as register variables. This can skew concurrent observations of the variable. The flush directive can be used to ensure that the value observed in one thread is also the value observed by other threads.This example comes from the OpenMP specification.

          /* presumption: int a = 0, b = 0; */
                        
    /* First thread */                /* Second thread */
    b = 1;                            a = 1;
    #pragma omp flush(a,b)            #pragma omp flush(a,b)
    if(a == 0)                        if(b == 0)
    {                                 {
      /* Critical section */            /* Critical section */
    }                                 }
In this example, it is enforced that at the time either of a or b is accessed, the other is also up-to-date, practically ensuring that not both of the two threads enter the critical section. (Note: It is still possible that neither of them can enter it.)You need the flush directive when you have writes to and reads from the same data in different threads.
If the program appears to work correctly without the flush directive, it does not mean that the flush directive is not required. It just may be that your compiler is not utilizing all the freedoms the standard allows it to do. You need the flush directive whenever you access shared data in multiple threads: After a write, before a read.
However, I do not know these:
  • Is flush needed if the shared variable is declared volatile?
  • Is flush needed if all access to the shared variable is atomic or restricted by critical sections?

Controlling which data to share between threads

In the parallel section, it is possible to specify which variables are shared between the different threads and which are not. By default, all variables are shared except those declared within the parallel block.

The privatefirstprivate and shared clauses

 int a, b=0;
 #pragma omp parallel for private(a) shared(b)
 for(a=0; a<50; ++a)
 {
   #pragma omp atomic
   b += a;
 }
This example explicitly specifies that a is private (each thread has their own copy of it) and that b is shared (each thread accesses the same variable).

The difference between private and firstprivate

Note that a private copy is an uninitialized variable by the same name and same type as the original variable; it does not copy the value of the variable that was in the surrounding context.Example:

 #include <string>
 #include <iostream>
 
 int main()
 {
     std::string a = "x", b = "y";
     int c = 3;
     
     #pragma omp parallel private(a,c) shared(b) num_threads(2)
     {
         a += "k";
         c += 7;
         std::cout << "A becomes (" << a << "), b is (" << b << ")\n";
     }
 }
This will output the string "k", not "xk". At the entrance of the block, a becomes a new instance of std::string, that is initialized with the default constructor; it is not initialized with the copy constructor.Internally, the program becomes like this:
 int main()
 {
     std::string a = "x", b = "y";
     int c = 3;
     
     OpenMP_thread_fork(2);
     {                  // Start new scope
         std::string a; // Note: It is a new local variable.
         int c;         // This too.
         a += "k";
         c += 7;
         std::cout << "A becomes (" << a << "), b is (" << b << ")\n";
     }                  // End of scope for the local variables
     OpenMP_join();
 }
In the case of primitive (POD) datatypes (intfloatchar* etc.), the private variable is uninitialized, just like any declared but not initialized local variable. It does not contain the value of the variable from the surrounding context. Therefore, the increment of c is moot here; the value of the variable is still undefined. (If you are using GCC version earlier than 4.4, you do not even get a warning about the use of uninitialized value in situations like this.)If you actually need a copy of the original value, use the firstprivate clause instead.

 #include <string>
 #include <iostream>
 
 int main()
 {
     std::string a = "x", b = "y";
     int c = 3;
     
     #pragma omp parallel firstprivate(a,c) shared(b) num_threads(2)
     {
         a += "k";
         c += 7;
         std::cout << "A becomes (" << a << "), b is (" << b << ")\n";
     }
 }
Now the output becomes "A becomes (xk), b is (y)".

The lastprivate clause

The lastprivate clause defines a variable private as in firstprivate or private, but causes the value from the last task to be copied back to the original value after the end of the loop/sections construct.
  • In a loop construct (for construct), the last value is the value assigned by the thread that handles the last iteration of the loop. Values assigned during other iterations are ignored.
  • In a sections construct (sections construct), the last value is the value assigned in the last section denoted by the section construct. Values assigned in other sections are ignored.
Example:
 #include <stdio.h>
 int main()
 {
    int done = 4, done2 = 5;
    
     #pragma omp parallel for lastprivate(done, done2) num_threads(2) schedule(static)
     for(int a=0; a<8; ++a)
     {
       if(a==2) done=done2=0;
       if(a==3) done=done2=1;
     }
     printf("%d,%d\n", done,done2);
 }
This program outputs "4196224,-348582208", because internally, this program became like this:
 #include <stdio.h>
 int main()
 {
    int done = 4, done2 = 5;
    OpenMP_thread_fork(2);
    {
        int this_thread = omp_get_thread_num(), num_threads = 2;
        int my_start = (this_thread  ) * 8 / num_threads;
        int my_end   = (this_thread+1) * 8 / num_threads;

        int priv_done, priv_done2; // not initialized, because firstprivate was not used

        for(int a=my_start; a<my_end; ++a)
        {
            if(a==2) priv_done=priv_done2=0;
            if(a==3) priv_done=priv_done2=1;
        }
        if(my_end == 8)
        {
           // assign the values back, because this was the last iteration
           done  = priv_done;
           done2 = priv_done2;
        }
    }
    OpenMP_join();
 }
As one can observe, the values of priv_done and priv_done2 are not assigned even once during the course of the loop that iterates through 4...7. As such, the values that are assigned back are completely bogus.Therefore, lastprivate cannot be used to e.g. fetch the value of a flag assigned randomly during a loop. Use reduction for that, instead.
Where this behavior can be utilized though, is in situations like this (from OpenMP manual):

 void loop()
 {
   int i;
   #pragma omp for lastprivate(i)
   for(i=0; i<get_loop_count(); ++i) // note: get_loop_count() must be a pure function.
       { ... }
   
   printf("%d\n", i); // this shows the number of loop iterations done.
 }

The default clause

The most useful purpose on the default clause is to check whether you have remembered to consider all variables for the private/shared question, using the default(none) setting.
 int a, b=0;
 // This code won't compile: It requires explicitly
 // specifying whether a is shared or private.
 #pragma omp parallel default(none) shared(b)
 {
   b += a;
 }
The default clause can also be used to set that all variables are shared by default (default(shared)).
Note: Because different compilers have different ideas about which variables are implicitly private or shared, and for which it is an error to explicitly state the private/shared status, it is recommended to use the default(none) setting only during development, and drop it in production/distribution code.

The reduction clause

The reduction clause is a mix between the privateshared, and atomic clauses.
It allows to accumulate a shared variable without the atomic clause, but the type of accumulation must be specified. It will often produce faster executing code than by using the atomicclause.This example calculates factorial using threads:
 int factorial(int number)
 {
   int fac = 1;
   #pragma omp parallel for reduction(*:fac)
   for(int n=2; n<=number; ++n)
     fac *= n;
   return fac;
 }
  • At the beginning of the parallel block, a private copy is made of the variable and preinitialized to a certain value .
  • At the end of the parallel block, the private copy is atomically merged into the shared variable using the defined operator.
(The private copy is actually just a new local variable by the same name and type; the original variable is not accessed to create the copy.)The syntax of the clause is:
  reduction(operator:list)
where list is the list of variables where the operator will be applied to, and operator is one of these:
OperatorInitialization value
+-|^||0
*&&1
&~0
minlargest representable number
maxsmallest representable number
To write the factorial function (shown above) without reduction, it probably would look like this:
 int factorial(int number)
 {
   int fac = 1;
   #pragma omp parallel for
   for(int n=2; n<=number; ++n)
   {
     #pragma omp atomic
     fac *= n;
   }
   return fac;
 }
However, this code would be less optimal than the one with reduction: it misses the opportunity to use a local (possible register) variable for the cumulation, and needlessly places load/synchronization demands on the shared memory variable. In fact, due to the bottleneck of that atomic variable (only one thread may access it simultaneously), it would completely nullify any gains of parallelism in that loop.The version with reduction is equivalent to this code (illustration only):
 int factorial(int number)
 {
   int fac = 1;
   #pragma omp parallel
   {
     int omp_priv = 1; /* This value comes from the table shown above */
     #pragma omp for nowait
     for(int n=2; n<=number; ++n)
       omp_priv *= n;
     #pragma omp atomic
     fac *= omp_priv;
   }
   return fac;
 }
Note how it moves the atomic operation out from the loop.The restrictions in reduction and atomic are very similar: both can only be done on POD types; neither allows overloaded operators, and both have the same set of supported operators.
As an example of how the reduction clause can be used to produce semantically different code when OpenMP is enabled and when it is disabled, this example prints the number of threads that executed the parallel block:
 int a = 0;
 #pragma omp parallel reduction (+:a)
 {
   a = 1; // Assigns a value to the private copy.
   // Each thread increments the value by 1.
 }
 printf("%d\n", a);
If you preinitialized "a" to 4, it would print a number >= 5 if OpenMP was enabled, and 1 if OpenMP was disabled.
Note: If you really need to detect whether OpenMP is enabled, use the _OPENMP #define instead. To get the number of threads, use omp_get_num_threads() instead.

The declare reduction directive (OpenMP 4.0+)

The declare reduction directive generalizes the reductions to include user-defined reductions.The syntax of the declaration is one of these two:
#pragma omp declare reduction(name:type:expression)
#pragma omp declare reduction(name:type:expression) initializer(expression)

  • The name is the name you want to give to the reduction method.
  • The type is the type of your reduction result.
  • Within the reduction expression, the special variables omp_in and omp_out are implicitly declared, and they stand for the input and output expressions respectively.
  • Within the initializer expression, the special variable omp_priv is implicitly declared and stands for the initial value of the reduction result.
An example use case is when you are running a data compressor with different parameters, and you want to find the set of parameters that results in best compression. Below is an example of such code:

  #include <cstdio>

  int compress(int param1, int param2)
  {
    return (param1+13)^param2; // Placeholder for a compression algorithm
  }

  int main(int argc, char** argv)
  {
    struct BestInfo { unsigned size, param1, param2; };

    #pragma omp declare reduction(isbetter:BestInfo: \
                                  omp_in.size<omp_out.size ? omp_out=omp_in : omp_out \   
                    ) initializer(omp_priv = BestInfo{~0u,~0u,~0u})  

    BestInfo result{~0u,~0u,~0u};
    #pragma omp parallel for collapse(2) reduction(isbetter:result)
    for(unsigned p1=0; p1<10; ++p1)
    for(unsigned p2=0; p2<10; ++p2)
    {
      unsigned size = compress(p1,p2);
      if(size < result.size) result = BestInfo{size,p1,p2};
    }
    std::printf("Best compression (%u bytes) with params %u,%u\n",
      result.size, result.param1, result.param2);
  }

Thread affinity (proc_bind)

The thread affinity of the parallel construct can be controlled with a proc_bind clause. It takes one of the following three forms:
  • #pragma omp parallel proc_bind(master)
  • #pragma omp parallel proc_bind(close)
  • #pragma omp parallel proc_bind(spread)
For more information, read the OpenMP specification.

Execution synchronization

The barrier directive and the nowait clause

The barrier directive causes threads encountering the barrier to wait until all the other threads in the same team have encountered the barrier.
 #pragma omp parallel
 {
   /* All threads execute this. */
   SomeCode();
   
   #pragma omp barrier
   
   /* All threads execute this, but not before
    * all threads have finished executing SomeCode().
    */
   SomeMoreCode();
 }
Note: There is an implicit barrier at the end of each parallel block, and at the end of each sectionsfor and single statement, unless the nowait directive is used.Example:
 #pragma omp parallel
 {
   #pragma omp for
   for(int n=0; n<10; ++n) Work();
   
   // This line is not reached before the for-loop is completely finished
   SomeMoreCode();
 }

 // This line is reached only after all threads from
 // the previous parallel block are finished.
 CodeContinues();

 #pragma omp parallel
 {
   #pragma omp for nowait
   for(int n=0; n<10; ++n) Work();
   
   // This line may be reached while some threads are still executing the for-loop.
   SomeMoreCode();
 }

 // This line is reached only after all threads from
 // the previous parallel block are finished.
 CodeContinues();
The nowait directive can only be attached to sectionsfor and single. It cannot be attached to the within-loop ordered clause, for example.

The single and master constructs

The single construct specifies that the given statement/block is executed by only one thread. It is unspecified which thread. Other threads skip the statement/block and wait at an implicit barrier at the end of the construct.
 #pragma omp parallel
 {
   Work1();
   #pragma omp single
   {
     Work2();
   }
   Work3();
 }
In a 2-cpu system, this will run Work1() twice, Work2() once and Work3() twice. There is an implied barrier at the end of the single construct, but not at the beginning of it.Note: Do not assume that the single block is executed by whichever thread gets there first. According to the standard, the decision of which thread executes the block is implementation-defined, and therefore making assumptions on it is non-conforming.
The master construct is similar, except that the statement/block is run by the master thread, and there is no implied barrier; other threads skip the construct without waiting.

 #pragma omp parallel
 {
   Work1();
   
   // This...
   #pragma omp master
   {
     Work2();
   }
   
   // ...is practically identical to this:
   if(omp_get_thread_num() == 0)
   {
     Work2();
   }
   
   Work3();
 }
Unless you use the threadprivate clause, the only important difference between single nowait and master is that if you have multiple master blocks in a parallel section, you are guaranteed that they are executed by the same thread every time, and hence, the values of private (thread-local) variables are the same.

Thread cancellation (OpenMP 4.0+)

Suppose that we want to optimize this function with parallel processing:
 /* Returns any position from the haystack where the needle can
  * be found, or NULL if no such position exists. It is not guaranteed
  * to find the first matching position; it only guarantees to find
  * _a_ matching position if one exists.
  */
 const char* FindAnyNeedle(const char* haystack, size_t size, char needle)
 {
   for(size_t p = 0; p < size; ++p)
     if(haystack[p] == needle)
     {
       /* This breaks out of the loop. */
       return haystack+p;
     }
   return NULL;
 }
Our first attempt might be to simply tack a #pragma parallel for before the for loop, but that doesn't work: OpenMP requires that a loop construct processes each iteration. Breaking out of the loop (using returngotobreakthrow or other means) is not allowed.To solve this problem, OpenMP 4.0 added a mechanism called cancellation points, and a cancel construct. Cancellation points are implicitly inserted at the following positions:
  • Implicit barriers
  • barrier regions
  • cancel regions
  • cancellation point regions
It can be used to solve finder problems where N threads search for a solution and once a solution is found by any thread, all threads end their search.
Because there is a performance overhead in checking for cancellations, it is only enabled if the library-internal global variable OMP_CANCELLATION is set. The value of this variable can be checked with the omp_get_cancellation() function, but there is no way modify it from inside the program. It can only be set from the environment when the program is launched.
In this example program, once a thread finds the "needle", it signals cancellation for all threads of the current team processing the innermost for loop. Threads check the cancellation only at every loop iteration. It also checks whether OMP_CANCELLATION is set, and if not, sets it and reruns the program.

  #include <stdio.h>  // For printf
  #include <string.h> // For strlen
  #include <stdlib.h> // For putenv
  #include <unistd.h> // For execv
  #include <omp.h>    // For omp_get_cancellation, omp_get_thread_num()

  static const char* FindAnyNeedle(const char* haystack, size_t size, char needle)
  {
    const char* result = haystack+size;
    #pragma omp parallel
    {
      unsigned num_iterations=0;
      #pragma omp for
      for(size_t p = 0; p < size; ++p)
      {
        ++num_iterations;
        if(haystack[p] == needle)
        {
          #pragma omp atomic write
          result = haystack+p;
          // Signal cancellation.
          #pragma omp cancel for
        }
        // Check for cancellations signalled by other threads:
        #pragma omp cancellation point for
      }
      // All threads reach here eventually; sooner if the cancellation was signalled.
      printf("Thread %u: %u iterations completed\n", omp_get_thread_num(), num_iterations);
    }
    return result;
  }

  int main(int argc, char** argv)
  {
    if(!omp_get_cancellation())
    {
      printf("Cancellations were not enabled, enabling cancellation and rerunning program\n");
      putenv("OMP_CANCELLATION=true");
      execv(argv[0], argv);
    }
    printf("%s\n%*s\n", argv[1], FindAnyNeedle(argv[1],strlen(argv[1]),argv[2][0])-argv[1]+1, "^");
  }
Example output:
   ./a.out "OpenMP cancellations can only be performed synchronously at cancellation points." "l"
   Cancellations were not enabled, enabling cancellation and rerunning program
   Thread 0: 10 iterations completed
   Thread 1: 3 iterations completed
   Thread 7: 10 iterations completed
   Thread 3: 10 iterations completed
   Thread 4: 10 iterations completed
   Thread 2: 8 iterations completed
   Thread 5: 5 iterations completed
   Thread 6: 6 iterations completed
   OpenMP cancellations can only be performed synchronously at cancellation points.
                              ^
The keyword in the end of the #pragma omp cancellation point construct is the name of the most closely nested OpenMP construct that you want to cancel. In the example code above, it is the for construct, and this is why the line says #pragma omp cancellation point for.OpenMP cancellations can only be performed synchronously at cancellation points. GNU pthreads also permits asynchronous cancellations. This is rarely used, and requires special setup, because there are several resource leak risks involved in it. An example of such code can be found here: http://bisqwit.iki.fi/jutut/kuvat/openmphowto/pthread_cancel_demo.cpp

Loop nesting

The problem

A beginner at OpenMP will quickly find out that this code will not do the expected thing:
 #pragma omp parallel for
 for(int y=0; y<25; ++y)
 {
   #pragma omp parallel for
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }
 }
The beginner expects there to be N tick() calls active at the same time (where N = number of processors). Although that is true, the inner loop is not actually parallelised. Only the outer loop is. The inner loop runs in a pure sequence, as if the whole inner #pragma was omitted.At the entrance of the inner parallel construct, the OpenMP runtime library (libgomp in case of GCC) detects that there already exists a team, and instead of a new team of N threads, it will create a team consisting of only the calling thread.
Rewriting the code like this won't work:
 #pragma omp parallel for
 for(int y=0; y<25; ++y)
 {
   #pragma omp for // ERROR, nesting like this is not allowed.
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }
 }
This code is erroneous and will cause the program to malfunction. See the restrictions chapter below for details.

Solution in OpenMP 3.0

In OpenMP 3.0, the loop nesting problem can be solved by using the collapse clause in the for construct.Example:
 #pragma omp parallel for collapse(2)
 for(int y=0; y<25; ++y)
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }
The number specified in the collapse clauses is the number of nested loops that are subject to the work-sharing semantics of the OpenMP for construct.

Restrictions

There are restrictions to which clauses can be nested under which constructs. The restrictions are listed in the OpenMP official specification.

Performance

Compared to a naive use of C++11 threads, OpenMP threads are often more efficient. This is because many implementations of OpenMP use a thread pool. A thread pool means that new operating system threads are only created once. When the threads are done with their work, they return to a “dock” waiting for new work to do.

Shortcomings

OpenMP and fork()

It is worth mentioning that using OpenMP in a program that calls fork() requires special consideration.This problem only affects GCC; ICC is not affected.
If your program intends to become a background process using daemonize() or other similar means, you must not use the OpenMP features before the fork. After OpenMP features are utilized, a fork is only allowed if the child process does not use OpenMP features, or it does so as a completely new process (such as after exec()).
This is an example of an erroneous program:
  #include <stdio.h>
  #include <sys/wait.h>
  #include <unistd.h>

  void a()
  {
    #pragma omp parallel num_threads(2)
    {
      puts("para_a"); // output twice
    }
    puts("a ended"); // output once
  }
  void b()
  {
    #pragma omp parallel num_threads(2)
    {
      puts("para_b");
    }
    puts("b ended");
  }

  int main() {
   a();   // Invokes OpenMP features (parent process)
   int p = fork();
   if(!p)
   {
     b(); // ERROR: Uses OpenMP again, but in child process
     _exit(0);
   }
   wait(NULL);
   return 0;
  }
When run, this program hangs, never reaching the line that outputs "b ended".There is currently no workaround; the libgomp API does not specify functions that can be used to prepare for a call to fork().

Missing in this article

  • The depend clause (added in OpenMP 4.0)
  • The nowait clause in target construct (added in OpenMP 4.5)
  • The taskgroup construct (added in OpenMP 4.0)
  • The taskyield construct (added in OpenMP 3.1)
  • The finalmergeable, and priority clauses in task (added in OpenMP 3.1 through 4.5)
  • The threadprivatecopyprivate and copyin clauses
  • The refval, and uval modifiers in linear clause (added in OpenMP 4.5)
  • The hint clause in critical construct (added in OpenMP 4.5)
  • The defaultmap clause (added in OpenMP 4.5)

Some specific gotchas

C++
  • STL is not thread-safe. If you use STL containers in a parallel context, you must exclude concurrent access using locks or other mechanisms. Const-access is usually fine, as long as non-const access does not occur at the same time.
  • Exceptions may not be thrown and caught across omp constructs. That is, if a code inside an omp for throws an exception, the exception must be caught before the end of the loop iteration; and an exception thrown inside a parallel section must be caught by the same thread before the end of the parallel section.
GCC

  • fork() is troublematic when used together with OpenMP. See the chapter "OpenMP and fork()" above for details.




Last edited at: 2018-02-10T21:10:44+02:00

No comments:

Post a Comment