Friday, July 5, 2019

Programare paralela openMP 2

The syntax

All OpenMP constructs in C and C++ are indicated with a #pragma omp followed by parameters, ending in a newline. The pragma usually applies only into the statement immediately following it, except for the barrier and flush commands, which do not have associated statements.

The parallel construct

The parallel construct starts a parallel block. It creates a team of N threads (where N is determined at runtime, usually from the number of CPU cores, but may be affected by a few things), all of which execute the next statement (or the next block, if the statement is a {…} -enclosure). After the statement, the threads join back into one.
  #pragma omp parallel
  {
    // Code inside this region runs in parallel.
    printf("Hello!\n");
  }
This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program.Internally, GCC implements this by creating a magic function and moving the associated code into that function, so that all the variables declared within that block become local variables of that function (and thus, locals to each thread).
ICC, on the other hand, uses a mechanism resembling fork(), and does not create a magic function. Both implementations are, of course, valid, and semantically identical.
Variables shared from the context are handled transparently, sometimes by passing a reference and sometimes by using register variables which are flushed at the end of the parallel block (or whenever a flush is executed).

Parallelism conditionality clause: if

The parallelism can be made conditional by including a if clause in the parallel command, such as:
  extern int parallelism_enabled;
  #pragma omp parallel for if(parallelism_enabled)
  for(int c=0; c<n; ++c)
    handle(c);
In this case, if parallelism_enabled evaluates to a zero value, the number of threads in the team that processes the for loop will always be exactly one.

Loop construct: for

The for construct splits the for-loop so that each thread in the current team handles a different portion of the loop.
 #pragma omp for
 for(int n=0; n<10; ++n)
 {
   printf(" %d", n);
 }
 printf(".\n");
This loop will output each number from 0…9 once. However, it may do it in arbitrary order. It may output, for example:
  0 5 6 7 1 8 2 3 4 9.
Internally, the above loop becomes into code equivalent to this:
  int this_thread = omp_get_thread_num(), num_threads = omp_get_num_threads();
  int my_start = (this_thread  ) * 10 / num_threads;
  int my_end   = (this_thread+1) * 10 / num_threads;
  for(int n=my_start; n<my_end; ++n)
    printf(" %d", n);
So each thread gets a different section of the loop, and they execute their own sections in parallel.Note: #pragma omp for only delegates portions of the loop for different threads in the current team. A team is the group of threads executing the program. At program start, the team consists only of a single member: the master thread that runs the program.
To create a new team of threads, you need to specify the parallel keyword. It can be specified in the surrounding context:
 #pragma omp parallel
 {
  #pragma omp for
  for(int n=0; n<10; ++n) printf(" %d", n);
 }
 printf(".\n");
Equivalent shorthand is to specify it in the pragma itself, as #pragma omp parallel for:
 #pragma omp parallel for
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");
You can explicitly specify the number of threads to be created in the team, using the num_threads attribute:
 #pragma omp parallel num_threads(3)
 {
   // This code will be executed by three threads.
   
   // Chunks of this loop will be divided amongst
   // the (three) threads of the current team.
   #pragma omp for
   for(int n=0; n<10; ++n) printf(" %d", n);
 }
Note that OpenMP also works for C. However, in C, you need to set explicitly the loop variable as private, because C does not allow declaring it in the loop body:
 int n;
 #pragma omp for private(n)
 for(n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");
See the "private and shared clauses" section for details.In OpenMP 2.5, the iteration variable in for must be a signed integer variable type. In OpenMP 3.0, it may also be an unsigned integer variable type, a pointer type or a constant-time random access iterator type. In the latter case, std::distance() will be used to determine the number of loop iterations.

What are: parallelfor and a team

The difference between parallelparallel for and for is as follows:
  • A team is the group of threads that execute currently.
    • At the program beginning, the team consists of a single thread.
    • parallel construct splits the current thread into a new team of threads for the duration of the next block/statement, after which the team merges back into one.
  • for divides the work of the for-loop among the threads of the current team. It does not create threads, it only divides the work amongst the threads of the currently executing team.
  • parallel for is a shorthand for two commands at once: parallel and for. Parallel creates a new team, and for splits that team to handle different portions of the loop.
If your program never contains a parallel construct, there is never more than one thread; the master thread that starts the program and runs it, as in non-threading programs.

Scheduling

The scheduling algorithm for the for-loop can explicitly controlled.
 #pragma omp for schedule(static)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");
There are five scheduling types: staticdynamicguidedauto, and (since OpenMP 4.0) runtime. In addition, there are three scheduling modifiers (since OpenMP 4.5): monotonicnonmonotonic, and simd.static is the default schedule as shown above. Upon entering the loop, each thread independently decides which chunk of the loop they will process.
There is also the dynamic schedule:
 #pragma omp for schedule(dynamic)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");
In the dynamic schedule, there is no predictable order in which the loop items are assigned to different threads. Each thread asks the OpenMP runtime library for an iteration number, then handles it, then asks for next, and so on. This is most useful when used in conjunction with the ordered clause, or when the different iterations in the loop may take different time to execute.The chunk size can also be specified to lessen the number of calls to the runtime library:
 #pragma omp for schedule(dynamic, 3)
 for(int n=0; n<10; ++n) printf(" %d", n);
 printf(".\n");
In this example, each thread asks for an iteration number, executes 3 iterations of the loop, then asks for another, and so on. The last chunk may be smaller than 3, though.Internally, the loop above becomes into code equivalent to this (illustration only, do not write code like this):
  int a,b;
  if(GOMP_loop_dynamic_start(0,10,1, 3, &a,&b))
  {
    do {
      for(int n=a; n<b; ++n) printf(" %d", n);
    } while(GOMP_loop_dynamic_next(&a,&b));
  }
The guided schedule appears to have behavior of static with the shortcomings of static fixed with dynamic-like traits. It is difficult to explain — this example program maybe explains it better than words do. (Requires libSDL to compile.)The "runtime" option means the runtime library chooses one of the scheduling options at runtime at the compiler library's discretion.
A scheduling modifier can be added to the clause, e.g.: #pragma omp for schedule(nonmonotonic:dynamic
The modifiers are:
  • monotonic: Each thread executes chunks in an increasing iteration order.
  • nonmonotonic: Each thread executes chunks in an unspecified order.
  • simd: If the loop is a simd loop, this controls the chunk size for scheduling in a manner that is optimal for the hardware limitations according to how the compiler decides. This modifier is ignored for non-SIMD loops.

The ordered clause

The order in which the loop iterations are executed is unspecified, and depends on runtime conditions.However, it is possible to force that certain events within the loop happen in a predicted order, using the ordered clause.

 #pragma omp for ordered schedule(dynamic)
 for(int n=0; n<100; ++n)
 {
   files[n].compress();

   #pragma omp ordered
   send(files[n]);
 }
This loop "compresses" 100 files with some files being compressed in parallel, but ensures that the files are "sent" in a strictly sequential order.If the thread assigned to compress file 7 is done but the file 6 has not yet been sent, the thread will wait before sending, and before starting to compress another file. The ordered clause in the loop guarantees that there always exists one thread that is handling the lowest-numbered unhandled task.
Each file is compressed and sent exactly once, but the compression may happen in parallel.
There may only be one ordered block per an ordered loop, no less and no more. In addition, the enclosing for construct must contain the ordered clause.
OpenMP 4.5 added some modifiers and clauses to the ordered construct.

  • #pragma omp ordered threads means the same as #pragma omp ordered. It means the threads executing the loop execute the ordered regions sequentially in the order of loop iterations.
  • #pragma omp ordered simd can only be used in a for simd loop.
  • #pragma omp ordered depend(source) and #pragma omp ordered depend(vectorvariable) also exist.

The collapse clause

When you have nested loops, you can use the collapse clause to apply the threading to multiple nested iterations.Example:
 #pragma omp parallel for collapse(2)
 for(int y=0; y<25; ++y)
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }

The reduction clause

The reduction clause is a special directive that instructs the compiler to generate code that accumulates values from different loop iterations together in a certain manner. It is discussed in a separate chapter later in this article. Example:
 int sum=0;
 #pragma omp parallel for reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

Sections

Sometimes it is handy to indicate that "this and this can run in parallel". The sections setting is just for that.
 #pragma omp sections
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }
This code indicates that any of the tasks Work1Work2 + Work3 and Work4 may run in parallel, but that Work2 and Work3 must be run in sequence. Each work is done exactly once.As usual, if the compiler ignores the pragmas, the result is still a correctly running program.
Internally, GCC implements this as a combination of the parallel for and a switch-case construct. Other compilers may implement it differently.
Note: #pragma omp sections only delegates the sections for different threads in the current team. To create a team, you need to specify the parallel keyword either in the surrounding context or in the pragma, as #pragma omp parallel sections.
Example:

 #pragma omp parallel sections // starts a new team
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }
or
 #pragma omp parallel // starts a new team
 {
   //Work0(); // this function would be run by all threads.
   
   #pragma omp sections // divides the team into sections
   { 
     // everything herein is run only once.
     { Work1(); }
     #pragma omp section
     { Work2();
       Work3(); }
     #pragma omp section
     { Work4(); }
   }
   
   //Work5(); // this function would be run by all threads.
 }

The simd construct (OpenMP 4.0+)

OpenMP 4.0 added explicit SIMD parallelism (Single-Instruction, Multiple-Data). SIMD means that multiple calculations will be performed simultaneously by the processor, using special instructions that perform the same calculation to multiple values at once. This is often more efficient than regular instructions that operate on single data values. This is also sometimes called vector parallelism or vector operations (and is in fact the preferred term in OpenACC).There are two use cases for the simd construct.
Firstly, #pragma omp simd can be used to declare that a loop will be utilizing SIMD.
 float a[8], b[8];
 ...
 #pragma omp simd
 for(int n=0; n<8; ++n) a[n] += b[n];
Secondly, #pragma omp declare simd can be used to indicate a function or procedure that is explicitly designed to take advantage of SIMD parallelism. The compiler may create multiple versions of the same function that use different parameter passing conventions for different CPU capabilities for SIMD processing.
  #pragma omp declare simd aligned(a,b:16)
  void add_arrays(float *__restrict__ a, float *__restrict__ b)
  {
    #pragma omp simd aligned(a,b:16)
    for(int n=0; n<8; ++n) a[n] += b[n];
  } 
Without the pragma, the function will use the default non-SIMD-aware ABI, even though the function itself may do calculation using SIMD.Since compilers of today attempt to do SIMD regardless of OpenMP simd directives, the simd directive can be thought essentially as a directive to the compiler, saying: “Try harder”.

The collapse clause

The collapse clause can be added to bind the SIMDness into multiple nested loops. The example code below will direct the compiler into attempting to generate instructions that calculate 16 values simultaneously, if at all possible.
  #pragma omp simd collapse(2)
  for(int i=0; i<4; ++i)
    for(int j=0; j<4; ++j)
      a[j*4+i] += b[i*4+j];

The reduction clause

The reduction clause can be used with SIMD just like with parallel loops.
 int sum=0;
 #pragma omp simd reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

The aligned clause

The aligned attribute hints the compiler that each element listed is aligned to the given number of bytes. Use this attribute if you are sure that the alignment is guaranteed, and it will increase the performance of the code and make it shorter.The attribute can be used in both the function declaration, and in the individual SIMD statements.

  #pragma omp declare simd aligned(a,b:16)
  void add_arrays(float *__restrict__ a, float *__restrict__ b)
  {
    #pragma omp simd aligned(a,b:16)
    for(int n=0; n<8; ++n) a[n] += b[n];
  } 

The safelen clause

While the restrict keyword in C tells the compiler that it can assume that two pointers will not address the same data (and thus it is safe to change the ordering of reads and writes), the safelen clause in OpenMP provides much fine-grained control over pointer aliasing.In the example code below, the compiler is informed that a[x] and b[y] are independent as long as the difference between x and y is smaller than 4. In reality, the clause controls the upper limit of concurrent loop iterations. It means that only 4 items can be processed concurrently at most. The actual concurrency may be smaller, and depends on the compiler implementation and hardware limits.

  #pragma omp declare simd
  void add_arrays(float* a, float* b)
  {
    #pragma omp simd aligned(a,b:16) safelen(4)
    for(int n=0; n<8; ++n) a[n] += b[n];
  } 

The simdlen clause (OpenMP 4.5+)

The simdlen clause can be added to a declare simd construct to limit how many elements of an array are passed in SIMD registers instead of using the normal parameter passing convention.

The uniform clause

The uniform clause declares one or more arguments to have an invariant value for all concurrent invocations of the function in the execution of a single SIMD loop.

The linear clause (OpenMP 4.5+)

The linear clause is similar to the firstprivate clause discussed later in this article.Consider this example code:

  #include <stdio.h>

  int b = 10;
  int main()
  {
    int array[8];

    #pragma omp simd linear(b:2)
    for(int n=0; n<8; ++n) array[n] = b;

    for(int n=0; n<8; ++n) printf("%d\n", array[n]);
  }
What does this code print? If we ignore the SIMD constructs, we can see it should print the sequence 10,10,10,10,10,10,10,10.But, if we enable the OpenMP SIMD construct, the program should now print 10,12,14,16,18,20,22,24. This is because the linear clause tells the compiler, that the value of b inside each iteration of the loop should be a copy of the original value of b before the SIMD construct, plus the loop iteration number, times the linear scale, which is 2 in this case.
In essence, it should be equivalent to the following code:
  int b_original = b;
  for(int n=0; n<8; ++n) array[n] = b_original + n*2;
However, as of GCC version 6.1.0, the linear clause does not seem to be implemented correctly, at least according to my understanding of the specification, so I cannot do more experimentation.

The inbranch and notinbranch clauses

The inbranch clause specifies that the function will always be called from inside a conditional statement of a SIMD loop. The notinbranch clause specifies that the function will never be called from inside a conditional statement of a SIMD loop.The compiler may use this knowledge to optimize the code.

The for simd construct (OpenMP 4.0+)

The for and simd constructs can be combined, to divide the execution of a loop into multiple threads, and then execute those loop slices in parallel using SIMD.
  float sum(float* table)
  {
    float result=0;
    #pragma omp parallel for simd reduction(+:result)
    for(int n=0; n<1000; ++n) result += table[n];
    return result;
  }

The task construct (OpenMP 3.0+)

When for and sections are too cumbersome, the task construct can be used. This is only supported in OpenMP 3.0 and later.These examples are from the OpenMP 3.0 manual:

struct node { node *left, *right; };
extern void process(node* );
void traverse(node* p)
{
    if (p->left)
        #pragma omp task // p is firstprivate by default
        traverse(p->left);
    if (p->right)
        #pragma omp task // p is firstprivate by default
        traverse(p->right);
    process(p);
}
In the next example, we force a postorder traversal of the tree by adding a taskwait directive. Now, we can safely assume that the left and right sons have been executed before we process the current node.
struct node { node *left, *right; };
extern void process(node* );
void postorder_traverse(node* p)
{
    if (p->left)
        #pragma omp task // p is firstprivate by default
        postorder_traverse(p->left);
    if (p->right)
        #pragma omp task // p is firstprivate by default
        postorder_traverse(p->right);
    #pragma omp taskwait
    process(p);
}
The following example demonstrates how to use the task construct to process elements of a linked list in parallel. The pointer p is firstprivate by default on the task construct so it is not necessary to specify it in a firstprivate clause.
struct node { int data; node* next; };
extern void process(node* );
void increment_list_items(node* head)
{
    #pragma omp parallel
    {
        #pragma omp single
        {
            for(node* p = head; p; p = p->next)
            {
             #pragma omp task
             process(p); // p is firstprivate by default
            }
        }
    }
}

No comments:

Post a Comment