Friday, July 5, 2019

Programare paralela openMP 3

Offloading support

Offloading means that parts of the program can be executed not only on the CPU of the computer itself, but also in other hardware attached to it, such as on the graphics card.

The declare target and end declare target directives

The declare target and end declare target directives delimit a section of the source code wherein all declarations, whether they are variables or functions/subroutines, are compiled for a device.Example:
#pragma omp declare target
int x;
void murmur() { x+=5; }
#pragma omp end declare target
This creates one or more versions of "x" and "murmur". A set that exists on the host computer, and also a separate set that exists and can be run on a device.These two functions and variables are separate, and may contain values separate from each others.
Variables declared in this manner can be accessed by the device code without separate map clauses.
OpenACC differences
In OpenACC, device-functions are declared by prefixing each function with #pragma acc routine. Its data model is more complicated and has no direct translation from/to OpenMP.

The targettarget data constructs

The target data construct creates a device data environment.The target construct executes the construct on a device (and also has target data features).
These two constructs are identical in effect:

#pragma omp target // device()... map()... if()...
{
  <<statements...>>
}
And:
#pragma omp target data // device()... map()... if()...
{
  #pragma omp target
  {
    <<statements...>>
  }
}
IMPORTANT: The target construct does not add any parallelism to the program by itself. It only transfers the execution into another device, and executes the code there in a single thread.To utilize parallelism on device, you have to engage a teams construct inside the target construct. Example:

  #include <stdio.h>

  long long r = 1;

  int main(void)
  {
    r=10;
    #pragma omp target teams distribute parallel for reduction(+:r) map(tofrom:r)
    for(unsigned long long n=0; n<0x800000000ull; ++n)
        r += n;
    printf("r=%llX\n", r);
    return 0;
  }
See the teams keyword below for details.

The if clause

If an if clause is added to the target region, the attached expression is evaluated. If the expression returns false, the code is only executed on the host. Otherwise, or if the if clause is not used, the code is executed on the device, and the task will wait until the device is done with the processing.Example:

  #include <stdlib.h>
  #include <stdio.h>

  int main(int argc, char** argv)
  {
    int r=0;

    #pragma omp target if(atoi(argv[1])) map(tofrom:r)
    r += 4;

    printf("r=%d\n", r);
  }

The device clause

Specifices the particular device that is to execute the code.
  int device_number = ...;
  #pragma omp target device(device_number)
  {
    //...
  }
You can acquire device numbers by using the <omp.h> library functions, such as omp_set_default_deviceomp_get_default_deviceomp_get_num_devices, andomp_is_initial_device.If the device clause is not used, the code is executed on the default device. The default device number is controlled by the omp_set_default_device function, or the OMP_DEFAULT_DEVICE environment variable.

The map clause

The map clause controls how data is between the host and the device.There are four different types of mappings:
  • map(alloc:variables) specifies that at entry to the block, the specified variables have uninitialized values.
  • map(from:variables) specifies that at entry to the block, the specified variables have copies of their original values on the host.
  • map(to:variables) specifies that at exit from the block, the values of these variables will be copied back to the host.
  • map(tofrom:variables) is a combination of from and to. This is the default mapping.
Variables are initialized and assigned through bitwise copy, i.e. constructors / operators are not called.
The mapping items can be entire variables or array sections.

Array sections (OpenMP 4.0+)

The variables in map and depend can also specify array sections. The array subsections are defined using one of the following syntax:
  • [lowerbound:length])
  • [lowerbound:])
  • [:length])
  • [:])
Array sections can only be specified in the map, and depend clauses. They are invalid in e.g private.
An example of a valid array subscript mapping:

  void foo (int *p)
  {
    int i;
    #pragma omp parallel
    #pragma omp single
    #pragma omp target teams distribute parallel for map(p[0:24])
    for (i = 0; i < 24; i++)
      p[i] = p[i] + 1;
  }

The target enter data and target exit data constructs (OpenMP 4.5+)

While the map clauses within a target data construct can be used to allocate data in the device memory and automatically deallocate it in the end of the construct, the target enter data and target exit data constructs can be used to store data in the memory in a more persistent manner.Examples:
  • #pragma omp target enter data map(from:var)
  • #pragma omp target exit data map(to:var)

The target update construct

The target update construct can be used to synchronize data between the device memory and the host memory without deallocating it.
  • #pragma omp target update from(c)

Teams

While the parallel construct creates a team of threads, the teams construct creates a league of teams.This directive can be only used directly inside a target construct. The optional attribute num_teams can be used to specify the maximum number of teams created. The actual number of teams may be smaller than this number. The master thread of each team will execute the code inside that team.
The example code below may print the message multiple times.
  #include <stdio.h>

  int main(void)
  {
    #pragma omp target teams
    {
      printf("test\n");
    }
    return 0;
  }
OpenACC differencesOpenACC calls teams and threads gangs and workers respectively. In OpenACC, a set of new teams is launched on the device with #pragma acc parallel, with the optional attribute num_gangs(n). This combines the behavior of #pragma omp target and #pragma omp teams.

The distribute construct

The distribute construct can be used to distribute a for loop across the master threads of all teams of the current teams region.For example, if there are 20 teams, the loop will be distributed across 20 master threads.

  #include <stdio.h>

  int main(void)
  {
    int r=0;

    #pragma omp target teams distribute reduction(+:r)
    for(int n=0; n<10000; ++n)
      r += n;

    printf("r=%d\n", r);
    return 0;
  }
OpenACC differencesIn OpenACC this behavior is achieved by adding the word gang to existing worksharing constructs like #pragma acc parallel and #pragma acc kernels.

The distribute simd construct

Adding the simd clause into the distribute construct will combine the effects of simd and distribute, meaning that the loop will be divided across the master threads of all teams of the current teams region, and therein divided according to the same principles that are in effect in #pragma omp simd constructs.

The dist_schedule clause

Much like with the schedule clause used with for scheduling, the scheduling in distribute can be controlled with the dist_schedule clause. Currently the only possible value for dist_schedule is static.

The distribute parallel for construct

The distribute parallel for construct can be used to distribute a for loop across all threads of all teams of the current teams region.For example, if there are 20 teams, and each team consists of 256 threads, the loop will be distributed across 5120 threads.

  #include <stdio.h>

  int main(void)
  {
    int r=0;

    #pragma omp target teams distribute parallel for reduction(+:r)
    for(int n=0; n<10000; ++n)
      r += n;

    printf("r=%d\n", r);
    return 0;
  }
The number of threads created in each team is implementation defined, but can be explicitly defined with the num_threads attribute.The simd clause can be added once again to the loop to add SIMD execution, if possible.
OpenACC differences
In OpenACC this behavior is achieved by adding the word worker to existing worksharing constructs like #pragma acc parallel and #pragma acc kernels. Additionally the word vector can be added to achieve SIMD parallelism as well.


No comments:

Post a Comment