OpenMP

Chapter 26 introduces a trio of useful new pseudocode keywords, spawn, sync, and parallel for. We have seen a few examples of how these simple constructs allow us to form powerful parallel implementations of common algorithms.

Here is the pseudocode for the parallelized version of merge sort from chapter 26.

MERGE-SORT'(A,p,r)
  if p < r
    q = (p+r)/2
    spawn MERGE-SORT'(A,p,q)
    MERGE-SORT'(A,q+1,r)
    sync
    P-MERGE(A,p,q,r)

P-MERGE(T,p₁,r₁,p₂,r₂,A,p₃)
  n₁ = r₁ - p₁ + 1
  n₂ = r₂ - p₂ + 1
  if n₁ < n₂
    exchange p₁ with p₂
    exchange r₁ with r₂
    exchange n₁ with n₂
  if n₁ == 0
    return
  else
    q₁ = (p₁+r₁)/2
    q₂ = BINARY-SEARCH(T[q₁],T,p₂,r₂)
    q₃ = p₃ + (q₁ - p₁) + (q₂ - p₂)
    A[q₃] = T[q₁]
    spawn P-MERGE(T,p₁,q₁-1,p₂,q₂-1,A,p₃)
    P-MERGE(T,q₁+1,r₁,q₂,r₂,A,q₃+1)
    sync

In the programming assignment below I am going to ask you write a C++ program that implements another parallel algorithm. The notes below will show the basics of implementing parallel programming in C++.

OpenMP is a system for writing parallel programs in C++. OpenMP is a combination of a set of compiler extentions and an OpenMP library. The compiler extentions allow you to implement parallel programming features in your program through the use of compiler directives, specifically pragmas.

Here is source code for a C++ version of the parallel merge sort algorithm.

The OpenMP pragmas embedded in this code are fairly easy to use.

To prepare for using parallelism in a program we use the omp parallel construct:

#pragma omp parallel
{
  // Parallel code goes here
}

This construct sets up a group of worker threads that are available to run tasks.

The default behavior of the omp parallel construct is to have every thread in the task group run the code in the parallel section. To prevent this behavior we typically use the omp single construct:

#pragma omp parallel
{
  #pragma omp single
  doSomething();
}

This sets up an arrangement in which only one of the worker threads will run the code following the single directive. The code in the doSomething() function then has the option of spawning work to run in the other available worker threads.

To set up a spawn/sync arrangement analogous to

x = spawn f(i)
y = f(i-1)
sync

we use the OpenMP task and taskwait constructs

#pragma omp task
x = f(i);

y = f(i-1);
#pragma omp taskwait

The statement or block after the task pragma will be run as a separate task in one of the available worker threads. While that thread is running the task we will be running any code that appears after the task. To set up a barrier that forces us to wait until both our code and the task's code are complete we then use the taskwait pragma.

To set up a parallel for loop we use a construct like this:

#pragma omp taskloop
for(i = 0;i < n;i++)
  A[i] = 0;

The work that needs to be done by the loop will automatically be divided up by whatever worker threads are available to do the work. For example, if four of the worker threads are currently idle the work of the loop will be broken into four distinct chunks that will be handled by the available threads.

Compiling and running OpenMP code

Many modern development environments offer support for OpenMP.

Windows Here are instructions on how to use the most up to date version of OpenMP in Visual Studio.

The C++ compiler in Visual Studio does not support the latest version of OpenMP, so you will have to install a more advanced compiler. To do this, run the Visual Studio Installer application and click the Modify button to modify your Visual Studio installation. On the right under Installation details you will see a list of options. Scroll down and locate the option "C++ Clang tools for Windows". Check that option and then click the Modify button.
Start up Visual Studio as usual and make a new Console App project.
Open the project settings. Under General locate the Platform Toolset option - set this option to LLVM.
Visual Studio will default to running your project in Debug mode. Unfortunately, performance of OpenMP programs is very poor in this mode. To fix this, switch to Release mode by selecting Release from the combo box in the tool bar. After switching to Release mode you will also need to go back to the project settings to switch to the LLVM compiler for this mode. Build your project and then select "Start without Debugging" from the Debug menu to run.

Mac The C++ compiler that Xcode uses does not support OpenMP. As an alternative we can use the homebrew package manager to install the gcc compiler tools and use gcc with Visual Studio Code to build and debug our OpenMP programs. Here are instructions on how to do that:

Go to the Homebrew home page to get instructions on how to install the Homebrew package manager for OS X.
Open a terminal window and type the command brew install gcc
Installing gcc will take a while - be patient....
Next, if you don't already have it installed, go to the download page for Visual Studio Code to download that software.
Start up Visual Studio Code and open the extensions manager by pressing the command-shift-X key combination. Locate and install the C/C++ Extension Pack.
Select Add Folder to Workspace... from the file menu and select the folder where your source code for the parallel merge sort example is located.
With the merge.cpp file open select Start Debugging from the Run menu. Visual Studio Code will prompt you to select an option for a compiler and a debugger to use: select the option g++-14 build and debug active file.
Initially the program will not compile because we are missing a compiler option. To fix this, open the tasks.json file in the .vscode folder. In the "args" section add an additional argument "-fopenmp" before the "-g" argument. Save your changes to the tasks.json file.
Go back to merge.cpp file in the editor and select Start Debugging from the Run menu again. At this point your program should compile and run successfully.
Visual Studio Code has a full set of debugging tools built in. You can set breakpoints in your code, view the value of variables, and step through your code line by line just as you would in Xcode.

Programming Assignment

In chapter 14 we saw the dynamic programming solution to the longest common subsequence, or LCS, problem. Like all dynamic programming solutions the LCS solution involves filling in a couple of tables. In this programming exercise we are going to use some parallel programming techniques to fill in the required tables more quickly. The basic idea is to going to be to fill in the table by using a pair of nested loops. The outer loop is going to be a conventional, sequential for loop, while the inner loop is going to be a parallel for loop.

Here is what you should do:

Start by filling two arrays with 1000 randomly selected characters chosen from the range 'A' through 'D'. Your job in this assignment is to compute the LCS of those two arrays in two different ways.
Start by constructing a conventional, nonparallel dynamic programming solution. You can use the book psuedocode from section 14.4 for this part. In your solution you should compute the longest common subsequence of the two arrays and store that sequence in a third array. Also, measure how much time it takes your solution to do all of this.
Next, write a version that uses a parallel for loop in the inner loop in the table-filling algorithm. See the note below about modifying the algorithm for filling in the table. Have your second version save the LCS it finds to a fourth array. Measure how much time it takes to run the parallel version.
Finally, check to make sure that both versions produced the same LCS, and print how much time each version took to do the job.

Very important requirement: for version two of your LCS solution you should not use the strategy of having an outer loop that iterates down the rows of the table and an inner loop that iterates across the columns of the current row. A parallel for loop will try to fill in the cells in a row in parallel by having different threads work on different parts of the row. The problem with this is that very often one of the threads will try to look up a value to the left of the cell it is working on before another thread has had a chance to fill in that value.

What you should do instead is to fill in the table along diagonals. If you set this up correctly, every cell along a given diagonal will depend only on values that appear in diagonals that have already been computed. This makes it safe to fill in the cells along a diagonal in parallel using multiple threads to work on the diagonal at the same time.