Chapter 26 introduces a trio of useful new pseudocode keywords, spawn
, sync
, and parallel for
. We have seen a few examples of how these simple constructs allow us to form powerful parallel implementations of common algorithms.
Here is the pseudocode for the parallelized version of merge sort from chapter 26.
MERGE-SORT'(A,p,r) if p < r q = (p+r)/2 spawn MERGE-SORT'(A,p,q) MERGE-SORT'(A,q+1,r) sync P-MERGE(A,p,q,r) P-MERGE(T,p1,r1,p2,r2,A,p3) n1 = r1 - p1 + 1 n2 = r2 - p2 + 1 if n1 < n2 exchange p1 with p2 exchange r1 with r2 exchange n1 with n2 if n1 == 0 return else q1 = (p1+r1)/2 q2 = BINARY-SEARCH(T[q1],T,p2,r2) q3 = p3 + (q1 - p1) + (q2 - p2) A[q3] = T[q1] spawn P-MERGE(T,p1,q1-1,p2,q2-1,A,p3) P-MERGE(T,q1+1,r1,q2,r2,A,q3+1) sync
In the programming assignment below I am going to ask you write a C++ program that implements another parallel algorithm. The notes below will show the basics of implementing parallel programming in C++.
OpenMP is a system for writing parallel programs in C++. OpenMP is a combination of a set of compiler extentions and an OpenMP library. The compiler extentions allow you to implement parallel programming features in your program through the use of compiler directives, specifically pragmas.
Here is source code for a C++ version of the parallel merge sort algorithm.
The OpenMP pragmas embedded in this code are fairly easy to use.
To prepare for using parallelism in a program we use the omp parallel
construct:
#pragma omp parallel { // Parallel code goes here }
This construct sets up a group of worker threads that are available to run tasks.
The default behavior of the omp parallel
construct is to have every thread in the task group run the code in the parallel section. To prevent this behavior we typically use the omp single
construct:
#pragma omp parallel { #pragma omp single doSomething(); }
This sets up an arrangement in which only one of the worker threads will run the code following the single directive. The code in the doSomething()
function then has the option of spawning work to run in the other available worker threads.
To set up a spawn/sync arrangement analogous to
x = spawn f(i) y = f(i-1) sync
we use the OpenMP task
and taskwait
constructs
#pragma omp task x = f(i); y = f(i-1); #pragma omp taskwait
The statement or block after the task
pragma will be run as a separate task in one of the available worker threads. While that thread is running the task we will be running any code that appears after the task. To set up a barrier that forces us to wait until both our code and the task's code are complete we then use the taskwait
pragma.
To set up a parallel for loop we use a construct like this:
#pragma omp taskloop for(i = 0;i < n;i++) A[i] = 0;
The work that needs to be done by the loop will automatically be divided up by whatever worker threads are available to do the work. For example, if four of the worker threads are currently idle the work of the loop will be broken into four distinct chunks that will be handled by the available threads.
Many modern development environments offer support for OpenMP.
Windows Here are instructions on how to use the most up to date version of OpenMP in Visual Studio.
Mac The C++ compiler that Xcode uses does not support OpenMP. As an alternative we can use the homebrew package manager to install the gcc compiler tools and use gcc with Visual Studio Code to build and debug our OpenMP programs. Here are instructions on how to do that:
brew install gcc
In chapter 14 we saw the dynamic programming solution to the longest common subsequence, or LCS, problem. Like all dynamic programming solutions the LCS solution involves filling in a couple of tables. In this programming exercise we are going to use some parallel programming techniques to fill in the required tables more quickly. The basic idea is to going to be to fill in the table by using a pair of nested loops. The outer loop is going to be a conventional, sequential for loop, while the inner loop is going to be a parallel for loop.
Here is what you should do:
Very important requirement: for version two of your LCS solution you should not use the strategy of having an outer loop that iterates down the rows of the table and an inner loop that iterates across the columns of the current row. A parallel for loop will try to fill in the cells in a row in parallel by having different threads work on different parts of the row. The problem with this is that very often one of the threads will try to look up a value to the left of the cell it is working on before another thread has had a chance to fill in that value.
What you should do instead is to fill in the table along diagonals. If you set this up correctly, every cell along a given diagonal will depend only on values that appear in diagonals that have already been computed. This makes it safe to fill in the cells along a diagonal in parallel using multiple threads to work on the diagonal at the same time.