Assigned: May 11, 2021
Due: May 18, 2021
Lecture 14
You may also find Tim Mattson’s Open MP tutorial informative
Complete info on Open MP is available at openmp.org .
This assignment is extracted as with the previous assignment into a
directory ps6. The tar file is at this link
.
A software development theme that we have been emphasizing throughout the course is: don’t repeat yourself (DRY). We are going to apply this in an important way to our code organization starting with this assignment.
We have been evolving some our files throughout the course, many of them have remained the same (or remained the same after a period of time) – yet, we have created new copies of these files assignment after assignment. Moreover, we have been accumulating a number of different projects within the course, but we have been keeping all of the files together in a single directory.
For this assignment (and the following), the files for the assignment are organized into separate subdirectories, as follows:
ps6/
├── Questions.rst
├── README.rst
├── data/
├── hello_omp/
│ ├── Makefile
│ ├── getenv.cpp
│ ├── hello_omp_2_threads.cpp
│ ├── hello_omp_critical.cpp
│ ├── hello_omp_num_threads.cpp
│ ├── hello_omp_parallel.cpp
│ ├── hello_omp_parallel_critical.cpp
│ └── ompi_info.cpp
├── include/
│ ├── AOSMatrix.hpp
│ ├── COOMatrix.hpp
│ ├── CSCMatrix.hpp
│ ├── CSRMatrix.hpp
│ ├── Make.inc
│ ├── Matrix.hpp
│ ├── Timer.hpp
│ ├── Vector.hpp
│ ├── amath583.hpp
│ ├── amath583IO.hpp
│ ├── amath583sparse.hpp
│ ├── catch.hpp
│ ├── getenv.hpp
│ ├── norm_utils.hpp
│ ├── norms.hpp
│ └── pagerank.hpp
├── matvec/
│ └── Makefile
│ └── matmat.cpp
│ └── matvec.cpp
│ └── pmatmat.cpp
│ └── pmatvec.cpp
├── norm/
│ ├── Makefile
│ ├── norm_block_critical.cpp
│ ├── norm_block_reduction.cpp
│ ├── norm_cyclic_critical.cpp
│ ├── norm_cyclic_reduction.cpp
│ ├── norm_parfor.cpp
│ └── norm_seq.cpp
├── pagerank/
│ ├── Makefile
│ └── pagerank.cpp
├── src/
│ ├── amath583.cpp
│ ├── amath583IO.cpp
│ └── amath583sparse.cpp
└── test/
├── Makefile
├── matmat_test.cpp
├── matvec_test.cpp
├── norms_test.cpp
├── pagerank_test.cpp
├── pmatmat_test.cpp
└── pmatvec_test.cpp
The main organizing principle is that the common files (included by source files) are maintained in the include subdirectory. Other subdirectories contain the source files for testing and for particular projects / exercises. For instance all of the norm-related main source files are in a single subdirectory. The custom is also to build the various projects in their source directories. The Makefiles have been amended accordingly (note though, that there is a huge amount of repeated code in the Makefiles – that has all been moved to include/Make.inc and included into the other Makefiles. The individual Makefiles are now quite spare.)
Questions.rst is at the top level of the subdirectory.
Note
make verify
You may notice that there are no longer verify.bash scripts for checking your code. Instead, the Makefile (defined in Make.inc) has a target verify
that will build and run all of the executables specified in the Makefile. So from any of the subdirectories
$ make verify
will do the verification.
$ make all
will build all of the targets for that subdirectory.
(This is another example of DRY – we aren’t specifying targets in any location but one – as part of a macro in each subdirectory’s Makefile. Previously we had been specifying the files to verify in both the Makefile and the verify scripts.)
You may need to update some settings in vscode so that the built-in error-checker can find the include files in their new location. Otherwise you may see the red squiggles under the include statements for the files in include
.
(Before proceeding with this step, you may want to do the next step first and install OpenMP. If it isn’t installed, vscode won’t be able to find omp.h
because it doesn’t yet exist – no settings will make the “not found” squiggles go away if something can’t actually be found.)
First see if you need to update the settings at all. If there are no red squiggles under, e.g., #include "Vector.hpp"
you are all set.
If there are red squiggles, first try resetting Intellisense. Open the command palette and run “C/C++ Reset Intellisense Database” (start typing “reset” into the command palette and it will show you that as a menu option).
If there are still red squiggles, open Settings and select “Workspace” (as opposed to “User”) in the upper left. Then search for “includePath”. Select “Edit in settings.json” under C_Cpp < Default: Include Path. Add the path to the include directory:
"C_Cpp.default.includePath": [
"${workspaceFolder}/ps6/include/"
],
This assumes the “root” of what you are doing in vscode is one level above ps6. It should work in most cases, but you can also provide an absolute path
"C_Cpp.default.includePath": [
"/Users/al75/amath583/assignments/ps6/include/"
],
If you look in the Explorer column at the left-hand of vscode, you can see what the root is. There may even be a .vscode subdirectory visible – the change indicated above would appear in the file .vscode/settings.json if you want to edit that file directly.
To compile programs with OpenMP directives (or, more specifically, to parallelize programs with OpenMP directives), you may need to do some additional setup to your development environment.
The default compiler in Mac OS will support OpenMP but it has to be invoked in a special way and you will need to install the OpenMP libraries.
The typical way (with g++ and with clang on Linux) to invoke the compiler with OpenMP is to add the flag
-fopenmp
. If you just use the flag that way with Mac OS clang it will complain with an error message
clang: error: unsupported option '-fopenmp'
To get clang to accept the option, you need to add the
-Xpreprocessor
flag, e.g.,
$ c++ -Xpreprocessor -fopenmp -c hello_omp.cpp -o hello_omp.o
That will be enough to get your program to compile into a .o
file,
but you also need to specify the OpenMP libraries.
$ c++ -Xpreprocessor -fopenmp hello_omp.cpp -o hello_omp.exe -lomp
Since Mac OS does not readily support OpenMP compilation, it also
doesn’t include the necessary libraries for OpenMP. To install those,
you need to invoke brew
(or your favorite package manager) to
install them.
$ brew install libomp
Once you have done that, you should be able to compile your OpenMP programs.
The Makefile associated with ps6 has some logic in it to detect which compiler you are using and will choose the appropriate flags to correctly (hopefully) use OpenMP.
Another alternative for Mac OS X is to install g++-9 and use that.
$ brew install gcc@9
You may need to also install the OpenMP libraries (though some versions of g++ may have them automagically).
$ brew install libomp
Note that if you want to use g++-9 to compile, you will need to change the macro at the top of include/Make.inc to use g++-9 rather than the built-in c++.
CXX := g++-9
I have tested out compiling for OpenMP with Ubuntu 18.04 – native and
under WSL. Both clang and g++ support the -fopenmp
flag. If you get
an error message at link time complaining about undefined symbols (which
symbols look OpenMP related), then you will also need to install the
OpenMP libraries. Under Ubuntu you can do this with:
$ sudo apt-get -y intall libomp5
You may also want to check with apt search
to see what version of
OpenMP libraries are available to install. But the version 5 worked for
me (at least for a few tests).
The Makefile included with ps6 includes logic that should choose the right flags for OpenMP compilation.
For command-line programs (like the ones we have been writing in this
course), the user of a program can pass parameters to it on the command
line and they are “automagically” put into the argv
array. This is
obviously a very useful way for a user of a program to state their
intentions to the program. Often however, the program itself needs to
find out information about the environment in which it is executing –
and this may be information that the user might not know or care about
or that would be inconvenient to pass on the command line every time a
program is run.
To enable a program to obtain information about its environment, most
operating systems have a simple look up mechanism that associates
variable names with values (though these values are also just strings).
One environment variable that we have discussed previously in the class
was the status variable (i.e., the return value from a main()
function).
From the bash
command line, if we want to retrieve the value
associated with an environment variable (if we want to evaluate it) we
prepend a $
sign to it. In bash
the status variable was
retrieved with $?
, though in the csh
shell, it is retrieved with
$status
. As with program variables, evaluating a variable is not the
same as printing it out. If you just type
$ $?
at the command line, you will get the cryptic response
0: Command not found
because the shell evaluated \$?
, which evaluated to 0
. This in
turn was evaluate by bash just as if we had typed in
$ 0
and since 0
is (probably) not an executable command, the shell
prints its error message. If you want to examine the value of an
environment variable in the shell, you need to pass that value to a
program that will just print its argument back out. In most Unix like
environments, this command is echo
:
$ echo $?
0
$ echo $SHELL
/bin/bash
Now, there are actually two overlapping namespaces in most command-line
shells: the environment variables and the shell variables. These
variables are accessed in the same way, with the $
prefix – but only
the environment variables are available to running programs. The shell
variables are used only by the shell program itself.
To see a list of all of the environment variables, issue the command:
$ env
Note that this is an executable program, not a built-in shell command, so it will retrieve and print the list of environment variables that are accessable by all programs. Try this yourself. You should see a couple dozen lines that look like this (I am just showing a few):
HOSTNAME=259e82e56952
SHELL=/bin/bash
TERM=xterm
HOME=/home/amath583
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/home/amath583
SHLVL=1
These variables in the environment are not special in any particular
way. Again, the environment just maintains an association between
variable names (e.g., HOME
) and a value (e.g., /home/amath583
).
There are conventions established about certain environment variables
are used by different programs. And, some programs will update the
environment. For instance if you query the value of PWD
:
$ cd /
$ echo $PWD
/
$ cd
$ echo $PWD
/home/amath583
so the name of the current working directory is updated whenever you change directories.
Another important environment variable is the path (PATH
). This
contains a colon separated list of directories. When you enter a command
on the command line, the shell will search for a program that matches
the name you typed in each of those directories. If it finds a match, it
will execute that program (using fork()
and exec()
as we have
discussed in lecture). If it does not find a match it will return an
error message (and set the status to a non-zero value).
But how does the bash shell – which itself is just a program – get the value of the environment variables? The C++ standard library provides a function for looking up environment variables.
Disclaimer. As with main()
when passed arguments, the function
for querying the environment deals with character pointers. I will first
explain the function itself and then give you a wrapper function that
you can use so that you don’t have to actually deal with pointers.
The function for querying the is std::getenv()
, prototyped as
follows:
char* std::getenv(char *str);
The function will look for a variable whose name matches the character
string passed in as str
.
If there is a match, between the string pointed to by str
and a
variable in the environment, getenv
will return a pointer to the
corresponding value in the environment.
If a match is not found – and this another one of the appalling
characteristics of pointers - it will indicate an error, but will
indicate that error by setting the value of the pointer to a special
value, namely NULL
(or zero). NULL pointers are no end of trouble
and have probably set the software community back a generation. And they
are a problem (this a general problem with pointers – but NULL is the
most common) because when a programmer has a pointer they want to
de-reference it. If for some reason the pointer is NULL
or some
other invalid value, the program will throw an exception – a
segmentation violation – and it will abort. Since pointers can be
manipulated just like any other variable in C, it is easy to give one an
invalid value. Moreover, as with getenv
, the special (and dangerous)
value of NULL is often used to indicate an error condition for function
calls that return pointers. If that value is not checked, and then later
used, the rare occasion that there is a failure in that call will result
in a segmentation violation.
If you are curious, try this program for yourself:
int main() {
char* ptr = 0;
return *ptr;
}
Or if you want possibly the shortest compiles-but-segfaults examples:
int main() {
return *(volatile int*)0;
}
More details on the memory management issues behind segmentation violations can be found in the Excursus on Memory Management.
So that you can avoid dealing with pointers (and you should), I provide the following overload to getenv that uses
std::string
rather than char*
(included in the file getenv.hpp).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #include <cstdlib>
#include <iostream>
#include <string>
std::string _getenv(const char* in) {
char* gotten = std::getenv(in);
if (NULL == gotten) {
return std::string("");
} else {
return std::string(gotten);
}
}
std::string getenv(const std::string& in) {
if (NULL == in.c_str()) {
return std::string("");
}
return _getenv(in.c_str());
}
|
Just pass in a C string literal (pointer to char, aka char*
) or a std::string
and it will return a string. If an environment variable is not found, this function will return an empty string rather than a null pointer.
A simple driver program is in hello_omp/getenv.cpp which you can compile to getenv.exe.
Experiment with executing getenv.exe
and examining some of the variables
individually. You can also use the env
command from your shell.
Create some new variables and examine their values. In a separate shell window, you might try the following (as an example of what to be careful about):
What do you get when you issue the command
echo $PATH
?What happens when you issue the command
export PATH=""
?What happens when you issue the command
echo $PATH
following the command above?
^D
may come in handy after the above.
The main source files for the warm up are in ps6/hello_omp subdirectory.
As we have seen in lecture, OpenMP provides a compiler directive based
mechanism for parallelizing a job for shared-memory execution. One of
the principles upon which OpenMP is based is that the program source
code does not need to be changed – the compiler, as controlled by OpenMP
directives (#pragma omp
statements), manages all concurrency. At the
same time, there are parameters about the running parallel program that
we would like the parallel program to be aware of. For example, we might
want to specify the number of threads that a program uses to execute. Or, we might want to partition data (or a loop) ourselves.
Now, with the OpenMP philosophy, the program itself can be both
sequential and parallel – and it should largely be “unaware” of its
parallelization. (This should be true for any parallel program –
concerns should be well-separated.) Rather than using command-line
arguments as a way to pass information to OpenMP, OpenMP uses
environment variables. Environment variables also have the advantage
(compared to command-line arguments) that they can be accessed from any
arbitrary part of the program – including from a library. You can find a
list of environment variables that OpenMP uses on the OpenMP web site or
the OpenMP quick reference (http://bit.ly/2pCSigX). The most important
environment variable is OMP_NUM_THREADS
, which sets the (maximum)
number of threads that OpenMP will use at run-time.
Note
Environment variables
It is
important to remember that environment variables don’t actually do
anything. Setting the value of OMP_NUM_THREADS
to some value doesn’t
automatically parallelize your program or cause OpenMP to have that
many threads. Rather it is a variable whose value any program can read –
it is up to the program what action to take (if any) in response to the
value it reads.
Note
clang and cmath
There may be an incompatibility between clang and the cmath
include file – this problem will only show up if you compile with the
-Ofast
compilation flag in addition to -fopenmp
. (The conflict
has to do with recently introduced SIMD commands in OpenMP 4.0.) If you
get a page full of errors for what should be a correct program, change
-Ofast
to -O3
.
When a program is compiled with OpenMP, it can be run just as with any other program – execute it on the shell command line.
When running a parallel program, there are several pieces of information we would like to have. For example, in the last assignment, we partitioned our work into pieces and let a thread or task execute each piece (in parallel). For that assignment, we specified the number of threads / tasks to use so that we could experiment with different numbers of threads. In general, however, we don’t want the user to just arbitrarily set the number of threads that a job runs with. We might not want to run with more threads than there are cores available, for example. Or, we might want to set some resource limit on how many threads are used (which might be more or less than the number of cores). That is, we want to let the runtime system have some control over how our parallel program runs (and doing this automatically is in the spirit of OpenMP).
Consider the following program (hello_omp/ompi_info.cpp) that inspects the OpenMP environment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | #include <omp.h>
#include "getenv.hpp"
int main() {
std::string envName = "OMP_NUM_THREADS";
std::cout << envName << " = " << getenv(envName) << std::endl;
std::cout << "hardware_concurrency() = " << std::thread::hardware_concurrency() << std::endl;
std::cout << "omp_get_max_threads() = " << omp_get_max_threads() << std::endl;
std::cout << "omp_get_num_threads() = " << omp_get_num_threads() << std::endl;
return 0;
}
|
You can build this with “make ompi_info.exe”.
$ make ompi_info.exe
Note that we are using the overloaded getenv()
function we presented earlier.
When I run this in my shell I get the following output (the numbers might vary depending on your hardware):
$ ./ompi_info.exe
OMP_NUM_THREADS =
hardware_concurrency() = 4
omp_get_max_threads() = 4
omp_get_num_threads() = 1
In this case, OMP_NUM_THREADS
is not set so OpenMP uses default
values. However, we have to be careful how we interpret this. The first
line shows that OMP_NUM_THREADS
is not set. The second line is
obtained with a call to the C++ standard library (outside of OpenMP) and
shows that the available hardware concurrency (number of cores) is equal
to four. The next two lines were obtained with calls to functions in the
OpenMP library. They show that the maximum available number of OpenMP
threads is equal to four. The final line, however, may seem curious – it
says that the number of threads is equal to one.
If we take a step back, though, number of threads equal to one makes
sense in the context where it is called. Recall from lecture that OpenMP
is based on the fork-join model. It runs sequentially between OpenMP
directives, the code blocks following which are run in parallel. In
main()
where the call to omp_get_num_threads
, there is only one
running thread – there are no other threads running concurrently with
it. But, how many threads actually run at one time with OpenMP? And how
can we tell?
Let’s add a simple function to the above:
int omp_thread_count() {
int n = 0;
#pragma omp parallel reduction(+:n)
n += 1;
return n;
}
This function will create the default number of threads and each one of
those will increment the variable n
by one – the final value of
n
will be the number of threads that executed in parallel. If we add
a call to omp_thread_count()
to the above program we obtain as
output:
$ ./ompi_info.exe
OMP_NUM_THREADS =
hardware_concurrency() = 4
omp_get_max_threads() = 4
omp_get_num_threads() = 1
omp_thread_count() = 4
And, indeed, the number of threads that were run in parallel is four.
Another approach would be to call omp_get_num_threads()
from inside
a parallel scope:
int omp_thread_count() {
int n = 0;
#pragma parallel
{
int tid = omp_get_thread_num();
if (0 == tid) {
n = omp_get_num_threads();
}
}
return n;
}
One advantage as a programmer to using OpenMP is that parallelization can be done in line. When we parallelized programs explicitly using threads and tasks in C++, we had to bundle the work we wanted to be done into a separate function and launch a thread or task to execute that function. With OpenMP we can parallelize directly in line. For example, we can create a parallel “Hello World” as follows:
int main() {
std::cout << "omp_get_max_threads() = " <<
omp_get_max_threads() << std::endl;
std::cout << "omp_get_num_threads() = " <<
omp_get_num_threads() << std::endl;
#pragma omp parallel
{
std::cout << "Hello! I am thread " << omp_get_thread_num() <<
" of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
std::endl;
}
return 0;
}
There is no helper function needed.
But, OpenMP is not a panacea. Underneath the covers, OpenMP still runs threads. And it doesn’t do anything special to fix the various issues we have seen with threads. This is the output from one run of the above program:
$ ./a.out
omp_get_max_threads() = 4
omp_get_num_threads() = 1
Hello! I am thread Hello! I am thread 0 of 4
Hello! I am thread 2 of 4
My C++ std::thread id is 140325565663104
My C++ std::thread id is 140325592336192
3 of 4
My C++ std::thread id is 140325561464768
Hello! I am thread 1 of 4
My C++ std::thread id is 140325569861440
The first thing to notice is that we can get the thread id for the current thread with C++ standard library mechanisms. Problematically though, note that the strings that are being printed are interleaved with each other. We have seen this before in multithreading – when we had race conditions.
Just as with the explicit threading case, OpenMP can have race
conditions when shared data is being written. However, as with the
mechanisms for parallelization, the OpenMP mechanisms for protecting
data are also expressed as #pragma omp
directives. We can fix this
race by adding a #pragma omp critical
. Note that this is more
general than the mutex approach we had before. We are just telling
OpenMP that there is a race – not how to protect it.
We explore a few variants of omp parallelization below.
The code for the first example is contained in the file hello_omp_parallel.cpp
and can be compiled to hello_omp_parallel.exe
:
$ make hello_omp_parallel.exe
The following examples can build .exe
files from the corresponding .cpp
in a similar manner.
In the first example, we just tell the compiler to execute the following block in parallel
1 2 3 4 5 | #pragma omp parallel
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
}
|
Experiment with this program a little bit until you get a feel for how OpenMP is managing threads and a feeling for how the OpenMP concepts of threads etc map to what you already know about threads.
For example, try settting different values of
OMP_NUM_THREADS
in your environment. You should see different numbers of “Hello”
messages being printed. What happens when you have more threads than
hardware concurrency? Are the thread IDs unique?
The code for this example is contained in the file hello_omp_2_threads.cpp
.
In this version, we tell the compiler to execute the following block in parallel, using 2 threads.
1 2 3 4 5 | #pragma omp parallel num_threads(2)
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
}
|
What effect does setting the environment variable OMP_NUM_THREADS
have on this program?
The code for the first example is contained in the file hello_omp_num_threads.cpp
Now, OpenMP is a little sneaky. Although it appears to operate in the form of preprocessor directives – and removing those preprocessor directives from the program will disable OpenMP – the directives can still interoperate with the program. The following program can be built to hello_omp_num_threads.exe:
1 2 3 4 5 | #pragma omp parallel num_threads(nthreads)
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
}
|
Does the program compile when using a program variable in the directive?
(Hint: Yes.) Run your program, passing in different (integer) values on
the command line. What has precedence, the variable or
OMP_NUM_THREADS
regarding how many threads are actually run?
The code for this example is contained in the file hello_omp_critical.cpp
1 2 3 4 5 | #pragma omp critical
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
}
|
How many threads get run when you execute
this example? Does setting OMP_NUM_THREADS
have any effect?
The code for this example is contained in the file hello_omp_parallel_critical.cpp
1 2 3 4 5 6 7 8 9 10 11 12 13 | #pragma omp parallel
{
std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
#pragma omp critical
{
std::cout << " ----------------------------------------------------------------" << std::endl;
std::cout << " #pragma omp critical" << std::endl;
std::cout << " Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
std::cout << " My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
std::cout << " ----------------------------------------------------------------" << std::endl;
}
}
|
The program has a parallel region which prints the “Hello World” information as before, as well as a region preceded by #pragma omp critical
.
How many threads get run when you execute
this example? Does setting OMP_NUM_THREADS
have any effect? Is there any difference in how the text for the critical and non-critical portions gets printed?
In problem set 5 we explored parallelizing some functions for computing the Euclidean norm of a vector. In this assignment we are going to look at how to use OpenMP to parallelize that same computation.
The main source files for these exercises is in the norm
subdirectory. There is a separate driver program for each exercise. The actual norm functions that we will be parallelizing are in the (single) file norms.hpp
in the include subdirectory.
There are five different
functions that you will parallelize:
norm_block_reduction()
norm_block_critical()
,
norm_cyclic_reduction(
), norm_cyclic_critical()
and norm_parfor()
.
The drivers are named according to the function they test and can be made into the corresponding executable.
The drivers all operate in the same way (dispatching to run
, which is contained in include/norm_utils.hpp
). A driver can accept up to three arguments:
$ ./norm_block_critical.exe [ lo ] [ hi ] [ max_threads ]
where lo
is the smallest size vector to be tested, hi
is the upper bound of vector size to be tested, and max_threads
is the upper bound of threads to test.
As with the norm drivers in ps5, these drivers will run test problems from lo
to hi
, increasing in factors of two at each test, and also test with one to max_threads
, again in powers of two. The programs will print measured GFlops for each test.
Look through the code for run()
in norm_utils.hpp
. How are we setting the number of threads for OpenMP to use?
Using OpenMP, parallelize the norm()
functions in the file
norms.hpp
, as indicated below:
norm_block_reduction()
– the function should be parallelized in
block fashion according to the number of available OpenMP threads.
The update to
the shared variable sum should be protected using an OpenMP reduction
directive.
Some of the scaffolding for this function has been provided. The basic ideas should be familiar: determining the block size, assigning work to each worker. The difference is that rather than specifying how to partition the problem and then determining the number of threads to use, we are specifying how many threads to use and then determining how to partition the problem. Also, we don’t need to create a helper function.
norm_block_critical()
– the function should be parallelized in
block fashion according to the number of available OpenMP threads.
The update to
the shared variable sum should be protected using an OpenMP critical
directive.
norm_cyclic_reduction()
– the function should be parallelized in
cyclic fashion according to the number of available OpenMP threads. The update
to the shared variable sum should be protected using OpenMP reduction
directive.
norm_cyclic_critical()
– the function should be parallelized in
cyclic fashion according to the number of available OpenMP threads. The update
to the shared variable sum should be protected using OpenMP critical
directive.
norm_parfor()
– the function should be parallelized using OpenMP parallel for directive.
The function as delivered is a sequential version of two_norm
. You should not need to change anything in the source code (i.e., the function body) to effect parallelization – just add the appropriate directive(s) – and make sure to protect any shared variables.
These functions are tested by the program test/norms_test.cpp
.
Which version of norm
provides the best parallel performance? How do the results compare to the parallelized versions of norm
from ps5?
Which version of norm
provides the best parallel performance for larger problems (i.e., problems at the top end of the default sizes in the drivers or larger)? How do the results compare to the parallelized versions of norm
from ps5?
Which version of norm
provides the best parallel performance for small problems (i.e., problems smller than the low end of the default sizes in the drivers)? How do the results compare to the parallelized versions of norm
from ps5?
For this part of the assignment, we are going to investigate OpenMP parallelization of sparse matrix-vector product.
The driver program for this problem is matvec/pmatvec.cpp
which
generates a sparse matrix corresponding to a discretized 2D laplacian
(for various problem sizes). It is essentially the same as the matvec.cpp
program from the midterm except that has an outer loop that loops over how many threads we want to run.
How does pmatvec.cpp
set the number of OpenMP threads to use?
The program runs just like matvec.cpp
– it runs sparse matvec and sparse transpose matvec over each type of sparse matrix (we do not include AOS) for a range of problem sizes, printing out GFlops/sec for each. There is an outer loop that repeats this process for 1, 2, 4, and 8 threads. (You can adjust problem size and max number of threads to try by passing in arguments on the command line.)
There are six total multiply algorithms that we want to accelerated – a matvec
and a t_matvec
member function for each of COO, CSR, and CSC. Since the calls to mult
ultimately boil down to a call to the member function, we can parallelize mult by parallelizing the member function.
Note that we are not going to make any kind of special member functions just for the parallelized version – in keeping with the spirit of OpenMP we are simply going to parallelize the member functions that are currently sequential (if we can).
Now, we can drop a #pragma parallel for
around any of the for loops in the matvec
and t_matvec
functions. However. Again it is important to emphasize that OpenMP is still just threads and all of the same rules and limitations apply.
The test program for the matvec implementations is in test/_test.exe
.
You should experiment with parallelizing any of the algorithms you like and then running pmatvec.exe
as well as pmatvec_test.exe
. Note that I chose a matrix for the test program that was likely to trigger a failure.
(For discussion on Piazza.)
What characteristics of a matrix would make it more or less likely to exhibit an error
if improperly parallelized? Meaning, if, say, you parallelized CSCMatrix::matvec
with just basic columnwise partitioning – there would be potential races with the same locations in y
being read and written by multiple threads. But what characteristics of the matrix give rise to that kind of problem? Are there ways to maybe work around / fix that if we knew some things in advance about the (sparse) matrix?
Implement the safe-to-implement parallel algorithms as the appropriate overloaded member functions in CSCMatrix.hpp and CSRMatrix.hpp. You should just add an appropriate OpenMP directive to the functions you are parallelizing. Make sure to protect any shared variables that need protecting.
Which methods did you parallelize? What directives did you use? How much parallel speedup did you see for 1, 2, 4, and 8 threads?
In this problem we want to parallelize sparse matrix by dense matrix product.
The driver program for this problem is in the file matvec/pmatmat.cpp
. The driver is exactly the same as matmat.cpp
(also provided) except it takes an additional argument to specify maximum number of threads.
Implement the safe-to-implement parallel algorithms as the appropriate overloaded member functions in CSCMatrix.hpp and CSRMatrix.hpp, using your previously-written matmat methods. You should just add an appropriate OpenMP directive to the functions you are parallelizing. Make sure to protect any shared variables that need protecting. Note that you have some freedom in choosing loop orders as well as which loop (or loops – OpenMP supports nested parallelism) to use.
Which methods did you parallelize? What directives did you use? How much parallel speedup did you see for 1, 2, 4, and 8 threads? How does the parallel speedup compare to sparse matrix by vector product?
The pagerank subdirectory has a driver program pagerank.cpp
. This is the same raw program as from ps5, with the addition of functionality to set the number of OpenMP threads. Notably the call to mult()
does not have a partitioning argument – and we aren’t going to add one.
Rather, as with the previous assignment (and perhaps using the same conclusions), you should choose the right data structure (and/or algorithm) so that you can take advantage of the parallelized versions of the functions that you just wrote.
“Implement” a parallel pagerank.
Describe any changes you made to pagerank.cpp to get parallel speedup. How much parallel speedup did you get for 1, 2, 4, and 8 threads?
Parallelize as many of the other functions used by pagerank as you can, by adding appropriate directives in amath583.cpp
and/or amath583sparse.cpp
.
Which functions did you parallelize? How much additional speedup did you achieve?
In class and in our assignments up to now, we have taken the immediately obvious approach to partitioning work (or data) for parallelization: we divided the loop indices (or the range of Vector indices) into equal sizes and gave each piece of work to a thread.
There is an underlying assumption here, however. To get maximum parallel speedup, each thread needs to start and finish at the same time – meaning, each thread needs to have the same amount of work to do. In partitioning our work by loop index or by Vector index, we are assuming that these partitionings will result in equal amounts of work for each thread.
This assumption will hold, for example, if we are adding two vectors together (or performing numerical quadrature over a fixed interval). But, suppose we are doing a matrix vector product and partition the problem such that each thread gets an equal number of rows to process in the matvec. The amount of work per thread depends now on the number of nonzeros in the portion of the matrix it is processing. If the matrix has (approximately) an equivalent number of nonzeros per row, each thread will get (approximately) the same amount of work. In scientific computing where the matrix is derived from a discretized PDE, each row will have the same number of elements since the discretized differential operator (which gives rise to the nonzero pattern in the sparse matrix) is uniformly applied everywhere.
However, discretized PDEs aren’t the only source of sparse matrices. Consider the PageRank algorithm. Each incoming link to a page creates a non-zero in the corresponding row of the sparse matrix used to represent the page-link graph. The number of links per page in the web (or a social network, say) is highly non-uniform. Some pages have many links to them, others very few. A well-known celebrity or politician may have millions or tens of millions of followers. On the other hand, almost all of those millions of followers will have many many fewer followers of their own.
If we partition a sparse matrix derived from a social network based simply on assigning an equal number of rows from the sparse matrix, we can easily end up in the situation where one thread is processing orders of magnitude more work than another, with the resultant loss in parallel speedup.
When we are partitioning a loop or a data structure ourselves, we can also use what we know about the problem and develop an appropriate partitioning strategy. For example, we can partition a sparse matrix so that there are approximately the same number of nonzeros per partition.
Fortunately OpenMP provides some mechanisms for deciding how to assign work to threads,
namely the schedule
option to parallel for
. We can use that option as part
of a pragma:
#pragma omp parallel for schedule(static)
Alternatively, we can make a library call
omp_set_schedule(static, 0);
A good description of the different options for OpenMP scheduling can be found here.
Add a call to omp_set_schedule
to main
in pagerank.cpp
. Experiment with different
options for scheduling. Many of the sparse matrices in the google drive are from social networks and
so should have poor balancing.
What scheduling options did you experiment with? Are there any choices for scheduling that make an improvement in the parallel performance (most importantly, scalability) of pagerank?
In the previous problem set and this one, you may have noticed that with the -Ofast compiler option enabled, the sequential implementation of two_norm
seemed to be much faster than the threaded version run on one thread. This behavior was not seen when just -O3 optimization was used. What is happening is that that compiler is introducing parallelization even though we didn’t explicitly call for it – namely, SIMD parallelism (vectorization).
We discussed SIMD parallelism and vector instructions in Lecture 8 (cf slides 30-43), where we noted that if we had a sufficient number of operands and operations that could be done at once, the AVX instructions enabled that.
In addition to being able to perform operations simultaneously on multiple operands (i.e., in parallel), AVX also provides vector load and store instructions, meaning multiple operands can be moved from memory (equivalently cached) into the vector registers with a single instruction (this is how we typically achieve the highest bandwidths available in the roofline model).
It is required that the operands being moved from memory be contiguous (index offset by 1).
The compiler looks for loop expressions that have unit stride (incrementing index by 1) for opportunities to generate vector instructions. Consider two_norm
double norm_seq(const Vector& x) {
double sum = 0;
for (size_t i = 0; i < x.num_rows(); ++i) {
sum += x(i) * x(i);
}
return std::sqrt(sum);
}
In looking at this code the compiler can recognize that a set of consecutive operand from memory are being accessed. In fact, one (intermediate) optimization is to unroll the loop.
double norm_seq(const Vector& x) {
double sum = 0;
for (size_t i = 0; i < x.num_rows(); i += 4) {
sum += x(i + 0) * x(i + 0);
sum += x(i + 1) * x(i + 1);
sum += x(i + 2) * x(i + 2);
sum += x(i + 3) * x(i + 3);
}
return std::sqrt(sum);
}
But now we can easily see there are four operands, all in contiguous memory, with the same operation (multiply) being done between them. The compiler can turn that operation into a vector load to get four operands at once and then a vector multiply to do four multiples at once (and a vector reduction to add the four numbers together into sum).
Now, doing these operations in parallel will change the numeric answer for floaing point, compared to the sequential version, so automatic vectorization is only enabled with -Ofast, not -O3 (you can also explicitly ask for it, with, e.g., -mavx2). You can get diagnostics from the compiler about what it did for vectorization with the -Rpass=loop-vectorize
compiler option. If we use that and compile two_norm
we would see the following:
../src/amath583.cpp:57:3: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
for (size_t i = 0; i < x.num_rows(); ++i) {
In addition to directives for multithreading the work in loops, OpenMP also provides a directive for vectorizing a loop: simd
. To vectorize the for loop in two_norm
above
double norm_seq(const Vector& x) {
double sum = 0;
#pragma omp simd
for (size_t i = 0; i < x.num_rows(); ++i) {
sum += x(i) * x(i);
}
return std::sqrt(sum);
}
Using OpenMP, experiment with applying simd parallelization to one or more of the norm_block functions or norm_cyclic functions in the norm subdirectory. If you keep in mind what the compiler is looking for you should be able to narrow down your options to functions that are most likely to be amenable to vectorization. Compiler optimization should be kept to -O3 so that vectorization is only a result of your OpenMP directives.
OpenMP vectorized version of at least one function, obtaining noticeable speedup for all threading levels.
Which function did you vectorize with OpenMP? How much speedup were you able to obtain over the non-vectorized (sequential) version?
You may also try compiling with -Ofast and compare the OpenMP vectorized version with one thread to the sequential version.
Answer the following questions (append to Questions.rst): a) The most important thing I learned from this assignment was… b) One thing I am still not clear on is…
Submit three files to Gradescope’s “ps6 – coding” assignment:
CSRMatrix.hpp
CSCMatrix.hpp
norms.hpp
Starting from this assignment, the autograder knows if you’re from 483 or 583, so it should give you the credit for the tests that are optional for your section automatically.
Next, submit Questions.pdf to “ps6 – written”. If some questions are optional for your section please indicate it in the respective parts of the write-up.
If you relied on any external references, list them in a file refs.txt and include that as well.