Problem Set #6

Assigned: May 11, 2021

Due: May 18, 2021

Background

Introduction

This assignment is extracted as with the previous assignment into a directory ps6. The tar file is at this link.

../_images/download.png

Don’t Repeat Yourself

A software development theme that we have been emphasizing throughout the course is: don’t repeat yourself (DRY). We are going to apply this in an important way to our code organization starting with this assignment.

We have been evolving some our files throughout the course, many of them have remained the same (or remained the same after a period of time) – yet, we have created new copies of these files assignment after assignment. Moreover, we have been accumulating a number of different projects within the course, but we have been keeping all of the files together in a single directory.

For this assignment (and the following), the files for the assignment are organized into separate subdirectories, as follows:

ps6/
├── Questions.rst
├── README.rst
├── data/
├── hello_omp/
│   ├── Makefile
│   ├── getenv.cpp
│   ├── hello_omp_2_threads.cpp
│   ├── hello_omp_critical.cpp
│   ├── hello_omp_num_threads.cpp
│   ├── hello_omp_parallel.cpp
│   ├── hello_omp_parallel_critical.cpp
│   └── ompi_info.cpp
├── include/
│   ├── AOSMatrix.hpp
│   ├── COOMatrix.hpp
│   ├── CSCMatrix.hpp
│   ├── CSRMatrix.hpp
│   ├── Make.inc
│   ├── Matrix.hpp
│   ├── Timer.hpp
│   ├── Vector.hpp
│   ├── amath583.hpp
│   ├── amath583IO.hpp
│   ├── amath583sparse.hpp
│   ├── catch.hpp
│   ├── getenv.hpp
│   ├── norm_utils.hpp
│   ├── norms.hpp
│   └── pagerank.hpp
├── matvec/
│   └── Makefile
│   └── matmat.cpp
│   └── matvec.cpp
│   └── pmatmat.cpp
│   └── pmatvec.cpp
├── norm/
│   ├── Makefile
│   ├── norm_block_critical.cpp
│   ├── norm_block_reduction.cpp
│   ├── norm_cyclic_critical.cpp
│   ├── norm_cyclic_reduction.cpp
│   ├── norm_parfor.cpp
│   └── norm_seq.cpp
├── pagerank/
│   ├── Makefile
│   └── pagerank.cpp
├── src/
│   ├── amath583.cpp
│   ├── amath583IO.cpp
│   └── amath583sparse.cpp
└── test/
    ├── Makefile
    ├── matmat_test.cpp
    ├── matvec_test.cpp
    ├── norms_test.cpp
    ├── pagerank_test.cpp
    ├── pmatmat_test.cpp
    └── pmatvec_test.cpp

The main organizing principle is that the common files (included by source files) are maintained in the include subdirectory. Other subdirectories contain the source files for testing and for particular projects / exercises. For instance all of the norm-related main source files are in a single subdirectory. The custom is also to build the various projects in their source directories. The Makefiles have been amended accordingly (note though, that there is a huge amount of repeated code in the Makefiles – that has all been moved to include/Make.inc and included into the other Makefiles. The individual Makefiles are now quite spare.)

Questions.rst is at the top level of the subdirectory.

Note

make verify

You may notice that there are no longer verify.bash scripts for checking your code. Instead, the Makefile (defined in Make.inc) has a target verify that will build and run all of the executables specified in the Makefile. So from any of the subdirectories

$ make verify

will do the verification.

$ make all

will build all of the targets for that subdirectory.

(This is another example of DRY – we aren’t specifying targets in any location but one – as part of a macro in each subdirectory’s Makefile. Previously we had been specifying the files to verify in both the Makefile and the verify scripts.)

vscode

You may need to update some settings in vscode so that the built-in error-checker can find the include files in their new location. Otherwise you may see the red squiggles under the include statements for the files in include.

(Before proceeding with this step, you may want to do the next step first and install OpenMP. If it isn’t installed, vscode won’t be able to find omp.h because it doesn’t yet exist – no settings will make the “not found” squiggles go away if something can’t actually be found.)

First see if you need to update the settings at all. If there are no red squiggles under, e.g., #include "Vector.hpp" you are all set.

If there are red squiggles, first try resetting Intellisense. Open the command palette and run “C/C++ Reset Intellisense Database” (start typing “reset” into the command palette and it will show you that as a menu option).

If there are still red squiggles, open Settings and select “Workspace” (as opposed to “User”) in the upper left. Then search for “includePath”. Select “Edit in settings.json” under C_Cpp < Default: Include Path. Add the path to the include directory:

"C_Cpp.default.includePath": [
     "${workspaceFolder}/ps6/include/"
 ],

This assumes the “root” of what you are doing in vscode is one level above ps6. It should work in most cases, but you can also provide an absolute path

"C_Cpp.default.includePath": [
     "/Users/al75/amath583/assignments/ps6/include/"
 ],

If you look in the Explorer column at the left-hand of vscode, you can see what the root is. There may even be a .vscode subdirectory visible – the change indicated above would appear in the file .vscode/settings.json if you want to edit that file directly.

Setting up OpenMP

To compile programs with OpenMP directives (or, more specifically, to parallelize programs with OpenMP directives), you may need to do some additional setup to your development environment.

Mac OS: clang

The default compiler in Mac OS will support OpenMP but it has to be invoked in a special way and you will need to install the OpenMP libraries.

The typical way (with g++ and with clang on Linux) to invoke the compiler with OpenMP is to add the flag -fopenmp. If you just use the flag that way with Mac OS clang it will complain with an error message

clang: error: unsupported option '-fopenmp'

To get clang to accept the option, you need to add the -Xpreprocessor flag, e.g.,

$ c++ -Xpreprocessor -fopenmp -c hello_omp.cpp -o hello_omp.o

That will be enough to get your program to compile into a .o file, but you also need to specify the OpenMP libraries.

$ c++ -Xpreprocessor -fopenmp hello_omp.cpp -o hello_omp.exe -lomp

Since Mac OS does not readily support OpenMP compilation, it also doesn’t include the necessary libraries for OpenMP. To install those, you need to invoke brew (or your favorite package manager) to install them.

$ brew install libomp

Once you have done that, you should be able to compile your OpenMP programs.

The Makefile associated with ps6 has some logic in it to detect which compiler you are using and will choose the appropriate flags to correctly (hopefully) use OpenMP.

Mac OS: g++-9

Another alternative for Mac OS X is to install g++-9 and use that.

$ brew install gcc@9

You may need to also install the OpenMP libraries (though some versions of g++ may have them automagically).

$ brew install libomp

Note that if you want to use g++-9 to compile, you will need to change the macro at the top of include/Make.inc to use g++-9 rather than the built-in c++.

CXX := g++-9

Linux and WSL

I have tested out compiling for OpenMP with Ubuntu 18.04 – native and under WSL. Both clang and g++ support the -fopenmp flag. If you get an error message at link time complaining about undefined symbols (which symbols look OpenMP related), then you will also need to install the OpenMP libraries. Under Ubuntu you can do this with:

$ sudo apt-get -y intall libomp5

You may also want to check with apt search to see what version of OpenMP libraries are available to install. But the version 5 worked for me (at least for a few tests).

The Makefile included with ps6 includes logic that should choose the right flags for OpenMP compilation.

Preliminaries

Environment Variables

For command-line programs (like the ones we have been writing in this course), the user of a program can pass parameters to it on the command line and they are “automagically” put into the argv array. This is obviously a very useful way for a user of a program to state their intentions to the program. Often however, the program itself needs to find out information about the environment in which it is executing – and this may be information that the user might not know or care about or that would be inconvenient to pass on the command line every time a program is run.

To enable a program to obtain information about its environment, most operating systems have a simple look up mechanism that associates variable names with values (though these values are also just strings). One environment variable that we have discussed previously in the class was the status variable (i.e., the return value from a main() function).

From the bash command line, if we want to retrieve the value associated with an environment variable (if we want to evaluate it) we prepend a $ sign to it. In bash the status variable was retrieved with $?, though in the csh shell, it is retrieved with $status. As with program variables, evaluating a variable is not the same as printing it out. If you just type

$ $?

at the command line, you will get the cryptic response

0: Command not found

because the shell evaluated \$?, which evaluated to 0. This in turn was evaluate by bash just as if we had typed in

$ 0

and since 0 is (probably) not an executable command, the shell prints its error message. If you want to examine the value of an environment variable in the shell, you need to pass that value to a program that will just print its argument back out. In most Unix like environments, this command is echo:

$ echo $?
0

$ echo $SHELL
/bin/bash

Now, there are actually two overlapping namespaces in most command-line shells: the environment variables and the shell variables. These variables are accessed in the same way, with the $ prefix – but only the environment variables are available to running programs. The shell variables are used only by the shell program itself.

To see a list of all of the environment variables, issue the command:

$ env

Note that this is an executable program, not a built-in shell command, so it will retrieve and print the list of environment variables that are accessable by all programs. Try this yourself. You should see a couple dozen lines that look like this (I am just showing a few):

HOSTNAME=259e82e56952
SHELL=/bin/bash
TERM=xterm
HOME=/home/amath583
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/home/amath583
SHLVL=1

These variables in the environment are not special in any particular way. Again, the environment just maintains an association between variable names (e.g., HOME) and a value (e.g., /home/amath583). There are conventions established about certain environment variables are used by different programs. And, some programs will update the environment. For instance if you query the value of PWD:

$ cd /
$ echo $PWD
/
$ cd
$ echo $PWD
/home/amath583

so the name of the current working directory is updated whenever you change directories.

Another important environment variable is the path (PATH). This contains a colon separated list of directories. When you enter a command on the command line, the shell will search for a program that matches the name you typed in each of those directories. If it finds a match, it will execute that program (using fork() and exec() as we have discussed in lecture). If it does not find a match it will return an error message (and set the status to a non-zero value).

But how does the bash shell – which itself is just a program – get the value of the environment variables? The C++ standard library provides a function for looking up environment variables.

Disclaimer. As with main() when passed arguments, the function for querying the environment deals with character pointers. I will first explain the function itself and then give you a wrapper function that you can use so that you don’t have to actually deal with pointers.

The function for querying the is std::getenv(), prototyped as follows:

char* std::getenv(char *str);

The function will look for a variable whose name matches the character string passed in as str.

If there is a match, between the string pointed to by str and a variable in the environment, getenv will return a pointer to the corresponding value in the environment.

If a match is not found – and this another one of the appalling characteristics of pointers - it will indicate an error, but will indicate that error by setting the value of the pointer to a special value, namely NULL (or zero). NULL pointers are no end of trouble and have probably set the software community back a generation. And they are a problem (this a general problem with pointers – but NULL is the most common) because when a programmer has a pointer they want to de-reference it. If for some reason the pointer is NULL or some other invalid value, the program will throw an exception – a segmentation violation – and it will abort. Since pointers can be manipulated just like any other variable in C, it is easy to give one an invalid value. Moreover, as with getenv, the special (and dangerous) value of NULL is often used to indicate an error condition for function calls that return pointers. If that value is not checked, and then later used, the rare occasion that there is a failure in that call will result in a segmentation violation.

If you are curious, try this program for yourself:

int main() {
  char* ptr = 0;
  return *ptr;
}

Or if you want possibly the shortest compiles-but-segfaults examples:

int main() {
  return *(volatile int*)0;
}

More details on the memory management issues behind segmentation violations can be found in the Excursus on Memory Management.

So that you can avoid dealing with pointers (and you should), I provide the following overload to getenv that uses std::string rather than char* (included in the file getenv.hpp).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#include <cstdlib>
#include <iostream>
#include <string>

std::string _getenv(const char* in) {
  char* gotten = std::getenv(in);
  if (NULL == gotten) {
    return std::string("");
  } else {
    return std::string(gotten);
  }
}

std::string getenv(const std::string& in) {
  if (NULL == in.c_str()) {
    return std::string("");
  }
  return _getenv(in.c_str());
}

Just pass in a C string literal (pointer to char, aka char*) or a std::string and it will return a string. If an environment variable is not found, this function will return an empty string rather than a null pointer.

A simple driver program is in hello_omp/getenv.cpp which you can compile to getenv.exe. Experiment with executing getenv.exe and examining some of the variables individually. You can also use the env command from your shell.

Create some new variables and examine their values. In a separate shell window, you might try the following (as an example of what to be careful about):

  • What do you get when you issue the command echo $PATH?

  • What happens when you issue the command export PATH=""?

  • What happens when you issue the command echo $PATH following the command above?

  • ^D may come in handy after the above.

Warm Up

The main source files for the warm up are in ps6/hello_omp subdirectory.

As we have seen in lecture, OpenMP provides a compiler directive based mechanism for parallelizing a job for shared-memory execution. One of the principles upon which OpenMP is based is that the program source code does not need to be changed – the compiler, as controlled by OpenMP directives (#pragma omp statements), manages all concurrency. At the same time, there are parameters about the running parallel program that we would like the parallel program to be aware of. For example, we might want to specify the number of threads that a program uses to execute. Or, we might want to partition data (or a loop) ourselves.

Now, with the OpenMP philosophy, the program itself can be both sequential and parallel – and it should largely be “unaware” of its parallelization. (This should be true for any parallel program – concerns should be well-separated.) Rather than using command-line arguments as a way to pass information to OpenMP, OpenMP uses environment variables. Environment variables also have the advantage (compared to command-line arguments) that they can be accessed from any arbitrary part of the program – including from a library. You can find a list of environment variables that OpenMP uses on the OpenMP web site or the OpenMP quick reference (http://bit.ly/2pCSigX). The most important environment variable is OMP_NUM_THREADS, which sets the (maximum) number of threads that OpenMP will use at run-time.

Note

Environment variables It is important to remember that environment variables don’t actually do anything. Setting the value of OMP_NUM_THREADS to some value doesn’t automatically parallelize your program or cause OpenMP to have that many threads. Rather it is a variable whose value any program can read – it is up to the program what action to take (if any) in response to the value it reads.

Note

clang and cmath

There may be an incompatibility between clang and the cmath include file – this problem will only show up if you compile with the -Ofast compilation flag in addition to -fopenmp. (The conflict has to do with recently introduced SIMD commands in OpenMP 4.0.) If you get a page full of errors for what should be a correct program, change -Ofast to -O3.

When a program is compiled with OpenMP, it can be run just as with any other program – execute it on the shell command line.

ompi_info

When running a parallel program, there are several pieces of information we would like to have. For example, in the last assignment, we partitioned our work into pieces and let a thread or task execute each piece (in parallel). For that assignment, we specified the number of threads / tasks to use so that we could experiment with different numbers of threads. In general, however, we don’t want the user to just arbitrarily set the number of threads that a job runs with. We might not want to run with more threads than there are cores available, for example. Or, we might want to set some resource limit on how many threads are used (which might be more or less than the number of cores). That is, we want to let the runtime system have some control over how our parallel program runs (and doing this automatically is in the spirit of OpenMP).

Consider the following program (hello_omp/ompi_info.cpp) that inspects the OpenMP environment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#include <omp.h>

#include "getenv.hpp"

int main() {

  std::string envName = "OMP_NUM_THREADS";
  std::cout << envName << "        = " << getenv(envName) << std::endl;

  std::cout << "hardware_concurrency() = " << std::thread::hardware_concurrency() << std::endl;
  std::cout << "omp_get_max_threads()  = " << omp_get_max_threads() << std::endl;
  std::cout << "omp_get_num_threads()  = " << omp_get_num_threads() << std::endl;

  return 0;
}

You can build this with “make ompi_info.exe”.

$ make ompi_info.exe

Note that we are using the overloaded getenv() function we presented earlier.

When I run this in my shell I get the following output (the numbers might vary depending on your hardware):

$ ./ompi_info.exe
OMP_NUM_THREADS        =
hardware_concurrency() = 4
omp_get_max_threads()  = 4
omp_get_num_threads()  = 1

In this case, OMP_NUM_THREADS is not set so OpenMP uses default values. However, we have to be careful how we interpret this. The first line shows that OMP_NUM_THREADS is not set. The second line is obtained with a call to the C++ standard library (outside of OpenMP) and shows that the available hardware concurrency (number of cores) is equal to four. The next two lines were obtained with calls to functions in the OpenMP library. They show that the maximum available number of OpenMP threads is equal to four. The final line, however, may seem curious – it says that the number of threads is equal to one.

If we take a step back, though, number of threads equal to one makes sense in the context where it is called. Recall from lecture that OpenMP is based on the fork-join model. It runs sequentially between OpenMP directives, the code blocks following which are run in parallel. In main() where the call to omp_get_num_threads, there is only one running thread – there are no other threads running concurrently with it. But, how many threads actually run at one time with OpenMP? And how can we tell?

Let’s add a simple function to the above:

int omp_thread_count() {
  int n = 0;
#pragma omp parallel reduction(+:n)
  n += 1;
  return n;
}

This function will create the default number of threads and each one of those will increment the variable n by one – the final value of n will be the number of threads that executed in parallel. If we add a call to omp_thread_count() to the above program we obtain as output:

$ ./ompi_info.exe
OMP_NUM_THREADS        =
hardware_concurrency() = 4
omp_get_max_threads()  = 4
omp_get_num_threads()  = 1
omp_thread_count()     = 4

And, indeed, the number of threads that were run in parallel is four.

Another approach would be to call omp_get_num_threads() from inside a parallel scope:

int omp_thread_count() {
  int n = 0;
#pragma parallel
  {
    int tid = omp_get_thread_num();
    if (0 == tid) {
      n = omp_get_num_threads();
    }
  }
  return n;
}

OpenMP is Still Threads

One advantage as a programmer to using OpenMP is that parallelization can be done in line. When we parallelized programs explicitly using threads and tasks in C++, we had to bundle the work we wanted to be done into a separate function and launch a thread or task to execute that function. With OpenMP we can parallelize directly in line. For example, we can create a parallel “Hello World” as follows:

int main() {
  std::cout << "omp_get_max_threads()  = " <<
        omp_get_max_threads() << std::endl;
  std::cout << "omp_get_num_threads()  = " <<
        omp_get_num_threads() << std::endl;

 #pragma omp parallel
  {
   std::cout << "Hello! I am thread " << omp_get_thread_num() <<
        " of " << omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() <<
          std::endl;
   }

  return 0;
}

There is no helper function needed.

But, OpenMP is not a panacea. Underneath the covers, OpenMP still runs threads. And it doesn’t do anything special to fix the various issues we have seen with threads. This is the output from one run of the above program:

$ ./a.out
omp_get_max_threads()  = 4
omp_get_num_threads()  = 1
Hello! I am thread Hello! I am thread 0 of 4
Hello! I am thread 2 of 4
My C++ std::thread id is 140325565663104
My C++ std::thread id is 140325592336192
3 of 4
My C++ std::thread id is 140325561464768
Hello! I am thread 1 of 4
My C++ std::thread id is 140325569861440

The first thing to notice is that we can get the thread id for the current thread with C++ standard library mechanisms. Problematically though, note that the strings that are being printed are interleaved with each other. We have seen this before in multithreading – when we had race conditions.

Just as with the explicit threading case, OpenMP can have race conditions when shared data is being written. However, as with the mechanisms for parallelization, the OpenMP mechanisms for protecting data are also expressed as #pragma omp directives. We can fix this race by adding a #pragma omp critical. Note that this is more general than the mutex approach we had before. We are just telling OpenMP that there is a race – not how to protect it.

Examples

We explore a few variants of omp parallelization below.

The code for the first example is contained in the file hello_omp_parallel.cpp and can be compiled to hello_omp_parallel.exe:

$ make hello_omp_parallel.exe

The following examples can build .exe files from the corresponding .cpp in a similar manner.

#pragma omp parallel

In the first example, we just tell the compiler to execute the following block in parallel

1
2
3
4
5
#pragma omp parallel
  {
    std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
  }

Experiment with this program a little bit until you get a feel for how OpenMP is managing threads and a feeling for how the OpenMP concepts of threads etc map to what you already know about threads.

For example, try settting different values of OMP_NUM_THREADS in your environment. You should see different numbers of “Hello” messages being printed. What happens when you have more threads than hardware concurrency? Are the thread IDs unique?

#pragma omp parallel num_threads(2)

The code for this example is contained in the file hello_omp_2_threads.cpp. In this version, we tell the compiler to execute the following block in parallel, using 2 threads.

1
2
3
4
5
#pragma omp parallel num_threads(2)
  {
    std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
  }

What effect does setting the environment variable OMP_NUM_THREADS have on this program?

#pragma omp parallel num_threads(n)

The code for the first example is contained in the file hello_omp_num_threads.cpp

Now, OpenMP is a little sneaky. Although it appears to operate in the form of preprocessor directives – and removing those preprocessor directives from the program will disable OpenMP – the directives can still interoperate with the program. The following program can be built to hello_omp_num_threads.exe:

1
2
3
4
5
#pragma omp parallel num_threads(nthreads)
  {
    std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
  }

Does the program compile when using a program variable in the directive? (Hint: Yes.) Run your program, passing in different (integer) values on the command line. What has precedence, the variable or OMP_NUM_THREADS regarding how many threads are actually run?

#pragma omp critical

The code for this example is contained in the file hello_omp_critical.cpp

1
2
3
4
5
#pragma omp critical
  {
    std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
  }

How many threads get run when you execute this example? Does setting OMP_NUM_THREADS have any effect?

#pragma omp parallel with critical

The code for this example is contained in the file hello_omp_parallel_critical.cpp

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#pragma omp parallel
  {
    std::cout << "Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
    std::cout << "My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
    #pragma omp critical
    {
      std::cout << "  ----------------------------------------------------------------" << std::endl;
      std::cout << "  #pragma omp critical" << std::endl;
      std::cout << "    Hello! I am thread " << omp_get_thread_num() << " of " << omp_get_num_threads() << std::endl;
      std::cout << "    My C++ std::thread id is " << std::this_thread::get_id() << std::endl;
      std::cout << "  ----------------------------------------------------------------" << std::endl;
    }
  }

The program has a parallel region which prints the “Hello World” information as before, as well as a region preceded by #pragma omp critical. How many threads get run when you execute this example? Does setting OMP_NUM_THREADS have any effect? Is there any difference in how the text for the critical and non-critical portions gets printed?

Exercises

Norm

In problem set 5 we explored parallelizing some functions for computing the Euclidean norm of a vector. In this assignment we are going to look at how to use OpenMP to parallelize that same computation.

The main source files for these exercises is in the norm subdirectory. There is a separate driver program for each exercise. The actual norm functions that we will be parallelizing are in the (single) file norms.hpp in the include subdirectory.

There are five different functions that you will parallelize: norm_block_reduction() norm_block_critical(), norm_cyclic_reduction(), norm_cyclic_critical() and norm_parfor().

The drivers are named according to the function they test and can be made into the corresponding executable.

The drivers all operate in the same way (dispatching to run, which is contained in include/norm_utils.hpp). A driver can accept up to three arguments:

$ ./norm_block_critical.exe [ lo ] [ hi ] [ max_threads ]

where lo is the smallest size vector to be tested, hi is the upper bound of vector size to be tested, and max_threads is the upper bound of threads to test.

As with the norm drivers in ps5, these drivers will run test problems from lo to hi, increasing in factors of two at each test, and also test with one to max_threads, again in powers of two. The programs will print measured GFlops for each test.

Look through the code for run() in norm_utils.hpp. How are we setting the number of threads for OpenMP to use?

Using OpenMP, parallelize the norm() functions in the file norms.hpp, as indicated below:

  • norm_block_reduction() – the function should be parallelized in block fashion according to the number of available OpenMP threads. The update to the shared variable sum should be protected using an OpenMP reduction directive.

    Some of the scaffolding for this function has been provided. The basic ideas should be familiar: determining the block size, assigning work to each worker. The difference is that rather than specifying how to partition the problem and then determining the number of threads to use, we are specifying how many threads to use and then determining how to partition the problem. Also, we don’t need to create a helper function.

  • norm_block_critical() – the function should be parallelized in block fashion according to the number of available OpenMP threads. The update to the shared variable sum should be protected using an OpenMP critical directive.

  • norm_cyclic_reduction() – the function should be parallelized in cyclic fashion according to the number of available OpenMP threads. The update to the shared variable sum should be protected using OpenMP reduction directive.

  • norm_cyclic_critical() – the function should be parallelized in cyclic fashion according to the number of available OpenMP threads. The update to the shared variable sum should be protected using OpenMP critical directive.

  • norm_parfor() – the function should be parallelized using OpenMP parallel for directive. The function as delivered is a sequential version of two_norm. You should not need to change anything in the source code (i.e., the function body) to effect parallelization – just add the appropriate directive(s) – and make sure to protect any shared variables.

These functions are tested by the program test/norms_test.cpp.

Which version of norm provides the best parallel performance? How do the results compare to the parallelized versions of norm from ps5?

Which version of norm provides the best parallel performance for larger problems (i.e., problems at the top end of the default sizes in the drivers or larger)? How do the results compare to the parallelized versions of norm from ps5?

Which version of norm provides the best parallel performance for small problems (i.e., problems smller than the low end of the default sizes in the drivers)? How do the results compare to the parallelized versions of norm from ps5?

Sparse Matrix-Vector Product

For this part of the assignment, we are going to investigate OpenMP parallelization of sparse matrix-vector product.

The driver program for this problem is matvec/pmatvec.cpp which generates a sparse matrix corresponding to a discretized 2D laplacian (for various problem sizes). It is essentially the same as the matvec.cpp program from the midterm except that has an outer loop that loops over how many threads we want to run.

How does pmatvec.cpp set the number of OpenMP threads to use?

The program runs just like matvec.cpp – it runs sparse matvec and sparse transpose matvec over each type of sparse matrix (we do not include AOS) for a range of problem sizes, printing out GFlops/sec for each. There is an outer loop that repeats this process for 1, 2, 4, and 8 threads. (You can adjust problem size and max number of threads to try by passing in arguments on the command line.)

There are six total multiply algorithms that we want to accelerated – a matvec and a t_matvec member function for each of COO, CSR, and CSC. Since the calls to mult ultimately boil down to a call to the member function, we can parallelize mult by parallelizing the member function.

Note that we are not going to make any kind of special member functions just for the parallelized version – in keeping with the spirit of OpenMP we are simply going to parallelize the member functions that are currently sequential (if we can).

Now, we can drop a #pragma parallel for around any of the for loops in the matvec and t_matvec functions. However. Again it is important to emphasize that OpenMP is still just threads and all of the same rules and limitations apply.

The test program for the matvec implementations is in test/_test.exe.

You should experiment with parallelizing any of the algorithms you like and then running pmatvec.exe as well as pmatvec_test.exe. Note that I chose a matrix for the test program that was likely to trigger a failure.

(For discussion on Piazza.) What characteristics of a matrix would make it more or less likely to exhibit an error if improperly parallelized? Meaning, if, say, you parallelized CSCMatrix::matvec with just basic columnwise partitioning – there would be potential races with the same locations in y being read and written by multiple threads. But what characteristics of the matrix give rise to that kind of problem? Are there ways to maybe work around / fix that if we knew some things in advance about the (sparse) matrix?

Implement the safe-to-implement parallel algorithms as the appropriate overloaded member functions in CSCMatrix.hpp and CSRMatrix.hpp. You should just add an appropriate OpenMP directive to the functions you are parallelizing. Make sure to protect any shared variables that need protecting.

Which methods did you parallelize? What directives did you use? How much parallel speedup did you see for 1, 2, 4, and 8 threads?

Sparse Matrix Dense Matrix Product (AMATH583 Only)

In this problem we want to parallelize sparse matrix by dense matrix product. The driver program for this problem is in the file matvec/pmatmat.cpp. The driver is exactly the same as matmat.cpp (also provided) except it takes an additional argument to specify maximum number of threads.

Implement the safe-to-implement parallel algorithms as the appropriate overloaded member functions in CSCMatrix.hpp and CSRMatrix.hpp, using your previously-written matmat methods. You should just add an appropriate OpenMP directive to the functions you are parallelizing. Make sure to protect any shared variables that need protecting. Note that you have some freedom in choosing loop orders as well as which loop (or loops – OpenMP supports nested parallelism) to use.

Which methods did you parallelize? What directives did you use? How much parallel speedup did you see for 1, 2, 4, and 8 threads? How does the parallel speedup compare to sparse matrix by vector product?

PageRank Reprise

The pagerank subdirectory has a driver program pagerank.cpp. This is the same raw program as from ps5, with the addition of functionality to set the number of OpenMP threads. Notably the call to mult() does not have a partitioning argument – and we aren’t going to add one.

Rather, as with the previous assignment (and perhaps using the same conclusions), you should choose the right data structure (and/or algorithm) so that you can take advantage of the parallelized versions of the functions that you just wrote.

“Implement” a parallel pagerank.

Describe any changes you made to pagerank.cpp to get parallel speedup. How much parallel speedup did you get for 1, 2, 4, and 8 threads?

Extra Credit

Parallelize as many of the other functions used by pagerank as you can, by adding appropriate directives in amath583.cpp and/or amath583sparse.cpp.

Which functions did you parallelize? How much additional speedup did you achieve?

Partitioning and Load Balancing

In class and in our assignments up to now, we have taken the immediately obvious approach to partitioning work (or data) for parallelization: we divided the loop indices (or the range of Vector indices) into equal sizes and gave each piece of work to a thread.

There is an underlying assumption here, however. To get maximum parallel speedup, each thread needs to start and finish at the same time – meaning, each thread needs to have the same amount of work to do. In partitioning our work by loop index or by Vector index, we are assuming that these partitionings will result in equal amounts of work for each thread.

This assumption will hold, for example, if we are adding two vectors together (or performing numerical quadrature over a fixed interval). But, suppose we are doing a matrix vector product and partition the problem such that each thread gets an equal number of rows to process in the matvec. The amount of work per thread depends now on the number of nonzeros in the portion of the matrix it is processing. If the matrix has (approximately) an equivalent number of nonzeros per row, each thread will get (approximately) the same amount of work. In scientific computing where the matrix is derived from a discretized PDE, each row will have the same number of elements since the discretized differential operator (which gives rise to the nonzero pattern in the sparse matrix) is uniformly applied everywhere.

However, discretized PDEs aren’t the only source of sparse matrices. Consider the PageRank algorithm. Each incoming link to a page creates a non-zero in the corresponding row of the sparse matrix used to represent the page-link graph. The number of links per page in the web (or a social network, say) is highly non-uniform. Some pages have many links to them, others very few. A well-known celebrity or politician may have millions or tens of millions of followers. On the other hand, almost all of those millions of followers will have many many fewer followers of their own.

If we partition a sparse matrix derived from a social network based simply on assigning an equal number of rows from the sparse matrix, we can easily end up in the situation where one thread is processing orders of magnitude more work than another, with the resultant loss in parallel speedup.

When we are partitioning a loop or a data structure ourselves, we can also use what we know about the problem and develop an appropriate partitioning strategy. For example, we can partition a sparse matrix so that there are approximately the same number of nonzeros per partition.

Load Balanced Partitioning with OpenMP

Fortunately OpenMP provides some mechanisms for deciding how to assign work to threads, namely the schedule option to parallel for. We can use that option as part of a pragma:

#pragma omp parallel for schedule(static)

Alternatively, we can make a library call

omp_set_schedule(static, 0);

A good description of the different options for OpenMP scheduling can be found here.

Add a call to omp_set_schedule to main in pagerank.cpp. Experiment with different options for scheduling. Many of the sparse matrices in the google drive are from social networks and so should have poor balancing.

What scheduling options did you experiment with? Are there any choices for scheduling that make an improvement in the parallel performance (most importantly, scalability) of pagerank?

OpenMP is not only about threading

In the previous problem set and this one, you may have noticed that with the -Ofast compiler option enabled, the sequential implementation of two_norm seemed to be much faster than the threaded version run on one thread. This behavior was not seen when just -O3 optimization was used. What is happening is that that compiler is introducing parallelization even though we didn’t explicitly call for it – namely, SIMD parallelism (vectorization).

We discussed SIMD parallelism and vector instructions in Lecture 8 (cf slides 30-43), where we noted that if we had a sufficient number of operands and operations that could be done at once, the AVX instructions enabled that.

assignments/ps6/images/ymm.png

In addition to being able to perform operations simultaneously on multiple operands (i.e., in parallel), AVX also provides vector load and store instructions, meaning multiple operands can be moved from memory (equivalently cached) into the vector registers with a single instruction (this is how we typically achieve the highest bandwidths available in the roofline model).

assignments/ps6/images/vector_load.pdf

It is required that the operands being moved from memory be contiguous (index offset by 1).

The compiler looks for loop expressions that have unit stride (incrementing index by 1) for opportunities to generate vector instructions. Consider two_norm

 double norm_seq(const Vector& x) {
   double sum = 0;

   for (size_t i = 0; i < x.num_rows(); ++i) {
     sum += x(i) * x(i);
   }
   return std::sqrt(sum);
}

In looking at this code the compiler can recognize that a set of consecutive operand from memory are being accessed. In fact, one (intermediate) optimization is to unroll the loop.

 double norm_seq(const Vector& x) {
   double sum = 0;

   for (size_t i = 0; i < x.num_rows(); i += 4) {
     sum += x(i + 0) * x(i + 0);
     sum += x(i + 1) * x(i + 1);
     sum += x(i + 2) * x(i + 2);
     sum += x(i + 3) * x(i + 3);
   }
   return std::sqrt(sum);
}

But now we can easily see there are four operands, all in contiguous memory, with the same operation (multiply) being done between them. The compiler can turn that operation into a vector load to get four operands at once and then a vector multiply to do four multiples at once (and a vector reduction to add the four numbers together into sum).

Now, doing these operations in parallel will change the numeric answer for floaing point, compared to the sequential version, so automatic vectorization is only enabled with -Ofast, not -O3 (you can also explicitly ask for it, with, e.g., -mavx2). You can get diagnostics from the compiler about what it did for vectorization with the -Rpass=loop-vectorize compiler option. If we use that and compile two_norm we would see the following:

../src/amath583.cpp:57:3: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
  for (size_t i = 0; i < x.num_rows(); ++i) {

OpenMP simd

In addition to directives for multithreading the work in loops, OpenMP also provides a directive for vectorizing a loop: simd. To vectorize the for loop in two_norm above

 double norm_seq(const Vector& x) {
   double sum = 0;

 #pragma omp simd
   for (size_t i = 0; i < x.num_rows(); ++i) {
     sum += x(i) * x(i);
   }
   return std::sqrt(sum);
}

Using OpenMP, experiment with applying simd parallelization to one or more of the norm_block functions or norm_cyclic functions in the norm subdirectory. If you keep in mind what the compiler is looking for you should be able to narrow down your options to functions that are most likely to be amenable to vectorization. Compiler optimization should be kept to -O3 so that vectorization is only a result of your OpenMP directives.

OpenMP vectorized version of at least one function, obtaining noticeable speedup for all threading levels.

Which function did you vectorize with OpenMP? How much speedup were you able to obtain over the non-vectorized (sequential) version?

You may also try compiling with -Ofast and compare the OpenMP vectorized version with one thread to the sequential version.

Submitting Your Work

Answer the following questions (append to Questions.rst): a) The most important thing I learned from this assignment was… b) One thing I am still not clear on is…

Submit three files to Gradescope’s “ps6 – coding” assignment:

  • CSRMatrix.hpp

  • CSCMatrix.hpp

  • norms.hpp

Starting from this assignment, the autograder knows if you’re from 483 or 583, so it should give you the credit for the tests that are optional for your section automatically.

Next, submit Questions.pdf to “ps6 – written”. If some questions are optional for your section please indicate it in the respective parts of the write-up.

If you relied on any external references, list them in a file refs.txt and include that as well.