Assigned: May 25, 2021

Due: June 1, 2021

Lectures 17, 18, 19

This assignment is extracted as with previous assignment from a tar file,
located is at `this link`

.

For your convenience, the tar file is also available at

```
/gscratch/niac/amath583sp21/archives/ps8.tar.gz
```

And the extracted files are available at

/gscratch/niac/amath583sp21/assignments/ps8

When you extract the files, you will have the following directory structure:

```
ps8/
├── Questions.rst
├── grid/
│ ├── Makefile
│ ├── grid.bash
│ ├── grid.cpp
│ ├── plot.py
│ ├── strong.bash
│ └── weak.bash
├── hello/
│ ├── Makefile
│ ├── hello.bash
│ └── hello.cpp
├── include/
│ ├── AOSMatrix.hpp
│ ├── COOMatrix.hpp
│ ├── CSCMatrix.hpp
│ ├── CSRMatrix.hpp
│ ├── Grid.hpp
│ ├── Make_mpi.inc
│ ├── Matrix.hpp
│ ├── Stencil.hpp
│ ├── Vector.hpp
│ ├── amath583.hpp
│ ├── amath583IO.hpp
│ ├── amath583sparse.hpp
│ ├── cg.hpp
│ ├── ir.hpp
│ ├── jacobi.hpp
│ ├── mpiMath.hpp
│ ├── mpiStencil.hpp
│ ├── norms.hpp
│ └── norm_utils.hpp
├── norm/
│ ├── Makefile
│ ├── README.md
│ ├── mpi_norm.bash
│ ├── mpi_norm.cpp
│ ├── omp_norm.cpp
│ ├── plot.py
│ ├── sequential_norm.cpp
│ ├── strong.bash
│ └── weak.bash
├── pi/
│ ├── Makefile
│ ├── pi-block.cpp
│ ├── pi-cpp.cpp
│ └── pi-cyclic.cpp
├── pingpong/
│ ├── Makefile
│ ├── pingpong.bash
│ ├── pingpong.cpp
│ ├── ring.bash
│ ├── ring.cpp
│ └── verify.bash
└── warmup/
├── hello_script.bash
├── hello_script_m.bash
└── hello_script_p.bash
```

Important

For this assignment, in addition to gcc/10.2.0 you will need to load the intel/oneAPI module

```
$ module load gcc/10.2.0
$ module load intel/oneAPI
```

If you get an error message to the effect of “mpicxx: Command not found”, you have probably not loaded intel/oneAPI.

At the core of most (if not almost all) scientific computing programs is a linear system solver. That is, to solve a system of PDEs, we need to discretize the problem in space to get a system of ODEs. The system of ODEs is discretized in time to get a sequence of nonlinear algebraic problems. The nonlinear problems are solved iteratively, using a linear approximation at each iteration. It is the linear problem that we can solve computationally – and there is an enormous literature on how to do that.

A linear system is canonically posed as \(Ax = b\) where our task is to find an \(x\) that makes the equation true. The classical method for solving a linear system is to use Gaussian elimination (equivalently, LU factorization). Unfortunately, LU factorization is an \(O(N^3)\) computation, with a structure similar to that of matrix-matrix product (in fact, LU factorization can be reduced to a series of matrix-matrix products, and high-performance implementations of matmat are used to make high-performance versions of linear system solvers).

One motivation that we discussed for using sparse matrix representations (and sparse matrix computations) is that many linear systems – notably those derived from PDEs using the pattern above – only involve a few variables in every equation. It is generally much more efficient in terms of computation and memory to solve linear systems iteratively using sparse matrix methods rather than using LU factorization. The most commonly used iterative methods (Krylov subspace methods) have as their core operation a sparse matrix-vector product rather than a dense matrix-matrix product. Besides being more efficient, sparse matrix-vector product is much more straightforward to parallelize than dense matrix-matrix product.

A prototypical problem for linear system solvers is Laplace’s equation:
\(\nabla^2 \phi = 0\). As illustrated in lecture (e.g., Lecture 17),
one interpretation of the discretized Laplace’s equation is that the
solution satisfies the property that the value the solution `phi(i,j)`

at each grid point is equal to the average of its neighbors. We made
two, seemingly distinct, observations based on that property.

First, we noted that we could express that property as a system of
linear equations. If we map the 2D discretization `phi(i,j)`

to a 1D
vector `x(k)`

such that

```
phi(i,j) == x(i*N + j)
```

(where `N`

is the number of discrete points in one dimension of the
discretized domain) then we can express the discretized Laplace’s
equation in the form \(Ax = b\). In that case, each row of the
matrix represents a linear equation of the form:

\[x_i - (x_{i-1} + x_{i+1} + x_{i-N} + x_{i+N})/4 = 0 .\]

Second, we remarked that one could solve the discretized Laplace’s equation iteratively such that at each iteration \(k\) we compute the updated approximate value of each grid point to be the average of that grid point’s neighbors:

\[\phi^{k+1}_{i,j} = (\phi^{k}_{i-1,j} + \phi^{k}_{i+1,j} + \phi^{k}_{i,j-1} + \phi^{k}_{i,j+1}) / 4.\]

Note that once \(\phi^{k+1}_{i,j} = \phi^{k}_{i,j}\) for all \(i\) and \(j\), then we also have

\[\phi^{k}_{i,j} - (\phi^{k}_{i-1,j} + \phi^{k}_{i+1,j} + \phi^{k}_{i,j-1} + \phi^{k}_{i,j+1}) / 4 = 0,\]

which means \(\phi^{k}\) in that case solves the discretized Laplace’s equation.

Now, given the mapping between \(x\) and \(\phi\), the equations above are saying the same thing. Or, in other words, iterating through the grid and updating values of the grid variables according to a defined stencil is equivalent to the product of matrix by a vector. In the former case, we are representing our unknowns on a 2D grid – in the latter case we are representing them in a 1D vector. But, given the direct mapping between the 2D and 1D representations, we have the same values on the grid as in the vector (and we know how to get from one to the other). And, most importantly, we know that we can compute the result of the matrix by vector product by iterating through the equivalent grid and applying a stencil.

That turns out to be a profound realization for two reasons. First, we
don’t have to create a 1D vector to represent knowns or unknowns in our
problem. Our problem is described on 2D grids, we can keep it on 2D
grids. Second, and here is the really significant part, *we don’t need
to form the matrix* \(A\) *to solve* \(Ax=b\). Read that last
sentence again. We can solve \(Ax=b\) without ever forming
\(A\).

So how do we do that? Again, with the class of Krylov solvers, we only need to form the product of \(A\) times \(x\) as a core operation to solve \(Ax = b\). In the context of the solver, we don’t need the matrix \(A\), we just need the result of multiplying \(A\) times \(x\) – which, as we have just seen, we can do with a stencil.

In lecture we noted that we could express the “in-place” iterative update of the grid variables this way:

```
int jacobi(Grid& Phi0, size_t max_iters) {
Grid Phi1(Phi0);
for (size_t iter = 0; iter < max_iters; ++iter) {
for (size_t i = 1; i < Phi0.numX()-1; ++i) {
for (size_t j = 1; j < Phi.numY()-1; ++j) {
Phi1(i, j) = (Phi0(i-1, j) + Phi0(i+1, j) + Phi0(i, j-1) + Phi0(i, j+1))/4.0;
}
}
swap(Phi0, Phi1);
}
return max_iters;
}
```

An equivalent expression would be the following:

```
int jacobi(Grid& Phi, size_t max_iters) {
Grid R(Phi);
for (size_t iter = 0; iter < max_iters; ++iter) {
for (size_t i = 1; i < Phi.numX()-1; ++i) {
for (size_t j = 1; j < Phi.numY()-1; ++j) {
R(i, j) = Phi(i,j) - (Phi(i-1, j) + Phi(i+1, j) + Phi(i, j-1) + Phi(i, j+1))/4.0;
}
}
Phi = Phi - R;
}
return max_iters;
}
```

But notice what the two inner loops are: we are computing `R`

`Phi`

as an explicit stencil operation to effect \(R = A \times \phi\)

Let’s refactor and rewrite `jacobi`

in a slightly more reusable and
modular fashion. First, let’s define an operator between a stencil and a
grid that applies the Laplacian.

```
Grid operator*(const Stencil& A, const Grid& x) {
Grid y(x);
for (size_t i = 1; i < x.numX()-1; ++i) {
for (size_t j = 1; j < x.numY()-1; ++j) {
y(i, j) = x(i,j) - (x(i-1, j) + x(i+1, j) + x(i, j-1) + x(i, j+1))/4.0;
}
}
return y;
}
```

In general, we might want to carry this iteration with the stencil, and just dispatch to it from here (and use subtype or parametric polymorphism), but we are going to just use the Laplacian stencil here.

Now we can define a simple iterative solver (“ir” for iterative refinement):

```
size_t ir(const Stencil& A, Grid& x, const Grid& b, size_t max_iter) {
for (size_t iter = 0; iter < max_iter; ++iter) {
Grid r = b - A*x;
x += r;
}
return max_iter;
}
```

Question for the reader. What do we need the `Stencil`

class to provide in this
case? How would we define it? (Answer this question for yourself
*before* you look into `Stencil.hpp`

.

Second question for the reader. What functions do you need to define so
that you can compile and run the code just as it is above – meaning,
using matrix and vector notation? Again, answer this to yourself before
looking in `Grid.cpp`

.

(NB: Using the `ir`

routine above is only equivalent to `jacobi`

when we
have unit diagonals in the matrix, otherwise we would need the
additional step of doing that division (the general practice of which is
known as *preconditioning*). But we’ve defined things here so that that
they will work without needing to worry about preconditioning for now –
even though preconditioning turns out to be a very important issue in
real solvers.)

For this assignment (and for the rest of the quarter), we will be using the UW Hyak compute cluster.

As a reminder, the Hyak cluster is a shared compute resource, which sharing includes the compute resources themselves, as well as the file system and the front-end (login) node. When you are logged in to the head node, other students will also be logged in at the same time, so be considerate of your usage. In particular,

Use the cluster only for this assignment (no mining bitcoins or other activities for which a cluster might seem useful),

Do not run anything compute-intensive on the head node (all compute-intensive jobs should be batch scheduled),

Do not put any really large files on the cluster, and

Do not attempt to look at other students’ work.

Use the –time=5:00 argument to all jobs you submit.

One program that is interesting to run across the cluster is the
`hostname`

command because it will do something different on different
nodes (namely, print out the unique hostname of that machine). This can
be very useful for quickly diagnosing issues with a cluster – and for
just getting a feeling for how the queuing system works.

As an example, try the following:

```
% srun hostname
```

This will run the hostname command on one of the cluster nodes and return the results to your screen. You should see a different node name than the node you are logged into.

To run the same thing on multiple different nodes in the cluster, try the following:

```
% srun -A niac --tasks=8 hostname
```

The “–tasks=8” argument indicates that `srun`

should run the indicated
program – in this case, `hostname`

, as 8 tasks.

Based on what is printed as a result, are the tasks run on one host or multiple hosts?

Specifying the number of tasks is different from specifying the number of nodes:

```
% srun --nodes=8 hostname
```

The “–nodes=8” argument indicates that `srun`

should run the indicated
program on 8 nodes.

Based on what is printed as a result, are the tasks run on one host or multiple hosts?

How many tasks are launched, and how many nodes are used, when you combine `tasks`

and `nodes`

?

```
% srun --nodes=8 --tasks=2 hostname
```

The nodes in Hyak have multiple CPU cores per nodes (how many?) and we can specify tasks per node to make best use of the cores (and other resources). Note also that we can still specify cpus per task for multi-threaded tasks. (You might try this with ompi_info.exe):

```
% srun --nodes=4 --tasks=2 --cpus-per-task=8 ./ompi_info.exe
```

Try running srun with hostname a few times. Experiment with various numbers of nodes, tasks, cpus.

Note

The niac partition will allow a total of 80 cores to be allocated. If you request more total cores (cpus) within the niac account, slurm will reject your request. You can however, submit larger jobs to the ckpt partition.

```
$ srun -p ckpt --nodes=8 --tasks=4 --cpus-per-task=4 ./ompi_info.exe
```

This will require waiting until the requested number of resources become available in the ckpt pool, however.

In the last assignment we looked at basic use of srun and sbatch and ran the following:

```
$ sbatch hello_script.bash
```

Look at the contents of the output file from this run. How many instances of the script were run (how many “hello” messages were printed)?

We can implement multiple instances with sbatch:

```
$ sbatch --nodes=4 hello_script.bash
```

How many instances were run?

This is different than what you might have expected. Run the script with
`srun`

```
% srun --nodes=4 hello_script.bash
```

How many “hello” messages were printed?

The difference between the sbatch and the `srun`

command is that `sbatch`

just puts your job into the queuing system and requests resources to
potentially run a parallel job (the `--nodes`

option). But it doesn’t run the
job in parallel. Running a job in parallel is the performed by `srun`

(or, from a job in the queue, with `mpirun`

).

Consider the following script (`hello_script_p.bash`

), slightly modified from `hello_script.bash`

:

```
#!/bin/bash
echo "Hello from Slurm"
srun hostname
echo "Goodbye"
```

Here, we run `hostname`

with `srun`

– note that we have not
specified how many nodes we want. Launch a few trials of the second
script

```
% sbatch hello_script_p.bash
% sbatch --nodes=2 hello_script_p.bash
% sbatch --nodes=4 hello_script_p.bash
```

Look at the .out files for each job. How many instances of hostname were run in each case? How many “hello” messages were printed?

We can also use `mpirun`

to launch parallel jobs (and is what we will
need to run our MPI jobs):

```
#!/bin/bash
echo "Hello from Slurm"
mpirun hostname
echo "Goodbye"
```

(This script is `hello_script_m.bash`

)

Again, run this script a few different ways and see if the results are what you expect.

```
% sbatch hello_script_m.bash
% sbatch --nodes=2 hello_script_m.bash
% sbatch --nodes=4 hello_script_m.bash
```

Note

If you launch `mpirun`

directly on the front-end node – it will run –
but will launch all of your jobs on the front-end node, where everyone
else will be working. This is not a way to win friends. Or influence
people. You also won’t get speed-up since you are using only one task on
one node (the front-end node).

Warning

Don’t ever run your MPI programs by invoking `mpirun`

manually.

As we have emphasized in lecture, MPI programs are not special programs in any way, other than that they make calls to functions in an MPI software library. An MPI program is a plain old sequential program, regardless of the calls to MPI functions. (This is in contrast to, say, an OpenMP program where there are special directives for compilation into a parallel program.)

Given that, we can compile an MPI program into an executable with the same C++ complier that we would use for a sequential program without MPI function calls. However, using a third-party library (not part of the standard library) requires specifying the location of include files, the location of library archives, which archives to link to, and so forth. Since MPI itself only specifies its API, implementations vary in terms of where their headers and archives are kept. Moreover, it is often the case that a parallel programmer may want to experiment with different MPI implementations with the same program.

It is of course possible (though not always straightforward) to determine the various directories and archives needed for compiling an MPI program. Most (if not all) implementations make this process transparent to the user by providing a “wrapper compiler” – a script (or perhaps a C/C++ program) that properly establishes the compilation environment for its associated MPI implementation. Using the wrapper compiler, as user does not have to specify the MPI environment. The wrapper compiler is simply invoked as any other compiler – and the MPI-associated bits are handled automagically.

The wrapper compilers associated with most MPI implementations are named
along the lines of “mpicc” of “mpicxx” for compiling C and C++,
respectively. These take all of the same options as their underlying
(wrapped) compiler. The default wrapped compiler is specified when the
MPI implementation is built, but it can usually be over-ridden with
appropriate command-line and/or environment settings. For MPI
assignments in this course, the compiler is named `mpicxx`

.

In one of the scripts above, we used `mpirun`

to launch a program that
didn’t have any MPI function calls in it. Again, in some sense, there is
no such thing as “an MPI program.” There are only processes, some of
which may use MPI library functions to communicate with each other. But
`mpirun`

does not in any way inspect what the processes actually do,
so it will launch programs without MPI just as well as programs that do
call MPI.

What it does do is set up a communication runtime system that enables programs that do make MPI function calls to be able to execute those function calls.

For your first non-trivial (or not completely trivial) MPI program, try
the hello world program we presented in lecture, provided as
`hello.cpp`

in your ps8/hello subdirectory.

The basic command to compile this program is

```
$ mpicxx hello.cpp
```

Note however, that to use all of the other arguments that we have been
using in this course with C++ (such as -std=c++11, -Wall, etc), we need
to pass those in too – so we rely on `make`

to automate this for us.

```
$ make hello.exe
```

which will create an executable hello.exe. Once you have hello.cpp compiled, run it!

```
$ srun --mpi=pmi2 --tasks=4 hello.exe
$ srun --mpi=pmi2 --nodes=4 hello.exe
$ srun --mpi=pmi2 --nodes=4 --tasks=2 hello.exe
```

Important

To launch mpi jobs directly with srun, you must specify the –mpi=pmi2 option.

There is also a script in your directory that you can experiment with

```
$ sbatch --tasks=4 hello.bash
$ sbatch --nodes=4 hello.bash
$ sbatch --nodes=4 --tasks=2 hello.bash
```

Experiment with different numbers of processes, being aware of the resource limitation of the niac allocation.

In general, you should prefer `sbatch`

and scripts to `srun`

.

The point to point example we showed in class used MPI to communicate – to bounce a message back and forth. Rather than including all of the code for that example in the text here, I refer you to the included program ps8/pingpong.cpp.

Take a few minutes to read through pingpong.cpp and determine how the arguments get passed to it and so forth.

Build the executable program `pingpong.exe`

. In this case the program
can take a number of “rounds” – how many times to bounce the token back
and forth. Note also that the token gets incremented on each “volley”.
Launch this program with . Note that we are passing command line
information in to this program. Note also that we are not testing the
rank of the process for the statement

```
if (argc >= 2) rounds = std::stol(argv[1]);
```

What are we assuming in that case about how arguments get passed? Is that a valid assumption? How would you modify the program to either “do the right thing” or to recognize an error if that assumption were not actually valid?

Does this program still work if we launch it on more than two processes? Why or why not?

In the MPI ping-pong example we sent a token back and forth between two processes, which, while actually quite amazing, is still only limited to two processes.

Complete the code in `ring.cpp`

so that instead of simply sending a
token from process 0 to 1 and then back again, the token is sent from
process 0 to 1, then from 1 to 2, then from 2 to 3 – etc until it comes
back to 0. As with the ping pong example above, increment the value of
the token every time it is passed. Copy and paste relevant code lines
that contain your edits to your report. Provide comments in the code
near your edits to explain your approach.

Hint: The strategy is that you want to receive from `myrank-1`

and
send to `myrank+1`

. It is important to note that this will break down
if `myrank`

is zero or if it is `mysize-1`

because there is no rank
-1 or mysize. These cases could be addressed explicitly. (Or, since with
everything else in MPI, there may be capabilities in the library for
handling this situation, since it is not uncommon.)

We couldn’t visit another high-performance computing paradigm without using it to compute the Euclidean norm of a vector. Computing the Euclidean norm with MPI has two parts, and the difficult part isn’t the one you would expect.

The statement for this part of the problem is: Write a function that takes a vector and returns its (Euclidean) norm.

That is easy to state, but we have to think for a moment about what it means in the context of CSP. Again, this function is going to run on multiple separate processes. So, first of all, there is not really such a thing as “a vector or “its norm”. Rather, each process will have its own vector and each process can compute the norm of that vector (or, more usefully, it can compute the sum of squares of that vector because we have to then take a global sum of all of those pieces to compute the final norm). Together, the different vectors on each process can be considered to be one distributed vector – but the distributed vector as an actual “thing” only exists in the mind of the programmer.

But now what do we mean by “its norm”? If we consider the local vectors on each process to be part of a larger global vector, then what we have to do is compute the local sum of squares, add all of those up, and then take the square root of the resulting global sum. This immediately raises another question though. We have some number of separate processes running. How (and where) are the individual sums “added up”? Where (and how) is the square root taken? Which process gets (or which processes get) the result?

Typically, we write individual MPI programs working with local data as if they were a single program working on the global data. That can be useful, but it is paramount to keep in mind what is actually happening.

With this in mind, generally, the local vectors are parts of a larger vector and we do a global reduction of the sum of squares and then either broadcast out the sum of squares (in which case all of the local processes can take the square root), or we reduce the sum of squares to one process, take the square root, and then broadcast the result out. Since the former can be done efficiently with one operation (all-reduce), that is the typical approach.

Now here is the harder question that I alluded to above. In previous
assignments you have been asked to test and time your implementations of
norm. A random vector is usually created to compare the sequential
result to the parallel result (or to differently ordered sequential
results). But how do we generate a distributed random vector across
multiple independent processes, each of which will be calling its own
version of `randomize()`

? More specifically, how do we make a
distributed vector that is the “same” as a sequential one (with the same
total number of elements)? We need to create the “same” vector if we are going to, say, check it for correctness.

We don’t want to have every node separately call `randomize`

.
– in that case,
each node would fill in the same values (random numbers on a computer
are algorithmically generated – each node runs the same algorithm from
the same starting point – hence generating the same values). Instead, we
want a single random vector partitioned and distributed across the
nodes.

To properly compute the norm of a vector in parallel, first, let’s get a
truly distributed random vector (so that each process has a piece of a
larger vector, not the same copy of a smaller vector). One approach to
dealing with issues such as this (the same kind of problem shows up
when, say, reading data from a file) is to get all the data needed on
one node (typically node 0) and then scatter it to the other nodes using `MPI_Scatter.`

Second, we need to get the final computed norm to all of the processes rather than just to a single process.

Finally, for testing, we want to check that all of the compute nodes have the same value. We can make this check on one node, using, say, the inverse operation of scatter (gather).

Complete the program `mpi_norm`

.

The program takes in the (global) size of a vector (\(log_2\) of its size). It should then scatter the vector out to all of the other processes (assume everything is a power of two) using an MPI collective operation.

The compute nodes should compute their local contributions to the global norm. Use another MPI collective operation to combine those results and send the results to all of the compute nodes so that a consistent value of the norm can be computed on all compute nodes. To realize this, complete the function

`mpi_norm`

in`mpi_norm.cpp`

.

In addition to testing, we also have been evaluating our programs for how they scale with respect to problem size and to number of threads/processes. We will continue this process for this problem. In distributed memory, however, the issue of timing becomes a little touchy, because there is no real notion of time across the different processes – but we need one number to measure to determine whether we are scaling or not.

What changes did you make for `mpi_norm`

? Copy and paste relevant code lines
that contain your edits to your report. Provide comments in the code
near your edits to explain your approach.

Run the script `strong.bash`

. This will run `mpi_norm.exe`

for a set
of problem sizes for different numbers of processes and plot the
resulting times. Include the generated pdf file to your report.

Run the script `weak.bash`

. This will run `mpi_norm.exe`

for
different sizes and different numbers of processors – keeping the local
problem size constant. Include the generated pdf file to your report.

Per our discussions in lectures past about weak vs strong scaling, do the plots look like what you would expect? Describe any (significant) differences (if any).

For strong scaling, at what problem size (and what number of nodes) does parallelization stop being useful? Explain.

The files that you will edit for this part of the assignment are in the
`grid`

subdirectory. In the `include`

subdirectory you will find the
following files:

`Grid.hpp`

implements a rectangular grid`Stencil.hpp`

iplements a stencil operator (in particular, it implements the`operator*`

between a stencil and a grid)`ir.hpp`

that implements iterative refinement as described above`jacobi.hpp`

that implements the Jacobi iteration to solve Laplace’s equation`cg.hpp`

that implements the conjugate gradient algorithm

*Your assignment is to parallelize – with MPI – these algorithms.*

Your work will be carried out in the following files:

`mpiMath.hpp`

: contains the implementation of`jacobi`

, which implements the Jacobi iteration. To parallelize this, we need to do two things. First, we need to correctly compute the global value of`rho`

– which needs to be available on all ranks after its computation. Second, we need to properly exchange boundary values for the`x`

Grid.**This problem is the focus of this assignment.**`mpiStencil.hpp`

: contains the implementation of the`mult`

. This function also needs to exchange boundary values of`x`

. You can use the same scheme as you used in`jacobi`

. For extra credit, try a different scheme.`mpiMath.hpp`

contains an implementation of the`mpi_dot()`

function. Think very carefully about this one. In this operation the distinction between real boundaries in your problem and the “pseudo-boundaries” realized by the ghost cells does become important.`mpiMath.hpp`

: contains an implementation of the ir(…) function. This is an overload of the ir(…) function in`ir.cpp`

– it just takes an mpiStencil rather than a plain Stencil. In looking at the different functions called in the sequential version of`ir()`

– what needs to change here to make it a parallel`ir()`

? We have discussed everything you need for this in lecture. You may want to change the names of some functions that are called here but you probably do not need any MPI code in there directly. There is a Big Hint in the function that is defined at the top of mpiMath.hpp.**583 only**Repeat 4 but for the`cg()`

algorithm also included in mpiMath.hpp.

There is a single driver program in the grid subdirectory to run the sequential and parallel versions of ir, jacobi, and cg.

This program takes a number of arguments:

one of -j, -i, or -c to select jacobi, ir, or cg solver, respectively

-x and -y specify the grid dimensions for the discretized Laplace’s equation

-m specifies the maximum number of iterations to run the algorithm

-t specifies the convergence tolerance for the iterations

-d turns on debugging output

-p specifies that the program should be run in parallel

You can run the program sequentially on the front end machine (the one you logged into). It will print the norm of the residual at each iteration of the algorithm.

The output from the sequential algorithm should match the corresponding parallel algorithm. E.g.., running parallel ir should give the same results as running sequential ir.

For running the parallel version of the program with sbatch, use the grid.bash script, e.g.,

```
$ sbatch --ntasks=8 --nodes=4 grid.bash -m 128
```

The grid.bash script automatically passes the -p option to grid.exe so you don’t need to specify that.

Note that as with many of our benchmarking programs, we are assuming things divide nicely by powers of two (that is, sequential equivalence will only be guaranteed and/or expected when running on tasks and nodes that are powers of two).

Although the amount of code to get this working is not large, this is probably the most challenging thing we’ve done in the course. To verify that a particular algorithm is working, you should compare its results with the sequential version.

For example, for testing the jacobi algorithm, we can establish the sequential answer at the end of 128 iterations:

```
$ ./grid.exe -j
# elapsed time [seqgrid]: 1 ms
# 128 iterations
# ||r|| = 0.0490717
```

When you run the parallel version on one node

```
$ sbatch grid.bash -j
```

slurm-<number>.out will contain

```
Starting parallel grid
# elapsed time [mpigrid]: 5 ms
# 128 iterations
# ||r|| = 0.0490717
Done with parallel grid
```

The final residual norm matches

Next, run on two nodes

```
$ sbatch --nodes=2 grid.bash -j
```

The residual norm reported in the slurm output file should match. Repeat with 4 nodes. And so on.

You can see iteration-by-iteration residuals (which may also help debug) by passing in the `-d`

flag.

```
$ sbatch --nodes=2 grid.bash -j -d
```

Making sure the first residual matches is a good check that your global value for `rho`

is correctly computed.

Once you have `jacobi`

working, you can repeat this process for `ir`

, using the `-i`

option instead of `-j`

and then for `cg`

, using the `-c`

option instead of `-j`

.

What changes did you make for halo exchange in `jacobi`

? Copy and paste relevant code lines
that contain your edits to your report. Provide comments in the code
near your edits to explain your approach.

If you used a different scheme for extra credit in `mult`

, show that as well.

What changes did you make for `mpi_dot`

? Copy and paste relevant code lines
that contain your edits to your report. Provide comments in the code
near your edits to explain your approach.

What changes did you make for `ir`

in `mpiMath.hpp`

? Copy and paste relevant code lines
that contain your edits to your report. Provide comments in the code
near your edits to explain your approach.

(583 only) What changes did you make for `cg`

in `mpiMath.hpp`

? Copy and paste relevant code lines
that contain your edits to your report. Provide comments in the code
near your edits to explain your approach.

Finally, once you have your iterative solvers working to your
satisfaction, run the scripts `weak.bash`

and `strong.bash`

. This will
perform runs to create weak and strong scaling plots for the solvers.
Please include these plots into your report.

Per our discussions in lectures past about weak vs strong scaling, do the plots look like what you would expect? Describe any (significant) differences (if any).

For strong scaling, at what problem size (and what number of nodes) does parallelization stop being useful? Explain.

Please submit your report as a PDF-file to “ps8 – written” on Gradescope. For 583, please submit the related questions to “ps8 – written (583 only)”.