University of Washington | ||
Department of Applied Mathematics | ||
AMATH 483/583 | Assigned: {{ page.handout_date | date: ‘%B %d, %Y’ }} | |
Final Exam | Due: {{ page.due_date | date: ‘%B %d, %Y’ }} |
{% capture now %}{{‘now’ | date: ‘%s’ | plus: 0 }}{% endcapture %} {% capture handout %}{{ page.handout_date | date: ‘%s’ | plus: 0 }}{% endcapture %}
{% if handout > now %}
Problem Set {{ page.title }} will be available on {{ page.handout_date | date: ‘%B %d, %Y’ }}.
{% else %}
This assignment is extracted as with previous assignments into a directory {{ page.psnum }}. In that directory is a file Questions.md with short answer questions that you will be asked in different parts of this assignment. To fill in your answers, open the file in vscode and Insert your answers directly into the file, using the format indicated.
The files for this assignment have been copied to your AWS directory in the subdirectory final-dist. There is also a copy in /nfs/home/public
For your reference, the tar ball for this assignment can be downloaded with this link.
At the core of most (if not almost all) scientific computing programs is a linear system solver. That is, to solve a system of PDEs, we need to discretize the problem in space to get a system of ODEs. The system of ODEs is discretized in time to get a sequence of nonlinear algebraic problems. The nonlinear problems are solved iteratively, using a linear approximation at each iteration. It is the linear problem that we can solve computationally – and there is an enormous literature on how to do that.
A linear system is canonically posed as $ Ax = b $ where our task is to find an \(x\) that makes the equation true. The classical method for solving a linear system is to use Gaussian elimination (equivalently, LU factorization). Unfortunately, LU factorization is an \(O(N^3)\) computation, with a structure similar to that of matrix-matrix product (in fact, LU factorization can be reduced to a series of matrix-matrix products, and high-performance implementations of matmat are used to make high-performance versions of linear system solvers).
One motivation that we discussed for using sparse matrix representations (and sparse matrix computations) is that many linear systems – notably those derived from PDEs using the pattern above – only involve a few variables in every equation. It is generally much more efficient in terms of computation and memory to solve linear systems iteratively using sparse matrix methods rather than using LU factorization. The most commonly used iterative methods (Krylov subspace methods) have as their core operation a sparse matrix-vector product rather than a dense matrix-matrix product. Besides being more efficient, sparse matrix-vector product is much more straightforward to parallelize than dense matrix-matrix product.
A prototypical problem for linear system solvers is Laplace’s equation:
\(\nabla^2 \phi = 0\). As illustrated in lecture (e.g., Lecture 17),
one interpretation of the discretized Laplace’s equation is that the
solution satisfies the property that the value the solution phi(i,j)
at each grid point is equal to the average of its neighbors. We made
two, seemingly distinct, observations based on that property.
First, we noted that we could express that property as a system of
linear equations. If we map the 2D discretization phi(i,j)
to a 1D
vector x(k)
such that
phi(i,j) == x(i*N + j)
(where N
is the number of discrete points in one dimension of the
discretized domain) then we can express the discretized Laplace’s
equation in the form \(Ax = b\). In that case, each row of the
matrix represents a linear equation of the form:
Second, we remarked that one could solve the discretized Laplace’s equation iteratively such that at each iteration \(k\) we compute the updated approximate value of each grid point to be the average of that grid point’s neighbors:
Note that once \(\phi^{k+1}_{i,j} = \phi^{k}_{i,j}\) for all \(i\) and \(j\), then we also have
which means \(\phi^{k}\) in that case solves the discretized Laplace’s equation.
Now, given the mapping between \(x\) and \(\phi\), the equations above are saying the same thing. Or, in other words, iterating through the grid and updating values of the grid variables according to a defined stencil is equivalent to the product of matrix by a vector. In the former case, we are representing our unknowns on a 2D grid – in the latter case we are representing them in a 1D vector. But, given the direct mapping between the 2D and 1D representations, we have the same values on the grid as in the vector (and we know how to get from one to the other). And, most importantly, we know that we can compute the result of the matrix by vector product by iterating through the equivalent grid and applying a stencil.
That turns out to be a profound realization for two reasons. First, we don’t have to create a 1D vector to represent knowns or unknowns in our problem. Our problem is described on 2D grids, we can keep it on 2D grids. Second, and here is the really significant part, we don’t need to form the matrix :math:`A` to solve :math:`Ax=b`. Read that last sentence again. We can solve \(Ax=b\) without every forming \(A\).
So how do we do that? Again, with the class of Krylov solvers, we only need to form the product of \(A\) times \(x\) as a core operation to solve \(Ax = b\). In the context of the solver, we don’t need the matrix \(A\), we just need the result of multiplying \(A\) times \(x\) – which we can do with a stencil.
In lecture we noted that we could express the “in-place” iterative update of the grid variables this way:
int jacobi(Grid& Phi0, size_t max_iters) {
Grid Phi1(Phi0);
for (size_t iter = 0; iter < max_iters; ++iter) {
for (size_t i = 1; i < Phi0.numX()-1; ++i) {
for (size_t j = 1; j < Phi.numY()-1; ++j) {
Phi1(i, j) = (Phi0(i-1, j) + Phi0(i+1, j) + Phi0(i, j-1) + Phi0(i, j+1))/4.0;
}
}
swap(Phi0, Phi1);
}
return max_iters;
}
An equivalent expression would be the following:
int jacobi(Grid& Phi, size_t max_iters) {
Grid R(Phi);
for (size_t iter = 0; iter < max_iters; ++iter) {
for (size_t i = 1; i < Phi.numX()-1; ++i) {
for (size_t j = 1; j < Phi.numY()-1; ++j) {
R(i, j) = Phi(i,j) - (Phi(i-1, j) + Phi(i+1, j) + Phi(i, j-1) + Phi(i, j+1))/4.0;
}
}
Phi = Phi - R;
}
return max_iters;
}
But notice what the two inner loops are: we are computing R
Phi
as an explicit stencil operation to effect \(R = A \times \phi\)
Let’s refactor and rewrite jacobi
in a slightly more reusable and
modular fashion. First, let’s define an operator between a stencil and a
grid that applies the Laplacian.
Grid operator*(const Stencil& A, const Grid& x) {
Grid y(x);
for (size_t i = 1; i < x.numX()-1; ++i) {
for (size_t j = 1; j < x.numY()-1; ++j) {
y(i, j) = x(i,j) - (x(i-1, j) + x(i+1, j) + x(i, j-1) + x(i, j+1))/4.0;
}
}
return y;
}
In general, we might want to carry this iteration with the stencil, and just dispatch to it from here (and use subtype or parametric polymorphism), but we are going to just use the Laplacian stencil here.
Now we can define a simple iterative solver (“ir” for iterative refinement):
size_t ir(const Stencil& A, Grid& x, const Grid& b, size_t max_iter) {
for (size_t iter = 0; iter < max_iter; ++iter) {
Grid r = b - A*x;
x += r;
}
return max_iter;
}
Question for the reader. What do we need stencil to provide in this
case? How would we define it? (Answer this question for yourself
before you look into Stencil.hpp
.
Second question for the reader. What functions do you need to define so
that you can compile and run the code just as it is above – meaning,
using matrix and vector notation? Again, answer this to yourself before
looking in Grid.cpp
.
(NB: Using the ir routine above is only equivalent to jacobi when we have unit diagonals in the matrix, otherwise we would need the additional step of doing that division (the general practice of which is known as preconditioning). But we’ve defined things here so that that they will work without needing to worry about preconditioning for now – even though preconditioning turns out to be a very important issue in real solvers.)
The compute cluster for this assignment is set up in the Amazon Web Services (AWS) cloud just for this course and will be taken down once the course is completed. (NB: If you are interested in experimenting with parallel computing on your own or for a research project after this course, please let me know directly and I will try to see if we can make some arrangements to keep some of the clusters running a bit longer.)
Using a cluster is somewhat different than what we have done so far in this course to use parallel computing resources. A compute cluster typically consists of some number of front-end nodes and then some (much larger) number of compute nodes. Development and compilation are typically done on the front-end nodes, whereas execution of the job is done on the compute nodes.
A cluster is a shared system. When you ran parallel programs on your own computer, you didn’t have to worry about the fact that someone else might want to run a parallel program with your computer. Or that other programs running on your computer would interfere with the performance of your jobs. To make efficient use of the resources – and to allocate them fairly, once a program is compiled and ready to execute, it is passed off to the compute nodes via a queueing system. The queueing system makes sure that the compute nodes are fairly used and that the resources that you need to run your parallel job are available.
Our AWS cluster consists of one head node and an elastic number of compute nodes. The intent of this cluster is to allow you to run distributed memory jobs over a non-trivial number of nodes and observe scaling. The IP address that you will use to connect to the head node of this cluster is ``35.163.68.152`` (NB: The head node has an eleastic IP address. If for some reason we have to rebuild the entire cluster, we will have to create a new instance for the head node and the IP address will change. I will announce any such changes to Piazza and to Canvas.)
If it turns out we need to add more capacity (or larger compute nodes), we can set that up fairly quickly. In fact, the fleet of compute nodes will grow and shrink dynamically in response to demand. The resizing is not immediate – if you are the first person to submit a job larger than the current size of the fleet it will take about five minutes for the additional compute nodes to come up.
A login has been created for you on the cluster, with the same name as for the GPU node – and with the same public-private key pair. You should be able to log in to the head node of the cluster just as you did the head node of the GPU machine. Note that the GPU machine and the cluster are separate, so even though your login credentials are the same, the home directories aren’t shared. (This is doable, but did not get a chance to set it up.) You should also be able to clone your git repository as with the GPU machine.
The compute cluster is a shared compute resource, which sharing includes the compute resources themselves, as well as the file system. When you are logged in to the head node, other students will also be logged in at the same time, so be considerate of your usage. In particular,
Use the cluster only for this assignment (no mining bitcoins or other activities for which a cluster might seem useful),
Do not run anything compute-intensive on the head node,
Do not run anything using mpirun
on the head node,
Do not put any really large files on the cluster, and
Do not attempt to look at other students’ work.
Once you are connected to the head node, there are a few quick sanity checks you can do to check out (and learn more about) the cluster environment.
To find out more about the head node itself, try
% more /proc/cpuinfo
This will return copious information about the cores on the head node. Of interest are how many cores there are.
You should also double check the C++ compiler, the mpic++ compiler, and mpirun.
% c++ --version
% mpic++ --version
These should both print information about the installed version of clang. And, finally, issuing
% mpirun --version
should print out a message indicating that it is Open MPI.
We will be using the Slurm batch queuing system for this assignment, which we will discuss in more detail below. To get some information about it, you can issue the command
% sinfo
This will print a summary of the compute nodes along with some basic details about their capabilities.
If you pass the -N argument to sinfo you will get the info printed on a per-node basis. Passing the -l (ell) flag will provide additional information.
You may see a list that looks something like this
% sinfo -N -l
Sat Jun 8 05:04:34 2019
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
ip-10-11-10-44 1 compute* idle 2 2:1:1 3711 16810 1 (null) none
ip-10-11-10-197 1 compute* idle 2 2:1:1 3711 16810 1 (null) none
ip-10-11-10-201 1 compute* idle 2 2:1:1 3711 16810 1 (null) none
ip-10-11-10-219 1 compute* idle 2 2:1:1 3711 16810 1 (null) none
(“man sinfo” to access documentation to explain the different columns.)
The command “squeue” on the other hand will provide information about the jobs that are currently running (or waiting to be run) on the nodes.
% squeue
% squeue -arl
(“man squeue” to access information about the squeue command.)
The development environment you will be using on the cluster is similar to the Linux environment that is set up on the GPU machine, although some of the tools are not as recent a version. The cluster is launched and administered with AWS Parallel Cluster Toolkit – which only supports Ubuntu 14.04 and 16.04. Accordingly, the OS is ubuntu 16.04, the compiler is clang, the MPI is Open MPI.
There are a number of HPC queueing systems used in production clusters,
all of which provide essentially the same functionality. We will be
using “slurm” but other available systems include SGE and torque. The
basic paradigm in using a queuing system is that you request the
queueing system to run something for you. The thing that is run can be a
program, or it can be a command script (with special commands / comments
to control the queueing system itself). Command line options (or
commands / directives in the script file) are used to specify the
configuration of the environment for the job to run, as well as to tell
the queueing system some things about the job (how much time it will
need to run for instance). When the job has run, the queueing system
will also provide its output in some files for you – the output from
cout
and the output from cerr
.
A reasonable tutorial on the basics of using slurm can be found here:
https://www.slurm.schedmd.com/tutorials.html . The most important
commands we have already seen above: sbatch
, sinfo
, and
squeue
. The most important bits to pay attention to in that tutorial
are about those commands.
When you are debugging and testing small and short running jobs, you
may want to get your results back on your screen rather than in a file.
To do this you can use the srun
command instead of sbatch
.
Please do not use srun
to actually log in to nodes on the cluster as
you will then be blocking others from using those nodes and we really
need to use those only under direction of the queueing system so that we
can make most efficient use of them.
One program that is interesting to run across the cluster is the
hostname
command because it will do something different on different
nodes (namely, print out the unique hostname of that machine). This can
be very useful for quickly diagnosing issues with a cluster – and for
just getting a feeling for how the queuing system works.
As an example, try the following:
% srun hostname
This will run the hostname command on one of the cluster nodes and return the results to your screen. You should see a different node name than the node you are logged into.
To run the same thing on multiple different nodes in the cluster, try the following:
% srun -n 8 hostname
The “-n 8” argument indicates that srun
should run the indicated
program – in this case, hostname
, as 8 tasks.
This is different than using -N (capital N):
% srun -N 8 hostname
The “-N 8” argument indicates that srun
should run the indicated
program on 8 nodes.
The nodes in our cluster have two CPUs per nodes – so running 8 tasks would require only 4 nodes. However, specifying only four nodes will run 1 task per node. Inspect the output from running the two version (-n 8 vs -N 8) and see if the output is consistent with two tasks per node.
NB: When launching MPI programs, you may see a warning like the following in your output file:
--------------------------------------------------------------------------
[[56201,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: ip-10-11-15-142
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
This is just a warning that Open MPI could not find an Infiniband network module (which is not available on AWS). It can be safely ignored for this assignment.
Try running srun with hostname a few times. Experiment with various
numbers of nodes. The fleet will currently only expand to 36 compute
nodes, so if you request more nodes than that your job will not run (or
may be rejected by srun
).
Since srun
basically just takes what you give it and runs that, it
is usually much more productive in the long run to specify what you want
to do inside of a script – and to submit the script to the queuing
system.
For example, with the above, you could implement it as follows:
#!/bin/bash
echo "Hello from Slurm"
hostname
echo "Goodbye"
This file is in your repo as final/warmup/hello_script.bash. Submit it to the queuing system with the following:
% sbatch hello_script.bash
NB Files that are submitted to sbatch
must be actual shell
scripts – meaning they must specify a shell in addition to having
executable commands. The shell is specified at the top of every script,
starting with #!#
. A bash script therefore has #!/bin/bash
at
the top of the file.
If you submit a file that is not a script, you will get an error like the following:
sbatch: error: This does not look like a batch script. The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh
In addition, scripts that are submitted to srun
must be executable –
that is, their execute mode must be set. To enable this, make sure that
you see the x
flag in the long listing for the script files.
$ ls -l hello_script.bash
-rwxr-xr-x 1 al75 al75 62 Jun 8 06:24 hello_script.bash
If you do not see the “x” – notably the very first one, issue the command
$ chmod +x hello_script.bash
If the script is not executable, you will get an error from srun similar to
slurmstepd: error: execve(): hello_script.bash: Permission denied
The standard out (cout) and standard error (cerr) from the program will
be saved in a file named slurm-xx.out
, where xx is the job number
that was run. in the directory where you launched sbatch
.
Try the following
% sbatch hello_script.bash
Now execute multiple instances
% sbatch -N 4 hello_script.bash
How many instances were run?
This is different than what you might have expected. Run the script with
srun
% srun -N 4 hello_script.bash
How many “hello” messages were printed?
The difference between the sbatch and the srun command is that sbatch just puts your job into the queuing system and requests resources to potentially run a parallel job (the -N option). But it doesn’t run the job in parallel. Running a job in parallel is the performed by srun.
Consider the following script, slightly modified from hello_script.bash:
#!/bin/bash
echo "Hello from Slurm"
srun hostname
echo "Goodbye"
Here, we run hostname
with srun
– note that we have not
specified how many nodes we want. Launch a few trials of the second
script
% sbatch hello_script_p.bash
% sbatch -N 2 hello_script_p.bash
% sbatch -N 4 hello_script_p.bash
Look at the .out files for each job. How many instances of hostname were run in each case? How many “hello” messages were printed?
We can also use mpirun
to launch parallel jobs (and is what we will
need to run our MPI jobs):
#!/bin/bash
echo "Hello from Slurm"
mpirun hostname
echo "Goodbye"
Again, run this script a few different ways and see if the results are what you expect.
% sbatch hello_script_m.bash
% sbatch -N 2 hello_script_m.bash
% sbatch -N 4 hello_script_m.bash
sbatch
and srun
accept the same command-line arguments – so,
again, tehre is a difference between -n and -N.
NB: If you run mpirun directly on the front-end node – it will run – but will launch all of your jobs on the front-end node, where everyone else will be working. This is not a way to win friends. Or influence people. You also won’t get speed-up since you are using only one node (the front-end node).
Don’t ever run your MPI programs using mpirun directly.
A final note on using the AWS cluster. It is currently configured to “autoscale” – that is, when there is not a large demand for nodes it shuts them down (maintaining a minimum of 4 up at all times). If you submit a job with a value of -N that requires more nodes than are currently available, AWS will spin some more up. However, this will take a few minutes to complete. Your job will wait in the queue in the meantime.
If you want to see what the cluster is doing (how many nodes are up, how loaded they are, etc), you can visit the page http://52.25.191.99/ganglia/ . This shows a number of statistics from the cluster, monitored in real time.
As we have emphasized in lecture, MPI programs are not special programs in any way, other than that they make calls to functions in an MPI software library. An MPI program is a plain old sequential program, regardless of the calls to MPI functions. (This is in contrast to, say, an OpenMP program where there are special directives for compilation into a parallel program.)
Given that, we can compile an MPI program into an executable with the same C++ complier that we would use for a sequential program without MPI function calls. However, using a third-party library (not part of the standard library) requires specifying the location of include files, the location of library archives, which archives to link to, and so forth. Since MPI itself only specifies its API, implementations vary in terms of where their headers and archives are kept. Moreover, it is often the case that a parallel programmer may want to experiment with different MPI implementations with the same program.
It is of course possible (though not always straightforward) to determine the various directories and archives needed for compiling an MPI program. Most (if not all) implementations make this process transparent to the user by providing a “wrapper compiler” – a script (or perhaps a C/C++ program) that properly establishes the compilation environment for its associated MPI implementation. Using the wrapper compiler, as user does not have to specify the MPI environment. The wrapper compiler is simply invoked as any other compiler – and the MPI-associated bits are handled automagically.
The wrapper compilers associated with most MPI implementations are named
along the lines of “mpicc” of “mpic++” for compiling C and C++,
respectively. These take all of the same options as their underlying
(wrapped) compiler. The default wrapped compiler is specified when the
MPI implementation is built, but it can usually be over-ridden with
appropriate command-line and/or environment settings. For MPI
assignments in this course, the compiler is named mpic++
– the same
name is used by both MPICH and Open MPI.
In one of the scripts above, we used mpirun
to launch a program that
didn’t have any MPI function calls in it. Again, in some sense, there is
no such thing as “an MPI program.” There are only processes, some of
which may use MPI library functions to communicate with each other. But
mpirun
does not in any way inspect what the processes actually do,
so it will launch programs without MPI just as well as programs that do
call MPI.
For your first non-trivial (or not completely trivial) MPI program, try
the hello world program we presented in lecture, provided as
hello.cpp
in your final/hello subdirectory.
The basic command to compile this program is
% mpic++ hello.cpp
Note however, that to use all of the other arguments that we have been
using in this course with C++ (such as -std=c++11, -Wall, etc), we need
to pass those in too – so we rely on make
to automate this for us.
% make hello.exe
which will create an executable hello.exe. Once you have hello.cpp compiled, run it!
% srun -n 4 hello.exe
There is also a script in your directory that you can experiment with
% sbatch -n 4 hello.bash
Experiment with different numbers of processes, being aware of the resource limitation of the compute fleet.
In general, you should prefer sbatch
and scripts to srun
.
The point to point example we showed in class used and to communicate – to bounce a message back and forth. Rather than including all of the code for that example in the text here, I refer you to the included program final/pingpong.cpp.
Take a few minutes to read through pingpong.cpp and determine how the arguments get passed to it and so forth.
Build the executable program pingpong.exe
. In this case the program
can take a number of “rounds” – how many times to bounce the token back
and forth. Note also that the token gets incremented on each “volley”.
Launch this program with . Note that we are passing command line
information in to this program. Note also that we are not testing the
rank of the process for the statement
if (argc >= 2) rounds = std::stol(argv[1]);
What are we assuming in that case about how arguments get passed? Is that a valid assumption? How would you modify the program to either “do the right thing” or to recognize an error if that assumption were not actually valid?
Does this program still work if we launch it on more than two processes? Why or why not?
Because of the trouble we have been having with the cluster for PS7, this problem has been completed for you. You should read through it and run it with different numbers of nodes.
In the MPI ping-pong example we sent a token back and forth between two processes, which, while actually quite amazing, is still only limited to two processes.
Complete the code in ring.cpp
so that instead of simply sending a
token from process 0 to 1 and then back again, the token is sent from
process 0 to 1, then from 1 to 2, then from 2 to 3 – etc until it comes
back to 0. As with the ping pong example above, increment the value of
the token every time it is passed.
Hint: The strategy is that you want to receive from myrank-1
and
send to myrank+1
. It is important to note that this will break down
if myrank
is zero or if it is mysize-1
because there is no rank
-1 or mysize. These cases could be addressed explicitly. (Or, since with
everything else in MPI, there may be capabilities in the library for
handling this situation, since it is not uncommon.)
A completed ring.cpp
program.
We have had a running example of taking the Euclidean norm of a vector throughout this course. It seems only fitting that we now develop an MPI version. Computing the Euclidean norm with MPI has two parts, and the difficult part isn’t the one you would expect.
The statement for this part of the problem is: Write a function that takes a vector and returns its (Euclidean) norm.
That is easy to state, but we have to think for a moment about what it means in the context of CSP. Again, this function is going to run on multiple separate processes. So, first of all, there is not really such a thing as “a vector or “its norm”. Rather, each process will have its own vector and each process can compute the norm of that vector (or, more usefully, it can compute the sum of squares of that vector). Together, the different vectors on each process can be considered to be one distributed vector – but that is only in the mind of the programmer.
But now what do we mean by “its norm”? If we consider the local vectors on each process to be part of a larger global vector, then what we have to do is compute the local sum of squares, add all of those up, and then take the square root of the resulting global sum. This immediately raises another question though. We have some number of separate processes running. How (and where) are the individual sums “added up”? Where (and how) is the square root taken? Which process gets (or which processes get) the result?
Typically, we write individual MPI programs working with local data as if they were a single program working on the global data. That can be useful, but it is paramount to keep in mind what is actually happening.
With this in mind, generally, the local vectors are parts of a larger vector and we do a global reduction of the sum of squares and then either broadcast out the sum of squares (in which case all of the local processes can take the square root), or we reduce the sum of squares to one process, take the square root, and then broadcast the result out. Since the former can be done efficiently with one operation (all-reduce), that is the typical approach.
Now here is the harder question that I alluded to above. In previous
assignments you have been asked to test and time your implementations of
norm. A random vector is usually created to compare the sequential
result to the parallel result (or to differently ordered sequential
results). But how do we generate a distributed random vector across
multiple independent processes, each of which will be calling its own
version of randomize()
? More specifically, how do we make a
distributed vector that is the “same” as a sequential one (with the same
total number of elements)?
We don’t want the same vector on every node, which is what would happen
if we created a vector on every node and called randomize()
on it –
each node would fill in the same values (random numbers on a computer
are algorithmically generated – each node runs the same algorithm from
the same starting point – hence generating the same values). Instead, we
want a single random vector partitioned and distributed across the
nodes.
To properly compute the norm of a vector in parallel, first, let’s get a truly distributed random vector (so that each process has a piece of a larger vector, not the same copy of a smaller vector). One approach to dealing with issues such as this (the same kind of problem shows up when, say, reading data from a file) is to get all the data needed on one node (typically node 0) and then scatter it to the other nodes using .
Second, we need to get the final computed norm to all of the processes rather than just to a single process.
Finally, for testing, we want to check that all of the compute nodes have the same value. We can make this check on one node, using, say, the inverse operation of scatter.
Complete the program mpi_norm
.
The program takes in the (global) size of a vector ($log_24 of its size). It should then scatter the vector out to all of the other processes (assume everything is a power of two) using an MPI collective operation. This has been completed for you.
The compute nodes should compute their local contributions to the
global norm. Use another MPI collective operation to combine those
results and send the results to all of the compute nodes so that a
consistent value of the norm can be computed on all compute nodes. To
realize this, complete the function mpi_norm
in mpi_norm.cpp
.
In addition to testing, we also have been evaluating our programs for how they scale with respect to problem size and to number of threads/processes. We will continue this process for this problem. In distributed memory, however, the issue of timing becomes a little touchy, because there is no real notion of time across the different processes – but we need one number to measure to determine whether we are scaling or not.
Run the script strong.bash
. This will run mpi_norm.exe
for a set
of problem sizes for different numbers of processes and plot the
resulting times. Turn in the generated pdf file.
Run the script weak.bash
. This will run mpi_norm.exe
for
different sizes and different numbers of processors – keeping the local
problem size constant. Turn in the generated pdf file.
Per our discussions in lectures past about weak vs strong scaling, do the plots look like what you would expect? Describe any (significant) differences (if any).
For strong scaling, at what problem size (and what number of nodes) does parallelization stop being useful?
The files that you will edit for this part of the assignment are in the
grid
subdirectory. In the include
subdirectory you will find the
following files:
Grid.cpp
and Grid.hpp
that implement a rectangular grid
Stencil.cpp
and Stencil.hpp
that implement a stencil operator
– in particular the operator*
between a stencil and grid is
implemented
ir.cpp
and ir.hpp
that implement iterative refinement as
described above
jacobi.cpp
and jacobi.hpp
that implement the Jacobi iteration
to solve Laplace’s equation
cg.cpp
and cg.hpp
that implement the conjugate gradient
algorithm
Your assignment is to parallelize – with MPI – these algorithms.
Your work will be carried out in the following files:
mpiStencil.cpp
: contains implementation of the operator*
function declared in mpiStencil.hpp
. This is basically the same
as the sequential version except there is a call to
update_halo()
. You will need to fill in the body of this
function. It takes a Grid x
as an argument and should update the
boundary values of x
with the appropriate values from its
neighbors. This problem is the focus of this assignment.
Final.cpp
: contains an implementation of the ir(…) function
declared in the header file Final.hpp
. This is an overload of the
ir(…) function in ir.cpp
– it just takes an mpiStencil rather
than a plain Stencil. In looking at the different functions called in
the sequential version of ir()
– what needs to change here to
make it a parallel ir()
? We have discussed everything you need
for this in lecture. You may want to change the names of some
functions that are called here but you probably do not need any MPI
code in there directly. There is a Big Hint in the function that is
defined at the top of Final.cpp (and that you will be finishing in
the next question).
Final.cpp
contains an implementation of the mpiDot()
function declared in the header file Final.hpp
.Allreduce()
.583 only Repeat 2 but for the cg()
algorithm also included in
Final.cpp.
There is a single driver program in the grid subdirectory to run the sequential and parallel versions of ir, jacobi, and cg.
This program takes a number of arguments:
one of -j, -i, or -c to select jacobi, ir, or cg solver, respectively
-x and -y specify the grid dimensions for the discretized Laplace’s equation
-m specifies the maximum number of iterations to run the algorithm
-t specifies the convergence tolerance for the iterations
-d turns on debugging output
-p specifies that the program should be run in parallel
For running the parallel version of the program with sbatch, use the grid.bash script, e.g.,
$ sbatch --ntasks 8 -N 4 grid.bash -m 128
The grid.bash script automatically passes the -p option to grid.exe so you don’t need to specify that.
Finally, once you have your iterative solvers working to your satisfaction, run the scripts weak.bash and strong.bash. This will perform runs to create weak and strong scaling plots for the solvers.
Per our discussions in lectures past about weak vs strong scaling, do the plots look like what you would expect? Describe any (significant) differences (if any).
For strong scaling, at what problem size (and what number of nodes) does parallelization stop being useful?
Do a make clean in your {{ page.psnum }} directory and then create a tar ball {{ page.psnum }}.tar.gz containing all of the files in that directory. From within the {{ page.psnum }} directory:
$ cd {{ page.psnum }}
$ make clean
$ cd ..
$ tar zcvf {{ page.psnum }}-submit.tar.gz --exclude "./{{ page.psnum }}/.vscode" ./{{ page.psnum }}
If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the {{ page.psnum }} subdirectory).
{% endif %}