Assigned: May 31, 2022
Due: June 10, 2022
Lecture 19 (eigenfaces)
This assignment is extracted as with previous assignment from a tar file, located is at final.tar.gz
.
For your convenience, the tar file is also available at
/gscratch/amath/amath583/final.tar.gz
You can copy and paste to your own directory using cp
:
cp /gscratch/amath/amath583/final.tar.gz ~
When you extract the files, you will have the following directory structure:
final/
├── Questions.rst
├── eigenfaces/
│ ├── Makefile
│ ├── common.bash
│ ├── eigenfaces_block.cpp
│ ├── eigenfaces_dgemm.cpp
│ ├── eigenfaces_mpi.cpp
│ ├── eigenfaces_opt.cpp
│ ├── eigenfaces_par.cpp
│ ├── eigenfaces_seq.cpp
│ ├── main.hpp
│ ├── max.bash
│ ├── test.bash
│ ├── test.cpp
│ └── verify.bash
├── include/
│ ├── AOSMatrix.hpp
│ ├── CImg.h
│ ├── COOMatrix.hpp
│ ├── CSCMatrix.hpp
│ ├── CSRMatrix.hpp
│ ├── Make_mpi.inc
│ ├── Matrix.hpp
│ ├── Vector.hpp
│ ├── amath583.hpp
│ ├── amath583IO.hpp
│ ├── amath583sparse.hpp
│ └── catch.hpp
└── src/
├── amath583.cpp
├── amath583IO.cpp
└── amath583sparse.cpp
The rules for the final exam are the same as for the other assignments, except
Similar as previous assignments on Hyak, the coding part of final will not be evaluated with gradescope using autograder.
However, we do want you to include your source codes in the submission and we will evaluate them by hand.
As with previous assignments, there are tests supplied with the final that you can use to verify your solutions.
We haven’t really had any requirements on commenting your code, either with
//
or the /*
*/
pair.
Continuing in this regard,
general code comments will not be required for the final, but are
highly suggested. Any clarifying comments may help the graders award
partial credit if your code doesn’t exhibit the correct behavior.
Important
For this assignment, in addition to gcc/10.2.0 you will need to load the intel/oneAPI module for the MPI compilier - mpicxx.
$ module load gcc/10.2.0
$ module load intel/oneAPI
In the rest of the assignment, I will omit the account option and the partition option in the example shell command. Please remember to include the corresponding options when you are using these commands.
The major steps in the eigenfaces application are:
Read in face images
Compute mean face image
Subtract mean from all face images
Compute covariance matrix
Compute eigendecomposition of covariance matrix
Save eigenface images
The implementation of eigenfaces that you have been provided with is instrumented to print out timing information about these steps (except for saving eigenface images).
The eigenfaces application programs take the following arguments:
-i bin_file
(required), specifies the face image data file to read.
-n num
, specifies the number of faces to read from the image data file. Default: 500.
-f num
, specifies the number of eigenface images to save. If the value is zero, the eigendecomposition will not be performed. Default: 0.
-z size_spec
, specifies the size of the image data. Acceptable options are “55x45”, “109x89”, or “218x178”. This should be selected to match the choice of image data file. Default: “55x45”.
If we invoke the base eigenfaces program (eigenfaces_seq.exe) in the following way
$ srun eigenfaces_seq.exe -i /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.262781_55_45_32.bin -n 500 -f 8
we get the following output
# Face size 55 x 45
# Reading /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
# elapsed time [read_file]: 10 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 10420 ms
# Gflops/s = 0.587872
# elapsed time [compute_svd]: 8436 ms
Since only 500 images were read, the time read the file, compute the mean, and subtract the mean are quite small. Computing the covariance, on the other hand, is quite large.
Given a set of \(N\) vectors \(\{ f_i \}\), the formula for computing the covariance matrix is
The time to compute the covariance is proportional to the square of the size of the vectors and to the number of vectors.
Note
The vectors in the eigenfaces application are “raveled” images. That is, their length is the total number of pixels in the image, i.e., height * width, with elements corresponding to the pixel values of the image. (Essentially, we are undoing the one-d to two-d mapping that we did for the Matrix class.)
Given that forming the covariance matrix with 500 images takes approximately 10 seconds, how long (approximately) would forming the covariance matrix take if we used all 262,781 images in the data file?
In the remainder of this final you will be applying successive optimizations / parallelizations to improve the performance of the covariance computation.
There are three data files that you can use in this assignment (small, medium, and large, respectively):
/gscratch/amath/amath583/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
/gscratch/amath/amath583/data/celebA/celebA_gray_lowres.202599_109_89_32.bin
/gscratch/amath/amath583/data/celebA/celebA_gray_lowres.202599_218_178_32.bin
These are collections of greyscale images of celebrity faces. The first (small) dataset contains 262781 images with height equal to 55 and width equal to 45. The second (medium) dataset contains 202599 images with height equal to 109 and width equal to 89. The third (large) dataset contains 202599 images with height equal to 218 and width equal to 178. In all of the datasets, the images are stored with 32bit floating point for each pixel.
Although these seem like small images, the amount of data they represent is non-trivial. The small, medium, and large datasets are2.5G, 7.4G, and 30G in size, respectively. Moreover, even the smallest vector (55 by 45) is a vector of length 2,475, meaning the (dense) covariance matrix is 2,475 by 2,475.
As seen in the example above, the input data file is specified to the eigenface program with the -i
option and the number of images to use is specified with the -n
option.
Typing out the full path of the data file can be arduous to do repeatedly. To make the typing easier, you can create a “symbolic link” to the data files within your eigenfaces
directory:
$ cd eigenfaces/
$ ln -s /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.262781_55_45_32.bin small.bin
$ ln -s /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.202599_109_89_32.bin med.bin
$ ln -s /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.202599_218_178_32.bin large.bin
Check the links
$ ls -l small.bin med.bin large.bin
You should see the symbolic links shown something like
small.bin -> /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
med.bin -> /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.202599_109_89_32.bin
large.bin -> /gscratch/amath/amath583/data/celebA/celebA_gray_lowres.202599_218_178_32.bin
Once you have done that, you can simply invoke eigenfaces with small.bin (or med.bin or large.bin)
$ srun --time=1:00 ./eigenfaces_seq.exe -i small.bin -n 500
You may also consider creating your own scripts for saving repetitive typing.
Your code comes with two testing/verification scripts: test.bash and verify.bash. The test.bash script compiles and runs the program test.cpp, which in turn tests each of the gen_covariance
functions in your files for correctness. If you execute test.exe
by itself, it will run all of the tests – test.bash runs the tests one by one. The MPI program is not tested in this script.
The script verify.bash runs the eigenfaces programs, saves their output in a form suitable for submission, and checks the saved images for correctness.
There are two sections that are commented out in verify.bash – one section for the 583 portion of the final (eigenfaces_block) and one section for the extra credit portion (eigenface_dgemm).
Running the verify.bash script will save the results from the run in verify.txt and in
slurm-verify.out, as well as create a tar file of images verify-images.tar.gz
.
The base case program executable is eigenfaces_seq.exe The source code is in eigenfaces_seq.cpp. You can make the executable with
$ srun make eigenfaces_seq.exe
and execute it with srun
$ srun --time=1:00 ./eigenfaces_seq.exe -i small.bin -n 500
(Here I assume you have made the symbolic links as shown above.)
My results for a recent run of eigenfaces_seq.exe are (you should see something similar)
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 23 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 1265 ms
# Gflops/s = 0.48424
If you specify the “-f” option to any of the programs, the program will perform an eigendecomposition of the covariance matrix and save the indicated number of eigenfaces, as well as the mean (average) face.
The mean face will be named mean_face_<method>_<num_images>_<height>_<width>.bmp
. The eigenfaces will be named
u_<method>_<component>_<num_images>_<height>_<width>.bmp
.
The mean face (as should be computed by all programs) looks like the following:
And the first eight eigenfaces:
The program executable for this problem is eigenfaces_opt.exe The source code is in eigenfaces_opt.cpp. You can make the executable with
$ srun make eigenfaces_opt.exe
Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.
The first optimization you are to make is to apply any and all sequential optimizations in order to speed up gen_covariance
and/or speed up outer
. (Recall the optimizations we learned about in, say, ps4.)
By applying two fairly basic optimizations, I was able to achieve the following
$ srun ./eigenfaces_opt.exe -i small.bin -n 500
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 11 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 1799 ms
# Gflops/s = 3.40502
File eigenfaces_opt.cpp with your optimized function(s).
What optimizations did you apply to improve eigenfaces_opt? Explain why these optimizations would result in better performance.
The program executable for this problem is eigenfaces_par.exe The source code is in eigenfaces_par.cpp. You can make the executable with
$ srun make eigenfaces_par.exe
Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.
The next optimization to be applied is shared memory parallelization. Although the stub code that is initially contained in eigenfaces_par.cpp is the sequential base case code, the code that you should parallelize is the code you developed for eigenfaces_opt. You may use any method you like to parallelize it for shared memory execution. (Recall the approaches we learned about in ps5 and ps6.)
By applying one parallelization approach, I was able to achieve the following
$ srun --time=1:00 --cpus-per-task=10 ./eigenfaces_par.exe -i small.bin -n 5000
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 138 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 3446 ms
# Gflops/s = 17.77604
File eigenfaces_par.cpp with your parallelized function(s). This should include your equential optimizations as well.
How did you parallelize your code? On 5000 faces, how much speedup were you able to get on 2, 4, 10, and 20 cores (cpus-per-task)? Does the speedup improve if you load 15000 faces rather than 5000 faces?
The program executable for this problem is eigenfaces_block.exe The source code is in eigenfaces_block.cpp. You can make the executable with
$ srun make eigenfaces_block.exe
Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.
In the way that we have separated gen_covariance
and outer
in the code
as supplied, we are updating the entire matrix with the outer product of a single
vector on each call to outer
. But even with the smallest images, we have a
covariance matrix that is 2,475 by 2,475 – which is over 6M total elements or over
48M bytes of memory (with double precision).
Another approach might be to first update just a portion of matrix (a block), then another portion, then another portion, and so on, using all of the vectors. (We could obtain some reuse in the vectors by only updating with some of them as well.)
The next optimization to be applied is to block for cache. Although the stub code that is initially contained in eigenfaces_par.cpp is the sequential base case code, the code that you should parallelize is the code you developed for eigenfaces_par. (Recall the approaches we learned about in lecture.)
By blocking the updates to C, I was able to achieve the following
$ srun --time=1:00 --cpus-per-task=10 ./eigenfaces_block.exe -i small.bin -n 5000
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 117 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 952 ms
# Gflops/s = 64.3448
Hint: For your implementation, you may want to pass begin and end markers to
outer
, or you may simply absorb the code for outer
into gen_covariance
.
File eigenfaces_block.cpp with your cache blocking optimizations. This should include your parallelization and sequential optimizations as well.
Explain your blocking approach. On 5000 faces and 10 (or 20) cores (say), how much speedup were you able to get over the non-blocked case? How much speedup were you able to get over the non-blocked case for 50000 face?
The program executable for this problem is eigenfaces_dgemm.exe The source code is in eigenfaces_dgemm.cpp. You can make the executable with
$ srun make eigenfaces_dgemm.exe
Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.
If we think about all of the faces collected together in a set \(\{ f_i \}\), we can define a matrix \(F\) such that the ith column of \(F\) is a face (vector) \(f_i\). In that case, it can be shown
Or in other words, the covariance matrix can be written as a matrix-matrix product. Although we have spent significant time in the course looking at matrix-matrix product, we always assumed powers of two and omitted the (tedious) issues of fringe cleanup. So, we are not going to put our own matrix-matrix product code in here.
Rather, we are going to use the BLAS “DGEMM” routine – in particular the C-BLAS function
cblas_dgemm
.
The prototype for cblas_dgemm
is the following
void cblas_dgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA,
const CBLAS_TRANSPOSE TransB, const int M, const int N,
const int K, const double alpha, const double *A,
const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc) NOTHROW;
Where CBLAS_LAYOUT
is one of CblasRowMajor
or CblasColMajor
and CBLAS_TRANSPOSE
is one of
CblasTrans
or CblasNoTrans
.
(You will need the header file mkl_cblas.h
included – which should already be done for you.)
Documentation on cblas_dgemm can be found here
To call cblas_dgemm
, you will need to copy the image vectors to an
appropriately-sized matrix and then invoke cblas_dgemm
with the appropriate
arguments. Note that a matrix is described with two pieces of information in this
interface: a pointer to the beginning of the matrix data and the “leading dimension”,
i.e., the number of elements between successive rows (for row oriented) or columns
(for column oriented). We use the leading dimension when computing the
one-dimensional index from two-dimensional indices for a matrix, which is why we have
to pass it in to this routine. Thus the matrix A is represented with const double *A
and int lda
, and so forth.
Finally, the function is very general, it computes
C = alpha * op(A) * op(B) + beta * C
where op
is one of tranpose or no transpose (as specified by the TransA
and TransB
arguments).
By using cblas_dgemm
to compute C, I was able to achieve the following
$ srun --cpus-per-task=10 ./eigenfaces_dgemm.exe -i small.bin -n 5000
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 104 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 498 ms
# Gflops/s = 123.0046
File eigenfaces_dgemm.cpp that implements gen_covariance
with a call to
cblas_dgemm
.
What single core performance do you get with eigenfaces_dgemm? How does your performance scale for 2, 5, 10, 20 cores? (You may even try 40 cores.)
The program executable for this problem is eigenfaces_mpi.exe The source code is in eigenfaces_mpi.cpp. You can make the executable with
$ srun make eigenfaces_mpi.exe
Note that this program uses its own main
(though it is very similar to the main
in main.hpp).
The scheme of this program is that it operates in SPMD mode. The root node (rank 0) reads in the data files and then sends a subset of the images to all of the other nodes (keeping a subset for itself). Each node then proceeds more or less like the sequential program, compute the mean, subtract the mean from every image, compute the covariance matrix, do the eigendecomposition.
However, there are a few places where we need to communicate to get global information. For instance, we can’t compute the mean on every node, that would just be the mean of the local faces – we need to compute a partial sum on every node, do an all reduce to get the global sum on every node and then divide the sum by the number of images to get the mean on every node. We can then use this globally-computed mean locally to update our local images.
Similarly with covariance, each node can compute its contribution to the covariance, but we have to reduce those contributions to a global covariance matrix (and then normalize the global covariance).
The eigendecomposition is just done on rank 0.
The file eigenfaces_mpi.cpp has a number of locations where you will need to fill in the MPI calls to make the program work (search for “Write me”). In particular, you will need to fill in the sends and receives for the sending the image data from the root to the other nodes, fill in the all reduce for the global sum and then add up (reduce) all of the covariance matrices.
Two things to note about the final covariance matrix, it is only needed on the root
node. Second, it should not be normalized until it has been summed together on the
root. Your gen_covariance
function should not normalize it.
You should replace the gen_coveriance
and outer
stubs with your
best-performing shared-memory parallel code (par, block, or dgemm, depending),
omitting normalization, as mentioned above.
By parallelizing my eigenfaces_dgemm program I was able to achieve the following
$ MKL_NUM_THREADS=10 sbatch --time 5:00 --nodes=4 --cpus-per-task=10 relay.bash mpirun ./eigenfaces_mpi.exe -i small.bin -n 50000
# Face size 55 x 45
# elapsed time [read_file]: 1164 ms
# elapsed time [scattering]: 3872 ms
# elapsed time [compute_mean]: 35 ms
# elapsed time [compute_shifted]: 25 ms
# elapsed time [reduce]: 55 ms
# elapsed time [compute_covariance]: 541 ms
# Gflops/s = 1132.28
Note that for some reason if you are combining intel MKL routines with MPI, you need to specify the number of threads that MKL should use in the environment, otherwise it will default to one thread. You can do this by indicating MKL_NUM_THREAD=<number> on the same line as sbatch (just before the invocation of sbatch). Without specifying that I would get the following:
$ sbatch --time 5:00 --nodes=4 --cpus-per-task=10 relay.bash mpirun ./eigenfaces_mpi.exe -i small.bin -n 50000
# Face size 55 x 45
# elapsed time [read_file]: 854 ms
# elapsed time [scattering]: 364 ms
# elapsed time [compute_mean]: 32 ms
# elapsed time [compute_shifted]: 22 ms
# elapsed time [reduce]: 226 ms
# elapsed time [compute_covariance]: 2105 ms
# Gflops/s = 291.004
File eigenfaces_mpi.cpp that implements gen_covariance
with your best version and has MPI functionality completed.
How does the performance of your eigenfaces_mpi.exe scale for 1, 2, 4, 8 nodes? (With 10 cores on each node.)
You are encouraged to try to scale up your runs with eigenfaces_mpi.exe. You have
been provided a script max.bash that will submit a job for you to the ckpt
partition. You
are encouraged to edit the script and try the medium images or even large images, to try 8, or 16 nodes –
though you should limit the total cores of your job to 80 cores (maximum QOSGrpCpuLimit allowed for ckpt
partition), for example, 4 nodes with 20 cpus-per-task, 8 nodes with 10
cpus-per-task, 16 nodes with 5 cpus-per-task.
What configuration of nodes and cores per node gives you the very best performance? (And what was your best performance?)
I encourage you to see if you can achieve 1,000 GFlops (or 2,000 or 4,000 or more).
Running the max.bash script will save the results from the run in max.txt and in slurm-max.out.
Your final solutions will be uploaded to Gradescope.
The files you will upload include results from your most recent run of verify.bash and max.bash, so you should do the uploads after you have made your last satisfactory runs of both.
Use scp
command to copy the output files from Hyak back to your laptop to upload.
Upload the following files:
eigenfaces_block.cpp
eigenfaces_dgemm.cpp
eigenfaces_mpi.cpp
eigenfaces_opt.cpp
eigenfaces_par.cpp
eigenfaces_seq.cpp
verify.txt
slurm-verify.out
max.bash
max.txt
slurm-max.out
verify-images.tar.gz
Questions.rst
If you relied on any external references, list them at the end of Questions.rst.