Final Exam

Assigned: June 1, 2021

Due: June 11, 2021

Background

  • Lecture on Eigenfaces and SVD

  • Final Walkthrough

Introduction

This assignment is extracted as with previous assignment from a tar file, located is at this link.

../_images/download.png

For your convenience, the tar file is also available at

/gscratch/niac/amath583sp21/archives/final.tar.gz

And the extracted files are available at

/gscratch/niac/amath583sp21/assignments/final

When you extract the files, you will have the following directory structure:

final/
├── Questions.rst
├── eigenfaces/
│   ├── Makefile
│   ├── common.bash
│   ├── eigenfaces_block.cpp
│   ├── eigenfaces_dgemm.cpp
│   ├── eigenfaces_mpi.cpp
│   ├── eigenfaces_opt.cpp
│   ├── eigenfaces_par.cpp
│   ├── eigenfaces_seq.cpp
│   ├── main.hpp
│   ├── max.bash
│   ├── test.bash
│   ├── test.cpp
│   └── verify.bash
├── include/
│   ├── AOSMatrix.hpp
│   ├── CImg.h
│   ├── COOMatrix.hpp
│   ├── CSCMatrix.hpp
│   ├── CSRMatrix.hpp
│   ├── Make_mpi.inc
│   ├── Matrix.hpp
│   ├── Vector.hpp
│   ├── amath583.hpp
│   ├── amath583IO.hpp
│   ├── amath583sparse.hpp
│   └── catch.hpp
└── src/
    ├── amath583.cpp
    ├── amath583IO.cpp
    └── amath583sparse.cpp

Special Instructions for Final

The rules for the final exam are the same as for the other assignments, except

  1. The exam may not be discussed with anyone except course instructors
    Do not discuss it with your classmates, with your friends, with your enemies. You may refer to your previous assignments, AMATH 483/583 slides and lectures. You may contact the instructors via private messages on Piazza to clarify questions. The same policies and penalties for plagiarism apply for the mid-term as for the regular assignments.

The final will be evaluated by hand rather than with gradescope. As with previous assignments, there are tests supplied with the final that you can use to verify your solutions. We haven’t really had any requirements on commenting your code, either with // or the /* */ pair. Continuing in this regard, general code comments will not be required for the final, but are highly suggested. Any clarifying comments may help the graders award partial credit if your code doesn’t exhibit the correct behavior.

Eigenfaces

The major steps in the eigenfaces application are:

  • Read in face images

  • Compute mean face image

  • Subtract mean from all face images

  • Compute covariance matrix

  • Compute eigendecomposition of covariance matrix

  • Save eigenface images

The implementation of eigenfaces that you have been provided with is instrumented to print out timing information about these steps (except for saving eigenface images).

The eigenfaces application programs take the following arguments:

  • -i bin_file (required), specifies the face image data file to read.

  • -n num, specifies the number of faces to read from the image data file. Default: 500.

  • -f num, specifies the number of eigenface images to save. If the value is zero, the eigendecomposition will not be performed. Default: 0.

  • -z size_spec, specifies the size of the image data. Acceptable options are “55x45”, “109x89”, or “218x178”. This should be selected to match the choice of image data file. Default: “55x45”.

If we invoke the base eigenfaces program (eigenfaces_seq.exe) in the following way

$ srun eigenfaces_seq.exe -i /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin -n 500 -f 8

we get the following output

# Face size 55 x 45
# Reading /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
# elapsed time [read_file]: 10 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 10420 ms
# Gflops/s = 0.587872
# elapsed time [compute_svd]: 8436 ms

Since only 500 images were read, the time read the file, compute the mean, and subtract the mean are quite small. Computing the covariance, on the other hand, is quite large.

Given a set of \(N\) vectors \(\{ f_i \}\), the formula for computing the covariance matrix is

\[C = \frac{1}{N} \sum_{i=0}^{N-1} f_i f_i^\top\]

The time to compute the covariance is proportional to the square of the size of the vectors and to the number of vectors.

Note

The vectors in the eigenfaces application are “raveled” images. That is, their length is the total number of pixels in the image, i.e., height * width, with elements corresponding to the pixel values of the image. (Essentially, we are undoing the one-d to two-d mapping that we did for the Matrix class.)

Given that forming the covariance matrix with 500 images takes approximately 10 seconds, how long (approximately) would forming the covariance matrix take if we used all 262,781 images in the data file?

In the remainder of this final you will be applying successive optimizations / parallelizations to improve the performance of the covariance computation.

Data Files

There are three data files that you can use in this assignment (small, medium, and large, respectively):

/gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
/gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_109_89_32.bin
/gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_218_178_32.bin

These are collections of greyscale images of celebrity faces. The first (small) dataset contains 262781 images with height equal to 55 and width equal to 45. The second (medium) dataset contains 202599 images with height equal to 109 and width equal to 89. The third (large) dataset contains 202599 images with height equal to 218 and width equal to 178. In all of the datasets, the images are stored with 32bit floating point for each pixel.

Although these seem like small images, the amount of data they represent is non-trivial. The small, medium, and large datasets are2.5G, 7.4G, and 30G in size, respectively. Moreover, even the smallest vector (55 by 45) is a vector of length 2,475, meaning the (dense) covariance matrix is 2,475 by 2,475.

As seen in the example above, the input data file is specified to the eigenface program with the -i option and the number of images to use is specified with the -n option.

Typing out the full path of the data file can be arduous to do repeatedly. To make the typing easier, you can create a “symbolic link” to the data files:

$ ln -s /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin small.bin
$ ln -s /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_109_89_32.bin med.bin
$ ln -s /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_218_178_32.bin large.bin

Check the links

$ ls -l small.bin med.bin large.bin

You should see the symbolic links shown something like

small.bin -> /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
med.bin -> /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_109_89_32.bin
large.bin -> /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_218_178_32.bin

Once you have done that, you can simply invoke eigenfaces with small.bin (or med.bin or large.bin)

$ srun --time=1:00 ./eigenfaces_seq.exe -i small.bin -n 500

You may also consider creating your own scripts for saving repetitive typing.

test.bash and verify.bash

Your code comes with two testing/verification scripts: test.bash and verify.bash. The test.bash script compiles and runs the program test.cpp, which in turn tests each of the gen_covariance functions in your files for correctness. If you execute test.exe by itself, it will run all of the tests – test.bash runs the tests one by one. The MPI program is not tested in this script.

The script verify.bash runs the eigenfaces programs, saves their output in a form suitable for submission, and checks the saved images for correctness.

There are two sections that are commented out in verify.bash – one section for the 583 portion of the final (eigenfaces_block) and one section for the extra credit portion (eigenface_dgemm).

Running the verify.bash script will save the results from the run in verify.txt and in slurm-verify.out, as well as create a tar file of images verify-images.tar.gz.

Base Case: eigenfaces_seq

The base case program executable is eigenfaces_seq.exe The source code is in eigenfaces_seq.cpp. You can make the executable with

$ make eigenfaces_seq.exe

and execute it with srun

$ srun --time=1:00 ./eigenfaces_seq.exe -i small.bin -n 500

(Here I assume you have made the symbolic links as shown above.)

My results for a recent run of eigenfaces_seq.exe are (you should see something similar)

# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 23 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 1265 ms
# Gflops/s = 0.48424

If you specify the “-f” option to any of the programs, the program will perform an eigendecomposition of the covariance matrix and save the indicated number of eigenfaces, as well as the mean (average) face.

The mean face will be named mean_face_<method>_<num_images>_<height>_<width>.bmp. The eigenfaces will be named u_<method>_<component>_<num_images>_<height>_<width>.bmp.

The mean face (as should be computed by all programs) looks like the following:

../_images/mean_face_mpi_50000_55_45.bmp

And the first eight eigenfaces:

../_images/u_mpi_0_50000_55_45.bmp ../_images/u_mpi_1_50000_55_45.bmp ../_images/u_mpi_2_50000_55_45.bmp ../_images/u_mpi_3_50000_55_45.bmp ../_images/u_mpi_4_50000_55_45.bmp ../_images/u_mpi_5_50000_55_45.bmp ../_images/u_mpi_6_50000_55_45.bmp ../_images/u_mpi_7_50000_55_45.bmp

Sequential Optimization: eigenfaces_opt

The program executable for this problem is eigenfaces_opt.exe The source code is in eigenfaces_opt.cpp. You can make the executable with

$ make eigenfaces_opt.exe

Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.

The first optimization you are to make is to apply any and all sequential optimizations in order to speed up gen_covariance (and/or speed up :cpp:`outer’). (Recall the optimizations we learned about in, say, ps4.)

By applying two fairly basic optimizations, I was able to achieve the following

$ srun ./eigenfaces_opt.exe -i small.bin -n 500
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 11 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 1799 ms
# Gflops/s = 3.40502

File eigenfaces_opt.cpp with your optimized function(s).

What optimizations did you apply to improve eigenfaces_opt? Explain why these optimizations would result in better performance.

Parallelization

The program executable for this problem is eigenfaces_par.exe The source code is in eigenfaces_par.cpp. You can make the executable with

$ make eigenfaces_par.exe

Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.

The next optimization to be applied is shared memory parallelization. Although the stub code that is initially contained in eigenfaces_par.cpp is the sequential base case code, the code that you should parallelize is the code you developed for eigenfaces_opt. You may use any method you like to parallelize it for shared memory execution. (Recall the approaches we learned about in ps5 and ps6.)

By applying one parallelization approach, I was able to achieve the following

$ srun --time=1:00 --cpus-per-task=10  ./eigenfaces_par.exe -i small.bin -n 5000
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 138 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 3446 ms
# Gflops/s = 17.77604

File eigenfaces_par.cpp with your parallelized function(s). This should include your equential optimizations as well.

How did you parallelize your code? On 5000 faces, how much speedup were you able to get on 2, 4, 10, and 20 cores (cpus-per-task)? Does the speedup improve if you load 15000 faces rather than 5000 faces?

Blocking for Cache (AMATH 583)

The program executable for this problem is eigenfaces_block.exe The source code is in eigenfaces_block.cpp. You can make the executable with

$ make eigenfaces_block.exe

Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.

In the way that we have separated gen_covariance and outer in the code as supplied, we are updating the entire matrix with the outer product of a single vector on each call to outer. But even with the smallest images, we have a covariance matrix that is 2,475 by 2,475 – which is over 6M total elements or over 48M bytes of memory (with double precision).

Another approach might be to first update just a portion of matrix (a block), then another portion, then another portion, and so on, using all of the vectors. (We could obtain some reuse in the vectors by only updating with some of them as well.)

The next optimization to be applied is to block for cache. Although the stub code that is initially contained in eigenfaces_par.cpp is the sequential base case code, the code that you should parallelize is the code you developed for eigenfaces_par. (Recall the approaches we learned about in lecture.)

By blocking the updates to C, I was able to achieve the following

$ srun --time=1:00 --cpus-per-task=10  ./eigenfaces_block.exe -i small.bin -n 5000
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 117 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 952 ms
# Gflops/s = 64.3448

Hint: For your implementation, you may want to pass begin and end markers to outer, or you may simply absorb the code for outer into gen_covariance.

File eigenfaces_block.cpp with your cache blocking optimizations. This should include your parallelization and sequential optimizations as well.

Explain your blocking approach. On 5000 faces and 10 (or 20) cores (say), how much speedup were you able to get over the non-blocked case? How much speedup were you able to get over the non-blocked case for 50000 face?

Using dgemm library call (Extra Credit)

The program executable for this problem is eigenfaces_dgemm.exe The source code is in eigenfaces_dgemm.cpp. You can make the executable with

Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.

If we think about all of the faces collected together in a set \(\{ f_i \}\), we can define a matrix \(F\) such that the ith column of \(F\) is a face (vector) \(f_i\). In that case, it can be shown

\[C = F F^\top\]

Or in other words, the covariance matrix can be written as a matrix-matrix product. Although we have spent significant time in the course looking at matrix-matrix product, we always assumed powers of two and omitted the (tedious) issues of fringe cleanup. So, we are not going to put our own matrix-matrix product code in here.

Rather, we are going to use the BLAS “DGEMM” routine – in particular the C-BLAS function cblas_dgemm.

The prototype for cblas_dgemm is the following

void cblas_dgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA,
                 const CBLAS_TRANSPOSE TransB, const int M, const int N,
                 const int K, const double alpha, const double *A,
                 const int lda, const double *B, const int ldb,
                 const double beta, double *C, const int ldc) NOTHROW;

Where CBLAS_LAYOUT is one of CblasRowMajor or CblasColMajor and :mpp:`CBLAS_TRANSPOSE` is one of CblasTrans or CblasNoTrans.

(You will need the file mkl_cblas.h included – which should already be done for you.)

Documentation on cblas_dgemm can be found here

To call cblas_dgemm, you will need to copy the image vectors to an appropriately-sized matrix and then invoke cblas_dgemm with the appropriate arguments. Note that a matrix is described with two pieces of information in this interface: a pointer to the beginning of the matrix data and the “leading dimension”, i.e., the number of elements between successive rows (for row oriented) or columns (for column oriented). We use the leading dimension when computing the one-dimensional index from two-dimensional indices for a matrix, which is why we have to pass it in to this routine. Thus the matrix A is represented with const double *A and int lda, and so forth.

Finally, the function is very general, it computes

C = alpha * op(A) * op(B) + beta * C

where op is one of tranpose or no transpose (as specified by the TransA and TransB arguments).

By using cblas_dgemm to compute C, I was able to achieve the following

$ srun --cpus-per-task=10 ./eigenfaces_dgemm.exe -i small.bin -n 5000
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 104 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 498 ms
# Gflops/s = 123.0046

File eigenfaces_dgemm.cpp that implements gen_covariance with a call to cblas_dgemm.

What single core performance do you get with eigenfaces_dgemm? How does your performance scale for 2, 5, 10, 20 cores? (You may even try 40 cores.)

MPI: eigenfaces_mpi

The program executable for this problem is eigenfaces_mpi.exe The source code is in eigenfaces_mpi.cpp. You can make the executable with

Note that this program uses its own main (though it is very similar to the main in main.hpp).

The scheme of this program is that it operates in SPMD mode. The root node (rank 0) reads in the data files and then sends a subset of the images to all of the other nodes (keeping a subset for itself). Each node then proceeds more or less like the sequential program, compute the mean, subtract the mean from every image, compute the covariance matrix, do the eigendecomposition.

However, there are a few places where we need to communicate to get global information. For instance, we can’t compute the mean on every node, that would just be the mean of the local faces – we need to compute a partial sum on every node, do an all reduce to get the global sum on every node and then divide the sum by the number of images to get the mean on every node. We can then use this globally-computed mean locally to update our local images.

Similarly with covariance, each node can compute its contribution to the covariance, but we have to reduce those contributions to a global covariance matrix (and then normalize the global covariance).

The eigendecomposition is just done on rank 0.

The file eigenfaces_mpi.cpp has a number of locations where you will need to fill in the MPI calls to make the program work (search for “Write me”). In particular, you will need to fill in the sends and receives for the sending the image data from the root to the other nodes, fill in the all reduce for the global sum and then add up (reduce) all of the covariance matrices.

Two things to note about the final covariance matrix, it is only needed on the root node. Second, it should not be normalized until it has been summed together on the root. Your gen_covariance function should not normalize it.

You should replace the gen_coveriance and outer stubs with your best-performing shared-memory parallel code (par, block, or dgemm, depending), omitting normalization, as mentioned above.

By parallelizing my eigenfaces_dgemm program I was able to achieve the following

$ MKL_NUM_THREADS=10 srun --mpi=pmi2 --nodes=4 --cpus-per-task=10 ./eigenfaces_mpi.exe -i small.bin -n 50000
# Face size 55 x 45
# elapsed time [read_file]: 879 ms
# elapsed time [scattering]: 322 ms
# elapsed time [compute_mean]: 31 ms
# elapsed time [compute_shifted]: 21 ms
# elapsed time [reduce]: 350 ms
# elapsed time [compute_covariance]: 729 ms
# Gflops/s = 840.278

Note that for some reason if you are combining intel MKL routines with MPI, you need to specify the number of threads that MKL should use in the environment, otherwise it will default to one thread. You can do this by indicating MKL_NUM_THREAD=<number> on the same line as srun (just before the invocation of srun). Without specifying that I would get the following:

$ srun --mpi=pmi2 --nodes=4 --cpus-per-task=10 ./eigenfaces_mpi.exe -i small.bin -n 50000
# Face size 55 x 45
# elapsed time [read_file]: 854 ms
# elapsed time [scattering]: 364 ms
# elapsed time [compute_mean]: 32 ms
# elapsed time [compute_shifted]: 22 ms
# elapsed time [reduce]: 226 ms
# elapsed time [compute_covariance]: 2105 ms
# Gflops/s = 291.004

File eigenfaces_mpi.cpp that implements gen_covariance with your best version and has MPI functionality completed.

How does the performance of your eigenfaces_mpi.exe scale for 1, 2, 4, 8 nodes? (With 10 cores.)

max.bash

You are encouraged to try to scale up your runs with eigenfaces_mpi.exe. You have been provided a script max.bash that will submit a job for you to the ckpt queue. You are encouraged to edit the script and try the medium images, try 8, 16, or 32 nodes – though you should limit the total cores to 320 or fewer, i.e., 8 nodes with 40 cpus-per-task, 16 nodes with 20, 32 nodes with 10.

What configuration of nodes and cores per node gives you the very best performance? (And what was your best performance?)

I encourage you to see if you can achieve 1,000 GFlops (or 2,000 or 4,000 or more).

Running the max.bash script will save the results from the run in max.txt and in slurm-max.out.

Submitting Your Work

Your final solutions will be uploaded to Gradescope.

The files you will upload include results from your most recent run of verify.bash and max.bash, so you should do the uploads after you have made your last satisfactory runs of both.

Upload the following files:

eigenfaces_block.cpp
eigenfaces_dgemm.cpp
eigenfaces_mpi.cpp
eigenfaces_opt.cpp
eigenfaces_par.cpp
eigenfaces_seq.cpp
verify.txt
slurm-verify.out
max.bash
max.txt
slurm-max.out
verify-images.tar.gz
Questions.rst

If you relied on any external references, list them at the end of Questions.rst.