Assigned: June 1, 2021

Due: June 11, 2021

Lecture on Eigenfaces and SVD

Final Walkthrough

This assignment is extracted as with previous assignment from a tar file,
located is at `this link`

.

For your convenience, the tar file is also available at

```
/gscratch/niac/amath583sp21/archives/final.tar.gz
```

And the extracted files are available at

/gscratch/niac/amath583sp21/assignments/final

When you extract the files, you will have the following directory structure:

```
final/
├── Questions.rst
├── eigenfaces/
│ ├── Makefile
│ ├── common.bash
│ ├── eigenfaces_block.cpp
│ ├── eigenfaces_dgemm.cpp
│ ├── eigenfaces_mpi.cpp
│ ├── eigenfaces_opt.cpp
│ ├── eigenfaces_par.cpp
│ ├── eigenfaces_seq.cpp
│ ├── main.hpp
│ ├── max.bash
│ ├── test.bash
│ ├── test.cpp
│ └── verify.bash
├── include/
│ ├── AOSMatrix.hpp
│ ├── CImg.h
│ ├── COOMatrix.hpp
│ ├── CSCMatrix.hpp
│ ├── CSRMatrix.hpp
│ ├── Make_mpi.inc
│ ├── Matrix.hpp
│ ├── Vector.hpp
│ ├── amath583.hpp
│ ├── amath583IO.hpp
│ ├── amath583sparse.hpp
│ └── catch.hpp
└── src/
├── amath583.cpp
├── amath583IO.cpp
└── amath583sparse.cpp
```

The rules for the final exam are the same as for the other assignments, *except*

**The exam may not be discussed with anyone except course instructors**Do not discuss it with your classmates, with your friends, with your enemies. You*may*refer to your previous assignments, AMATH 483/583 slides and lectures. You may contact the instructors via*private*messages on Piazza to clarify questions. The same policies and penalties for plagiarism apply for the mid-term as for the regular assignments.

The final will be evaluated by hand rather than with gradescope.
As with previous assignments, there are tests supplied with the final that you can use to verify your solutions.
We haven’t really had any requirements on commenting your code, either with
`//`

or the `/*`

`*/`

pair.
Continuing in this regard,
general code comments will not be required for the final, but are
*highly suggested*. Any clarifying comments may help the graders award
partial credit if your code doesn’t exhibit the correct behavior.

The major steps in the eigenfaces application are:

Read in face images

Compute mean face image

Subtract mean from all face images

Compute covariance matrix

Compute eigendecomposition of covariance matrix

Save eigenface images

The implementation of eigenfaces that you have been provided with is instrumented to print out timing information about these steps (except for saving eigenface images).

The eigenfaces application programs take the following arguments:

`-i bin_file`

(required), specifies the face image data file to read.`-n num`

, specifies the number of faces to read from the image data file. Default: 500.`-f num`

, specifies the number of eigenface images to save. If the value is zero, the eigendecomposition will not be performed. Default: 0.`-z size_spec`

, specifies the size of the image data. Acceptable options are “55x45”, “109x89”, or “218x178”. This should be selected to match the choice of image data file. Default: “55x45”.

If we invoke the base eigenfaces program (eigenfaces_seq.exe) in the following way

```
$ srun eigenfaces_seq.exe -i /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin -n 500 -f 8
```

we get the following output

```
# Face size 55 x 45
# Reading /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
# elapsed time [read_file]: 10 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 10420 ms
# Gflops/s = 0.587872
# elapsed time [compute_svd]: 8436 ms
```

Since only 500 images were read, the time read the file, compute the mean, and subtract the mean are quite small. Computing the covariance, on the other hand, is quite large.

Given a set of \(N\) vectors \(\{ f_i \}\), the formula for computing the covariance matrix is

\[C = \frac{1}{N} \sum_{i=0}^{N-1} f_i f_i^\top\]

The time to compute the covariance is proportional to the square of the size of the vectors and to the number of vectors.

Note

The vectors in the eigenfaces application are “raveled” images. That is, their length is the total number of pixels in the image, i.e., height * width, with elements corresponding to the pixel values of the image. (Essentially, we are undoing the one-d to two-d mapping that we did for the Matrix class.)

Given that forming the covariance matrix with 500 images takes approximately 10 seconds, how long (approximately) would forming the covariance matrix take if we used all 262,781 images in the data file?

In the remainder of this final you will be applying successive optimizations / parallelizations to improve the performance of the covariance computation.

There are three data files that you can use in this assignment (small, medium, and large, respectively):

```
/gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
/gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_109_89_32.bin
/gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_218_178_32.bin
```

These are collections of greyscale images of celebrity faces. The first (small) dataset contains 262781 images with height equal to 55 and width equal to 45. The second (medium) dataset contains 202599 images with height equal to 109 and width equal to 89. The third (large) dataset contains 202599 images with height equal to 218 and width equal to 178. In all of the datasets, the images are stored with 32bit floating point for each pixel.

Although these seem like small images, the amount of data they represent is non-trivial. The small, medium, and large datasets are2.5G, 7.4G, and 30G in size, respectively. Moreover, even the smallest vector (55 by 45) is a vector of length 2,475, meaning the (dense) covariance matrix is 2,475 by 2,475.

As seen in the example above, the input data file is specified to the eigenface program with the `-i`

option and the number of images to use is specified with the `-n`

option.

Typing out the full path of the data file can be arduous to do repeatedly. To make the typing easier, you can create a “symbolic link” to the data files:

```
$ ln -s /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin small.bin
$ ln -s /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_109_89_32.bin med.bin
$ ln -s /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_218_178_32.bin large.bin
```

Check the links

```
$ ls -l small.bin med.bin large.bin
```

You should see the symbolic links shown something like

```
small.bin -> /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.262781_55_45_32.bin
med.bin -> /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_109_89_32.bin
large.bin -> /gscratch/niac/amath583sp21/data/celebA/celebA_gray_lowres.202599_218_178_32.bin
```

Once you have done that, you can simply invoke eigenfaces with small.bin (or med.bin or large.bin)

```
$ srun --time=1:00 ./eigenfaces_seq.exe -i small.bin -n 500
```

You may also consider creating your own scripts for saving repetitive typing.

Your code comes with two testing/verification scripts: test.bash and verify.bash. The test.bash script compiles and runs the program test.cpp, which in turn tests each of the `gen_covariance`

functions in your files for correctness. If you execute `test.exe`

by itself, it will run all of the tests – test.bash runs the tests one by one. The MPI program is not tested in this script.

The script verify.bash runs the eigenfaces programs, saves their output in a form suitable for submission, and checks the saved images for correctness.

There are two sections that are commented out in verify.bash – one section for the 583 portion of the final (eigenfaces_block) and one section for the extra credit portion (eigenface_dgemm).

Running the verify.bash script will save the results from the run in verify.txt and in
slurm-verify.out, as well as create a tar file of images `verify-images.tar.gz`

.

The base case program executable is eigenfaces_seq.exe The source code is in eigenfaces_seq.cpp. You can make the executable with

```
$ make eigenfaces_seq.exe
```

and execute it with srun

```
$ srun --time=1:00 ./eigenfaces_seq.exe -i small.bin -n 500
```

(Here I assume you have made the symbolic links as shown above.)

My results for a recent run of eigenfaces_seq.exe are (you should see something similar)

```
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 23 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 1265 ms
# Gflops/s = 0.48424
```

If you specify the “-f” option to any of the programs, the program will perform an eigendecomposition of the covariance matrix and save the indicated number of eigenfaces, as well as the mean (average) face.

The mean face will be named `mean_face_<method>_<num_images>_<height>_<width>.bmp`

. The eigenfaces will be named
`u_<method>_<component>_<num_images>_<height>_<width>.bmp`

.

The mean face (as should be computed by all programs) looks like the following:

And the first eight eigenfaces:

The program executable for this problem is eigenfaces_opt.exe The source code is in eigenfaces_opt.cpp. You can make the executable with

```
$ make eigenfaces_opt.exe
```

Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.

The first optimization you are to make is to apply any and all sequential optimizations in order to speed up `gen_covariance`

(and/or speed up :cpp:`outer’). (Recall the optimizations we learned about in, say, ps4.)

By applying two fairly basic optimizations, I was able to achieve the following

```
$ srun ./eigenfaces_opt.exe -i small.bin -n 500
```

```
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 11 ms
# elapsed time [compute_mean]: 0 ms
# elapsed time [compute_shifted]: 0 ms
# elapsed time [compute_covariance]: 1799 ms
# Gflops/s = 3.40502
```

File eigenfaces_opt.cpp with your optimized function(s).

What optimizations did you apply to improve eigenfaces_opt? Explain why these optimizations would result in better performance.

The program executable for this problem is eigenfaces_par.exe The source code is in eigenfaces_par.cpp. You can make the executable with

```
$ make eigenfaces_par.exe
```

Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.

The next optimization to be applied is shared memory parallelization. Although the stub code that is initially contained in eigenfaces_par.cpp is the sequential base case code, the code that you should parallelize is the code you developed for eigenfaces_opt. You may use any method you like to parallelize it for shared memory execution. (Recall the approaches we learned about in ps5 and ps6.)

By applying one parallelization approach, I was able to achieve the following

```
$ srun --time=1:00 --cpus-per-task=10 ./eigenfaces_par.exe -i small.bin -n 5000
```

```
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 138 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 3446 ms
# Gflops/s = 17.77604
```

File eigenfaces_par.cpp with your parallelized function(s). This should include your equential optimizations as well.

How did you parallelize your code? On 5000 faces, how much speedup were you able to get on 2, 4, 10, and 20 cores (cpus-per-task)? Does the speedup improve if you load 15000 faces rather than 5000 faces?

The program executable for this problem is eigenfaces_block.exe The source code is in eigenfaces_block.cpp. You can make the executable with

```
$ make eigenfaces_block.exe
```

Note that this program uses the same main code (contained in main.hpp) as all of the non-MPI programs.

In the way that we have separated `gen_covariance`

and `outer`

in the code
as supplied, we are updating the entire matrix with the outer product of a single
vector on each call to `outer`

. But even with the smallest images, we have a
covariance matrix that is 2,475 by 2,475 – which is over 6M total elements or over
48M bytes of memory (with double precision).

Another approach might be to first update just a portion of matrix (a block), then another portion, then another portion, and so on, using all of the vectors. (We could obtain some reuse in the vectors by only updating with some of them as well.)

The next optimization to be applied is to block for cache. Although the stub code that is initially contained in eigenfaces_par.cpp is the sequential base case code, the code that you should parallelize is the code you developed for eigenfaces_par. (Recall the approaches we learned about in lecture.)

By blocking the updates to C, I was able to achieve the following

```
$ srun --time=1:00 --cpus-per-task=10 ./eigenfaces_block.exe -i small.bin -n 5000
```

```
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 117 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 952 ms
# Gflops/s = 64.3448
```

Hint: For your implementation, you may want to pass begin and end markers to
`outer`

, or you may simply absorb the code for `outer`

into `gen_covariance`

.

File eigenfaces_block.cpp with your cache blocking optimizations. This should include your parallelization and sequential optimizations as well.

Explain your blocking approach. On 5000 faces and 10 (or 20) cores (say), how much speedup were you able to get over the non-blocked case? How much speedup were you able to get over the non-blocked case for 50000 face?

The program executable for this problem is eigenfaces_dgemm.exe The source code is in eigenfaces_dgemm.cpp. You can make the executable with

If we think about all of the faces collected together in a set \(\{ f_i \}\), we can define a matrix \(F\) such that the ith column of \(F\) is a face (vector) \(f_i\). In that case, it can be shown

\[C = F F^\top\]

Or in other words, the covariance matrix can be written as a matrix-matrix product. Although we have spent significant time in the course looking at matrix-matrix product, we always assumed powers of two and omitted the (tedious) issues of fringe cleanup. So, we are not going to put our own matrix-matrix product code in here.

Rather, we are going to use the BLAS “DGEMM” routine – in particular the C-BLAS function
`cblas_dgemm`

.

The prototype for `cblas_dgemm`

is the following

```
void cblas_dgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA,
const CBLAS_TRANSPOSE TransB, const int M, const int N,
const int K, const double alpha, const double *A,
const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc) NOTHROW;
```

Where `CBLAS_LAYOUT`

is one of `CblasRowMajor`

or `CblasColMajor`

and :mpp:`CBLAS_TRANSPOSE` is one of
`CblasTrans`

or `CblasNoTrans`

.

(You will need the file `mkl_cblas.h`

included – which should already be done for you.)

Documentation on cblas_dgemm can be found here

To call `cblas_dgemm`

, you will need to copy the image vectors to an
appropriately-sized matrix and then invoke `cblas_dgemm`

with the appropriate
arguments. Note that a matrix is described with two pieces of information in this
interface: a pointer to the beginning of the matrix data and the “leading dimension”,
i.e., the number of elements between successive rows (for row oriented) or columns
(for column oriented). We use the leading dimension when computing the
one-dimensional index from two-dimensional indices for a matrix, which is why we have
to pass it in to this routine. Thus the matrix A is represented with ```
const
double *A
```

and `int lda`

, and so forth.

Finally, the function is very general, it computes

```
C = alpha * op(A) * op(B) + beta * C
```

where `op`

is one of tranpose or no transpose (as specified by the `TransA`

and `TransB`

arguments).

By using `cblas_dgemm`

to compute C, I was able to achieve the following

```
$ srun --cpus-per-task=10 ./eigenfaces_dgemm.exe -i small.bin -n 5000
```

```
# Face size 55 x 45
# Reading small.bin
# elapsed time [read_file]: 104 ms
# elapsed time [compute_mean]: 10 ms
# elapsed time [compute_shifted]: 8 ms
# elapsed time [compute_covariance]: 498 ms
# Gflops/s = 123.0046
```

File eigenfaces_dgemm.cpp that implements `gen_covariance`

with a call to
`cblas_dgemm`

.

What single core performance do you get with eigenfaces_dgemm? How does your performance scale for 2, 5, 10, 20 cores? (You may even try 40 cores.)

The program executable for this problem is eigenfaces_mpi.exe The source code is in eigenfaces_mpi.cpp. You can make the executable with

Note that this program uses its own `main`

(though it is very similar to the `main`

in main.hpp).

The scheme of this program is that it operates in SPMD mode. The root node (rank 0) reads in the data files and then sends a subset of the images to all of the other nodes (keeping a subset for itself). Each node then proceeds more or less like the sequential program, compute the mean, subtract the mean from every image, compute the covariance matrix, do the eigendecomposition.

However, there are a few places where we need to communicate to get global information. For instance, we can’t compute the mean on every node, that would just be the mean of the local faces – we need to compute a partial sum on every node, do an all reduce to get the global sum on every node and then divide the sum by the number of images to get the mean on every node. We can then use this globally-computed mean locally to update our local images.

Similarly with covariance, each node can compute its contribution to the covariance, but we have to reduce those contributions to a global covariance matrix (and then normalize the global covariance).

The eigendecomposition is just done on rank 0.

The file eigenfaces_mpi.cpp has a number of locations where you will need to fill in the MPI calls to make the program work (search for “Write me”). In particular, you will need to fill in the sends and receives for the sending the image data from the root to the other nodes, fill in the all reduce for the global sum and then add up (reduce) all of the covariance matrices.

Two things to note about the final covariance matrix, it is only needed on the root
node. Second, it should not be normalized until it has been summed together on the
root. Your `gen_covariance`

function should not normalize it.

You should replace the `gen_coveriance`

and `outer`

stubs with your
best-performing shared-memory parallel code (par, block, or dgemm, depending),
omitting normalization, as mentioned above.

By parallelizing my eigenfaces_dgemm program I was able to achieve the following

```
$ MKL_NUM_THREADS=10 srun --mpi=pmi2 --nodes=4 --cpus-per-task=10 ./eigenfaces_mpi.exe -i small.bin -n 50000
```

```
# Face size 55 x 45
# elapsed time [read_file]: 879 ms
# elapsed time [scattering]: 322 ms
# elapsed time [compute_mean]: 31 ms
# elapsed time [compute_shifted]: 21 ms
# elapsed time [reduce]: 350 ms
# elapsed time [compute_covariance]: 729 ms
# Gflops/s = 840.278
```

Note that for some reason if you are combining intel MKL routines with MPI, you need to specify the number of threads that MKL should use in the environment, otherwise it will default to one thread. You can do this by indicating MKL_NUM_THREAD=<number> on the same line as srun (just before the invocation of srun). Without specifying that I would get the following:

```
$ srun --mpi=pmi2 --nodes=4 --cpus-per-task=10 ./eigenfaces_mpi.exe -i small.bin -n 50000
```

```
# Face size 55 x 45
# elapsed time [read_file]: 854 ms
# elapsed time [scattering]: 364 ms
# elapsed time [compute_mean]: 32 ms
# elapsed time [compute_shifted]: 22 ms
# elapsed time [reduce]: 226 ms
# elapsed time [compute_covariance]: 2105 ms
# Gflops/s = 291.004
```

File eigenfaces_mpi.cpp that implements `gen_covariance`

with your best version and has MPI functionality completed.

How does the performance of your eigenfaces_mpi.exe scale for 1, 2, 4, 8 nodes? (With 10 cores.)

You are encouraged to try to scale up your runs with eigenfaces_mpi.exe. You have been provided a script max.bash that will submit a job for you to the ckpt queue. You are encouraged to edit the script and try the medium images, try 8, 16, or 32 nodes – though you should limit the total cores to 320 or fewer, i.e., 8 nodes with 40 cpus-per-task, 16 nodes with 20, 32 nodes with 10.

What configuration of nodes and cores per node gives you the very best performance? (And what was your best performance?)

I encourage you to see if you can achieve 1,000 GFlops (or 2,000 or 4,000 or more).

Running the max.bash script will save the results from the run in max.txt and in slurm-max.out.

Your final solutions will be uploaded to Gradescope.

The files you will upload include results from your most recent run of verify.bash and max.bash, so you should do the uploads after you have made your last satisfactory runs of both.

Upload the following files:

```
eigenfaces_block.cpp
eigenfaces_dgemm.cpp
eigenfaces_mpi.cpp
eigenfaces_opt.cpp
eigenfaces_par.cpp
eigenfaces_seq.cpp
verify.txt
slurm-verify.out
max.bash
max.txt
slurm-max.out
verify-images.tar.gz
Questions.rst
```

If you relied on any external references, list them at the end of Questions.rst.