Problem Set #3

Assigned: Apr 16, 2020

Due: Apr 23, 2020

Background

View the final part of lecture 5 (which starts the CPU model we will be using for the course), lecture 6, and lecture 7.

Introduction

This assignment is extracted as with the previous assignment into a directory ps3.

Extract the homework files

Create a new subdirectory named “ps3” to hold your files for this assignment. The files for this assignment are in a “tar ball” that can be downloaded with this link.

../_images/download.png

To extract the files for this assignment, download the tar ball and copy it (or move it) to your ps3 subdirectory. From a shell program, change your working directory to be the ps3 subdirectory and execute the command

$ tar zxvf ps3.tar.gz

The files that should be extracted from the tape archive file are

  • CImg.h - header file for CImg image library, used by mnist.cpp

  • Makefile

  • Matrix.hpp – contains the declaration of the Matrix class as developed in lecture 5

  • Questions.rst – text file for short-answer questions

  • Timer.hpp - contains declaration of a simple Timer class

  • Vector.hpp – contains the declaration of the Vector class as developed in lecture 4

  • amath583.cpp – contains assorted operations on Vector and Matrix objects

  • amath583.hpp – contains declarations for the functions defined in amath583.cpp

  • catch.hpp – header file for the “Catch2” testing framework

  • efficiency.cpp - program to demonstrate optimization behavior

  • float_vs_double.cpp - program to experiment with performance of single precision vs double precision numbers

  • hello.cpp, goodbye.cpp, main.cpp - simple C++ programs

  • mmtime.cpp - simple program to measure matrix-matrix product time

  • mnist.cpp/mnist.hpp - driver program for computing all-pairs products of image data

  • repeat.cpp - program to demonstrate use of argc and argv

  • test_*.cpp – files to test the programs you will be writing

  • timing.cpp - program to demonstrate use of Timer

  • verify.bash - bash script to build and run all test programs

Preliminaries

main()

By convention (and long since standardized), the function that is first invoked when a C++ program is run is the function named main(). In lecture and in previous assignments, we have seen a very basic use of main()

#include <iostream>

int main() {
  std::cout << "Hello World!" << std::endl;
  return 0;
}

Returning from main()

We have noted a few important things about main().

First, main() is a function just like any other; the only thing that makes it special is its name – the function with that name is the one that will be called when your program is run. As “just a function”, main has a return type (in this case, int) and so must also return a value of that type. As with any other function, a return value is returned to the calling function. In the case of main(), there isn’t another function in your program that called main(), but the value is still returned. In particular, it is returned to the program that invoked your program – typically the shell (bash).

But where does that value go and how do you access it? If you examine the test script verify.bash, you will notice lines that look like this:

if [ $? == 0 ] ; then
    echo "+++ $1 Passed +++"
    else
    echo "--- $1 Failed ---"
fi

These lines follow the execution of a program. The return value from the program – the return value from that program’s main() function – is accessed with the special shell macro $?. (Other shells have different macros – if you use csh or tcsh, the macro is called $status.)

In your ps3 directory there are two files: hello.cpp and goodbye.cpp. As supplied, they both return 0 from main(). There are rules to build hello.exe and goodbye.exe in your Makefile. Build both of them:

$ make hello.exe goodbye.exe

Now run each of them and inspect the return value:

$ ./hello.exe
$ echo $?
$ ./goodbye.exe
$ echo $?

What is printed in response to the echo statements for each case?

Next, change goodbye.cpp so that main() returns a non-zero value (say, 1 or -1). Recompile goodbye.exe and re-run the two programs again:

$ make goodbye.exe
$ ./hello.exe
$ echo $?
$ ./goodbye.exe
$ echo $?

What is returned in this case?

In the command-line world of Linux/Unix/etc, programs are often invoked by other programs (e.g., bash scripts) and the return values from executing programs are used to signal successul execution (return 0) or unsuccessful execution (return a non-zero error code). Actual program output is generally returned as ASCII strings to the standard output file (in C++ this is the std::cout) and read into the standard input (std::cin). Error messages are sent to a different stream from the standard output (std::cerr). Both std::cout and std::cerr will by default print to your terminal screen, but they can be separately directed when you connect the inputs and outputs of programs together.

Calling main() with arguments

Command-line programs also can take parameters on the command line. For example, if you want to list the files in the director “/usr”, you would invoke

$ ls /usr

Here, ls is simply a program (by convention, Unix/Linux programs do not have a .exe extension). If you look in the /usr/bin directory, in fact, you will see the executable file for that command

$ ls /usr/bin/ls

Shell programs like bash operate in a ‘read-execute-print’ loop (REPL). The shell takes all of characters typed on the line before you hit enter, parses it, executes the command given (the first argument on the line), and prints the result. To execute the first command, the shell searches the directories contained in the $PATH environment variable and runs the program if it is found.

You can see what your path is by printing the $PATH variable:

$ echo $PATH

Now, all of the programs that you might run (do an ls on, say, /usr/bin or any other directory in your path) have a main() function (assuming they are C or C++ programs, which most programs in Linux/Unix are). So they will return 0 or non-zero values (you can verify this by running some programs and checking $?)

$ ls /usr/bin
$ echo $?
$ ls /nosuchdirectory
$ echo $?
$ nouchcommand
$ echo $?

But, these programs are doing more than what we have been doing so far: they take arguments. That is,

$ ls /usr/bin

lists the contents of the directory /usr/bin. So, somehow, the argument “/usr/bin” is being passed into the program – more precisely, somehow, “/usr/bin” is being passed to the main() function in the ls program.

The main() function we have seen to this point doesn’t have any arguments, so it isn’t obvious how argument passing would be done. However, main() is actually an overloaded function and either of the two version of main() will be called when your program is run. (You can only have a single main() – you must use one or the other of the overloads but you can’t use both.)

The overload of main() that takes arguments is declared thusly:

int main(int argc, char *argv[]);

The first argument traditionally named argc contains the number of arguments that are passed to your main() function. The second argument is an array of strings, each string of which is one of the arguments to your program.

Note

Here is where we are forced to make a concession to the use of pointers. In C/C++, a pointer is the address in memory of a datum and is denoted with the * preface. Technically, argv is an array of pointers, which are essentially memory addresses, and the data being pointed to is specified to be interpreted as character types when accessed by that pointer. In C++, char pointers can be used more or less interchangeably with std::string and so we will be using contents of argv[] in that way.

If we have our arguments in an array of strings and we know how many entries are in the array, we can loop through the array and read and interpret each entry. In your ps3 directory is a file repeat.cpp:

1
2
3
4
5
6
7
8
9
#include <iostream>

int main(int argc, char *argv[]) {
  for (size_t i = 0; i < argc; ++i) {
    std::cout << "argv[" << i << "]: " << argv[i] << std::endl;
  }

  return 0;
}

Make repeat.exe and try the following. First, execute the program with no arguments at all:

$ ./repeat.exe

What is printed?

Now execute the program with whatever arguments you like, e.g.,

$ ./repeat.exe whatever arguments you like

What is printed?

Finally, note that “.” is your current directory. We specify “.” in front of the programs we compile locally because the shell will not find it by searching in its path. That is

$ repeat.exe

Will not find repeat.exe.

But, the directory where repeat.exe exists is part of the file system directory structure on your computer and that location can be specified other than with “.”. If you run the command

$ pwd

bash will print your current working directory. The current working directory is also stored in the environment variable cwd. So for instance, in my own case, pwd might print this

$ echo $cwd
/mnt/c/Users/al75/amath583/ps3

(your working directory may be different).

Instead of using “.” (which is a relative path), use the value of $cwd (which is an absolute path)

$ /mnt/c/Users/al75/amath583/ps3/repeat.exe

$ /mnt/c/Users/al75/amath583/ps3/repeat.exe some arguments or other

(You will need to specify the correct path for your system.)

What is printed?

Warm Up

main()

As we develop more sophisticated programs through the rest of this quarter, we will need to pass arguments to main() – and to interpret those arguments in different ways. For example, if we want to pass in the size of a problem to run, we would put a number on the command line

$ ./program.exe 1024

Here, “1024” would be passed to the main() function in this program. But note that it is the string “1024”, not the number 1024. For example, what happens if you try to compile the following program (main.cpp in your ps3 directory):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#include <iostream>

int main(int argc, char* argv[])  {

  int size = argv[1];

  std::cout << "size is " << size << std::endl;

  return  0;
}

To set the value of size to be a number, we have to convert (“parse”) the string in argv[1] into an integer and assign the resulting value to size.

Modify main.cpp so that it correctly assigns an integer value to size. You may find the functions atoi() or atol or stol or similar to be useful. If you do end up using a library function to do the conversion from a character string to a number, you will also need to include the appropriate header file (the header file containing the

declaration for that function) as well as invoke it with the proper namespace (e.g., std).

Note

In the rest of this assignment, we will be converting arguments from the command line from strings into the appropriate type (integer, etc). Accordingly, the code samples for the rest of this assignment don’t give away how to do that but rather have a FIXME comment.

There is another problem with main.cpp. Once you have made your modification, what happens if you don’t pass any argument at all to your program?

$ ./main.exe

We need to do a little bit of error checking. One can do arbitrarily sophisticated argument checking and error handling, but we can do a few simple things that will serve uw well for this course. The simplest, which is what we will do for this problem set, is to simply make sure we have the right number of arguments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <iostream>

int main(int argc, char* argv[])  {
  if (argc != 2) {
    std::cerr << "Usage: " << argv[0] << " size" << std::endl;
    return -1;
  }
  int size = argv[1];  // FIXME
  std::cout << "size is " << size << std::endl;

  return  0;
}

Until we introduce some more sophisticated approaches, you should use an approach like this for programs you write that take command-line arguments. Grading scripts for such programs will check that you handle the wrong number of command-line arguments correctly. A typical approach, as shown above, is to issue a “usage” message and exit with a non-zero exit code.

For now we will assume that the arguments themselves are well-formed (that is, you won’t be required to check that an argument that is supposed to be a number is actually a well-formed number).

Command line argument parsing is a common and important enough task in programming that numerous libraries exist to help that process. A very basic one is part of the C standard library – the getopt function. One that I have found to be useful but without too much overhead in either learning or in use is called “docopt”.

Questions

Answer the following questions in Questions.rst

  1. What does argv[0] always contain?

  2. Which entry of argv holds the first argument passed to the program?

  3. Which entry of argv holds the second argument passed to the program?

  4. How would you print just the last argument passed to a program?

Timing

As discussed in lecture – if we are going to achieve “high performance computing” we need to be able to measure performance. Performance is the ratio of how much work is done in how much time – and so we need to measure (or calculate) both work and time to quantitatively characterize performance.

To measure time, in lecture we also introduced a class (also available as Timer.hpp in the example code directory).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#include <iostream>
#include <chrono>
#include <string>

class Timer {
private:
  typedef std::chrono::time_point<std::chrono::system_clock> time_t;

public:
  Timer() = default;

  time_t start()   { return (startTime = std::chrono::system_clock::now()); }
  time_t stop()    { return (stopTime  = std::chrono::system_clock::now()); }
  double elapsed() { return std::chrono::duration_cast<std::chrono::milliseconds>(stopTime-startTime).count(); }

private:

To use this timer, you just need to include it (with the appropriate preprocessor directive) in any file where you want to use a Timer object. To start timing invoke the start() member function, to stop the timer invoke the stop() member function. The elapsed time (in milliseconds) between the start and stop is reported with elapsed()..

To practice using the timer class, compile the program timing.cpp:

1
2
3
4
5
6
7
8
#include <iostream>
#include "Timer.hpp"

int main(int argc, char* argv[]) {
  if (argc != 2) {
    std::cout << "Usage: " << argv[0] << " N" << std::endl;
  }
  size_t loops = std::stol(argv[1]);

First, to get a baseline, compile it with no optimization at all – the target is “timing0.exe” in your Makefile. How many loops do you need to run to get an elapsed time of (about) 1 second? How fast is each loop being executed? Note that the empty statement “;” in the loop body just means “do nothing.”

Second, let’s look at how much optimizing this program will help. Compile the same program as before, but this time use the -O3 option (timing3.exe). How much did your program speed up? If it executes too quickly, again, try increasing the loop count. How much time per iteration is spent now? Try enough cases to see if your answer makes sense.

If you are unsure about the answer you are getting here, start a discussion on Piazza. Try to have this discussion sooner rather than later, as you will need some of the information gained for later in this assignment.

Abstraction Penalty and Efficiency

One question that arises as we continue to optimize, e.g., matrix multiply is: how much performance is available? The performance gains we saw in class were impressive, but are we doing well in an absolute sense? To flip the question around, and perhaps make it more specific: We are using a fairly deep set of abstractions to give ourselves notational convenience. That is, rather than computing linear offsets from a pointer directly to access memory, we are invoking a member function of a class (recall that operator()) is just a function. Then from that function we are invoking another function in the underlying std::vector<double> class. And there may even be more levels of indirection underneath that function – we don’t know what the vector’s operator[] function does under the hood.

Calling a function requires a number of operations to be performed by the CPU: saving return addresses on the stack, saving parameters on the stack, jumping to a new program location – and then unwinding all of that when the function call has completed. When we were analyzing matrix-matrix product in lecture, we were assuming that the inner loop just involved a small number of memory accesses and floating point operations. We didn’t consider the cost we might pay for having all of those function calls – calls we could be making at every iteration of the multiply function. If we were making those – or doing anything extraneous – we would also be measuring those when timing the multiply function. And, obviously, we would be giving up performance. The performance loss due to the use of programing abstractions is called the abstraction penalty.

One can measure the difference between achieved performance vs maximum possible performance as a ratio – as efficiency. Efficiency is simply

\[\frac{\text{Achieved performance}}{\text{Maximum performance}}\]

Measuring maximum performance

Let’s write a short program to measure maximum performance – or at least measure performance that has the multiply-add in matrix-matrix product, but without abstractions in the way.

size_t loops = argv[1]; // FIXME
double a = 3.14, b = 3.14159, c = 0.0;
Timer T;
T.start();
for (size_t i = 0; i < loops; ++i) {
  c += a * b;
}
T.stop();

To save space, I am just including the timing portion of the program. In this loop we are multiplying two doubles and adding to doubles. And we are doing this \(N^3\) times – exactly what we would do in a matrix-matrix multiply.

Time this loop with and without optimization (efficiency0.exe and efficiency3.exe). What happens? How can this be fixed? By which I mean we would like to have a number printed out that helps us estimate the performance of the computer for the loop in question when optimization is enabled. Again, this is a question I would like the class to discuss as a whole and arrive at a solution.

Bonus

Consider the program above, but now with double replaced by long. What happens when you time this loop with and without optimization? What happens when you apply the same fix as for the double example above? Again, discuss as a class.

size_t loops = argv[1]; // FIXME
long a = 3, b = 14, c = 0;
Timer T;
T.start();
for (size_t i = 0; i < loops; ++i) {
  c += a * b;
}
T.stop();

Exercises

Floats and Doubles

Most microprocessors today support (at least) two kinds of floating point numbers: single precision (float) and double precision (double). In the (distant) past, arithmetic operations using float was more efficient than with double. Moreover, the amount of chip space that would have to be devoted to computing with float was smaller than with double. Today these differences in computation time are less pronounced (non-existent); an arithmetic operation with a float takes the same amount of time as an arithmetic operation with a double.

The IEEE 754 floating point standard specifies how many bits are used for representing floats and doubles, how the are each represented, what the semantics are for various operations, and what to do when exceptions occur (divide by zero, overflow, denormalization, etc.). This standard, and the hardware that supports it (notably the original 8087 chip) is one of the most significant achievements in numerical computing. (William Kahan won the Turing award for his work on floating point numbers.)

Since a double has so much better precision than a float, and since the computation costs are a wash, why would one ever use a float these days? Interestingly, 16-bit floating point numbers are starting to be available on many architectures in order to support machine learning. Again, if the computational cost is the same, why would one ever give up that much precision?

In this exercise we are going to investigate some operations with and see what may or may not still be true (and hypothesize why).

The program float_vs_double.cpp has two sections one for single precision and one for double precision. The functionality of each section is the same, only the datatypes are different. Each section times two things: the time it takes to allocate and initialize three arrays and the time it takes to multiply two of them and set the third to the result.

The section for double looks like this, for instance (with the versions for float being very similar:

{
  size_t dim = argv[1]; // FIXME
  Timer t;    t.start();
  std::vector<double> x(dim, 3.14);
  std::vector<double> y(dim, 27.0);
  std::vector<double> z(dim, 0.0);
  t.stop();
  std::cout << "Construction time: " << t.elapsed() << std::end;

  t.start();
  for (size_t i = 0; i < dim; ++i)
    z[i] = x[i] * y[i];
  t.stop();
  std::cout << "Multiplication time: " << t.elapsed() << std::endl;
}

For this exercise, you are asked to generate four sets of numbers by running float_vs_double0.exe and float_vs_double3.exe: Single precision with optimization off, double precision with optimization off, single precision with optimization on and double precision with optimization on. Run the programs with a variety of problem sizes that give you times between 100ms and 1s (more or less).

Your explanations should refer to the simple machine model for single core CPUs with hierarchical memory that we developed in lecture 6.

Efficiency

For this problem we want to investigate how to properly measure a loop with just a multiply and an add in the inner loop. In particular, we want to see if that can tell us the maximum FLOPS count you can get on one core. We would like to relate that to the clock rate of the CPU on your computer and estimate how many floating point operations being done on each clock cycle. See the paragraph marked “Measuring maximum performance” above.

Once you have a meaningful answer for the maximum floating point rate of your computer, we can use this result as the denominator in computing efficiency (fraction of peak).

Modify the program in float_vs_double.cpp so that the timing gives you a more sensible answer. (By “sensible” I mean something useful for measuring the performance of your computer running a long series of arithmetic operations.)

Note that since we are interested in maximum efficiency, we are interested in the rate possible with fully optimized code. NB: Give some thought to the numbers that you generate and make sure they make sense to you. Use double precision for your experiments.

Questions

Answer the following questions in Questions.rst.

5. What is the difference (ratio) in execution times between single and double precision for construction with no optimization? Explain.

6. What is the difference (ratio) in execution times between single and double precision for multiplication with no optimization? Explain.

7. What is the difference (ratio) in execution times between single and double precision for construction with optimization? Explain.

8. What is the difference (ratio) in execution times between single and double precision for multiplication with optimization? Explain.

9. What is the difference (ratio) in execution times for double precision multiplication with and without optimization? Explain.

Some utility functions

To run our test programs for this problem set we need to be able to compare two matrices. More specifically, we need to be able to compare the result of doing two equivalent computations – multiplying two matrices with two different algorithms. In the ideal mathematical world of real numbers, two equivalent ways of computing a given result will give exactly the same answer (that is what is meant by equivalent) – meaning, the difference between the two results is exactly zero.

Unfortunately, in the world of finite precision, we don’t necessary have that nice property. Two answers computed in two different ways should be close, but it is very difficult to ensure exact equality. (That is why, in general, you should never test even individual floating point numbers against each other for equality, much less entire matrices or vectors.)

To check whether two computations produce the same result for matrix computations, instead of looking for exact equality, we are going to look for something “small”. In particular, we will assume two non-zero matrices are “close enough” if

\[|| A - B || < \epsilon || A ||\]

Where \(\epsilon\) is small (1e-12, say).

In your repository is a program test_mult.cpp which compiles to test_mult.exe (there is already a make rule for it in your Makefile). This program runs the four different versions of matrix-matrix product contained in amath583.cpp and tests to make sure they all compute the “same” answer, using an expression like:

REQUIRE(f_norm(A-B) < 1.e-12 * f_norm(A));

Notice the nice natural syntax in this expression for computing the difference between two Matrix objects.

As was discussed in lecture, this functionality is provided by overloading a function with a particular name, namely operator(), which takes two Matrix objects (by const reference) as arguments and returns a new Matrix object

Matrix operator-(const Matrix& A, const Matrix& B);

Question for discussion on Piazza. We saw that it is a “good thing” to pass large objects by reference (or by const reference). Should the same apply to returning objects? Note that the return type of operator-() is a Matrix, not a Matrix& or a const Matrix&. Why are we not returning C by reference?

The version of amath583.cpp that you have is incomplete. In particular, the body of operator-() is not filled in and the body of f_norm() is not filled in.

Complete the definitions of f_norm() (copied from your previous assignment) and complete the definition of operator+(). You may find the definition of operator+() in that same file to be helpful.

Once your definitions are complete, the test program test_operator.exe should pass, as should the test program test_mult.exe. We are testing these functions because we will be using them later in testing other functions.

Matrix-matrix product (AMATH583 ONLY)

Your ps3 directory has a simple matrix-matrix product driver program in mmtime.cpp. You can build and run it using

$ make mmtime.exe
$ ./mmtime.exe 128

The program multiplies two matrices together and reports the time. You pass in the matrix size on the command line.

As given to you, the program calls mult_0 – found in amath583.cpp:

void mult_0(const Matrix& A, const Matrix& B, Matrix& C) {
  for (size_t i = 0; i < C.num_rows(); ++i) {
    for (size_t j = 0; j < C.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

This is just our basic matrix-matrix product. Run mmtime.exe for a few problem sizes, say powers of two from 128 to 1024.

Replace the call to mult_0 with a call to mult_3 and repeat for the same problem sizes. You may also need to try some larger problem sizes to get useful times.

Questions

Answer the following questions in Questions.rst.

  1. If you double the problem size for matrix-matrix product, how many more operations will matrix-matrix product perform?

  2. If the time to perform a given operation is constant, when you double the problem size for matrix-matrix product, how much more time will be required to compute the answer?

  3. What ratio did you see when doubling the problem size when mmtime called mult_0? (Hint, it may be larger than what pure operation count would predict.) Explain.

  4. What raio did you see when doubling the problem size when mmtime called mult_3? Was this the same for mult_0? Referring to the function in amath583.cpp, what optimizations are implemented and what kinds of performance benefits might they provide?

  5. (Extra credit.) Try also with mult_1 and mult_2.

All Pairs Computations

A number of interesting problems in data analytics involve comparing each element in a data set against every other element in that same data set – i.e., involve all pairs computations. If we arrange those comparisons in an array, we can assign a given element \((i,j)\) in the array the value resulting from comparing element \(i\) with element \(j\).

Consider the case of a set of vectors, where we take the comparison to be the inner product between the two vectors, where inner-product is defined as

\[\langle \vec{x}, \vec{y} \rangle = \sum_k x_k y_k\]

Suppose we have a set of vectors \(V = \{ v_0 , v_1 , \ldots , v_{m-1} \}\) , then the comparison (let’s refer to it as the similarity measure) between vector \(v_i\) and \(v_j\) is

\[S_{ij} = \sum_k v_{ik} v_{jk}\]

This should look somewhat familiar – it’s almost matrix-matrix product. Recall that matrix-matrix product is defined as

\[C_{ij} = \sum_k A_{ik} B_{kj}\]

If we arrange our collection of vectors into a matrix so that each row of the matrix is a vector, then we can define \(C = V V^T\) as

\[C_{ij} = \sum_k V_{ik} V^T_{kj} = \sum_k V_{ik} V_{jk}\]

We have already looked at how to optimize matrix-matrix product – and saw that we can achieve impressive levels of performance. One option to get good performance for computing a similarity matrix would be to simply invoke the matrix-matrix product we already have:

Matrix A(num_vectors, vector_size);
Matrix B = transpose(A);
Matrix S(size, size);
mult(A, B, S);

(Aside: What are the dimensions of S?)

Another option is to write a function that directly does the matrix transpose multiply.

Matrix A(num_vectors, vector_size);
Matrix C = A;
Matrix S(size, size);
mult_trans(A, C, S);  // compute S = A * C^T

But, since in our case we are multiplyin A against itself, we don’t really need the copy in C, so we could instead invoke

Matrix A(num_vectors, vector_size);
Matrix S(size, size);
mult_trans(A, A, S);  // compute S = A * A^T

Finally, we could also make a specialized function that only takes A as the operand

Matrix A(num_vectors, vector_size);
Matrix S(size, size);
mult_trans_self(A, S);  // compute S = A * A^T

For this problem, we are going to write a few of the functions in question and examine the performance of these different ways of computing similarity.

The driver program for this problem is in the file mnist.cpp and the functions for carrying out the computations are declared and defined in amath583.hpp and amath583.cpp, respectively.

As delivered to you, the driver program will compile, and will report timing results, but they won’t be very meaningful as some of the functions are empty (waiting to be filled in by you). The driver program will also save an image of the similarity, as computed by the final algorithm in the driver program.

The program expects as input the name of a file containing “MNIST” images. here, which you are welcome to use (and modify).

../_images/download.png

This file contains 60k images of handwritten digits, each image being 28 by 28 pixels in size.

You can build and run the program as follows

$ make mnist.exe
$ ./mnist.exe train-images.idx3-ubyte 2048

The program output looks something like the following (you may need to widen your terminal screen when you run the program so that everything lines up nicely):

#images mult_0(A,B) mult_1(A,B) mult_2(A,B) mult_3(A,B)  m_t_0(A,C)  m_t_1(A,C)  m_t_2(A,C)  m_t_3(A,C)  m_t_0(A,A)  m_t_1(A,A)  m_t_2(A,A)  m_t_3(A,A)    m_t_4(A)
    128     1.28451     1.16773     4.28169     6.42253     1.51118     12.8451     25.6901     25.6901     1.51118     12.8451     25.6901     25.6901     25.6901
    256          -1          -1     3.80594     4.67093          -1     8.56337     20.5521     12.8451          -1     9.34186     20.5521     14.6801     14.6801

Each row represents the Gflop/s measured for the given number of inputs and each column represents a different algorithm. The first four (mult_0 etc) are increasingly optimized versions of matrix-matrix product, and we are invoking them by passing in a matrix that is the transpose of A. The next four algorithms are increasingly optimized versions of the explicit matrix-transpose product algorithm, where we are passing in A and another matrix that is a copy of A. The next four are the same as the previous, but we are just passing in A for both arguments. Finally, the last column is the most optimized version of the matrix-transpose product, specialized to multiply A against itself.

Note: This program can get really slow for the unoptimized versions of the code so the program will decide not to run certain problem sizes if it senses itself getting too slow for a particular algorithm. In that case a “-1” is printed.

Transpose

The first function that you need to write is for actually computing the matrix transpose. The function skeleton for this is in amath583.cpp. The function takes as an argument the matrix A. It needs to allocate an appropriately sized matrix B, fill it with the transpose of A, and then return B. The program test_transpose.cpp will exercise a few test cases.

Matrix-transpose product

Complete the implementations of mult_trans_0 and mult_trans_1.

Complete the implementations of mult_trans_2 and mult_trans_3.

The program test_mult_transpose.cpp will test your implementations. You will need to have your transpose function working for the test cases in this file to work properly. Each of the algorithms is intended to correspond to the non-transpose variant also contained in amath583.cpp and those implementations should be a useful reference once you have the basic idea of the algorithm down from implementing mult_trans_0.

(Extra Credit) Complete the implementations of mult_trans_4.

All-Pairs

Once you have your algorithms completed (or as many as you are able), run the mnist.exe driver program up to a maximum number of images of 2048 (say). For your viewing pleasure, the program saves an image of the similarity matrix at each problem size as mnist_similarity_NNN.bmp where NNN is the problem size.

Questions
  1. What do you observe about the different approaches to doing the similarity computation? Which algorithm (optimizations) are most effective? Does it pay to make a transpose of A vs a copy of A vs just passing in A itself. What about passing in A twice vs passing it in once (mult_trans_3 vs mult_trans_4)?

  2. What is the best performance over all the algorithms that you observed for the case of 1024 images? What would the execution time be for 10 times as many images? For 60 times as many images? (Hint: the answer is not cubic but still potentially a problem.) What if we wanted to do, say 56 by 56 images instead of 28 by 28?

The last question is a leading one. Though modern hardware can provide astonishing levels of performance, we can easily overwhelm a single CPU by just trying to solve a bigger problem. And, often, the resource requirements grow much more rapidly than the problem size.

There are various things we can do to address this issue. First, we can try to be more clever and use better algorithms (see extra credit problem next). And, we can implement our algorithms in such a way that we can use more hardware resources (and/or more specialized hardware) to solve the same problem. We will be looking at being more clever for the next two assignments and then we will take up using parallelization and using GPUs.

Extra Credit

There is an optimization that you can make to the last algorithm that can significantly cut the running time. Implement that optimization as a new function mult_trans_5 and report its performance (you will need to add a call to it to the benchmark driver). (Hint: This optimization involves doing less work – not doing the same work faster.)

Turning in the Exercises

Do a make clean in your ps3 directory and then create a tar ball ps3.tar.gz containing all of the files in that directory. From within the ps3 directory:

$ cd ps3
$ make clean
$ cd ..
$ tar zcvf ps3-submit.tar.gz ./ps3

If you relied on any external references, list them in a file ref3.txt and include that in your tarball as well (in the ps3 subdirectory).