Warm Up
main()
As we develop more sophisticated programs through the rest of this
quarter, we will need to pass arguments to main()
– and to interpret
those arguments in different ways. For example, if we want to pass in
the size of a problem to run, we would put a number on the command line
Here, “1024” would be passed to the main()
function in this
program. But note that it is the string “1024”, not the number 1024.
For example, what happens if you try to compile the following program
(main.cpp in your ps3 directory):
1#include <iostream>
2
3int main(int argc, char* argv[]) {
4
5 int size = argv[1];
6
7 std::cout << "size is " << size << std::endl;
8
9 return 0;
10}
To set the value of size
to be a number, we have to convert
(“parse”) the string in argv[1]
into an integer and assign the
resulting value to size
.
Modify main.cpp
so that it correctly assigns an integer value to
size
. You may find the functions atoi()
or atol
or stol
or similar to be useful. If you do end up using a library function to do
the conversion from a character string to a number, you will also need
to include the appropriate header file (the header file containing the declaration for that function) as well as invoke it with the proper namespace (e.g., std
).
Note
In the rest of this assignment, we will be converting arguments from the command line from strings into the appropriate type (integer, etc). Accordingly, the code samples for the rest of this assignment don’t give away how to do that but rather have a FIXME
comment.
There is another problem with main.cpp
. Once you have made your
modification, what happens if you don’t pass any argument at all to your
program?
We need to do a little bit of error checking. One can do arbitrarily sophisticated
argument checking and error handling, but we can do a few simple things that
will serve uw well for this course. The simplest, which is what we will do
for this problem set, is to simply make sure we have the right number of
arguments:
1#include <iostream>
2
3int main(int argc, char* argv[]) {
4 if (argc != 2) {
5 std::cerr << "Usage: " << argv[0] << " size" << std::endl;
6 return -1;
7 }
8 int size = argv[1]; // FIXME
9 std::cout << "size is " << size << std::endl;
10
11 return 0;
12}
Until we introduce some more sophisticated approaches, you should use an
approach like this for programs you write that take command-line
arguments. Grading scripts for such programs will check that you handle
the wrong number of command-line arguments correctly. A typical approach, as shown above, is to issue a “usage” message and exit with a non-zero exit code.
For now we will assume
that the arguments themselves are well-formed (that is, you won’t be required to check that an argument that is supposed to be a number is actually a well-formed number).
Command line argument parsing is a common and important enough task in programming
that numerous libraries exist to help that process. A very basic one is part of
the C standard library – the getopt
function. One that I have found to be useful
but without too much overhead in either learning or in use is called “docopt”.
Questions
Answer the following questions in Questions.rst
What does argv[0]
always contain?
Which entry of argv
holds the first argument passed to the program?
Which entry of argv
holds the second argument passed to the program?
How would you print just the last argument passed to a program?
Timing
As discussed in lecture – if we are going to achieve “high performance
computing” we need to be able to measure performance. Performance is the
ratio of how much work is done in how much time – and so we need to
measure (or calculate) both work and time to quantitatively characterize performance.
To measure time, in lecture we also introduced a class (also available
as Timer.hpp
in the example code directory).
1#include <iostream>
2#include <chrono>
3#include <string>
4
5class Timer {
6private:
7 typedef std::chrono::time_point<std::chrono::system_clock> time_t;
8
9public:
10 Timer() = default;
11
12 time_t start() { return (startTime = std::chrono::system_clock::now()); }
13 time_t stop() { return (stopTime = std::chrono::system_clock::now()); }
14 double elapsed() { return std::chrono::duration_cast<std::chrono::milliseconds>(stopTime-startTime).count(); }
15
16private:
17 time_t startTime, stopTime;
18};
To use this timer, you just need to include it (with the
appropriate preprocessor directive)
in any file where you want to use a Timer
object.
To start timing invoke the start()
member function, to stop the
timer invoke the stop()
member function. The elapsed time (in
milliseconds) between the start and stop is reported with elapsed()
..
To practice using the timer class, compile the program timing.cpp:
1#include <iostream>
2#include "Timer.hpp"
3
4int main(int argc, char* argv[]) {
5 if (argc != 2) {
6 std::cout << "Usage: " << argv[0] << " N" << std::endl;
7 }
8 size_t loops = std::stol(argv[1]);
9
10 Timer T;
11 T.start();
12 for (size_t i = 0; i < loops; ++i)
13 ;
14 T.stop();
15
16 std::cout << loops << " loops took " << T.elapsed() << " milliseconds" << std::endl;
17
18 return 0;
19}
First, to get a baseline, compile it with no optimization at all using the -O0
option – the
target is “timing0.exe” in your Makefile. How many loops do you need
to run to get an elapsed time of (about) 1 second? How fast is each
loop being executed?
Note that the empty statement “;” in the loop body just means “do
nothing.”
Second, let’s look at how much optimizing this program will help.
Compile the same program as before, but this time use the -O3
option
(timing3.exe). How much did your program speed up? If it executes too
quickly, again, try increasing the loop count. How much time per
iteration is spent now? How about with -Ofast
option?
Try enough cases to see if your answer makes sense.
If you are unsure about the answer you are getting here, start a
discussion on Piazza. Try to have this discussion sooner rather than
later, as you will need some of the information gained for later in this
assignment.
Abstraction Penalty and Efficiency
One question that arises as we continue to optimize, e.g., matrix
multiply is: how much performance is available? The performance gains we
saw in class were impressive, but are we doing well in an absolute
sense? To flip the question around, and perhaps make it more specific:
We are using a fairly deep set of abstractions to give ourselves
notational convenience. That is, rather than computing linear offsets
from a pointer directly to access memory, we are invoking a member
function of a class (recall that operator()
)
is just a function. Then from that function
we are invoking another function in the
underlying std::vector<double>
class. And there may even be
more levels of indirection underneath that function – we don’t know what
the vector
’s operator[]
function does under the hood.
Calling a function
requires
a number of operations to be performed by the CPU:
saving return addresses on the stack,
saving parameters on the stack, jumping to a new program location – and
then unwinding all of that when the function call has completed. When we
were analyzing matrix-matrix product in lecture, we were assuming that
the inner loop just involved a small number of memory accesses and
floating point operations. We didn’t consider the cost we might pay for
having all of those function calls – calls we could be making at every
iteration of the multiply function. If we were making those – or doing
anything extraneous – we would also be measuring those when timing the
multiply function. And, obviously, we would be giving up performance.
The performance loss due to the use of programing abstractions is called
the abstraction penalty.
One can measure the difference between achieved performance vs maximum
possible performance as a ratio – as efficiency. Efficiency is simply
\[\frac{\text{Achieved performance}}{\text{Maximum performance}}\]
Bonus
Consider the program above, but now with double
replaced by long
.
What happens when you time this loop with and without optimization?
What happens when you apply the same fix as for the double example above?
Again, discuss as a class.
size_t loops = argv[1]; // FIXME
long a = 3, b = 14, c = 0;
Timer T;
T.start();
for (size_t i = 0; i < loops; ++i) {
c += a * b;
}
T.stop();
Exercises
Floats and Doubles
Most microprocessors today support (at least) two kinds of floating
point numbers: single precision (float
) and double precision
(double
). In the (distant) past, arithmetic operations using
float
was more efficient than with double
. Moreover, the amount
of chip space that would have to be devoted to computing with float
was smaller than with double
. Today these differences in computation
time are less pronounced (non-existent); an arithmetic operation with a
float
takes the same amount of time as an arithmetic operation with
a double
.
The IEEE 754 floating point standard specifies how many bits are used
for representing float
s and double
s, how the are each
represented, what the semantics are for various operations, and what to
do when exceptions occur (divide by zero, overflow, denormalization,
etc.). This standard, and the hardware that supports it (notably the
original 8087 chip) is one of the most significant achievements in
numerical computing. (William Kahan won the Turing award for his work on
floating point numbers.)
Since a double
has so much better precision than a float
, and
since the computation costs are a wash, why would one ever use a
float
these days? Interestingly, 16-bit floating point numbers are
starting to be available on many architectures in order to support
machine learning. Again, if the computational cost is the same, why
would one ever give up that much precision?
In this exercise we are going to investigate some operations with and
see what may or may not still be true (and hypothesize why).
The program float_vs_double.cpp
has two sections one for single
precision and one for double precision. The functionality of each
section is the same, only the datatypes are different. Each section
times two things: the time it takes to allocate and initialize three
arrays and the time it takes to multiply two of them and set the third
to the result.
The section for double looks like this, for instance (with the versions
for float being very similar:
{
size_t dim = argv[1]; // FIXME
Timer t; t.start();
std::vector<double> x(dim, 3.14);
std::vector<double> y(dim, 27.0);
std::vector<double> z(dim, 0.0);
t.stop();
std::cout << "Construction time: " << t.elapsed() << std::end;
t.start();
for (size_t i = 0; i < dim; ++i)
z[i] = x[i] * y[i];
t.stop();
std::cout << "Multiplication time: " << t.elapsed() << std::endl;
}
For this exercise, you are asked to
generate four sets of numbers by running
float_vs_double0.exe
and float_vs_double3.exe
: Single precision
with optimization off, double precision with optimization off, single
precision with optimization on and double precision with optimization
on. Run the programs with a variety of problem sizes that give you times
between 100ms and 1s (more or less).
Your explanations should refer to the simple machine model for single
core CPUs with hierarchical memory that we developed in lecture 6.
Efficiency
For this problem we want to investigate how to properly measure a loop
with just a multiply and an add in the inner loop. In particular, we
want to see if that can tell us the maximum FLOPS count you can get on
one core. We would like to relate that to the clock rate of the CPU on
your computer and estimate how many floating point operations being done
on each clock cycle. See the paragraph marked “Measuring maximum
performance” above.
Once you have a meaningful answer for the maximum floating point rate of
your computer, we can use this result as the denominator in computing
efficiency (fraction of peak).
Modify the program in float_vs_double.cpp
so that the timing gives you a
more sensible answer. (By “sensible” I mean something useful for
measuring the performance of your computer running a long series of
arithmetic operations.)
Note that since we are interested in maximum efficiency, we are
interested in the rate possible with fully optimized code. NB: Give
some thought to the numbers that you generate and make sure they make
sense to you. Use double precision for your experiments.
Questions
Answer the following questions in Questions.rst.
1. What is the difference (ratio) in execution times
between single and double precision for construction with no optimization? Explain.
6. What is the difference (ratio) in execution times
between single and double precision for multiplication with no optimization? Explain.
7. What is the difference (ratio) in execution times
between single and double precision for construction with optimization? Explain.
8. What is the difference (ratio) in execution times
between single and double precision for multiplication with optimization? Explain.
9. What is the difference (ratio) in execution times
for double precision multiplication with and without optimization? Explain.
Some utility functions
To run our test programs for this problem set we need to be able to compare two matrices. More specifically, we need to be able to compare the result of doing two equivalent computations – multiplying two matrices with two different algorithms. In the ideal mathematical world of real numbers, two equivalent ways of computing a given result will give exactly the same answer (that is what is meant by equivalent) – meaning, the difference between the two results is exactly zero.
Unfortunately, in the world of finite precision, we don’t necessary have that nice property. Two answers computed in two different ways should be close, but it is very difficult to ensure exact equality. (That is why, in general, you should never test even individual floating point numbers against each other for equality, much less entire matrices or vectors.)
To check whether two computations produce the same result for matrix computations, instead of looking for exact equality, we are going to look for something “small”. In particular, we will assume two non-zero matrices are “close enough” if
\[|| A - B || < \epsilon || A ||\]
Where \(\epsilon\) is small (1e-12, say).
In your repository is a program test_mult.cpp
which compiles to test_mult.exe
(there is already a make rule for it in your Makefile). This program runs the four different versions of matrix-matrix product contained in amath583.cpp
and tests to make sure they all compute the “same” answer, using an expression like:
REQUIRE(f_norm(A-B) < 1.e-12 * f_norm(A));
Notice the nice natural syntax in this expression for computing the difference between two Matrix
objects.
As was discussed in lecture, this functionality is provided by overloading a function with a particular name, namely operator()
, which takes two Matrix objects (by const reference) as arguments and returns a new Matrix object.
Matrix operator+(const Matrix& A, const Matrix& B);
Matrix operator-(const Matrix& A, const Matrix& B);
Question for discussion on Piazza. We saw that it is a “good thing” to pass large objects by reference (or by const reference), since we don’t pay the overhead of copying. Should the same apply to returning objects? Note that the return type of operator-()
is a Matrix
, not a Matrix&
or a const Matrix&
. Why are we not returning Matrix C
by reference?
The version of amath583.cpp
that you have is incomplete. In particular, the body of operator-()
, operator+()
are not filled in and the body of f_norm()
is not filled in.
Complete the definitions of f_norm()
(copied from your previous assignment), complete the definition of operator+()
, and complete the definition of operator-()
. If you did’t implement f_norm()
in previous assignment, you can turn help from your classmates on Piazza. But remember to cite your source.
Once your definitions are complete, the test program test_operator.exe
should pass, as should the test program test_mult.exe
. We are testing these functions because we will be using them later in testing other functions.
Matrix-matrix product (AMATH583 ONLY)
Your ps3 directory has a simple matrix-matrix product driver program in mmtime.cpp. You can build and run it using
$ make mmtime.exe
$ ./mmtime.exe 128
The program multiplies two matrices together and reports the time. You pass in the matrix size on the command line.
As given to you, the program calls mult_0
– found in amath583.cpp
:
void mult_0(const Matrix& A, const Matrix& B, Matrix& C) {
for (size_t i = 0; i < C.num_rows(); ++i) {
for (size_t j = 0; j < C.num_cols(); ++j) {
for (size_t k = 0; k < A.num_cols(); ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
This is just our basic matrix-matrix product. Run mmtime.exe for a few problem sizes, say powers of two from 128 to 1024.
Replace the call to mult_0
with a call to mult_3
and repeat for the same problem sizes. You may also need to try some larger problem sizes to get useful times.
Questions
Answer the following questions in Questions.rst.
If you double the problem size for matrix-matrix product, how many more operations will matrix-matrix product perform?
If the time to perform a given operation is constant, when you double the problem size for matrix-matrix product, how much more time will be required to compute the answer?
What ratio did you see when doubling the problem size when mmtime called mult_0
? (Hint, it may be larger than what pure operation count would predict.) Explain.
What ratio did you see when doubling the problem size when mmtime called mult_3
? Was this the same for mult_0
? Referring to the function in amath583.cpp, what optimizations are implemented and what kinds of performance benefits might they provide?
(Extra credit.) Try also with mult_1
and mult_2
.
All Pairs Computations
A number of interesting problems in data analytics involve comparing each element in a data set against every other element in that same data set – i.e., involve
all pairs computations. If we arrange those comparisons in an array, we can assign a given element \((i,j)\) in the array the value resulting from comparing element \(i\) with element \(j\).
Consider the case of a set of vectors, where we take the comparison to be the inner product between the two vectors, where inner-product is defined as
\[\langle \vec{x}, \vec{y} \rangle = \sum_k x_k y_k\]
Suppose we have a set of vectors \(V = \{ v_0 , v_1 , \ldots , v_{m-1} \}\) , then the comparison (let’s refer to it as the similarity measure) between vector \(v_i\) and \(v_j\) is
\[S_{ij} = \sum_k v_{ik} v_{jk}\]
This should look somewhat familiar – it’s almost matrix-matrix product. Recall that matrix-matrix product is defined as
\[C_{ij} = \sum_k A_{ik} B_{kj}\]
If we arrange our collection of vectors into a matrix so that each row of the matrix is a vector, then we can define \(C = V V^T\) as
\[C_{ij} = \sum_k V_{ik} V^T_{kj} = \sum_k V_{ik} V_{jk}\]
We have already looked at how to optimize matrix-matrix product – and saw that we can achieve impressive levels of performance. One option to get good performance for computing a
similarity matrix would be to simply invoke the matrix-matrix product we already have:
Matrix A(num_vectors, vector_size);
Matrix B = transpose(A);
Matrix S(size, size);
mult(A, B, S);
(Aside: What are the dimensions of S?)
Another option is to write a function that directly does the matrix transpose multiply.
Matrix A(num_vectors, vector_size);
Matrix C = A;
Matrix S(size, size);
mult_trans(A, C, S); // compute S = A * C^T
But, since in our case we are multiplyin A against itself, we don’t really need the copy in C, so we could instead invoke
Matrix A(num_vectors, vector_size);
Matrix S(size, size);
mult_trans(A, A, S); // compute S = A * A^T
Finally, we could also make a specialized function that only takes A as the operand
Matrix A(num_vectors, vector_size);
Matrix S(size, size);
mult_trans_self(A, S); // compute S = A * A^T
For this problem, we are going to write a few of the functions in question and examine the performance of these different ways of computing similarity.
The driver program for this problem is in the file mnist.cpp
and the functions for carrying out the computations are declared and defined in
amath583.hpp
and amath583.cpp
, respectively.
As delivered to you, the driver program will compile, and will report timing results, but they won’t be very meaningful as some of the functions are empty (waiting to be filled in by you). The driver program will also save an image of the similarity, as computed by the final algorithm in the driver program.
The program expects as input the name of a file containing “MNIST” images.
Download it here
, which you are welcome
to use (and modify). This file contains 60k images of handwritten digits, each image being 28 by 28 pixels in size.
You can build and run the program as follows
$ make mnist.exe
$ ./mnist.exe train-images.idx3-ubyte 2048
The program output looks something like the following (you may need to widen your terminal screen when you run the program so that everything lines up nicely):
#images mult_0(A,B) mult_1(A,B) mult_2(A,B) mult_3(A,B) m_t_0(A,C) m_t_1(A,C) m_t_2(A,C) m_t_3(A,C) m_t_0(A,A) m_t_1(A,A) m_t_2(A,A) m_t_3(A,A) m_t_4(A)
128 1.28451 1.16773 4.28169 6.42253 1.51118 12.8451 25.6901 25.6901 1.51118 12.8451 25.6901 25.6901 25.6901
256 -1 -1 3.80594 4.67093 -1 8.56337 20.5521 12.8451 -1 9.34186 20.5521 14.6801 14.6801
Each row represents the Gflop/s measured for the given number of inputs and each column represents a different algorithm. The first four (mult_0
etc) are increasingly optimized versions of matrix-matrix product, and we are invoking them by passing in a matrix that is the transpose of A. The next four algorithms are increasingly optimized versions of the explicit matrix-transpose product algorithm, where we are passing in A and another matrix that is a copy of A. The next four are the same as the previous, but we are just passing in A for both arguments. Finally, the last column is the most optimized version of the matrix-transpose product, specialized to multiply A against itself.
Note: This program can get really slow for the unoptimized versions of the code so the program will decide not to run certain problem sizes if it senses itself getting too slow for a particular algorithm. In that case a “-1” is printed.
Transpose
The first function that you need to write is for actually computing the matrix transpose. The function skeleton for this is in amath583.cpp. The function takes as an argument the matrix A. It needs to allocate an appropriately sized matrix B, fill it with the transpose of A, and then return B. The program test_transpose.cpp
will exercise a few test cases.
Matrix-transpose product
Complete the implementations of mult_trans_0
and mult_trans_1
.
Complete the implementations of mult_trans_2
and mult_trans_3
.
The program test_mult_transpose.cpp will test your implementations. You will need to have your transpose function working for the test cases in this file to work properly. Each of the algorithms is intended to correspond to the non-transpose variant also contained in amath583.cpp
and those implementations should be a useful reference once you have the basic idea of the algorithm down from implementing mult_trans_0
.
(Extra Credit)
Complete the implementations of mult_trans_4
.
All-Pairs
Once you have your algorithms completed (or as many as you are able), run the mnist.exe driver program up to a maximum number of images of 2048 (say). For your viewing pleasure, the program saves an image of the similarity matrix at each problem size as mnist_similarity_NNN.bmp
where NNN
is the problem size.
Questions
What do you observe about the different approaches to doing the similarity computation? Which algorithm (optimizations) are most effective? Does it pay to make a transpose of A vs a copy of A vs just passing in A itself. What about passing in A twice vs passing it in once (mult_trans_3 vs mult_trans_4)?
What is the best performance over all the algorithms that you observed for the case of 1024 images? What would the execution time be for 10 times as many images? For 60 times as many images? (Hint: the answer is not cubic but still potentially a problem.) What if we wanted to do, say 56 by 56 images instead of 28 by 28?
The last question is a leading one. Though modern hardware can provide astonishing levels of performance, we can easily overwhelm a single CPU by just trying to solve a bigger problem. And, often, the resource requirements grow much more rapidly than the problem size.
There are various things we can do to address this issue. First, we can try to be more clever and use better algorithms (see extra credit problem next). And, we can implement our algorithms in such a way that we can use more hardware resources (and/or more specialized hardware) to solve the same problem. We will be looking at being more clever for the next two assignments and then we will take up using parallelization and using GPUs.
Turning in the Exercises
Submit your files to Gradescope. To log in to Gradescope, point your browser to gradescope.com and click the login button. On the form that pops up select “School Credentials” and then “University of Washington netid”. You will be asked to authenticate with the UW netid login page. Once you have authenticated you will be brought to a page with your courses. Select amath583sp22.
There you will find two assignments for 483 and 583 respectively: “ps3 - coding” and “ps3 - written”. For the coding part, you only need to submit the files that you edited:
The autograder should provide you a feedback within 10 minutes after the submission. Also, please keep in mind that this is the first time that we use Gradescope for this class, so you might encounter bugs in the autograder. Please contact TAs if you see “Autograder failed to execute correctly”. Do not attach the code – they will have access to your submission on Gradescope.
For the written part, you will need to submit your Questions.rst as a PDF file. You can use any modern text processor (Word, Pages, TextEdit, etc), or console tools like pandoc or rst2pdf, to convert it to a PDF file. Please make sure to match the relevant pages of your pdf with questions on Gradescope.
In this assignment, questions 10 – 14 are for AMATH583 only, if you are a 583 student, please submit your answers for those questions to the separate assignment “ps3 – written (for 583 only, Qs 10 – 14)” on Gradescope. Note that you could have your work in a single pdf file which you could submit for both written assignments, however, you would need to make sure that correct pages are selected for each question.
If you relied on any external resources, include the references to your document as well.