Assigned: Apr 21, 2022
Due: Apr 28, 2022
Lecture 8
First few minutes of Lecture 9 (Strassen’s algorithm)
This assignment is extracted as with the previous assignment into a directory ps4.
Create a new subdirectory named “ps4” to hold your files for this
assignment. The files for this assignment are in a “tar ball” that can be
downloaded with ps4.tar.gz
.
To extract the files for this assignment, download the tar ball and copy it (or move it) to your ps4 subdirectory. From a shell program, change your working directory to be the ps3 subdirectory and execute the command
$ tar zxvf ps4.tar.gz
The files that should be extracted from the tape archive file are
Makefile:
Matrix.hpp: Declarations for Matrix class
Questions.rst: File containing questions to be answered for this assignment
Timer.hpp: Simple timer class
Vector.hpp: Declarations for Vector class
amath583.cpp: Definitions (implementations) of Matrix and Vector functions
amath583.hpp: Declarations of Matrix and Vector functions
bandwidth.cpp: Simple program to (attempt to) measure CPU bandwidth
blur.hpp: Definitions of blurring kernels
catch.hpp: Catch2 testing framework header
cpuinfo583.cpp: Simple program to read CPU flags
kernels.hpp: Implementations of kernels with different numerical intensity
mmult.cpp: Driver program for benchmarking matrix-matrix product with six different loop orderings
mmult_ps3.cpp: Driver program for benchmarking matrix-matrix products from PS3
mmult_ps3_test.cpp: Test program for verifying matrix-matrix products from PS3
mmult_test.cpp: Test program for verifying matrix-matrix product with six different loop orderings
mmult_trans.cpp: Driver program for benchmarking matrix-matrix transpose product with six different loop orderings
mmult_trans_test.cpp: Test program for verifying matrix-matrix product transpose with six different loop orderings
roofline.cpp: Simple program to (attempt to) measure CPU roofline parameters
soa_image.hpp, aos_image.hpp, tensor_image.hpp: Definitions for different color image format structures
soa_test.hpp, aos_test.hpp, tensor_test.hpp: Test programs for blurring kernels
verify-all.bash: Meta bash script for running other verifiers
verify-make.bash: Bash script that attempts to make all matrix programs for PS4
verify-opts.bash: Bash script that sets options for other scripts
verify-run.bash: Bash script that attempts to make and run all matrix timing programs for PS4
verify-test.bash: Bash script that attempts to make and run all matrix test programs for PS4
Although we can use macros in a Makefile
to help make the content of
our rules consistent, if you read through the Makefiles in the previous
assignment, you will quickly notice that there is still a huge amount of
repetition in each Makefile
.
Note
Don’t Repeat Yourself (DRY)
Although we haven’t introduced this important principle (and witty TLA) explicitly yet in this course – you should always keep it in mind – for software development especially. It’s also not bad advice for life.
You will find that the software you develop evolves and grows organically over time. Whenever you repeat yourself, you suddenly have two (or more) sequences of code that you need to keep synchronized. As tempting as it is (and I still fall prey to this temptation) to cut and paste, you should always try to take the time to factor your code into non-repetive pieces – parameterizing the aspects that do need to change from one part to the next.
We have already touched on this topic in developing procedural abstractions (functions) in C++. We will come back to it.
One place where there is alot of repetition is in the rule for building a .o file from a .cpp file. If you look at the Makefile from the previous assignment, you will see a sequence of rules that look like (this is just a few from that file)
hello.exe : hello.cpp
$(CXX) $(CXXFLAGS) hello.cpp -o hello.exe
goodbye.exe : goodbye.cpp
$(CXX) $(CXXFLAGS) goodbye.cpp -o goodbye.exe
mmtime.o : mmtime.cpp Matrix.hpp amath583.hpp
$(CXX) $(CXXFLAGS) -c mmtime.cpp -o mmtime.o
mnist.o : mnist.cpp Matrix.hpp amath583.hpp
$(CXX) $(CXXFLAGS) -c mnist.cpp -o mnist.o
mmtime.exe : mmtime.o amath583.o
$(CXX) $(CXXFLAGS) mmtime.o amath583.o -o mmtime.exe
mnist.exe : mnist.o amath583.o
$(CXX) $(CXXFLAGS) mnist.o amath583.o -o mnist.exe
But we can immediately see three patterns: a rule to create “something.exe” from “something.cpp”, a rule to create “something.o” from “something.cpp”, and a rule to create “something.exe” from “something.o” (plus other .o files).
If you want to add a new program to this project, you can write a new set of rules for it – but they will be exactly the same as a previous program, so you may end up cutting and pasting to save time and effort. In either case, there is lots of repetition – and if you make a change to the rule itself, you have to change it everywhere.
Let’s take a closer look at one of the patterns – the pattern for making a .o from a .cpp. All of those rules look like this:
<something>.o: <something>.cpp
$(CXX) $(CXXFLAGS) -c <something>.cpp -o <something>.o
The dependency is that the something.o depends on the something.cpp (and maybe other files, we’ll come back to that later). If the dependency needs to be satisifeied, the compilation command (already parameterized with CXX and CXXFLAGS) is specified on the following line.
It would be really nice if we could create just one rule and apply to anything matching a specified pattern.
Fortunately, to automate pattern-based production rules, make
has some “magic”
macros that pattern match for you – namely $<
and $@
. The macro
$<
represents the file that is the dependency (the .cpp file) and
$@
represents the target (the .o file).
A generic compilation statement would therefore be
<something>.o : $(CXX) -c $(CXXFLAGS) $< -o $@
That is, given a dependency rule that has a source file and a dependent target
file (which is what all dependency rules have), this generic rule will
substitute $@
with the target file and $<
with the dependency.
That is, where we previously had the rules
mmtime.o : mmtime.cpp Matrix.hpp amath583.hpp
$(CXX) $(CXXFLAGS) -c mmtime.cpp -o mmtime.o
mnist.o : mnist.cpp Matrix.hpp amath583.hpp
$(CXX) $(CXXFLAGS) -c mnist.cpp -o mnist.o
We can now say
mmtime.o : mmtime.cpp Matrix.hpp amath583.hpp
$(CXX) -c $(CXXFLAGS) $< -o $@
mnist.o : mnist.cpp Matrix.hpp amath583.hpp
$(CXX) -c $(CXXFLAGS) $< -o $@
(And similarly for all the other rules for producing a .o).
That’s a good first step, but there are a few more things we need to clean up.
First,
there is an issue with the $<
macro. Namely, which dependency is the
one that we will compile to make the target? Most .o files have a large
number of header files that they depend on. Above, for example, mnist.o
depends not just on mnist.cpp but also on Matrix.hpp and amath583.hpp.
If a recompile rule gets
triggered because one of those changes, we want to compile the .cpp
files that includes the .hpp file, not the .hpp file itself.
To disambiguiate this situation,
if we give
a rule a list of dependencies, make will pick the first file as the one
to pass to $<
, so as long as we put the .cpp file first in the list,
we will be okay.
Another approach, and the one we will be using for reasons that will become clear, is to note that dependencies are accumulated. You don’t need to specify all of them on one line. For instance, these are two equivalent ways of expressing the same dependencies:
# First way
norm.o: norm.cpp Vector.hpp amath583.hpp
$(CXX) -c $(CXXFLAGS) $< -o $@
# Second way
norm.o: Vector.hpp amath583.hpp
norm.o: norm.cpp
$(CXX) -c $(CXXFLAGS) $< -o $@
# Third way
norm.o: norm.cpp
$(CXX) -c $(CXXFLAGS) $< -o $@
norm.o: Vector.hpp
norm.o: amath583.hpp
Note that in the second and third way the production rule for the .o is
more obvious. There is just one dependence. But, whenever any of the
dependencies are out of date, the production rule will get fired, and
since the .cpp and .o rule are right above the production rule,
and since there is just the one binding,
it is
more obvious what will get bound to $<
.
There is one last thing to clean up here. We still have a repetitive pattern – all we did was substitute the magic macros in. But every production rule still has:
something.o : something.cpp
$(CXX) -c $(CXXFLAGS) $< -o $@
We haven’t really saved any lines of Makefile, or eliminated any actual repetitions – we just made on of the repetitions (the production rule) exactly the same everywhere.
But that the rule is exactly the same everywhere cries out even more for automation.
In fact, there is one last piece of pattern-matching that make
has and that
will let us roll all of those rules up into just one: implicit
rules. We express an implicit rule like this:
%.o : %.cpp
$(CXX) -c $(CXXFLAGS) $< -o $@
This basically the rule with “something” above where the % stands in for “something”. We only need to write this rule once and it covers all cases where something.o depends on something.cpp. We still have the other dependencies (on headers) to account for – but this can also be automated (more on that in the next assignment).
At any rate, the Makefile we started with now looks like this:
CXX = c++
CXXFLAGS = -Wall -g -std=c++11
# Implicit rule for building .o from .cpp
%.o : %.cpp
$(CXX) -c $(CXXFLAGS) $< -o $@
# Dependencies
amath583.o: amath583.hpp
amath583.o: Vector.hpp
norm.o: amath583.hpp
norm.o: Vector.hpp
This will handle all cases where something.o depends on something.cpp. Using these magic macros and implicit rules greatly simplifies what we have to put in a Makefile and it also allows us to separately specify dependencies from production rules. (Note also that the dependency of, say, amath583.o on amath583.cpp is covered by the implicit rule, so we only need to explicitly specify other dependencies, the header files).
NB: We will still continue to supply Makefiles for much of your
assignments, but we will be gradually taking some of the scaffolding
away in future problem sets and you may have to add rules for your own
files. You can add those rules however you like, you do not have to
use any of the advanced features of make
in your assignments as long
as your programs correctly build for the test script(s). However, these
features were created to make the lives of programmers easier, so you
will probably find it useful to familiarize yourself with these rules so
that you can take advantage of them.
As discussed in lecture, the SIMD/vector instruction sets for Intel architectures have been evolving over time. Although modern compilers can generate machine code for any of these architectures, the chip that a program is running on must support the instructions generated by the compiler. If the CPU fetches an instruction that it cannot execute, it will throw an illegal instruction error.
The machine instruction cpuid
can be used to query the
microprocessor about its capabilities (cf. CPUID).
How to issue these queries and
how to interpret the results is explained (e.g.) in the wikipedia entry
for cpuid. A small program for querying your machines CPU in this way is
provided in “cpuinfo583.cpp”. The program interacts directly with the
CPU using assembly language instructions (in particular the cpuid
instruction).
Compile and run cpuinfo583.exe (there is a rule for it in your Makefile).
$ make cpuinfo583.exe
$ ./cpuinfo583.exe
Note
For Macbook having Apple M1 chip, unfortunately, cpuinfo583.cpp does not work on M1 chip machine, as it uses a completely different architecture as well as instruction set. You will have to try on another machine that does not use Apple M1 chip. Or you can team up with one of your classmates whose laptop is not using Apple M1 chip. Remember to credit your classmate.
You will get a listing that shows a selection of available capabilities on your own machine. Run this command and check the output. The particular macros to look for are anything with “SSE”, “AVX”, “AVX2”, or “AVX512”. These support 128-, 256-, 256-, and 512-bit operands, respectively.
What level of SIMD/vector support does the CPU your computer provide?
What is the maximum operand size that your computer will support?
What is the minimum operand size that your computer will support?
What is the clock speed of your CPU? You may need to look this up via “About this Mac” on MacOSX. If your Macbook is using Apple M1 Chip, try “sudo powermetrics” in the terminal, looking for “CPU frequency”. Look at the “Performance” Tab in Task Manager on a Windows; and try “lscpu” in the terminal on a Linux.
In lecture 8 we presented the roofline model, which provides an upper bound to computational performance, taking into account both potential floating point performance as well as the ability of the system to supply enough operands to the CPU to do its computations (bandwidth). A roofline model can be plotted on a graph of performance (in GFlops/sec) as a function of “numerical intensity” (flops/byte). The limitations on computational performance (in GFlops/sec) are horizontal lines on the graph, while bandwidth (bytes/sec) is a sloped line. For example (measured on my laptop):
As we have also discussed in lecture, your computer doesn’t have a single bandwidth. The bandwidth between L1 cache and the CPU is significantly faster than the bandwidth between main memory and the CPU. The performance of your program therefore depends also on the problem size as well as the numerical intensity.
For this part of this assignment I want you to continue developing an intuition for what affects (and effects) performance on your computer (or any computer). There isn’t a “right answer” for this part. Every computer system is different – and, as we have seen, trying to suss out different aspects of your computer’s performance is complicated – made even more so by the compiler (which are subtle and quick to anger).
But what I want you to for this part is to create a roofline model for your computer – again, there isn’t a right answer.
For the roofline model you will need to characterize a few things. First is the peak available performance for your computer (for now we will just be looking at peak performance available with a single core). The second is the bandwidth – for L1, L2 (if present), L3 (if present), L4 (if present), and main memory (DRAM).
Below are some experiments to do to help determine these parameters for your computer. For each of bandwidth and for roofline there are two approaches: “homegrown” and “docker”. The former uses a C++ program that you build and run on your computer. The second builds and runs a more sophisticated set of experiments within a docker container. Which approach you want to use is up to you (and you can do both).
To setup a homegrown develop environment, you can follow the material in Set Up Your Development Environment (Without Docker) in the “excursus” section of the course web site.
Your ps4 directory contains a program “bandwidth.cpp” which you can compile to “bandwidth.exe”. This program runs some simple loops to try to measure bandwidth for reading, writing, and copying (reading and writing). The functions that carry out the loops are parameterized by the size of the data to work on as well as by the number of trials to run within the timing loop:
// Reading
for (size_t j = 0; j < ntrials; ++j) {
for (size_t i = 0; i < A.size(); ++i) {
a = A[i];
}
}
// Writing
for (size_t j = 0; j < ntrials; ++j) {
for (size_t i = 0; i < A.size(); ++i) {
A[i] = a;
}
}
// Copying
for (size_t j = 0; j < ntrials; ++j) {
for (size_t i = 0; i < A.size(); ++i) {
B[i] = A[i];
}
}
The main program will in turn invoke the timing loops for problem sizes ranging from 128 bytes up to 128MB (you can specify a different upper limit on the command line).
Build this program and run it with no arguments (or with something larger than 128M)
$ make bandwidth.exe
$ ./bandwidth.exe
$ ./bandwidth.exe 268435456
Multiple columns of data will scroll by. The first set are for reading, the second for writing, the third for copying. The main number of interest is in the final column – bytes/second – aka your bandwidth.
Note
Compilers again
One of the techniques for preventing the compiler from doing dead-code elimination is to return a value computed in the loop. If you look at the routines for doing the loops, I use one of those techniques – I return a value computed in the loop. However, at least on my laptop, compiler, and OS, the read loop still gets optimized out (for loop being ignored) – even though the data type is double and even though we are not using the returned value in the main function.
Discuss: What is the compiler doing this time? How might it be fixed?
(I don’t actually know the answer to the second question – at least not at the level of C++ source code. I have one idea but it’s not ideal, so I look forward to the discussion!)
In the numbers in the last column, you should see some plateaus, representing bandwidth from L1 cache (first plateau), L2 cache (second plateau), and so on. You may also see that the demarcation between successive plateaus becomes less sharp.
For example, a few selected lines from running this program on my laptop look like this:
bytes/elt #elts res_bytes ntrials usecs ttl_bytes bytes/sec
8 256 4096 2097156 45000 8589950976 1.90888e+11
8 512 8192 1048580 41000 8589967360 2.09511e+11
8 1024 16384 524292 38000 8590000128 2.26053e+11
8 2048 32768 131074 20000 4295032832 2.14752e+11
8 4096 65536 65538 69000 4295098368 6.22478e+10
8 8192 131072 32770 69000 4295229440 6.22497e+10
There is a major fall off between 32k and 64k bytes – indicating that L1 cache is (probably) 32k bytes. It also looks like L1 cache bandwidth is (probably) about 226GB/s.
Based on the output from bandwidth.exe on your computer, what do you expect L1 cache and L2 cache (if present) sizes to be? What are the corresponding bandwidths? How do the cache sizes compare to what “about this mac” (or equivalent) tells you about your CPU? (There is no “right” answer for this question – but I do want you to do the experiment.)
Use docker to pull the image
amath583/bandwidth
(it has been put on docker hub). When
running, this image will execute a profiling program put its results into a subdirectory called “BW”.
So, before running this image, you will need to first create a subdirectory named BW in
the directory that you have mapped to in docker. That is:
$ cd <your working directory>
$ mkdir BW
$ docker pull amath583/bandwidth
$ docker run -v <your working directory>:/home/amath583/work amath583/bandwidth
The <your working directory>
stands for the location on your host
machine where you have deen doing your work. The bandwidth
program may run for quite some time – 10-20 minutes – during which time
your computer should have as little load on it as possible (other than
the bandwidth program itself).
When the program completes, there will be two files in the BW directory
– bandwidth.bmp and bandwidth.csv. The file bandwidth.png is an image of
a graph showing data transfer rates for various types of operations and
various types of operations. The file bandwidth.csv
is the raw data
that is plotted in bandwidth.bmp
, as comma separated values. You can
use that data to do your own investigations about your computer.
An example of bandwidth.bmp from my laptop is the following (your results will most likely vary):
Note
Something went wrong when you get a warning from docker. For example, if I run docker image amath583/bandwidth
that was built for linux/amd64 architecture on my Apple Mini with M1 chip having linux/arm64/v8 architecture, I get the following warning from docker:
WARNING: The requested image's platform (linux/amd64) does not match the detected
host platform (linux/arm64/v8) and no specific platform was requested.
What’s the performance penalty when you see this warning after you try to run a docker image that is built for one architecture but is instead used on a different architecture? (Hint, why are we using -march=native
compilation option in our Makefile? If we compile our code on one architecture with -march=native
, but run the executable on two different architectures, do we get the same performance?) If you are unsure about the answer, start a discussion on Piazza.
In looking at the bandwidth graph you can see one or more places where the transfer rate drops suddenly. For small amounts of data, the data can fit in cache closer to the CPU – and hence data transfer rates will be higher. The data transfer rates will fall off fairly quickly once that data no longer completely fits into cache.
Same questions as for the “homegrown” approach:
Based on the output from running this image on your computer, what do you expect L1 cache and L2 cache (if present) sizes to be? What are the corresponding bandwidths? How do the cache sizes compare to what “about this mac” (or equivalent) tells you about your CPU? (There is no “right” answer for this question – but I do want you to do the experiment.)
(You only need to do one of homegrown or docker.)
Note
The bandwidth program attempts to determine what AVX features your CPU has and then attempts to use only those for doing its benchmarking.
In some cases, we have found that it doesn’t quite get this right, and will attempt to issue instructions that the host CPU does not in fact support.
In that case, your program may exit with an “Illegal Instruction” message. If that happens, run the amath583/bandwidth.noavx
image instead.
That image may underestimate maximum available bandwidth, but you will still be able to see the difference between performance for different levels of cache.
In your ps4 directory is a program “roofline.cpp” which you can compile to “roofline.exe” (there should be make rule in your Makefile for it). This program does something similar to “bandwidth.cpp but rather than vary the problem size, it varies the arithmetic intensity for a fixed problem size (as specified by the command-line argument).
On my laptop when I run the program (e.g.) I see something like this:
$ ./roofline.exe 128
kernel sz res_bytes ntrials usecs ttl_bytes ttl_flops intensity flops/sec bytes/sec
2 128 33554433 114000 8589934848 1073741856 0.125 9.41879e+09 7.53503e+10
4 128 16777217 80000 4294967552 1073741888 0.25 1.34218e+10 5.36871e+10
8 128 8388609 63000 2147483904 1073741952 0.5 1.70435e+10 3.4087e+10
16 128 4194305 55000 1073742080 1073742080 1 1.95226e+10 1.95226e+10
32 128 2097153 52000 536871168 1073742336 2 2.06489e+10 1.03244e+10
64 128 1048577 54000 268435712 1073742848 4 1.98841e+10 4.97103e+09
Compile and run this program, choosing a few problem sizes in the middle of each cache size (about half the cache size) and perhaps also try +- a factor of two. As with the bandwidth program above, you should see some plateaus for performance at different problem sizes.
At a problem size around half the L1 cache, you should see pretty large numbers for floating point performance with the middle number of arithmetic intensity. The largest number that you see for performance (at any level of arithmetic intensity and any problem size) is the horizontal line for the rooline characterization of your computer. The sloped lines that are also part of the model can be gleaned also from this program but may be more obvious from the bandwidth program.
What is the (potential) maximum compute performance of your computer? (The horizontal line.) What are the L1, L2 (if present), and DRAM bandwidths? How do those bandwidths correspond to what was measured above with the bandwidth program?
Based on the clock speed of your CPU and its maximum Glop rate, what is the (potential) maximum number of double precision floating point operations that can be done per clock cycle? (Hint: Glops / sec \(\div\) GHz = flops / cycle.) There are several hardware capabilities that can contribute to supporting more than one operation per cycle: fused multiply add (FMA) and AVX registers. Assuming FMA contributes a factor of two, SSE contributes a factor of two, AVX/AVX2 contribute a factor of four, and AVX contributes a factor of eight of eight, what is the expected maximum number of floating point operations your CPU could perform per cycle, based on the capabilities your CPU advertises via cpuinfo (equiv. lscpu)? Would your answer change for single precision (would any of the previous assumptions change)?
Not required if you use the homegrown approach, but I nevertheless encourage you to develop an approximate roofline model for your computer and plot it (similar to the plot above and shown in lecture).
A docker image that will profile your computer and create a roofline
plot has been put on docker hub:. One image is called: amath583/ert
,
which will create a roofline plot based on a single CPU core of your
machine. When running, this image will put its results into a
subdirectory called “ERT”. So, before running this image, first create a
subdirectory named ERT in the directory you are mapping to docker.
To run the amath583/ert
image:
$ cd <your working directory>
$ mkdir ERT
$ docker pull amath583/ert
$ docker run -v <your working directory>:/home/amath583/work amath583/ert
The <your working directory>
stands for the location on your host
that will be mapped to a directory inside of the container. This is not an interactive image, so you don’t need the “-it” option.
The roofline characterization will run for a few minutes, but probably not as long as the bandwidth characterization.
Once the image finishes, you will find results in the subdirectory
named Sequential
, under wich there
will be one or more subdirectories with the name like Run.001
. Directly
in the Run.001
subdirectory will be a file roofline.pdf
, which
contains a graph of the roofline model for your computer.
NB: If you do multiple runs of the roofline profiler, delete the subdirectory
Run.001
each time so that the profiler has a fresh directory to
write to and there won’t be old or incomplete results to confuse you or
to confuse the profiler.
What is the maximum compute performance of your computer? (The horizontal line.) What are the L1, L2 (if present), and DRAM bandwidths? How do those bandwidths correspond to what was measured above?
In our last exciting episode we saw that it could be extremely beneficial to compute
\(C=A\times B^T\) by directly implementing a mult_trans
operation rather than
explicitly forming the transpose of \(B\) and calling mult
.
In this exercise we are going to explore a few more details about optimization.
Your ps4 directory contains a benchmark driver mmult_ps3.cpp that exercises the matrix multiply and matrix multiply transpose operations we wrote in the last assignment. It loops over a set of matrix sizes (8 to 256) and measures the performance of each function at that size. For each size it prints a row of performance numbers in GFlops/sec – each column is the performance for one of the functions.
The functions mult_0 through mult_3 were provided for you last time and are included again. The functions mult_trans_0 through mult_trans_3 are empty – copy the versions over that you wrote last time. If you are not happy with the performance you got and would like to use an “approved” solution instead, contact one the TAs or ask a classmate. It is okay to copy those functions for this assignment only.
(Why am I not just providing those you may ask? All of ps1 through the final may ultimately be available at the same time. For those who may want to reuse this material (including my future self), I don’t want solutions for a given assignment to just be available from a future assignment.)
Compile and run mmult_ps3.exe (the default size used with no command line argument should be fine). There is a make rule that invokes the compiler with optimizations enabled so you should just be able to
$ make mmult_ps3.exe
$ ./mmult_ps3.exe
If your computer and compiler are more or less like mine, you should see something printed that looks like this:
N mult_0 mult_1 mult_2 mult_3 mul_t_0 mul_t_1 mul_t_2 mul_t_3
8 3.00795 2.90768 6.23075 5.81536 3.6346 3.35502 7.93004 7.26921
16 2.35758 2.4572 6.23075 6.12144 2.50123 6.58343 11.0769 11.8279
32 1.98706 2.01804 6.61387 6.56595 2.29975 12.0813 15.8965 17.425
64 1.70172 1.94222 6.08737 6.62057 2.05673 13.9541 20.156 17.7847
128 1.57708 1.72746 6.05582 6.35951 1.79506 14.7662 21.136 16.457
256 1.28681 1.46371 4.87796 5.92137 1.62958 11.073 19.4263 15.7063
Huh. Although we applied all of the same optimizations to the mult and mult_trans functions – the mult_trans functions are significatnly faster – even for the very simplest cases. That is fine, as for the last assignment, when you need to compute \(C = A\times B^T\) – but we also need to just be able to compute \(C = A\times B\) – it is a much more common operation – and we would like to do it just as fast as for \(C = A\times B^T\).
Save the output from a run of mmult_ps3.exe into a file mmult_ps3.log.
The mathematical formula for matrix-matrix product is
which we realized using our Matrix
class as three nested loops over
i, j, and k (in that nesting order). This expression says “for each
\(i\) and \(j\), sum each of the \(A_{ik} B_{kj}\) products.
The function(s) we have been presenting, we have been using the “ijk”
ordering, where i
is the loop variable for the outer loop, j
is the loop
variable for the middle loop, and k
is the loop variable for the inner
loop:
void mult(const Matrix& A, const Matrix& B, Matrix& C) {
for (size_t i = 0; i < C.num_rows(); ++i) {
for (size_t j = 0; j < C.num_cols(); ++j) {
for (size_t k = 0; k < A.num_cols(); ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
But the summation is valid regardless of the order in which we add the
k-indexed
terms to form each \(C_{ij}\).
Computationally, if we look at the inner term C(i,j) = C(i,j) + A(i,k)*B(k,j)
–
there is no dependence of i
on j
or k
(nor any of the loop indices on any of the other loop indices);
we can do the inner summation in any order. That is, we
can nest the i, j, k loops in any order, giving us six variants of the
matrix-matrix product algorithm. For instance, an “ikj” ordering would
have i in the outer loop, k in the middle loop, and j in the inner loop.
void mult_ikj(const Matrix& A, const Matrix& B, Matrix& C) {
for (size_t i = 0; i < C.num_rows(); ++i) {
for (size_t k = 0; k < A.num_cols(); ++k) {
for (size_t j = 0; j < C.num_cols(); ++j) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
}
The program mmult.cpp in your ps4 directory is a benchmark driver that calls six different matrix multiply functions:
mult_ijk
, mult_ikj
, mult_jik
, mult_jki
, mult_kij
, and mult_kji
. The corresponding implementations
are included in your amath583.cpp file. Right now those implementations are completely unoptimized – they are each just three loops (in the indicated order, with C(i,j) += A(i,k) * B(k,j)
in the inner loop. The program runs each of these six functions over a set of problem sizes (N by N matrices) from 8 to 256, in powers of two. The maximum problem size can also be passed in on the command line as an argument.
The program prints the GFlops/sec that it measures for each of the six functions, for each of the sizes it runs.
Compile and run mmult.exe (the default arguments should be fine). There is a make rule that invokes the compiler with optimizations enabled so you should just be able to
$ make mmult.exe
$ ./mmult.exe
Look carefully at the performance numbers that are printed. You should see two of the columns with not so good performance, two columns with okay performance, and two columns with the best performance. The columns within each pair have something in common – what is it?
Depending on your CPU and your compiler, some of the differences will be more or less pronounced, but they should be visible as the problem size increases.
Note
Here is a discussion question for you on Piazza. Explain what you are seeing in terms of data accesses at each inner loop iteration.
Save the output from a run of mmuls with “_0” to indicate it is a run before any optimizations have been introduced.
I claim that certain loop ordering give good performance because of the memory access pattern being favorable. Is that consistent with what you are seeing in these (unoptimized) functions?
To guide your discussion, consider the following diagram of matrices A, B, and C.
Here, each box is a matrix entry. Recall that for row-ordered matrices (such as we are using in this course), the elements in each row are stored next to each other in memory.
Now, let’s think about what happens when we are executing matrix multiply with the “ijk” ordering. From one iteration of the inner loop to the next (from one step of the algorithm to the next), what are the next data elements to be accessed? For “ijk”, here is what happens next:
What happens in the case of each of the six orderings for mult – what are the next elements to be accessed in each case? Does this help explain the similarities between certain orderings – as well as why there are good, bad, and ugly (er, medium) pairs? Refer again to how data are stored for three matrices A, B, and C.
Referring to the figures about how data are stored in memory, what is it about the best performing pair of loops that is so advantageous?
If we look at the output from mmult_ps3.exe, the biggest jump in performance comes between
mult_trans_0
and mult_trans_1
.
The difference between the code in the two functions is quite small though – we just “hoisted” the reference to C(i,j)
out of the inner loop.
What will the data access pattern be when we are executing mult_trans
in i,j,k order?
What data are accessed in each of the matrices at step (i,j,k) and what data are accessed at step (i, j, k+1)?
Are these accesses advantageous in any way?
Referring again to how data are stored in memory, explain why hoisting
C(i,j)
out of the inner loop is so beneficial in mult_trans with the “ijk” loop ordering.
Modify mult_ijk so that it achieves performance on par with mult_trans_1
(or mult_trans_2
or mult_trans_3
).
(Hint: Be lazy: mult_trans_1
et al are already written. Don’t repeat yourself.
Don’t even copy and paste.)
Compile and run mult.exe with your modified version of mult_ijk
.
Save the output to a file mult_1.log.
Note
Sometimes when you are developing code, you might want to keep older code around to refer to, or to go back to if an exploration you are making doesn’t pan out. One common way of saving things in this way is to use the preprocessor and “#if 0” the code you don’t want to be compiled
#define 0
Stuff I want to save but not compile
#endif
You may (but are not required) to “#if 0” the initial version of mult_ijk
.
Modify mult_3
so that it achieves performance on par with trans_mult_1 (or trans_mult_2 or trans_mult_3).
(This may not be quite as straightforward as for mult_ijk
).
Compile and run mult_ps3.exe with your modified version of mult_3
with problem sizes up to 1024 (say).
Save the output to a file mult_2.log.
What optimization is applied in going from mult_2
to mult_3
?
How does your maximum achieved performance for mult
(any version) compare to what bandwidth
and roofline
predicted?
Show your analysis.
To store the “color” of a particular pixel, a color image stores the values of the three additive primar colors (red, green, and blue) necessary to reconstruct that color. A color image is thus a three-dimensional structure: 3 color planes, each of which is the same height by width of the image.
A question arises then about what is the best way to store / access that data in memory? We could, for instance, store each color plane separately, so that the image is a “structure of arrays”:
Or, we could store the three color values of each pixel together, so that the image is an “array of structures”:
Or, we could store just a single large array, treating the image as a “tensor” and use a lookup formula to determine how the data are stored.
The struct of arrays data structure could be implemented like this:
class SOA_Image {
public:
SOA_Image(size_t M, size_t N)
: nrows_(M), ncols_(N), storage_{{std::vector<double>(M * N), std::vector<double>(M * N), std::vector<double>(M * N)}} {}
size_t num_rows() const { return nrows_; }
size_t num_cols() const { return ncols_; }
const double& operator()(size_t i, size_t j, size_t k) const { return storage_[k][i * ncols_ + j]; }
double& operator()(size_t i, size_t j, size_t k) { return storage_[k][i * ncols_ + j]; }
private:
size_t nrows_;
size_t ncols_;
std::array<std::vector<double>, 3> storage_;
};
Again, the terminology “struct of arrays” is generic, referring to how data are organized in memory. Our implementation (in
storage_
) of the “struct of arrays” is as a array
of vector
.
Note also that the accessor operator()
accesses pixels with (i, j, k)
where i
and j
are pixel coordinates and k
is the color plane.
Similarly, the array of structs data structrue could be implemented this way:
class AOS_Image {
public:
AOS_Image(size_t M, size_t N) : nrows_(M), ncols_(N), storage_(M * N) {}
const double& operator()(size_t i, size_t j, size_t k) const { return storage_[i * ncols_ + j][k]; }
double& operator()(size_t i, size_t j, size_t k) { return storage_[i * ncols_ + j][k]; }
size_t num_rows() const { return nrows_; }
size_t num_cols() const { return ncols_; }
private:
size_t nrows_;
size_t ncols_;
std::vector<std::array<double, 3>> storage_;
};
Again, the terminology “array of structs” is generic. Our implementation (in
storage_
) of the “array of structs” is as a vector
of array
.
Note also that the accessor operator()
accesses pixels with (i, j, k)
where i
and j
are pixel coordinates and k
is the color plane.
Finally, a “tensor” based storage could be implemented as:
class Tensor_Image {
public:
Tensor_Image(size_t M, size_t N) : nrows_(M), ncols_(N), storage_(3 * M * N) {}
const double& operator()(size_t i, size_t j, size_t k) const { return storage_[k * nrows_ * ncols_ + i * ncols_ + j]; }
double& operator()(size_t i, size_t j, size_t k) { return storage_[k * nrows_ * ncols_ + i * ncols_ + j]; }
size_t num_rows() const { return nrows_; }
size_t num_cols() const { return ncols_; }
private:
size_t nrows_;
size_t ncols_;
std::vector<double> storage_;
};
Although blurring an image sounds like something you would not want to do, small amounts of blur can be useful in image processing for, say, removing high frequency noise from an image. One simple blurring function is a box blur, which computes an output pixel based on the average of nine neighboring input pixels.
So, for instance, we could blur an image with the following:
// Iterate through the image
for (size_t i = 1; i < img.num_rows() - 1; ++i) {
for (size_t j = 1; j < img.num_cols() - 1; ++j) {
// Iterate through the color planes
for (size_t k = 0; k < 3; ++k) {
// Apply blurring kernel
blurred(i, j, k) = (img(i + 0, j + 0, k) +
img(i + 0, j + 1, k) +
img(i + 0, j - 1, k) +
img(i + 1, j + 0, k) +
img(i + 1, j + 1, k) +
img(i + 1, j - 1, k) +
img(i - 1, j + 0, k) +
img(i - 1, j + 1, k) +
img(i - 1, j - 1, k)) / 9.0;
}
}
}
Note that since we have defined the accessor generically, the inner blurring kernel will be the same regardless of which representation we use for the image. We do have a choice, however in how we access the data: do we iterate over the color planes as the outer loop or as the inner loop?
In your source code for this assignment is the file soa_vs_aos.cpp
. From it you can make soa_vs_aos.exe
. When you run that executable on an image, it will apply six different blurring operations: for the three different formats (aos, soa, tensor) and for the image planes iterated on the outer loop and on the inner loop. Read the source code to familiarize yourself with which blurring kernel does which. Note, as mentioned above, that the inner loop of each function is the same – and that the functions doing the color plane on the outer loop are the same (similarly for the inner loop). (If it seems like we are violating DRY, we are, we will see how to fix this in a few more assignments.)
Your source code has one julia image: julia.bmp. You can generate your own julia images based on your previous assignment. Run the program on them. The program will print out execution times for the six implementation variants of the blurring function as well as save the blurred output for each variant. (If you don’t want to see these (or spend the time computing them), you can comment out the lines in the code that save the output images.)
Which variant of blurring function between struct of arrays and array of structs gives the better performance? Explain what about the data layout and access pattern would result in better performance.
Which variant of the blurring function has the best performance overall? Explain, taking into account not only the data layout and access pattern but also the accessor function.
Please use the content of Questions.rst and the relevant parts of the performance logs to form a report in PDF, and submit it to Gradescope. For the written part, you will need to submit your Questions.rst as a PDF file. You can use any modern text processor (Word, Pages, TextEdit, etc), or console tools like pandoc or rst2pdf, to convert it to a PDF file. Please make sure to match the relevant pages of your pdf with questions on Gradescope. To log in to Gradescope, point your browser to gradescope.com and click the login button. On the form that pops up select “School Credentials” and then “University of Washington netid”. You will be asked to authenticate with the UW netid login page. Once you have authenticated you will be brought to a page with your courses. Select amath583sp22.
We don’t auto-grade the coding part this time, only your written report.
Please make sure to include the references to the external sources if you used any.