Assigned: June 3, 2020
Due: June 16, 2020
This assignment is a continuation of Problem Set #7 (Part I)
This assignment is extracted as with previous assignment from a tar file,
located is at this link
.
Note
You should extract this tar file into the same subdirectory where you extracted ps7a.
The new files and subdirectories included are:
ps7/
├── Questions.rst
├── cu_axpy/
│ ├── Makefile
│ ├── cu_axpy_0.cu
│ ├── cu_axpy_1.cu
│ ├── cu_axpy_2.cu
│ ├── cu_axpy_3.cu
│ ├── cu_axpy_t.cu
│ ├── nvprof.bash
│ ├── omp_axpy.cpp
│ ├── plot.py
│ ├── script.bash
│ └── seq_axpy.cpp
├── include/
│ └── Make_cu.inc
├── norm_cuda/
│ ├── Makefile
│ ├── norm_parfor.cpp
│ ├── norm_seq.cpp
│ └── norm_thrust.cu
└── reduction/
├── Makefile
├── NsightEclipse.xml
├── nvprof.bash
├── readme.txt
├── reduction.cpp
├── reduction.h
└── reduction_kernel.cu
The subdirectories will be in parallel to the existing ones from ps7a. There will be one new file copied into an existing subdirectory: Make_cu.inc will be created in include.
After successful extraction, the complete list of subdirectories should look like this:
$ ls
axpy_cuda include norm pagerank README.rst test
hello_omp matvec norm_cuda Questions.rst reduction warmup
The existing Questions.rst file (from ps7a) will be overwritten (it was actually mistakenly copied from ps6). The correct Questions.rst file is in ps7b.tar.gz.
The problems for this assignment will be done on an AWS GPU-enabled “cluster”. You can connect to the GPU cluster in the same way as the CPU instance from part I.
Note
Cluster access Update
There is a small cluster now available with GPU. The head node is at IP 18.236.206.176 .
Host ps7
HostName 18.236.206.176
User al75
IdentityFile /Users/lums658/.ssh/al75_id_rsa
Unless you have a laptop with an Nvidia graphics card AND you have CUDA installed just so, it won’t be possible to do this on your laptop.
Once you are logged on to the cluster head node, you can query the GPU using the
command nvidia-smi
. The head node and the cluster node are both GPU enabled.
To find out information about the head node, you can run nvidia-smi
locally.
$ nvidia-smi
Note that the GPU is actually on the compute node(s) so you will need to invoke nvidia-smi
using srun
.
$ srun nvidia-smi
On either machine, nvidia-smi
should print out information similar to the following:
% nvidia-smi
Fri May 31 07:14:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 58C P0 73W / 149W | 0MiB / 11441MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This indicates the GPU is of the Kepler family, has 4GiB of memory, we have version 418.67 of the drivers installed, and version 10.1 of CUDA installed.
The Nvidia CUDA toolkit contains a large number of substantive examples.
With the latest version of the toolkit, everything related to CUDA is
installed in /usr/local/cuda
; the samples can be found in
/usr/local/cuda/samples
. I encourage you to look through some of
them – the 6_Advanced
subdirectory in particular has some
interesting and non-trivial examples.
You can also find out more detailed information about the particular GPU being used by running the deviceQuery
program. It is a sample program that comes with cuda and can be invoked (locally and remotely) as
$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
$ srun /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
The cu_axpy
subdirectory of your repo contains a set of basic examples:
several
“cu_axpy” programs, similar to those that were presented in lecture. The
executables can be compiled by issuing “make”. Each program takes one
argument – the log (base 2) of problem size to run.
That is, if you pass in a value of,
say, 20, the problem size that is run is \(2^{20}\). The default problem size is \(2^{16}\).
Note
Again, the size argument in these examples is the \(\text{log}_2\) of the size.
The programs print some timing numbers for memory allocation (etc), as well a Gflops/sec for the axpy computation. Each program only runs the single problem size specified.
The axpy subdirectory also contains a script script.bash
which runs
all of the axpy programs over a range of problem sizes and uses python with matplotlib to plot the
results. In addition to the basic cuda examples, there is a sequential
implementation, an OpenMP implementation, and a thrust implementation.
(Python and matplotlib are already installed on the AWS instances.)
Run the file script.bash
and examine the output plot. Review the
slides from lecture and make sure you understand why the versions 1, 2,
and 3 of cu_axpy give the results that they do. Note also the
performance of the sequential, OpenMP, and Thrust cases.
How many more threads are run in version 2 compared to version 1? How much speedup might you expect as a result? How much speedup do you see in your plot?
How many more threads are run in version 3 compared to version 2? How much speedup might you expect as a result? How much speedup do you see in your plot? (Hint: Is the speedup a function of the number of threads launched or the number of available cores, or both?)
(AMATH 583) The cu_axpy_t also accepts as a second command line argument the size of the blocks to be used. Experiment with different block sizes with, a few different problem sizes (around \(2^{24}\) plus or minus). What block size seems to give the best performance? Are there any aspects of the GPU as reported in deviceQuery that might point to why this would make sense?
Nvidia has an interactive profiling tool for analyzing cuda applications: The Nvida
visual profiler (nvvp). Unfortunately it is a graphical tool and to use it on a remote node,
one must run it over a forwarded X connection, which isn’t really usable due to high latencies.
However, there
is a command-line program (nvprof
) that provides fairly good
diagnostics, but you have to know what to look for.
You can see a list of the events and metrics that nvprof
can provide by
executing:
$ nvprof --query-events
$ nvprof --query-metrics
Complete information about cuda programming (including a best practices
guide) can be found at https://docs.nvidia.com/cuda/
.
You axpy subdirectory also contains a script nvprof.bash
that
invokes nvprof
with a few metrics of interest.
Run nvprof.bash
on the four cu_axpy executables
$ bash nvprof.bash ./cu_axpy_1.exe
$ bash nvprof.bash ./cu_axpy_2.exe
$ bash nvprof.bash ./cu_axpy_3.exe
$ bash nvprof.bash ./cu_axpy_t.exe
Quantities that you might examine in order to tune performance are “achieved occupancy” (how efficiently is your program using the GPU cores) and memory throughput. Compare the achieved occupancy for the four cu_axpy programs above. The occupancy should correlate well (though not perfectly) with the performance that the programs exhibit.
Looking at some of the metrics reported by nvprof, how do metrics such as occupancy and efficiency compare to the ratio of threads launched between versions 1, 2, and 3?
The reduction example that we looked at in class is in the subdirectory
reduction
. The files are copied from the cuda set of samples and
slightly modified so that the program would compile in the course repo.
There are seven different kernels that can be run from the program;
kernels are selected by passing in the –kernel=<#> option, where <#> can
be from 0 to 6. The kernels are progressively more optimized, as we
showed in lecture. The comments in the code describe the optimization
applied for each successive kernel.
Compile and run the reduction program and use nvprof to see if the effects of the different optimizations (as described in lecture and in the nvidia documentation) actually have the effect that is claimed.
$ nvprof ./reduction --kernel=0
$ nvprof ./reduction --kernel=1
$ # etc ...
Documentation on the optimization process can also be found at: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
The subdirectory norm
contains a skeletons for a thrust-based
algorithm for computing the two-norm of a vector.
Documentation for thrust can be found at https://docs.nvidia.com/cuda/thrust/index.html
Complete the implementation of the norm_thrust
function in norm_thrust.cu
.
You may find it helpful to refer to the axpy examples for the rest of
this problem. In cu_axpy_t, we used thrust::transform
to combine two vectors. To compute the norm of a single vector, we want an analagous operation – transforming over a single vector. We also want to reduce (accumulate) over the transformed values (and we know how to do that efficiently – per the reduction exercise – and there is a reduce function in thrust).
We could do this in two steps – transform the vector into a vector with the square of each value and the reduce over that. That approach is obviously enormously inefficient. To do the transformation and reduction in a single pass, thrust includes the function transform_reduce
– which is suggested for use in this problem. You may use any technique you like to implement the operators that get passed in – function object, helper function, lambda, built-in functions.
The norm_cuda subdirecory also contains norm_parfor and norm_seq, which you can run to compare to norm_cuda.
Note that norm_thrust is a function template. The driver that calls it for timing invokes to versions of it – one for single precision and one for double precision.
As with our previous norm drivers, this program will run timing tests over a range of problem sizes (in powers of two) and you can pass the upper and lower bounds in on the command line. The program does two timing runs for each of single and double and prints the results.
When you run your working norm_thrust, you may get an output that looks like the following:
$./norm_thrust.exe
Float
N Sequential First Second First Second
1048576 1.7798 6.09445 6.07604 2.7595e-08 2.7595e-08
2097152 1.68752 11.108 11.2314 4.51044e-08 4.51044e-08
4194304 1.41257 18.6075 19.1292 2.15465e-08 2.15465e-08
8388608 1.38701 27.5941 29.5374 3.51608e-08 3.51608e-08
16777216 1.38567 36.9667 45.4383 4.24208e-08 4.24208e-08
33554432 1.37679 41.2072 57.2881 8.84264e-09 8.84264e-09
67108864 1.38583 29.4984 65.4721 2.65331e-09 2.65331e-09
Double
N Sequential First Second First Second
1048576 1.78137 6.38466 6.44605 1.36909e-11 1.36909e-11
2097152 1.69176 11.0473 11.0473 2.58195e-11 2.58195e-11
4194304 1.41943 17.2002 17.3459 1.87745e-12 1.87745e-12
8388608 1.39531 24.1052 24.1052 8.97525e-12 8.97525e-12
16777216 1.39186 29.8772 29.8772 6.11285e-13 6.11285e-13
33554432 1.38983 33.5544 34.0407 2.9634e-12 2.9634e-12
67108864 1.38298 36.2751 36.2751 1.04758e-11 1.04758e-11
Consider just the Second column for single and double. Why might there be a difference in performance between the two?
Consider just the First and Second columns for single precision. Why might there be a difference in performance between the two?
(Hint: What data structure are we starting with in the driver? Our own Vector
type. What are its element types as compared to what we are sending to the GPU?)
Extra Credit. Use nvprof or some of other means of testing your hypothesis for your answer above.
In the previous problem sets we considered block and strided approaches to parallelizing norm. We found that the strided approach had unfavorable access patterns to memory and gave much lower performance than the blocked approach.
But, consider one of the kernels we launched for axpy (this is from cu_axpy_2):
__global__ void madd(int n, float a, float* x, float* y) {
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = a * x[i] + y[i];
}
It is strided!
Compare and contrast strided partitioning for task-based parallelism (e.g., OpenMP or C++ tasks) with strided partitioning for GPU. Why is it bad in the former case but good (if it is) in the latter case?
Based on one or both of norm_cuda and cu_axpy_t, implement a program that uses thrust to reduce over a vector (array) of floating point numbers. Compare the performance to the hand-optimized versions in the reduction
subdirectory.
Do a make clean in your ps7 directory and then create a tar ball ps7.tar.gz containing all of the files in that directory. From within the ps7 directory:
$ cd ps7
$ make clean
$ cd ..
$ tar zcvf ps7-submit.tar.gz --exclude "./ps7/.vscode" ./ps7
If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps7 subdirectory).
It’s okay if the directory (and tar file) are named ps7a instead of ps7.