Assigned: May 18, 2021
Due: May 25, 2021
Lectures 15, 16 (CUDA)
Walk through of connecting to Hyak using VS Code
This assignment is extracted as with previous assignment from a tar file,
located is at this link
.
The files and subdirectories included are:
ps7/
├── Questions.rst
├── warmup/
│ └── hello_script.bash
├── axpy_cuda/
│ ├── Makefile
│ ├── cu_axpy_0.cu
│ ├── cu_axpy_1.cu
│ ├── cu_axpy_2.cu
│ ├── cu_axpy_3.cu
│ ├── cu_axpy_t.cu
│ ├── cu_batch.bash
│ ├── nvprof.bash
│ ├── omp_axpy.cpp
│ ├── plot.bash
│ ├── plot.py
│ ├── script.bash
│ └── seq_axpy.cpp
├── include/
│ ├── AOSMatrix.hpp
│ ├── COOMatrix.hpp
│ ├── CSCMatrix.hpp
│ ├── CSRMatrix.hpp
│ ├── Make.inc
│ ├── Make_cu.inc
│ ├── Matrix.hpp
│ ├── Timer.hpp
│ ├── Vector.hpp
│ ├── amath583.hpp
│ ├── amath583IO.hpp
│ ├── amath583sparse.hpp
│ ├── catch.hpp
│ ├── getenv.hpp
│ ├── norm_utils.hpp
│ ├── norms.hpp
│ └── pagerank.hpp
├── norm_cuda/
│ ├── Makefile
│ ├── cu_norm_0.cu
│ ├── cu_norm_1.cu
│ ├── cu_norm_2.cu
│ ├── cu_norm_3.cu
│ ├── cu_norm_4.cu
│ ├── cu_batch.bash
│ ├── norm_parfor.cpp
│ ├── norm_seq.cpp
│ ├── norm_thrust.cu
│ ├── plot.bash
│ ├── plot.py
│ └── script.bash
└── src/
├── amath583.cpp
├── amath583IO.cpp
└── amath583sparse.cpp
For this problem set warm-up will consist of revisiting ps6 and executing some of the problems we did on multicore nodes of Hyak.
Different development environments on Hyak are supported via the modules system. There are two modules we need to load for this asignment: gcc/10.2.0 and cuda/11.2.2. To load these modules issue the commands
$ module load gcc/10.2.0
$ module load cuda/11.2.2
You will need to load these every time you connect to Hyak in order to use the right version of gcc and to use CUDA for this assignment. You can add these statements to the end of your .bashrc file so that they are executed automatically whenever you login.
Use rsync to copy your ps6 directory from your laptop to Hyak.
From the directory above ps6
$ rsync -avuzb ps6 klone.hyak.uw.edu:
(You will have to go through the 2FA process, just as if you were logging in.)
Connect to Hyak with vs code and verify that ps6 has been copied. The copied ps6 directory should be visible in your hyak home directory.
From the hello_omp directory, build ompi_info.exe
$ make ompi_info.exe
Now we are ready to run our first job on Hyak.
$ srun --time 5:00 -A niac ./ompi_info.exe
You should get back an output similar to the following
OMP_NUM_THREADS =
hardware_concurrency() = 40
omp_get_max_threads() = 1
omp_get_num_threads() = 1
Note what this is telling us about the environment into which ompi_info.exe was launched. Although there are 40 cores available, te maximum number of Open MP threads that will be available is just 1 – not very much potential parallelism.
Fortunately, srun provides options for enabling more concurrency.
Try the following
$ srun --time 5:00 -A niac --cpus-per-task 2 ./ompi_info.exe
How many omp threads are reported as being available? Try increasing the number of cpus-per-task. Do you always get a corresponding number of omp threads? Is there a limit to how many omp threads you can request?
What is the reported hardware concurrency and available omp threads if you execute ompi_info.exe on the login node?
Note
We are explicitly specifying the maximum time allowed for each job as 5 minutes, using the –time option. The default is higher, but since the niac allocation is being shared by the entire class, we want to protect against any jobs accidentally running amok. We will be setting time limits on the command line whenever we use srun and within the batch files when we use sbatch.
Now let’s revisit some of the computational tasks we parallelized in previous assignments. Before we run these programs we want to compile them for the compute nodes they will be running on.
Recall that one of the arguments we have been passing to the compiler for maximum optimization effort has been “-march=native”, that means to use as many of the available instructions that might be available, assuming that the executable will be run on the same machine as it is compiled on. To make sure we do this, we need to compile on the cluster nodes as well as execute on them.
Let’s build and run norm_parfor.exe
$ srun --time 5:00 -A niac make norm_parfor.exe
To see that there is a different architectural difference between the compute node and the login node, try
$ ./norm_parfor.exe
You should get an error about an illegal instruction – generally a sign that the code you are trying to run was built for a more advanced architecture than the one you are trying to run on.
To run norm_parfor.exe, try the following first
$ srun --time 5:00 -A niac norm_parfor.exe
How much speedup do you get?
We have seen how to get parallel resources above, to launch norm_parfor.exe with 8 cores available, run:
$ srun --time 5:00 -A niac --cpus-per-task 8 ./norm_parfor.exe
What are the max Gflop/s reported when you run norm_parfor.exe with 8 cores? How much speedup is that over 1 core? How does that compare to what you had achieved with your laptop?
Build and run pmatvec.exe:
$ srun -A niac --time 5:00 make pmatvec.exe
$ srun -A niac --time 5:00 --cpus-per-task 8 ./pmatvec.exe
$ srun -A niac --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 16
What are the max Gflop/s reported when you run pmatvec.exe with 16 cores? How does that compare to what you had achieved with your laptop?
What happens when you “oversubscribe”?
$ srun -A niac --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 32
Finally, build and run pagerank.exe:
$ srun -A niac --time 5:00 make pagerank.exe
There are a number of data files that you can use in the shared directory /gscratch/niac/amath583sp21/data
. One reasonably sized one is as-Skitter
.
Try running pagerank with different numbers of cores.
$ srun -A niac --time 5:00 ./pagerank.exe /gscratch/niac/amath583sp21/data/as-Skitter.mtx
$ srun -A niac --time 5:00 --cpus-per-task 2 ./pagerank.exe -n 2 /gscratch/niac/amath583sp21/data/as-Skitter.mtx
$ srun -A niac --time 5:00 --cpus-per-task 4 ./pagerank.exe -n 4 /gscratch/niac/amath583sp21/data/as-Skitter.mtx
$ srun -A niac --time 5:00 --cpus-per-task 8 ./pagerank.exe -n 8 /gscratch/niac/amath583sp21/data/as-Skitter.mtx
The output of pagerank looks something like
# elapsed time [read]: 8059 ms
Converged in 46 iterations
# elapsed time [pagerank]: 3334 ms
# elapsed time [rank]: 129 ms
How much speedup (ratio of elapsed time for pagerank) do you get when running on 8 cores?
Use rsync to copy your ps7 directory from your laptop to Hyak. All subsequent work on ps7 will be done in the copy on Hyak.
From the directory above ps7:
$ rsync -avuzb ps7 klone.hyak.uw.edu:
Once you are logged on to the cluster head node, you can query GPU resources using the
command nvidia-smi
. Note that the head node does not have any GPU resources, trying to run nvidia-smi
locally will not work.
The GPU is actually on the compute node(s) so you will need to invoke nvidia-smi
using srun
(requesting a node with a gpu, of course).
$ srun -p gpu-2080ti -A niac --gres=gpu:2080ti nvidia-smi
This should print information similar to the following:
Mon May 17 21:48:47 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 38% 43C P0 37W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This indicates the GPU is of the GeForce RTX family, has 11GiB of memory, we have version 460.32.03 of the drivers installed, and version 11.2 of CUDA installed.
The Nvidia CUDA toolkit contains a large number of substantive examples.
With the latest version of the toolkit, everything related to CUDA is
installed in /sw/cuda/11.2.2
; the samples can be found in
/sw/cuda/11.2.2/samples
. I encourage you to look through some of
them – the 6_Advanced
subdirectory in particular has some
interesting and non-trivial examples.
The cu_axpy
subdirectory of your repo contains a set of basic examples:
several
“cu_axpy” programs, similar to those that were presented in lecture. The
executables can be compiled by issuing “make”. Each program takes one
argument – the log (base 2) of problem size to run.
That is, if you pass in a value of,
say, 20, the problem size that is run is \(2^{20}\). The default problem size is \(2^{16}\).
Note
Again, the size argument in these examples is the \(\text{log}_2\) of the size.
The programs print some timing numbers for memory allocation (etc), as well a Gflops/sec for the axpy computation. Each program only runs the single problem size specified.
The axpy subdirectory also contains a script script.bash
which runs
all of the axpy programs over a range of problem sizes, a script cu_batch.bash
which is used to submit script.bash
to the queueing system, and plot.bash
, which
plots the results of the runs to a pdf axpy.pdf
.
In addition to the basic cuda examples, there is a sequential implementation, an OpenMP implementation, and a thrust implementation. (Python and matplotlib are installed on the login node.)
The examples cu_axpy_0 through cu_axpy_3 follow the development of CUDA kernels as shown in slides 57-64 in lecture.
To build and run one of the cuda examples
$ make cu_axpy_1.exe
$ srun -p gpu-2080ti -A niac --time 5:00 --gres=gpu:2080ti ./cu_axpy_1.exe 20
You can compare to sequential and omp versions with seq_axpy and omp_axpy:
$ srun -A niac --time 5:00 ./seq_axpy.exe 20
$ srun -A niac --time 5:00 --cpus-per-task 8 ./omp_axpy.exe 20
To generate the scaling plots for this problem, first submit cu_batch.bash
$ sbatch cu_batch.bash
The system will print something like
Submitted batch job 147081
You can check on the status of the job by referring to the job id or the job name
$ squeue -n cu_batch.bash
$ squeue --job 147081
When the job starts running, the log file slurm-147081.out will be created. When the job is finished it will no longer appear in the queue.
Once the batch job is finished, there will be a number of .txt files to be processed by plot.py
$ python3 plot.py
Submit cu_batch.bash to the queue, run the plotting script, and examine the output plot. Review the slides from lecture and make sure you understand why the versions 1, 2, and 3 of cu_axpy give the results that they do. Note also the performance of the sequential, OpenMP, and Thrust cases. Make sure you can explain the difference between version 1 and 2 (partitioning).
How many more threads are run in version 2 compared to version 1? How much speedup might you expect as a result? How much speedup do you see in your plot?
How many more threads are run in version 3 compared to version 2? How much speedup might you expect as a result? How much speedup do you see in your plot? (Hint: Is the speedup a function of the number of threads launched or the number of available cores, or both?)
(AMATH 583) The cu_axpy_t also accepts as a second command line argument the size of the blocks to be used. Experiment with different block sizes with, a few different problem sizes (around \(2^{24}\) plus or minus). What block size seems to give the best performance? Are there any aspects of the GPU as reported in deviceQuery that might point to why this would make sense?
Nvidia has an interactive profiling tool for analyzing cuda applications: The Nvida
visual profiler (nvvp). Unfortunately it is a graphical tool and to use it on a remote node,
one must run it over a forwarded X connection, which isn’t really usable due to high latencies.
However, there
is a command-line program (nvprof
) that provides fairly good
diagnostics, but you have to know what to look for.
You can see a list of the events and metrics that nvprof
can provide by
executing:
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti nvprof --query-events
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti nvprof --query-metrics
Complete information about cuda programming (including a best practices
guide) can be found at https://docs.nvidia.com/cuda/
.
You axpy subdirectory also contains a script nvprof.bash
that
invokes nvprof
with a few metrics of interest.
Run nvprof.bash
on the four cu_axpy executables
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_1.exe
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_2.exe
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_3.exe
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_t.exe
Quantities that you might examine in order to tune performance are “achieved occupancy” (how efficiently is your program using the GPU cores) and memory throughput. Compare the achieved occupancy for the four cu_axpy programs above. The occupancy should correlate well (though not perfectly) with the performance that the programs exhibit.
Looking at some of the metrics reported by nvprof, how do metrics such as occupancy and efficiency compare to the ratio of threads launched between versions 1, 2, and 3?
In the previous problem sets we considered block and strided approaches to parallelizing norm. We found that the strided approach had unfavorable access patterns to memory and gave much lower performance than the blocked approach.
But, consider one of the kernels we launched for axpy (this is from cu_axpy_2):
__global__ void dot0(int n, float a, float* x, float* y) {
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = a * x[i] + y[i];
}
It is strided!
Think about how we do strided partitioning for task-based parallelism (e.g., OpenMP or C++ tasks) with strided partitioning for GPU. Why is it bad in the former case but good (if it is) in the latter case?
In this part of the assignment we want to work through the evolution of reduction patterns that were presented in lecture (slides 73-77) but in the context of norm rather than simply reduction. (In fact, we are going to generalize slightly and actually do dot product).
Consider the implementation of the dot0
kernel in cu_norm_0
__global__
void dot0(int n, float* a, float* x, float* y) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
sdata[tid] = 0.0;
for (int i = index; i < n; i += stride)
sdata[tid] += x[i] * y[i];
__syncthreads();
if (tid == 0) {
a[blockIdx.x] = 0.0;
for (int i = 0; i < blockDim.x; ++i) {
a[blockIdx.x] += sdata[i];
}
}
}
There are two phases in the kernel of this dot product. In the first phase, each thread computes the partial sums for its partition of the input data and saves the results in a shared memory array. This phase is followed by a barrier (__syncthreads()
). Then, the partial sums are added together for all of the threads in each block by the zeroth thread, leaving still some partial sums in the a
array (one partial sum for each block), which are then finally added together by the cpu.
As we have seen, a single gpu thread is not very powerful, and having a single gpu thread adding up the partial sums is quite wasteful. A tree-based approach is much more efficient.
A simple tree-based approach cu_norm_1.cu
is
__global__
void dot0(int n, float* a, float* x, float* y) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
sdata[tid] = 0.0;
for (int i = index; i < n; i += stride)
sdata[tid] += x[i] * y[i];
__syncthreads();
for (size_t s = 1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) {
a[blockIdx.x] = sdata[0];
}
}
Implement the reduction schemes from slides 75, 76, and 77
in cu_norm_2.cu
, cu_norm_3.cu
, and cu_norm_4.cu
, respectively.
You can run individual programs, e.g., as
$ srun -p gpu-2080ti -A niac --time 5:00 --gres=gpu:2080ti ./cu_norm_1.exe 20
You can compare to sequential and omp versions with norm_seq.exe
and norm_parfor.exe
.
First copy your version of norms.hpp from ps6/include into ps7/include. Then build and run:
$ srun -A niac --time 5:00 ./norm_seq.exe 20
$ srun -A niac --time 5:00 --cpus-per-task 8 ./norm_parfor.exe 20
(You should make these on a compute node as with axpy above.)
Note
The files in the ps7 include subdirectory are not parallelized. You will need to copy your version of norms.hpp from ps6 into the ps7 include subdirectory.
This subdirectory has script files for queue submission and plotting, similar to those in the axpy_cuda
subdirectory. When you
have your dot products working, submit the batch script to the queue and plot the results.
What is the max number of Gflop/s that you were able to achieve from the GPU? Overall?
Submit your files to Gradescope.
For ps7, you will find two assignments on Gradescope: “ps7 – written (for both 483 and 583)” and “ps7 – written (for 583 only, Q8-9)”.
Please submit your answers to the Questions as a PDF file. You would also need to submit the plots you get in To-Do 2 (axpy section of ps7) and in To-Do 5 (norm_cuda section of ps7). Please, append those plots to the end of your Questions.pdf file and upload only the resulting pdf file as your submission. (Separate Questions.pdf, axpy.pdf, norm_cuda.pdf could be combined into a single pdf using any suitable online tool).
In this assignment, Question 8 and Question 9 are for AMATH 583 only. If you are a 583 student, please submit your answers for those to the assignment “ps7 – written (for 583 only, Q8-9)” on Gradescope. Note that you could have your work in a single pdf file which you could submit for both written assignments, however, you would need to make sure that correct pages are selected for each question. If you are a 483 student, please do not submit any files to the 583-only assignment on Gradescope.
Please make sure to accurately match the relevant pages of your pdf with questions on Gradescope (otherwise you could lose points if we do not find your work when grading).
If you relied on any external resources, include the references in your document as well.