Assigned: May 17, 2022
Due: May 26, 2022
Lectures 15, 16 (CUDA)
Walk through of connecting to Hyak using VS Code (See below)
This assignment is extracted as with previous assignment from a tar file, located is at ps7.tar.gz
.
The files and subdirectories included are:
ps7/
├── Questions.rst
├── warmup/
│ └── hello_script.bash
├── axpy_cuda/
│ ├── Makefile
│ ├── cu_axpy_0.cu
│ ├── cu_axpy_1.cu
│ ├── cu_axpy_2.cu
│ ├── cu_axpy_3.cu
│ ├── cu_axpy_t.cu
│ ├── cu_batch.bash
│ ├── nvprof.bash
│ ├── omp_axpy.cpp
│ ├── plot.bash
│ ├── plot.py
│ ├── script.bash
│ └── seq_axpy.cpp
├── include/
│ ├── AOSMatrix.hpp
│ ├── COOMatrix.hpp
│ ├── CSCMatrix.hpp
│ ├── CSRMatrix.hpp
│ ├── Make.inc
│ ├── Make_cu.inc
│ ├── Matrix.hpp
│ ├── Timer.hpp
│ ├── Vector.hpp
│ ├── amath583.hpp
│ ├── amath583IO.hpp
│ ├── amath583sparse.hpp
│ ├── catch.hpp
│ ├── getenv.hpp
│ ├── norm_utils.hpp
│ ├── norms.hpp
│ └── pagerank.hpp
├── norm_cuda/
│ ├── Makefile
│ ├── cu_norm_0.cu
│ ├── cu_norm_1.cu
│ ├── cu_norm_2.cu
│ ├── cu_norm_3.cu
│ ├── cu_norm_4.cu
│ ├── cu_batch.bash
│ ├── norm_parfor.cpp
│ ├── norm_seq.cpp
│ ├── norm_thrust.cu
│ ├── plot.bash
│ ├── plot.py
│ └── script.bash
└── src/
├── amath583.cpp
├── amath583IO.cpp
└── amath583sparse.cpp
For this problem set warm-up will consist of revisiting ps6 and executing some of the problems we did on multicore nodes of Hyak.
Different development environments on Hyak are supported via the modules system. There are two modules we need to load for this asignment: gcc/11.2.0 and cuda/11.6.2. To load these modules issue the commands
$ module load gcc/11.2.0
$ module load cuda/11.6.2
You will need to load these every time you connect to Hyak in order to use the right
version of gcc and to use CUDA for this assignment. You can add these statements to
the end of your .bashrc
file so that they are executed automatically whenever you
login. .bashrc
file is located at your home directory on head node of Hyak.
You can open it using VS Code and copy paste these statements at the end of it.
To query what modules are available to you on Hyak, issue this command:
$ module avail
As you can see, many lab create some tools/softwares on Hyak. They are shared among all the users. To query what modules you have loaded at this memoment on Hyak, issue this command:
$ module list
Currently Loaded Modules:
1) gcc/11.2.0 2) cuda/11.6.2
One way to copy files (usually a large number of files), is to use a file-copying tool - rsync
, to copy your ps6 directory from your laptop to Hyak. We want to copy the whole directory and maintain its original folder hierarchy.
From the directory above ps6, we copy files to the home directory (~
symbol) of ours on Hyak:
$ rsync -avuzb /home/tony/amath583/ps6/ klone.hyak.uw.edu:~
(You will have to go through the 2FA process, just as if you were logging in.)
Connect to Hyak with VS code and verify that ps6 has been copied. The copied ps6 directory should be visible in your Hyak home directory.
Another way to copy your ps6 directory from your laptop to Hyak is using secure copy - scp
(remote file copy program).
We also want to copy ps6 recusively so that every file and the whole folder hierarchy will be copied to Hyak.
To do so, issue this command:
$ scp -r /home/tony/amath583/ps6/ <your NetID>@klone.hyak.uw.edu:~
You will have to go through the 2FA process, just as if you were logging in. This will copy the whole ps6 to the home directory of your on Hyak. You can specify a different directory klone.hyak.uw.edu:<Your path>
.
Note
scp
will overwrite (delete the existing directory, and create a new one) ps6 directory if there is already a ps6 directory in the same path. If you have some source codes in ps6 on Hyak, scp
overwrites the whole ps6 directory. Be cautious when using scp
command.
Say we want to copy the whole ps6 directory
that we put under our home directory on Hyak back to our laptop under the current directory. You can use scp
to copy file from Hyak back to your laptop. Issue this command:
$ scp -r <your NetID>@klone.hyak.uw.edu:~/ps6/ /home/tony/amath583/ps6/
(You will have to go through the 2FA process, just as if you were logging in.)
From the hello_omp directory, build ompi_info.exe
$ make ompi_info.exe
You may notice the compilation process got an error when you try to build on the head node of Hyak.
Note
For this assignment as well as the future assignments, NEVER compile on the head node of Hyak. Instead, we need to build through srun on a compute node.
Rememeber to add the module load statements to the end of your .bashrc
file so that they are executed automatically whenever you login or you are running/building a program. To verify the modules are loaded or not, issue
$ module list
Currently Loaded Modules:
1) gcc/11.2.0 2) cuda/11.6.2
Now we are ready to build. From the hello_omp directory, build ompi_info.exe.
$ srun --time 5:00 -A amath -p gpu-rtx6k make ompi_info.exe
Note
Slurm on Hyak requires the user to specify both the account and the partition explicitly in the submitted jobs.
The option -A
is for the account you are associated with. The option -p
is for the partition you are associated with.
After a successful compilation, we are ready to run our first job on Hyak.
$ srun --time 5:00 -A amath -p gpu-rtx6k ./ompi_info.exe
You should get back an output similar to the following
OMP_NUM_THREADS =
hardware_concurrency() = 40
omp_get_max_threads() = 1
omp_get_num_threads() = 1
Note what this is telling us about the environment into which ompi_info.exe was launched. Although there are 40 cores available on our compute node, the maximum number of OpenMP threads that will be available is just 1 – not very much potential parallelism.
Fortunately, srun provides options for enabling more concurrency.
Try the following
$ srun --time 5:00 -A amath -p gpu-rtx6k --cpus-per-task 2 ./ompi_info.exe
How many omp threads are reported as being available? Try increasing the number of cpus-per-task. Do you always get a corresponding number of omp threads? Is there a limit to how many omp threads you can request?
What is the reported hardware concurrency and available omp threads if you execute ompi_info.exe on the login node?
Note
We are explicitly specifying the maximum time allowed for each job as 5 minutes, using the --time
option. The default maximum time is loner than 5 minutes, but since the amath allocation is being shared by the entire class, we want to protect against any jobs accidentally running amok. We will be setting time limits on the command line whenever we use srun and within the batch files when we use sbatch. Please be considerate when specifying the maximum time.
Now let’s revisit some of the computational tasks we parallelized in previous assignments. Before we run these programs we want to compile them for the compute nodes they will be running on.
Recall that one of the arguments we have been passing to the compiler for maximum optimization effort has been “-march=native”, that means to use as many of the available instructions that might be available, assuming that the executable will be run on the same machine as it is compiled on. To make sure we do this, we need to compile on the cluster nodes as well as execute on them.
Let’s build and run norm_parfor.exe
$ srun --time 5:00 -A amath -p gpu-rtx6k make norm_parfor.exe
To see that there is a different architectural difference between the compute node and the login node, try
$ ./norm_parfor.exe
You should get an error about an illegal instruction – generally a sign that the code you are trying to run was built for a more advanced architecture than the one you are trying to run on.
To run norm_parfor.exe, try the following first
$ srun --time 5:00 -A amath -p gpu-rtx6k norm_parfor.exe
How much speedup do you get?
We have seen how to get parallel resources above, to launch norm_parfor.exe with 8 cores available, run:
$ srun --time 5:00 -A amath -p gpu-rtx6k --cpus-per-task 8 ./norm_parfor.exe
What are the max Gflop/s reported when you run norm_parfor.exe with 8 cores? How much speedup is that over 1 core? How does that compare to what you had achieved with your laptop?
Build and run pmatvec.exe:
$ srun -A amath -p gpu-rtx6k --time 5:00 make pmatvec.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 8 ./pmatvec.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 16
What are the max Gflop/s reported when you run pmatvec.exe with 16 cores? How does that compare to what you had achieved with your laptop?
What happens when you “oversubscribe”?
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 32
Finally, build and run pagerank.exe:
$ srun -A amath -p gpu-rtx6k --time 5:00 make pagerank.exe
There are a number of data files that you can use in the shared directory /gscratch/amath/amath583/data
. One reasonably sized one is as-Skitter
.
Try running pagerank with different numbers of cores.
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 1 ./pagerank.exe -n 1 /gscratch/amath/amath583/data/as-Skitter.mtx
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 2 ./pagerank.exe -n 2 /gscratch/amath/amath583/data/as-Skitter.mtx
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 4 ./pagerank.exe -n 4 /gscratch/amath/amath583/data/as-Skitter.mtx
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 8 ./pagerank.exe -n 8 /gscratch/amath/amath583/data/as-Skitter.mtx
The output of pagerank looks something like
# elapsed time [read]: 8059 ms
Converged in 46 iterations
# elapsed time [pagerank]: 3334 ms
# elapsed time [rank]: 129 ms
How much speedup (ratio of elapsed time for pagerank comparing 1 core with 8 cores) do you get when running on 8 cores?
Use rsync to copy your ps7 directory from your laptop to Hyak. All subsequent work on ps7 will be done in the copy on Hyak.
From the directory above ps7:
$ rsync -avuzb /home/tony/amath583/ps7/ klone.hyak.uw.edu:~
Once you are logged on to the cluster head node, you can query GPU resources using the
command nvidia-smi
. Note that the head node does not have any GPU resources, trying to run nvidia-smi
locally will not work.
The GPU is actually on the compute node(s) so you will need to invoke nvidia-smi
using srun
(requesting a node with a gpu, of course).
$ srun -p gpu-rtx6k -A amath --gres=gpu:rtx6k nvidia-smi
This should print information similar to the following:
Sun May 15 10:38:26 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:21:00.0 Off | Off |
| 29% 34C P0 24W / 260W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This indicates the GPU is a Nvidia Quadro RTX 6000, has 24576MiB of memory. We have version 510.47.03 of the drivers installed, and version 11.6 of CUDA installed.
The Nvidia CUDA toolkit contains a large number of substantive examples.
With the latest version of the toolkit, everything related to CUDA is
installed in /sw/cuda/11.2.2
; the samples can be found in
/sw/cuda/11.2.2/samples
. I encourage you to look through some of
them – the 6_Advanced
subdirectory in particular has some
interesting and non-trivial examples.
The cu_axpy
subdirectory of your repo contains a set of basic examples:
several
“cu_axpy” programs, similar to those that were presented in lecture. The
executables can be compiled by issuing “make”. Each program takes one
argument – the log (base 2) of problem size to run.
That is, if you pass in a value of,
say, 20, the problem size that is run is \(2^{20}\). The default problem size is \(2^{16}\).
Note
Again, the size argument in these examples is the log_2
of the size.
The programs print some timing numbers for memory allocation (etc), as well a Gflops/sec for the axpy computation. Each program only runs the single problem size specified.
The axpy subdirectory also contains a script script.bash
which runs
all of the axpy programs over a range of problem sizes, a script cu_batch.bash
which is used to submit script.bash
to the queueing system, and plot.bash
, which
plots the results of the runs to a pdf axpy.pdf
.
In addition to the basic cuda examples, there is a sequential implementation, an OpenMP implementation, and a thrust implementation. (Python and matplotlib are installed on the login node.)
The examples cu_axpy_0 through cu_axpy_3 follow the development of CUDA kernels as shown in slides 62-69 in lecture.
To build and run one of the cuda examples
$ srun -A amath -p gpu-rtx6k --time 5:00 make cu_axpy_1.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k ./cu_axpy_1.exe 20
You can compare to sequential and omp versions with seq_axpy and omp_axpy:
$ srun -A amath -p gpu-rtx6k --time 5:00 ./seq_axpy.exe 20
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 8 ./omp_axpy.exe 20
To generate the scaling plots for this problem, first submit cu_batch.bash
$ sbatch cu_batch.bash
The system will print something like
Submitted batch job 147081
You can check on the status of the job by referring to the job id or the job name
$ squeue -n cu_batch.bash
$ squeue --job 147081
When the job starts running, the log file slurm-147081.out will be created. When the job is finished it will no longer appear in the queue.
Once the batch job is finished, there will be a number of .txt files to be processed by plot.py
$ python3 plot.py
This Python script is going to parse the performance number within cu_norm_(0-4).txt files, and plot a norm_cuda.pdf.
You can use scp
to copy it back to your laptop.
Submit cu_batch.bash to the queue, run the plotting script, and examine the output plot. Review the slides from lecture and make sure you understand why the versions 1, 2, and 3 of cu_axpy give the results that they do. Note also the performance of the sequential, OpenMP, and Thrust cases. Make sure you can explain the difference between version 1 and 2 (partitioning).
How many more threads are run in version 2 compared to version 1? How much speedup might you expect as a result? How much speedup do you see in your plot?
How many more threads are run in version 3 compared to version 2? How much speedup might you expect as a result? How much speedup do you see in your plot? (Hint: Is the speedup a function of the number of threads launched or the number of available cores, or both?)
(AMATH 583, updated) The cu_axpy_3 also accepts as a second command line argument the size of the blocks to be used. Experiment with different block sizes with, a few different problem sizes (around \(2^{24}\) plus or minus). What block size seems to give the best performance? Are there any aspects of the GPU as reported in deviceQuery that might point to why this would make sense?
Nvidia has an interactive profiling tool for analyzing cuda applications: The Nvida
visual profiler (nvvp
). Unfortunately it is a graphical tool and to use it on a remote node,
one must run it over a forwarded X connection, which isn’t really usable due to high latencies.
However, there is a command-line program (nvprof
) that provides fairly good
diagnostics, but you have to know what to look for. Nividia decided to stop support to nvprof
,
and introduced a replacement for it called Nsight Compute (nsys
).
Run nsys nvprof
on the four cu_axpy executables
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_1.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_2.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_3.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_t.exe
Quantities that you might examine in order to tune performance are “Memory Operation Statistics”, “Kernel Statistics” and “API Statistics”. Compare the overhead such as the data movement for the four cu_axpy programs above.
(TL: you can ignore this question.) Looking at some of the metrics reported by nvprof, how do metrics such as occupancy and efficiency compare to the ratio of threads launched between versions 1, 2, and 3?
In the previous problem sets we considered block and strided approaches to parallelizing norm. We found that the strided approach had unfavorable access patterns to memory and gave much lower performance than the blocked approach.
But, consider one of the kernels we launched for axpy (this is from cu_axpy_2):
__global__ void dot0(int n, float a, float* x, float* y) {
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = a * x[i] + y[i];
}
It is strided!
Think about how we do strided partitioning for task-based parallelism (e.g., OpenMP or C++ tasks) with strided partitioning for GPU. Why is it bad in the former case but good (if it is) in the latter case?
In this part of the assignment we want to work through the evolution of reduction patterns that were presented in lecture (slides 78-82) but in the context of norm rather than simply reduction. (In fact, we are going to generalize slightly and actually do dot product).
Consider the implementation of the dot0
kernel in cu_norm_0
__global__
void dot0(int n, float* a, float* x, float* y) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
sdata[tid] = 0.0;
for (int i = index; i < n; i += stride)
sdata[tid] += x[i] * y[i];
__syncthreads();
if (tid == 0) {
a[blockIdx.x] = 0.0;
for (int i = 0; i < blockDim.x; ++i) {
a[blockIdx.x] += sdata[i];
}
}
}
There are two phases in the kernel of this dot product. In the first phase, each thread computes the partial sums for its partition of the input data and saves the results in a shared memory array. This phase is followed by a barrier (__syncthreads()
). Then, the partial sums are added together for all of the threads in each block by the zeroth thread, leaving still some partial sums in the a
array (one partial sum for each block), which are then finally added together by the cpu.
As we have seen, a single gpu thread is not very powerful, and having a single gpu thread adding up the partial sums is quite wasteful. A tree-based approach is much more efficient.
A simple tree-based approach cu_norm_1.cu
is
__global__
void dot0(int n, float* a, float* x, float* y) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
sdata[tid] = 0.0;
for (int i = index; i < n; i += stride)
sdata[tid] += x[i] * y[i];
__syncthreads();
for (size_t s = 1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) {
a[blockIdx.x] = sdata[0];
}
}
Implement the reduction schemes from slides 80, 81, and 82
in cu_norm_2.cu
, cu_norm_3.cu
, and cu_norm_4.cu
, respectively.
You can run individual programs, e.g., as
$ srun -p gpu-rtx6k -A amath --time 5:00 --gres=gpu:rtx6k ./cu_norm_1.exe 20
You can compare to sequential and omp versions with norm_seq.exe
and norm_parfor.exe
.
First copy your version of norms.hpp from ps6/include into ps7/include. Then build and run:
$ srun -A amath -p gpu-rtx6k --time 5:00 ./norm_seq.exe 20
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 8 ./norm_parfor.exe 20
(You should make these on a compute node as with axpy above.)
Note
The files in the ps7 include subdirectory are not parallelized. You will need to copy your version of norms.hpp from ps6 into the ps7 include subdirectory. You can use scp
to copy norm.hpp from your laptop to Hyak.
This subdirectory has script files for queue submission and plotting, similar to those in the axpy_cuda
subdirectory. When you have your dot products working, submit the batch script to the queue and plot the results.
What is the max number of Gflop/s that you were able to achieve from the GPU? Overall?
Answer the following questions (append to Questions.rst): a) The most important thing I learned from this assignment was… b) One thing I am still not clear on is…
Submit your files to Gradescope. To log in to Gradescope, point your browser to gradescope.com and click the login button. On the form that pops up select “School Credentials” and then “University of Washington netid”. You will be asked to authenticate with the UW netid login page. Once you have authenticated you will be brought to a page with your courses. Select amath583sp22.
For ps7, you will find two assignments on Gradescope: “ps7 – written (for 483)” and “ps7 – written (for 583)”. We will not grade your source code with autograder starting from this assignment. There is no need to submit your source code to Gradescope.
Please submit your answers to the Questions as a PDF file. You would also need to submit the plots you get in To-Do 2 (axpy section of ps7) and in To-Do 5 (norm_cuda section of ps7). Please, append those plots to the end of your Questions.pdf file and upload only the resulting pdf file as your submission. (Separate Questions.pdf, axpy.pdf, norm_cuda.pdf could be combined into a single pdf using any suitable online tool).
In this assignment, Question 8 and Question 9 are for AMATH 583 only. If you are a 583 student, please submit your answers for those to the assignment “ps7 – written (for 583)” on Gradescope. Note that you could have your work in a single pdf file which you could submit for both written assignments, however, you would need to make sure that correct pages are selected for each question. If you are a 483 student, please do not submit any files to the 583-only assignment on Gradescope.
Please make sure to accurately match the relevant pages of your pdf with questions on Gradescope (otherwise you could lose points if we do not find your work when grading).
If you relied on any external resources, include the references in your document as well.