Problem Set #7

Assigned: May 18, 2021

Due: May 25, 2021



This assignment is extracted as with previous assignment from a tar file, located is at this link.


The files and subdirectories included are:

├── Questions.rst
├── warmup/
│   └── hello_script.bash
├── axpy_cuda/
│   ├── Makefile
│   ├──
│   ├──
│   ├──
│   ├──
│   ├──
│   ├── cu_batch.bash
│   ├── nvprof.bash
│   ├── omp_axpy.cpp
│   ├── plot.bash
│   ├──
│   ├── script.bash
│   └── seq_axpy.cpp
├── include/
│   ├── AOSMatrix.hpp
│   ├── COOMatrix.hpp
│   ├── CSCMatrix.hpp
│   ├── CSRMatrix.hpp
│   ├──
│   ├──
│   ├── Matrix.hpp
│   ├── Timer.hpp
│   ├── Vector.hpp
│   ├── amath583.hpp
│   ├── amath583IO.hpp
│   ├── amath583sparse.hpp
│   ├── catch.hpp
│   ├── getenv.hpp
│   ├── norm_utils.hpp
│   ├── norms.hpp
│   └── pagerank.hpp
├── norm_cuda/
│   ├── Makefile
│   ├──
│   ├──
│   ├──
│   ├──
│   ├──
│   ├── cu_batch.bash
│   ├── norm_parfor.cpp
│   ├── norm_seq.cpp
│   ├──
│   ├── plot.bash
│   ├──
│   └── script.bash
└── src/
    ├── amath583.cpp
    ├── amath583IO.cpp
    └── amath583sparse.cpp


Shared Computing Resources and the Hyak Compute Cluster

For this assignment (and for the rest of the quarter), we will be using the UW Hyak compute cluster.

Compute clsuters are a fairly standard operational model for high-performance computing, typicallys consisting of a set of front-end resources (aka front-end nodes aka login nodes) and then a (much larger) numer of dedicated back-end resources (aka compute nodes). The front-end resources are shared – multiple users can login and use them. The compute resources, where the real computing is done, however are batch scheduled via a queueing system. Only one job will be run on a particular resource at a time.

Compute resources are often heterogeneous, as is the case with Hyak, with different capabilities being available on different installed hardware. When a user wishes to use a particular class of resource, that resource is requested specifically when the job is submitted to the queueing system.

Administratively, Hyak is a “condo cluster”, meaning different resources have been purchased under the auspices of different research projects, but the day-to-day operationa and maintenance is amortized by collecting them all into a single administrative domain. The resources associated with a particular project (or research group) are available for that project (group) on demand. However, idle resources are put into a pool for general use.

For this course we have 80 CPU cores and 370GB of RAM as dedicated resources (plus two V100 GPUs, hopefully soon). Students at UW are also eligible to use resources that are part of the student technology fund by joining the Research Computing Club. I encourage everyone to join the RCC. The RCC web pages have additional information on how to access and use Hyak.

If you already have access to Hyak via your research group, you can indicate which group’s resources to use when submitting your job. You can find the resources available to you at any time by issuing the command “hyakalloc”.

We will be using Hyak in different ways for this and the next assignments. For this assignment we will be using single compute nodes to provide GPU and multicore resources. For the next assignment we will be using larger numbers of distributed memory nodes. The process for submitting jobs to the batch scheduler is the same in each case.

Note that the main login node for hyak is When we say “connect to Hyak” we mean connect to (log in to)

Account Activation

An account has been created for you on Hyak for this course. To activate your account, including setting up two-factor authentication (2FA), follow the instructions here . (See also the RCC instructions).

Connect with VS Code

The easiest (and best) way to work on Hyak is to connect with VS Code. This will provide a fairly seamless transition—you will be able to edit your files and work with the command line in pretty much the same way you have been so far.

A Panopto walkthrough on how to connect with VS Code has been posted to Canvas.

The basic steps are:

  • Install the Remote SSH extension for vs code

  • Use the Remote SSH extension to connect to

You will be able to edit files on Hyak directly on your laptop and interact with Hyak through the vs code terminal.

Some Rules

The Hyak cluster is a shared compute resource, which sharing includes the compute resources themselves, as well as the file system and the front-end (login) node. When you are logged in to the head node, other students will also be logged in at the same time, so be considerate of your usage. In particular,

  • Use the cluster only for this assignment (no mining bitcoins or other activities for which a cluster might seem useful),

  • Do not run anything compute-intensive on the head node (all compute-intensive jobs should be batch scheduled),

  • Do not put any really large files on the cluster, and

  • Do not attempt to look at other students’ work.

Sanity Check

Once you are connected to the head node, there are a few quick sanity checks you can do to check out (and learn more about) the cluster environment.

To find out more about the head node itself, try

$ uname -a
$ lscpu
$ more /proc/cpuinfo

These will return various (and copious) information about the OS and cores on the head node. Of interest are how many cores there are.

You can see the resources that are available to you on Hyak with the hyakalloc command

$ hyakalloc

You should get a response back similar to

Account-Partition - Resource            Total           Used           Free
----------------------------            -----           ----           ----
   - CPU cores                             80              0             80
   - Memory (GB)                          370              0            370

Account-Partition - Resource            Total           Used           Free
----------------------------            -----           ----           ----
   - CPU cores                             40              0             40
   - Memory (GB)                          184              0            184
   - GPUs                                   8              0              8

(“niac” is the account group that you are assigned to as part of this course and niac-gpu-2080ti is account-partition that we are able to access for GPUs. If you belong to other groups you may see their resources as well.)

Batch Resources

We will be using the “SLURM” batch queuing system for this assignment, which we will discuss in more detail below. To get some information about it, you can issue the command

$ sinfo

You can get an organized summary of nodes in the cluster with the command

$ sinfo -o "%20N  %10c  %10m  %25f  %10G"

This will print a summary of the compute nodes along with some basic details about their capabilities.

You should see a list that looks something like this

NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES
n[3008-3011,3064,306  40          386048      bigmem,cascadelake         (null)
z3000                 40          191488      2080ti,cascadelake,gpu,gp  gpu:2080ti
n[3000-3007,3016-302  40          773120      cascadelake,hugemem        (null)
z3001                 40          191488      2080ti,cascadelake,gpu,gp  gpu:2080ti
n[3012-3015,3024-306  40          176128+     cascadelake                (null)

(Use “man sinfo” to access documentation to explain the different columns.) Currently we only have one compute node configured – with 16 CPUS in order to explore multicore performance.

The command squeue on the other hand will provide information about the jobs that are currently running (or waiting to be run) on the nodes.

$ squeue
$ squeue -arl

(“man squeue” to access information about the squeue command.)

The basic paradigm in using a queuing system is that you request the queueing system to run something for you. The thing that is run can be a program, or it can be a command script (with special commands / comments to control the queueing system itself). Command line options (or commands / directives in the script file) are used to specify the configuration of the environment for the job to run, as well as to tell the queueing system some things about the job (how much time it will need to run for instance). When the job has run, the queueing system will also provide its output in some files for you – the output from cout and the output from cerr.

A reasonable tutorial on the basics of using slurm can be found here: . The most important commands are: srun, sbatch, sinfo, and squeue. The most important bits to pay attention to in that tutorial are about those commands.

A brief overview of slurm on Hyak can be found here.


There are two modes of getting SLURM to run your jobs for you – pseudo-interactive (using srun) and batch (using sbatch). The former (srun) should only be used for short-running jobs (testing, debugging, sanity checks). Using it for verifying your programs for the first part of this assignment is fine. Note that you are still using a slot in the queue even with srun – so if your job hangs, it will block other jobs from running (at least until your timeslot expires).

The “look and feel” of srun is alot like just running a job locally. Instead of executing the job locally with ./norm_parfor.exe (for example), you simply use srun:

$ srun ./norm_parfor.exe
        N  Sequential    1 thread   2 threads   4 threads   8 threads      1 thread     2 threads     4 threads     8 threads
  1048576     5.76266     5.72983     11.4924     22.8542     50.2792             0   5.19496e-15   5.00255e-15   4.04052e-15
  2097152     5.47874     5.46393      10.811     21.7382     42.1178             0   6.80033e-16   1.08805e-15   2.58412e-15
  4194304     4.05311     4.06923     7.84222     15.6246     31.9816             0   4.80808e-15   5.57737e-15   3.84646e-15
  8388608     3.29741      3.3026     6.39376     12.9454     24.6724             0   1.11505e-14   9.92669e-15   9.51874e-15

(Note that the performance is significantly better than what you may have been seeing so far on your laptop.)

Try the following with srun

$ hostname
$ srun hostname

The former will give you the name of the front-end node while the latter should give you the hostname of the compute node where SLURM ran the job – and they should be different. (Had better be different!)

Note that srun is also the command we will use in upcoming assignments to launch distributed-memory (MPI) jobs. You can pass it an argument to tell it how many separate jobs to run.

$ srun -n 4 hostname

You shouldn’t use this option until we start doing MPI programs.

The srun command takes numerous other options to specify how it should be run (specific resources required, time limits, what account to use, etc.) We will cover these as needed but you can


srun basically just takes what you give it and runs it, in a pseudo-interactive mode – results are printed back to your screen. However, jobs invoked with srun are still put into the queue – if there are jobs ahead of you, you will have to wait until the job runs to see your output.

It is often more productive to use a queueing system asynchronously – especially if there are many long-running jobs that have to be scheduled. In this mode of operation, you tell SLURM what you want to do inside of a script – and to submit the script to the queuing system.

For example, you could create a script as follows:

echo "Hello from Slurm"
echo "Goodbye"

This file is in your repo as warmup/hello_script.bash. Submit it to the queuing system with the following:

% sbatch hello_script.bash



Files that are submitted to sbatch must be actual shell scripts – meaning they must specify a shell in addition to having executable commands. The shell is specified at the top of every script, starting with #!. A bash script therefore has #!/bin/bash at the top of the file.

If you submit a file that is not a script, you will get an error like the following:

sbatch: error: This does not look like a batch script.  The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh

#! is pronounced in various ways – I usually pronounce it as “crunch-bang”, though “pound-bang” or “hash-bang” are also used. #! serves as a magic number (really that is the technical term) to indicates scripts in Unix / Linux systems.

Output from your batch submission will be captured in a file slurm-<some number>.out. The script above would produce something like

Hello from SLURM!

Job options can be incorporated as part of your batch script (instead of being put on the command line) by prefixing ach option with #SBATCH.

For example the following are typical options to use in this course


#SBATCH --account=niac
#SBATCH --partition=gpu-2080ti
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --gres=gpu:2080ti

Here, we request from the niac account and gpu-2080ti partition, 1 core on 1 node, with one 2080ti gpu, and a time limit of 5 minutes. (Another partition that can be used is the ckpt partition, which is the resource pool of available idle resources.)


There is a reasonable chance that we will be able to use different / better GPU resources during the course of this assignment. If so, we will make an announcement and instructions on use will be updated accordingly.

If you see an error such as

slurmstepd: error: execve(): hello_script.bash: Permission denied

you may need to set the executable bit for the script.

$ chmod +x hello_script.bash

You can verify that the executable bit is set by doing a long listing of the file:

$ ls -l hello_script.bash
-rw-r--r-- 1 al75 al75 62 May 30 17:25 hello_script.bash

The script is not executable. So we issue use chmod +x

$ chmod +x hello_script.bash
$ ls -l hello_script.bash
-rwxr-xr-x 1 al75 al75 62 May 30 17:25 hello_script.bash

Program Output

When you use sbatch, rather than sending your output to a screen (since sbatch is not interactive), standard out (cout) and standard error (cerr) from the program will instead be saved in a file named slurm-xx.out, where xx is the job number that was run. The file will be saved in the directory where you launched sbatch.

Try the following

$ sbatch hello_script.bash

You should see a new file named slurm-<some number>.out in your directory.

Launching Parallel Jobs

There is a subtle difference between what srun and sbatch do. In some sense sbatch just keeps your place in the queue and launches what you would have done with srun, whereas srun actually does the launch. That is, sbatch won’t launch your script in parallel. We will come back to this with the MPI assignment.

Warm Up: PS6 Reprise

For this problem set warm-up will consist of revisiting ps6 and executing some of the problems we did on multicore nodes of Hyak.

Setting up your environment

Different development environments on Hyak are supported via the modules system. There are two modules we need to load for this asignment: gcc/10.2.0 and cuda/11.2.2. To load these modules issue the commands

$ module load gcc/10.2.0
$ module load cuda/11.2.2

You will need to load these every time you connect to Hyak in order to use the right version of gcc and to use CUDA for this assignment. You can add these statements to the end of your .bashrc file so that they are executed automatically whenever you login.

Copy PS6 to Hyak

Use rsync to copy your ps6 directory from your laptop to Hyak.

From the directory above ps6

$ rsync -avuzb ps6

(You will have to go through the 2FA process, just as if you were logging in.)

Connect to Hyak with vs code and verify that ps6 has been copied. The copied ps6 directory should be visible in your hyak home directory.


From the hello_omp directory, build ompi_info.exe

$ make ompi_info.exe

Now we are ready to run our first job on Hyak.

$ srun --time 5:00 -A niac ./ompi_info.exe

You should get back an output similar to the following

hardware_concurrency() = 40
omp_get_max_threads()  = 1
omp_get_num_threads()  = 1

Note what this is telling us about the environment into which ompi_info.exe was launched. Although there are 40 cores available, te maximum number of Open MP threads that will be available is just 1 – not very much potential parallelism.

Fortunately, srun provides options for enabling more concurrency.

Try the following

$ srun --time 5:00 -A niac --cpus-per-task 2 ./ompi_info.exe

How many omp threads are reported as being available? Try increasing the number of cpus-per-task. Do you always get a corresponding number of omp threads? Is there a limit to how many omp threads you can request?

What is the reported hardware concurrency and available omp threads if you execute ompi_info.exe on the login node?


We are explicitly specifying the maximum time allowed for each job as 5 minutes, using the –time option. The default is higher, but since the niac allocation is being shared by the entire class, we want to protect against any jobs accidentally running amok. We will be setting time limits on the command line whenever we use srun and within the batch files when we use sbatch.


Now let’s revisit some of the computational tasks we parallelized in previous assignments. Before we run these programs we want to compile them for the compute nodes they will be running on.

Recall that one of the arguments we have been passing to the compiler for maximum optimization effort has been “-march=native”, that means to use as many of the available instructions that might be available, assuming that the executable will be run on the same machine as it is compiled on. To make sure we do this, we need to compile on the cluster nodes as well as execute on them.

Let’s build and run norm_parfor.exe

$ srun --time 5:00 -A niac make norm_parfor.exe

To see that there is a different architectural difference between the compute node and the login node, try

$ ./norm_parfor.exe

You should get an error about an illegal instruction – generally a sign that the code you are trying to run was built for a more advanced architecture than the one you are trying to run on.

To run norm_parfor.exe, try the following first

$ srun --time 5:00 -A niac norm_parfor.exe

How much speedup do you get?

We have seen how to get parallel resources above, to launch norm_parfor.exe with 8 cores available, run:

$ srun --time 5:00 -A niac --cpus-per-task 8  ./norm_parfor.exe

What are the max Gflop/s reported when you run norm_parfor.exe with 8 cores? How much speedup is that over 1 core? How does that compare to what you had achieved with your laptop?


Build and run pmatvec.exe:

$ srun -A niac --time 5:00 make pmatvec.exe
$ srun -A niac --time 5:00 --cpus-per-task 8  ./pmatvec.exe
$ srun -A niac --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 16

What are the max Gflop/s reported when you run pmatvec.exe with 16 cores? How does that compare to what you had achieved with your laptop?

What happens when you “oversubscribe”?

$ srun -A niac --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 32


Finally, build and run pagerank.exe:

$ srun -A niac --time 5:00 make pagerank.exe

There are a number of data files that you can use in the shared directory /gscratch/niac/amath583sp21/data . One reasonably sized one is as-Skitter.

Try running pagerank with different numbers of cores.

$ srun -A niac --time 5:00 ./pagerank.exe /gscratch/niac/amath583sp21/data/as-Skitter.mtx
$ srun -A niac --time 5:00 --cpus-per-task 2 ./pagerank.exe -n 2 /gscratch/niac/amath583sp21/data/as-Skitter.mtx
$ srun -A niac --time 5:00 --cpus-per-task 4 ./pagerank.exe -n 4 /gscratch/niac/amath583sp21/data/as-Skitter.mtx
$ srun -A niac --time 5:00 --cpus-per-task 8 ./pagerank.exe -n 8 /gscratch/niac/amath583sp21/data/as-Skitter.mtx

The output of pagerank looks something like

# elapsed time [read]: 8059 ms
Converged in 46 iterations
# elapsed time [pagerank]: 3334 ms
# elapsed time [rank]: 129 ms

How much speedup (ratio of elapsed time for pagerank) do you get when running on 8 cores?


Use rsync to copy your ps7 directory from your laptop to Hyak. All subsequent work on ps7 will be done in the copy on Hyak.

From the directory above ps7:

$ rsync -avuzb ps7

Finding out about the GPU


Once you are logged on to the cluster head node, you can query GPU resources using the command nvidia-smi. Note that the head node does not have any GPU resources, trying to run nvidia-smi locally will not work.

The GPU is actually on the compute node(s) so you will need to invoke nvidia-smi using srun (requesting a node with a gpu, of course).

$  srun -p gpu-2080ti  -A niac --gres=gpu:2080ti nvidia-smi

This should print information similar to the following:

Mon May 17 21:48:47 2021
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 208...  Off  | 00000000:86:00.0 Off |                  N/A |
| 38%   43C    P0    37W / 250W |      0MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

This indicates the GPU is of the GeForce RTX family, has 11GiB of memory, we have version 460.32.03 of the drivers installed, and version 11.2 of CUDA installed.

The Nvidia CUDA toolkit contains a large number of substantive examples. With the latest version of the toolkit, everything related to CUDA is installed in /sw/cuda/11.2.2; the samples can be found in /sw/cuda/11.2.2/samples. I encourage you to look through some of them – the 6_Advanced subdirectory in particular has some interesting and non-trivial examples.


The cu_axpy subdirectory of your repo contains a set of basic examples: several “cu_axpy” programs, similar to those that were presented in lecture. The executables can be compiled by issuing “make”. Each program takes one argument – the log (base 2) of problem size to run. That is, if you pass in a value of, say, 20, the problem size that is run is \(2^{20}\). The default problem size is \(2^{16}\).


Again, the size argument in these examples is the \(\text{log}_2\) of the size.

The programs print some timing numbers for memory allocation (etc), as well a Gflops/sec for the axpy computation. Each program only runs the single problem size specified.

The axpy subdirectory also contains a script script.bash which runs all of the axpy programs over a range of problem sizes, a script cu_batch.bash which is used to submit script.bash to the queueing system, and plot.bash, which plots the results of the runs to a pdf axpy.pdf.

In addition to the basic cuda examples, there is a sequential implementation, an OpenMP implementation, and a thrust implementation. (Python and matplotlib are installed on the login node.)

The examples cu_axpy_0 through cu_axpy_3 follow the development of CUDA kernels as shown in slides 57-64 in lecture.

To build and run one of the cuda examples

$ make cu_axpy_1.exe
$ srun -p gpu-2080ti -A niac --time 5:00 --gres=gpu:2080ti ./cu_axpy_1.exe 20

You can compare to sequential and omp versions with seq_axpy and omp_axpy:

$ srun -A niac --time 5:00 ./seq_axpy.exe 20
$ srun -A niac --time 5:00 --cpus-per-task  8 ./omp_axpy.exe 20

To generate the scaling plots for this problem, first submit cu_batch.bash

$ sbatch cu_batch.bash

The system will print something like

Submitted batch job 147081

You can check on the status of the job by referring to the job id or the job name

$ squeue -n cu_batch.bash
$ squeue --job 147081

When the job starts running, the log file slurm-147081.out will be created. When the job is finished it will no longer appear in the queue.

Once the batch job is finished, there will be a number of .txt files to be processed by

$ python3

Submit cu_batch.bash to the queue, run the plotting script, and examine the output plot. Review the slides from lecture and make sure you understand why the versions 1, 2, and 3 of cu_axpy give the results that they do. Note also the performance of the sequential, OpenMP, and Thrust cases. Make sure you can explain the difference between version 1 and 2 (partitioning).

How many more threads are run in version 2 compared to version 1? How much speedup might you expect as a result? How much speedup do you see in your plot?

How many more threads are run in version 3 compared to version 2? How much speedup might you expect as a result? How much speedup do you see in your plot? (Hint: Is the speedup a function of the number of threads launched or the number of available cores, or both?)

(AMATH 583) The cu_axpy_t also accepts as a second command line argument the size of the blocks to be used. Experiment with different block sizes with, a few different problem sizes (around \(2^{24}\) plus or minus). What block size seems to give the best performance? Are there any aspects of the GPU as reported in deviceQuery that might point to why this would make sense?

nvprof (AMATH 583)

Nvidia has an interactive profiling tool for analyzing cuda applications: The Nvida visual profiler (nvvp). Unfortunately it is a graphical tool and to use it on a remote node, one must run it over a forwarded X connection, which isn’t really usable due to high latencies. However, there is a command-line program (nvprof) that provides fairly good diagnostics, but you have to know what to look for.

You can see a list of the events and metrics that nvprof can provide by executing:

$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti nvprof --query-events
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti nvprof --query-metrics

Complete information about cuda programming (including a best practices guide) can be found at

You axpy subdirectory also contains a script nvprof.bash that invokes nvprof with a few metrics of interest.

Run nvprof.bash on the four cu_axpy executables

$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_1.exe
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_2.exe
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_3.exe
$ srun -p gpu-2080ti --time 5:00 --gres=gpu:2080ti bash nvprof.bash ./cu_axpy_t.exe

Quantities that you might examine in order to tune performance are “achieved occupancy” (how efficiently is your program using the GPU cores) and memory throughput. Compare the achieved occupancy for the four cu_axpy programs above. The occupancy should correlate well (though not perfectly) with the performance that the programs exhibit.

Looking at some of the metrics reported by nvprof, how do metrics such as occupancy and efficiency compare to the ratio of threads launched between versions 1, 2, and 3?


In the previous problem sets we considered block and strided approaches to parallelizing norm. We found that the strided approach had unfavorable access patterns to memory and gave much lower performance than the blocked approach.

But, consider one of the kernels we launched for axpy (this is from cu_axpy_2):

__global__ void dot0(int n, float a, float* x, float* y) {
  int index  = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = a * x[i] + y[i];

It is strided!

Think about how we do strided partitioning for task-based parallelism (e.g., OpenMP or C++ tasks) with strided partitioning for GPU. Why is it bad in the former case but good (if it is) in the latter case?


In this part of the assignment we want to work through the evolution of reduction patterns that were presented in lecture (slides 73-77) but in the context of norm rather than simply reduction. (In fact, we are going to generalize slightly and actually do dot product).

Consider the implementation of the dot0 kernel in cu_norm_0

void dot0(int n, float* a, float* x, float* y) {
  extern __shared__ float sdata[];

  int tid    = threadIdx.x;
  int index  = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  sdata[tid] = 0.0;
  for (int i = index; i < n; i += stride)
    sdata[tid] += x[i] * y[i];


  if (tid == 0) {
    a[blockIdx.x] = 0.0;
    for (int i = 0; i < blockDim.x; ++i) {
      a[blockIdx.x] += sdata[i];

There are two phases in the kernel of this dot product. In the first phase, each thread computes the partial sums for its partition of the input data and saves the results in a shared memory array. This phase is followed by a barrier (__syncthreads()). Then, the partial sums are added together for all of the threads in each block by the zeroth thread, leaving still some partial sums in the a array (one partial sum for each block), which are then finally added together by the cpu.

As we have seen, a single gpu thread is not very powerful, and having a single gpu thread adding up the partial sums is quite wasteful. A tree-based approach is much more efficient.

A simple tree-based approach is

void dot0(int n, float* a, float* x, float* y) {
  extern __shared__ float sdata[];

  int tid    = threadIdx.x;
  int index  = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  sdata[tid] = 0.0;
  for (int i = index; i < n; i += stride)
    sdata[tid] += x[i] * y[i];


  for (size_t s = 1; s < blockDim.x; s *= 2) {
    if (tid % (2*s) == 0) {
      sdata[tid] += sdata[tid + s];

  if (tid == 0) {
    a[blockIdx.x] = sdata[0];

Implement the reduction schemes from slides 75, 76, and 77 in,, and, respectively.

You can run individual programs, e.g., as

$ srun -p gpu-2080ti -A niac --time 5:00 --gres=gpu:2080ti ./cu_norm_1.exe 20

You can compare to sequential and omp versions with norm_seq.exe and norm_parfor.exe. First copy your version of norms.hpp from ps6/include into ps7/include. Then build and run:

$ srun -A niac --time 5:00 ./norm_seq.exe 20
$ srun -A niac --time 5:00 --cpus-per-task  8 ./norm_parfor.exe 20

(You should make these on a compute node as with axpy above.)


The files in the ps7 include subdirectory are not parallelized. You will need to copy your version of norms.hpp from ps6 into the ps7 include subdirectory.

This subdirectory has script files for queue submission and plotting, similar to those in the axpy_cuda subdirectory. When you have your dot products working, submit the batch script to the queue and plot the results.

What is the max number of Gflop/s that you were able to achieve from the GPU? Overall?

Submitting Your Work

Submit your files to Gradescope.

For ps7, you will find two assignments on Gradescope: “ps7 – written (for both 483 and 583)” and “ps7 – written (for 583 only, Q8-9)”.

Please submit your answers to the Questions as a PDF file. You would also need to submit the plots you get in To-Do 2 (axpy section of ps7) and in To-Do 5 (norm_cuda section of ps7). Please, append those plots to the end of your Questions.pdf file and upload only the resulting pdf file as your submission. (Separate Questions.pdf, axpy.pdf, norm_cuda.pdf could be combined into a single pdf using any suitable online tool).

In this assignment, Question 8 and Question 9 are for AMATH 583 only. If you are a 583 student, please submit your answers for those to the assignment “ps7 – written (for 583 only, Q8-9)” on Gradescope. Note that you could have your work in a single pdf file which you could submit for both written assignments, however, you would need to make sure that correct pages are selected for each question. If you are a 483 student, please do not submit any files to the 583-only assignment on Gradescope.

Please make sure to accurately match the relevant pages of your pdf with questions on Gradescope (otherwise you could lose points if we do not find your work when grading).

If you relied on any external resources, include the references in your document as well.