Problem Set #7¶

Assigned: May 17, 2022

Due: May 26, 2022

Background¶

Lectures 15, 16 (CUDA)

Walk through of connecting to Hyak using VS Code (See below)

Hyak Training Video

Introduction¶

This assignment is extracted as with previous assignment from a tar file, located is at ps7.tar.gz.

The files and subdirectories included are:

ps7/
├── Questions.rst
├── warmup/
│   └── hello_script.bash
├── axpy_cuda/
│   ├── Makefile
│   ├── cu_axpy_0.cu
│   ├── cu_axpy_1.cu
│   ├── cu_axpy_2.cu
│   ├── cu_axpy_3.cu
│   ├── cu_axpy_t.cu
│   ├── cu_batch.bash
│   ├── nvprof.bash
│   ├── omp_axpy.cpp
│   ├── plot.bash
│   ├── plot.py
│   ├── script.bash
│   └── seq_axpy.cpp
├── include/
│   ├── AOSMatrix.hpp
│   ├── COOMatrix.hpp
│   ├── CSCMatrix.hpp
│   ├── CSRMatrix.hpp
│   ├── Make.inc
│   ├── Make_cu.inc
│   ├── Matrix.hpp
│   ├── Timer.hpp
│   ├── Vector.hpp
│   ├── amath583.hpp
│   ├── amath583IO.hpp
│   ├── amath583sparse.hpp
│   ├── catch.hpp
│   ├── getenv.hpp
│   ├── norm_utils.hpp
│   ├── norms.hpp
│   └── pagerank.hpp
├── norm_cuda/
│   ├── Makefile
│   ├── cu_norm_0.cu
│   ├── cu_norm_1.cu
│   ├── cu_norm_2.cu
│   ├── cu_norm_3.cu
│   ├── cu_norm_4.cu
│   ├── cu_batch.bash
│   ├── norm_parfor.cpp
│   ├── norm_seq.cpp
│   ├── norm_thrust.cu
│   ├── plot.bash
│   ├── plot.py
│   └── script.bash
└── src/
    ├── amath583.cpp
    ├── amath583IO.cpp
    └── amath583sparse.cpp

Preliminaries¶

Shared Computing Resources and the Hyak Compute Cluster¶

For this assignment (and for the rest of the quarter), we will be using the UW Hyak compute cluster. We will not use our docker environment any more.

Compute clsuters are a fairly standard operational model for high-performance computing, typicallys consisting of a set of front-end resources (aka front-end nodes aka login nodes) and then a (much larger) numer of dedicated back-end resources (aka compute nodes). The front-end resources are shared – multiple users can login and use them. The compute resources, where the real computing is done, however are batch scheduled via a queueing system. Only one job will be run on a particular resource at a time.

Compute resources are often heterogeneous, as is the case with Hyak, with different capabilities being available on different installed hardware. When a user wishes to use a particular class of resource, that resource is requested specifically when the job is submitted to the queueing system.

Administratively, Hyak is a “condo cluster”, meaning different resources have been purchased under the auspices of different research projects, but the day-to-day operationa and maintenance is amortized by collecting them all into a single administrative domain. The resources associated with a particular project (or research group) are available for that project (group) on demand. However, idle resources are put into a pool for general use.

For this course we have 80 CPU cores and 370GB of RAM as dedicated resources (plus two V100 GPUs, hopefully soon). Students at UW are also eligible to use resources that are part of the student technology fund by joining the Research Computing Club. I encourage everyone to join the RCC. The RCC web pages have additional information on how to access and use Hyak.

If you already have access to Hyak via your research group, you can indicate which group’s resources to use when submitting your job. You can find the resources available to you at any time by issuing the command “hyakalloc”.

We will be using Hyak in different ways for this and the next assignments. For this assignment we will be using single compute nodes to provide GPU and multicore resources. For the next assignment we will be using larger numbers of distributed memory nodes. The process for submitting jobs to the batch scheduler is the same in each case.

Note that the main login node for Hyak is klone.hyak.uw.edu. When we say “connect to Hyak” we mean connect to (log in to) klone.hyak.uw.edu.

Account Activation¶

An account has been created for you on Hyak for this course. To activate your account, including setting up two-factor authentication (2FA), follow the instructions here . (See also the RCC instructions).

Connect with VS Code¶

The easiest (and best) way to work on Hyak is to connect with VS Code. This will provide a fairly seamless transition—you will be able to edit your files and work with the command line in pretty much the same way you have been so far.

In order to connect to Hyak remotely with VS Code, we need to install the Remote SSH extension. The basic steps are:

In the extension tab on the left hand side of VS code, search “Remote SSH”
Install the Remote SSH extension for VS code

After the extension has been installed, before we connect, we need to change a setting of Remote SSH extension to enable 2FA. In VS Code, Go to “Manage”->”settings”, search remote.SSH.showLoginTerminal, and select Always reveal the SSH login terminal to enable it. This setting will pull up the terminal when connecting to Hyak so that you can sign in via 2FA. Now let’s use the Remote SSH extension to connect to klone.hyak.uw.edu

At the lower bottom left, click on “Open a Remote Window”
Click on “Connect to Host”
Type in <your NetID>@klone.hyak.uw.edu in the dropdown textbox and return
Follow the steps in the pulled-up terminal to type in your password, and follow 2FA

You now can edit files on Hyak directly on your VS Code and interact with Hyak through the VS code terminal.

Connect with VS Code Passwordlessly¶

Apperantly, following the above procedure every time when we try to log in Hyak is quite annoying. Can we connect to Hyak with VS code in a passwordless way? To do this, we need a SSH public key.

Let’s generate a SSH key pair. By default, a user’s SSH keys are stored in that user’s ~/.ssh directory. You can easily check to see if you have a key already by going to that directory and listing the contents:

$ cd ~/.ssh
$ ls
id_dsa       known_hosts
config            id_dsa.pub

You’re looking for a pair of files named something like id_dsa` or id_rsa` and a matching file with a .pub extension. The .pub file is your public key, and the other file is the corresponding private key. If you don’t have these files (or you don’t even have a .ssh directory), you can create them by running a program called ssh-keygen, which is provided with the SSH package on Linux/macOS systems and comes with Git for Windows. You can run it within your terminal (or in Powershell if you are using Windows):

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/tony/.ssh/id_rsa):
Created directory '/home/tony/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/tony/.ssh/id_rsa.
Your public key has been saved in /home/tony/.ssh/id_rsa.pub.
The key fingerprint is:
d0:82:24:8e:d7:f1:bb:9b:33:53:96:93:49:da:9b:e3 tony@mylaptop.local

First it confirms the path you want to save the key (~/.ssh/id_rsa, you can specify a different file name for it if you want to), and then it asks twice for a passphrase, which you can leave empty if you don’t want to type a password when you use the key. ssh-keygen will generate a SSH key pair: a public key - id_rsa.pub and a corresponding private key - id_rsa. We will need both later.

Now we have the SSH key, let’s use the public key to configure the authorized_keys on Hyak. Let’s first connect to Hyak with password with VS Code. Once connected, open ~/.ssh/ directory in VS code. In the explorer on the left hand, create a new file named authorized_keys (if it is not there). On your local machine, open the public key file (id_rsa.pub) we created earlier using VS code. Let’s copy all the content within the public key file and paste it to authorized_keys file. Remember NOT to include any extra newline symbol. If you already has an authorized_keys file, append the public key file at the end of it.

Once this done, we can configure our VS code.

At the lower bottom left, click on “Open a Remote Window”
Click on “Connect to Host”
At the end of the dropdown box list, select “Configure SSH Hosts”
Click on “Setting”
Below “The absolute file path to a custom SSH config file.”, there is textbox.
In the textbox, specify a config file ~/.ssh/config, enter to save the path.

After we tell VS Code to use config file ~/.ssh/config, we still need to have the correct Host information of Hyak in it. We also need to configure our Hyak user id, the SSH key we generated as well as other settings needed by SSH. Let’s configure them in VS code.

Click on “Open a Remote Window”
Click on “Connect to Host”
At the bottom of the list, click “Configure SSH Hosts”
Select the config file we just specified. This will open the config file
Copy and paste the setting below into the config file, and save

Host klone
   HostName klone.hyak.uw.edu
   User <Your NetID>
   Identityfile ~/.ssh/id_rsa
   IdentitiesOnly yes
   ControlPath ~/.ssh/master-%r@%h:%p
   ControlMaster auto
   ControlPersist yes
   ForwardX11 yes
   Compression yes

Note

Remember to change the User to be your own NetID. If the SSH key you wish to use is not ~/.ssh/id_rsa, you also need to change the Identityfile to the key file you want to use. Remember NOT to use the public key file (the file with .pub as file extension, for instance id_rsa.pub).

In this Host setting, Host klone is the alias for HostName klone.hyak.uw.edu. With alias, we no long need to type in klone.hyak.uw.edu when connecting to Hyak.

Finally, let’s use the Remote SSH extension to connect to klone.hyak.uw.edu. Once you finish the above steps successfully, you can connect to Hyak with the following steps:

Click on “Open a Remote Window”
Click on “Connect to Host”
Select klone, this will connect VS Code to Hyak without any prompt (without password and 2FA)

Note

If you are a Windows user, or notice any steps above not accurate, you are welcomed to create a public post on Piazza to share your steps with other classmates.

Note

Once you successfully setup passwordless login, you may want to do disable remote.SSH.showLoginTerminal. In VS Code, Go to “Manage”->”settings”, search remote.SSH.showLoginTerminal, and unselect Always reveal the SSH login terminal to disable it. With passwordless login using SSH key, we do not need this any more.

Some common errors:

WARNING: UNPROTECTED PRIVATE KEY FILE!

You have a wrong permisson on your privete key (similar problem could also happen on the public key file). Use this shell command to its permission:

$ chmod 600 ~/.ssh/id_rsa

Use VS Code on Hyak¶

After VS code connect to hyak remotely, you can open a directory (or a file) within your home directory on the head node of hyak. Or use explorer on the left handside of the VS code to open a folder or edit files. You can create a new terminal from the menu of VS Code Terminal->New Terminal to open a bash terminal within VS code.

Some Rules¶

The Hyak cluster is a shared compute resource across UW, which sharing includes the compute resources themselves, as well as the file system and the front-end (login) node. When you are logged in to the head node, other students will also be logged in at the same time, so be considerate of your usage. In particular,

Note

Use the cluster only for this assignment (no mining bitcoins or other activities for which a cluster might seem useful),

Do not run anything compute-intensive on the head node (all compute-intensive jobs should be batch scheduled),

Do not put any really large files on the cluster, and

Do not attempt to look at other students’ work.

Sanity Check¶

Once you are connected to the head node, there are a few quick sanity checks you can do to check out (and learn more about) the cluster environment.

To find out more about the head node itself, try

$ uname -a
$ lscpu
$ more /proc/cpuinfo

These will return various (and copious) information about the OS and cores on the head node. Of interest are how many cores there are.

You can see the resources that are available to you on Hyak with the hyakalloc command

$ hyakalloc

You should get a response back similar to

Account resources available to user: x0
╭─────────┬───────────┬──────┬────────┬──────┬───────╮
│ Account │ Partition │ CPUs │ Memory │ GPUs │       │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│   amath │ gpu-rtx6k │   40 │   363G │    8 │ TOTAL │
│         │           │    0 │     0G │    0 │ USED  │
│         │           │   40 │   363G │    8 │ FREE  │
├─────────┼───────────┼──────┼────────┼──────┼───────┤
│    niac │   compute │   80 │   350G │    0 │ TOTAL │
│         │           │   32 │   175G │    0 │ USED  │
│         │           │   48 │   175G │    0 │ FREE  │
╰─────────┴───────────┴──────┴────────┴──────┴───────╯
Checkpoint Resources
╭───────┬──────┬──────╮
│       │ CPUs │ GPUs │
├───────┼──────┼──────┤
│ Idle: │ 9077 │   69 │
╰───────┴──────┴──────╯

amath is the account group that you are assigned to as part of this course and gpu-rtx6k is account-partition that we are able to access for GPUs. If you belong to other groups you may see their resources as well. For example, I am affiliated with NIAC, therefore I also have access to the resources of NIAC.

As you can see, we only have 8 GPUs in total to share among the whole class. Please be considerate while running jobs on them. A tip for you is to try to run jobs in the morning. Based on my past experience, hardware resources are usually less busy in the morning, especially weekends.

Batch Resources¶

We will be using the “SLURM” batch queuing system for this assignment (and future ones), which we will discuss in more detail below. To get some information about it, you can issue the command

$ sinfo

You can get an organized summary of nodes in the cluster with the command

$ sinfo -o "%20N  %10c  %10m  %25f  %10G"

This will print a summary of the compute nodes along with some basic details about their capabilities.

You should see a list that looks something like this

NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES
n[3008-3011,3064,306  40          386048      bigmem,cascadelake         (null)
n[3000-3007,3016-302  40          773120      cascadelake,hugemem        (null)
z3000                 40          191488      2080ti,cascadelake,gpu,gp  gpu:2080ti
z[3005-3006]          56          1031168     gpu,gpu_default,gpu_exclu  gpu:m10:8
g[3000-3007,3014-301  40          385024      2080ti,bigmem,cascadelake  gpu:2080ti
g[3010-3013,3020-302  40          385024      bigmem,cascadelake,gpu,gp  gpu:rtx6k:
g[3040-3047,3060-306  52          1031168     a40,cascadelake,gpu,gpu_d  gpu:a40:8
z3001                 40          191488      2080ti,cascadelake,gpu,gp  gpu:2080ti
z3002                 40          385024      bigmem,cascadelake,gpu,gp  gpu:titan:
n[3012-3015,3024-305  40          176128+     cascadelake                (null)
n3059                 40          192512      cascadelake,interactive    (null)

(Use “man sinfo” to access documentation to explain the different columns.) Currently we only have one compute node configured – with 16 CPUS in order to explore multicore performance.

The command squeue on the other hand will provide information about the jobs that are currently running (or waiting to be run) on the nodes.

$ squeue
$ squeue -arl

(“man squeue” to access information about the squeue command.)

The basic paradigm in using a queuing system is that you request the queueing system to run something for you. The thing that is run can be a program, or it can be a command script (with special commands / comments to control the queueing system itself). Command line options (or commands / directives in the script file) are used to specify the configuration of the environment for the job to run, as well as to tell the queueing system some things about the job (how much time it will need to run for instance). When the job has run, the queueing system will also provide its output in some files for you – the output from cout and the output from cerr.

A reasonable tutorial on the basics of using slurm can be found here: https://www.slurm.schedmd.com/tutorials.html . The most important commands are: srun, sbatch, sinfo, and squeue. The most important bits to pay attention to in that tutorial are about those commands.

A brief overview of slurm on Hyak can be found here.

srun¶

There are two modes of getting SLURM to run your jobs for you – pseudo-interactive (using srun) and batch (using sbatch). The former (srun) should only be used for short-running jobs (testing, debugging, sanity checks). Using it for verifying your programs for the first part of this assignment is fine. Note that you are still using a slot in the queue even with srun – so if your job hangs, it will block other jobs from running (at least until your timeslot expires).

The “look and feel” of srun is alot like just running a job locally. Instead of executing the job locally with ./norm_parfor.exe (for example), you simply use srun:

$ srun ./norm_parfor.exe
        N  Sequential    1 thread   2 threads   4 threads   8 threads      1 thread     2 threads     4 threads     8 threads
  1048576     5.76266     5.72983     11.4924     22.8542     50.2792             0   5.19496e-15   5.00255e-15   4.04052e-15
  2097152     5.47874     5.46393      10.811     21.7382     42.1178             0   6.80033e-16   1.08805e-15   2.58412e-15
  4194304     4.05311     4.06923     7.84222     15.6246     31.9816             0   4.80808e-15   5.57737e-15   3.84646e-15
  8388608     3.29741      3.3026     6.39376     12.9454     24.6724             0   1.11505e-14   9.92669e-15   9.51874e-15

(Note that the performance is significantly better than what you may have been seeing so far on your laptop.)

Try the following with srun

$ hostname
$ srun hostname

The former will give you the name of the front-end node while the latter should give you the hostname of the compute node where SLURM ran the job – and they should be different. (Had better be different!)

Note that srun is also the command we will use in upcoming assignments to launch distributed-memory (MPI) jobs. You can pass it an argument to tell it how many separate jobs to run.

$ srun -n 4 hostname

You shouldn’t use this option until we start doing MPI programs.

The srun command takes numerous other options to specify how it should be run (specific resources required, time limits, what account to use, etc.) We will cover these as needed but you can

sbatch¶

srun basically just takes what you give it and runs it, in a pseudo-interactive mode – results are printed back to your screen. However, jobs invoked with srun are still put into the queue – if there are jobs ahead of you, you will have to wait until the job runs to see your output.

It is often more productive to use a queueing system asynchronously – especially if there are many long-running jobs that have to be scheduled. In this mode of operation, you tell SLURM what you want to do inside of a script – and to submit the script to the queuing system.

For example, you could create a script as follows:

#!/bin/bash
echo "Hello from Slurm"
hostname
echo "Goodbye"

This file is in your repo as warmup/hello_script.bash. Submit it to the queuing system with the following:

% sbatch hello_script.bash

Note

Scripts

Files that are submitted to sbatch must be actual shell scripts – meaning they must specify a shell in addition to having executable commands. The shell is specified at the top of every script, starting with #!. A bash script therefore has #!/bin/bash at the top of the file.

If you submit a file that is not a script, you will get an error like the following:

sbatch: error: This does not look like a batch script.  The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh

#! is pronounced in various ways – I usually pronounce it as “crunch-bang”, though “pound-bang” or “hash-bang” are also used. #! serves as a magic number (really that is the technical term) to indicates scripts in Unix / Linux systems.

Output from your batch submission will be captured in a file slurm-<some number>.out. The script above would produce something like

Hello from SLURM!
n3164
Goodbye

Job options can be incorporated as part of your batch script (instead of being put on the command line) by prefixing ach option with #SBATCH.

For example the following are typical options to use in this course

#!/bin/bash

#SBATCH --account=amath
#SBATCH --partition=gpu-rtx6k
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --gres=gpu:rtx6k

Here, we request from the amath account and gpu-rtx6k partition, 1 cpu core on 1 node, with one rtx6k gpu, and a time limit of 5 minutes. (Another partition that can be used is the ckpt partition, which is the resource pool of available idle resources.)

If you see an error such as

slurmstepd: error: execve(): hello_script.bash: Permission denied

you may need to set the executable bit for the script.

$ chmod +x hello_script.bash

You can verify that the executable bit is set by doing a long listing of the file:

$ ls -l hello_script.bash
-rw-r--r-- 1 x0 x0 62 May 15 10:25 hello_script.bash

The script is not executable. So we issue use chmod +x

$ chmod +x hello_script.bash
$ ls -l hello_script.bash
-rwxr-xr-x 1 x0 x0 62 May 15 10:25 hello_script.bash

Program Output¶

When you use sbatch, rather than sending your output to a screen (since sbatch is not interactive), standard out (cout) and standard error (cerr) from the program will instead be saved in a file named slurm-xx.out, where xx is the job number that was run. The file will be saved in the directory where you launched sbatch.

Try the following

$ sbatch hello_script.bash

You should see a new file named slurm-<some number>.out in your directory.

Launching Parallel Jobs¶

There is a subtle difference between what srun and sbatch do. In some sense sbatch just keeps your place in the queue and launches what you would have done with srun, whereas srun actually does the launch. That is, sbatch won’t launch your script in parallel. We will come back to this with the MPI assignment.

squeue¶

Sometimes, we do want to take a look at our jobs in the Slurm job scheduler queue. Issue this command:

$ squeue -u <your NetID>

The option -u specifies your NetID to query the jobs of yours. Below is one example job, it shows the JOBID, which partition it is associated to, the job name, the user id, the state of the job (PD - pending), and how many nodes you allocated. Try man squeue to see more options. The most important thing is this unique identifier - JOBID.

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
4538009     amath NR_C6LE_       x0 PD       0:00      1 (AssocGrpJobsLimit)

scancel¶

We can use scancel to cancel a job of your own. Issue this command:

$ scancel <your JOBID>

Or you can issue this command to cancel all your jobs:

$ scancel -u <your NetID>

Try man scancel to see more options.

Warm Up: PS6 Reprise¶

For this problem set warm-up will consist of revisiting ps6 and executing some of the problems we did on multicore nodes of Hyak.

Setting up your environment¶

Different development environments on Hyak are supported via the modules system. There are two modules we need to load for this asignment: gcc/11.2.0 and cuda/11.6.2. To load these modules issue the commands

$ module load gcc/11.2.0
$ module load cuda/11.6.2

You will need to load these every time you connect to Hyak in order to use the right version of gcc and to use CUDA for this assignment. You can add these statements to the end of your .bashrc file so that they are executed automatically whenever you login. .bashrc file is located at your home directory on head node of Hyak. You can open it using VS Code and copy paste these statements at the end of it.

To query what modules are available to you on Hyak, issue this command:

$ module avail

As you can see, many lab create some tools/softwares on Hyak. They are shared among all the users. To query what modules you have loaded at this memoment on Hyak, issue this command:

$ module list
Currently Loaded Modules:
  1) gcc/11.2.0   2) cuda/11.6.2

Copy PS6 to Hyak¶

One way to copy files (usually a large number of files), is to use a file-copying tool - rsync, to copy your ps6 directory from your laptop to Hyak. We want to copy the whole directory and maintain its original folder hierarchy.

From the directory above ps6, we copy files to the home directory (~ symbol) of ours on Hyak:

$ rsync -avuzb /home/tony/amath583/ps6/ klone.hyak.uw.edu:~

(You will have to go through the 2FA process, just as if you were logging in.)

Connect to Hyak with VS code and verify that ps6 has been copied. The copied ps6 directory should be visible in your Hyak home directory.

Another way to copy your ps6 directory from your laptop to Hyak is using secure copy - scp (remote file copy program). We also want to copy ps6 recusively so that every file and the whole folder hierarchy will be copied to Hyak. To do so, issue this command:

$ scp -r /home/tony/amath583/ps6/ <your NetID>@klone.hyak.uw.edu:~

You will have to go through the 2FA process, just as if you were logging in. This will copy the whole ps6 to the home directory of your on Hyak. You can specify a different directory klone.hyak.uw.edu:<Your path>.

Note

scp will overwrite (delete the existing directory, and create a new one) ps6 directory if there is already a ps6 directory in the same path. If you have some source codes in ps6 on Hyak, scp overwrites the whole ps6 directory. Be cautious when using scp command.

Copy PS6 on Hyak to Laptop¶

Say we want to copy the whole ps6 directory that we put under our home directory on Hyak back to our laptop under the current directory. You can use scp to copy file from Hyak back to your laptop. Issue this command:

$ scp -r <your NetID>@klone.hyak.uw.edu:~/ps6/ /home/tony/amath583/ps6/

(You will have to go through the 2FA process, just as if you were logging in.)

hello_omp¶

From the hello_omp directory, build ompi_info.exe

$ make ompi_info.exe

You may notice the compilation process got an error when you try to build on the head node of Hyak.

Note

For this assignment as well as the future assignments, NEVER compile on the head node of Hyak. Instead, we need to build through srun on a compute node.

Rememeber to add the module load statements to the end of your .bashrc file so that they are executed automatically whenever you login or you are running/building a program. To verify the modules are loaded or not, issue

$ module list
Currently Loaded Modules:
  1) gcc/11.2.0   2) cuda/11.6.2

Now we are ready to build. From the hello_omp directory, build ompi_info.exe.

$  srun --time 5:00 -A amath -p gpu-rtx6k make ompi_info.exe

Note

Slurm on Hyak requires the user to specify both the account and the partition explicitly in the submitted jobs. The option -A is for the account you are associated with. The option -p is for the partition you are associated with.

After a successful compilation, we are ready to run our first job on Hyak.

$ srun --time 5:00 -A amath -p gpu-rtx6k ./ompi_info.exe

You should get back an output similar to the following

OMP_NUM_THREADS        =
hardware_concurrency() = 40
omp_get_max_threads()  = 1
omp_get_num_threads()  = 1

Note what this is telling us about the environment into which ompi_info.exe was launched. Although there are 40 cores available on our compute node, the maximum number of OpenMP threads that will be available is just 1 – not very much potential parallelism.

Fortunately, srun provides options for enabling more concurrency.

Try the following

$ srun --time 5:00 -A amath -p gpu-rtx6k --cpus-per-task 2 ./ompi_info.exe

How many omp threads are reported as being available? Try increasing the number of cpus-per-task. Do you always get a corresponding number of omp threads? Is there a limit to how many omp threads you can request?

What is the reported hardware concurrency and available omp threads if you execute ompi_info.exe on the login node?

Note

We are explicitly specifying the maximum time allowed for each job as 5 minutes, using the --time option. The default maximum time is loner than 5 minutes, but since the amath allocation is being shared by the entire class, we want to protect against any jobs accidentally running amok. We will be setting time limits on the command line whenever we use srun and within the batch files when we use sbatch. Please be considerate when specifying the maximum time.

norm¶

Now let’s revisit some of the computational tasks we parallelized in previous assignments. Before we run these programs we want to compile them for the compute nodes they will be running on.

Recall that one of the arguments we have been passing to the compiler for maximum optimization effort has been “-march=native”, that means to use as many of the available instructions that might be available, assuming that the executable will be run on the same machine as it is compiled on. To make sure we do this, we need to compile on the cluster nodes as well as execute on them.

Let’s build and run norm_parfor.exe

$ srun --time 5:00 -A amath -p gpu-rtx6k  make norm_parfor.exe

To see that there is a different architectural difference between the compute node and the login node, try

$ ./norm_parfor.exe

You should get an error about an illegal instruction – generally a sign that the code you are trying to run was built for a more advanced architecture than the one you are trying to run on.

To run norm_parfor.exe, try the following first

$ srun --time 5:00 -A amath  -p gpu-rtx6k norm_parfor.exe

How much speedup do you get?

We have seen how to get parallel resources above, to launch norm_parfor.exe with 8 cores available, run:

$ srun --time 5:00 -A amath  -p gpu-rtx6k --cpus-per-task 8  ./norm_parfor.exe

What are the max Gflop/s reported when you run norm_parfor.exe with 8 cores? How much speedup is that over 1 core? How does that compare to what you had achieved with your laptop?

matvec¶

Build and run pmatvec.exe:

$ srun -A amath -p gpu-rtx6k --time 5:00 make pmatvec.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 8  ./pmatvec.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 16

What are the max Gflop/s reported when you run pmatvec.exe with 16 cores? How does that compare to what you had achieved with your laptop?

What happens when you “oversubscribe”?

$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 16 ./pmatvec.exe 2048 32

pagerank¶

Finally, build and run pagerank.exe:

$ srun -A amath -p gpu-rtx6k --time 5:00 make pagerank.exe

There are a number of data files that you can use in the shared directory /gscratch/amath/amath583/data . One reasonably sized one is as-Skitter. Try running pagerank with different numbers of cores.

$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 1 ./pagerank.exe -n 1 /gscratch/amath/amath583/data/as-Skitter.mtx
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 2 ./pagerank.exe -n 2 /gscratch/amath/amath583/data/as-Skitter.mtx
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 4 ./pagerank.exe -n 4 /gscratch/amath/amath583/data/as-Skitter.mtx
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task 8 ./pagerank.exe -n 8 /gscratch/amath/amath583/data/as-Skitter.mtx

The output of pagerank looks something like

# elapsed time [read]: 8059 ms
Converged in 46 iterations
# elapsed time [pagerank]: 3334 ms
# elapsed time [rank]: 129 ms

How much speedup (ratio of elapsed time for pagerank comparing 1 core with 8 cores) do you get when running on 8 cores?

Problems¶

Use rsync to copy your ps7 directory from your laptop to Hyak. All subsequent work on ps7 will be done in the copy on Hyak.

From the directory above ps7:

$ rsync -avuzb /home/tony/amath583/ps7/ klone.hyak.uw.edu:~

Finding out about the GPU¶

nvidia-smi¶

Once you are logged on to the cluster head node, you can query GPU resources using the command nvidia-smi. Note that the head node does not have any GPU resources, trying to run nvidia-smi locally will not work.

The GPU is actually on the compute node(s) so you will need to invoke nvidia-smi using srun (requesting a node with a gpu, of course).

$  srun -p gpu-rtx6k  -A amath --gres=gpu:rtx6k nvidia-smi

This should print information similar to the following:

Sun May 15 10:38:26 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     Off  | 00000000:21:00.0 Off |                  Off |
| 29%   34C    P0    24W / 260W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This indicates the GPU is a Nvidia Quadro RTX 6000, has 24576MiB of memory. We have version 510.47.03 of the drivers installed, and version 11.6 of CUDA installed.

The Nvidia CUDA toolkit contains a large number of substantive examples. With the latest version of the toolkit, everything related to CUDA is installed in /sw/cuda/11.2.2; the samples can be found in /sw/cuda/11.2.2/samples. I encourage you to look through some of them – the 6_Advanced subdirectory in particular has some interesting and non-trivial examples.

AXPY CUDA¶

The cu_axpy subdirectory of your repo contains a set of basic examples: several “cu_axpy” programs, similar to those that were presented in lecture. The executables can be compiled by issuing “make”. Each program takes one argument – the log (base 2) of problem size to run. That is, if you pass in a value of, say, 20, the problem size that is run is \(2^{20}\). The default problem size is \(2^{16}\).

Note

Again, the size argument in these examples is the log_2 of the size.

The programs print some timing numbers for memory allocation (etc), as well a Gflops/sec for the axpy computation. Each program only runs the single problem size specified.

“axpy” is the base name of one of the fundamental vector-vector operations in the basic linear algebra subprograms (BLAS) library. When the library was developed, the language of choice for scientific computing was Fortran 66 – in which only the first six letters in any identifier were significant. In addition, programs were “written” on punch cards – one program statement per card, and each card was limited to 80 characters, including any indentation. Hence, naming tended to be very cryptic. The mnemonic “axpy” is for a times x plus y – axpy effected the operation \(y \leftarrow a \times x + y\) for two one-dimensional arrays x and y and scalar a. There was no function overloading (nor templates) in Fortran 66 – there was a separate function for each basic type (single precision real, double precision real, single precision complex, double precision complex), each with its own name: saxpy, daxpy, caxpy, and zaxpy.

The axpy subdirectory also contains a script script.bash which runs all of the axpy programs over a range of problem sizes, a script cu_batch.bash which is used to submit script.bash to the queueing system, and plot.bash, which plots the results of the runs to a pdf axpy.pdf.

In addition to the basic cuda examples, there is a sequential implementation, an OpenMP implementation, and a thrust implementation. (Python and matplotlib are installed on the login node.)

The examples cu_axpy_0 through cu_axpy_3 follow the development of CUDA kernels as shown in slides 62-69 in lecture.

To build and run one of the cuda examples

$ srun -A amath -p gpu-rtx6k --time 5:00 make cu_axpy_1.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k ./cu_axpy_1.exe 20

You can compare to sequential and omp versions with seq_axpy and omp_axpy:

$ srun -A amath -p gpu-rtx6k --time 5:00 ./seq_axpy.exe 20
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task  8 ./omp_axpy.exe 20

To generate the scaling plots for this problem, first submit cu_batch.bash

$ sbatch cu_batch.bash

The system will print something like

Submitted batch job 147081

You can check on the status of the job by referring to the job id or the job name

$ squeue -n cu_batch.bash
$ squeue --job 147081

When the job starts running, the log file slurm-147081.out will be created. When the job is finished it will no longer appear in the queue.

Once the batch job is finished, there will be a number of .txt files to be processed by plot.py

$ python3 plot.py

This Python script is going to parse the performance number within cu_norm_(0-4).txt files, and plot a norm_cuda.pdf. You can use scp to copy it back to your laptop.

Submit cu_batch.bash to the queue, run the plotting script, and examine the output plot. Review the slides from lecture and make sure you understand why the versions 1, 2, and 3 of cu_axpy give the results that they do. Note also the performance of the sequential, OpenMP, and Thrust cases. Make sure you can explain the difference between version 1 and 2 (partitioning).

How many more threads are run in version 2 compared to version 1? How much speedup might you expect as a result? How much speedup do you see in your plot?

How many more threads are run in version 3 compared to version 2? How much speedup might you expect as a result? How much speedup do you see in your plot? (Hint: Is the speedup a function of the number of threads launched or the number of available cores, or both?)

(AMATH 583, updated) The cu_axpy_3 also accepts as a second command line argument the size of the blocks to be used. Experiment with different block sizes with, a few different problem sizes (around \(2^{24}\) plus or minus). What block size seems to give the best performance? Are there any aspects of the GPU as reported in deviceQuery that might point to why this would make sense?

nvprof (AMATH 583)¶

Nvidia has an interactive profiling tool for analyzing cuda applications: The Nvida visual profiler (nvvp). Unfortunately it is a graphical tool and to use it on a remote node, one must run it over a forwarded X connection, which isn’t really usable due to high latencies. However, there is a command-line program (nvprof) that provides fairly good diagnostics, but you have to know what to look for. Nividia decided to stop support to nvprof, and introduced a replacement for it called Nsight Compute (nsys).

Run nsys nvprof on the four cu_axpy executables

$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_1.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_2.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_3.exe
$ srun -A amath -p gpu-rtx6k --time 5:00 --gres=gpu:rtx6k nsys nvprof ./cu_axpy_t.exe

Quantities that you might examine in order to tune performance are “Memory Operation Statistics”, “Kernel Statistics” and “API Statistics”. Compare the overhead such as the data movement for the four cu_axpy programs above.

(TL: you can ignore this question.) Looking at some of the metrics reported by nvprof, how do metrics such as occupancy and efficiency compare to the ratio of threads launched between versions 1, 2, and 3?

Striding¶

In the previous problem sets we considered block and strided approaches to parallelizing norm. We found that the strided approach had unfavorable access patterns to memory and gave much lower performance than the blocked approach.

But, consider one of the kernels we launched for axpy (this is from cu_axpy_2):

__global__ void dot0(int n, float a, float* x, float* y) {
  int index  = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = a * x[i] + y[i];
}

It is strided!

Think about how we do strided partitioning for task-based parallelism (e.g., OpenMP or C++ tasks) with strided partitioning for GPU. Why is it bad in the former case but good (if it is) in the latter case?

norm_cuda¶

In this part of the assignment we want to work through the evolution of reduction patterns that were presented in lecture (slides 78-82) but in the context of norm rather than simply reduction. (In fact, we are going to generalize slightly and actually do dot product).

Consider the implementation of the dot0 kernel in cu_norm_0

__global__
void dot0(int n, float* a, float* x, float* y) {
  extern __shared__ float sdata[];

  int tid    = threadIdx.x;
  int index  = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  sdata[tid] = 0.0;
  for (int i = index; i < n; i += stride)
    sdata[tid] += x[i] * y[i];

  __syncthreads();

  if (tid == 0) {
    a[blockIdx.x] = 0.0;
    for (int i = 0; i < blockDim.x; ++i) {
           a[blockIdx.x] += sdata[i];
    }
  }
}

There are two phases in the kernel of this dot product. In the first phase, each thread computes the partial sums for its partition of the input data and saves the results in a shared memory array. This phase is followed by a barrier (__syncthreads()). Then, the partial sums are added together for all of the threads in each block by the zeroth thread, leaving still some partial sums in the a array (one partial sum for each block), which are then finally added together by the cpu.

As we have seen, a single gpu thread is not very powerful, and having a single gpu thread adding up the partial sums is quite wasteful. A tree-based approach is much more efficient.

A simple tree-based approach cu_norm_1.cu is

__global__
void dot0(int n, float* a, float* x, float* y) {
  extern __shared__ float sdata[];

  int tid    = threadIdx.x;
  int index  = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  sdata[tid] = 0.0;
  for (int i = index; i < n; i += stride)
    sdata[tid] += x[i] * y[i];

  __syncthreads();

  for (size_t s = 1; s < blockDim.x; s *= 2) {
    if (tid % (2*s) == 0) {
           sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
  }

  if (tid == 0) {
    a[blockIdx.x] = sdata[0];
  }
}

Implement the reduction schemes from slides 80, 81, and 82 in cu_norm_2.cu, cu_norm_3.cu, and cu_norm_4.cu, respectively.

You can run individual programs, e.g., as

$ srun -p gpu-rtx6k -A amath --time 5:00 --gres=gpu:rtx6k ./cu_norm_1.exe 20

You can compare to sequential and omp versions with norm_seq.exe and norm_parfor.exe. First copy your version of norms.hpp from ps6/include into ps7/include. Then build and run:

$ srun -A amath -p gpu-rtx6k --time 5:00 ./norm_seq.exe 20
$ srun -A amath -p gpu-rtx6k --time 5:00 --cpus-per-task  8 ./norm_parfor.exe 20

(You should make these on a compute node as with axpy above.)

Note

The files in the ps7 include subdirectory are not parallelized. You will need to copy your version of norms.hpp from ps6 into the ps7 include subdirectory. You can use scp to copy norm.hpp from your laptop to Hyak.

This subdirectory has script files for queue submission and plotting, similar to those in the axpy_cuda subdirectory. When you have your dot products working, submit the batch script to the queue and plot the results.

What is the max number of Gflop/s that you were able to achieve from the GPU? Overall?

Submitting Your Work¶

Answer the following questions (append to Questions.rst): a) The most important thing I learned from this assignment was… b) One thing I am still not clear on is…

Submit your files to Gradescope. To log in to Gradescope, point your browser to gradescope.com and click the login button. On the form that pops up select “School Credentials” and then “University of Washington netid”. You will be asked to authenticate with the UW netid login page. Once you have authenticated you will be brought to a page with your courses. Select amath583sp22.

For ps7, you will find two assignments on Gradescope: “ps7 – written (for 483)” and “ps7 – written (for 583)”. We will not grade your source code with autograder starting from this assignment. There is no need to submit your source code to Gradescope.

Please submit your answers to the Questions as a PDF file. You would also need to submit the plots you get in To-Do 2 (axpy section of ps7) and in To-Do 5 (norm_cuda section of ps7). Please, append those plots to the end of your Questions.pdf file and upload only the resulting pdf file as your submission. (Separate Questions.pdf, axpy.pdf, norm_cuda.pdf could be combined into a single pdf using any suitable online tool).

In this assignment, Question 8 and Question 9 are for AMATH 583 only. If you are a 583 student, please submit your answers for those to the assignment “ps7 – written (for 583)” on Gradescope. Note that you could have your work in a single pdf file which you could submit for both written assignments, however, you would need to make sure that correct pages are selected for each question. If you are a 483 student, please do not submit any files to the 583-only assignment on Gradescope.

Please make sure to accurately match the relevant pages of your pdf with questions on Gradescope (otherwise you could lose points if we do not find your work when grading).

If you relied on any external resources, include the references in your document as well.