Problem Set #7 (Part II)¶

Assigned: June 3, 2020

Due: June 16, 2020

Background¶

This assignment is a continuation of Problem Set #7 (Part I)

Introduction¶

This assignment is extracted as with previous assignment from a tar file, located is at this link.

Note

You should extract this tar file into the same subdirectory where you extracted ps7a.

The new files and subdirectories included are:

ps7/
├── Questions.rst
├── cu_axpy/
│   ├── Makefile
│   ├── cu_axpy_0.cu
│   ├── cu_axpy_1.cu
│   ├── cu_axpy_2.cu
│   ├── cu_axpy_3.cu
│   ├── cu_axpy_t.cu
│   ├── nvprof.bash
│   ├── omp_axpy.cpp
│   ├── plot.py
│   ├── script.bash
│   └── seq_axpy.cpp
├── include/
│   └── Make_cu.inc
├── norm_cuda/
│   ├── Makefile
│   ├── norm_parfor.cpp
│   ├── norm_seq.cpp
│   └── norm_thrust.cu
└── reduction/
    ├── Makefile
    ├── NsightEclipse.xml
    ├── nvprof.bash
    ├── readme.txt
    ├── reduction.cpp
    ├── reduction.h
    └── reduction_kernel.cu

The subdirectories will be in parallel to the existing ones from ps7a. There will be one new file copied into an existing subdirectory: Make_cu.inc will be created in include.

After successful extraction, the complete list of subdirectories should look like this:

$ ls
axpy_cuda  include  norm       pagerank       README.rst   test
hello_omp  matvec   norm_cuda  Questions.rst  reduction    warmup

The existing Questions.rst file (from ps7a) will be overwritten (it was actually mistakenly copied from ps6). The correct Questions.rst file is in ps7b.tar.gz.

Preliminaries¶

The problems for this assignment will be done on an AWS GPU-enabled “cluster”. You can connect to the GPU cluster in the same way as the CPU instance from part I.

Note

Cluster access Update

There is a small cluster now available with GPU. The head node is at IP 18.236.206.176 .

Host ps7
   HostName 18.236.206.176
   User al75
   IdentityFile /Users/lums658/.ssh/al75_id_rsa

Unless you have a laptop with an Nvidia graphics card AND you have CUDA installed just so, it won’t be possible to do this on your laptop.

Finding out about the GPU¶

nvidia-smi¶

Once you are logged on to the cluster head node, you can query the GPU using the command nvidia-smi. The head node and the cluster node are both GPU enabled. To find out information about the head node, you can run nvidia-smi locally.

$ nvidia-smi

Note that the GPU is actually on the compute node(s) so you will need to invoke nvidia-smi using srun.

$ srun nvidia-smi

On either machine, nvidia-smi should print out information similar to the following:

% nvidia-smi
Fri May 31 07:14:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   58C    P0    73W / 149W |      0MiB / 11441MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This indicates the GPU is of the Kepler family, has 4GiB of memory, we have version 418.67 of the drivers installed, and version 10.1 of CUDA installed.

The Nvidia CUDA toolkit contains a large number of substantive examples. With the latest version of the toolkit, everything related to CUDA is installed in /usr/local/cuda; the samples can be found in /usr/local/cuda/samples. I encourage you to look through some of them – the 6_Advanced subdirectory in particular has some interesting and non-trivial examples.

deviceQuery¶

You can also find out more detailed information about the particular GPU being used by running the deviceQuery program. It is a sample program that comes with cuda and can be invoked (locally and remotely) as

$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
$ srun /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery

Warm Up¶

AXPY CUDA¶

The cu_axpy subdirectory of your repo contains a set of basic examples: several “cu_axpy” programs, similar to those that were presented in lecture. The executables can be compiled by issuing “make”. Each program takes one argument – the log (base 2) of problem size to run. That is, if you pass in a value of, say, 20, the problem size that is run is \(2^{20}\). The default problem size is \(2^{16}\).

Note

Again, the size argument in these examples is the \(\text{log}_2\) of the size.

The programs print some timing numbers for memory allocation (etc), as well a Gflops/sec for the axpy computation. Each program only runs the single problem size specified.

“axpy” is the base name of one of the fundamental vector-vector operations in the basic linear algebra subprograms (BLAS) library. When the library was developed, the language of choice for scientific computing was Fortran 66 – in which only the first six letters in any identifier were significant. In addition, programs were “written” on punch cards – one program statement per card, and each card was limited to 80 characters, including any indentation. Hence, naming tended to be very cryptic. The mnemonic “axpy” is for a times x plus y – axpy effected the operation \(y \leftarrow a \times x + y\) for two one-dimensional arrays x and y and scalar a. There was no function overloading (nor templates) in Fortran 66 – there was a separate function for each basic type (single precision real, double precision real, single precision complex, double precision complex), each with its own name: saxpy, daxpy, caxpy, and zaxpy.

The axpy subdirectory also contains a script script.bash which runs all of the axpy programs over a range of problem sizes and uses python with matplotlib to plot the results. In addition to the basic cuda examples, there is a sequential implementation, an OpenMP implementation, and a thrust implementation. (Python and matplotlib are already installed on the AWS instances.)

Run the file script.bash and examine the output plot. Review the slides from lecture and make sure you understand why the versions 1, 2, and 3 of cu_axpy give the results that they do. Note also the performance of the sequential, OpenMP, and Thrust cases.

How many more threads are run in version 2 compared to version 1? How much speedup might you expect as a result? How much speedup do you see in your plot?

How many more threads are run in version 3 compared to version 2? How much speedup might you expect as a result? How much speedup do you see in your plot? (Hint: Is the speedup a function of the number of threads launched or the number of available cores, or both?)

(AMATH 583) The cu_axpy_t also accepts as a second command line argument the size of the blocks to be used. Experiment with different block sizes with, a few different problem sizes (around \(2^{24}\) plus or minus). What block size seems to give the best performance? Are there any aspects of the GPU as reported in deviceQuery that might point to why this would make sense?

nvprof¶

Nvidia has an interactive profiling tool for analyzing cuda applications: The Nvida visual profiler (nvvp). Unfortunately it is a graphical tool and to use it on a remote node, one must run it over a forwarded X connection, which isn’t really usable due to high latencies. However, there is a command-line program (nvprof) that provides fairly good diagnostics, but you have to know what to look for.

You can see a list of the events and metrics that nvprof can provide by executing:

$ nvprof --query-events
$ nvprof --query-metrics

Complete information about cuda programming (including a best practices guide) can be found at https://docs.nvidia.com/cuda/.

You axpy subdirectory also contains a script nvprof.bash that invokes nvprof with a few metrics of interest.

Run nvprof.bash on the four cu_axpy executables

$ bash nvprof.bash ./cu_axpy_1.exe
$ bash nvprof.bash ./cu_axpy_2.exe
$ bash nvprof.bash ./cu_axpy_3.exe
$ bash nvprof.bash ./cu_axpy_t.exe

Quantities that you might examine in order to tune performance are “achieved occupancy” (how efficiently is your program using the GPU cores) and memory throughput. Compare the achieved occupancy for the four cu_axpy programs above. The occupancy should correlate well (though not perfectly) with the performance that the programs exhibit.

Looking at some of the metrics reported by nvprof, how do metrics such as occupancy and efficiency compare to the ratio of threads launched between versions 1, 2, and 3?

Reduction¶

The reduction example that we looked at in class is in the subdirectory reduction. The files are copied from the cuda set of samples and slightly modified so that the program would compile in the course repo. There are seven different kernels that can be run from the program; kernels are selected by passing in the –kernel=<#> option, where <#> can be from 0 to 6. The kernels are progressively more optimized, as we showed in lecture. The comments in the code describe the optimization applied for each successive kernel.

Compile and run the reduction program and use nvprof to see if the effects of the different optimizations (as described in lecture and in the nvidia documentation) actually have the effect that is claimed.

$ nvprof ./reduction --kernel=0
$ nvprof ./reduction --kernel=1
$ # etc ...

Documentation on the optimization process can also be found at: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Problems¶

Norm¶

The subdirectory norm contains a skeletons for a thrust-based algorithm for computing the two-norm of a vector.

Documentation for thrust can be found at https://docs.nvidia.com/cuda/thrust/index.html

Complete the implementation of the norm_thrust function in norm_thrust.cu.

You may find it helpful to refer to the axpy examples for the rest of this problem. In cu_axpy_t, we used thrust::transform to combine two vectors. To compute the norm of a single vector, we want an analagous operation – transforming over a single vector. We also want to reduce (accumulate) over the transformed values (and we know how to do that efficiently – per the reduction exercise – and there is a reduce function in thrust).

We could do this in two steps – transform the vector into a vector with the square of each value and the reduce over that. That approach is obviously enormously inefficient. To do the transformation and reduction in a single pass, thrust includes the function transform_reduce – which is suggested for use in this problem. You may use any technique you like to implement the operators that get passed in – function object, helper function, lambda, built-in functions.

The norm_cuda subdirecory also contains norm_parfor and norm_seq, which you can run to compare to norm_cuda.

Note that norm_thrust is a function template. The driver that calls it for timing invokes to versions of it – one for single precision and one for double precision.

As with our previous norm drivers, this program will run timing tests over a range of problem sizes (in powers of two) and you can pass the upper and lower bounds in on the command line. The program does two timing runs for each of single and double and prints the results.

That’s Funny¶

When you run your working norm_thrust, you may get an output that looks like the following:

$./norm_thrust.exe

Float
        N  Sequential       First      Second       First      Second
  1048576      1.7798     6.09445     6.07604    2.7595e-08    2.7595e-08
  2097152     1.68752      11.108     11.2314   4.51044e-08   4.51044e-08
  4194304     1.41257     18.6075     19.1292   2.15465e-08   2.15465e-08
  8388608     1.38701     27.5941     29.5374   3.51608e-08   3.51608e-08
 16777216     1.38567     36.9667     45.4383   4.24208e-08   4.24208e-08
 33554432     1.37679     41.2072     57.2881   8.84264e-09   8.84264e-09
 67108864     1.38583     29.4984     65.4721   2.65331e-09   2.65331e-09

Double
        N  Sequential       First      Second       First      Second
  1048576     1.78137     6.38466     6.44605   1.36909e-11   1.36909e-11
  2097152     1.69176     11.0473     11.0473   2.58195e-11   2.58195e-11
  4194304     1.41943     17.2002     17.3459   1.87745e-12   1.87745e-12
  8388608     1.39531     24.1052     24.1052   8.97525e-12   8.97525e-12
 16777216     1.39186     29.8772     29.8772   6.11285e-13   6.11285e-13
 33554432     1.38983     33.5544     34.0407    2.9634e-12    2.9634e-12
 67108864     1.38298     36.2751     36.2751   1.04758e-11   1.04758e-11

Consider just the Second column for single and double. Why might there be a difference in performance between the two?

Consider just the First and Second columns for single precision. Why might there be a difference in performance between the two? (Hint: What data structure are we starting with in the driver? Our own Vector type. What are its element types as compared to what we are sending to the GPU?)

Extra Credit. Use nvprof or some of other means of testing your hypothesis for your answer above.

In the previous problem sets we considered block and strided approaches to parallelizing norm. We found that the strided approach had unfavorable access patterns to memory and gave much lower performance than the blocked approach.

But, consider one of the kernels we launched for axpy (this is from cu_axpy_2):

__global__ void madd(int n, float a, float* x, float* y) {
  int index  = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = a * x[i] + y[i];
}

It is strided!

Compare and contrast strided partitioning for task-based parallelism (e.g., OpenMP or C++ tasks) with strided partitioning for GPU. Why is it bad in the former case but good (if it is) in the latter case?

Extra Credit¶

Based on one or both of norm_cuda and cu_axpy_t, implement a program that uses thrust to reduce over a vector (array) of floating point numbers. Compare the performance to the hand-optimized versions in the reduction subdirectory.

Submitting Your Work¶

Do a make clean in your ps7 directory and then create a tar ball ps7.tar.gz containing all of the files in that directory. From within the ps7 directory:

$ cd ps7
$ make clean
$ cd ..
$ tar zcvf ps7-submit.tar.gz --exclude "./ps7/.vscode" ./ps7

If you relied on any external references, list them in a file refs.txt and include that in your tarball as well (in the ps7 subdirectory).

It’s okay if the directory (and tar file) are named ps7a instead of ps7.