The subdirectory norm
contains a skeletons for a thrust-based
algorithm for computing the two-norm of a vector.
Documentation for thrust can be found at https://docs.nvidia.com/cuda/thrust/index.html
|d_|
Complete the implementation of the norm_thrust
function in norm_thrust.cu
.
|_d|
You may find it helpful to refer to the axpy examples for the rest of
this problem. In cu_axpy_t, we used thrust::transform
to combine two vectors. To compute the norm of a single vector, we want an analagous operation – transforming over a single vector. We also want to reduce (accumulate) over the transformed values (and we know how to do that efficiently – per the reduction exercise – and there is a reduce function in thrust).
We could do this in two steps – transform the vector into a vector with the square of each value and the reduce over that. That approach is obviously enormously inefficient. To do the transformation and reduction in a single pass, thrust includes the function transform_reduce
– which is suggested for use in this problem. You may use any technique you like to implement the operators that get passed in – function object, helper function, lambda, built-in functions.
The norm_cuda subdirecory also contains norm_parfor and norm_seq, which you can run to compare to norm_cuda.
Note that norm_thrust is a function template. The driver that calls it for timing invokes to versions of it – one for single precision and one for double precision.
As with our previous norm drivers, this program will run timing tests over a range of problem sizes (in powers of two) and you can pass the upper and lower bounds in on the command line. The program does two timing runs for each of single and double and prints the results.
When you run your working norm_thrust, you may get an output that looks like the following:
$./norm_thrust.exe
Float
N Sequential First Second First Second
1048576 1.7798 6.09445 6.07604 2.7595e-08 2.7595e-08
2097152 1.68752 11.108 11.2314 4.51044e-08 4.51044e-08
4194304 1.41257 18.6075 19.1292 2.15465e-08 2.15465e-08
8388608 1.38701 27.5941 29.5374 3.51608e-08 3.51608e-08
16777216 1.38567 36.9667 45.4383 4.24208e-08 4.24208e-08
33554432 1.37679 41.2072 57.2881 8.84264e-09 8.84264e-09
67108864 1.38583 29.4984 65.4721 2.65331e-09 2.65331e-09
Double
N Sequential First Second First Second
1048576 1.78137 6.38466 6.44605 1.36909e-11 1.36909e-11
2097152 1.69176 11.0473 11.0473 2.58195e-11 2.58195e-11
4194304 1.41943 17.2002 17.3459 1.87745e-12 1.87745e-12
8388608 1.39531 24.1052 24.1052 8.97525e-12 8.97525e-12
16777216 1.39186 29.8772 29.8772 6.11285e-13 6.11285e-13
33554432 1.38983 33.5544 34.0407 2.9634e-12 2.9634e-12
67108864 1.38298 36.2751 36.2751 1.04758e-11 1.04758e-11
|q_| Consider just the Second column for single and double. Why might there be a difference in performance between the two? |_q|
|q_|
Consider just the First and Second columns for single precision. Why might there be a difference in performance between the two?
(Hint: What data structure are we starting with in the driver? Our own Vector
type. What are its element types as compared to what we are sending to the GPU?)
|_q|
|t_| Extra Credit. Use nvprof or some of other means of testing your hypothesis for your answer above. |_t|
For this problem set warm-up will consist of revisiting ps6 and executing some of the problems we did on multicore nodes of Hyak.
Different development environments on Hyak are supported via the modules system.
Use rsync to copy your ps6 directory from your laptop to Hyak.
From the directory above ps6
$ rsync -avuzb ps6 klone.hyak.uw.edu:
(You will have to go through the 2FA process, just as if you were logging in.)
Connect to Hyak with vs code and verify that ps6 has been copied.
|q_| How does the performance of (in GFLOP/s) compare to your results for dense matrix-matrix product in previous assignments? There should be a fairly large difference in GFLOP/s between sparse and dense matrix methods. Give some reasons for such a big difference. |_q|
(Include your answer in Questions.rst.)
(Include your answer in Questions.md.) How does the performance of CSC compare to the performance of CSR and of COO? Explain why (or why not) there are any significant differences.