The subdirectory norm contains a skeletons for a thrust-based algorithm for computing the two-norm of a vector.

Documentation for thrust can be found at

|d_| Complete the implementation of the norm_thrust function in |_d|

You may find it helpful to refer to the axpy examples for the rest of this problem. In cu_axpy_t, we used thrust::transform to combine two vectors. To compute the norm of a single vector, we want an analagous operation – transforming over a single vector. We also want to reduce (accumulate) over the transformed values (and we know how to do that efficiently – per the reduction exercise – and there is a reduce function in thrust).

We could do this in two steps – transform the vector into a vector with the square of each value and the reduce over that. That approach is obviously enormously inefficient. To do the transformation and reduction in a single pass, thrust includes the function transform_reduce – which is suggested for use in this problem. You may use any technique you like to implement the operators that get passed in – function object, helper function, lambda, built-in functions.

The norm_cuda subdirecory also contains norm_parfor and norm_seq, which you can run to compare to norm_cuda.

Note that norm_thrust is a function template. The driver that calls it for timing invokes to versions of it – one for single precision and one for double precision.

As with our previous norm drivers, this program will run timing tests over a range of problem sizes (in powers of two) and you can pass the upper and lower bounds in on the command line. The program does two timing runs for each of single and double and prints the results.

That’s Funny

When you run your working norm_thrust, you may get an output that looks like the following:


        N  Sequential       First      Second       First      Second
  1048576      1.7798     6.09445     6.07604    2.7595e-08    2.7595e-08
  2097152     1.68752      11.108     11.2314   4.51044e-08   4.51044e-08
  4194304     1.41257     18.6075     19.1292   2.15465e-08   2.15465e-08
  8388608     1.38701     27.5941     29.5374   3.51608e-08   3.51608e-08
 16777216     1.38567     36.9667     45.4383   4.24208e-08   4.24208e-08
 33554432     1.37679     41.2072     57.2881   8.84264e-09   8.84264e-09
 67108864     1.38583     29.4984     65.4721   2.65331e-09   2.65331e-09

        N  Sequential       First      Second       First      Second
  1048576     1.78137     6.38466     6.44605   1.36909e-11   1.36909e-11
  2097152     1.69176     11.0473     11.0473   2.58195e-11   2.58195e-11
  4194304     1.41943     17.2002     17.3459   1.87745e-12   1.87745e-12
  8388608     1.39531     24.1052     24.1052   8.97525e-12   8.97525e-12
 16777216     1.39186     29.8772     29.8772   6.11285e-13   6.11285e-13
 33554432     1.38983     33.5544     34.0407    2.9634e-12    2.9634e-12
 67108864     1.38298     36.2751     36.2751   1.04758e-11   1.04758e-11

|q_| Consider just the Second column for single and double. Why might there be a difference in performance between the two? |_q|

|q_| Consider just the First and Second columns for single precision. Why might there be a difference in performance between the two? (Hint: What data structure are we starting with in the driver? Our own Vector type. What are its element types as compared to what we are sending to the GPU?) |_q|

|t_| Extra Credit. Use nvprof or some of other means of testing your hypothesis for your answer above. |_t|

Warm Up: PS6 Reprise

For this problem set warm-up will consist of revisiting ps6 and executing some of the problems we did on multicore nodes of Hyak.

Setting up your environment

Different development environments on Hyak are supported via the modules system.

Copy PS6 to Hyak

Use rsync to copy your ps6 directory from your laptop to Hyak.

From the directory above ps6

$ rsync -avuzb ps6

(You will have to go through the 2FA process, just as if you were logging in.)

Connect to Hyak with vs code and verify that ps6 has been copied.

|q_| How does the performance of (in GFLOP/s) compare to your results for dense matrix-matrix product in previous assignments? There should be a fairly large difference in GFLOP/s between sparse and dense matrix methods. Give some reasons for such a big difference. |_q|

(Include your answer in Questions.rst.)

(Include your answer in How does the performance of CSC compare to the performance of CSR and of COO? Explain why (or why not) there are any significant differences.

Sparse-Matrix Transpose-Matrix Product