Problem Set #7 (Part I)¶

Assigned: May 30, 2020

Due: Jun 5, 2020

Background¶

Lectures 15, 16 (CUDA)

Lectures 21, 22 (C++ templates and generic programming)

Walk through of connecting to AWS using VS Code

Introduction¶

This assignment is extracted as with the previous assignment into a directory ps6. The tar file is at this link.

When you extract the files, you will see the following directory structure, which is similar to, but not exactly the same as the previous assignment:

ps7/
├── Questions.rst
├── README.rst
├── data/
├── hello_omp/
│   ├── Makefile
│   ├── getenv.cpp
│   ├── hello_omp_2_threads.cpp
│   ├── hello_omp_critical.cpp
│   ├── hello_omp_num_threads.cpp
│   ├── hello_omp_parallel.cpp
│   ├── hello_omp_parallel_critical.cpp
│   └── ompi_info.cpp
├── include/
│   ├── AOSMatrix.hpp
│   ├── COOMatrix.hpp
│   ├── CSCMatrix.hpp
│   ├── CSRMatrix.hpp
│   ├── Make.inc
│   ├── Matrix.hpp
│   ├── Timer.hpp
│   ├── Vector.hpp
│   ├── amath583.hpp    /* Everything is in here now */
│   ├── amath583IO.hpp
│   ├── catch.hpp
│   ├── getenv.hpp
│   ├── norm_utils.hpp
│   ├── norms.hpp
│   ├── pagerank.hpp
│   └── util.hpp         /* NEW  */
├── matvec/
│   └── Makefile
│   └── matmat.cpp
│   └── matvec.cpp
│   └── pmatmat.cpp
│   └── pmatvec.cpp
├── norm/
│   ├── Makefile
│   ├── norm_block_critical.cpp
│   ├── norm_block_reduction.cpp
│   ├── norm_cyclic_critical.cpp
│   ├── norm_cyclic_reduction.cpp
│   ├── norm_parfor.cpp
│   └── norm_seq.cpp
├── pagerank/
│   ├── Makefile
│   └── pagerank.cpp
├── test/
│   ├── Makefile
│   ├── matmat_test.cpp
│   ├── matvec_test.cpp
│   ├── norms_test.cpp
│   ├── pagerank_test.cpp
│   ├── pmatmat_test.cpp
│   ├── pmatvec_test.cpp
└── warmup/
    ├── Makefile
    ├── hello_script.bash
    ├── nonsense.cpp
    ├── overloads.cpp
    └── specialization.cpp

Note that:

There is one new file: include/util.hpp
There is no longer a src subdirectory (nor the library source files)
There is no longer include/amath583sparse.hpp
Most (all?) of the functions have been completed with a reference (my own) implementation.

Preliminaries¶

VS Code¶

In addition to the setup for connecting to AWS, you should update the setting “C_Cpp > Default: Cpp Standard” to c++17.

AWS¶

Per the walkthrough video, connect to AWS:

Copy or move your public and private keys to your .ssh subdirectory.
Use vs code (or an editor of your choice if you feel comfortable doing so) to create (or modify) a config file in .ssh/config to contain an entry for the AWS instance.
Copy your existing ps6 subdirectory there using rsync (not ps7)
Rerun the timing programs in matvec, norm, and pagerank. (You may also find running those in hello_omp to be interesting). These aren’t “official” timings, but are just to give you a feel for a platform with a few more cores than you may have already been using (we will be using even more a bit further down).
Copy ps7 to AWS – you can either copy the whole directory with rsync after you have extracted it on your laptop or copy the tar file (using rsync or scp) and extract the tar files there. I suggest keeping your ps7 files in just one place – AWS – so that you don’t get into the problem of making some edits in one place and some in another and not keeping different locations consistent. There are tools to manage that process (source code control, e.g., git) but those take a while to get up to speed with and can cause more trouble than they save in the meantime.

As a reminder, an entry in .ssh/config should look like the following:

# Read more about SSH config files: https://linux.die.net/man/5/ssh_config
Host ps6
   HostName 54.184.35.40
   User al75
   IdentityFile /Users/lums658/.ssh/al75_id_rsa

Replace “al75” with your UW net id and the path in Identityfile to the location of your .ssh subdirectory.

Note

EFS

The home directories on our AWS instance are stored in Amazon’s EFS (elastic file system). We will be using different machines for multicore, GPU, and cluster assignments, but they will all connect to the same file system – the same files will be on any machine we use.

Templates (or, we mean it, don’t repeat yourself)¶

Most (all?) modern C++ libraries expect to be able to be composed with other libraries – other libraries written by other authors in other times and other places. Being able to safely and reliably (and confidently) compose independently developed libraries is of utmost importance in being able to construct large software systems from smaller software components.

In C++ this mechanism is provided by type parameterization (“templates”).

In the pagerank problems in previous assignments, we found that certain data structure choices offered better performance in different situations – we might want to use CSRMatrix in some situations or CSCMatrix in others.

But consider the implementations of pagerank if we wanted to support both:

Vector pagerank(const CSRMatrix& P, double alpha, double tol, size_t max_iters) {
  Vector x(P.num_rows(), 1.0 / static_cast<double>(P.num_rows()));

  for (size_t i = 0; i < max_iters; ++i) {
    Vector y = mult(x, P);
    y *= alpha;
    y += (1.0 - alpha) / static_cast<double>(x.num_rows());

    if (two_norm(x - y) < tol * two_norm(x)) {
      return y;
    }
    std::swap(x, y);
  }

  return x;
}

Vector pagerank(const CSRMatrix& P, double alpha, double tol, size_t max_iters) {
  Vector x(P.num_rows(), 1.0 / static_cast<double>(P.num_rows()));

  for (size_t i = 0; i < max_iters; ++i) {
    Vector y = mult(x, P);
    y *= alpha;
    y += (1.0 - alpha) / static_cast<double>(x.num_rows());

    if (two_norm(x - y) < tol * two_norm(x)) {
      return y;
    }
    std::swap(x, y);
  }

  return x;
}

There is literally only one character that differs between the two functions. But, we need both if we are going to support both CSCMatrix and CSRMatrix (and we would need a new one for COOMatrix and one for AOSMatrix and one for any new sparse matrix type).

Similarly, in the body of the pagerank function, we have a call to mult, for which we then also need multiple almost-identical-except-for-one-character implementations (one for every matrix type):

void mult(const Vector& x, const CSCMatrix& A, Vector& y) {
  A.t_matvec(x, y);
}

void mult(const Vector& x, const CSRMatrix& A, Vector& y) {
  A.t_matvec(x, y);
}

Notice that in each of these cases, there is nothing about the functions that depends on implementation details of the matrix type (which is why we have identical function bodies for both). This is a fundamental principle in generic programming: functions should be written based on what functions need from types so that the function can operate correctly and efficiently. Types then need to be written to adhere to those requirements. (Generic programming is sometimes – rightly – characterized as “algorithm-oriented” programming.)

To have a slightly richer function body, let’s consider the case of dense row-oriented matrices and colummn-oriented matrices, RowMatrix and ColMatrix classes, respectively (they have been lurking in Matrix.hpp all quarter). The matrix multiply algorithms for each class would be very similar:

void multiply(const RowMatrix& A, const RowMatrix& B, RowMatrix& C) {
  for (size_t i = 0; i < A.num_rows(); ++i) {
    for (size_t j = 0; j < B.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

void multiply(const ColMatrix& A, const ColMatrix& B, ColMatrix& C) {
  for (size_t i = 0; i < A.num_rows(); ++i) {
    for (size_t j = 0; j < B.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

In fact, the function bodies would be the same as for the dense Matrix class we have been using all quarter. Moreover, the functions (which are overloads) are invoked identically:

RowMatrix A(size), B(size), C(size);
ColMatrix D(size), E(size), F(size);
multiply(A, B, C);
multiply(D, E, F);

Notice that RowMatrixand ColMatrix have the same interface based primarily on operator() – and deliberately so. A matrix has particular semantics and both RowMatrix and ColMatrix should present those. The difference between the two types has nothing to do with how they are used – the difference is simply how they internally represent the matrix.

Now, despite having the same interfaces and same semantics, since RowMatrix and ColMatrix are different types, we have to have different multiply functions for them. That is, although the syntax for accessing elements is the same in each case – a call to operator() – these are actually different functions and in the code that is actually running, the operator() function belonging to the class being used is the one that needs to be called. That is, if we are multiplying two RowMatrix matrices together, the operator() inside multiply needs to be the function belonging to RowMatrix (and similarly for ColMatrix). So even though the syntax of the function bodies is exactly the same, they are in fact doing different things (calling completely different operator() functions) – and so we actually need two different functions. If we have additional matrix types, we need functions for those as well.

To really confuse matters, suppose you wanted to use ColMatrix and RowMatrix together? You would need eight multiply functions (I just show the prototypes here, the function bodies are all the same):

void multiply(const ColMatrix& A, const ColMatrix& B, ColMatrix& C);
void multiply(const ColMatrix& A, const ColMatrix& B, RowMatrix& C);
void multiply(const ColMatrix& A, const RowMatrix& B, ColMatrix& C);
void multiply(const ColMatrix& A, const RowMatrix& B, RowMatrix& C);
void multiply(const RowMatrix& A, const ColMatrix& B, ColMatrix& C);
void multiply(const RowMatrix& A, const ColMatrix& B, RowMatrix& C);
void multiply(const RowMatrix& A, const RowMatrix& B, ColMatrix& C);
void multiply(const RowMatrix& A, const RowMatrix& B, RowMatrix& C);

If we had three matrix types we would need 27 versions. And so on.

And yet, the functions bodies are all the same. If you had to maintain a library with a combinatorially large number of functions, all identical except for the types, you might start thinking about automatically generating the functions. That is, if you knew you needed a multiply for RowMatrix with a RowMatrix into a ColMatrix – or a FooMatrix with a BarMatrix into a BazMatrix, you might write a script that takes the three types and generates the necessary function:

def gen_multiply(m1, m2, m3):
  print('''
  void multiply(const {first}& A, const {second}& B, {third}& C) {{
     for (size_t i = 0; i < A.num_rows(); ++i) {{
       for (size_t j = 0; j < B.num_cols(); ++j) {{
         for (size_t k = 0; k < A.num_cols(); ++k) {{
           C(i, j) += A(i, k) * B(k, j);
         }}
       }}
    }}
 }}
 '''.format(first=m1, second=m2, third=m3))

When we invoke this script:

gen_multiply("RowMatrix", "RowMatrix", "ColMatrix")

We get the specific function we want:

void multiply(const RowMatrix& A, const RowMatrix& B, ColMatrix& C) {
   for (size_t i = 0; i < A.num_rows(); ++i) {
     for (size_t j = 0; j < B.num_cols(); ++j) {
       for (size_t k = 0; k < A.num_cols(); ++k) {
         C(i, j) += A(i, k) * B(k, j);
       }
     }
   }
 }

Let’s think for a second about what our script is doing. It is a “template” if you like (pun intended) for the function we really want. Notice that everywhere we want to stamp something in we have a formatting placeholder ({first}, {second}, {third}). We could spread those placeholders out wherever we like and still generate a correct function (assuming the function is correct when the type name is substituted for the placeholder).

So if we had a matvec generating script for the two-argument matvec (the one that returns the result):

def gen_matvec(m1, v1):
   print('''
    {second} matvec(const {first}& A, const {second}& x) {{
      {second} y(x.num_rows());
      matvec(A, x, y);
      return y;
   }}
   '''.format(first=m1, second=v1))

We could invoke it as

gen_matvec("RowMatrix", "Vector")

to obtain

Vector matvec(const Matrix& A, const Vector& x) {
  Vector y(x.num_rows());
  matvec(A, x, y);
  return y;
}

Basically, everywhere we would have a type, we can put in a placeholder and then stamp out the function we want.

Of course, this isn’t how we are really going to do things. Although it saves all of the duplication of code – we only need the “template” that is in the script – the process for using this for real programs would be quite unwieldy (i.e., impossible to really do). You have two different software systems coupled to each other – we have to feed the types we want to Python which would generate the C++ functions we want, which we would have to feed back to C++.

But the essential capability – write the function as a template in terms of placeholders that are filled in with types when you need a particular version of the function – would seem to be, well, essential for the types of situations above (these situations are ubiquitous in real code).

C++ does in fact have such a template mechanism. It lets you write function templates (and, in fact, class templates) that are then stamped out on demand to create type-specific functions as you need them (and then compiled). It is important to keep in mind that that is what is really happening – which is why I illustrated this first in terms of an external script. The difference with it all happening in C++ is that you will not see the generated function (or class) as such – but C++ is stamping them out and compiling them just as if we had generated them manually.

Consider a multiply function template:

template <class Matrix_1_t, class Matrix_2_t, class Matrix_3_t>
void multiply(const Matrix_1_t& A, const Matrix_2_t& B, Matrix_3_t& C) {
   for (size_t i = 0; i < A.num_rows(); ++i) {
     for (size_t j = 0; j < B.num_cols(); ++j) {
       for (size_t k = 0; k < A.num_cols(); ++k) {
         C(i, j) += A(i, k) * B(k, j);
       }
     }
   }
 }

and compare this to the script we wrote to do the same thing. We are telling C++ that we are defining a function template that has three placeholders and then writing a function in terms of those placeholders.

If we now want to multiply a RowMatrix by a RowMatrix into a ColMatrix, we just need to invoke the function template:

RowMatrix A(10, 10);
RowMatrix B(10, 10);
ColMatrix C(10, 10);

multiply<RowMatrix, RowMatrix, ColMatrix>(A, B, C);

When the compiler runs across the call to multiply<RowMatrix, RowMatrix, ColMatrix>, it generates the necessary function (unless it has already generated that function from a previous use with the same types). Then, since this is a function call, it makes a call to that generated function.

Now, we have to be careful, just as we can’t call a function defined for certain types with arguments of other types, we have to make sure that the types of arguments match the types passed to placeholders. This would fail to compile, for example:

RowMatrix A(10, 10);
RowMatrix B(10, 10);
ColMatrix C(10, 10);

multiply<RowMatrix, RowMatrix, RowMatrix>(A, B, C);

We are defining a multiply for three RowMatrix matrices but passing in a ColMatrix as the third argument. Fortunately, in many cases, C++ can infer the types you want for the placeholders based on the arguments you pass it (type argument deduction). For multiply, we just need to invoke it as:

RowMatrix A(10, 10);
RowMatrix B(10, 10);
ColMatrix C(10, 10);

multiply(A, B, C);

and C++ can match the type of A to the placeholder for A, the type of B for the placeholder for B, and the type of C for the placeholder of C. Sometimes C++ may not be able to do this, in which case you may need to pass in explicit types to create the function you want from the function template.

Templates are an extraordinarily powerful mechanism in C++. In fact, the template mechanism in C++ is its own Turing-complete language – one can perform arbitrary compile-time operations using it. (C++ has more recently introduced the constexpr mechanism for compile-time computation and should generally be used instead of template metaprogramming.)

There are a few things to be aware of when using function templates.

1. Function templates are not functions. As the name implies, function templates are, well, templates. They are not actual functions and are not compiled – by themselves they don’t have enough information to be compiled. The type parameter needs to be bound to an actual type – then the compiler has enough information to compile the function. One refers to the creation of a function when the type parameter is bound as “instantiation.”

2. The model for using function templates is different than for concrete functions. To use a concrete function in a program we need its declaration. The function itself can exist elsewhere and be compiled separately from where it is being called. We can’t compile a function template separately – we don’t even know what it will be until we compose it with a particular type. When the function template is composed with a particular type, then the compiler can create an actual function based on this composition (an instantiation) and compile it. Accordingly, we have

3. Function templates are included completely in header files. The full text of the function template needs to be present when it is intantiated, so the full text of the function template is included as its declaration and definition in the header file.

4. Errors in function templates only become exposed upon composition. Since the function is compiled only when it is composed with a particular type, we won’t see any errors in the function (or with the composition) until the composition occurs and the compiler attempts to compile the composed function. Although error messages are getting better (and C++ will be including methods for better checking of templates, i.e., concepts), error messages from errors in function templates are notoriously verbose.

5. Instantiated function templates perform exactly as the equivalent concrete function. That is, the function template multiply will give exactly the same performance when composed with RowMatrix or ColMatrix as the earlier functions specifically written for those types.

Template Specialization¶

“Aha, not so fast!” you might be saying. “We don’t necessarily want to use the same function for RowMatrix and ColMatrix. Different loop orderings cam provide different performance, depending on the matrix orientation.”

This is true. And the C++ template mechanism provides a feature just for handling this issue – that is, the issue of needing selectively different functionality to get better performance (e.g.) for a given type.

The file specialization.cpp in the warmup subdirectory shows how this is done. The generic function template is this

template <class Matrix_1_t, class Matrix_2_t, class Matrix_3_t>
void multiply(const Matrix_1_t& A, const Matrix_2_t& B, Matrix_3_t& C) {
  std::cout << "Generic multiply" << std::endl;
  for (size_t i = 0; i < C.num_rows(); ++i) {
    for (size_t j = 0; j < C.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

(Note we’ve added a diagnostic print statement just for this example.)

We can create specializations of multiply for particular specific values of the placeholders, e.g., ColMatrix is this

template <>
void multiply(const ColMatrix& A, const ColMatrix& B, ColMatrix& C) {
  std::cout << "Specialized ColMatrix multiply" << std::endl;
  for (size_t j = 0; j < C.num_cols(); ++j) {
    for (size_t k = 0; k < A.num_cols(); ++k) {
      for (size_t i = 0; i < C.num_rows(); ++i) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

If we compile this program

int main() {

  RowMatrix A(10, 10);
  RowMatrix B(10, 10);
  RowMatrix C(10, 10);

  ColMatrix D(10, 10);
  ColMatrix E(10, 10);
  ColMatrix F(10, 10);

  multiply(A, B, C);
  multiply(D, E, F);
  multiply(A, B, D);

  return 0;
}

We get

$ make specialization.exe
$ ./specialization.exe
Specialized RowMatrix multiply
Specialized ColMatrix multiply
Generic multiply

When deciding how to instantiate a function template, the compiler examines the types of the arguments being passed in and attempts to match to any specializations and if none are found, uses the base template.

Concepts¶

The C++ template mechanism is exceedingly powerful – we’ve only barely scratched the surface here. In fact, the template system is “Turing complete” – meaning you can write arbitrary programs using it with a technique known as “template metaprogramming”. These are programs that are executed by the compiler itself at compile time that manipulate the compiler and the program being compiled (the syntax for template metaprogramming is Byzantine, to say the least, however).

A number of techniques have been developed over the years to try to reign in some of the excesses of template metaprogramming, as well as some of the deficiencies of templates themselves. The most important of these is concepts.

The basic idea of a concept is this. Despite that power (or maybe because of it), there are some big problems with the fundamental idea of a function template as stencil that we can just stamp types into. For example, with our multiply template we can invoke it as:

std::string A = "ten";
int B = 10;
Matrix C(10, 10);

multiply(A, B, C);

This is nonsense of course, but the compiler doesn’t know that. The compiler does for these types (or any types) what it does with templates. It substitutes the indicated types for the placeholders and then tries to compile the result. You will get errors with the resulting compilation process – but the error won’t be “this is nonsense, you can’t pass a string when you said you wanted a matrix” – it may be errors from deep inside the function(s) being compiled. For the example above:

$ make nonsense.exe
nonsense.cpp: In instantiation of 'void multiply(const Matrix_1_t&, const Matrix_2_t&, Matrix_3_t&) [with Matrix_1_t = std::__cxx11::basic_string<char>; Matrix_2_t = int; Matrix_3_t = Matrix]':
nonsense.cpp:24:19:   required from here
nonsense.cpp:12:32: error: 'const class std::__cxx11::basic_string<char>' has no member named 'num_cols'
   12 |       for (size_t k = 0; k < A.num_cols(); ++k) {
      |                              ~~^~~~~~~~
nonsense.cpp:13:21: error: no match for call to '(const std::__cxx11::basic_string<char>) (size_t&, size_t&)'
   13 |         C(i, j) += A(i, k) * B(k, j);
      |                    ~^~~~~~
nonsense.cpp:13:31: error: expression cannot be used as a function
   13 |         C(i, j) += A(i, k) * B(k, j);
      |                              ~^~~~~~

(“Instantiation” is the technical term C++ uses for stamping a type into a template).

This happens because (as of C++ 17) the placeholders are just placeholders with no semantic meaning. The upcoming “concepts” feature for C++20 will provide a formal mechanism for attaching meaning to placeholders, so that the compiler will be able to tell you “what you are asking is nonsense”.

There is a more problematic difficulty that also comes up due to not having any semantic information attached to placeholders. Over the quarter we have worked with two types of matrix-matrix product, one for dense matrices and one for sparse matrices. Now, of course, for this course (ha!), we want to have both versions of multiply available – and we have. But now, we want to have them available as function templates.

So let’s define a multiply template for sparse matrices.

template <class SparseMatrix_t, class Matrix_1_t, class Matrix_2_t>
void multiply(const SparseMatrix_t& A, const Matrix_1_t& B, Matrix_2_t& C) {
  A.matmat(B, C);
}

If we try to compile this we run into a problem:

$ make overloads.exe
make overloads.exe
overloads.cpp:28:6: error: redefinition of 'template<class SparseMatrix_t, class Matrix_1_t, class Matrix_2_t> void multiply(const SparseMatrix_t&, const Matrix_1_t&, Matrix_2_t&)'
   28 | void multiply(const SparseMatrix_t& A, const Matrix_1_t& B, Matrix_2_t& C) {
      |      ^~~~~~~~
overloads.cpp:17:6: note: 'template<class Matrix_1_t, class Matrix_2_t, class Matrix_3_t> void multiply(const Matrix_1_t&, const Matrix_2_t&, Matrix_3_t&)' previously declared here
   17 | void multiply(const Matrix_1_t& A, const Matrix_2_t& B, Matrix_3_t& C) {
      |      ^~~~~~~~
overloads.cpp: In instantiation of 'void multiply(const Matrix_1_t&, const Matrix_2_t&, Matrix_3_t&) [with Matrix_1_t = CSRMatrix; Matrix_2_t = Matrix; Matrix_3_t = Matrix]':
overloads.cpp:38:19:   required from here
overloads.cpp:21:21: error: no match for call to '(const CSRMatrix) (size_t&, size_t&)'
   21 |         C(i, j) += A(i, k) * B(k, j);
      |                    ~^~~~~~
make: *** [overloads.o] Error 1

The problem is that we have defined two templates named multiply. We need two such templates, one for sparse matrices and one for dense – but to the compiler, the placeholders are just placeholders, there isn’t anyway (without using concepts) to directly tell the compiler to use one version if the first placeholder gets associated with a dense matrix and to use a different version if it is associated with a sparse matrix.

Warning

Magic Ahead

While it is true we can’t directly use concepts (yet), we can trick the compiler into at least selecting the right overload based on arbitrary properties of types that we pass to the templates. Because of all of the potential (valid) overloads that might be available to a compiler, all that is needed is that there be a single most valid one. If the cases that are less specific are also in error, those do not get generated and compiled – they are simply ignored (this is a property known as “substitution failure is not an error” or “SFINAE”).

There are various ways we can leverage this property to prune templates from being considered. What we are going to do is the following:

template <class Matrix_1_t, class Matrix_2_t, class Matrix_3_t,
       std::enable_if_t<   is_dense_matrix <Matrix_1_t>::value
                                    && is_dense_matrix <Matrix_2_t>::value
                                    && is_dense_matrix <Matrix_3_t>::value>* = nullptr >
void   multiply(const Matrix_1_t& A, const Matrix_2_t& B, Matrix_3_t& C) {
  for (size_t i = 0; i < C.num_rows(); ++i) {
    for (size_t j = 0; j < C.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

template <class SparseMatrix_t, class Matrix_1_t, class Matrix_2_t,
       std::enable_if_t<   is_sparse_matrix<SparseMatrix_t>::value
                        && is_dense_matrix <Matrix_1_t>    ::value
                        && is_dense_matrix <Matrix_2_t>    ::value>* = nullptr >
void multiply(const SparseMatrix_t& A, const Matrix_1_t& B, Matrix_2_t& C) {
  A.matmat(B, C);
}

Now, this can look extremely intimidating, but let’s break down what is happening, say, in the first multiply. If we ignore the template mechanisms, what is left is just what we have been writing all quarter for multiply (written in terms of placeholders rather than concrete types).

template <class Matrix_1_t, class Matrix_2_t, class Matrix_3_t,
       std::enable_if_t<   is_dense_matrix <Matrix_1_t>::value
                                    && is_dense_matrix <Matrix_2_t>::value
                                    && is_dense_matrix <Matrix_3_t>::value>* = nullptr >
void   multiply(const Matrix_1_t& A, const Matrix_2_t& B, Matrix_3_t& C) {
  for (size_t i = 0; i < C.num_rows(); ++i) {
    for (size_t j = 0; j < C.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

Let’s look again at the concept-overloaded multiply functions, looking at the template statements.

template <class Matrix_1_t, class Matrix_2_t, class Matrix_3_t,
       std::enable_if_t<   is_dense_matrix <Matrix_1_t>::value
                                    && is_dense_matrix <Matrix_2_t>::value
                                    && is_dense_matrix <Matrix_3_t>::value>* = nullptr >
void   multiply(const Matrix_1_t& A, const Matrix_2_t& B, Matrix_3_t& C) {
  for (size_t i = 0; i < C.num_rows(); ++i) {
    for (size_t j = 0; j < C.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

template <class SparseMatrix_t, class Matrix_1_t, class Matrix_2_t,
       std::enable_if_t<   is_sparse_matrix<SparseMatrix_t>::value
                        && is_dense_matrix <Matrix_1_t>    ::value
                        && is_dense_matrix <Matrix_2_t>    ::value>* = nullptr >
void multiply(const SparseMatrix_t& A, const Matrix_1_t& B, Matrix_2_t& C) {
  A.matmat(B, C);
}

Again, for each, we have the function bodies for multiply – those should be very familiar by now (and shouldn’t be intimidating). Above each function definition (and highlighted) is a template statement – indicating, just as before, that the function being defined is a function template. And, just as before, we have placeholder arguments for the matrix arguments being passed to each variant of multiply.

Now, each multiply function template has one more placeholder argument – and this is where the magic with SFINAE happens. The enable_if template is one that will basically “blow up” if its argument evaluates to false, or it will return something valid if the its argument evaluates to true. If it blows up, the compiler will not even consider using that function template to stamp types into. If it does not blow up, the compiler will consider it. In addition, your course repo now has a file util.hpp that has some other (magic) class templates that (again using SFINAE) evaluate to true if the type argument is a dense matrix (i.e., that it has an operator() that takes two index arguments) or if it is a sparse matrix, or if it is a dense vector. The concepts mechanism for C++20 will replace all of this machinery by essentially elevating placeholders to instead be “types of types.”

You will not be required to write any code like this in the course (or even be particular facile with it). But if you do want to basically be able to read it – focus first on the function body, then the first template arguments, then the enable_if. Here we can see in the enable_if that the first version of multiply takes three types that are each a dense matrix, whereas the second multiply expects the first argument to be a sparse matrix and the next two to be dense matrices.

I will be more than happy to arrange extra office hours for those who would like questions answered about this (or just additional explanation).

The Code Repo¶

The code in ps7a is basically the code that was in ps6 but with the template version of all of the math functions. The invocations of those now function templates has not changed. If you look in your repo you will notice we no longer have the various .cpp files that we had in the past. All of those functions have been converted to function templates and are now completely included in header files. If you browse through some of those files you will see how this is done. Notice that alot of previous redundancy has been eliminated.

Note also that since the math functionality we are using (in amath583.hpp) is generic, it no longer depends on Matrix.hpp nor any of the other headers that define specific classes. So the include statemenst for those files have been removed.

Shared Computing Resources and Compute Clusters¶

Achieving one of the basic goals of high-performance computing – more work in less time – requires balance. That is, all of the worker tasks should be

working as much as they can (sequential efficiency)
working at the same time (parallelism)
starting and finishing at the same time (load balance)

Since a parallel job depends on all of the workers, even a single worker being slowed down can slow down an entire parallel job. Effective parallel computation requires dedicated resources so that all workers can have equal access to similar resources. Such access can be difficult to provide on, say, your laptop or a compute resource with multiple simultaneous users because many processes are competing for the same resources.

One operational model that is fairly standard for high-performance computing is that of a “compute cluster”, a set of resources consisting of front-end resources (aka front-end nodes aka head nodes) and then a (much larger) numer of dedicated back-end resources (aka compute nodes). The front-end resources are shared – multiple users can login and use them. The back-end resources, where the real computing is done, however are batch scheduled. Only one job will be run on them at a time.

We will be using different configurations of compute clusters for this and the next assignment. For this assignment the compute nodes will provide GPU and multicore resources. For the next assignment there will be a large number of distributed memory nodes. The process for submitting jobs to the batch scheduler is the same in each case.

We also have some flexibility in the cluster configuration (they are elastic resources in the AWS cloud). If it turns out we need to add more capacity (or larger compute nodes), we can set that up fairly quickly. In fact, the fleet of compute nodes will grow and shrink dynamically in response to demand. The resizing is not immediate – if you are the first person to submit a job larger than the current size of the fleet it will take about five minutes for the additional compute nodes to come up.

Some Rules¶

The course clusters are shared compute resources, which sharing includes the compute resources themselves, as well as the file system. When you are logged in to the head node, other students will also be logged in at the same time, so be considerate of your usage. In particular,

Use the cluster only for this assignment (no mining bitcoins or other activities for which a cluster might seem useful),
Do not run anything compute-intensive on the head node (all compute-intensive jobs should be batch scheduled),
Do not put any really large files on the cluster, and
Do not attempt to look at other students’ work.

Sanity Check¶

Once you are connected to the head node, there are a few quick sanity checks you can do to check out (and learn more about) the cluster environment.

To find out more about the head node itself, try

$ uname -a
$ lscpu
$ more /proc/cpuinfo

These will return various (and copious) information about the OS and cores on the head node. Of interest are how many cores there are.

You can also double check the C++ compiler

$ c++ --version

Shared Files¶

All home directories are under the directore /efs/home – your home directory will be /efs/homes/<your-netid>

Important

/efs/home/public

Shared files for everyone’s use are in /efs/home/public. Notably, a number of large data files (sparse matrix files, MNIST data, etc) are in /efs/home/public/data. I will also put copies of code for this and other assignments in /efs/home/public/ps6 (etc).

When you need to use large data sets for input to your programs, you should read them from the shared directory rather than copying them to your home directory.

Batch Scheduling¶

We will be using the “SLURM” batch queuing system for this assignment, which we will discuss in more detail below. To get some information about it, you can issue the command

$ sinfo

This will print a summary of the compute nodes along with some basic details about their capabilities.

If you pass the -N argument to sinfo you will get the info printed on a per-node basis. Passing the -l (ell) flag will provide additional information.

You may see a list that looks something like this

Sat May 30 16:59:08 2020
NODELIST         NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
ip-172-31-81-77      1  compute*        idle   16   16:1:1      1        0      1   (null) none

(Use “man sinfo” to access documentation to explain the different columns.) Currently we only have one compute node configured – with 16 CPUS in order to explore multicore performance.

The command squeue on the other hand will provide information about the jobs that are currently running (or waiting to be run) on the nodes.

$ squeue
$ squeue -arl

(“man squeue” to access information about the squeue command.)

The basic paradigm in using a queuing system is that you request the queueing system to run something for you. The thing that is run can be a program, or it can be a command script (with special commands / comments to control the queueing system itself). Command line options (or commands / directives in the script file) are used to specify the configuration of the environment for the job to run, as well as to tell the queueing system some things about the job (how much time it will need to run for instance). When the job has run, the queueing system will also provide its output in some files for you – the output from cout and the output from cerr.

A reasonable tutorial on the basics of using slurm can be found here: https://www.slurm.schedmd.com/tutorials.html . The most important commands are: srun, sbatch, sinfo, and squeue. The most important bits to pay attention to in that tutorial are about those commands.

srun¶

There are two modes of getting SLURM to run your jobs for you – pseudo-interactive (using srun) and batch (using sbatch). The former (srun) should only be used for short-running jobs (testing, debugging, sanity checks). Using it for verifying your programs for the first part of this assignment is fine. Note that you are still using a slot in the queue even with srun – so if your job hangs, it will block other jobs from running (at least until your timeslot expires).

The “look and feel” of srun is alot like just running a job locally. Instead of executing the job locally with ./norm_parfor.exe (for example), you simply use srun:

$ srun ./norm_parfor.exe
        N  Sequential    1 thread   2 threads   4 threads   8 threads      1 thread     2 threads     4 threads     8 threads
  1048576     5.76266     5.72983     11.4924     22.8542     50.2792             0   5.19496e-15   5.00255e-15   4.04052e-15
  2097152     5.47874     5.46393      10.811     21.7382     42.1178             0   6.80033e-16   1.08805e-15   2.58412e-15
  4194304     4.05311     4.06923     7.84222     15.6246     31.9816             0   4.80808e-15   5.57737e-15   3.84646e-15
  8388608     3.29741      3.3026     6.39376     12.9454     24.6724             0   1.11505e-14   9.92669e-15   9.51874e-15

(Note that the performance is significantly better than what you may have been seeing so far on your laptop.)

Try the following with srun

$ hostname
$ srun hostname

The former will give you the name of the front-end node while the latter should give you the hostname of the compute node where SLURM ran the job – and they should be different. (Had better be different!) Compare the output of srun hostname to the information printed by sinfo – they should be the same nodes.

Note that srun is also the command we will use to launch distributed-memory (MPI) jobs. You can pass it an argument to tell it how many separate jobs to run.

$ srun -n 4 hostname

You shouldn’t use this option until we start doing MPI programs.

sbatch¶

srun basically just takes what you give it and runs it, in a pseudo-interactive mode – results are printed back to your screen. However, jobs invoked with srun are still put into the queue – if there are jobs ahead of you, you will have to wait until the job runs to see your output.

It is often more productive to use a queueing system asynchronously – especially if there are many long-running jobs that have to be scheduled. In this mode of operation, you tell SLURM what you want to do inside of a script – and to submit the script to the queuing system.

For example, you could create a script as follows:

#!/bin/bash
echo "Hello from Slurm"
hostname
echo "Goodbye"

This file is in your repo as warmup/hello_script.bash. Submit it to the queuing system with the following:

% sbatch hello_script.bash

Note

Scripts

Files that are submitted to sbatch must be actual shell scripts – meaning they must specify a shell in addition to having executable commands. The shell is specified at the top of every script, starting with #!. A bash script therefore has #!/bin/bash at the top of the file.

If you submit a file that is not a script, you will get an error like the following:

sbatch: error: This does not look like a batch script.  The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh

#! is pronounced in various ways – I usually pronounce it as “crunch-bang”, though “pound-bang” or “hash-bang” are also used. #! serves as a magic number (really that is the technical term) to indicates scripts in Unix / Linux systems.

If you seen an error such as

slurmstepd: error: execve(): hello_script.bash: Permission denied

you may need to set the executable bit for the script.

$ chmod +x hello_script.bash

You can verify that the executable bit is set by doing a long listing of the file:

$ ls -l hello_script.bash
-rw-r--r-- 1 al75 al75 62 May 30 17:25 hello_script.bash

The script is not executable. So we issue use chmod +x

$ chmod +x hello_script.bash
$ ls -l hello_script.bash
-rwxr-xr-x 1 al75 al75 62 May 30 17:25 hello_script.bash

Program Output¶

When you use sbatch, rather than sending your output to a screen (since sbatch is not interactive), standard out (cout) and standard error (cerr) from the program will instead be saved in a file named slurm-xx.out, where xx is the job number that was run. The file will be saved in the directory where you launched sbatch.

Try the following

$ sbatch hello_script.bash

You should see a new file named slurm-<some number>.out in your directory.

Launching Parallel Jobs¶

There is a subtle difference between what srun and sbatch do. In some sense sbatch just keeps your place in the queue and launches what you would have done with srun, whereas srun actually does the launch. That is, sbatch won’t launch your script in parallel. We will come back to this with the MPI assignment.

Warm Up¶

Generic Libraries¶

The files in ps7a can be considered as a reference implementation (solution?) for many of the problems we did in the last few assignments. You should spend a litte time browsing the class files as well as the algorithms.

Repeat what you did above with ps6, but using ps7a and using srun to execute your programs (you can also experiment with using srun and your programs from ps6). Again, you are encouraged to browse some of the driver code to convince yourself that there is no difference in using the code between the concrete versions and the generic versions.

Note in particular that you can switch between CSRMatrix and CSCMatrix just by changing one line now in pagerank.cpp. Verify that you can do this (and that the right thing happens).

Note

auto

In fact, you can switch between CSRMatrix and CSCMatrix by just changing one character.

There is an additional feature in C++ that works hand-in-glove with templates (particularly class templates): auto. In many instances it is possible for C++ to determine the type of a variable based on how the variable is being initialized (recall that we should initialize variables when they are created, if at all possible). To let C++ determine the type of the thing you are creating, you use the auto keyword instead of a type. When we create the matrix A in pagerank.cpp we can say

auto A = read_csrmatrix(input_file);

The “Right Thing” (tm) will happen – A will be a CSRMatrix. Knowing explicitly what types are is important for a programmer when the programmer is responsible for matching up types and functions, but since a generic function can take any type, it is less important that you explicitly define types (and you really don’t want to when you are working with placeholders in generic functions).

You can create a CSCMatrix by changing just one character:

auto A = read_csrmatrix(input_file);

I don’t recommend relying on auto (in fact I recommend against it) until you are comfortable with the strong typing in C++ – and with the basic operation of generic functions and data types.

To Be Continued¶

Next up: GPUs.