(Originally recorded 2019-04-30)
So far we have been looking at how to extract maximum performance (in terms of floating point operations per second) from a CPU. However, in the larger context of scientific computing and problem solving, ultimately what we want to do is compute the answer the in the shortest amount of time.
In this lecture we look at a couple of ways to work smarter (and to work less) in order to reduce time to solution.
We first look at a remarkable algorithm discovered by Volker Strassen (and which now bears his name): Strassen’s algorithm for computing matrix-matrix product with lower than \(N^3\) computational complexity.
Next we develop a representation for matrix storage that stores only the information we need for representing a problem and that also results in significant improvements in time to solution.
“High-Performance Scientific Computing” covers a large (and growing) intellectual space. A core theme in scientific computing is using computer algorithms (which are necessarily discrete, at least in digital computers) to model physical phenomena (which are necessarily continuous). The physical phenomena are also typically modeled mathematically as systems of nonlinear partial differential equations, which require discretization and linearization to reduce them (within the context of a larger iterative solution process) to systems of linear equations, aka \(\mat{A}\vec{x}=\vec{b}\).
The matrices at the bottom of this sequence of steps have an important characteristic: Almost all of the entries are zero. Computations using such a matrix thus involve adding and multiplying zero elements, operations which essentially have no effect and could be profitably eliminated. To do so, we need to develop special data structures—sparse matrix data structures—that only store the non-zero entries of the matrix, but allow us to perform operations such as matrix-vector product.
Laplace’s equation on a unit square is a prototypical PDE that we can use to derive sparse matrix data structures
If we discretize the domain, we turn our continuous differential equation into an algebraic equation by using a finite-difference approximation to the differential operator.
We can collect the equations for each grid point in the discretized domain into a collection of simultaneous equations, i.e., a linear system of equations. The coefficients and variables can be represented in matrix form \(\mat{A}\vec{x}=\vec{b}\).
Important
Make sure you can explain how we go from the stencil to \(\mat{A}\vec{x}=\vec{b}\) (cf slides 24 and 25).
Each of the equations that comprise the matrix system have only five coefficients, meaning that there are only five non-zero elements on each row of the matrix—regardless of the matrix size. For a domain that is discretized with 1,000 points in each dimension, the matrix size will be 1M by 1M — but with only five elements per row. Clearly, a representation that stores only the non-zero elements (of which there are 5,000) is much more efficient than one that stores all of the elements (of which there are 1M). Sparse matrix techniques include both the data structures as well as the algorithms that are used to exploit the non-zero structure.
To do a computation such as matrix-vector product (which is the canonical sparse matrix operation) we need the row, column, and data value of the matrix entry. In the case of dense matrix-vector product, the row and column are obtained by our loop variables as we iterate through the rows and columns of the matrix.
For a sparse matrix, we no longer store all of the matrix entries, but only some of them. In the sparse matrix-vector product, we iterate just over the non-zero entries in the matrix. But for the algorithm we still need the row index and column index corresponding to the non-zero matrix entry. Since we can’t implicitly infer the row and column indices from the location of the entry in the matrix (from our loop indices), we need to instead explicitly store them.
The most basic sparse matrix storage format is coordinate storage, which, for each matrix non-zero entry, stores the entry’s value, row index, and column index.
One possible implementation for coordinate storage as a C++ class might look like:
class COOMatrix {
public:
COOMatrix(size_t M, size_t N) : num_rows_(M), num_cols_(N) {}
size_t num_rows() const { return num_rows_; }
size_t num_cols() const { return num_cols_; }
private:
size_t num_rows_, num_cols_;
std::vector<size_t> row_indices_, col_indices_;
std::vector<double> storage_;
};
In this storage format we use std::vector
to store the value, row index, and
column index information. These vectors will all be the same size. For a given index
value k
, which will be less than the size of the vectors, the quantities
storage[k]
, row_indices[k]
, and col_indices[k]
represent the
corresponding value, row index, and column index for the kth non-zero in the matrix.
There is a crucial thing to notice about this format (and which is a consequence of
only storing non-zero entries): there is no random access to elements, that is, there
is no operator()
to access entries by their row and column number. With dense
matrices, we used operator()
to access the internal data of the matrix.
Without operator()
That means we need a different approach for expressing
algorithms. In particular, we will need to implement algorithms such as matrix-vector
product as member functions of the sparse matrix structure.
To develop a matrix-vector product for coordinate format, consider dense matrix-vector product:
Consider what we are doing abstractly. We are going through every element in the
matrix. At each element, we index into y
with the row index and index into
x
with the column index and we multiply the entry’s value against the retrieved
value of x
and store that product into y
.
We can do exactly this abstract operation for our coordinate storage.
That is, we can go through every element in the matrix, we can index into y
with the row index, we can index into x
with the column index, and we can
obtain the matrix value to do multiply against. This is expressed differently for the
coordinate storage format
Important
Make sure you can explain how coordinate matrix-vector product and dense matrix-vector product are performing the same abstract operation.
We note that the numerical intensity of sparse matrix-vector product is quite a bit lower than for dense matrices (cf slide 52). Nonetheless, sparse matrix computation can provide enormous performance benefits. See the excursus on Sparse vs Dense Computation (Or, How to be Faster while Being Slower).
The coordinate format is a basic and easy to use format. However, there are some ways to make sparse storage more compact and more efficient. We achieve this by compressing one of the index axes (either the rows or the columns) and thereby reduce some redundancy and improve locality. Our steps for compression will be to first sort the matrix entries according to one of the index axes, then run-length encode the resulting array, then perform a prefix sum along the run-length encoded array.
The example we will look at will be sorting and compressing along the row index array and will result in the compressed sparse row (CSR) format. A similar process can be followed for compressing along the column index array to arrive at the compressed sparse column (CSC) format. CSR is probably the most common sparse matrix format used in scientific computing today.
Next, we note that there are repeated entries for each row value. We can reduce the size of this array by applying
The data necessary to represent a matrix in Compressed Sparse Row (CSR) format are thus
An implementation of CSR as a C++ class might look like:
An implementation of matrix-vector product for CSR might look like.
Again, the matrix-vector product follows the abstract matrix-vector operation that we
presented above. However, due to the way we are storing matrix entries, we use a
two-level loop structure. That is, we iterate through the rows and then iterate over
the elements in that row. (Another way to think about CSR is that we have put all of
the elements in every row together and that neighboring values in the
row_indices_
array point to the beginning and and of the elements in that row,
as stored in the col_indices_
and storage_
arrays.)
Important
Make sure you can explain how to derive CSR from COO.
Important
Make sure you can explain how CSR is performing the abstract matrix-vector product operation.