Homework 5: Sparse Matrices

Sparse matrix reprsentation, built using hash tables, is an important application in areas as disparate as weather forecasting and the google search engine.

Revision: 03/27/2014 alpha release
Revision: 03/28/2014 Narrative upgraded. Procedural Requirements completed.
Revision: 03/29/2014 beta release - report bugs to the discussion forum.
Revision: 04/03/2014 final release - some typos corrected, should be good to go.

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

Operational Objectives: Complete the implementation of fsu::HashTable<K,D,H> and fsu::HashTable<K,D,H>::Iterator in the file hashtbl.h Define and implement fsu::SparseVector<N> and fsu::SparseMatrix<N> as class templates. Implement the matrix*vector product between a SparseMatrix and a SparseVector. For extra credit, also implement the matrix*matrix product between two SparseMatrix objects. ... that provide bidirectional inorder iterators and forward preorder and levelorder iterators for BSTs.

Deliverables: Three files:

hashtbl.h    # fsu::HashTable<K,D,H> and associated Iterator class
sparse.h     # fsu::SparseVector<N>, fsu::SparseMatrix<N>, and the product operators
log.txt      # your project work log

Procedural Requirements

  1. The official development | testing | assessment environment is gnu g++47 -std=c++11 -Wall -Wextra on the linprog machines.

  2. Create and work within a separate subdirectory cop4530/hw5.

  3. Do your own work. Variations of this project have been used in previous courses. You are not permitted to seek help from former students or their work products. For this and all other projects, it is a violation of course ethics and the student honor code to use, or attempt to use, code from any source other than that explicitly distributed in the course code library, or to give or receive help on this project from anyone other than the course instruction staff. See Introduction/Work Rules.

  4. Begin by copying the entire directory LIB/hw5 into your hw5 directory. At this point you should see these files in your directory:

    hashtbl.start  # copy to hashtbl.h and complete the implementations
    sparse.start   # copy to sparse.h and supply code implementing M*V [M*M for extra credit]
    sparse_util.h  # utilities - READ THIS
    fhtbl.cpp      # test harness for fsu::HashTable
    rantable.cpp   # creates random  files for testing tables
    matrix.h       # fsu::Matrix - READ THIS
    matrix_util.h  # matrix utilities - READ
    mxv.cpp        # Matrix * Vector
    mxm.cpp        # Matrix * Matrix
    smxsv.cpp      # SparseMatrix * SparseVector
    smxsm.cpp      # SparseMatrix * SparseMatrix
    sm2m.cpp       # converts sparse matrix file to matrix file
    m2sm.cpp       # converts matrix file to sparse matrix file
    makefile       # builds executables
    filespec.txt   # describes four file types: .mat .vec .sm .sv
    
    Note that all code files are complete, you only need to copy the two start files to header files and complete their code. The other files are provided for your reading and use in testing.

    Then copy these relevant executables:

    LIB/area51/fhtbl.x
    LIB/area51/rantable.x
    LIB/area51/mxm.x
    LIB/area51/mxv.x
    LIB/area51/smxsm.x
    LIB/area51/smxsm_v.x # verbose version, displays all element updates
    LIB/area51/smxsv.x
    LIB/area51/v2sv.x
    LIB/area51/sv2v.x
    LIB/area51/m2sm.x
    LIB/area51/sm2m.x
    

  5. Copy hashtbl.start to hashtbl.h. Complete the implementation of HashTable<K,D,H> and its associated iterator types in hashtbl.h.

  6. Copy sparse.start to sparse.h. Complete the implementation of SparseVector<N> and SparseMatrix<N> in hashtbl.h.

  7. Test your code thoroughly, and be sure to document your testing in the log.txt file. There should be a brief narrative describing your test plan, and then documentation of the results of carrying out that plan.

  8. Submit the assignment using the script hw5submit.sh.

    Warning: Submit scripts do not work on the program and linprog servers. Use shell.cs.fsu.edu to submit assignments. If you do not receive the second confirmation with the contents of your assignment, there has been a malfunction.

Code Requirements and Specifications

  1. There are missing or incomplete implementations for several member functions of fsu::HashTable<K,D,H> and fsu::HashTableIterator<K,D,H>. All of these should be implemented and tested to ensure correct behavior.

  2. The global product operator SparseMatrix * SparseVector requires implementation. The global product operator SparseMatrix * SparseMatrix may be implemented (correctly) for extra credit.

  3. The implementation of the product operators is required to be space conservative, meaning that no unnecessary new elements of SparseVectors or SparseMatrices are created by the product operators.

  4. The implementation of SparseMatrix * SparseVector should have runtime <= O(n), where n is the number of (non-zero) elements of the SparseMatrix.

  5. Provide an informal argument showing that your implementation of SparseMatrix * SparseVector has runtime <= O(n), where n is the number of (non-zero) elements of the SparseMatrix.

  6. If you choose to implement the SparseMatrix * SparseMatrix product operator, provide an informal analysis of the runtime of your implementation.

Introduction to Sparse Matrices

A good place to start talking about sparse matrices is with an example - not contrived, but real - you could even say a billion dollars real: The google view of the world wide web.

Most web pages link to other web pages. Perhaps the key idea kicking off the google search engine is to evaluate web pages by the number of links to the page - something not directly controllable by the page owner, and presumably determined by other pages concluding that the page is important enough to route its visitors on to that page. But there is a lot more to the "page rank" than just counting incoming links.

Think of a random web surfer sitting at a page P. Suppose that P has 10 outgoing links. Then the random surfer has probability 1 out of 10, or 0.10, of taking one of those links out. Construct a matrix M with all web pages along the row and column indexes, for each web page P and other web page Q to which P links, make the matrix entry at that location be this probability p = 1/k, where k is the number of out links from P. If P has index i and Q has index j, then M(i,j) = 1/k, where k is the number of out links from P. There is a technical detail we are omitting here having to do with the possibility the random surfer might start a new search, but we don't need to go into that for this example.

Now we have a matrix M whose entries are "transition" probabilities. There is a nice theory applying to such situations, which states that:

If we start with a vector v with equal components, the successive applications of matrix-vector multiplication, that is, v, M*v, M*(M*v) = M2*v, M*(M2*v) = M3*v, M*(M3*v) = M4*v, ... , converges to a vector w that satisfies w = M*w: w is a vector that is invariant under multiplication by the transition matrix! This vector w is (with the modifications mentioned above) the page rank vector discovered by Brin and Page that became the cornerstone of the google search engine.

The obvious problem is:

  1. The web is big
  2. The matrix is (big)2
  3. Matrix multiplication is (big)3

To make the google search engine remain current, the page rank vector must be re-calculated regularly to account for changes in the web - pages and links added or (in fewer caes) subtracted. Assume this is done daily.

Let's explore for a moment the resources that might be required to calculate the page rank vector. A generally accepted estimate of the number of web pages is currently about 10 billion, or n = 1010. The transition matrix M, being nxn, has n2 = 1020 entries. Assuming 8 bytes to store a floating point number, the amount of storage to keep one copy of M is 8*1020 bytes, or 8*1020 / 1012 = 8*108 = 800,000,000 terabytes.

Moreover, matrix-vector multiplication requires n2 individual floating-point multiplies. In our example, (1010)2 = 1020 flops, or 1020/1012 = 100,000,000 teraflops of calculation, just to perform one M*v multiplication. Assuming IBM Blue, the current record holder for floating point speed at 596 teraflops per second, this requires roughly two days of CPU time. This estimate doesn't account for the disk access time implied by the storage of such a large amount of data. (The M*V algorithm can exploit parallelism to shorten the runtime to Θ(n2/p), where p is the number of processors, but even so this is a large computational burden and is already largely accomplished by Blue.)

Thus we have a space problem and a time problem that, using today's technologies, seem intractable. Enter Sparse Matrix technology.

Note that web pages have comparatively few out links. We could even assume that they have no more than 10, because the buried links are unlikely to be clicked. Whatever the number, it is much smaller than the entire web. Sticking with 10 as our estimate, this means that each row of the matrix M has all but 10 of its entries equal to zero! If n is the size of the WWW, then we have about 10n non-zero entries in M, whereas there are n2 entries. To put this in perspective, take as an estimate n = 1010 web pages, as above. Then there are 10n = 100,000,000,000 non-zero entries in M representing WWW links, which leaves n2 - 10n = 100,000,000,000,000,000,000 - 100,000,000,000 = 99,999,999,900,000,000,000 zero entries in M. That is a lot of memory used to remember nothing.

Sparse matrix technology uses associative arrays to reduce the memory required for an nxn matrix from Θ(n2) to Θ(m), where m is the number of non-zero entries, and the runtime of M*V is also reduced to Θ(m). In the example above, both storage and runtime are reduced by a factor of n/10 = 109, requiring space of roughly 8*103 = 8,000 terabytes and time on the order of 1,000 teraflops, large but manageable numbers.

The concept behind Sparse Matrix technology is simple: explicitly remember only the non-zero entries of a matrix, with the convention that "missing" entries are assumed to have the value zero. Hash tables (or ordered maps, for that matter) can be used in two different ways to accomplish this:

  1. Define SparseMatrix to be a hash table with KeyType = Pair<size_t,size_t> and DataType = double. Then store non-zero entries M(i,j) in the table using (i,j) as the key. In this design, the pair (i,j) is a key, but neither i nor j have any meaning by themselves, so if you need a concept of rows and columns, the second approach gives you that.
  2. Define a SparseVector to be a hash table with KeyType = size_t and DataType = double. Then define a SparseMatrix to be a hash table with KeyType = size_t and DataType = SparseVector. Here M(i,j) is a little more complicated to access, but we have the advantage that we can iterate through both rows and columns of the matrix.

We will use the second approach.

Examples

An example to drive home the point about saving space and time with sparse technology can be viewed as follows, in steps:

  1. Create a file named m800x800.sm with this content:

    1   1   1
    300 100 2
    300 200 3
    300 300 4
    475 200 5
    475 300 6
    799 300 7
    799 799 8
    

    This represents a sparse matrix with 8 non-zero entries.

  2. Run the converter sm2m.x to convert to a matrix file:

    command: sm2m.x m800x800.sm m800x800.mat
    

    Now "more" both files to get a sense of the relative sizes. You can also open the matrix file with emacs and search for the (non-zero) entries 1,2,3,4,5,6,7,8 to convince yourself that they are there, hard to find with your eyes in the blizzard of zeros.

  3. Square the matrix, first as a sparse matrix and then as a regular matrix:

    smxsm.x m800x800.sm m800x800.sm square.sm
    mxm.x m800x800.mat m800x800.mat square.mat
    

    (In the matrix product case I have a "ticker" displayed to reassure you that the program is running while it is preforming the necessary 8003 = 512,000,000 floating point multiplies.)

  4. Convert the square from matrix to sparse, and use diff to compare the results:

    m2sm.x square.mat s.sm
    diff square.sm s.sm
    

    If there are differences, convince yourself they are only differences in the order of the rows, and hence represent the same sparse matrix.

  5. As a "double check", convert the square from sparse to matrix, and use diff to compare the results:

    sm2m.x square.sm s.mat
    diff square.mat s.mat
    

    There should be no differences.

This process may give ideas on how to test your SparseMatrix multiply operators. There is one place where you need to be careful, because the converter sm2m creates a matrix that is "just big enough", not necessarily square. That is why I put the element at (799,799), to "anchor" the nominal size at 800x800.

class fsu::HashTable<K,D,H> and associated Iterator classes

The underpinning technologies for this assignment are discussed in the lecture notes:
Hash Functions
Hash Tables & Associative Arrays

class fsu::Matrix<N>

The mathematical notion of Vector is used in C++ and the fsu::Vector<T> class template provides a by now familier implementation of the concept. A similar mathematical notion of Matrix can also be implemented usefully as a "vector of vectors" class template named fsu::Matrix<T>. The class template is defined as follows:

namespace fsu
{

  template <typename T>
  class Matrix
  {
  public:
    typedef T ValueType;

    // used by graph algorithms when using adjacency matrix rep - omitted, see code file

    Matrix ();
    Matrix (size_t numrows, size_t numcols);
    Matrix (size_t numrows, size_t numcols, T t);
    Matrix (const Matrix& m);
    virtual ~Matrix();
    Matrix<T>& operator=  (const Matrix<T>& m);

    Vector <T>& operator[]        (size_t rowindex);
    const Vector <T>& operator[]  (size_t rowindex) const;
    ValueType& operator()         (size_t i, size_t j); // see note 1
    const ValueType& operator()   (size_t i, size_t j) const;  // see note 2

    void SetSize (size_t numrows, size_t numcols);
    void SetSize (size_t numrows, size_t numcols, T t);
    size_t NumberOfRows () const;
    size_t NumberOfCols () const;

    void Clear();

  protected:
    fsu::Vector < fsu::Vector < T > > row_;
  } ;

  template <typename T>
  T& Matrix<T>::operator () (size_t i, size_t j)
  {
    bool resize = 0;
    size_t rows= NumberOfRows(), cols = NumberOfCols();
    if (i >= rows)
    {
      rows = i + 1;
      resize = 1;
    }
    if (j >= cols)
    {
      cols = j + 1;
      resize = 1;
    }
    if (resize)
      SetSize(rows,cols,T());
    return (*this)[i][j];
  }

  template <typename T>
  const T& Matrix<T>::operator () (size_t i, size_t j) const
  {
    return (*this)[i][j];
  }

} // namespace fsu

Notes.

  1. Anticipating the details of SparseMatrix, we introduce here the doubly-indexed element operator taking two size_t arguments, the row and column indices of a value in the matrix. Client code can look something like this:

    fsu::Matrix<int> m;
    int val;
    ...
    m(i,j) = val; // m(i,j) is an Lvalue
    

    Even more useful (and dangerously so) is that this operator has "insert semantics" just like the associative array bracket operator: if (i,j) is not a pair of coordinates for which m is currently defined, the matrix is expanded just enough so that the call is defined. The implementation above color coded red shows how the automatic expansion takes place.

  2. There is a const version of the element operator which of course would not be permitted to change the state of m (coded blue). An error will be thrown if the const version is called for indices for which the matrix is not defined. A place where the analogies between normal and sparse break down is that the Vector bracket operator[] does not have insert-if-not-found semantics, whereas SparseVector does. The addition of these element operators for Matrix ensures that at the matrix level the behavior of normal and sparse are analogous.

Global Matrix*Vector and Matrix*Matrix Product Operators

You are familier with the fact that matrices and vectors have a kind of "algebra" and that key operations for this algebra are matrix-vector multiplication and matrix-matrix multiplication. When multiplying M*V the rows of the matrix M must have the same number of elements as the vector V, and then the product is defined to be the vector W such that W[i] = ΣjM(i,j)V[j]. The product A*B of matrices A and B is defined similarly. The product is the matrix X such that X(i,j) = ΣkA(i,k)*B(k,j). (The colmuns of A must be the same size as the rows of B for this to work.) These two algorithms are captured for fsu::Vector and fsu::Matrix in the following code.

template < typename N >
fsu::Vector<N> operator * (const fsu::Matrix<N>& A, const fsu::Vector<N>& V) // see note 3
{
  fsu::Vector <N> X(0);
  size_t J = A.NumberOfCols();
  if (J != V.Size())
  {
    std::cerr << " ** M*V Multiplication Error: rows of M and V not same size\n";
    return X;
  }
  size_t I = A.NumberOfRows();
  X.SetSize(I,N());
  for (size_t i = 0; i < I; ++i)
  {
    for (size_t j = 0; j < J; ++j)
    {
      X[i] += A(i,j)*V[j];
    }
  }
  return X;
}

template < typename N >
fsu::Matrix<N> operator * (const fsu::Matrix<N>& A, const fsu::Matrix<N>& V)
{
  fsu::Matrix <N> X(0,0);
  size_t K = A.NumberOfCols();
  if (K != B.NumberOfRows())
  {
    std::cerr << " ** A*B Multiplication Error: rows of A and cols of B not same size\n";
    return X;
  }
  size_t I = A.NumberOfRows(), J = B.NumberOfCols();
  X.SetSize(I,J,N());
  for (size_t i = 0; i < I; ++i) // row
  {
    for (size_t j = 0; j < J; ++j) // col
    {
      for (size_t k = 0; k < K; ++k) // x(i,j) = (row i of A) dot (col j of B)
      {
        X(i,j) += A(i,k)*B(k,j);
      }
    }
  }
  return X;
}

Notes.

  1. In many cases it is more practical to avoid the copying of large objects implied by the return-by-value product operators. This is accomplished by defining a function Product whose signature is similar to that of operator* with a third argument receiving a reference to the product object. The algorithm operates on this referenced object and returns a reference to the modified product object. This technique leaves the client program in charge of object creation and copying.

  2. It is worth noting here that Matrix*Vector has runtime Θ(m*n) [where M has dimensions mxn) and Matrix*Matrix has runtime Θ(m*n*p) [where A has dimension mxn and B has dimension nxp]. These facts are immediately verified by inspection of the implementation code which consists of nested fixed-size loops.

One of your challenges in this assignment is to implement these two operators for fsu::SparseVector and fsu::SparseMatrix.

class fsu::SparseVector<N>

Before defining SparseMatrix we need a "sparse" anaolg of Vector. SparseVector is defined using a HashTable with KeyType = size_t and DataType = N, so that we can have SparseVector elements only for the specific indicies we need without being forced to have all of the other indices preceding it in the number system. For example, we can have client code like this:

fsu::SparseVector<int> v;
v[3000] = 15;
v[5000] = 25;

which results in a SparseVector with two elements, <3000,15> and <5000,25>, but not necessarly any elements at indices 0,1, ... ,2999 or 3001, ...,4999. Here is our class template definition:

namespace fsu { template < typename N > class SparseVector { public: typedef N ValueType; ValueType& operator [] (size_t i) { return val_[i]; } // see note 6 const ValueType& operator [] (size_t i) const { return val_[i]; } // see note 7 // terminology support, mostly used to define TableType typedef size_t KeyType; typedef N DataType; typedef hashclass::KISS < size_t > HashType; typedef fsu::Entry < KeyType , DataType > EntryType; typedef fsu::List < EntryType > BucketType; typedef fsu::HashTable < KeyType , DataType , HashType > TableType; typedef typename TableType::Iterator Iterator; // Iterator support Iterator Begin () const { return val_.Begin(); } Iterator End () const { return val_.End(); } bool Retrieve (size_t i, N& n) const { return val_.Retrieve(i,n); } // see note 8 size_t NumEntries () const { return val_.Size(); } size_t MaxIndex () const; // proper type explicit SparseVector (size_t size = 100) : val_(size) {} SparseVector (const SparseVector& v) : val_(v.val_) {} virtual ~SparseVector () {} SparseVector<N>& operator= (const SparseVector<N>& v) { val_ = v.val_; return *this; } void Clear () { val_.Clear(); } void Rehash ( size_t size = 0 ) { val_.Rehash(size); } private: TableType val_; }; template < typename N > size_t SparseVector<N>::MaxIndex () const { size_t max = 0; for (Iterator i = Begin(); i != End(); ++i) if ((*i).key_ > max) max = (*i).key_; return max; } } // namespace fsu

Notes.

  1. Except for the function MaxIndex() const, all of the methods are implemented in-line. It is nevertheless important to understand these one-liners.

  2. Here we see the SparseVector bracket operator, implemented as the bracket operator of the underlying hash table. This version has "insert-if-not-found" semantics.

  3. The const version of the bracket operator is not permitted to change the state of the calling SparseVector object. Be sure to check the fsu::HashTable code to fully grasp what is going on here.

  4. The Retrieve method is extremely useful, both here and in SparseMatrix. It is used to prevent calls to the element operators that might expand the size of the object unnecessarily.

class fsu::SparseMatrix<N>

Just as Matrix is defined in terms of Vector, we define SparseMatrix in terms of SparseVector:

namespace fsu
{
  template < typename N >
  class SparseMatrix
  {

  public:
    typedef N                   ValueType;
    typedef SparseVector<N>     RowType;

    // element access
    RowType&       operator   [] (size_t i)                 { return row_[i]; }
    const RowType& operator   [] (size_t i) const           { return row_[i]; }
    ValueType& operator       () (size_t i, size_t j)       { return (*this)[i][j]; } // see note 9
    const ValueType& operator () (size_t i, size_t j) const { return (*this)[i][j]; } // see note 9

    // proper type
    explicit SparseMatrix       (size_t numrows = 100)  : row_(numrows) {}
             SparseMatrix       (const SparseMatrix& m) : row_(m.row_) {}
    virtual  ~SparseMatrix      () { Clear(); }
    void     Clear              () { row_.Clear(); }
    SparseMatrix<N>& operator=  (const SparseMatrix<N>& m) { row_ = m.row_; return *this; }

    // other terminology support
    typedef size_t                                             KeyType;
    typedef RowType                                            DataType;
    typedef hashclass::KISS < size_t >                         HashType;
    typedef fsu::Entry      < KeyType , DataType >             EntryType;
    typedef fsu::HashTable  < KeyType , DataType , HashType >  TableType;
    typedef typename TableType::Iterator Iterator;

    // iterator support
    Iterator Begin () const { return row_.Begin(); }
    Iterator End   () const { return row_.End(); }

    // informational
    size_t NumEntries() const;
    fsu::Pair<size_t, size_t> MaxIndices() const; // see note 10
    bool Retrieve (size_t i, size_t j, N& n) const; // see note 11

    // improve structural efficiency
    void Rehash ( size_t size = 0 );

  private: 
    fsu::HashTable < size_t , SparseVector<N> , hashclass::KISS < size_t > > row_;
  };
  template < typename N >
  bool SparseMatrix<N>::Retrieve (size_t i, size_t j, N& n) const
  {
    RowType r;                 // local variable assists in finding entry value n
    if (!row_.Retrieve(i,r))   // r is the "ith row" of the sparsematrix
      return 0;
    if (!r.Retrieve(j,n))      // n is the "jth entry" of the sparsevector r
      return 0;
    return 1;                  // if both tests are passed, n is a usable reference
  }

  template < typename N >
  size_t SparseMatrix<N>::NumEntries() const
  {
    size_t count(0);
    for (Iterator i = row_.Begin(); i != row_.End(); ++i)
    {
      count += (*i).data_.NumEntries();
    }
    return count;
  }

  template < typename N >
  fsu::Pair<size_t, size_t> SparseMatrix<N>::MaxIndices() const
  {
    fsu::Pair<size_t,size_t> p(0,0);
    for (Iterator i = row_.Begin(); i != row_.End(); ++i)
    {
      if (p.first_ < (*i).key_) p.first_ = (*i).key_;
      for (typename RowType::Iterator j = (*i).data_.Begin(); j != (*i).data_.End(); ++j)
        if (p.second_ < (*j).key_) p.second_ = (*j).key_;
    }
    return p;
  }
    
  template < typename N >
  void SparseMatrix<N>::Rehash ( size_t size )
  {
    for (Iterator i = row_.Begin(); i != row_.End(); ++i)
    {
      (*i).data_.Rehash();
    }
    row_.Rehash(size);
  }
} // namespace fsu

Notes.

  1. The modifying and const versions of the element operator are in this instance "syntactic sugar" for the more cumbersome double bracket operator m[][].

  2. Pair<> MaxIndicies() const returns the largest row and largest col index in the SparseMatrix as a pair of unsigned ints. These are useful in utilities such as Convert( Matrix& , const SparseMatrix& ) where the size of a matrix needs to be adjusted to fit a sparse matrix.

  3. The bool Retrieve(i,j,value) const method is extremely useful, both here and in SparseVector. It is used to guard calls to the element operators that might otherwise expand the size of the object unnecessarily.

Global SparseMatrix*SparseVector and SparseMatrix*SparseMatrix Product Operators

Implementation of these operators is one of the deliverables in the assignment.

template < typename N >
fsu::SparseVector<N> operator* (const fsu::SparseMatrix<N>& a, const fsu::SparseVector<N>& v)
{
  fsu::SparseVector <N> w;
  ... 
  return w;
}

template < typename N >
fsu::SparseMatrix<N> operator* (const fsu::SparseMatrix<N>& a, const fsu::SparseMatrix<N>& b)
{
  fsu::SparseMatrix <N> x;
  ...
  return x;
}

Hints