google_pagerank

COT 4401 Top 10 Algorithms
Chris Lacher
Google and PageRank

Background

Research Paper by Sergey Brin & Lawrence Page (Stanford, circa 2000)
PageRank Theory by Raluca Tanase & Remus Radu (Cornell, 2009)

PageRank

The page rank idea begins by asking what is the probability that a page is reached from a random location in the www. The links to a page are assumed to be equally likely to be selected, so for example if page A has 5 outgoing links then there is probability 1/5 = 0.20 of choosing a particular one of these 5 links. Note that this is a conditional probability: given we are at page A, the probability we will navigate to one particular page linked from A is the number of outgoing links, 0.20 in this example.

Recall that conditional probabilities multiply when navigating a path from page A to, say, page C: the probability of taking that path from A to C is the product of the conditional probabilities along the path.

Similarly, to get the total probability of navigating from A to C we would add the path probabilities, since the distinct paths from A to C are mutually exclusive ways to make the navigation.

If we put all the conditional probabilities in a matrix M = (m_i,j) where

m_i,j = 1/(number of out links) if i links to j
m_i,j = 0 if i does not link to j

we have a matrix that for each pair i,j gives the probability of navigating from page i to page j directly. This is called the transition matrix.

The rank of page i for the WWW is defined as the probability that a surfer who starts at a randomly selected page will navigate to page i. As explained in the background references, this can be phrased as a random walk problem in the directed graph whose transition matrix is M, and the rank can be calculated by starting with a probability vector v with equal components and iterating the product:

v, Mv, M(Mv) = M²v, M(M²v) = M³v, M(M³v) = M⁴v, ...

which is known to converge to the unique principle eigenvector w of M with length 1:

Mw = w

Here is an intuition for why we expect the successive products to converge to a principle eigenvector:

MM^∞v = M( ... MMM)v = M^∞v

so taking w = M^∞v we have

w = M^∞v = M(M^∞v) = Mw

The argument can be made rigorous by examining limits.

This eigenvector w is the page rank vector, well, almost. The calculation is modified by taking into account the possibility of dead-ends (sinks in the web digraph) and unreachable pages (no directed path exists from A to C). The fix by Brin & Page is straightforward: modify the transition matrix to

W = (1 - q)M + q{1}

where {1} is the matrix with all entries equal to 1. (Note this is similar to the Cornell wording, and the reverse of the Brin/Page wording. However, the latter seem to have meant it this way, looking at their section 2.1.2.) Here, q is the probability that a random web surfer decides to quit following links described by the transition matrix and start a new search. This possibility is real, and explains how surfers get out of dead-ends as well as jump from one connected component to another. q = 0.15 is the value suggested by Brin/Page.

W satisfies all the same requirements to conclude that

v, Wv, W²v, W³v, W⁴v, ...

converges to the unique principle eigenvector p of W with length 1:

Wp = p

p_i is the rank of page i. The problem is to actually calculate page rank.

Calculating Page Rank

Note that the product Mx of an nxn matrix M and an n-dimensional vector x is the vector y given by:

y[i] = Σ_jm[i][j] * x[j]

(using bracket notation instead of subscripts). This requires Θ(n²) storage and Θ(n²) time to calculate y.

At the time Brin & Page wrote their paper cited above, the web size was approaching 10⁹ (1 billion) documents. That number has increased to 10¹¹. Thus the transition matrix has between 10¹⁸ and 10²² entries - much too large for even today's computer systems to manipulate. So a big question is: how to represent the transition matrix in such a way that the page rank can be calculated?

The key observation for getting this done is that most web pages have only a few outgoing links, and for the few pages that have a lot of links, we can ignore all but the first 10 or so, because users will do the same thing. Therefore the transition matrix is sparse: for each i, there are only about 10 non-zero entries in row i. We can make use of hash table technology to store and manipulate matrices with gigantic size, as long as they are sparse. The following is pseudo code.

typedef HashTable < size_t, double >       SparseVector; 
typedef HashTable < size_t, SparseVector > SparseMatrix;

SparseMatrix m;    // m = WWW transition matrix
SparseVector x,y;  // x = given vector, y = result of product m*x
SparseMatrix::Iterator iter;
SparseVector::Iterator jter;

size_t i;
size_t j;

for (iter = m.Begin(); iter != m.End(); ++iter)
{
  i = (*iter).key_;
  y[i] = 0;  // this is an insert operation initializing the ith component of y to zero
  for (jter = m[i].Begin(); jter != m[i].End(); ++jter)
  {
    j = (*jter).key_;
    y[i] += m[i][j] * x[j]; // these are retrieval operations
  }
}
// the call to x[j] can be prevented from being an insert operation by
// substituting the value 0 whenever the key i is not in the table x

The basic matrix-vector multiplication is the same, but we only have to consider places where the matrix has non-zero entries, which, for sparse matrices, reduces the problem to one of manageable size, both in terms of storage (needing the entire matrix in primary memory) and time.

Indexing the Web

The google enterprise is a tour-de-force of data structures, algorithms, and operating system optimization. When a user enters a keyword, google must:

Find all web pages with that word
Rank these pages
Sort the pages by rank
Present the user with the top so-many pages

all within a time that does not bore the user into starting a different activity. Finding the web pages requires that the pages be stored in a manner that they can be quickly searched for matching strings. That process itself requires a sophisticated index and reverse index of the pages and a sort of the content that permits binary search. Ranking pages use the PageRank algorithm, plus other information that may relate to known user preferences, previous search behavior of that user, and other proprietary (and dynamic) inputs. Sorting by rank requires, obviously, a sort.

The WWW transition matrix must be continuously updated. This is done with a system of "web crawlers", processes that wander the web and send link info back to Google. (They are not the only people operating web crawlers, of course.) Because the web evolves more or less continuously, with new pages and links added virtually every second of every day, the transition matrix needs to be updated, at least daily, using the calculating methodology described above.

Here is a remarkable thing: virtually all of the technology used by Google is covered, in one form or another, in our curriculum: Data Structures, Algorithms, and Operating Systems.

File specs for matrix, sparsematrix, vector, and sparsevector. Files for matrix and vector data begin with the dimension(s) and follow with the data. For example,

represents a 3by3 matrix and

3
1 0 3

represents a 3-dimensional vector.

Sparse matrices and vectors are captured in files using a "mapping" concept, because position alone cannot determine where a value is intended to be. For example,

is a sparse representation of the matrix above. Here the first two entries represent the row,col indices and the third is the value at that index. In other words, this mapping is represented in the file:

M(0,0) = 1
M(0,2) = 2
M(1,1) = 3
M(2,0) = 4
M(2,2) = 5

The sparse interpretation is that any index pair not mentioned in the file is intended to be a zero entry. The logical completion of the mapping is then:

M(0,0) = 1
M(0,1) = 0 
M(0,2) = 2
M(1,0) = 0 
M(1,1) = 3
M(1,2) = 0 
M(2,0) = 4
M(2,1) = 0 
M(2,2) = 5

The sparse representation simply ignores the zero entries, and sparse matrix algebra assumes that any entries not explicitly stated in the sparse representation are zero entries.

Exercise 1. Consider the 10x10 matrix M defined by this file "m10x10.mat":

10 10
2 1 0 0 0 0 0 0 0 0 
1 2 1 0 0 0 0 0 0 0 
0 1 2 1 0 0 0 0 0 0
0 0 1 2 1 0 0 0 0 0 
0 0 0 1 2 1 0 0 0 0 
0 0 0 0 1 2 1 0 0 0 
0 0 0 0 0 1 2 1 0 0 
0 0 0 0 0 0 1 2 1 0 
0 0 0 0 0 0 0 1 2 1 
0 0 0 0 0 0 0 0 1 2

Note as a matter of interest that this matrix is symmetric and has all of its non-zero entries on or near the diagonal, making it "sparse" in the sense that most entries are zero.

Consider also the dimension 10 vector V defined by this file "v10.vec":

10
0 0 0 0 1 1 0 0 0 0

Compute the following items:

The vector W = M×V
The sparse matrix representation m of M
The sparse vector representation v of V
The sparse vector representation w of W
The sparse product x = m×v

Then verify that the sparse vectors w and x are equal. (Show all work.)

Exercise 2. Provide estimates of the following numbers:

The number n of live pages on the open web on Jan 1 2018.
The expected increase in n per week in 2018.

Document your findings.

index↑