kevin_bacon

COT 4401 Top 10 Algorithms
Chris Lacher
Kevin Bacon

Resources

Graphs 1[Lacher Notes]
Undirected Graphs [Sedgewick Slides]
Lecture Video
Linux executables: kb.x fkb.x
Movie DB: movies.txt (required by executables)
Stanford Network Analysis Project: SNAP (includes tools and large-scale graph data)

The KB Game

Here is the "Kevin Bacon [KB]" Game.

Define a relation "co-actor", denoted by μ, among actors by AμB iff A and B are actors in some movie together. Clearly μ is reflexive [AμA for all A] and symmetric [AμB implies BμA for all A,B]. But μ is not transitive, and the KB game is really about the transitive closure of μ.

Continuing with the rules of the KB game, define a "co-actor chain" to be a finite sequence A₁, A₂, ..., A_n such that A_kμA_k+1 for k = 1,2, ..., n-1. Finally, define the KB number of an actor A to be the length of the smallest co-actor chain with A = A₁ and KB = A_n. The game is then to:

Make an Assertion: Give the KB number of A (or at least an upper bound on the KB number of A), and
Provide a Proof: Give a sequence of movies and actors defining the co-actor chain connecting A to KB.

Note that this is a famous game that began as an informal "parlor" or "movie trivia" game. It is now implemented by Google. Note also that this game can be generalized, first to other actors, then to other domains, such as:

The "Erdos Number" of a mathematician - how many co-author pairs take X back to Paul Erdos?
[Lacher's Erdos number is 3: Author chain is Lacher-Kuratowski-Ulam-Erdos.]
Social Media Analysis - how many friend relations separate me from, say, a Kardashian? [probably a lot.] And more interestingly, what does the "friend map" look like? Who are the "super friends"? These are interesting questions useful to social scientists, marketers, and others.

What we will explore in this segment is: what are the algorithms needed to calculate KB? To get started, you need to brush up on your graph theory. First take a look at the appropriate chapter(s) in your Discrete Math text. Then begin to consider how graphs may be implemented, by looking in the algorithms texts, the multitude of online references, and the resources given above.

There are also some excellent lecture videos around, but I find these somewhat time consuming. You may of course read or watch any material you want, and if you find a good one, please post the link for the rest of us.

KB Algorithms

Here is a synopsis of how to set up the KB game.

Create a graph whose vertices are either actors or movies and whose edges represent an actor in a movie. Note that this is a bipartite graph - color actor vertices red and movie vertices blue. Every edge has one blue end and one red end.
Invoke breadth-first search from the Kevin Bacon vertex, and record the resulting BFS tree as a collection of parent vertices for each "black" vertex. Note that the white vertices are unreachable from Kevin Bacon.
To find KB(actor A): if (A is colored white) {KB = infinity} else {KB = L/2} where L = length of the path from A to Kevin Bacon in the BFS tree.

Notes:

We can divide by 2 because every path from one actor to another has an even number of vertices. (This is a fact about bipartite graphs.)
The documentation can also be provided in the form of the actual path from A back to Kevin Bacon: [A M₁ X₁ M₂ X₂ M₃ X₃ ... X_k-1 M_k KB], where M_i are movies and X_i is an actor in both M_i and M_i+1.

So the primary building blocks are:

Movie database
Symbol Graph representation
BFS algorithm/survey
Fact about bipartite graphs

Graph Representation

Adjacency Matrix

Vertices are represented by integers 0,1,...,n-1. Edges are represented by an nxn matrix AM where AM[i][j] = 1 if there is an edge from i to j and AM[i][j] = 0 otherwise.

This is a convenient representation, especially for small or dense graphs or in a mathematical (as opposed to computational) setting. One has direct access to edge information with such a representation.

The bigest disadvantage of the adjacency matrix representation is precisely where many real-world applications lie: Computing with graphs with many vertices and relatively few edges.

Note that there are potentially Θ(n²) edges in a graph, just as there are n² entries in the adjacency matrix. If there are only, say, an average of 100 edges adjacent to each vertex in a very large graph, then most of the "slots" for edges in the adjacency matrix are wasted storage.

The basic test of efficiency of representation is "how much time is required to touch every component (vertex and edge) of the graph?" (This is a lower bound on the time required for any traversal of the graph.)

Exercise 1a Touch Time. Explain why, for the adjacency matrix representation, the "touch time" is Θ(n²).

Adjacency List

Vertices are represented by integers 0,1,...,n-1, just as with the adjacency matrix. A vector AL is used to store a list of vertices that are adjacent to vertex i: AL[i] = list of vertices adjacent to i.

Answering the question "is there an edge from v to w" requires a search of the adjacency list of v. This search is necessarily sequential search, so the time required to answer is 1/2 the size of the list, on average. This is not as detrimental as it may seem on first glance, because the adjacency lists are relatively short. In fact, it is shown in discrete math that

Σ_vdeg(v) = 2n

for undirected graphs, where deg(v) is the number of edges touching v, the same as the size of the adjacency list AL[v]. (A similar result holds for directed graphs - see [Lacher, Graphs1].)

Exercise 1b Touch Time. Argue that the "touch time" using the adjacency list representation is Θ(n + e), where n is the number of vertices and e is the number of edges.

For "sparse" graphs, for example satisfying e = O(n), the touch time is Θ(n), a significant improvement over the adjacency matrix representation.

Symbol Graphs

Let us refer to a graph with vertices 0,1, ..., n-1 as an abstract graph. Abstract graphs provide a convenient and efficient platform on which to build various algorithms such a depth- and breadth-first search. In practical applications, the graph components represent real-world entities and hence we need a way to provide names for the components. Let's concentrate on the problem of naming vertices, or associating vertices with symbols (such as string objects).

Suppose that S is a type (such as std::string). To associate instances of S with graph vertices we need a map s2i taking these S objects to unsigned integers representing graph vertices, along with an inverse map i2s taking vertices to S objects, and these need to be mutual inverses:

i2s[s2i[v]] == v for any vertex v of the graph
s2i[i2s[x]] == x for any symbol x being represented by a vertex

It so happens we already have the technology for these mappings: s2i is an associative array with KeyType = S and DataType = unsigned long, and i2s is an ordinary vector with ValueType = S. It takes a little management to maintain these two mappings, but the design and implementation is straightforward. The payoff is that we can now build a graph whose vertices are "symbols" from the type S. A similar technique can be used to associate symbols with edges, if needed.

Breadth First Search

We repeat here the BFS algorithm - breadth-first search from a given vertex - please see the references, particularly [Lacher, Graphs1], for more details.

BFS is set up as a class, with a double-ended control queue:

class BFSurvey
{
public:

  typedef uint32_t Vertex;

         BFSurvey ( const Graph& g );
  void   Search   ( );
  void   Search   ( Vertex v );
  void   Reset    ( );

  Vector < Vertex >  distance;  // distance of vertex from root of BFS tree (origin of search)
  Vector < Vertex >  parent;    // parent in BFS tree
  Vector < Color >   color;     // state of vertex at any point during search/survey

private:
  const Graph&      g_;
  Vector < bool >   visited_ ;
  Deque  < Vertex > conQ_  ;
};

The class contains a reference to a graph object on which the survey is performed, private data used in the algorithm control, and public variables to house three results of the survey - distance, parent, and color for each vertex in the graph. (These could be privatized with accessors and other trimmings for data security.) These data are instantiated by the survey and have the following interpretation when the survey is completed:

code description

distance[x] number of edges from v to x

parent[x] the parent in the search tree - the vertex from which x was discovered

color[x] white, grey, or black: white = undiscovered, grey = being processed, black = finished

During the course of Search, when a vertex is pushed onto the control queue in FIFO order, it is colored gray and assigned distance one more than its parent at the front of the queue. The vertex is colored black when popped from the queue. At any given time during Search, the gray vertices are precisely those in the FIFO control queue. The 1-argument constructor initializes the Graph reference and sets all the various data to the initial/reset state:

BFSurvey::BFSurvey (const Graph& g)
  : distance(g.vSize, g.eSize + 1), parent(g.vSize, null), 
    color(g.vSize, white), visited_(g.vSize, false),
    g_(g)
{}

The search begins at a vertex:

void BFSurvey::Search( Vertex v )
{
  conQ_.Push(v);
  visited_[v] = true;
  distance[v] = 0;
  color[v]    = grey;
  while (!conQ_.Empty())
  {
    f = conQ_.Front();
    if (n = unvisited adjacent from f in g_)
    {
      conQ_.PushBack(n);  // PushFIFO
      visited_[n] = true;
      distance[n] = distance[f] + 1;
      parent[n]   = &f;
      color[n]    = grey;
    }
    else
    {
      conQ_.PopFront();
      color[f] = black;
    }
  }
}

The no-argument Search method repeatedly calls Search(v), thus ensuring that the survey considers the entire graph. Often there are relatively few vertices not reached on the first call, but nevertheless Search() perserveres until every vertex has been discovered.

void BFSurvey::Search()
{
  Reset();
  for (each vertex v of g_)
  {
    if (color[v] == white) Search(v);
  }
}
void BFSurvey::Reset()
{
  for (each vertex v of g_)
  {
    visited_[v] = 0;
    distance[v] = g_.eSize + 1; // impossibly large
    parent[v] = null;
    color[v] = white;
  }
}

Now we can complete the KB game implementation:

Create a symbol graph whose vertices are strings representing either actors or movies. Insert an edge between an actor vertex A and a movie vertex M whenever A is an actor in M.
Invoke BFSurvey::Search(Kevin Bacon). The BFS tree rooted at Kevin Bacon is our search space and is recorded in the parent vector. Note that the white vertices are unreachable from Kevin Bacon.
To find KB(actor A): if (A is colored white) {KB = infinity} else {KB = L/2} where L = length of the path from A to Kevin Bacon in the BFS tree.

Exercise 2 Exploring KB.
Get copies of "kb.x" and "movies.txt" into your programming space on linprog and experiment:

What is the largest KB(actor) you can find (other than infinity)?
Explain why KB numbers 1,2,3 seem to be the most common.
Can you find an actor with KB(actor) = infinity?
What does KB(actor) = infinity say about the movie/actor graph?

Software notes:

Enter "kb.x" to see what is expected. The base actor name should be in single quotes.
Typical startup: kb.x movies.txt 'Bacon, Kevin' <Enter>

When playing the game, if you enter an actor name that is not found in the DB, a hint is provided in the form of a listing of DB entries near the name you entered. If you find the correct name in the hint, you can just copy/paste it into the interface.

Here is a sample session:

kb.x movies.txt 'Bacon, Kevin'

 Loading database movies.txt (first read) ...(second read) ... done. 
 4188 movies and 115241 actors read from movies.txt
 Load time: 3.90 sec

Welcome to MovieMatch ( Bacon, Kevin )
Enter actor name ('0' to quit): Wayne, John
 Name 'Wayne, John' not in DB 'movies.txt'
 Here are some similar name possibilities:
...
Wayne, Donald
Wayne, Fredd
Wayne, Geoff
Wayne, George
Wayne, Greg (I)
Wayne, Gus
Wayne, Harte
Wayne, Jesse (I)
Wayne, John (I)
Wayne, Keith
Wayne, Ken (I)
Wayne, Kevin
Wayne, Marion
...
Enter actor name ('0' to quit): Wayne, John (I)
 The KB Number of 'Wayne, John (I)' is: 2
  Do you want proof? y
   A connecting path is:

 Wayne, John (I)
   | El Dorado (1966)
 Asner, Edward
   | JFK (1991)
 Bacon, Kevin

   The path is minimal because it was found with BFS [ref graph theory].
Enter actor name ('0' to quit):

Exercise 3 Generalizing KB.
There are numerous ways that the KB game technology [possibly slightly modified] can be applied in other settings. In the following settings, explain (A) how a KB-like model would be used and any modifications that would be needed, and (B) what the model results mean in the new context and how would they be useful

Amazon book recommendations [hint: we have books purchased by customers].
Facebook friend analysis
Twitter follower analysis
Cell phone metadata [calls made, but no content captured]
Jury seating analysis - for example, finding jurors for the Boston Marathon bombing trial who do not have personal experiences with one of the victims, first- or second- hand.
Erdos number of a mathematician.

Exercise 4 Theorizing about degrees of separation.
Try to come up with a quantitative way of explaining why "degrees of separation" are often surprisingly low.

index↑

code	description
`distance[x]`	number of edges from `v` to `x`
`parent[x]`	the parent in the search tree - the vertex from which `x` was discovered
`color[x]`	white, grey, or black: white = undiscovered, grey = being processed, black = finished