Project 8: Degrees of Separation

Implementing the Kevin Bacon game

Revision dated 01/05/18

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

  • Describe and implement SymbolGraph
  • Define bipartite graphs
  • Explain the basic conclusions about path lengths in bipartite graphs
  • Describe the back-end design for the Kevin Bacon game solver
  • Propose other environments where a similar path game might be played

Operational Objectives: Design and implement the class MovieMatch

Deliverables: Files:

moviematch.h
log.txt

Movie Distance and Kevin Bacon

The Kevin Bacon game is this: given an actor by name, what is his/her Kevin Bacon number?

To solve this we first need a clear definition of the Kevin Bacon number for an actor, or more generally, the movie distance between two actors. The definition is much like the path distance between two vertices in a graph, except using movie chains instead of edges.

A movie chain from actor x to actor y is a sequence of movies m1 m2 ... mk such that

  1. mj and mj+1 have an actor in common for 0 < j < k
  2. x is in movie m1
  3. y is in movie mk

The movie distance md(x,y) is defined to be the number of movies in a shortest movie chain from x to y. If there is no movie chain from x to y, we define md(x,y) = infinity.

The Kevin Bacon number of an actor x is the movie distance from x to Kevin Bacon.

Some consequences are:

  1. Kevin Bacon has Kevin Bacon number 0.
  2. In general, md(x,x) = 0 for any actor x.
  3. All other actors have Kevin Bacon number at least 1.
  4. In general, if x != y and x and y are actors in the same movie, then md(x,y) = 1
  5. Movie distance satisfies the triangle inequality: md(x,z) <= md(x,y) + md(y,z)

The actor-movie graph

To solve the Kevin Bacon game (or any other similar game based on another actor) we use graphs. Specifically, create a graph in which both actors and movies are vertices, and insert an edge whenever an actor is in a movie. Thus each edge has an actor for one vertex and a movie for the other.

A graph is said to be bipartite if the vertices can be colored with two colors, say red and blue, such that each edge has different colored vertices, that is, each edge goes between a blue vertex and a red vertex. Clearly the movie-actor graph is bipartite, with actors colored blue and movies colored red.

The following result is proved in discrete math courses and most books on graph theory:

Theorem. In a bipartite graph, a path whose ends have the same color has an even number of edges.

As a consequence, any path from one actor to another in the movie-actor graph has an even number of edges. If P is such a path, with length n, then n is even and n/2 is the number of movies passed through by P. If P is a shortest path from actor x to actor y, then n/2 is the movie distance from x to y. (Note, in passing, that the path P has an odd number of vertices.)

Thus to solve the Kevin Bacon game, we perform a Breadth-First survey from Kevin Bacon. The Breadth First Search Tree rooted at Kevin Bacon consists of shortest paths from Kevin Bacon to all other actors who have a finite Kevin Bacon number. Dividing the length of such a path by 2 yields the Kevin Bacon number for the actor at the other end of the path.

In practical terms, we start at an actor x and follow the parent vertices of the BFS tree back to Kevin Bacon, counting the steps. Then divide this count by 2 to get the number.

Note that the path itself provides documentation in the form of a list starting with x and then listing movie | actor in pairs until we are back to Kevin Bacon.

Procedural Requirements

  1. The official development | testing | assessment environment is given in the course organizer. Code should compile without error or warning.

  2. Maintain your work log in the text file log.txt as documentation of effort, testing results, and development history. This file may also be used to report on any relevant issues encountered during project development.

  3. Copy all files from LIB/proj8, including:

    kb.cpp          # client program plays Kevin Bacon game
    line.cpp        # contains implementation of Line()
    movies.txt      # movie DB
    movies_abbreviated.txt # smaller version for debugging and optimizing
    deliverables.sh # submit configuration file
    
  4. When logged in to shell or quake, submit the project by executing "submit.sh deliverables.sh". Read the screen and watch for processing errors.

    Warning: The submit process does not work on the program and linprog servers. Use shell or quake to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Code Requirements and Specifications - MovieMatch

  1. MovieMatch should, at a minimum, provide services required by kb.cpp. This will require the following (partial) class definition:

    // types used
    typedef uint32_t                           Vertex;
    typedef fsu::String                        Name;
    typedef fsu::ALUGraph <Vertex>             Graph;
    typedef fsu::BFSurvey <Graph>              BFS;
    typedef hashclass::KISS<Name>              Hash;
    typedef fsu::HashTable<Name,Vertex,Hash>   AA; // associative array
    typedef fsu::Vector<Name>                  Vector;
    
    class MovieMatch
    {
    public:
           MovieMatch    ();
      bool Load          (const char* filename);
      bool Init          (const char* actor);
      void Shuffle       ();
      long MovieDistance (const char* actor);
      void ShowPath      (std::ostream& os) const;
      void ShowStar      (Name name, std::ostream& os) const;
      void Hint          (Name name, std::ostream& os) const;
      void Dump          (std::ostream& os) const;
      ...
    };
    

  2. The underlying graph should be built from the "database" provided in the text file movies.txt. Each line of this file represents a movie and the actors in the movie. Forward slash '/' is used to delimit the strings representing movie titles and actor names in each line.

  3. The following helper function makes reading a movie DB file somewhat straightforward:

    private:
      static void  Line      (std::istream& is, fsu::Vector& movie);
    ...
    

    This function consumes a line of text from the stream and instantiates the vector "movie" (passed by reference) whose elements are the names that are delimited by '/' in that line of the file. (Recall that each line of a movies file represents one movie and the actors in that movie, delimited by '/'.) The first element of movie is then a movie title and all other elements are actors in that movie. Note that you are not required to use this - it can be optimized away - but it is very helpful in a draft to postpone read issues until the main functionality is built. An implementation is distributed in the file line.cpp.

  4. bool Load (const char* filename)
    This method uses the data in the file to build the underlying symbol graph for the game. The symbol graph consists of these private members:

    private:
      ...
      Graph  g_;
      Vector name_;
      AA     vrtx_;
      ...
    

    name_ is a mapping: {vertices} -> {names}, and vrtx_ is a mapping: {names} -> {vertices}. Even though Vector and AA are very different structurally, they perform as mappings in the abstract, each using its bracket operator as function evaluation. These mappings are required to be mutually inverses of the other: For any vertex v, vrtx_[name_[v]] == v and for any name n, name_[vrtx_[n]] == n.

    Load must look at each name encountered (movie or actor) and, if and only if that name is not already encountered, record it as a new vertex. Then Load must add an edge [a,m] to g_ whenever a is an actor in movie m.

    It is advisable to allow Load to read the file twice: First to establish the vertices and the two mappings vrtx_ and name_; and second to insert all of the edges. Function Line() will be handy for these steps.

  5. bool Init (const char* actor)
    This method establishes actor as the base actor in the game (i.e., the "Kevin Bacon") and performs a BFS from the base actor in the graph. This BFS searches only from the base actor vertex (not a full survey) and records all the parent info for use later during game play. The BFS survey data is thus required to be persistent, so it is maintained as a BFSurvey object:

    private:
      ...
      Name   baseActor_;
      BFS    bfs_;
      ...
    

  6. void Shuffle ()
    This method "shuffles" the vertex stars (aka adjacency lists) pseudo-randomly, which makes the game more interesting. Shuffle is implemented by (1) calling Graph::Shuffle() and then re-computing the search paths: (2) bfs_.Reset(), (3) bfs_Search(vrtx_[baseActor_].

    Graph::Shuffle will also need to be added to ALUGraph as a public member function. It is implemented by calling List::Shuffle for each adjacency list:

    template < typename N >
    void ALUGraph<N>::Shuffle()
    {
      for (Vertex v = 0; v < VrtxSize(); ++v) al_[v].Shuffle();
    }
    

    List::Shuffle is already implemented in fsu::List. It does a fairly simple card-shuffle-like permutation of the list (and could certainly be improved to better pseudo-randomness).

    Still, the effect of Shuffle is to change some of the "proof paths" of MovieMatch in ways that are not easy to predict. For example, the proof that Boniface, Isabel has KB number 3 is

    Boniface, Isabel
      | True Grit (1969)
    Duvall, Robert (I)
      | Eagle Has Landed, The (1976)
    Sutherland, Donald (I)
      | Animal House (1978)
    Bacon, Kevin
    

    and after Shuffle is

    Boniface, Isabel
      | True Grit (1969)
    Corey, Jeff
      | Beethoven's 2nd (1993)
    Chaykin, Maury
      | Where the Truth Lies (2005)
    Bacon, Kevin
    

    and after a second Shuffle is

    Boniface, Isabel
      | True Grit (1969)
    Hopper, Dennis
      | True Romance (1993)
    Pitt, Brad
      | Sleepers (1996)
    Bacon, Kevin
    

    At some point presumably the path will go through Wayne, John (I). It is an excellent thought journey to explain how Shuffle affects the proof paths.

  7. long MovieDistance (const char* actor)
    This method uses the pre-computed BFS tree to (1) determine whether actor is in the DB and retrieve its vertex if so (if (vrtx_.Retrieve(actor,v))), (2) determine whether actor is reachable from the base actor (if (bfs_.Color()[v] == 'b')). In that case, it (3) computes the path from actor to base actor (storing the path as it goes) and returns the move distance. The path is stored in the class member

    private:
      ...
      fsu::List<Vertex>  path_;
      ...
    

    MovieDistance returns -3 when the entered name is not in the DB, -2 when the name is not reachable from the base actor, -1 when the name entered is a movie (not an actor), and otherwise the movie distance between actor and base_actor.

  8. void ShowPath (std::ostream& os) const
    This method outputs the entire path as an actor-movie chain connecting actor_ to baseActor_. This is used to document the movie distance number. See area51/kb_i.x for suggested behavior.

  9. void ShowStar (Name name, std::ostream& os) const
    This method outputs name (which might be a movie...) followed by the names of all vertices that are adjacent to name in the graph. This is implemented using an AdjIterator:

    typename fsu::ALUGraph::AdjIterator i;
    

    Note that if name is an actor the star is a list of all movies in which the actor appears. If name is a movie, the star is a list of all actors in that movie. See area51/kb_i.x for suggested behavior.

  10. void Hint (Name name, std::ostream& os) const
    This method provides hints intended to be helpful when a name is not found in the DB. See area51/kb_i.x for one idea on behavior.

  11. void Dump (std::ostream& os) const
    This method, as expected, depicts the internal structure of the MovieMatch objects. The demonstration program area51/kb_i.x uses this implementation:

      void Dump(std::ostream& os)
      {
        ShowAL(g_,os);
        WriteData(bfs_,os);
        vrtx_.Dump(os);
        for (size_t i = 0; i < name_.Size(); ++i)
        {
          os << "name_[" << i << "] = " << name_[i] << '\t';
          os << "vrtx_[" << name_[i] << "] = " << vrtx_[name_[i]] << '\n';
        }
        vrtx_.Analysis(std::cout);
      }
    

    ShowAL and WriteData are in graph_util.h and survey_util.h, respectively. vrtx_.Dump() and vrtx_.Analysis() are calls to the HashTable API. The for loop shows the two mappings. Every one of these has proved helpful tracking down a bug!

  12. kb.cpp
    This client program is supplied. Note that it utilizes the entire API discussed above. The program #includes source code for all helpers in the library, so it can be compiled with one call to g++.

  13. Identical Output
    Output from your project should be identical to that produced by the area51 examples.

Hints

  • It is highly recommended to construct some tiny fake movies files. Spend a few minutes creating these to model specifc cases of graph structure, accessibility, and redundancy. Keep a hand drawing of the symbol graphs for these examples so that the Dump output can be hand-traced. Note that Dump is called by kb.cpp when there is a third command line argument:

    kb.x m_test.1 name   # runs kb.x with DB = m_test.1 and base actor = name 
    kb.x m_test.1 name y # same as above, with a call to Dump after Load and Init
    

    It is also advisable to read the source code kb.cpp to understand what it is asking your MovieMatch object to do.

  • When you need a string with blanks in it to be read as a single command line argument, enclose it in single quotes:

    kb.x movies.txt 'Bacon, Kevin'  # runs kb.x with base actor = 'Bacon, Kevin' 
    

  • Here is a graphic created by former student Rachel Rados that illustrates most of what is going on with data in MovieMatch using a tiny movies database: movies_tiny.txt

  • Here is a partial list of technologies used in this project:

    graphs
    graph search & survey
    path computation in graphs
    associative arrays [hash tables]
    generic sort algorithms
    generic binary search
    

    Generic sorts are used to order the vertex star prior to output and to prepare the hint vector of all names. These both use a CaseInsensitiveLessThan predicate. Generic lower and upper bound are used to isolate a range in the hint vector that is sized to be useful.

  • Be careful to keep in mind the dual personality of the AA bracket operator: aa[key] behaves as "insert key" when key is not in the table. In a const environment you are protected. The const bracket operator will be called and fail if you accidentally use it in insert mode. You can use the const method HashTable::Retrieve to probe whether a name is already a key in the AA. Otherwise, it is advised to use the AA bracket operator for readability.

  • Load time can be an issue. kb.cpp has a built-in timer for the load operation, and we'll run an informal contest on this measure. In designing your Load function, be aware of runtime in every step of the plan. There are a lot of places where choosing one direction over another can have a dramatic affect on Load time.

    The supplied executable area51/kb_i.x requires about 0.14 seconds to load movies_abbreviated.txt (190 movies and 10,190 actors) and 1.60 seconds to load movies.txt (4,188 movies and 115,241 actors). Note that the two ratios 1.60/0.14 and (4188 + 115241)/(190 + 10190) are approximately equal, informally indicating linear runtime growth.

    Nevertheless there are aspects to the Load process (as implemented for kb_i.x) that can be further optimized to reduce the load time.

  • The implementation of Hint that is illustrated in area51/kb_i.x uses yet another item we have worked on: a generic sort algorithm to sort a vector hint_ which is built during the first read loop and consists of all names (actor and movie names). This sort is done after the graph has been established. (We do the sort as part of Init so it doesn't add to Load runtime.)

    Once hint_ is sorted, the generic binary search algorithms can be used to locate small ranges in the vector surrounding an input name.

    Hint() is needed because it is difficult to recall the exact name of an actor. For example, "Wayne, John" is not found in the DB ... Huh? ... ok, the hint shows us he is officially "Wayne, John (I)".

    Note BTW that you can mouse-select an entire line of Hint output on screen and "paste selection" will pipe the selection directly into input for a running kb.x.

  • Aside from bragging rights for "best load time" (self-reported on Discussion thread "Project 8 Load Time"), style points can be awarded for "intuitive hints" (AI anyone?, self-report to Discussion thread "Project 8 Hint").

  • The Kevin Bacon number of an actor using movies_abbreviated.txt is not necessarily the actual Kevin Bacon number. (Explain this.)

  • There is an analytical version of kb.cpp under the name "fkb". fkb plays the game just like kb, but has the added functionality of switching to a menu-driven access to an extended MoveMatch API that has some analytic functionality that is not a required part of the project. You may find it helpful to use fkb_i.x as well as looking at the source code fkb.cpp. You could also create and build an abbreviated version for your use by commenting out the calls you have not implemented, giving you direct access to your MovieMatch API.