Project 8: Degrees of Separation
Implementing the Kevin Bacon game
Revision dated 07/13/18
Educational Objectives:
After completing this assignment, the student should be able to accomplish the
following:
- Describe and implement SymbolGraph
- Define bipartite graphs
- Explain the basic conclusions about path lengths in bipartite graphs
- Describe the back-end design for the Kevin Bacon game solver
- Propose other environments where a similar path game might be played
Operational Objectives:
Design and implement the class MovieMatch
Deliverables: Files:
moviematch.h
log.txt
Rubric used in Assessment
=========================================================================
rubric
-------------------------------------------------------------------------
tests:
fkb.x movies_test.1 zzz < fkb.com1 [0...20]: xx (basic functionality - KB numbers)
fkb.x movies_test.1 zzz < fkb.com2 [0...20]: xx (basic functionality - Star, Hint, Path)
fkb.x movies_abbreviated.txt 'Bacon, Kevin' < fkb.com3 [0...20]: xx (acceptable Hint & Star)
fkb.x movies.txt 'Wayne, John (I)' < fkb.com4 [0...20]: xx (Load + Init time < 3 sec)
fkb.x movies_abbreviated.txt 'Bacon, Kevin' < fkb.com5 [0...20]: xx (Shuffle + basic)
subjective eval:
SE [-50.. 0]: ( x)
log.txt testing diary [-50.. 0]: ( x)
requirements [-50.. 0]: ( x)
dated submissions deduction [4 pts each]: ( x)
--------------------
total: xxx
=========================================================================
Notes:
1. Code category includes technical expertise, including Load
implementation and reasonable load times.
2. Creativity includes extreme load times (assuming not tailored to prior
knowledge of the DB) as well as Hint presentation to the user.
=========================================================================
Subjective point deductions (each instance)
A: movies.txt load time > 3 seconds (-5)
B: declaring new variables when the old ones break (-5)
C: using PushBack to set up vectors during the first read (-5)
(instead of one call to SetSize between reads)
D: avoidable calls to vrtx_.Size() (-5)
E: using advance knowledge to pre-set the number of buckets for vrtx_ (-10)
F: deviations from requirements (-10)
G: inadequate test diary in log (-5)
=========================================================================
Movie Distance and Kevin Bacon
The Kevin Bacon game is this: given an actor by name, what is his/her Kevin
Bacon number?
To solve this we first need a clear definition of the Kevin Bacon number for an
actor, or more generally, the movie distance between two actors. The
definition is much like the path distance between two vertices in a graph,
except using movie chains instead of edges.
A movie chain from actor x to actor y is a sequence of
movies m1
m2 ... mk such that
- mj and mj+1 have an actor
in common for 0 < j < k
- x is in movie m1
- y is in movie mk
The movie distance md(x,y) is defined to be the number of movies
in a shortest movie chain from x to y. If there is no movie chain
from x to y, we define md(x,y) = infinity.
The Kevin Bacon number of
an actor x is the movie distance from x to Kevin Bacon.
Some consequences are:
- Kevin Bacon has Kevin Bacon number 0.
- In general, md(x,x) = 0 for any actor x.
- All other actors have Kevin Bacon number at least 1.
- In general, if x != y and x and y are actors in the same movie, then md(x,y) = 1
- Movie distance satisfies the triangle inequality: md(x,z) <= md(x,y) + md(y,z)
The actor-movie graph
To solve the Kevin Bacon game (or any other similar game based on another actor)
we use graphs. Specifically, create a graph in which both actors and movies are
vertices, and insert an edge whenever an actor is in a movie. Thus each edge has
an actor for one vertex and a movie for the other.
A graph is said to be bipartite if the vertices can be colored
with two colors, say red and blue, such that each edge has different colored
vertices, that is, each edge goes between a blue vertex and a red
vertex. Clearly the movie-actor graph is bipartite, with actors colored blue and
movies colored red.
The following result is proved in discrete math courses and most books on graph theory:
Theorem. In a bipartite graph, a path whose ends have the same color has
an even number of edges.
As a consequence, any path from one actor to another in the movie-actor graph
has an even number of edges. If P is such a path, with length n,
then n is even and n/2 is the number of movies passed through by
P. If P is a shortest path from actor x to actor
y, then n/2 is the movie distance from x to y.
(Note, in passing, that the path P has an odd number of vertices.)
Thus to solve the Kevin Bacon game, we perform a Breadth-First survey from Kevin
Bacon. The Breadth First Search Tree rooted at Kevin Bacon consists of shortest paths
from Kevin Bacon to all other actors who have a finite Kevin Bacon
number. Dividing the length of such a path by 2 yields the Kevin Bacon number
for the actor at the other end of the path.
In practical terms, we start at an actor x and follow the parent vertices
of the BFS tree back to Kevin Bacon, counting the steps. Then divide this count
by 2 to get the number.
Note that the path itself provides documentation in the form of a list starting
with x and then listing movie | actor in pairs until we are back to
Kevin Bacon.
Procedural Requirements
The official
development | testing | assessment environment is
given in the course organizer. Code should compile without error or warning.
Maintain your work log in the text file log.txt as
documentation of effort, testing results, and development history. This file may
also be used to report on any relevant issues encountered during project
development.
Copy all files from LIB/proj8, including:
kb.cpp # client program plays Kevin Bacon game
line.cpp # contains implementation of Line()
movies.txt # movie DB
movies_abbreviated.txt # smaller version for debugging and optimizing
deliverables.sh # submit configuration file
When logged in to shell or quake, submit the project by executing "submit.sh
deliverables.sh". Read the screen and watch for processing errors.
Warning: The submit process does not work on the program and
linprog servers. Use shell or quake to submit projects. If you do
not receive the second confirmation with the contents of your project, there has
been a malfunction.
Code Requirements and Specifications - MovieMatch
MovieMatch should, at a minimum, provide services required by kb.cpp. This
will require the following (partial) class definition:
// types used
typedef uint32_t Vertex;
typedef fsu::String Name;
typedef fsu::ALUGraph <Vertex> Graph;
typedef fsu::BFSurvey <Graph> BFS;
typedef hashclass::KISS<Name> Hash;
typedef fsu::HashTable<Name,Vertex,Hash> AA; // associative array
typedef fsu::Vector<Name> Vector;
class MovieMatch
{
public:
MovieMatch ();
bool Load (const char* filename);
bool Init (const char* actor);
void Shuffle ();
long MovieDistance (const char* actor);
void ShowPath (std::ostream& os) const;
void ShowStar (Name name, std::ostream& os) const;
void Hint (Name name, std::ostream& os) const;
void Dump (std::ostream& os) const;
...
};
The underlying graph should be built from the
"database" provided in the text file movies.txt. Each line of this file
represents a movie and the actors in the movie. Forward slash '/' is
used to delimit the strings representing movie titles and actor names in each
line.
The following helper function makes reading a movie DB file
somewhat straightforward:
private:
static void Line (std::istream& is, fsu::Vector& movie);
...
This function consumes a line of text from the stream and instantiates the
vector "movie" (passed by reference) whose
elements are the names that are delimited by '/' in that line of the file. (Recall that each
line of a movies file represents one movie and the actors in that movie,
delimited by '/'.) The first element
of movie is then a movie title
and all other elements are actors in that movie. Note that you are not required
to use this - it can be optimized away - but it is very helpful in a draft to
postpone read issues until the main functionality is built. An implementation is
distributed in the file line.cpp.
bool Load (const char* filename)
This method uses the data in the file to build the underlying symbol graph for
the game. The symbol graph consists of these private members:
private:
...
Graph g_;
Vector name_;
AA vrtx_;
...
name_ is a mapping: {vertices} -> {names}, and vrtx_ is a mapping: {names} ->
{vertices}. Even though Vector and AA are very different structurally, they
perform as mappings in the abstract, each using its bracket operator as function
evaluation. These mappings are required to be mutually inverses of the other:
For any vertex v, vrtx_[name_[v]] == v and for any name n, name_[vrtx_[n]] == n.
Load must look at each name encountered (movie or actor) and, if and only if
that name is not already encountered, record it as a new vertex. Then Load must
add an edge [a,m] to g_ whenever a is an actor in movie m.
It is advisable to allow Load to read the file twice: First to establish the
vertices and the two mappings vrtx_ and name_; and second to insert all of the
edges. Function Line() will be handy for these steps.
bool Init (const char* actor)
This method establishes actor as the base actor in the game (i.e., the "Kevin
Bacon") and performs a BFS from the base actor in the graph. This BFS
searches only from the base actor vertex (not a full survey) and records
all the parent info for use later during game play. The BFS survey data is
thus required to be persistent, so it is maintained as a BFSurvey object:
private:
...
Name baseActor_;
BFS bfs_;
...
void Shuffle ()
This method "shuffles" the vertex stars (aka adjacency lists) pseudo-randomly,
which makes the game more interesting. Shuffle is implemented by (1) calling
Graph::Shuffle() and then re-computing the search
paths: (2) bfs_.Reset(), (3) bfs_Search(vrtx_[baseActor_].
Graph::Shuffle will also need to be added to ALUGraph as a public member
function. It is implemented by calling List::Shuffle for each adjacency list:
template < typename N >
void ALUGraph<N>::Shuffle()
{
for (Vertex v = 0; v < VrtxSize(); ++v) al_[v].Shuffle();
}
List::Shuffle is already implemented in fsu::List. It does a fairly simple
card-shuffle-like permutation of the list (and could certainly be improved to better
pseudo-randomness).
Still, the effect of Shuffle is to change some of the "proof paths" of
MovieMatch in ways that are not easy to predict. For example, the proof that
Boniface, Isabel has KB number 3 is
Boniface, Isabel
| True Grit (1969)
Duvall, Robert (I)
| Eagle Has Landed, The (1976)
Sutherland, Donald (I)
| Animal House (1978)
Bacon, Kevin
and after Shuffle is
Boniface, Isabel
| True Grit (1969)
Corey, Jeff
| Beethoven's 2nd (1993)
Chaykin, Maury
| Where the Truth Lies (2005)
Bacon, Kevin
and after a second Shuffle is
Boniface, Isabel
| True Grit (1969)
Hopper, Dennis
| True Romance (1993)
Pitt, Brad
| Sleepers (1996)
Bacon, Kevin
At some point presumably the path will go through Wayne, John (I). It is an
excellent thought journey to explain how Shuffle affects the proof paths.
long MovieDistance (const char* actor)
This method uses the pre-computed BFS tree to (1) determine whether actor is in
the DB and retrieve its vertex if so (if
(vrtx_.Retrieve(actor,v))), (2) determine whether actor is reachable
from the base actor (if (bfs_.Color()[v] == 'b')). In that case, it (3) computes
the path from actor to base actor (storing the path as it goes) and returns
the move distance. The path is stored in the class member
private:
...
fsu::List<Vertex> path_;
...
MovieDistance returns -3 when the entered name is not in the DB, -2 when the
name is
not reachable from the base actor, -1 when the name entered is a movie (not an
actor), and otherwise the movie distance between
actor and base_actor.
void ShowPath (std::ostream& os) const
This method outputs the entire path as an actor-movie chain connecting actor_ to
baseActor_. This is used to document the movie distance number. See area51/kb_i.x
for suggested behavior.
void ShowStar (Name name, std::ostream& os) const
This method outputs name (which might be a movie...)
followed by the names of all vertices that are adjacent to name in the
graph. This is implemented using an AdjIterator:
typename fsu::ALUGraph::AdjIterator i;
Note that if name is an actor the star is a list of all movies in which
the actor appears. If name is a movie, the star is a list of all actors in that
movie. See area51/kb_i.x for suggested behavior.
void Hint (Name name, std::ostream& os) const
This method provides hints intended to be helpful when a name is not found in
the DB. See area51/kb_i.x for one idea on behavior.
void Dump (std::ostream& os) const
This method, as expected, depicts the internal structure of the MovieMatch
objects. The demonstration program area51/kb_i.x uses this implementation:
void Dump(std::ostream& os)
{
ShowAL(g_,os);
WriteData(bfs_,os);
vrtx_.Dump(os);
for (size_t i = 0; i < name_.Size(); ++i)
{
os << "name_[" << i << "] = " << name_[i] << '\t';
os << "vrtx_[" << name_[i] << "] = " << vrtx_[name_[i]] << '\n';
}
vrtx_.Analysis(std::cout);
}
ShowAL and WriteData are in graph_util.h and survey_util.h,
respectively. vrtx_.Dump() and vrtx_.Analysis() are calls to the HashTable
API. The for loop shows the two mappings. Every one of these has proved helpful
tracking down a bug!
kb.cpp
This client program is supplied. Note that it utilizes the entire API discussed
above. The program #includes source code for all helpers in the library, so
it can be compiled with one call to g++.
Identical Output
Output from your project should be identical to that produced by the area51
examples, with two exceptions: (1) Hint and (2) the timing data.
Hints
It is highly recommended to construct some tiny fake movies files. Spend a few
minutes creating these to model specifc cases of graph structure, accessibility,
and redundancy. Keep a hand drawing of the symbol graphs for these examples so
that the Dump output can be hand-traced. Note that Dump is called by kb.cpp when
there is a third command line argument:
kb.x m_test.1 name # runs kb.x with DB = m_test.1 and base actor = name
kb.x m_test.1 name y # same as above, with a call to Dump after Load and Init
It is also advisable to read the source code kb.cpp to understand what
it is asking your MovieMatch object to do.
When you need a string with blanks in it to be read as a single command
line argument, enclose it in single quotes:
kb.x movies.txt 'Bacon, Kevin' # runs kb.x with base actor = 'Bacon, Kevin'
Here is a graphic created by former student Rachel Rados that illustrates most of what is
going on with data in MovieMatch using a tiny movies database:
movies_tiny.txt
Here is a partial list of technologies used in this project:
graphs
graph search & survey
path computation in graphs
associative arrays [hash tables]
generic sort algorithms
generic binary search
Generic sorts are used to order the vertex star prior to output and to prepare
the hint vector of all names. These both use a CaseInsensitiveLessThan
predicate. Generic lower and upper bound are used to isolate a range in the hint
vector that is sized to be useful.
Be careful to keep in mind the dual personality of the AA bracket operator: aa[key]
behaves as "insert key" when key is not in the table. In a const environment you
are protected. The const bracket operator will be called and fail if you
accidentally use it in insert mode. You can use the const method
HashTable::Retrieve to probe whether a name is already a key in the
AA. Otherwise, it is advised to use the AA bracket operator for readability.
Load time can be an issue. kb.cpp has a built-in timer for the load operation, and
we'll run an informal contest on this measure. In designing your Load function,
be aware of runtime in every step of the plan. There are a lot of places
where choosing one direction over another can have a dramatic affect on Load time.
The supplied executable area51/kb_i.x requires about 0.14 seconds to load
movies_abbreviated.txt (190 movies and 10,190 actors) and 1.60 seconds
to load movies.txt (4,188 movies and 115,241 actors). Note that the two ratios
1.60/0.14 and (4188 + 115241)/(190 + 10190) are approximately equal, informally
indicating linear runtime growth.
Nevertheless there are aspects to the Load process (as implemented for kb_i.x)
that can be further optimized to reduce the load time.
The implementation of Hint that is illustrated in area51/kb_i.x uses yet
another item we have worked on: a generic sort algorithm to sort a vector hint_
which is built during the first read loop and
consists of all names (actor and movie names). This sort is done after the
graph has been established. (We do the sort as part of Init so it doesn't add to
Load runtime.)
Once hint_ is sorted, the generic binary search algorithms can be used to
locate small ranges in the vector surrounding an input name.
Hint() is needed because it is difficult to recall the exact name of an
actor. For example, "Wayne, John" is not found in the DB ... Huh? ... ok, the
hint shows us he is officially "Wayne, John (I)".
Note BTW that you can mouse-select an entire line of Hint output on screen and
"paste selection" will pipe the selection directly into input for a running kb.x.
Aside from bragging rights for "best load time" (self-reported on Discussion
thread "Project 8 Load Time"),
style points can be awarded for "intuitive hints" (AI anyone?, self-report to Discussion
thread "Project 8 Hint").
The Kevin Bacon number of an actor using movies_abbreviated.txt is not
necessarily the actual Kevin Bacon number. (Explain this.)
There is an analytical version of kb.cpp under the name "fkb".
fkb plays the game just like kb, but has the added functionality of switching to
a menu-driven access to an extended MoveMatch API that has some analytic functionality
that is not a required part of the project. You may find it helpful to use
fkb_i.x as well as looking at the source code fkb.cpp. You could also create and
build an abbreviated version for your use by commenting out the calls you have
not implemented, giving you direct access to your MovieMatch API.
|