Project: Random Graphs

Generating and analyzing random graphs using Partition

Version 11/15/17

Educational Objectives: After completing this assignment, the student should be able to accomplish the following:

Describe and elucidate the API for the Partition classes (1 and 2)
State and give intuitive arguments for the asymptotic runtime of Partition operations
Describe the supporting data structure for Partition (1 and 2)
Explain how Union is implemented
Explain how Find is implemented without path compression
Explain how Find is implemented with path compression
Define Graph
Define and explain the notion of Graph Component
Use Partition as a bookkeeping device on components of randomly generated graphs

Operational Objectives: Implement the ComponentSizeDistribution function defined on Partition objects. Write programs that generate random graphs and analyze their component and degree distributions, using various structural constraints on the graphs. Use the software to detect giant component phase transitions.

Deliverables:

partition_util.h       # contains void ComponentSizeDistribution ( ... )
rangraph_bipartite.cpp # generates and analyzes random bipartite graphs
rangraph_geo.cpp       # generates and analyzes random graphs with geometrically distributed vertex degrees

makefile               # builds all project object code and executables
manual.txt             # operating instructions for software [team document]
report.txt             # overview of team and project [team document]
log.txt                # personal log for team member [individual document]

Background

The study of random graphs began in 1960 with the publication of a remarkable paper by Paul Erdös and Alfréd Rényi that illucidated their discover of a phase transtion in the number of components of a random graph as the expected vertex degree passes through the value 1.0. This result was astonishing and unexpected. [See Footnote 1.]

It is not possible to convey the breadth, depth, and importance of the study of large-scale graphs in a few paragraphs. An entire book would be needed just for a complete bibliography on the subject. Nevertheless some intuition can be obtained by thinking about the following:

Many important systems and phenomena are representable as graphs, including:
1. Road networks [vertices represent intersections, edges represent roads between intersections]
2. Airline schedules [vertices represent cities, edges direct flights between cities]
3. Social networks [vertices represent facebook users, edges represent friend relationships]
4. Contagion networks [vertices represent individuals, edges represent contacts]
5. Customer/product networks [vertices = Amazon book titles and book customers, edges = book is purchased by customer]
6. Movie/actor graph [vertices = actors and movies, edges = actor is in that movie]
7. Human Sexual Contacts [vertices = humans, edges = ... (well, you know)]. This example was the subject of a 2001 paper in the journal Nature.
8. WWW
Properties of graph representations yield information about the context. For example:
1. An A-B path is a route. Routes can be optimized on any quality, many route planners provide at least 3: shortest by distance, shortest by driving time, most scenic.
2. Travel itineraries can be minimized on travel time, number of stopovers, or cost.
3. Large degree vertices may be good viral marketers.
4. A contagion may be an infectious disease among birds, knowledge of a chimpanzee predator, or a juicy piece of gossip. Graph properties will determine how fast it spreads.
5. If I bought the same book as you, and you later buy another book, Amazon can recommend your new book to me. More generally, clusters, components, and node degrees are important features for marketing.
6. Well ... we'll explore this one in detail in another project.
7. Who are the vertices with high degree? Who is an isolated vertex? How close is the graph to being bipartite? What is the significance of long paths? Connected components? What is the average path distance between vertices?
8. Around 2000, grad students Sergey Brin and Larry Page came up with a way to exploit properties of the web graph to optimize searches. Google is now one of the richest and most influential companies ever created.
Some of most interesting graphs are too large and/or too dynamic to study. Features of these graphs can be built into randomly generated abstract graph models whose properties reflect their real counterparts.

We are going to dip our pinky toes into this research area by writing some random graph generators with certain properties, and then we will analyze some of their features. One of the things we will want to do is keep up with the component structure of the graphs as we generate them, because that is so much more tractable than doing component analysis after the graph is created. For this we need the union-find algorithm and a special analytic tool that gives the component size breakdown in descending order by size. We also want to analyze the degree sequence structure of these graphs.

Our starting point consists of:

The Partition class that implements the union-find algorithms
The Graph class and associated search/survey algorithms already studied in the previous project
Two example random graph generators that implement the classic cases studied by Erdös and Rényi.

Let's abbreviate Erdös-Rényi to ER. ER studied two families of random graphs called G[n,e] and G(n,p) in the first reference below. A member of the G[n,e] family is obtained by starting with n vertices and repetitively adding edges between randomly drawn vertex pairs until e edges have been added, while ensuring that the graph remains simple (i.e., there are no self-loops and at most one edge between any two vertices). Members of G(n,p) are obtained in a slightly different manner: for each vertex, add an edge to every other vertex with probability p, again ensuring the graph remains simple. The two ER families are very similar when we take

p = 2e/(n(n - 1)).

The formula above is obtained from the observation that the expected degree of a vertex v in G(n,p) is [d] = p(n - 1) and the "degree theorem" which states

Σ_vd(v) = 2e.

Substituting expected values yields

2[e] = Σ_v[d(v)] = Σ_vp(n - 1) = n×p(n - 1) = pn(n - 1)

(taking [x] to mean the expected value of x). The subtle difference in the way the families are generated is that in G[n,e] there is a single random Bernoulli generator associated with the graph, used to pick the vertex pair at random. Whereas in G(n,p) each vertex has its own independent Bernoulli generator. While the two families have very similar properties, the second one is more cumbersome to implement but also is amenable to generalizations in which the individual generators associated with vertices may vary in their properties.

References

Joel Spencer, The Giant Component - Golden Anniversary, Notices of the AMS 57:6, 720-724 (2010).

Tom Britton, Maria Deijfen, and Anders Martin-Loef, Generating simple random graphs with prescribed degree distribution, Journal of Statistical Physics 124:6, 1377-1397 (2006) [arXiv.org > math > arXiv:1509.06985 23 Sep 2015]

Jure Leskovec, SNAP: The Stanford Network Analysis Project, Stanford University, 2009 - present.

Procedural Requirements

The official development, testing, and assessment environment is g++ -std=c++11 -Wall -Wextra on the linprog machines. Code should compile without error or warning.
Maintain your work log in the text file log.txt as documentation of effort, testing results, and development history. This file may also be used to report on any relevant issues encountered during project development.
Begin by copying all files from LIB/proj8RG into your proj8RG directory. All of these files require your familiarization with code, in conjunction with reading from the lecture notes.

In addition you will want to copy the following executables:
```
LIB/area51/rangraph.x
LIB/area51/rangraph_ER.x
LIB/area51/rangraph_BP.x
LIB/area51/rangraph_geo.x
LIB/area51/fpartition1.x
LIB/area51/fpartition2.x
```
After completing the project, you should be able to create these using the distributed makefile. All of the executables are important to use to assist in understanding:
- Partition
- Random graph generators and analyzers
Create the file partition_util.h by copying the "stub" version and completing the implementation of ComponentSizeDistribution(). Test your function using the supplied fpartition1.cpp and fpartition2.cpp.
Create the files rangraph_bipartite.cpp and rangraph_geo.cpp. Test thoroughly and complete the experimental investigation discussed. Be sure to put your results in log.txt.

When logged in to shell or quake, submit the project by executing "submit.sh deliverables.sh". Read the screen and watch for processing errors.

Warning: The submit process does not work on the program and linprog servers. Use shell or quake to submit projects. If you do not receive the second confirmation with the contents of your project, there has been a malfunction.

Code Requirements and Specifications

The stand-alone function template
```
template < class P >
void ComponentSizeDistribution ( const P& p , size_t maxToDisplay , std::ostream& os = std::cout )
```
takes three arguments:
1. const P& p is the Partition object under analysis, passed by const reference.
2. size_t maxToDisplay is the number of component sizes to display.
3. std::ostream& os is the output stream through which to display.
The items to display are the sizes of the components of the partition object, ranked in descending order by size. For example, suppose there are 25 components, component 3 has size 4, component 5 has size 6, component 22 has size 3, and all other components have size 1. Then the display would be:
```
rank      size
----      ----
1         6
2         4
3         3
4         1
*         1 (the remaining 21 components have size 1)
```
The display may be cut short by the "maxToDisplay" argument. All of the display boiler plate is supplied in the stub file. You only have to come up with the algorithm to calculate the distribution.
The random graph generator rangraph_bipartite.cpp should generate random bipartite graphs with the inputs (1) name of file to store graph, (2) number of red vertices, (3) number of blue vertices, and (4) number of edges. An optional fifth argument determines the length of the tail of the component distribution to display. A good starting point for this code is rangraph.cpp which generates graphs in the family G[n,e]. These generators are simple enough that they can get by with the random number generator in fsu::xran.
The random graph generator rangraph_geo.cpp should generate random graphs with expected vertex degrees geometricaly distributed. The inputs are (1) name of file to store graph, (2) the number of vertices, and (3) the expected vertex degree. An optional fourth argument determines the length of the tail of the component distribution to display. A good starting point for this code is rangraph_ER.cpp which generates graphs in the family G(n,p). The generators rangraph_ER and rangraph_geo require the use of the C++ <random> library.

The generator rangraph_ER can be thought of as having a Bernoulli generator with probability p = d/(n - 1) at each vertex. What must happen in rangraph_geo is the probability associated with the Bernoulli generators at the vertices must be distributed geometricaly over the vertices with mean d/(n - 1).

Experiment

Once the code is written and working correctly, the team needs to conduct experiments as follows:

Use rangraph and rangraph_ER to tease out the phase change behavior as the expected degree passes the value 1.0. This serves to confirm the classical result of Erdös and Rényi.

Please also observe the degree distributions and conclude ??
Is there an ER-like phase transition in bipartite graphs? If so, what is the critical value of expected degree?

Please also observe the degree distributions and conclude ??
Is there an ER-like phase transition in graphs with geometricaly distributed degree sequences? If so, what is the critical value of expected degree?

Please also observe the degree distributions and conclude ??

Discuss all experimental results in report.txt. Succinctly please! But backed up with computational results.

Hints

See the cpp references pages regarding <random>.
Also see the distributed program "random_demo.cpp" for practical use.
Example executables are in LIB/area51. Always consult these when in doubt about expected behavior.
Use the distributed makefile to ensure compatibility with our code during assessment.

Footnote 1

Paul Erdös is the mathematician of the so-called "Erdös Number", which is the smallest number of co-authorships connecting a published mathematician to Erdös. (This is the math-nerd analog of the Kevin Bacon number which is famous in movie-nerd circles, defined as the smallest number of co-actors connecting an actor back to Kevin Bacon.) Lacher has Erdös number 3:

Kuratowski, K.; Lacher, R.C. (1969), "A theorem on the space of monotone mappings", Bull. Pol. Acad. Sci. 12 (1969) 797--800.
Kuratowski, K.; Ulam, St. (1932), "Quelques proproperties topologiques du produit combinatoire", Fundamenta Mathematicae, Institute of Mathematics Polish Academy of Sciences, 19 (1): 247--251.
Erdös, P; Ulam, St. (1968), "On equations with sets as unknowns", Proceedings of the National Academy of Sciences of the United States of America 60: 1189-95.

Erdös, Kuratowski, and Ulam are each incredibly famous. Kuratowski is one of the founders of both set-theoretic topology and graph theory - the notations K(n) for the complete graph on n vertices and K(p,q) for the complete bipartite graph use "K" in his honor. He was also Ulam's major professor. Among many other things, Ulam discovered the way to calculate how to start a mass particle chain reaction. This work as part of the Manhatten project led to hydrogen fusion, hydrogen bombs, the threat of global destruction, and ultimately the end of the cold war. Without Ulam's discovery, we could well be living in a Stalinist dictatorship.