Version 07/01/2019 Notes Index ↑ 

4 Search Trees

Recall the several definitions from discrete mathematics. A undirected graph G = (V,E) is connected iff for any two vertices x, yV there is a path in G from x to y. A component of G is a maximal connected subgraph of G. G is a tree iff G is connected and contains no cycles. G is a forest if each component of G is a tree.

Theorem 4 (Characterization of Trees). The following are equivalent statements about a graph G= (V,E):

  1. G is a tree. (I.e., G is connected and acyclic.)
  2. Any two vertices of G are connected by a unique simple path.
  3. G is connected, but if any edge is removed from E the resulting graph is not connected.
  4. G is connected and |V| = 1 + |E|.
  5. G is acyclic, but if any edge is added to E the resulting graph has a cycle.
  6. G is acyclic and |V| = 1 + |E|.

(See your discrete math text for proofs.)

4.1 BFS and DFS Trees

Recall that both BFSurvey and DFSurvey collect parent data during the course of the algorithm, storing that information in the vector parent_[]. Assume either DFS or BFS context, and suppose we have run Reset() and then a single call Search(v) for some vertex v. Let's also shorten the name parent_ to p, and define:

B(v) = { black vertices }, that is, vertices discovered during the search starting at v
P(v) = { (p(x),x) | p(x) != null }, that is, all the edges connecting the search parent of x to x after Search(v)

Lemma 1 (Tree Lemma). (B(v),P(v)) is a tree with root v.

Proof. First note that v is the unique vertex in B(v) with null parent. Also note that for any black vertex x other than v, (p(x),x) is an edge in the graph (directed from p(x) to x). Following the parent vertices until a null parent is reached defines a (directed) path from v to x.

Now count the vertices and edges: for each black vertex x other than v, the edge (p(x),x) is distinct from any other (p(y),y) because xy. Therefore we have a 1-1 correspondence between black vertices not equal to v and edges. Thus (B(v),P(v)) is a connected graph with vertexSize = 1 + edgeSize. Such a graph must be a tree, by statement (4) of the Tree Characterization Theorem. ∎

We call (B(v),P(v)) the search tree generated by the search starting at v.

Now assume we have done a full survey with a call Search(), and define

P = { (p(x),x) | p(x) ≠ null }, that is, all the edges in all the search trees

Lemma 2 (Forest Lemma). (V,P) is a forest whose trees are all the search trees generated during the survey.

Proof. By the Tree Lemma, T(v) = (B(v),P(v)) is a tree for each starting vertex v. Suppose some edge in P connects two of these trees, say T(v1) and T(v2). The edge must necessarily be of the form (p(x), x) for some x, where xB(v1) and p(x) ∈ B(v2). But then the parent-path from x will pass through p(x) to v2, which means that v2 should have been discovered by XFSurvey::Search(v1). The contradiction means that no edge connects T(v1) and T(v2) . Therefore the search trees T(v) represent the components of (V,P), the definition of forest. ∎

We call (V,P) the search forest generated by the survey.

4.2 Interpreting BFSurvey

We have alluded to the shortest path property of BFS in previous sections. It is time make full contact with a proof, and we devote Section 4.2 to doing that. We follow the proof in [Cormen et al 3e]. For any two vertices x,y in G, define the shortest-path-distance from x to y to be

δ(x,y) = the length of the shortest path from x to y in G; or
δ(x,y) = 1 + |E| if y is not reachable from x. (This is distance "infinity".)

Lemma 3. δ(x,y) = 1 iff x and y are connected by an edge eE.

Proof. Since xy, the distance must be at least 1. The edge e is a path with length 1, so the shortest path distance is no greater than 1. Conversely, if the shortest path distance is 1 then such a path consists of a single edge connecting the two vertices. ∎

Lemma 4 (Triangle Inequality). Let G=(V,E) be a directed or undirected graph and x,y,zV. If y is reachable from x and z is reachable from y then

δ(x,z) ≤ δ(x,y) + δ(y,z)

Proof. First note that z is reachable from x by concatenating shortest paths from x to y and y to z. This path from x through y to z has length exactly δ(x,y) + δ(y,z). The shortest path from x to z can be no longer than this path through y. Therefore δ(x,z) ≤ δ(x,y) + δ(y,z). ∎

Assumptions. For the remainder of this section, let G=(V,E) be a directed or undirected graph and suppose BFSurvey::Reset() and BFSurvey::Search(v) have been called for some starting vertex vV. Let d(x) denote the calculated value distance_[x] for each xV.

Lemma 5. The path in the search tree subgraph (B(v),P(v)) from v to x has length d(x).

Proof. In the search tree there is only one path from v to x. Clearly the path from v to itself has length 0 = d(v), verifying a base case. Assume the lemma is true for d(x) ≤ k and let x be a vertex with d(x) = k + 1. The unique path to x consists of a path of length k plus one extra edge of the form (p(x),x). By the induction hypothesis, the path from v to p(x) has length k = d(p(x)). By inspection of the algorithm, d(x) = d(p(x)) + 1 = k + 1, verifying the inductive step. Therefore by the principle of mathematical induction the result is proved. ∎

Corollary. For each vertex x, d(x) ≥ δ(v,x).

Lemma 6. At any point in the run of the algorithm, consider the gray vertices, that is, the vertices in the control queue, and the front vertex f. The values d(x) are non-decreasing in queue order for the gray vertices, and moreover are either constant (equal to d(f) or have two values d(f) and d(f) + 1.

Proof. Examine the code to see that when x is pushed onto the control queue, d(x) = d(p(x)) + 1 (and at that time p(x) is at front of the queue). Then: Show by mathematical induction that d values are non-decreasing for all vertices in the queue. Because d values are never changed once a vertex is pushed, if x is pushed before y then d(x) ≤ d(y). ∎

Corrolary. If x and y are both gray vertices (i.e., in the control queue) with x colored gray before y (i.e., x pushed before y), then d(x) ≤ d(y) ≤ d(x) + 1.

Lemma 7. If x and y are reachable from v and x is discovered before y, then d(x) ≤ d(y).

Proof. Examine the code to see that when x is pushed onto the control queue, d(x) = d(p(x)) + 1 (and at that time p(x) is the front of the queue). Show by mathematical induction that d values are non-decreasing for all vertices in the queue. Because d values are never changed once a vertex is pushed, if x is pushed before y then d(x) ≤ d(y). ∎

Lemma 8. d(x) = δ(v,x) for all reachable x.

Proof. Suppose that the result fails. Let δ be the smallest shortest-path-distance for which the result fails, and let y be a vertex for which d(y) > δ(v,y) = δ. Let x be the next-to-last vertex on a shortest path from v to y. Then δ(v,y) = 1 + δ(v,x) and, because of the minimality of δ = δ(v,y), d(x) = δ(v,x). We summarize what we know so far:

d(y) > δ(v,y) = 1 + δ(v,x) = 1 + d(x)

Now consider the three possible colors of y at the times x is at the front of the control queue. If y is white, then y will be pushed onto conQ while x is at the front, making d(y) = d(x) + 1, a contradiction. If y is black, it has been popped and d(y) ≤ d(x) by Lemma 7, again a contradiction. If y is gray, then d(y) ≤ d(x) + 1 by Lemma 5, a contradiction yet again. Therefore under all possibilities our original assumption of failure is false. ∎

Putting these facts together we have:

Theorem 5 (Breadth-First Tree Theorem). Suppose BFSurvey::Search(s) has been called for the graph or digraph G=(V,E). Then For each vertex x that is reachable from s, the "parent path" from s to x in the breadth-first tree is a shortest path from s to x.

4.3 Interpreting DFSurvey

We have already remarked that where BFS focuses on distance, DFS is more about time. We also took care to ensure that the time stamps on vertices during a DFSurvey::Search() are unique, so that one time stamp is used for each change of color of a vertex. These time stamps provide a way to codify the effects of LIFO order in the control system for DFS.

We will make use of the more compact mathematical notations

td(x) = dtime[x] = discovery time of x and
tf(x) = ftime[x] = finishing time of x

for each vertex x. Inspection of the DFS algorithm shows that discovery occurs before finishing:

Lemma 9. For each vertex x, td(x) < tf(x).

Therefore the interval [td(x),tf(x)] represents the time values for which x is in the control LIFO, that is, the times when x has color gray. Prior to td(x), x is white, and after tf(x), x is black.

Theorem 6 (Parenthesis Theorem). Assume G = (V,E) is a (directed or undirected) graph and that DFSurvey::Search() has been run on G. Then for two vertices x and y, exactly one of the following three conditions holds:

  1. The time intervals [td(x),tf(x)] and [td(y),tf(y)] are disjoint, and x and y belong to different trees in the DFS forest.
  2. [td(x),tf(x)] is a subset of [td(y),tf(y)], and x is a descendant of y in the forest.
  3. [td(x),tf(x)] is a superset of [td(y),tf(y)], and x is an ancester of y in the forest.

Proof. First suppose x and y belong to different trees in the DFS forest. Then x is discovered during one call Search(v) and y is discovered during a different call Search(w) where vw. Then x is colored gray and then black during Search(v), and y is colored gray and then black during Search(w). Clearly these two processes do not overlap in time, and condition (1) holds.

Suppose on the other hand that x and y are in the same tree in the search forest. Without loss of generality we assume y is a descendant of x. Then, by inspection of the algorithm, x must be colored gray before y. Hence, td(x) < td(y). But due to the LIFO order of processing, this means that y is colored black before x. Therefore tf(y) < tf(x). That is, [td(y),tf(y)] is a subset of [td(x),tf(x)], and condition (3) holds. A symmetric argument completes the proof. ∎

Theorem 7 (White Path Theorem). In a depth-first forest of a directed or undirected graph G=(V,E), vertex y is a descendant of vertex x iff at the discovery time td(x) there is a path from x to y consisting entirely of white vertices.

Proof. First note that discovery time td(x) = dtime[x] is stamped prior to any processing of x in the DFSurvey::Search algorithm.

Suppose z is a descendant of x. If z = x then {x} is a white path. If zx then td(x) < td(z) by the Parenthesis Theorem, so z is white at time td(x). Applying the observation to any y in the DFS tree path from x to z shows that the DFS tree path from x to z consists of white vertices.

Conversely, suppose at time td(x) there is a path from x to z consisting entirely of white vertices. If some vertex in this path is not a descendant of x, let y be the one closest to x with this property. Then the predecessor p on the path is a descendant of x. At time td(p), y is white and an unvisited adjacent of p, so y will be discovered and p(y) = p. That is, y is a descendant of p, and hence of x, contradicting the assumption that y is not a descendant of x. Therefore every vertex on the white path is a descendant of x. ∎

4.4 Classification of Edges

The surveys can be used to classify edges of a graph or directed graph. We will use DFSurvey for this purpose. Given an edge, there are four possibilities: (1) it is in the DFS Forest; it goes from x to another vertex in the same tree, either (2) an ancester or (3) a descendant; or (4) it goes to a vertex that is neither ancester nor descendant, whether in the same or a different tree.

  1. Tree edges are edges in the depth-first forest.
  2. Back edges are edges (x,y) connecting a vertex x in the DFS forest to an ancester y in the same tree of the forest.
  3. Forward edges are edges (x,y) connecting a vertex x in the DFS forest to a descendant y in the same tree of the forest.
  4. Cross edges are any other edges. These might go to another vertex in the same tree or a vertex in a different tree.

For an undirected graph, this classification is based on the first encounter of the edge in the DFSurvey.

Note these observations relating the color of the terminal vertex of an edge to the edge classification. Suppose e = (x,y) is an edge of G, and consider the instant in algorithmic time when e is explored. Then:

  1. If y is white then e is a tree edge.
  2. If y is gray then e is a back edge.
  3. If y is black then e is a forward or cross edge.

Theorem 8. In a depth-first survey of an undirected graph G, every edge is either a tree edge or a back edge.

Proof. Let e = (x,y) be an edge of G. Since G is undirected, e is as well, so we can assume that x is discovered before y. At time td(x), y is white. Suppose e is first explored from x. Then y is white at the time, and hence e becomes a tree edge. If e is first explored from y, then x is gray at the time, and e is a back edge. ∎

Theorem 9. A directed graph D contains no directed cycles iff a depth-first search of D yields no back edges.

Proof. If DFS produces a back edge (x,y), adding that edge to the DFS tree path from x to y creates a cycle.

If D has a (directed) cycle C, let y be the first vertex discovered in C, and let (x,y) be the preceding edge in C. At time t_d(y), the vertices of C form a white path from y to x. By the white path theorem, x is a descendant of y, so (x,y) is a back edge. ∎

5 Spinoffs from BFS and DFS

If theorems have corollaries, do algorithms have cororithms? Maybe, but that is difficult to speak. "Spinoff" is very informal term meaning an extra outcome or simple modification of the algorithm that requires little or no extra verification or anaylsis.

5.1 Components of a Graph

Suppose G = (V,E) is an undirected graph. G is called connected iff for every pair x,yV of vertices there is a path in G from x to y. A component of G is a graph C such that

  1. C is a subgraph of G,
  2. C is connected, and
  3. C is maximal with respect to the first two properties.

The technology developed in Sections 3 and 4 shows that the following instantiation of the DFS algorithm produces a Vector<N> component such that component[x] is the component containing x for each vertex x of G. All that is needed is to declare the component vector and make a small post-processing adjustment to DFSurvey::Search():

void DFSurvey::Search()
{
  unsigned components = 0;
  for (each vertex v of g_)
    if (color[v] == white)
    {
      components +=1;
      Search(v);
    }
  component[f] = components;
}

Recall that we know the DFS forest is a collection of trees, each tree generated by a call to Search(v). The DFS trees are in 1-1 correspondence to the components of G. The algorithm above counts the components and assigns each vertex its component number as it is processed.

This is an algorithm that runs in time Θ(|V| + |E|) and results in a mechanism for constant-time lookup of the component of any vertex.

5.2 Topological Sort

A directed graph is acyclic if it has no (directed) cycles. A directed acyclic graph is called a DAG for short. DAGs occur naturally in various places, such as:

verticesdirected edge
cells in a spreadsheetcalculated value uses other cells as input
targets in a makefiledependency list of other targets
courses in a curriculum  course pre-requisit

In these and other models, it is important to know what order to consider the vertices. For example, courses need to be taken respecting the pre-requisit structure, make needs to build the targets in some order constrained by the dependencies, and a spreadsheet cell should be calculated only after the cells on which it depends have been calculated.

A topological sort of a directed graph G is an ordering of its vertices in such a way that all edges go from lower to higher vertices in the ordering: for each edge (x,y) in G, x < y in the ordering.

Theorem 10. A directed graph G has a topological sort if and only if G has no directed cycles.

Proof. Suppose G has a topological sort. If G had a (directed) cycle { x1, x2, ..., xk = x1 } then we would have x1 < x2 < ... < xk = x1, that is, x1 < x1, an impossibility.

If on the other hand G is acyclic, either of the two algorithms below constructs a topological sort for G. ∎

Theorem 11. Suppose G is a directed graph and that a complete depth-first survey is performed on G. Then the reversed post-ordering of the vertices is a topological sort of G if and only if G has no (directed) cycles.

Proof. Note that a postorder is the finishing order of the vertices.

First suppose G has a topological sort. Then, because all edges point forward in the sort order, there can be no cycle.

Next suppose G is a DAG, let e = (x,y) be an edge in G, and consider the moment that e is explored during DFSurvey: x is at the top of the control stack with color gray. Look at the cases:

  1. y is white: y will become a descendant of x, so y comes before x in postorder.
  2. y is gray: y would be an ancester of x, hence e would be a back edge, ruled out by Theoreom 9.
  3. y is black: y is finished but x is still being processed. Again y preceeds x in postorder.

Thus in all cases y is finished before x and hence preceeds x in postorder. In reverse postorder, y > x. ∎

Here is the program referred to in the Theorem 10.


template <class DigraphType, class ResultType>
void TopSort (const DigraphType& g, ResultType& outQueue)
{
  fsu::DFSurvey <DigraphType>    dfs(g);
  fsu::List<DigraphType::Vertex> postorder;
  typename fsu::List::Iterator      i; 
  dfs.Search();
  PostOrder(dfs,postorder);
  for (i = postorder.rBegin(); i != postorder.rEnd(); --i)
  {
    outQueue.Push(*i);
  }
}

To detect whether there is a cycle in G, modify DFSurvey to detect back edges during Search(v) and return false if one is found.

Exercises

  1. Conversions between directed and undirected graphs.
    1. Consider an undirected graph G=(V,E) represented by either an adjacency matrix or an adjacency list. What changes to the representations are made when G is converted to a directed graph? Explain.
    2. Consider a directed graph D = (V,E) represented by either an adjacency matrix or an adjacency list. What changes to the representations are made when D is converted to an undirected graph? Explain.
  2. Find the appropriate places in the Graph hierarchy to re-define each of the virtual methods named in the Graph base class, and provide the implementations.
  3. The way vertices are stored in adjacency lists has an arbitrary effect on the order in which they are processed by DFS and BFS.
    1. Explain these effects.
    2. How might the graph edge insertion operations be modified to enforce encountering vertices in numerical order?
  4. Prove: During BFSurvey::Search() on a graph G, if x and y are both gray vertices with x colored gray before y, then d(x) ≤ d(y) ≤ d(x) + 1. (This is Lemma 6 above.)
  5. Describe 3 other ways to find the components of a graph (other than the algorithm in Section 5.1): (a) Directly from BFS or DFS survey data, (b) Using a BFS or DFS forest, and (c) using a traversal of the graph edge set and Partition / Union-Find.
  6. Consider an alternative topological sort algorithm offered first by Donald Knuth. The idea is attractively intuitive - keep removing source vertices and their edges from the graph until nothing is left. The order in which vertices are removed is a topological sort:
  7. 
    template <class DigraphType, class ResultType>
    bool TopSort2 (const DigraphType& diGraph, ResultType& outQueue)
    {
      typedef typename DigraphType::Vertex      Vertex;
      typedef typename DigraphType::AdjIterator AdjIterator;
    
      fsu::Queue < Vertex >   conQueue;
      // conQueue stores current source vertices prior to processing
    
      fsu::Vector < Vertex >  inDegree(diGraph.VrtxSize(),0);
      // current in-degree of each vertex
    
      // preprocess to get all InDegrees (more efficient than n calls to InDegree)
      for (Vertex v = 0; v < diGraph.VrtxSize(); ++v)
      {
        for (AdjIterator i = diGraph.Begin(v); i != diGraph.End(v); ++i)
        {
          ++inDegree[(size_t)*i];
        }
      }
    
      // initialize conQueue
      for (v = 0; v < diGraph.VrtxSize(); ++v)
      {
        if (inDegree[v] == 0)
        {
          conQueue.Push(v);
        }
      }
    
      // main algorithm
      while (!conQueue.Empty())
      {
        Vertex v = conQueue.Front();
        conQueue.Pop();
        outQueue.Push(v);
        for (AdjIterator i = diGraph.Begin(v); i != diGraph.End(v); ++i)
        {
          --inDegree[*i];
          if (inDegree[*i] == 0) conQueue.Push(*i);
        }
      } // end while
    
      // report result
      if (outQueue.Size() != diGraph.VrtxSize())
        return 0;
      return 1;
    } // TopSort2
    

    1. Show that a DAG G must have at least one source and at least one sink. (A source is a vertex v in G with InDegree(v) = 0. A sink is a vertex v in G with OutDegree(v) = 0.) Hint: Let P be a maximal length (directed) path in G. Show that the first vertex in P is a source and that the last vertex in P is a sink.
    2. Show that TopSort2 produces a topological sort iff D is acyclic.
    3. What would be the effect of using fsu::Stack instead of fsu::Queue for the internal control queue? Will either of these choices ensure that TopSort2 produces the same topological sort as TopSort?
    4. Use aggregate analysis to derive and verify the worst case runtime for TopSort2.

Software Engineering Projects

  1. Develop the graph class hierarchy as outlined in Section 1.4. Be sure to provide adjacency iterators facilitating BFS and DFS implementations.
  2. Implement BFSurvey and DFSurvey operating on graphs and digraphs via the API provided by the hierarchy above.
  3. Develop two classes BFSIterator and DFSIterator that may be used by UnGraphList and DiGraphList. The goal is that these traversal loops are defined for the graph/digraph g:
    for (BFSIterator i.Initialize(g,v); !i.Finished(); ++i)
    {
      std::cout << *i;
    }
    for (DFSIterator i.Initialize(g,v); !i.Finished(); ++i)
    {
      std::cout << *i;
    }
    

    and accomplish BFSurvey::Search(v) and DFSurvey::Search(v), respectively, and output the vertex number in discovery order. Of course, the traversals defined with iterators may be stopped, or paused and restarted, in the client program. The iterators should provide access to all of the public survey information.