Balanced Binary Search Trees - Overview

This chapter is devoted to various specialized types of binary search tree, including:

AVL Trees [Adelson-Velskii and Landis, 1962]
Classic Red-Black Trees [Guibas and Sedgewick, 1978]
Left-Leaning Red-Black Trees [Sedgewick, 2007]

All of these are Binary Search Trees, so that BST Search works, with enhanced Insert and Remove operations that ensure enough branching in the tree so that

h ≤ O(log n)

where h = tree height and n = tree size. This in turn ensures that the insert and search operations have worst-case runtime ≤ O(log n), thus completing one of the goals in the development of Ordered Set and Ordered Table containers.

In fact, all const methods for BST carry over to the various balanced tree structures completely unchanged. Therefore we concentrate here on the mutating operations Insert, Put, and Get. Remove will be addressed for all BSTs at the end of this chapter.

AVL Trees

An AVL tree is a binary tree in which the heights of the left and right subtrees of each node differ by at most 1. Note this definition is equivalent to the recursive version: an AVL tree is a binary tree in which the heights of the left and right subtrees of the root differ by at most 1 and in which the left and right subtrees are again AVL trees. The name "AVL" derives from the names of the two inventors of the technology, G.M. Adelson-Velskii and E.M. Landis [An algorithm for the organization of information, 1962.]

Because an AVL tree (AVLT) is a binary search tree (BST), there is a well defined binary search algorithm in an AVLT that follows descending paths. An important feature of AVLTs is that they have height bounded above by 1.5 log₂ (n + 1), where n is the number of nodes in the AVLT, so an AVLT must be fairly "bushy". (In stark contrast, a BST can be severely "leggy", with height n - 1. We can define bushy and leggy asymptotically as having height O(log n) and Ω(n), respectively. Note that "sparse" is a synonym for "leggy" and "dense" is a synonym for "bushy". "Bushy and "leggy" are terms from gardening. "Dense" and "sparse" are from graph theory.)

Theorem AVL 1. Suppose an AVL tree has n vertices and height H. Then:

log n < H + 1
H ≤ A log n + B, where A and B are constants

Proof. The first inequality is true for any binary tree: the maximum number of vertices a binary tree can have is Σ_k2^k, the sum ranging over all layers k = 0...H. This sum evaluates to 2^{H +1} - 1. Therefore n ≤ 2^{H +1} - 1 < 2^H
+1. Taking log₂ of boths sides yields the first result.

We concentrate now on the second claim. Let n(H) be the minimum number of vertices an AVL tree of height H can have. Clearly n(0) = 1, since a tree of height 0 consists exactly of the root vertex. Also n(1) = 2, by looking at all cases. As an inductive step, note that an AVL tree of height H with minimal vertex count must have one subtree of height H - 1 and another subtree of height H - 2. Thus n(H) = n(H-1) + n(H-2) + 1. In summary, we have the following recurrance relation:

n(0) = 1
n(1) = 2
n(H) = n(H-1) + n(H-2) + 1

Consider the Fibonnaci recursion given by:

f(0) = 0
f(1) = 1
f(H) = f(H-1) + f(H-2)

Assertion: n(H) > f(H+2) - 1

Proof:

Base cases:
n(0) = 1, f(2) - 1 = 1 - 1 = 0, so n(0) > f(2) - 1
n(1) = 2, f(3) - 1 = 2 - 1 = 1, so n(1) > f(2) - 1

Inductive case:
n(H + 1)
       = n(H) + n(H-1) + 1 # definition of n
       > (f(H+2) - 1) + (f(H+1) - 1) + 1 # inductive hypothesis
       = f(H+2) + f(H+1) - 1
       = f(H+3) - 1

Because both sides of the inequality are integers, we can rephrase the previous assertion as:

Assertion: n(H) >= f(H+2)

A standard factoid on the Fibonnaci numbers is that f(H + 2) >= φ^H / SQRT(5), where φ = (1 + SQRT(5))/2, the golden ratio. (See, for example, Cormen et al Exercise 4-5.) Whence we obtain:

Assertion: n(H) >= φ^H / SQRT(5)

It follows that

n(H) >= φ^H / SQRT(5)
SQRT(5) n(H) >= φ^H
log_φ(SQRT(5) n(H)) >= H
log_φSQRT(5) + log_φn(H) >= H

Noting that log_φx = (log₂φ)(log₂x), we have proved the theorem with A = log₂φ and B = (log₂φ) (log₂SQRT(5))

Theorem AVL 2. BST search in an AVLT has worst case run time ≤ O(log n), where n is the number of nodes.

The challenge is to make sure that the AVLT properties are maintained as we insert and remove elements. It turns out that the AVLT properties do not necessarily hold after an ordinary BST insert or remove operation, but that there are "repair" algorithms that bring the resulting BST back into compliance with the AVLT definition. These algorithms restructure the BST by pruning and re-hanging subtrees and are called rotations.

Rotations are constant time algorithms, and they are combined into repair algorithms that iterate along a descending path in the AVLT. It follows that BST insert or remove, followed by AVLT repair, has run time O(log n). Consequently

Theorem AVL 3. AVLT insert and remove have worst case run time ≤ O(log n).

Detailed specifications of the AVLT algorithms are found in the text.

Red-Black Trees

A red-black tree is a binary search tree whose nodes have a color attribute and which satisfies the following additional properties:

Every node color is either red or black
The root is black
If a node is red, then all its children are black
All root-null paths in the tree have the same number of black nodes

(A Root-Null path is a descending path from the root to a node with at least one null child.)

The key observation in analyzing the height of a RB tree also carries over to the left-leaning case discussed below. For any tree satisfying property RB 4 above, define the black height of a node x to be

b(x) = number of black nodes between x and a null descendant of x

Lemma (Black Height Lemma). For any binary tree satisfying property RB 4 and any node x, the subtree rooted at x contains at least 2^b(x) - 1 nodes, not counting x itself.

Thus the entire RB tree contains at least 2^b nodes, including the root, where b is the constant number of black notes in a root-null path.

Corollary. 2^b ≤ n and b ≤ log₂n.

Proof. (by induction on height h(x)).

Base Case: Assume h(x) = 0

Observe that b(x) ≤ h(x), so b(x) = 0 and 2^b - 1 = 2⁰ - 1 = 0.

Inductive Step: Assume true for h(x) < h and deduce true for h(x) = h.

Let x be a node with height h > 0. If x has 2 children, each child has black height either b(x) or b(x) - 1, depending on whether the child is itself black. By the inductive hypothesis, the two child subtrees must have at least 2^(b-1) - 1 nodes. The number of nodes at x is therefore at least the sum
1 + (2^b-1 - 1) + (2^b-1 - 1) = 2^b - 1
which proves the result. If x has only one child, then b(x) = 0 and (as in the height 0 case) clearly has at least 0 = 2^0 - 1 nodes, proving the result in that case as well.

Because a red-black tree (RBT) is a binary search tree (BST), there is a well defined binary search algorithm in an RBT that follows descending paths. An important feature of RBTs is that they have height bounded above by 2 log₂ (n + 1), where n is the number of nodes in the RBT, so an RBT must be fairly "bushy". (In stark contrast, a BST can be severely "leggy", with height n - 1. We can define "bushy" and "leggy" asymptotically as having height O(log n) and Ω(n), respectively.)

Theorem RB 1. In an RB tree, h ≤ 2 log₂ n.

Proof. Suppose x is a leaf node in the RB tree, and denote by b the constant number of black nodes in the descending path from root to x (property 4). This descending path from root to x has length L(x) = b + r(x) nodes, where r(x) is the number of red nodes in the path. Because the path begins with a black node (property 2), and a red node is always followed by a black node (property 3), the number of red nodes in the path cannot exceed the number of black nodes in the path, and we have:

L(x) = b + r(x) ≤ 2b

Applying the Black Height Lemma we have

L(x) ≤ 2b ≤ 2 log₂n

which completes the proof. An immediate consequence is

Theorem RB 2. BST search in an RB tree has worst case run time is O(log n), where n is the number of nodes.

The challenge is to make sure that the RBT properties are maintained as we insert and remove elements. It turns out that the RBT properties do not necessarily hold after an ordinary BST insert or remove operation, but that there are "repair" algorithms that bring the resulting BST back into compliance with the RBT rules. These algorithms restructure the BST by pruning and re-hanging subtrees and are called rotations.

Rotations are constant time algorithms, and they are combined into repair algorithms that iterate along a descending path in the RBT. It follows that BST insert or remove, followed by RBT repair, has run time O(log n). Consequently

Theorem RB 3. RB tree insert and remove have worst case run time O(log n).

Detailed specifications of the RB tree algorithms are found in the text.

Left-Leaning Red-Black Trees

The story of Left-Leaning Red-Black trees is marvelous. These gadgets were discovered and developed by Robert Sedgwick just a few years ago. Because Sedgewick is one of the co-discoverers of Red-Black trees, it might be assumed he had long since moved on, and of course that is true - he has a long career in discovering, teaching, and applying algorithms. What is great is that a Professor at Princeton, a Director of Adobe Systems, still has the incentive (through teaching algorithms) and the curiosity (through persistent attention to research) to find what must, surely, be the simplest possible scheme to implement O(log n) search time binary search trees. This is inspirational stuff. We will adopt left-leaning red-black trees as our implementation of choice for ordered Sets and Tables.

A left-leaning red-black tree [RBLL tree] is a red-black tree with the additional property that all links to red nodes "lean left":

Every node color is either red or black
The root is black
If a node is red, then all its children are black
All root-null paths in the tree have the same number of black nodes
If a node is red, it must be a left child.

An RBLL tree is an RB tree, so it shares all properties of RB trees, including the most important property that descending paths in the tree have length bounded by 2 log (n + 1). The left-leaning constraint (property 5) serves to simplify the code implementing RBLL trees by eliminating about half of the possible RB configurations that can reporesent a given set.

These properties may seem to be almost a rabbit-out-of-the-hat trick, but in reality they are carefully stated to align with properties of 2-3 trees: Sedgewick sets up an injective isomorphism between 2-3 trees and a subset of 2-colored binary search trees, and RBLL trees turn out to be the target subset of the isomorphism. The most remarkable outcome is not so much that RBLL trees are height-balanced, but that the maintenance algorithms are much simpler than those of either AVL trees or classic RB trees.

Theorem RBLL 1. In an RBLL tree, h ≤ 2 log₂n.

Proof. An RBLL tree is a red-black tree, so the results for red-black trees apply directly. An immediate consequence is

Theorem RBLL 2. BST search in an RBLL tree has worst case run time is O(log n), where n is the number of nodes.

Implementing the RBLLT Insert and Remove algorithms use the same constant-time rotation algorithms as AVL and RB trees, in fact in simpler ways. (RBLL trees do not need "double rotations" - a consqeuence of the left-leaning property.) Therefore as with the others we have:

Theorem RBLL 3. RBLL tree insert and remove have worst case run time O(log n).

Detailed specifications of the RBLL tree algorithms are not yet found in any text (although it seems likely to be in the next edition of Sedgwick's Algorithms). We will give details here.

RBLL Examples

     4                       4                             4
   3   5                   2   5                         3   5
  2                         3                         2

  OK                       not OK - leans right       not OK - too many consequtive reds

Typical output from demo:

  6
  4 10
  2  5  8 12
  1  3  -  -  7  9 11  -

Re-drawn as tree: Root-null paths (black count = 3):

            6                         6  4  2  1
      4           10                  6  4  2  3
   2     5     8    12                6  4  5
  1 3   - -   7 9  11 -               6 10  8  7
                                      6 10  8  9
                                      6 10 12 11
                                      6 10 12

RBLL Demos

Find and run the functionality test / demo program frbllt.x. Note: Like most class "f-tests" this will accept files of commands as a cvommand line argement. Ending a command file with 'x' switches to interactive mode

Example com file com.1:

11 12 13 14 15 16 17 18 19 110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129 130 131
x

There is also a Load command to insert files of data into the set.

Example Session 1

frbllt.x
L uint.sorted.63 # Inserts 1 ... 63 into RBLL tree
d3               # calls "Dump(std::cout, cw = 3, fill = '-')"

Output to screen:

 32
 16 49
  8 24 40 57
  4 12 20 28 36 44 53 61
  2  6 10 14 18 22 26 30 34 38 42 47 51 55 59 63
  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 46 48 50 52 54 56 58 60 62 65

The picure of a perfect BST - no red nodes!

Example Session 2

frbllt.x
L uint.63 # Inserts 63 random ints into RBLL tree
d1        # calls "Dump(std::cout)"

Output to screen:

 *
 **
 ****
 ********
 ****************
 ********--------**********--***-
 ****--*-*-----*-----------------**------*-*---------------------

The tree height is 1 more than the optimal value for a tree with 63 elements:

n = 63
log₂ (n + 1) = log₂ (63 + 1) = log₂ 64 = 6
H = 6 < 7 = 1 + log₂ (n + 1)

RBLL Insert

  public:
    void Insert(const T& tval)
    {
      root_ = RInsert(root_, tval);
      root_->SetBlack();
    }

  private:
    Node * RInsert(Node* nptr, const T& tval)
    { ... }

The public Insert operation is implemented with a call to a private recursive version whose implementation we discuss after looking at two helper methods: RotateLeft and RotateRight.

Rotations

private: static Node * RotateLeft(Node * n) { if (0 == n || 0 == n->rchild_) return n; Node * p = n->rchild_; n->rchild_ = p->lchild_; p->lchild_ = n; return p; }

RotateLeft returns "replacement" pointer
to re-attach subtree

Example: left rotation about n->75 Links undergoing change shown in color Before: n-> 75 / \ 60 90 <-p / \ / \ 55 65 80 99 /\ /\ /\ /\ .. .. .. .. After: p-> 90 / \ n-> 75 99 / \ /\ 60 80 .. / \ /\ 55 65 .. /\ /\ .. ..

private: static Node * RotateRight(Node * n) { if (n == 0 || n->lchild_ ==0) return n; Node * p = n->lchild_; n->lchild_ = p->rchild_; p->rchild_ = n; return p; }

RotateRight returns "replacement" pointer
to re-attach subtree

Example: right rotation about n->90 Links undergoing change shown in color Before: n-> 90 / \ p-> 75 99 / \ /\ 60 80 .. / \ /\ 55 65 .. /\ /\ .. .. After: 75 <-p / \ 60 90 <-n / \ / \ 55 65 80 99 /\ /\ /\ /\ .. .. .. ..

RBLL RInsert

  Node * RInsert(Node* nptr, const T& tval)
  {
    // invariant: number of black nodes in root->null paths has not changed
    // This means the only place the black node count goes up is at the top: 
    // if the node returned by RInsert is red, its color changes to black in Insert.
    if (nptr == 0)    // add new node at bottom of tree
    {
      return NewNode(tval, RED);
    }
    if (pred_(tval,nptr->value_))       // left subtree
    {
      nptr->lchild_ = RInsert(nptr->lchild_, tval);
    }
    else if (pred_(nptr->value_,tval))  // right subtree
    {
      nptr->rchild_ = RInsert(nptr->rchild_, tval);
    }
    else     // equality: node exists - set location 
    {
      nptr->value_ = tval;
    }
    // repair RBLL properties on way up
    if (nptr->RightChildIsRed() && !nptr->LeftChildIsRed())
      nptr = RotateLeft(nptr);
    if (nptr->LeftChildIsRed() && nptr->lchild_->LeftChildIsRed())
      nptr = RotateRight(nptr);
    if (nptr->LeftChildIsRed() && nptr->RightChildIsRed())
    { // swap parent/child colors
      nptr->lchild_->SetBlack();
      nptr->rchild_->SetBlack();
      nptr->SetRed();
    }
    // some color changes moved to RotateLeft and RotateRight
    return nptr;
  }

Rotations Revisited

private:
  static Node* RotateLeft(Node* n)
  {
    if (0 == n || n->rchild_ == 0) return n;
    Require(n->rchild_->IsRed());
    Node * p = n->rchild_;
    n->rchild_ = p->lchild_;
    p->lchild_ = n;  

    // color changes added:
    n->IsRed()? p->SetRed() : p->SetBlack(); // p.color = n.color
    n->SetRed();                             // n.color = RED
    return p;
  } // */

p takes on old color of n ["blue" in graphic]
n is colored RED ["red" in graphic]
preserves "black height" in tree

    Example: left rotation about n->75

    Before:

          n->    75
              /      \
            60        90   <-p
           /  \      /  \
         55    65  80    99
         /\    /\  /\    /\
         ..    ..  ..    ..
    After:

         p->    90
             /      \
    n->    75        99
          /  \      /\
        60    80     ..
       /  \   /\
     55    65 ..
     /\    /\
     ..    ..

private:
  Node * RotateRight(Node * n)
  {
    if (n == 0 || n->lchild_ == 0) return n;
    Require(n->lchild_->IsRed());
    Node * p = n->lchild_;
    n->lchild_ = p->rchild_;
    p->rchild_ = n;  

    // color changes
    n->IsRed()? p->SetRed() : p->SetBlack();  // p.color = n.color
    n->SetRed();                   // n.color = RED

    return p;
  } // */

p takes on old color of n ["blue" in graphic]
n is colored RED ["red" in graphic]
preserves "black height" in tree

    Example: right rotation about n->90

    Before:

          n->    90
              /      \
      p->   75        99
           /  \       /\
         60    80     ..
        /  \   /\
      55    65 ..
      /\    /\
      ..    ..

    After:

                 75    <-p
              /      \
            60        90   <-n
           /  \      /  \
         55    65  80    99
         /\    /\  /\    /\
         ..    ..  ..    ..

RBLL Lite (w/o Iterators)

Again we are faced with the problem that the Ordered Set / BST API requires iterators to take full advantage of the container, even when a straightforward store/retrieve use case is in play. As with BSTs, we turn to the "lite" case of associative-array-like API:

    void Put (const T& t)
    {
      Get(t) = t;
    }

    T& Get (const T& t)
    {
      Node * location;
      root_ = RGet(root_,t,location);
      root_->SetBlack();
      return location -> value_;
    }

The highlighted extra argument location for RGet is the key to making Get effective without the use of iterators. Note that this argument is a Node pointer passed by reference so that it can effectively serve as a return value that is set during the RGet call. Following the argument location in the implementation clarifies one of the TWO subtle differences between RInsert and RGet:

Node * RGet(Node* nptr, const T& tval, Node*& location)
{
  if (nptr == 0)    // add new node at bottom of tree
  {
    location = BST_ADT<T,P>::NewNode(tval, BST_ADT<T,P>::RED);
    return location;
  }
  if (this->pred_(tval,nptr->value_))       // left subtree
  {
    nptr->lchild_ = RGet(nptr->lchild_, tval, location);
  }
  else if (this->pred_(nptr->value_,tval))  // right subtree
  {
    nptr->rchild_ = RGet(nptr->rchild_, tval, location);
  }
  else     // equality: node exists - set location 
  {
    // nptr->value_ = tval;
    location = nptr;
  }

  // repair RBLL properties on way up (same code as Insert)
  if (nptr->RightChildIsRed() && !nptr->LeftChildIsRed())
    nptr = RotateLeft(nptr);
  if (nptr->LeftChildIsRed() && nptr->lchild_->LeftChildIsRed())
    nptr = RotateRight(nptr);
  if (nptr->LeftChildIsRed() && nptr->RightChildIsRed())
  {
    nptr->lchild_->SetBlack();
    nptr->rchild_->SetBlack();
    nptr->SetRed();
  }
  return nptr;
}

We mentioned two differences between the implementations of RInsert and RGet. The first is the addition of the return argument location highlighted in the code. The second is that RGet does NOT update the value where RInsert does. This ensures that when t is in the set, Get(t) returns (a reference to) the value already stored, whereas Insert(t) updates the value. Get is a combination of Insert and Includes, in that Get ensures that the element is in the set (behaving like Insert), but retrieves the existing value when it is found (behaving like Includes).

Tables, Maps, and Associative Arrays

All of the binary search tree technologies apply almost verbatim to Tables and Associative Arrays. The only distinction is that in a Set, the element is both the search key and the data, whereas in a Table, the search key and the data are separated into distinct pieces. Some people find this a more natural way to deal with retrieval systems, and it is probably the more common use. On the other hand, a "table" application can always be made with a Set of Pairs (as we have seen in assignments), but some storable entities do not natrually split as a pair.

A table or associative array or map class definition is distinguished from a set class by having template parameters for both KeyType and DataType where the Set would have only one for ValueType, and search uses the key to guide the search but retrieves the data associated with that key.

Set and Table Runtime Analysis

These tables summarize what we have already discussed in great detail. It is interesting to see the results collected together. Note how much work we have done to ensure logarithmic insert time in Sets and Tables! If Insert time is not an issue, the OVector implementation of Set or Table is extremely time efficient and, as we see in the next slide, has a scrupulously small memory footprint as well.

We have not discussed Erase for OList and OVector, but the student should be able to describe (1) how these are implememted and (2) how the runtime conclusions in the tables are derived.

We will eplore how to improve the runtime of Set and Table operations further, but we will have to give up the constraint of ordered traversal in order to do so.

Set and Table Runspace Analysis

This slide summarizes aspects we haven't made esplicit up to now: run space requirements. Runspace is usually stated in the form "+something" which means a measure of the additional space required, on top of the space needed to store the input data. The estimates include space overhead for the container itself.

For example, in the OList column, we see "+2n pointers" is container space overhead, which is acknowledging that the list data elements are stored in links and each link has two pointers as well. The OList algorithms each use only a small number of fixed-size variables, which is a constant amount.

OVector is very space efficient as a container and as a collection of algorithms - none need more than a small constant amount of space within which to work (except for ReHash, which must allocate a new memory footprint prior to de-allocating the existing footprint).

The ReHash operation for OList and OVector can take advantage of the ordered property in the underlying container to copy the "alive" elements into newly allocated space sequentially with no search involved.

BST estimates are the same as for OList, and for the same reasons.

RBLLT space overhead is "+(2n pointers + n bytes)", acknowledging the extra byte in each node that contains the color and other structural flags. The RBLLT Insert algorithm uses runtime stack space in proportion to the number of recursive calls. That number is limited by the height of the tree, which we have seen is O(log n).