Optimal binary search trees

(useful as a static dictionary)

Given an ordered set S = a₁ < a₂ < ... a_n, we wish to process sequences of MEMBER queries. We also know the probability of various requests occurring:

p_i = Prob[ MEMBER(a_i,S) is asked], for i = 1...n
q_i = Prob[ MEMBER(x,S) is asked] with a_i < x < a_i+1, for i = 0...n
where a₀ = -∞ and a_n+1 = +∞

To help analyze the time complexity, we add leaves to the binary search tree wherever we have a null link.

If x is the label of node v then cost( MEMBER(x,S) ) = 1 + depth(v).

If x not in set S and a_i < x < a_i+1 then cost( MEMBER(x,S) ) = depth(leaf i).

The average time complexity for this tree can be found by summing the costs of accessing a node mutiplied by the probability of that access.

cost(binary search tree T) = ∑_{i = 1 to n} ( p_i [1 + depth(a_i)] ) + ∑_{i = 0 to n} ( q_i depth(leaf i) )

Problem: Given the p's and q's, find T to minimize cost.

The divide-and-conquer approach suggests determining which element belongs at the root and then determining what each of the subtrees looks like. There seems to be no easy way of determining what the root should be, which means that we would have to solve 2n subproblems, as each of the n elements could be at the root and for each choice we must solve the left and right subtrees. (As an exercise, determine the time complexity of this recursive approach. Start by giving an explicit recurrence.) This is too many for recursion, so we use dynamic programming.

For 0 ≤ i < j ≤ n, let
T_{i, j} = min cost tree for problem {a_i+1...a_j}
c_{i, j} = cost(T_{i, j})
r_{i, j} = root(T_{i, j})
and define weight w_{i, j} = q_i + (p_i+1+q_i+1) + ... + (p_j+q_j)

T_{i, j} consists of a root containing a_k, for some k and left and right subtrees of the root, with the left subtree being an optimal (min cost) tree T_{i, k-1} and the right subtree being T_{k, j}.

Also, boundary conditions:
T_{i, i} = the empty tree
w_{i, i} = q_i
c_{i, i} = 0

In T_{i, j}, the depth of all vertices in the subtrees is precisely 1 more than what the depths were in subtrees T_{i, k-1} and T_{k, j}. Therefore,
c_{i, j} = (c_{i, k-1} + w_{i, k-1}) + p_k + (c_{k, j} + w_{k, j})
= w_{i, j} + c_{i, k-1} + c_{k, j}, for some k

The optimal T_{i, j} will have root a_k that minimizes the sum c_{i, k-1} + c_{k, j}.

Construction of optimal binary search tree

    for i := 0 to n do
       w_i,i := q_i
       c_i,i := 0
       r_i,i := 0
    for length := 1 to n do
       for i := 0 to n-length do
          j := i + length
          w_i,j := w_i,j-1 + p_j + q_j
          m := value of k (with i < k ≤ j) which minimizes (c_i,k-1+c_k,j)
          c_i,j := w_i,j + c_i,m-1 + c_m,j
          r_i,j := m
          Leftson(r_i,j) := r_i,m-1
          Rightson(r_i,j) := r_m,j

The time complexity of this algorithm is O(n³).

Making a slight change will reduce the complexity to be O(n²). (See, for example, Knuth v.3, 2nd ed.,p.436-9 and p.456#27)

Modify the range of considered values of k:

          if length=1 then
             m := j
          else
             m := value of k (with r_i,j-1 ≤ k ≤ r_i+1,j) which minimizes (c_i,k-1+c_k,j)

Dan Hirschberg

Computer Science Department
University of California, Irvine, CA 92697-3435

dan (at) ics.uci.edu
Last modified: Oct 28, 2003