## Optimal binary search trees

(useful as a static dictionary)

Given an ordered set S = a1 < a2 < ... an, we wish to process sequences of MEMBER queries.   We also know the probability of various requests occurring:

pi = Prob[ MEMBER(ai,S) is asked], for i = 1...n
qi = Prob[ MEMBER(x,S) is asked]   with ai < x < ai+1, for i = 0...n
where a0 = -∞ and an+1 = +∞

To help analyze the time complexity, we add leaves to the binary search tree wherever we have a null link.

If x is the label of node v then cost( MEMBER(x,S) ) = 1 + depth(v).

If x not in set S and ai < x < ai+1 then cost( MEMBER(x,S) ) = depth(leaf i).

The average time complexity for this tree can be found by summing the costs of accessing a node mutiplied by the probability of that access.

cost(binary search tree T) = ∑i = 1 to n ( pi [1 + depth(ai)] ) + ∑i = 0 to n ( qi depth(leaf i) )

Problem:   Given the p's and q's, find T to minimize cost.

The divide-and-conquer approach suggests determining which element belongs at the root and then determining what each of the subtrees looks like.   There seems to be no easy way of determining what the root should be, which means that we would have to solve 2n subproblems, as each of the n elements could be at the root and for each choice we must solve the left and right subtrees.   (As an exercise, determine the time complexity of this recursive approach.   Start by giving an explicit recurrence.)   This is too many for recursion, so we use dynamic programming.

For 0 ≤ i < j ≤ n, let
Ti, j = min cost tree for problem {ai+1...aj}
ci, j = cost(Ti, j)
ri, j = root(Ti, j)
and define weight wi, j = qi + (pi+1+qi+1) + ... + (pj+qj)

Ti, j consists of a root containing ak, for some k and left and right subtrees of the root, with the left subtree being an optimal (min cost) tree Ti, k-1 and the right subtree being Tk, j.

Also, boundary conditions:
Ti, i = the empty tree
wi, i = qi
ci, i = 0

In Ti, j, the depth of all vertices in the subtrees is precisely 1 more than what the depths were in subtrees Ti, k-1 and Tk, j.   Therefore,
ci, j = (ci, k-1 + wi, k-1) + pk + (ck, j + wk, j)
= wi, j + ci, k-1 + ck, j,   for some k

The optimal Ti, j will have root ak that minimizes the sum ci, k-1 + ck, j.

Construction of optimal binary search tree

```    for i := 0 to n do
wi,i := qi
ci,i := 0
ri,i := 0
for length := 1 to n do
for i := 0 to n-length do
j := i + length
wi,j := wi,j-1 + pj + qj
m := value of k (with i < k ≤ j) which minimizes (ci,k-1+ck,j)
ci,j := wi,j + ci,m-1 + cm,j
ri,j := m
Leftson(ri,j) := ri,m-1
Rightson(ri,j) := rm,j
```

The time complexity of this algorithm is O(n3).

Making a slight change will reduce the complexity to be O(n2).   (See, for example, Knuth v.3, 2nd ed.,p.436-9 and p.456#27)

Modify the range of considered values of k:

```          if length=1 then
m := j
else
m := value of k (with ri,j-1 ≤ k ≤ ri+1,j) which minimizes (ci,k-1+ck,j)
```

Dan Hirschberg
Computer Science Department
University of California, Irvine, CA 92697-3435
dan (at) ics.uci.edu