ICS 161: Design and Analysis of Algorithms
Lecture notes for January 11, 1996

Sequential and Binary Search

Example: looking up a topic in Baase. Suppose after Tuesday's application of matrix multiplication to Fibonacci numbers, that you wanted to know what she says about matrix multiplication.

You could look it up in the index but it will give you many different pages to look at, some of which are only somewhat relevant. Or you could read through the table of contents until you find relevant looking titles (6.2 and 7.3).

Sequential search

The second method (reading through the table of contents) is an example of sequential search. Similar sorts of problems show up all the time in programming (e.g. operating systems have to look up file names in a directory, and the Unix system usually does it with sequential search).

Very abstractly:

    sequential search(list L,item x)
    {
    for (each item y in the list)
        if (y matches x)
        return y
    return no match
    }

This has many variants -- do you stop once you've found one match or do you keep going until you've found all of them? do you represent the list using pointers, linked lists, or what? how do you indicate that there was no match?

So we want to analyse this...

To really understand the running time, we have to know how quick the "y matches x" part is -- everything else is straightforward. The way I've written it in pseudocode, that part still needs to be filled in. But we can still analyse the algorithm! We just measure the time in terms of the number of comparisons.

Examples: does 8 appear in the list of the first 10 Fib. numbers? Does 9? Note for 9, algorithm has to go through whole list.

So the time seems to depend on both L and x. We want to be able to predict the time easily without running the algorithm, so saying

    comparisons(x,L) = position of x in L

is true but not very informative.

Methods of analysis

To be able to predict the time without having to look at the details of the input, we measure it as a function of the length of the input. Here x has basically constant length (depending on what an item is) and the length of L is just the number of items in it.

So given a list L with n items, how many comparisons does the algorithm take? Answer: it depends.

We want an answer that doesn't depend. There are various ways of getting one, by combining the times for different inputs with the same length.

Worst case analysis -- what is the most comparisons we could ever see no matter how perverse the input is?
```
    time_wc(n) =        max         time(I)
         (input I of size n) 
```
Best case analysis -- what is the fewest comparisons the algorithm could take if the input is well behaved?
```
    time_wc(n) =        min         time(I)
         (input I of size n) 
```
Average case analysis -- how much time would the algorithm take on "typical" input?
We assume that each input I of size n has a probability P[I] of being the actual input and use these proabilities to find a weighted average:
```
    time_avg(n) = sum P[I] time(I)
```

These distinctions didn't make sense with Fibonacci numbers because the time there was always a function of n, but here they can give different answers (we'll see with sequential search).

Average case is probably the most important in general, but is problematic in terms of what is a typical input? You have to make some assumption about the probabilities, and your analysis will only be as accurate as the validity of your assumptions. Also note that it's possible to have an algorithm for which no input takes the "average" time -- e.g. if it takes either 1 step or 100 steps, the average may be around 50 even though no input actually takes 50 steps.

Worst case is what we usually do, it's easier than average case analysis and it's useful because you can guarantee that the algorithm will not ever take longer than its worst case bound. It's also true that the average case is at most the worst case, no matter what probabilities you choose, so you can use worst case analysis to get some information about the average case without having to make assumptions about what a "typical" input looks like.

Best case is fun but not very useful.

Analysis of sequential search

The best case for sequential search is that it does one comparison, and matches X right away.

In the worst case, sequential search does n comparisons, and either matches the last item in the list or doesn't match anything.

The average case is harder to do. We know that the number of comparisons is the position of x in the list. But what is typical position of x?

One reasonable assumption: If x is in the list, it's equally likely to be anywhere in it. so P[pos] = 1/n.

    average number of comparisons

     n   1
     =  sum  - . i
    i=1  n

    1  n
     =  - sum  i
    n i=1

     = (n+1)/2.

But if x is not in the list, the number of comparisons is always n.

So finding something takes half as long as not finding it, on average, with this definition of "typical".

We can define a stronger version of "typical": suppose for any list, any permutation of the list is equally likely. Then we can average over all possible permutations:

    average number of comparisons

     n!  1
     =  sum  -  . (position of x in permutation i)
    i=1  n!

     n  1
     =  sum - . p . (number of permutations with x in position p)
    p=1 n!

     n  1
     =  sum - . p . (n-1)!
    p=1 n!

     n  1
     =  sum - . p
    p=1 n

     = (n+1)/2.

So this assumption ends up giving the same analysis.

A second point to be made about average case analysis: sometimes it makes sense to analyse different cases separately. The analysis above assumes x is always in the list; if x is not in the list, you always get n comparisons. You could make up a probability p that x is in or out of the list and combine the two numbers above to get a total average number comparisons equal to pn + (1-p)(n+1)/2 but it makes more sense to just report both numbers separately.

Randomized algorithms

Sometimes it's useful to pay a little bit to reduce the uncertainty in the world -- e.g. insurance, you know you'll pay a fixed amount instead of either paying nothing (if you stay healthy) or a lot (if you get appendicitis).

the same concept applies to computer programs -- if the worst case is much larger than the average case, we might prefer to have a slightly more complicated program that reduces the worst case as long as it doesn't increase the average case too much. For instance if you're programming the computer controlling a car, and you want to tell if you're in a crash and should activate the air bags, you don't want to be running some algorithm that usually takes half a second but maybe sometimes takes as much as five minutes.

Random numbers are very useful in this respect. they're also useful in making "average case" analysis apply even when the input itself is not random at all, or when we don't know a good definition for a "typical" input. The idea is to "scramble" the input so that it looks typical. We say that an algorithm is randomized if it uses random numbers. An algorithm that is not randomized is called deterministic.

The "expected time" analysis of a random algorithm is measured in terms of time(input,sequence of random numbers). For some particular input I, the expected time of the algorithm is just the average over different sequences of random numbers:

        sum    Prob(R) . time(I,R)
    (random sequence R)

The expected time of the algorithm on (worst case) inputs of length n is then computed by combining this formula with the previous formula for worst case analysis:

        max             sum        Prob(R) . time(I,R)
    (input I of size n) (random sequence R)

This looks complicated, but isn't usually much harder than average case analysis. Here it is for sequential search.

We want to scramble (x,L) so that position of x in L is random. Idea: pick a random permutation of L then do the sequential search.

    randomized search(list L,item x)
    {
    randomly permute L
    for (each item y in L)
        if (y matches x)
        return y
    return no match
    }

This slows down the algorithm somewhat (because you have to take time to do the permutation) but may speed up the searching part. If you're just searching for a number in a list of numbers, this would be a pretty bad method, because the time for doing the random permutation would probably be more than the worst case for the original deterministic sequential search algorithm. However if comparisons are very slow, much slower than the other steps in the algorithm, the total number of comparisons will dominate the overall time and this algorithm could be an improvement.

Let's plug this algorithm into our formula for expected times:

    time = max(x,L) sum(permutation p) probability(p) time(x,p(L))

Note there are n! permutations. of those, there are (n-1)! such that x is in some given position i.

    time = max    sum  sum                 prob(perm) . time(x,L')
        (x,L)  i    (perm w/x at pos i)

     = max    sum  #(perms w/x at pos i) 1/n! . i
        (x,L)  i   

     = max    sum  (n-1)!/n! . i
        (x,L)  i   

     = max(x,L) sum(i) i/n

     = (n+1)/2

so the number of comparisons is exactly the same as the average case but now it doesn't matter what the list is.

We'll see that same idea of using random permutation to avoid the worst case later, in the quicksort and quickselect algorithms. For both of these algorithms, the use of randomization decreases the running time enormously, from O(n^2) to O(n log n) or O(n).

It is also sometimes possible to make stronger forms of analysis about random algorithms than just their expected time, for instance we could compute the variance of the running time, or prove statements such as that with very high probability, an algorithm uses time close to its expectation. This is important if one wants to be sure that the slow possibilities are very rare, but is usually much more complicated, so we won't do much of that sort of analysis in this class.

Binary search

Let's go back to the original example -- finding matrix multiplication in Baase. I talked about looking it up in the table of contents (by sequential search) but also about looking it up in the index.

The index of Baase and most other books has the useful property that it's alphabetized, so we can be smarter about our search. For instance, we could stop the sequential search whenever we found a y>x, and this would speed up the time for x not in L. But we can be much better, and this is basically what people do in alphabetized lists.

    binary search(x,L)
    {
    let n = length of L, i=n/2.
    if (n = 0) return no match
    else if (L[i] matches x) return L[i]
    else if (L[i] > x) binary search(x,L[1..i-1])
    else binary search(x,L[i+1..n])
    }

Recursion is not really necessary:

    alternate search(x,L)
    {
    let n = length of L
    let a = 1, b = n
    while (L[i = (a+b)/2] doesn't match)
        if (L[i] > x) b = i-1
        else a = i+1        
        if a>b return no match
    return L[i]
    }

Analysis: T(n) = O(1) + T(n/2) = O(log n)

More precisely in the worst case, T(n) = 2 + T(ceiling((n-1)/2)) which solves to approximately 2 log n (logarithm to base 2).

So binary search is fast, but in order to use it we need to somehow get the list to be in sorted order -- this problem is known as sorting, and we'll see it in much detail next week.

ICS 161 -- Dept. Information & Computer Science -- UC Irvine
Last update: