Median selection

more generally, selecting the kth smallest element from a set of n elements

Straightforward approaches

Linear-time on average, quadratic worst-case-time algorithm

similar to quicksort, to select kth smallest from set S

SELECT( k, S)
    if |S| = 1 then return a in S
    choose random a in S
    Let S1, S2, S3 be sets of elts in S (<,=,> to a)
    if |S1| ≥ k then return SELECT(k,S1)
    else if |S1| + |S2| ≥ k then return a
    else return SELECT( k-|S1|-|S2|, S3)

Worst-case time complexity

The worst-case time complexity is O(n2) if the partition does not reduce the size of S sufficiently.

Average-case time complexity

Let T(n) be the expected running time for |S| = n, and assume that S is a set of distinct elements.

Let a be the ith smallest.  Then,
    i > k → call SELECT on |S'| = i-1
    i < k → call SELECT on |S'| = n-i

The expected cost of the recursion is (1/n) [ ∑i =1 to k-1T(n-i) + ∑i =k+1 to nT(i-1) ]
= (1/n) [ ∑i =n-k+1 to n-1T(i) + ∑i =k to n-1T(i) ]

The rest of the procedure requires < cn time.
Therefore, for n ≥ 2, T(n) ≤ cn + Maxk{ (1/n) [ ∑i =n-k+1 to n-1T(i) + ∑i =k to n-1T(i) ] }

It is easy to show (by induction) that if T(1) ≤ c then, for all n ≥ 2, T(n) ≤ 4cn.

Basis:   n = 1.   T(1) ≤ c ≤ 4cn.

Inductive step:   We need to prove that T(n) ≤ 4cn (if n > 1), assuming that, for all k < n, T(k) ≤ 4ck.

T(n) ≤ cn + Maxk{ (1/n) [ ∑i =n-k+1 to n-1T(i) + ∑i =k to n-1T(i) ] }
        ≤ cn + (4c/n) Maxk{ ∑i =n-k+1 to n-1( i ) + ∑i =k to n-1( i ) }
        ≤ cn + (4c/n) Maxk{ n2/2 - 3n/2 + k(n-k+1) }
        ( which is maximized at k = (n+1)/2 )
        ≤ cn + (4c/n) ( ¾ n2 - n + ¼ )
        ≤ cn + 3cn + ( -4c + c/n )
        (the parenthesized expression is < 0 since n ≥ 2 )
        ≤ 4cn

Alternate analysis of average-case time complexity

Let a be the ith smallest, and call i lucky if i is in the middle half   (¼ n   ≤   i   ≤   ¾ n).
i is lucky implies that the new subproblem has size at most  ¾ n.
i is lucky half the time.

Lemma Need on average 2 splits before getting a lucky split.
Proof We always need at least 1 split.
Half the time we're done, and half the time we must continue.
Thus, E = 1 + ½E, and therefore E = 2.

So, after 2 splits on average, the subproblem is at most  ¾ the original size.
T(n) ≤ T(¾ n) + avg time to reduce problem size to at most  ¾ n
T(n) ≤ T(¾ n) + 2cn
T(n) ≤ 8cn satisfies this constraint.

Linear-worst-case-time algorithm

Similar to previous algorithm.   The difference is that we expend a little effort in picking the element used to partition set S.   We pick an element in such a way as to guarantee that S will be split reasonably evenly.

PICK( a, S)
    Divide S into floor( |S|/5 ) sets of 5 elements each
                  plus 1 "leftover" set having between 0 and 4 elements
    Sort each 5-element set
    Let M be the set of medians of the 5-element sets
    a := SELECT( ceil( |M|/2 ), M)   i.e., the median of M

Also, if |S| < 50 then SELECT will sort S and return the kth smallest (in the previous algorithm, this occurred only if |S| = 1)

Time analysis

T(n) = time to sort 5-element sets and get M [ = cn/5 ]
        + evaluate a = T(|M|) [ = T(n/5) ]
        + partition S [ = n ]
        + recursive call [ = T(|S1|) or T(|S3|) ]

The method of picking a ensures that more than ¼ of the "in-play" elements of S are ≤ a and also that more than ¼ of the "in-play" elements of S are ≥ a.

More formally, since |M| = floor( n/5 ), at least floor( n/10 ) of the elements of M are ≥ a and, for each of these, there are 2 additional distinct elements of S that are at least as large.   Therefore, |S1| < n - 3 floor(n/10) which, for n ≥ 50, is < 3n/4.   The same can be shown for S3.

And so, each of T(|S1|) and T(|S3|) is ≤ T(3n/4).

Therefore, there exists a constant c such that
T(n) ≤ cn, for n < 50 and
T(n) ≤ T(n/5) + T(3n/4) + cn, for n ≥ 50.

It is easy to show by induction that T(n) ≤ 20cn = O(n).

Proof by strong induction on n.

Basis:   for n < 50, T(n) ≤ cn ≤ 20cn.

Inductive step:   We need to prove that T(n) ≤ 20cn (if n ≥ 50), assuming that, for all k < n, T(k) ≤ 20cn.

T(n) ≤ T(n/5) + T(3n/4) + cn   (given)
        ≤ 20c(n/5) + 20c(3n/4) + cn
        ≤ 4cn + 15cn + cn
        ≤ 20cn


Dan Hirschberg
Last modified: Jan 24, 2007