David MacKay
.

Information Theory, Inference, and Learning Algorithms



Search :

.

Randomized Algorithms

Consider the task of sorting N objects (eg numbers). A possible randomized algorithm works as follows.

  1. Pick one of the objects at random, and pretend that it is the median of the set. Compare all (N-1) others with it, thus dividing the others into two sets of size A and B. Certainly A+B = N-1; if we got lucky, A ~= N/2. Maybe, typically (in some sense), A ~= N/3 and B ~= 2N/3, or vice versa.
  2. Repeat step 1 on each of the two sets (until A = 1 and B = 1).

If we get lucky at every step (i.e., we hit the true median every time) then the number of comparisons of objects with each other will be pretty accurately

Cmin = N log2 N.

Failing to hit the median means that our binary tree will be lopsided, and more comparisons will be needed. How many comparisons?

We still expect something that scales as N ln N, but the base of the logarithm will be different. The "2" in the "log2" in

Cmin(N) = N log2 N
is there because the perfect median choice makes a balanced binary tree, and log2 N is the depth of that tree. The intuition is: When we make an unbalanced tree, the "2" changes to a number smaller than 2; as we'll see, it turns out to be roughly 1.6.
Crandomized algorithm ~= N log1.6 N

There are two rather different ways to prove this result.

Recurrence relation proof

Let the average cost, in comparisons, of sorting N items by the randomized algorithm be T(N). Then, considering the first step, with the pseudo-median being drawn at random with u being uniform distributed between 0 and 1,

T(N) ~= N + < T(uN) > + < T((1-u)N) >
where < T((1-u)N) > denotes the average of T((1-u)N), and we've neglected the distinction between N and N-1.

We need to solve this recurrence relation, with boundary condition T(1)=T(0)=0. Let's guess that

T(N) = a N ln N
and solve for a Then
T(N) ~= N + T(N) - a N < H_2^(e)(u) >
For consistency this gives
1 = a < H_2^(e)(u) >
Carrying out the integral we find the neat result:
a = 2.
So the base of the logarithm - the number formerly known as `1.6' - is in fact sqrt(e).

`Means add' proof

Another neat derivation of the cost of the randomized sort algorithm, which gives little insight into the correct answer, but generates it quite easily, runs as follows.

Imagine that the numbers are already sorted, and consider two numbers whose ranks in the list of size $N$ are $m$ and $n$, with $m < n$. When we apply the algorithm, what is the chance $c_{mn}$ that $m$ will, at some point along the way, get compared with $n$? Well, if any of the $n-m-1$ numbers between $m$ and $n$ gets selected before both $m$ and $n$, then the subsequent comparisons will separate $m$ and $n$ into separate sets, so that they never get compared with each other. On the other hand, if one of $m$ and $n$ is selected before any of those $n-m-1$ numbers, then $m$ and $n$ will instantly be compared with each other in the next comparison step. Thus

c_{mn} = FRACTION {2}{n-m+1} ;
and the expected number of comparisons is the sum over all pairs,
C(N) = {m,n: m < n} FRACTION {2}{n-m+1}


Site last modified Sun Aug 31 18:51:08 BST 2014