Binary search

From PEGWiki
Jump to: navigation, search

Binary search is the term used in computer science for the discrete version of the bisection method, in which given a monotone function f over a discrete interval I and a target value c one wishes to find x \in I for which f(x) \approx c. It is most often used to locate a value in a sorted array. The term may also be used for the continuous version.

Informal introduction: the guessing game[edit]

In the number guessing game, there are two players, Alice and Bob. Alice announces that she has thought of an integer between 1 and 100 (inclusive), and challenges Bob to guess it. Bob may guess any number from 1 to 100, and Alice will tell him whether his guess is correct; if it is not correct, she will tell him (truthfully) either that her number is higher than his guess, or that it is lower.

One possible strategy that Bob can employ that guarantees he will eventually win is simply guessing every integer from 1 to 100 in turn. First Bob says "1" and (for example) Alice replies "Higher". Then Bob says "2" and again Alice replies "Higher", and so on. Eventually Bob says "40" and Alice says "Correct", and the game is over. While this strategy is simple and correct, it may require up to 100 guesses to finish the game.

However, there is a much cleverer strategy. Initially, Bob knows the number lies somewhere between 1 and 100. He guesses a number close to the middle of this range; say, 50. Alice replies, "Lower". Now Bob knows the number lies in the range 1 to 49, so again he picks a number close to the middle of the range, say, 25. Alice replies, "Higher", and the range has now been narrowed to 26 to 49. In this way, in every step, Bob will narrow the range to half its previous size. Eventually, either Bob will guess the number correctly, or there will only be one number left in the range, which must be Alice's number. Furthermore, this strategy will take at most 7 steps, because 100 halved 7 times is less than one. In general, the number of steps it takes Bob to find an integer between 1 and N is at most about \log_2 N, whereas the naive "linear search" strategy introduced earlier may require up to N steps. We see that the bisection strategy, or binary search, is a vast improvement over the linear search.

Furthermore, binary search is the best possible strategy at Bob's disposal. This is because Bob asks a series of questions and gets a reply of "Higher" or "Lower" to each one except the last (which is always "Correct"), and based on this information alone must determine the number; so the total number of possible answers that Bob can determine is at most the number of such sequences, which is approximately 2^N if the length is limited to N questions; so no strategy that asks less than \log_2 N questions can guarantee being able to guess all possible N answers.

Finding an element in an array[edit]

In the most commonly encountered variation of binary search, we are given a zero-indexed array A of size n that is sorted in ascending order, along with some value c. There are two slightly different variations on what we wish to do:

  • Lower bound: Here are two equivalent formulations:
  1. Find the first index into A where we could insert a new element of value c without violating the ascending order.
  2. If there is an element in A with value greater than or equal to c, return the index of the first such element; otherwise, return n. In other words, return \min\{i\,|\,A_i \geq c\}, or n if that set is empty.
  • Upper bound:
  1. Find the last index into A where we could insert a new element of value c without violating the ascending order.
  2. If there is an element in A with value strictly greater than to c, return the index of the first such element; otherwise, return n. In other words, return \min\{i\,|\,A_i > c\}, or n if that set is empty.

The following properties then hold:

  • If there is no i such that A_i = c, then the lower bound and upper bound will be the same; they will both point to the first element in A that is greater than c.
  • If c is larger than all elements in A, then the lower and upper bounds will both be n.
  • If there is at least one element A with value c, then the lower bound will be the index of the first such element, and the upper bound will be the index of the element just after the last such element. Hence [lower bound, upper bound) is a half-open interval that represents the range of indices into A that refer to elements with value c.
  • The element immediately preceding the lower bound is the last element with value less than c, if this exists, or the hypothetical element at index -1, otherwise.
  • The element immediately preceding the upper bound is the last element with value less than or equal to c, if this exists, or the hypothetical element at index -1, otherwise.

Of course, sometimes you only care whether the exact value c you are seeking can be found in the array at all; but this is just a special case of lower bound.

Both lower bound and upper bound can easily be found by iterating through the array from beginning to end, which takes O(n) comparisons and hence O(n) time if the elements of the array are simple scalar variables that take O(1) time to compare. However, because the array is sorted, we can do better than this. We begin with the knowledge that the element we seek might be anywhere in the array, or it could be "just past the end". Now, examine an element near the centre of the array. If this is less than the element we seek, we know the element we seek must be in the second half of the array. If it is greater, we know the element we seek must be in the first half. Thus, by making one comparison, we cut the range under consideration into half. It follows that after O(\log n) comparisons, we will be able to locate either the upper bound or the lower bound as we desire.

Implementation[edit]

However, implementation tends to be tricky, and even experienced programmers can introduce bugs when writing binary search. Correct reference implementations are given below in C:

/* n is the size of the int[] array A. */
int lower_bound(int A[], int n, int c)
{
    int l = 0;
    int r = n;
    while (l < r)
    {
        int m = (l+r)/2;     /* Rounds toward zero. */
        if (A[m] < c)
            l = m+1;
        else
            r = m;
    }
    return l;
}
 
int upper_bound(int A[], int n, int c)
{
    int l = 0;
    int r = n;
    while (l < r)
    {
        int m = (l+r)/2;
        if (A[m] <= c)       /* Note use of <= instead of <. */
            l = m+1;
        else
            r = m;
    }
    return l;
}

Proof of correctness[edit]

We prove that lower_bound above is correct. The proof for upper_bound is very similar.

We claim the invariant that, immediately before and after each iteration of the while loop:

  1. The index we are seeking is at least l.
  2. The index we are seeking is at most r.

This is certainly true immediately before the first iteration of the loop, where l = 0 and r = n. Now we claim that if it is true before a given iteration of the loop, it will be true after that iteration, which, by induction, implies our claim. First, the value m will be exactly the arithmetic mean of l and r, if their sum is even, or it will be 0.5 less than the mean, if their sum is odd, since m is computed as the arithmetic mean rounded down. This implies that, in any given iteration of the loop, m's initial value will be at least l, and strictly less than r. Now:

  • Case 1: A[m] < c. This means that no element with index between l and m, inclusive, can be greater than c, because the array is sorted. But because we know that such an element exists (or that it is the hypothetical element just past the end of the array) in the current range delineated by indices l and r, we conclude that it must be in the second half of the range with indices from m+1 to r, inclusive. In this case we set l = m+1 and at the end of the loop the invariant is again satisfied. This is guaranteed to increase the value of l, since m starts out with value at least l. It also does not increase l beyond r, since m is initially strictly less than r.
  • Case 2: A[m] >= c. This means that there exists an element with index between l and m, inclusive, which is at least c. This implies that the first such element is somewhere in this half of the range, because the array is sorted. Therefore, after setting r = m, the range of indices between l and r, inclusive, will still contain the index we are seeking, and hence the invariant is again satisfied. This is guaranteed to decrease the value of r, since m starts out strictly less than r; it also will not decrease it beyond l, since m starts out at least as large as l.

Since each iteration of the loop either increases l or decreases r, the loop will eventually terminate with l = r. Since the invariant still holds, this unique common value must be the solution. _{\blacksquare}

Carefully proving the O(\log n) time bound and showing the correctness of upper_bound are left as exercises to the reader.

Possible bugs[edit]

As suggested below, the programmer has at least six opportunities to err.

  1. Setting r = n-1 initially. Doing this will prevent the value n from ever being returned, which renders the implementation correct whenever the answer is less than n, but incorrect whenever it is n. (It will simply return n-1 in this case.)
  2. Using while (l <= r) instead of while (l < r). This will always cause an infinite loop in which the else branch is taken in every subsequent iteration.
  3. Comparing A[m] with c in the wrong direction, that is, writing A[m] > c, A[m] >= c, c < A[m], or c <= A[m], or reversing the order of the two branches. This would not correspond to the theory, and would almost always given an incorrect result. Note that simultaneously logically negating the comparison and reversing the order of the branches, however, will not give an error, but will instead give logically equivalent code.
  4. Computing m as the ceiling of the arithmetic mean, instead of the floor. The obvious error condition here is that when l == n-1 and r == n, the value of m computed will be n, and we will then attempt to examine a nonexistent element in the array, which could cause the program to crash.
  5. Setting l = m or l = m-1 instead of l = m+1. Either variation lends itself to an error condition in which m comes out to be equal to either l or l+1, respectively, and the first branch is taken, which causes an infinite loop in which the first branch is always taken.
  6. Setting r = m+1 or r = m-1 instead of r = m. In the former case, we could have an infinite loop in which the second branch is always taken, if the difference between l and r is 1 or 2. In the latter case, it is possible for r to become less than l.

And, of course, one can easily mix up the upper_bound and lower_bound implementations, giving a seventh opportunity to err.

Additionally, if the array can be very large, then m should be computed as l + (r-l)/2, instead of (l+r)/2, in order to avoid possible arithmetic overflow.

Inverse function computation[edit]

Both the number guessing game and the array search can be expressed as specific cases of the more general problem of discrete inverse function computation, hinted at in the lead section of this article.

We suppose:

  1. We are given some (straightforward to compute) function f from a discrete totally ordered domain (such as the integers from the interval[0, n)) to any totally ordered codomain (such as the real numbers);
  2. f is a monotonically increasing function, so that a > b implies f(a) \geq f(b);
  3. We are given some value c in f's codomain. (On the other hand, c may or may not be in f's range.)

Here,

  • The lower bound is the smallest argument x such that f(x) \geq c, if any exists.
  • The upper bound is the smallest argument x such that f(x) > c, if any exists.

Sometimes there is exactly one x such that f(x) = c. This value can properly be denoted x = f^{-1}(c); finding the value x is computing the inverse of f at c. Other times, there will be multiple such values, or there will only be x such that f(x) \approx c; in any case, binary search will find these values for us.

The approach mirrors that of the previous two sections. Initially we only know that x lies somewhere in the domain of f, so we guess a value that roughly splits this range in half; by evaluating f at this point and comparing the result with c, we can discard half the range and repeat as many times as necessary (about O(\log |D|) times, where D is the domain).

The number guessing game is a special case of this version of the problem. One possible way to formulate it is that f:[0, n) \to \mathbb{Z} is defined as f(x) = 0 for some x (the secret number), f(x) = -1 for all smaller numbers, and f(x) = 1 for all larger numbers; finding the secret number is just a matter of doing a binary search on c = 0.

The array search problem is also a special case. Here we simply define f(x) = A_x. In fact, the two problems are in some sense equivalent, in that a function with a discrete totally ordered domain is isomorphic to an array; the distinction will be lost in any programming language that implements both lazy evaluation and memoization.

More generally, binary search is often very useful for problems that read something like "find the maximum x such that P(x)". For example, in Trucking Troubles, the problem essentially down to finding the maximum x such that the edges of the given graph of weight at least x span the graph. This problem can be solved using Prim's algorithm to construct the "maximum spanning tree" (really, the minimum spanning tree using negated edge weights), but binary search provides a simpler approach. We know that x is somewhere in between the smallest edge weight in the graph and the largest edge weight in the graph, so we start by guessing a value roughly halfway in-between. Then we use depth-first search to test whether the graph is connected using only the edges of weight greater than or equal to our guess value. If it is, then the true value of x is greater than or equal to our guess; if not, then it is less. This is then repeated as many times as necessary to determine the correct value of x.

Binary search can also be used to find approximate solutions to equations. For example, suppose we have c blocks, and we wish to use as many of them as possible to construct a square-based pyramid. A square-based pyramid has an apex consisting of a single block. The next layer under it is a 2 block by 2 block square (so 4 blocks in all), then there is a layer with 9 blocks, and so on down. How many layers can we construct using the blocks given?

It can be shown that there a square-based pyramid with n levels requires f(n) = \frac{1}{6}(2n^3 + 3n^2 + n) blocks. This polynomial is increasing on [0, \infty). But of course as a trivial upper bound we cannot construct more that c levels using c blocks. So, by using binary search, we can find the maximum x \in [0, c) such that \frac{1}{6}(2x^3 + 3x^2 + x) \leq c. (More efficient solutions are accessible with higher mathematics.)

Continuous[edit]

The continuous version of the problem is like the number guessing game applied to real numbers: Alice thinks of a real number between 0 and 1, and Bob has to guess it. Bob begins by guessing 0.5, and so on. Clearly, in this case, no matter how many questions Bob asks, there will still be an infinite number of possibilities, though he can get greater and greater accuracy by asking more and more questions. (For example, after 10 questions, he might have narrowed down the range to (0.1015625, 0.1025390625)). However, in computer science, we are often satisfied if we can approximate the correct answer to arbitrary precision, rather than determine it exactly outright. (After all, the native real types of most microprocessors (see IEEE 754) can only store a fixed number of bits of precision, anyway.) Here, if, for example, Bob wants to determine Alice's number to an accuracy of one part in 109, then he has to ask about 30 questions, because the length of the original interval is 1, and this must be reduced by a factor of 109, which is about 230. In general, the time taken will be approximately \log_2\frac{\ell}{\epsilon_a}, where \ell is the length of the original interval and \epsilon_a is the desired precision (expressed as absolute error).

More formally, we have a monotonically increasing function f:A \to B where A is usually an interval of the real line and B is some totally ordered set, and we know that there is some x_0 \in A such that f(x_0) = c \in B. (Such a guarantee can be provided by the intermediate value theorem, if f is a continuous function.) Our objective is to find x such that |x - x_0| < \epsilon, that is, to approximate f^{-1}(c) to some desired precision.

This is a very general method for computing inverses of monotone functions. More specific (and faster) methods exist for specific classes of functions (such as Newton's method, the Lagrange inversion formula, or integral representations that can be computed using Gaussian quadrature); however, binary search is expedient and efficient when the precision required is not too large, and can be used for nearly all functions, because chances are they will be monotone on some interval around f^{-1}(c). The exceptions occur when the function oscillates wildly around that point (such functions are usually "pathological", and occur only in discussion of mathematical theory and not in practice), or the function has a local extremum at c (here, ternary search may be used instead).

An example problem of a different type is Secret Room. This can be solved by repeatedly guessing some radius r and trying to figure out whether a circle of radius r can be placed somewhere on the grid without crossing any lasers; if so, then the maximum radius is at least r, and otherwise, it is less. (However, determining whether placing the circle is possible is a nontrivial problem in computational geometry.)

Problems[edit]

For some of these problems, binary search might not yield an optimal solution, or even get all the points; but the solution it gives far outperforms the naive algorithm.