Difference between revisions of "Asymptotic analysis"

From PEGWiki
Jump to: navigation, search
(added citation)
(Recursive functions)
Line 276: Line 276:
  
 
===Recursive functions===
 
===Recursive functions===
 +
Many algorithms work by breaking the input down into smaller pieces, [[recursively]] solving each smaller piece as though it were an independent problem, and then combining the results to obtain a solution to the original instance. This algorithmic paradigm is known as ''divide and conquer'', and its running time is usually analyzed slightly differently from that of the simpler algorithms from the previous section.
 +
 +
Examples:
 +
* [[Merge sort]], which sorts a list by dividing it into two sub-lists, sorting each sub-list, and then merging the two sorted sub-lists to obtain the original list, sorted. The merge step takes <math>O(n)</math> time if the original list had <math>n</math> elements. The overall running time is <math>O(n \log n)</math>.
 +
* [[Karatsuba multiplication]], which multiplies two <math>n</math>-digit numbers (see [[Big numbers]]) by dividing each number into two halves and recursively multiplying three pairs of halves, along with some additions and subtractions which take <math>O(n)</math> time in total. The overall running time is <math>O(n^{\log_2 3})</math>.
 +
* One solution to the [[closest pair of points]] problem finds the closest pair by dividing the point set with a half plane to obtain two sets of points of approximately half the size, finding the closest pair of points within each set, and then sorting points close to the dividing line and sweeping to find the closest pair across the division, which takes <math>O(n \log n)</math> time (for the sorting). The overall running time is then <math>O(n \log^2 n)</math>. (NB: This can be improved to <math>O(n \log n)</math> if we concomitantly run merge sort, rather than re-sorting along the dividing line in each step.)
 +
 +
How do we analyze the running time of such algorithms? We can reason as follows:
 +
* In the case of merge sort, we start out with one instance (the original list of size <math>n</math>), which gives rise to two subinstances of size <math>n/2</math>; these in turn give rise to a total of four subinstances of size <math>n/4</math>, and so on down to subinstances of size 1 or 2. There are about <math>\log_2 n</math> "levels" to this scheme, and at each level, we have about <math>2^k</math> subinstances of size <math>n/2^k</math>, and a linear amount of work is required in the "conquer" step of each; so that is about <math>cn</math> steps at each level, or about <math>cn \log n</math> steps in total, giving <math>O(n \log n)</math>.
 +
* In Karatsuba multiplication, there will again be about <math>\log_2 n</math> levels. On the topmost level, we have one instance with size <math>n</math>. On the next level, we have three instances with size <math>n/2</math>, and then nine instances with size <math>n/4</math>, and so on down. A linear amount of additional work is required for the "conquer" step for each instance, so that is about <math>cn</math> work on the topmost level, about <math>\frac{3}{2}cn</math> on the next level, and so on down. So the total running time will be about <math>cn \sum_{k=0}^{\log_2 n} \left(\frac{3}{2}\right)^k = cn\left(\frac{\frac{3}{2}^{k+1}-1}{\frac{3}{2}-1}\right) = O\left(n\left(\frac{3}{2}\right)^{\log_2 n}\right) = O(n^{\log_2 3})</math>. (Several steps were omitted.)
 +
* In the third algorithm listed above, there are again about <math>\log_2 n</math> levels; on the topmost level, the work done is <math>cn \log n</math>; then <math>2c\left(\frac{n}{2}\log\left(\frac{n}{2}\right)\right)</math> on the next level, since we will have two subinstances of size <math>\frac{n}{2}</math>, which will each require <math>O\left(\frac{n}{2}\log\left(\frac{n}{2}\right)\right)</math> work; then on the next level, <math>4c\left(\frac{n}{4}\log\left(\frac{n}{4}\right)\right)</math>, and so on down; the total work done is about <math>O\left(\sum_{k=0}^{\log_2 n} 2^k \frac{n}{2^k}\log\left(\frac{n}{2^k}\right)\right) = n\cdot O\left(\sum_{k=0}^{\log_2 n}\log\left(\frac{n}{2^k}\right)\right)</math>. This gives us an arithmetic series, with first term <math>\log n</math> (here <math>k = 0</math>) and last term about 0 (for when <math>\frac{n}{2^k} \approx \frac{n}{n} = 1</math> since <math>k \approx \log_2 n</math>) and a total of about <math>\log_2 n</math> terms; its sum is then about <math>\log n \log_2 n</math> giving <math>O(n \log^2 n)</math> overall.
 +
We would like to have some way to tackle the general case, without so much messy mathematical reasoning in each individual case. By ''general case'', we mean that we assume that to solve an instance of size <math>n</math> using some algorithm, we divide it into chunks of size about <math>\frac{n}{b}</math>, and solve <math>a</math> subproblems using those chunks, along with extra work in the amount of <math>f(n)</math>, unless <math>n < c</math> for some constant <math>c</math>, which represents a base case that can be solved in <math>O(1)</math> time. For example, in the first of the three examples above, we have <math>a = 2, b = 2, f(n) = n</math>; in the second, <math>a = 3, b = 2, f(n) = n</math>, and in the third, <math>a = 2, b = 2, f(n) = n \log n</math>.
 +
 +
===The master theorem===
 +
The authoritative statement of the '''master theorem''' is from CLRS<ref name="CLRS">Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. ''Introduction to Algorithms'', Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Sections 4.3 (The master method) and 4.4 (Proof of the master theorem), pp. 73–90.</ref>. Suppose that <math>a \geq 1</math> and <math>b > 1</math> above, and <math>f</math> is positive. Then, there are three cases:
 +
# If <math>f(n) \in O(n^{\log_b a - \epsilon})</math> for some <math>\epsilon > 0</math>, then the overall running time is <math>\Theta(n^{\log_b a})</math>.
 +
# If <math>f(n) \in \Theta(n^{\log_b a} log^k n)</math>, for some <math>k \geq 0</math>, then the overall running time is <math>\Theta(n^{\log_b a} log^{k+1} n)</math>.
 +
# If <math>f(n) \in \Omega(n^{\log_b a + \epsilon})</math> for some <math>\epsilon > 0</math>, and, furthermore, <math>af\left(\frac{n}{b}\right) \leq cf(n)</math> for some <math>c < 1</math> and all sufficiently large <math>n</math>, then the overall running time is <math>\Theta(f(n))</math>.
 +
 +
For example, Karatsuba multiplication falls into the first case, since <math>f(n) = n</math> and <math>\log_2 3 \approx 1.58</math>. The theorem then tells us that the overall running time is <math>O(n^{\log_2 3})</math>. What is happening here is that as we proceed from one level to the next, the amount of work done on the current level increases exponentially or faster. In this particular case, each level entails 1.5 times as much work as the previous level. Intuitively, this means most of the work is done on the lowest level, on which we have <math>n</math> subproblems, each with size 1. The earlier analysis then applies. The reason why we need <math>f(n) \in O(n^{\log_b a - \epsilon})</math> is precisely to ensure this exponential growth. If <math>f(n)</math> is exactly <math>n^{\log_b a}</math>, then we will have <math>af\left(\frac{n}{b}\right) = a\frac{n^{\log_b a}}{b^{\log_b a}} = n^{\log_b a} = f(n)</math>, that is, the amount of work done on each level will be the same. But if <math>f(n)</math> grows more slowly than this, then <math>f(n)</math> will be less than <math>af\left(\frac{n}{b}\right)</math>. If it grows polynomially more slowly (that is, by a factor of <math>n^\epsilon</math> or more), then the desired exponential growth will be attained.
 +
 +
Merge sort and the closest pair of points algorithm are covered by case 2. For merge sort, we have <math>k = 0</math> (there is no log factor at all). Here the time spent on each level is the same, since <math>f(n) = af\left(\frac{n}{b}\right)</math> if <math>f(n) = n^{log_b a}</math>, so we simply multiply <math>f(n)</math> by the number of levels, that is, <math>\log_b n \in \Theta(\log n)</math>. For the closest pair of points algorithm, we have <math>k = 1</math>. When <math>k > 0</math>, then there is a decrease in the amount of work done as we go further down, but it is slower than an exponential decay. In each case the average amount of work done per level is on the order of the amount of work done on the topmost level, so in each case we obtain <math>\Theta(f(n) \log n)</math> time.
 +
 +
Case 3 is not often encountered in practice. When <math>f(n) \geq \frac{1}{c}af\left(\frac{n}{b}\right)</math> (with <math>\frac{1}{c} > 1</math>), there is an exponential (or faster) decay as we proceed from the top to the lower levels, so most of the work is done at the top, and hence the <math>\Theta(f(n))</math> behaviour is obtained. The condition that <math>f(n) \in \Omega(n^{\log_b a + \epsilon})</math> is not actually needed in the statement of Case 3, since it is implied by the condition that <math>af\left(\frac{n}{b}\right) \leq cf(n)</math>. The reason why the theorem is stated as it is above is that if <math>f(n)</math> is actually of the form <math>cn^{\log_b a + \epsilon}\log^k n</math>, which covers most of the cases, then the other condition is implied, and we don't need to explicitly check it, but can instead just look at the form of <math>f(n)</math>. However, there are some possibilities for <math>f(n)</math> that are not "well-behaved", and will not satisfy the condition <math>af\left(\frac{n}{b}\right) \leq cf(n)</math>.
 +
 +
An actual proof of the master theorem is given in CLRS.
 +
 +
===Akra&ndash;Bazzi theorem===
 
TODO
 
TODO
  

Revision as of 06:19, 1 January 2012

In theoretical computer science, asymptotic analysis is the most frequently used technique to quantify the performance of an algorithm. Its name refers to the fact that this form of analysis neglects the exact amount of time or memory that the algorithm uses on specific cases, but is concerned only with the algorithm's asymptotic behaviour—that is, how the algorithm performs in the limit of infinite problem size.

Irrelevance of constant factor

It does this by neglecting the so-called invisible constant factor. The reasoning is as follows. Suppose Alice and Bob each write a program to solve the same task. Whenever Alice's program and Bob's program are executed with the same input on a given machine (which we'll call ALOPEX), Alice's program is always about 25% faster than Bob's program. The table below shows the number of lines of input in the first column, the execution time of Alice's program in the second, and the execution time of Bob's program in the third:

N Alice Bob
104 0.026 0.032
105 0.36 0.45
106 3.8 4.7

Practically speaking, Alice's program is faster. In theory, however, the two programs have equally efficient algorithms. This is because an algorithm is an abstraction independent of the machine on which it runs, and we could make Bob's program just as fast as Alice's by running it on a computer that is 25% faster than the original one.

If this argument sounds unconvincing, consider the following argument. Programs, as they run on real machines, are built up from a series of very simple instructions, each of which is executed in a period of time on the order of a nanosecond on a typical microcomputer. However, some instructions happen to be faster than others.

Now, suppose we have the following information:

  • On ALOPEX, the MULTIPLY instruction takes 1 nanosecond to run whereas the DIVIDE instruction takes 2 nanoseconds to run.
  • We have another machine, which we'll call KITSUNE, on which these two instructions both take 1.5 nanoseconds to run.
  • When Bob's program is given 150000 lines of input, it uses 1.0×108 multiplications and 3.0×108 divisions, whereas Alice's program uses 4.0×108 multiplications and 0.8×108 divisions on the same input.

Then we see that:

  1. On ALOPEX, Bob's program will require (1 ns)×(1.0×108) + (2 ns)×(3.0×108) = 0.70 s to complete all these instructions, whereas Alice's program will require (1 ns)×(4.0×108) + (2 ns)×(0.8×108) = 0.56 s; Alice's program is faster by 25%.
  2. On KITSUNE, Bob's program will require (1.5 ns)×(1.0×108) + (1.5 ns)×(3.0×108) = 0.60 s, whereas Alice's will require (1.5 ns)×(4.0×108) + (1.5 ns)×(0.8×108) = 0.72 s. Now Bob's program is 20% faster than Alice's.

Thus, we see that a program that is faster on one machine is not necessarily faster on another machine, so that comparing exact running times in order to decide which algorithm is faster is not meaningful.

The same can be said about memory usage. On a 32-bit machine, an int (integer data type in the C programming language) and an int* (a pointer to an int) each occupy 4 bytes of memory, whereas on a 64-bit machine, an int uses 4 bytes and an int* uses 8 bytes; so that an algorithm that uses more pointers than integers may use more memory on a 64-bit machine and less on a 32-bit machine when compared to a competing algorithm that uses more integers than pointers.

Why the machine is now irrelevant

The preceding section should convince you that if two programs's running times differ by at most a constant factor, then their algorithms should be considered equally efficient, as we can make one or the other faster in practice just by designing our machine accordingly. Let's now consider a case in which one algorithm can actually be faster than another.

Suppose that Alice and Bob are now given a new problem to solve and their algorithms perform as follows on ALOPEX:

N Alice Bob
103 0.040 0.0013
104 0.38 0.13
105 3.7 12
106 37 1500

Observe that when we multiply the problem size by 10, Alice's program takes about 10 times longer, but Bob's program takes about 100 times longer. Thus, even though Bob's program is faster than Alice's program for small inputs (10000 lines or fewer), in the limit of infinite problem size, Alice's program will always be faster. Even if we run Bob's program on a supercomputer and Alice's program on a 386, for sufficiently large input, Alice's program will always finish first, since no matter how much of a disadvantage Alice's program has, it catches up more and more as we increase the problem size; it scales better than Bob's algorithm.

Also observe that Alice's algorithm has running time approximately proportional to the size of its input, whereas Bob's algorithm has running time approximately proportional to the square of the size of its input. We write that Alice's algorithm has running time \Theta(N) whereas Bob's has running time \Theta(N^2). Notice that we don't bother to write the constant factor, since this varies from machine to machine; we write only the part that tells us how the algorithm scales. Since N^2 grows much more quickly than N as N increases, we see that Bob's algorithm is slower than Alice's.

Asymptotic optimality

An algorithm is said to be asymptotically optimal when it is the best-performing algorithm possible for a given problem, usually in terms of running time. There may be multiple asymptotically optimal algorithms for a given problem; for example, Tarjan's algorithm, Gabow's algorithm, and Kosaraju's algorithm are all asymptotically optimal for finding strongly connected components in a graph.

A program that uses an asymptotically optimal algorithm may not be the fastest (or least memory-hungry, etc.) program possible. For example, a program that uses Kosaraju's algorithm might be outperformed by a program that uses Tarjan's algorithm. However, it is possible to beat the performance of a program that uses an asymptotically optimal algorithm by at most a constant factor. Furthermore, if an algorithm is not asymptotically optimal, then, given large enough input size, it will always be outperformed by an asymptotically optimal algorithm.

Asymptotic complexity

The performance of an algorithm in the limit of infinite problem size is called its asymptotic complexity, or just complexity. This may be subdivided into space complexity, time complexity, and so on. The complexity of a problem is defined to be the complexity of the optimal algorithm that solves it. Problems may be divided into complexity classes based on their complexities.

Determining a problem's asymptotic complexity is important in theoretical computer science; there are many problems (such as matrix multiplication) whose complexities remain unknown. Often, a computer scientist will discover an algorithm for a problem and then show that it is asymptotically optimal; this is usually considered a landmark result in the study of that problem. More rarely, one will prove that an algorithm exists for a given problem, and show that it is asymptotically optimal, without actually giving the algorithm (nonconstructive proof).

Notation

We are now ready to give an explanation of the \Theta-notation used in the previous section. (See section Big theta notation below.) We suppose that functions f and g map the positive integers to the positive real numbers. The interpretation is that f(n) represents the amount of time (in some units) that it takes a certain algorithm to process an input of size n when run on a certain machine.

Big O notation

We say that f(n) \in O(g(n)) when g(n) is asymptotically at least as large as f(n). This is read f is big-O of g. Mathematically, this is defined to mean that there are some constants n_0, c such that c\cdot g(n) \geq f(n) for all n \geq n_0; to read it idiomatically, for sufficiently large n, f(n) is bounded by a constant times g(n).

This means that if we plotted f and g on the same set of axes, with n along the positive x-axis, then we can scale the curve g such that it always lies above the curve f.

For example, if we plot f(n) = 3n+1 and g(n) = 2n+1 on the same set of axes, then we see that by scaling g by a factor of \frac{3}{2}, we can make the curve of g lie above the curve of f. Therefore, 3n+1 \in O(2n+1).

Similarly, if we plot f(n) = 5n and g(n) = n^2 on the same set of axes, and we scale g by a factor of 5, we will make the curve of g lie above the curve of f. So 5n \in O(n^2).

On the other hand, if f(n) = n^2 and g(n) = 5n, no matter how much we scale g, we cannot make it lie above f. We could for example try scaling g by a factor of 100. Then, we will have f(n) \leq g(n) for all n up to 500, but for n > 500, f will be larger. If we try scaling g by a factor of 1000, then f will still catch up at n > 5000, and so on. So n^2 \notin O(5n).

Big O notation is the most frequently used notation for describing an algorithm's running time, since it provides an upper bound. If, for example, we write that an algorithm's running time is O(n^2), then this is a guarantee that, as n grows larger, the relative increase in the algorithm's running time is at most that of n^2, so that when we double n, our algorithm's running time will be at most quadrupled, and so on. This characterization is machine-independent; if an algorithm is O(n^2) on one machine then it will be O(n^2) on any other machine as well. (This holds true under the assumption that they are both classical von Neumann machines, which is the case for all computing machines on Earth right now.)

An algorithm that is O(n^k) for some constant k is said to be polynomial-time; its running time is bounded above by a polynomial in its input size. Polynomial-time algorithms are considered "efficient", because they provide the guarantee that as we double the problem size, the running time increases by a factor of at most 2^k. If an algorithm is not polynomial-time, it could be that doubling the problem size from 1000 to 2000 increases the running time by a factor of 10, but doubling the problem size from 10000 to 20000 increases the running time by a factor of 100, and the factor becomes worse and worse as the problem size increases; these algorithms are considered inefficient. The first polynomial-time algorithm discovered that solves a particular problem is usually considered a breakthrough.

Big omega notation

Big omega notation is the opposite of big O notation. The statement that f(n) \in O(g(n)) is equivalent to the statement that g(n) \in \Omega(f(n)). Conceptually, it means that f(n) is asymptotically at least as large as g(n). Therefore, big omega notation provides a lower bound, instead of an upper bound. For example, if an algorithm's running time is \Omega(k^n) for some k > 1, then when we increase the problem size from n to n+c, the running time of our algorithm will increase by a factor of at least k^c. Approximately doubling the input size will then approximately square the running time. This scaling behaviour, called exponential time, is considered extremely inefficient.

Big theta notation

If f(n) \in O(g(n)) and f(n) \in \Omega(g(n)), then f and g differ by only a constant factor from each other, like the running time of Alice and Bob's algorithms from the first section. In this case, we write f(n) \in \Theta(g(n)) or equivalently g(n) \in \Theta(f(n)). Whereas big O notation gives us an upper bound on a function, such as an algorithm's running time as a function of its input size, and big omega notation gives us a lower bound, big theta notation gives us both simultaneously—it gives us an expectation of that function. This is often expressed as f is on the order of g.

Thus, if an algorithm's running time is \Theta(n^2), and we double the input size n, then we have an expectation that the running time of the algorithm will be nearly exactly four times what it was before, whereas big O notation would only tell us that it is not more than four times greater, and big omega notation that it is at least four times greater.

Little O notation

Whereas the big O notation f(n) \in O(g(n)) expresses that g grows at least as quickly as f in an asymptotic, constant-factor-oblivious sense, little O notation, as in f(n) \in o(g(n)), expresses that g grows strictly more quickly than f in the limit of infinite n. It means that \lim_{n\to\infty} \frac{f(n)}{g(n)} = 0. This notation is not used very often in computer science (simply because it is rarely useful).

Little omega notation

Little omega notation is the opposite of little O notation; the statement that f(n) \in o(g(n)) is equivalent to the statement that g(n) \in \omega(f(n)). Little omega notation, like little O notation, is rarely used in computer science.

Proper use of notation

In the five notations above, note the use of the \in symbol. The expression f(n) \in O(g(n)) is sometimes written as f(n) = O(g(n)). This is incorrect. The symbol O(g(n)) is not a function, but a set of functions; it is the set of all functions whose growth is bounded by a constant factor of g(n) in the infinite limit, as defined above. Use of the equal sign would suggest that we could flip it around to get O(g(n)) = f(n), but this would not make any sense.

It is fine to read f(n) \in O(g(n)) as f of n is O g of n, but important to remember that the word "is" does not denote equality.

Tilde/swung dash

The notation f(n) \sim g(n) means that \lim_{n\to\infty} \frac{f(n)}{g(n)} = 1. This is usually not used when discussing running times of algorithms, because it does not ignore the constant factor the way that the other notations do; hence n \in \Theta(2n) but n \nsim 2n. On the other hand, n \sim n+1. However, this notation may be useful when we care about the constant factor because we are quantifying something that depends on only the algorithm itself instead of the machine it runs on. For example, the amount of time an algorithm uses to sort its input depends on what machine it runs on, but the number of comparisons that the algorithm uses for a given input does not depend on the machine. In this case we might say that one algorithm requires \sim N \log_2 N comparisons, whereas another requires \sim 2 N \log_2 N comparisons. Both algorithms are \Theta(N \log N), but the tilde notation distinguishes their performance.

Mathematical discussion

The following identities hold. The proofs are left as an exercise to the reader.

Relationships between notations

  • f(n) \in O(g(n)) if and only if g(n) \in \Omega(f(n))
  • f(n) \in o(g(n)) if and only if g(n) \in \omega(f(n))
  • f(n) \in \Theta(g(n)) if and only if f(n) \in O(g(n)) and f(n) \in \Omega(g(n))
  • f(n) \in o(g(n)) implies f(n) \in O(g(n))
  • f(n) \in \omega(g(n)) implies f(n) \in \Omega(g(n))
  • f(n) \in O(g(n)) implies f(n) \notin \omega(g(n))
  • f(n) \in \Omega(g(n)) implies f(n) \notin o(g(n))
  • f(n) \sim g(n) implies f(n) \in \Theta(g(n))
  • f(n) \sim g(n) is equivalent to f(n) \in g(n) + o(g(n))

Reflexivity, symmetry, and transitivity

Reflexivity

  • f(n) \in O(f(n)).
  • f(n) \in \Omega(f(n)).
  • f(n) \in \Theta(f(n)).
  • f(n) \sim f(n).
  • f(n) \notin o(f(n)).
  • f(n) \notin \omega(f(n)).

Symmetry

  • f(n) \in \Theta(g(n)) is equivalent to g(n) \in \Theta(f(n)).
  • f(n) \sim g(n) is equivalent to g(n) \sim f(n).
  • f(n) \in o(g(n)) implies g(n) \notin o(f(n)).
  • f(n) \in \omega(g(n)) implies g(n) \notin \omega(f(n)).

Transitivity

  • If f(n) \in O(g(n)) and g(n) \in O(h(n)) then f(n) \in O(h(n)).
  • If f(n) \in \Omega(g(n)) and g(n) \in \Omega(h(n)) then f(n) \in \Omega(h(n)).
  • If f(n) \in \Theta(g(n)) and g(n) \in \Theta(h(n)), then f(n) \in \Theta(h(n)).
  • If f(n) \in o(g(n)) and g(n) \in o(h(n)) then f(n) \in o(h(n)).
  • If f(n) \in \omega(g(n)) and g(n) \in \omega(h(n)) then f(n) \in \omega(h(n)).
  • If f(n) \sim g(n) and g(n) \sim h(n) then f(n) \sim h(n).
Mixed transitivity
  • If f(n) \in \Theta(g(n)) and g(n) \in O(h(n)), then f(n) \in O(h(n)).
  • If f(n) \in \Theta(g(n)) and g(n) \in \Omega(h(n)), then f(n) \in \Omega(h(n)).
  • If f(n) \in \Theta(g(n)) and g(n) \in o(h(n)), then f(n) \in o(h(n)).
  • If f(n) \in \Theta(g(n)) and g(n) \in \omega(h(n)), then f(n) \in \omega(h(n)).
  • If f(n) \in O(g(n)) and g(n) \in o(h(n)), then f(n) \in o(h(n)).
  • If f(n) \in \Omega(g(n)) and g(n) \in \omega(h(n)), then f(n) \in \omega(h(n)).

Composition

  • TODO

Hierarchy of functions in one variable

From the reflexivity, symmetry, and transitivity properties above, it can be seen that \Theta defines an equivalence relation on functions from \mathbb{N} to \mathbb{R}^+. Thus f(n) = n^2 and g(n) = (n+2)(2n+1) belong to the same equivalence class, but neither belongs to the same equivalence class as h(n) = \log n. The equivalence class a function belongs to in this context is called its order. Notably, f(n) \in \Theta(c(g(n))) for any c > 0, so scaling a function does not change its order.

These orders may in turn be ranked. If f(n) \in o(g(n)), then g, in a sense, increases strictly more quickly than f, and furthermore every function of g's order increases strictly more quickly than every function of f's order, because of the mixed transitivity relations given above.

(Note that in the below, the term "order" will be used in two different senses.)

The orders may then be partially ordered, where we define [f] \leq [g] if and only if f(n) \in o(g(n)). They cannot be totally ordered, because it is possible to define functions f and g such that f \notin O(g) and f \notin \Omega(g), so that [f] \nleq [g] and g \nleq [f]. How to do this is left as an exercise to the reader. The vast majority of orders encountered in the analysis of algorithms, however, belong to a chain (are totally ordered). Here is a list of orders of functions, with the more quickly growing functions toward the end. Names are given for common orders of complexity, so that, for example, an algorithm requiring about cn steps (with c constant) to process an input of size n is said to require linear time. The term linear space is defined analogously.

  • Constant: f(n) = c (Example: Time required to add a node to the front of a linked list)
  • Inverse Ackermann function: f(n) = \alpha(n) (Example: Amortized time required to perform a disjoint set operation)
  • Iterated logarithm: f(n) = \log^* n
  • Doubly logarithmic: f(n) = \log \log n (Example: Average running time of interpolation search)
  • Logarithmic: f(n) = \log n (Example: Time required by binary search)
  • Note that f(n) = \log_a(n) belongs to this class regardless of the value of a > 1. That is, \log_a(n) \in \Theta(\log_b(n)) for any a, b > 1. This is because of the relation \log_a(n) = \frac{\log_b(n)}{\log_b(a)}, where the denominator is a constant. This is why one sees "the running time of such-and-such algorithm is O(\log n)" without any base stated, or that it is O(\ln n) even though the analysis clearly suggests about O(\log_2 n) steps.
  • Polylogarithmic: f(n) = \log^c(n)
  • There is a separate order for each value of c; \log^2(n) \notin \Theta(\log^3(n)).
  • f(n) = n^c for 0 < c < 1
  • There is a separate order for each value of c; n^{1/3} \notin \Theta(n^{1/2}).
  • Linear: f(n) = n (Example: Time required to search for an item in an unsorted list)
  • f(n) = n \log n (Example: Time required to sort an array of n items using only comparisons and swaps)
  • The name "linearithmic" has been proposed for this complexity. Notably, a former PEG leader has dismissed it as ridiculous and fake-sounding. "Log-linear" is also a possible name.
  • The function \log(n!) \approx n \log n - n also belongs to this class.
  • Every quadratic polynomial belongs to this class, regardless of its coefficients.
  • Every cubic polynomial belongs to this class, regardless of its coefficients.
  • Exponential: f(n) = c^n, for c > 1 (Example: Naive brute-force solution to subset sum problem)
  • There is a separate order for each value of c; 2^n \notin \Theta(3^n).
  • Factorial: f(n) = n!
  • f(n) = n^n
  • Doubly exponential: f(n) = c^{c^n} (usually 2^{2^n})

Revisiting identities

Some of the identities given above, especially the mixed transitivity identities, are much simplified if we consider:

  • f(n) \in O(g(n)) to be equivalent to [f] \leq [g]
  • f(n) \in \Omega(g(n)) to be equivalent to [f] \geq [g]
  • f(n) \in o(g(n)) to be equivalent to [f] < [g]
  • f(n) \in \omega(g(n)) to be equivalent to [f] > [g]
  • f(n) \in \Theta(g(n)) to be equivalent to [f] = [g]

For example, the arithmetic identity that a \leq b and b < c imply a < c suggests the identity that if f(n) \in O(g(n)) and g(n) \in o(h(n)), then f(n) \in o(h(n)).

Multiple variables

When the running time or space usage of an algorithm depends on more than one parameter (which is especially common in graph theory, where both the number of vertices V and the number of edges E are important), the use of big O notation and its brethren becomes tricky. For example, if a geometric algorithm involving n points in d-dimensional space had running time \Theta(2^d n), then one might sometimes express this as simply O(n) and call it linear (because the reader is supposed to understand that d is being held constant). However, this does not adequately capture the essence of the algorithm's time complexity, which is linear in n but exponential in d.

It would, in this case, be advisable to avoid causing confusion by always writing it as \Theta(2^d n) or O(2^d n), rather than as "O(n)" or "linear", unless it is very clear from context that d is being held constant.

With two variables, the notation f(n, m) \in O(g(n, m)) means that there exist c, n_0, m_0 such that c g(n, m) \geq f(n, m) whenever n \geq n_0 and m \geq m_0. Big omega and big theta notation on multiple variables are defined analogously. (The other three notations are trickier to define in this case, since they involve limiting processes, and multivariate limits are tricky.) Here, we should write fully f(n, m) \in O(g(n, m)) instead of simply f \in O(g), so we always know what the relevant variables are.

How to analyze an algorithm

In both theory and practice, it is important to be able to determine the complexity of an algorithm.

In theory, much of computer science is the quest for better and better complexities of algorithms for solving various problems. This is often true even when the algorithm that is better in theory is worse in practice. For example, Strassen multiplication has a complexity of O(n^{2.807}), which is better than the naive method's O(n^3), but only outperforms the naive method in practice for very large matrices.

In practice, an upper bound on an algorithm's time or memory usage in terms of big O notation is an indication of how well it scales. An algorithm that takes O(n) time to search for an article in a database of n articles, for example, would probably work fine for this wiki, but would be horrendously inefficient for Wikipedia's purposes; O(\log n) scales much better in the sense that doubling the number of articles would only add a constant to the amount of time required per search.

Complexities in contests

Around the turn of the century, the International Olympiad in Informatics underwent a massive increase in difficulty as well as a fundamental shift in problem style. In the nineties, IOI problems often required recursive backtracking searches that had to be cleverly optimized by pruning in order to find an optimal solution within the time limit. After the shift, however, most problems required knowledge of algorithms typically taught in college computer science courses, which, when carefully composed, would give an asymptotically optimal algorithm for the overall problem.

With the older problems, it is not easy to predict theoretically what the running time should be, so one must take recourse to simply coding a solution, testing it on large test cases, and then optimizing it further if it does not pass the largest possible test cases in time. With the newer problems, on the other hand, one can often guess at what an acceptable time complexity would be simply by examining the bounds given.

As a rule of thumb for olympiad-style contests:

  • An exponential-time algorithm (O(2^n)) suffices to tackle test cases with n \leq 25, so a bound this small suggests an exponential-time algorithm. Dynamic programming is often required to achieve this time bound, or a polynomial multiple of it, rather than \Omega(n!), which is too slow.
  • A cubic-time algorithm will suffice for n up to about 300.
  • A quadratic-time algorithm will suffice for n up to about 5000.
  • An O(n \log n) time algorithm will suffice for n up to about 200000. The number 100000 is a very common bound on input size in olympiad problems, and almost always suggests that a contestant should aim to invent an algorithm with time complexity of around O(n \log n). (This aspect of olympiad contests has been criticized.)[1] O(n^2) is almost always too slow in this case.
  • A linear-time algorithm is typically required if the input can be even larger than this, usually 300000 or more lines.

Best case, average case, and worst case

The running time of an algorithm usually depends on not only the size of input (number of characters in a string, length of a number's binary representation, number of edges in a graph, etc.) but also the details. For example, bubble sort requires only linear time to sort an array that is already sorted, but quadratic time to sort an array that is sorted in reverse order initially. These are called the best case and worst case, respectively, for bubble sort, since they represent the best and worst that bubble sort can perform on an input of a given size.

Theoreticians usually focus on optimizing the worst-case running time of an algorithm. In practice, the worst case is important, but the average case is also important. The average case running time is defined as the expected running time for the algorithm assuming that its input is of a given size and drawn randomly and uniformly from the set of all possible inputs of that size. The average case tends to be more difficult to mathematically analyze than the worst case. As an example, quicksort is O(n \log n) in both the average case and the best case, but O(n^2) in the worst case, so it is often used in practice (such as in the std::sort implementation in libstdc++) with a fallback to heapsort to handle particularly bad cases. (This combination is called introsort.)

In contests, the worst case is more important than the average case, because it is almost guaranteed that the problem setters will include contrived input—input that is intended to be the worst case for a typical "easy" algorithm, in order to separate contestants who implement this algorithm from cleverer contestants who implement an algorithm with better worst-case performance.

In time-critical applications (with an obvious and very cliché example being the software that controls space shuttles), the worst case is the only case that matters; catastrophic failure could occur if an algorithm with poor worst-case performance were to hit the worst case "on the job".

The best-case complexity is usually not considered very important, whether in theory or in practice.

The worst-case and average-case complexity of an algorithm are often similar, or even of exactly the same order. The best-case complexity is usually significantly better. Bubble sort is given above as an example; its average case complexity is quadratic, the same as its worst-case complexity. Another example is searching for a key in a balanced binary search tree, which is logarithmic time in the average and worst case but constant time in the best case.

Time versus space

It is usually very easy to analyze the space complexity of an algorithm, in contrast to its time complexity, which is usually trickier. An interesting example is that of the disjoint sets data structure. It obviously requires linear space (simply an array for each item that stores the ID of its parent item in its tree), but the analysis of its time complexity is beyond the scope of a typical undergraduate computer science course.

We will focus on computing time complexity for the remainder of this article. If an algorithm or data structure's space complexity is difficult to analyze, it may be specially treated on its own page.

Instruction counting: the basic idea

The most basic way to analyze the time complexity of an algorithm is by supposing that certain instructions take constant time to execute and then counting the number of such instructions that the algorithm uses. Most commonly:

  • A scalar variable (a variable that requires a constant amount of memory to store), which is generally a character, an integer, a floating point number, a Boolean variable, or a pointer, takes constant time to read from input, write to output, or copy from one memory location to another.
  • Dynamically allocating or disposing of enough memory to store one scalar variable takes constant time.
  • Comparing two scalar variables (to determine whether the first is less than, equal to, or greater than the second) takes constant time.
  • A Boolean function (AND, OR, XOR, NOT) requires constant time to compute.
  • Every arithmetic operation takes constant time, as does every bitwise operation (such as a left shift).
  • Every standard mathematical operation then also takes constant time, since we only need to compute its value to limited precision, and can thus approximate it using a certain number of arithmetic operations.
  • Jumping from one place in the code to another takes constant time; this includes exiting a loop or subroutine, or choosing a branch of an if statement after the condition has already been checked.
  • Dereferencing a pointer takes constant time.
  • Accessing the ith element of an array, given i, takes constant time (random access machine model).

We don't have to count exactly how many instructions the CPU will execute in performing our algorithm; we just have to get it right to within a constant factor.

Simple examples

It then follows that:

  • A block of code that involves at most a constant number of constant-time instructions takes constant time to execute.

If the number of instructions required by an algorithm is some polynomial in its input size n, then only the leading term is significant (and its coefficient doesn't matter). For example, if an algorithm requires 3n^3 + 6n^2 + 3n + 2 constant-time instructions, then it requires O(n^3) time. This is a simple consequence of the fact that 6n^2 + 3n + 2 \in o(3n^3), so that 3n^3 + (6n^2 + 3n + 2) \sim 3n^3, thus 3n^3 + 6n^2 + 3n + 2 \in \Theta(3n^3) = \Theta(n^3). This is true even of functions that are linear combinations of powers of n but are not polynomials (because the exponents are negative or fractional).

  • A loop whose body contains only a constant number of constant-time instructions, and whose termination condition takes constant time to check, and runs n times, requires O(n) time. This is because the body requires linear time in total across all iterations, checking the condition also requires linear time, and exiting the loop via a jump requires constant time; and the constant term is not significant compared to the linear term.
  • Copying an array of n elements or copying a string of n characters takes O(n) time.
  • Lexicographically comparing two strings takes constant time in the best or average case, and linear time in the worst case.
  • Adding or subtracting two large integers, each of size n, takes O(n) time, using the traditional carrying and borrowing-based algorithms.
  • When we say that an integer is of size n, we mean that it is so large that it can't fit in a single scalar variable, but instead requires n scalar variables to contain it. In Java, for example, an int is a scalar variable, whereas a BigInteger is not. See big numbers for more information.
  • A loop whose body contains only a constant number of constant-time instructions, and runs \log n times, requires O(\log n) time to complete. This is because constant terms are insignificant compared to logarithmic terms, as 1 \in o(\log n). It follows that binary search on an array of scalar variables takes logarithmic time.
  • If we have two nested loops, one which f(n) times and one of which runs g(n) times, and the conditions and body each only take constant time to evaluate once, then this nested loop structure takes O(f(n)g(n)) time. This result easily extends to a nested structure of more than two loops.
  • Multiplying together two large integers of size n using the "longhand" method taught in elementary school requires O(n^2) time.
  • The Floyd–Warshall algorithm, which consists of three nested loops, each of which runs V times, and whose inner body contains only a constant number of constant-time instructions, takes O(V^3) time.

Recursive functions

Many algorithms work by breaking the input down into smaller pieces, recursively solving each smaller piece as though it were an independent problem, and then combining the results to obtain a solution to the original instance. This algorithmic paradigm is known as divide and conquer, and its running time is usually analyzed slightly differently from that of the simpler algorithms from the previous section.

Examples:

  • Merge sort, which sorts a list by dividing it into two sub-lists, sorting each sub-list, and then merging the two sorted sub-lists to obtain the original list, sorted. The merge step takes O(n) time if the original list had n elements. The overall running time is O(n \log n).
  • Karatsuba multiplication, which multiplies two n-digit numbers (see Big numbers) by dividing each number into two halves and recursively multiplying three pairs of halves, along with some additions and subtractions which take O(n) time in total. The overall running time is O(n^{\log_2 3}).
  • One solution to the closest pair of points problem finds the closest pair by dividing the point set with a half plane to obtain two sets of points of approximately half the size, finding the closest pair of points within each set, and then sorting points close to the dividing line and sweeping to find the closest pair across the division, which takes O(n \log n) time (for the sorting). The overall running time is then O(n \log^2 n). (NB: This can be improved to O(n \log n) if we concomitantly run merge sort, rather than re-sorting along the dividing line in each step.)

How do we analyze the running time of such algorithms? We can reason as follows:

  • In the case of merge sort, we start out with one instance (the original list of size n), which gives rise to two subinstances of size n/2; these in turn give rise to a total of four subinstances of size n/4, and so on down to subinstances of size 1 or 2. There are about \log_2 n "levels" to this scheme, and at each level, we have about 2^k subinstances of size n/2^k, and a linear amount of work is required in the "conquer" step of each; so that is about cn steps at each level, or about cn \log n steps in total, giving O(n \log n).
  • In Karatsuba multiplication, there will again be about \log_2 n levels. On the topmost level, we have one instance with size n. On the next level, we have three instances with size n/2, and then nine instances with size n/4, and so on down. A linear amount of additional work is required for the "conquer" step for each instance, so that is about cn work on the topmost level, about \frac{3}{2}cn on the next level, and so on down. So the total running time will be about cn \sum_{k=0}^{\log_2 n} \left(\frac{3}{2}\right)^k = cn\left(\frac{\frac{3}{2}^{k+1}-1}{\frac{3}{2}-1}\right) = O\left(n\left(\frac{3}{2}\right)^{\log_2 n}\right) = O(n^{\log_2 3}). (Several steps were omitted.)
  • In the third algorithm listed above, there are again about \log_2 n levels; on the topmost level, the work done is cn \log n; then 2c\left(\frac{n}{2}\log\left(\frac{n}{2}\right)\right) on the next level, since we will have two subinstances of size \frac{n}{2}, which will each require O\left(\frac{n}{2}\log\left(\frac{n}{2}\right)\right) work; then on the next level, 4c\left(\frac{n}{4}\log\left(\frac{n}{4}\right)\right), and so on down; the total work done is about O\left(\sum_{k=0}^{\log_2 n} 2^k \frac{n}{2^k}\log\left(\frac{n}{2^k}\right)\right) = n\cdot O\left(\sum_{k=0}^{\log_2 n}\log\left(\frac{n}{2^k}\right)\right). This gives us an arithmetic series, with first term \log n (here k = 0) and last term about 0 (for when \frac{n}{2^k} \approx \frac{n}{n} = 1 since k \approx \log_2 n) and a total of about \log_2 n terms; its sum is then about \log n \log_2 n giving O(n \log^2 n) overall.

We would like to have some way to tackle the general case, without so much messy mathematical reasoning in each individual case. By general case, we mean that we assume that to solve an instance of size n using some algorithm, we divide it into chunks of size about \frac{n}{b}, and solve a subproblems using those chunks, along with extra work in the amount of f(n), unless n < c for some constant c, which represents a base case that can be solved in O(1) time. For example, in the first of the three examples above, we have a = 2, b = 2, f(n) = n; in the second, a = 3, b = 2, f(n) = n, and in the third, a = 2, b = 2, f(n) = n \log n.

The master theorem

The authoritative statement of the master theorem is from CLRS[2]. Suppose that a \geq 1 and b > 1 above, and f is positive. Then, there are three cases:

  1. If f(n) \in O(n^{\log_b a - \epsilon}) for some \epsilon > 0, then the overall running time is \Theta(n^{\log_b a}).
  2. If f(n) \in \Theta(n^{\log_b a} log^k n), for some k \geq 0, then the overall running time is \Theta(n^{\log_b a} log^{k+1} n).
  3. If f(n) \in \Omega(n^{\log_b a + \epsilon}) for some \epsilon > 0, and, furthermore, af\left(\frac{n}{b}\right) \leq cf(n) for some c < 1 and all sufficiently large n, then the overall running time is \Theta(f(n)).

For example, Karatsuba multiplication falls into the first case, since f(n) = n and \log_2 3 \approx 1.58. The theorem then tells us that the overall running time is O(n^{\log_2 3}). What is happening here is that as we proceed from one level to the next, the amount of work done on the current level increases exponentially or faster. In this particular case, each level entails 1.5 times as much work as the previous level. Intuitively, this means most of the work is done on the lowest level, on which we have n subproblems, each with size 1. The earlier analysis then applies. The reason why we need f(n) \in O(n^{\log_b a - \epsilon}) is precisely to ensure this exponential growth. If f(n) is exactly n^{\log_b a}, then we will have af\left(\frac{n}{b}\right) = a\frac{n^{\log_b a}}{b^{\log_b a}} = n^{\log_b a} = f(n), that is, the amount of work done on each level will be the same. But if f(n) grows more slowly than this, then f(n) will be less than af\left(\frac{n}{b}\right). If it grows polynomially more slowly (that is, by a factor of n^\epsilon or more), then the desired exponential growth will be attained.

Merge sort and the closest pair of points algorithm are covered by case 2. For merge sort, we have k = 0 (there is no log factor at all). Here the time spent on each level is the same, since f(n) = af\left(\frac{n}{b}\right) if f(n) = n^{log_b a}, so we simply multiply f(n) by the number of levels, that is, \log_b n \in \Theta(\log n). For the closest pair of points algorithm, we have k = 1. When k > 0, then there is a decrease in the amount of work done as we go further down, but it is slower than an exponential decay. In each case the average amount of work done per level is on the order of the amount of work done on the topmost level, so in each case we obtain \Theta(f(n) \log n) time.

Case 3 is not often encountered in practice. When f(n) \geq \frac{1}{c}af\left(\frac{n}{b}\right) (with \frac{1}{c} > 1), there is an exponential (or faster) decay as we proceed from the top to the lower levels, so most of the work is done at the top, and hence the \Theta(f(n)) behaviour is obtained. The condition that f(n) \in \Omega(n^{\log_b a + \epsilon}) is not actually needed in the statement of Case 3, since it is implied by the condition that af\left(\frac{n}{b}\right) \leq cf(n). The reason why the theorem is stated as it is above is that if f(n) is actually of the form cn^{\log_b a + \epsilon}\log^k n, which covers most of the cases, then the other condition is implied, and we don't need to explicitly check it, but can instead just look at the form of f(n). However, there are some possibilities for f(n) that are not "well-behaved", and will not satisfy the condition af\left(\frac{n}{b}\right) \leq cf(n).

An actual proof of the master theorem is given in CLRS.

Akra–Bazzi theorem

TODO

Amortized analysis

TODO

References

  1. Ribiero, P. and Guerreiro, P. Improving the Automatic Evaluation of Problem Solutions in Programming Contests. Olympiads in Informatics, 2009, Vol. 3, 132–143. Retrieved from http://www.mii.lt/olympiads_in_informatics/pdf/INFOL048.pdf
  2. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Sections 4.3 (The master method) and 4.4 (Proof of the master theorem), pp. 73–90.