Difference between revisions of "Knuth–Morris–Pratt algorithm"

Revision as of 08:06, 23 October 2011

The Knuth–Morris–Pratt (KMP) algorithm is a linear time solution to the single-pattern string search problem. It is based on the observation that a partial match gives useful information about whether or not the needle may partially match subsequent positions in the haystack. This is because a partial match indicates that some part of the haystack is the same as some part of the needle, so that if we have preprocessed the needle in the right way, we will be able to draw some conclusions about the contents of the haystack (because of the partial match) without having to go back and re-examine characters already matched. In particular, this means that, in a certain sense, we will want to precompute how the needle matches itself. The algorithm thus "never looks back" and makes a single pass over the haystack. Together with linear time preprocessing of the needle, this gives a linear time algorithm overall.

Motivation

The motivation behind KMP is best illustrated using a few simple examples.

Example 1

In this example, we are searching for the string $S$ = aaa in the string $T$ = aaaaaaaaa (in which it occurs seven times). The naive algorithm would begin by comparing $S_1$ with $T_1$ , $S_2$ with $T_2$ , and $S_3$ with $T_3$ , and thus find a match for $S$ at position 1 of $T$ . Then it would proceed to compare $S_1$ with $T_2$ , $S_2$ with $T_3$ , and $S_3$ with $T_4$ , and thus find a match at position 2 of $T$ , and so on, until it finds all the matches. But we can do better than this, if we preprocess $S$ and note that $S_1$ and $S_2$ are the same, and $S_2$ and $S_3$ are the same. That is, the prefix of length 2 in $S$ matches the substring of length 2 starting at position 2 in $S$ ; $S$ partially matches itself. Now, after finding that $S_1, S_2, S_3$ match $T_1, T_2, T_3$ , respectively, we no longer care about $T_1$ , since we are trying to find a match at position 2 now, but we still know that $S_2, S_3$ match $T_2, T_3$ respectively. Since we already know $S_1 = S_2, S_2 = S_3$ , we now know that $S_1, S_2$ match $T_2, T_3$ respectively; there is no need to examine $T_2$ and $T_3$ again, as the naive algorithm would do. If we now check that $S_3$ matches $T_4$ , then, after finding $S$ at position 1 in $T$ , we only need to do one more comparison (not three) to conclude that $S$ also occurs at position 2 in $T$ . So now we know that $S_1, S_2, S_3$ match $T_2, T_3, T_4$ , respectively, which allows us to conclude that $S_1, S_2$ match $T_3, T_4$ . Then we compare $S_3$ with $T_5$ , and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of $S$ in $T$ , our technique only needs three comparisons to find the first occurrence, and only one for each after that, and doesn't go back to examine previous characters of $T$ again. (This is how a human would probably do this search, too.)

Example 2

Now let's search for the string $S$ = aaa in the string $T$ = aabaabaaa. Again, we start out the same way as in the naive algorithm, hence, we compare $S_1$ with $T_1$ , $S_2$ with $T_2$ , and $S_3$ with $T_3$ . Here we find a mismatch between $S$ and $T$ , so $S$ does not occur at position 1 in $T$ . Now, the naive algorithm would continue by comparing $S_1$ with $T_2$ and $S_2$ with $T_3$ , and would find a mismatch; then it would compare $S_1$ with $T_3$ , and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding $S$ at positions 2 and 3 in $T$ are extinguished. This is because, as noted in Example 1, $S_2$ is the same as $S_3$ , and since $S_3 \neq T_3$ , $S_2 \neq T_3$ also (so we will not find $S$ at position 2 of $T$ ). And, likewise, since $S_1 \neq S_2$ , and $S_2 \neq T_3$ , it is also true that $S_1 \neq T_3$ , so it is pointless looking for a match at the third position of $T$ . Thus, it would make sense to start comparing again at the fourth position of $T$ (i.e., $S_1, S_2, S_3$ with $T_4, T_5, T_6$ , respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in $T$ , and begin matching again at $T_7$ (where we finally find a match.) Again, notice that the characters of $T$ were examined strictly in order.

Example 3

As a more complex example, imagine searching for the string $S$ = tartan in the string $T$ = tartaric_acid. We make the observation that the prefix of length 2 in $S$ matches the substring of length 2 in $S$ starting from position 4. Now, we start by comparing $S_1, S_2, S_3, S_4, S_5, S_6$ with $T_1, T_2, T_3, T_4, T_5, T_6$ , respectively. We find that $S_6$ does not match $T_6$ , so there is no match at position 1. At this point, we note that since $S_1 \neq S_2$ and $S_1 \neq S_3$ , and $S_2 = T_2, S_3 = T_3$ , obviously, $S_1 \neq T_2$ and $S_1 \neq T_3$ , so there cannot be a match at position 2 or position 3. Now, recall that $S_1 = S_4$ and $S_2 = S_5$ , and that $S_4 = T_4, S_5 = T_5$ . We can translate this to $S_1 = T_4, S_2 = T_5$ . So we proceed to compare $S_3$ with $T_6$ . In this way, we have ruled out two possible positions, and we have restarted comparing not at the beginning of $S$ but in the middle, avoiding re-examining $T_4$ and $T_5$ .

Concept

Let the prefix of length $i$ of string $S$ be denoted $S^i$ .

The examples above show that the KMP algorithm relies on noticing that certain substrings of the needle match or do not match other substrings of the needle, but it is probably not clear what the unifying organizational principle for all this match information is. Here it is:

At each position $i$ of $S$ , find the longest proper suffix of $S^i$ that is also a prefix of $S$ .

We shall denote the length of this substring by $\pi_i$ , following ^[1]. We can also state the definition of $\pi_i$ equivalently as the maximum $j$ such that $S_j \sqsupset S_i$ .

The table $\pi$ , called the prefix function, occupies linear space, and, as we shall see, can be computed in linear time. It contains all the information we need in order to execute the "smart" searching techniques described in the examples. In particular, in examples 1 and 2, we used the fact that $\pi_3 = 2$ , that is, the prefix aa matches the suffix aa. In example 3, we used the facts that $\pi_5 = 2$ . This tells us that the prefix ta matches the substring ta ending at the fifth position. In general, the table $\pi$ tells us, after either a successful match or a mismatch, what the next position is that we should check in the haystack. Comparison proceeds from where it was left off, never revisiting a character of the haystack after we have examined the next one.

Computation of the prefix function

To compute the prefix function, we shall first make the following observation:

Prefix function iteration lemma^[1]: The sequence

\pi^*_i = i, \pi_i, \pi_{\pi_i}, \pi_{\pi_{\pi_i}}, \ldots, 0

contains exactly those values

j

such that

S^j \sqsupset S^i

.

That is, we can enumerate all suffixes of $S^i$ that are also prefixes of $S$ by starting with $i$ , looking it up in the table $\pi$ , looking up the result, looking up the result, and so on, giving a strictly decreasing sequence, terminating with zero.

Proof: We first show by induction that if

j

appears in the sequence

\pi^*_i

then

S^j \sqsupset S^i

, i.e,

j

indeed belongs in the sequence

\pi^*_i

. Suppose

j

is the first entry in

\pi^*_i

. Then

j = i

and it is trivially true that

S^j \sqsupset S^i

. Now suppose

j

is not the first entry, but is preceded by the entry

k

which is valid. That is,

\pi_k = j

. By definition,

S^j \sqsupset S^k

. But

S^k \sqsupset S^i

by assumption. Since

\sqsupset

is transitive,

S^j \sqsupset S^i

.

We now show by contradiction that if

S^j \sqsupset S^i

, then

j \in \pi^*_i

. Assume

j

does not appear in the sequence. Clearly

0 < j < i

since 0 and

i

both appear. Since

\pi^*_i

is strictly decreasing, we can find exactly one

k \in \pi^*_i

such that

k > j

and

\pi_k < j

; that is, we can find exactly one

k

after which

j

"should" appear (to keep the sequence decreasing). We know from the first part of the proof that

S^k \sqsupset S^i

. Since the suffix of

S^i

of length

j

is a suffix of the suffix of

S^i

of length

k

, it follows that the suffix of

S^i

of length

j

matches the suffix of length

j

of

S^k

. But the suffix of

S^i

of length

j

also matches

S^j

, so

S^j

matches the suffix of

S^k

of length

j

. We therefore conclude that

\pi_k \geq j

. But

j > \pi_k

, a contradiction.

_\blacksquare

With this in mind, we can design an algorithm to compute the table $\pi$ . For each $i$ , we will first try to find some $j > 0$ such that $S_j \sqsupset S_i$ . If we fail to do so, we will conclude that $\pi_i = 0$ (clearly this is the case when $i = 1$ .) Observe that if we do find such $j > 0$ , then by removing the last character from this suffix, we obtain a suffix of $S^{i-1}$ that is also a prefix of $S$ , i.e., $S^{j-1} \sqsupset S^{i-1}$ . Therefore, we first enumerate all nonempty proper suffixes of $S^{i-1}$ that are also prefixes of $S$ . If we find such a suffix of length $k$ which also satisfies $S_k = S_i$ , then $S^{k+1} \sqsupset S^i$ , and $k+1$ is a possible value of $j$ . So we will let $k = \pi_{i-1}$ and keep iterating through the sequence $\pi_k, \pi_{\pi_k}, \ldots$ . We stop if we reach element $j$ in this sequence such that $S_{j+1} = S_i$ , and declare $\pi_i = j+1$ ; this always gives an optimal solution since the sequence $\pi^*_{i-1}$ is decreasing and since it contains all possible valid $k$ 's. If we exhaust the sequence, then $\pi_i = 0$ .

Here is the pseudocode:

π[1] ← 0
for i ∈ [2..m]
    k ← π[i-1]
    while k > 0 and S[k+1] ≠ S[i]
        k ← π[k]
    if k = 0
        π[i] ← 0
    else
        π[i] ← k+1

With a little bit of thought, this can be re-written as follows:

π[1] ← 0
k ← 0
for i ∈ [2..m]
    while k > 0 and S[k+1] ≠ S[i]
        k ← π[k]
    if k ≠ 0
        k ← k+1
    π[i] ← k

This algorithm takes $O(m)$ time to execute. To see this, notice that the value of k is never negative; therefore it cannot decrease more than it increases. It only increases in the line k ← k+1, which can only be executed up to $m-1$ times. Therefore k can be decreased at most k times. But k is decreased in each iteration of the while loop, so the while loop can only run a linear number of times overall. All the other instructions in the for loop take constant time per iteration, so linear time overall.

Matching

Suppose that we have already computed the table $\pi$ . Here's where we apply this hard-won information. Suppose that so far we are testing position $j$ for a match, and that the first $k$ characters of $S$ have been successfully matched, that is, $S_1 = T_j, S_2 = T_{j+1}, S_3 = T_{j+2}, \ldots S_k = T_{j+k-1}$ . There are two possibilities: either we just continue going along $S$ and $T$ and comparing pairs of characters, or we decide we want to try out a new position in $T$ . This occurs because either $k = m$ (i.e, we've successfully located $S$ at position $i$ in $T$ , and now we want to check out other positions) or because $S_{k+1} \neq T_{j+k}$ (so we can rule out the current position).

Given that $S_1, \ldots, S_k = T_j, \ldots, T_{j+k-1}$ , what positions in $T$ can we rule out? Here is the result at the core of the KMP algorithm:

Theorem: If

k > 0

then

p = k - \pi_k

is the least

p > 0

such that

S_1, \ldots, S_{k-p}

match

T_{j+p}, \ldots, T_{j+k-1}

, respectively. (If

k = 0

, then

p = 1

vacuously.)

Think carefully about what this means. If $p > 0$ does not satisfy the statement that $S_1, \ldots, S_{k-p}$ match $T_{j+p}, \ldots, T_{j+k-1}$ , then the needle $S$ does not match $T$ at position $j+p$ , i.e., we can rule out the position $j+p$ . On the other hand, if $p > 0$ does satisfy this statement, then $S$ might match $T$ at position $j+p$ , and, in fact, all the characters up to but not including $T_{j+k}$ have already been verified to match the corresponding characters in $S$ , so we can proceed by comparing $S_{k-p+1}$ with $T_{j+k}$ , and, as promised, never need to look back.

Proof: Let

0 \leq q < k

. If

S^q \sqsupset S^k

, then by definition we have

S_1, \ldots, S_q = S_{k-q+1}, \ldots, S_k

. But since

S_1, \ldots, S_k = T_j, \ldots, S_{j+k-1}

, it is also true that

S_{k-q+1}, \ldots, S_k = T_{j+k-q}, \ldots, T_{j+k-1}

. Therefore

S_1, \ldots, S_q = T_{j+k-q}, \ldots, T_{j+k-1}

. If, on the other hand, it is not true that

S^q \sqsupset S^k

, then it is not true that

S_1, \ldots, S_q = S_{k-q+1}, \ldots, S_k

, so it is not true that

S_{k-q+1}, \ldots, S_k = T_{j+k-q}, \ldots, T_{j+k-1}

, so it is not true that

S_1, \ldots, S_q = T_{j+k-q}, \ldots, T_{j+k-1}

. Therefore

k-q

is a possible value of

p

if and only if

S^q \sqsupset S^k

. Since the maximum possible value of

q

is

\pi_k

, the minimum possible value of

p

is given by

k-\pi_k

.

_\blacksquare

Thus, here is the matching algorithm in pseudocode:

j ← 1
k ← 0
while j+m-1 ≤ n
    while k ≤ m and S[k+1] = T[j+k]
        k ← k+1
    if k = m
        print "Match at position " j
    if k = 0
        j ← j+1
    else
        j ← j+k-π[k]
        k ← π[k]

Thus, we scan the text one character at a time; the current character being examined is located at position $j+k$ . When there is a mismatch, we use the $\pi$ table to look up the next possible position at which the match might occur, and try to proceed.

The fact that the algorithm scans one character at a time without looking back is more obvious when the code is cast into this equivalent form:^[1]

k ← 0
for i ∈ [1..n]
    while k > 0 and S[k+1] ≠ T[i]
        k ← π[k]
    if S[k+1] = T[i]
        k ← k+1
    if k = m
        print "Match at position " i-m+1
        k ← π[k]

Here, i is identified with j+k as above. Each iteration of the inner loop in one of these two segments corresponds to an iteration of the outer loop in the other. In this second form, we can also prove that the algorithm takes $O(n)$ time; each time the inner while loop is executed, the value of k decreases, but it cannot decrease more than $n$ times because it starts as zero, is never negative, and is increased at most once per iteration of the outer loop (i.e., at most $n$ times in total), hence the inner loop is only executed up to $n$ times. All other operations in the outer loop take constant time.

References

↑ ^1.0 ^1.1 ^1.2 Thomas H. Cormen; Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2001). "Section 32.4: The Knuth-Morris-Pratt algorithm". Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 978-0-262-03293-3.

External links

A Needle in the Haystack

[CLRS-1] 1.0 ^1.1 ^1.2 Thomas H. Cormen; Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2001). "Section 32.4: The Knuth-Morris-Pratt algorithm". Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 978-0-262-03293-3.

[1]

@@ Line 5: / Line 5: @@
 ===Example 1===
-In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matches. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too.
+In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matches. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too.)
 ===Example 2===
 Now let's search for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aabaabaaa'''. Again, we start out the same way as in the naive algorithm, hence, we compare <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>. Here we find a mismatch between <math>S</math> and <math>T</math>, so <math>S</math> does ''not'' occur at position 1 in <math>T</math>. Now, the naive algorithm would continue by comparing <math>S_1</math> with <math>T_2</math> and <math>S_2</math> with <math>T_3</math>, and would find a mismatch; then it would compare <math>S_1</math> with <math>T_3</math>, and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding <math>S</math> at positions 2 and 3 in <math>T</math> are extinguished. This is because, as noted in Example 1, <math>S_2</math> is the same as <math>S_3</math>, and since <math>S_3 \neq T_3</math>, <math>S_2 \neq T_3</math> also (so we will not find <math>S</math> at position 2 of <math>T</math>). And, likewise, since <math>S_1 \neq S_2</math>, and <math>S_2 \neq T_3</math>, it is also true that <math>S_1 \neq T_3</math>, so it is pointless looking for a match at the third position of <math>T</math>. Thus, it would make sense to start comparing again at the fourth position of <math>T</math> (''i.e.'', <math>S_1, S_2, S_3</math> with <math>T_4, T_5, T_6</math>, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in <math>T</math>, and begin matching again at <math>T_7</math> (where we finally find a match.) Again, notice that the characters of <math>T</math> were examined strictly in order.

Difference between revisions of "Knuth–Morris–Pratt algorithm"

Revision as of 08:06, 23 October 2011

Contents

Motivation

Example 1

Example 2

Example 3

Concept

Computation of the prefix function

Matching

References

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools