Difference between revisions of "Knuth–Morris–Pratt algorithm"

From PEGWiki
Jump to: navigation, search
(Created page with "The '''Knuth–Morris–Pratt (KMP) algorithm''' is a linear time solution to the single-pattern string search problem. It is based on the observation that a partial match gi...")
 
Line 5: Line 5:
  
 
===Example 1===
 
===Example 1===
In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matcsubstringhes. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too.
+
In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matches. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too.
 
===Example 2===
 
===Example 2===
Now let's search for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aabaabaaa'''. Again, we start out the same way as in the naive algorithm, hence, we compare <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>. Here we find a mismatch between <math>S</math> and <math>T</math>, so <math>S</math> does ''not'' occur at position 1 in <math>T</math>. Now, the naive algorithm would continue by comparing <math>S_1</math> with <math>T_2</math> and <math>S_2</math> with <math>T_3</math>, and would find a mismatch; then it would compare <math>S_1</math> with <math>T_3</math>, and find a mismatsubstringch, and so on. But a human would notice that after the first mismatch, the possibilities of finding <math>S</math> at positions 2 and 3 in <math>T</math> are extinguished. This is because, as noted in Example 1, <math>S_2</math> is the same as <math>S_3</math>, and since <math>S_3 \neq T_3</math>, <math>S_2 \neq T_3</math> also (so we will not find <math>S</math> at position 2 of <math>T</math>). And, likewise, since <math>S_1 \neq S_2</math>, and <math>S_2 \neq T_3</math>, it is also true that <math>S_1 \neq T_3</math>, so it is pointless looking for a match at the third position of <math>T</math>. Thus, it would make sense to start comparing again at the fourth position of <math>T</math> (''i.e.'', <math>S_1, S_2, S_3</math> with <math>T_4, T_5, T_6</math>, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in <math>T</math>, and begin matching again at <math>T_7</math> (where we finally find a match.) Again, notice that the characters of <math>T</math> were examined strictly in order.
+
Now let's search for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aabaabaaa'''. Again, we start out the same way as in the naive algorithm, hence, we compare <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>. Here we find a mismatch between <math>S</math> and <math>T</math>, so <math>S</math> does ''not'' occur at position 1 in <math>T</math>. Now, the naive algorithm would continue by comparing <math>S_1</math> with <math>T_2</math> and <math>S_2</math> with <math>T_3</math>, and would find a mismatch; then it would compare <math>S_1</math> with <math>T_3</math>, and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding <math>S</math> at positions 2 and 3 in <math>T</math> are extinguished. This is because, as noted in Example 1, <math>S_2</math> is the same as <math>S_3</math>, and since <math>S_3 \neq T_3</math>, <math>S_2 \neq T_3</math> also (so we will not find <math>S</math> at position 2 of <math>T</math>). And, likewise, since <math>S_1 \neq S_2</math>, and <math>S_2 \neq T_3</math>, it is also true that <math>S_1 \neq T_3</math>, so it is pointless looking for a match at the third position of <math>T</math>. Thus, it would make sense to start comparing again at the fourth position of <math>T</math> (''i.e.'', <math>S_1, S_2, S_3</math> with <math>T_4, T_5, T_6</math>, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in <math>T</math>, and begin matching again at <math>T_7</math> (where we finally find a match.) Again, notice that the characters of <math>T</math> were examined strictly in order.
  
 
===Example 3===
 
===Example 3===
Line 14: Line 14:
 
==Concept==
 
==Concept==
 
The examples above show that the KMP algorithm relies on noticing that certain substrings of the needle match or do not match other substrings of the needle, but it is probably not clear what the unifying organizational principle for all this match information is. Here it is:
 
The examples above show that the KMP algorithm relies on noticing that certain substrings of the needle match or do not match other substrings of the needle, but it is probably not clear what the unifying organizational principle for all this match information is. Here it is:
:''At each position <math>i</math> of <math>S</math>, find the longest substring of <math>S</math> ending at <math>S_i</math> which is also a prefix of <math>S</math>.''
+
:''At each position <math>i</math> of <math>S</math>, find the longest substring of <math>S</math> ending at <math>S_i</math> that is also a prefix of <math>S</math>.''
We shall denote the length of this substring by <math>\pi_i</math>, following <ref>Thomas H. Cormen; Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2001). "Section 32.4: The Knuth-Morris-Pratt algorithm". ''Introduction to Algorithms'' (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 978-0-262-03293-3.</ref>.
+
We shall denote the length of this substring by <math>\pi_i</math>, following <ref name="CLRS">Thomas H. Cormen; Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2001). "Section 32.4: The Knuth-Morris-Pratt algorithm". ''Introduction to Algorithms'' (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 978-0-262-03293-3.</ref>. We can also state the definition of <math>\pi_i</math> equivalently as ''the length of the longest proper suffix of the prefix of <math>S</math> length <math>i</math> which is also a prefix of <math>S</math>''.
  
The table <math>\pi</math> uses linear space, and, as we shall see, can be computed in linear time. It contains ''all'' the information we need in order to execute the "smart" searching techniques described in the examples. In particular, in examples 1 and 2, we used the fact that <math>\pi_3 = 2</math>, that is, the prefix '''aa''' matches the suffix '''aa'''. In example 3, we used the facts that <math>\pi_5 = 2</math>. This tells us that the prefix '''ta''' matches the substring '''ta''' ending at the fifth position. In general, the table <math>\pi</math> tells us, after either a successful match or a mismatch, what the next position is that we should check in the haystack. Comparison proceeds from where it was left off, never revisiting a character of the haystack after we have examined the next one.
+
The table <math>\pi</math>, called the ''prefix function'', occupies linear space, and, as we shall see, can be computed in linear time. It contains ''all'' the information we need in order to execute the "smart" searching techniques described in the examples. In particular, in examples 1 and 2, we used the fact that <math>\pi_3 = 2</math>, that is, the prefix '''aa''' matches the suffix '''aa'''. In example 3, we used the facts that <math>\pi_5 = 2</math>. This tells us that the prefix '''ta''' matches the substring '''ta''' ending at the fifth position. In general, the table <math>\pi</math> tells us, after either a successful match or a mismatch, what the next position is that we should check in the haystack. Comparison proceeds from where it was left off, never revisiting a character of the haystack after we have examined the next one.
 +
 
 +
==Computation of the prefix function==
 +
To compute the prefix function, we shall first make the following observation:
 +
:''Prefix function iteration lemma''<ref name="CLRS"></ref>: The sequence <math>\pi^*_i = i, \pi_i, \pi_{\pi_i}, \pi_{\pi_{\pi_i}}, \ldots, 0</math> contains exactly those values <math>j</math> such that the prefix of <math>S</math> of length <math>j</math> matches the substring of length <math>j</math> ending at <math>S_i</math>.
 +
That is, we can enumerate all substrings ending at <math>S_i</math> that are prefixes of <math>S</math> by starting with <math>i</math>, looking it up in the table <math>\pi</math>, looking up the result, looking up the result, and so on, giving a strictly decreasing sequence, terminating with zero.
 +
:''Proof'': We first show by induction that if <math>j</math> appears in the sequence <math>\pi^*_i</math> then the prefix <math>S_1, S_2, \ldots, S_j</math> matches the substring <math>S_{i-j+1}, S_{i-j+2}, \ldots, S_i</math>, ''i.e'', <math>j</math> indeed belongs in the sequence <math>\pi^*_i</math>. Suppose <math>j</math> is the first entry in <math>\pi^*_i</math>. Then <math>j = i</math> and it trivially belongs (''i.e'', the prefix of length <math>i</math> matches itself). Now suppose <math>j</math> is not the first entry, but is preceded by the entry <math>k</math> which is valid. This means that the prefix of length <math>j</math> is a suffix of the prefix of length <math>k</math>. But the prefix of length <math>k</math> matches the substring of length <math>k</math> ending at <math>S_i</math> by assumption. So the prefix of length <math>j</math> is a suffix of the substring of length <math>k</math> ending at <math>S_i</math>. Therefore the prefix of length <math>j</math> matches the substring of length <math>j</math> ending at <math>S_i</math> (since <math>j < k</math>).
 +
:We now show by contradiction that if the prefix of length <math>j</math> matches the suffix of length <math>j</math> of the prefix of length <math>i</math>, ''i.e.'', if <math>j</math> "belongs" in the sequence <math>\pi^*_i</math>, then it appears in this sequence. Assume <math>j</math> does not appear in the sequence. Clearly <math>0 < j < i</math> since 0 and <math>i</math> both appear. Since <math>\pi^*_i</math> is strictly decreasing, we can find exactly one <math>k \in \pi^*_i</math> such that <math>k > j</math> and <math>\pi_k < j</math>; that is, we can find exactly one <math>k</math> after which <math>j</math> "should" appear (to keep the sequence decreasing).  
  
 
==References==
 
==References==
 
<references/>
 
<references/>

Revision as of 22:34, 3 April 2011

The Knuth–Morris–Pratt (KMP) algorithm is a linear time solution to the single-pattern string search problem. It is based on the observation that a partial match gives useful information about whether or not the needle may partially match subsequent positions in the haystack. This is because a partial match indicates that some part of the haystack is the same as some part of the needle, so that if we have preprocessed the needle in the right way, we will be able to draw some conclusions about the contents of the haystack (because of the partial match) without having to go back and re-examine characters already matched. In particular, this means that, in a certain sense, we will want to precompute how the needle matches itself. The algorithm thus "never looks back" and makes a single pass over the haystack. Together with linear time preprocessing of the needle, this gives a linear time algorithm overall.

Motivation

The motivation behind KMP is best illustrated using a few simple examples.

Example 1

In this example, we are searching for the string S = aaa in the string T = aaaaaaaaa (in which it occurs seven times). The naive algorithm would begin by comparing S_1 with T_1, S_2 with T_2, and S_3 with T_3, and thus find a match for S at position 1 of T. Then it would proceed to compare S_1 with T_2, S_2 with T_3, and S_3 with T_4, and thus find a match at position 2 of T, and so on, until it finds all the matches. But we can do better than this, if we preprocess S and note that S_1 and S_2 are the same, and S_2 and S_3 are the same. That is, the prefix of length 2 in S matches the substring of length 2 starting at position 2 in S; S partially matches itself. Now, after finding that S_1, S_2, S_3 match T_1, T_2, T_3, respectively, we no longer care about T_1, since we are trying to find a match at position 2 now, but we still know that S_2, S_3 match T_2, T_3 respectively. Since we already know S_1 = S_2, S_2 = S_3, we now know that S_1, S_2 match T_2, T_3 respectively; there is no need to examine T_2 and T_3 again, as the naive algorithm would do. If we now check that S_3 matches T_4, then, after finding S at position 1 in T, we only need to do one more comparison (not three) to conclude that S also occurs at position 2 in T. So now we know that S_1, S_2, S_3 match T_2, T_3, T_4, respectively, which allows us to conclude that S_1, S_2 match T_3, T_4. Then we compare S_3 with T_5, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of S in T, our technique only needs three comparisons to find the first occurrence, and only one for each after that, and doesn't go back to examine previous characters of T again. (This is how a human would probably do this search, too.

Example 2

Now let's search for the string S = aaa in the string T = aabaabaaa. Again, we start out the same way as in the naive algorithm, hence, we compare S_1 with T_1, S_2 with T_2, and S_3 with T_3. Here we find a mismatch between S and T, so S does not occur at position 1 in T. Now, the naive algorithm would continue by comparing S_1 with T_2 and S_2 with T_3, and would find a mismatch; then it would compare S_1 with T_3, and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding S at positions 2 and 3 in T are extinguished. This is because, as noted in Example 1, S_2 is the same as S_3, and since S_3 \neq T_3, S_2 \neq T_3 also (so we will not find S at position 2 of T). And, likewise, since S_1 \neq S_2, and S_2 \neq T_3, it is also true that S_1 \neq T_3, so it is pointless looking for a match at the third position of T. Thus, it would make sense to start comparing again at the fourth position of T (i.e., S_1, S_2, S_3 with T_4, T_5, T_6, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in T, and begin matching again at T_7 (where we finally find a match.) Again, notice that the characters of T were examined strictly in order.

Example 3

As a more complex example, imagine searching for the string S = tartan in the string T = tartaric_acid. We make the observation that the prefix of length 2 in S matches the substring of length 2 in S starting from position 4. Now, we start by comparing S_1, S_2, S_3, S_4, S_5, S_6 with T_1, T_2, T_3, T_4, T_5, T_6, respectively. We find that S_6 does not match T_6, so there is no match at position 1. At this point, we note that since S_1 \neq S_2 and S_1 \neq S_3, and S_2 = T_2, S_3 = T_3, obviously, S_1 \neq T_2 and S_1 \neq T_3, so there cannot be a match at position 2 or position 3. Now, recall that S_1 = S_4 and S_2 = S_5, and that S_4 = T_4, S_5 = T_5. We can translate this to S_1 = T_4, S_2 = T_5. So we proceed to compare S_3 with T_6. In this way, we have ruled out two possible positions, and we have restarted comparing not at the beginning of S but in the middle, avoiding re-examining T_4 and T_5.

Concept

The examples above show that the KMP algorithm relies on noticing that certain substrings of the needle match or do not match other substrings of the needle, but it is probably not clear what the unifying organizational principle for all this match information is. Here it is:

At each position i of S, find the longest substring of S ending at S_i that is also a prefix of S.

We shall denote the length of this substring by \pi_i, following [1]. We can also state the definition of \pi_i equivalently as the length of the longest proper suffix of the prefix of S length i which is also a prefix of S.

The table \pi, called the prefix function, occupies linear space, and, as we shall see, can be computed in linear time. It contains all the information we need in order to execute the "smart" searching techniques described in the examples. In particular, in examples 1 and 2, we used the fact that \pi_3 = 2, that is, the prefix aa matches the suffix aa. In example 3, we used the facts that \pi_5 = 2. This tells us that the prefix ta matches the substring ta ending at the fifth position. In general, the table \pi tells us, after either a successful match or a mismatch, what the next position is that we should check in the haystack. Comparison proceeds from where it was left off, never revisiting a character of the haystack after we have examined the next one.

Computation of the prefix function

To compute the prefix function, we shall first make the following observation:

Prefix function iteration lemma[1]: The sequence \pi^*_i = i, \pi_i, \pi_{\pi_i}, \pi_{\pi_{\pi_i}}, \ldots, 0 contains exactly those values j such that the prefix of S of length j matches the substring of length j ending at S_i.

That is, we can enumerate all substrings ending at S_i that are prefixes of S by starting with i, looking it up in the table \pi, looking up the result, looking up the result, and so on, giving a strictly decreasing sequence, terminating with zero.

Proof: We first show by induction that if j appears in the sequence \pi^*_i then the prefix S_1, S_2, \ldots, S_j matches the substring S_{i-j+1}, S_{i-j+2}, \ldots, S_i, i.e, j indeed belongs in the sequence \pi^*_i. Suppose j is the first entry in \pi^*_i. Then j = i and it trivially belongs (i.e, the prefix of length i matches itself). Now suppose j is not the first entry, but is preceded by the entry k which is valid. This means that the prefix of length j is a suffix of the prefix of length k. But the prefix of length k matches the substring of length k ending at S_i by assumption. So the prefix of length j is a suffix of the substring of length k ending at S_i. Therefore the prefix of length j matches the substring of length j ending at S_i (since j < k).
We now show by contradiction that if the prefix of length j matches the suffix of length j of the prefix of length i, i.e., if j "belongs" in the sequence \pi^*_i, then it appears in this sequence. Assume j does not appear in the sequence. Clearly 0 < j < i since 0 and i both appear. Since \pi^*_i is strictly decreasing, we can find exactly one k \in \pi^*_i such that k > j and \pi_k < j; that is, we can find exactly one k after which j "should" appear (to keep the sequence decreasing).

References

  1. 1.0 1.1 Thomas H. Cormen; Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2001). "Section 32.4: The Knuth-Morris-Pratt algorithm". Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 978-0-262-03293-3.