Editing Knuth–Morris–Pratt algorithm

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 5: Line 5:
  
 
===Example 1===
 
===Example 1===
In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matches. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too.)
+
In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matches. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too.
 
+
 
===Example 2===
 
===Example 2===
 
Now let's search for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aabaabaaa'''. Again, we start out the same way as in the naive algorithm, hence, we compare <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>. Here we find a mismatch between <math>S</math> and <math>T</math>, so <math>S</math> does ''not'' occur at position 1 in <math>T</math>. Now, the naive algorithm would continue by comparing <math>S_1</math> with <math>T_2</math> and <math>S_2</math> with <math>T_3</math>, and would find a mismatch; then it would compare <math>S_1</math> with <math>T_3</math>, and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding <math>S</math> at positions 2 and 3 in <math>T</math> are extinguished. This is because, as noted in Example 1, <math>S_2</math> is the same as <math>S_3</math>, and since <math>S_3 \neq T_3</math>, <math>S_2 \neq T_3</math> also (so we will not find <math>S</math> at position 2 of <math>T</math>). And, likewise, since <math>S_1 \neq S_2</math>, and <math>S_2 \neq T_3</math>, it is also true that <math>S_1 \neq T_3</math>, so it is pointless looking for a match at the third position of <math>T</math>. Thus, it would make sense to start comparing again at the fourth position of <math>T</math> (''i.e.'', <math>S_1, S_2, S_3</math> with <math>T_4, T_5, T_6</math>, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in <math>T</math>, and begin matching again at <math>T_7</math> (where we finally find a match.) Again, notice that the characters of <math>T</math> were examined strictly in order.
 
Now let's search for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aabaabaaa'''. Again, we start out the same way as in the naive algorithm, hence, we compare <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>. Here we find a mismatch between <math>S</math> and <math>T</math>, so <math>S</math> does ''not'' occur at position 1 in <math>T</math>. Now, the naive algorithm would continue by comparing <math>S_1</math> with <math>T_2</math> and <math>S_2</math> with <math>T_3</math>, and would find a mismatch; then it would compare <math>S_1</math> with <math>T_3</math>, and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding <math>S</math> at positions 2 and 3 in <math>T</math> are extinguished. This is because, as noted in Example 1, <math>S_2</math> is the same as <math>S_3</math>, and since <math>S_3 \neq T_3</math>, <math>S_2 \neq T_3</math> also (so we will not find <math>S</math> at position 2 of <math>T</math>). And, likewise, since <math>S_1 \neq S_2</math>, and <math>S_2 \neq T_3</math>, it is also true that <math>S_1 \neq T_3</math>, so it is pointless looking for a match at the third position of <math>T</math>. Thus, it would make sense to start comparing again at the fourth position of <math>T</math> (''i.e.'', <math>S_1, S_2, S_3</math> with <math>T_4, T_5, T_6</math>, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in <math>T</math>, and begin matching again at <math>T_7</math> (where we finally find a match.) Again, notice that the characters of <math>T</math> were examined strictly in order.
Line 31: Line 30:
  
 
Here is the pseudocode:
 
Here is the pseudocode:
<pre>
 
&pi;[1] &larr; 0
 
for i &isin; [2..m]
 
    k &larr; &pi;[i-1]
 
    while k &gt; 0 and S[k+1] &ne; S[i]
 
        k &larr; &pi;[k]
 
    if S[k+1] = S[i]
 
        &pi;[i] &larr; 0
 
    else
 
        &pi;[i] &larr; k+1
 
</pre>
 
 
With a little bit of thought, this can be re-written as follows:
 
 
<pre>
 
<pre>
 
&pi;[1] &larr; 0
 
&pi;[1] &larr; 0
Line 50: Line 36:
 
     while k &gt; 0 and S[k+1] &ne; S[i]
 
     while k &gt; 0 and S[k+1] &ne; S[i]
 
         k &larr; &pi;[k]
 
         k &larr; &pi;[k]
     if S[k+1] = S[i]
+
     if k &ne; 0
 
         k &larr; k+1
 
         k &larr; k+1
 
     &pi;[i] &larr; k
 
     &pi;[i] &larr; k
Line 63: Line 49:
 
Think carefully about what this means. If <math>p > 0</math> does not satisfy the statement that <math>S_1, \ldots, S_{k-p}</math> match <math>T_{j+p}, \ldots, T_{j+k-1}</math>, then the needle <math>S</math> ''does not match'' <math>T</math> at position <math>j+p</math>, ''i.e.'', we can ''rule out'' the position <math>j+p</math>. On the other hand, if <math>p > 0</math> ''does'' satisfy this statement, then <math>S</math> ''might'' match <math>T</math> at position <math>j+p</math>, and, in fact, all the characters up to but not including <math>T_{j+k}</math> have already been verified to match the corresponding characters in <math>S</math>, so we can proceed by comparing <math>S_{k-p+1}</math> with <math>T_{j+k}</math>, and, as promised, ''never need to look back''.
 
Think carefully about what this means. If <math>p > 0</math> does not satisfy the statement that <math>S_1, \ldots, S_{k-p}</math> match <math>T_{j+p}, \ldots, T_{j+k-1}</math>, then the needle <math>S</math> ''does not match'' <math>T</math> at position <math>j+p</math>, ''i.e.'', we can ''rule out'' the position <math>j+p</math>. On the other hand, if <math>p > 0</math> ''does'' satisfy this statement, then <math>S</math> ''might'' match <math>T</math> at position <math>j+p</math>, and, in fact, all the characters up to but not including <math>T_{j+k}</math> have already been verified to match the corresponding characters in <math>S</math>, so we can proceed by comparing <math>S_{k-p+1}</math> with <math>T_{j+k}</math>, and, as promised, ''never need to look back''.
 
:''Proof'': Let <math>0 \leq q < k</math>. If <math>S^q \sqsupset S^k</math>, then by definition we have <math>S_1, \ldots, S_q = S_{k-q+1}, \ldots, S_k</math>. But since <math>S_1, \ldots, S_k = T_j, \ldots, S_{j+k-1}</math>, it is also true that <math>S_{k-q+1}, \ldots, S_k = T_{j+k-q}, \ldots, T_{j+k-1}</math>. Therefore <math>S_1, \ldots, S_q = T_{j+k-q}, \ldots, T_{j+k-1}</math>. If, on the other hand, it is not true that <math>S^q \sqsupset S^k</math>, then it is not true that <math>S_1, \ldots, S_q = S_{k-q+1}, \ldots, S_k</math>, so it is not true that <math>S_{k-q+1}, \ldots, S_k = T_{j+k-q}, \ldots, T_{j+k-1}</math>, so it is not true that <math>S_1, \ldots, S_q = T_{j+k-q}, \ldots, T_{j+k-1}</math>. Therefore <math>k-q</math> is a possible value of <math>p</math> if and only if <math>S^q \sqsupset S^k</math>. Since the maximum possible value of <math>q</math> is <math>\pi_k</math>, the minimum possible value of <math>p</math> is given by <math>k-\pi_k</math>. <math>_\blacksquare</math>
 
:''Proof'': Let <math>0 \leq q < k</math>. If <math>S^q \sqsupset S^k</math>, then by definition we have <math>S_1, \ldots, S_q = S_{k-q+1}, \ldots, S_k</math>. But since <math>S_1, \ldots, S_k = T_j, \ldots, S_{j+k-1}</math>, it is also true that <math>S_{k-q+1}, \ldots, S_k = T_{j+k-q}, \ldots, T_{j+k-1}</math>. Therefore <math>S_1, \ldots, S_q = T_{j+k-q}, \ldots, T_{j+k-1}</math>. If, on the other hand, it is not true that <math>S^q \sqsupset S^k</math>, then it is not true that <math>S_1, \ldots, S_q = S_{k-q+1}, \ldots, S_k</math>, so it is not true that <math>S_{k-q+1}, \ldots, S_k = T_{j+k-q}, \ldots, T_{j+k-1}</math>, so it is not true that <math>S_1, \ldots, S_q = T_{j+k-q}, \ldots, T_{j+k-1}</math>. Therefore <math>k-q</math> is a possible value of <math>p</math> if and only if <math>S^q \sqsupset S^k</math>. Since the maximum possible value of <math>q</math> is <math>\pi_k</math>, the minimum possible value of <math>p</math> is given by <math>k-\pi_k</math>. <math>_\blacksquare</math>
Thus, here is the matching algorithm in pseudocode:
 
<pre>
 
j &larr; 1
 
k &larr; 0
 
while j+m-1 &le; n
 
    while k &le; m and S[k+1] = T[j+k]
 
        k &larr; k+1
 
    if k = m
 
        print "Match at position " j
 
    if k = 0
 
        j &larr; j+1
 
    else
 
        j &larr; j+k-&pi;[k]
 
        k &larr; &pi;[k]
 
</pre>
 
Thus, we scan the text one character at a time; the current character being examined is located at position <math>j+k</math>. When there is a mismatch, we use the <math>\pi</math> table to look up the next possible position at which the match might occur, and try to proceed.
 
 
The fact that the algorithm scans one character at a time without looking back is more obvious when the code is cast into this equivalent form:<ref name="CLRS"></ref>
 
<pre>
 
k &larr; 0
 
for i &isin; [1..n]
 
    while k &gt; 0 and S[k+1] &ne; T[i]
 
        k &larr; &pi;[k]
 
    if S[k+1] = T[i]
 
        k &larr; k+1
 
    if k = m
 
        print "Match at position " i-m+1
 
        k &larr; &pi;[k]
 
</pre>
 
Here, <code>i</code> is identified with <code>j+k</code> as above. Each iteration of the inner loop in one of these two segments corresponds to an iteration of the outer loop in the other. In this second form, we can also prove that the algorithm takes <math>O(n)</math> time; each time the inner while loop is executed, the value of <code>k</code> decreases, but it cannot decrease more than <math>n</math> times because it starts as zero, is never negative, and is increased at most once per iteration of the outer loop (''i.e.'', at most <math>n</math> times in total), hence the inner loop is only executed up to <math>n</math> times. All other operations in the outer loop take constant time.
 
  
 
==References==
 
==References==
 
<references/>
 
<references/>
 
==External links==
 
* {{SPOJ|NHAY|A Needle in the Haystack}}
 

Please note that all contributions to PEGWiki are considered to be released under the Attribution 3.0 Unported (see PEGWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

Template used on this page: