Editing Knuth–Morris–Pratt algorithm
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 5: | Line 5: | ||
===Example 1=== | ===Example 1=== | ||
− | In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matches. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too. | + | In this example, we are searching for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aaaaaaaaa''' (in which it occurs seven times). The naive algorithm would begin by comparing <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>, and thus find a match for <math>S</math> at position 1 of <math>T</math>. Then it would proceed to compare <math>S_1</math> with <math>T_2</math>, <math>S_2</math> with <math>T_3</math>, and <math>S_3</math> with <math>T_4</math>, and thus find a match at position 2 of <math>T</math>, and so on, until it finds all the matches. But we can do better than this, if we preprocess <math>S</math> and note that <math>S_1</math> and <math>S_2</math> are the same, and <math>S_2</math> and <math>S_3</math> are the same. That is, the '''prefix''' of length 2 in <math>S</math> matches the '''substring''' of length 2 starting at position 2 in <math>S</math>; ''<math>S</math> partially matches itself''. Now, after finding that <math>S_1, S_2, S_3</math> match <math>T_1, T_2, T_3</math>, respectively, we no longer care about <math>T_1</math>, since we are trying to find a match at position 2 now, but we still know that <math>S_2, S_3</math> match <math>T_2, T_3</math> respectively. Since we already know <math>S_1 = S_2, S_2 = S_3</math>, we now know that <math>S_1, S_2</math> match <math>T_2, T_3</math> respectively; there is no need to examine <math>T_2</math> and <math>T_3</math> again, as the naive algorithm would do. If we now check that <math>S_3</math> matches <math>T_4</math>, then, after finding <math>S</math> at position 1 in <math>T</math>, we only need to do ''one'' more comparison (not three) to conclude that <math>S</math> also occurs at position 2 in <math>T</math>. So now we know that <math>S_1, S_2, S_3</math> match <math>T_2, T_3, T_4</math>, respectively, which allows us to conclude that <math>S_1, S_2</math> match <math>T_3, T_4</math>. Then we compare <math>S_3</math> with <math>T_5</math>, and find another match, and so on. Whereas the naive algorithm needs three comparisons to find each occurrence of <math>S</math> in <math>T</math>, our technique only needs three comparisons to find the ''first'' occurrence, and only one for each after that, and doesn't go back to examine previous characters of <math>T</math> again. (This is how a human would probably do this search, too. |
− | + | ||
===Example 2=== | ===Example 2=== | ||
Now let's search for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aabaabaaa'''. Again, we start out the same way as in the naive algorithm, hence, we compare <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>. Here we find a mismatch between <math>S</math> and <math>T</math>, so <math>S</math> does ''not'' occur at position 1 in <math>T</math>. Now, the naive algorithm would continue by comparing <math>S_1</math> with <math>T_2</math> and <math>S_2</math> with <math>T_3</math>, and would find a mismatch; then it would compare <math>S_1</math> with <math>T_3</math>, and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding <math>S</math> at positions 2 and 3 in <math>T</math> are extinguished. This is because, as noted in Example 1, <math>S_2</math> is the same as <math>S_3</math>, and since <math>S_3 \neq T_3</math>, <math>S_2 \neq T_3</math> also (so we will not find <math>S</math> at position 2 of <math>T</math>). And, likewise, since <math>S_1 \neq S_2</math>, and <math>S_2 \neq T_3</math>, it is also true that <math>S_1 \neq T_3</math>, so it is pointless looking for a match at the third position of <math>T</math>. Thus, it would make sense to start comparing again at the fourth position of <math>T</math> (''i.e.'', <math>S_1, S_2, S_3</math> with <math>T_4, T_5, T_6</math>, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in <math>T</math>, and begin matching again at <math>T_7</math> (where we finally find a match.) Again, notice that the characters of <math>T</math> were examined strictly in order. | Now let's search for the string <math>S</math> = '''aaa''' in the string <math>T</math> = '''aabaabaaa'''. Again, we start out the same way as in the naive algorithm, hence, we compare <math>S_1</math> with <math>T_1</math>, <math>S_2</math> with <math>T_2</math>, and <math>S_3</math> with <math>T_3</math>. Here we find a mismatch between <math>S</math> and <math>T</math>, so <math>S</math> does ''not'' occur at position 1 in <math>T</math>. Now, the naive algorithm would continue by comparing <math>S_1</math> with <math>T_2</math> and <math>S_2</math> with <math>T_3</math>, and would find a mismatch; then it would compare <math>S_1</math> with <math>T_3</math>, and find a mismatch, and so on. But a human would notice that after the first mismatch, the possibilities of finding <math>S</math> at positions 2 and 3 in <math>T</math> are extinguished. This is because, as noted in Example 1, <math>S_2</math> is the same as <math>S_3</math>, and since <math>S_3 \neq T_3</math>, <math>S_2 \neq T_3</math> also (so we will not find <math>S</math> at position 2 of <math>T</math>). And, likewise, since <math>S_1 \neq S_2</math>, and <math>S_2 \neq T_3</math>, it is also true that <math>S_1 \neq T_3</math>, so it is pointless looking for a match at the third position of <math>T</math>. Thus, it would make sense to start comparing again at the fourth position of <math>T</math> (''i.e.'', <math>S_1, S_2, S_3</math> with <math>T_4, T_5, T_6</math>, respectively). Again finding a mismatch, we use similar reasoning to rule out the fifth and sixth positions in <math>T</math>, and begin matching again at <math>T_7</math> (where we finally find a match.) Again, notice that the characters of <math>T</math> were examined strictly in order. | ||
Line 28: | Line 27: | ||
:''Proof'': We first show by induction that if <math>j</math> appears in the sequence <math>\pi^*_i</math> then <math>S^j \sqsupset S^i</math>, ''i.e'', <math>j</math> indeed belongs in the sequence <math>\pi^*_i</math>. Suppose <math>j</math> is the first entry in <math>\pi^*_i</math>. Then <math>j = i</math> and it is trivially true that <math>S^j \sqsupset S^i</math>. Now suppose <math>j</math> is not the first entry, but is preceded by the entry <math>k</math> which is valid. That is, <math>\pi_k = j</math>. By definition, <math>S^j \sqsupset S^k</math>. But <math>S^k \sqsupset S^i</math> by assumption. Since <math>\sqsupset</math> is [[Partial order|transitive]], <math>S^j \sqsupset S^i</math>. | :''Proof'': We first show by induction that if <math>j</math> appears in the sequence <math>\pi^*_i</math> then <math>S^j \sqsupset S^i</math>, ''i.e'', <math>j</math> indeed belongs in the sequence <math>\pi^*_i</math>. Suppose <math>j</math> is the first entry in <math>\pi^*_i</math>. Then <math>j = i</math> and it is trivially true that <math>S^j \sqsupset S^i</math>. Now suppose <math>j</math> is not the first entry, but is preceded by the entry <math>k</math> which is valid. That is, <math>\pi_k = j</math>. By definition, <math>S^j \sqsupset S^k</math>. But <math>S^k \sqsupset S^i</math> by assumption. Since <math>\sqsupset</math> is [[Partial order|transitive]], <math>S^j \sqsupset S^i</math>. | ||
:We now show by contradiction that if <math>S^j \sqsupset S^i</math>, then <math>j \in \pi^*_i</math>. Assume <math>j</math> does not appear in the sequence. Clearly <math>0 < j < i</math> since 0 and <math>i</math> both appear. Since <math>\pi^*_i</math> is strictly decreasing, we can find exactly one <math>k \in \pi^*_i</math> such that <math>k > j</math> and <math>\pi_k < j</math>; that is, we can find exactly one <math>k</math> after which <math>j</math> "should" appear (to keep the sequence decreasing). We know from the first part of the proof that <math>S^k \sqsupset S^i</math>. Since the suffix of <math>S^i</math> of length <math>j</math> is a suffix of the suffix of <math>S^i</math> of length <math>k</math>, it follows that the suffix of <math>S^i</math> of length <math>j</math> matches the suffix of length <math>j</math> of <math>S^k</math>. But the suffix of <math>S^i</math> of length <math>j</math> also matches <math>S^j</math>, so <math>S^j</math> matches the suffix of <math>S^k</math> of length <math>j</math>. We therefore conclude that <math>\pi_k \geq j</math>. But <math>j > \pi_k</math>, a contradiction. <math>_\blacksquare</math> | :We now show by contradiction that if <math>S^j \sqsupset S^i</math>, then <math>j \in \pi^*_i</math>. Assume <math>j</math> does not appear in the sequence. Clearly <math>0 < j < i</math> since 0 and <math>i</math> both appear. Since <math>\pi^*_i</math> is strictly decreasing, we can find exactly one <math>k \in \pi^*_i</math> such that <math>k > j</math> and <math>\pi_k < j</math>; that is, we can find exactly one <math>k</math> after which <math>j</math> "should" appear (to keep the sequence decreasing). We know from the first part of the proof that <math>S^k \sqsupset S^i</math>. Since the suffix of <math>S^i</math> of length <math>j</math> is a suffix of the suffix of <math>S^i</math> of length <math>k</math>, it follows that the suffix of <math>S^i</math> of length <math>j</math> matches the suffix of length <math>j</math> of <math>S^k</math>. But the suffix of <math>S^i</math> of length <math>j</math> also matches <math>S^j</math>, so <math>S^j</math> matches the suffix of <math>S^k</math> of length <math>j</math>. We therefore conclude that <math>\pi_k \geq j</math>. But <math>j > \pi_k</math>, a contradiction. <math>_\blacksquare</math> | ||
− | With this in mind, we can design an algorithm to compute the table <math>\pi</math>. For each <math>i</math>, we will first try to find | + | With this in mind, we can design an algorithm to compute the table <math>\pi</math>. For each <math>i</math>, we will first try to find a nonempty proper suffix of the prefix of length <math>i</math> that is also a prefix of <math>S</math>. If we fail to do so, we will conclude that <math>\pi_i = 0</math> (clearly this is the case when <math>i = 1</math>.) Observe that if we do find a nonempty proper suffix of <math>S_i</math> that is also a prefix of <math>S</math>, and that this suffix has length <math>k</math>, then, by removing the last character from this suffix, we obtain a nonempty proper suffix of the prefix of length <math>i-1</math> that is also a prefix of <math>S</math>. Therefore, if we enumerate ''all'' nonempty proper suffixes of the prefix of length <math>i-1</math> that are also prefixes of <math>S</math>, the ''longest'' of these ''that can be extended'' (because the character after this prefix equals the character <math>S_i</math>, so that it can be added to the end of both the prefix and the suffix to give a suffix of the prefix of length <math>i</math> that is also a prefix of <math>S</math>) gives the value of <math>\pi_i</math>. So we will let <math>k = \pi_{i-1}</math> and keep iterating through the sequence <math>\pi_k, \pi_{\pi_k}, \ldots</math>. We stop if we reach element <math>j</math> in this sequence such that <math>S_{j+1} = S_i</math>, and declare <math>\pi_i = j+1</math>; this always gives an optimal solution since the sequence <math>\pi^*_i</math> is decreasing. If we exhaust the sequence, then <math>\pi_i = 0</math>. |
Here is the pseudocode: | Here is the pseudocode: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<pre> | <pre> | ||
π[1] ← 0 | π[1] ← 0 | ||
Line 50: | Line 36: | ||
while k > 0 and S[k+1] ≠ S[i] | while k > 0 and S[k+1] ≠ S[i] | ||
k ← π[k] | k ← π[k] | ||
− | if | + | if k ≠ 0 |
k ← k+1 | k ← k+1 | ||
π[i] ← k | π[i] ← k | ||
</pre> | </pre> | ||
This algorithm takes <math>O(m)</math> time to execute. To see this, notice that the value of <code>k</code> is never negative; therefore it cannot decrease more than it increases. It only increases in the line <code>k ← k+1</code>, which can only be executed up to <math>m-1</math> times. Therefore <code>k</code> can be decreased at most <code>k</code> times. But <code>k</code> is decreased in each iteration of the while loop, so the while loop can only run a linear number of times overall. All the other instructions in the for loop take constant time per iteration, so linear time overall. | This algorithm takes <math>O(m)</math> time to execute. To see this, notice that the value of <code>k</code> is never negative; therefore it cannot decrease more than it increases. It only increases in the line <code>k ← k+1</code>, which can only be executed up to <math>m-1</math> times. Therefore <code>k</code> can be decreased at most <code>k</code> times. But <code>k</code> is decreased in each iteration of the while loop, so the while loop can only run a linear number of times overall. All the other instructions in the for loop take constant time per iteration, so linear time overall. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==References== | ==References== | ||
<references/> | <references/> | ||
− | |||
− | |||
− |