Editing String searching

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 51: Line 51:
  
 
==Suffix data structures==
 
==Suffix data structures==
The final solution to multiple-pattern searching is given by data structures that index the suffixes of a string, namely, the [[suffix tree]] and the [[suffix array]]. The suffix tree is a compressed trie of all the suffixes of a given string. We can search for a substring simply by walking down starting from the root; if a corresponding branch ultimately does not exist then the needle is not found, and if it does, then all matches are to be found in the subtree rooted at the node we end up at. The suffix tree can be constructed in linear time, although the algorithm to do so is very complex; but constructing a suffix tree of the haystack allows a needle to be searched for in time proportional to its length (plus the number of matches, if we wish to return all of them). The suffix array is a sorted list of the suffixes of the string; if the needle is a prefix of any of the suffixes of the haystack then it is found at the positions at which those suffixes start, and clearly the matches will occur in a contiguous segment of the suffix array which can be determined using [[binary search]].
+
The final solution to multiple-pattern searching is given by data structures that index the suffixes of a string, namely, the [[suffix tree]] and the [[suffix array]]. The suffix tree is a compressed trie of all the suffixes of a given string. We can search for a substring simply by walking down starting from the root; if a corresponding branch ultimately does not exist then the needle is not found, and if it does, then all matches are to be found in the subtree rooted at the node we end up at. The suffix tree can be constructed in linear time, although the algorithm to do so is very complex; but it allows a needle to be searched for in time proportional to its length (plus the number of matches, if we wish to return all of them). The suffix array is a sorted list of the suffixes of the string; if the needle is a prefix of any of these then it is found at the positions at which those suffixes start, and clearly the matches will occur in a contiguous segment of the suffix array which can be determined using [[binary search]].
  
 
==Boyer–Moore algorithm==
 
==Boyer–Moore algorithm==
 
The [[Boyer–Moore algorithm]] is, on average, the fastest algorithm for single-pattern search. In the worst case it runs in linear time, but in the average case it examines only a fraction of the characters in the haystack. It does so by preprocessing the needle and then attempting to match the needle in ''reverse order'' while still considering possible match positions in the haystack in forward order. This allows it to rule out potential matches as soon as possible. For example, recall the example of looking for the needle '''AAAAAAA''' in the haystack '''AAAAAABAAAAAAA'''. Using KMP, we first match the first six characters of the needle to the first six characters of the haystack, at which point we discover that the seventh characters do not match. This allows us to rule out six other positions at which the needle might be found, so that we can resume processing the haystack at the first position ''after'' the '''B'''. The Boyer–Moore algorithm, on the other hand, would ''start'' by comparing AAAAAA'''A''' with AAAAAA'''B'''AAAAAAA, that is, by trying to match the ''last'' character of the needle. (If they match, then it would proceed to compare the second-last characters, and so on.) It would then find the mismatch immediately, and, using the information determined in the preprocessing stage, conclude the same information as KMP — that the search should resume starting at the eighth position of the haystack, at which point it would start matching from the end again, examining the fourteenth character first and then ultimately returning to the eighth character of the haystack and the first character of the needle, at which point a match is confirmed. Notice that Boyer–Moore did not even bother to examine the first six characters of the haystack. On average, then, the Boyer–Moore algorithm tends to run in time linear in the length of the needle plus sublinear in the length of the haystack.
 
The [[Boyer–Moore algorithm]] is, on average, the fastest algorithm for single-pattern search. In the worst case it runs in linear time, but in the average case it examines only a fraction of the characters in the haystack. It does so by preprocessing the needle and then attempting to match the needle in ''reverse order'' while still considering possible match positions in the haystack in forward order. This allows it to rule out potential matches as soon as possible. For example, recall the example of looking for the needle '''AAAAAAA''' in the haystack '''AAAAAABAAAAAAA'''. Using KMP, we first match the first six characters of the needle to the first six characters of the haystack, at which point we discover that the seventh characters do not match. This allows us to rule out six other positions at which the needle might be found, so that we can resume processing the haystack at the first position ''after'' the '''B'''. The Boyer–Moore algorithm, on the other hand, would ''start'' by comparing AAAAAA'''A''' with AAAAAA'''B'''AAAAAAA, that is, by trying to match the ''last'' character of the needle. (If they match, then it would proceed to compare the second-last characters, and so on.) It would then find the mismatch immediately, and, using the information determined in the preprocessing stage, conclude the same information as KMP — that the search should resume starting at the eighth position of the haystack, at which point it would start matching from the end again, examining the fourteenth character first and then ultimately returning to the eighth character of the haystack and the first character of the needle, at which point a match is confirmed. Notice that Boyer–Moore did not even bother to examine the first six characters of the haystack. On average, then, the Boyer–Moore algorithm tends to run in time linear in the length of the needle plus sublinear in the length of the haystack.

Please note that all contributions to PEGWiki are considered to be released under the Attribution 3.0 Unported (see PEGWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)