Editing Longest common subsequence

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 1: Line 1:
{{Distinguish|Longest common substring}}
 
 
 
The '''longest common subsequence''' (LCS) problem is the problem of finding a sequence of maximal length that is a [[subsequence]] of two finite input sequences. Formally, given two sequences <math>x = [x_1, x_2, ..., x_m]</math> and <math>y = [y_1, y_2, ..., y_n]</math>, we would like to find two sets of indices <math>i_1 < i_2 < ... < i_k</math> and <math>j_1 < j_2 < ... < j_k</math> such that <math>x_{i_p} = y_{j_p}</math> for all <math>1 \leq p \leq k</math> and <math>k</math> is maximized.
 
The '''longest common subsequence''' (LCS) problem is the problem of finding a sequence of maximal length that is a [[subsequence]] of two finite input sequences. Formally, given two sequences <math>x = [x_1, x_2, ..., x_m]</math> and <math>y = [y_1, y_2, ..., y_n]</math>, we would like to find two sets of indices <math>i_1 < i_2 < ... < i_k</math> and <math>j_1 < j_2 < ... < j_k</math> such that <math>x_{i_p} = y_{j_p}</math> for all <math>1 \leq p \leq k</math> and <math>k</math> is maximized.
  
Line 66: Line 64:
 
The function <math>\operatorname{LCS\_len}</math> defined above takes constant time to transition from one state to another, and there are <math>(m+1)(n+1)</math> states, so the time taken overall is <math>\Theta(mn)</math>.
 
The function <math>\operatorname{LCS\_len}</math> defined above takes constant time to transition from one state to another, and there are <math>(m+1)(n+1)</math> states, so the time taken overall is <math>\Theta(mn)</math>.
  
The function <math>\operatorname{LCS}</math> as written above may take longer to compute if we naively store an array of strings where each entry is the longest common subsequence of some subinstance, because then a lot of string copying occurs during transitions, and this presumably takes more than constant time. If we actually wish to reconstruct a longest common subsequence, we may compute the <math>\operatorname{LCS\_len}</math> table first, making a note of "what happened" when each value was computed (''i.e.'', which subinstance's solution was used) and then backtrack once <math>\operatorname{LCS\_len}(m,n)</math> is known. This then also takes <math>\Theta(mn)</math> time.
+
The function <math>\operatorname{LCS}</math> as written above may take longer to compute if we naively store an array of strings where each entry is the longest common subsequence of some subinstance, because then a lot of string copying occurs during transitions, and this presumably takes more than constant time. If we actually wish to reconstruct a longest common subsequence, we may compute the <math>\operatorname{LCS\_len}</math> table first, making a note of "what happened" when each value was computed (\emph{i.e.}, which subinstance's solution was used) and then backtrack once <math>\operatorname{LCS\_len}(m,n)</math> is known. This then also takes <math>\Theta(mn)</math> time.
  
 
Using the recursive definition of <math>\operatorname{LCS\_len}</math> as written may be faster (provided memoization is used), because we may not have to compute all values that we would compute in the dynamic solution. The runtime is then <math>O(mn)</math> (no improvement is seen in the worst case).
 
Using the recursive definition of <math>\operatorname{LCS\_len}</math> as written may be faster (provided memoization is used), because we may not have to compute all values that we would compute in the dynamic solution. The runtime is then <math>O(mn)</math> (no improvement is seen in the worst case).
  
 
In the dynamic solution, if we desire only the length of the LCS, we notice that we only need to keep two rows of the table at any given time, since we will never be looking back at <math>p</math>-values less than the current <math>p</math> minus one. In fact, keeping only two rows at any given time may be more cache-optimal, and hence result in a constant speedup.
 
In the dynamic solution, if we desire only the length of the LCS, we notice that we only need to keep two rows of the table at any given time, since we will never be looking back at <math>p</math>-values less than the current <math>p</math> minus one. In fact, keeping only two rows at any given time may be more cache-optimal, and hence result in a constant speedup.
 
 
====Memory====
 
====Memory====
 
The memory usage is <math>\Theta(mn)</math> for the dynamic solution. The recursive solution may use less memory if hashing techniques are used to implement memoization.
 
The memory usage is <math>\Theta(mn)</math> for the dynamic solution. The recursive solution may use less memory if hashing techniques are used to implement memoization.
Line 125: Line 122:
 
==Faster methods==
 
==Faster methods==
 
The dynamic algorithm presented above always takes the same amount of time to run (to within a constant factor) and it does not depend on the size of the set from which the elements of the sequences are taken. Faster algorithms exist for cases where this set is small and finite (making the sequences strings over a relatively small alphabet), one of the strings is much longer than the other, or the two strings are very similar.<ref name="BHR00">L. Bergroth and H. Hakonen and T. Raita (2000). "A Survey of Longest Common Subsequence Algorithms". ''SPIRE'' (IEEE Computer Society) '''00''': 39–48. doi:[http://dx.doi.org/10.1109%2FSPIRE.2000.878178 10.1109/SPIRE.2000.878178]. ISBN 0-7695-0746-8.</ref> These algorithms almost always outperform the naive algorithm in practice (otherwise the <code>diff</code> utility would be unacceptably slow). However, it is surprisingly difficult to improve over the <math>O(mn)</math> bound of this algorithm in the general case.
 
The dynamic algorithm presented above always takes the same amount of time to run (to within a constant factor) and it does not depend on the size of the set from which the elements of the sequences are taken. Faster algorithms exist for cases where this set is small and finite (making the sequences strings over a relatively small alphabet), one of the strings is much longer than the other, or the two strings are very similar.<ref name="BHR00">L. Bergroth and H. Hakonen and T. Raita (2000). "A Survey of Longest Common Subsequence Algorithms". ''SPIRE'' (IEEE Computer Society) '''00''': 39–48. doi:[http://dx.doi.org/10.1109%2FSPIRE.2000.878178 10.1109/SPIRE.2000.878178]. ISBN 0-7695-0746-8.</ref> These algorithms almost always outperform the naive algorithm in practice (otherwise the <code>diff</code> utility would be unacceptably slow). However, it is surprisingly difficult to improve over the <math>O(mn)</math> bound of this algorithm in the general case.
 
==Reductions==
 
* A [[longest palindromic subsequence]] may be found by determining the longest common subsequence of a sequence and itself.
 
* A [[longest increasing subsequence]] may be found by determining the longest common subsequence of a sequence and a sorted copy of itself, although a faster algorithm exists.
 
* The [[Levenshtein distance|edit distance]] between two strings is given by the sum of the lengths of the strings minus twice the length of the longest common subsequence.
 
* To find a shortest common supersequence of two sequences, start with a longest common subsequence, and then insert the remaining elements in their appropriate positions. For example, [9,'''2''',3,'''6''','''1'''] and ['''2''',0,'''6''','''1''',3] give initially [2,6,1]. We know the 9 precedes the 2, the 3 and the 0 lie in between the 2 and the 6, and the 3 follows the 1, so we construct [9,2,3,0,6,1,3] as a possible shortest common supersequence. (The order of the 3 and 0 is irrelevant in this example; [9,2,0,3,6,1,3] works just as well.)
 
 
==Applications==
 
* The <code>diff</code> utility on UNIX-based systems computes the longest common subsequence of two files (each regarded as a sequence of lines). It also prints out instructions on how to convert the first file into the second.
 
* Suppose Alice starts editing an article on a wiki and, while she is still editing, Bob starts editing ''the same article''. Then suppose that Alice saves her changes and subsequently Bob saves his changes. If the wiki software simply wrote the entire edited article that Bob submitted back into the database, then Alice's changes would be obliterated, even if she was working on a different part of the article. A more sophisticated approach would be to use longest common subsequences to determine the actual ''changes'' that Alice made and the actual changes that Bob made. Making these changes in succession would allow both Alice's and Bob's edits to go through.
 
* Biologists frequently wish to know how related (in terms of evolution) two given species are. This can be accomplished using the ''molecular clock''. Since DNA mutates at a roughly constant rate, by observing how different two species' genomes are, one can estimate how long ago their evolutionary lineages diverged from a common ancestor. Determining how different the genomes are can be accomplished by taking the longest common subsequence of the sequences of base pairs.
 
  
 
==References==
 
==References==
Line 154: Line 140:
 
     border-style: solid;
 
     border-style: solid;
 
     border-width: 1px;
 
     border-width: 1px;
    width: 20px;
 
    height: 20px;
 
    text-align: center;
 
 
   }
 
   }
 
   .lcs_table td
 
   .lcs_table td
Line 162: Line 145:
 
     border-style: solid;
 
     border-style: solid;
 
     border-width: 1px;
 
     border-width: 1px;
    width: 20px;
 
    height: 20px;
 
    text-align: center;
 
 
   }
 
   }
}}
 

Please note that all contributions to PEGWiki are considered to be released under the Attribution 3.0 Unported (see PEGWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

Template used on this page: