Editing Longest common subsequence

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 1: Line 1:
{{Distinguish|Longest common substring}}
 
 
 
The '''longest common subsequence''' (LCS) problem is the problem of finding a sequence of maximal length that is a [[subsequence]] of two finite input sequences. Formally, given two sequences <math>x = [x_1, x_2, ..., x_m]</math> and <math>y = [y_1, y_2, ..., y_n]</math>, we would like to find two sets of indices <math>i_1 < i_2 < ... < i_k</math> and <math>j_1 < j_2 < ... < j_k</math> such that <math>x_{i_p} = y_{j_p}</math> for all <math>1 \leq p \leq k</math> and <math>k</math> is maximized.
 
The '''longest common subsequence''' (LCS) problem is the problem of finding a sequence of maximal length that is a [[subsequence]] of two finite input sequences. Formally, given two sequences <math>x = [x_1, x_2, ..., x_m]</math> and <math>y = [y_1, y_2, ..., y_n]</math>, we would like to find two sets of indices <math>i_1 < i_2 < ... < i_k</math> and <math>j_1 < j_2 < ... < j_k</math> such that <math>x_{i_p} = y_{j_p}</math> for all <math>1 \leq p \leq k</math> and <math>k</math> is maximized.
  
Line 66: Line 64:
 
The function <math>\operatorname{LCS\_len}</math> defined above takes constant time to transition from one state to another, and there are <math>(m+1)(n+1)</math> states, so the time taken overall is <math>\Theta(mn)</math>.
 
The function <math>\operatorname{LCS\_len}</math> defined above takes constant time to transition from one state to another, and there are <math>(m+1)(n+1)</math> states, so the time taken overall is <math>\Theta(mn)</math>.
  
The function <math>\operatorname{LCS}</math> as written above may take longer to compute if we naively store an array of strings where each entry is the longest common subsequence of some subinstance, because then a lot of string copying occurs during transitions, and this presumably takes more than constant time. If we actually wish to reconstruct a longest common subsequence, we may compute the <math>\operatorname{LCS\_len}</math> table first, making a note of "what happened" when each value was computed (''i.e.'', which subinstance's solution was used) and then backtrack once <math>\operatorname{LCS\_len}(m,n)</math> is known. This then also takes <math>\Theta(mn)</math> time.
+
The function <math>\operatorname{LCS}</math> as written above may take longer to compute if we naively store an array of strings where each entry is the longest common subsequence of some subinstance, because then a lot of string copying occurs during transitions, and this presumably takes more than constant time. If we actually wish to reconstruct a longest common subsequence, we may compute the <math>\operatorname{LCS\_len}</math> table first, making a note of "what happened" when each value was computed (\emph{i.e.}, which subinstance's solution was used) and then backtrack once <math>\operatorname{LCS\_len}(m,n)</math> is known. This then also takes <math>\Theta(mn)</math> time.
  
 
Using the recursive definition of <math>\operatorname{LCS\_len}</math> as written may be faster (provided memoization is used), because we may not have to compute all values that we would compute in the dynamic solution. The runtime is then <math>O(mn)</math> (no improvement is seen in the worst case).
 
Using the recursive definition of <math>\operatorname{LCS\_len}</math> as written may be faster (provided memoization is used), because we may not have to compute all values that we would compute in the dynamic solution. The runtime is then <math>O(mn)</math> (no improvement is seen in the worst case).
  
 
In the dynamic solution, if we desire only the length of the LCS, we notice that we only need to keep two rows of the table at any given time, since we will never be looking back at <math>p</math>-values less than the current <math>p</math> minus one. In fact, keeping only two rows at any given time may be more cache-optimal, and hence result in a constant speedup.
 
In the dynamic solution, if we desire only the length of the LCS, we notice that we only need to keep two rows of the table at any given time, since we will never be looking back at <math>p</math>-values less than the current <math>p</math> minus one. In fact, keeping only two rows at any given time may be more cache-optimal, and hence result in a constant speedup.
 
 
====Memory====
 
====Memory====
 
The memory usage is <math>\Theta(mn)</math> for the dynamic solution. The recursive solution may use less memory if hashing techniques are used to implement memoization.
 
The memory usage is <math>\Theta(mn)</math> for the dynamic solution. The recursive solution may use less memory if hashing techniques are used to implement memoization.
Line 131: Line 128:
 
* The [[Levenshtein distance|edit distance]] between two strings is given by the sum of the lengths of the strings minus twice the length of the longest common subsequence.
 
* The [[Levenshtein distance|edit distance]] between two strings is given by the sum of the lengths of the strings minus twice the length of the longest common subsequence.
 
* To find a shortest common supersequence of two sequences, start with a longest common subsequence, and then insert the remaining elements in their appropriate positions. For example, [9,'''2''',3,'''6''','''1'''] and ['''2''',0,'''6''','''1''',3] give initially [2,6,1]. We know the 9 precedes the 2, the 3 and the 0 lie in between the 2 and the 6, and the 3 follows the 1, so we construct [9,2,3,0,6,1,3] as a possible shortest common supersequence. (The order of the 3 and 0 is irrelevant in this example; [9,2,0,3,6,1,3] works just as well.)
 
* To find a shortest common supersequence of two sequences, start with a longest common subsequence, and then insert the remaining elements in their appropriate positions. For example, [9,'''2''',3,'''6''','''1'''] and ['''2''',0,'''6''','''1''',3] give initially [2,6,1]. We know the 9 precedes the 2, the 3 and the 0 lie in between the 2 and the 6, and the 3 follows the 1, so we construct [9,2,3,0,6,1,3] as a possible shortest common supersequence. (The order of the 3 and 0 is irrelevant in this example; [9,2,0,3,6,1,3] works just as well.)
 
==Applications==
 
* The <code>diff</code> utility on UNIX-based systems computes the longest common subsequence of two files (each regarded as a sequence of lines). It also prints out instructions on how to convert the first file into the second.
 
* Suppose Alice starts editing an article on a wiki and, while she is still editing, Bob starts editing ''the same article''. Then suppose that Alice saves her changes and subsequently Bob saves his changes. If the wiki software simply wrote the entire edited article that Bob submitted back into the database, then Alice's changes would be obliterated, even if she was working on a different part of the article. A more sophisticated approach would be to use longest common subsequences to determine the actual ''changes'' that Alice made and the actual changes that Bob made. Making these changes in succession would allow both Alice's and Bob's edits to go through.
 
* Biologists frequently wish to know how related (in terms of evolution) two given species are. This can be accomplished using the ''molecular clock''. Since DNA mutates at a roughly constant rate, by observing how different two species' genomes are, one can estimate how long ago their evolutionary lineages diverged from a common ancestor. Determining how different the genomes are can be accomplished by taking the longest common subsequence of the sequences of base pairs.
 
  
 
==References==
 
==References==

Please note that all contributions to PEGWiki are considered to be released under the Attribution 3.0 Unported (see PEGWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

Template used on this page: