Editing Longest common subsequence
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone.
Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
− | |||
− | |||
The '''longest common subsequence''' (LCS) problem is the problem of finding a sequence of maximal length that is a [[subsequence]] of two finite input sequences. Formally, given two sequences <math>x = [x_1, x_2, ..., x_m]</math> and <math>y = [y_1, y_2, ..., y_n]</math>, we would like to find two sets of indices <math>i_1 < i_2 < ... < i_k</math> and <math>j_1 < j_2 < ... < j_k</math> such that <math>x_{i_p} = y_{j_p}</math> for all <math>1 \leq p \leq k</math> and <math>k</math> is maximized. | The '''longest common subsequence''' (LCS) problem is the problem of finding a sequence of maximal length that is a [[subsequence]] of two finite input sequences. Formally, given two sequences <math>x = [x_1, x_2, ..., x_m]</math> and <math>y = [y_1, y_2, ..., y_n]</math>, we would like to find two sets of indices <math>i_1 < i_2 < ... < i_k</math> and <math>j_1 < j_2 < ... < j_k</math> such that <math>x_{i_p} = y_{j_p}</math> for all <math>1 \leq p \leq k</math> and <math>k</math> is maximized. | ||
Line 66: | Line 64: | ||
The function <math>\operatorname{LCS\_len}</math> defined above takes constant time to transition from one state to another, and there are <math>(m+1)(n+1)</math> states, so the time taken overall is <math>\Theta(mn)</math>. | The function <math>\operatorname{LCS\_len}</math> defined above takes constant time to transition from one state to another, and there are <math>(m+1)(n+1)</math> states, so the time taken overall is <math>\Theta(mn)</math>. | ||
− | The function <math>\operatorname{LCS}</math> as written above may take longer to compute if we naively store an array of strings where each entry is the longest common subsequence of some subinstance, because then a lot of string copying occurs during transitions, and this presumably takes more than constant time. If we actually wish to reconstruct a longest common subsequence, we may compute the <math>\operatorname{LCS\_len}</math> table first, making a note of "what happened" when each value was computed ( | + | The function <math>\operatorname{LCS}</math> as written above may take longer to compute if we naively store an array of strings where each entry is the longest common subsequence of some subinstance, because then a lot of string copying occurs during transitions, and this presumably takes more than constant time. If we actually wish to reconstruct a longest common subsequence, we may compute the <math>\operatorname{LCS\_len}</math> table first, making a note of "what happened" when each value was computed (\emph{i.e.}, which subinstance's solution was used) and then backtrack once <math>\operatorname{LCS\_len}(m,n)</math> is known. This then also takes <math>\Theta(mn)</math> time. |
Using the recursive definition of <math>\operatorname{LCS\_len}</math> as written may be faster (provided memoization is used), because we may not have to compute all values that we would compute in the dynamic solution. The runtime is then <math>O(mn)</math> (no improvement is seen in the worst case). | Using the recursive definition of <math>\operatorname{LCS\_len}</math> as written may be faster (provided memoization is used), because we may not have to compute all values that we would compute in the dynamic solution. The runtime is then <math>O(mn)</math> (no improvement is seen in the worst case). | ||
In the dynamic solution, if we desire only the length of the LCS, we notice that we only need to keep two rows of the table at any given time, since we will never be looking back at <math>p</math>-values less than the current <math>p</math> minus one. In fact, keeping only two rows at any given time may be more cache-optimal, and hence result in a constant speedup. | In the dynamic solution, if we desire only the length of the LCS, we notice that we only need to keep two rows of the table at any given time, since we will never be looking back at <math>p</math>-values less than the current <math>p</math> minus one. In fact, keeping only two rows at any given time may be more cache-optimal, and hence result in a constant speedup. | ||
− | |||
====Memory==== | ====Memory==== | ||
The memory usage is <math>\Theta(mn)</math> for the dynamic solution. The recursive solution may use less memory if hashing techniques are used to implement memoization. | The memory usage is <math>\Theta(mn)</math> for the dynamic solution. The recursive solution may use less memory if hashing techniques are used to implement memoization. | ||
Line 124: | Line 121: | ||
==Faster methods== | ==Faster methods== | ||
− | The dynamic algorithm presented above always takes the same amount of time to run (to within a constant factor) and it does not depend on the size of the set from which the elements of the sequences are taken. Faster algorithms exist for cases where this set is small and finite (making the sequences strings over a relatively small alphabet), one of the strings is much longer than the other, or the two strings are very similar.<ref name="BHR00">L. Bergroth and H. Hakonen and T. Raita | + | The dynamic algorithm presented above always takes the same amount of time to run (to within a constant factor) and it does not depend on the size of the set from which the elements of the sequences are taken. Faster algorithms exist for cases where this set is small and finite (making the sequences strings over a relatively small alphabet), one of the strings is much longer than the other, or the two strings are very similar.<ref name="BHR00">{{cite journal | author = L. Bergroth and H. Hakonen and T. Raita | title = A Survey of Longest Common Subsequence Algorithms | journal = SPIRE | volume = 00 | year = 2000 | isbn = ISBN 0-7695-0746-8 | pages = 39–48 | doi = 10.1109/SPIRE.2000.878178 | publisher = IEEE Computer Society | address = Los Alamitos, CA, U.S.}}</ref> These algorithms almost always outperform the naive algorithm in practice (otherwise the <code>diff</code> utility would be unacceptably slow). However, it is surprisingly difficult to improve over the <math>O(mn)</math> bound of this algorithm in the general case. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
==References== | ==References== | ||
− | + | {{reflist}} | |
[[Category:Dynamic programming]] | [[Category:Dynamic programming]] | ||
Line 154: | Line 140: | ||
border-style: solid; | border-style: solid; | ||
border-width: 1px; | border-width: 1px; | ||
− | |||
− | |||
− | |||
} | } | ||
.lcs_table td | .lcs_table td | ||
Line 162: | Line 145: | ||
border-style: solid; | border-style: solid; | ||
border-width: 1px; | border-width: 1px; | ||
− | |||
− | |||
− | |||
} | } | ||
− |