Longest common subsequence
The longest common subsequence (LCS) problem is the problem of finding a sequence of maximal length that is a subsequence of two finite input sequences. Formally, given two sequences and , we would like to find two sets of indices and such that for all and is maximized.
Contents
Discussion
Two strings may have a potentially exponential number of longest common subsequences. An example (with proof) is given here (take the string given and its reverse as the input sequences). Therefore, it is clear that any algorithm to return all longest common subsequences of a pair of input sequences (i.e., all possible sequences and of indices as previously defined) cannot run in polynomial time.
If, however, we restrict ourselves to either reporting simply the maximum value of or reporting only one possible pair of sequences for which the maximum is achieved, then the problem may be solved in polynomial time using dynamic programming.
Algorithm
Optimal substructure
Consider the structure of an optimal solution. (That is, one of possibly many optimal solutions; we don't care which one we're looking for.) We want to show that this problem has the optimal substructure required for a dynamic solution. To do this, we first notice that our optimal solution either uses or does not use the last element of , and either uses or does not use the last element of . And then we make these two critical observations:
- If a longest common subsequence of and does not use the last term of or does not use the last term of (or both), then we can drop the unused terms from the end of and and our longest common subsequence will be a longest common subsequence of the shortened sequences. That is, the optimal solution contains an optimal solution to a smaller problem.
- Why is this true? Clearly the longest common subsequence of and is at least some common subsequence of the shortened sequences, because we didn't cut off any terms we used. So the longest common subsequence of the shortened sequences is at least as long as the longest common subsequence of and . Could it be longer? No, because then we could just add back in those terms we chopped off from the end, and get a common subsequence of and longer than what we originally assumed was the maximum length --- a contradiction.
- If a longest common subsequence of and uses the last term of and the last term of , then we can drop the last term from both and as well as the last term from our common subsequence and we will be left with a common subsequence of the shortened and .
- Similar reasoning applies here: if the shortened common subsequence were not optimal, adding back in the common last term would give a better-than-optimal solution to the original problem.
It follows that if a longest common subsequence of and uses the last element of both and , then its length is one more than that of some longest common subsequence of lacking its last element and lacking its last element. If it does not, then its length is exactly the same as that of some longest common subsequence of the shortened and (one or both may be shortened).
Here is an example. A longest common subsequence of the sequences and is . ( would also work.) This common subsequence uses the last element of the second sequence but it does not use the last element of the first sequence. Therefore, we can drop the last element of the first sequence, and conclude that is also a longest common subsequence of and . Notice that in this subproblem, the last elements of both sequences are used. Now we know that we can drop the last elements of both sequences as well as the last element of our longest common subsequence; must be a longest common subsequence of and , and indeed it is.
Since an optimal solution to a general instance of the longest common subsequence problem contains optimal solutions to subinstances, we can reconstruct the optimal solution to the original instance once we know the optimal solutions to all subinstances, i.e., work forward, thus:
- If the last element of and the last element of are the same, then some longest common subsequence of the two uses the last element of both sequences, and, furthermore, any longest common subsequence of and (i.e., and with their last elements removed) can be turned into a longest common subsequence of and .
- Why? It would be absurd to use the last element of neither sequence, as we could just append that last element onto the end of a supposedly optimal common subsequence to obtain a longer one. For example, the sequence cannot possibly be a longest common subsequence of and because it fails to use the 6; clearly is a longer common subsequence. Furthermore, if we use the last element of only one of the sequences, and it is matched to some element other than the last in the other sequence, we can substitute it with the last instead. For example, it may be true that selecting the nine and the first six from and selecting the nine and six from gives a longest common subsequence of and , but it is also true that we could have just as well selected the second six from , hence using the last element of both sequences.
- If the last element of and the last element of differ, then some longest common subsequence of the two is either a longest common subsequence of and or of and , and, furthermore, it doesn't matter which longest common subsequences of and and of and we know; one of them will be a longest common subsequence for and .
- Since the last elements of and differ, obviously there is no longest common subsequence that uses both of them. Thus we can always remove either the last element of or the last element of . For example, some longest common subsequence of and should be found if we consider a longest common subsequence of and and of and . This is indeed the case; is a longest common subsequence of the first pair and is a longest common subsequence of the second pair, and , the longer of the two, is also a longest common subsequence of and .
Overlapping subproblems and state
As explained in the preceding section, the solution to an instance of the longest common subsequence problem depends on the solutions to subinstances, which are formed by removing the last element from either or both of the input sequences. In order to obtain an efficient dynamic programming solution, we want the total number of possible subinstances, subinstances of subinstances, and so on down to be bounded polynomially. Is this true?
Luckily, it is. The only subinstances we have to consider are those in which the two input sequences are both prefixes of the two given sequences. There are prefixes of and prefixes of , so there are only subinstances to consider in total.
We see that our dynamic state only needs to consist of two values and , used to uniquely identify the subinstance consisting of the first elements of and the first elements of .
Base case
A simple base case is that if one of the input sequences is empty, then the longest common subsequence is also empty (as the empty sequence is the only subsequence of the empty sequence). This base case complements our optimal substructure nicely, since it accounts for when it is no longer possible to remove the last element.
Putting it together
Here, then, is a recursive function to compute the longest common subsequence:
Notice that in the second case, we are concatenating the common last element to the end of a longest common subsequence of the prefixes shortened by one element, and in the third case, if there is a tie in length, we don't care which one we take.
Since the value of this recursive function depends only on the values of the function called with arguments that are less than or equal to the original arguments, we can compute all values of bottom-up by looping from 0 to and looping from 0 to .
If only the lengths are desired:
Finally, the solution to the original instance is either or , whichever is desired; here we are solving the problem for the entire input sequences.
Analysis
Time
The function defined above takes constant time to transition from one state to another, and there are states, so the time taken overall is .
The function as written above may take longer to compute if we naively store an array of strings where each entry is the longest common subsequence of some subinstance, because then a lot of string copying occurs during transitions, and this presumably takes more than constant time. If we actually wish to reconstruct a longest common subsequence, we may compute the table first, making a note of "what happened" when each value was computed (\emph{i.e.}, which subinstance's solution was used) and then backtrack once is known. This then also takes time.
Using the recursive definition of as written may be faster (provided memoization is used), because we may not have to compute all values that we would compute in the dynamic solution. The runtime is then (no improvement is seen in the worst case).
In the dynamic solution, if we desire only the length of the LCS, we notice that we only need to keep two rows of the table at any given time, since we will never be looking back at -values less than the current minus one. In fact, keeping only two rows at any given time may be more cache-optimal, and hence result in a constant speedup.
Memory
The memory usage is for the dynamic solution. The recursive solution may use less memory if hashing techniques are used to implement memoization.
Alternative formulation
Here is another way of thinking about the longest common subsequence problem. Draw a table and place the elements of along the left and the elements of across the top of the table, so that there is one entry in the table for every pair of elements from and . Place a checkmark in any box where the row element and the column element match. For example:
9 | 2 | 3 | 6 | 1 | |
---|---|---|---|---|---|
2 | ✓ | ||||
0 | |||||
6 | ✓ | ||||
1 | ✓ | ||||
3 | ✓ |
Any sequence of checkmarks such that each one is strictly below and to the right of the previous one is a common subsequence. We want as long a sequence as possible; this will be a longest common subsequence. It should be clear from looking at the table above that is the desired longest common subsequence. Our DP table can be understood as follows: is the best we can do given that we only use checkmarks from the first rows and the first columns. The condition corresponds to the case in which there is a checkmark in the box at .
Faster methods
The dynamic algorithm presented above always takes the same amount of time to run (to within a constant factor) and it does not depend on the size of the set from which the elements of the sequences are taken. Faster algorithms exist for cases where this set is small and finite (making the sequences strings over a relatively small alphabet), one of the strings is much longer than the other, or the two strings are very similar.[1] These algorithms almost always outperform the naive algorithm in practice (otherwise the diff
utility would be unacceptably slow). However, it is surprisingly difficult to improve over the bound of this algorithm in the general case.
References
- ↑ L. Bergroth and H. Hakonen and T. Raita (2000). "A Survey of Longest Common Subsequence Algorithms". SPIRE (IEEE Computer Society) 00: 39–48. doi:10.1109/SPIRE.2000.878178. ISBN 0-7695-0746-8.