Difference between revisions of "Longest palindromic subsequence"

From PEGWiki
Jump to: navigation, search
(removing confusing statement from proof)
 
(11 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
{{Distinguish|Longest palindromic substring}}
 
{{Distinguish|Longest palindromic substring}}
  
The '''longest palindromic subsequence''' problem is the problem of finding the longest subsequence of a string (a subsequence is obtained by deleting some of the characters from a string without reordering the remaining characters) which is also a palindrome. In general, the longest palindromic subsequence is not unique. For example, the string '''alfalfa''' has two palindromic subsequences of length 5: '''alala''' and '''afafa'''. However, it does not have any palindromic subsequences longer than five characters. Therefore '''alala''' and '''afafa''' are both considred longest palindromic subsequences of '''alfalfa'''.
+
The '''longest palindromic subsequence''' (LPS) problem is the problem of finding the longest [[subsequence]] of a string (a subsequence is obtained by deleting some of the characters from a string without reordering the remaining characters) which is also a palindrome. In general, the longest palindromic subsequence is not unique. For example, the string '''alfalfa''' has four palindromic subsequences of length 5: '''alala''', '''afafa''', '''alfla''', and '''aflfa'''. However, it does not have any palindromic subsequences longer than five characters. Therefore all four are considered longest palindromic subsequences of '''alfalfa'''.
  
 
==Precise statement==
 
==Precise statement==
Line 11: Line 11:
 
'''Theorem''': Returning all longest palindromic subsequences cannot be accomplished in worst-case polynomial time.
 
'''Theorem''': Returning all longest palindromic subsequences cannot be accomplished in worst-case polynomial time.
  
'''Proof'''<ref name="schneider"/>: Consider a string made up of <math>N/2</math> ones, followed by <math>N/4</math> zeroes, and finally <math>N/4</math> ones. (Assume <math>N</math> is a multiple of 4, although it does not really matter.) Any palindromic substring either does not contain any zeroes, in which case its length is only up to <math>3N/4</math>, or it contains at least one zero. If it contains at least one zero, it must be of the form <math>1^a0^b1^c</math>, but <math>a</math> and <math>c</math> must be equal. (This is because the middle of the palindrome must lie somewhere within the zeroes, otherwise there would be no zeroes on one side of it and at least one zero on the other side; but as long as the middle lies within the zeroes, there must be an equal number of ones on each side.) But <math>c</math> can only be up to <math>N/4</math>, and likewise with <math>b</math>, so again the palindrome cannot be longer than <math>3N/4</math> characters. However, there are <math>\binom{N/2}{N/4}+1</math> palindromic substrings of length <math>3N/4</math>; we can either take all the ones, or we can take all <math>N/4</math> zeroes, all <math>N/4</math> terminal ones, and <math>N/4</math> out of the <math>N/2</math> initial ones. Thus the output size is not polynomial in <math>N</math>, and then neither can the algorithm be in the worst case. <math>_\blacksquare</math>
+
'''Proof'''<ref name="schneider">Jonathan T. Schneider (2010). Personal communication.</ref>: Consider a string made up of <math>N/2</math> ones, followed by <math>N/4</math> zeroes, and finally <math>N/4</math> ones. Any palindromic subsequence either does not contain any zeroes, in which case its length is only up to <math>3N/4</math>, or it contains at least one zero. If it contains at least one zero, it must be of the form <math>1^a0^b1^c</math>, but <math>a</math> and <math>c</math> must be equal. (This is because the middle of the palindrome must lie somewhere within the zeroes, otherwise there would be no zeroes on one side of it and at least one zero on the other side; but as long as the middle lies within the zeroes, there must be an equal number of ones on each side.) But <math>c</math> can only be up to <math>N/4</math>, and likewise with <math>b</math>, so again the palindrome cannot be longer than <math>3N/4</math> characters. However, there are <math>\binom{N/2}{N/4}+1</math> palindromic subsequences of length <math>3N/4</math>; we can either take all the ones, or we can take all <math>N/4</math> zeroes, all <math>N/4</math> terminal ones, and <math>N/4</math> out of the <math>N/2</math> initial ones. Thus the output size is not polynomial in <math>N</math>, and then neither can the algorithm be in the worst case. <math>_\blacksquare</math>
  
 
However, this does not rule out the existence of a polynomial-time algorithm for the first two variations on the problem. We now present such an algorithm.
 
However, this does not rule out the existence of a polynomial-time algorithm for the first two variations on the problem. We now present such an algorithm.
  
==Theoretical background==
+
==Algorithm==
(Note: these Lemmas are "obvious" and their proofs will probably not help you intuitively understand how the algorithm works, so skip them if they are too heavy in mathematical notation for you.)
+
===LCS-based approach===
 +
The standard algorithm for computing a longest palindromic subsequence of a given string <var>S</var> involves first computing a [[longest common subsequence]] (LCS) of <var>S</var> and its reverse <var>S'</var>. Often, this gives a correct LPS right away. For example, the reader may verify that '''alala''', '''afafa''', '''aflfa''' and '''alfla''' are all LCSes of '''alfalfa''' and its reverse '''aflafla'''. However, this is not ''always'' the case; for example, '''afala''' and '''alafa''' are ''also'' LCSes of '''alfalfa''' and its reverse, yet neither is palindromic. While it is clear that any LPS of a string is an LCS of the string and its reverse, the converse is false.
  
''Lemma 1'': Any palindromic subsequence <math>s</math> of a string <math>S</math> is a common subsequence of <math>S</math> and its reverse <math>S'</math>.
+
However, with a slight modification, this algorithm can be made to work. We will use the example string '''ABCDEBCA''', which has six LPSes: '''ABCBA''', '''ABDBA''', '''ABEBA''', '''ACDCA''', '''ACECA''' and '''ACBCA''' (and our objective is to find one of them). Suppose that we find an LCS which is not one of these, such as <var>L</var>&nbsp;=&nbsp;'''ABDCA''':
 +
: '''AB'''C'''D'''EB'''CA'''
 +
: '''A'''C'''B'''E'''DC'''B'''A'''
 +
Now consider the central character <var>C</var>&nbsp;=&nbsp;'''D''' of <var>L</var>. It splits <var>S</var> up into two parts, the part before and the part after: <var>S</var><sub>1</sub>&nbsp;=&nbsp;'''ABC''' and <var>S</var><sub>2</sub>&nbsp;=&nbsp;'''EBCA'''. We know that the first half of <var>L</var> (<var>L</var><sub>1</sub>&nbsp;=&nbsp;'''AB''') is a subsequence of <var>S</var><sub>1</sub>, and that the second half of <var>L</var> (<var>L</var><sub>2</sub>&nbsp;=&nbsp;'''CA''') is a subsequence of <var>S</var><sub>2</sub>. Notice that <var>S'</var> is ''also'' split up into two parts: <var>S'</var><sub>1</sub>&nbsp;=&nbsp;'''ACBE'''&nbsp;=&nbsp;(<var>S</var><sub>2</sub>)' and <var>S'</var><sub>2</sub>&nbsp;=&nbsp;'''CBA'''&nbsp;=&nbsp;(<var>S</var><sub>1</sub>)'. Since <var>L</var> is also a subsequence of <var>S</var>', we see that <var>L</var><sub>1</sub> is a subsequence of <var>S'</var><sub>1</sub>, and <var>L</var><sub>2</sub> is a subsequence of <var>S'</var><sub>2</sub>. But the fact that <var>L</var><sub>1</sub> is a subsequence of <var>S'</var><sub>1</sub> implies (by reversing both strings) that (<var>L</var><sub>1</sub>)'&nbsp;=&nbsp;'''BA''' is a subsequence of <var>S</var><sub>2</sub>&nbsp;=&nbsp;'''EBCA'''. Since <var>L</var><sub>1</sub> is a subsequence of <var>S</var><sub>1</sub> and (<var>L</var><sub>1</sub>)' is a subsequence of <var>S</var><sub>2</sub>, we conclude that <var>L</var><sub>1</sub><var>C</var>(<var>L</var><sub>1</sub>)'&nbsp;=&nbsp;'''ABDBA''' is a subsequence of <var>S</var><sub>1</sub><var>C</var><var>S</var><sub>2</sub>&nbsp;=&nbsp;<var>S</var>. So we have obtained the desired result, a longest palindromic subsequence.
  
''Proof'': Since <math>s</math> is a subsequence of <math>S</math>, its reverse <math>s'</math> is a subsequence of <math>S'</math> But <math>s = s'</math> since <math>s</math> is a palindrome, so <math>s</math> is a subsequence of <math>S'</math>, and hence a common subsequence. <math>_\blacksquare</math>
+
Extrapolating to the general case, we can always obtain a LPS by first taking the LCS of <var>S</var> and <var>S'</var> and then "reflecting" the first half of the result onto the second half; that is, if <var>L</var> has <var>k</var> characters, then we replace the last &lfloor;<var>k</var>/2&rfloor; characters of <var>L</var> by the reverse of the first &lfloor;<var>k</var>/2&rfloor; characters of <var>L</var> to obtain a palindromic subsequence of <var>S</var>. This is obviously a palindrome by the foregoing analysis; it is guaranteed to be a subsequence by the foregoing analysis; and it is guaranteed to be as long as possible because it is of the same length as any LCS, yet the length of an LCS is also the upper bound on the length of an LPS since every LPS must trivially be an LCS. The odd case is handled as above, and the even case very similarly (lacking only the central character <var>C</var>).
  
''Lemma 2'': If there exists a common subsequence <math>s</math> of length <math>L</math> of <math>S</math> and its reverse <math>S'</math>, then there exists a palindromic subsequence <math>s^*</math> of <math>S</math> of length greater than or equal to <math>L</math> which is a supersequence of <math>s</math>.
+
The complexity of this algorithm is the same as the complexity of the LCS algorithm used. The textbook [[dynamic programming]] algorithm for LCS runs in <math>O(mn)</math> time, so the time taken to find the LPS is <math>O(n^2)</math>, where <math>n</math> is the length of <var>S</var>.
  
''Proof'': Let <math>s</math> denote the subsequence in <math>S</math> and <math>s'</math> denote the subsequence in <math>S'</math>. Let <math>{s^*}'</math> denote a supersequence of <math>s'</math>. Walk through the string <math>S</math> from left to right. that is, consider <math>S_i</math> as <math>i</math> goes from 0 to <math>N-1</math>. Let <math>i'</math> denote <math>N-i-1</math>, so that <math>S_i = S'_{i'}</math> at all times. For each value of <math>i</math>:
+
===Direct solution===
* If <math>S_i</math> is in <math>s</math> then <math>S_i</math> is in <math>s^*</math> and <math>S'_{i'}</math> is in <math>{s^*}'</math>.
+
There is also a "direct" <math>O(n^2)</math> solution, also based on dynamic programming; it is very similar in principle to the approach based on the textbook LCS algorithm. We define <math>f(i,j)</math> to be the length of the longest palindromic subsequence of the substring <var>S</var>[<var>i</var>,<var>j</var>) (see [[half-open interval]]). Assuming indexing starting from zero, the objective is to compute <math>f(0,n)</math>.
* If <math>S_i</math> is not in <math>s</math> but <math>S'_{i'}</math> is in <math>s'</math>, then, again, <math>S_i</math> is in <math>s^*</math> and <math>S'_{i'}</math> is in <math>{s^*}'</math>.
+
* Otherwise, <math>S_i</math> is not in <math>s^*</math> and <math>S'_{i'}</math> is not in <math>{s^*}'</math>.
+
After this has completed, <math>s^*</math> is clearly a supersequence of <math>s</math> and a subsequence of <math>S</math>, and likewise <math>{s^*}'</math> is a supersequence of <math>s'</math> and a subsequence of <math>S'</math>.
+
  
Furthermore, <math>s^*</math> and <math>{s^*}'</math> are reverses of each other, because whenever a character <math>S_i</math> is added to the end of <math>s^*</math>, the identical character <math>S'_{i'}</math> is added to the beginning of <math>{s^*}'</math>, and ''vice versa''.
+
This is easy when the substring is empty, or when it consists of only a single character; the value of <math>f</math> will be 0 or 1, respectively. Otherwise,
 
+
* when the substring's first and last characters are equal, add ''both'' of them to the longest palindromic subsequence of the characters in between to get a palindromic subsequence two characters longer;
Now consider the <math>i</math><sup>th</sup> character in <math>s^*</math>. This is <math>S_j</math> where <math>j</math> is the <math>i</math><sup>th</sup> smallest index for which either <math>S_j</math> is in <math>s</math> or <math>S'_{j'}</math> is in <math>s'</math>. This means that <math>j'</math> is the <math>i</math><sup>th</sup> largest index for which either <math>S'_{j'}</math> is in <math>s'</math> or <math>S_j</math> is in <math>s</math>, since <math>S</math> and <math>S'</math> are reverses of each other. Therefore, <math>S'_{j'}</math> is the <math>i</math><sup>th</sup> character in <math>{s*}'</math> (characters near the beginning of <math>{s*}'</math> originate from near the beginning of <math>S'</math> or the end of <math>S</math>). But the <math>i</math><sup>th</sup> character in <math>{s*}'</math> is the <math>(n-i-1)</math><sup>st</sup> character in <math>s*</math>, because <math>s*</math> and <math>{s*}'</math> are reverses of each other. Therefore <math>s*</math> is palindromic. <math>_\blacksquare</math>
+
* when they are unequal, then it's clearly not possible to use both of them to form a palindromic subsequence of the given substring, so the answer will be the same as if one of them were ignored.
 
+
Thus:
'''Theorem''': Any [[longest common subsequence]] <math>s</math> of <math>S</math> and its reverse <math>S'</math> is a longest palindromic subsequence of <math>S</math>.
+
:<math>f(i,j) = \begin{cases}
 
+
0 & \text{if } j-i = 0 \\
'''Proof''': Suppose <math>s</math> is not palindromic. By Lemma 2, we know we can obtain a palindrome <math>s*</math> that is a supersequence of <math>s</math> and a subsequence of <math>S</math>. This cannot be <math>s</math> itself since <math>s</math> is not palindromic. So <math>s*</math> must be longer than <math>s</math>. By Lemma 1, <math>s*</math> is a common subsequence of <math>S</math> and <math>S'</math>. However, as <math>s*</math> is longer than <math>s</math>, this contradicts <math>s</math> having been a longest common subsequence of <math>S</math> and <math>S'</math>.
+
1 & \text{if } j-i = 1 \\
 
+
\max(f(i+1,j),f(i,j-1)) & \text{if } j-i > 1 \text{ and } S_i \neq S_{j-1} \\
Likewise, suppose <math>s</math> is a longest common subsequence of <math>S</math> and <math>S'</math> and palindromic but it is not a longest palindromic subsequence of <math>S</math>. Then there again exists a longer palindromic subsequence of <math>S</math>, which gives a longer common subsequence of <math>S</math> and <math>S'</math>, a contradiction. <math>_\blacksquare</math>
+
2 + f(i+1,j-1) & \text{if } j-i > 1 \text{ and } S_i = S_{j-1}
 
+
\end{cases}</math>
==Algorithm==
+
A corollary of the Theorem is that a longest palindromic subsequence of <math>S</math> can be found in <math>O(|S|^2)</math> time simply by finding the longest common subsequence of <math>S</math> and its reverse.
+
  
Note that there exist more efficient algorithms for finding longest common subsequences, which also give more efficient means of computing longest palindromic subsequences.
+
To see calculate the cost, consider that the worst case occurs when every invocation to <math>f</math> requires a call on two different substrings. This can happen for invocations with strings of length <math>n</math> to <math>2</math>. So the total number of invocations is <math>2^{(n-1)}</math>.  
  
==Shortest palindromic supersequence==
+
However, a lot of these invocations are repeats. If we store previously computed solutions, then the cost goes down. The above worst case happens when no matches are ever made, so this requires us to try out eliminating all strings that start at 0 and those that start at n-1, such that their combined length is <math>\leq n-1</math>. This is just the number of ways to partition a given number into two non-negative integers, summed up over <math>1 ,..., n-1</math>. The number of ways to partition <math>k</math> into two non-negative integers is <math>k + 1</math>. Hence the overall cost is <math>O(n^2)</math>.
It can also be shown that the shortest palindromic supersequence of a string <math>S</math> can be found by taking the [[Longest_common_subsequence#Shortest_common_supersequence|shortest common supersequence]] of <math>S</math> and its reverse. The proof is left as an exercise to the reader.
+
  
 
==References==
 
==References==
<references>
 
<ref name="schneider">Jonathan T. Schneider (2010). Personal communication.</ref>
 
 
<references/>
 
<references/>
  

Latest revision as of 01:10, 3 May 2022

Not to be confused with Longest palindromic substring.

The longest palindromic subsequence (LPS) problem is the problem of finding the longest subsequence of a string (a subsequence is obtained by deleting some of the characters from a string without reordering the remaining characters) which is also a palindrome. In general, the longest palindromic subsequence is not unique. For example, the string alfalfa has four palindromic subsequences of length 5: alala, afafa, alfla, and aflfa. However, it does not have any palindromic subsequences longer than five characters. Therefore all four are considered longest palindromic subsequences of alfalfa.

Precise statement[edit]

Three variations of this problem may be distinguished:

  • Find the maximum possible length for a palindromic subsequence.
  • Find some palindromic subsequence of maximal length.
  • Find all longest palindromic subsequences.

Theorem: Returning all longest palindromic subsequences cannot be accomplished in worst-case polynomial time.

Proof[1]: Consider a string made up of N/2 ones, followed by N/4 zeroes, and finally N/4 ones. Any palindromic subsequence either does not contain any zeroes, in which case its length is only up to 3N/4, or it contains at least one zero. If it contains at least one zero, it must be of the form 1^a0^b1^c, but a and c must be equal. (This is because the middle of the palindrome must lie somewhere within the zeroes, otherwise there would be no zeroes on one side of it and at least one zero on the other side; but as long as the middle lies within the zeroes, there must be an equal number of ones on each side.) But c can only be up to N/4, and likewise with b, so again the palindrome cannot be longer than 3N/4 characters. However, there are \binom{N/2}{N/4}+1 palindromic subsequences of length 3N/4; we can either take all the ones, or we can take all N/4 zeroes, all N/4 terminal ones, and N/4 out of the N/2 initial ones. Thus the output size is not polynomial in N, and then neither can the algorithm be in the worst case. _\blacksquare

However, this does not rule out the existence of a polynomial-time algorithm for the first two variations on the problem. We now present such an algorithm.

Algorithm[edit]

LCS-based approach[edit]

The standard algorithm for computing a longest palindromic subsequence of a given string S involves first computing a longest common subsequence (LCS) of S and its reverse S'. Often, this gives a correct LPS right away. For example, the reader may verify that alala, afafa, aflfa and alfla are all LCSes of alfalfa and its reverse aflafla. However, this is not always the case; for example, afala and alafa are also LCSes of alfalfa and its reverse, yet neither is palindromic. While it is clear that any LPS of a string is an LCS of the string and its reverse, the converse is false.

However, with a slight modification, this algorithm can be made to work. We will use the example string ABCDEBCA, which has six LPSes: ABCBA, ABDBA, ABEBA, ACDCA, ACECA and ACBCA (and our objective is to find one of them). Suppose that we find an LCS which is not one of these, such as L = ABDCA:

ABCDEBCA
ACBEDCBA

Now consider the central character C = D of L. It splits S up into two parts, the part before and the part after: S1 = ABC and S2 = EBCA. We know that the first half of L (L1 = AB) is a subsequence of S1, and that the second half of L (L2 = CA) is a subsequence of S2. Notice that S' is also split up into two parts: S'1 = ACBE = (S2)' and S'2 = CBA = (S1)'. Since L is also a subsequence of S', we see that L1 is a subsequence of S'1, and L2 is a subsequence of S'2. But the fact that L1 is a subsequence of S'1 implies (by reversing both strings) that (L1)' = BA is a subsequence of S2 = EBCA. Since L1 is a subsequence of S1 and (L1)' is a subsequence of S2, we conclude that L1C(L1)' = ABDBA is a subsequence of S1CS2 = S. So we have obtained the desired result, a longest palindromic subsequence.

Extrapolating to the general case, we can always obtain a LPS by first taking the LCS of S and S' and then "reflecting" the first half of the result onto the second half; that is, if L has k characters, then we replace the last ⌊k/2⌋ characters of L by the reverse of the first ⌊k/2⌋ characters of L to obtain a palindromic subsequence of S. This is obviously a palindrome by the foregoing analysis; it is guaranteed to be a subsequence by the foregoing analysis; and it is guaranteed to be as long as possible because it is of the same length as any LCS, yet the length of an LCS is also the upper bound on the length of an LPS since every LPS must trivially be an LCS. The odd case is handled as above, and the even case very similarly (lacking only the central character C).

The complexity of this algorithm is the same as the complexity of the LCS algorithm used. The textbook dynamic programming algorithm for LCS runs in O(mn) time, so the time taken to find the LPS is O(n^2), where n is the length of S.

Direct solution[edit]

There is also a "direct" O(n^2) solution, also based on dynamic programming; it is very similar in principle to the approach based on the textbook LCS algorithm. We define f(i,j) to be the length of the longest palindromic subsequence of the substring S[i,j) (see half-open interval). Assuming indexing starting from zero, the objective is to compute f(0,n).

This is easy when the substring is empty, or when it consists of only a single character; the value of f will be 0 or 1, respectively. Otherwise,

  • when the substring's first and last characters are equal, add both of them to the longest palindromic subsequence of the characters in between to get a palindromic subsequence two characters longer;
  • when they are unequal, then it's clearly not possible to use both of them to form a palindromic subsequence of the given substring, so the answer will be the same as if one of them were ignored.

Thus:

f(i,j) = \begin{cases}
0 & \text{if } j-i = 0 \\
1 & \text{if } j-i = 1 \\
\max(f(i+1,j),f(i,j-1)) & \text{if } j-i > 1 \text{ and } S_i \neq S_{j-1} \\
2 + f(i+1,j-1) & \text{if } j-i > 1 \text{ and } S_i = S_{j-1}
\end{cases}

To see calculate the cost, consider that the worst case occurs when every invocation to f requires a call on two different substrings. This can happen for invocations with strings of length n to 2. So the total number of invocations is 2^{(n-1)}.

However, a lot of these invocations are repeats. If we store previously computed solutions, then the cost goes down. The above worst case happens when no matches are ever made, so this requires us to try out eliminating all strings that start at 0 and those that start at n-1, such that their combined length is \leq n-1. This is just the number of ways to partition a given number into two non-negative integers, summed up over 1 ,..., n-1. The number of ways to partition k into two non-negative integers is k + 1. Hence the overall cost is O(n^2).

References[edit]

  1. Jonathan T. Schneider (2010). Personal communication.

External links[edit]