Editing Longest palindromic substring

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 1: Line 1:
{{Distinguish|Longest palindromic subsequence}}
+
The '''longest palindromic substring''' problem is exactly as it sounds: the problem of finding a maximal-length substring of a given string that is also a palindrome. For example, the longest palindromic substring of '''bananas''' is '''anana'''. The longest palindromic substring is not guaranteed to be unique; for example, in the string '''abracadabra''', there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, '''aca''' and '''ada'''. In some applications it may be necessary to return ''all'' maximal-length palindromic substrings, in some, only one, and in some, only the maximum length itself. This article discusses algorithms for solving these problems.
  
The '''longest palindromic substring''' problem is exactly as it sounds: the problem of finding a maximal-length [[substring]] of a given string that is also a palindrome. For example, the longest palindromic substring of '''bananas''' is '''anana'''. The longest palindromic substring is not guaranteed to be unique; for example, in the string '''abracadabra''', there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, '''aca''' and '''ada'''. In some applications it may be necessary to return ''all'' maximal-length palindromic substrings, in some, only one, and in some, only the maximum length itself. This article discusses algorithms for solving these problems.
+
==Theoretical discussion==
 +
===Lower bound===
 +
The worst-case running time of any correct algorithm must be at least <math>O(N)</math>, where <math>N</math> is the length of the string. This is because the algorithm cannot be correct unless it can verify whether the entire string is a palindrome (because in that case it is itself the longest palindromic substring), and it is not possible to conclude that a string is palindromic unless every character is examined.
 +
 
 +
One might ask whether the size of the output gives a better lower bound on the worst-case running time in the case in which it is necessary to return all maximal-length palindromic substrings (rather than simply reporting the maximum length). It turns out that it does not, because there can only be up to linearly many substrings of any given length. Specifically, there cannot be more than <math>N</math> different palindromic substrings of maximal length, because there cannot be more than <math>N</math> substrings of any given length in the first place.
 +
 
 +
===Upper bound===
 +
Given a substring of length <math>l</math>, we can determine whether it is a palindrome in <math>O(l)</math> time by iterating through it and checking whether the character at zero-based offset <math>i</math> always matches that at <math>l-i-1</math>. Furthermore, our original string, of length <math>N</math>, has <math>\Theta(N^2)</math> substrings each of length <math>O(N)</math>. Therefore, we conclude, by enumerating all substrings of our original string, checking whether each is a palindrome, and finding the longest one which is, we obtain an <math>O(N^3)</math> solution. Thus <math>O(N^3)</math> is an upper bound on the efficiency of the best algorithm, and we have established that the longest palindromic substring problem can be solved in polynomial time.
 +
 
 +
==Observation==
 +
Every palindromic substring has a ''centre''. For an odd-length substring, this is its middle character; for an even-length substring, this is the imaginary space between its two middle characters. Call each character and each space in the original string a ''position''. Then there are <math>2N-1</math> positions in any string of length <math>N</math>, and any longest palindromic substring must have one of these as its centre.
 +
 
 +
Therefore, if we iterate through all positions of the string, and for each one determine the longest palindromic substring ''centered on that position'', then one of these is the longest palindromic substring of the whole string. For example, in the string '''opposes''', the longest palindromic substrings centered around each position are ['''o''', (empty), '''p''', '''oppo''', '''p''', (empty), '''o''', (empty), '''s''', (empty), '''ses''', (empty), '''s'''], and one of these --- '''oppo''' --- is the longest palindromic substring of the original string.
 +
 
 +
Being a bit clever in determining the longest palindromic substring centered about a given string improves the running time over the naive solution dramatically. We begin by noting that there definitely is a palindromic substring of length zero (the empty string) centered around the current center, if it is a space, or a palindromic substring of length one (the character itself) centered around the current center, if it is a character. Then, we repeatedly attempt to make a longer palindromic substring centered at our current location by adding the characters just before and just after the current longest palindromic substring. We can do this if and only if these characters match. If they do not match, there is no hope of any longer palindromic substring centered here existing, so we are done.
 +
 
 +
For example, suppose our current center is between the two '''b''''s in the word '''scabbards'''. We already know there exists a palindromic substring of length zero (the empty string) centered around our current location. To determine whether we can make it longer, we check the first character on either side, that is, the two '''b''''s. They match, so we know that '''bb''' is a palindromic substring centered at our current location. Then we try to extend it again, by checking the first character on either side. Now we find two '''a''''s, so we now know we have '''abba''', a palindromic substring of length 4, centered around our current location. Again we try to extend it, by checking one character on each side. We find the '''c''' and the '''r''', which do ''not'' match. Therefore we know that '''abba''' is the longest palindromic substring centered our current location --- substrings centered at our current location that are longer than '''cabbar''', such as '''scabbard''', have no hope of being palindromic.
 +
 
 +
Notice that once we have fixed the centre, we will not examine any character of the string more than once in determining the longest palindromic substring centered there. That is, determining the longest palindromic substring centered at any given location takes only <math>O(N)</math> time, and by iterating over all possible centres, an <math>O(N^2)</math> algorithm is obtained.
  
 
==Toward a better bound==
 
==Toward a better bound==
Line 31: Line 49:
  
 
==A simple asymptotically optimal solution==
 
==A simple asymptotically optimal solution==
Although we have already established that the <math>O(N)</math> solution using suffix trees and LCA queries is asymptotically optimal, we are driven to seek a better solution for many reasons. The suffix tree based solution:
 
* requires much code to implement
 
* has a high constant factor on runtime
 
* consumes a great deal of memory, and
 
* is still <math>O(N \log N)</math> with general alphabets.
 
However, there exists a simpler, shorter, faster, less memory-hungry algorithm<ref name="akalin"/> which accesses the string only by comparing its characters for equality (and hence has running time independent of alphabet size). This is known as ''Manacher's algorithm''.<ref name="manacher"/><ref name="galil"/> Again we will compute the longest palindromic substring centered around each possible position (so that returning all longest palindromic substrings is as easy as simply finding the maximum possible length for a palindromic substring). Our algorithm will return a ''longest palindrome array'': an array in which the <math>i</math><sup>th</sup> element contains the length of the longest palindrome centered around the <math>i</math><sup>th</sup> position from the left. (Note that the parities of elements in the longest palindrome array alternate.)
 
  
The algorithm processes the string from left to right, and fills in the longest palindrome array in the same order, so that when any given position is being considered, the longest palindromes centered around all positions to the left are already known. The algorithm distinguishes two kinds of positions: "interesting" positions and "uninteresting" positions. The algorithm "hops" from each interesting position to the one immediately to the right. It spends amortized constant time on each interesting position. Interesting positions are so named because any maximal-length palindromic substring overall must be centered at one of them. Uninteresting positions' entries in the longest palindrome array can be computed in constant time "in passing" from one interesting position to the next.
 
 
We will proceed through a series of observations to derive the algorithm. First, the designation of interesting and uninteresting positions is not unique; in particular, any uninteresting position can be relabelled interesting without affecting correctness or consistency. That being said, we begin by denoting the first position --- the one immediately to the left of the first character --- interesting, and going from there.
 
 
''Lemma'' (Second observation): If <math>s</math> is the longest palindrome centered at the current position, and <math>t</math> is the longest palindromic proper suffix of <math>s</math>, then all positions between the current position and the centre of <math>t</math> may safely be denoted uninteresting.
 
 
''Proof'': Consider some position <math>i</math> between the centres of <math>s</math> and <math>t</math>. Suppose there exists a palindromic substring, <math>u</math>, of length greater than or equal to that of <math>s</math>, centered at <math>i</math>. Certainly the last character of <math>u</math> lies to the right of the last character of <math>s</math>, since <math>u</math> is centered further to the right and is longer. So we repeatedly remove one character from each end of <math>u</math> until <math>u</math> and <math>s</math> end at the same position. Since <math>u</math> is now shorter than <math>s</math> (since its centre is to the right but both <math>s</math> and <math>u</math> extend equally far on the right), it is a palindromic proper suffix of <math>s</math>. Since the centre of <math>u</math> is to the left of the centre of <math>t</math>, <math>u</math> is longer than <math>t</math>, a contradiction as <math>t</math> is maximal by hypothesis.
 
 
To find the longest palindromic proper suffix, we make yet another observation: the longest palindromic proper suffix and the longest palindromic proper prefix are reflections of each other about the current centre. So we scan backward from the current position. For each position <math>i</math> to the left of the current position, we know the longest palindromic substring <math>u</math> centered there. If the first character of <math>u</math> is to the right of the first character of <math>s</math>, then the longest palindromic substring centered at the position obtained by reflecting <math>u</math> about the current position is identical to <math>s</math>. (If it were possible to extend it, it would also be possible to extend <math>s</math>.) This means we can fill in the corresponding position in the longest palindromic array immediately, and also that there is no palindromic prefix of <math>s</math> centered at <math>i</math> as <math>u</math>, and hence no corresponding palindromic suffix at the position obtained by reflecting <math>i</math> across the current centre. Otherwise, on the other hand, we know there ''is'' a palindromic (proper) prefix centered at <math>i</math>, which means there is a palindromic (proper) suffix centered at the position obtained by reflecting <math>i</math> across the current centre. Furthermore, since we are scanning backward from the current centre, the ''first'' such <math>i</math> we encounter is the centre of the ''longest'' palindromic proper prefix, so we have also found the ''longest'' palindromic proper suffix.
 
 
These facts give us enough information to construct an algorithm:
 
# (Initialization) Fill in the first entry of the longest palindrome array; it is zero, for this corresponds to the position just before the beginning of the string. Start at the position right on the first character of the string. This is the first interesting position. Also, we need to keep track of the length of the currently longest known palindrome centered at the current position. This starts out as one (the character itself).
 
# (Test interesting position) Check whether the palindrome we have so far centered at the current position can be extended (that is, whether there exist identical characters immediately to the left and right of it). If so, extend it. Do this repeatedly until it cannot be extended anymore, either because it is a prefix/suffix of the entire string, or because the next characters on the left and right do not match.
 
# We now know the length of the longest palindrome centered at the current location. Fill this in.
 
# (Find next interesting position) Let <math>i</math> scan backward through the longest palindrome array and <math>j</math> scan forward, so that <math>i</math> and <math>j</math> are reflections across the current position. Note that the entry for <math>i</math> is always known beforehand.
 
#* If <math>j</math> is after the last position in the string, we have computed all entries in the longest palindrome array, and the algorithm halts. (This always happens eventually.)
 
#* Otherwise, if the longest palindrome centered at <math>i</math> has its first character to the right of the first character of the longest palindrome centered at our current position, then its length <math>l</math> is also the length of the longest palindrome centered at <math>j</math>. Fill this in.
 
#* Otherwise, the longest palindrome centered at <math>j</math> is ''at least as long'' as the longest palindrome centered at <math>i</math>. <math>j</math> is the next interesting position. Go back to step 2, setting the current position to <math>j</math> and the length of the currently longest known palindrome centered there to <math>l</math>.
 
 
It should now be clear that this algorithm is correct, but it is not necessarily clear that it is efficient.
 
 
'''Theorem''': Manacher's algorithm runs in linear time.
 
 
'''Proof''':
 
* The outer loop runs <math>O(N)</math> times.
 
* Step 1 takes constant time and is executed once.
 
* Step 2 contains a loop. Each iteration of the loop takes constant time and examines two characters of the string, one on the left of the current palindrome and one on the right. Call the pointer to the next character on the right to examine <math>r</math>. After each iteration of the inner loop, either it runs again immediately, moving <math>r</math> one character to the right or step 3 and step 4 follow, which either terminate the algorithm or move the current position to the right, keeping <math>r</math> fixed. (This is because we always move to a suffix in step 4.) So after each iteration either <math>r</math> or the current position is advanced, and thus total number of iterations of the loop in step 2 is no more than the sum of <math>N</math>, the number of characters, and <math>2N+1</math>, the number of positions. So the total time spent in step 2 is <math>O(N)</math>.
 
* Step 3 takes constant time and is executed once per iteration of the outer loop.
 
* Step 4 consists of some constant-time parts executed once per iteration of the outer loop, as well as an inner loop. The inner loop takes constant time per iteration can only run <math>O(N)</math> times in total, because each iteration fills in a previously unknown entry in the longest palindrome array, which has <math>O(N)</math> entries.
 
Overall, each step contributes <math>O(N)</math> to the running time, and so Manacher's algorithm is <math>O(N)</math> overall.
 
  
 
==References==
 
==References==
Line 75: Line 57:
 
<ref name="ukkonen">E. Ukkonen. (1995). On-line construction of suffix trees. ''Algorithmica'' '''14'''(3):249-260. [http://www.cs.helsinki.fi/u/ukkonen/SuffixT1.pdf PDF] | [http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf PDF with figures]</ref>
 
<ref name="ukkonen">E. Ukkonen. (1995). On-line construction of suffix trees. ''Algorithmica'' '''14'''(3):249-260. [http://www.cs.helsinki.fi/u/ukkonen/SuffixT1.pdf PDF] | [http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf PDF with figures]</ref>
 
<ref name="tarjan">Gabow, H. N.; Tarjan, R. E. (1983), "A linear-time algorithm for a special case of disjoint set union", <i>Proceedings of the 15th ACM Symposium on Theory of Computing (STOC)</i>, pp.&#160;246–251, doi:[http://dx.doi.org/10.1145%2F800061.808753 10.1145/800061.808753]</ref>
 
<ref name="tarjan">Gabow, H. N.; Tarjan, R. E. (1983), "A linear-time algorithm for a special case of disjoint set union", <i>Proceedings of the 15th ACM Symposium on Theory of Computing (STOC)</i>, pp.&#160;246–251, doi:[http://dx.doi.org/10.1145%2F800061.808753 10.1145/800061.808753]</ref>
<ref name="akalin">Fred Akalin. Finding the longest palindromic substring in linear time. Retrieved 2016-10-01 from [https://www.akalin.com/longest-palindrome-linear-time https://www.akalin.com/longest-palindrome-linear-time]. ''(This source contains a Python implementation of the simple linear-time algorithm.)''</ref>
+
<ref name="akalin">Frederick Akalin. Finding the longest palindromic substring in linear time. [http://www.akalin.cx/2007/11/28/finding-the-longest-palindromic-substring-in-linear-time/]</ref>
<ref name="manacher">Glenn Manacher. 1975. A New Linear-Time "On-Line" Algorithm for Finding the Smallest Initial Palindrome of a String. J. ACM 22, 3 (July 1975), 346-351. DOI=10.1145/321892.321896 http://doi.acm.org/10.1145/321892.321896 </ref>
+
<ref name="galil">Alberto Apostolico, Dany Breslauer, and Zvi Galil (1994), "Parallel Detection of all Palindromes in a String", Comput. Sci.</ref>
+
 
</references>
 
</references>
 
==Code==
 
[[Longest_palindromic_substring/lps.cpp|C++ implementation of Manacher's algorithm]]
 
 
==External links==
 
* SPOJ problems:
 
** {{SPOJ|PLD|Palindromes}}
 
** {{SPOJ|PALIM|Yet Another Longest Palindrome Problem}}
 
** {{SPOJ|PALDR|Even Palindrome}}
 
* The problem Calf Flac from [http://train.usaco.org/ the USACO training pages] can be solved using the naive algorithm. ilove youuuuuuuuuuuuu
 

Please note that all contributions to PEGWiki are considered to be released under the Attribution 3.0 Unported (see PEGWiki:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)

Templates used on this page: