Difference between revisions of "Prefix sum array and difference array"

From PEGWiki
Jump to: navigation, search
(easier example problem)
(rewrite --- should be easier to read now)
Line 1: Line 1:
Given an [[array]] of numbers, we can construct a new array by replacing each element by the difference between itself and the previous element, except for the first element, which we simply ignore. This is called the '''difference array''', because it contains the first differences of the original array. For example, the difference array of <math>A = [9, 2, 6, 3, 1, 5, 0, 7]</math> is <math>D = [2-9, 6-2, 3-6, 1-3, 5-1, 0-5, 7-0]</math>, or <math>[-7, 4, -3, -2, 4, -5, 7]</math>. Evidently, the difference array can be computed in [[linear time]] from the original array.
+
Given an [[array]] of numbers, we can construct a new array by replacing each element by the difference between itself and the previous element, except for the first element, which we simply ignore. This is called the '''difference array''', because it contains the first differences of the original array. We will denote the difference array of array <math>A</math> by <math>D(A)</math>. For example, the difference array of <math>A = [9, 2, 6, 3, 1, 5, 0, 7]</math> is <math>D(A) = [2-9, 6-2, 3-6, 1-3, 5-1, 0-5, 7-0]</math>, or <math>[-7, 4, -3, -2, 4, -5, 7]</math>.
  
We can run this process in reverse: that is, given a difference array and the initial element of the original array, we can reconstruct the original array. To do this, we simply scan from left to right, and accumulate as we go along, giving <math>[9, 9-7, 9-7+4, 9-7+4-3, 9-7+4-3-2, 9-7+4-3-2+4, 9-7+4-3-2+4-5, 9-7+4-3-2+4-5+7]</math>. The reader should perform this calculation and confirm that this is indeed equal to <math>A = [9, 2, 6, 3, 1, 5, 0, 7]</math>. This reverse process can also be done in linear time, as each element in <math>A</math> is the sum of the corresponding element in <math>D</math> and the ''previous'' element in <math>A</math>. We say that <math>A</math> is a '''prefix sum array''' of <math>D</math>. However, it is not the ''only'' prefix sum array of <math>D</math>, because knowing <math>D</math> alone does not give us enough information to reconstruct <math>A</math>; we also needed to know the initial element of <math>A</math>. If we assumed that this were -3, for example, instead of 9, we would obtain <math>A' = [-3, -10, -6, -9, -11, -7, -12, -5]</math>. <math>D</math> is the difference array of both <math>A</math> and <math>A'</math>. Both <math>A</math> and <math>A'</math> are prefix sum arrays of <math>D</math>.
+
We see that the difference array can be computed in [[linear time]] from the original array, and is shorter than the original array by one element. Here are implementations in C and Haskell. (Note that the Haskell implementation actually takes a list, not an array, and will work fine for infinite lists.)
 +
<syntaxhighlight lang="c">
 +
// D must have enough space for n-1 ints
 +
void difference_array(int* A, int n, int* D)
 +
{
 +
    for (int i = 0; i < n-1; i++)
 +
        D[i] = A[i+1] - A[i];
 +
}
 +
</syntaxhighlight><br/>
 +
<syntaxhighlight lang="haskell">
 +
d :: [Int] -> [Int] -- NB: The general type signature should be (Num a) => [a] -> [a]
 +
d a = zipWith (-) (tail a) a
 +
</syntaxhighlight><br/>
 +
 
 +
The '''prefix sum array''' is the opposite of the difference array. Given an array of numbers <math>A</math> and an arbitrary constant <math>c</math>, we first append <math>c</math> onto the front of the array, and then replace each element with the sum of itself and all the elements preceding it. For example, if we start with <math>A = [9, 2, 6, 3, 1, 5, 0, 7]</math>, and choose to append the arbitrary value <math>-8</math> to the front, we obtain <math>P(-8, A) = [-8, -8+9, -8+9+2, -8+9+2+6, ..., -8+9+2+6+3+1+5+0+7]</math>, or <math>[-8, 1, 3, 9, 12, 13, 18, 18, 25]</math>. Computing the prefix sum array can be done in linear time as well, and the prefix sum array is longer than the original array by one element:
 +
<syntaxhighlight lang="c">
 +
// P must have enough space for n+1 ints
 +
void prefix_sum_array(int c, int* A, int n, int* P)
 +
{
 +
    P[0] = c;
 +
    for (int i = 0; i < n; i++)
 +
        P[i+1] = P[i] + A[i];
 +
}
 +
</syntaxhighlight><br/>
 +
<syntaxhighlight lang="haskell">
 +
p :: Int -> [Int] -> [Int] -- NB: The general type signature should be (Num a) => a -> [a] -> [a]
 +
p = scanl (+)              -- NB: This is the same as p c a = scanl (+) c a
 +
</syntaxhighlight><br/>
 +
 
 +
Note that every array has an infinite number of possible prefix sum arrays, since we can choose whatever value we want for <math>c</math>. For convenience, we usually choose <math>c = 0</math>. However, changing the value of <math>c</math> has only the effect of shifting all the elements of <math>P(c,A)</math> by a constant. For example, <math>P(15, A) = [15, 24, 26, 32, 35, 36, 41, 41, 48]</math>. However, each element of <math>P(15, A)</math> is exactly 23 more than the corresponding element from <math>P(-8, A)</math>.
 +
 
 +
The functions <math>D</math> and <math>P</math> carry out '''reverse processes'''. Given an nonempty zero-indexed array <math>A</math>:
 +
# <math>D(P(c, A)) = A</math> for any <math>c</math>. For example, taking the difference array of <math>P(-8, A) = [-8, 1, 3, 9, 12, 13, 18, 18, 25]</math> gives <math>[9, 2, 6, 3, 1, 5, 0, 7]</math>, that is, it restores the original array <math>A</math>.
 +
# <math>P(A_0, D(A)) = A</math>. Thus, taking <math>D(A) = [-7, 4, -3, -2, 4, -5, 7]</math> and <math>A_0 = 9</math> (initial element of <math>A</math>), we have <math>P(A_0, D(A)) = [9, 2, 6, 3, 1, 5, 0, 7]</math>, again restoring the original array <math>A</math>.
  
 
==Analogy with calculus==
 
==Analogy with calculus==
These two processes&mdash;computing the difference array, and computing a prefix sum array&mdash;are the discrete equivalents of differentiation and integration in calculus, which operate on continuous domains:
+
These two processes&mdash;computing the difference array, and computing a prefix sum array&mdash;are the discrete equivalents of differentiation and integration in calculus, which operate on continuous domains. An entry in an array is like the value of a function at a particular point.
* Differentiation and integration are reverses. Computing the difference array and computing a prefix sum array are reverses.
+
* ''Reverse processes'':
* A function can only have one derivative, but an infinite number of antiderivatives. An array has only one difference array, but an infinite number of prefix sum arrays.
+
:* <math>D(P(c, A)) = A</math> for any <math>c</math>. Likewise <math>\frac{d}{dx} \int_c^x f(t)\, dt = f(x)</math> for any <math>c</math>.
* However, if we know the value of a function over an interval of the real line and we are given some real number, we can find one unique antiderivative of this function which attains this real value at the left end of the interval. Likewise, if we are given an array, and we are told the initial element of one of its prefix sum arrays, we can reconstruct the entire prefix sum array.
+
:* <math>P(A_0, D(A)) = A</math>. Likewise <math>f(a) + \int_a^x \frac{df}{dt}\, dt = f(x)</math>.
* When we take the difference array, we shorten the array by one element, since we destroy information about what the initial element was. On the other hand, when we take a prefix sum array, we lengthen the array by one element, since we introduce information about what the initial element is to be. Likewise, when we take the derivative of a function <math>f:[a,b]\to\mathbb{R}</math> on a closed interval, we obtain <math>f':(a,b)\to\mathbb{R}</math> (open interval), and we destroy the information of what <math>f(a)</math> was; conversely, if we are given <math>f':(a,b)\to\mathbb{R}</math> (open interval), and the value of <math>f(a)</math> is also told to us, we obtain its antiderivative <math>f:[a,b]</math> (closed interval).
+
* ''Uniqueness'':
* All possible prefix sum arrays differ only by an offset from each other, in the sense that one of them can be obtained by adding a constant to each entry of the other. For example, the two prefix sum arrays shown for <math>A</math> above differ by 12 at all positions. Likewise, all possible antiderivatives differ from each other only by a constant (the difference between their constants of integration).
+
:* A differentiable function <math>f(x)</math> can only have one derivative, <math>\frac{df}{dx}</math>. An array <math>A</math> can only have one difference array, <math>D(A)</math>.
Because of these similarities, we will speak simply of ''differentiating'' and ''integrating'' arrays. An array can be differentiated multiple times, but eventually it will shrink to length 0. An array can be integrated any number of times. Differentiating <math>k</math> times and integrating <math>k</math> times are still reverse processes.
+
:* A continuous function <math>f(x)</math> has an infinite number of antiderivatives, <math>F_c(x) = \int_c^x f(t)\, dt</math>, where <math>c</math> can be any number in its domain, but they differ only by a constant (their graphs are vertical translations of each other). An array <math>A</math> has an infinite number of prefix arrays <math>P(c,A)</math>, but they differ only by a constant (at each entry).
 +
* Given some function <math>f:[a,b]\to\mathbb{R}</math>, and the fact that <math>F</math>, an antiderivative of <math>f</math>, satisfies <math>F(a) = y_0</math>, we can uniquely reconstruct <math>F</math>. That is, even though <math>f</math> has an infinite number of antiderivatives, we can pin it down to one once we are given the value the antiderivative is supposed to attain on the left edge of <math>f</math>'s domain. Likewise, given some array <math>A</math> and the fact that <math>P</math>, a prefix sum array of <math>A</math>, satisfies <math>P_0 = c</math>, we can uniquely reconstruct <math>P</math>.
 +
* ''Effect on length'':
 +
:* <math>D(A)</math> is shorter than <math>A</math> by one element. Differentiating <math>f:[a,b] \to \mathbb{R}</math> gives a function <math>f':(a,b) \to \mathbb{R}</math> (shortens the closed interval to an open interval).
 +
:* <math>P(c,A)</math> is longer than <math>A</math> by one element. Integrating <math>f:(a,b) \to \mathbb{R}</math> gives a function <math>F:[a,b] \to \mathbb{R}</math> (lengthens the open interval to a closed interval).
 +
 
 +
Because of these similarities, we will speak simply of ''differentiating'' and ''integrating'' arrays. An array can be differentiated multiple times, but eventually it will shrink to length 0. An array can be integrated any number of times.
  
 
==Use of prefix sum array==
 
==Use of prefix sum array==
The Fundamental Theorem of Calculus also has an analogue, which is why the prefix sum array is so useful. If we assume both the original array and its prefix sum array are zero-indexed (or both one-indexed), for example, <math>A = [9, 2, 6, 3, 1, 5, 0, 7]</math> where <math>A_0 = 9</math> and <math>P = [0, 9, 11, 17, 20, 21, 26, 26, 33]</math> (a possible prefix sum array of <math>A</math>, where we have arbitrarily chosen 0 as the initial element), where <math>P_0 = 0</math>, then we can add up any range of elements in <math>A</math> by taking the difference of ''two'' elements in <math>P</math>. In particular, <math>\Sigma_{[i,j)}A</math>, the notation we will use for <math>A_i + A_{i+1} + ... + A_{j-1}</math>, is equal to <math>P_j - P_i</math>. (Note the use of the [[half-open interval]].) For example, <math>A_2 + A_3 + A_4 + A_5 = 6 + 3 + 1 + 5 = 15 = 26 - 11 = P_6 - P_2</math>. This property holds no matter what we choose as the initial element of <math>P</math>, because it will always cancel out of the calculation, just as we can use any antiderivative of a function to compute an integral (which is like summing a function across an entire interval) by taking the difference between the antiderivative's values at the endpoints, because the constant of integration will always cancel out. This makes the prefix sum array a useful item to [[precompute]]; once we have computed it, which only takes linear time, we can sum any range of the original array in ''constant time'', simply by taking the difference of two values in the prefix sum array.
+
The Fundamental Theorem of Calculus also has an analogue, which is why the prefix sum array is so useful. To compute an integral <math>\int_a^b f(t)\, dt</math>, which is like a continuous kind of sum of an infinite number of function values <math>f(a), f(a+\epsilon), f(a+2\epsilon), ..., f(b)</math>, we take any antiderivative <math>F</math>, and compute <math>F(b) - F(a)</math>. Likewise, to compute the sum of values <math>A_i, A_{i+1}, A_{i+2}, ..., A_{j-1}</math>, we will take any prefix array <math>P(c,A)</math> and compute <math>P_j - P_i</math>. Notice that just as we can use any antiderivative <math>F</math> because the constant cancels out, we can use any prefix sum array because the initial value cancels out. (Note our use of the [[left half-open interval]].)
 +
 
 +
''Proof'': <math>P_j = c + \sum_{k=0}^{j-1} A_k</math> and <math>P_i = c + \sum_{k=0}^{i-1} A_k</math>. Subtracting gives <math>P_j - P_i = \sum_{k=0}^{j-1} A_k - \sum_{k=0}^{i-1} A_k = \sum_{k=i}^{j-1} A_k</math> as desired. <math>_{\blacksquare}</math>
 +
 
 +
This is best illustrated ''via'' example. Let <math>A = [9,2,6,3,1,5,0,7]</math> as before. Take <math>P(0,A) = [0, 9, 11, 17, 20, 21, 26, 26, 33]</math>. Then, suppose we want <math>A_2 + A_3 + A_4 + A_5 = 6 + 3 + 1 + 5 = 15</math>. We can compute this by taking <math>P_6 - P_2 = 26 - 11 = 15</math>. This is because <math>P_6 - P_2 = (0 + A_0 + A_1 + A_2 + A_3 + A_4 + A_5) - (0 + A_0 + A_1) = A_2 + A_3 + A_4 + A_5</math>.
  
 
===Example: Counting Subsequences (SPOJ)===
 
===Example: Counting Subsequences (SPOJ)===
Line 20: Line 63:
 
We will consider the problem {{SPOJ|SUBSEQ|Counting Subsequences}} from IPSC 2006. Here we are given an array of integers <math>S</math> and asked to find the number of contiguous subsequences of the array that sum to 47.
 
We will consider the problem {{SPOJ|SUBSEQ|Counting Subsequences}} from IPSC 2006. Here we are given an array of integers <math>S</math> and asked to find the number of contiguous subsequences of the array that sum to 47.
  
To solve this, we will first transform array <math>S</math> into its prefix sum array <math>P</math>. Notice that the sum of each contiguous subsequence <math>S_i + S_{i+1} + S_{i+2} + ... + S_{j-1}</math> corresponds to the difference of two elements of <math>P</math>, that is, <math>P_j - P_i</math>. So what we want to find is the number of pairs <math>(i,j)</math> with <math>P_j - P_i = 47</math> and <math>i < j</math>. (Note that if <math>i > j</math>, we will instead get a subsequence with sum -47.)
+
To solve this, we will first transform array <math>S</math> into its prefix sum array <math>P(0,S)</math>. Notice that the sum of each contiguous subsequence <math>S_i + S_{i+1} + S_{i+2} + ... + S_{j-1}</math> corresponds to the difference of two elements of <math>P</math>, that is, <math>P_j - P_i</math>. So what we want to find is the number of pairs <math>(i,j)</math> with <math>P_j - P_i = 47</math> and <math>i < j</math>. (Note that if <math>i > j</math>, we will instead get a subsequence with sum -47.)
  
 
However, this is quite easy to do. We sweep through <math>P</math> from left to right, keeping a [[map]] of all elements of <math>P</math> we've seen so far, along with their frequencies; and for each element <math>P_j</math> we count the number of times <math>P_j - 48</math> has appeared so far, by looking up that value in our map; this tells us how many contiguous subsequences ending at <math>S_{j-1}</math> have sum 47. And finally, adding the number of contiguous subsequences with sum 47 ending at each entry of <math>S</math> gives the total number of such subsequences in the array. Total time taken is <math>O(N)</math>, if we use a [[hash table]] implementation of the map.
 
However, this is quite easy to do. We sweep through <math>P</math> from left to right, keeping a [[map]] of all elements of <math>P</math> we've seen so far, along with their frequencies; and for each element <math>P_j</math> we count the number of times <math>P_j - 48</math> has appeared so far, by looking up that value in our map; this tells us how many contiguous subsequences ending at <math>S_{j-1}</math> have sum 47. And finally, adding the number of contiguous subsequences with sum 47 ending at each entry of <math>S</math> gives the total number of such subsequences in the array. Total time taken is <math>O(N)</math>, if we use a [[hash table]] implementation of the map.

Revision as of 08:16, 19 December 2011

Given an array of numbers, we can construct a new array by replacing each element by the difference between itself and the previous element, except for the first element, which we simply ignore. This is called the difference array, because it contains the first differences of the original array. We will denote the difference array of array A by D(A). For example, the difference array of A = [9, 2, 6, 3, 1, 5, 0, 7] is D(A) = [2-9, 6-2, 3-6, 1-3, 5-1, 0-5, 7-0], or [-7, 4, -3, -2, 4, -5, 7].

We see that the difference array can be computed in linear time from the original array, and is shorter than the original array by one element. Here are implementations in C and Haskell. (Note that the Haskell implementation actually takes a list, not an array, and will work fine for infinite lists.)

// D must have enough space for n-1 ints
void difference_array(int* A, int n, int* D)
{
    for (int i = 0; i < n-1; i++)
        D[i] = A[i+1] - A[i];
}

d :: [Int] -> [Int] -- NB: The general type signature should be (Num a) => [a] -> [a]
d a = zipWith (-) (tail a) a

The prefix sum array is the opposite of the difference array. Given an array of numbers A and an arbitrary constant c, we first append c onto the front of the array, and then replace each element with the sum of itself and all the elements preceding it. For example, if we start with A = [9, 2, 6, 3, 1, 5, 0, 7], and choose to append the arbitrary value -8 to the front, we obtain P(-8, A) = [-8, -8+9, -8+9+2, -8+9+2+6, ..., -8+9+2+6+3+1+5+0+7], or [-8, 1, 3, 9, 12, 13, 18, 18, 25]. Computing the prefix sum array can be done in linear time as well, and the prefix sum array is longer than the original array by one element:

// P must have enough space for n+1 ints
void prefix_sum_array(int c, int* A, int n, int* P)
{
    P[0] = c;
    for (int i = 0; i < n; i++)
        P[i+1] = P[i] + A[i];
}

p :: Int -> [Int] -> [Int] -- NB: The general type signature should be (Num a) => a -> [a] -> [a]
p = scanl (+)              -- NB: This is the same as p c a = scanl (+) c a

Note that every array has an infinite number of possible prefix sum arrays, since we can choose whatever value we want for c. For convenience, we usually choose c = 0. However, changing the value of c has only the effect of shifting all the elements of P(c,A) by a constant. For example, P(15, A) = [15, 24, 26, 32, 35, 36, 41, 41, 48]. However, each element of P(15, A) is exactly 23 more than the corresponding element from P(-8, A).

The functions D and P carry out reverse processes. Given an nonempty zero-indexed array A:

  1. D(P(c, A)) = A for any c. For example, taking the difference array of P(-8, A) = [-8, 1, 3, 9, 12, 13, 18, 18, 25] gives [9, 2, 6, 3, 1, 5, 0, 7], that is, it restores the original array A.
  2. P(A_0, D(A)) = A. Thus, taking D(A) = [-7, 4, -3, -2, 4, -5, 7] and A_0 = 9 (initial element of A), we have P(A_0, D(A)) = [9, 2, 6, 3, 1, 5, 0, 7], again restoring the original array A.

Analogy with calculus

These two processes—computing the difference array, and computing a prefix sum array—are the discrete equivalents of differentiation and integration in calculus, which operate on continuous domains. An entry in an array is like the value of a function at a particular point.

  • Reverse processes:
  • D(P(c, A)) = A for any c. Likewise \frac{d}{dx} \int_c^x f(t)\, dt = f(x) for any c.
  • P(A_0, D(A)) = A. Likewise f(a) + \int_a^x \frac{df}{dt}\, dt = f(x).
  • Uniqueness:
  • A differentiable function f(x) can only have one derivative, \frac{df}{dx}. An array A can only have one difference array, D(A).
  • A continuous function f(x) has an infinite number of antiderivatives, F_c(x) = \int_c^x f(t)\, dt, where c can be any number in its domain, but they differ only by a constant (their graphs are vertical translations of each other). An array A has an infinite number of prefix arrays P(c,A), but they differ only by a constant (at each entry).
  • Given some function f:[a,b]\to\mathbb{R}, and the fact that F, an antiderivative of f, satisfies F(a) = y_0, we can uniquely reconstruct F. That is, even though f has an infinite number of antiderivatives, we can pin it down to one once we are given the value the antiderivative is supposed to attain on the left edge of f's domain. Likewise, given some array A and the fact that P, a prefix sum array of A, satisfies P_0 = c, we can uniquely reconstruct P.
  • Effect on length:
  • D(A) is shorter than A by one element. Differentiating f:[a,b] \to \mathbb{R} gives a function f':(a,b) \to \mathbb{R} (shortens the closed interval to an open interval).
  • P(c,A) is longer than A by one element. Integrating f:(a,b) \to \mathbb{R} gives a function F:[a,b] \to \mathbb{R} (lengthens the open interval to a closed interval).

Because of these similarities, we will speak simply of differentiating and integrating arrays. An array can be differentiated multiple times, but eventually it will shrink to length 0. An array can be integrated any number of times.

Use of prefix sum array

The Fundamental Theorem of Calculus also has an analogue, which is why the prefix sum array is so useful. To compute an integral \int_a^b f(t)\, dt, which is like a continuous kind of sum of an infinite number of function values f(a), f(a+\epsilon), f(a+2\epsilon), ..., f(b), we take any antiderivative F, and compute F(b) - F(a). Likewise, to compute the sum of values A_i, A_{i+1}, A_{i+2}, ..., A_{j-1}, we will take any prefix array P(c,A) and compute P_j - P_i. Notice that just as we can use any antiderivative F because the constant cancels out, we can use any prefix sum array because the initial value cancels out. (Note our use of the left half-open interval.)

Proof: P_j = c + \sum_{k=0}^{j-1} A_k and P_i = c + \sum_{k=0}^{i-1} A_k. Subtracting gives P_j - P_i = \sum_{k=0}^{j-1} A_k - \sum_{k=0}^{i-1} A_k = \sum_{k=i}^{j-1} A_k as desired. _{\blacksquare}

This is best illustrated via example. Let A = [9,2,6,3,1,5,0,7] as before. Take P(0,A) = [0, 9, 11, 17, 20, 21, 26, 26, 33]. Then, suppose we want A_2 + A_3 + A_4 + A_5 = 6 + 3 + 1 + 5 = 15. We can compute this by taking P_6 - P_2 = 26 - 11 = 15. This is because P_6 - P_2 = (0 + A_0 + A_1 + A_2 + A_3 + A_4 + A_5) - (0 + A_0 + A_1) = A_2 + A_3 + A_4 + A_5.

Example: Counting Subsequences (SPOJ)

Computing the prefix sum array is rarely the most difficult part of a problem. Instead, the prefix sum array is kept on hand because the algorithm to solve the problem makes frequent reference to range sums.

We will consider the problem Counting Subsequences from IPSC 2006. Here we are given an array of integers S and asked to find the number of contiguous subsequences of the array that sum to 47.

To solve this, we will first transform array S into its prefix sum array P(0,S). Notice that the sum of each contiguous subsequence S_i + S_{i+1} + S_{i+2} + ... + S_{j-1} corresponds to the difference of two elements of P, that is, P_j - P_i. So what we want to find is the number of pairs (i,j) with P_j - P_i = 47 and i < j. (Note that if i > j, we will instead get a subsequence with sum -47.)

However, this is quite easy to do. We sweep through P from left to right, keeping a map of all elements of P we've seen so far, along with their frequencies; and for each element P_j we count the number of times P_j - 48 has appeared so far, by looking up that value in our map; this tells us how many contiguous subsequences ending at S_{j-1} have sum 47. And finally, adding the number of contiguous subsequences with sum 47 ending at each entry of S gives the total number of such subsequences in the array. Total time taken is O(N), if we use a hash table implementation of the map.