Cartesian tree

From PEGWiki
Jump to: navigation, search
A sequence and its corresponding Cartesian tree

Given a sequence of numbers (or any totally ordered objects), there exists a binary min-heap whose inorder traversal is that sequence. This is known as the Cartesian tree for that sequence.

Construction[edit]

To see what such a tree would look like, imagine starting with the example sequence shown on the right. Because 1 is the smallest element in the sequence, it must occur at the root of the Cartesian tree (which is min-heap-ordered). Now, in an inorder traversal of the Cartesian tree, all the elements in the root's left subtree will be visited before the root, and all the elements in the right subtree after; so we can readily conclude that the elements 9, 3, and 7 occur in the left subtree, whereas 8, 12, 10, 20, 15, 18, and 5 occur in the right subtree.

To actually construct the left subtree, we apply this reasoning recursively; the left subtree of the root is itself min-heap-ordered, so its root must be the element 3, as it is the smallest of 9, 3, and 7. Likewise, the root of the right subtree must be 5, which is the smallest of all the elements to the right of the 1; and so on.

It is clear from this discussion that there always exists at least one valid Cartesian tree for any given sequence.

Uniqueness[edit]

Based on the technique given above to construct a Cartesian tree, and the underlying reasoning, we see that the Cartesian tree of a sequence of distinct numbers is always unique. If there are duplicates in the sequence, there may be multiple possible Cartesian trees; for example, the sequence [5, 5] has two possible Cartesian trees (one in which the root has a left child, and one in which it has a right child). On the other hand, [5, 2, 5] has only one possible Cartesian tree, even though it has duplicate elements. The precise criterion is that the Cartesian tree is unique if and only if every segment of the sequence has a unique minimum element.

We can guarantee uniqueness even with duplicates by insisting that if an element appears multiple times in a sequence then the occurrence furthest to the left is always considered to be the smallest, or by finding any other way to break ties.

Naive method[edit]

The naive method described above may take up to O(N^2) time, since it requires scanning the sequence to find its minimum, then scanning the range to the left of the minimum, and scanning the range to the right, and so on down; this algorithm then has the same performance dynamics as quicksort (average O(N \log N)).

Left-to-right[edit]

The Cartesian tree may be constructed on one pass over the sequence. We proceed from left to right and add each value we encounter as a new node to the tree; after having processed the first i elements of the sequence, we will have a valid Cartesian tree for these initial elements. After we have processed the entire sequence (i = N), the Cartesian tree will be complete.

Thus, we begin by constructing a singleton tree containing only the first element a_1 in the sequence; this is a valid Cartesian tree for the segment of the sequence containing only the first element. Then, we repeatedly examine the next element a_{i+1}. At each step, we keep track of v_i, that is, the node that has label a_i (the last value inserted):

  • If a_{i+1} > a_i, then we just insert a new node v_{i+1} as the right child of v_i, and label this new node with a_{i+1}.
  • Otherwise, we consider node P(v_i), the parent of v_i. If the label of P(v_i) is less than or equal to a_{i+1}, we stop. Otherwise, we look to its parent, P(P(v_i)), and see whether its label is less than or equal to a_{i+1} and so on. If we stop at node u, with right child w, then we insert the new node v_{i+1} so that it is the right child of u, and make w the left child of v_{i+1}.
  • If we trace all the way up to the root of the tree so far, and still find no node with label less than or equal to a_{i+1}, then a_{i+1} is the smallest element seen so far; so make the new node v_{i+1} the root of the tree, and make the old root v_{i+1}'s left child.

Pseudocode[edit]

input sequence A with length N indexed from 1
// label the root node 1, with left child = right child = 0 (nonexistent)
root ← 1
left[1] ← 0
right[1] ← 0
// the root's label is the first item in the sequence
label[1] ← A[1]
for i ∈ [2..N]
    last ← i-1               // node of A[i-1]
    // create new node with label A[i]
    label[i] ← A[i]
    right[i] ← 0
    while label[last] > A[i] and last ≠ root
        last ← parent[last]
    if label[last] > A[i]    // A[i] is the smallest element yet; make it new root
        parent[root] ← i
        left[i] ← root
        root ← i
    else if right[last] = 0       // just insert it
        right[last] ← i
        parent[i] ← last
        left[i] ← 0
    else                     // reconfigure links
        parent[right[last]] ← i
        left[i] ← right[last]
        right[last] ← i
        parent[i] ← last

Proof of correctness[edit]

We define our invariant is as follows: After the loop in the above algorithm has run m times, we will have a valid Cartesian tree for the first m+1 elements of A.

This is true just before the loop runs for the first time (m = 0), because we will have a singleton tree that contains just the first element of A, and clearly this is a valid Cartesian tree for just that first element.

Now assume the loop has completed m iterations, but has not yet begun iteration m+1. Then, we have a valid Cartesian tree of the first m+1 elements of A, and we wish to insert A_{m+2}. We first observe the following:

  • The last node inserted, which we will call v, has no right child. This is because v must come last in the inorder traversal of the tree so far, as that tree is a valid Cartesian tree of all the elements added so far; but if v has a right child, then \mathrm{right}[v] will follow v in the inorder traversal.
  • v is the right child of \mathrm{parent}[v], if the latter exists. \mathrm{parent}[v] is furthermore the right child of \mathrm{parent}[\mathrm{parent}[v]], and so on all the way up to the root. If this were not the case, then v would lie in the left subtree of some node u, and therefore u would follow v in the inorder traversal, a contradiction.

Then consider the three cases covered in the code:

  1. If A_{m+2} is less than the label of the tree's root, then after we make it the new root (with the old root as its left child), A_{m+2} will occur at the very end of the inorder traversal, since everything in the left subtree will be visited before it; but the order in which the previously inserted elements are visited will be unchanged. Also, since A_{m+2} is the smallest element in the tree, and it goes at the root, the tree maintains its min-heap-ordering property. Hence the invariant is preserved.
  2. If A_{m+2} \geq A_{m+1}, then after we have inserted a new node as the right child of v, this node will follow v in the inorder traversal, but the order in which all other elements are visited will be unchanged, so the effect is to append the new node to the end of the inorder traversal of the old tree. The new tree will also maintain its min=heap-ordering, since the new node inserted is greater than or equal to its parent. Therefore, the invariant is preserved.
  3. If we find some proper ancestor u of the most recently inserted node, whose label is less than or equal to A_{m+2}, but whose right child's label is greater than A_{m+2}, then u is visited immediately before all the elements in its right subtree, and those elements occur at the very end of the inorder traversal (since u is descended from the root of the tree along only right edges); furthermore all the elements in u's right subtree are greater than or equal to the label of \mathrm{right}[u]. Now then, when we insert a new node as u's right child, and make the original right child of u the left child of the new root, we see that all the elements formerly in u's right subtree remain in u's right subtree, that they are visited after u in the inorder traversal and all consecutively, and that the new node is visited at the very end; we conclude that the inorder traversal of the new tree is just the inorder traversal of the old tree with the new node visited at the very end. Also, since A_{m+2} is greater than or equal to u's label, the min-heap-ordering is preserved at u; and since it is less than the label of u's original right child, the min-heap-ordering is preserved here as well; so finally we conclude that the invariant is preserved. _{\blacksquare}

Analysis of running time[edit]

Observe that the loop in the algorithm given above performs at most a constant number of operations per iteration, plus some variable number of operations possibly spent ascending the tree looking for a place to insert the new element. Now define the active point of the tree to be the node pointed to by the last variable. Every time we ascend the tree by one node, we decrease the active point's depth by one. Then, either Case 1, Case 2, or Case 3 happens. In Case 1, the active point's depth is left unchanged, as it goes from being the old tree's root to being the new tree's root. In Cases 2 and 3, the active point's depth increases by one, since the new node inserted is one level deeper than its parent (the previous active point), and this new node becomes the new active point. So in total the depth of the active node cannot increase more than N-1 times, once for each iteration of the loop. This means that it cannot decrease more than N-1 times in total, either, because it starts at 0 and must finish as a non-negative integer. Hence the total number of times we ascend the tree is O(N) across all iterations of the loop, and all the other loop operations take O(N) time total, so the construction is completed in O(N) time. (This analysis also shows that each of the iterations of the loop uses amortized constant time.)

Using all nearest smaller values[edit]

Another construction method relies on finding all nearest smaller values of the sequence, which can be done in linear time. That is, for each element A_i in the sequence, we find the maximum j < i such that A_j \leq A_i; this is the nearest smaller value on the left (LNSV) . We also find the minimum k > i such that A_k < A_i, and call this the nearest smaller value on the right (RNSV).

Theorem: Consider the unique Cartesian tree of A obtained by considering the first of equal elements to be the smallest. (In such a tree, if node with label x has a left child with label l, then l < x, and if it has a right child with label r, then r \leq x.) Then, given an element A_i of the sequence, excepting that it is the root of the tree, it will be either the LNSV's right child or the RNSV's left child, as determined by the following rules:

  1. If the LNSV and RNSV both exist, the larger of the two is the parent of A_i in the tree. In case of a tie, the RNSV wins.
  2. If only the LSNV exists, it is A_i's parent. If only the RNSV exists, it is A_i's parent.
  3. If neither the LNSV nor the RNSV exists, A_i is the root.

Proof: Suppose that node labelled A_i is the right child of its parent. Then its parent must be labelled A_j where j < i, because A_i occurs after A_j in the inorder traversal. Also, A_i \geq A_j. Now suppose there exists k such that j < k < i and A_i \geq A_k (that is, A_j is not the LNSV to A_i). Then, A_k must occur after A_j in the inorder traversal, but before A_i. This is only possible if A_k occurs in the left subtree of the node labelled A_i. However, this implies that A_k > A_i, a contradiction. A similar analysis shows that if A_i's parent occurs after A_i in the sequence A, it must be the RNSV of A_i. Now, pursuant to the three cases in the statement of the Theorem:

  1. Since the LNSV and RNSV exist, node labelled A_i is not the root. Therefore this node has a parent. The foregoing analysis shows that if node labelled A_i is the right child of its parent, then the parent is labelled with the LSNV, and if it is the left child, the parent is labelled with the RNSV. Suppose the former is true and A_i's parent is its LNSV. Then the RNSV follows A_i and A_j in the inorder traversal. However, it cannot be in the right subtree of node labelled A_i, since its value is strictly less than A_i. It follows that it cannot be anywhere in the right subtree of A_j either (or anywhere in the subtree rooted at A_j). Now let u be the lowest common ancestor of the nodes labelled with A_j and the RNSV. If u is the node labelled with the RNSV, then node labelled A_j is in the RNSV's left subtree, which means A_j (the LNSV) is strictly greater than the RNSV, as desired. If not, then since the RNSV occurs after A_j in the inorder traversal, it must be that the RNSV is in u's right subtree and node labelled A_i is in u's left subtree. But then u occurs after A_i in the inorder traversal and before the RNSV, and u's label is strictly less than A_i, which contradicts the "nearestness" of the supposed RNSV (so this case can never occur at all). A similar analysis shows that if node labelled A_i is the left child of its parent, then the parent is labelled with the RNSV and the RNSV is greater than or equal to the LNSV, as desired.
  2. If only the LSNV exists, then node labelled A_i still cannot be the root, but all the elements following A_i are greater than or equal to it, so node labelled A_i cannot be the left child of any node. Therefore it is the right child of its parent, and therefore its parent is the LNSV. Likewise, if only the RNSV exists, we see that node labelled A_i is the left child of its parent, and thus the parent must be the RNSV.
  3. If neither the LSNV nor the RNSV exists, then A_i is the leftmost occurrence of the minimum element in the sequence, and hence must be at the root of the tree. _{\blacksquare}

Using the preceding Theorem, after computing all nearest smaller values on both sides of each element (which requires linear time), we can determine each node's parent in constant time, so we obtain overall linear time construction.

Applications[edit]

  • A range minimum query on a sequence is equivalent to a lowest common ancestor query on the sequence's Cartesian tree. Hence, RMQ may be reduced to LCA using the sequence's Cartesian tree.
  • The treap, a balanced binary search tree structure, is a Cartesian tree of (key,priority) pairs; it is heap-ordered according to the priority values, and an inorder traversal gives the keys in sorted order.
  • The suffix tree of a string may be constructed from the suffix array and the longest common prefix array. The first step is to compute the Cartesian tree of the longest common prefix array. Details in the Suffix tree article.