Shunting yard algorithm

From PEGWiki
Revision as of 19:03, 31 July 2012 by Brian (Talk | contribs)

Jump to: navigation, search

The shunting yard algorithm is a simple technique for parsing infix expressions containing binary operators of varying precedence. It was first described by Edsgar Dijkstra in 1961. In general, the algorithm assigns to each operator its correct operands, taking into account the order of precedence. It can therefore be used to evaluate the expression immediately, to convert it into postfix, or to construct the corresponding syntax tree.

Motivation

To understand how the algorithm works, consider the two infix expressions 1*2+3 and 1*2↑3+4 (where ↑ represents exponentiation). In the first, the operand 2 belongs to the * operator, whereas in the second, it does not (instead, it belongs to the higher-precedence ↑ operator). So we may formulate the general rule that an operand belongs to whichever of the two operators next to it has higher precedence. If it is between two operators of equal precedence, then it belongs to the operator on the left if it is left-associative (like + or *), or to the operator on the right if it is right-associative (like ↑).

The basic algorithm

Initial version: without parentheses

We describe the version that produces postfix output, as it is the simplest conceptually. The algorithm parses the expression from left to right. It also constructs the postfix notation from left to right. Every time an operand is encountered, it is immediately transferred to the output stream. However, operators might have to be temporarily held on an auxiliary stack before they are moved to the output stream, in order to ensure that they are output in the correct order.

Specifically, in postfix notation, an operator is supposed to be output immediately after both of its operands have been output. When we encounter an operator in the input, we have not yet seen both of its operands (since its right operand is always to its right, and we parse the input from left to right). So we have to push it onto the stack; it is not ready to be output yet. Thus, after parsing the first three tokens of 1*2+3 or 1*2↑3 we'll have * on the operator stack and 1 2 in the output stream.

What happens next is of critical importance. When parsing 1*2+3, the final output is supposed to be 1 2 * 3 +, but when parsing 1*2↑3+4, the final output is supposed to be 1 2 3 ↑ * 4 +. In the former, once we encounter the next operator, +, we can conclude that the preceding 2 belongs to the * operator, since * is of higher precedence than +. Therefore, we pop * off the stack and into the output stream. However, in the latter, we know the 2 does not belong to the *, since is of higher precedence; instead, the right operand of the * is the entire expression 2↑3. Therefore, we can't pop * off the stack and into the output stream yet, but must instead now also push onto the output stack. Then, we encounter the operand 3, and move it to the output stream. Now, on encountering the operator +, since it is of lower precedence than the previous operator , we know that the 3 belongs to the , and thus pop off into the output stream. But the * now on the top of the stack is still of higher precedence than the +, which means we should pop off the * too, because whatever is between the * and the + in the input belongs to the *, not to the +. The current output stream is therefore 1 2 3 ↑ *. So, in general, whenever we encounter an operator, we should add it to the stack only after we have popped off everything that is of higher precedence. (If the operator is left-associative, then we pop off operators of equal precedence as well; if it is right-associative, then we do not.)

Upon reaching the end of the output stream, all remaining operators are popped off the stack and transferred to the output stream. This makes sense because operators are supposed to appear in the postfix output immediately after their operands, and all operands have already been output; and since each of the operators left on the stack is waiting for its right operand, the operators should be output from right to left (that is, their order should be the reverse of the order in which they appear in the input). Thus, for example, 1+2*3 would become 1 2 3 * +.

How to handle parentheses

Now we consider what happens when the expression may contain parentheses. A pair of matching parentheses is equivalent to a compound operand that must itself be parsed as though it is an expression in its own right. It therefore stands to reason that:

  • an opening parenthesis behaves like the beginning of an expression, in the sense that none of the operands after belong to any of the operands before it; thus, when we encounter one, we push it onto the stack, and, until the corresponding closing parenthesis is encountered, we cannot examine any of the operators beneath it. So in the expression 1*2+3, we would normally pop off the * after encountering the +; but if we have 1*(2+3) instead, then, upon reaching the +, the stack will have ( on the top and * below it; so here we do not pop off anything at all.
  • a closing parenthesis behaves like the end of an expression, in the sense that once we reach it, we pop off all remaining operators that have accumulated on the stack since the corresponding opening parenthesis was encountered, and transfer them to the output stream. Finally, we pop off the (. However, we do not add either parenthesis to the output stream, since parentheses do not appear in postfix expressions.

Actually, if we add an opening parenthesis at the beginning of the input, and another one at the end, then we can eliminate the need to pop off all remaining operators at the end (as they will be popped when the final parenthesis is encountered), as well as the need to check whether the stack ever becomes empty (as it will always contain at least the initial parenthesis).

Summary

Start off with an empty output stream and an empty stack. Repeatedly read a symbol from the input.

  • If it is part of a number (i.e., a digit or a decimal separator), then keep reading tokens until an operand or parenthesis is encountered, and convert the entire string just read into a number, and transfer the number to the output stream.
  • If it is a left-associative operator, then repeatedly pop from the stack into the output stream until either the stack becomes empty or the top of the stack is a parenthesis or a lower-precedence operator.
  • If it is a right-associative operator, then repeatedly pop from the stack into the output stream until either the stack becomes empty or the top of the stack is a parenthesis or an operator of lower or equal precedence.
  • If it is an opening parenthesis, push it onto the stack.
  • If it is a closing parenthesis, repeatedly pop operators from the stack into the output stream until an opening parenthesis is encountered. Pop the opening parenthesis off the stack, but do not emit it into the output stream.

Upon reaching the end of the input stream, pop all remaining operators off the stack and into the output stream.

It is easy to see that this algorithm will run in linear time, as each symbol is read once from the input stream, written at most once into the output stream, pushed onto the stack at most once, and popped off the stack at most once. (Also, operators at the top of the stack can be examined without being popped; but this occurs at most once for every operator in the input, since popping stops once this happens; so this only makes another linear contribution to the running time.) The space requirement is also obviously linear.

Extensions

Unary operators

Unary minus signs (and possibly unary plus signs) often appear in infix expressions. There are three modifications that have to be made to the algorithm in order to handle these:

  • A minus sign is always binary if it immediately follows an operand or a right parenthesis, and it is always unary if it immediately follows another operator or a left parenthesis, or if it occurs at the very beginning of the input. The algorithm must be modified in order to distinguish between the two.
  • A unary minus sign does not cause any operators to be popped from the stack. This is because, in the postfix output, the unary minus sign will always immediately follow its operand (whereas it always immediately precedes it in the infix), so no other operators can be popped before it at this point.
  • Unary minus signs and binary minus signs must be distinguished in the output in order to avoid ambiguity. Because postfix expressions are intended to be evaluated from left to right, we have a problem with an expression like 1 2 - 3 + if the minus sign is allowed to be unary; upon reaching it, we cannot determine whether it is unary or binary. If it is binary, then both of the preceding operands belong to it, and if it is unary, then only one of the two belongs to it, but perhaps the other belongs to some following operator. It is advisable to have some separate symbol for unary and binary minus signs, as is common in handheld scientific calculators. Also, this symbolic distinction must be made before the operator is pushed onto the stack, because once it is on the stack, we lose the ability to retrospectively determine whether it was supposed to be unary or binary.

It should also be pointed out that the unary minus sign is usually treated as though it has higher precedence than * and /. For example, the expression 10/-1*-2 usually evaluates to 20 rather than 5.

Functions

Infix expressions often contain not only basic operands and parentheses but also more complex mathematical functions such as gcd(20,12). These are not hard to parse, either. When a function name like gcd is encountered, it is pushed onto the stack; immediately afterward, the following opening parenthesis will also be pushed onto the stack. Each comma behaves like a closing parenthesis, because it completes a subexpression—except that it does not cause the opening parenthesis to be popped. When the closing parenthesis is finally encountered, the opening parenthesis is popped, and if the top of the stack is now a function name, it too is popped, and transferred to the output stream.

Variadic functions

Variadic functions (that is, functions that do not have a fixed arity, but can take varying numbers of arguments) present an especial difficulty. Like the unary and binary minus signs, these may cause ambiguity once converted into postfix, so we must find some way to tag such functions in the postfix output so that the evaluator can determine how many of the preceding arguments belong to them. Note that, unlike in the case of the unary and binary minus signs, we cannot determine in advance while scanning the input how many arguments the function is taking. The easiest way to handle this is to maintain a second stack, which we might call the arity stack. Every time we encounter a function name, we push the number 1 onto the arity stack. Every time we encounter a comma, we increment the number on the top of the arity stack, since the comma indicates another argument to the function. Finally, when it comes time to pop off the function name from the operator stack, we also pop the number off the top of the arity stack; this tells us the arity of the function.

Variations

As mentioned previously, the shunting yard algorithm can also be used to directly evaluate the input expression, or to convert it into a syntax tree.

Evaluation

It is a simple matter to obtain the postfix form of the input expression and then evaluate it, as postfix can be easily evaluated from left to right; it requires no knowledge of operator precedence, and contains no parentheses. However, we can also directly evaluate the input, without converting it into postfix. To do this, we replace the output stream with an output stack. Every time we encounter an operand, as before, we push it onto the output stack. Whenever we would normally transfer an operator into the output stream, we instead perform an evaluation; we pop off the correct number of operands from the output stack, apply the operator, and then push the result back onto the output stream. At the end of the input, as before, we process all remaining operators; and we should be left with a single value in the output stack—this is the value of the entire expression. So, for example, whenever we pop a * from the operator stack, we pop the top operand off the output stack; call this y; and we pop the next operand off the output stack too; call this x; and we push x*y onto the output stack. (Note that the order of the operands is the reverse of the order in which we pop them off; this is because a stack is a last-in, first-out structure.)

Conversion into syntax tree

Direct conversion into a syntax tree follows the same principle as direct evaluation (the previous section). Here, we keep an output stack of subtrees. Every operand gets pushed onto the output stack immediately as a singleton tree, and, instead of pushing an operand directly onto the output stack, we pop off the appropriate number of operands from the output stack and make them the children of a new subtree (whose root is the operator), which we then push back onto the output stack. After the algorithm has finished, the output stack should contain a single tree: the syntax tree for the entire expression.

The use of a syntax tree representation relieves us of the responsibility of finding some way to tag unary minus signs and variadic functions in the output (although we must still identify unary minus signs before pushing them onto the operator stack); the correct number of operands of an operator is simply the number of children the operator's node has in the final syntax tree. A variation of the algorithm, in which function arguments are directly added as children to the function token still waiting on the operator stack as soon as they are finished construction, also eliminates the need for the arity stack; details are left as an exercise for the reader.

Code

References

  1. Dijkstra, E. W. (1961). ALGOL-60 Translation. Retrieved from http://www.cs.utexas.edu/~EWD/MCReps/MR35.PDF
  2. Wikipedia