Earley Parsing Explained — Semantic Actions

(Source code for the impatient).

We now know how to transform unstructured input into a genuine parse tree. There’s only one slight problem: that parse tree is ugly. More specifically, that parse tree is entirely determined by the shape of the grammar, offering us very little flexibility. Let us review our arithmetic example.

The Grammar:

Sum     -> Sum     [+-] Product
Sum     -> Product
Product -> Product [*/] Factor
Product -> Factor
Factor  -> '(' Sum ')'
Factor  -> Number
Number  -> [0-9]

The input we want to parse: 1+(2*3+4)

The resulting parse tree:

         ┌─────┐
         │ Sum │
         └┬─┬─┬┘
   ┌──────┘ │ └────────┐
┌──┴──┐   ┌─┴─┐   ┌────┴────┐
│ Sum │   │ + │   │ Product │
└──┬──┘   └───┘   └────┬────┘
┌──┴──────┐        ┌───┴────┐
│ Product │        │ Factor │
└──┬──────┘        └─┬─┬─┬──┘
┌──┴─────┐      ┌────┘ │ └────┐
│ Factor │    ┌─┴─┐ ┌──┴──┐ ┌─┴─┐
└──┬─────┘    │ ( │ │ Sum │ │ ) │
┌──┴─────┐    └───┘ └┬─┬─┬┘ └───┘
│ Number │    ┌──────┘ │ └────────┐
└──┬─────┘ ┌──┴──┐   ┌─┴─┐   ┌────┴────┐
 ┌─┴─┐     │ Sum │   │ - │   │ Product │
 │ 1 │     └──┬──┘   └───┘   └────┬────┘
 └───┘   ┌────┴────┐          ┌───┴────┐
         │ Product │          │ Factor │
         └──┬─┬─┬──┘          └───┬────┘
     ┌──────┘ │ └─────┐       ┌───┴────┐
┌────┴────┐ ┌─┴─┐ ┌───┴────┐  │ Number │
│ Product │ │ * │ │ Factor │  └───┬────┘
└────┬────┘ └───┘ └───┬────┘    ┌─┴─┐
┌────┴───┐        ┌───┴────┐    │ 4 │
│ Factor │        │ Number │    └───┘
└────┬───┘        └───┬────┘
┌────┴───┐          ┌─┴─┐
│ Number │          │ 3 │
└────┬───┘          └───┘
   ┌─┴─┐
   │ 2 │
   └───┘

Ugh. There are too many useless nodes here. I’d rather have something like this:

   ┌───┐
   │ + │
   └┬─┬┘
  ┌─┘ └─┐
┌─┴─┐ ┌─┴─┐
│ 1 │ │ - │
└───┘ └┬─┬┘
     ┌─┘ └─┐
   ┌─┴─┐ ┌─┴─┐
   │ * │ │ 4 │
   └┬─┬┘ └───┘
  ┌─┘ └─┐
┌─┴─┐ ┌─┴─┐
│ 2 │ │ 3 │
└───┘ └───┘

There are several ways to collapse the former into the latter. One way would be to write an ad-hoc recursive algorithm over the parse tree. Another way would be using semantic actions.

The structure of the parse tree

Our ugly parse tree has a very straightforward structure. Its nodes are either a token, or the aggregation of a grammar rule and a list of sub nodes. In Ocaml, we write this (the grammar rule is represented with an index.)

type 'a parse_tree = Token of 'a
                   | Node  of int * 'a parse_tree list

So, each non-leaf node is associated with exactly one grammar rule. Actually, that grammar rule is directly responsible for the creation of this node! Here are a couple examples (search for those nodes in the tree above).

┌─────┐
│ Sum │        Sum -> Sum [+-] Product
└┬─┬─┬┘

┌───┴────┐
│ Factor │     Factor  -> '(' Sum ')'
└─┬─┬─┬──┘

┌──┴──────┐
│ Product │    Product -> Product [*/] Factor
└──┬──────┘

┌─┴─┐
│ 2 │          nothing, this is a leaf node
└───┘

This is important, because the number and nature of the sub nodes is entirely determined by the grammar rule involved. For instance, this grammar rule:

Sum -> Sum [+-] Product

will always yield a node with 3 sub-nodes, one of which is either “+” or “-”. Indeed, you can see that the top node above has 3 sub nodes, one of which is “+”.

Walking down the parse tree

Those invariants are extremely convenient, because they allow potent simplifying assumptions. When we analyse a node, we can use one specialised (and simple) piece of code, provided we know which grammar rule was involved in the first place. So we need as many pieces of code as there are grammar rules.

Those pieces of code are the semantic actions.

Conceptually, a semantic action is a function: it takes a list of values as input, and returns a value. The input comes from the sub nodes. There is just the special case of tokens, which must have a semantic action of their own. (Possibly none, if you consider tokens to be values already.)

In a dynamic language, this would be very easy to describe. In Ocaml, we kinda have to fight the language. To keep things simple, I have decided to emulate a dynamic type system. So, values shall be s-expressions:

type sexpr = Nil
           | Int    of int
           | Char   of char
           | String of string
           | List   of sexpr list

Naturally, semantic actions are functions of this type:

type semantic_action = sexpr list -> sexpr

Now we just need to walk down the parse tree:

let act (token_handler : 'a -> sexpr         )
        (actions       : semantic_action DA.t)
        (ast           : 'a option parse_tree)
    : sexpr =
  let rec aux = function
    | Token (Some t)    -> token_handler t
    | Token None        -> Nil
    | Node (rule, subs) -> (actions >: rule) (List.map aux subs)
  in aux ast

We have 3 parameters here.

The token_handler, which turns a token into an s-expression (typically a char or a string). We use the same token handler for every terminal node.
The actions are an array of semantic actions. There’s one semantic action for each grammar rule (no more, no less).
The ast is the parse tree, represented by its top node.

There are two base cases, depending on whether there is a meaningful token or not. The recursive case computes the values of each sub node (List.map aux subs calls aux recursively for each sub node), then gives that list to the relevant semantic action (actions >: rule denotes array indexing).

Examples of semantic actions

Now we can write our semantic actions. I’ll use pseudo-code for clarity (the Ocaml code is quite ugly). I have 3 examples of semantic actions, all of which work on our trusty arithmetic expression grammar. I’ll write the grammar on the left side, as a reminder. In the actual code, semantic actions are stored in a separate array.

Interpreter

            grammar             ||    semantic actions
--------------------------------||-----------------------------
Sum     -> Sum     [+-] Product || (l, op, r) -> op(l, r)
Sum     -> Product              || (p       ) -> p
Product -> Product [*/] Factor  || (l, op, r) -> op(l, r)
Product -> Factor               || (f       ) -> f
Factor  -> '(' Sum ')'          || (_, s , _) -> s
Factor  -> Number               || (n       ) -> n
Number  -> [0-9]                || (n       ) -> n

These semantic actions directly “interpret” the input. Considering the start symbol is Sum, they will yield a number. On the input 1+(2*3+4), the result is 11.

Abstract syntax tree

We’ll be using s-expressions to denote the AST.

            grammar             ||    semantic actions
--------------------------------||-----------------------------
Sum     -> Sum     [+-] Product || (l, op, r) -> List (l, op, r)
Sum     -> Product              || (p       ) -> p
Product -> Product [*/] Factor  || (l, op, r) -> List (l, op, r)
Product -> Factor               || (f       ) -> f
Factor  -> '(' Sum ')'          || (_, s , _) -> s
Factor  -> Number               || (n       ) -> n
Number  -> [0-9]                || (n       ) -> n

That is how you collapse the parse tree: with some nodes, you just pass the result directly to the parent node, without encapsulating it. on the input 1+(2*3+4), the result looks like this:

      ┌─────┐
      │List │
      └┬─┬─┬┘
  ┌────┘ │ └────┐
┌─┴─┐  ┌─┴─┐  ┌─┴───┐
│ 1 │  │ + │  │List │
└───┘  └───┘  └┬─┬─┬┘
         ┌─────┘ │ └────┐
      ┌──┴──┐  ┌─┴─┐  ┌─┴─┐
      │List │  │ - │  │ 4 │
      └┬─┬─┬┘  └───┘  └───┘
   ┌───┘ │ └───┐
 ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
 │ 2 │ │ * │ │ 3 │
 └───┘ └───┘ └───┘

Postfix notation

This one is special: it relies on side effects. Semantic actions can be evaluated in a predictable order. If the host language uses strict evaluation, this is very easy. In our case, the recursive calls are such that the semantic actions are triggered node by node, from left to right, starting with the sub nodes. When you evaluate a given semantic action, you know the semantic actions for all the sub-nodes have been evaluated as well. From left to right, I might add.

This is a straightforward evaluation strategy, applicable to many situations. We don’t have to stick to it, however. We could give control to the semantic actions themselves: instead of giving them a list of values, you give them a list of closures, which, when evaluated (if at all), will yield the value you would have had otherwise… and perform its side effects, if any.

This is the strategy used by Schorre’s MetaII. The difference is, instead of parsing some input in a top down fashion, we’re walking down a fully formed tree. But I digress.

            grammar             ||    semantic actions
--------------------------------||-----------------------------
Sum     -> Sum     [+-] Product || (_, op, _) -> print op; print ' '
Sum     -> Product              || (p       ) ->
Product -> Product [*/] Factor  || (l, op, r) -> print op; print ' '
Product -> Factor               || (f       ) ->
Factor  -> '(' Sum ')'          || (_, s , _) ->
Factor  -> Number               || (n       ) ->
Number  -> [0-9]                || (d       ) -> print d; print ' '

Those semantic actions don’t return a meaningful result. They just print to the standard output. Thanks to their natural order of evaluation, with the input 1+(2*3+4), they will print this (trailing space not shown):

1 2 3 * 4 + +

This is a postfix notation, amenable to stack based evaluation:

[ 1 ] Push 1 on the top of the stack.
[ 2 ] Push 2 on the top of the stack.
[ 3 ] Push 3 on the top of the stack.
[ * ] Pop the top 2 elements on the stack, multiply them, push the result back on the top of the stack.
[ 4 ] Push 4 on the top of the stack.
[ + ] Pop the top 2 elements on the stack, multiply them, push the result back on the top of the stack.
[ + ] Pop the top 2 elements on the stack, multiply them, push the result back on the top of the stack.

If you follow these instructions, the stack will evolve like this:

        1     2     3     *     4     +     +
                  ┌───┐       ┌───┐
                  │ 3 │       │ 4 │
            ┌───┐ ├───┤ ┌───┐ ├───┤ ┌───┐
            │ 2 │ │ 2 │ │ 6 │ │ 6 │ │10 │
      ┌───┐ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤ ┌───┐
      │ 1 │ │ 1 │ │ 1 │ │ 1 │ │ 1 │ │ 1 │ │11 │
┌───┐ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤ ├───┤
│...│ │...│ │...│ │...│ │...│ │...│ │...│ │...│

This strategy is not limited to arithmetic expressions. It can be used to generate all kinds of stacked based code.

A few words on efficiency

My implementation of semantic actions is simple and modular, but also inefficient. Semantic actions have a well defined structure, which makes them easy to optimise.

First, we don’t need the parse tree to perform the semantic actions. It can be deforested away. The way to do this is simple: instead of constructing a node of that tree, just call the semantic actions directly. Second, I deferred as many decisions as I could to run time. This means an absurd amount of dynamic dispatch, which can be virtually eliminated with a bit of static analysis and code generation. Recall how we construct the parse tree:

We start from a completed Earley item. We have a start position, an end position, and a grammar rule.
We match each non-terminal node of this rule to a completed item (we also test the terminal nodes). Now we have a list of completed Earley items. They can be seen as the “children” of the item we had in step (1).
For each item from (2), we (recursively) go back to step (1). This gives us a list of sub-trees (one for each item). (If we call the semantic actions directly, we get a list of values instead)
We combine those sub-trees (or values) in a node, that we return to the caller.

The details of those operations are highly dependent on the particular grammar rule involved. Remember the depth first search we perform in step (2)? That search is exactly as deep as the number of symbols in the grammar rule. Moreover, the symbol involved only depends on the depth of the current node. So when the number and nature of the symbol is known in advance, our life is much simpler:

We don’t need the full power of recursion. Nested loops (one per non-terminal symbol) are enough.
We don’t have to test for the end of the search: it is implicit in the code. A success in the inner loop means we’re done.
We don’t have to look up the symbols in the grammar rules: we can just “hard-code” them instead.

Specialised depth-first searches can be generated for each grammar rule. From there, the only significant dispatch lies in step (3): the recursive call to the relevant rule. We can just use an indirect call, or we can be clever and switch over the possible cases to help the branch predictor of the CPU: not every rule matches any given symbol.

And so, we have optimised step (2). Now let’s take a look at (3) and (4).

In addition to the previous optimisations, code generation also enables static typing for the semantic actions themselves. Originally, I needed the semantic actions to all have the same type, effectively reverting back to dynamic typing, and all the inefficiencies it entails. (There are other possibilities, but I won’t go there.)

The values from step (3) have a type that depends on the symbol involved. I mean, it wouldn’t make sense for 2 rules with the same left hand side to return values of different types. Since those types are known in advance, we don’t have to go through generic semantic actions. For instance, if a semantic actions needs an integer, we can guarantee it will have just that —no need for any run time test.

But there’s more. Sometimes, a semantic action doesn’t need all the values it could get from below. With a generic approach, short of using lazy evaluation, we still dig to the bottom no matter what. The specialised approach can instead omit the parts of step (3) that are not needed. Depending on the particular grammar and semantic actions involved this can be huge: these are recursive calls. Behind them lie an entire sub-trees worth of computation.

Just one word of caution: if you’re counting on side effects performed by the very semantic actions you could omit that way, it might want to give some explicit control to the semantic action writers.

And so, we have optimised steps (2), (3), and (4).

The current source code doesn’t perform those optimisations. That would obscure the essence of semantic actions. Just keep them in mind, in case you end up writing a production-quality parsing framework (the whole point of this series, really).