# Earley Parsing Explained — The Parser

We got ourselves a nice recogniser, but most of the time we want more than a binary answer. We need a parse tree.

So, how do we turn our recogniser into a parser? Let’s see… Earley himself devised a method, but he was proven wrong by Tomita. Tomita’s method doesn’t terminate on some grammars. Elizabeth Scott described a method that works, but I totally failed to comprehend it.

Turns out, I don’t need to. It all boils down to a simple insight:
the Earley items produced by the recogniser are enough. *They*
are the parse forest we need. And it is quite simple to extract a parse
tree from that forest. (Again, this insight is not mine. I believe I
picked it up from Ian Piumarta.)

## The solution

We need to perform nested depth-first searches over the completed
items. When you look at those items the right way, you can see the
search graphs. It is not easy to see if your mind has been polluted by
talks of back pointers or some such. It certainly took *me* a
long time.

Now the question is, *how the hell* are we supposed to look at
our Earley items?

## Analysing the Earley states

This is best shown by example. Let’s conjure up our trusty arithmetic expression grammar:

```
Sum -> Sum [+-] Product
Sum -> Product
Product -> Product [*/] Factor
Product -> Factor
Factor -> '(' Sum ')'
Factor -> Number
Number -> [0-9] Number
Number -> [0-9]
```

When we use it on this input:

`1+(2*3-4)`

We get those Earley items.

```
=== 0 ===
Sum -> • Sum [+-] Product (0)
Sum -> • Product (0)
Product -> • Product [*/] Factor (0)
Product -> • Factor (0)
Factor -> • '(' Sum ')' (0)
Factor -> • Number (0)
Number -> • [0-9] Number (0)
Number -> • [0-9] (0)
=== 1 ===
Number -> [0-9] • Number (0)
Number -> [0-9] • (0)
Number -> • [0-9] Number (1)
Number -> • [0-9] (1)
Factor -> Number • (0)
Product -> Factor • (0)
Sum -> Product • (0)
Product -> Product • [*/] Factor (0)
Sum -> Sum • [+-] Product (0)
=== 2 ===
Sum -> Sum [+-] • Product (0)
Product -> • Product [*/] Factor (2)
Product -> • Factor (2)
Factor -> • '(' Sum ')' (2)
Factor -> • Number (2)
Number -> • [0-9] Number (2)
Number -> • [0-9] (2)
=== 3 ===
Factor -> '(' • Sum ')' (2)
Sum -> • Sum [+-] Product (3)
Sum -> • Product (3)
Product -> • Product [*/] Factor (3)
Product -> • Factor (3)
Factor -> • '(' Sum ')' (3)
Factor -> • Number (3)
Number -> • [0-9] Number (3)
Number -> • [0-9] (3)
=== 4 ===
Number -> [0-9] • Number (3)
Number -> [0-9] • (3)
Number -> • [0-9] Number (4)
Number -> • [0-9] (4)
Factor -> Number • (3)
Product -> Factor • (3)
Sum -> Product • (3)
Product -> Product • [*/] Factor (3)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 5 ===
Product -> Product [*/] • Factor (3)
Factor -> • '(' Sum ')' (5)
Factor -> • Number (5)
Number -> • [0-9] Number (5)
Number -> • [0-9] (5)
=== 6 ===
Number -> [0-9] • Number (5)
Number -> [0-9] • (5)
Number -> • [0-9] Number (6)
Number -> • [0-9] (6)
Factor -> Number • (5)
Product -> Product [*/] Factor • (3)
Sum -> Product • (3)
Product -> Product • [*/] Factor (3)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 7 ===
Sum -> Sum [+-] • Product (3)
Product -> • Product [*/] Factor (7)
Product -> • Factor (7)
Factor -> • '(' Sum ')' (7)
Factor -> • Number (7)
Number -> • [0-9] Number (7)
Number -> • [0-9] (7)
=== 8 ===
Number -> [0-9] • Number (7)
Number -> [0-9] • (7)
Number -> • [0-9] Number (8)
Number -> • [0-9] (8)
Factor -> Number • (7)
Product -> Factor • (7)
Sum -> Sum [+-] Product • (3)
Product -> Product • [*/] Factor (7)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 9 ===
Factor -> '(' Sum ')' • (2)
Product -> Factor • (2)
Sum -> Sum [+-] Product • (0)
Product -> Product • [*/] Factor (2)
Sum -> Sum • [+-] Product (0)
```

Okay, let’s start from the beginning. We know the parse was successful: the recogniser said so, by showing us this item:

```
=== 9 ===
Sum -> Sum [+-] Product • (0)
```

There is a dot at the end, so this is a completed item. It starts at
`(0)`

(the beginning), and stops at `(9)`

(the
very end). There’s only one way Earley’s algorithm could possibly
produce such an item: the whole input is a `Sum`

. We can
visualise this item in a chart:

```
┌───────────────────────┐
┌──────────────┤Sum -> Sum [+-] Product├──────────────┐
│ └───────────────────────┘ │
│ │
0 1 2 3 4 5 6 7 8 9
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
```

Indeed, this is a `Sum`

, composed of 3 parts: the left
operand, `"1"`

, the operator, `"+"`

, and the right
operand, `"(2*3-4)"`

. Anyway, that’s how we know the parse
succeeded as a whole.

So, Earley’s algorithm works. Duh. But that’s not what I’m after. I
need you to recall how this state was created. More specifically,
*what from*. Let me pull back what I said in the recogniser
section:

Imagine we have an item like this (‘a’, ‘b’, and ‘c’ are symbols, and ‘i’ is an integer):

`Rule -> a b c • (i)`

The fact that this item even

existmeans the following items also exist somewhere:`Foo -> a b • c (i) Foo -> a • b c (i) Foo -> • a b c (i)`

In our current example this means we can find those items:

```
Sum -> Sum [+-] • Product (0)
Sum -> Sum • [+-] Product (0)
Sum -> • Sum [+-] Product (0)
```

But that’s not the end of it. To advance an item one step, you need
*two* things: an un-advanced version of the item (which we have
here), and a completed *something*: either a completed state, or
a successful scan. This has several implications:

- There is another completed
`Sum`

somewhere. It starts at`(0)`

, and finishes at… well… let’s say`(x)`

. - There is a successful scan between
`(x)`

and`(x+1)`

. Meaning, the input at`x`

matches`[+-]`

. - There is a completed
`Product`

somewhere. It starts at`(x+1)`

, and finishes at… wait a minute this is the last one! it’s got to finish wherever the overall`Sum`

finishes! That would be the end of the input, or`(9)`

.

## Performing the search

The problem now is to search for those states, and determine the
value of `(x)`

. Given how Earley items are stored in the
state sets, we need to start at the end: let’s look for the
`Product`

there:

```
=== 9 ===
Factor -> '(' Sum ')' • (2)
Product -> Factor • (2)
Sum -> Sum [+-] Product • (0)
Product -> Product • [*/] Factor (2)
Sum -> Sum • [+-] Product (0)
```

Okay, so there is only one that’s completed:

```
=== 9 ===
Product -> Factor • (2)
```

And it started at `(2)`

. Okay, so `(x+1)`

equals `(2)`

! Which means, there should be a successful scan
between `(1)`

and `(2)`

. Let’s check the second
character of the input:

`1+(2*3-4)`

That would be `"+"`

, which definitely matches
`[+-]`

. All is well so far. Now, we’re looking for a
completed `Sum`

in `(x)`

—that is,
`(1)`

. Let’s look…

```
=== 1 ===
Number -> [0-9] • Number (0)
Number -> [0-9] • (0)
Number -> • [0-9] Number (1)
Number -> • [0-9] (1)
Factor -> Number • (0)
Product -> Factor • (0)
Sum -> Product • (0)
Product -> Product • [*/] Factor (0)
Sum -> Sum • [+-] Product (0)
```

Okay, we can find only one:

```
=== 1 ===
Sum -> Product • (0)
```

Now to sum it up, the whole parse is composed of 3 parts:

```
=== 1 === Sum -> Product • (0)
=== 2 === [+-]
=== 9 === Product -> Factor • (2)
```

Again, we can visualise this as a chart. Here, completed items become
*edges* of a *graph*. The nodes of the graph are the
numbers I put between each input token (0-9). In some way, those nodes
represent the state sets.

```
┌───────────────────────┐
┌─────────────────────┤Sum -> Sum [+-] Product├───────────────────┐
│ └───────────────────────┘ │
│┌──────────────┐ ┌────┐ ┌─────────────────┐ │
┌┤Sum -> Product├┐│[+-]│┌──────────┤Product -> Factor├────────────┐
│└──────────────┘│└────┘│ └─────────────────┘ │
│ │ │ │
0 1 2 3 4 5 6 7 8 9
┌────────────────┬──────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└────────────────┴──────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
```

Granted, this case was easy, because there was only one solution. You
can thank the grammar for not being ambiguous. In any case, there
*had* to be a solution: if there wasn’t one, the parse would have
failed in the first place. The `Sum`

that spans the whole
input wouldn’t be there.

## Nested searches!

So, we know exactly where the big `Sum`

comes from. We
know that the scan succeeded, and why. But what about the little
`Sum`

and `Product`

? Well, we just have to perform
the same work, going down one level. And then another. And another. Here
is the final result, minus 2 gaps for space. Filling them should be
easy.

```
┌───────────────────────┐
┌──────────────────────────┤Sum -> Sum [+-] Product├─────────────────────────┐
│ └───────────────────────┘ │
│┌──────────────────┐ ┌────┐ ┌─────────────────┐ │
┌┤Sum -> Product├┐│[+-]│┌─────────────┤Product -> Factor├────────────────┐
│└──────────────────┘│└────┘│ └─────────────────┘ │
│┌──────────────────┐│ │ ┌──────────────────────┐ │
┌┤Product -> Factor ├┐ ┌──────────┤Factor -> '(' Sum ')'├──────────────┐
│└──────────────────┘│ │ └──────────────────────┘ │
│┌──────────────────┐│ │┌───┐ ┌───────────────────────┐ ┌───┐│
┌┤Factor -> Number ├┐ ││'('│┌────┤Sum -> Sum [+-] Product├───────┐│')'││
│└──────────────────┘│ │└───┘│ └───────────────────────┘ │└───┘│
│┌──────────────────┐│ │ │┌──────────────┐ ┌────┐ ┌─────────┐│ │
┌┤Number -> [0-9] ├┐ │ ┌┤Sum -> Product├─┐│[+-]│┌┤Product ├┐ │
│└──────────────────┘│ │ │└──────────────┘ │└────┘││-> Factor││ │
│ ┌─────┐ │ │ │ │ │└─────────┘│ │
│ │[0-9]│ │ │ │ etc. │ │ etc. │ │
0 └─────┘ 1 2 3 4 5 6 7 8 9
┌────────────────────┬──────┬─────┬─────┬─────┬─────┬──────┬───────────┬─────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└────────────────────┴──────┴─────┴─────┴─────┴─────┴──────┴───────────┴─────┘
```

It helps to do those things manually at least once, to get a feel of the algorithm, and the data we need.

Speaking of which (you probably have noticed this by now), we don’t need all of the Earley items. Only the completed ones. This makes sense since we’re after successful parses, not aborted attempts. Here is a list of all completed items: Note the correspondence with the chart above: it’s basically the same data, visualised differently. Only the scans aren’t represented.

```
=== 0 ===
=== 1 ===
Number -> [0-9] (0)
Factor -> Number (0)
Product -> Factor (0)
Sum -> Product (0)
=== 2 ===
=== 3 ===
=== 4 ===
Number -> [0-9] (3)
Factor -> Number (3)
Product -> Factor (3)
Sum -> Product (3)
=== 5 ===
=== 6 ===
Number -> [0-9] (5)
Factor -> Number (5)
Product -> Product [*/] Factor (3)
Sum -> Product (3)
=== 7 ===
=== 8 ===
Number -> [0-9] (7)
Factor -> Number (7)
Product -> Factor (7)
Sum -> Sum [+-] Product (3)
=== 9 ===
Factor -> '(' Sum ')' (2)
Product -> Factor (2)
Sum -> Sum [+-] Product (0)
```

Much more manageable.

## Searching from the wrong end

We still have a problem however. Remember how we found out how the
big `Sum`

was parsed? We searched from the *end*, then
worked up to the beginning. In this case it didn’t matter, because our
grammar has no ambiguity. If we found one however, we would have to
resolve it from the *end*, which is not very intuitive for the
user: can you imagine a longest-match rule that maximises the length of
the *last* rule first? This would look like a *shortest*
match in most cases.

That won’t do. We need to start at the beginning. The trick is to see the completed items as edges:

- A beginning.
- An end.
- A grammar rule.

A completed item only stores its beginning and its rule. Its end is implicit: it’s the Earley set it is stored on. We can reverse that. Instead of having this:

```
=== 9 ===
Product -> Factor (2)
```

We could have the *beginning* be implicit, and store the end.
Like that:

```
=== 2 ===
Product -> Factor (9)
```

It is basically the same thing, but now we can perform searches from the beginning. Here is the whole chart, dully reversed:

```
=== 0 ===
Sum -> Sum [+-] Product (9)
Sum -> Product (1)
Product -> Factor (1)
Factor -> Number (1)
Number -> [0-9] (1)
=== 1 ===
=== 2 ===
Product -> Factor (9)
Factor -> '(' Sum ')' (9)
=== 3 ===
Sum -> Sum [+-] Product (8)
Sum -> Product (6)
Sum -> Product (4)
Product -> Product [*/] Factor (6)
Product -> Factor (4)
Factor -> Number (4)
Number -> [0-9] (4)
=== 4 ===
=== 5 ===
Factor -> Number (6)
Number -> [0-9] (6)
=== 6 ===
=== 7 ===
Product -> Factor (8)
Factor -> Number (8)
Number -> [0-9] (8)
=== 8 ===
=== 9 ===
```

Now we can start again. The top rule, the one that says we parsed
everything, is now at the *beginning*:

```
=== 0 ===
Sum -> Sum [+-] Product (9)
```

It is a `Sum`

that starts at `(0)`

and finishes
at `(9)`

. Again, this confirms the input has been completely
and correctly parsed. Now we can search how this `Sum`

has
been parsed. From this input, we know the same things we knew
before:

- There is a completed
`Sum`

between`(0)`

and`(x)`

. - There is a “+” or a “-” character between
`(x)`

and`(x+1)`

. - There is a
`Product`

between`(x+1)`

and`(9)`

Now however, because the edges are stored at their beginning
location, we can search this chart at `(0)`

:

```
=== 0 ===
Sum -> Sum [+-] Product (9)
Sum -> Product (1)
Product -> Factor (1)
Factor -> Number (1)
Number -> [0-9] (1)
```

Ah, there are *Two* sums that match. Let’s try the first:

```
=== 0 ===
Sum -> Sum [+-] Product (9)
```

If this is the one, `(x)`

should now be `(9)`

.
Let’s see if there is a character between `(9)`

and
`(10)`

… argh, this is the end of the input! It doesn’t match.
Okay, let’s keep calm and backtrack…

```
=== 0 ===
Sum -> Sum [+-] Product (9) !! Failed !!
Sum -> Product (1)
Product -> Factor (1)
Factor -> Number (1)
Number -> [0-9] (1)
```

…so we can try the second `Sum`

:

```
=== 0 ===
Sum -> Product (1)
```

If this is the one, `(x)`

is now `(1)`

. Let’s
test the input between `(1)`

and `(2)`

… it is a
“+” character, which matches `[+-]`

. We’re still good. Let’s
try the `Product`

now. It should start at
`(2)`

:

```
=== 2 ===
Product -> Factor (9)
Factor -> '(' Sum ')' (9)
```

Good, there is one, and it finishes at `(9)`

. This was the
last of them, and we’re now at `(9)`

. We win!

## A real depth-first search

We have performed a search. You may now be able to implement it in code. Still, a few clarifications should help you. It may not be obvious, but what we just did was really a depth-first search through a graph.

### Aside: depth first search.

This one’s simple: you have a directed graph, with nodes and edges. You are at some node.

If the current node is a

*leaf*, you win!Otherwise, make a list of all edges starting from the current node. That will give you a list of

*children*nodes. Go to the first child, then back to step 1. This will yield one of 2 results:- You eventually found a leaf. Congratulations, you win!
- The search stopped before you found a leaf. Try the next child instead.

If there are no more child to try, the search stops.

More details can be found in the Wikipedia.

### Our search graph

We know how to search a graph, but we’re not sure *what* graph
exactly. More specifically, what are the nodes and the branches of this
graph? Where are the leaves? Where do we start?

The *nodes* are the state sets. In our example we have 10,
from `(0)`

to `(9)`

. One of them is the
*root* (the search starts there), and one of them is the
*leaf* (the search stops there). When we decompose a rule such as
this:

```
=== x ===
A -> a b c (y)
```

the root is `(x)`

, and the leaf is `(y)`

.

The *edges* are the items themselves. They start from a node
(indeed, they are *stored* in a node, and they point to another
node (wherever they finish). There’s one subtlety however. Some edges
aren’t represented in the state stets: the scans. But we don’t really
need them: first, when we search for a scan, we know it starts at the
current node, and stops at the next: scans always span exactly one
token. Second, we can test scans directly from the input. Third, the
rule we are decomposing tell us which test we must perform.

Another important notion here is the *depth* of the current
node. It tells us which symbol we should look for in the rule we are
decomposing.

Finally, we should remember the whole *path* that lead us to
the final node. That is, the list of the edges that lead us from the
root to the leaf.

Now, I should just show you the code. It’s not Lua any more, sorry. I
couldn’t turn my vague ideas into code without static typing to guide
me. (Static typing offer a *much* tighter feedback loop than
dynamic typing in many cases: failure happens sooner, and closer to the
root cause. Invaluable for exploratory programming.)

Anyway:

```
let df_search (edges : int -> 'node -> 'edge DA.t)
(child : int -> 'edge -> 'node )
(pred : int -> 'node -> bool )
(root : 'node )
: ('node * 'edge) list option =
let rec aux depth root =
if pred depth root
then Some []
else opt_find_mem (child depth |- aux (depth + 1))
(edges depth root)
>>= (fun (edge, path) -> Some ((root, edge) :: path))
in aux 0 root
```

This is my generalised depth first search routine. It takes 4 arguments:

`edges`

: a function that given a depth and a node, will return a list of edges.`child`

: a function that given a depth and an edge, will return a node.`pred`

: a function that given a depth and a node, will tell you if the node is a leaf.`root`

: A node. This will be the starting point of the search.

This search then returns a list, where each element is a node and an edge that starts from this node. I have done it this way because the edges I use later don’t store their start node —only the grammar rule and their end point.

Ah, and of course, the search could fail, in which case it returns
nothing (hence the `option`

in the return type). I also use a
couple important helper functions:

```
let (>>=) x f = match x with None -> None | Some r -> f r
let opt_find f = DA.foldl (function None -> f | e -> const e) None
let opt_find_mem f = opt_find (fun a -> f a >>= (fun out -> Some (a, out)))
```

The first operator let me use the `option`

type as a
monad. Less cumbersome than pattern matching. `opt_find`

is a
clever function that applies a function to each element of a list (here
implemented as a dynamic array), and returns the first non-null result.
`opt_find_mem`

is the same, except it also returns the
element of the list that produced the result we wanted.

Now the actual search function. It decomposes one edge into a list of sub-edges, using the depth first search above:

```
let top_list (grammar : 'a grammar )
(input : 'a input )
(chart : edge DA.t DA.t )
(start : int )
({finish; rule} : edge )
: (int * edge) list =
let symbols = rule_body grammar rule in
let bottom = DA.length symbols in
let pred depth start = depth = bottom && start = finish in
let child depth edge = edge.finish in
let edges depth start =
if depth >= DA.length symbols
then DA.make_empty () (* Going past the maximum depth fails the search *)
else match symbols >: depth with
| Terminal (t, _) -> if t (input start)
then DA.make 1 {finish = start + 1; rule = -1}
else DA.make_empty () (* Non-matching token fail
the search *)
| Non_term name -> (chart >: start) // (fun {finish; rule} ->
rule_name grammar rule = name)
in
match df_search edges child pred start with
| None -> failwith "there's always a solution"
| Some path -> path
```

This function takes 5 arguments. The first three form a general context:

- The grammar over which we operate. I need it to retrieve the rules referenced by the edges. (My edges only store an index to their rule.)
- The input. I need it to check the terminal symbols.
- The chart. All the completed items produced by the recogniser. Dully reversed, so the search can be done from the beginning. This is where I pick my edges from.

The last two form what I call an “edge”:

`start`

: the starting point of the edge. Not technically part of my edges, but important nonetheless (I need them for further decomposition).`finish`

: where the edge ends.`rule`

: an integer that denotes a grammar rule. I need the body this rule to perform the decomposition.

You will note the presence of 2 helper operators: the first
(`>:`

) is array indexing. The second (`//`

) is
a filter.

Now you can see why the depth is so important: it lets me know which symbol I am processing, and determines the valid edges that start from the current node. There are two cases when producing those edges:

*Terminal symbols.*Those aren’t represented as edges in the chart, so I have to make one up. This edge references no rule (hence the -1), and it finishes exactly one token later.*Non-terminal symbols.*Those I just search for by name. If the current symbol denotes a`Product`

, that’s what I will search for.

One last thing: this function *cannot fail*. It will always
produce a meaningful result: every edge that is on the chart can be
decomposed. Terminal symbols are different, but we don’t decompose them
anyway. So, if this ever fail, you have a bug somewhere.

## Building the whole parse tree

Now we can decompose *one* edge into its sub edges. The only
thing left to do is apply this recursively to have the whole tree. This
part is relatively obvious:

```
type 'a parse_tree = Token of 'a
| Node of string * 'a parse_tree list
let parse_tree (grammar : 'a grammar )
(input : 'a input )
(chart : edge DA.t DA.t)
: 'a option parse_tree =
let start = 0 in
let finish = DA.length chart - 1 in
let name = grammar.start_symbol in
let rule_name {finish; rule} = rule_name grammar rule in
let rec aux (start, edge) =
if edge.rule = -1
then Token (input start)
else Node (rule_name edge,
List.map aux (top_list grammar input chart start edge))
in
match DA.find (fun edge -> edge.finish = finish && rule_name edge = name)
(chart >: start)
>>= fun edge -> Some (aux (start, edge))
with
| None -> failwith "Are you sure this parse succeeded?"
| Some node -> node
type 'a parse_tree = Token of 'a
| Node of string * 'a parse_tree list
```

So, a parse tree is either a token (directly taken from the input) or a node (the name of the grammar rule involved, and the list of the sub-tree that compose it). Well, a parse tree.

The function that constructs the parse tree take the same context than the function that decomposes one node: a grammar, the input, and a chart. We then use that context (start symbol of the grammar, size of the chart…) to determine where to start the search. Again, if called correctly on a parse that succeeded, this function cannot fail.

Anyway, this will get us our parse tree:

```
┌─────┐
│ Sum │
└┬─┬─┬┘
┌────────┘ │ └─────────┐
┌──┴──┐ ┌─┴─┐ ┌────┴────┐
│ Sum │ │ + │ │ Product │
└──┬──┘ └───┘ └────┬────┘
┌──┴──────┐ ┌───┴────┐
│ Product │ │ Factor │
└──┬──────┘ └─┬─┬─┬──┘
┌──┴─────┐ ┌────┘ │ └────┐
│ Factor │ ┌─┴─┐ ┌──┴──┐ ┌─┴─┐
└──┬─────┘ │ ( │ │ Sum │ │ ) │
┌──┴─────┐ └───┘ └┬─┬─┬┘ └───┘
│ Number │ ┌──────┘ │ └────────┐
└──┬─────┘ ┌──┴──┐ ┌─┴─┐ ┌────┴────┐
┌─┴─┐ │ Sum │ │ - │ │ Product │
│ 1 │ └──┬──┘ └───┘ └────┬────┘
└───┘ ┌────┴────┐ ┌───┴────┐
│ Product │ │ Factor │
└──┬─┬─┬──┘ └───┬────┘
┌──────┘ │ └─────┐ ┌───┴────┐
┌────┴────┐ ┌─┴─┐ ┌───┴────┐ │ Number │
│ Product │ │ * │ │ Factor │ └───┬────┘
└────┬────┘ └───┘ └───┬────┘ ┌─┴─┐
┌────┴───┐ ┌───┴────┐ │ 4 │
│ Factor │ │ Number │ └───┘
└────┬───┘ └───┬────┘
┌────┴───┐ ┌─┴─┐
│ Number │ │ 3 │
└────┬───┘ └───┘
┌─┴─┐
│ 2 │
└───┘
```

Hmm… Not as neat and clean as we might want it to be. It’s workable for now, but there are ways to make this simpler. We’ll get to that in a later post.

## Resolving ambiguities

I mentioned ambiguities before, yet avoided to confront the subject.
The only thing we can do right now is find a parse tree. *Any*
parse tree. The problem is, if there are several possibilities, just
taking one of them isn’t going to solve our problem: we want the output
to be *predictable*.

You might think the problem would go away if you simply ban ambiguous
grammars. Now you have *two* problems:

- The ambiguity of context free grammars is undecidable. You will either let some ambiguous grammars slip, or reject some unambiguous ones. Reliability or generality, pick one.
- Second, many ambiguous grammars are actually useful. Some context
free languages are
*inherently*ambiguous, and can only be parsed with ambiguous grammars. Others are simply easier to process with ambiguous grammars.

So we should accept ambiguous grammar. Let’s see an example: the famous “dangling else”:

```
if
if {}
else {}
```

There are two ways to parse this. Either the `else`

branch
belongs to the outer `if`

:

```
if
if {}
else {}
```

Or, it belong to the inner `if`

:

```
if
if {}
else {}
```

Let’s try and write a grammar for this little language:

```
Block -> "{}"
Block -> If
If -> "if" Block
If -> "if" Block "else" Block
```

Here are the two possible syntax trees:

```
┌───────┐
│ Block │
└───┬───┘
┌─┴──┐
│ If │
└┬┬┬┬┘
┌────────┘││└──────────────────┐
│ │└───────┐ │
┌──┴───┐ ┌───┴───┐ ┌──┴─────┐ ┌───┴───┐
│ "if" │ │ Block │ │ "else" │ │ Block │
└──────┘ └───┬───┘ └────────┘ └───┬───┘
┌─┴──┐ ┌───┴──┐
│ If │ │ "{}" │
└─┬┬─┘ └──────┘
┌────┘└────┐
┌──┴───┐ ┌───┴───┐
│ "if" │ │ Block │
└──────┘ └───┬───┘
┌───┴──┐
│ "{}" │
└──────┘
────────────────────────────────────────
┌───────┐
│ Block │
└───┬───┘
┌─┴──┐
│ If │
└─┬┬─┘
┌────┘└────┐
┌──┴───┐ ┌───┴───┐
│ "if" │ │ Block │
└──────┘ └───┬───┘
┌─┴──┐
│ If │
└┬┬┬┬┘
┌────────┘││└──────────────────┐
│ │└───────┐ │
┌──┴───┐ ┌───┴───┐ ┌──┴─────┐ ┌───┴───┐
│ "if" │ │ Block │ │ "else" │ │ Block │
└──────┘ └───┬───┘ └────────┘ └───┬───┘
┌───┴──┐ ┌───┴──┐
│ "{}" │ │ "{}" │
└──────┘ └──────┘
```

### What is an ambiguity?

Recall that Earley items are unique: the algorithm make sure there is no duplicate. Here, we’re only interested in completed items (or edges). So, two different edges will have at least one difference:

- In their start point.
- In their end point.
- Or in their grammar rule.

Obviously. Now recall our depth first search: for a given start
point, we gather every edge whose rule name match the current symbol.
Yup, there might be several. *That*, is a possible ambiguity.

I must insist on *possible*. As you have seen in our
arithmetic expression example above, some branches may abort, leaving
only one valid solution. Actual ambiguity only arises when the search
could have yielded different results, depending on which edge was picked
first.

### Resolution

So, when we perform our search, we’re only looking at one start
point: the current node. The edges we find there can only be different
in 2 ways: by their end point, or by their grammar rule. For instance,
in our dangling else example, when you’re searching for an
`If`

rule from 0 to 5 (the whole input), you will find 2
edges:

```
=== 0 ===
If -> "if" Block (5)
If -> "if" Block "else" Block (5)
```

Those edges end at the same point, but differ by their rule. Written
as it is, our search algorithm will just pick the first one. Which is
precisely what we will exploit here: when we construct the chart, we can
do more than just flipping items. We can *sort* them.

My current choice is a blend of prioritised choice and longest match. The order of the rule in the grammar is significant. If there are more than one possibility, the rule that appear first will have precedence. When that does not suffice, I just put the longest edge first. This way:

```
-- Grammar --
Foo -> a b
Foo -> c d
-- Chart --
=== x ===
Foo -> a b (y)
Foo -> c d (y)
Foo -> c d (z) -- z < y
```

I believe this gives acceptable control. Granted, there are more fine grained ways of doing this, but the algorithms involved would be significantly more complex. I think it is simpler to just decide at the top level, and let the consequences of those decisions flow downward.

That said, nothing stops you from trying other disambiguation mechanisms: by nature, they’re limited to the search of edges and the construction of the parse tree. Whatever complexity lurks in there won’t affect the rest of the program.

Back to our example, if we pick this rule first…

`If -> "if" Block (5)`

…then the outer `If`

will have no `else`

clause. The `else`

close will therefore belong to the inner
`If`

, just like it would in most programming languages. If
you pick the other rule first, you get the other possibility.

Just by flipping the order of the rules in the grammar, you can
decide where the `else`

clause will go.

## One last trap

I forgot to tell you: some grammars will blow up in your face:

```
A -> A
A ->
```

The problem is, this grammar generates an infinite amount of parse trees:

```
┌───┐ ┌───┐ ┌───┐
│ A │ │ A │ │ A │ ...
└───┘ └─┬─┘ └─┬─┘
┌─┴─┐ ┌─┴─┐
│ A │ │ A │
└───┘ └─┬─┘
┌─┴─┐
│ A │
└───┘
```

So, when I’m constructing my parse tree, I could be unlucky enough to fall into one of those bottomless pits until I overflow the stack. We need to do something about it. We could try and avoid the infinite paths, but I don’t like it. It would complicate the construction of the parse tree, and frankly, this is complex enough already.

My solution is more brutal: such bottomless grammars are bogus, and should be rejected. Fortunately, unlike unlike ambiguous grammars, they can be detected with a little bit of static analysis.

First we compile the nullable rules: they are the rules which contain only nullable terminal symbols. (And we have learned to detect those earlier).

For each nullable symbol, we compile the symbols which appear in the right hand side of the corresponding nullable rules. Imagine the following grammar:

`A -> A C A -> B A -> B -> A C -> 'x'`

This grammar has 2 nullable symbols, A and B. The nullable rules are those:

`A -> B A -> B -> A`

In this simple case, the only child of “A” is “B”, and vice versa.

Finally, we look for duplicates in the transitive closure of children. If a path includes a duplicate, then we have found a loop, and must reject the grammar. For instance, with he above grammar, When I examine the symbol “A”, I find only one child, “B”. So far so good. Then I examine “B” and find the grandchild “A”. We have a duplicate, so this grammar is bottomless.

That’s basically it. For reference, here is my code:

```
let infinite_loop : 'a grammar -> bool = fun grammar ->
let null_set = nullable_symbols grammar in
let rules symbol = grammar.rules // (fun {lhs} -> lhs = symbol)
// is_nullable null_set
/@ (fun {rhs} -> rhs) in
let add_rule = DA.foldl (fun set -> function
| Terminal _ -> failwith "impossible"
| Non_term sym -> String_set.add sym set) in
let children symbol = DA.foldl add_rule
String_set.empty
(rules symbol) in
let rec aux path symbol =
if List.mem symbol path
then true
else String_set.exists (aux (symbol::path))
(children symbol)
in String_set.exists (aux []) null_set
```

It is worth breaking up a bit. First, we get the null symbols. I reuse the previous nullable_symbols detector directly.

`let null_set = nullable_symbols grammar`

Then I need to detect the nullable rules that correspond to a given
symbol. I only take the grammars rules I want: their name must be the
symbol I seek (obviously), and they must be nullable. Finally, I only
take the right hand side of each rule. (The `//`

operator is
a filter, and the `/@`

operator is a map.)

```
let rules symbol = grammar.rules // (fun {lhs} -> lhs = symbol)
// is_nullable null_set
/@ (fun {rhs} -> rhs)
```

Once I have my rules for a given symbol, I need to collect the children symbols. Remember that nullable rules don’t have terminal symbols.

```
let add_rule = DA.foldl (fun set -> function
| Terminal _ -> failwith "impossible"
| Non_term sym -> String_set.add sym set)
let children symbol = DA.foldl add_rule
String_set.empty
(rules symbol)
```

Finally, I perform the search for duplicates. During the search, I
keep track of the symbols of the current path. If the current symbol is
in this path, we have hit a loop, and return `true`

to say
so. (Production code should return the path itself, for a more friendly
error message.) Otherwise we search deeper, if possible.

```
let rec aux path symbol =
if List.mem symbol path
then true
else String_set.exists (aux (symbol::path))
(children symbol)
```

Of course, We need to perform this search for every nullable symbol:

`String_set.exists (aux []) null_set`

## Conclusion

Now you should be able to build a fully fledged Earley parser. As always, my code is available for study. Again, sorry about using Ocaml instead of sticking with Lua. Dynamic typing just required too much brainpower. If you find my code hard to read, this other tutorial may help you.