Earley Parsing Explained — The Parser
We got ourselves a nice recogniser, but most of the time we want more than a binary answer. We need a parse tree.
So, how do we turn our recogniser into a parser? Let’s see… Earley himself devised a method, but he was proven wrong by Tomita. Tomita’s method doesn’t terminate on some grammars. Elizabeth Scott described a method that works, but I totally failed to comprehend it.
Turns out, I don’t need to. It all boils down to a simple insight: the Earley items produced by the recogniser are enough. They are the parse forest we need. And it is quite simple to extract a parse tree from that forest. (Again, this insight is not mine. I believe I picked it up from Ian Piumarta.)
The solution
We need to perform nested depth-first searches over the completed items. When you look at those items the right way, you can see the search graphs. It is not easy to see if your mind has been polluted by talks of back pointers or some such. It certainly took me a long time.
Now the question is, how the hell are we supposed to look at our Earley items?
Analysing the Earley states
This is best shown by example. Let’s conjure up our trusty arithmetic expression grammar:
Sum -> Sum [+-] Product
Sum -> Product
Product -> Product [*/] Factor
Product -> Factor
Factor -> '(' Sum ')'
Factor -> Number
Number -> [0-9] Number
Number -> [0-9]
When we use it on this input:
1+(2*3-4)
We get those Earley items.
=== 0 ===
Sum -> • Sum [+-] Product (0)
Sum -> • Product (0)
Product -> • Product [*/] Factor (0)
Product -> • Factor (0)
Factor -> • '(' Sum ')' (0)
Factor -> • Number (0)
Number -> • [0-9] Number (0)
Number -> • [0-9] (0)
=== 1 ===
Number -> [0-9] • Number (0)
Number -> [0-9] • (0)
Number -> • [0-9] Number (1)
Number -> • [0-9] (1)
Factor -> Number • (0)
Product -> Factor • (0)
Sum -> Product • (0)
Product -> Product • [*/] Factor (0)
Sum -> Sum • [+-] Product (0)
=== 2 ===
Sum -> Sum [+-] • Product (0)
Product -> • Product [*/] Factor (2)
Product -> • Factor (2)
Factor -> • '(' Sum ')' (2)
Factor -> • Number (2)
Number -> • [0-9] Number (2)
Number -> • [0-9] (2)
=== 3 ===
Factor -> '(' • Sum ')' (2)
Sum -> • Sum [+-] Product (3)
Sum -> • Product (3)
Product -> • Product [*/] Factor (3)
Product -> • Factor (3)
Factor -> • '(' Sum ')' (3)
Factor -> • Number (3)
Number -> • [0-9] Number (3)
Number -> • [0-9] (3)
=== 4 ===
Number -> [0-9] • Number (3)
Number -> [0-9] • (3)
Number -> • [0-9] Number (4)
Number -> • [0-9] (4)
Factor -> Number • (3)
Product -> Factor • (3)
Sum -> Product • (3)
Product -> Product • [*/] Factor (3)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 5 ===
Product -> Product [*/] • Factor (3)
Factor -> • '(' Sum ')' (5)
Factor -> • Number (5)
Number -> • [0-9] Number (5)
Number -> • [0-9] (5)
=== 6 ===
Number -> [0-9] • Number (5)
Number -> [0-9] • (5)
Number -> • [0-9] Number (6)
Number -> • [0-9] (6)
Factor -> Number • (5)
Product -> Product [*/] Factor • (3)
Sum -> Product • (3)
Product -> Product • [*/] Factor (3)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 7 ===
Sum -> Sum [+-] • Product (3)
Product -> • Product [*/] Factor (7)
Product -> • Factor (7)
Factor -> • '(' Sum ')' (7)
Factor -> • Number (7)
Number -> • [0-9] Number (7)
Number -> • [0-9] (7)
=== 8 ===
Number -> [0-9] • Number (7)
Number -> [0-9] • (7)
Number -> • [0-9] Number (8)
Number -> • [0-9] (8)
Factor -> Number • (7)
Product -> Factor • (7)
Sum -> Sum [+-] Product • (3)
Product -> Product • [*/] Factor (7)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 9 ===
Factor -> '(' Sum ')' • (2)
Product -> Factor • (2)
Sum -> Sum [+-] Product • (0)
Product -> Product • [*/] Factor (2)
Sum -> Sum • [+-] Product (0)
Okay, let’s start from the beginning. We know the parse was successful: the recogniser said so, by showing us this item:
=== 9 ===
Sum -> Sum [+-] Product • (0)
There is a dot at the end, so this is a completed item. It starts at
(0)
(the beginning), and stops at (9)
(the
very end). There’s only one way Earley’s algorithm could possibly
produce such an item: the whole input is a Sum
. We can
visualise this item in a chart:
┌───────────────────────┐
┌──────────────┤Sum -> Sum [+-] Product├──────────────┐
│ └───────────────────────┘ │
│ │
0 1 2 3 4 5 6 7 8 9
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
Indeed, this is a Sum
, composed of 3 parts: the left
operand, "1"
, the operator, "+"
, and the right
operand, "(2*3-4)"
. Anyway, that’s how we know the parse
succeeded as a whole.
So, Earley’s algorithm works. Duh. But that’s not what I’m after. I need you to recall how this state was created. More specifically, what from. Let me pull back what I said in the recogniser section:
Imagine we have an item like this (‘a’, ‘b’, and ‘c’ are symbols, and ‘i’ is an integer):
Rule -> a b c • (i)
The fact that this item even exist means the following items also exist somewhere:
Foo -> a b • c (i) Foo -> a • b c (i) Foo -> • a b c (i)
In our current example this means we can find those items:
Sum -> Sum [+-] • Product (0)
Sum -> Sum • [+-] Product (0)
Sum -> • Sum [+-] Product (0)
But that’s not the end of it. To advance an item one step, you need two things: an un-advanced version of the item (which we have here), and a completed something: either a completed state, or a successful scan. This has several implications:
- There is another completed
Sum
somewhere. It starts at(0)
, and finishes at… well… let’s say(x)
. - There is a successful scan between
(x)
and(x+1)
. Meaning, the input atx
matches[+-]
. - There is a completed
Product
somewhere. It starts at(x+1)
, and finishes at… wait a minute this is the last one! it’s got to finish wherever the overallSum
finishes! That would be the end of the input, or(9)
.
Performing the search
The problem now is to search for those states, and determine the
value of (x)
. Given how Earley items are stored in the
state sets, we need to start at the end: let’s look for the
Product
there:
=== 9 ===
Factor -> '(' Sum ')' • (2)
Product -> Factor • (2)
Sum -> Sum [+-] Product • (0)
Product -> Product • [*/] Factor (2)
Sum -> Sum • [+-] Product (0)
Okay, so there is only one that’s completed:
=== 9 ===
Product -> Factor • (2)
And it started at (2)
. Okay, so (x+1)
equals (2)
! Which means, there should be a successful scan
between (1)
and (2)
. Let’s check the second
character of the input:
1+(2*3-4)
That would be "+"
, which definitely matches
[+-]
. All is well so far. Now, we’re looking for a
completed Sum
in (x)
—that is,
(1)
. Let’s look…
=== 1 ===
Number -> [0-9] • Number (0)
Number -> [0-9] • (0)
Number -> • [0-9] Number (1)
Number -> • [0-9] (1)
Factor -> Number • (0)
Product -> Factor • (0)
Sum -> Product • (0)
Product -> Product • [*/] Factor (0)
Sum -> Sum • [+-] Product (0)
Okay, we can find only one:
=== 1 ===
Sum -> Product • (0)
Now to sum it up, the whole parse is composed of 3 parts:
=== 1 === Sum -> Product • (0)
=== 2 === [+-]
=== 9 === Product -> Factor • (2)
Again, we can visualise this as a chart. Here, completed items become edges of a graph. The nodes of the graph are the numbers I put between each input token (0-9). In some way, those nodes represent the state sets.
┌───────────────────────┐
┌─────────────────────┤Sum -> Sum [+-] Product├───────────────────┐
│ └───────────────────────┘ │
│┌──────────────┐ ┌────┐ ┌─────────────────┐ │
┌┤Sum -> Product├┐│[+-]│┌──────────┤Product -> Factor├────────────┐
│└──────────────┘│└────┘│ └─────────────────┘ │
│ │ │ │
0 1 2 3 4 5 6 7 8 9
┌────────────────┬──────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└────────────────┴──────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
Granted, this case was easy, because there was only one solution. You
can thank the grammar for not being ambiguous. In any case, there
had to be a solution: if there wasn’t one, the parse would have
failed in the first place. The Sum
that spans the whole
input wouldn’t be there.
Nested searches!
So, we know exactly where the big Sum
comes from. We
know that the scan succeeded, and why. But what about the little
Sum
and Product
? Well, we just have to perform
the same work, going down one level. And then another. And another. Here
is the final result, minus 2 gaps for space. Filling them should be
easy.
┌───────────────────────┐
┌──────────────────────────┤Sum -> Sum [+-] Product├─────────────────────────┐
│ └───────────────────────┘ │
│┌──────────────────┐ ┌────┐ ┌─────────────────┐ │
┌┤Sum -> Product├┐│[+-]│┌─────────────┤Product -> Factor├────────────────┐
│└──────────────────┘│└────┘│ └─────────────────┘ │
│┌──────────────────┐│ │ ┌──────────────────────┐ │
┌┤Product -> Factor ├┐ ┌──────────┤Factor -> '(' Sum ')'├──────────────┐
│└──────────────────┘│ │ └──────────────────────┘ │
│┌──────────────────┐│ │┌───┐ ┌───────────────────────┐ ┌───┐│
┌┤Factor -> Number ├┐ ││'('│┌────┤Sum -> Sum [+-] Product├───────┐│')'││
│└──────────────────┘│ │└───┘│ └───────────────────────┘ │└───┘│
│┌──────────────────┐│ │ │┌──────────────┐ ┌────┐ ┌─────────┐│ │
┌┤Number -> [0-9] ├┐ │ ┌┤Sum -> Product├─┐│[+-]│┌┤Product ├┐ │
│└──────────────────┘│ │ │└──────────────┘ │└────┘││-> Factor││ │
│ ┌─────┐ │ │ │ │ │└─────────┘│ │
│ │[0-9]│ │ │ │ etc. │ │ etc. │ │
0 └─────┘ 1 2 3 4 5 6 7 8 9
┌────────────────────┬──────┬─────┬─────┬─────┬─────┬──────┬───────────┬─────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└────────────────────┴──────┴─────┴─────┴─────┴─────┴──────┴───────────┴─────┘
It helps to do those things manually at least once, to get a feel of the algorithm, and the data we need.
Speaking of which (you probably have noticed this by now), we don’t need all of the Earley items. Only the completed ones. This makes sense since we’re after successful parses, not aborted attempts. Here is a list of all completed items: Note the correspondence with the chart above: it’s basically the same data, visualised differently. Only the scans aren’t represented.
=== 0 ===
=== 1 ===
Number -> [0-9] (0)
Factor -> Number (0)
Product -> Factor (0)
Sum -> Product (0)
=== 2 ===
=== 3 ===
=== 4 ===
Number -> [0-9] (3)
Factor -> Number (3)
Product -> Factor (3)
Sum -> Product (3)
=== 5 ===
=== 6 ===
Number -> [0-9] (5)
Factor -> Number (5)
Product -> Product [*/] Factor (3)
Sum -> Product (3)
=== 7 ===
=== 8 ===
Number -> [0-9] (7)
Factor -> Number (7)
Product -> Factor (7)
Sum -> Sum [+-] Product (3)
=== 9 ===
Factor -> '(' Sum ')' (2)
Product -> Factor (2)
Sum -> Sum [+-] Product (0)
Much more manageable.
Searching from the wrong end
We still have a problem however. Remember how we found out how the
big Sum
was parsed? We searched from the end, then
worked up to the beginning. In this case it didn’t matter, because our
grammar has no ambiguity. If we found one however, we would have to
resolve it from the end, which is not very intuitive for the
user: can you imagine a longest-match rule that maximises the length of
the last rule first? This would look like a shortest
match in most cases.
That won’t do. We need to start at the beginning. The trick is to see the completed items as edges:
- A beginning.
- An end.
- A grammar rule.
A completed item only stores its beginning and its rule. Its end is implicit: it’s the Earley set it is stored on. We can reverse that. Instead of having this:
=== 9 ===
Product -> Factor (2)
We could have the beginning be implicit, and store the end. Like that:
=== 2 ===
Product -> Factor (9)
It is basically the same thing, but now we can perform searches from the beginning. Here is the whole chart, dully reversed:
=== 0 ===
Sum -> Sum [+-] Product (9)
Sum -> Product (1)
Product -> Factor (1)
Factor -> Number (1)
Number -> [0-9] (1)
=== 1 ===
=== 2 ===
Product -> Factor (9)
Factor -> '(' Sum ')' (9)
=== 3 ===
Sum -> Sum [+-] Product (8)
Sum -> Product (6)
Sum -> Product (4)
Product -> Product [*/] Factor (6)
Product -> Factor (4)
Factor -> Number (4)
Number -> [0-9] (4)
=== 4 ===
=== 5 ===
Factor -> Number (6)
Number -> [0-9] (6)
=== 6 ===
=== 7 ===
Product -> Factor (8)
Factor -> Number (8)
Number -> [0-9] (8)
=== 8 ===
=== 9 ===
Now we can start again. The top rule, the one that says we parsed everything, is now at the beginning:
=== 0 ===
Sum -> Sum [+-] Product (9)
It is a Sum
that starts at (0)
and finishes
at (9)
. Again, this confirms the input has been completely
and correctly parsed. Now we can search how this Sum
has
been parsed. From this input, we know the same things we knew
before:
- There is a completed
Sum
between(0)
and(x)
. - There is a “+” or a “-” character between
(x)
and(x+1)
. - There is a
Product
between(x+1)
and(9)
Now however, because the edges are stored at their beginning
location, we can search this chart at (0)
:
=== 0 ===
Sum -> Sum [+-] Product (9)
Sum -> Product (1)
Product -> Factor (1)
Factor -> Number (1)
Number -> [0-9] (1)
Ah, there are Two sums that match. Let’s try the first:
=== 0 ===
Sum -> Sum [+-] Product (9)
If this is the one, (x)
should now be (9)
.
Let’s see if there is a character between (9)
and
(10)
… argh, this is the end of the input! It doesn’t match.
Okay, let’s keep calm and backtrack…
=== 0 ===
Sum -> Sum [+-] Product (9) !! Failed !!
Sum -> Product (1)
Product -> Factor (1)
Factor -> Number (1)
Number -> [0-9] (1)
…so we can try the second Sum
:
=== 0 ===
Sum -> Product (1)
If this is the one, (x)
is now (1)
. Let’s
test the input between (1)
and (2)
… it is a
“+” character, which matches [+-]
. We’re still good. Let’s
try the Product
now. It should start at
(2)
:
=== 2 ===
Product -> Factor (9)
Factor -> '(' Sum ')' (9)
Good, there is one, and it finishes at (9)
. This was the
last of them, and we’re now at (9)
. We win!
A real depth-first search
We have performed a search. You may now be able to implement it in code. Still, a few clarifications should help you. It may not be obvious, but what we just did was really a depth-first search through a graph.
Aside: depth first search.
This one’s simple: you have a directed graph, with nodes and edges. You are at some node.
If the current node is a leaf, you win!
Otherwise, make a list of all edges starting from the current node. That will give you a list of children nodes. Go to the first child, then back to step 1. This will yield one of 2 results:
- You eventually found a leaf. Congratulations, you win!
- The search stopped before you found a leaf. Try the next child instead.
If there are no more child to try, the search stops.
More details can be found in the Wikipedia.
Our search graph
We know how to search a graph, but we’re not sure what graph exactly. More specifically, what are the nodes and the branches of this graph? Where are the leaves? Where do we start?
The nodes are the state sets. In our example we have 10,
from (0)
to (9)
. One of them is the
root (the search starts there), and one of them is the
leaf (the search stops there). When we decompose a rule such as
this:
=== x ===
A -> a b c (y)
the root is (x)
, and the leaf is (y)
.
The edges are the items themselves. They start from a node (indeed, they are stored in a node, and they point to another node (wherever they finish). There’s one subtlety however. Some edges aren’t represented in the state stets: the scans. But we don’t really need them: first, when we search for a scan, we know it starts at the current node, and stops at the next: scans always span exactly one token. Second, we can test scans directly from the input. Third, the rule we are decomposing tell us which test we must perform.
Another important notion here is the depth of the current node. It tells us which symbol we should look for in the rule we are decomposing.
Finally, we should remember the whole path that lead us to the final node. That is, the list of the edges that lead us from the root to the leaf.
Now, I should just show you the code. It’s not Lua any more, sorry. I couldn’t turn my vague ideas into code without static typing to guide me. (Static typing offer a much tighter feedback loop than dynamic typing in many cases: failure happens sooner, and closer to the root cause. Invaluable for exploratory programming.)
Anyway:
let df_search (edges : int -> 'node -> 'edge DA.t)
(child : int -> 'edge -> 'node )
(pred : int -> 'node -> bool )
(root : 'node )
: ('node * 'edge) list option =
let rec aux depth root =
if pred depth root
then Some []
else opt_find_mem (child depth |- aux (depth + 1))
(edges depth root)
>>= (fun (edge, path) -> Some ((root, edge) :: path))
in aux 0 root
This is my generalised depth first search routine. It takes 4 arguments:
edges
: a function that given a depth and a node, will return a list of edges.child
: a function that given a depth and an edge, will return a node.pred
: a function that given a depth and a node, will tell you if the node is a leaf.root
: A node. This will be the starting point of the search.
This search then returns a list, where each element is a node and an edge that starts from this node. I have done it this way because the edges I use later don’t store their start node —only the grammar rule and their end point.
Ah, and of course, the search could fail, in which case it returns
nothing (hence the option
in the return type). I also use a
couple important helper functions:
let (>>=) x f = match x with None -> None | Some r -> f r
let opt_find f = DA.foldl (function None -> f | e -> const e) None
let opt_find_mem f = opt_find (fun a -> f a >>= (fun out -> Some (a, out)))
The first operator let me use the option
type as a
monad. Less cumbersome than pattern matching. opt_find
is a
clever function that applies a function to each element of a list (here
implemented as a dynamic array), and returns the first non-null result.
opt_find_mem
is the same, except it also returns the
element of the list that produced the result we wanted.
Now the actual search function. It decomposes one edge into a list of sub-edges, using the depth first search above:
let top_list (grammar : 'a grammar )
(input : 'a input )
(chart : edge DA.t DA.t )
(start : int )
({finish; rule} : edge )
: (int * edge) list =
let symbols = rule_body grammar rule in
let bottom = DA.length symbols in
let pred depth start = depth = bottom && start = finish in
let child depth edge = edge.finish in
let edges depth start =
if depth >= DA.length symbols
then DA.make_empty () (* Going past the maximum depth fails the search *)
else match symbols >: depth with
| Terminal (t, _) -> if t (input start)
then DA.make 1 {finish = start + 1; rule = -1}
else DA.make_empty () (* Non-matching token fail
the search *)
| Non_term name -> (chart >: start) // (fun {finish; rule} ->
rule_name grammar rule = name)
in
match df_search edges child pred start with
| None -> failwith "there's always a solution"
| Some path -> path
This function takes 5 arguments. The first three form a general context:
- The grammar over which we operate. I need it to retrieve the rules referenced by the edges. (My edges only store an index to their rule.)
- The input. I need it to check the terminal symbols.
- The chart. All the completed items produced by the recogniser. Dully reversed, so the search can be done from the beginning. This is where I pick my edges from.
The last two form what I call an “edge”:
start
: the starting point of the edge. Not technically part of my edges, but important nonetheless (I need them for further decomposition).finish
: where the edge ends.rule
: an integer that denotes a grammar rule. I need the body this rule to perform the decomposition.
You will note the presence of 2 helper operators: the first
(>:
) is array indexing. The second (//
) is
a filter.
Now you can see why the depth is so important: it lets me know which symbol I am processing, and determines the valid edges that start from the current node. There are two cases when producing those edges:
- Terminal symbols. Those aren’t represented as edges in the chart, so I have to make one up. This edge references no rule (hence the -1), and it finishes exactly one token later.
- Non-terminal symbols. Those I just search for by name. If
the current symbol denotes a
Product
, that’s what I will search for.
One last thing: this function cannot fail. It will always produce a meaningful result: every edge that is on the chart can be decomposed. Terminal symbols are different, but we don’t decompose them anyway. So, if this ever fail, you have a bug somewhere.
Building the whole parse tree
Now we can decompose one edge into its sub edges. The only thing left to do is apply this recursively to have the whole tree. This part is relatively obvious:
type 'a parse_tree = Token of 'a
| Node of string * 'a parse_tree list
let parse_tree (grammar : 'a grammar )
(input : 'a input )
(chart : edge DA.t DA.t)
: 'a option parse_tree =
let start = 0 in
let finish = DA.length chart - 1 in
let name = grammar.start_symbol in
let rule_name {finish; rule} = rule_name grammar rule in
let rec aux (start, edge) =
if edge.rule = -1
then Token (input start)
else Node (rule_name edge,
List.map aux (top_list grammar input chart start edge))
in
match DA.find (fun edge -> edge.finish = finish && rule_name edge = name)
(chart >: start)
>>= fun edge -> Some (aux (start, edge))
with
| None -> failwith "Are you sure this parse succeeded?"
| Some node -> node
type 'a parse_tree = Token of 'a
| Node of string * 'a parse_tree list
So, a parse tree is either a token (directly taken from the input) or a node (the name of the grammar rule involved, and the list of the sub-tree that compose it). Well, a parse tree.
The function that constructs the parse tree take the same context than the function that decomposes one node: a grammar, the input, and a chart. We then use that context (start symbol of the grammar, size of the chart…) to determine where to start the search. Again, if called correctly on a parse that succeeded, this function cannot fail.
Anyway, this will get us our parse tree:
┌─────┐
│ Sum │
└┬─┬─┬┘
┌────────┘ │ └─────────┐
┌──┴──┐ ┌─┴─┐ ┌────┴────┐
│ Sum │ │ + │ │ Product │
└──┬──┘ └───┘ └────┬────┘
┌──┴──────┐ ┌───┴────┐
│ Product │ │ Factor │
└──┬──────┘ └─┬─┬─┬──┘
┌──┴─────┐ ┌────┘ │ └────┐
│ Factor │ ┌─┴─┐ ┌──┴──┐ ┌─┴─┐
└──┬─────┘ │ ( │ │ Sum │ │ ) │
┌──┴─────┐ └───┘ └┬─┬─┬┘ └───┘
│ Number │ ┌──────┘ │ └────────┐
└──┬─────┘ ┌──┴──┐ ┌─┴─┐ ┌────┴────┐
┌─┴─┐ │ Sum │ │ - │ │ Product │
│ 1 │ └──┬──┘ └───┘ └────┬────┘
└───┘ ┌────┴────┐ ┌───┴────┐
│ Product │ │ Factor │
└──┬─┬─┬──┘ └───┬────┘
┌──────┘ │ └─────┐ ┌───┴────┐
┌────┴────┐ ┌─┴─┐ ┌───┴────┐ │ Number │
│ Product │ │ * │ │ Factor │ └───┬────┘
└────┬────┘ └───┘ └───┬────┘ ┌─┴─┐
┌────┴───┐ ┌───┴────┐ │ 4 │
│ Factor │ │ Number │ └───┘
└────┬───┘ └───┬────┘
┌────┴───┐ ┌─┴─┐
│ Number │ │ 3 │
└────┬───┘ └───┘
┌─┴─┐
│ 2 │
└───┘
Hmm… Not as neat and clean as we might want it to be. It’s workable for now, but there are ways to make this simpler. We’ll get to that in a later post.
Resolving ambiguities
I mentioned ambiguities before, yet avoided to confront the subject. The only thing we can do right now is find a parse tree. Any parse tree. The problem is, if there are several possibilities, just taking one of them isn’t going to solve our problem: we want the output to be predictable.
You might think the problem would go away if you simply ban ambiguous grammars. Now you have two problems:
- The ambiguity of context free grammars is undecidable. You will either let some ambiguous grammars slip, or reject some unambiguous ones. Reliability or generality, pick one.
- Second, many ambiguous grammars are actually useful. Some context free languages are inherently ambiguous, and can only be parsed with ambiguous grammars. Others are simply easier to process with ambiguous grammars.
So we should accept ambiguous grammar. Let’s see an example: the famous “dangling else”:
if
if {}
else {}
There are two ways to parse this. Either the else
branch
belongs to the outer if
:
if
if {}
else {}
Or, it belong to the inner if
:
if
if {}
else {}
Let’s try and write a grammar for this little language:
Block -> "{}"
Block -> If
If -> "if" Block
If -> "if" Block "else" Block
Here are the two possible syntax trees:
┌───────┐
│ Block │
└───┬───┘
┌─┴──┐
│ If │
└┬┬┬┬┘
┌────────┘││└──────────────────┐
│ │└───────┐ │
┌──┴───┐ ┌───┴───┐ ┌──┴─────┐ ┌───┴───┐
│ "if" │ │ Block │ │ "else" │ │ Block │
└──────┘ └───┬───┘ └────────┘ └───┬───┘
┌─┴──┐ ┌───┴──┐
│ If │ │ "{}" │
└─┬┬─┘ └──────┘
┌────┘└────┐
┌──┴───┐ ┌───┴───┐
│ "if" │ │ Block │
└──────┘ └───┬───┘
┌───┴──┐
│ "{}" │
└──────┘
────────────────────────────────────────
┌───────┐
│ Block │
└───┬───┘
┌─┴──┐
│ If │
└─┬┬─┘
┌────┘└────┐
┌──┴───┐ ┌───┴───┐
│ "if" │ │ Block │
└──────┘ └───┬───┘
┌─┴──┐
│ If │
└┬┬┬┬┘
┌────────┘││└──────────────────┐
│ │└───────┐ │
┌──┴───┐ ┌───┴───┐ ┌──┴─────┐ ┌───┴───┐
│ "if" │ │ Block │ │ "else" │ │ Block │
└──────┘ └───┬───┘ └────────┘ └───┬───┘
┌───┴──┐ ┌───┴──┐
│ "{}" │ │ "{}" │
└──────┘ └──────┘
What is an ambiguity?
Recall that Earley items are unique: the algorithm make sure there is no duplicate. Here, we’re only interested in completed items (or edges). So, two different edges will have at least one difference:
- In their start point.
- In their end point.
- Or in their grammar rule.
Obviously. Now recall our depth first search: for a given start point, we gather every edge whose rule name match the current symbol. Yup, there might be several. That, is a possible ambiguity.
I must insist on possible. As you have seen in our arithmetic expression example above, some branches may abort, leaving only one valid solution. Actual ambiguity only arises when the search could have yielded different results, depending on which edge was picked first.
Resolution
So, when we perform our search, we’re only looking at one start
point: the current node. The edges we find there can only be different
in 2 ways: by their end point, or by their grammar rule. For instance,
in our dangling else example, when you’re searching for an
If
rule from 0 to 5 (the whole input), you will find 2
edges:
=== 0 ===
If -> "if" Block (5)
If -> "if" Block "else" Block (5)
Those edges end at the same point, but differ by their rule. Written as it is, our search algorithm will just pick the first one. Which is precisely what we will exploit here: when we construct the chart, we can do more than just flipping items. We can sort them.
My current choice is a blend of prioritised choice and longest match. The order of the rule in the grammar is significant. If there are more than one possibility, the rule that appear first will have precedence. When that does not suffice, I just put the longest edge first. This way:
-- Grammar --
Foo -> a b
Foo -> c d
-- Chart --
=== x ===
Foo -> a b (y)
Foo -> c d (y)
Foo -> c d (z) -- z < y
I believe this gives acceptable control. Granted, there are more fine grained ways of doing this, but the algorithms involved would be significantly more complex. I think it is simpler to just decide at the top level, and let the consequences of those decisions flow downward.
That said, nothing stops you from trying other disambiguation mechanisms: by nature, they’re limited to the search of edges and the construction of the parse tree. Whatever complexity lurks in there won’t affect the rest of the program.
Back to our example, if we pick this rule first…
If -> "if" Block (5)
…then the outer If
will have no else
clause. The else
close will therefore belong to the inner
If
, just like it would in most programming languages. If
you pick the other rule first, you get the other possibility.
Just by flipping the order of the rules in the grammar, you can
decide where the else
clause will go.
One last trap
I forgot to tell you: some grammars will blow up in your face:
A -> A
A ->
The problem is, this grammar generates an infinite amount of parse trees:
┌───┐ ┌───┐ ┌───┐
│ A │ │ A │ │ A │ ...
└───┘ └─┬─┘ └─┬─┘
┌─┴─┐ ┌─┴─┐
│ A │ │ A │
└───┘ └─┬─┘
┌─┴─┐
│ A │
└───┘
So, when I’m constructing my parse tree, I could be unlucky enough to fall into one of those bottomless pits until I overflow the stack. We need to do something about it. We could try and avoid the infinite paths, but I don’t like it. It would complicate the construction of the parse tree, and frankly, this is complex enough already.
My solution is more brutal: such bottomless grammars are bogus, and should be rejected. Fortunately, unlike unlike ambiguous grammars, they can be detected with a little bit of static analysis.
First we compile the nullable rules: they are the rules which contain only nullable terminal symbols. (And we have learned to detect those earlier).
For each nullable symbol, we compile the symbols which appear in the right hand side of the corresponding nullable rules. Imagine the following grammar:
A -> A C A -> B A -> B -> A C -> 'x'
This grammar has 2 nullable symbols, A and B. The nullable rules are those:
A -> B A -> B -> A
In this simple case, the only child of “A” is “B”, and vice versa.
Finally, we look for duplicates in the transitive closure of children. If a path includes a duplicate, then we have found a loop, and must reject the grammar. For instance, with he above grammar, When I examine the symbol “A”, I find only one child, “B”. So far so good. Then I examine “B” and find the grandchild “A”. We have a duplicate, so this grammar is bottomless.
That’s basically it. For reference, here is my code:
let infinite_loop : 'a grammar -> bool = fun grammar ->
let null_set = nullable_symbols grammar in
let rules symbol = grammar.rules // (fun {lhs} -> lhs = symbol)
// is_nullable null_set
/@ (fun {rhs} -> rhs) in
let add_rule = DA.foldl (fun set -> function
| Terminal _ -> failwith "impossible"
| Non_term sym -> String_set.add sym set) in
let children symbol = DA.foldl add_rule
String_set.empty
(rules symbol) in
let rec aux path symbol =
if List.mem symbol path
then true
else String_set.exists (aux (symbol::path))
(children symbol)
in String_set.exists (aux []) null_set
It is worth breaking up a bit. First, we get the null symbols. I reuse the previous nullable_symbols detector directly.
let null_set = nullable_symbols grammar
Then I need to detect the nullable rules that correspond to a given
symbol. I only take the grammars rules I want: their name must be the
symbol I seek (obviously), and they must be nullable. Finally, I only
take the right hand side of each rule. (The //
operator is
a filter, and the /@
operator is a map.)
let rules symbol = grammar.rules // (fun {lhs} -> lhs = symbol)
// is_nullable null_set
/@ (fun {rhs} -> rhs)
Once I have my rules for a given symbol, I need to collect the children symbols. Remember that nullable rules don’t have terminal symbols.
let add_rule = DA.foldl (fun set -> function
| Terminal _ -> failwith "impossible"
| Non_term sym -> String_set.add sym set)
let children symbol = DA.foldl add_rule
String_set.empty
(rules symbol)
Finally, I perform the search for duplicates. During the search, I
keep track of the symbols of the current path. If the current symbol is
in this path, we have hit a loop, and return true
to say
so. (Production code should return the path itself, for a more friendly
error message.) Otherwise we search deeper, if possible.
let rec aux path symbol =
if List.mem symbol path
then true
else String_set.exists (aux (symbol::path))
(children symbol)
Of course, We need to perform this search for every nullable symbol:
String_set.exists (aux []) null_set
Conclusion
Now you should be able to build a fully fledged Earley parser. As always, my code is available for study. Again, sorry about using Ocaml instead of sticking with Lua. Dynamic typing just required too much brainpower. If you find my code hard to read, this other tutorial may help you.