Earley Parsing Explained — The Recogniser
(The impatient may download the source code.)
We will use this grammar:
Sum = Sum [+-] Product
| Product
Product = Product [*/] Factor
| Factor
Factor = '(' Sum ')'
| Number
Number = [0-9]+
However, Earley parsers don’t recognise EBNF grammars directly. We need to use this restricted syntax:
Sum -> Sum [+-] Product
Sum -> Product
Product -> Product [*/] Factor
Product -> Factor
Factor -> '(' Sum ')'
Factor -> Number
Number -> [0-9] Number
Number -> [0-9]
Don’t worry, this notation is just as powerful as the full EBNF. We will automate the translation process later.
Vocabulary
Grammars, rules, and symbols
For the purposes of Earley parsing, a grammar is a set of rules. Here is a rule:
Product -> Product [*/] Factor
The left hand side of a rule (Product
in this example)
is called a non-terminal symbol. The right hand side
(Product [*/] Factor
in this example) is a list of symbols:
non-terminal symbols (Product
and Factor
in
this example), and terminal symbols ([*/]
in this
example).
A terminal symbol is a function from characters to booleans. It tells
you if a given character matches. Popular short hands for such functions
look like 'x'
(matches the character ‘x’ only), or
[+-*/]
(matches ‘+’, ‘-’, ’*‘, and’/’), or
[0-9]
(matches all numbers).
Earley items
I said previously that chart parsers construct lists of partial parses. Earley parsers are no exception. They just call those partial parses Earley items. Here is an Earley item:
Sum -> Sum • [+-] Product (0)
It is just like a grammar rule, with a couple more things: a fat dot somewhere in the right hand side, and a number at the far right. And of course, it represents a partial parse. Here is another way to view this item:
┌────────────│──────────────┐
┌┤ Sum -> Sum │ [+-] Product ├──>?
│└────────────│──────────────┘
│ V
┌─────────────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└─────────────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
^
└─ current position
- The number between the parentheses represents where the item
starts.
(0)
means it starts at the beginning of the input. - The fat dot represent how much has been parsed so far.
Whatever lies on its left is done. Whatever lies on its right
has yet to be tested. In this example, the very next thing to test is
the terminal token
[+-]
.
The astute reader may have noticed a key piece of information that I have represented in the chart, yet is conspicuously absent from the item itself: the current position. This information is crucial, but Earley’s algorithm doesn’t need the items to store their own current position. That’s because all the items with the same current position are stored together in a state set. (I don’t know why nobody calls them “item sets” instead. Whatever.) That state set is itself stored in an array. Its position in that array denotes the current position of the Earley items.
If you are able to look at a particular item, you know its current position, because that’s how you found it in the first place.
State sets
State sets are kind of weird. They are sets all right, in the sense that they don’t contain multiple copies of the same item. On the other hand, for the purpose of Earley’s algorithm, it is easier to think of them as dynamic arrays.
Those state sets are stored in a big dynamic array, generally simply called ‘S’. Here is a possible representation.
┌──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │ <── the outer array, called 'S'
└──┬───┴──┬───┴──┬───┘
│ │ │
│ │ V
│ │ ┌─────┬─────┬─────┬─────┐
│ │ │ ... │ ... │ ... │ ... │ <── a state set, S(2)
│ V └─────┴─────┴─────┴─────┘
│ ┌─────┬─────┬─────┐
│ │ ... │ ... │ ... │ <── another state set, S(1)
V └─────┴─────┴──│──┘
┌─────┬─────┬─────┐ │
│ ... │ ... │ ... │ └── an Earley item
└─────┴─────┴─────┘
(We can do nice optimisations about storing the state sets contiguously in memory, but let’s ignore that for now. New state set, new array.)
The big array and the input
Earley parsers are incremental parsers: as soon as they have read a token, they gather as much information as they can about the partial parses that make sense. Concretely, this means the big array, S, is constructed piece by piece, with one element per input token (plus one initial element).
Here is what S looks like when we have read the first two tokens.
┌──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │
└──┬───┴──┬───┴──┬───┘
V V V
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
The arrows here indicate the current position of each state set in S.
Those positions are between tokens. Which makes sense,
considering a character is either read, or not read. When
S(2)
is created, it means we have read the first 2
characters.
S(0)
is created before we read anything. That’s the
initial state. When we will be finished with this input, if all goes
well, we will have created 10 state sets, up to S(9)
. This
last state set is created when we have read everything.
Start up
The parsing starts by initialising the first state set. We typically
chose the rules by name. In our “hello world”, that would be every rule
named Sum
.
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │ <── Our big array, S, contains only one state set.
└──┬───┘
│ ┌── Our only state set, S(0), contains 2 Earley items.
V V
┌───────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │
├───────────────────────────────┤
│ Sum -> • Product (0) │
└───────────────────────────────┘
Note how I produced the items:
- We start at the beginning, and have read no input yet. The current
position is therefore
(0)
. I put the item inS(0)
. - The items are brand new So, their start position is the same as the
current position:
(0)
. Which you can see at the right of both items. - As brand new items, nothing has been parsed yet. Everything is yet to be done (or tested). We put the fat dot at the very beginning of the item.
Main loop
Now that we have initialised our data, we can loop over it. There are two loops to consider: an inner loop, which loops over a single state set, and an outer loop, which loops over the main array of state set.
Inner loop
Basically, we go over each Earley item in the current state set. When
we begin the algorithm, the current state set is S(0)
(we’ll do the rest later).
While going over the items, we may add even more items to
either the current state, or the next one. (For the first inner loop,
this means S(0)
and S(1)
). We continue until
there are no more items in the current state set to process. This can go
on for a while: whenever we add an item to the current state set, it
will need to be processed just like the others.
This looks like this:
current
item
│
V
current ┌─────┬─────┐
state set ─> │ ... │ ... │
└─────┴─────┘
Nothing to add, let’s try the next Earley item:
current
item
│
V
current ┌─────┬─────┐
state set ─> │ ... │ ! │
└─────┴─────┘
Ah, apparently, we need to add an item:
current
item
│ ┌──The item we just added
V V
current ┌─────┬─────┬─────┐
state set ─> │ ... │ ! │ ... │
└─────┴─────┴─────┘
Done, now let’s go to the next item (there wasn’t one, but now there is).
current
item
│
V
current ┌─────┬─────┬─────┐
state set ─> │ ... │ ! │ !!! │
└─────┴─────┴─────┘
Ah, this one’s different. We need to add an item to the next state set! We create this state set if necessary, then add an Earley item to it.
current
item
│
V
current ┌─────┬─────┬─────┐
state set ─> │ ... │ ! │ !!! │
└─────┴─────┴─────┘
next ┌─────┐
state set ─> │ ... │
└─────┘
^
└──The item we just added
We have finished processing the current state, we can now process the next one (there wasn’t one originally, but now there is), and so on.
Outer loop
The outer loop works similarly: Run the inner loop over
S(0)
. Repeat the process with the next state set (which
hopefully has just been created), and so on until there are no more
state sets to process. Given a grammar and an input, we will start with
something like this:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
│
V
┌─────┬─────┐
│ ... │ ... │
└─────┴─────┘
Then we run the inner loop over S(0)
, generating some
more Earley items and one state set in the process:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──┬───┘
│ │
│ V
│ ┌─────┐
│ │ ... │
V └─────┘
┌─────┬─────┬─────┐
│ ... │ ... │ ... │
└─────┴─────┴─────┘
Now we run the inner loop over S(1)
, and generate some
more data:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │
└──┬───┴──┬───┴──┬───┘
│ │ │
│ │ V
│ │ ┌─────┬─────┬─────┬─────┐
│ │ │ ... │ ... │ ... │ ... │
│ V └─────┴─────┴─────┴─────┘
│ ┌─────┬─────┬─────┐
│ │ ... │ ... │ ... │
V └─────┴─────┴─────┘
┌─────┬─────┬─────┐
│ ... │ ... │ ... │
└─────┴─────┴─────┘
And so on until we can’t generate more state sets (either because we reached the end of the input, or because the input “stopped making sense”). I didn’t represent all state sets, but you get the idea:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │ S(3) │ S(4) │ S(5) │ S(6) │ S(7) │ S(8) │ S(9) │
└──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┘
│ │ │ │ │ │ │ │ │ │
│ │ │ V V V V V V V
│ │ │ ... ... ... ... ... ... ...
│ │ V
│ │ ┌─────┬─────┬─────┬─────┐
│ │ │ ... │ ... │ ... │ ... │
│ V └─────┴─────┴─────┴─────┘
│ ┌─────┬─────┬─────┐
│ │ ... │ ... │ ... │
V └─────┴─────┴─────┘
┌─────┬─────┬─────┐
│ ... │ ... │ ... │
└─────┴─────┴─────┘
Prediction, Scan, and Completion
There are three specific ways of adding new Earley items to our data. Which way we choose depends on the item we are currently examining.
Prediction. The symbol at the right of the fat dot is non-terminal. We add the the corresponding rules to the current state set.
Scan. The symbol at the right of the fat dot is terminal. We check if the input matches this symbol. If it does, we add this item (advanced one step) to the next state set.
Completion. There is nothing at the right of the fat dot. This means we have a successful partial parse. We look for the parent items, and add them (advanced one step) to this state set.
Prediction
Let’s go back to our initialised data. We must now examine the first Earley Item.
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
│
V
┌───────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │ <─ prediction
├───────────────────────────────┤
│ Sum -> • Product (0) │
└───────────────────────────────┘
At the right of the fat dot, there is Sum
, a
non-terminal symbol. We must perform a prediction. Why? Because we’re
trying to parse the grammar rule embedded in this item. What’s just
after the fat dot? A Sum
. Does this Sum
actually begins by a Sum
? Well, we have to try if we want
to know. So we look up the Grammar, and see 2 rules named “Sum”:
Sum -> Sum [+-] Product
Sum -> Product
This means we must add the following two items to
S(0)
:
Sum -> • Sum [+-] Product (0)
Sum -> • Product (0)
│ │
│ └ The item is predicted in S(0), so
│ we set the starting point at (0).
│
└ We put the dot at the beginning of the rule.
Now we can add those items at the end of S(0)
.
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
│
V
┌───────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │ <─ prediction
├───────────────────────────────┤
│ Sum -> • Product (0) │
├───────────────────────────────┤─┐
┆ Sum -> • Sum [+-] Product (0) ┆ │
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤ ├─> These 2 are redundant
┆ Sum -> • Product (0) ┆ │
└┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┘─┘
Now we have a problem: this redundancy will drop us in an infinite loop: since the first symbol of a rule named “Sum” is also “Sum”, we will keep predicting sums until we run out of memory. This is a classic problem with most top-down parsers, and the reason why they can’t handle left-recursive grammars.
So we take one little precaution: no duplicate. The same items give the same results anyway, so why waste work? That way we avoid infinite loops —all of them. So don’t worry about left-recursive grammars, right-recursive grammars, magical grammars… they Just Work™.
(Okay, that was a lie. Some grammars still won’t work. We need another little precaution, which cannot be explained at this point.)
With this in mind, we go examine the next Earley item. We must still
perform a prediction, this time over Product
. (Basically,
we’re trying to see if this Sum
begins with a
Product
.)
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
│
V
┌───────────────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │
├───────────────────────────────────────┤
│ Sum -> • Product (0) │ <─ prediction
├───────────────────────────────────────┤─┐
┆ Product -> • Product [*/] Factor (0) ┆ │
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤ ├─> These 2 are new
┆ Product -> • Factor (0) ┆ │
└┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┘─┘
Since those are not duplicates we actually add them to
S(0)
.
Scan
We keep looping over the items, predicting everything until we reach this Earley item:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
│
V
┆ ... ┆
├───────────────────────────────────────┤
│ Factor -> • '(' Sum ')' (0) │ <─ scan
├───────────────────────────────────────┤
┆ ... ┆
Now we can’t predict any more: at the right of the fat dot, there is a terminal symbol. How do we try a terminal symbol? We scan it. First we see if the terminal symbol (an open parens) applies to the current character (the digit ‘1’). Since ‘1’ is not an open parens, the scan fails. We continue the loop without doing anything.
The next scan looks like this:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
│
V
┆ ... ┆
├───────────────────────────────────────┤
│ Number -> • [0-9] Number (0) │ <─ scan
├───────────────────────────────────────┤
┆ ... ┆
Here, the non-terminal symbol at the right of the fat dot is
[0-9]
, and the digit ‘1’ does match that terminal symbol.
Test successful! So, we copy the Earley item we’re examining over to the
next state set, with a slight modification:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬┄┄┄┄┄┄┐
│ S(0) │ S(1) ┆
└──┬───┴┄┄┄┄┄┄┘
│ │
│ V
│ ┌┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┐
│ ┆ Number -> [0-9] • Number (0) ┆
│ └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┄┄┄┄┄┄┄│┄┄┘
│ │ │
│ │ └ The starting point
│ │ is unchanged.
│ │
│ └ The fat dot is shifted
V one step to the right.
┆ ... ┆
├──────────────────────────────┤
│ Number -> • [0-9] Number (0) │ <─ scan
├──────────────────────────────┤
┆ ... ┆
This time, we’ve done some significant work: we have tried many things, and this is our first success! At last, we managed to move a fat dot one step to the right. Now this is not a confirmed success, since the fat dot is not completely off to the right yet. But that’s a start.
A confirmed success will come soon though: there’s another item that can be scanned:
Number -> • [0-9] Number (0)
Which ultimately gives us:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬┄┄┄┄┄┄┐
│ S(0) │ S(1) ┆
└──┬───┴┄┄┄┄┄┄┘
│ │
│ V
│ ┌──────────────────────────────┐
│ │ Number -> [0-9] • Number (0) │
│ ├──────────────────────────────┤
│ ┆ Number -> [0-9] • (0) ┆
│ └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┄┄┄┄┄┄┄┄┄┄┘
│ │
V └ Dot at the end of the item
┆ ... ┆
├──────────────────────────────┤
│ Number -> • [0-9] Number (0) │ <─ scan
├──────────────────────────────┤
│ Number -> • [0-9] (0) │ <─ scan
├──────────────────────────────┤
┆ ... ┆
Completion
We continue our main loop, finish S(0)
, and process
S(1)
. At some point, we get to this:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──────┘
│ │
│ V
│ ┆ ... ┆
│ ├──────────────────────────────────────┤
│ │ Number -> [0-9] • (0) │ <─ completion
│ ├──────────────────────────────────────┤
│ ┆ ... ┆
V
┌───────────────────────────────────────┐
│ ... │
├───────────────────────────────────────┤
┆ ... ┆
Here there is no symbol at the right side of the fat dot. At this
point, things get real interesting: we have a successful
partial parse. An actual, confirmed success: between the starting point
of the item (0), and its current position (1), the string
"1"
is a Number
.
We do not stop there however: This Number
may be a
fluke, but it may also be part of something bigger. Actually, Earley’s
algorithm pretty much guarantees it will be part of something
bigger. This item doesn’t come from nowhere: there’s a whole chain of
predictions, scans, and completions to get there.
Imagine we have an item like this (‘a’, ‘b’, and ‘c’ are symbols, and ‘i’ is an integer):
Rule -> a b c • (i)
The fact that this item even exist means the following items also exist somewhere:
Foo -> a b • c (i)
Foo -> a • b c (i)
Foo -> • a b c (i)
And the last of them can only exist in S(i)
: the state
set where it was predicted. We’re not interested in those items however.
What we really want is the items that might have triggered the
original prediction. Might have. Only one item actually
triggered the original prediction, but there’s often more than one path,
and we want to capture them all. Anyway, those rules are the rules in
S(i)
that look like this:
Bar -> a b • Foo c d (j)
That is, any rule in S(i)
that has the rule
Foo
right next to the fat dot. Those are the
“something” bigger Foo
is a part of. Now the purpose of
starting points should be obvious. With them, we can jump straight to
the interesting state set, and retrieve the sets we want.
Enough with the side quest. We were triggering a completion with this item:
┆ ... ┆
├──────────────────────────────────────┤
┆ Number -> [0-9] • (0) ┆
├──────────────────────────────────────┤
┆ ... ┆
So, we search S(0)
for any item that has
Number
right next to the dot:
┌─────────┤we must search here├─────────┐
│ │
┌───V──┬──────┐ │
│ S(0) │ S(1) │ │
└──┬───┴──────┘ │
│ │ │
│ V │
│ ┆ ... │ ┆
│ ├───────────────────────────────────│──┤
│ │ Number -> [0-9] • (0) │ <─ completion
│ ├──────────────────────────────────────┤
│ ┆ ... ┆
V
┌───────────────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │ Nope
├───────────────────────────────────────┤
│ Sum -> • Product (0) │ Nope
├───────────────────────────────────────┤
│ Product -> • Product [*/] Factor (0) │ Nope
├───────────────────────────────────────┤
│ Product -> • Factor (0) │ Nope
├───────────────────────────────────────┤
│ Factor -> • '(' Sum ')' (0) │ Nope
├───────────────────────────────────────┤
│ Factor -> • Number (0) │ FOUND ONE!
├───────────────────────────────────────┤
│ Number -> • [0-9] Number (0) │ Nope
├───────────────────────────────────────┤
│ Number -> • [0-9] (0) │ Nope
└───────────────────────────────────────┘
Now we just have to complete the items we have found. This is very similar to a successful scan: we copy all of the matching rules to the current state. The only difference is, we moved the fat dot over a non-terminal symbol:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──────┘
│ │
│ V
│ ┌─────────────────────────────────────┐
│ │ Number -> [0-9] • Number (0) │
│ ├─────────────────────────────────────┤
│ │ Number -> [0-9] • (0) │ <─ completion
│ ├─────────────────────────────────────┤ trigger
│ │ Number -> • [0-9] Number (1) │
│ ├─────────────────────────────────────┤
│ │ Number -> • [0-9] (1) │
│ ├─────────────────────────────────────┤
│ ┆ Factor -> Number • (0) ┆ <─ completed item
│ └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┘
│ │ │
│ │ └ The starting point
│ │ is unchanged.
│ │
│ └ The fat dot is shifted
V one step to the right.
┌───────────────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │
├───────────────────────────────────────┤
│ Sum -> • Product (0) │
├───────────────────────────────────────┤
│ Product -> • Product [*/] Factor (0) │
├───────────────────────────────────────┤
│ Product -> • Factor (0) │
├───────────────────────────────────────┤
│ Factor -> • '(' Sum ')' (0) │
├───────────────────────────────────────┤
│ Factor -> • Number (0) │ <─ Item to complete
├───────────────────────────────────────┤
│ Number -> • [0-9] Number (0) │
├───────────────────────────────────────┤
│ Number -> • [0-9] (0) │
└───────────────────────────────────────┘
Note that there may be more than one item to complete (or even none at all, see the starting state). Here for instance, the item we just completed is also a completion trigger:
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──────┘
│ │
│ V
│ ┌──────────────────────────────────────┐
│ │ Number -> [0-9] • Number (0) │
│ ├──────────────────────────────────────┤
│ │ Number -> [0-9] • (0) │
│ ├──────────────────────────────────────┤
│ │ Number -> • [0-9] Number (1) │
│ ├──────────────────────────────────────┤
│ │ Number -> • [0-9] (1) │
│ ├──────────────────────────────────────┤
│ │ Factor -> Number • (0) │ <─ completion
│ ├──────────────────────────────────────┤ trigger
│ ┆ Product -> Factor • (0) ┆ <─ completed item
│ └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄─┄┄┄┄┄┘
V
┌───────────────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │
├───────────────────────────────────────┤
│ Sum -> • Product (0) │
├───────────────────────────────────────┤
│ Product -> • Product [*/] Factor (0) │
├───────────────────────────────────────┤
│ Product -> • Factor (0) │ <─ Item to complete
├───────────────────────────────────────┤
│ Factor -> • '(' Sum ')' (0) │
├───────────────────────────────────────┤
│ Factor -> • Number (0) │
├───────────────────────────────────────┤
│ Number -> • [0-9] Number (0) │
├───────────────────────────────────────┤
│ Number -> • [0-9] (0) │
└───────────────────────────────────────┘
(And yes, Product -> Factor • (0)
is a completion
trigger too.)
After the end
Eventually, the algorithm will stop, and you will end up with a whole
slew of Earley items. for the input 1+(2*3-4)
, it will be
those:
=== 0 ===
Sum -> • Sum [+-] Product (0)
Sum -> • Product (0)
Product -> • Product [*/] Factor (0)
Product -> • Factor (0)
Factor -> • '(' Sum ')' (0)
Factor -> • Number (0)
Number -> • [0-9] Number (0)
Number -> • [0-9] (0)
=== 1 ===
Number -> [0-9] • Number (0)
Number -> [0-9] • (0)
Number -> • [0-9] Number (1)
Number -> • [0-9] (1)
Factor -> Number • (0)
Product -> Factor • (0)
Sum -> Product • (0)
Product -> Product • [*/] Factor (0)
Sum -> Sum • [+-] Product (0)
=== 2 ===
Sum -> Sum [+-] • Product (0)
Product -> • Product [*/] Factor (2)
Product -> • Factor (2)
Factor -> • '(' Sum ')' (2)
Factor -> • Number (2)
Number -> • [0-9] Number (2)
Number -> • [0-9] (2)
=== 3 ===
Factor -> '(' • Sum ')' (2)
Sum -> • Sum [+-] Product (3)
Sum -> • Product (3)
Product -> • Product [*/] Factor (3)
Product -> • Factor (3)
Factor -> • '(' Sum ')' (3)
Factor -> • Number (3)
Number -> • [0-9] Number (3)
Number -> • [0-9] (3)
=== 4 ===
Number -> [0-9] • Number (3)
Number -> [0-9] • (3)
Number -> • [0-9] Number (4)
Number -> • [0-9] (4)
Factor -> Number • (3)
Product -> Factor • (3)
Sum -> Product • (3)
Product -> Product • [*/] Factor (3)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 5 ===
Product -> Product [*/] • Factor (3)
Factor -> • '(' Sum ')' (5)
Factor -> • Number (5)
Number -> • [0-9] Number (5)
Number -> • [0-9] (5)
=== 6 ===
Number -> [0-9] • Number (5)
Number -> [0-9] • (5)
Number -> • [0-9] Number (6)
Number -> • [0-9] (6)
Factor -> Number • (5)
Product -> Product [*/] Factor • (3)
Sum -> Product • (3)
Product -> Product • [*/] Factor (3)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 7 ===
Sum -> Sum [+-] • Product (3)
Product -> • Product [*/] Factor (7)
Product -> • Factor (7)
Factor -> • '(' Sum ')' (7)
Factor -> • Number (7)
Number -> • [0-9] Number (7)
Number -> • [0-9] (7)
=== 8 ===
Number -> [0-9] • Number (7)
Number -> [0-9] • (7)
Number -> • [0-9] Number (8)
Number -> • [0-9] (8)
Factor -> Number • (7)
Product -> Factor • (7)
Sum -> Sum [+-] Product • (3)
Product -> Product • [*/] Factor (7)
Factor -> '(' Sum • ')' (2)
Sum -> Sum • [+-] Product (3)
=== 9 ===
Factor -> '(' Sum ')' • (2)
Product -> Factor • (2)
Sum -> Sum [+-] Product • (0)
Product -> Product • [*/] Factor (2)
Sum -> Sum • [+-] Product (0)
Quite a mess. Now the big question is, have we made it? To know that, we must look at the last state set:
=== 9 ===
Factor -> '(' Sum ')' • (2)
Product -> Factor • (2)
Sum -> Sum [+-] Product • (0)
Product -> Product • [*/] Factor (2)
Sum -> Sum • [+-] Product (0)
First this is the set number 9 (or the 10th set). The input has precisely 9 characters. So, first good news, the whole input made sense.
This is not enough however. We are looking for items that:
- are complete (the fat dot is at the end),
- have started at the beginning (state set 0),
- have the same name that has been chosen at the beginning (“Sum”).
There is one such item in S(9):
Sum -> Sum [+-] Product • (0)
We Win!
Failure modes
There are 2 ways to fail:
We may fail to parse the whole input. This is because an invalid token caused it to stop making sense at this point. This is what happens when you try “
1+%
”: the parse stops at the second character, fails at the third, then stops.We may parse the whole input, but fail to make a complete parse out of it. This is what happens when you try “
1+
”. This is a valid beginning of aSum
, but the last part is missing, and we can’t finish the parse.
In both cases, you get the following Earley items:
=== 0 ===
Sum -> • Sum [+-] Product (0)
Sum -> • Product (0)
Product -> • Product [*/] Factor (0)
Product -> • Factor (0)
Factor -> • '(' Sum ')' (0)
Factor -> • Number (0)
Number -> • [0-9] Number (0)
Number -> • [0-9] (0)
=== 1 ===
Number -> [0-9] • Number (0)
Number -> [0-9] • (0)
Number -> • [0-9] Number (1)
Number -> • [0-9] (1)
Factor -> Number • (0)
Product -> Factor • (0)
Sum -> Product • (0)
Product -> Product • [*/] Factor (0)
Sum -> Sum • [+-] Product (0)
=== 2 ===
Sum -> Sum [+-] • Product (0)
Product -> • Product [*/] Factor (2)
Product -> • Factor (2)
Factor -> • '(' Sum ')' (2)
Factor -> • Number (2)
Number -> • [0-9] Number (2)
Number -> • [0-9] (2)
(We get the same Earley items because this grammar makes no difference between an input that stops making sense, and an input that just stops. To make that difference, we need a rule that uses an “end of input” token explicitly.)
Note that in both examples, there is a valid parse: the
number 1
, which is a Sum
by itself. This is
evidenced by the presence of this Earley item in S(1)
:
Sum -> Product • (0)
I say this because in some cases, you just want to parse a header. Of course the input will stop making sense at some point.
Let’s do this!
You should now be able to implement a Lua recogniser. It will work for any grammar, except for one important limitation: It handles empty rules badly. We’ll correct this bug next time.
In the mean time, here is a recogniser in Lua. To use it, just run the lua interpreter on it in the command line, like this:
$ lua recogniser.lua
It has been tested with lua 5.2.3 on a Debian machine. It should work on 5.1 as well, on any machine.
If you don’t know Lua, don’t worry, it mostly looks like pseudocode. Use it at your own peril though: there is still that bug about empty rules.
Source code
You should be able to read it even if you don’t know Lua (I did my best to avoid Lua specific idioms). The only tricky bit lie in the tables.
First, an example grammar:
local Grammar = {
start_rule_name = 'Sum',
{ name = 'Sum' , 'Sum' , class('+-'), 'Product'},
{ name = 'Sum' , 'Product' , },
{ name = 'Product', 'Product' , class('*/'), 'Factor' },
{ name = 'Product', 'Factor' , },
{ name = 'Factor' , char('(') , 'Sum', char(')') },
{ name = 'Factor' , 'Number' },
{ name = 'Number' , range('09'), 'Number' },
{ name = 'Number' , range('09'), },
}
then how to use it:
local input = "1+(2*3+4)"
local S = build_items(Grammar, input)
print_S(S, Grammar) -- print all the internal state
diagnose(S, Grammar, input) -- tell if the input is OK or not
Now the main loop:
local function build_items(grammar, input)
-- Earley sets
local S = {{}}
-- put start item(s) in S[1]
for i = 1, #grammar do
if grammar[i].name == grammar.start_rule_name then
unsafe_append(S[1], { rule = i,
start = 1,
next = 1 })
end
end
-- populate the rest of S[i]
local i = 1
while i <= #S do
local j = 1
while j <= #S[i] do
local symbol = next_symbol(grammar, S[i][j])
if type(symbol) == "nil" then complete(S, i, j, grammar)
elseif type(symbol) == "function" then scan (S, i, j, symbol, input)
elseif type(symbol) == "string" then predict (S, i, symbol, grammar)
else error("illegal rule")
end
j = j + 1
end
i = i + 1
end
return S
end
The prediction step:
local function predict(S, i, symbol, grammar)
for rule_index, rule in ipairs(grammar) do
if rule.name == symbol then
append(S[i], { rule = rule_index,
next = 1 ,
start = i })
end
end
end
The scan step:
local function scan(S, i, j, symbol, input)
local item = S[i][j]
if symbol(input, i) then -- terminal symbols are predicates
if S[i+1] == nil then S[i+1] = {} end
unsafe_append(S[i+1], { rule = item.rule,
next = item.next + 1,
start = item.start })
end
end
The completion step:
local function complete(S, i, j, grammar)
local item = S[i][j]
for old_item_index, old_item in ipairs(S[item.start]) do
if next_symbol(grammar, old_item) == name(grammar, item) then
append(S[i], { rule = old_item.rule,
next = old_item.next + 1,
start = old_item.start })
end
end
end
Some utility functions:
-- next element in the rule of this item
local function next_symbol(grammar, item)
return grammar[item.rule][item.next]
end
-- gets the name of the rule pointed by the item
local function name(grammar, item)
return grammar[item.rule].name
end
-- compares two items for equality (needed for safe append)
local function equal(item1, item2)
return item1.rule == item2.rule
and item1.start == item2.start
and item1.next == item2.next
end
-- Adds an item at the end of the Earley set
local function unsafe_append(set, item)
set[#set+1] = item
end
-- Adds an item at the end of the Earley set, **unless already present**
local function append(set, item)
for i = 1, #set do
if equal(item, set[i]) then
return
end
end
unsafe_append(set, item) -- the item wasn't already there, we add it
end
That should be enough. In any case, the full source code is still available.