@ Loup's Impossible? Like that would stop me.

Earley Parsing Explained — The Recogniser

(The impatient may download the source code.)

We will use this grammar:

Sum     = Sum     [+-] Product
        | Product
Product = Product [*/] Factor
        | Factor
Factor  = '(' Sum ')'
        | Number
Number  = [0-9]+

However, Earley parsers don’t recognise EBNF grammars directly. We need to use this restricted syntax:

Sum     -> Sum     [+-] Product
Sum     -> Product
Product -> Product [*/] Factor
Product -> Factor
Factor  -> '(' Sum ')'
Factor  -> Number
Number  -> [0-9] Number
Number  -> [0-9]

Don’t worry, this notation is just as powerful as the full EBNF. We will automate the translation process later.

Vocabulary

Grammars, rules, and symbols

For the purposes of Earley parsing, a grammar is a set of rules. Here is a rule:

Product -> Product [*/] Factor

The left hand side of a rule (Product in this example) is called a non-terminal symbol. The right hand side (Product [*/] Factor in this example) is a list of symbols: non-terminal symbols (Product and Factor in this example), and terminal symbols ([*/] in this example).

A terminal symbol is a function from characters to booleans. It tells you if a given character matches. Popular short hands for such functions look like 'x' (matches the character ‘x’ only), or [+-*/] (matches ‘+’, ‘-’, ’*‘, and’/’), or [0-9] (matches all numbers).

Earley items

I said previously that chart parsers construct lists of partial parses. Earley parsers are no exception. They just call those partial parses Earley items. Here is an Earley item:

Sum -> Sum • [+-] Product  (0)

It is just like a grammar rule, with a couple more things: a fat dot somewhere in the right hand side, and a number at the far right. And of course, it represents a partial parse. Here is another way to view this item:

 ┌────────────│──────────────┐
┌┤ Sum -> Sum │ [+-] Product ├──>?
│└────────────│──────────────┘
│             V
┌─────────────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│      1      │  +  │  (  │  2  │  *  │  3  │  -  │  4  │  )  │
└─────────────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
              ^
              └─ current position

The astute reader may have noticed a key piece of information that I have represented in the chart, yet is conspicuously absent from the item itself: the current position. This information is crucial, but Earley’s algorithm doesn’t need the items to store their own current position. That’s because all the items with the same current position are stored together in a state set. (I don’t know why nobody calls them “item sets” instead. Whatever.) That state set is itself stored in an array. Its position in that array denotes the current position of the Earley items.

If you are able to look at a particular item, you know its current position, because that’s how you found it in the first place.

State sets

State sets are kind of weird. They are sets all right, in the sense that they don’t contain multiple copies of the same item. On the other hand, for the purpose of Earley’s algorithm, it is easier to think of them as dynamic arrays.

Those state sets are stored in a big dynamic array, generally simply called ‘S’. Here is a possible representation.

┌──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │ <── the outer array, called 'S'
└──┬───┴──┬───┴──┬───┘
   │      │      │
   │      │      V
   │      │   ┌─────┬─────┬─────┬─────┐
   │      │   │ ... │ ... │ ... │ ... │ <── a state set, S(2)
   │      V   └─────┴─────┴─────┴─────┘
   │   ┌─────┬─────┬─────┐
   │   │ ... │ ... │ ... │ <── another state set, S(1)
   V   └─────┴─────┴──│──┘
┌─────┬─────┬─────┐   │
│ ... │ ... │ ... │   └── an Earley item
└─────┴─────┴─────┘

(We can do nice optimisations about storing the state sets contiguously in memory, but let’s ignore that for now. New state set, new array.)

The big array and the input

Earley parsers are incremental parsers: as soon as they have read a token, they gather as much information as they can about the partial parses that make sense. Concretely, this means the big array, S, is constructed piece by piece, with one element per input token (plus one initial element).

Here is what S looks like when we have read the first two tokens.

┌──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │
└──┬───┴──┬───┴──┬───┘
   V      V      V
   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘

The arrows here indicate the current position of each state set in S. Those positions are between tokens. Which makes sense, considering a character is either read, or not read. When S(2) is created, it means we have read the first 2 characters.

S(0) is created before we read anything. That’s the initial state. When we will be finished with this input, if all goes well, we will have created 10 state sets, up to S(9). This last state set is created when we have read everything.

Start up

The parsing starts by initialising the first state set. We typically chose the rules by name. In our “hello world”, that would be every rule named Sum.

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │ <── Our big array, S, contains only one state set.
└──┬───┘
   │     ┌── Our only state set, S(0), contains 2 Earley items.
   V     V
┌───────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │
├───────────────────────────────┤
│ Sum -> • Product          (0) │
└───────────────────────────────┘

Note how I produced the items:

Main loop

Now that we have initialised our data, we can loop over it. There are two loops to consider: an inner loop, which loops over a single state set, and an outer loop, which loops over the main array of state set.

Inner loop

Basically, we go over each Earley item in the current state set. When we begin the algorithm, the current state set is S(0) (we’ll do the rest later).

While going over the items, we may add even more items to either the current state, or the next one. (For the first inner loop, this means S(0) and S(1)). We continue until there are no more items in the current state set to process. This can go on for a while: whenever we add an item to the current state set, it will need to be processed just like the others.

This looks like this:

             current
              item
                │
                V
current      ┌─────┬─────┐
state set ─> │ ... │ ... │
             └─────┴─────┘

Nothing to add, let’s try the next Earley item:

                   current
                    item
                      │
                      V
current      ┌─────┬─────┐
state set ─> │ ... │  !  │
             └─────┴─────┘

Ah, apparently, we need to add an item:

                   current
                    item
                      │     ┌──The item we just added
                      V     V
current      ┌─────┬─────┬─────┐
state set ─> │ ... │  !  │ ... │
             └─────┴─────┴─────┘

Done, now let’s go to the next item (there wasn’t one, but now there is).

                          current
                           item
                            │
                            V
current      ┌─────┬─────┬─────┐
state set ─> │ ... │  !  │ !!! │
             └─────┴─────┴─────┘

Ah, this one’s different. We need to add an item to the next state set! We create this state set if necessary, then add an Earley item to it.

                          current
                           item
                            │
                            V
current      ┌─────┬─────┬─────┐
state set ─> │ ... │  !  │ !!! │
             └─────┴─────┴─────┘

next         ┌─────┐
state set ─> │ ... │
             └─────┘
                ^
                └──The item we just added

We have finished processing the current state, we can now process the next one (there wasn’t one originally, but now there is), and so on.

Outer loop

The outer loop works similarly: Run the inner loop over S(0). Repeat the process with the next state set (which hopefully has just been created), and so on until there are no more state sets to process. Given a grammar and an input, we will start with something like this:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
   │
   V
┌─────┬─────┐
│ ... │ ... │
└─────┴─────┘

Then we run the inner loop over S(0), generating some more Earley items and one state set in the process:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──┬───┘
   │      │
   │      V
   │   ┌─────┐
   │   │ ... │
   V   └─────┘
┌─────┬─────┬─────┐
│ ... │ ... │ ... │
└─────┴─────┴─────┘

Now we run the inner loop over S(1), and generate some more data:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │
└──┬───┴──┬───┴──┬───┘
   │      │      │
   │      │      V
   │      │   ┌─────┬─────┬─────┬─────┐
   │      │   │ ... │ ... │ ... │ ... │
   │      V   └─────┴─────┴─────┴─────┘
   │   ┌─────┬─────┬─────┐
   │   │ ... │ ... │ ... │
   V   └─────┴─────┴─────┘
┌─────┬─────┬─────┐
│ ... │ ... │ ... │
└─────┴─────┴─────┘

And so on until we can’t generate more state sets (either because we reached the end of the input, or because the input “stopped making sense”). I didn’t represent all state sets, but you get the idea:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ S(0) │ S(1) │ S(2) │ S(3) │ S(4) │ S(5) │ S(6) │ S(7) │ S(8) │ S(9) │
└──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┘
   │      │      │      │      │      │      │      │      │      │
   │      │      │      V      V      V      V      V      V      V
   │      │      │     ...    ...    ...    ...    ...    ...    ...
   │      │      V
   │      │   ┌─────┬─────┬─────┬─────┐
   │      │   │ ... │ ... │ ... │ ... │
   │      V   └─────┴─────┴─────┴─────┘
   │   ┌─────┬─────┬─────┐
   │   │ ... │ ... │ ... │
   V   └─────┴─────┴─────┘
┌─────┬─────┬─────┐
│ ... │ ... │ ... │
└─────┴─────┴─────┘

Prediction, Scan, and Completion

There are three specific ways of adding new Earley items to our data. Which way we choose depends on the item we are currently examining.

Prediction

Let’s go back to our initialised data. We must now examine the first Earley Item.

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
   │
   V
┌───────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │ <─ prediction
├───────────────────────────────┤
│ Sum -> • Product          (0) │
└───────────────────────────────┘

At the right of the fat dot, there is Sum, a non-terminal symbol. We must perform a prediction. Why? Because we’re trying to parse the grammar rule embedded in this item. What’s just after the fat dot? A Sum. Does this Sum actually begins by a Sum? Well, we have to try if we want to know. So we look up the Grammar, and see 2 rules named “Sum”:

Sum -> Sum [+-] Product
Sum -> Product

This means we must add the following two items to S(0):

Sum -> • Sum [+-] Product (0)
Sum -> • Product          (0)
       │                   │
       │                   └ The item is predicted in S(0), so
       │                     we set the starting point at (0).
       │
       └ We put the dot at the beginning of the rule.

Now we can add those items at the end of S(0).

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
   │
   V
┌───────────────────────────────┐
│ Sum -> • Sum [+-] Product (0) │ <─ prediction
├───────────────────────────────┤
│ Sum -> • Product          (0) │
├───────────────────────────────┤─┐
┆ Sum -> • Sum [+-] Product (0) ┆ │
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤ ├─> These 2 are redundant
┆ Sum -> • Product          (0) ┆ │
└┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┘─┘

Now we have a problem: this redundancy will drop us in an infinite loop: since the first symbol of a rule named “Sum” is also “Sum”, we will keep predicting sums until we run out of memory. This is a classic problem with most top-down parsers, and the reason why they can’t handle left-recursive grammars.

So we take one little precaution: no duplicate. The same items give the same results anyway, so why waste work? That way we avoid infinite loops —all of them. So don’t worry about left-recursive grammars, right-recursive grammars, magical grammars… they Just Work™.

(Okay, that was a lie. Some grammars still won’t work. We need another little precaution, which cannot be explained at this point.)

With this in mind, we go examine the next Earley item. We must still perform a prediction, this time over Product. (Basically, we’re trying to see if this Sum begins with a Product.)

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
   │
   V
┌───────────────────────────────────────┐
│ Sum     -> • Sum     [+-] Product (0) │
├───────────────────────────────────────┤
│ Sum     -> • Product              (0) │ <─ prediction
├───────────────────────────────────────┤─┐
┆ Product -> • Product [*/] Factor  (0) ┆ │
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤ ├─> These 2 are new
┆ Product -> • Factor               (0) ┆ │
└┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┘─┘

Since those are not duplicates we actually add them to S(0).

Scan

We keep looping over the items, predicting everything until we reach this Earley item:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
   │
   V
┆ ...                                   ┆
├───────────────────────────────────────┤
│ Factor  -> • '(' Sum ')'          (0) │ <─ scan
├───────────────────────────────────────┤
┆ ...                                   ┆

Now we can’t predict any more: at the right of the fat dot, there is a terminal symbol. How do we try a terminal symbol? We scan it. First we see if the terminal symbol (an open parens) applies to the current character (the digit ‘1’). Since ‘1’ is not an open parens, the scan fails. We continue the loop without doing anything.

The next scan looks like this:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┐
│ S(0) │
└──┬───┘
   │
   V
┆ ...                                   ┆
├───────────────────────────────────────┤
│ Number  -> • [0-9] Number         (0) │ <─ scan
├───────────────────────────────────────┤
┆ ...                                   ┆

Here, the non-terminal symbol at the right of the fat dot is [0-9], and the digit ‘1’ does match that terminal symbol. Test successful! So, we copy the Earley item we’re examining over to the next state set, with a slight modification:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬┄┄┄┄┄┄┐
│ S(0) │ S(1) ┆
└──┬───┴┄┄┄┄┄┄┘
   │      │
   │      V
   │    ┌┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┐
   │    ┆ Number -> [0-9] • Number (0) ┆
   │    └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┄┄┄┄┄┄┄│┄┄┘
   │                      │         │
   │                      │         └ The starting point
   │                      │           is unchanged.
   │                      │
   │                      └ The fat dot is shifted
   V                        one step to the right.
┆ ...                          ┆
├──────────────────────────────┤
│ Number -> • [0-9] Number (0) │ <─ scan
├──────────────────────────────┤
┆ ...                          ┆

This time, we’ve done some significant work: we have tried many things, and this is our first success! At last, we managed to move a fat dot one step to the right. Now this is not a confirmed success, since the fat dot is not completely off to the right yet. But that’s a start.

A confirmed success will come soon though: there’s another item that can be scanned:

Number -> • [0-9] Number (0)

Which ultimately gives us:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬┄┄┄┄┄┄┐
│ S(0) │ S(1) ┆
└──┬───┴┄┄┄┄┄┄┘
   │      │
   │      V
   │    ┌──────────────────────────────┐
   │    │ Number -> [0-9] • Number (0) │
   │    ├──────────────────────────────┤
   │    ┆ Number -> [0-9] •        (0) ┆
   │    └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┄┄┄┄┄┄┄┄┄┄┘
   │                      │
   V                      └ Dot at the end of the item
┆ ...                          ┆
├──────────────────────────────┤
│ Number -> • [0-9] Number (0) │ <─ scan
├──────────────────────────────┤
│ Number -> • [0-9]        (0) │ <─ scan
├──────────────────────────────┤
┆ ...                          ┆

Completion

We continue our main loop, finish S(0), and process S(1). At some point, we get to this:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──────┘
   │      │
   │      V
   │    ┆ ...                                  ┆
   │    ├──────────────────────────────────────┤
   │    │ Number -> [0-9] •                (0) │ <─ completion
   │    ├──────────────────────────────────────┤
   │    ┆ ...                                  ┆
   V
┌───────────────────────────────────────┐
│ ...                                   │
├───────────────────────────────────────┤
┆ ...                                   ┆

Here there is no symbol at the right side of the fat dot. At this point, things get real interesting: we have a successful partial parse. An actual, confirmed success: between the starting point of the item (0), and its current position (1), the string "1" is a Number.

We do not stop there however: This Number may be a fluke, but it may also be part of something bigger. Actually, Earley’s algorithm pretty much guarantees it will be part of something bigger. This item doesn’t come from nowhere: there’s a whole chain of predictions, scans, and completions to get there.

Imagine we have an item like this (‘a’, ‘b’, and ‘c’ are symbols, and ‘i’ is an integer):

Rule -> a b c •  (i)

The fact that this item even exist means the following items also exist somewhere:

Foo ->   a   b • c  (i)
Foo ->   a • b   c  (i)
Foo -> • a   b   c  (i)

And the last of them can only exist in S(i): the state set where it was predicted. We’re not interested in those items however. What we really want is the items that might have triggered the original prediction. Might have. Only one item actually triggered the original prediction, but there’s often more than one path, and we want to capture them all. Anyway, those rules are the rules in S(i) that look like this:

Bar -> a b • Foo c d  (j)

That is, any rule in S(i) that has the rule Foo right next to the fat dot. Those are the “something” bigger Foo is a part of. Now the purpose of starting points should be obvious. With them, we can jump straight to the interesting state set, and retrieve the sets we want.

Enough with the side quest. We were triggering a completion with this item:

┆ ...                                  ┆
├──────────────────────────────────────┤
┆ Number -> [0-9] •                (0) ┆
├──────────────────────────────────────┤
┆ ...                                  ┆

So, we search S(0) for any item that has Number right next to the dot:

    ┌─────────┤we must search here├─────────┐
    │                                       │
┌───V──┬──────┐                             │
│ S(0) │ S(1) │                             │
└──┬───┴──────┘                             │
   │      │                                 │
   │      V                                 │
   │    ┆ ...                               │  ┆
   │    ├───────────────────────────────────│──┤
   │    │ Number -> [0-9] •                (0) │ <─ completion
   │    ├──────────────────────────────────────┤
   │    ┆ ...                                  ┆
   V
┌───────────────────────────────────────┐
│ Sum     -> • Sum     [+-] Product (0) │ Nope
├───────────────────────────────────────┤
│ Sum     -> • Product              (0) │ Nope
├───────────────────────────────────────┤
│ Product -> • Product [*/] Factor  (0) │ Nope
├───────────────────────────────────────┤
│ Product -> • Factor               (0) │ Nope
├───────────────────────────────────────┤
│ Factor  -> • '(' Sum ')'          (0) │ Nope
├───────────────────────────────────────┤
│ Factor  -> • Number               (0) │ FOUND ONE!
├───────────────────────────────────────┤
│ Number  -> • [0-9] Number         (0) │ Nope
├───────────────────────────────────────┤
│ Number  -> • [0-9]                (0) │ Nope
└───────────────────────────────────────┘

Now we just have to complete the items we have found. This is very similar to a successful scan: we copy all of the matching rules to the current state. The only difference is, we moved the fat dot over a non-terminal symbol:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──────┘
   │      │
   │      V
   │    ┌─────────────────────────────────────┐
   │    │ Number -> [0-9] • Number        (0) │
   │    ├─────────────────────────────────────┤
   │    │ Number -> [0-9] •               (0) │ <─ completion
   │    ├─────────────────────────────────────┤    trigger
   │    │ Number -> • [0-9] Number        (1) │
   │    ├─────────────────────────────────────┤
   │    │ Number -> • [0-9]               (1) │
   │    ├─────────────────────────────────────┤
   │    ┆ Factor -> Number •              (0) ┆ <─ completed item
   │    └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄│┄┄┘
   │                       │               │
   │                       │               └ The starting point
   │                       │                 is unchanged.
   │                       │
   │                       └ The fat dot is shifted
   V                         one step to the right.
┌───────────────────────────────────────┐
│ Sum     -> • Sum     [+-] Product (0) │
├───────────────────────────────────────┤
│ Sum     -> • Product              (0) │
├───────────────────────────────────────┤
│ Product -> • Product [*/] Factor  (0) │
├───────────────────────────────────────┤
│ Product -> • Factor               (0) │
├───────────────────────────────────────┤
│ Factor  -> • '(' Sum ')'          (0) │
├───────────────────────────────────────┤
│ Factor  -> • Number               (0) │ <─ Item to complete
├───────────────────────────────────────┤
│ Number  -> • [0-9] Number         (0) │
├───────────────────────────────────────┤
│ Number  -> • [0-9]                (0) │
└───────────────────────────────────────┘

Note that there may be more than one item to complete (or even none at all, see the starting state). Here for instance, the item we just completed is also a completion trigger:

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │  1   │  +   │  (   │  2   │  *   │  3   │  -   │  4   │  )   │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┐
│ S(0) │ S(1) │
└──┬───┴──────┘
   │      │
   │      V
   │    ┌──────────────────────────────────────┐
   │    │ Number  -> [0-9] • Number        (0) │
   │    ├──────────────────────────────────────┤
   │    │ Number  -> [0-9] •               (0) │
   │    ├──────────────────────────────────────┤
   │    │ Number  -> • [0-9] Number        (1) │
   │    ├──────────────────────────────────────┤
   │    │ Number  -> • [0-9]               (1) │
   │    ├──────────────────────────────────────┤
   │    │ Factor  -> Number •              (0) │ <─ completion
   │    ├──────────────────────────────────────┤    trigger
   │    ┆ Product -> Factor •              (0) ┆ <─ completed item
   │    └┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄─┄┄┄┄┄┘
   V
┌───────────────────────────────────────┐
│ Sum     -> • Sum     [+-] Product (0) │
├───────────────────────────────────────┤
│ Sum     -> • Product              (0) │
├───────────────────────────────────────┤
│ Product -> • Product [*/] Factor  (0) │
├───────────────────────────────────────┤
│ Product -> • Factor               (0) │ <─ Item to complete
├───────────────────────────────────────┤
│ Factor  -> • '(' Sum ')'          (0) │
├───────────────────────────────────────┤
│ Factor  -> • Number               (0) │
├───────────────────────────────────────┤
│ Number  -> • [0-9] Number         (0) │
├───────────────────────────────────────┤
│ Number  -> • [0-9]                (0) │
└───────────────────────────────────────┘

(And yes, Product -> Factor • (0) is a completion trigger too.)

After the end

Eventually, the algorithm will stop, and you will end up with a whole slew of Earley items. for the input 1+(2*3-4), it will be those:

=== 0 ===
Sum     -> • Sum [+-] Product     (0)
Sum     -> • Product              (0)
Product -> • Product [*/] Factor  (0)
Product -> • Factor               (0)
Factor  -> • '(' Sum ')'          (0)
Factor  -> • Number               (0)
Number  -> • [0-9] Number         (0)
Number  -> • [0-9]                (0)

=== 1 ===
Number  -> [0-9] • Number         (0)
Number  -> [0-9] •                (0)
Number  -> • [0-9] Number         (1)
Number  -> • [0-9]                (1)
Factor  -> Number •               (0)
Product -> Factor •               (0)
Sum     -> Product •              (0)
Product -> Product • [*/] Factor  (0)
Sum     -> Sum • [+-] Product     (0)

=== 2 ===
Sum     -> Sum [+-] • Product     (0)
Product -> • Product [*/] Factor  (2)
Product -> • Factor               (2)
Factor  -> • '(' Sum ')'          (2)
Factor  -> • Number               (2)
Number  -> • [0-9] Number         (2)
Number  -> • [0-9]                (2)

=== 3 ===
Factor  -> '(' • Sum ')'          (2)
Sum     -> • Sum [+-] Product     (3)
Sum     -> • Product              (3)
Product -> • Product [*/] Factor  (3)
Product -> • Factor               (3)
Factor  -> • '(' Sum ')'          (3)
Factor  -> • Number               (3)
Number  -> • [0-9] Number         (3)
Number  -> • [0-9]                (3)

=== 4 ===
Number  -> [0-9] • Number         (3)
Number  -> [0-9] •                (3)
Number  -> • [0-9] Number         (4)
Number  -> • [0-9]                (4)
Factor  -> Number •               (3)
Product -> Factor •               (3)
Sum     -> Product •              (3)
Product -> Product • [*/] Factor  (3)
Factor  -> '(' Sum • ')'          (2)
Sum     -> Sum • [+-] Product     (3)

=== 5 ===
Product -> Product [*/] • Factor  (3)
Factor  -> • '(' Sum ')'          (5)
Factor  -> • Number               (5)
Number  -> • [0-9] Number         (5)
Number  -> • [0-9]                (5)

=== 6 ===
Number  -> [0-9] • Number         (5)
Number  -> [0-9] •                (5)
Number  -> • [0-9] Number         (6)
Number  -> • [0-9]                (6)
Factor  -> Number •               (5)
Product -> Product [*/] Factor •  (3)
Sum     -> Product •              (3)
Product -> Product • [*/] Factor  (3)
Factor  -> '(' Sum • ')'          (2)
Sum     -> Sum • [+-] Product     (3)

=== 7 ===
Sum     -> Sum [+-] • Product     (3)
Product -> • Product [*/] Factor  (7)
Product -> • Factor               (7)
Factor  -> • '(' Sum ')'          (7)
Factor  -> • Number               (7)
Number  -> • [0-9] Number         (7)
Number  -> • [0-9]                (7)

=== 8 ===
Number  -> [0-9] • Number         (7)
Number  -> [0-9] •                (7)
Number  -> • [0-9] Number         (8)
Number  -> • [0-9]                (8)
Factor  -> Number •               (7)
Product -> Factor •               (7)
Sum     -> Sum [+-] Product •     (3)
Product -> Product • [*/] Factor  (7)
Factor  -> '(' Sum • ')'          (2)
Sum     -> Sum • [+-] Product     (3)

=== 9 ===
Factor  -> '(' Sum ')' •          (2)
Product -> Factor •               (2)
Sum     -> Sum [+-] Product •     (0)
Product -> Product • [*/] Factor  (2)
Sum     -> Sum • [+-] Product     (0)

Quite a mess. Now the big question is, have we made it? To know that, we must look at the last state set:

=== 9 ===
Factor  -> '(' Sum ')' •          (2)
Product -> Factor •               (2)
Sum     -> Sum [+-] Product •     (0)
Product -> Product • [*/] Factor  (2)
Sum     -> Sum • [+-] Product     (0)

First this is the set number 9 (or the 10th set). The input has precisely 9 characters. So, first good news, the whole input made sense.

This is not enough however. We are looking for items that:

There is one such item in S(9):

Sum -> Sum [+-] Product •  (0)

We Win!

Failure modes

There are 2 ways to fail:

In both cases, you get the following Earley items:

=== 0 ===
Sum     -> • Sum [+-] Product     (0)
Sum     -> • Product              (0)
Product -> • Product [*/] Factor  (0)
Product -> • Factor               (0)
Factor  -> • '(' Sum ')'          (0)
Factor  -> • Number               (0)
Number  -> • [0-9] Number         (0)
Number  -> • [0-9]                (0)

=== 1 ===
Number  -> [0-9] • Number         (0)
Number  -> [0-9] •                (0)
Number  -> • [0-9] Number         (1)
Number  -> • [0-9]                (1)
Factor  -> Number •               (0)
Product -> Factor •               (0)
Sum     -> Product •              (0)
Product -> Product • [*/] Factor  (0)
Sum     -> Sum • [+-] Product     (0)

=== 2 ===
Sum     -> Sum [+-] • Product     (0)
Product -> • Product [*/] Factor  (2)
Product -> • Factor               (2)
Factor  -> • '(' Sum ')'          (2)
Factor  -> • Number               (2)
Number  -> • [0-9] Number         (2)
Number  -> • [0-9]                (2)

(We get the same Earley items because this grammar makes no difference between an input that stops making sense, and an input that just stops. To make that difference, we need a rule that uses an “end of input” token explicitly.)

Note that in both examples, there is a valid parse: the number 1, which is a Sum by itself. This is evidenced by the presence of this Earley item in S(1):

Sum -> Product •  (0)

I say this because in some cases, you just want to parse a header. Of course the input will stop making sense at some point.

Let’s do this!

You should now be able to implement a Lua recogniser. It will work for any grammar, except for one important limitation: It handles empty rules badly. We’ll correct this bug next time.

In the mean time, here is a recogniser in Lua. To use it, just run the lua interpreter on it in the command line, like this:

$ lua recogniser.lua

It has been tested with lua 5.2.3 on a Debian machine. It should work on 5.1 as well, on any machine.

If you don’t know Lua, don’t worry, it mostly looks like pseudocode. Use it at your own peril though: there is still that bug about empty rules.

Source code

You should be able to read it even if you don’t know Lua (I did my best to avoid Lua specific idioms). The only tricky bit lie in the tables.

First, an example grammar:

local Grammar = {
   start_rule_name = 'Sum',
   { name = 'Sum'    , 'Sum'      , class('+-'), 'Product'},
   { name = 'Sum'    , 'Product'  ,                       },
   { name = 'Product', 'Product'  , class('*/'), 'Factor' },
   { name = 'Product', 'Factor'   ,                       },
   { name = 'Factor' , char('(')  , 'Sum', char(')')      },
   { name = 'Factor' , 'Number'                           },
   { name = 'Number' , range('09'), 'Number'              },
   { name = 'Number' , range('09'),                       },
}

then how to use it:

local input = "1+(2*3+4)"
local S     = build_items(Grammar, input)
print_S(S, Grammar)              -- print all the internal state
diagnose(S, Grammar, input)      -- tell if the input is OK or not

Now the main loop:

local function build_items(grammar, input)
   -- Earley sets
   local S = {{}}
   -- put start item(s) in S[1]
   for i = 1, #grammar do
      if grammar[i].name == grammar.start_rule_name then
         unsafe_append(S[1], { rule  = i,
                               start = 1,
                               next  = 1 })
      end
   end
   -- populate the rest of S[i]
   local i = 1
   while i <= #S do
      local j = 1
      while j <= #S[i] do
         local symbol = next_symbol(grammar, S[i][j])
         if     type(symbol) == "nil"      then complete(S, i, j, grammar)
         elseif type(symbol) == "function" then scan    (S, i, j, symbol, input)
         elseif type(symbol) == "string"   then predict (S, i,    symbol, grammar)
         else error("illegal rule")
         end
         j = j + 1
      end
      i = i + 1
   end
   return S
end

The prediction step:

local function predict(S, i, symbol, grammar)
   for rule_index, rule in ipairs(grammar) do
      if rule.name == symbol then
         append(S[i], { rule  = rule_index,
                        next  = 1 ,
                        start = i })
      end
   end
end

The scan step:

local function scan(S, i, j, symbol, input)
   local item = S[i][j]
   if symbol(input, i) then -- terminal symbols are predicates
      if S[i+1] == nil then S[i+1]  = {} end
      unsafe_append(S[i+1], { rule  = item.rule,
                              next  = item.next + 1,
                              start = item.start })
   end
end

The completion step:

local function complete(S, i, j, grammar)
   local item = S[i][j]
   for old_item_index, old_item in ipairs(S[item.start]) do
      if next_symbol(grammar, old_item) == name(grammar, item) then
         append(S[i], { rule  = old_item.rule,
                        next  = old_item.next + 1,
                        start = old_item.start })
      end
   end
end

Some utility functions:

-- next element in the rule of this item
local function next_symbol(grammar, item)
   return grammar[item.rule][item.next]
end

-- gets the name of the rule pointed by the item
local function name(grammar, item)
   return grammar[item.rule].name
end

-- compares two items for equality (needed for safe append)
local function equal(item1, item2)
   return item1.rule  == item2.rule
      and item1.start == item2.start
      and item1.next  == item2.next
end

-- Adds an item at the end of the Earley set
local function unsafe_append(set, item)
   set[#set+1] = item
end

-- Adds an item at the end of the Earley set, **unless already present**
local function append(set, item)
   for i = 1, #set do
      if equal(item, set[i]) then
         return
      end
   end
   unsafe_append(set, item) -- the item wasn't already there, we add it
end

That should be enough. In any case, the full source code is still available.