Earley Parsing Explained — Chart Parsing

We can’t parse the whole input all at once and magically get a full parse tree from the input. We have to work it out bit by bit somehow. One way to do it is to rely on induction to construct the tree directly. That’s how recursive descent parsers work.

Another way to do it is to construct a list of partial parses. Here’s one for instance:

 ┌────────┐
┌┤ Number ├┐
│└────────┘│
│          │
┌──────────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1     │  +  │  (  │  2  │  *  │  3  │  -  │  4  │  )  │
└──────────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Here we just parsed the first character, a number. Here’s more:

 ┌────────┐                     ┌────────┐
┌┤ Number ├┐     ┌──────────────┤ Factor ├─────────────────┐
│└────────┘│     │              └────────┘                 │
│          │     │                                         │
┌──────────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1     │  +  │  (  │  2  │  *  │  3  │  -  │  4  │  )  │
└──────────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
                                   │                 │
                                   │     ┌─────┐     │
                                   └─────┤ Sum ├─────┘
                                         └─────┘

Now with a bit more work, we can construct a complete parse. Recall the following grammar:

Sum     = Sum     [+-] Product
        | Product
Product = Product [*/] Factor
        | Factor
Factor  = '(' Sum ')'
        | Number
Number  = [0-9]+

We can use it to find all the partial parses we need, including one that spans the whole input. Here are some:

                            ┌─────┐
┌───────────────────────────┤ Sum ├─────────────────────────┐
│                           └─────┘                         │
│  ┌─────┐                                                  │
┌──┤ Sum ├──┐                                               │
│  └─────┘  │                                               │
│┌─────────┐│                                               │
┌┤ Product ├┐                                               │
│└─────────┘│                                               │
│┌────────┐ │                    ┌─────────┐                │
┌┤ Factor ├─┐     ┌──────────────┤ Product ├────────────────┐
│└────────┘ │     │              └─────────┘                │
│┌────────┐ │     │              ┌────────┐                 │
┌┤ Number ├─┐     ┌──────────────┤ Factor ├─────────────────┐
│└────────┘ │     │              └────────┘                 │
│           │     │                                         │
┌───────────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1      │  +  │  (  │  2  │  *  │  3  │  -  │  4  │  )  │
└───────────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Note that I take the Factor at the lower right for granted. A real parser would of course dig deeper before determining that this sub-string is indeed a “factor”. Now let’s remove some useless fluff for ease of reading:

                            ┌─────┐
┌───────────────────────────┤ Sum ├─────────────────────────┐
│                           └─────┘                         │
│┌────────┐                      ┌────────┐                 │
┌┤ Number ├─┐     ┌──────────────┤ Factor ├─────────────────┐
│└────────┘ │     │              └────────┘                 │
│           │     │                                         │
┌───────────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1      │  +  │  (  │  2  │  *  │  3  │  -  │  4  │  )  │
└───────────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

When you think about it, this is just another way of representing the following parse tree:

          ┌─────┐
          │ Sum │
          └─┬┬┬─┘
    ┌───────┘│└───────┐
┌───┴────┐ ┌─┴─┐  ┌───┴────┐
│ Number │ │ + │  │ Factor │
└───┬────┘ └───┘  └───┬────┘
  ┌─┴─┐        ┌──────┴────────┐
  │ 1 │        │ ( 2 * 3 - 4 ) │ <- taken for granted
  └───┘        └───────────────┘

Useless work

All partial parses are not useful. Remember this one I showed earlier?

┌──────────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1     │  +  │  (  │  2  │  *  │  3  │  -  │  4  │  )  │
└──────────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
                                   │                 │
                                   │     ┌─────┐     │
                                   └─────┤ Sum ├─────┘
                                         └─────┘

Taken in isolation, the sub-string 3-4 is indeed a Sum (a difference counts as a Sum in my grammar). But when you add a bit of context you can see this is a dead end: 2*3-4 reads (2×3)−4, not 2×(3−4). So, as correct as it is, this partial parse is useless. In practice many are, and some parsers waste too much time on them.

When you think about it, it doesn’t make sense to even try to parse 3-4 as a Sum: it follows a star (*), and the grammar says that stars are followed by factors, not by sums.

Here lies Earley’s genius: his parsers avoid most unnecessary work by not even trying a whole slew of hopeless partial parses like this one. This makes it faster than many alternatives.