What is Earley Parsing, and Why should you care?
What
Like most parsers, Earley parsers work with a specification…
Sum = Sum [+-] Product
| Product
Product = Product [*/] Factor
| Factor
Factor = '(' Sum ')'
| Number
Number = [0-9]+
…to turn some input…
┌───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 1 │ + │ ( │ 2 │ * │ 3 │ - │ 4 │ ) │
└───┴───┴───┴───┴───┴───┴───┴───┴───┘
…into a nicely structured tree.
┌─────┐
│ Sum │
└─┬┬┬─┘
┌───────┘│└───────┐
┌───┴────┐ ┌─┴─┐ ┌───┴────┐
│ Number │ │ + │ │ Factor │
└───┬────┘ └───┘ └───┬────┘
┌─┴─┐ ┌──┴──┐
│ 1 │ │ Sum │
└───┘ └─┬┬┬─┘
┌───────┘│└───────┐
┌────┴────┐ ┌─┴─┐ ┌────┴───┐
│ Product │ │ - │ │ Number │
└───┬┬┬───┘ └───┘ └──────┬─┘
┌───────┘│└───────┐ │
┌───┴────┐ ┌─┴─┐ ┌───┴────┐ ┌─┴─┐
│ Number │ │ * │ │ Number │ │ 4 │
└───┬────┘ └───┘ └───┬────┘ └───┘
┌─┴─┐ ┌─┴─┐
│ 2 │ │ 3 │
└───┘ └───┘
So far, nothing special about them.
Why
The biggest advantage of Earley Parsing is its accessibility. Most other tools such as parser generators, parsing expression grammars, or combinator libraries feature restrictions that often make them hard to use. Use the wrong kind of grammar, and your PEG will enter an infinite loop. Use another wrong kind of grammar, and most parser generators will fail. To a beginner, these restrictions feel most arbitrary: it looks like it should work, but it doesn’t. There are workarounds of course, but they make these tools more complex.
Earley parsing Just Works™.
On the flip side, to get this generality we must sacrifice some speed. Earley parsing cannot compete with speed demons such as Flex/Bison in terms of raw speed. It’s not that bad, however:
- Earley parsing is cubic in the worst cases, which is the state of the art (and possibly the best we can do). The speed demons often don’t work at all for those worst cases. Other parsers are prone to exponential combinatorial explosion.
- Most simple grammars can be parsed in linear time.
- Even the worst unambiguous grammars can be parsed in quadratic time.
My advice would be to use Earley parsing by default, and only revert to more specific methods if performance is an issue…
…which is what I would like to say. For now we lack the tools. My goal is to get you to write them.