Earley Parsing Explained
Earley parsers are among the most general parsers out there. They can parse any context free language without restriction, and can even be extended towards context sensitivity. They are reasonably fast on most practical grammars, and are easy to implement (the core algorithms take less than 200 lines of code).
On the other hand, most of the information I found about them is encoded in cryptic academese. Deciphering it is hard for non-experts (it was certainly hard for me).
This tutorial is mostly aimed at the curious and the implementer. It tries to convey an intuitive understanding of Earley parsing. Hard core theorists seeking deep math should read the papers referenced in this tutorial. Here, I just want to help you write your own Earley parsing framework.
Prerequisites
Unfortunately, I can’t assume zero knowledge. To understand this tutorial, you will need a good grasp of the vocabulary around formal grammars and the Chomksky hierarchy. Notions of automata theory can also be useful.
At the very least, you should be able to implement a recursive descent parser for a simple language, such as arithmetic expressions. If you have never written one, do so now.
Table of contents
What is Earley Parsing, and Why should you care? This is the part where I tell you Earley parsing is the magical panacea to all software engineering problems.
Chart Parsing. Earley parser are what we call “chart parsers”, or “table parsers”. They store information about partial parses in tables.
The Recogniser. The first step: determining if the input is valid at all. Beware, I left a nasty bug in there.
Empty Rules. Time to fix that bug. Our recogniser is now fully general.
Optimising Right Recursion. Our recogniser is quadratic on right-recursive grammars. Compared to current LR parsers, which are linear, this is a performance bug. This optimisation fixes it.
The Parser. Now we can construct the abstract syntax tree. We will also learn to deal with ambiguity, mostly through prioritised choice.
Semantic Actions. The shape of our syntax trees is tied to the form of the grammar. Semantic actions give us a way to bypass this limitation. This gets us the convenience of parsing expression grammars.
Context Sensitivity. Context free grammars don’t solve everything. Some problems, like significant indentation, requires some context sensitivity. Now we have the power of parsing expression grammars.
Handling Backus Naur Form. Until now, we worked on a restricted syntax for grammars. It is as powerful as the BNF notation, just less convenient. Here, we will learn to compile the full BNF notation into the restricted notation.