@ Loup's

Impossible? Like that would stop me.

Earley Parsing Explained — Empty Rules

As before, the impatient and the expert can read the source code before the manual.

The bug

I told you last time there was a bug. Look at the following grammar (let us assume the start symbol is A):

A ->
A -> B
B -> A

It's an uninteresting, trivial grammar that parses the empty string, but that's enough for our purposes. You will notice that this grammar is ambiguous: there are several syntax trees we could derive from an empty input. Well, a countable infinity of them, in fact:

┌───┐     ┌───┐     ┌───┐
│ A │     │ A │     │ A │     ...
└───┘     └─┬─┘     └─┬─┘
          ┌─┴─┐     ┌─┴─┐
          │ B │     │ B │
          └─┬─┘     └─┬─┘
          ┌─┴─┐     ┌─┴─┐
          │ A │     │ A │
          └───┘     └─┬─┘
                    │ B │
                    │ A │

But, when you use our Earley recogniser on this grammar, you get the following items:

=== 0 ===
A -> •    (0)
A -> • B  (0)
B -> • A  (0)

Here, we can see only one completed item:

A -> • (0)

Which means, the subsequent parser will only see one possible parse:

│ A │

So we have a bug.

The problem

The problem comes from the poor timing around the handling of empty rules. Let's take a look at the items one more time:

=== 0 ===
A -> •    (0)
A -> • B  (0)
B -> • A  (0)

See the last item? It calls for completion: A has been completed earlier, with item A -> • (0). So, B -> • A (0) should be advanced one step. We should add this item:

B -> A • (0)

And this one as well, while we're at it:

A -> B • (0)

But as I said, there's a timing problem. When the completion step is triggered, the state set looks like this:

=== 0 ===
A -> •    (0)  <- completion to perform
A -> • B  (0)

Nice, a completion! Let's look at all the items of the form:

R -> α • A β (j)

When you think about it, there is one:

B -> • A  (0)

But guess what, it doesn't exist yet. It will be predicted later, by the second item of the set. So we can't advance it, and miss another completion in the process.

Ordinarily, it is not possible for a completed item to be created before the items it is supposed to advance one step: they belong to a later Earley set: by the time a later Earley set i being processed, the previous ones are done. Unfortunately, empty rules break that assumption: now we look for items in a set that is not fully built yet! Of course we could miss something.

The solution

One obvious solution was given by Grune and Jacob (1990):

The easiest way to handle this mare’s nest is to stay calm and keep running the Predictor and Completer in turn until neither has anything more to add

Ah, but I don't like this at all. One thing I liked about our algorithm, was the ability to process the Earley sets in one pass. With this method, we have to loop at least twice, add another nesting to our loops… Yuck.

The real solution came from Aycock and Horspool (2002). They noticed we could advance some items without waiting for a completion to trigger that advancement. More precisely, some symbols are nullable: they can parse the empty string. Hence the trick:

When performing a prediction, if the predicted symbol is nullable, then advance the predictor one step.

Let's say for instance we're processing the second item:

=== 0 ===
A -> •    (0)
A -> • B  (0)  <- needs a prediction.

So we need to predict the symbol B. B happens to be nullable. So, we can advance A -> • B (0) without waiting for something like B -> α • (0) to appear. It will come up eventually, so we might as well anticipate the completion. So there's two things happening in the predictor: the actual prediction, and a magical completion:

=== 0 ===
A -> •    (0)
A -> • B  (0)  <- predictor
B -> • A  (0)  <- predicted from the predictor
A -> B •  (0)  <- magical completion of the predictor

Now, we can keep processing the Earley sets in one pass. Don't worry, Aycock and Horspool have proven this trick works. There's just one last hurdle…

Detecting nullable symbols

A nullable symbols is a symbol that name at least one nullable rule. And a nullable rule is only composed of nullable symbols. (Possibly zero, which makes the rule empty.) Terminal symbols aren't nullable.

There are several ways to do this. I personally have chosen to just scan every rule, and maintain a list of nullable symbols in the process. If a rule is empty, or contains only symbols from the nullable set, then I add its name to the nullable set. And I keep scanning all the rules over and over, until the nullable set no longer grows.

This is simple, but slow. Quadratic with respect to the size of the grammar, in fact. We can beat this. In his Marpa parser, Jeffrey Kegler has devised an alternative which should work in linear time. If you intend to process large grammars (eventually you will), you should take a look.

Anyway, now that I have a nullable set, I can just use it to test if a symbol is nullable when I make a prediction.

Lua code

The code is very simple. I only have made 2 additions to the recogniser. First, the detection of of the nullable set:

-- Detecting nullable rules -- Nullable symbols sets are named "nss".
local function nullable_nss()
   return { size = 0 }

local function add_nullable_rule(rule_name, nss)
   if nss[rule_name] then return end  -- Do nothing for known nullable rules.
   nss[rule_name] = true              -- The others are added,
   nss.size = nss.size + 1            -- and the size is adjusted.

-- Returns true if it can say for sure the rule is nullable.
-- Returns false otherwise
local function is_nullable(rule, nss)
   for i = 1, #rule do
      if not nss[rule[i]] then
         return false
   return true

-- Adds nullable rules to the nss, by examining them in one pass.
local function update_nss(nss, grammar)
   for i = 1, #grammar do                     -- For each rule,
      if is_nullable(grammar[i], nss) then       -- if the rule is nullable for sure,
         add_nullable_rule(grammar[i].name, nss) -- add it to the nss.

local function nullable_rules(grammar)
   local nss = nullable_nss()
   repeat                      -- Keep...
      local old_size = nss.size
      update_nss(nss, grammar) -- ...updating the nss,
   until old_size == nss.size  -- as long as it keeps growing.
   return nss                  -- An nss that stopped growing is complete.

Then, the addition of the magic completion in the prediction step:

local function predict(S, i, j, symbol, grammar, nss)
   for rule_index, rule in ipairs(grammar) do
      if rule.name == symbol then
         append(S[i], { rule  = rule_index,
                        next  = 1 ,
                        start = i })
         if nss[rule.name] then -- magical completion
            append(S[i], { rule  = S[i][j].rule,
                           next  = S[i][j].next + 1,
                           start = S[i][j].start})

And that's basically it. As before, the full source code is available.