2015-09-14

Around four years ago, when I had first decided to start at Mozilla
research, I had planned to write an LR(1) parser generator. It seemed
like a good way to get to know Rust. However, I found that newborns
actually occupy somewhat more time than anticipated (read: I was lucky
to squeeze in a shower), and hence that never came to pass.

Well, I’m happy to say that, four years later, I’ve finally rectified
that. For a few months now I’ve been working on a side project while I
have my morning coffee: LALRPOP (pronounced like some sort of
strangely accented version of lollypop). LALRPOP is an LR(1)
parser generator that emits Rust code. It is designed for ease of use,
so it includes a number of features that many parser generators are
missing:

Regular-expression-like notation, so you can write Id* for any
number of `Id` or Id? for an optional `Id`.

User-defined macros, so you can make a macro like Comma

that
means comma separated list of `Id` with optional trailing comma.

Conditional macros, so you can easily generate a subset of your
grammar for some particular context (like, all expressions that
don’t end in {).

Support for synthesizing tokenizers (currently somewhat limited, but
sufficient for many uses) as well as external tokenizers (very
flexible). If you’re using an external tokenizer, you don’t even
need to be parsing input strings at all really, any iterator of
matchable values will do.

Easy to get access to positional information.

Easy to write fallible rules where the action code can generate a
parse error.

If you’d like to learn more about LALRPOP, I recently started a
tutorial that introduces LALRPOP in more depth and walks through
most of its features. The tutorial doesn’t cover everything yet, but
I’ll try to close the gaps.

Why LR(1)? After all, aren’t LR(1) generators kind of annoying,
what with those weird shift/reduce errors? Well, after teaching
compiler design for so many years, I think I may have developed
Stockholm syndrome – I kind of enjoy diagnosing and solving
shift/reduce failures. ;) But more seriously, I personally like that
once I get my grammar working with an LR(1) generator, I know that it
is unambiguous and will basically work. When I’ve used PEG generators,
I usually find that they work great in the beginning, but once in a
while they will just mysteriously fail to parse something, and
figuring out why is a horrible pain. This is why with LALRPOP I’ve
tried to take the approach of adding tools to make handling
shift/reduce errors relatively easy – basically automating the
workarounds that one typically has to do by hand.

That said, eventually I would like LALRPOP to support a bunch of
algorithms. In particular, I plan to add something that can handle
universal CFGs, though other deterministic techniques, like LL(k),
would be nice as well.

Performance. Another advantage of LR(1), of course, it that it
offers determinstic, linear performance. That said, I’ve found that in
practice, parsing based on a parsing table is not particularly
speedy. If you think about it, it’s more-or-less interpreting your
grammar – you’ve basically got a small loop that’s loading data from
a table and then doing an indirect jump based on the results, which
happen to be the two operations that CPUs like least. In my
experience, rewriting to use a recursive descent parser is often much
faster.

LALRPOP takes a different approach. The idea is that instead of a
parsing table, we generate a function for every state. This ought to
be quite speedy; it also plays very nicely with Rust’s type system,
since types in Rust don’t have uniform size, and using distinct
functions lets us store the stack in local variables, rather than
using a Vec. At first, I thought maybe I had invented something new
with this style of parsing, but of course I should have known better:
a little research revealed that this technique is called
recursive ascent.

Now, as expected, recursive ascent is supposed to be quite fast. In
fact, I was hoping to unveil some fantastic performance numbers with
this post, but I’ve not had time to try to create a fair benchmark, so
I can’t – since I haven’t done any measurements, LALRPOP’s generated
code may in fact be quite slow. I just don’t know. Hopefully I’ll find
some time to rectify that in the near future.

100% stable Rust. It’s probably worth pointing out that LALRPOP is
100% stable Rust, and I’m committed to keeping it that way.

Other parser generators. Should LALRPOP or LR(1) not be too your
fancy, I just want to point out that the Rust ecosystem has grown
quite a number of parser combinator and PEG libraries: nom, oak,
peg, nailgun, peggler, [combine], parser-combinators, and of
course my own rusty-peg. I’m not aware of any other LR(1) (or GLL, GLR, etc) generators
for Rust, but there may well be some.

Future plans. I’ve got some plans for LALRPOP. There are a host of
new features I’d like to add, with the aim of making eliminating more
boilerplate. I’d also like to explore adding new parser algorithms,
particularly universal algorithms that can handle any CFG grammar,
such as GLL, GLR, or LL(*). Finally, I’m really interesting in
exploring the area of error recovery, and in particular techniques to
find the minimum damaged area of a parse tree if there is an incorrect
parse. (There is of course tons of existing work here.)

Plea for help. Of course, if I wind up doing all of this myself,
it might take quite some time. So if you’re interested in working on
LALRPOP, I’d love to hear from you! I’d also love to hear some other
suggestions for things to do with LALRPOP. One of the things I plan to
do over the next few weeks is also spend some more time writing up
plans in LALRPOP’s issue database, as well as filling out its wiki.

Show more