2013-11-27



Parsing is…complicated. And yet, we wouldn’t have many facets of the modern world without it. Our ability to get behavior out of a computer is limited by the expressiveness of our programming languages. And the programming languages available to us are limited by our parsing techniques.

The construction of a programming language consists of at least three components:

lexer

parser

interpreter / compiler

The lexer, or scanner as it is often known, is generally a simple routine that understands the input as a stream of characters. The lexer converts the character stream into a token stream that it sends to the parser.

The parser consists of rules that determine whether a stream of tokens is arranged in an acceptable order. In addition, the parser is responsible for assembling a graph of nodes, in many cases an Abstract Syntax Tree (commonly AST), that represent the input in a meaningful way.

The syntax graph generated by the parser is generally converted into bytecode that can run on a virtual machine or compiled into machine code. For a small DSL, it can be handled by a simple runtime implemented in the host language.

Writing these components by hand can be a challenge, particularly for the newcomer. There have been many tools to assist in their creation, initially with lexer and parser generators, and recently with code generators like LLVM.

The Parslet gem provides a DSL for creating programming languages in Ruby. In this article we will be looking at some of the history of parsing, where Parslet fits in, and how you can use it to create your own programming languages.

The source code for this tutorial is available at github.

Time and Space

In 1965, Donald Knuth invented the LR parser. LR is short for Left-to-right, Rightmost derivation. The LR parser can recognize any deterministic Context-Free-Grammar in linear time. Due to maintaining state-transition tables, LR parsers were fast but had inordinate memory requirements (for the 1960s), and were thus dismissed as too impractical.

In 1969, Frank DeRemer proposed simplified versions of the LR parser: Look-Ahead LR (LALR) and Simple LR (SLR). These parsers reduced the memory requirements at the cost of language recognition. The most popular variation became LALR(1), LALR with 1 look-ahead terminal.

In 1970, Stephen C. Johnson created YACC (Yet Another Compiler Compiler) while working on the B language at AT&T. He and Al Aho initially tried implementing an LALR table by hand, but it took 2 days just to get 30 grammar rules and produced quite a few errors, so he decided to invest time into the LALR(1) parser generator that we now know as YACC.

Initially, YACC was rather slow, but after years of work Johnson was able to speed it up by several orders of magnitude and make it the leading tool for parser generation. By the time computers had improved significantly and alternative parsing algorithms were available, YACC and LALR(1) had become entrenched in computer science culture.

Writing a usable grammar for YACC isn’t easy. The gotchas of LALR(1) (namely shift/reduce and reduce/reduce conflicts) present a high barrier of entry for those interested in parsing a non-trivial language. As a result, the YACCs of the world are increasingly competing with alternative approaches, including writing parsers from scratch.

The New Era

Today, YACC lives on in GNU Bison (and other variants) and remains quite popular due to its speed and history. However, the consistent computer science theme of performance taking a back seat to maintainability has been playing itself out.

Initially, the GCC project used Bison to generate parsers for C, C++, and Objective C, but it gradually shifted to hand-written recursive descent parsers in versions 3.4-4.1. Their basis was that LALR(1) parsers required too many hacks to conform to any of the C languages.

The Clang maintainers have decided to stick with a similar direction, with the goal of eventually having a unified recursive descent parser for the C language family and its many dialects.

The ANTLR project generates recursive descent parsers using LL(*) grammars. In v4, ANTLR’s creator Terence Parr decided to focus on grammar flexibility rather than performance, aptly naming the release “honey badger”.

Recursive descent isn’t the only modern answer to parsing demanding languages. With more RAM on the table (no pun intended), and increased clock cycles, there has been renewed interest in some of the more expensive algorithms that have been proposed since the mid 20th century. These include the CYK Algorithm and the Earley parser.

Bison now supports Generalized LR (GLR) in addition to LALR(1). GLR works similarly to LR except that it forks parse stacks upon encountering a conflict. Each parse stack reads tokens until it finishes the input or reaches an invalid state. This enables GLR to avoid shift/reduce and reduce/reduce conflicts at the cost of increased memory consumption.

Parsing Expression Grammar

In order to understand why Parsing Expression Grammar is interesting, it helps to know the alternative. In the 1950s, Noam Chomsky proposed his now-famous Chomsky Hierarchy which consisted of formal grammars, the kinds of languages they can generate, and what kind of machine (automaton) would be needed to recognize the language.

Most parser-generators expect parsers to be described by a Context-Free Grammar, the second most complicated in Chomsky’s hierarchy. The problem with Chomsky’s grammars is that they are all generative. They were designed to express ambiguity in language because Chomsky was most interested in applying them to natural languages. Historically, the ambiguity expressed by CFGs for parser-generators has not meshed well with the parsing algorithms behind them (like LALR(1)).

In 2004, Bryan Ford published a paper on Parsing Expression Grammars. He argued that the role of Context-Free-Grammars as language generators was misaligned with the role of parsers as language recognizers. In the paper, he proposed the alternative, recognition-based, formalism that we now call PEG.

Parsing Expression Grammar (PEG) brings two important innovations:

lexical analysis and parsing are the same activity (rather than requiring separate tools like lex and yacc)

ambiguity is replaced with greed

The choice operator in CFGs is unordered (commutative). This has the pitfall of assuming multiple potential parse trees (when there can be only one). However, the choice operator in PEGs is greedy. If the first option is valid, the second is ignored, just like in most programming languages. It should be noted that this is a not a magical fix, and it is quite possible to get yourself into trouble with it.

Unlike CFGs, PEGs are not particularly suited to parsing natural language. However, they are much more suited to parsing programming languages. Whereas a CFG is like a specification for generating strings in a language, a PEG is like a collection of rules for creating a recursive descent parser.

Parslet

Parslet provides a DSL for defining parsers with a Parsing Expression Grammar.

First, install the parslet gem.

This is what a simple parser in Parslet looks like.

Rules

A Parslet parser is established through a set of rules and definition blocks.

The definition block contains the expected syntax for that rule, including:

– the order in which elements are expected to arrive.

– the number of times any element is expected to appear

– any choices between elements

Here is how you would expect one or more digits using #repeat.

Rules can be referenced in definition blocks by their names.

Every parser needs a root rule. The root is the first rule that the parser will expect.

Atoms

Parslet refers to the elements of language syntax as atoms or parslets (interestingly, ‘atom’ is usually used in the documentation).

Atoms come in the following varieties:

Regular expressions – ex: match('[a-zA-Z0-9]')

Strings – ex: str('foo')

any character – corresponds to the regexp /./ ex: any.repeat(0)

The behavior of the parser is driven by relationships between its atoms and rules.

atom1 >> atom2 – atom2 is expected after atom1

atom.repeat(x[,y]) – atom is expected x or more times (or up to y times)

atom.maybe – equivalent to atom.repeat(0,1)

atom1 | atom2 – either atom is expected, but if atom1 is seen then atom2 is ignored

Although it seems implied, the Parslet documentation doesn’t appear to make clear that atoms are purely terminals (and thus rules, being non-terminals, don’t count as atoms), but it appears that anything you can apply to an atom, you can apply to a rule.

Expect ‘a’ or rule b:

Expect a word followed by a colon:

Tricky Whitespace

It’s very tempting to use #repeat(0) to make your parser flexible, but care must be taken. When dealing with whitespace, It’s generally a good idea to have rules like these:

If you just use one all-encompassing whitespace rule, you are at risk of getting entangled in ambiguity. While playing around with whitespace, I managed to create a parser that hangs.

Error Reporting

Parslet provides excellent error reporting if you use it. If you just let Parslet fail, it will tell you what sequence it failed to match. However, if you rescue Parslet::ParseFailed, you can get a failure tree.

This prints out:

Parslet provides a convenience method so that you don’t need to write out the exception handling every time.

Generating a Tree

By default, Parslet just checks the input. It doesn’t generate a syntax tree. The #as method creates a node on the tree.

Here is a more complete example.

If you run assignment_parser.rb, instead of getting the input back, you get an array of assignment hashes.

[{:assignment=>{:left=>"a"@0, :right=>"23"@2}}, {:assignment=>{:left=>"b"@5, :right=>"56"@7}}]

Transformation

Parslet provides Parslet::Transform for making sense of syntax graphs produced by Parslet::Parser. You could use it to make an interpreter or bytecode generator.

Like Parslet::Parser, a Parslet::Transform is made of rules and blocks.

The capture method determines what the rule will match and is one of:

simple – any object but hashes or arrays

sequence – hashes or arrays

subtree – everything

The result of the action block will replace the node pattern specified in the rule. If only the leaf node is specified, only the leaf node will be replaced.

Here, LeafTransform just replaces {:words => “hello world”}, while TreeTransform replaces the entire tree.

Let’s transform the tree produced by the assignment parser (Note: the Parslet “@” decorations are removed).

The #apply method returns the transformed tree with Assignment objects instead of hashes.

For more on Parlset::Transform, check out the documentation.

Conclusion

If you are new to creating programming languages, Parslet is an excellent way to get started. It provides more grammatical flexibility than the typical LALR(1) generator, and it can use your Ruby code. However, Parslet is not the only PEG game in town for Ruby. Here are some others:

treetop

citrus

rsec

Here are some additional resources if you would like to learn more about creating languages with Parslet:

Parsing TOML with Parslet

Using Parslet to Write Your Own Language

Writing DSLs with Parslet – Wicked Good Ruby 2013

The post Parsing with the Parslet Gem appeared first on SitePoint.

Show more