Parsing is where the story of compilation begins. While it may mean something wonderful to us, our source code is merely a stream of character data. It’s the role of a parser to take this data and translate it to something the compiler understands. This involves pulling out symbols and operators, creating a parse tree, and converting that to an abstract syntax tree. In this article I look a bit at how that happens.
A simple example
The first phase is taking the raw code and producing a parse tree. Consider this basic expression below.
The logical first step is to create a token stream of this data. Using regular expressions, scanning the text manually, or using a special lexer, the text is tokenized. The above yields these tokens: (, (, 1, +, 3, ), *, 23, ), /, (, -, 1.4, ).
This tokenization is usually not done as a distinct step. Instead the tokens are usually read one at a time on demand. This allows the tokens to be context dependent: the extraction can differ based on what logical part of the code is being processed.
From the token stream a tree needs to be constructed. One that resolves the order in which all the operations occur. It’s fairly easy in the above case, since each subexpression is bracketed. The tree looks like this.
Of course that expression will not likely occur like that in code; it’d more likely be typed with fewer parenthesis and spaces.
Even though this yields a different token stream the same tree must be derived as a result of this expression. This requires resolving operator precedence, making the parser slightly more complicated. Fortunately this type of expression parsing is old and there are a few tools for this. The one I use in Leaf is called the “Shunting-yard algorithm”. The result of this algorithm on the above is the postfix expression 1 3 + 2 * -4 /.
Actually the - should also be an operator in the postfix notation. It could be the binary addition operator or the unary negative operator. It’s hard to show them both in postfix notation but with names it would look like 1 3 add 2 mul 4 neg div.
Though basic, the shunting-yard algorithm is a good starting point for a custom parser. In Leaf I retain the basic stack setup it has, but have some significant modifications, or rather extensions, to how it works.
Statements
A programming language is made up of more than just simple expressions. The next most common structures are statements, function calls, and assignments.
The only special part here is the var keyword. The entire a = pine( 13 ) component is still an expression. The whole thing still creates a tree, but now it may have multiple children.
We would likely see a lot of divergence in compilers at this point. The in-memory form could be a tree, a list, or special node structures for each operation already. In the Leaf parser I stay strictly in this abstract tree form during this initial phase.
On top of statements come blocks, functions, and whatever else a language provides.
Just for fun here’s the parse tree generated for that code in Leaf.
How does it parse
I’m glossing over a lot of details. Obviously something more than the shunting yard is required to parse full language structures.
A common approach for compilers is to use a custom recursive descent parser. This essentially follows the structure of the language. The parser will have functions like parse_block, parse_tuple, parse_expression or parse_function and will simply call them as it encounters each language construct. Individual tokens are matched directly or using a regular expression library.
Tools called parser generators also exist and promise to do most of this work. In my experience however I’ve found parser generators somewhat lacking. From what I’ve read a lot of compilers use hand-written recursive descent parsers instead of these generators.
Syntax Tree
The previous steps were to generate a parse tree. What we really want from the parser is an abstract syntax tree. Something where the various tokens have been properly translated into high-level language constructs and any vestiges of the textual syntax can been eliminated.
Not all compilers have this strict separation between a parse tree and the syntax tree. It’s also possible to do a more linear assembly of the text into a syntax tree, processing statements and blocks as they occur.
The syntax tree is a compiler specific representation of the code in memory. It uses types that model the language, such as function, variable, statement, or block. Whereas the parse tree is very generic, the syntax tree is highly specific. This is required for the compiler to actually understand the code.
Creation of the parse tree has already done most of the hard work. Conversion to a syntax tree feels mostly like grunt work in comparison. It’s mainly a serialization problem, sort of like converting a complex JSON object tree into a typed structure. How well the two trees match determines how much shuffling and balancing occurs at this stage.
It’s a bit difficult to show these in-memory structures in a generic fashion without just looking like a modified parse tree. The class hierarchy may also be harder to understand when “flattened” into a tree. In Leaf I dump this structure basically as an expanded, and slightly modified, form of Leaf itself.
A dump of the previous code after parsing.
Onwards
Unfortunately not all languages are designed to allow this distinct parsing stage. Languages like C++, or even C#, have a few constructs that require the parser to do a bit more than parsing. Some of the parsing, like constructors and variable declarations, require the parser have at least a basic understanding of symbols and semantics. Nonetheless, they still do parsing and produce a syntax tree in the end.
Having a syntax tree is where the parser stops. It’s done it’s job and can pass it off to the next stages of compilation.
If you’d like to learn more about compilers then follow me on Twitter. I have many more things to write on this topic. If there’s something special you’d like to hear about, or want to arrange a presentation feel free to contact me.