2014-05-21

One of our clients sent us a bug report, which consisted of a 700 line Haskell program with a complaint that “it deadlocks”. After studying the code I concluded that, for the sake of the bug report, it could be summarized as

print a bunch of stuff, then crash with a stack overflow

So I wanted to replace the original code with code that did just that: print a bunch of stuff, then crash with a stack overflow. No big deal:

The function go prints a dot every THRESHOLD recursive calls; we have a (dummy, since it can never be reached) addition after the recursive call to ensure that go is not tail recursive and will eventually run out of stack space. The THRESHOLD macro variable is there so that we can tweak how quickly the program runs out of stack space, or in other words, how many dots it prints before it crashes. For example, if we compile the code with

it prints 22,126 dots before crashing; if we compile with

it prints 19,483 dots before crashing. But as I was experimenting to find the right value for this parameter, I noticed something very strange. If we compile with

it never crashes! It just prints, and prints, and prints (with GHC 7.6.3). This blog post documents my attempt to understand this behaviour, and explains some aspects of ghc’s runtime, and in particular, its stack, as it goes.

Checking core

After type checking ghc translates Haskell to an intermediate language called core, which is the “real life” version of the more “academic” language System FC. We can ask ghc to output the (optimized) core version of our program with

This is useful, because it allows to verify that even with a THRESHOLD value of 24 the optimizer did not somehow manage to make the function tail recursive:

This is a bit difficult to read; don’t worry about the details for now, we will start dealing with low level details soon enough. For now, it’s enough to note that $wa is the translation of go, and that the recursive calls to $wa are not in tail position; the optimizer did not somehow manage to make the function tail recursive. Ok. Then what? Why is this function not crashing with a stack overflow?

Simplifying the problem

If we cannot understand the behaviour of the code by looking at core then we need to drop all the way down to assembly language. However, if we want to have any hope of being able to step through the assembly language we need to simplify that call to putChar. Sadly, replacing it with

made the problem go away: the program now always crashed. So the strange behaviour of our program had something to do with the the implementation of putChar. The real putChar is more involved that it might seem; it deals with buffering, character encodings, concurrent access to Handles, etc.

Unfortunately, since I had no idea what aspect of the implementation of putChar was causing the behaviour I was seeing, I could think of no other approach than to inline putChar and the functions it calls, and start simplifying it bit by bit until the strange behaviour disappeared.

Many, many hours later I ended up with this code:

In order to understand this code, you first have to realize that the IO monad is a state monad with the RealWorld as the state:

I have written the code with explicit state passing, instead of using the monad, so that the Haskell code is as close as possible to the generated core and beyond.

That out of the way, let’s discuss the actual code. First of all, the real putChar is defined as

where stdout has type Handle. As it turns out, there are only two aspects of hPutChar that are relevant:

hPutChar does a case analysis on something (actually, of course, it does case analyses on lots of things); as it turns out, all that matters for our example is that it does a case analysis at all, so we have modelled stdout simply by ().

This case analysis happens while asynchronous exceptions have been masked. This happens because a Handle contains an MVar, and most of the I/O operation happens while we have taken the value from this MVar (see withHandle' in GHC.IO.Handle.Internals).

At this point we can simplify no further. Although we are not actually printing any dots anymore, if we run this code it will run forever until you interrupt it manually; but if you remove either the case analysis from hPutChar or the call to maskAsyncExceptions# from go the program will exit with a stack overflow almost immediately. What gives?

Understanding hPutChar

When ghc compiles your program, after lexical analysis and syntax analysis it first removes syntactical sugar and does scope analysis; then, after type checking, it translates your code to core, as we saw. The core gets optimized and then translated to stg (for “Stackless Spineless G-machine”); stg code is very similar to core, but with a handful of additional restrictions; most importantly, all constructor applications must be fully applied (and if not, an explicit lambda must be constructed). Finally, the stg code gets translated to C--, which is a portable assembly language similar in intent to llvm, and the C-- code then is translated to assembly language.

We wrote our code in such a low level way that the core and stg translations are not very interesting; they all look very alike. The C-- code however is a big step down in level. Recall that we simplified hPutChar to

The C-- code for this “hPutChar” is (-ddump-opt-cmm):

This might look rather frightening, especially when you are used to Haskell, so let’s take it step by step. We will be pushing a value to the stack, so we first check if there is enough room on the stack to do so:

If not, we will call a function from the GHC runtime called __stg_gc_fun, which will extend the stack, if possible. We will come back to this in detail later.

In order to do case analysis on stdout we first have to make sure that it is in weak head normal form (in this case, it will always be, but it might not be in general). In order to do this, we need to call its definition; but before we can do that, we need to do two more things.

First, in general stdout might be a thunk, with free variables in the closure, and hence it needs to know where it can find the values of those free variables. The convention is therefore that when we call a closure, the address of the closure can always be found in register R1:

Secondly, when stdout completes, it needs to know what to do next. It does this by looking at the stack: when it finishes, it loads a “continuation address” from the top of the stack and calls it. Hence we need to push this address to the stack so that stdout will be able to find it later:

At this point we could call stdout, but actually we can do slightly better. ghc implements an optimization called pointer tagging. Since addresses are always word aligned, the lower 3 bits (or, on 32-bit machines, the lower 2 bits) of addresses are always 0. For datatype constructors ghc uses these bits to encode, as part of the pointer itself, which constructor of the datatype it is (non-zero value) or whether the thunk is not yet in weak head normal form. So we can check by looking at the pointer if there is a point calling stdout at all:

If the constructor is already in weak head normal form, we call the continuation directly (note that s19G_info is the same address that we pushed to the stack as the continuation address for stdout). Finally, if it turns out that we do still need to evaluate the stdout we call it:

In the continuation we now know that stdout is in weak head normal form; technically speaking, we should now pattern match on it to find out which constructor it is, but of course () only has a single constructor, so we can skip that step. This means that all that is left to do is to call our continuation with the result. We are returning

In general, unboxed tuples are represented as a pair of pointers, either as two consecutive memory locations or, ideally, in two registers. However, real world tokens disappear from generated code, so all we have to return is (). By convention the first few arguments are passed in registers; in this case, that means that we need to load the address of () into register R1 before calling the continuation:

The + 1 part is pointer tagging at work: () is already in weak head normal form, and it is the first (and only) constructor of the () type.

Understanding go

At this point you might realize why I said at the start that it would be completely unfeasible to step through the real hPutChar; our simplified version just does a case analysis (on something of unit value, no less) and then returns unit, and it is already complicated enough! What about go?

Thankfully, the code was carefully written to make the translation to lower level code as simple as possible; the C-- translation of go does not introduce any more concepts that we already used in hPutChar:

The structure is very similar as before, except that we have two (actually, three) case statements.

First we do a stack overflow check again, as before.

We need to evaluate the scrutinee of the first case statement: the call to the primop maskAsyncExceptions# (the equivalent of mask_ in Haskell-land), which expects its argument in R1. In this case, the argument is hPutChar, except that as before we use pointer tagging (in this case, to indicate that the function has been evaluated and that its arity is 1).

The continuation after we are finished with the scrutinee is s19O_info; this is a second case statement, so we push a second continuation (s19N_info) onto the stack and recursively call go.

The second continuation—which is never reached—makes sure that go indeed returned unit, and then returns unit; if you look at the stg translation of go (-ddump-stg) you will notice that go is actually a triply nested case statement:

Three case statements, hence three continuations (s19O_info, s19N_info, s19M_info).

Running the code

All this is pure theory, so far. Let’s confirm it by actually running our code. We can load up the code into a debugger; make sure to compile the code with -debug to link it against the version of the Haskell RTS that has debugging symbols. Since I am working on OSX Mavericks I will be using lldb; gdb will work in a very similar way.

One of the abstractions that C-- offers over real assembly language is virtual registers. When we translate C-- to assembly these are mapped to machine registers by a register allocator in ghc’s native code generator. On my machine, Sp and SpLim are mapped to rbp and r15, and R1 is mapped to rbx.

Let’s load up our code, set a breakpoint in go, ask it to print Sp and SpLim whenever we hit a breakpoint, and then start the program (which, during my debugging, I called “weird”, which is why you will see many references to “weird” below).

lldb started the program, stopped at the breakpoint, and told us the value of Sp and SpLim, as asked. It also shows us a disassembly of the code. The C-- code started with

which, if you recall, was to check if we might run into a stack overflow. This translates into

in Intel assembly language. The details don’t matter very much, and are beyond the scope of this blog post. If you haven’t seen Intel assembly language, leaq is “load effective address”, cmpq is “compare” and jb “jump if below”. We will not explain such details any further; hopefully you will be able to squint a bit and see that it resembles the C-- rather closely.

Before we do anything else, let’s see what’s on the stack before we start:

So, the stack frame immediately above us when we start is a catch frame. Catch frames are pushed onto the stack when you use catch, and are used by the runtime to find which exception handler to run when an exception occurs. The catch frame has two additional fields in its payload: the handler to run (in this case, the default top-level handler), and a mask which indicates if asynchronous exceptions have been blocked or not (in this case, they haven’t).

We can use the step command from lldb to start stepping through the execution of the code. We check for stack overflow, find that there is none, and then set things up for the first case statement in go: we push a continuation address to the stack, put the argument to maskAsyncExceptions# in rbx (the machine equivalent of R1), and then call maskAsyncExceptions. At this point the stack therefore looks like

and we can confirm that rbx contains the tagged address of hPutChar

You can find the implementation of maskAsyncExceptions# as stg_maskAsyncExceptionszh in Exception.cmm in the rts/ directory of the ghc source (“zh” is the z-encoding of “#”). The details are not so important, however. It masks async exceptions by setting a flag in a register, pushes a frame onto the stack to unmask exceptions when we are done, and then calls the function (whose address is in R1). This means that we will end up in hPutChar, at which point the stack looks like

This continues on. hPutChar executes, completes, and runs the contination, which happens to be stg_unmaskAsyncExceptionszh_ret_info; this unmasks asynchronous exceptions, and continues with the continuation above it, which in this case is the original continuation from go. And after some more steps we end up back in go for the next recursive call, and the whole process repeats. However, the stack now looks like

Remember that go is not tail recursive, so once this recursive call completes, we still need to call the continuation from the previous invocation. And this repeats; if we run until the breakpoint again (using cont) and print the stack, we will see

and so on. This is precisely the behaviour that we were expecting: since the function is not tail recursive, we are using up more and more stack space and should eventually run out, and crash with a stack overflow. So why don’t we?

Stack overflow

After precisely 84 recursive calls, the stack looks like

with 84 s19N_info pointers in total. Moreover, the top of the stack (virtual register Sp, real register rbp) is now dangerously close to the stack limit (virtual register SpLim, real register r15):

That is, we have 16 bytes left, or space for two addresses. This means that we have just enough stack space to make it to the entry point for hPutChar; we get there in the same way as before, and the stack now looks like

Note that we pushed two additional addresses onto the stack, which is therefore now full. Remember that hPutChar starts with a check for stack overflow:

Actually, if we look at the unoptimized C-- minus instead (-ddump-cmm) we will see

instead, which is a lot clearer: we load the address of hPutChar into register R1 and then run the garbage collector. In the optimized C-- code we find the address of the garbage collector somewhere else, but we can verify in lldb that it’s the same thing:

Either way, the garbage collector runs. It notices that we are out of stack space (as opposed to out of heap space; we carefully avoided any heap allocation in this test code), creates a new, bigger, stack, copies the old stack over, and then calls back into our function. In other words, we end up back at the start of hPutChar, but now the stack looks like:

which is the same stack as before, except at a different location; and, crucially, we have space on the stack again:

so we can continue recursing.

Running out of stack space

Of course, we cannot continue increasing the stack forever; eventually we should reach the maximum stack size (8 MB by default in ghc 7.6). So why don’t we? Well, now that we understand the problem at this level of detail a quick Google search will reveal that this is due to a “bug” in ghc (depending on your definition of bug, I suppose) prior to version 7.8: when a stack overflow happens while asynchronous exceptions are blocked, the stack will be grown no matter what the limitation on the stack size is.

In this example, it just so happens that we run out of stack space when we try to push the continuation address in hPutChar, at which point asynchronous exceptions are indeed masked. Hence, this program will continue growing the stack unpunished, until we run out of machine memory completely. If we remove the case statement from hPutChar (or indeed, if we modify the program in a myriad of other, minor, ways) we would detect the stack overflow outside of the maskAsyncExceptions# and hence we would crash with a stack overflow exception when we reach the maximum stack size.

In the original program that we started with, the value of THRESHOLD determines how many continuation addresses for go are on the stack when we call (the real) putChar. We run out of stack space either while asynchronous exceptions were masked (somewhere deep inside the bowels of the real hPutChar), or outside it, and hence we would run forever or crash almost immediately, respectively. This is clearly not what one would expect, and with GHC 7.8 the problem has been resolved – a stack overflow is now treated like any other asynchronous exception, and the program will crash with a stack overflow as soon as asynchronous exceptions are no longer masked.

Further Reading

If you want to understand ghc’s runtime execution model How to make a fast curry: push/enter vs eval/apply is essential reading, although you can ignore the comparison with push/enter, which is not used in ghc. Faster Laziness Using Dynamic Pointer Tagging explains the pointer tagging trick we discussed above. Apart from these two papers, there are various other references that might be helpful. Be aware however that ghc is constantly evolving and many of these references may no longer precisely match what is in ghc right now.

System F with Type Equality Coercions describes System FC, the idealized version of core; Chapter 25 (Profiling and Optimization) of Real World Haskell discusses reading core and using it for optimization purposes.

I know kung fu: learning STG by example on the ghc wiki describing the compilation from Haskell down to assembly language in quite some detail.

Edward Z. Yang has a number of blog posts that discuss the execution of Haskell code at a low level of detail; they are all worth reading: Unraveling the mystery of the IO monad, Tracing the compilation of Hello Factorial!, and Interrupting GHC

Blog post Thunks and lazy blackholes: an introduction to GHC at runtime by Keegan McAllister discusses more about the heap, which we did not cover at all here.

For C-- it is unfortunately hard to find definitive references; ghc’s version of C-- (cmm) differs significantly from the official standard. The description on the ghc wiki is probably a good starting point; the comment at the top of CmmParse.y in the ghc sources is also very helpful. The paper C--: a portable assembly language that supports garbage collection (Abstract) describes some of the underlying motivations but is somewhat outdated. Hoopl: A Modular, Reusable Library for Dataflow Analysis and Transformation describes a library that is used to optimize C--.

Runtime Support for Multicore Haskell describes how Haskell runs on multiple cores; Section 4 contains some information about the GHC runtime, and in particular, how it relates to foreign calls (calls into the C world). Extending the Haskell Foreign Function Interface with Concurrency is an older paper that describes this in some detail, too.

Debugging GHC-compiled code with gdb on the ghc wiki contains some useful information about using gdb on compiled Haskell code; GDB and LLDB Command Examples is a nice a reference for lldb.

Finally, David Terei has a presentation about ghc where he gives a bird’s eye view of the entire thing, and his bachelor’s thesis Low Level Virtual Machine for Glasgow Haskell Compiler talks about the llvm backend for ghc.

Postscript

One of the many variations that I played with while tracking this bug down looked like

When you run this in 7.8 this program will crash with

This turns out to be due to an unrelated bug in 7.8.2; see #9045 (and #8866). It has already been fixed and should be released as part of 7.8.3.

Show more