Planet.haskell.org

FP Complete: Green Threads are like Garbage Collection

2017-01-06

Many common programming languages today eschew manual memory
management in preference to garbage collection. While the former
certainly has its place in certain use cases, I believe the majority
of application development today occurs in garbage collected
languages, and for good reason. Garbage collection moves
responsibility for many memory issues from the developer into the
language's runtime, mostly removing the possibility of entire classes
of bugs while reducing cognitive overhead. In other words, it
separates out a concern. That's not to say that garbage collection is
perfect or appropriate in all cases, but in many common cases it
greatly simplifies code.

Languages like Haskell, Erlang, and Go provide a similar
separation-of-concern in the form of green threads. Instead of
requiring the developer to manually deal with asynchronous (or
non-blocking) I/O calls, the runtime system takes responsibility for
this. Like garbage collection, it's not appropriate for all use cases,
but for many common cases it's a great fit and reduces code
complexity. This post will discuss what green threads are, the
problems they solve, some cases where they aren't the best fit, and
how to get started with green thread based concurrency easily.

If you want to jump to that last point, you can
download the Stack build tool
and start with
the async library tutorial.

Blocking and non-blocking I/O

Suppose you're writing a web server. A naive approach would be to
write something like the following psuedo-code:

The read and write calls appear to perform blocking I/O, which
means that the entire system thread running them will be blocked on
the kernel until data is available. Depending on what our forkThread
call did, this could mean one of two things:

If forkThread forks a new system thread, then performing blocking
I/O isn't a problem: that thread has nothing to do until read and
write complete. However, forking a new system thread for each
connection is expensive, and does not scale well to hundreds of
thousands of concurrent requests.

If forkThread forks a new thread within your runtime (sometimes
called a fiber), then multiple fibers will all be running on a
single system thread, and each time you make a blocking I/O call,
the entire thread will be blocked, preventing any progress on other
connections. You've essentially reduced your application to handling
one connection at a time.

Neither of these approaches is very attractive for writing robust
concurrent applications (though the former is certainly better than
the latter). Another approach is to use non-blocking I/O. In this
case, instead of making a call to read or write which blocks until
data is available, you make a call and provide a callback function
or continuation to handle what to do with the result data. Let's see
what our web server above will look like:

Let's note some differences:

We're no longer performing any forking. All actions are occuring in
one single thread, removing the possibility of overhead from
spawning a system thread (or even a fiber).

Instead of capturing the output of read in a variable bytes, we
provide readAsync a callback function, and that callback function
is provided the bytes value when available. We sometimes call
these callbacks continuations, since they tell us how to continue
processing from where you left off.

The loops are gone. Instead, our callbacks recursively call
functions to create the necessary infinite looping, while allowing
for proper asynchronous behavior.

This approach solves the problems listed above with blocking I/O: no
performance overhead of spawning threads or fibers, and multiple
requests can be processed concurrently without being blocked by each
other's I/O calls.

The downsides of non-blocking I/O

Unfortunately, this new style does not get away scot-free.

Subjectively, the callback-style of coding is not as elegant as the
blocking style. There are workarounds for this with techniques like
promises.

Error/exception handling isn't as straight-forward in the callback
setup as with the blocking code. It's certainly a solvable problem,
but often involves techniques like passing in an extra callback
function for the error case. Many languages out there today use
runtime exceptions, and they don't translate too well to callbacks.

Our code is limited to running on one CPU core, at least in the
simplest case. You can work around this with techniques like
prefork, but it's not automatically handled by the callback
approach. If our goal is to maximize requests per second, using
every available CPU core for processing is definitely desired.

It's still possible for handling of multiple requests to cause
blockage for each other. For example, if our parseRequest or
renderResponse functions perform any blocking I/O, or use a
significant amount of CPU time, other requests will need to wait
until that processing finishes before they can resume their
processing.

For those interested, we had a
previous blog post on Concurrency and Node
which delved more deeply into these points.

Making non-blocking a runtime system concern

Let's deliver on our promise from the introduction: turning
non-blocking I/O into a runtime system concern. The theory behind this
is:

Blocking I/O calls are conceptually easier to think about and work
with

While spawning lightweight threads/fibers does entail some overhead,
it's lightweight enough to be generally acceptable

If we can limit the amount of CPU time spent in running each fiber,
we won't need to worry about high-CPU processing blocking other
requests

The runtime system can handle the scheduling of lightweight threads
onto separate CPU cores (via separate system threads), which will
result in the ability to saturate the entire CPU without techniques
like prefork

That may sound like a lot to deliver on, but green threads are up to
the challenge. They are very similar to the fibers that we described
above, with one major difference: seemingly blocking I/O calls
actually use non-blocking I/O under the surface. Let's see how this
would work with our web server example from above (copied here for
convenience):

Starting from after the bindSocket call:

Our main green thread calls accept. The runtime system
essentially rewrites this to an acceptAsync call like our
callback version, puts the main green thread to sleep, and has the
runtime system's event loop schedule a wakeup when new data is
available on the listenSocket.

When a new connection is available, the runtime system wakes up the
main green thread with the new socket value filled in. Our thread
then forks a new green thread (let's call it worker 1) running the
handleConnection call.

Inside worker 1, our read call is similarly rewritten to
readAsync with a callback, and the runtime system puts worker 1
to sleep and schedules a wakeup when data is available on socket.

The runtime system then goes back to the list of green threads that
have work to do and finds the main thread. The main thread
continues from the forkThread call, and iterates on the while
loop. It arrives back at the accept call and, like before, the
runtime system puts the main thread to sleep and schedules a wakeup
when there's a connection available on listenSocket.

Importantly, at this point, the entire application is able to
simply wait for some data. We're waiting for the operating system
to tell us that either listenSocket or socket have some data
available. When the operating system tells us (via a system call
like epoll or select) that data is available, the runtime
system can wake the relevant thread up, and do some more work until
the next I/O call that puts it to sleep.

Since the runtime system is handling the scheduling of threads, it
is free to determine that a thread has been active for too long and
pause execution in favor of a different thread, solving the long
CPU processing problem mentioned above. (This is also the
difference between cooperative and preemptive multithreading.)

Instead of needing a separate error handling mechanism for
callbacks, the runtime system can reuse existing exception throwing
mechanisms from the language. For example, if an error occurs
during the read call, the runtime system can throw an exception
from that point in the code.

I'd argue that this green thread approach - for the most part - gives
us the best of both worlds: the high level, easy to read and write,
robust code that comes from using threads, with the high performance
of the non-blocking callback code.

Downsides of green threads

Like garbage collection, there are downsides to green threads as
well. While not a comprehensive list, here are some such downsides I'm
aware of:

By passing control of scheduling to the runtime system, we lose
control of when exactly a context switch will occur, which can
reduce in performance degradation. (Note that this only applies in
relation to the async approach; the other thread-based approaches
also have the context switch issue.) In my experience, this is rarely
a problem, though certain performance-sensitive code bases may be
affected by this. In a distributed computation project at FP
Complete, for example, we ultimately
went the route of creating our own event loop.

As mentioned, spawning a green thread is cheap, but not free. An
event loop can bypass this overhead. Again, with most green thread
systems, the overhead is small enough so as not to be prohibitive,
but if the highest performance possible is your goal, green threads
may ultimately be your bottleneck.

As you can see, like garbage collection, the main downside is that for
specific performance cases, green threads may be an impediment. But
also like garbage collection, there is a wide array of cases where the
gains in code simplicity and lower bug rates more than pay for the
slight performance overhead.

Experiment with green threads

Let's go ahead and get started with a short example right now. The
only tools you'll need are
the Stack build tool and a
text editor. If you're on a Mac or Linux system, you can get Stack by
running curl -sSL https://get.haskellstack.org/ | sh. On Windows,
you probably want the
64-bit installer.

Once you have Stack installed, save the following code into a file
called echo.hs:

Now you can run this program with stack echo.hs. The first run will
take a bit of time as it downloads a compiler and a number of
libraries. This will only happen on the first run; subsequent runs
will reuse the previously downloaded and installed tools and
libraries. Once you've got this running, you can connect to it to play
with:

Go ahead and play with it with some short lines, and confirm that it
responds. For example, here's a sample session:

Now to prove our claims of the system remaining responsive in the presence of high CPU: try entering a long string, which will require a lot of CPU time to calculate the Fibonacci value, e.g.:

As you might expect, further interactions on this connection will have
no effect as it is computing its response. But go ahead and open up a
new telnet session in a different terminal. You should be able to
continue interacting with the echo server and get results, thanks to
the scheduling behavior of the runtime system. Notice how we get this
behavior without any explicit work on our part to break up the
expensive CPU operation into smaller bits!

Learn more

There are great libraries in Haskell to take advantage of green
threads for easy and beautiful concurrency. My all time favorite is
the async package. Paired
with powerful abstractions like
Software Transactional Memory,
you can quickly whip up high-performance, concurrent network services
while avoiding common pitfalls like deadlocks and race conditions.

If you or your team are interested in learning more about how
functional programming and Haskell, check out our
training options and our
Haskell syllabus. If you'd like to learn about
how FP Complete can help you with your server applications needs,
contact us about our consulting.

Contact FP Complete