2013-06-04

This is a companion post to my discussion with Craig Andera on Relevance Podcast Episode 32 and my Clojure/West talk Clojure in the Large. I've talked about various bits and pieces of this workflow at other times, too, but I'll try to bring it all together here in the hopes that others will find it useful.

One of the great pleasures of working with a dynamic language is being able to build a system while simultaneously interacting with it. To make this possible, first you need the ability to redefine parts of the program while it is running: Clojure provides this capability admirably. However, some aspects of Clojure's runtime are not quite as late-binding as one might wish for interactive development. For example, the effect of a changed macro definition will not be seen until code which uses the macro has been recompiled. Changes to methods of a defrecord or deftype will not have any effect on existing instances of that type.

The facilities that Clojure provides for loading code from files are not sufficient to deal with these issues. I wrote the second version of tools.namespace to make a "smarter" require that recognizes dependencies between namespaces and reloads them appropriately.

But tools.namespace is only part of the story. To really get the benefit of interactive development, I want to ensure that the version of the application I am currently interacting with is congruent with the source files I'm editing. That means not only that the application must be running the most up-to-date version of the code, but also that any state in the application was produced by that same code. It is dangerously easy, when changing and reloading code at the REPL, to get an application into a state which could not have been reached by the code it is currently running.

Therefore, after every significant code change, I want to restart the application from scratch. But I don't want to restart the JVM and reload all my Clojure code in order to do it: that takes too long and is too disruptive to my workflow. Instead, I want to design my application in such a way that I can quickly shut it down, discard any transient state it might have built up, start it again, and return to a similar state. And when I say quickly, I mean that the whole process should take less than a second.

To achieve this goal, I make the application itself into a transient object. Instead of the application being a singleton tied to a JVM process, I write code to construct instances of my application, possibly many of them within one JVM. Each time I make a change, I discard the old instance and construct a new one. The technique is similar to dealing with virtual machines in a cloud environment: rather than try to transition a VM from an old state to a new state, we simply discard the old one and spin up a new one.

Designing applications this way requires discipline. First and foremost, all state must be local. Any global state, anywhere, breaks the whole model. Second, all resources acquired by the application instance must be carefully managed so that they can be released when the instance is destroyed.

Enough talk. Here's how it works.

The System Constructor

In some "main" namespace, I provide a constructor function for the application. I usually call it system because it represents the whole system I am working on.

The system constructor can optionally take parameters which specify its configuration.

Creating a system is not the same as starting it and should not have side effects. Usually the system constructor will create instances of other components it depends on and return a data structure such as a map or defrecord which contains them. My system instance might look something like this:

Sometimes I have different versions of the constructor that produce different systems for interactive development, testing, and production.

Notice that some things which are "global" from the point of view of the application, such as my web server and scheduled thread pool, become "local" instances in this data structure. Any function which needs one of these components has to take it as a parameter. This isn't as burdensome as it might seem: each function gets, at most, one extra argument providing the "context" in which it operates. That context could be the entire system object, but more often will be some subset. With judicious use of lexical closures, the extra arguments disappear from most code. In addition to enabling more interactive development, this approach makes testing easier. See my post On the Perils of Dynamic Scope for more background, as well as the Clojure/West talk when it comes out.

Start and Stop

Next, I have functions to start and stop the system. Ideally, these behave like real functions, in that they return a new value representing the "started" or "stopped" system, but they also have to perform side effects along the way, such as opening a connection to a database or starting a web server.

These functions can call similar start/stop functions on sub-systems in turn. In the past, I've talked about a "Lifecycle" protocol containing start and stop methods. It's not necessary, but is sometimes useful to ensure that all components of the system can be started and stopped in a consistent way.

There's usually a bit of trial-and-error while I get the start/stop functions working correctly. If something in start/stop throws an exception, I could easily end up in a state where a sub-system has acquired a resource — such as a socket connection — but I do not have any handle on that sub-system with which to shut it down and release the resource. In that situation, there's nothing for it but to restart the JVM.

Dev Profile and user.clj

You probably know that the Clojure REPL starts by default in the user namespace. In addition, if there is a file named user.clj at the root of the Java classpath, Clojure will load that file automatically when it starts.

You probably don't want user.clj to be loaded in a deployed production app or library release, but by using Leiningen 2 profiles we can ensure that it is only loaded during development.

In my Leiningen project.clj file, I create a :dev profile with an extra :source-paths directory, plus whatever dependencies I want to use during development. tools.namespace has to be there, and I frequently add testing/development tools such as java.classpath or Criterium.

Leiningen will automatically merge the :dev profile into the project configuration for the repl and test tasks, but not the jar or run tasks. This means that any source files I put in the dev directory will be excluded from the production app.

I create a user.clj file in the dev directory which defines a normal namespace called user and refers a bunch of symbols I commonly use during development, as well as the symbols to construct, start, and stop the system.

Also in user.clj, I have a few things that I will only use during development, starting with a global Var to hold the system itself:

Now wait a minute, you might say, isn't that the global state you told us to avoid? It would be, if it were part of the application. Instead it's a container in which I can put the current instance of the application. I'm only going to use it for interactive development.

The system Var is manipulated by the following functions:

The exact division of these functions isn't important. Sometimes I omit init and start and just have go. The important thing is to have one function that creates and starts the system, and another function that tears it down.

Finally, the heart of my workflow: the reset function. This is one function which I can call at the REPL to 1) stop the current application instance; 2) reload any source files that have changed; and 3) create and start a new application instance.

The real work of reloading files is handled by the clojure.tools.namespace.repl/refresh function. It takes my go function as an argument, but go has to be passed as a namespace-qualified symbol so that it can be resolved after the user namespace has been reloaded. (This is a trick that refresh knows how to do.)

Workflow

I do all my Clojure development in Emacs using nREPL.el, but nothing about this workflow is Emacs-specific. It should work with any environment that provides a REPL, as long as it doesn't try to do any code-reloading of its own. (For example, the reload-on-every-request functionality of ring-devel is incompatible with tools.namespace.) The fact that I use Emacs as my REPL is one reason I use user.clj instead of :repl-options in Leiningen's project.clj: those options have no effect on remote nREPL sessions.

The first thing I do when I start work is launch an nREPL session and call reset. Now my application is running and I can start working on it. Every time I make a change, I save the file and call reset at the REPL. (I have an Elisp helper function that I can bind to a keystroke.) Presto! My application is running again in a clean state with all the new code.

Rather than switching the REPL among several namespaces, I generally stay in user, where I have all my development tools like clojure.pprint and clojure.repl. I use the REPL itself for examining the application's state and testing individual functions. I frequently define little helper functions to examine the state of the application as I work on it, all of which are accessible by navigating the system object.

Anything I want to hang on to, such as a snippet of test data, I define in the user.clj file, because tools.namespace will destroy any Vars I created with def at the REPL.

Snags

This process isn't perfect by any means. One of the more irritating aspects is that any syntactic errors in a source file prevent all the code from being loaded, including user.clj. If a file fails to compile during the tools.namespace reloading process, any namespaces which depend on it no longer exist. So the reset function isn't available to call, nor are any of my aliases or referred symbols in the user namespace.

As a work-around, in tools.namespace 0.2.3 I added a feature to recover aliases and referred symbols in the current REPL namespace after a failed reload. This isn't perfect: the reset function still doesn't exist. But at least I can call the refresh function from tools.namespace without typing out its fully-qualified name clojure.tools.namespace.repl/refresh. Once I have successfully reloaded all the source files with refresh, I can call reset again to start the app.

A slightly worse problem occurs when starting a new REPL process: if there are any compilation errors in something loaded by user.clj, then the REPL will not start at all. I try to avoid this by starting the REPL from a known working commit, then only changing code after it's running. I also try not to commit any code which does not compile, but sometimes it happens. :)

Occasionally I do get my application into a state that I cannot recover from. Usually this happens when something in a start or stop function throws an exception. At that point, some component of the application may be in a broken state but I don't have a reference to it that I can use to shut it down. If that component acquired external resources which I need to release before restarting it, e.g. socket connections, then there's basically nothing I can do but restart the JVM. Fortunately, these situations usually only occur while I'm writing the start/stop functions themselves, so after a few development cycles to get them working I don't have to worry about it.

Entry Points

The central thrust of this approach is to design your application so that you can construct multiple instances of it within a single JVM process. That's ideal for development, but what about production?

If you control the entry point to your application process, it's easy. Just write a -main function that creates a single instance of your application and starts it.

But often we deploy apps to environments where we do not control the -main function. For example, Ring web apps deployed to a Servlet container have no -main. Furthermore, they expect a static reference to a Var which contains the root web handler function. If that handler is meant to be a closure over some contextual state, there's no place to construct it.

There are a couple of ways to work around this. One is to have a separate namespace in a "production" profile that constructs a single instance of the application and assigns it to a global Var. Alternatively, if the framework provides an "initialization" hook (as lein-ring does), you can use that to create the application instance and store it in a global Var. The root web handler function, created exclusively for production deployment, can pass the system object to functions that need it.

Epilogue

I'm continually tweaking this process, looking for improvements, but overall I'm pretty happy with it. It has enabled me to work rapidly on some fairly large applications. Best of all, it's agnostic with regard to development tools. You can adapt this workflow to any build tool that can substitute different CLASSPATHs for different circumstances.

Some of my Relevance coworkers like this approach, others find it too constraining. The Pedestal team uses pieces of this technique, such as the :dev profile, but without tools.namespace. They were annoyed that compiler errors prevented them from starting a new REPL, so they came up with a variation that uses a function in user.clj to load another file called dev.clj.

I wanted to write about this to clarify and expand upon things I've presented elsewhere. Check out the podcast to hear me talk more about this approach specifically, and the Clojure/West video (when it is released) for more background.

UPDATE: Fixed name of clojure.string in example code.

UPDATE: More comments on Hacker News

Show more