Example.com

Practical continuous deployment

2014-04-17

In February, I had the pleasure of speaking with the London Atlassian User Group (AUG) about some of our experiences with continuous delivery and deployment at Atlassian. The slides for this are available online, but the talk generated a lot of discussion at the time and I’d like to recap some of it here.

To give a bit of context, I work in the business platform team; we’re responsible for developing the tools that allow the business to interact with the customer. In particular, we develop the order management tools such as HAMS (hosted account management system) and MAC (my.atlassian.com); if you’ve ever bought or evaluated an Atlassian product, these are the systems that have been doing the work in the background. As we grow, these systems need to undergo a lot of changes, so we need to speed up our delivery of improvements and features. To do this, we’re moving to a continuous delivery and deployment model for all of our business tools, which will allow us to get improvements out to our customers (both internal and external) faster.

Integration vs. deployment vs. delivery

Before we start, I should probably clarify what I mean by continuous integration, delivery, and deployment. The terms are often used interchangeably, but they have different meanings for our purposes:

Continuous integration

The process of automatically building and testing your software on a regular basis. How regularly this occurs varies; in the early days of agile this meant daily builds, however, with the rise of tools like Bamboo, this can be as often as every commit. In the business systems team we build and run full unit and integration tests of every commit to every branch using Bamboo’s branch-build feature.

Continuous delivery

A logical step forward from continuous integration. If your tests are run constantly, and you trust your tests to provide a guarantee of quality, then it becomes possible to release your software at any point in time. Note that continuous delivery does not always entail actually delivering as your customers may not need or want constant updates. Instead, it represents a philosophy and a commitment to ensuring that your code is always in a release-ready state.

Continuous deployment

The ultimate culmination of this process; it’s the actual delivery of features and fixes to the customer as soon as they are ready. It usually refers to cloud and SaaS, as these are the most amenable to being silently updated in the background. But some desktop software offers this in the form of optional beta and nightly updates such as Mozilla’s beta and aurora Firefox channels, for example.

In practice there’s a continuous spectrum of options between these techniques, ranging from just running tests regularly, to a completely automated deployment pipeline from commit to the customer. The constant theme through all of them, however, is a commitment to constant QA and testing, and a level of test coverage that imparts confidence in the readiness of your software for delivery.

So why would you bother?

When bringing up the subject of continuous deployment adoption, there will inevitably (and rightly) be questions about what the benefits of the of the model are. For the business platforms team the main drivers were:

We want to move to feature-based releases rather than a weekly “whatever happens to be ready” release. This is allows faster and finer-grained upgrades, and assists debugging and regression detection by only changing one thing at a time.

The weekly release process was only semi-automated; While we have build tools such as Maven to perform the steps, the actual process of cutting a new release was driven by developers, following instructions on a Confluence page. By automating every step of the process we make it self documenting and repeatable.

Similarly, the actual process of getting the software onto our servers was semi automated; we had detailed scripts to upgrade servers, but running these required coordination with the sysadmin team, who already had quite enough work to be going on with. By making the deployment to the servers fully automated and driven by Bamboo rather than by humans, we created a repeatable deployment process and freed the sysadmins of busywork.

By automating the release and deployment process, we can constantly release ongoing work to staging and QA servers, giving visibility of the state of development.

But these benefits are driven by our internal processes; When advocating for adopting continuous deployment, you’ll often receive requests for justification from other stakeholders in your company. Continuous deployment bring benefits to them, too:

To customers: By releasing features when they’re ready rather than waiting for a fixed upgrade window, customers will get them faster. By releasing constantly to a staging server while developing them, customers have visibility of the changes and can be part of the development process.

To management: When we release more often, managers will see the result of work sooner and progress will be visible.

To developers: This removes the weekly/monthly/whatever mad dash to get changes into the release window. If a developer needs a few more hours to make sure a feature is fully working, then the feature will go out a few hours later, not a when the next release window opens.

To sysadmins: Not only will sysadmins not have to perform the releases themselves, but the change to small, discrete feature releases will allow easier detection of what changes affected the system adversely.

The last point should probably be expanded upon, as it is such a powerful concept. If your release process bundles multiple features together, it becomes much harder to nail down exactly what caused regressions in the system. Consider the following scenario: On a Monday, a release is performed with the changes that were made the previous week. Shortly afterwards, however, the sysadmin team notices a large increase in load on the database server. While triaging the issue they notice the following changes were included:

Developer A added a new AJAX endpoint for the order process.

Developer B added a column to a table in the database.

Developer C upgraded the version of the database driver in the application.

Identifying which one is the cause would require investigation, possibly to the point of reverting changes or running git bisect. However, if each change is released separately, the start of the performance regression can be easily correlated with the release of each feature, especially if your release process tags each release in your monitoring system.

So how do you actually, you know, do it?

Continuous deployment guides frequently focus on the culture and adoption aspects of it (as indeed this one has so far). What’s less common to see are practical nuts-and-bolts issues being addressed. For the rest of this post I’ll attempt to address some of the practical hurdles we had to jump while transitioning to continuous deployment.

It’s worth noting that everything here should be treated as a starting point rather than a set of rules; Much of what I cover is still actively being changed as we discover new ways of working and new requirements. In particular, the development process is still being actively tweaked as we find out what works best for us. Processes should serve the goals, not the other way around.

While we’re using Atlassian tools wherever possible, none of this is necessarily Atlassian specific, however, our tools do have some nice integrations and I’ll point these out where appropriate.

Development workflow

Continuous deployment implies a clearer development process, with your main release branch always being in a releasable state. There are any number of methodologies that allow this and I won’t cover them all here, but if you want a deep dive into them I would recommend the book Continuous Delivery by Jez Humble and David Farley. As mentioned above, the model we follow is “release by feature”; every distinct change we wish to make results in a separate release, containing only that change. What I’ll outline briefly here is our current workflow and the tools we’re using to enable it.

#1: Use an issue tracker for everything

For every bug, feature request, or other change, we create a unique ticket. If this is part of an ongoing project, we’ll use a parent epic, and for smaller chunks of work, we’ll use a sub-task. Obviously all this is best practice anyway, but in our case we use the issue IDs generated to track the change from concept to deployment. This is important as having a single reference point makes it easier to track the state of work, and enables some of the tool integrations I’ll describe below. As you may have guessed, we use JIRA for this purpose.

#2: Create a separate branch for this work, tagged with the issue number

In your version control system, create a branch which contains the issue number and a short description of the change; e.g. “BIZPLAT-64951-remove-smtp-logging”. It should hopefully go without saying at this point that you should be using a DVCS for this work. Older version control systems such as Subversion make branching and merging difficult, and without the ability to separate work into streams, keeping the main branch pristine rapidly becomes unwieldy.

We use Git for version control, and Stash to manage our repositories. JIRA has a useful integration point here; from within the tracking issue we created in #1 we can just press a button and Stash will create a branch for us. SourceTree also has the ability to pick-up and checkout these new branches. JIRA also has the ability to query Stash and list all branches associated with a given issue in the issue itself, along with number of commits to the branch. This turns your JIRA ticket into a convenient dashboard for the state of your development.

#3: Develop on this branch, and continuously test and integrate it

Git allows you to easily make many commits on a branch, and then only merge when ready – but while working on this branch you should be constantly running your test system. We run our full integration test suite against every commit on all active branches. We use Bamboo to do this; in particular we use its plan branch feature to automatically create the necessary build configuration.

Again, there are useful integration points here. One is that, as with the branches and commits, JIRA can display the state of any branch plans associated with tickets, creating an at-a-glance view of the feature development. But a more powerful one (in my opinion) is that Bamboo can also inform Stash of the state of builds for a branch. Why that is so important becomes evident when we come to pull requests in step #4.

#3.1: Optional: Push your changes to a rush box

Rush boxes are used as staging servers for ongoing work that is not yet ready for QA or staging. This step is particularly useful when you have a customer involved in the development of the feature. By pushing out work in progress to a viewable stage environment you give them visibility of changes and allow them to review.

A version of this is to do this automatically using the same infrastructure tooling you use to perform your deployments below to also push to the staging server. In extreme cases, you can actually create dedicated staging infrastructure (e.g at AWS) at the same time that you create the feature branch. We haven’t got to this stage yet, but it’s definitely on my “cool stuff I’d like to do” list.

#4: When ready, create a pull request for the branch

Pull requests are the DVCS world’s version of code review. Inexperienced developers tend to dislike the idea, but many experienced developers love them, as they provide a safety net when working on critical code and infrastructure.

On our team we have a “no direct merge/commit to master” rule that we enforce through Stash access controls. This means that all merges must come through a pull request. On top of this we define a couple of extra quality rules:

All pull requests must have approval from at least one reviewer

The Bamboo tests for this branch must pass.

Both of these are enforced by Stash; in particular, the second one is enforced by the Bamboo to Stash pass/fail notification mentioned in #3.

#5: Merge and release

Once the pull request has passed, the merge to the release branch can be performed. At this point we perform full release of the software; we use a separate dedicated Bamboo build plan for this, which first runs the full test suite before bumping the version and pushing to our build repository.

#6: Deploy to staging

For us, this stage is also fully automated. The build of a release version of the software triggers its automatic deployment to our pre-production staging servers. This allows additional QA to be performed on it, and possible review by customers or other interested parties.

The actual nuts and bolts of how we perform these deployments will be covered below.

#7: Promote to production

One of our rules is that we never push builds directly out to production. The binaries that go to production are the exact same ones that have been through QA on our staging servers. Thus we don’t so much release to production as promote once we’re happy with the quality. Again, the mechanics of this will be covered in more detail below, but suffice to say that we use Bamboo to manage which builds are deployed to where, and Bamboo communicates this information back to JIRA where it can be displayed against the original feature request.

Some full disclosure about integration

While I’ve pointed out a number of places where Atlassian products integrate together, the spirit of open company, no bullshit compels me to point out that these integration points are not just available to Atlassian tools. This interoperability is enabled by REST APIs that are documented online, so it’s perfectly possible enable these features with a little work, possibly via curl. For example, if you’re using Stash for Git management but are still using Jenkins for CI, you can still enable the build-status integration mentioned in step #3 but calling out to the Stash build status API.

However, the spirit of keeping marketing off my back and wanting to get paid compels me to point out that this is all a lot easier using our products together.

Segue: Continuous downtime?

It’s worth noting at this point that with continuous deployment also comes the possibility of continuous downtime. If you’re releasing infrequently, you can often get away with the occasional outage for upgrades, but if you get to the point where you’re releasing features several times a day, this quickly becomes unacceptable. Thus continuous deployment goes hand-in-hand with a need for clustering, failover, and other high-availability infrastructure.

We went through this process, too. Earlier versions of our order systems were running a single instance for a largely historical reason, and prior to moving to the new development model we had been following a weekly release cycle, with the deployment happening on a Monday morning. This infrequency, along with the fact that the deployment happened in Australian business hours, meant that this didn’t affect customers unduly. Not that we were proud of this downtime, but it was never a sufficient pain point for us to invest the work necessary to turn this into a truly clustered system. But the move to continuous deployment meant we had to address this deficiency, splitting out the stateful parts of the order system (such as batch-processing) and clustering the critical parts.

One additional point to note when clustering systems is that it’s important to select components of your network stack that will work with your deployment automation tools. For example, our original HA (high-availability) stack used Apache with mod_proxy_balancer. However, while theoretically this can be automated via HTTP calls to the management front-end, in practice it was never built with this in mind and proved unreliable. In the end we moved our critical services behind a HAProxy cluster, which provides a reliable (albeit Unix-socket based) API for managing cluster members.

The last mile problem

Another issue I seldom see addressed is how to actually get the software onto your servers; it’s assumed that if you’re doing continuous deployment you already have the “deployment” part taken care of. But even if this is the case, your existing tools may not necessarily fit with a fully automated deployment pipeline rather than a fixed-window schedule. I refer to this as the ‘last mile’ problem. On the business platforms team we had this issue: We had a set of Python tools that had been in use for several years that performed upgrades on our hosts. However they were not end-to-end automated, and had no concept of cross-host coordination; rather than perform a rewrite of these scripts, we decided to look around for newer options.

One option that may be possible in simple single-server cases is to use your Bamboo agents themselves as the last mile system. By placing an agent on the target servers and binding that to that server’s deployment environment, you can script the deployment steps as agent tasks.

When consulting with your sysadmin team, the immediate temptation will probably be to use Puppet or Chef to perform the upgrades, as software configuration management. These didn’t work for us, however, as they work on the model of eventual convergence; i.e. that they’ll eventually get to the state that you desire, but not immediately. In a clustered system, however, you usually need events to happen in the correct order and be coordinated across hosts, e.g:

Take server A out of the balancer pool

Upgrade A

Check A is operational

Put A back into balancer pool

Take B out of the balancer pool

And so on…

This is difficult to do with configuration management tools as you need something more directed; luckily a lot of work has been done in this area by the DevOps community over the last few years. Going into a full review of the options would be make this already too-long post longer, but suffice to say that we trialled a few options and eventually settled on Ansible for automation. The main deciding factor was that it has explicit support for a number of tools we already use including HipChat and Nagios (and look for JIRA support coming in 1.6), and that it was explicitly designed to perform rolling upgrades across clusters of servers.

Ansible uses YAML playbooks to describe sequences of operations to perform across hosts. We maintain these in their own Git repository that we pull as a task in the deployment environment. In total, our deployment environment task list looks like:

Download the build artifacts

Fetch Ansible from git

Fetch Ansible playbooks from git

Run the necessary playbook

However we use Puppet to the base-line configuration of the hosts, including the filesystem permissions and SSH keys that Ansible uses. Personally I’ve never understood the Puppet/Chef vs. Ansible/Salt/Rundeck dichotomy; I see them as entirely complementary. For example, I use Ansible to create clusters of AWS servers, place them in the DNS and behind load-balancers, and then bootstrap Puppet on the host to manage the OS-level configuration. Once that’s complete, Ansible is then used to manage the deployment and upgrade of the application-level software.

Managing deployments

So now you have your release being built and deployed to development, staging, QA, and production environments – but how do you know which versions of the software are deployed where? How do you arrange the promotion of QA/staging builds up to production? And how do you ensure only certain users can perform these promotions?

Earlier spikes of the business-platform automated deployments relied on Bamboo’s child-plans to separate the build and deployment stages of the pipeline and allow setting of permissions. However, this was unwieldy, and it quickly became hard to track what was deployed where. But with the release of Bamboo 5 we rapidly replaced these with the new deployment environments. These look a lot like standard build plans (and so can use the same task plugins available to the build jobs), but they support concepts necessary for effective deployments. In particular, they support the idea that a single build may be deployed to multiple locations (e.g. QA and staging), and then migrated onto other locations (e.g. production). They also have explicit integration with build plans but have their own permissions separate for the builds.

There’s a lot to deployment environments, and going into their features would take too long. However I will note that, like the other stages of the build pipeline, deployment environments have their own integration and feedback points; most notably the original JIRA ticket can display when, for example, a feature has been deployed to the staging environment.

Procedural issues

On the subject of management, there are some additional questions you may need to take into account. What these are will depend a lot on your organization, but I’ll cover a couple of the issues we had to address along the way:

Deployment security

Most of our systems share a common Bamboo server with a massive array of build agents that is managed by a dedicated team. Our open company, no bullshit philosophy means that we extend a lot of trust internally, and assume all employees are acting in the best interests of the company. This means that we grant developers quite high privileges to create and administer their own build plans. However the company is growing, and the regulatory bodies ends to take a less stoic view of such things. In particular, the software we produce modifies the company financials, and so must have strict access controls associated with it.

Although we investigated methods of remaining open but secure, in the end we decided that we should err on the side of strict compliance. To do this we identified all the systems that could be considered as falling under SOX scope and placed them into a separate build environment separated from the more liberal policies of the master Bamboo server.

Separation of duties

Historically, the business platforms team has done a lot of our own operations. In some ways, we were doing DevOps before the idea came to wider recognition. However, SOX contains some strict rules about separation of duties; in particular the implementers of a software change must have oversight and cannot themselves sign-off on the deployment to production.

Our solution to this was to hand off the deployment to production to the business analysts. The analysts are almost always involved in the triage and specification of software changes, and are therefore in a good position to judge the readiness of features etc. to be deployed. And the use of automated deployment, in particular Bamboo’s deployment environments, makes this role available to them rather than just sysadmins.

In practice, we use the the deployment environment permission system to restrict production deployments to specific group (“bizplat-releasers”); we can then add and remove members as required.

Conclusion

I’ve covered a lot here, and have only really scratched the surface of the practice of continuous deployment. But I hope this has helped to clarify some of the issues and solutions involved in adopting this powerful model into your organization.

The post Practical continuous deployment appeared first on Atlassian Blogs.