Michaeldehaan.net

Cliff Notes On Achieving Immutable Systems

2015-09-19

My present company has been following an immutable systems topology for many of our stateless servers. As I’ve written on this topic before, I figured I’d pass along some of the things I’ve learned from our setup, which is awesome and something I can’t take credit for. I’ll keep this very high level.

To me, this is all supremely interesting. I wrote Ansible thinking about normal servers, and increasingly, CM tools are used to describe templates for immutable servers. And in doing so, many more moving parts are eliminated and it really makes some things cleaner.

Why Immutable Systems?

It all kinda started with masterless deployment - using things like Puppet attached to git, or even ansible’s ansible-pull utility – have been popular for a long time. The downside of this setup is it requires production nodes that have access to git checkouts, which may reveal more things than you wish to reveal. This is kind of a half-step towards immutable systems.

Even with this approach, while a potentially troublesome network dependency and speed bump is out of the way, runtime configuration can result in occasional errors or inconsistency, and will come online slower than an image based setup.

You’ll get very reliable deploys and near infinite scale, with no central server requirements. It also enables CI/CD models that keep a very tight link between deployment configuration and testing/development. For the purposes of this article, I’ll assume Amazon terminology.

The CM tool is still in play - because in doing so you get a few things bash scripts don’t easily provide - easier reuse/composability and a templating engine.

First off, Repos

Have repos for each service. One repo is “base” which contains your base configuration content that sets up your up-to-date OS, configures SELinux, and all of that. This will describe the base AMI that all of our services derive from.

Base Repo

The repo contains a “deploy” directory. The directory contains two Ansible playbooks - while I’m mentioning Ansible terminology (since I wrote it), this setup could just as easily be used for a masterless Puppet or Chef setup if this is what you prefer.

Playbook one is “install”, which is run directly by the Packer/ansible provisioner. If you prefer, you can also try Strider, which I wrote, which has some different features (though it’s AWS specific) and may be expanding a bit more over time.

Playbook two is “configure”, and this playbook is to be executed by the userdata of the service images.

Service Repos

The repo contains the code for your applicaton, and also a “deploy” directory. This directory also contains two playbooks. The packer.json file here though derives from the base AMI built earlier, and then runs it’s own copy of “setup” on top to configure the application on top of the base operating system policy.

There is also a “setup” playbook. This image uses EC2 userdata which performs the following tasks:

* run the configure playbook that the base repo copied over

* run the configure playbook that the service repo copied over

Service Discovery

Unless you’re entirely a single-service web application, your services will probably need a way to find each other. I’ll admit this is not my area of expertise here as I haven’t tried all of the options. You could look into things like etcd, eureka, or Consul, or even basic things you could code up with Dynamo – all with varying costs to support that infrastructure. Your “setup.yml” from the base playbook and your service playbook will run from user data every time the instance boots, so one of the steps these launch could be to download the latest service information. I would think over time Amazon comes out with strong options here that don’t require something external.

Jenkins

Any change to the base repo rebuilds a new base AMI. Any change to the service repo rebuilds the dependent AMIs Owners of various services are required to periodically change their repo to point to the latest base AMI on a regular reschedule.

Logging

Any AMI playbook errors will be recorded in your Jenkins logs. For runtime configuration, like the “setup” playbooks that are run by EC2 User Data, these will be logged to a logging provider - for example, a custom ansible callback plugin can send events to sumo logic. This approach is nice because you can ensure the logs and events are highly structured. Alternatively, you could just let your logging provider directly consume output redirected to disk.

CI/CD and Testing

Application unit tests are called before Packer in the build system.

If you were using my quick hack strider instead of Packer, you could optionally use the “post bake” steps of the provisioner to run tests on the packer nodes after the image is baked. Regardless, you’ll probably want to use some automation of any kind to deploy your EC2 images into a test/stage environment and run a more comprehensive battery on the deployed fleet - likely not just one AMI image, but many, working in concert. This part is still up to you, so invest in good automated QA - they are one of the most importants part of your organization.

How Do I Describe My Stack?

Lately I’ve been learning that Cloud Formation exposes LOTS of knobs While it’s not 100% friendly, I think that’s a valid reason to consider it instead of many other options.

I’m conceptually still very interested in Terraform, particularly due to some declarative gaps in Cloud Formation, but there are various knobs and features that I’d want access to, so I think it matters how many of those knobs and features you’d want.

I think there’s enough complexity that using abstractions over CF like Troposphere may *not* be a good fit in some situations, though I’d be strongly tempted to render Cloud Formation JSON from YAML, so it could be coded up more quickly and with comments added, which JSON does not supply.

Reducing Stateful Services

The above topology isn’t going to eliminate some live management for stateful services, but you should still consider baking a base AMI for them. Where possible, use things like RDS and Dynamo, and you can avoid having a data layer you have to maintain, which may completely eliminate needing to deploy configurations to stateful services altogether for many shops. There are various other options available if you can not completely eliminate them, including managing the way you did prior to Immutable approaches.

However, for cloud based shops, the need for stateful servers is rapidly reducing with providers offering more and more data-layer features every day.