2015-08-13

This June I had the pleasure of presenting a talk on continuous
deployment at Devoxx UK. The video for
this talk
is now online. As
the information in this talk may be of use to some people a full
transcription of the talk and questions is below...



Steve Smith: Hello everyone. My name's Steve Smith. I'm from
Atlassian. I want to talk a bit about practical continuous deployment,
the practical part being quite important in this. There's been a lot
of talk about continuous deployment, and there's a lot of discussion
about why you should be doing it. However there is not as much
discussion about how you actually get it done as I ideally like.

So I'm going to talk about how we got continuous deployment done at
Atlassian, as in one particular case, and just basically talk you
through some of the tools, techniques, the problems that we hit, and
so on.



Before we start, I should probably just back up a little bit and say a
little bit about what continuous deployment is. I assume everyone here
basically knows, but the core of the idea is of continuously releasing
our code out into the wild, be it behind the firewall, or be it in the
Cloud, where it's obviously particular effective.

One of the interesting ideas about this is that we think we're pretty
clever doing this. People who are doing continuous deployment in the
air are like, "Yeah, we did like 10 releases last week, and it's
really good."

That's pretty impressive for a software organization, but I was
talking to my uncle, and he's a maître d' at a couple of London
restaurants. "How many meals you do a day?" How many covers they call
it. Even a mid-ranged, mid-sized restaurant in London will generally
about 150 to 200 covers a day.

That is basically people in the line in the kitchen continually
building, creating, and getting out the door in complicated
configurations with special requests on every single one and getting
tables to all come out at the same time. They did 150 to 200 times a
day, and we think we are pretty clever if we do manage to get two
releases out in a day.

This is possible, and this is something that's actually always been
done out in the wider world, something like Amazon or any company that
has to deliver things. "We'll build you a bespoke computer and deliver
it to your door." These people do it continuously, so it's actually
quite possible, but we need to just take some of the techniques from
the real world in with us.

I'm not going to delve into those too much right now, so it's about an
idea to pin in your head. This is being done all the time now. We are
a little bit behind the times.

A bit of background about me. Who am I to talk about this kind of
thing?

The answer is I'm now a DevOps advocate, which is something like an
evangelist. If you ever find exactly what DevOps advocate is, please
let me know because I haven't found out yet.

I worked at Atlassian, originally joined Atlassian about eight and a
half years ago as the original sys admin, though my background was
originally a developer. Since then, I moved into what was then called
internal systems team. We now call it the business platforms team.

What we did was build the systems that you make the sales through. We
processed the money. We were the hole the money came through if you
like. A billion dollars have passed through some of the code that I
wrote, not necessarily in one direction. Some of them went back the
other direction, but there's certain amount of pressure there.

We wanted to remove some bottlenecks to getting fixes and improvements
out the door. I was tasked with the job of turning our horrible
monolithic dinosaur piece of code into something we continuously
deploy.

I'll talk a little bit about how I'm doing that. That's slightly off
topic, but ultimately, it was a question of we were moving towards,
it's no big secret, Atlassian is moving towards an IPO at some point,
God knows when, if you find out, please let me know as well.

We certainly have a bunch of requirements, about SOX requirements and
so on. We had to have separation of duties, we had to have, literally
firewall off our production, and our development systems, all the
rest.

How do we do that, and still continuously deploy in a way that makes
us flexible, agile once I bring all these improvements in?

This is how I got into this situation. This is what I've been up to
for a while, and it took about six to nine months to do this. Now,
theoretically, any smart developer and sysadmin can get together, and
go, "All right, bang it out. Let's do this in an afternoon." You could
get this sort of working.

The reality is to do it properly, you have to make a bunch of
decisions and you have to get other people involved in the
decisions. You need to convince them that this is a sane thing to do,
and that's one of the immediate bottlenecks you'll find in this. I
will also discuss that, about some of the human aspects of this, and
selling this inside your organization.

Continuous Integration, Continuous Delivery, Continuous Deployment?

To back up even further, let's just ask the simple question, what am I
even talking about? I assume most people have heard about, or are
vaguely familiar with continuous integration, continuous delivery,
continuous deployment. What do they actually mean, and how do they
relate to each other?



The first one is continuous integration. That's theoretically been
around for a while. That's part of the Agile manifesto, was you build
all the time, and it was a big idea. "We'll build once a day."

Now, of course, we build on every single commit, and if you're a
hardcore, and using some of the continuous integration tools like
Jenkins, or Bamboo, or any others out there, you can do on the per
branch basis.

Every feature has its own branch, and every branch is getting built
constantly. That's great. We're now doing it, literally, hundreds of
times a day. From there, you say, "If we can build this 100 times a
day, theoretically, could we make it deliverable hundreds of times a
day?"

It doesn't mean you actually doing delivery, but it's a commitment to
say, "Our software should always be in a releasable state." We should
always be able to go, "What the hell. Just throw it over the wall."

In extreme cases, you'll see this with behind the firewall
applications like Firefox. You're going to have the nightly builds
that will appear constantly every day if you want to keep up to
date. This is about delivery. It's not necessarily about the actual
deployment. It's about saying, "We want to keep our software always
ready to release."

As well as integration builds, we have an extensive integration
testing. We have acceptance testing, all the rest. Those run
constantly, and therefore, we have a high degree of confidence we're
not going to break anything at any given time. Every build should
theoretically be releasable even if it's not necessarily what you want
to release.

Because once you get to that point of view, you may go, "If we can do
that, why don't we?" This is the more extreme version. This says that
we are going to release, every time we have something that's
releasable. That could be, in extreme cases, every merge onto the
master branch.

That's actually quite close to what we ended up doing, and we release
that constantly. "If it's ready, it's been through QA, it's been
tested, everything, it's ready to go out." Why don't you just go put
it out there? Why wait?

Now, this obviously works best in the Cloud, because you have that
ability to modify the entire world. It even happens a little bit in
behind the firewall software. If you are following one of what we call
EAPs, the early access program, you can get onto that, and get beta
builds, alpha builds.

If you can have something like Firefox nightly, you will get constant
updates every time they do a new build of features, or things that
merge onto the master branch.

But at its core, throughout end to end, it's about quality. It's about
commitment to say, "We are going to trust our testing, and our
integration so much that we are prepared to back this up with action."

That's not necessarily going the whole way. How much risk you're
prepared to take on is up to you, but the further you get that down
that line, the further you are saying, "We have a good testing
infrastructure in place. We have QA in place. We have people who can
make that call, and know when we feel that our software is ready."

Why?

Your next question will obviously be, "Why do this? OK, we can do it,
but should we do it?" The answer is not necessarily always the case,
but I can only speak to what was important to us.

The main thing with us is we were releasing approximately once a
week. This was because we had downtime with our systems. We do it on a
Monday morning. I was living in Australia at the time, so Monday
morning was pretty much downtime for the rest of the world. We would
just do a dump of whatever was ready on that Monday. That was a pain
in the ass.

We wanted to release individual features. Not just whatever happened
to be done. This is a very well powerful idea, and I'll introduce why
it's powerful. It's actually one of the biggest selling points, in
particular to some of the people who are going to be most hostile to
continuous deployment in your organization.

The idea of releasing what we need to release, not what we have to
release in that time window was very, very important to us,
particularly when you're worrying about, "Am I going to screw the
company for a couple of million dollars?" The CEO is breathing down
your neck, you don't want that. You want to sleep at night.

The other one is automation. Ours at such a time were semi-automated,
but ultimately required large, because of SOCKS compliance that we had
to give the code to somebody else, and somebody else had to install it
on the system, and in particular sys admins. They have quite enough
shit to deal with without us dumping more on them.

We wanted to automate this away. Continuous deployment makes you
automate things. It supports the idea of the Agile world of saying,
"If something's hard to do, do it often, and you'll make it easier."
This applies to continuous deployment.

If you're going to have to do this 10 times a day, you're going to
make sure it's a one button press, and you're going to make sure it's
robust. We also wanted to remove the bottlenecks. Again, this was a
problem with the sys admins. We would have a hard time getting time
from the sys admins to do the deployment.

When we originally started the team, it was much smaller, literally
two people writing this code at the start. That was fine. We had
root on our machines, we could do it ourselves.

As we got bigger, and the changes got bigger, and the releases got
bigger, and these requirements and separation of duties got bigger, we
needed to start referring to somebody else. Who are you going to give
it to? The answer is sysadmins. The sysadmins were not happy with
that. We wanted to make it a fully automated system that the sys
admins had very little to do with.

Then we could distribute the decision making outside the sys admin
team to the people who really needed to make the decision. In our
case, we used the business analysts who were the final arbiters of
what went out the door, and then they would just press a button, part
of automation.

In your organization that might be the manager, it might be the QA,
whatever. However you are structured. It's that simple idea of it
should be easy, and it should be repeatable, and it should be possible
to do by anyone with the privileges. It makes it a lot easier to get
stuff out there.

Stakeholders

The first thing that's going to happen is your stakeholders are also
going to ask the question "Why?". Stakeholders here are anyone and
everyone in the organization. The more people who are concerned about
this is actually a good thing. That's the call of DevOps, not saying,
"This is ours. We're going to keep this to ourselves. We'll make that
decision."

No, that's a silo. If you want to have an open organization, you
should query the people who are going to be affected by
this. Honestly, if you're doing anything of any importance it's
probably going to be pretty much everyone. How are you going to sell
this to people who may not be all that amenable to the idea of you
running around chucking code over the wall every couple of hours?

The first people who are going to push are the users. Users obviously
can be customers, they can be internal users, they can be you yourself
in some cases.

What's in it for them? The answer is they get their features
faster. If they have a bug fix, they'll find a typo on the website,
the customer processing isn't working quite right. It's calculating
tax wrong, we want to tweak some of the tax implications, or whatever,
you want to change the fonts. They are going to see that much faster,
and they can also get visibility of the change process;

Continuous deployment will give them much more visibility to people
who are concerned about what the changes are being made, give greater
visibility of it, because you're automating stuff. I'll show you how
to do that nearer the end.

The next one is managers. They're actually pretty easy. I think one of
the reasons why Agile became so popular particularly amongst managers
is because managers are complete and utter control freaks. They want
to know what's happening constantly. They want to know how badly are
they are screwed. How badly are we over-time, under-time? What's going
on?

Agile gives you that. By having the shorter iterations, you have the
ability to measure much more discretely. You're not going to the start
of the waterfall, the start your project, gather your requirements,
and three years later you find that it's three years behind. You want
to know, "It's a week behind, but we have six week left. We can do
something with that."

Developers, again this is a tricky one, but ultimately, they'll get on
board when they realize the same benefits of Agile apply here. One of
the real problems with getting your changes into a long-term release,
into one of those windows that come up once a month, or even once a
week, and you need to get your code in there, you have a
death-march. It might be a mini death-march, but you're still there.

If you're going to get it by Friday morning, you're still there late
on Thursday night, because you've got to get that out. However, if the
release goes out whenever your code is ready, it's driven not by the
whims of release windows, or time zones, or anything like that.

If you can just say, "I'll release it when I'm done," then you can
spend that extra 20 minutes getting that last test to work reliably,
and then it'll go out there not when somebody else tells you, you have
to get it ready. That's a really powerful idea. Again, this comes with
the Agile manifesto as well. This idea that you don't have
death-marches, you're iterating very, very quickly.

Finally, the sysadmins; they push back hard, because there's a certain
amount of work involved for them, and they don't trust the developers,
and the developers don't trust the sysadmins. I've been on both sides
of that battleground, and let's just say, I have a mild case of
schizophrenia. There is this need to convince the admins. However,
there's one argument you can make to the sysadmins, and they will
love this immediately.

If you are releasing one feature at a time, you know who screwed the
systems, and you can blame them for it. Say you release once a
week. In that week, three changes were made. Dave modified the schema
of the database. Alice updated the database driver. James added a new
REST endpoint to query, to get information to have more interactive
order form.

That release goes out on Monday, and the database usage spikes through
the roof. Which one screwed the system? We don't know. Any of them
could have messed that up. Debugging that is going to be a pain. It's
going to come down to reproducing it and then getting out git
bisect, and splitting it out.

Or you could release each feature separately, and if you're being
smart, your release process pokes a marker into your monitoring
systems, on your graphs, to show you which release went out at that
given point. You see a marker, and you see the database load spike. Go
look at that commit hash, go choke whoever did that.

Sysadmins love that idea. The idea you can finally blame someone for
something is great, so the first time you have that, it's awesome. I
remember the first time I caught someone using BitTorrent on
network. It's one of the little moments of joy in my life. It's like,
"Someone I can actually get for once."

How?

Anyway, enough of that, how do you actually do this? A lot of the
discussion about continuous deployment tends to focus on the
benefits. Like what I have been talking about so far. But how do you
actually get this done?

I'm talking about from the idea of how do you do this as a developer
through to how do you, and the sysadmin? Because, we're talking
DevOps, you need to cooperate for this. This is a smooth line, and you
can't say, "It's been deployed. It's not my problem anymore."

If you are going to be releasing stuff frequently, you need to work in
concert with your sysadmin. You need to work in concert with your
Ops. In some cases, and this is really quite a good idea, the Ops are
actually final arbiter of what goes out the door.

The developers had to go to the Ops people, and say, "We want to do a
release." You do a little huddle and say, "What's in there?" because
they are the ones who are ultimately going to have to take
responsibility for it, but they have the phone number of the person
who came and talked to them. That's always great for them.

First, one of the things you're going to have to do, is you are going
to have to modify your development workflow. There's a bunch of bullet
points here I don't want to go into too much, I'll drag you through
them, most of its pretty obvious.

It's pretty much best practice now, whether or not people are doing
best practice is another matter entirely. It's like test driven
development, it's like teenage sex, everyone says they're doing it,
but no one is actually doing it.

I'll run you quickly what we consider the best practice of doing a
continuously deployed workflow, and generally, a workflow at all. The
first thing is you need to track your request, everything. Everything
has to go into your bug tracker.

Obviously, I work for Atlassian. We use JIRA. I don't care. I'm not
going to diss anyone for using something else. It works pretty well
for us, and I can only talk about what is my own experience.

I'll introduce some of the cool features you can do with that a little
bit later, but ultimately, everything has to be a request. In extreme
cases, Mozilla, and people like Ubuntu put everything in their but
trackers. It's like in Mozilla, one of the first things they did when
they produced Bugzilla, was put a bug into Bugzilla saying, "We're
going to have a party when Mozilla 1 is released," things like that.

Ubuntu did that with their bug number one, destroy Microsoft, or don't
allow Microsoft to be the biggest operating system in the world. We
put everything in there. Feature request, bugs, the whole works. I
think that's pretty much best practice nowadays. You've got to track
it, but you hear stories.

This is where we get into you've got to be modern. How many people are
using some form of DVCS? Some sort of distributed revision control
system? Actually, maybe I had better ask the other way around. Who's
not using a distributed version control system? One. Oh, there's a
couple.

It's worthwhile. This thing really works with a distributed version
control system. In particular, I'm talking about Git, but it doesn't
have to be that. We also support Mercurial, at least on some of our
systems, and you can use whatever you want. You can go back and use
Darcs or Bzr or whatever, but ultimately, it has to have efficient
branching and merging for this to work.

Every feature goes on a branch. The master is pristine. There's a lot
of different ways you can do this. Gitflow and things like
this. That's up to you, but ultimately you have to have a branch that
says, "This is what we're releasing," and if it lands on this branch
it's going out there.

Throughout this master, we don't over complicate things. Then we
create the branch. We actually tag the branch with the ID of that
ticket. That ticket is the universal ID for your code base of the
features being released. This is very powerful. I'll talk about how
you use this later, and keep people happy, and particularly your
users. You isolate your feature work from everything else, so each
feature be counted as a branch, and then you test.

In our case, what we do, our Bamboo instance -- we use Bamboo. You can
use whatever you like. It's not important. Our Bamboo is set up so
every time a branch is created Bamboo is notified, and it immediately
clones all tests. Even if it goes on the master branch, the release
tests is run against the feature branch. This means you need a
shitload of infrastructure. We run hundreds of machines, and then we
overflow into AWS, so you can basically fire up AWS instances on the
fly.

We poured a lot of resources into this. That's great because you're
pretty sure if your testing is good. We're back to quality
again. You've got to trust your testing. If your testing is good,
you're pretty sure that when you merge back to that master branch,
it's not going to break anything. You don't just merge though. In
fact, in the case of our team, and I actually think most teams at
Atlassian, we deliberately block direct merges to master.

We do that by basically saying you can never push master. You can only
push feature branches. Any merges occur via, in our case, Stash, which
is our Git system. Again, there are other ones out there. It does
require a pull request. You can use BitBucket. You can use GitHub, or
whatever, but you never push to master. You only push to feature, and
when you go to finally say, "My code's ready to go out there," you do
it via pull request.

We set up rules around this. We have certain rules that say, all your
builds must be clean. In fact, you must have a certain number of green
builds before you can do the merge. You must have a minimum of two
reviewers. The reviewers are important. Now, I say you must trust your
test code and your test pipeline, that's very, very powerful.

But ultimately, software is written by humans, for humans. And
ultimately, humans must be involved in it. So, we tend to require that
other developers are pulled into the final acceptance before he gets
pushed off the feature branch into the main branch. This is also a
very powerful tool as well.

It's one of those things where we actually adopted it quite
organically, because initially everyone hates the idea of code
reviews. But it's good, particularly when you are working on a system
that has such high risk and these building dollars passing through it.

The idea that you can actually pull somebody else in and go, "Look,
can you just check this?" In particular, I'm a technical person. I was
never strong at the business logic side of things. Ernest and Alison,
on the other hand, who work on the same team, they had a pretty
natural talent for the stuff. In particular, Ernest could just spot
corner cases in the business plan that no one else would spot.

Modifying the business end of things, I would always go to Ernest and
Alison and say, "Can you review this? I want you to be sure." They
might pick up something. They might say, "No, it looks good to me."
Your tests are great, but ultimately it software for humans, so things
like layouts and images are always going to require a bit of human
input.

At that point, you do a pull request and you pull people in. At that
point, assuming it's been passed, it gets automatically merged in, and
then they gets automatically released. So, what happens is whenever a
merge occurs on the master branch, we run a full release, every single
time.

What you're doing here, this is what I'm talking about releasing
features. You create a branch, you separate the feature out to only
that work which you're doing, very discrete. The work is relevant to a
particular change, tag with an initial request for this change, and
then when you merge that, that gets pushed out and only that.

Now, let's push up automatically, in our case straight up to our
staging servers. At that point, that's when you look at pulling some
other people in. Maybe the original customer, the original person that
made the bug report or the feature request. You might say to them, "Go
have a look at the staging server. Doesn't look right to you?"

They will get involved, QA might get involved. It depends a lot on how
your organization is structured. But having done that released
staging, if everyone signs off on it, ideally you will have some
system in place for having an automated sign off. People can just tick
it off and you get enough ticks, you can then push it out.

You then push up to production because what you push to production is
not a fresh build. It's exactly what is on staging. Now, you might get
some different flags to production, we do that. We switch off some of
the things that mail and so on.

Ultimately, when you promote up to production because in production is
exactly what you want in staging. What you just put up to production
or you just pushed out is that one feature that you produced. That's
pretty much the development process and you get basically this one
discrete feature goes end-to-end and pops out on to your production
service in a discrete manner.

Last Mile

Now, how do you get it on to the servers? Everyone's talking about,
"It's great. We push out a hundred times a day." How do you actually
get this code on to this production service or even on to your staging
service? They might be in different continents.

I do investigate a number of options there. They'll call it some last
mile problem. This comes from the telecommunications industry. You can
shove terabytes of data across the Atlantic and that is trivial to
do. You want to get that last little chunk of Game of Thrones from the
routers down to someone's house, that's a pain in the ass. We're
talking about tens of millions of dollars of copper in the ground just
to get Game of Thrones, or fiber-optic if you're in a really
progressive country, just to get Game of Thrones up to the door. It's
that last mile is a bitch. It's the same problem.

I'll rush into this quite quickly. I'm assuming most of people here
are developers rather than In an earlier version of this talk I had
more time to delve into this. But say you'll talk to your sysadmins,
and they'll probably go "Chef!" or "Puppet!", depending on whatever
they're using to handle their infrastructure.

That is a wrong answer. Not that I have anything against Puppet and
Chef, but they are not appropriate to this problem. They're very
declarative. You say to a Puppet or Chef, "This is how I want my
service to look." How it gets there to that state is entirely up to
Puppet and Chef. You don't want that.

In particular, you want to say, "Well, I need you to upgrade the
database first and only then do you push out this new version of the
software," or, "I need database schema changes. I need you to bring
things in and out of clusters, et cetera." Puppet and Chef are not
appropriate to this. Not a problem but Puppet and Chef, just the wrong
problem domain.

There's another way of doing this which is orchestration agents. Now,
if your sysadmins happen to be already using these, you definitely go
ahead and use them. The problem of these is you have to install stuff
on to the servers.

Imagine you have your cluster of servers out there in AWS or your own
data farc ors whatever.. Now, orchestration agents are a powerful way
of performing operations on groups of machines. This sounds perfect
for rolling out software.

The problem is if you have an orchestration agent-based system, you
need to get orchestration agents out there first. If your system just
happens to not already doing this, then they're going to have to roll
out a whole new infrastructure just for you to do your deployments.

You're going to get some push back on that unless you got really
progressive sysadmin, "Oh, I always wanted to do that," in which case
you're not going out which is great.

There are a bunch of them. SaltStack is one. Never used it myself. I
have heard good things about it. It's on my long to-do list of things
I need to look at.

Puppet Labs also do MCollective. This shows you exactly how different
these two paradigms are, Puppet Labs have Puppet, obviously, but they
also have some MCollective which does sort of what Puppet does but
does it in a completely different way. This gives you an
idea. MCollective, we've used it internally at Atlassian, it has a
number of issues in its implementation. That means that we have a
current job that's goes around restarting the MCollective demon on a
regular basis.

Let's reiterate. These are agents that sit on a machine and do things
for you. You call out to them and they say, "I need all of you to
clean the temp directory," whatever, and it will do it. It's a daemon
that sits there.

Func is another one. I think it's pretty much obsolete. It's been
largely superseded by Ansible. The same guy did it.

Bamboo itself is an agent-based system. If you are using Bamboo, if
you're feeling very excited, you can use Bamboo agents to do
deployments for you. That has a similar problem. If you want to do
global orchestration, it's really a build system. You'd be stretching
the limits of whan Bamboo agent is appropriate. We do use agents for
deployment. We tend to use them as an endpoint that starts a
deployment then uses other tools to spread the work out through
various different machines.

The other way of doing it is use whatever you already have around. You
don't have to put a dependency on your servers. You put a dependency
on the code which is doing the pushing which you're already modifying
anyway.

In particular, you're going to have to set up your build plans to do
continuous deployment and so on. Why not just do all the hard work
into the build plans where you can control it rather than having to go
out into your infrastructure and get your sys admins to install extra
software.

In particular, your sysadmins are almost certainly running SSH on the
servers in production. If they're not, I fully recommend I think
monster.co.uk is a good source for a different job because you're
probably going to want one pretty soon if your sysadmins are not using
SSH.

SSH is almost certainly out there. We can actually use that as a
transport layer to do automation then have all the management at the
backend. There are a number of tools out there for doing this. If
you're really excited, you can actually do it yourself. You just have
do /bin/bash and knock up a script. That's fine.

Fabric is one, I think the BitBucket guys use that quite a lot. I
never used it myself. I did talk to them about it. Capistrano is
another one. I know our operations guys are quite big on that. And
there's Ansible which is one of the new hot or it was five minutes
ago. I think we might move on to something else by now. Who can keep
track?

The only idea is that each of these uses SSH and you just say, "I need
you to SSH a bunch of machines and perform a set of operations." Now,
you can do that with a sort of very, very simple script or some of
these higher level tools that are available to us to automate this.

Ultimately, this is the solution we came up with. This is not
necessarily what best for you, this was what's worked for us. We have
Bamboo. Bamboo runs our builds for us and it does that release that I
talked about when a merge is made on to the master branch.

That pushes out to one of our agents. We have dedicated agents that
only do deployments and Bamboo has first class support for this. I
don't want to turn this into an advertisement, but I'm just talking
about what I know myself, but Bamboo has first class support for the
idea of deployment.

In particular, if you're going to have an agent that sits inside your
production environment, you can set lockdown rules that says, "No
build can use that except the deployment build." That could be set by
the sysadmins.

That means the big problems with using your build agents to do
deployment is one of your builds might run on one of your production
agents. Now, your build is running inside your production
infrastructure. If you use this idea, you need to separate out. Bamboo
has first class support for that.

Inside that agent, the deployment runs. It uses Ansible and it
connects to multiple hosts and will do deployments. In particular,
what it does is it will first connect to the load balancer. It will
pull node number one out of the cluster.

It will then perform an upgrade of that particular node, run a bunch
of tests against it to ensure that that node is up and running
correctly. Only then will it put it back into the cluster then it will
go on to the next node, take it out of the cluster, et cetera, et cetera.
You're just rolling upgrades and Ansible has first class
support for this.

It has the idea that before an action happens, another action must
happen first and it might be occurring on a different host. It has the
idea that you can actually upgrade certain ones in parallel, but no
more than five at a time, say. If you get a certain number of
failures, you abandon the entire roll out.

Ansible is explicitly created with the idea of this rolling upgrade
system at large, large scale.

We use SSH. We connect straight over SSH, then we do the
deployment. It's over there connecting, and rolling across all of our
infrastructure, doing upgrades and testing and testing and
testing. That's what works for us. What works for you will probably be
something entirely different. This is just a guideline for what works
for us.

What we did in case of Ansible, I looked at all of the available
options out there, put them into a grid and went, "OK, that's good
enough." If I did it again now, I might find something completely
different. Please, do evaluate, but Ansible is working quite well for
us. The sysadmins really quite like it now that they just found it
quite intuitive to work with because they're building some of the
script. It's been working pretty well for us.

So, I've been mentioning clusters a lot. If you are constantly
upgrading, are you not constantly having downtime? This is probably
one of the single biggest reasons why it took like six to nine months
to get this out there.

We had a monolithic app that was stateful. I basically reengineered
the whole thing that you could actually run multiple copies of it and
do rolling upgrades then slap a proxy in front, and there's these
arguments with the sysadmins about what's the correct proxy. "Do we
want to get a Zeus Server?" "No, we do not want to get a Zeus Server."
"Let's use HAproxy." "No, no, no." It was back and forth, and back and
forth.

Ultimately, if you're going to do this quite a lot, you're going to
have to have robust software, if it's customer facing. There is
actually one particular application in our system. We actually like
more to be like this where we just went, "Run a single version, bring
it down when you want," because it's in the backend.

Its job was to take our invoices and push them up to our accounting
system. Those invoices didn't go over REST. They went over queuing
system. We use HornetQ, but you can use whatever queuing system you
think is appropriate.

If it was down, those invoices just backed up and backed up and backed
up. When it came back online, it started pulling invoices to process,
process, process, process, process. That's a great way. If you can
actually have as much of your system separate out into that where it's
really not critical anymore, that's awesome. If you can do that, do
that.

Ultimately, you're going to have to address this. I would like to go
into this in more detail. It's a huge subject in and of itself. I love
talking about it because it's largely one of those things that's
great, but ultimately unsolvable, almost the computer science level,
but to make a really truly robust clustered system.

It's fascinating. The world is getting better for this kind of
thing. It depends a lot on your architecture. I can't really help you
here. I'm going to take a look around and bear it in mind. It's fun to
do. When you get those rolling upgrades going and it's all nothing
happens and no one notices. My first the continuous deployment and no
one in the company notice, I was like, "Yes, this is what it's all
about," because really I need a hobby, don't I?

Management

I've got 15 minutes left, cool. Let's get towards the end. What you're
doing is continuous deployments, deploy and deploy and deploy. What
the hell is on your service now? You don't know unless you've been
diligent by tracking everything from the start.

In particular, what we do...Say we use the original bug or feature
request or whatever, your bug tracker ID is to run this from
end-to-end. Now, we have at the backend, we call them deployment
environments.

Again, this is Bamboo specific, Atlassian specific. This is just
because it's what I know. I never used Jenkins or any of the other
solutions. I can't really talk to them. I'm just talking about what I
know here.

With this idea of deployment environments, Bamboo is the first class
idea of deployment, deployment. All it is is really much like a normal
build task. You can just say, "I need you to run this and then run
this and then package yourself and then send this down the line."

The idea is there's multiple places that this package could go to. In
particular, you can have any development. We've got a development
sandbox. We have a staging environment and a production
environment. What's running on each one? Well, Bamboo will track that
for you. You really need some way of tracking this. What is where?
What is required to get it to the next state? Do we want to roll back,
et cetera, et cetera?

We use deployment environments. They are basically just Bamboo
tasks. They look an awful lot like builds but they can push code out
into an environment, be it development testing or production. It
tracks them. In particular, you can see here in the dev sandbox we've
got release number 82. On the staging, though, we have release number
81 and release 80.

What you can do is you can actually promote these releases between the
different environments. You can actually see which environment has
which version running. It tracks back to the original tickets because
we're using ticket based management. This is what I'm talking about,
what I'm saying about having that tying all together and using the
ticket ID as the universal ID through the whole build process.

JIRA creates this ticket. It will then notify Stash or it can notify
Bitbucket or it can notify GitHub, whatever is appropriate. It says,
"I need a new branch." A ticket has been created so inside JIRA you
press a button and it will create a branch for you. It will
automatically do it for you. You could do it by hand but if you're
inside that ticket you can press a button and it goes off your GitHub
repository and creates the branch for you in the correct format.

Stash will notify Bamboo. Bamboo will pick up on that branch; start
testing it, and pushing that information back to Stash. Stash will
then set rules. We've set policy that says, "You cannot merge a pull
request until your builds are green."

How do you know the build's green? Because Bamboo is telling you. The
pull request is deeply integrated into your branching
environment. Your testing environment and your source control are
quite loosely but critically coupled, if that makes sense.

At the same time, Bamboo is talking back to JIRA and Stash is talking
back to JIRA. In particular, Bamboo is saying the build's a part of a
build that's failing. Stash is saying, "Oh, a bunch of commits have
been made. A pull request has been merged, has been raised. A pull
request has been approved. A pull request has been merged."

This means that back in the original ticket, and this is a real ticket
from our ordering system. What you can see here is, if you look at the
green box that's highlighted, you can see in development we've...a
branch was created, three commits have been merged, three commits have
been made, one pull request has been raised and has been merged.

You can see here, if we press a button further down, this is the
information coming from Bamboo. Your development sandbox, this
particular ticket, the code that was produced as part fixing this bug
or creating this feature, has been merged out onto the development
sandbox. You can go and have a look at it. It goes on to...It's going
to have some staging but has not yet gone into production.

The person who raised this ticket has full visibility of the state of
the work that is going on around it. This is the admin in me
talking. I will tell you now, if you want to keep your users happy
give them as much information as you can, show them how to get it, and
they will leave you alone. People hate being kept in the dark. They
hate being fed bullshit, but even bad news is good news if they feel
like they're being included in the process. They will leave you alone.

This is great. You just give them the tools and you can carry on
working. They will be happy. They just love this stuff. No one hates
anything more than feeling like they've having one being pulled over
on them. Show them that work is occurring. Show them the things that
are happening. Open up your monitoring systems and let them point
things out. Let them reason about things themselves.

People want to get their jobs done in a quiet way and they hate being
kept in the dark.

The key point that we want to take away here is you need to have a
clear motivation for why you're doing this. This is a big undertaking
to do. It's not without its risks. It certainly requires
organizational change. You have to know why you're doing it
first. That's very important.

You have to have a branching feature based workflow. That's really
important. Most importantly, you have to have cooperative tools and
people. Your tools must enable cooperation between your people. This
is the heart of DevOps, talking to people.

This is the heart of what agile was supposed to be but turned into a
bunch of tools and it turned into a bunch of consultancies. It was
about sitting down with your customers and working with them, not
against them and not keeping them in the dark. These tools will help
you and your operations, your sysapp bins, and your customers all have
visibility of what's going on. They can all work together.

To bring it back to the original point that I made about food. I
really haven't referenced back at all. I'm really [inaudible 25:00]
about that. It's actually possible to do this. People generate change
all day, every day, and they do it with confidence because they've
done it a thousand times before and they all communicate.

You go into a restaurant kitchen. They're all shouting at each
other. The chef is shouting at the sous chef. The waiters are shouting
at everyone. The maître D' is basically screaming but ultimately
things happen because the people are communicating. Even angry
communication or complaints are a good thing if it means the
communication is happening.

Ultimately what happens is food goes out on time. People are happy. At
the front end, they never see any of this. It's all about getting some
good food out there. It's very possible to do.

That's the sort of summary. This is the URL for the questions. I'd
love to take some questions. I'll also be hanging around, of course,
today and tomorrow. I don't know that we're actually using the
questions app, so if anyone has any questions about this feel free
to...we've got a few minutes to talk now or feel free to grab me
later.

Hi.

Female Audience Member: I want to know why you're using branches
because doesn't using branches go against the continuous deployment
models, generally?

Steve: The question was why do we use branches? Doesn't that go
against the continuous deployment model?

No, absolutely not. We used to use Subversion. We could never have
done this with Subversion. On Subversion, we...if any of you ever
tried...most people work with Subversion at some point. Everyone? Yep.

There's a poor guy down there who's probably, "No! It's just me."

Branching and merging in Subversion just does not work. When we were
using Subversion we would constantly me merging onto master and we'd
have to hope that by the time we got to the stage where we had to do a
release it was going to be OK.

We had testing and stuff, but ultimately you had a whole bunch of
stuff that was going in and then you'd throw out what was
important. You'd throw out whatever was done at the end.

By separating a branch, you merge a branch. You only have one change
go out at a time. You can separate your work into continuous
deployment. It's not by continuously throwing whatever code is out
there. It's by continuously throwing features out there.

You separate your features out, work on them until they're complete
and only then do you throw them out. Continuous deployment is not
continuously deploying all work. It's by continuously deploying
completed work. You can only do that if you split the work out into
separate streams. If everyone's working on the same branch you rapidly
get all mixed up together.

Audience Member: This is an assumption that you can actually build
and test release branches with the amount of resources you have for a
branch release strategy. If it takes you an hour and a half to run the
unit tests for your app and seven hours to run the integration tests
and you have teams with huge numbers of developers then the branching
strategy is just not possible with even multiple teams.

Steve: Right, absolutely. The question is it's absolutely implicit
that you have massive resources to do this. The answer is, really,
ultimately you do have to have those resources.

We had a similar problem. We still do, to a degree. Our unit test took
several minutes. It wasn't too bad, but our integration test, when I
originally hacked them in there...when I joined the team there was no
integration tests. I added them.

I thought, "We need integration tests. This is great. Add more and
more and more." We're up to a couple hours for the integration tests
so we started to parallelize them because ultimately you need to break
it down. We actually managed to split them up to multiple resources.

The thing is, computers are cheap now. If you build a huge team but
you don't have the resources you've got a big imbalance
there. Computers are cheap. Human time is very, very expensive.

A job that I was working on when I was working for the military in
Australia, a military contract, the answer to our app performance
problems was it was taking quite a long time to run all of this radar
tracking stuff that we were doing. The answer was; well...

We used SGI machines, these big, bench sized machines. Cost half a
million. The answer was if it was running too slow, buy a new SGI
machine. Now it will run fast enough because the time taken to
optimize this code would take more developer time than the cost of
this real expensive hardware.

If you're spending a lot of money on developers but not actually
giving the developers the tools to do efficient testing then you're
wasting developer time. That's seven hours of developer time that's
theoretically been wasted by not throwing resources at it. Wouldn't it
have been better to throw some machines at the problem and have the
machines take the brunt of it?

You're probably talking about hundreds of thousands, or millions if
you're a large team. You're talking about millions of pounds a year
for their salaries. You can't spin up a few AWS instances? You've got
a mismatch in resources, I think.

Audience Member: We do. We're heavy Bamboo users. We have loads of
machines but there's still a problem in parallelizing. If it takes too
long, then there's a problem in the feedback loop because an hour and
a half on a parallelized system would be too long. Seven hours is too
long to know that you have a problem.

And so, given that strategy for many developers it's a natural fit to
have our modules be updated to the latest version. It's just a
feedback situation.

Steve: We can actually talk about this afterwards. We've got too
much kinds of ways but I'd be interested in hearing your story and
help out.I'll answer a question from someone else first, sorry.

Audience Member: At one point you said that you can configure, for
instance, Bamboo or whatever you have to say, "You're free to branch."
First of all, you can't push too much. You're not going to be able to
do a pull request unless you have feature branches that you are
passing, et cetera.

Is there any kind of log to say, "Well, actually, you're free to
branch but since you branched out master has changed." Is there any
way to force your feature branch to be pulled from master so whenever
you do a pull request you have an updated branch?

Steve: Bamboo actually has a feature to do that. I don't know about
other systems. I don't want to turn it into a sales pitch for Bamboo
but Bamboo has a feature to do that. You can basically say, "OK, you
do your pull, your feature branch has started its test. At the end of
those tests it does a dummy merge into master and then runs all the
tests again.

So yes, you can absolutely do that and double check.

Audience Member: Let's say you've just deployed a new version and
now there's something wrong with it. You need to go back to the
previous version you need to back out. How do you do that? Is it like
another merge request, and another code you use?

Steve: There is as many ways to do that as there are
developers. There's actually twice as much because it's developers and
operations. Every single person has a separate idea. It depends on a
lot of things. Bamboo, again, I can only talk from my own
experience. Bamboo has the ability to do that, it's particularly in
the deployment environment. You have a list of all the releases that
went out. You can just press a button, and roll back to it, and it'll
redeploy a previous version for you. Now this is where we get into the
nasty, nasty, clustering world.

What has happened if you have modified the database, and the previous
version is no longer compatible with the database? That itself is a
talk on its own, but ultimately yes. What I would do in that
particular case, I would roll back to a previous version, and then
examine the bad release. The other thing you can do of course because
you've got a merge request, you can actually revert the master back to
the previous state before you did the merge.

There's lots and lots of ways to do it. That's the least hard
part. The tricky bit is what happens if the state of your production
systems is changed. What do you do then? Question?

Audience Member: What about if you had...you fixed your branch,
which is consuming an application, and it's dependent of that feature?

Steve: This is really coming down to the idea...That gets even more
tricky when you're talking about cross systems as well. You can
ultimately branch off a feature branch as well. Ultimately if you've
got two of you, or another person need to cooperate, well ultimately
this is agile. You need to cooperate. You probably need to pair on
this. The other thing is, well if these two changes so it's dependent,
why are they two separate feature requests? Maybe they should be one.

Audience Member: If you release, your link in the data with the same
number that's one who knows it has to pointing to that...?

Steve: ...No, I don't think so. This is into the nitty
gritty. You've probably ultimately come down to a human problem you'd
want to communicate, or possibly you want to merge your two feature
requests into a single feature request. If your code is that
interdependent, then those two changes are really one change. It's
like if you got two big systems, and you're changing the API and the
client at both ends. It's the same problem as that. You ultimately
need to coordinate your changes.

Actually, we're technically out of time. Are we allowed to carry on?
No, no. I'm being told, sorry, we're going to have to wrap up
now. Thanks everyone for coming. I will be hanging around today and
tomorrow. Please come up and talk to me. This is all interesting
stuff, and good questions. Thanks.

Show more