This morning we deployed a new version of PythonAnywhere -- we've blogged about the new
stuff that you can see in it, and this post is a run-down on why it took longer than we
were expecting.
Our normal system updates tend to take between 15 and 20 minutes. Today's was meant to
take about 40 minutes -- more about why later -- but it wound up taking over an hour and
forty minutes. That definitely warrants an explanation.
Situation normal
It's worth starting by explaining how a normal system update works. There are three
main classes of servers that make up a PythonAnywhere cluster:
File storage servers -- two types of server, one of which stores the files in your private
file space, and one that stores and processes system logs (like the ones you can access from
"Web" tab).
Database servers -- for MySQL and Postgres.
"Ephemeral" servers -- basically, everything else. Loadbalancers, web servers, console server,
servers for IPython/Jupyter notebooks, a "housekeeping" server that does admin stuff like sending
out emails, and so on.
In a normal system update, we replace just the ephemeral servers. The file storage and database
servers are pretty simple -- they run the minimum code required to do what they do, and rarely
need updating.
Our process for that is:
The day before, start up a complete new cluster of ephemeral servers, hooked up to the existing
file and database storage servers, but without anything being routed to it.
When we get in early for the deploy, we start up a server that just responds to all requests with
a "system down for maintenance" response, then switch all of our incoming traffic over to it.
Then we stop the old ephemeral servers, and run any database migrations we need to apply to the Django code that runs the main PythonAnywhere website.
We set things up so that when we access PythonAnywhere from our own office network, we bypass the "system
maintenance" server and see the new cluster, and then we sanity-check to make sure everything's OK.
If it is, then we reroute all traffic to the new cluster, and we're done.
All of that tends to take between 15 and 20 minutes.
Sometimes, we need to update the file storage servers. That's a slightly more complicated
arrangement.
When we start up the new cluster the day before, we create completely
new storage servers as well as ephemeral servers. But we don't hook the new storage servers up to the storage itself. Storage is contained on Amazon
EC2 Elastic Block Storage (EBS), which can be moved from machine to machine but can only be attached to one machine at any given time -- so we leave it on the old
storage servers initially.
When we do the deploy, after we've stopped the old ephemeral servers, we need to:
disconnect the old EBS devices from the old storage servers
connect them to the new storage servers
start up the various system services that turn those block devices into a usable filesystem
run some system checks to make sure everything's OK and the new ephemeral servers can see the storage on the new storage servers.
Once all that is done, then we check out the new system end-to-end, and, if all is well, go live.
Because all of these steps are scripted, the extra work doesn't normally take all that long; maybe half an
hour in total. Worst case, 40 minutes.
Today's system update was even larger than that, however.
Virtual Private Clouds
We've recently started hitting some limits to what we can do on AWS. When we started using the service,
they had just one kind of server virtualization -- for those who are familiar with this stuff, they
only supported paravirtualization (PV). But now they support a new kind, Hardware Virtual Machines (HVM). This
new kind has a large number of benefits in terms of the kinds of servers supported and their performance
characteristics in an environment like PythonAnywhere.
The problem is, in order to use HVM, we had to move PythonAnywhere out of what AWS call "EC2-classic" and into what
they call a Virtual Private Cloud. (Use of a VPC is not normally technically required for HVM, but there was a nice
little cascade of dependencies that meant that in our case, it was.)
We'd been testing PythonAnywhere-in-a-VPC in our development and integration environments for
some time, and by this system update we felt we were as ready as we ever would be to go ahead with it.
We decided we'd move all of the servers into the VPC in one go. There's a thing called "ClassicLink"
which could have allowed us to leave some servers in EC2-classic and move others to the VPC, but using it
would have been complex, easy to get wrong, and would only have stored up trouble for the future. So we
decided we'd do one of our updates where we'd replace the file storage server as well as the ephemeral ones.
The night before, we'd start a full set of new servers inside a VPC, and then when we switched over to them,
we'd have moved into it.
But there was a wrinkle -- as well as moving the file storage servers, we'd need to move the database servers.
We (rightly) didn't think moving Postgres would be a huge deal, because we manage our own Postgres infrastructure.
But moving the MySQL databases would require us to convert them over to the VPC using Amazon's interface.
We timed this with a number of test database servers, and found it took between five and ten minutes. We checked
with Amazon to make sure that this was a typical time, and they confirmed.
So -- half an hour for a full cluster replacement, plus ten minutes at the worst case for the MySQL move, forty
minutes in total.
What could possibly go wrong?
What went wrong
There were a few delays in today's update, but nothing too major. The code to migrate the EBS storage for the
storage servers wasn't quite right, due to a change we'd made but hadn't tested enough, but we have detailed
checklists for doing that process (after all, keeping people's data safe is the top priority in anything like this)
so we were able to work around them. Everything was looking OK, and we were ready to go live after about
50 minutes -- a bit of a delay, but not too bad.
It was when we run the script to make the new cluster live that we hit the real problem.
Earlier on I glossed over how we route traffic away from the old cluster and over to the site-down server, and
then back again when we go live with the new one. Now's the time for an explanation. AWS has a great feature
called an "Elastic IP address", or EIP. An EIP is an IP address that's associated with an AWS account, which
can be associated with any server that you're running. Our load-balancer endpoints all use EIPs. So to
switch into maintenance mode, we simply change all of our EIPs so that instead of being associated with the old
cluster, they're associated with the site-down system. To switch back, we move them from the site-down system
to the new cluster.
So we ran the script to switch from the site-down system to the new cluster, and got a swathe of errors. For
each EIP it said
A bit of frantic googling, and we discovered something: EIPs are either associated with the EC2-classic system
or with the VPC system. You can't use an EC2-classic EIP in a VPC, or vice versa.
There is a way to convert an EC2-classic EIP to a VPC one, however, so we started looking in to that. Weirdly,
the boto API that we normally use to script our interactions with AWS doesn't support that particular API call.
And there's no way to do it in the AWS web interface. However, we found that the AWS command-line tool (which
we've never used before) does have a way to do it. So, a quick pip install aws-cli, then we started
IPython and ran a script to convert all of our EIPs to VPC:
Uh-oh. We have a couple of dozen EIPs associated with our account -- we had a vague recollection of having had to increase the limit in the past. But it looked like (and we confirmed) that for some reason there's a completely separate
limit of EIPs for VPCs -- the one we had previously increased only applied to EC2-classic.
AWS tech support to the rescue
The advantage of spending thousands of dollars a month on AWS, and paying extra for their premium support
package, is that when stuff like this goes wrong, there's always someone you can reach out to. We logged a
support case with "production systems down" priority, and were on a chat with a kindly support team member called
George within five minutes. He was able to confirm that our VPC EIP limit was just five, and bumped it up enough
to cover the EIPs we were moving across.
Unfortunately limit increases like that take some time to propagate across the AWS system, so while that
was happening, we took a look at the details of the error that we'd got originally -- again, it was
The boto API call that we were using took two parameters -- the IP address we wanted to move, and the
instance ID of the server we wanted to move it to. Looking at the EIPs that we had successfully moved
into the VPC world, they had a new number associated with them -- an "Allocation ID". And it appeared that
boto now required this ID as well as the EIP's actual IP address when it was asked to associate an EIP.
So we reworked the code so that it could do that, and waited for the limit increase to come through.
Finally, it did, so we reran our little IPython script. All of the EIPs moved across. A bit of further
scripting and we had allocation IDs ready for all of the EIPs, and could re-run the script to switch everything
over and make the new cluster live. And everything worked.
Phew
Lessons learned
It's hard to draw much in the way of lessons from this. One obvious problem is that we didn't know that
EIPs have to be moved into the VPC environment, and we didn't know that because our testing environment doesn't
use EIPs. That's clearly something we need to fix. But it's unlikely that we would have spotted the fact that
there was a different limit of inside-VPC EIPs associated with our account -- and because we had to move the
EIPs over while the system was down, we wouldn't have discovered that until it was too late; this morning's
update would have taken an hour and a half rather than an hour and forty minutes, which isn't a huge improvement.
I suppose the most important lesson is that large system updates sometimes go wrong -- so we should schedule
them to happen later. The quietest time of day across our systems as a whole is from about 5am to 8am UTC.
Things ramp up pretty quickly after that, as people come in to work across western Europe. For a normal
update, taking half an hour, 7am is a perfectly OK time to do it. But for future big changes, we'll start off
at 5 or 6.