2013-07-31

When Dropbox launched to the public in 2008, the service grew seemingly overnight from just a few thousand users to hundreds of thousands — and then millions.

That’s great for a new startup, but it presents some engineering problems, especially when you start to think about scaling your infrastructure to meet the demands of your users.

In the case of Dropbox, the company faced an even more difficult challenge because it was billing itself as a real-time storage and syncing solution. For Dropbox to be successful, users had to be able to trust that the service was fast, reliable and safe.

So how did Dropbox manage to scale? Rajiv Eranki was Dropbox’s head of server engineering at Dropbox from 2008-2011. He was the second engineer hired and from the beginning, his job was focused was on helping the product scale. During his tenure at Dropbox, Eranki watched the company grow from 2,000 to 40 million users.

Eranki shared some of his experiences scaling Dropbox on his blog and at the RAMP Conference earlier this month.

Video streaming by Ustream

Here are some key takeaways:

Choosing Python Was a Good Decision

Eranki explained in his RAMP Conference presentation that the Dropbox team used Python for virtually everything. This was beneficial because it meant that the entire platform “could get to 40 million users without having to write thousands of lines of C code.”

This was echoed by Rian Hunter at PyCon 2011. PyCon is a conference dedicated to the Python language and Hunter gave a presentation titled “How Dropbox Did It and How Python Helped.”

The advantage of Python was that it allowed the team to scale much more quickly than if it had used another language or group of languages as its base.

In the early days, when only two engineers were focused on scaling, limiting complexity was an important part of keeping the project growing.

In a similar vein, by using popular software stacks — including MySQL and Amazon’s S3 and EC2 infrastructure, the team was able to ensure that at least in the early days — it wasn’t the biggest or most active user of a technology.

Test Your Potential Fail Points

One of the points that Ernaki makes repeatedly in his presentation is that it’s important that systems that can fail be tested. Frequently, Ernaki said that the team would hard reboot servers to see what would happen. Does the failover strategy work? Does the process automatically restart itself?

Figuring out how something fails and testing those systems when things are running right makes actual failures manageable.

Ernaki writes, “Maybe it sounds stupid to run fire drills on the live site, but testing environments are not sufficient and this is really good insurance.”

Keep Hardware Consistent

A lot of Dropbox’s scaling meant that new hardware needed to be purchased. Rather than relying on a bunch of different server configurations and hardware types, the team had smaller categories of machine types with consistent configurations.

That limited the amount of “capacity planning” as Ernaki put it also helps keep things consistent when it comes to figuring out if a problem is specific to a piece of hardware.

Use UTC

Using the UTC time code across servers saved Dropbox from having to deal with potential problems of one server or one system being in one timezone and one in another. Dropbox even goes as far as not converting times to the user’s time zone until the last second, in the browser (or file manager).

The Dropbox team even kept their wall clock set to UTC, just so everyone was on the same time as their servers.

This might sound silly, but when a big part of your business relies on reliable file synchronization, a timezone change could potentially mean that files were synced incorrectly.

Release Often

One of Dropbox’s mantras was — and is — to release updates frequently. In Dropbox’s early days, code was often released the same day it was coded. This meant that results were instantly available and that potential improvements immediately helped users.

Even today, Dropbox still releases beta channel updates to its Mac, Windows and Linux clients. These releases often introduce new features before they hit the main line for users who explicitly are willing to test the newest stuff, while understanding there could be bugs.

As important as early releases are, however, Dropbox also has had to make sure that only stable code is pushed to its clients. After all, a corrupted directory and lost work is one of the worst things that can happen to a storage and syncing service.

What do you think of the way Dropbox has scaled? Let us know in the comments.

Show more