2013-01-31

2012 brought us some of the worst website outages and downtime in recent memory. Here’s the list that made our top 15.

15. Google App Engine

When: Friday, October 26th

Cause: Traffic Spike

For four hours between 10:30AM-2:30PM EST on October 26th, Google App Engine failed to deliver about 50% of its requests. As a service used by hundreds of thousands of developers to create applications, this outage was felt heavily across the web. The downtime was caused by an increased load on traffic routers.

14. Tumblr

When: Thursday, October 18th

Cause: Network Issue

Starting at 8:30AM EST, Tumblr experienced an outage due to "network problems following an issue with one of [their] uplink providers." The problem ensued for six hours until service was finally restored around 2:15PM EST.

13. Salesforce

When: Tuesday, July 10th

Cause: Power Outage

Salesforce underwent a significant outage in the early morning that affected six of the company's regions. The outage was identified as a power failure at an Equinix data center in Silicon Valley. Though the power outage only lasted for one minute, it took over nine hours to fully restore their service. This outage came just weeks after a smaller previous incident.

12. Twitter

When: Thursday, June 31st

Cause: Cascaded Bug

Twitter, notorious for severe outages, went down at around noon on June 21st. The disruption lasted for three hours, when Twitter identified the problem as a "cascaded bug in one of our infrastructure components." The outage was so severe, however, that the infamous "Fail Whale" error page couldn't even load - the site simply timed out. The outage marked the longest and worst crash for Twitter in 8 months.

11. Github

When: Tuesday, October 16th - Thursday, October 18th

Cause: Distributed Denial of Service (DDoS) Attack

On Tuesday and Wednesday, Github experienced partial outages of 26 minutes due to a network issue and 24 minutes due to errors in its search service respectively. Then, on Thursday, Github underwent a DDoS attack that lasted for 5 hours. Developers in companies and startups across the world were at a standstill from doing any work, as they could not pull or push any of their code. Overall, it was a rough week for Github.

10. Kohl's

When: Thursday, November 21st

Cause: Traffic Spike

Kohl's ran a massive online special for Black Friday shoppers, offering over 500 early bird specials, 20% off sales prices, and free shipping for orders over $50. The bargains started the day before Thanksgiving and ran until 3pm on Black Friday. However, given the surge in traffic, the Kohl's website experienced an outage for several hours on Thanksgiving evening. As the heaviest online traffic week of the year, a few hours of downtime can be incredibly costly for online retailers.

9. Super Bowl (Coke, Acura, Act of Valor)

When: Sunday, February 5th

Cause: Traffic Spike

The Super Bowl is the largest advertising event of the year. Advertisers spend millions on precious seconds to capture the eyeballs of millions watching the big game. Some advertisers' websites, however, buckle under the massive influx of traffic they receive due to their ads. For Coke, Acura, and Act of Valor, their websites all experienced severe outages directly after their ads aired during the Super Bowl.

8. Facebook

When: Thursday, June 1st - Friday, June 2nd

Cause: The Like Button

Facebook slowed down or was completely unavailable for most users for three hours between June 1st and June 2nd. With over 1 billion users worldwide, an outage of any kind is detrimental to a web property the size of Facebook. What's worse, though, is that Facebook affected thousands of retail and content sites on the web as well. How? The Like button. Third party widgets, such as the Like button, rely upon the servers and performance of that third party (third party widgets are one of the biggest culprits of poor performance). So when Facebook experienced problems, websites who had the Like button embedded on their pages underwent performance spikes between 5 and 20 seconds!

7. Bank of America

When: Friday, September 14th - Wednesday, September 19th

Cause: Service Upgrade / Traffic Spike

On September 14th, problems started with Bank of America's website with the message "some of our pages are temporarily unavailable" on the homepage. The issues were sporadic on Saturday, but were prominent again on Monday with unavailable webpages. Starting at 10AM on Tuesday, the majority of users were unable to connect to Bank of America's website due to slowness and time-out failures. The website placed the message "We're sorry, our site is running slowly" on their homepage. The problems were not resolved until Wednesday morning. Some speculated the issues were caused by a DDoS attack, but Bank of America denied the claims. They attributed the outages to end of the month traffic along with a code release which migrated older customers to their new platform.

6. Hosting.com

When: Friday, July 27th

Cause: Power Outage

Hosting.com suffered an outage in the early morning which caused more than 1,100 customer websites to experience downtime for as many as five hours. According to Hosting.com CEO Art Zeile, the cause of the outage came from human error, as an engineer performing maintenance on servers mistakenly cut the power to the facility. The power loss only lasted for a couple of minutes, but all of the servers needed to restart which prolonged website downtime for customers. The majority of website owners did not have backup hosting and were not prepared for such an outage, leaving them at mercy to the resolution of a singular data source.

5. Hurricane Sandy

When: Monday, October 29th - Monday, November 5th

Cause: Natural Disaster

When Hurricane Sandy hit the East Coast, it took down some major data centers in New York and New Jersey that host popular websites such as Gawker Media, Huffington Post, and BuzzFeed. The hurricane caused sporadic outages for an entire week before the data centers were able to restore power and reboot.

We have to give major props to Squarespace for literally carrying fuel up 17 floors for 3 days -- all to provide 100% uptime to over 1 million websites. That's dedication.

4. Leap Second Bug

When: Sunday, July 1st

Cause: Additional second of time added to atomic clocks due to leap year

The Leap Second Bug caused outages for many popular services such as Reddit, LinkedIn, Yelp, Gawker Media, Foursquare, StumbleUpon, Mozilla, and Microsoft Windows Azure. What is the Leap Second Bug? As explained here, every 18 months a leap second is added to adjust our atomic clocks to the Earth's slowing rotation. A grand total of 24 leap seconds have been added since 1972! One small second threw Java and digital certificates for a loop with a new timestamp and thus caused problems for these services. Google, however, was prepared for the leap second. They slowly added milliseconds over time to make up for the leap second when the transfer finally happened.

3. Royal Bank of Scotland

When: Tuesday, June 19th - Thursday, August 2nd

Cause: Batch processing backlog

The IT staff was responsible for system failures that affected 17 million customers of RBS, NatWest and Ulster Bank. The problem occurred during maintenance on systems which caused an error in their automated batch scheduler and processor. This prevented millions of customers from receiving or making payments, and lasted for more than a week! The outage cost RBS a whopping £125 million!

2. GoDaddy

When: Monday, September 10th

Cause: Domain Name Server (DNS) Failure

Around 11AM PST, GoDaddy announced they were experiencing intermittent outages and later attributed the issue to a DNS failure. The infamous hacker group Anonymous originally took credit for the outage by way of a DDoS attack, but later rescinded this claim. GoDaddy hosts more than 5 million websites, so thousands - and possibly millions - of websites experienced downtime due to this issue. Service was restored for the majority of users by 8PM PST, but the sheer magnitude and scale of GoDaddy's reach online made this one of the biggest and most publicized outages of the year.

1. Amazon Web Services (AWS)

When: Friday, June 29th / Monday, October 22nd / Monday, December 24th

Cause: Natural Disaster / Memory Leak / Elastic Load Balancing Failure

AWS had a rough year for uptime, as it experienced three major outages. The first outage happened on June 29th due to a major storm that impacted popular services such as Instagram, Pinterest, and Netflix until the following day. On October 22nd, a memory leak and failed monitoring system caused Reddit, Foursquare, Minecraft, Airbnb, Heroku, GitHub, imgur, Pocket, HipChat, Coursera and a number of others to go down. The outage lasted for six hours until service was restored. Finally, on Christmas Eve, Netflix went down until Christmas morning due to an elastic load balancing failure in AWS.

What are the biggest outages you remember from last year?  Do you think anyone on this list will turn into a repeat offender in 2013?



Show more