2014-10-13

Hello there!

What you're reading is a blog post. Where is it from? It's from https://blog.haskell.org. What's it doing? It's cataloging the thoughts of the people who run Haskell.org.

That's right. This is our new adventure in communicating with you. We wanted some place to put more long-form posts, invite guests, and generally keep people up to date about general improvements (and faults) to our infrastructure, now and in the future. Twitter and a short-form status site aren't so great, and a non-collective mind of people posting scattered things on various sites/lists isn't super helpful for cataloging things.

So for an introduction post, we've really got a lot to talk about...

A quick recap on recent events

Haskell.org has had some rough times lately.

About a month and a half ago, we had an extended period of outage, roughly around the weekend of ICFP 2014. This was due to a really large amount of I/O getting backed up on our host machine, rock. rock is a single-tenant, bare-metal machine from Hetzner that we used to host several VMs that comprise the old server set; including the main website, the GHC Trac and git repositories, and Hackage. We alleviated a lot of the load by turning off the hackage server, and migrating one of the VMs to a new hosting provider.

Then, about a week and a half ago, we had another hackage outage that was a result of more meager concerns: disk space usage. Much to my chagrin, this was due in part to an absence of log rotation over the past year, which resulted in a hefty 15GB of text sitting around (in a single file, no less). Oops.

This caused a small bump on the road, which was that the hackage server had a slight error while committing some transactions in the database when it ran out of disk. We recovered from this (thanks to @duncan for the analysis), and restarted it. (We also had point-in-time backups, but in this case it was easier to fix than rollback the whole database).

But we've had several other availability issues beforehand too, including faulty RAM and inconsistent performance. So we're setting out to fix it. And in the process we figured, hey, they'd probably like to hear us babble about a lot of other stuff, too, because why not?

New things

OK, so enough sad news about what happened. Now you're wondering what's going to happen. Most of these happening-things will be good, I hope.

There are a bunch of new things we've done over the past year or so for Haskell.org, so it's best to summarize them a bit. These aren't in any particular order; most of the things written here are pretty new and some are a bit older since the servers have started churning a bit. But I imagine many things will be new to y'all.

A new blog, right here.

And it's a fancy one at that (powered by Phabricator). Like I said, we'll be posting news updates here that we think are applicable for the Haskell.org community at large - but most of the content will focus on the administrative side.

A new hosting provider: Rackspace

As I mentioned earlier this year pending the GHC 7.8 release, Rackspace has graciously donated resources towards Haskell.org for GHC, particularly for buildbots. We had at that time begun using Rackspace resources for hosting Haskell.org resources. Over the past year, we've done so more and more, to the point where we've decided to move all of Haskell.org. It became clear we could offer a lot higher reliability and greatly improved services for users, using these resources.

Jesse Noller was my contact point at Rackspace, and has set up Haskell.org for its 2nd year running with free Rackspace-powered machines, storage, and services. That's right: free (to a point, the USD value of which I won't disclose here). With this, we can provide more redundant services both technically and geographically, we can offer better performance, better features and management, etc. And we have their awesome Fanatical Support.

So far, things have been going pretty well. We've migrated several machines to Rackspace, including:

https://ghc.haskell.org & https://git.haskell.org

https://darcs.haskell.org

http://deb.haskell.org

https://planet.haskell.org

https://phabricator.haskell.org, and our GHC build bot

https://new-www.haskell.org (soon not to be so new-)

https://hackage.haskell.org (and its build bot)

We're still moving more servers, including:

https://www.haskell.org, to become https://wiki.haskell.org most likely

http://code.haskell.org / http://community.haskell.org / http://projects.haskell.org

Many thanks to Rackspace. We owe them greatly.

Technology upgrades, increased security, etc etc

We've done several overhauls of the way Haskell.org is managed, including security, our underlying service organization, and more.

A better webserver: All of our web instances are now served with nginx where we used Apache before. A large motivation for this was administrative headache, since most of us are much more familiar with nginx as opposed to our old Apache setup. On top of that we get increased speed and a more flexible configuration language (IMO). It also means we have to now run separate proxy servers for nginx, but systems like php-fpm or gunicorn tend to have much better performance and flexibility than things like mod_php anyway.

Ubuntu LTS: Almost all of our new servers are running Ubuntu 14.04 LTS. Previously we were running Debian stable, and before Debian announced their LTS project for Squeeze, the biggest motivation was Ubuntu LTSes typically had a much longer lifespan.

IPv6 all the way: All of our new servers have IPv6, natively.

HTTPS: We've rolled out HTTPS for the large majority of Haskell.org. Our servers sport TLS 1.2, ECDHE key exchange, and SPDY v3 with strong cipher suites. We've also enabled HSTS on several of our services (including Phabricator), and will continue enabling it for likely every site we have.

Reorganized: We've done a massive reorganization of the server architecture, and we've generally split up services to be more modular, with servers separated in both geographic locations and responsibilities where possible.

Consolidation: We've consolidated several of our services too. The biggest change is that we now have a single, consolidated MariaDB 10.0 server powering our database infrastructure. All communications to this server are encrypted with spiped for high security. Phabricator, the wiki, some other minor things (like a blog), and likely future applications will use it for storage where possible too.

Improved hardware: Every server now has dedicated network, and servers that are linked together (like buildbots, or databases) are privately networked. All networking operations are secured with spiped where possible.

Interlude: A new Hackage server

While we're on the subject, here's an example of what the new Hackage Server will be sporting:

Old server:

8GB RAM, probably 60%+ of all RAM taken by disk cache.

Combined with the hackage-builder process.

1 core.

Shared ethernet link amongst multiple VMs (no dedicated QOS per VM, AFAIK). No IPv6.

1x100GB virtual KVM block device backed by RAID1 2x2TB SATA setup on the host.

New server:

4x cores.

4GB RAM, as this should fit comfortably with nginx as a forward proxy.

Hackage builder has its own server (removing much of the RAM needs).

Dedicated 800Mb/s uplink, IPv6 enabled.

Dedicated dual 500GB block devices (backed by dedicated RAID10 shared storage) in RAID1 configuration.

So, Hackage should hopefully be OK for a long time. And, the doc builder is now working again, and should hopefully stay that way too.

Automation: it's a thing

Like many other sites, Haskell.org is big, complicated, intimidating, and there are occasionally points where you find a Grue, and it eats you mercilessly.

As a result, automation is an important part of our setup, since it means if one of us is hit by a bus, people can conceivably still understand, maintain and continue to improve Haskell.org in the future. We don't want knowledge of the servers locked up in anyone's head.

In The Past, Long ago in a Galaxy unsurprisingly similar to this one at this very moment, Haskell.org did not really have any automation. At all, not even to create users. Some of Haskell.org still does not have automation. And even still, in fact, some parts of it are still a mystery to all, waiting to be discovered. That's obviously not a good thing.

Today, Haskell.org has two projects dedicated to automation purposes. These are:

Ansible, available in rA, which is a set of Ansible playbooks for automating various aspects of the existing servers.

Auron, available in rAUR, is a new, Next Gen™ automation framework, based on NixOS.

We eventually hope to phase out Ansible in favor of Auron. While Auron is still very preliminary, several services have been ported over, and the setup does work on existing providers. Auron also is much more philosophically aligned with our desires for automation, including reproducibility, binary determinism, security features, and more.

More
bugs
code in the open

In our quest to automate our tiny part of the internet, we've begun naturally writing a bit of code. What's the best thing to do with code? Open source it!

The new haskell-infra organization on GitHub hosts our code, including:

Our copies of Phabricator which this server runs, and our extension libraries to it.

The New Haskell.org homepage

Our scripts for building GHC with Phabricator

And maybe other fun stuff.

Most of our repositories are hosted on GitHub, and we use our Phabricator for code review and changes between ourselves. (We still accept GitHub pull requests though!) So it's pretty easy to contribute in whatever way you want.

Better DNS and content delivery: CloudFlare & Fastly

We're very recently begun using CloudFlare for Haskell.org for DNS management, DDoS mitigation, and analytics. After a bit of deliberation, we decided that after moving off Hetzner we'd think about a 3rd party provider, as opposed to running our own servers.

We chose CloudFlare mostly because aside from a nice DNS management interface, and great features like AnyCast, we also get analytics and security features, including immediate SSL delivery. And, of course, we get a nice CDN on top for all HTTP content. The primary benefits from CloudFlare are the security and caching features (in that order, IMO). The DNS interface is still particularly useful however; the nameservers should be redundant, and CloudFlare acts more like a reverse proxy as changes are quick and instant.

But unfortunately while CloudFlare is great, it's only a web content proxy. That means certain endpoints which need things like SSH access can not (yet) be reliably proxied, which is one of the major downfalls. As a result, not all of Haskell.org will be magically DDoS/spam resistant, but a much bigger amount of it will be. But the bigger problem is: we have a lot of non-web content!

In particular, none of our Hackage server downloads for example can proxied: Hackage, like most package repositories, merely uses HTTP as a transport layer for packages. In theory you could use a binary protocol, but HTTP has a number of advantages (like firewalls being nice to it). Using a service like CloudFlare for such content is - at the least - a complete violation of the spirit of their service, and just a step beyond that a total violation of their ToS (Section 10). But hackage pushes a few TB a month in traffic - so we have to pay line-rate for that, by the bits. And also, Hackage can't have data usefully mirrored to CDN edges - all traffic has to hop through to the Rackspace DCs, meaning users suffer at the hands of latency and slower downloads.

But that's where Fastly came to the rescue. Fastly also recently stepped up to provide Haskell.org with an Open Source Software discount - meaning we get their awesome CDN for free, for custom services! Hooray!

Since Fastly is a dedicated CDN service, you can realistically proxy whatever you want with it, including our package downloads. With the help of a new friend of ours (@davean), we'll be moving Fastly in front of Hackage soon. Hopefully this just means your downloads and responsiveness will get faster, and we'll use less bandwidth. Everyone wins.

Finally, we're rolling out CloudFlare gradually to new servers to test them and make sure they're ready. In particular, we hope to not disturb any automation as a result of the switch (particularly to new SSL certificates), and also, we want to make sure we don't unfairly impact other people, such as Tor users (Tor/CloudFlare have a contentious relationship - lots of nasty traffic comes from Tor endpoints, but so does a ton of legitimate traffic). Let us know if anything goes wrong.

Better server monitoring: DataDog & Nagios

Server monitoring is a crucial part of managing a set of servers, and Haskell.org unfortunately was quite bad at it before. But not anymore! We've done a lot to try and increase things. Before my time, as far as I know, we pretty much only had some lame mrtg graphs of server metrics. But we really needed something more than that, because it's impossible to have modern infrastructure on that alone.

Enter DataDog. I played with their product last year, and I casually approached them and asked if they would provide an account for Haskell.org - and they did!

DD provides real-time analytics for servers, while providing a lot of custom integrations with services like MySQL, nginx, etc. We can monitor load, networking, and correlate this with things like database or webserver connection count. Events occur from all over Haskell.org On top of that, DD serves as a real-time dashboard for us to organize and comment on events as they happen.

But metrics aren't all we need. There are two real things we need: metrics (point-in-time data), and resource monitoring (logging, daemon watchdogs, resource checks, etc etc).

This is where Nagios comes in - we have it running and monitoring all our servers for daemons, heatlh checks, endpoint checks for connectivity, and more. Datadog helpfully plugins into Nagios, and reports events (including errors), as well as sending us weekly summaries of Nagios reports. This means we can helpfully use the Datadog dashboard as a consolidated piece of infrastructure for metrics and events.

As a result: Haskell.org is being monitored much more closely here on out we hope.

Better log analysis: ElasticSearch

We've (very recently) also begun rolling out another part of the equation: log management. Log management is essential to tracking down big issues over time, and in the past several years, ElasticSearch has become incredibly popular. We have a new ElasticSearch instance, running along with Logstash, which several of our servers now report to (via the logstash-forwarder service, which is lightweight even on smaller servers). Kibana sits in front on a separate server for query management so we can watch the systems live.

Furthermore, our ElasticSearch deployment is, like the rest of our infrastructure, 100% encrypted - Kibana proxies backend ElasticSearch queries through HTTPS and over spiped. Servers dump messages into LogStash over SSL. I would have liked to use spiped for the LogStash connection as well, but SSL is unfortunately mandatory at this time (perhaps for the best).

We're slowly rolling out logstash-forwarder over our new machines, and tweaking our LogStash filters so they can get juicy information. Hopefully our log index will become a core tool in the future.

A new status site

As I'm sure some of you might be aware, we now have a fancy new site, https://status.haskell.org, that we'll be using to post updates about the infrastructure, maintenance windows, and expected (or unexpected!) downtimes. And again, someone came to help us - https://status.io gave us this for free!

Better server backups

Rackspace also fully supports their backup agents which provide compressed, deduplicated backups for your servers. Our previous situation on Hetzner was a lot more limited in terms of storage and cost. Our backups are stored privately on Cloud Files - the same infrastructure that hosts our static content.

Of course, backup on Rackspace is only one level of redundancy. That's why we're thinking about trying to roll out Tarsnap soon too. But either way, our setup is far more reliable and robust and a lot of us are sleeping easier (our previous backups were space hungry and becoming difficult to maintain by hand.)

GHC: Better build bots, better code review

GHC has for a long time had an open infrastructure request: the ability to build patches users submit, and even patches we write, in order to ensure they do not cause machines to regress. Developers don't necessarily have access to every platform (cross compilers, Windows, some obscurely old Linux machine), so having infrastructure here is crucial.

We also needed more stringent code review. I (Austin) review most of the patches, but ideally we want more people reviewing lots of patches, submitting patches, and testing patches. And we really need ways to test all that - I can't be the bottleneck to test a foreign patch on every machine.

At the same time, we've also had a nightly build infrastructure, but our build infrastructure as often hobbled along with custom code running it (bad for maintenance), and the bots are not directed and off to the side - so it's easy to miss build reports from them.

Enter Harbormaster, our Phabricator-powered buildbot for continuous integration and patch submissions!

Harbormaster is a part of Phabricator, and it runs builds on all incoming patches and commits to GHC. How?

First, when a patch or commit for GHC comes in, this triggers an event through a Herald rule. Herald is a Phab application to get notified or perform actions when events arrive. When a GHC commit or patch comes in, a rule is triggered, which begins a build.

Our Herald rule runs a build plan, which is a dependency based sequence of actions to run.

The first thing our plan does is allocate a machine resource, or a buildbot. It does this by taking a lease on the resource to acquire (non-exclusive) access to it, and it moves forward. Machine management is done by a separate application, Drydock.

After leasing a machine, we SSH into it.

We then run a build, using our phab-ghc-builder code.

Harbormaster tracks all the stdout output, and test results.

It then reports back on the Code review, or the commit in question, and emails the author.

This has already lead to a rather large change in development for most GHC developers, and Phabricator is building our patches regularly now - yes, even committers use it!

Harbormaster will get more powerful in the future: our build plans will lease more resources, including Windows, Mac, and different varieties of Linux machines, and it will run more general build plans for cross compilers and other things. It's solved a real problem for us, and the latest infrastructure has been relatively reliable. In fact I just get lazy and submit diffs to GHC without testing them - I let the machines do it. Viva la code review!

(See the GHC wiki for more on our Phabricator process there's a lot written there for GHC developers.)

Phabricator: Documentation, an official wiki, and a better issue tracker

That's right, there's now documentation about the Haskell.org infrastructure, hosted on our new official wiki. And now you can report bugs through Maniphest to us. Both of these applications are powered by Phabricator, just like our blog.

In a previous life, Haskell.org used Request Tracker (RT) to do support management. Our old RT instance is still running, but it's filled with garbage old tickets, some spam, it has its own PostgreSQL instance alone for it (quite wasteful) and generally has not seen active use in years. We've decided to phase it out soon, and instead use our Phabricator instance to manage problems, tickets, and discussions. We've already started importing and rewriting new content into our wiki and modernizing things.

Hopefully these docs will help keep people up to date about the happenings here.

But also, our Phabricator installation has become an umbrella installation for several Haskell.org projects (even the Haskell.org Committee may try to use it for book-keeping). In addition, we've been taking the time to extend and contribute to Phab where possible to improve the experience for users.

In addition to that, we've also authored several Phab extensions:

libphutil-haskell in rPHUH, which extends Phabricator with custom support for GHC and other things.

libphutil-rackspace in rPHUR, which extends Phabricator with support for Rackspace, including Cloud Files for storage needs, and build-machine allocation for Harbormaster.

libphutil-scrypt (deprecated; soon to be upstream) in rPHUSC, which extends Phabricator with password hashing support for the scrypt algorithm.

Future work

Of course, we're not done. That would be silly. Maintaining Haskell.org and providing better services to the community is a real necessity for anything to work at all (and remember: computers are the worst).

We've got a lot further to go. Some sneak peaks...

We'll probably attempt to roll out HHVM for our Mediawiki instance to improve performance and reduce load.

We'll be creating more GHC buildbots, including a fabled Windows build bot, and on-demand servers for distributed build load.

We'll be looking at ways of making it easier to donate to Haskell.org (on the homepage, with a nice embedded donation link).

Moar security. I (Austin) in particular am looking into deploying a setup like grsecurity for new Haskell.org servers to harden them automatically.

We'll roll out a new server, https://downloads.haskell.org, that will serve as a powerful, scalable file hosting solution for things like the Haskell Platform or GHC. This will hopefully alleviate administration overhead, reduce bandwidth, and make things quicker (thanks again, Fastly!)

[REDACTED: TOP SECRET]

And, of course, we'd appreciate all the help we can get!

el fin

This post was long. This is the ending. You probably won't read it. But we're done now! And I think that's all the time we have for today.

Show more