Metabroadcast.com

sensu high availability in the cloud

2014-02-04

Any of you keeping score should know by now my admiration for Sensu. But what's been on the radar for a while, and something of a sore point, is the ability to architect a clustered and resilient Sensu architecture in the cloud, specifically Amazon's EC2-Classic. Turns out that Sensu itself is gloriously attuned to clustering, it's just those pesky dependencies which cause the issue: Redis and RabbitMQ. Today I'll be describing an architecture which offers fault tolerance and failover in a virtualised environment for Sensu.

be still my beating heart

Following through discussions on HA in Sensu, it mostly comes down to heartbeat daemons and load balancing. There's an excellent article on the subject by Jean-Francois Theroux identifying the HA complexities of two main problems.

RabbitMQ works best when using HAProxy to handle failover (a process supported by Sean Porter, the father of Sensu), and a similar setup is required for Redis, but with extra complexity owing to heartbeat generation. But neither of these can be trivially implemented in AWS (EC2-Classic). See, the crux of HAProxy involves shifting IP addresses on the HAProxy cluster, and that's going to require some additional complexities to shift Elastic IPs between HAProxy nodes in the event of failure. But Elastic IP migration is not instantaneous, and you'll end up with a gap of service. None of this really appealed, so we decided on an approach with as little complexity as possible.

elastic caching

One policy we had was that unless there was a very good reason, we should try not to rely on the various SaaS products that Amazon produce. It goes against our flexible ethos. But, as it happens, Amazon recently updated their ElastiCache service to support Redis. This removes a significant headache from our Sensu Architecture, so we felt thoroughly justified in exploring the service.

ElastiCache allows you to build managed clusters of Redis with near-zero configuration, and for our use case with Sensu, it's very friendly. ElastiCache even handles the leader election through a cluster DNS address in front of each of the nodes. You'll only ever need to CNAME to this cluster address to access your Redis cluster. Our quick and dirty performance tests showed that IO performance of Redis in ElastiCache was better than running a service local to Sensu Server. This was both in synthetic and real-world load. Your mileage may vary of course, depending on what size of infrastructure you're monitoring and your budget/performance envelope, but we're very happy with performance on our setup.

the importance of timing

This left one more piece of the puzzle: how do you handle RabbitMQ failover? Well, ruling out HAProxy means that we'd have to look into some form of distributed implementation to cope with the potential split-brain issues running between Availability Zones in AWS. This started bringing up lots more complexity, and that was against our agenda! And none of the solutions looked particularly robust. Instead we went back to Sean Porter's earlier advice about RabbitMQ:

As Sensu data is time sensitive, the data in queues isn't necessarily worth mirroring

This, to me, suggests that there's not a great deal of value in the data persisted in RabbitMQ at the point of failover. The probability of false positive or false negative is pretty low, so it might just make more sense to run independent RabbitMQ services on each Sensu Server node that we have in our infrastructure... In the event of failover, we have a gap of data, but the frequency of check submission means that the monitoring infrastructure will catch up within a few minutes, and the important persistence located in Redis is unaffected.

a delicate balancing act

None of what I've discussed in the RabbitMQ proposal is worth a damn if we can't drum up a way of failing over to a secondary server in the event of a host/service failure. So, how do we implement this in EC2-Classic without the presence of HAProxy or Elastic Load Balancers (by using an ELB, you expose critical services to the outside world on a public IP. Not a good use-case for Sensu's queue!)?

It kinda comes back to the delicate work that Jean-Francois discussed about making Redis HA. You need a good quality service check performed locally that a remote service can interrogate and determine health. In our case, the remote service is Amazon Route53. This service can hit a TCP or HTTP port and alter the health of a DNS record based on the returned value. If you can open up a random port on your Sensu/RabbitMQ server, have some very stripped down web service listening on an obfuscated URL, and serving up a basic file then you've got yourself a health check. Finesse comes through a script that produces the basic file, by checking your services for good health, and then a second process that checks that the first process is working, removing the file if it has fallen stale.

here's one I made earlier

Putting it all together we have the above. With some Route53 magic you can have a primary/failover record so that all your clients can be configured to point to a failover address, supporting service resilience when a server has a fault. Sensu's general server architecture allows multiple daemons to run alongside so you'll have the added benefit of increased throughput as well as resilience. Both servers point to the same failover queue record, and API/Dashboard, so services failover in-step. I've also included Graphite on the architecture as that used to be housed on a single server with Sensu too.

there's always a gotcha

Nothing's perfect though, it's more about having the lowest complexity. In our case, triggering a failover is fine, but because the DNS record is primary/failover, the recovery of the primary RabbitMQ is not recognised by the Sensu services. Presumably this is because there's a session maintained on connection. Clients and server will both remain connected to the failover queue, and the only way to trip them back to the correct host is to restart the RabbitMQ service! If you do not do this, you might end up with a split-brain when new hosts connect to the primary queue. Not quite perfect, but we'll take it.

yeah but what about...

Like I said, this is a solution for EC2-Classic more than anything. Addressing the complexities of running HAProxy, in an IaaS architecture that doesn't lend itself to shared private addresses. But VPC, that can handle it well. Much of the Route53 implementation can be substituted for an internal ELB in VPC. Giving better resilience and not exposing any ports to the outside world. Once we've migrated our infrastructure to VPC I'm sure we'll be discussing a new, even more glamorous architecture.