2014-04-08

In the next few weeks, I will be making the leap to second line support. What does this mean? What is this support you speak of? All shall be revealed...

weight of the world

Support is a role most semi-technical people at MetaBroadcast towers end up performing at some point in their time here. Naturally, we like to have a safety-net for everything from catching infrastructure disasters, through API issues, to answering support emails. Support is our catch-all term for all of that and more. Over the past few years we've gradually honed the system to where it meshes quite nicely both with our requirements and the skills of the team.

You may've noted that the selection of possible issues I just listed was, well, broad. Not every person on the support rota is capable/willing to deal with every kind of issue, so we split our needs into three tiers called (obviously enough) first-, second- and third-line support. Each handles a separate chunk of the potential workload, and at any given time there will be one person from each of the tiers on support. Support lasts for a week, then another person from each tier is cycled on, after a quick handover.

load-balancing

So what are these tiers, and what does each involve? Well, let's have a quick run-through...

first-line

First-line is the first port of call. A first-line support engineer will have a plethora of potential tasks, from answering those important support@ emails, to restarting servers if they seem to have (simple) issues. Much of the first-line responsibility is taking load off the other tiers' plates—answering queries, performing simple tasks that don't require intricate knowledge of Linux, that sort of thing. While first-liners may get alerted about technical problems (more on this later), they are generally less pressing concerns (although may be issues that, if left unchecked, could go on to cause problems later). First-line is generally also the first tier that people will end up becoming part of, as it requires the least technical knowhow and experience. Generally, if a first-line engineer encounters a problem, there'll be some documentation on how best to deal with it, and if they really struggle, they can escalate it to second-line. Speaking of which...

second-line

Second-line skips all the email and gets right down to business. Second-line engineers should generally* be bothered much less often than first line, but when they get an alert, it's serious. They must deal with intricate issues, as well as major outages and high-impact issues. If one server in an auto-scaling group needs restarting, that's first line. If all servers crash simultaneously, that's where second-line comes in. Second-line are expected to know the ins and outs of our various systems and how they interact, and be able to diagnose and fix serious problems quickly as soon as they occur.

*we're getting better at this!

third-line

Third-line is the last line of support. Generally, on a typical week, the third-line support engineer will not have to deal with anything. They are there for redundancy's sake: if first- and second-line are for some reason unable to get to a laptop in time, then the third-line engineer is there to handle the incident. This means that our support engineers are able to commute in on the tube without having to worry about putting a dent in our support coverage!

monitor all the things

Of course, all this wouldn't be possible without our support and alerting infrastructure. We have comprehensive alerting on many metrics across all our production systems thanks to Sensu, as set up by the wonderful Mr. Horwich. These then filter through to Pagerduty, which has rotas for each tier of support, and will alert support engineers by escalating incidents to the appropriate tier, via email (and phone if necessary). Sensu also feeds Graphite, which allows us to track crucial metrics such as API response times, and check patterns of behaviour.

We also try where possible to minimise hassle for support engineers. For instance, we use AWS to host all of our infrastructure, which means for some of our products we can use autoscaling groups to ensure that, for simple load-based issues, we simply spin up a new application server rather than wake someone up at 2am. Our general rule is—we do our utmost to avoid having to wake people up.

So we think we're starting to really get our support procedures pretty nicely sorted. We're always tweaking and improving though, to make everyone's life easier. On that note, do you have any ideas how we could make things (even) better? Do leave a comment below to let us know...

Show more