Codeascraft.com

Building a Better Mobile Crash Analytics Platform

2016-11-04

‘Crashcan’ (think trashcan, but for crashes) is Etsy’s internal application for mobile crash analytics. We use an external partner to collect and store crash and exception data from our mobile apps, which we then ingest via their API. Crashcan is a single-page web app that refines and better exposes the crash data we receive from our external partner.

Crashcan gives us extra analysis of our crashes on top of what our partner offers. We can make less cautious assumptions about our stack traces internally, which allows us to group more of them together. It connects crashes to our user data and internal tooling. It allows us to search for crashes in a range of versions. It’s provided a good balance between building our own solution from scratch and completely relying on an external partner. We get the ability to customize our mobile crash reporting without having to maintain an entire crash reporting infrastructure.

Error Reporting – The Web vs. Mobile

Unfortunately, collecting mobile crashes and exceptions is (of necessity) quite different from most error reporting on the web, in several ways:

On the web (especially the desktop web), we can be fairly confident that a user is online – they’re less prone to flaky or slow connections. Plus, users don’t expect to be able to access the web when they don’t have a connection.

‘Crashes’ are very different on the web, so many exceptions and errors are less severe. Sure, users may need to refresh a page, but it’s rare that a web page will crash their browser.

We can watch graphs and logs live as we deploy on the web (hooray, continuous deployment!) – and it’s clear if hundreds or thousands of exceptions start pouring in. With our mobile apps, however, we have to wait for users to install new versions after a release makes it to the App Store. Only then do we get to see exceptions.

With mobile, when a crash occurs, we normally can’t send the crash until the app is launched again (with a data connection) – in some cases, this can be days or weeks.

App crashes are costly to the user, as the app crashing loses the user’s state. On the web, even if a page breaks for some reason, the user keeps their browser history.

With the web, there’s one version of Etsy at any point in time. It’s updated continuously, and every user is running the latest version, always. With the apps, we have to deal with multiple versions at once, and we have to wait for users to update.

With these differences, it’s been important to approach the analysis of crashes and exceptions differently than we do on the web.

A Crash Analytics Wish List

Many of the issues mentioned above were handled by our external partner. But while this external partner provides a good overview of our mobile crashes, there were still some bits of functionality that would make analyzing crashes significantly easier for us. Some of the functionality we really wanted was:

An easy way to filter broad ranges of app versions – like being able to specify 4.0-4.4 to find all crashes for versions between 4.0.0.0 and 4.4.999.999.

Links between users’ accounts and specific crashes – like “This user reported they experienced this crash… Let’s find it.” This coupling with our user database allows us to better determine who is experiencing a crash – is it just sellers? Just buyers?

Better crash de-duplication, specifically handling different versions of our apps and different versions of mobile operating systems. For example, crash traces may be almost identical, but with different line numbers or method names depending on the app version. But if they originate in the same place, we want to group them all together.

Crash categorization – such as NullPointerExceptions versus OutOfMemory errors on Android – because some types of crashes are fairly easy to fix, while others (like OutOfMemory errors) are often systemic and unactionable.

Custom alerting with various criteria – like when we experience a new crash with this keyword, or when an old version of our app suddenly experiences new API errors.

It seemed like it’d be fairly straightforward to build our own application, using the data we were already collecting, to implement this functionality. We wanted to augment the data we receive from our external partner with data and couple it with our own internal tooling. We also wanted to provide any interested Etsy Admin with a simple way to view the overall health of our apps. So that’s exactly what we chose to do.

Building Crashcan

Crashcan’s structure was a pretty wide-open space. All it really needed to do was provide crash ingestion from an API, process the crash a bit, and expose it via a simple user interface (it sounds a lot like many technologies, actually). So while the options for technologies and methodologies were open, we ultimately decided to keep it simple.

By using PHP, Etsy’s primary development language, we keep the barrier to entry for developers at Etsy low . We used as much modern PHP as possible, with Composer handling our dependency management. MySQL handles the data storage, with Doctrine ORM providing the interface to the database.

Data Ingestion

Ingesting the data was the first hurdle. Before handling anything else, we needed to make sure that we could actually (1) get the data we wanted and (2) keep up with the number of crashes that we wanted to ingest, without breaking down our system. After all, if you can’t get the data you want and you can’t do so reliably, there’s really no point.

After analyzing the API endpoints we had at our fingertips (yay, documentation!), we determined that we could get all the data we wanted. The architecture needed to allow us to:

Determine whether we already have a crash (regardless of whether it has been deduplicated on our end)

Keep track of deduplicated crashes, and link them to the originating crash from the external provider

Run complex queries to combine data

Analyze whether crashes are meeting specific thresholds, like whether a new crash has occurred at least n times

Count crashes by category

Filter everything by version range and other criteria

In the end, we developed a schema that allowed us to fulfill all those needs while remaining quick in response to queries:

To actually ingest the data from our external provider, we run a cron job every minute that checks for new crashes. This cron runs a simple loop – it loads new crashes from a paginated endpoint, looping through each page and each crash in turn. Each crash is added to a queue so that we can update it asynchronously.

We run a series of workers that run continuously, monitoring the queue for incoming crashes. As these workers run, they each pick a crash off the queue and processes it. This includes several steps, first checking whether we have the crash already, then updating it if we have it or creating a new crash if we don’t. We also go through each crash’s occurrences to make sure that we’re recording each one and tying it to an existing user if one exists. The flowchart below demonstrates how these workers process crashes.

Monitoring & Alerting

After building Crashcan’s initial design and getting crashes ingesting correctly, we quickly realized that we needed utilities to monitor the data ingestion and to alert us when something went wrong. Initially, we had to manually compare crash counts in Crashcan with those that our external provider offered in their user interface. Obviously, this was neither convenient nor sustainable, so we began integrating StatsD and Nagios. To check that we were still ingesting all our crashes, we also wrote a script to perform a sort of ‘spot-check’ of our data against our external provider’s – which fails if our data differs too much from theirs.

We created a simple dashboard, linked to StatsD, that allows us to see at-a-glance if the ingestion is going well – or if we’re encountering errors, slowness, or hitting our API limit. While we plan to improve our alerting infrastructure over time, this has been serving us well for now – though before we got our monitoring in a good state, we hit some big snags that kept us from being able to use Crashcan for weeks at a time. There’s an important lesson there: plan for monitoring and alerting from the beginning.

Application Structure

When deciding on Crashcan’s structure, we decided to focus first on building a stable, straightforward API. This would enable us to expose our crash data to both users and other applications – with one interface for accessing the data. This meant that it was simple for us to build Crashcan’s user interface as a Single Page Application. Very few of the disadvantages of single page applications applied in Crashcan’s case, since our users would only be other Etsy Admin. Building a robust API also enabled us to share the data easily with other applications inside Etsy – most especially with our internal app release platform.

When an Admin accesses Crashcan, we aim to present them with the most broadly applicable information first – the overall health of an app. This is presented through an overview of frequent crashes, common crash categories, and new crashes, along with a graph showing all crashes for the app split out by version. This makes it much easier to spot big new crashes or problematic releases. The Admin then has the option to narrow the scope of their search and view the details of specific crashes.

Ongoing Work

While we’ve finished Crashcan v1 with much of the core functionality and gotten it in a stable enough state that we can depend on its data, there’s still quite a bit that we’d like to improve. For example, we haven’t even begun to implement a couple of the items we mentioned in our wish list, like custom alerting. Second, the user interface could do with some bugfixes and refinement. Right now, it’s in a mostly-usable state that other Etsy Admin can at least tolerate, but it’s not stable or refined enough that we’d be comfortable releasing it to a wider audience.

Additionally, our crash deduplication is still rudimentary. It only performs simple (and expensive) string comparisons to find similar crashes. We’d like to implement more advanced and more efficient crash deduplication using crash signature generation. This would give us a much more reliable way of determining when crashes are related, therefore providing a more accurate picture of how common each crash is.

Lessons Learned

Most of the pain points in Crashcan’s development weren’t new or especially unexpected, but they serve as a valuable reminder of some important considerations when building new applications.

Build with monitoring and alerting in mind from the beginning. We could’ve avoided a several-week-long lapse in functionality had we focused on building in monitoring from the beginning.

Don’t be afraid to consult with others on structural or technical decisions, and then just make a decision. It’s something that I’ve always struggled with – but getting blocked on making decisions or digging too deep into the minutiae of every decision is a great way to waste time.

Document your assumptions – especially when dealing with APIs – as small assumptions can turn into big deals later on. This is what led to our biggest failure – we mistakenly assumed that crash timestamps were accurate. When a crash said it had occurred 4 days in the future, our app stopped updating, because it was checking only crashes occurring after that crash.