Planetmysql.org

Facebook's Charity Majors Says VividCortex Is Impressive

2015-08-11

A few months ago, we featured Charity Majors, the production engineering manager for Parse at Facebook, on Brainiac Corner. We are featuring Charity and her expertise once again. This time, though, she is reviewing VividCortex: from installation to problem solving to a feature wishlist.

One of our favorite takeaways: “And VividCortex is a DB monitoring system built by database experts. They know what information you are going to need to diagnose problems, whether you know it or not. It’s like having a half a DBA on your team.” And without further ado…

Parse review of VividCortex

Many years ago, when I was but a wee lass trying to upgrade mysql and having a terrible time with performance regressions, Baron and the newly-formed Percona team helped me figure my shit out. The Percona toolset (formerly known as Maatkit) changed my life. It helped me understand what was going on under the hood in my database for the very first time, and basically I’ve been playing with data ever since. (Thanks, I think?)

I’ve been out of the mysql world for a while now, mostly doing Mongo, Redis, and Cassandra these days. So when I heard that Baron’s latest startup VividCortex was entering the NoSQL monitoring space, I was intrigued.

To be perfectly clear, I don’t need VividCortex at the moment, and do not use it for my day-to-day work. Parse was acquired by Facebook two years ago, and the first thing we did was pipeline all of our metrics into the sophisticated Facebook monitoring systems. Facebook’s powerful tools work insanely well for what we need to do. That said, I was eager to take VividCortex for a spin.

Parse workload

First, a little bit of background on Parse. We are a complete framework for building mobile apps. You can use our APIs and SDKs to build beautiful, fully featured apps with core storage, analytics, push notifications, cloud code etc without needing to build your own backend. We currently host over half a million apps , and all mobile application data is stored in MongoDB using the RocksDB storage engine.

We face some particular challenges with our MongoDB storage layer. We have millions of collections and tens of millions of indexes, which is not your traditional Mongo use case. Indexes are intelligently auto-generated for apps based on real query patterns and the cardinality of their data. Parse is a platform, which means we have very little control over the types of queries that enter our systems. We often have to do things like transparently scale or optimize apps that have just been featured on the front page of the iTunes store, or handle spiky events, or figure out complex query planner caching issues.

Basically, Parse is a DBA’s worst nightmare or most delicious fantasy, depending on how you feel about tracking down crazy problems and figuring out how to solve them naively for the entire world.

SO.

VividCortex. I was really curious to see if it could tell me anything new about our systems, given that we have already extensively instrumented them using the sophisticated monitoring platforms at Facebook.

Setup

The setup flow for VividCortex is a delight. It took less than two minutes from generating an account to capturing all metrics for a few machines (the trial period lets you monitor 5 nodes for 14 days). Signup is fun, too: you get a cute little message from the VividCortex team, a tutorial video, and a nudge for how to get live chat support.

I chose to install the agent on each node. You have the option of installing locally or remotely, but you have to install one agent process per monitored node. I sorta wish I could install just one agent, or one per replica set with autodetection for all nodes in the replica set, but as a storage-agnostic monitoring layer this decision makes sense. If I was running this in production, I would probably consider making this part of the chef bootstrap process. It has a supervisor process that restarts the agent if it dies, and the agent polls the VividCortex API to detect any server-side instructions or configuration changes.

I had to input the DB auth credentials, but it automatically detected what type of DB I was running and enabled all the right monitoring plugins — nice touch.

Agent

The agent works by capturing pcaps off the network, reconstructing queries or transactions, and also frequently running “SHOW INNODB STATUS” or “db.serverStatus()” or whatever the equivalent is for that db.

The awesome thing about monitoring over the network is that this gives VividCortex second-level granularity for metrics gathering, and it has less potential impact on your production systems. At Parse we do all our monitoring by tailing logs, reprocessing the logs into a structured format, and aggregating the metrics after that (whether via ganglia or FB systems). This means we have minute-level granularity and often a delay of a couple of minutes before logs are fully processed and stored. On the one hand this means we can use the same unified storage systems for all of our structured logs and metrics, but on the other hand it takes a lot more work upfront to parse the logs, structure the data, and ship it off for storage.

Second-level granularity isn’t a thing that I’ve often longed to have, but it could be that this is just because I’ve never had it before. Also: log files can lie to you. There’s a long-standing bug in MongoDB where the query time logged to disk doesn’t include the time spent waiting to acquire the lock. If you were timing this yourself over the wire, you wouldn’t have this problem. Log files also incur a performance penalty that can be substantial.

Query families

The most impressive feature of VividCortex is really the query family normalization and “top queries” dashboard. As a scrappy startup with limited engineering cycles, this is the most important thing for you to pay attention to. It’s not particularly easy to implement, and every company past a certain scale ends up reinventing the same wheel again. We built something very similar to this at Parse a while back. Before we had it we spent a lot of time tailing and sorting logs, looking for slow queries, running mongotop, sorting by scanned documents and read/write lock time held, and other annoying firefighting techniques.

With the top queries dashboard, you can just watch the list or generate a daily report. Or better yet, train your developers to check it themselves after they ship a change. :)

VividCortex also has a really neat “compare queries” feature, which lets you compare the same query over two different time ranges. This is definitely something we don’t have now, although we can kinda fake it. The “adaptive fault detection” also looks basically like magic, although not yet implemented for MongoDB (it’s a patent-pending method that VividCortex has developed for detecting database stalls).

Live support

Ok, I don’t usually use this kind of thing, but I actually love VividCortex’s built-in live support chat. The techs were incredibly friendly and responsive. We ran into some strange edge cases due to the weirdness of our traffic characteristics, which caused some hiccups getting started. The people manning the chat systems were clearly technical contributors with deep knowledge of the systems, and they were very straightforward about what was happening on the backend, what they had to do to fix it, and when we could expect to get back up and running. Love it.

Things I wish it had

I wish it was easier to group query families, queue lengths and load by replica set, not just by single nodes. If you’re sending a lot of queries to secondaries, you need those aggregate counts. You can get around this by creating an “environment” by hand for each replica set (thanks @dbsmasher!), but that’s gonna get painful if you have more than a few replica sets, and it won’t dynamically adjust the host list for a RS when it changes.

Comments attached to query families. It’s really nice to be able to attach a comment to the query in the application layer, for example with the line number of the code that’s issuing the query.

Some sort of integration with in-house monitoring systems. Like, maybe a REST API that a Nagios check could query for alerting on critical thresholds. This is obviously a pretty complicated request, but my heart longs for it.

Summary

This might be a good time to mention that I’ve always been fairly prejudiced against outsourcing my monitoring and metrics. I hate being paged by multiple systems or having to correlate connected incidents across disparate sources of truth. I still think monitoring sprawl and source-of-truth proliferation is a serious issue for anyone who decides to outsource any or all of their monitoring infrastructure.

But you know what? I’m getting really tired of building monitoring systems over and over again. If I never have to build out another ganglia or graphite system I will be pretty damn happy. Especially since the acquisition, I’ve come to see how wonderful it is when you can let experts do their thing so you don’t have to. And VividCortex is a DB monitoring system built by database experts. They know what information you are going to need to diagnose problems, whether you know it or not. It’s like having a half a DBA on your team.

Monitoring, for me, is starting to cross over that line between “key competency that you should always own in-house” to “commodity service that you should outsource to other companies that are better at it so you can concentrate on your own core product.” In a couple of years, I think we’re all going to look at building our own monitoring pipelines the same way we now look at running our own mail systems and spam filters: mildly insane.

I do still think there’s a lot of efficiencies to aggregating all metrics in the same space. For that reason, I would love to see more crossover and interoperability between deep specialists like VividCortex, and more generalized offerings like Interana and Data Dog, or even on-prem solutions like syncing VividCortex data back to crappy local ganglia instances.

But if I were to go off and do a new startup today? VividCortex would be a really useful tool to have, no question.

Thanks, Charity, for the thoughtful, flattering, and constructive review! See for yourself how VividCortex can revolutionize your monitoring with a free trial.

PlanetMySQL Voting: Vote UP / Vote DOWN