Blog.librato.com

Introducing Composite Metrics

2014-03-26

Today we are excited to announce that the first iteration of our composite metrics feature is now in public beta. Our new composite metrics feature allows you to visualize the result of mathematical transformations of your existing metrics. You can think of composite metrics as a domain specific language for making arbitrarily complex queries against your native metrics. You can compare, transform and summarize native metrics, in the same way a data-query language might. How does this help you?

In web-operations, the truth can be an elusive quantity. Something to be un-riddled from measurements taken in the near past of things like latency, size, and magnitude. Did your recently introduced feature enhance end-user productivity? What is the current rate of churn? How many users across all web-servers are using IE6? The truth often hides in the space between our metrics, in ratios, differences, sums, and derivatives; it sometimes requires us to create new knowledge from the bits we already measure.

By mathematically transforming your existing data, composite metrics can help you dig in and analyze your existing data, answering questions like: “what is the current ratio of new user sign-ups to cancellations?”. They can present the same data to different people in different ways, for example displaying the customer cancellations metric as a ratio to engineering vs. an absolute count for sales. Composite metrics are useful any time you want to summarize a large-body of metrics in a meaningful way, and generally help you worry less about choosing a raw data collection methodology, by providing the option of display-time presentation formatting and transformation.

Composite metrics can be saved in instruments, and correlated with native metrics in the same instrument. You can even assign properties like y-axis titles and tool-tip aliases to them just like you do to native metrics. They are available now in the correlation or instrument interfaces to our metrics app. Just open up an instrument builder and click the beaker shaped icon in the upper right (or hit ‘c’ on your keyboard while in an instrument builder). Clicking the beaker icon opens up the composite metrics composer; a text field into which you can type a query that will create a composite metric stream in the graph. Queries are composed of time-series wrapped in functions in the general form:

Lets take a quick tour of the DSL used to create and work with composite metrics.

Sets and the Series() Function

Composite metrics are created from sets and functions. A time series is a stream of time-aligned measurements of the sort that is identified by a combination of metric name and source name in our metrics platform. A set is a list of time series. For example, if you have a metric called cpu.load.5min, and three sources (server1, server2 and server3) which emit that metric, then the combination of the metric name and wild-card: cpu.load.5min,server* specifies a set of three time series: cpu.load.5min across each of the three different servers.

The series() function is used to retrieve sets of measurements from your existing metrics data. It takes three arguments; the first: metric name, and second: source name, are required, and the third is optional (we’ll get to that in a moment). Continuing our system load metric example, this call...

... will retrieve the set of measurements from all three servers, while this call...

... will retrieve only the set of measurements from server1. You can use wildcards in the metric name too. So a series() invocation like this...

... will retrieve a set containing the 5, 10, and 15 minute cpu load metrics from all servers.

Let's familiarize ourselves with the the series() function by using it to replicate an existing metric. Continuing the load average example, I created an instrument, and added the collectd.load.load.shortterm metric from several servers. Then, clicking the beaker icon in the upper right to bring up the composite metric composer, I manually retrieved the same data using the series() function.

First, you may have noticed in the screenshot that I just use the shorthand s instead of series. You may type out series, if you want, but the s alias is a handy shortcut because every composite metric contains at least one call to series().

Although both data sets seem to have the same general shape, they are not sharing a Y-axis, so they aren’t overlapping each other.

I can force them to overlap either by editing the display properties for the data streams to share a common Y-axis. I can edit the display properties for both metrics simultaneously by checking them both and clicking the gears icon in the lower left, or individually by clicking the gears icons to the right of each metric name.

… And then configure them both to share the same Y-axis.

Now the two sets line up on the graph, but the data I've retrieved using series(), is plotting a little bit differently. The two spikes at around 16:10 for example, are connected in the composite set. This is because a summary period has been configured differently for each stream.

Ranges, Periods, and Summaries

If you think of an instrument as if it were a piece of graph paper that has a box for each data-point, wherein you need to color in the boxes to depict your data-set, then the date-range of the instrument would be the physical length of the paper, and the period would be the size of the boxes. When the box size used by the graph paper doesn't match the interval at which you collected your data -- if, for example, you have a data point for every 10-second increment, but the paper you're coloring on only has a box every 60 seconds, then you have 5 more data points per box than you can fit in the available space, and you'll need to summarize your data to fit it into to the period used by the instrument.

When we summarize data, we do so by creating one data point that represents the set of data points that won't fit in the box. We can do this by averaging or adding all of those data points together, or by throwing out all but the smallest, or largest. We could even just tell you how many data points there are, but in the end, we'll have one data-point that accurately sums-up all of the data-points we couldn't draw individually. The difference between the metric streams in the above graph is that one time-series is using a 60-second period, and the other is using a 30-second period, so the yellow time-series is displaying more detail, because it has smaller boxes to work with.

I can fix this by configuring the period in the display properties of either stream (again using the gears icon to the right of the metric name):

Once both streams are using the same period, they line up.

The third parameter passed to series() is a curly-braced set of options that modify the display-time properties we’ve been speaking about. You can use this to specify the summary method and period that your composite metric uses when the series() function plots it. The Function option, sets the summary function to one of average, min, max, count, or sum; it’s set to average by default. In our current example, if we wanted to retrieve all load-related metrics from all sources, and summarize the measurements over a 30-second display period by taking the maximum in each 30-second interval, we would use:

Sets of sets

Finally we can create sets of sets by square-bracketing series() calls. We can use this to specify multiple series as a single argument to functions like divide(), as we’ll see shortly, or to make more specific typeglobs, for example to specify the 5 and 15 minute load averages, but not the ten minute average we could use:

Dynamic sources

When you embed a metric in an instrument, our user interface supports the concept of dynamic sources, which is a fancy way of saying that you'll specify the source later on when you actually want to view the data. You can use dynamic sources in your composite metric definitions so they will work with your dynamically sourced templates and dashboards by using % in lieu of the source argument to series(). We could make our set of load averages dynamic like so:

Aggregation and Transformation Functions

Composite metrics are created by applying transformation and aggregation functions to the native time-series data returned from series(). Each function takes both a set as input and returns a set as output, so functions can nest each other ad infinitum. At the moment there are 6 functions.

sum()

The sum() function aggregates the input set down to a single series by adding together the measurements at each time interval. You get a single time series consisting of all the input series added together.

If, for example, you had several metrics that tracked the occurrences of HTTP response codes in your logs, you could get a total count of HTTP 400-series errors using the sum function like so:

You could use a set of sets to capture 400 and 500's like so:

subtract()

The subtract() function takes a set of exactly two time series and returns the result of subtracting the second from the first.

Continuing from the last example, if you tracked the occurrences of HTTP response codes in your logs, you could get a count of all non-200 responses by combining sum() and subtract() in a set of sets like so:

max()

The max() function aggregates the input set down to a single series by discarding all but the largest measurement at each time interval. You get a single time series consisting of the largest single measurement from each of the input series.

If you wanted to monitor load average site-wide with a single line, you could use the max() function to display the site-wide maximum load average.

min()

The min() function aggregates the input set down to a single series by discarding all but the smallest measurement in each time interval. You get a single time series consisting of the smallest single measurement from each of the input series.

You could add a lower bound to the site-wide load average graph with a call to min() like so...

mean()

The mean() function aggregates the input set down to a single series by computing the mean of all measurement in each time interval. You get a single time series consisting of the mean of the input series.

Adding the mean to our site-wide load average graph with mean() like so...

... yields a graph like the one below, which represents the CPU load of every host in our entire production infrastructure:

derive()

The derive() function is useful when you want to view the rate of change of a given metric -- how much it increases or decreases from one measurement to the next. It transforms the input set by computing the derivative of each series in the set. Unlike the functions we've discussed thus far, derive() doesn't aggregate, or combine the input set, you get as many series out as you put in. In order to avoid certain edge cases involving late-arriving data, we recommend that you push calls to derive() to the lowest, or innermost, possible level, e.g:

is preferable to:

You can track the rate of customer signups by using derive() to graph the rate of change in a total user count metric.

divide()

It's often useful to calculate the ratio of two metrics. The divide() function takes a set of exactly 2 time series and returns the result of dividing the first by the second (The first set is used as the dividend and the second, the divisor). Continuing the last example, if we were tracking sign-ups and churn on a per-server basis, you could track the ratio of new customers to cancellations by first aggregating all of the per-server stats with sum(), and then dividing the result like so:

A real-world use-case

The Collectd server-monitoring daemon reports CPU usage broken out per-core and per-type like this:

Assuming we want to graph the total system CPU usage on a server server-instance-1, we'd start constructing our composite metric by retrieving the native metrics that detail system CPU usage for each core:

Collectd reports it's CPU usage as an always-increasing counter of time consumed, so to see the rate of change we'll use the derive() function:

To get a total usage for the server instance, we need to aggregate the all of per-core series. We can do this with the sum() function:

Now we can see how the system CPU usage changes over time, but it's still in rather abstract units of time. What we really want is the system CPU usage as a percentage of the total capacity available on the server. We can take a ratio of the system CPU usage to the total CPU usage (including idle) using divide(). Note that we used the set of sets notation to specify the inputs to divide():

We can make our composite more widely applicable by switching to a dynamic source and more terse by using the s() shorthand for series():

Finally, by clicking the gears icon in the graph-builder, we can add a display transform to change the x-axis magnitude to reflect 0-100 percent (instead of .0-1).

More to come

We think you’ll find our new composite metrics interface empowering because it lets you leverage your existing data to create fundamentally new knowledge. Going forward we’ll be improving this interface by, for example, adding more transformative functions, each of which will exponentially enhance the power of your existing data. If you like what you see, or have a feature that you’d like to see implemented, we’d appreciate your feedback, let us know.