Coreos.com

Developing Prometheus alerts for etcd

2016-08-24

Prometheus is an open source monitoring and alerting system. Its powerful query language used to retrieve time series data can also be employed when defining alerts. Alerts actively notify users of irregular system conditions, sending messages to a variety of integrations such as Slack or PagerDuty.

In this post, we walk through some of the metrics exposed by the new etcd v3 distributed key-value store, explain how these statistics reflect the health of a cluster, and show how to define alerts for notable conditions in etcd. Most of the idioms for building alerts can be taken from this example and adapted to write alerting rules for your own or third-party applications that expose Prometheus metrics.

A note on PromCon Berlin

If you are planning to attend PromCon in Berlin, August 24-25, please meet the team at one of these sessions:

On Thursday, August 25 at 4:00 p.m. CEST, Jon Boulle (@baronboulle), Berlin site lead at CoreOS, will show how Prometheus helped us get a 10x performance improvement for Kubernetes.

On Friday, August 26 at 1:45 p.m. CEST, Fabian Reinartz, software engineer at CoreOS, will give an in-depth overview of alerting in the Prometheus universe.

We will also be available throughout both days to answer your questions about CoreOS, Prometheus, Kubernetes, and more.

Prometheus setup

Grab the latest Prometheus and etcd. At the time of this writing, that was v1.0.1 for Prometheus and v3.0.6 for etcd. To make it easier to run and manage these services, we will use goreman.

On Linux, you can fetch the three necessary binaries with these commands:

On OS X, that would be:

Note that for both Linux and OS X, these setup steps are only suitable for quickly getting started, and are not intended to be used in production environments.

After running those commands, we end up with the binaries for the Prometheus server, for etcd, and for goreman in the current directory. Download this Procfile run by goreman into the same directory.

The etcd cluster is already completely configured with these commands. We’ll need to create a Prometheus configuration file called prometheus-config.yaml in the same directory, to which we add the following content:

This way Prometheus is configured to scrape the metrics of each individual etcd instance in the cluster. Furthermore, the rule file etcd.rules has already been declared, to which we will add alerting rules later.

We can now start an etcd cluster with three peers as well as the Prometheus server with a simple

Whenever you want to make Prometheus reload the rule file without stopping all the processes, you can send the SIGHUP signal to the process with

Now we have a functioning etcd cluster monitored by Prometheus. Let’s continue with looking at the exposed metrics.

Available etcd metrics

Metrics have been largely reworked for the etcd v3 release, which allows a great overview of the state of a running cluster. Typically alerts are grouped into at least two categories, for example: warning and and critical. A warning is a group of alerts that hint at conditions that could lead to outages, but do not need immediate attention as the availability of the cluster is not immediately affected. Critical alerts indicate that the cluster or single instances are non-functional.

In this post, we will looking at these metrics in particular:

up

etcd_server_has_leader

etcd_http_successful_duration_seconds_bucket

etcd_http_failed_total and etcd_http_received_total

process_open_fds and process_max_fds

The /metrics endpoint of each instance can be called to view the metrics a specific etcd instance exposes.

Writing alerting rules

Alerting rules are defined with a custom syntax based on the Prometheus expression language. Expressions written and tested in the Prometheus web UI can be used to directly implement alerts based on them.

An alert definition starts with the ALERT keyword followed by an alert name of your choosing followed by the IF keyword paired with a PromQL vector expression. Those are the minimum requirements for an alerting rule.

The FOR keyword followed by a duration defines for how long the PromQL expression has to trigger until it becomes an active alert.

LABELS are used to label alerts similar to time series, they can be used to differentiate how to proceed with an alert. For example a severity label could be used to define if this alert is only a warning that should simply be displayed on a dashboard or critical to page someone to take care of the problem immediately.

ANNOTATIONS, like LABELS, are simply a set of key/value pairs. Annotations contain additional information about an alert, which do not define its identity. Templating can be used to attach meaningful descriptions and more.

When an alert fires, the description would contain the specific instance, to quickly trace the alert to its origin.

Now that we have covered how to write alerting rules we can jump in and develop a few example alerts based on the metrics named above.

Unsafe number of peers in the cluster

The up metric is a synthetic metric as it is not directly exposed by etcd. However, it indicates whether Prometheus was able to scrape the /metrics endpoint. The value of this metric is 0 for when the target is down and 1 when it is up. Going to the console in the Prometheus web UI and typing in the up metric should give you a good idea of this metric.

Querying the `up` metric in the console on the Prometheus web UI

As you can see, the up metric has two labels attached to it. The instance label describes which HTTP endpoint was used to scrape this metric. The job label was defined in the prometheus-config.yaml file during the setup.

An etcd cluster must have a majority of members operating to make forward progress. In other words, if more than half the members are healthy, the cluster is still available. We want to alert if the cluster is in a state where it is still functioning, but one more failure would result in unavailability.

Let’s go through that. The alert name is “InsufficientPeers.” It alerts if etcd is one lost member away from losing majority.

To be resilient to single scrape failure, we choose a three minute interval for which the alert has to be firing before actually being sent.

Additionally, we add a label to indicate the severity. In this case it is a critical alert, as one more peer failing would result in the cluster being unavailable.

Let’s add the rule to the etcd.rules file in the same directory as all other files and reload the Prometheus configuration.

Healthy InsufficientPeers alert

Since all peers are available the alert is not active. We can change that by killing one of our etcd processes:

For three minutes the alert will be “pending.” If the alerting conditions remains true for the configured three minutes, it will become “firing.”

Active and firing InsufficientPeers alert

To return your cluster to a stable state, not producing alerts, you can restart the etcd process.

Leaderless peer

The etcd_server_has_leader metric describes whether this particular instance has a leader or not. This metric is a gauge, which means its value can increase and decrease. It uses the value 1 to indicate the presence of a leader and 0 if not. No leader results in the peer not being able to perform writes, as it can’t submit proposals to the cluster. That situation makes it a perfect fit for use in an alert.

Having no leader is a normal behavior during leader election, therefore a single occurrence shouldn’t already trigger an alert. Instead we will only trigger an alert if the leader has been missing for longer than a minute.

The resulting alerting rule for that behavior looks like

etcd HTTP requests responding slow

etcd exposes the response time of its HTTP API of successful requests through this metric. It generates buckets labelled with each HTTP method and upper bucket limits. The bucket limits are represented with the label “le”, which stands for less than or equal to, which means that requests belonging to one bucket will be part of all higher value buckets as well.

This metric will only show up once there have actually been HTTP requests to the etcd HTTP API. Let’s insert a key/value pair.

Now we will see the metric showing up on the metrics endpoint.

Being a key/value store, etcd is typically a backing service for other applications, therefore when requests are slow it will likely impact the application utilizing it. To prevent the application from slow responding, you want to alert on the response time for etcd as well. It will also help when troubleshooting why the application built on top of etcd is responding slowly.

Unfortunately, latency is hard to represent as a relative value. It highly depends on the environment of the cluster as well as its usage. This alert is an example how these histogram metrics can be used to calculate the 99th percentile of request duration. If that 99th percentile exceeds 150 milliseconds, the alert is triggered.

High number of failed requests

By writing a key into etcd with the command above we have already increased the counter of the HTTP requests received in total. However here we are interested in the failure of HTTP requests, therefore we have to generate an error. We will try to retrieve a key that does not exist in etcd.

Now the etcd_http_failed_total has increased. And a snippet of the relevant metrics looks like this:

For alerts to not appear on arbitrary events it is typically better not to alert directly on a raw value that was sampled, rather by aggregating and defining a relative threshold rather than a hardcoded value. For example: send a warning if 1% of the HTTP requests fail instead of sending a warning if 300 requests failed within the last five minutes. A static value would also require a change whenever your traffic volume changes.

We can declare such a ratio using PromQL:

Let’s examine the statement. Most important is the use of the rate function. It calculates the per-second change over the last five minutes. We then sum those rates by each HTTP method and divide it by the rate of the HTTP requests received in total. Rate is essentially used to smoothen the time series here to avoid quick spikes to trigger false-positive alerts.

Note that we used the by statement to group the sums by each HTTP method. As described in the documentation aggregations by default keep all the dimensionality, however, can be grouped further by using the by or without statement to explicitly ignore or define which dimensions to keep. In this case grouping by the HTTP method, as throwing away that dimensionality would hide that case if a single, let’s say PUT, method keeps failing.

Using some synthetic failed and successful requests, a graph of this statement looks like this.

Chart of ratio of failed to successful HTTP request

To alert on a specific ratio the above PromQL statement can be embedded in an alerting rule.

File descriptors exhaustion

Keeping track of used and maximum file descriptors is important when running etcd. File descriptor exhaustion leads to an instance being unavailable, requiring a restart to function properly again. Luckily these metrics are not specific to the metrics exposed by etcd, as they are available via the official Go client library. It easily allows the use of these metrics in your own applications written in Go.

The caveat, however, is that we don’t want to alert when the file descriptors are already exhausted, nor when a threshold is hit, but when we are going to hit that limit soon. Prometheus has the predict_linear function for this use-case. Based on historic samples it can predict whether the file descriptors will be exhausted within a certain time.

Let’s say we want an alert if the file descriptors will exhaust within the next four hours.

Note we used a recording rule here. As the ratio of open file descriptors to maximum available number of file descriptors is not available as its own time series we need a recording. It makes Prometheus pre-calculate the ratio of open file descriptors to maximum file descriptors, as we use the [1h] duration, which make the predict_linear function predict the potential file descriptor based on the values of the last hour.

What’s next?

If you want to develop further alerts for your etcd cluster, the metrics exposed by etcd are documented here.

We have only covered a fraction of Prometheus’s functionality, to understand and build more alerting rules have a look at the documentation. Prometheus itself, however, does not aggregate and send alerts over your favorite notification channel (PagerDuty, Slack, email, etc.). The Prometheus Alertmanager is responsible for that. Stay tuned for posts on how to effectively handle a stream of incoming alerts with Alertmanager!

We have assembled a set of standard alerts for etcd. You’ll want to review the thresholds and alert labels depending on your requirements and environment, but this set is a good base to start from.

If you have questions, meet us at PromCon in Berlin this week, find us on IRC, or tweet to the CoreOS team @CoreOSLinux.