Zenoss.com

Defining What’s “Normal” (Preventive Analytics 3)

2013-12-13

In the second post of this series on Preventive Analytics, we looked at the plethora of problems with the Obamacare website launch. We then offered six potential ways IT Operational Analytics could help fix this by using the insight into infrastructure performance to identify potential bottlenecks. This would allow developers eliminate the current issues before the deadline enrollment crush. Fortunately for them, they still have time and the resources to do that.

However, for most of us, the situation is probably a little different – we neither have the financial resources nor the unlimited supply of Subject Matter Experts to match those of Obamacare. We need to get it right the first time, make sure our environment is working normally, and continue to make changes as we detect abnormalities so that we can meet our Service Level Agreements.

But what is normal? The modern datacenter is so elastic and dynamic that it can be overwhelming trying to gauge the relevance of what seems at first glance to be an abnormality.

Let’s say the CPU utilization of one of your VMs just jumped to over 80%. You imaged this VM to operate as a web server supporting your online store. But you can’t tell whether this sudden uptick is really a problem. After all, you are using a total of 26 VM-based web servers to support this load, so in theory the load should be balanced across the farm and the utilization mostly unchanged. In what contexts would this unexpected jump be considered “normal?” Big Data historical data analysis could help, but do you have it? If you do, you are one of the lucky few.

Even with a unified monitoring solution that can monitor across your entire environment, it is hard to figure out what constitutes “normal” for a given service or application. Traffic and volume can vary depending on the time of day, day of week, week of month, and even time of year! When you get an alert, should you jump on it and work to fix it or make a determination that this is the “new normal” and reset your thresholds?

This is where IT Operational Analytics comes in handy. Analyzing your historical data and allowing you put intelligence around it, you can figure out what is normal and utilize trends and patterns to identify the “new normal.” You can continue to tune your infrastructure to meet the changing needs before your environments starts to experience degradation or even disruptions. This is Preventive Analytics at work!

Zenoss Analytics, a fully integrated component of Zenoss’ unified platform Zenoss Service Dynamics, can identify through its ad-hoc views and reports:

Baselines for a given time period

Insight into utilization trends based on the data collected from ZSD Resource Manager and ZSD Service Impact, as well as from ZenPacks

Differences between false positive “noise” and real problems that need to be triaged

Unsure how this plays out in the real world? Let’s look at some examples.

Example 1: Periodic Events

Every Sunday night at 11 p.m. EST, your database backup jobs kick off. This activity floods your network and storage resources and requires additional compute power to avoid impacting performance response times (since online stores are never closed). If for whatever reason (e.g., IT silos leading to lack of communication between DBAs and IT ops folks), you weren’t aware of this process, IT Operational Analytics could give you access to historical performance metrics on Sunday nights over the last 60 days, six months, or whatever length of time you request.

By providing you with an average range over previous Sunday nights, you can surmise what the normal CPU usage is during this time. This information would then give you the confidence to ignore alerts generated by the CPU jump, place it low on your list of priorities, or perhaps revise your monitoring thresholds for Sundays so that these alerts are not even generated.

However, if this jump in CPU took place on, say, a Monday night, you could reasonably infer that this jump is not normal. From there you could dig deeper into the data to figure out the source of this abnormality. If this jump was indeed an abnormality, you need to address it.

Example 2: Partial Lockdown of Certain Resources

At the end of every fiscal quarter (which can happen any day of the week), your financial databases are locked down so that quarterly bookkeeping processes can run. All this number crunching and report building causes every KPI (Key Performance Indicator) to spike, which may affect the responses of any service that’s tied into these databases.

As an IT ops professional, you’re already aware of this quarterly activity and have set up “blackouts” for these periods to prevent noise alerts from being generated and distract you from other priorities. The problem with relying solely on these preset blackouts is that crucial business activity may still need to take place, and you don’t want these processes lumped in with the bookkeeping processes. IT Operational Analytics can give you baselines over time to help you figure out the expected activity for the parts of your infrastructure still in production mode and filter out false positives that may be getting generated because of the quarterly bookkeeping activities. Moreover, with the right Analytics solution, you can see the accepted ranges of activity on the production servers during these times to make sure you are provisioning enough capacity to these key business processes.

Example 3: Planning for Increased Capacity

That increase to 80% could be a fluke, a side effect of periodic events or partial lockdowns—or a sign that you need to increase your CPU capacity. Using IT Operational Analytics, you can determine whether you need to bump up capacity by providing data on a number of things, including:

Month over month percentage increases, as well as the surge in demand based on historical trends in the amount of capacity your services and applications used in the time period leading up to this weird jump in CPU utilization.

Increases in overall utilization over the last several quarters. For example, if your business has expanded dramatically over the last six months, your gross capacity needs may end up being significantly larger than they’ve been in the past.

Changes in way your customers access your e-commerce site. If a higher percentage of your customers use mobile devices to research and buy your products, you may need to reserve additional capacity for your mobile services.

In other words, you need some understanding of what data to pull in order to generate views and reports that would answer these questions. If your numbers show utilization rates are skyrocketing, you may need to be more aggressive in provisioning supplemental infrastructure and possibly shorten your purchase cycles to handle the expanding load without carrying too much excess capacity.

Improving Predictions Helps Taking Preventive Actions

As I wrote in the first Preventive Analytics post, no one can prevent every outage or performance issue that comes down the pike. Nor can IT Operational Analytics, at least at this stage, tell you how to handle every seemingly abnormal situation. What IT Operational Analytics can do, however, is give you a wealth of contextual information that helps you make the best judgments and take the best action available to you in order to prevent impacts to the businesses you support. And that ability to assess events based on past measurements will help you keep your infrastructure as close to normal as possible, even in the most complex of IT environments.