Soasta.com

Enough is enough! Or is it? Best practices for load server calibration

2015-08-14

Here’s a question I get asked all the time by customers and prospects alike: I use “that tool out west” and I can usually get 50 vUsers per load server for my current scripts/tests. How many vUsers can SOASTA get per load server?

My typical response: Do you currently use a load server calibration process?

The dead silence and blank stare I get in return usually means the person thinks they need a lab like the one above, or that they have no idea why you’d want to calibrate a load server, much less have an iterative process for doing it.

Here’s one reason why it’s important, and why I get asked this all the time:

I know of several large financial institutions that do all of their performance testing inside the firewall, and thus they own all of their own infrastructure, including dedicated servers that are used solely for load generation. With some of the large load generation requirements (which can be in the hundreds of thousands of vUsers), you can imagine that even a small bump in optimization of a load server could potentially save a company quite a bit of infrastructure costs in the load server hardware alone. Which is why, as part of our best practices, SOASTA advocates calibration of load servers when using CloudTest.

So, let’s walk though the process!

Load server calibration: What it is and why you need to do it

Load server calibration is an iterative process to accurately determine the appropriate number of virtual users to run from each load server.

Why calibrate? Two reasons:

To identify the maximum number of users a load server can handle (per test clip/script) so computer resources are used in the most efficient way possible. If you are running very large tests where the number of load servers might be limited, you need to get as many users as possible on a load server. (Refer to my intro example as well.)

Eliminate the load server as a potential bottleneck in the test. If the load server is overloaded, the first thing you will see in the results is an increase in average response time. If you don’t notice the load server is overloaded, you might incorrectly believe the target application is slowing down when in reality it is the load server itself.

So, how do you know when a load server is properly calibrated? Simple. Here are a few things to look for:

CPU — For cloud providers where instances are on shared hardware, the CPU should peak around 65-75% utilization. Average utilization should be around 65%. For bare-metal load servers, CPU utilization shouldn’t go beyond 90%.

Heap usage — The java heap usage and garbage collection needs to be “healthy” on the load servers. Heap usage should not increase as the text executes.

Network interface — No more than 90% utilization (i.e. 900-950Mbit/second).

Errors — No weird errors are seen during the test.

When should you perform a calibration test?

The calibration process is highly recommended for all tests, but especially in these situations:

Whenever a large test is being executed (i.e. over 10,000 users). This ensures that server resources are being used in the most efficient manner possible.

When your analysis of previous test results leads you to believe that the load servers might be getting in the way of the load test. As an example, maybe you’ve noticed the load servers spiking up to high levels of CPU usage in the CloudTest Monitor Dashboard. Maybe the target application doesn’t appear to be slowing down — despite metrics on the SOASTA dashboards that indicate an increase in average response time.

The calibration process explained

Depending on how many locations you plan to run the test from, how many test clips you have,and how long the testing session will last, the calibration process can take a varying amount of time. However, if done correctly, the output is a fairly precise number of virtual users that should be applied to each clip on each instance size on each cloud provider.

The number of virtual users that a given load server will support is determined by ramping up a test to a specified number of users over a 10-minute period. This relatively slow ramp (in most cases) will help identify when the key metrics on the load servers start to deviate from the norm.

What you need to know before you get started:

How many test clips will run during the test session?

How many hours will the test session last?

What cloud providers will be used for the test, if any?

Which instance sizes with each cloud provider will be used?

Make sure the target application can sustain the needed number of users against the site. As an example, let’s use 1,000 users as it’s likely more than the load generator can push (given an average test case and a load server consistent with an AWS Large instance). Lower levels will also work, but if you can get clearance for 1,000 that will ensure you have headroom to start high and see issues as you ramp up.

Step 1: Test clip setup

The first step in the calibration process is making sure you have created all the test clips that will run during the test session. Ensure that all the think times are appropriate to the application (meaning, make sure the test clip is taking the correct amount of time to complete.) Then, make sure that all memory optimizations are complete in the test clip.

The two most important things are scopes and clearing the responses in scripts. Scopes should be set as ‘private’ unless local/public is needed for scripts. The clearResponse function should be used in any scripts (like validations or extractions) that access the response of a message.

Step 2: Test environment setup

Start a grid with 1 result server and 1 load server in the location you want to calibrate (for example, Rackspace OpenStack London).

Normally load is generated from large-size instances, but if you have a specific reason to generate load from another instance size (and it needs to be calibrated separately), you will need to run the calibration process below with a separate grid that only has that instance.

Once the grid starts, verify the monitoring of the grid started as well. This monitoring data provides critical information for the calibration The grid UI will confirm monitoring is started. You can also confirm monitoring started by opening the CloudTest monitor dashboard. If you see the load server and result server listed and capturing data, that means the monitoring was started correctly. The dashboard will look like this:

Where to look during the calibration test

There are five main areas to watch during the calibration test:

Test results. Keep an eye on two main metrics: average response time and errors. If the average response time starts to rise (which it likely will as you ramp up to 1,000 users), start looking at the monitoring data. Try to determine if the increase in response time is due to the load increasing or because the load servers are getting in the way. Also, watch the errors. If you start to see “weird” errors (HTTP timeout errors, connection timeout errors, odd SSL errors, etc.), that might be an indication the load servers are overloaded.

Default monitoring dashboard. This is the monitoring dashboard that shows monitoring data over time. It is accessed from SOASTA Central by clicking on ‘Monitors’, finding the load server you just started, and clicking the ‘View Analytics’ link. A better option is if you create your own monitoring dashboard that shows these same metrics. This is the better option because when you start the composition, you will link to the monitoring. If you use your own dashboard, then you will get correlated metrics in the dashboards. This will be your primary monitoring dashboard during the test.

Linux ‘top’. This command gives you deeper information than the default CPU chart provides. Specifically, on EC2, it gives you information about how much CPU is being stolen by the hypervisor. Most times, EC2 steals CPU resources (%st), so the max CPU utilization you might see in the SOASTA dashboard is ~75% utilization. This means the CPU is maxed out if there is ~25%st shown in top. Before the test starts, SSH into the load server (assuming the load server is Linux-based) and type ‘top’ at the command line. Keep this window up throughout the calibration test.

CloudTest monitor dashboard. This is the dashboard mentioned above. It is less useful in this test as there is only a single load. This dashboard is most useful during large tests where lots of load servers and result servers are active. You can see all the metrics on a single dashboard.

Monitoring combined charts. This dashboard is very useful since it shows the different monitoring metrics against Virtual Users and Send Rate. The only thing to be aware of is that you need to add the monitoring to the composition before the test starts by going to Composition Properties -> Monitoring -> Enable Server Monitoring and check the monitor of the grid you want to save the monitoring data.

Step 3: Create and execute the test composition

Given that the goal is to identify as closely as possible where the load server starts to get overloaded, the longer the ramp-up time, the better. However, you have to be realistic too. You don’t have days to do this. So, in this example, a 10-minute ramp to 1,000 users seems to be adequate.

Setup the test composition as follows:

On Track 1, put the test clip being calibrated

Set the track to use a ‘dedicated load server’

Set composition to “load” mode

Set 1 load server

Set 1,000 virtual users on that load server

Set ramp-up time on the track to 10 minutes

Set the track with Renew Parallel Repeats

Go to Composition Properties -> Monitoring and check the checkbox for the ‘Test Servers’ monitor (note that this will change on each stop/start of the grid, so confirm this setting if you restart the grid)

Once the test composition is setup correctly, load and play the composition to start the calibration test.

How to know when the load server is overloaded: 5 metrics to watch

Identifying when a server is overloaded is as much an art as it is a science. The metrics discussed below will lead you in the right direction, but how conservative you might be has an impact as well. Whenever you think you’ve identified a point in the test where you believe the load server is overloaded, note the number of virtual users. Then let the test run a few minutes more. It’s possible the metric was temporarily out of line.

Another thing to keep in mind is that the load server might have been overloaded before the test reached the virtual user level you noted as a possible limit. Monitoring might have detected it after the fact.

Here are the five key metrics to watch:

1. Both the default monitoring dashboard and top

If % id in top is close to 0, or the CPU metric in the Default Monitoring dashboard is basically flatlining around 75%-80% CPU, note the number of virtual users the test is at. Likely the server is overloaded at this point.

2. Heap usage/garbage

This metric is just as important to watch as the overall CPU usage. Between these two metrics, you can normally figure out when the server is overloaded. The primary metric to watch is the “JVM Heap Usage” widget on the Default Monitoring dashboard. Make sure it goes down (i.e. Garbage collection) after periods of increase. If it keeps going up and doesn’t come down much, that means java can’t do garbage collection appropriately. This also means the CPU is probably overloaded as it continually tries to do garbage collection. Be patient though: major garbage collections can occur after longer periods of time. CPU utilization will tell you if you have any chance of a decent garbage collection. If CPU is high and garbage collection isn’t happening, you have probably overloaded the load server.

Below is an example of healthy garbage collection. Note how the heap usage increases and garbage collection returns it to a proper level. The trend should be effectively flat.

Below is an example of unhealthy garbage collection. Note that some minor garbage collection is occurring, but no major garbage collection is occurring. This indicates the load server is overloaded.

3. Network interface

Watch the Default Monitoring dashboard. As the load increases, the amount of bandwidth used will increase. If the amount of bandwidth used approaches 950Mbits/second, the Network Interface is becoming a bottleneck in the test and the load server is overloaded. This is rarely a bottleneck with the load

4. Disk IO

Watch the Default Monitoring dashboard. There are two disk-related metrics that will tell you if heavy disk use is occurring. This is very unlikely to be a bottleneck with the SOASTA load servers since most processing on the servers is done in memory.

5. Test results

Watch error rates and average response time. If any of the metrics above are starting to become a bottleneck, likely the average response time will increase and/or error rates will climb rapidly, in a hockey stick-like fashion.

Important: Memory usage can be a misleading metric

Do not rely on this metric to determine if the load server is overloaded. Java will take up all the memory — but that doesn’t mean there is a memory limitation. JVM heap usage is the metric that should be used instead.

What to do when the load server becomes overloaded

When you determine the load server is overloaded, you need to stop the test composition and change the number of virtual users to the number of users you identified the bottleneck at. Here’s how:

Let’s say you identified a bottleneck at 600 virtual users. Then set the composition virtual users to 600 users.

Reset the ramp-time to 10 minutes.

Go through the same process as described above.

If you find that over time 600 users is still too high, then lower the number of virtual users on the load server and restart the test.

If the test runs without any bottleneck, then you’ve pretty much confirmed the number of virtual users you can run on the load server.

What to do when you think the virtual user threshold has been identified

Once you find the number of virtual users that seems to work with the test clip, you need to run that test for as long as the test session is scheduled to last. If it is two hours, then run that same test for two hours. This will help flush out any long-term problems with the test. If the JVM heap usage continues to go up (toward the 6GB limit on EC2), it is possible the test might die when it approaches 6GB heap usage.

It isn’t always possible to run this longer test. If you have lots of test clips or a limited amount of time before the test, you might not be able to run the test clip for hours. If that’s the case, don’t fret. Take a look at the results you’ve gathered to this point. Do you think that the number of virtual users you’ve identified as your limit is aggressive or conservative? If you think you are on the edge and being too aggressive with the number of virtual users, then cut that number back a bit.

Again, this process is a combination of art and science. Go with your gut.

Next steps and considerations

Once you’ve completed this process for a single test clip, you will need to repeat this process for each cloud vendor and instance size you plan to run this test clip on. Additionally, you will need to repeat this process for each test clip.

One important thing to note about cloud providers is that every instance is not the same. Take Amazon EC2, for example. Not every “m1.large” instance is the same. Some have different processor types (Intel vs. AMD). Some have different processor speeds. You don’t have control over whether you get a faster m1.large or a slower m1.large. So it is possible that you might have calibrated your test on one of the faster instances. As a result, when you run the larger test, it is possible that some of your load might come from slower load servers. You won’t know if you randomly get fast or slow servers. Again, this is a time when you have to make a call about how conservative you want to be. Maybe you subtract 100 users from the total number of virtual users to account for this. Or, if you don’t think it will make a big difference, just let it go.

One way to do a simple validation of your virtual user number is to stop the grid and restart it. You will get another random server. If you repeat the calibration test 2-3 times with different load servers and get the same result, you can be confident you have the right number of virtual users.

What NOT to do: An alternative way to get more virtual users out of a given load server is by artificially extending the test clip. In essence, this means increasing the think times during the test. This is not a recommended approach as it artificially decreases the number of HTTP requests being sent to the target application and won’t properly simulate the expected number of virtual users.

Advanced calibration tips

If this is a large test that you are going to do often, and therefore you would like a more precise calibration, you can calibrate with the same variations of load, but do so by increasing the number of load generators in use at each step, rather than by changing the load on a single load generator. Keep each single load server to a low number of users (at a point where you are pretty sure that it is not at capacity). You then track the same response time, errors, and other metrics as you did using the single load generator.

By comparing your two curves, you can get an idea of where the true capacity limits are. For example, if the curves degrade but are identical, then that suggests that the target under test is degrading rapidly, and load generator capacity is hardly even a factor.

On the other hand, if you see degradation in the first step (one load generator), but the second step (multiple load generators) shows no degradation (the charts show a nice linear increase in the factors as load increases), then that suggests the target site is not degrading at all, and all of the degradation you saw in the first step was due to load generator capacity being reached.

You may see something in-between, because often both sides are degrading at different rates as the load increases. In that case, you need to make a judgment call by looking at the two curves.

Let me illustrate

An illustration of how this might be done in this case, based on the assumption that we’re testing a very robust site and based upon the observations made so far that we seem to be able to get a fairly large number of users per load generator (and in this example we’ll start with a lower VU number and work our way up):

Step one: Run tests on a single load generator, with test runs at 100, 200, 300 … 2000 users, or until you see CPU and memory limitations come into play, or until significant degradation occurs. Plot a graph of the factors like response time, error rate, and throughput. If you see no degradation in the graph (everything is linear and the error rate is constant) until a CPU or memory limit is reached, then you are done and can stop — you have found the capacity point and need go no further. Otherwise, go to step two.

Step two: Run the same tests with a constant 100 users per load generator, with 1, 2, 3, 4… 20 load generators, or until you reach the stopping point of step one. Plot a graph of the factors like response time, error rate, and throughput.

Step three: Compare the two graphs and reach a conclusion as to the capacity of load generator vs. the capacity of the site under test. If you can, make a judgment of where the capacity of the load generator is reached. (Note that if the site under test is itself degrading rapidly, you might not be able to reach a conclusion about the load generator capacity.)

Final takeaways

Whether you are driving load from the cloud or from your own internal load generators, calibration of your load generators is considered a SOASTA best practice. Certainly cost savings is one main reason — either for cloud infrastructure costs (pennies) or internal hardware costs (potentially hundreds of thousands of dollars).

(SOASTA CloudTest can drive load from either source, as well as drive load from internal and cloud-based load generators at the same time — in the same test. But I’ll explain that in an upcoming post.)

Calibration is also key to ensuring that you have the most accurate performance test baseline point for each test run. After all, our goal is to test the website or application, not the load server! Still have questions? Leave a comment below or ping me on Twitter.

Related posts:

New to cloud-based performance testing? Here’s a quick language lesson

Performance testing in production? Here are 14 grid management best practices you should know

Why performance testing in production is not only a best practice — it’s a necessity