2014-05-09

After discussing the methodology of a benchmark tools suite, this article is about implementing it into the ideal project and testing servers with it.

Why starting a new project ?

As for every project hosted by eNovance, Automated Health Check (AHC) project is done under an open source licence and publicly available on github via the edeploy repository. Up to now, I never found a project that satisfy the following requirements which is why I started this project :

fully opensource

support cpu, memory, storage, network benchmarking

run into a ramfs generated on any Linux distribution you like

generate reports with both hardware/software configuration and performance metrics

uploading results to a centralized server

support a smoke mode to stress hardware for a given time

conditional success/failure behaviour

To make this tool exist, the edeploy project was an evidence as it can build under-control Linux operating systems (Debian, Fedora, Redhat, Ubuntu) with a bootable disk or ramfs as an output. Morever, the bootstrap client used to deploy edeploy’s role has pretty much all about the hardware/software configuration detection in addition of the central logging system. When the project started, it was an evidence to fork this bootstrap code  and create a new edeploy role : the AHC project.

What are the main objectives of such a tool ?

When you do benchmarking you have to choose between two main approaches :

get the most of it to report to the world that your are the most powerful system on earth

report a level of performance that indicates roughly how powerful your system looks like and don’t track the “lost percents“

The first approach could be pretty time consuming and kind of a never ending story. Every system could always be a little bit more powerful as yesterday. Some slight optimization, a new compiler or a different compilation directive could have an impact on your performance. Tracking this lost percentages could take days, hundred on runs and a very big test matrix.

I prefer the second one as my goal is to have very quickly answer to this question : “Are my servers running almost normally ?” The ‘almost’ part of the sentence is very important. This tool does not try to check everything but make a best-effort estimation of your server’s capabilities. Having a quick overview of a system to insure the basic features are working well. It’s usually enough to track down weak systems.

 Selecting the appropriate benchmark tools

For every single component to benchmark, selecting a tool is always a trade-of between available features, licensing and having a still alive project.

 CPU & Memory Benchmarking

The Sysbench project offer a single interface to compute both CPU computing power and memory bandwidth. Its main advantages are a lightweight source code, a GPL licensing, a threading option and a time based mode.

This benchmark does not test all features and instructions the CPU have and this is not the objective to do it neither. Sysbench reports a number that represents a global level of performance. This number doesn’t really have a unit humanly understandable,it is much more like a relative performance indicator.

About the memory module of Sysbench, it performs IOs of a given block size to the main memory. It’s pretty straightforward to understand. The result of this benchmark is a memory bandwidth in MB/sec reported during a constant time.

Storage Benchmarking

When thinking about storage benchmarking tools, fio comes immediately in mind. Mainly developed under the GPL license by Jens Axboe (Linux Kernel Maintainer of the Block Layer) , this tool is by far the most versatile tool I’m aware of. As we try to estimate the performance of the hardware by itself, removing the filesystem layer is mandatory.

Filesystems are complex beasts that have various optimization and behaviours that are useful for users but could hide some defects or introduced non desired latencies. The more software on the data path, the more complex is the analysis of the results. Making the same test on two different filesystems would lead to pretty different results. As we want to be as clause a possible to the hardware, it’s important to remove this source of possible annoyance.

Fio’s ability to perform IOs at the block level is a very interesting feature here. Fio can be scripted to perform the exact IO pattern you need while keeping under control the time you spend on your run and ensuring that it runs without any cache Layer from the Linux Kernel (O_DIRECT).

Network Benchmarking

The Netperf project, under a BSD-like license,  is clearly one of the most known and used tool over the Linux world. It provides a very simple command line, a port based pairing, TCP and UDP support and up to 20 different scenario. This tool is used to report the network bandwidth that a set of servers can generated simultaneously. The performance is expressed in Gbit/sec.

 

Embedding the benchmarking tools inside a custom and live bootable Linux Operating System

Software configurations/changes is a big concern when performing performance tests. It’s mandatory to reduce any possible source of annoyance that could have a positive or negative impact on the performance (like a crontab, a change in the benchmark tool itself or patch on the Linux Kernel). As we try to setup a performance indicator to compare a set of servers, keeping the same OS over server and time is a key point to consider.

The main idea here is to create a custom operating system that embeds the less amount of possible software with the Linux distribution of your choice. Ideally, the result is a bootable disk image or a kernel and ramfs files that could be booted over PXE. The main benefit of this approach is being able to boot anytime your servers in order to run a benchmark test series without making any change on your production environment. As a result, the performance metrics will always be provided on the same software environment letting as a unique difference between tests and over-time the hardware you have. It’s so possible to perform some differential analysis between install time anytime later if some issues are occurring on this particular server. This also could be used to ensure that a new server at least as performing as the other servers of a given pool.. This is particularly precious when a set of server is complemented by a new batch of servers coming from the same or a different vendor. This methodology make the proof that new incoming server will not degrade a level of service by under performing regarding the existing ones.

To ease this integration, I’ve been adding to the eDeploy project a role that performs the task of selecting the main packages required to perform all this benchmark series.

To build it, you can proceed like the following :

# To build a bootable ramfs :

make health-check

 

To build a bootable disk image:

make health-img

The resulting Operating System is now strongly versioned, archivable and available at any time. Booting becomes very easy by using a USB key on standalone servers or via PXE on an already setup network.

Detecting the hardware and software configuration

Prior to any benchmark, you should grab both hardware and software configuration and save them. This is mandatory for several reasons :

the detected hardware could be tied to a particular OS/Kernel

understand what components and version of them (firmware, revisions) were used during the run

To achieve this task, the very well designed lshw tool performs a complete analysis of your host and save it in a xml file. The description is pretty huge, so only keep the relevant information you need to get a more synthetic view of your servers.

Some complementary information you may need which are not part of the lshw report can be grabbed from some system commands or through the /sys interface.

In my Automatic Health Check tool, the extraction of all hardware / software information looks like this report. It includes the following components:

physical disks, the ones under the raid controller

logical disk seen by Linux like virtualized raid arrays or physical disks attached on a SATA controller

Raid Array configuration

System properties (model, vendor)

Firmware (bios date & version)

Memory Bank allocation (speed, DIMM type, vendor, part number, location, size and clock)

Total amount of memory detected

Ethernet devices (driver, negotiation, link speed, IP, …)

Infiniband devices

CPU (number of sockets/cores, model & vendor, clocks, features)

OS (Vendor/Version, kernel version/arch/command line)

IPMI (channel to use to reach the open interface)

No human interaction to perform the benchmark

A key factor for reproducible benchmarks is being sure that the same tests will always be run in the same conditions. Scripting is the way to go to avoid any human mistake or typo when starting the tools. The scripting tool should reuse the hardware detection done previously and loop on the detected components to test every single device.

Each processor, the physical socket, not the logical cores are tested for both computing power and memory bandwidth to ensure they work properly.

For the storage devices, each individual disk are tested on a set of various patterns and then all the disks at the same time performing the same pattern series.

A simple python script like the one I developed for AHC could be enough to do the job. The script should be written precisely to avoid any miss-synchronization of jobs or missing some critical options on the benchmark tools both leading to misleading and wrong performance results.

Understanding the efficiency of your server

When scripting your benchmark, it’s important to keep in mind the behaviour you want to observe. One is the global performance you get from a component, the second one should be the scalability of your server.

CPU computing power efficiency

When considering to measure the computing CPU power you have, running a single thread or a simple process reports the amount of CPU power per logical core. If you now run the same test but on all the available logical cores by using threads or processes, the sum of all individual cores performance provides a useful information on the global CPU power available. Let’s say each logical core provides 100 units of power : a dual socket, 4 cores per socket server offers 8 logical cores (16 if hyper-threading is enabled) will never offer 800 units (or 16000 units) of computing power when all cores are used simultaneously. On servers, you can expect having a 60 to 75% of efficiency meaning an average of 60 to 75 units per logical core instead of 100. The scalability factor of your server cannot be 100% and you have to take this number if account when computing the CPU power you need at full load.

Memory bandwidth efficiency

Scalabilty issues are even more important with the memory bandwidth which is limited by the number of channels and the distance between the CPU and its memory.  The distance and the path may vary from one architecture to another by having the memory controller embedded inside or outside the processor and sometimes attached to another processor like on a NUMA architecture. The following images (from National Instruments website) show that in a pretty well done manner.

 



Embedded versus external memory controller

 



Legacy versus Numa architecture

 

From the CPU point of view, the available bandwidth from the memory is limited by the path length (the number of nodes to reach the memory) and width (how much bandwidth a channel can deliver). It’s really important to understand that each CPU block shown in the previous images are in reality physical processors. The memory bandwidth available to this physical socket has to be shared, please read divided, by the number of cores that requests memory accesses simultaneously. A 6-Core processor getting a 4.8GB/sec of memory bandwidth is only able to provide simultaneously 800MB/sec per logical core requesting memory IOs. While the number of cores are increasing by a 60-75% ratio the computing power, they are at the same time dividing the memory bandwidth available for each of them. Please note that if a single logical core is requesting memory IOs while the other cores are idle, it will get the complete bandwidth, so 4.8GB/sec in our example.

The memory bandwidth available at a given time can be determined in a couple of seconds.

Storage efficiency

In the storage benchmarking, the one-by-one test checks that each disk is performing well while the all-together test checks if the controller can sustain the global load induced by the attached disks. A good example is about high-density servers like Dell R720xd that features up to 26 disks in the chassis. If you plan you the 26 disks, it’s good to know and therefore prove that when all the disks are loaded simultaneously you are not getting 26x the performance of a single disk. The controller cannot handle this global load and degrades the overall performances by a something around 15 percent.

This phenomena have to be understood, measured and taken in account when designing the storage nodes unless the production server will deliver less IOs than expected.

Summary

Comparing individual performance versus a full load provides a scaling indicator of your servers and could vary regarding each component’s technology. The test suite should implement tests that performs those simultaneous run to compare standalone and simultaneous loads on a give component.

This is what AHC is doing by using the threading or forking feature of Sysbench for the CPU & Memory metrics and the scripting capability of fio for the storage part. Fio uses a stonewall instruction to determine the synchronization points between various storage jobs. In the result numbers both individual and grouped performance are reported allowing such analysis.

 

To be continued

The next articles in this series will be about

selecting the proper test patterns

aggregating the results

running smoke tests

analyzing the results and detect  potential under performing server{s}.

using this tools to also benchmark the scalability of the cloud infrastructure

Show more