Longwhiteclouds.com

The Secret to Success when Virtualizing Business Critical Apps

2013-07-21

My main mission in my work and in my blog is to help customers successfully virtualize business critical apps without compromising SLA’s, while reducing risk, and gaining the economic and operational benefits of virtualization. I very often see customers attempting to virtualize their business critical apps and taking the exact same approach as they did with dev/test and tier 2 and 3 applications. The result is often they struggle, and sometimes their projects are not successful. To achieve success when you are virtulizing critical apps takes a different approach. Often attempting these important projects yourself is more costly than getting in the right team that has done it all before. The objective of this article is to bust some of the myths that surround virtualizing business critical apps and give you some ideas of how to successfully approach such projects so that you can increase your chances of success. This is by no means exhaustive, but will give you a good starting point. I will share some of the secrets to success that I’ve built up over hundreds of successful vBCA projects and from lessons learned from projects I’ve seen go off the rails.

Thanks to the wonders of modern technology this article comes to you from 40088ft somewhere above Australia on a Singapore Airlines A380-800 from Sydney to Singapore. This quietest and most comfortable plane I’ve ever flown in. I’m on my way to Singapore to assist with an Architect Bootcamp where I’ll be training 50 of ASEAN’s top architects on the important aspects of virtualizing business critical applications.

Defining Business Critical Applications

Before I start busting myths and revealing some of my secrets lets define what we mean by business critical apps, and what makes an app critical:

My definition of a Business Critical Application is any application that could have a material impact on your or your customers organizations’ reputation, productivity or financial viability if it were to become unavailable or experience severe performance degradation for an extended period of time

Examples:

Virtual Desktop Environment – If it supports all users

ERP systems and supporting databases and middleware

Manufacturing, Power Grid Management Systems, Process Automation and Control Systems – SCADA

Financial systems, payment processing, online banking

Middleware and ESB systems

Billing systems

Customer facing online systems, e-commerce systems

Medical systems

Security or door access control systems at secure facilities, such as airports or military bases

One of the systems I worked on and helped successfully virtualize was a financial system that is involved in the processing of over $70billion per annum in transactions. This system if it become unavailable or if performance was severely degraded at the wrong time of year could mean losses to the tune of over $100million per day. Not the sort of thing you want to go wrong.

Myth Busting

It may come as a surprise that there are very little technical constraints to virtualizing business critical apps. Long gone are the days where a virtualized system couldn’t meet the performance requirements for the most critical apps. Where prior generations of VMware vSphere only supported 4 vCPU’s or 8 vCPU’s per VM the current generation can support 64 vCPU’s per VM. A VM can support 1TB RAM, 1Million Storage IOPS (4KB <2ms latency) and 40Gb/s network throughput. There are very few enterprise applications that require this amount of resources in a single instance where scaling out is not possible. The only things that can’t be virtualized these days are applications that require specialised hardware devices, as an example x.25 or ISDN cards, or a VM that requires more than 10 NIC’s or more than 60 SCSI devices. The difference in performance between virtual and physical at 100% utilization is generally between 6% and 10%. But how many systems are designed to run at 100% utilization all the time?

For most applications the slight overhead in most cases will not be noticed. In some cases virtual can perform better than physical. Java systems and WebSphere Application Server in particular in certain configurations can show an improvement in performance of between 4% and 6% for 2 vCPU and 4 vCPU configurations per VM. This means that the same physical host hardware can produce between 4% and 6% more transactions per second. At VMworld 2013 in San Francisco in a couple of weeks one session is going to show how an HPC cluster of 2720 CPU’s running on vSphere produces a 2.2% performance improvement over physical and is used for missile defence simulations. If this sounds interesting then you should register for VMworld and sign up for VAPP5419 - High-Performance Computing (HPC) in the Virtualized Data Center. I definitely be there at VMworld for this session and I’ll be presenting two sessions as well.

For the most part applications on virtual machines need to be designed and configured in the same way as the same application running on a physical machine. In general the same application best practices apply. However there are some slight differences due to virtualization. Here are some of the things you should consider:

Firstly a virtual cpu (vCPU) only has one thread, whereas a physical core could have two threads in the case of hyperthreading.

Network access between VM’s on the same host that are connected to the same virtual network happens at memory speed. If you have two VM’s that interact on the network a lot it might make sense to group them together.

Less virtual CPU’s can mean better performance, especially where there are many VM’s running on a host. This is because if a VM has less vCPU’s there is more chance that each vCPU will be scheduled on a physical processor thread. Oversizing VM’s can lead to performance problems. You should start smaller as you can always easily add vCPU’s at a later time once you’ve verified performance.

Because modern x86 servers support NUMA (Non-uniform Memory Access) you should aim to size your VM’s if possible to fit within a NUMA node. For example, if your physical CPU sockets have 8 cores your VM’s will be configured optimally if they have 1, 2, 4 or 8 vCPU’s. It’s easy to work out your NUMA node size by dividing the number of physical CPU cores and the total amount of memory by the number of physical CPU sockets. You should aim to size your VM’s memory to be less than the size of the NUMA node, and ideally less than half the size of total memory on the hosts.

Due to the multiple layers of abstraction in the storage stack of a virtual machine it is best to use the simples IO scheduler. So for Linux systems the IO scheduler (elevator) should be set to NOOP. Also because of the multiple layers of abstraction the data is always fragmented all of the time, this is normal and expected, defragmenting your systems would not be a good idea and would likely cause performance issues.

Common Mistakes

Here are some common mistakes that I often see and you should try to avoid:

Treating virtualization as a magical black box and just expecting any workload to run with half of the resources it needs. Virtualizing isn’t a magical black box. It is still bound by the laws of physics. If your workload really needs 6 vCPU or 8 vCPU and 96GB RAM at times then you better make sure it gets it when it needs it. There are plenty of benefits to virtualizing over and above just consolidation ratios. In fact when virtualizing business critical apps consolidation ratios are the least important factor. Reducing risk, ensuring availability and performance, and greatly simplifying DR, performance management and capacity planning are generally higher up the list of priorities. Ultimately you need to ensure that your physical hardware and infrastructure underpinning your virtualization solution has the capability to meet your objectives. If you buy cheap slow hardware don’t expect it to perform like a rocket when you put your virtual machines on it. This is really just common sense.

Failing to baseline or properly evaluate and record the performance and other requirements of the source system, or the new system that is being developed. You can lump not having objective measures of performance and other SLA’s into this category. If you haven’t got a baseline and you haven’t got clearly documented and agreed objective business and technical metrics and requirements then you will find it almost impossible to achieve success. Having a gut feeling if something is working ok or not is not sufficient when dealing with business critical apps. This leads us to the next point.

Failing to verify that the performance, availability and other business requirements can actually be met and that you have the right infrastructure to meet them. Each component of the solution needs to be able to meet it’s performance, availability and other business objectives. Testing per component and as an integrated solution will prove if the solution works as expected. If each component meets its objectives then logically so should the integrated whole.

Insufficient planning for risk and disaster scenarios, prior, during and after migration.

Making a solution more complicated than it needs to be, or designing a technical solution that is not supported by business requirements. For example implementing a metro stretched cluster solution when there is no business justification for it and where other alternative solutions would be simpler, less costly and still meet the requirements. As a lot of high availability features are already built into the base VMware vSphere platform, such as VMware HA you may not need to use in guest clustering solutions in some cases. This can greatly simplify your solution and it’s operations.

You can’t take a plug and pray approach with Business Critical Apps. To ensure their SLA’s, predictability and low risk you need to take a very methodical and disciplined approach to the project and have objective measures to meet, that can be verified.

Secrets to Successful vBCA Projects

As I said previously you need a methodical and disciplined approach to virtualizing business critical apps. It is much more of an applications or software development (SDLC) type project than a pure infrastructure project. Especially as even the hardware is now software when it’s virtualized. This is especially true if the project involves migration from a traditional Unix system, which may involve porting of code or software redevelopment if it’s not a standard commercial off the shelf product (COTS). So here are some tips or secrets to successfully virtualizing business critical apps:

Clearly document all the important business requirements and when doing the architecture and solution design make sure you have traceability of design decisions back to the business requirements that they support.

Baseline the source environment and record metrics that are objective and are a valid representation of system performance and availability as it impacts the end users. How you achieve this will be up to you but it constantly surprises me how many customers don’t bother to baseline or evaluate and record the baseline performance, availability and other important requirements of their source systems prior to virtualizing it. Every time I have seen this, without exception, the projects have run into problems. If you don’t have objective, accurate and valid metrics of source system performance how can you verify that the system will at least achieve if not exceed the prior state, which is our goal. Another point here is just because CPU utilization is 40% or 50% might not make a difference so long as the end user response times and scalability objects are met. So any evaluation must be in an application metrics context rather than just infrastructure utilization metrics.

Test your infrastructure! Verify it’s meeting all of the business requirements, design criteria, performance, availability, security and recoverability metrics (and other metrics) that are important for your project. If you don’t test it how will you know if it’ll work when you need it to most. I recommend a risk based approach to testing so that you get the most coverage of the most important things for the least amount of effort and time. It will be up to you to decide what and how much to test. I will give you some more ideas on this below.

Test your applications, both component based and integration based testing, including testing of the combined workload of multiple application instances or VM’s on a host. You should also test your availability and recoverability methods while testing the applications under load.

Prepare for the worst! Testing and verifying everything is not a one time process. It is an ongoing process. You need to have well defined plans that are tested and verified to work. Especially when it comes to DR and it needs to cover every component. You need to test normal operating scenarios as well as scenarios where things go wrong. Getting to know how the system behaves during failures and disasters will give you a lot more confidence. VMware Site Recovery Manager can provide you with an automated recovery process that is auditable and testable without disruption to production. Recovery and failure testing should include security and compliance requirements to ensure that your systems are secure and compliant even when recovered after a disaster. Also bare in mind that most real disasters in a datacenter are man made, not as a result of a natural phenomenon.

Test your migration methodology and your roll back plans. Before you do a production migration for real you should test your migration methodology. During this process you should be timing it and although its a test you should be making it as real and valid as possible. I also recommend a pilot or proof of concept in most cases. Your should also test and verify your roll back process and have clear criteria about when and how a roll back would be initiated and under what conditions.

Follow the VMware and your vendor best practices for architecture design and the applications that you’re migrating, at least as a baseline to start from. Best practices are created over numerous projects and are the best place to start in the absence of any special requirements that might cause you to modify them. Some best practices might not be valid for your environment and you may need to create your own best practices, but at a minimum all previous best practice documents should be reviewed during your project. I often see people having trouble with databases and applications when they have not even bothered to read the best practice documentation that would have prevented the problem in the first place.

Make sure the applications teams and end users are core members of your project team and that they have input into design, testing, and migration methods. This will not help get their buy in, it’ll increase the chances you will cover all important aspects of the applications migration and they will get an understanding of how the applications will behave once virtualized, even when things are going wrong.

Don’t overcommit resources too aggressively without having observed the systems performance and behaviour once it’s been virtualized. You should consider allocating no more than one vCPU per logical host cpu or thread to start with, and not overcommitting memory on the hosts in your VMware clusters until you properly understand usage patterns and you have real data to base decision on. You can increase your system utilization safely and get better overall consolidation by grouping systems that need different resources onto the same host. Fortunately most resource scheduling decisions are automatically done by VMware with features such as VMware Distributed Resource Scheduler (DRS). For systems that need high storage performance and low latency you need to make sure they are configured with enough virtual SCSI controllers, sufficient virtual disks, and have access to sufficient physical storage devices to get the queue depth and parallelism they need.

Make use of all of the features of VMware vSphere to ensure quality of service to your critical applications and prevent impacts from noisy neighbours. For example using Network IO Control, Load Based Teaming, Storage IO Control and VMware HA and VMware DRS can give you much more predictable performance and improved quality of service, over and above what you can get from when the applications were physical.

Low risk idea. Plan for risk and have detailed risk mitigation plans that are documented and agreed with all key stakeholders. Identifying and mitigating risks and their impacts will be an ongoing process as part of the project and even after the project is complete and in operation. In some cases the mitigation plans will also need to be tested and verified (depending on impact). Virtualizing without compromising is the name of the game and reducing risk is one of the most important objectives. There is no point virtualizing if through a flawed process, flawed design, or lack of testing, you introduce more risk that has a severe business impact. This could well have a bigger impact than any possible benefits if you don’t plan and execute your projects carefully. Done well of course the results and benefits are substantial.

Migrate low risk systems first, prove the processes, gain confidence, before moving onto higher risk systems. Once you have proved multiple times over that your design, your process and the results are in accordance with your objectives it will be much easier to migrate higher risk systems. You would normally start with Dev Systems, then Test Systems, then Pre-Prod, before finally migrating the actual production system.

Testing Coverage

In terms of what you should test and your testing coverage, here are some of the areas I would recommend you consider when planning out your vBCA project testing strategy and plans:

Pilot and Design Verification Testing

Has the design been implemented as expected?

Does the migration process work as expected?

System and Operation Testing

Does the application and the full solution function as expected?

Do the maintenance and operational aspects of the design work as expected?

Availability and Recovery Testing

Do individual infrastructure and application components behave as expected when components fail?

Do the business continuity and availability aspects of the infrastructure and applications work as expected under various disaster scenarios?

Performance and Scalability Testing

Does the solution meet the performance SLA’s for applications and infrastructure?

What is the saturation point and headroom of the design and individual components, and what is the sweet spot for scalability?

You may also want to do application regression testing, integration testing, and of course User Acceptance Testing. I would recommend a risk based approach so that you can cover the most important areas thoroughly but without unnecessary effort required to test areas that are low impact. You will need to decide how much testing and what testing is required to give you the desired level of comfort and to verify that your requirements and objectives are being met. You won’t know if you’ve met your objectives unless you’ve tested and verified them. So test thoroughly and test often.

Final Word

Virtualizing Business Critical Applications successfully requires a disciplined and methodical approach that reduces and manages risk and a higher level of assessment, testing and verification. The reason for this is that the impacts are a lot higher if the applications are not meeting their objectives. A revenue generating system that becomes unavailable or where performance is severely degraded will likely have immediate consequences, and could even jeopardise the future viability of the business. A medical system or military system going down could mean the difference between life and death. By approaching a vBCA project in the right way and through thorough planning and testing you can achieve better SLA’s for your applications, with lower risk, higher availability, and very often much higher performance. If you do it wrong you might find yourself updating your CV. vBCA is about virtualizing without compromise. No compromise to SLA’s, no compromise to performance, no compromise to risk.

—

This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.