Blogs.oracle.com

What's up with LDoms: Part 11 - IO Recommendations

2015-01-06

In the last few articles, I discussed various different options for IO with LDoms. Here's a very short summary:

IO Option

Links to previous articles

SR-IOV

SR-IOV

Direct IO

Direct IO

Root Domains

Physical IO

Virtual IO

Improved vDisk Performance for LDoms

Layered Virtual Networking

Sizing the IO Domain

Virtual Networking Explained

A closer Look at Disk Backend Choises

In this article, I will discuss the pros and cons of each of these options and give some recommendations for their use.

In the case of physical IO, there are several options: Root Domains, DirectIO and SR-IOV. Let's start with SR-IOV. The most recent addition to the LDom IO options, it is by far the most flexible and the most sophisticated PCI virtualization option available. Please see the diagram on the right (from the Admin Guide) for an overview. First introduced for Ethernet adapters, Oracle today supports SR-IOV for Ethernet, Infiniband and Fibre Channel. Note that the exact features depend on the hardware capabilities and built-in support of the individual adapter. SR-IOV is not a feature of a server but rather a feature of an individual IO card in a server platform that supports it. Here are the advantages of this solution:

It is very fine grain, with between 7 and 63 Virtual Functions per adapter. The exact number depends on adapter capabilities. This means that you can create and use as many as 63 virtual devices in a single PCIe slot!

It provides bare metal performance (especially latency), although hardware resources like send and receive buffers, MAC slots and other resources are devided between VFs which might lead to slight performance differences in some cases.

Particularily for Fibre Channel, there are no limitations to what end-point device (disk, tape, library, etc.) you attach to the fabric. Since this is a virtual HBA, it is administered like one.

Different than Root Domains and Direct IO, most SR-IOV configuration operations can be performed dynamically, if the adapters support it. This is currently the case for Ethernet and Fibre Channel. This means you can add or remove SR-IOV VFs to and from domains in a dynamic reconfiguration operation, without rebooting the domain.

Of course, there are also some drawbacks:

First of all, you have a hard dependency on the domain owning the root complex. Here's a little more detail about this:
As you can see in the
diagram, the IO domain owns the physical IO card. The physical Root
Complex (pci_0 in the diagram) remains under the control of the root
domain (the control domain in this example). This means that if the
root domain should reboot for whatever reason, it will reset the root
complex as part of that reboot. This reset will cascade down the PCI
structures controlled by that root complex and eventually reset the PCI
card in the slot and all the VFs given away to the IO domains. Essentially, seen from
the IO domain, its (virtual) IO card will perform an unexpected reset. The best
way to respond to this is with a panic of the IO domain, which is the
most likely consequence. Note that the Admin Guide says that the behaviour of the IO domain is unpredictable,
which means that a panic is the best, but not the only possible
outcome. Please also take note of the recommended precautions (by
configuring domain dependencies) documented in the same section of the
Admin Guide. Furthermore, you should be aware that this also means that any kind of multi-pathing on top of VFs is counter-productive. While it is possible to create a configuration where one guest uses VFs from two different root domains (and thus from two different physical adapters), this does not increase the availability of the configuration. While this might protect against external failures like link failures to a single adapter, it doubles the likelyhood of a failure of the guest, because it now depends on two root domains instead of one. I strongly recommend against any such configurations at this time. (There is work going on to mitigate this dependency.)

Live Migration is not possible for domains that use VFs. In the case of Ethernet, this can be worked around by creating an IPMP failover group consiting of one virtual network port and one Ethernet VF and manually removing the VF before initiating the migration as described by Raghuram here. Note that this is not currently possible for Fibre Channel or IB.

Since you are actually sharing one adapter between many guests, these guests do share the IO bandwidth of this one adapter. Depending on the adapter, there might be bandwidth management available, however, the side effects of sharing should be considered.

Not all PCIe adapters support SR-IOV. Please consult MOS DocID 1325454.1 for details.

SR-IOV is a very flexible solution, especially if you need a larger number of virtual devices and yet don't want to buy into the slightly higher IO latencies of virtual IO. Due to the limitations mentioned above, I can not currently recommend SR-IOV or Direct IO for use in domains with highest availability requirements. In all other situations, and definately in test and development environments, it is an interesting alternative to virtual IO. The performance gap between SR-IOV and virtual IO has been narrowed considerably with the latest improvements in virtual IO. You will essentially have to weigh the availability, latency and managability characteristics of SR-IOV against virtual IO to make your decision.

Next in line is Direct IO. As described in an earlier post, you give one full PCI slot to the receiving domain. The hypervisor will create a virtual PCIe infrastructure in the receiving guest and reconfigure the PCIe subsystem accordingly. This is shown in an abstract view in the diagram (from the Admin Guide) at the right. Here are the advantages:

Since Direct IO works on a per slot basis, it is a more fine grain solution, compared to root domains. For example, you have 16 slots in a T5-4, but only 8 root complexes.

The IO domain has full control over the adapter.

Like SR-IOV, it will provide bare-metal performance.

There is no sharing, and thus no cross-influencing from other domains.

It will support all kinds of IO devices, tape drives and tape libraries being the most popular example.

The disadvantages of Direct IO are:

There is a hard dependency on the domain owning the root complex. The reason is the same as with SR-IOV, so there's no need to repeat this here. Please make sure you understand this and read the recommendations in the Admin Guide on how to deal with this dependency.

Not all IO cards are supported with DirectIO. They must not contain their own PCIe switch. A list of supported cards is maintained in MOS DocID 1325454.1.

Like Root Domains, dynamic reconfiguration is not currently supported with DirectIO slots. This means that you will need to reboot both the root domain and the receiving guest domain to change this configuration.

And of course, Live Migration is not possible with Direct IO devices.

DirectIO was introduced in an early release of the LDoms software. At the time, systems like the T2000 only supported two Root Complexes. The most common usecase was to support tape devices in domains other than the control domain. Today, with a much better ratio of slots/root complex, the need for this feature is diminishing and although it is fully supported, you should consider other alternatives first.

Finally there are Root Domains. Again, a diagram you already know, just as a reminder.

The advantages of Root Domains are:

Highest Isolation of all domain types. Since they own and control their own CPU, memory and one or more PCIe root complex, they are fully isolated from all other domains in the system. This is very similar to Dynamic System Domains you might know from older SPARC systems, just that we now use a hypervisor instead of a crossbar.

This also means no sharing of any IO resources with other domains, and thus no cross-influence of any kind.

Bare metal performance. Since there's no virtualization of any kind involved, there are no performance penalties anywhere.

Root Domains are fully independent of all other domains in all aspects. The only exception is console access, which is usually provided by the control domain. However, this is not a single point of failure, as the root domain will continue to operate and will be fully available over the network even if the control domain is unavailable.

They allow hot-swapping of IO cards under their control, if the chassis supports it. Today, that is for T5-4 and above.

Of course, there are disadvantages, too:

Root Domains are not very flexible. You can not add or remove PCIe root complexes without rebooting the domain.

You are limited in the number of Root Domains, mostly by the number of PCIe root complexes available in the system.

As with all physical IO, Live Migration is not possible.

Use Root Domains whenever you have an application that needs at least one socket worth of CPU and memory or more and has high IO requirements, but where you'd prefer to host it on a larger system to allow some flexibility in CPU and memory assignment. Typically, Root Domains have a memory footprint and CPU activity which is too high to allow sensible live migration. They are typically used for high value applications that are secured with some kind of cluster framework.

Having covered all the options for PCI virtualization, there is only virtual IO left to cover. For easier reference, here's the diagram from previous posts that shows this basic setup. This variant is probably the most widely used one. It has been available from the very first version, it's performance has been significantly improved recently. The advantages of this type of IO are mostly obvious:

Virtual IO allows live migration of guests. In fact, only if all the IO of a guest is fully virtualized, can it be live migrated.

This type of IO is by far the most flexible from a platform point of view. The number of virtual networks and the overall network architecture is only limited by the number of available LDCs (which has recently been increased to 1984 per domain). There is a big choice of disk backends. Providing disk and networking to a great number of guests can be achieved with a minimum of hardware.

Virtual IO fully supports dynamic reconfiguration - the adding and removing of virtual devices.

Virtual IO can be configured with redundant IO service domains, allowing a rolling upgrade of the IO service domains without disrupting the guest domains and without requiring live migration of the guests for this purpose. Especially when running a large number of guests on one platform, this is a huge advantage.

Of course, there are also some drawbacks:

As with all virtual IO, there is a small overhead involved. In the LDoms implementation, there is no limitation of physical bandwidth. But there is a small amount of additional latency added to each data packet as it is processed through the stack. Note that this additional latency, while measurable, is very small and not typically an issue for applications.

LDoms virtual IO currently supports virtual Ethernet and virtual disk. While virtual Ethernet provides the same functionality as a physical Ethernet switch, the virtual disk interface works on a LUN by LUN basis. This is different to other solutions that provide a virtual HBA and comes with some overhead in administration, since you have to add each virtual disk individually instead of just a single (virtual) HBA. It also means that other SCSI devices like tapes or tape libraries can not be connected with virtual IO.

As is natural for virtual IO, the physical devices (and thus their resources) are shared between all consumers. While recent releases of LDoms do support bandwidth limitations for network traffic, no such limits can currently be set on virtual disk devices.

You need to configure sufficient CPU and memory resources in the IO service domains. The usual recommendation is one to two cores and 8-16 GB of memory. While this doesn't strictly count as overhead for the CPU resources of the guests, those is still resources that are not directly available to guests.

Some recommendations for virtual IO:

In general, use the latest version of LDoms, along with Solaris 11.

Other than general networking considerations, there are no specific tunables for networking, if you are using a recent version of LDoms. Stick to the defaults.

The same is true for disk IO. However, keep in mind what has been true for the last 20 years: More LUNs do more IOPS. Just because you've virtualized your guest doesn't mean that a single, 10TB LUN would give you more IOPS than 10x1TB LUNs - quite the opposite! In the special case of the Oracle database: Make sure the redo logs are on dedicated storage. This has been a recommendation since the "bad old days", and it continues to be true, whether you virtualize or not.

Virtual IO is best used in consolidation scenarios, where you have many smaller systems to host on one chassis. These smaller systems tend to be lightweight in most of their resource consumption, including IO. Hence, they will definately work well on virtual IO. These are also the workloads that lend themselves best to Live Migration because of their smaller memory footprint and lower overall activity. This is not to say that domains with moderate IO requirements wouldn't be well suited for virtual IO, they are. However, larger domains with higher overall resource consumption (CPU, Memory, IO), tend to benefit less from the advantages of Live Migration and the flexibility of virtual IO.

To finalize this article, here's a tabular overview of the different options and the most important points to consider:

IO Option

Pros

Cons

When to use

SR-IOV

Highest granularity of all PCIe-based IO solutions

Bare metal performance

Supports Ethernet, FC and IB

Dynamic reconfiguration

Depends on support by PCIe card

No Live Migration

Dependency on root domain

For larger number of guests that need bare metal latency and can do without live migration.

When administrating a great number of LUNs is a constant burden, consider FC SR-IOV

When availability is not the top priority.

Direct IO

Dedicated slot, no hardware sharing

Bare metal performance

Supports Ethernet, FC and IB

Granularity limited by number of PCIe slots in the system

Not all PCIe cards supported

No Live Migration

No dynamic reconfiguration

Dependency on root domain

If you need a dedicated or special purpose IO card

Root Domains

Fully independent domains, similar to dynamic domains

Full bare metal performance, dedicated to each domain

All types of IO cards supported

Granularity limited by the number of Root Complexes in the system

No Live Migration

No dynamic reconfiguration

High value applications with high CPU, memory and IO requirements

Live Migration is not a requirement and/or not practical because of domain size and activity.

Virtual IO

Allows Live Migration

Most flexible, including full dynamic reconfiguration

No special hardware requirements

Almost no limit to the number of virtual devices

Allows fully redundant virtual IO configuration for HA deployments

Limited to Ethernet and virtual disk

Small performance overhead, mostly visible in additional latency

vDisk administration complexity

Sharing of IO hardware may have performance implications

Consolidation Scenarios

Many small guests

Live Migration is a requirement

There are already quite a few links for further reading spread throughout this article. Here is just one more:

White paper about the various virtualization options you have with SPARC and Solaris