Fedoraproject.org

Session notes from the Virtualization microconf at the 2012 LPC

2013-01-25

The Linux Plumbers Conf wiki seems to have made the discussion notes for the 2012 conf read-only as well as visible only to people who have logged in. I suspect this is due to the spam problem, but I’ll put those notes here so that they’re available without needing a login. The source is here.

These are the notes I took during the virtualization microconference at the 2012 Linux Plumbers Conference.

Virtualization Security Discussion – Paul Moore

Slides

threats to virt system

3 things to worry about

attacks from host – has full access to guest

attacks from other guests on the host

break out from guest, attack host and other guests (esp. in multi-tenant situations)

attacks from the network

traditional mitigation: separate networks, physical devices, etc.

protecting guest against malicious hosts

host has full access to guest resources

host has ability to modify guest stuff at will; w/o guest knowing it

how to solve?

no real concrete solutions that are perfect

guest needs to be able to verify / attest host state

root of trust

guests need to be able to protect data when offline

(discussion) encrypt guests – internally as well as qcow2 encryption

decompose host

(discussion) don’t run services as root

protect hosts against malicious guests

just assume all guests are going to be malicious

more than just qemu isolation

how?

multi-layer security

restrict guest access to guest-owned resources

h/w passthrough – make sure devices are tied to those guests

limit avl. kernel interfaces

system calls, netlink, /proc, /sys, etc.

if a guest doesn’t need an access, don’t give it!

libvirt+svirt

MAC in host to provide separation, etc.

addresses netlink, /proc, /sys

(discussion) aside: how to use libvirt w/o GUI?

there is ‘virsh’, documentation can be improved.

seccomp

allows to selectively turn off syscalls; addresses syscalls in list above.

priv separation

libvirt handles n/w, file desc. passing, etc.

protecting guest against hostile networks

guests vulnerable directly and indirectly

direct: buggy apache

indirect: host attacked

qos issue on loaded systems

host and guest firewalls can solve a lot of problems

extend guest separation across network

network virt – for multi-tenant solutions

guest ipsec and vpn services on host

(discussion) blue pill vulnerability – how to mitigate?

lot of work being done by trusted computing group – TPM

maintain a solid root of trust

somebody pulling rug beneath you, happens even after boot

you’ll need h/w support?

yes, TPM

UEFI, secure boot

what about post-boot security threats?

let’s say booted securely. other mechanisms you can enable – IMA – extends root of trust higher. signed hashes, binaries.

unfortunately, details beyond scope for a 20-min talk

Storage Virtualization for KVM: Putting the pieces together – Bharata Rao

Slides

Different aspects of storage mgmt with kvm

mainly using glusterfs as storage backend

integrating with libvirt, vdsm

problems

multiple choices for fs and virt mgmt

libvirt, ovirt, etc.

not many fs’es are virt-ready

virt features like snapshots, thin-provisioning, cloning not present as part of fs

some things done in qemu: snapshot, block mig, img fmt handling are better handled outside

storage integration

storage device vendor doesn’t have well-defined interfaces

gluster has potential: leverage its capabilities, and solve many of these problems.

intro on glusterfs

userspace distributed fs

aggregates storage resources from multiple nodes and presents a unified fs namespace

glusterfs features

replication, striping, distribution, geo-replication/sync, online volume extension

(discussion) why gluster vs ceph?

gluster is modular; pluggable, flexible.

keeps storage stack clean. only keep those things active which are needed

gluster doesn’t have metadata.

unfortunately, gluster people not around to answer these questions.

by having backend in qemu, qemu can already leverage glusterfs features

(discussion) there is a rados spec in qemu already

yes, this is one more protocol that qemu will now support

glusterfs is modular: details

translators: convert requests from users into requests for storage

open/read/write calls percolate down the translator stack

any plugin can be introduced in the stack

current status: enablement work to integrate gluster-qemu

start by writing a block driver in qemu to support gluster natively

add block device support in gluster itself via block device translator

(discussion) do all features of gluster work with these block devices?

not yet, early stages. Hope is all features will eventually work.

interesting use-case: replace qemu block dev with gluster translators

would you have to re-write qcow2?

don’t need to, many of qcow2 features already exist in glusterfs common code

slide showing perf numbers

future

is it possible to export LUNs to gluster clients?

creating a VM image means creating a LUN

exploit per-vm storage offload – all this using a block device translator

export LUNs as files; also export files as LUNs.

(discussion) why not use raw files directly instead of adding all this overhead? This looks like a perf disaster (ip network, qemu block layer, etc.) – combination of stuff increasing latency, etc.

all of this is an experimentation, to go where we haven’t yet thought about – explore new opportunities. this is just the initial work; more interesting stuff can be build upon this platform later.

libvirt, ovirt, vdsm support for glusterfs added – details in slides

(discussion) storage array integration (slide) – question

way vendors could integrate san storage into virt stack.

we should have capability to use array-assisted features to create lun.

from ovirt/vdsm/libvirt/etc.

(discussion) we already have this in scsi. why add another layer? why in userspace?

difficult, as per current understanding: send commands directly to storage: fast copy from lun-to-lun, etc., not via scsi T10 extentions.

these are out-of-band mechanisms, in mgmt path, not data path.

why would someone want to do that via python etc.?

Next generation interrupt virtualization for KVM – Joerg Roedel

Slides

presenting new h/w tech today that accelerates guests

current state

kvm emulates local apic and io-apic

all reads/writes intercepted

interrupts can be queued from user or kernel

ipi costs high

h/w support

limited

tpr is accelerated by using cr8 register

only used by 64 bit guests

shiny new feature: avic

avic is designed to accelrate most common interrupt system features

ipi

tpr

interurpts from assigned devs

ideally none of those require intercept anymore

avic virtualizes apic for each vcpu

uses an apic backing page

guest physical apic id table, guest logical apic id table

no x2apic in first version

guest vapic backing page

store local apic contents for one vcpu

writes to accelerated won’t intercept

to non-accelerated cause intercepts

accelerated:

tpr

EOI

ICR low

ICR high

physical apic id table

maps guest physical apic id to host vapic pages

(discussion) what if guest cpu is not running

will be covered later

table maintained by kvm

logical apic id table

maps guest logical apic ids to guest physical apic ids

indexed by guest logical apic id

doorbell mechanism

used to signal avic interrupts between physcial cpus

src pcpu figures out physical apic id of the dest.

when dest. vcpu is running, it sends doorbell interrupt to physical cpu

iommu can also send doorbell messages to pcpus

iommu checks if vcpu is running too

for not running vcpus, it sends an event log entry

imp. for assigned devices

msr can also be used to issue doorbell messages by hand – for emulated devices

running and not running vcpus

doorbell only when vcpu running

if target pcpu is not running, sw notified about a new interrupt for this vcpu

support in iommu

iommu necessary for avic-enabled device pass-through

(discussion) kvm has to maintain? enable/disable on sched-in/sched-out

support can be mostly be implemented in kvm-amd module

some outside support in apic emulation

some changes to lapic emulation

change layout

kvm x86 core code will allocate vapic pages

(discussion) instead of kvm_vcpu_kick(), just run doorbell

vapic page needs to be mapped in nested page table

likely requires changes to kvm softmmu code

open question wrt device passthrough

changes to vfio required

ideally fully transparent to userspace

Reviewing Unused and New Features for Interrupt/APIC Virtualization – Jun Nakajima

Slides

Intel is going to talk about a similar hardware feature

intel: have a bitmap, and they decide whether to exit or not.

amd: hardcoded. apic timer counter, for example.

q to intel: do you have other things to talk about?

yes, coming up later.

paper on ‘net showed perf improved from 5-6Gig/s to wire speed, using emulation of this tech.

intel have numbers on their slides.

they used sr-iov 10gbe; measured vmexit

interrupt window: when hypervisor wants to inject interrupt, guest may not be running. hyp. has to enter vm. when guest is ready to receive interrupt, it comes back with vmexit. problem: as you need to inject interrupt, more vmexits, guest becomes busier. so: they wanted to eliminate them.

read case: if you have something in advance (apic page), hyp can just point to that instead of this exit dance

more than 50% exits are interrupt-related or apic related.

new features for interrupt/apic virt

reads are redirected to apic page

writes: vmexit after write; not intercepted. no need for emluation.

virt-interrupt delivery

extend tpr virt to other apic registers

eoi – no need for vm exits (using new bitmap)

this looks different from amd

but for eoi behaviour, intel/amd can have common interface.

intel/amd comparing their approaches / features / etc.

most notably, intel have support for x2apic, not for iommu. amd have support for iommu, not for x2apic.

for apic page, approaches mostly similar.

virt api can have common infra, but data structures are totally different. intel spec will be avl. in a month or so (update: already available now). amd spec shd be avl in a month too.

they can remove interrupt window, meaning 10% optimization for 6 VM case

net result

eliminate 50% of vmexits

optimization of 10% vmexits.

intel also supports x2apic h/w.

VMFUNC

this can hide info from other vcpus

secure channel between guest and host; can do whatever hypervisor wants.

vcpu executes vmfunc instrucion in special thread

usecases:

allow hvm guests to share pages/info with hypervisor in secure fashion

(discussion) why not just add to ept table

(discussion) does intel’s int. virt. has iommu component to?

doesn’t want to commit.

CoLo – Coarse-grained Lock-stepping VM for non-stop service – Will Auld

Slides

non-stop service with VM replication

client-server

Compare and contrast with Ramus – Xen’s solution

xen: ramus

buffers responses until checkpoint to secondary server completes (once per epic)

resumes secondary only on failover

failover at anytime

xen: colo

runs two VMs in parallel comparing their responses, checkpoints only on miscompare

resumes after every checkpoint

failover at anytime

CoLo runs VMs on primary and secondary at same time.

both machines respond to requests; they check for similartiy. When they agree, one of the responses sent to client

diff. between two models:

ramus: assume machine states have to be same. This is the reason to buffer responses until checkpoint has completed.

in colo; no such req. only requirement is request stream must be the same.

CoLo non-stop service focus on server response, not internal machine state (since multiprocessor environment is inherently nondeterministic)

there’s heartbeat, checkpoint

colo managers on both machines compare requests.

when they’re not same, CoLo does checkpoint.

(discussion) why will response be same?

int. machine state shouldn’t matter for most responses.

some exceptions, like tcp/ip timestamps.

minor modifications to tcp/ip stacks

coarse grain time stamp

highly deterministic ack mechanism

even then, state of machine is dissimilar.

resume of machine on secondary node:

another stumbling block.

(slides) graph on optimizations

how do you handle disk access? network is easier – n/w stack resumes on failover. if you don’t do failover in a state where you know disk is in a consistent state, you can get corruption.

Two solutions

For NAS, do same compares as with responses (this can also trigger checkpoints).

On local disks, buffer original state of changed pages, revert to original and them checkpoint with primary nodes disk writes included. This is equivalent to how the memory image is updated. (This was not described complete enough during the session).

that sounds dangerous. client may have acked data, etc.

will have to look closer at this. (More complete explanation above counters this)

how often do you get mismatch?

depends on workload. some were like 300-400 packets of good packets, then a mismatch.

during that, are you vulnerable to failure?

no, can failover at any point. internal state doesn’t matter. Both VMs, provide consistent request streams from their initial state and match responses up to the moment of failover.

NUMA – Dario Faggioli, Andrea Arcangeli

NUMA and Virtualization, the case of Xen – Dario Faggioli

Slides

Intro to NUMA

access costs to memory is different, based on which processor access it

remote mem is slower

in context of virt, want to avoid accessing remote memory

what we used to have in xen

on creation of VM, memory was allocated on all nodes

to improve: automatic placement

at vm1 creation time, pin vm1 to first node,

at vm2 create time, pin vm2 to second node since node1 already has a vm pinned to it

then they went ahead a bit, because pinning was inflexible

inflexible

lots of idle cpus and memories

what they will have in xen 4.3

node affinity

instead of static cpu pinning, preference to run vms on specific cpus

perf evaluation

specjbb in 3 configs (details in slides)

they get 13-17% improvements in 2vcpus in each vm

open problems

dynamic memory migration

io numa

take into consideration io devices

guest numa

if vm bigger than 1 node, should guest be aware?

ballooning and sharing

sharing could cause remote access

ballooning causes local pressures

inter-vm dependencies

how to actually benchmark and evaluate perf to evaluate if they’re improving

AutoNUMA – Andrea Arcangeli

Slides

solving similar problem for Linux kernel

implementation details avail in slides, will skip now

components of autonuma design

novel approach

mem and cpu migration tried earlier using diff. approaches, optimistic about this approach.

core design is to two novel ideas

introduce numa hinting pagefaults

works at thread-level, on thread locality

false sharing / relation detection

autonuma logic

cpu follows memory

memory in b/g slowly follows cpu

actual migration is done by knuma_migrated

all this is async and b/g, doesn’t stall memory channels

benchmarkings

developed a new benchmark tool, autonuma-benchmark

generic to measure alternative approaches too

comparing to gravity measurement

put all memory in single node

then drop pinning

then see how memory spreads by autonuma logic

see slides for graphics on convergance

perf numbers

also includes comparison with alternative approach, sched numa.

graphs show autonuma is better than schednuma, which is better than vanilla kernel

criticism

complex

takes more memory

takes 12 bytes per page, Andrea thinks it’s reasonable.

it’s done to try to decrease risk of hitting slowdowns (is faster than vanilla already)

stddev shows autonuma is pretty deterministic

why is autonuma so important?

even 2 sockets show differences and improvements.

but 8 nodes really shows autonuma shines

Discussions

looks like andrea focussing on 2 nodes / sockets, not more? looks like it will have bad impact on bigger nodes

points to graph showing 8 nodes

on big nodes, distance is more.

agrees autonuma doesn’t worry about distances

currently worries only about convergence

distance will be taken as future optimisation

same for Xen case

access to 2 node h/w is easier

as Andrea also mentioned, improvement on 2 node is lower bound; so improvements on bigger ones should be bigger too; don’t expect to be worse

not all apps just compute; they do io. and they migrate to the right cpu to where the device is.

are we moving memory to cpu, or cpu to device, etc… what should the heuristic be?

1st step should be to get cpu and mem right – they matter the most.

doing for kvm is great since it’s in linux, and everyone gets the benefit.

later, we might want to take other tunables into account.

crazy things in enterprise world, like storage

for high-perf networking, use tight binding, and then autonuma will not interfere.

this already works.

xen case is similar

this is also something that’s going to be workload-dependent, so custom configs/admin is needed.

did you have a chance to test on AMD Magny-Cours (many more nodes)

hasn’t tried autonuma on specific h/w

more nodes, better perf, since upstream is that much more worse.

xen

he did, and at least placement was better.

more benchmarking is planned.

suggestion: do you have a way to measure imbalance / number of accesses going to correct node

to see if it’s moving towards convergence, or not moving towards convergence, maybe via tracepoints

essentially to analyse what the system is doing.

exposing this data so it can be analysed.

using printks right now for development, there’s a lot of info, all the info you have to see why the algo is doing what it’s doing.

good to have in production so that admins can see

what autonuma is doing

how much is it converging

to decide to make it more aggressive, etc.

overall, all such stats can be easily exported, it’s already avl. via printk, but have to moved to something more structured and standard.

xen case is same; trying to see how they can use perf counters, etc. for statistical review of what is going on, but not precise enough

tells how many remote memory accesses are happening, but not from where and to where

something more in s/w is needed to enable this information.

One balloon for all – towards unified balloon driver – Daniel Kiper

Slides

wants to integrate various balloon drivers avl. in Linux

currently 3 separate drivers

virtio

xen

vmware

despite impl. differences, their core is similar

feature difference in drivers (xen has selfballooning)

overall lots of duplicate code

do we have an example of a good solution?

yes, generic mem hotplug code

core functionality is h/w independent

arch-specific parts are minimal, most is generic

solution proposed

core should be hypervisor-independent

should co-operate on h/w independent level – e.g mem hotplug, tmem, movable pages to reduce fragmentation

selfballooning ready

support for hugepages

standard api and abi if possible

arch-specific parts should communicate with underlying hypervisor and h/w if needed

crazy idea

replace ballooning with mem hot-unplug support

however, ballooning operates on single pages whereas hotplug/unplug works on groups of pages that are arch-dependent.

not flexible at all

have to use userspace interfaces

can be done via udev scripts, which is a better way

discussion: does acpi hotplug work seamlessly?

on x86 baremetal, hotplug works like this:

hotplug mem

acpi signals to kernel

acpi adds to mem space

this is not visible to processes directly

has to be enabled via sysfs interfaces, by writing ‘enable command’ to every section that has to be hotplugged

is selfballooning desirable?

kvm isn’t looking at it

guest wants to keep mem to itself, it has no interest in making host run faster

you paid for mem, but why not use all of it

if there’s a tradeoff for the guest: you pay less, you get more mem later, etc., guests could be interested.

essentially, what is guest admin’s incentive to give up precious RAM to host?

ARM – Marc Zyngier, Stefano Stabellini

Some other notes are on session etherpad

KVM ARM – Marc Zyngier

Slides

ARM architecture virtualization extensions

recent introduction in arch

new hypervisor mode PL2

traditionally secure state and non-secure state

Hyp mode is in non-secure side

higher privilege than kernel mode

adds second stage translation; adds extra level of indirection between guests and physical mem

tlbs are tagged by VMID (like EPT/NPT)

ability to trap accesses to most system registers

can handle irqs, fiqs, async aborts

e.g. guest doesn’t see interrupts firing

hyp mode: not a superset of SVC

has its own pagetables

only stage 1, not 2

follows LPAE, new physical extensions.

one translation table register

so difficult to run Linux directly in Hyp mode

therefore they use Hyp mode to switch between host and guest modes (unlike x86)

KVM/ARM

uses HYP mode to context switch from host to guest and back

exits guest on physical interrupt firing

access to a few privileged system registers

WFI (wait for interrupt)

(discussion) WFI is trapped and then we exit to host

etc.

on guest exit, control restored to host

no nesting; arch isn’t ready for that.

MM

host in charge of all MM

has no stage2 translation itself (saves tlb entries)

guests are in total control of page tables

becomes easy to map a real device into the guest physical space

for emulated devices, accesses fault, generates exit, and then host takes over

4k pages only

instruction emulation

trap on mmio

most instructions described in HSR

added complexity due to having to handle multiple ISAs (ARM, Thumb)

interrupt handling

redirect all interrupts to hyp mode only while running a guest. This only affects physical interrupts.

leave it pending and return to host

pending int will kick in when returns to guest mode?

No, it will be handled in host mode. Basically, we use the redirection to HYP mode to exit the guest, but keep the handling on the host.

inject two ways

manipulating arch. pins in the guest?

The architecture defines virtual interrupt pins that can be manipulated (VI→I, VF→F, VA→A). The host can manipulate these pins to inject interrupts or faults into the guest.

using virtual GIC extensions,

booting protocol

if you boot in HYP mode, and if you enter a non-kvm kernel, it gracefully goes back to SVC.

if kvm-enabled kernel is attempted to boot into, automatically goes into HYP mode

If a kvm-enabled kernel is booted in HYP mode, it installs a HYP stub and goes back to SVC. The only goal of this stub is to provide a hook for KVM (or another hypervisor) to install itself.

current status

pending: stable userspace ABI

pending: upstreaming

stuck on reviewing

Xen ARM – Stefano Stabellini

Slides

Why?

arm servers

smartphones

288 cores in a 4U rack – causes a serious maintenance headache

challenges

traditional way: port xen, and port hypercall interface to arm

from Linux side, using PVOPS to modify setpte, etc., is difficult

then, armv7 came.

design goals

exploit h/w as much as possible

limit to one type of guest

(x86: pv, hvm)

no pvops, but pv interfaces for IO

no qemu

lots of code, complicated

no compat code

32-bit, 64-bit, etc., complicated

no shadow pagetables

most difficult code to read ever

NO emulation at all!

one type of guest

like pv guests

boot from a user supplied kernel

no emulated devices

use PV interfaces for IO

like hvm guests

exploit nested paging

same entry point on native and xen

use device tree to discover xen presence

simple device emulation can be done in xen

no need for qemu

exploit h/w

running xen in hyp mode

no pv mmu

hypercall

generic timer

export timer int. to guest

GIC: general interrupt controller

int. controller with virt support

use GIC to inject event notifications into any guest domains with Xen support

x86 taught us this provides a great perf boost (event notifications on multiple vcpus simultaneously)

on x86, they had a pci device to inject interrupts to guest at regular intervals (on x86 we had a pci device to inject event notifications as legacy interrupt)

hypercall calling convention

hvc (hypercall)

pass params on registers

hvc takes an argument: 0xEA1 – means it’s a xen hypercall.

64-bit ready abi (another lesson from x86)

no compat code in xen

2600 lines of lesser code

had to write a 1500 line patch of mechanical substitutions to make 32-bit host make all guests work fine

status

xen and dom0 boot

vm creation and destruction work

pv console, disk, network work

xen hypervisor patches almost entirely upstream

linux side patches should go in next merge window

open issues

acpi

will have to add acpi parsers, etc. in device table

linux has 110,000 lines – should all be merged

uefi

grub2 on arm: multiboot2

need to virtualise runtime services

so only hypervisor can use them now

client devices

lack ref arch

difficult to support all tablets, etc. in market

uefi secure boot (is required by win8)

windows 8

Discussion

who’s responsbile for device tree mgmt for xen?

xen takes dt from hw, changes for mem mgmt, then psases to dom0

at present, currently have to build dt binary

at the moment, linux kernel infrastructure doesn’t support interrupt priorities.

needed to prevent a guest doing a DoS on host by just generating interrupts non-stop

xen does support int. priorities in GIC

VFIO – Are we there yet? – Alex Williamson

Slides

are we there yet? almost

what is vfio?

virtual function io

not sr-iov specific

userspace driver interface

kvm/qemu vm is a userspace driver

iommu required

visibility issue with devices in iommu, guaranteeing devices are isolated and safe to use – different from uio.

config space access is done from kernel

adds to safety requirement – can’t have userspace doing bad things on host

what’s different from last year?

1st proposal shot down last year, and got revised at last LPC

allow IOMMU driver to define device visibility – not per-device, but the whole group exposed

more modular

what’s different from pci device assignment

x86 only

kvm only

no iommu grouping

relies on pci-sysfs

turns kvm into a device driver

current staus

core pci and iommu drivers in 3.6

qemu will be pushed for 1.3

what’s next?

qemu integration

legacy pci interrupts

more of a qemu-kvm problem, since vfio already supports this, but these are unique since they’re level-triggered; host has to mask interrupt so it doesn’t cause a DoS till guest acks interrupt

like to bypass qemu directly – irqfd for edge-triggered. now exposing irqfd for level

(lots of discussion here)

libvirt support

iommu grps changed the way we do device assignment

sysfs entry point; move device to vfio driver

do you pass group by file descriptor?

lots of discussion on how to do this

existing method needs name for access to /sys

how can we pass file descriptors from libvirt for groups and containers to work in different security models?

The difficulty is in how qemu assembles the groups and containers. On the qemu command line, we specify an individual device, but that device lives in a group, which is the unit of ownership in vfio and may or may not be connectable to other containers. We need to figure out the details here,

POWER support

already adding

PowerPC

freescale looking at it

one api for x86, ppc was strange

error reporting

better ability to inject AER etc to guest

maybe another ioctl interrupt

What are we going to be able to do if we do get PCIe AER errors to show up at a device, what is the guest going to be able to do (for instance can it reset links).

We’re going to have to figure this out and it will factor into how much of the AER registers on the device do we expose and allow the guest to control. Perhaps not all errors are guest serviceable and we’ll need to figure out how to manage those.

better page pinning and mapping

gup issues with ksm running in b/g

PRI support

graphics support

issues with legacy io port space and mmio

can be handled better with vfio

NUMA hinting

Semiassignment: best of both worlds – Alex Graf

Slides

b/g on device assignment

best of both worlds

assigned device during normal operation

emulated during migration

xen solution – prototype

create a bridge in domU

guest sees a pv device and a real device

guest changes needed for bridge

migration is guest-visible, since real device goes away and comes back (hotplug)

security issue if VM doesn’t ack hot-unplug

vmware way

writing a new driver for each n/w device they want to support

this new driver calls into vmxnet

binary blob is mapped into your address space

migration is guest exposed

new blob needed for destination n/w card

alex way

emulate real device in qemu

e.g. expose emulated igbvf if passing through igbvf

need to write migration code for each adapter as well

demo

doesn’t quite work right now

is it a good idea?

how much effort really?

doesn’t think it’s much effort

current choices in datacenters are igbvf and

that’s not true!

easily a dozen adapters avl. now

lots of examples given why this claim isn’t true

no one needs single-vendor/card dependency in an entire datacenter

non-deterministic network performance

more complicated network configuration

discussion

Another solution suggested by Benjamin Herrenschmidt: use s3; remove ‘live’ from ‘live migration’.

AER approach

General consensus was to just do bonding+failover

KVM performance: vhost scalability – John Fastabend

Slides

current situation: one kernel thread per vhost

if we create a lot of VMs and a lot of virtio-net devices, perf doesn’t scale

not numa aware

Main grouse is it doesn’t scale.

instead of having a thread of every vhost device, create a vhost thread per cpu

add some numa-awareness scheduling – pick best cpu based on load

perf graphs

for 1 VM, number of instances of netperf increase, per-cpu-vhost doesn’t shine.

another tweak: use 2 threads per cpu: perf is better

for 4 VMs, results are good for 1-thread. much better than 2-thread. (2 thread does worse than current) With 4VMs per-cpu-vhost was nearly equivalent.

on 12 VMs, 1-thread works better, 2-thread works better than baseline. Per cpu-vhosts shine here outperforming baseline and 1-thread/2-thread cases.

tried tcp, udp, inter-guest, all netperf tests, etc.

this result is pretty standard for all the tests they’ve done.

RFC

should they continue?

strong objections?

discussion

were you testing with raw qemu or libvirt?

as libvirt creates its own cgroups, and that may interfere.

pinning vs non-pinning

gives similar results

no objections!

in a cgroup – roundrobin the vhost threads – interesting case to check with pinning as well.

transmit and receive interfere with each other – so perf improvement was seen when they pinned transmit side.

try this on bare-metal.

Network overlays – Vivek Kashyap

want to migrate machines from one place to another in a data center

don’t want physical limitations (programming switches, routers, mac addr, etc)

idea is to define a set of tunnels which are overlaid on top of networks

vms migrate within tunnels, completely isolated from physical networks

Scaling at layer 2 is limited by the need to support broadcsat/multicast over the network

overlay networks

when migrating across domains (subnets), have to re-number IP addresses

when migrating need to migrate IP and MAC addresses

When migrating across subnets might need to re-number or find another mechanism

solution is to have a set of tunnels

every end-user can view their domain/tunnel as a single virtual network

they only see their own traffic, no one else can see their traffic.

standardization is required

being worked on at IETF

MTU seen as VM is not same as what is on the physical network (because headers added by extra layers)

vxlan adds udp headers

one option is to have large(er) physical MTU so it takes care of this otherwise there will be fragmentation

Proposal

If guest does pathMTU discovery let tunnel end point return the ICMP error to reduce the guest’s view of the MTU.

Even if the guest has not set the DF (dont fragment) bit return an ICMP error. The guest will handle the ICMP error and update its view of the MTU on the route.

having the hypervisor to co-operate so guests do a path MTU discovery and things work fine

no guest changes needed, only hypervisor needs small change

(discussion) Cannot assume much about guests; guests may not handle ICMP.

Some way to avoid flooding

extend to support an ‘address resolution module’

Stephen Hemminger supported the proposal

Fragmentation

can’t assume much about guests; they may not like packets getting fragmented if they set DF

fragmentation highly likely since new headers are added

The above is wrong comment since if DF is set we do pathMTU and the packet wont be fragmented. Also, the fragmentation if done is on the tunnel. The VM’s dont see fragmentation but it is not performant to fragment and reassemble at end points.

Instead the proposal is to use PathMTU discovery to make the VM’s send packets that wont need to be fragmented.

PXE, etc., can be broken

Distributed Overlay Ethernet Network

DOVE module for tunneling support

use 24-bit VNI

patches should be coming to netdev soon enough.

possibly using checksum offload infrastructure for tunneling

question: layer 2 vs layer 3

There is interest in the industry to support overlay solutions for layer 2 and layer 3.

Lightning talks

QEMU disaggregation – Stefano Stabellini

Slides

dom0 is a privileged VM

better model is to split dom0 into multiple service VMs

disk domain, n/w domain, everything else

no bottleneck, better security, simpler

hvm domain needs device model (qemu)

wouldn’t it be nice if one qemu does only disk emulation

second does network emulation

etc.

to do this, they moved pci decoder in xen

traps on all pci requests

hypervisor de-multiplexes to the ‘right’ qemu

open issues

need flexibility in qemu to start w/o devices

modern qemu better

but: always uses PCI host bridge, PIIX3, etc.

one qemu uses this all, others have functionality, but don’t use it

multifunction devices

PIIX3

internal dependencies

one component pulls others

vnc pulls keyboard, which pulls usb, etc.

it is in demoable state

Xenner — Alex Graf

intro

guest kernel module that allows a xen pv kernel to run on top of kvm – messages to xenbus go to qemu

is anyone else interested in this at all?

xen folks last year did show interest for a migration path to get rid of pv code.

xen is still interested, but not in short time. – few years.

do you guys want to work together and get it rolling?

no one commits to anything right now