The Linux Plumbers Conf wiki seems to have made the discussion notes for the 2012 conf read-only as well as visible only to people who have logged in. I suspect this is due to the spam problem, but I’ll put those notes here so that they’re available without needing a login. The source is here.
These are the notes I took during the virtualization microconference at the 2012 Linux Plumbers Conference.
Virtualization Security Discussion – Paul Moore
Slides
threats to virt system
3 things to worry about
attacks from host – has full access to guest
attacks from other guests on the host
break out from guest, attack host and other guests (esp. in multi-tenant situations)
attacks from the network
traditional mitigation: separate networks, physical devices, etc.
protecting guest against malicious hosts
host has full access to guest resources
host has ability to modify guest stuff at will; w/o guest knowing it
how to solve?
no real concrete solutions that are perfect
guest needs to be able to verify / attest host state
root of trust
guests need to be able to protect data when offline
(discussion) encrypt guests – internally as well as qcow2 encryption
decompose host
(discussion) don’t run services as root
protect hosts against malicious guests
just assume all guests are going to be malicious
more than just qemu isolation
how?
multi-layer security
restrict guest access to guest-owned resources
h/w passthrough – make sure devices are tied to those guests
limit avl. kernel interfaces
system calls, netlink, /proc, /sys, etc.
if a guest doesn’t need an access, don’t give it!
libvirt+svirt
MAC in host to provide separation, etc.
addresses netlink, /proc, /sys
(discussion) aside: how to use libvirt w/o GUI?
there is ‘virsh’, documentation can be improved.
seccomp
allows to selectively turn off syscalls; addresses syscalls in list above.
priv separation
libvirt handles n/w, file desc. passing, etc.
protecting guest against hostile networks
guests vulnerable directly and indirectly
direct: buggy apache
indirect: host attacked
qos issue on loaded systems
host and guest firewalls can solve a lot of problems
extend guest separation across network
network virt – for multi-tenant solutions
guest ipsec and vpn services on host
(discussion) blue pill vulnerability – how to mitigate?
lot of work being done by trusted computing group – TPM
maintain a solid root of trust
somebody pulling rug beneath you, happens even after boot
you’ll need h/w support?
yes, TPM
UEFI, secure boot
what about post-boot security threats?
let’s say booted securely. other mechanisms you can enable – IMA – extends root of trust higher. signed hashes, binaries.
unfortunately, details beyond scope for a 20-min talk
Storage Virtualization for KVM: Putting the pieces together – Bharata Rao
Slides
Different aspects of storage mgmt with kvm
mainly using glusterfs as storage backend
integrating with libvirt, vdsm
problems
multiple choices for fs and virt mgmt
libvirt, ovirt, etc.
not many fs’es are virt-ready
virt features like snapshots, thin-provisioning, cloning not present as part of fs
some things done in qemu: snapshot, block mig, img fmt handling are better handled outside
storage integration
storage device vendor doesn’t have well-defined interfaces
gluster has potential: leverage its capabilities, and solve many of these problems.
intro on glusterfs
userspace distributed fs
aggregates storage resources from multiple nodes and presents a unified fs namespace
glusterfs features
replication, striping, distribution, geo-replication/sync, online volume extension
(discussion) why gluster vs ceph?
gluster is modular; pluggable, flexible.
keeps storage stack clean. only keep those things active which are needed
gluster doesn’t have metadata.
unfortunately, gluster people not around to answer these questions.
by having backend in qemu, qemu can already leverage glusterfs features
(discussion) there is a rados spec in qemu already
yes, this is one more protocol that qemu will now support
glusterfs is modular: details
translators: convert requests from users into requests for storage
open/read/write calls percolate down the translator stack
any plugin can be introduced in the stack
current status: enablement work to integrate gluster-qemu
start by writing a block driver in qemu to support gluster natively
add block device support in gluster itself via block device translator
(discussion) do all features of gluster work with these block devices?
not yet, early stages. Hope is all features will eventually work.
interesting use-case: replace qemu block dev with gluster translators
would you have to re-write qcow2?
don’t need to, many of qcow2 features already exist in glusterfs common code
slide showing perf numbers
future
is it possible to export LUNs to gluster clients?
creating a VM image means creating a LUN
exploit per-vm storage offload – all this using a block device translator
export LUNs as files; also export files as LUNs.
(discussion) why not use raw files directly instead of adding all this overhead? This looks like a perf disaster (ip network, qemu block layer, etc.) – combination of stuff increasing latency, etc.
all of this is an experimentation, to go where we haven’t yet thought about – explore new opportunities. this is just the initial work; more interesting stuff can be build upon this platform later.
libvirt, ovirt, vdsm support for glusterfs added – details in slides
(discussion) storage array integration (slide) – question
way vendors could integrate san storage into virt stack.
we should have capability to use array-assisted features to create lun.
from ovirt/vdsm/libvirt/etc.
(discussion) we already have this in scsi. why add another layer? why in userspace?
difficult, as per current understanding: send commands directly to storage: fast copy from lun-to-lun, etc., not via scsi T10 extentions.
these are out-of-band mechanisms, in mgmt path, not data path.
why would someone want to do that via python etc.?
Next generation interrupt virtualization for KVM – Joerg Roedel
Slides
presenting new h/w tech today that accelerates guests
current state
kvm emulates local apic and io-apic
all reads/writes intercepted
interrupts can be queued from user or kernel
ipi costs high
h/w support
limited
tpr is accelerated by using cr8 register
only used by 64 bit guests
shiny new feature: avic
avic is designed to accelrate most common interrupt system features
ipi
tpr
interurpts from assigned devs
ideally none of those require intercept anymore
avic virtualizes apic for each vcpu
uses an apic backing page
guest physical apic id table, guest logical apic id table
no x2apic in first version
guest vapic backing page
store local apic contents for one vcpu
writes to accelerated won’t intercept
to non-accelerated cause intercepts
accelerated:
tpr
EOI
ICR low
ICR high
physical apic id table
maps guest physical apic id to host vapic pages
(discussion) what if guest cpu is not running
will be covered later
table maintained by kvm
logical apic id table
maps guest logical apic ids to guest physical apic ids
indexed by guest logical apic id
doorbell mechanism
used to signal avic interrupts between physcial cpus
src pcpu figures out physical apic id of the dest.
when dest. vcpu is running, it sends doorbell interrupt to physical cpu
iommu can also send doorbell messages to pcpus
iommu checks if vcpu is running too
for not running vcpus, it sends an event log entry
imp. for assigned devices
msr can also be used to issue doorbell messages by hand – for emulated devices
running and not running vcpus
doorbell only when vcpu running
if target pcpu is not running, sw notified about a new interrupt for this vcpu
support in iommu
iommu necessary for avic-enabled device pass-through
(discussion) kvm has to maintain? enable/disable on sched-in/sched-out
support can be mostly be implemented in kvm-amd module
some outside support in apic emulation
some changes to lapic emulation
change layout
kvm x86 core code will allocate vapic pages
(discussion) instead of kvm_vcpu_kick(), just run doorbell
vapic page needs to be mapped in nested page table
likely requires changes to kvm softmmu code
open question wrt device passthrough
changes to vfio required
ideally fully transparent to userspace
Reviewing Unused and New Features for Interrupt/APIC Virtualization – Jun Nakajima
Slides
Intel is going to talk about a similar hardware feature
intel: have a bitmap, and they decide whether to exit or not.
amd: hardcoded. apic timer counter, for example.
q to intel: do you have other things to talk about?
yes, coming up later.
paper on ‘net showed perf improved from 5-6Gig/s to wire speed, using emulation of this tech.
intel have numbers on their slides.
they used sr-iov 10gbe; measured vmexit
interrupt window: when hypervisor wants to inject interrupt, guest may not be running. hyp. has to enter vm. when guest is ready to receive interrupt, it comes back with vmexit. problem: as you need to inject interrupt, more vmexits, guest becomes busier. so: they wanted to eliminate them.
read case: if you have something in advance (apic page), hyp can just point to that instead of this exit dance
more than 50% exits are interrupt-related or apic related.
new features for interrupt/apic virt
reads are redirected to apic page
writes: vmexit after write; not intercepted. no need for emluation.
virt-interrupt delivery
extend tpr virt to other apic registers
eoi – no need for vm exits (using new bitmap)
this looks different from amd
but for eoi behaviour, intel/amd can have common interface.
intel/amd comparing their approaches / features / etc.
most notably, intel have support for x2apic, not for iommu. amd have support for iommu, not for x2apic.
for apic page, approaches mostly similar.
virt api can have common infra, but data structures are totally different. intel spec will be avl. in a month or so (update: already available now). amd spec shd be avl in a month too.
they can remove interrupt window, meaning 10% optimization for 6 VM case
net result
eliminate 50% of vmexits
optimization of 10% vmexits.
intel also supports x2apic h/w.
VMFUNC
this can hide info from other vcpus
secure channel between guest and host; can do whatever hypervisor wants.
vcpu executes vmfunc instrucion in special thread
usecases:
allow hvm guests to share pages/info with hypervisor in secure fashion
(discussion) why not just add to ept table
(discussion) does intel’s int. virt. has iommu component to?
doesn’t want to commit.
CoLo – Coarse-grained Lock-stepping VM for non-stop service – Will Auld
Slides
non-stop service with VM replication
client-server
Compare and contrast with Ramus – Xen’s solution
xen: ramus
buffers responses until checkpoint to secondary server completes (once per epic)
resumes secondary only on failover
failover at anytime
xen: colo
runs two VMs in parallel comparing their responses, checkpoints only on miscompare
resumes after every checkpoint
failover at anytime
CoLo runs VMs on primary and secondary at same time.
both machines respond to requests; they check for similartiy. When they agree, one of the responses sent to client
diff. between two models:
ramus: assume machine states have to be same. This is the reason to buffer responses until checkpoint has completed.
in colo; no such req. only requirement is request stream must be the same.
CoLo non-stop service focus on server response, not internal machine state (since multiprocessor environment is inherently nondeterministic)
there’s heartbeat, checkpoint
colo managers on both machines compare requests.
when they’re not same, CoLo does checkpoint.
(discussion) why will response be same?
int. machine state shouldn’t matter for most responses.
some exceptions, like tcp/ip timestamps.
minor modifications to tcp/ip stacks
coarse grain time stamp
highly deterministic ack mechanism
even then, state of machine is dissimilar.
resume of machine on secondary node:
another stumbling block.
(slides) graph on optimizations
how do you handle disk access? network is easier – n/w stack resumes on failover. if you don’t do failover in a state where you know disk is in a consistent state, you can get corruption.
Two solutions
For NAS, do same compares as with responses (this can also trigger checkpoints).
On local disks, buffer original state of changed pages, revert to original and them checkpoint with primary nodes disk writes included. This is equivalent to how the memory image is updated. (This was not described complete enough during the session).
that sounds dangerous. client may have acked data, etc.
will have to look closer at this. (More complete explanation above counters this)
how often do you get mismatch?
depends on workload. some were like 300-400 packets of good packets, then a mismatch.
during that, are you vulnerable to failure?
no, can failover at any point. internal state doesn’t matter. Both VMs, provide consistent request streams from their initial state and match responses up to the moment of failover.
NUMA – Dario Faggioli, Andrea Arcangeli
NUMA and Virtualization, the case of Xen – Dario Faggioli
Slides
Intro to NUMA
access costs to memory is different, based on which processor access it
remote mem is slower
in context of virt, want to avoid accessing remote memory
what we used to have in xen
on creation of VM, memory was allocated on all nodes
to improve: automatic placement
at vm1 creation time, pin vm1 to first node,
at vm2 create time, pin vm2 to second node since node1 already has a vm pinned to it
then they went ahead a bit, because pinning was inflexible
inflexible
lots of idle cpus and memories
what they will have in xen 4.3
node affinity
instead of static cpu pinning, preference to run vms on specific cpus
perf evaluation
specjbb in 3 configs (details in slides)
they get 13-17% improvements in 2vcpus in each vm
open problems
dynamic memory migration
io numa
take into consideration io devices
guest numa
if vm bigger than 1 node, should guest be aware?
ballooning and sharing
sharing could cause remote access
ballooning causes local pressures
inter-vm dependencies
how to actually benchmark and evaluate perf to evaluate if they’re improving
AutoNUMA – Andrea Arcangeli
Slides
solving similar problem for Linux kernel
implementation details avail in slides, will skip now
components of autonuma design
novel approach
mem and cpu migration tried earlier using diff. approaches, optimistic about this approach.
core design is to two novel ideas
introduce numa hinting pagefaults
works at thread-level, on thread locality
false sharing / relation detection
autonuma logic
cpu follows memory
memory in b/g slowly follows cpu
actual migration is done by knuma_migrated
all this is async and b/g, doesn’t stall memory channels
benchmarkings
developed a new benchmark tool, autonuma-benchmark
generic to measure alternative approaches too
comparing to gravity measurement
put all memory in single node
then drop pinning
then see how memory spreads by autonuma logic
see slides for graphics on convergance
perf numbers
also includes comparison with alternative approach, sched numa.
graphs show autonuma is better than schednuma, which is better than vanilla kernel
criticism
complex
takes more memory
takes 12 bytes per page, Andrea thinks it’s reasonable.
it’s done to try to decrease risk of hitting slowdowns (is faster than vanilla already)
stddev shows autonuma is pretty deterministic
why is autonuma so important?
even 2 sockets show differences and improvements.
but 8 nodes really shows autonuma shines
Discussions
looks like andrea focussing on 2 nodes / sockets, not more? looks like it will have bad impact on bigger nodes
points to graph showing 8 nodes
on big nodes, distance is more.
agrees autonuma doesn’t worry about distances
currently worries only about convergence
distance will be taken as future optimisation
same for Xen case
access to 2 node h/w is easier
as Andrea also mentioned, improvement on 2 node is lower bound; so improvements on bigger ones should be bigger too; don’t expect to be worse
not all apps just compute; they do io. and they migrate to the right cpu to where the device is.
are we moving memory to cpu, or cpu to device, etc… what should the heuristic be?
1st step should be to get cpu and mem right – they matter the most.
doing for kvm is great since it’s in linux, and everyone gets the benefit.
later, we might want to take other tunables into account.
crazy things in enterprise world, like storage
for high-perf networking, use tight binding, and then autonuma will not interfere.
this already works.
xen case is similar
this is also something that’s going to be workload-dependent, so custom configs/admin is needed.
did you have a chance to test on AMD Magny-Cours (many more nodes)
hasn’t tried autonuma on specific h/w
more nodes, better perf, since upstream is that much more worse.
xen
he did, and at least placement was better.
more benchmarking is planned.
suggestion: do you have a way to measure imbalance / number of accesses going to correct node
to see if it’s moving towards convergence, or not moving towards convergence, maybe via tracepoints
essentially to analyse what the system is doing.
exposing this data so it can be analysed.
using printks right now for development, there’s a lot of info, all the info you have to see why the algo is doing what it’s doing.
good to have in production so that admins can see
what autonuma is doing
how much is it converging
to decide to make it more aggressive, etc.
overall, all such stats can be easily exported, it’s already avl. via printk, but have to moved to something more structured and standard.
xen case is same; trying to see how they can use perf counters, etc. for statistical review of what is going on, but not precise enough
tells how many remote memory accesses are happening, but not from where and to where
something more in s/w is needed to enable this information.
One balloon for all – towards unified balloon driver – Daniel Kiper
Slides
wants to integrate various balloon drivers avl. in Linux
currently 3 separate drivers
virtio
xen
vmware
despite impl. differences, their core is similar
feature difference in drivers (xen has selfballooning)
overall lots of duplicate code
do we have an example of a good solution?
yes, generic mem hotplug code
core functionality is h/w independent
arch-specific parts are minimal, most is generic
solution proposed
core should be hypervisor-independent
should co-operate on h/w independent level – e.g mem hotplug, tmem, movable pages to reduce fragmentation
selfballooning ready
support for hugepages
standard api and abi if possible
arch-specific parts should communicate with underlying hypervisor and h/w if needed
crazy idea
replace ballooning with mem hot-unplug support
however, ballooning operates on single pages whereas hotplug/unplug works on groups of pages that are arch-dependent.
not flexible at all
have to use userspace interfaces
can be done via udev scripts, which is a better way
discussion: does acpi hotplug work seamlessly?
on x86 baremetal, hotplug works like this:
hotplug mem
acpi signals to kernel
acpi adds to mem space
this is not visible to processes directly
has to be enabled via sysfs interfaces, by writing ‘enable command’ to every section that has to be hotplugged
is selfballooning desirable?
kvm isn’t looking at it
guest wants to keep mem to itself, it has no interest in making host run faster
you paid for mem, but why not use all of it
if there’s a tradeoff for the guest: you pay less, you get more mem later, etc., guests could be interested.
essentially, what is guest admin’s incentive to give up precious RAM to host?
ARM – Marc Zyngier, Stefano Stabellini
Some other notes are on session etherpad
KVM ARM – Marc Zyngier
Slides
ARM architecture virtualization extensions
recent introduction in arch
new hypervisor mode PL2
traditionally secure state and non-secure state
Hyp mode is in non-secure side
higher privilege than kernel mode
adds second stage translation; adds extra level of indirection between guests and physical mem
tlbs are tagged by VMID (like EPT/NPT)
ability to trap accesses to most system registers
can handle irqs, fiqs, async aborts
e.g. guest doesn’t see interrupts firing
hyp mode: not a superset of SVC
has its own pagetables
only stage 1, not 2
follows LPAE, new physical extensions.
one translation table register
so difficult to run Linux directly in Hyp mode
therefore they use Hyp mode to switch between host and guest modes (unlike x86)
KVM/ARM
uses HYP mode to context switch from host to guest and back
exits guest on physical interrupt firing
access to a few privileged system registers
WFI (wait for interrupt)
(discussion) WFI is trapped and then we exit to host
etc.
on guest exit, control restored to host
no nesting; arch isn’t ready for that.
MM
host in charge of all MM
has no stage2 translation itself (saves tlb entries)
guests are in total control of page tables
becomes easy to map a real device into the guest physical space
for emulated devices, accesses fault, generates exit, and then host takes over
4k pages only
instruction emulation
trap on mmio
most instructions described in HSR
added complexity due to having to handle multiple ISAs (ARM, Thumb)
interrupt handling
redirect all interrupts to hyp mode only while running a guest. This only affects physical interrupts.
leave it pending and return to host
pending int will kick in when returns to guest mode?
No, it will be handled in host mode. Basically, we use the redirection to HYP mode to exit the guest, but keep the handling on the host.
inject two ways
manipulating arch. pins in the guest?
The architecture defines virtual interrupt pins that can be manipulated (VI→I, VF→F, VA→A). The host can manipulate these pins to inject interrupts or faults into the guest.
using virtual GIC extensions,
booting protocol
if you boot in HYP mode, and if you enter a non-kvm kernel, it gracefully goes back to SVC.
if kvm-enabled kernel is attempted to boot into, automatically goes into HYP mode
If a kvm-enabled kernel is booted in HYP mode, it installs a HYP stub and goes back to SVC. The only goal of this stub is to provide a hook for KVM (or another hypervisor) to install itself.
current status
pending: stable userspace ABI
pending: upstreaming
stuck on reviewing
Xen ARM – Stefano Stabellini
Slides
Why?
arm servers
smartphones
288 cores in a 4U rack – causes a serious maintenance headache
challenges
traditional way: port xen, and port hypercall interface to arm
from Linux side, using PVOPS to modify setpte, etc., is difficult
then, armv7 came.
design goals
exploit h/w as much as possible
limit to one type of guest
(x86: pv, hvm)
no pvops, but pv interfaces for IO
no qemu
lots of code, complicated
no compat code
32-bit, 64-bit, etc., complicated
no shadow pagetables
most difficult code to read ever
NO emulation at all!
one type of guest
like pv guests
boot from a user supplied kernel
no emulated devices
use PV interfaces for IO
like hvm guests
exploit nested paging
same entry point on native and xen
use device tree to discover xen presence
simple device emulation can be done in xen
no need for qemu
exploit h/w
running xen in hyp mode
no pv mmu
hypercall
generic timer
export timer int. to guest
GIC: general interrupt controller
int. controller with virt support
use GIC to inject event notifications into any guest domains with Xen support
x86 taught us this provides a great perf boost (event notifications on multiple vcpus simultaneously)
on x86, they had a pci device to inject interrupts to guest at regular intervals (on x86 we had a pci device to inject event notifications as legacy interrupt)
hypercall calling convention
hvc (hypercall)
pass params on registers
hvc takes an argument: 0xEA1 – means it’s a xen hypercall.
64-bit ready abi (another lesson from x86)
no compat code in xen
2600 lines of lesser code
had to write a 1500 line patch of mechanical substitutions to make 32-bit host make all guests work fine
status
xen and dom0 boot
vm creation and destruction work
pv console, disk, network work
xen hypervisor patches almost entirely upstream
linux side patches should go in next merge window
open issues
acpi
will have to add acpi parsers, etc. in device table
linux has 110,000 lines – should all be merged
uefi
grub2 on arm: multiboot2
need to virtualise runtime services
so only hypervisor can use them now
client devices
lack ref arch
difficult to support all tablets, etc. in market
uefi secure boot (is required by win8)
windows 8
Discussion
who’s responsbile for device tree mgmt for xen?
xen takes dt from hw, changes for mem mgmt, then psases to dom0
at present, currently have to build dt binary
at the moment, linux kernel infrastructure doesn’t support interrupt priorities.
needed to prevent a guest doing a DoS on host by just generating interrupts non-stop
xen does support int. priorities in GIC
VFIO – Are we there yet? – Alex Williamson
Slides
are we there yet? almost
what is vfio?
virtual function io
not sr-iov specific
userspace driver interface
kvm/qemu vm is a userspace driver
iommu required
visibility issue with devices in iommu, guaranteeing devices are isolated and safe to use – different from uio.
config space access is done from kernel
adds to safety requirement – can’t have userspace doing bad things on host
what’s different from last year?
1st proposal shot down last year, and got revised at last LPC
allow IOMMU driver to define device visibility – not per-device, but the whole group exposed
more modular
what’s different from pci device assignment
x86 only
kvm only
no iommu grouping
relies on pci-sysfs
turns kvm into a device driver
current staus
core pci and iommu drivers in 3.6
qemu will be pushed for 1.3
what’s next?
qemu integration
legacy pci interrupts
more of a qemu-kvm problem, since vfio already supports this, but these are unique since they’re level-triggered; host has to mask interrupt so it doesn’t cause a DoS till guest acks interrupt
like to bypass qemu directly – irqfd for edge-triggered. now exposing irqfd for level
(lots of discussion here)
libvirt support
iommu grps changed the way we do device assignment
sysfs entry point; move device to vfio driver
do you pass group by file descriptor?
lots of discussion on how to do this
existing method needs name for access to /sys
how can we pass file descriptors from libvirt for groups and containers to work in different security models?
The difficulty is in how qemu assembles the groups and containers. On the qemu command line, we specify an individual device, but that device lives in a group, which is the unit of ownership in vfio and may or may not be connectable to other containers. We need to figure out the details here,
POWER support
already adding
PowerPC
freescale looking at it
one api for x86, ppc was strange
error reporting
better ability to inject AER etc to guest
maybe another ioctl interrupt
What are we going to be able to do if we do get PCIe AER errors to show up at a device, what is the guest going to be able to do (for instance can it reset links).
We’re going to have to figure this out and it will factor into how much of the AER registers on the device do we expose and allow the guest to control. Perhaps not all errors are guest serviceable and we’ll need to figure out how to manage those.
better page pinning and mapping
gup issues with ksm running in b/g
PRI support
graphics support
issues with legacy io port space and mmio
can be handled better with vfio
NUMA hinting
Semiassignment: best of both worlds – Alex Graf
Slides
b/g on device assignment
best of both worlds
assigned device during normal operation
emulated during migration
xen solution – prototype
create a bridge in domU
guest sees a pv device and a real device
guest changes needed for bridge
migration is guest-visible, since real device goes away and comes back (hotplug)
security issue if VM doesn’t ack hot-unplug
vmware way
writing a new driver for each n/w device they want to support
this new driver calls into vmxnet
binary blob is mapped into your address space
migration is guest exposed
new blob needed for destination n/w card
alex way
emulate real device in qemu
e.g. expose emulated igbvf if passing through igbvf
need to write migration code for each adapter as well
demo
doesn’t quite work right now
is it a good idea?
how much effort really?
doesn’t think it’s much effort
current choices in datacenters are igbvf and
that’s not true!
easily a dozen adapters avl. now
lots of examples given why this claim isn’t true
no one needs single-vendor/card dependency in an entire datacenter
non-deterministic network performance
more complicated network configuration
discussion
Another solution suggested by Benjamin Herrenschmidt: use s3; remove ‘live’ from ‘live migration’.
AER approach
General consensus was to just do bonding+failover
KVM performance: vhost scalability – John Fastabend
Slides
current situation: one kernel thread per vhost
if we create a lot of VMs and a lot of virtio-net devices, perf doesn’t scale
not numa aware
Main grouse is it doesn’t scale.
instead of having a thread of every vhost device, create a vhost thread per cpu
add some numa-awareness scheduling – pick best cpu based on load
perf graphs
for 1 VM, number of instances of netperf increase, per-cpu-vhost doesn’t shine.
another tweak: use 2 threads per cpu: perf is better
for 4 VMs, results are good for 1-thread. much better than 2-thread. (2 thread does worse than current) With 4VMs per-cpu-vhost was nearly equivalent.
on 12 VMs, 1-thread works better, 2-thread works better than baseline. Per cpu-vhosts shine here outperforming baseline and 1-thread/2-thread cases.
tried tcp, udp, inter-guest, all netperf tests, etc.
this result is pretty standard for all the tests they’ve done.
RFC
should they continue?
strong objections?
discussion
were you testing with raw qemu or libvirt?
as libvirt creates its own cgroups, and that may interfere.
pinning vs non-pinning
gives similar results
no objections!
in a cgroup – roundrobin the vhost threads – interesting case to check with pinning as well.
transmit and receive interfere with each other – so perf improvement was seen when they pinned transmit side.
try this on bare-metal.
Network overlays – Vivek Kashyap
want to migrate machines from one place to another in a data center
don’t want physical limitations (programming switches, routers, mac addr, etc)
idea is to define a set of tunnels which are overlaid on top of networks
vms migrate within tunnels, completely isolated from physical networks
Scaling at layer 2 is limited by the need to support broadcsat/multicast over the network
overlay networks
when migrating across domains (subnets), have to re-number IP addresses
when migrating need to migrate IP and MAC addresses
When migrating across subnets might need to re-number or find another mechanism
solution is to have a set of tunnels
every end-user can view their domain/tunnel as a single virtual network
they only see their own traffic, no one else can see their traffic.
standardization is required
being worked on at IETF
MTU seen as VM is not same as what is on the physical network (because headers added by extra layers)
vxlan adds udp headers
one option is to have large(er) physical MTU so it takes care of this otherwise there will be fragmentation
Proposal
If guest does pathMTU discovery let tunnel end point return the ICMP error to reduce the guest’s view of the MTU.
Even if the guest has not set the DF (dont fragment) bit return an ICMP error. The guest will handle the ICMP error and update its view of the MTU on the route.
having the hypervisor to co-operate so guests do a path MTU discovery and things work fine
no guest changes needed, only hypervisor needs small change
(discussion) Cannot assume much about guests; guests may not handle ICMP.
Some way to avoid flooding
extend to support an ‘address resolution module’
Stephen Hemminger supported the proposal
Fragmentation
can’t assume much about guests; they may not like packets getting fragmented if they set DF
fragmentation highly likely since new headers are added
The above is wrong comment since if DF is set we do pathMTU and the packet wont be fragmented. Also, the fragmentation if done is on the tunnel. The VM’s dont see fragmentation but it is not performant to fragment and reassemble at end points.
Instead the proposal is to use PathMTU discovery to make the VM’s send packets that wont need to be fragmented.
PXE, etc., can be broken
Distributed Overlay Ethernet Network
DOVE module for tunneling support
use 24-bit VNI
patches should be coming to netdev soon enough.
possibly using checksum offload infrastructure for tunneling
question: layer 2 vs layer 3
There is interest in the industry to support overlay solutions for layer 2 and layer 3.
Lightning talks
QEMU disaggregation – Stefano Stabellini
Slides
dom0 is a privileged VM
better model is to split dom0 into multiple service VMs
disk domain, n/w domain, everything else
no bottleneck, better security, simpler
hvm domain needs device model (qemu)
wouldn’t it be nice if one qemu does only disk emulation
second does network emulation
etc.
to do this, they moved pci decoder in xen
traps on all pci requests
hypervisor de-multiplexes to the ‘right’ qemu
open issues
need flexibility in qemu to start w/o devices
modern qemu better
but: always uses PCI host bridge, PIIX3, etc.
one qemu uses this all, others have functionality, but don’t use it
multifunction devices
PIIX3
internal dependencies
one component pulls others
vnc pulls keyboard, which pulls usb, etc.
it is in demoable state
Xenner — Alex Graf
intro
guest kernel module that allows a xen pv kernel to run on top of kvm – messages to xenbus go to qemu
is anyone else interested in this at all?
xen folks last year did show interest for a migration path to get rid of pv code.
xen is still interested, but not in short time. – few years.
do you guys want to work together and get it rolling?
no one commits to anything right now