Re: [Stratos-dev] Limited memory sharing investigation

6 Oct 2020

      On Tue, Oct 6, 2020 at 12:06 AM François Ozog francois.ozog@linaro.org
wrote:
...
Thanks for you comments Arnd.
Some responses below.
On Mon, 5 Oct 2020 at 16:41, Arnd Bergmann arnd@linaro.org wrote:
...
On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org
wrote:
...
...
...
IO Model
...
...

the device is provided an unstructured memory area (or multiple) and

rings. The device does memory allocation as it pleases on the
unstructured area(s) [this is very common on Arm platform devices]
The structured buffer method is to be understood as computer driven IO
while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that
implements this so I can see the implications?
For x86 PCI:  Chelsio T5+ adapters: the ring is associated with one or
more huge pages. Chelsio can run on the two models. I am not sure the
upstream driver is offering this possibility, you may want to check the
Chelsio pubicly provided one.
...
https://www.netcope.com/en/products/netcopep4 (
https://github.com/DPDK/dpdk/blob/main/drivers/net/szedata2/rte_eth_szedata2...
not cheked the code) is also quite interesting to checkout. They have
multiple DMA models (more than Chelsio) depending on the usecase. They have
both open source and closed source stuff. I don't know what is currently
open.
...
For Arm: NXP DPAA2: each ring is associated with a fixed set of arrays (n
rings with p arrays : total of n*p arrays), an array for 64 packets, up 128
packets, up to 256 packets.... The ring is not prepopulated with packets
pointers as you don't know the incoming size.  The hardware placed inbound
packets into the array corresponding to the size and update the ring
descriptor accordingly.
Ok, got it now. So these would both be fairly unusual adapters and only
used for multi-gigabit networking.
...
...
...
User land IO
I think the use of userland IO (in the backend and/or the fronted VM)
may impact the data path in various ways and thus this should be
factored in the analysis.
Key elements of performance: use metata data (prepend, postpend to
data) to get information rather than ring descriptor which is in
uncached memory (we have concrete examples and the cost of the
different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached?
Do you mean just the descriptors of the physical device in case of
a noncoherent bus, or also shared memory between the guests?
two methods I am aware of to know if there are any work to do:

read an MMIO register
poll ring descriptors (depends on HW capability)

As far as I have seen so far, the slowest is the MMIO.
On all hardware on which I used busy polling of the ring descriptors,
those have been uncached.
...
Between guests, which is close to host/VM: I am not sure what is virtio
policy on ring mapping, but I assume this is also uncached.
No.
The only reason ring descriptors are uncached on low-end Arm SoCs is
because they are on a noncoherent bus and uncached mappings can
be used as a workaround to maintain a coherent view of the ring between
hardware and the OS. In case of virtio, the rings are always cached because
both sides are on the CPUs that are connected over a coherent bus. This
might be different if the virtio device implementation is on a remote CPU
behind a noncoherent bus, but I don't think we support that case at the
moment.
The MMIO registers of the virtio device may appear to be mapped as
uncached to the guest OS, but they don't actually point to memory here.
Instead, the MMIO registers are implemented by trapping into the
hypervisor that then handles the side effects of the access, such as
forwarding the buffers to a hardware device.
...
Don't know the differences between virtio 1.0 and virtio 1.1 (so called
packed queues)
...
[Most of my network activities have been in the context of telecom. In
that context, one could argue that a link not 100% used is a link that has
a problem. So not need for IRQs, except at startup:  I always think of busy
polling to get ultra low latency and low jitter]
This is a bit different in the kernel: polling on a register is generally
seen
as a waste of time, so the kernel always tries to batch as much work
as it can without introducing latency elsewhere, and then let the CPU
do useful work elsewhere while a new batch of work piles up. ;-)
...
...
Just to clarify, the shared memory buffer in this case would be
a fixed host-physical buffer, rather than a fixed guest-physical
location with pages
...
e flipping, right?
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory.  If MMU and SMMU don't play a key role in orchestrating
address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the
use of vIOMMU in the front-end guest to be able to use an unmodified
set of virtio drivers on guest pages, while using the dynamic iommu mappings
to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of
complexity
in the host, in particular with a type-2 hypervisor like KVM, and I'm not
convinced that the overhead of trapping into the hypervisor for each
iotlb flush as well as manipulating the page tables of a remote guest
from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an
object for the queue into its guest-physical address space as a
vm_area_struct, possibly coming from a memfd or dmabuf fd that was
shared by the front-end guest. Performing the dynamic remapping of
that VMA from host user space (qemu, kvmtool, ...) would certainly
mean a high runtime overhead, while doing it in the host kernel
alone may not be acceptable for mainline kernels because of the
added complexity in too many sensitive areas of the kernel (memory
management / file systems, kvm, vhost/virtio).
...
...
...
inbound traffic:
traffic is received in device SRAM
packet is marshalled to a memory associated to a VLAN (i.e. VM).
descriptor is updated to point to this packet
backend VM kernel handle queues (IRQ, busy polling...), create virtio
descriptors pointing to data as part of the bridging (stripping the
underlay network VLAN tag)
fr ont end DPDK app read the descriptor and the packet
if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously
on different CPUs and send IPIs between them, or would each
queue always be on a fixed CPU across both guests?
With DPDK in both  VMs I would use vCPU pinning. IPIs are not desired as
it cause way to much delay in very high performance as it also corresponds
to a VM exit.
...
In any case DPDK busy polling make the IPI and results in zero-exit
operations.
Wouldn't user space spinning on the state of the queue be a bug in this
scenario, if the thing that would add work to the queue is prevented from
running because of the busy guest?
Arnd

2025

2024

2023

2022

2021

2020

Re: [Stratos-dev] Limited memory sharing investigation

IO Model

User land IO