Re: [Stratos-dev] Limited memory sharing investigation

6 Oct 2020


      Thanks for you comments Arnd.
Some responses below.
On Mon, 5 Oct 2020 at 16:41, Arnd Bergmann arnd@linaro.org wrote:
...
On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org
wrote:
...
IO Model
...
...

the device is provided an unstructured memory area (or multiple) and

rings. The device does memory allocation as it pleases on the
unstructured area(s) [this is very common on Arm platform devices]
The structured buffer method is to be understood as computer driven IO
while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that
implements this so I can see the implications?
For x86 PCI:  Chelsio T5+ adapters: the ring is associated with one or more
huge pages. Chelsio can run on the two models. I am not sure the upstream
driver is offering this possibility, you may want to check the Chelsio
pubicly provided one.
https://www.netcope.com/en/products/netcopep4 (
https://github.com/DPDK/dpdk/blob/main/drivers/net/szedata2/rte_eth_szedata2...
not cheked the code) is also quite interesting to checkout. They have
multiple DMA models (more than Chelsio) depending on the usecase. They have
both open source and closed source stuff. I don't know what is currently
open.
For Arm: NXP DPAA2: each ring is associated with a fixed set of arrays (n
rings with p arrays : total of n*p arrays), an array for 64 packets, up 128
packets, up to 256 packets.... The ring is not prepopulated with packets
pointers as you don't know the incoming size.  The hardware placed inbound
packets into the array corresponding to the size and update the ring
descriptor accordingly.
...
...
User land IO
I think the use of userland IO (in the backend and/or the fronted VM)
may impact the data path in various ways and thus this should be
factored in the analysis.
Key elements of performance: use metata data (prepend, postpend to
data) to get information rather than ring descriptor which is in
uncached memory (we have concrete examples and the cost of the
different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached?
Do you mean just the descriptors of the physical device in case of
a noncoherent bus, or also shared memory between the guests?
two methods I am aware of to know if there are any work to do:
- read an MMIO register
- poll ring descriptors (depends on HW capability)
As far as I have seen so far, the slowest is the MMIO.
On all hardware on which I used busy polling of the ring descriptors, those
have been uncached.
Between guests, which is close to host/VM: I am not sure what is virtio
policy on ring mapping, but I assume this is also uncached.
Don't know the differences between virtio 1.0 and virtio 1.1 (so called
packed queues)
[Most of my network activities have been in the context of telecom. In that
context, one could argue that a link not 100% used is a link that has a
problem. So not need for IRQs, except at startup:  I always think of busy
polling to get ultra low latency and low jitter]
...
...
Virtio
There are significant performance different between virtio 1.0 and
virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert
an element in a queue while it is only one with virtio 1.1 (note sure
about the numbers but should not be far).
as a reference 6WIND has a VM 2 VM network driver that is not virtio
based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1
to reach that level of performance.
I tried to get some information about the 6WIND driver but couldn't find
it.
Do you have a link to their sources, or do you know what they do
specifically?
No. Closed source.
...
...
Memory allocation backend
Shared memory allocation shall be controlled by the VMM (Qemu, Xen...)
but the VMM may further ask FF-A to do so. I think there has been
efforts in the FF-A to find memory with proper attributes (coherent
between device and all VMs). I frankly have no clue here but this may
be worth digging into.
I would expect any shared memory between guests to just use
the default memory attributes: For incoming data you end up
copying each packet twice anyway (from the hw driver allocated buffer
to the shared memory, and from shared memory to the actual
skb; for outbound data over a noncoherent device, the driver
needs to take care of proper barriers and cache management).
Some use cases may allow zero-copy in one direction only or two directions.
Zero-copy is feasible, but is it desirable? what is the actual
trustworthiness of the construct? remain to be seen.
skb may or may not be present in the backend if DPDK is used to feed the
virtio-net device, same may apply on the frontend side.
It's not just about sharing memories. Some accelerators have limits for
"physical address" and bus visibility (according to member discussion). So
FF-A is able to identify memory that can be used between normal world
entities, normal and secure world, normal world and device. So when you
design a full zero-copy data path you may need to control the memory with
help of FF-A
Just to clarify, the shared memory buffer in this case would be
...
a fixed host-physical buffer, rather than a fixed guest-physical
location with pages
...
e flipping, right?
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory.  If MMU and SMMU don't play a key role in orchestrating
address spaces, I am not sure what I describe is feasible.
...
inbound traffic:
...
traffic is received in device SRAM
packet is marshalled to a memory associated to a VLAN (i.e. VM).
descriptor is updated to point to this packet
backend VM kernel handle queues (IRQ, busy polling...), create virtio
descriptors pointing to data as part of the bridging (stripping the
underlay network VLAN tag)
fr ont end DPDK app read the descriptor and the packet
if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously
on different CPUs and send IPIs between them, or would each
queue always be on a fixed CPU across both guests?
With DPDK in both  VMs I would use vCPU pinning. IPIs are not desired as
it cause way to much delay in very high performance as it also corresponds
to a VM exit.
In any case DPDK busy polling make the IPI and results in zero-exit
operations.
...
   Arnd


-- 
François-Frédéric Ozog | *Director Linaro Edge & Fog Computing Group*
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog

2025

2024

2023

2022

2021

2020

Re: [Stratos-dev] Limited memory sharing investigation

IO Model

User land IO

Virtio

Memory allocation backend