Thanks for you comments Arnd. Some responses below.
On Mon, 5 Oct 2020 at 16:41, Arnd Bergmann arnd@linaro.org wrote:
On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org wrote:
IO Model
...
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the unstructured area(s) [this is very common on Arm platform devices] The structured buffer method is to be understood as computer driven IO while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that implements this so I can see the implications?
For x86 PCI: Chelsio T5+ adapters: the ring is associated with one or more huge pages. Chelsio can run on the two models. I am not sure the upstream driver is offering this possibility, you may want to check the Chelsio pubicly provided one. https://www.netcope.com/en/products/netcopep4 ( https://github.com/DPDK/dpdk/blob/main/drivers/net/szedata2/rte_eth_szedata2... not cheked the code) is also quite interesting to checkout. They have multiple DMA models (more than Chelsio) depending on the usecase. They have both open source and closed source stuff. I don't know what is currently open. For Arm: NXP DPAA2: each ring is associated with a fixed set of arrays (n rings with p arrays : total of n*p arrays), an array for 64 packets, up 128 packets, up to 256 packets.... The ring is not prepopulated with packets pointers as you don't know the incoming size. The hardware placed inbound packets into the array corresponding to the size and update the ring descriptor accordingly.
User land IO
I think the use of userland IO (in the backend and/or the fronted VM) may impact the data path in various ways and thus this should be factored in the analysis. Key elements of performance: use metata data (prepend, postpend to data) to get information rather than ring descriptor which is in uncached memory (we have concrete examples and the cost of the different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached? Do you mean just the descriptors of the physical device in case of a noncoherent bus, or also shared memory between the guests?
two methods I am aware of to know if there are any work to do: - read an MMIO register - poll ring descriptors (depends on HW capability) As far as I have seen so far, the slowest is the MMIO. On all hardware on which I used busy polling of the ring descriptors, those have been uncached. Between guests, which is close to host/VM: I am not sure what is virtio policy on ring mapping, but I assume this is also uncached. Don't know the differences between virtio 1.0 and virtio 1.1 (so called packed queues) [Most of my network activities have been in the context of telecom. In that context, one could argue that a link not 100% used is a link that has a problem. So not need for IRQs, except at startup: I always think of busy polling to get ultra low latency and low jitter]
Virtio
There are significant performance different between virtio 1.0 and virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert an element in a queue while it is only one with virtio 1.1 (note sure about the numbers but should not be far). as a reference 6WIND has a VM 2 VM network driver that is not virtio based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1 to reach that level of performance.
I tried to get some information about the 6WIND driver but couldn't find it. Do you have a link to their sources, or do you know what they do specifically?
No. Closed source.
Memory allocation backend
Shared memory allocation shall be controlled by the VMM (Qemu, Xen...) but the VMM may further ask FF-A to do so. I think there has been efforts in the FF-A to find memory with proper attributes (coherent between device and all VMs). I frankly have no clue here but this may be worth digging into.
I would expect any shared memory between guests to just use the default memory attributes: For incoming data you end up copying each packet twice anyway (from the hw driver allocated buffer to the shared memory, and from shared memory to the actual skb; for outbound data over a noncoherent device, the driver needs to take care of proper barriers and cache management).
Some use cases may allow zero-copy in one direction only or two directions.
Zero-copy is feasible, but is it desirable? what is the actual trustworthiness of the construct? remain to be seen. skb may or may not be present in the backend if DPDK is used to feed the virtio-net device, same may apply on the frontend side. It's not just about sharing memories. Some accelerators have limits for "physical address" and bus visibility (according to member discussion). So FF-A is able to identify memory that can be used between normal world entities, normal and secure world, normal world and device. So when you design a full zero-copy data path you may need to control the memory with help of FF-A
Just to clarify, the shared memory buffer in this case would be
a fixed host-physical buffer, rather than a fixed guest-physical location with pages
e flipping, right?
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
inbound traffic:
traffic is received in device SRAM packet is marshalled to a memory associated to a VLAN (i.e. VM). descriptor is updated to point to this packet backend VM kernel handle queues (IRQ, busy polling...), create virtio descriptors pointing to data as part of the bridging (stripping the underlay network VLAN tag) fr ont end DPDK app read the descriptor and the packet if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously on different CPUs and send IPIs between them, or would each queue always be on a fixed CPU across both guests?
With DPDK in both VMs I would use vCPU pinning. IPIs are not desired as
it cause way to much delay in very high performance as it also corresponds to a VM exit. In any case DPDK busy polling make the IPI and results in zero-exit operations.
Arnd