On Tue, Oct 6, 2020 at 12:06 AM François Ozog francois.ozog@linaro.org wrote:
Thanks for you comments Arnd. Some responses below.
On Mon, 5 Oct 2020 at 16:41, Arnd Bergmann arnd@linaro.org wrote:
On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org
wrote:
IO Model
...
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the unstructured area(s) [this is very common on Arm platform devices] The structured buffer method is to be understood as computer driven IO while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that implements this so I can see the implications?
For x86 PCI: Chelsio T5+ adapters: the ring is associated with one or
more huge pages. Chelsio can run on the two models. I am not sure the upstream driver is offering this possibility, you may want to check the Chelsio pubicly provided one.
https://github.com/DPDK/dpdk/blob/main/drivers/net/szedata2/rte_eth_szedata2... not cheked the code) is also quite interesting to checkout. They have multiple DMA models (more than Chelsio) depending on the usecase. They have both open source and closed source stuff. I don't know what is currently open.
For Arm: NXP DPAA2: each ring is associated with a fixed set of arrays (n
rings with p arrays : total of n*p arrays), an array for 64 packets, up 128 packets, up to 256 packets.... The ring is not prepopulated with packets pointers as you don't know the incoming size. The hardware placed inbound packets into the array corresponding to the size and update the ring descriptor accordingly.
Ok, got it now. So these would both be fairly unusual adapters and only used for multi-gigabit networking.
User land IO
I think the use of userland IO (in the backend and/or the fronted VM) may impact the data path in various ways and thus this should be factored in the analysis. Key elements of performance: use metata data (prepend, postpend to data) to get information rather than ring descriptor which is in uncached memory (we have concrete examples and the cost of the different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached? Do you mean just the descriptors of the physical device in case of a noncoherent bus, or also shared memory between the guests?
two methods I am aware of to know if there are any work to do:
- read an MMIO register
- poll ring descriptors (depends on HW capability)
As far as I have seen so far, the slowest is the MMIO. On all hardware on which I used busy polling of the ring descriptors,
those have been uncached.
Between guests, which is close to host/VM: I am not sure what is virtio
policy on ring mapping, but I assume this is also uncached.
No.
The only reason ring descriptors are uncached on low-end Arm SoCs is because they are on a noncoherent bus and uncached mappings can be used as a workaround to maintain a coherent view of the ring between hardware and the OS. In case of virtio, the rings are always cached because both sides are on the CPUs that are connected over a coherent bus. This might be different if the virtio device implementation is on a remote CPU behind a noncoherent bus, but I don't think we support that case at the moment.
The MMIO registers of the virtio device may appear to be mapped as uncached to the guest OS, but they don't actually point to memory here. Instead, the MMIO registers are implemented by trapping into the hypervisor that then handles the side effects of the access, such as forwarding the buffers to a hardware device.
Don't know the differences between virtio 1.0 and virtio 1.1 (so called
packed queues)
[Most of my network activities have been in the context of telecom. In
that context, one could argue that a link not 100% used is a link that has a problem. So not need for IRQs, except at startup: I always think of busy polling to get ultra low latency and low jitter]
This is a bit different in the kernel: polling on a register is generally seen as a waste of time, so the kernel always tries to batch as much work as it can without introducing latency elsewhere, and then let the CPU do useful work elsewhere while a new batch of work piles up. ;-)
Just to clarify, the shared memory buffer in this case would be a fixed host-physical buffer, rather than a fixed guest-physical location with pages
e flipping, right?
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the use of vIOMMU in the front-end guest to be able to use an unmodified set of virtio drivers on guest pages, while using the dynamic iommu mappings to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of complexity in the host, in particular with a type-2 hypervisor like KVM, and I'm not convinced that the overhead of trapping into the hypervisor for each iotlb flush as well as manipulating the page tables of a remote guest from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an object for the queue into its guest-physical address space as a vm_area_struct, possibly coming from a memfd or dmabuf fd that was shared by the front-end guest. Performing the dynamic remapping of that VMA from host user space (qemu, kvmtool, ...) would certainly mean a high runtime overhead, while doing it in the host kernel alone may not be acceptable for mainline kernels because of the added complexity in too many sensitive areas of the kernel (memory management / file systems, kvm, vhost/virtio).
inbound traffic: traffic is received in device SRAM packet is marshalled to a memory associated to a VLAN (i.e. VM). descriptor is updated to point to this packet backend VM kernel handle queues (IRQ, busy polling...), create virtio descriptors pointing to data as part of the bridging (stripping the underlay network VLAN tag) fr ont end DPDK app read the descriptor and the packet if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously on different CPUs and send IPIs between them, or would each queue always be on a fixed CPU across both guests?
With DPDK in both VMs I would use vCPU pinning. IPIs are not desired as
it cause way to much delay in very high performance as it also corresponds to a VM exit.
In any case DPDK busy polling make the IPI and results in zero-exit
operations.
Wouldn't user space spinning on the state of the queue be a bug in this scenario, if the thing that would add work to the queue is prevented from running because of the busy guest?
Arnd