On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org wrote:
IO Model
...
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the unstructured area(s) [this is very common on Arm platform devices] The structured buffer method is to be understood as computer driven IO while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that implements this so I can see the implications?
User land IO
I think the use of userland IO (in the backend and/or the fronted VM) may impact the data path in various ways and thus this should be factored in the analysis. Key elements of performance: use metata data (prepend, postpend to data) to get information rather than ring descriptor which is in uncached memory (we have concrete examples and the cost of the different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached? Do you mean just the descriptors of the physical device in case of a noncoherent bus, or also shared memory between the guests?
Virtio
There are significant performance different between virtio 1.0 and virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert an element in a queue while it is only one with virtio 1.1 (note sure about the numbers but should not be far). as a reference 6WIND has a VM 2 VM network driver that is not virtio based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1 to reach that level of performance.
I tried to get some information about the 6WIND driver but couldn't find it. Do you have a link to their sources, or do you know what they do specifically?
Memory allocation backend
Shared memory allocation shall be controlled by the VMM (Qemu, Xen...) but the VMM may further ask FF-A to do so. I think there has been efforts in the FF-A to find memory with proper attributes (coherent between device and all VMs). I frankly have no clue here but this may be worth digging into.
I would expect any shared memory between guests to just use the default memory attributes: For incoming data you end up copying each packet twice anyway (from the hw driver allocated buffer to the shared memory, and from shared memory to the actual skb; for outbound data over a noncoherent device, the driver needs to take care of proper barriers and cache management).
Just to clarify, the shared memory buffer in this case would be a fixed host-physical buffer, rather than a fixed guest-physical location with page flipping, right?
inbound traffic: traffic is received in device SRAM packet is marshalled to a memory associated to a VLAN (i.e. VM). descriptor is updated to point to this packet backend VM kernel handle queues (IRQ, busy polling...), create virtio descriptors pointing to data as part of the bridging (stripping the underlay network VLAN tag) fr ont end DPDK app read the descriptor and the packet if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously on different CPUs and send IPIs between them, or would each queue always be on a fixed CPU across both guests?
Arnd