Arnd Bergmann via Stratos-dev stratos-dev@op-lists.linaro.org writes:
On Wed, Oct 7, 2020 at 12:37 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
The two approaches are not mutually exclusive. The first approach could be demoed in a couple weeks, while this approach will require months of work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
That seems like a reasonable set of steps. So the pre-shared region will be the source of memory for the virtqueues as well as the "direct" buffers the virtqueues reference?
Xen PV drivers have started with the equivalent of a virtio IOMMU in place, which we call "grant table". A virtual machine uses the grant table to share memory explicitly with another virtual machine. Specifically, the frontend uses the grant table to share memory with the backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table interface.)
Speaking from that experience, we ended up switching PV network to use hypervisor-based copy (without Argo; Argo came later and it is a more generic solution) because it was faster than the alternatives. We are still using the grant table for everything else (block, framebuffer, etc.)
My feeling is that the grant table approach is too specific to Xen and wouldn't lean itself to porting to most other hypervisors. The idea of picking virtio seems to be based around the assumption that this is already portable.
Adding a vIOMMU requirement for regular virtio devices seems possible and builds on existing guest drivers, but it certainly adds complexity in all areas (front-end, hypervisor and back-end) without being an obviously faster than a simpler approach.
OTOH, we can probably use the existing grant table implementation in Xen for a performance comparison, assuming that a virtio+viommu based approach would generally be slower than that.
With the existing grant tables we should be able to test a DomU vhost-user with a pre-shared chunk of memory if we can someway to pass signalling events to the DomU guest. The front-end could be in the main Dom0 initially.
I'm hoping this is something that Akashi can look at once he's up to speed with Xen although I expect it might lag the KVM based PoC a bit.
The take away is that the results might differ significantly, not just between a protocol and the other (net and block), but also between hypervisors (Xen and KVM), and between SoCs. These are difficult waters to navigate.
Definitely - more data points required ;-)
Agreed.
Arnd