On Mon, 12 Oct 2020, Alex Bennée wrote:
On Wed, Oct 7, 2020 at 12:37 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
The two approaches are not mutually exclusive. The first approach could be demoed in a couple weeks, while this approach will require months of work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
Just FYI, Xen has mechanisms to pre-share memory areas statically (from VM creation) between VMs. We could pre-share a memory region between domU1 and domU2 and use it for virtio. We would have to come up with a way to mark the memory as "special virtio memory" and let Linux/QEMU know about it. Maybe we could use a reserved-memory binding for it.
That seems like a reasonable set of steps. So the pre-shared region will be the source of memory for the virtqueues as well as the "direct" buffers the virtqueues reference?
My original thinking was to use the pre-shared region for everything, in a dma_ops swiotlb fashion: the kernel (frontend) would end up picking bounce buffers out of the pre-shared region thanks to a special swiotlb instance made for the purpose. The backend would be told to map the pre-shared region at initialization and only use already-mapped pages from it.
However, I am not sure this is the best way to do -- you and Arnd might have better ideas on how to integrate the pre-shared region with the rest of the virtio infrastructure.
Xen PV drivers have started with the equivalent of a virtio IOMMU in place, which we call "grant table". A virtual machine uses the grant table to share memory explicitly with another virtual machine. Specifically, the frontend uses the grant table to share memory with the backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table interface.)
Speaking from that experience, we ended up switching PV network to use hypervisor-based copy (without Argo; Argo came later and it is a more generic solution) because it was faster than the alternatives. We are still using the grant table for everything else (block, framebuffer, etc.)
My feeling is that the grant table approach is too specific to Xen and wouldn't lean itself to porting to most other hypervisors. The idea of picking virtio seems to be based around the assumption that this is already portable.
Yeah, you are right. I didn't mean the grant table "as is"; I meant making a few changes to it so that it becomes easy to implement in other hypervisors and turn it into a virtio interface. But I don't know, it could be easier to start from scratch.
Adding a vIOMMU requirement for regular virtio devices seems possible and builds on existing guest drivers, but it certainly adds complexity in all areas (front-end, hypervisor and back-end) without being an obviously faster than a simpler approach.
OTOH, we can probably use the existing grant table implementation in Xen for a performance comparison, assuming that a virtio+viommu based approach would generally be slower than that.
With the existing grant tables we should be able to test a DomU vhost-user with a pre-shared chunk of memory if we can someway to pass signalling events to the DomU guest. The front-end could be in the main Dom0 initially.
I'm hoping this is something that Akashi can look at once he's up to speed with Xen although I expect it might lag the KVM based PoC a bit.
Excellent idea! We should definitely be able to use the existing grant table to do measurements and get some useful data points.