Thanks for the discussion. More thoughts on vIOMMU below, in case we want to try this solution in addition to the pre-shared one.
On Tue, Oct 06, 2020 at 02:46:02PM +0200, Arnd Bergmann wrote:
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the use of vIOMMU in the front-end guest to be able to use an unmodified set of virtio drivers on guest pages, while using the dynamic iommu mappings to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of complexity in the host, in particular with a type-2 hypervisor like KVM, and I'm not convinced that the overhead of trapping into the hypervisor for each iotlb flush as well as manipulating the page tables of a remote guest from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an object for the queue into its guest-physical address space as a vm_area_struct, possibly coming from a memfd or dmabuf fd that was shared by the front-end guest. Performing the dynamic remapping of that VMA from host user space (qemu, kvmtool, ...) would certainly mean a high runtime overhead, while doing it in the host kernel alone may not be acceptable for mainline kernels because of the added complexity in too many sensitive areas of the kernel (memory management / file systems, kvm, vhost/virtio).
With a vIOMMU the frontend wouldn't populate the backend's page tables and send IOTLB flushes. And I don't think the hypervisor could touch the backend's stage-1 page tables either. Neither do grant tables, I believe, though I still have some reading to do there. The frontend would send MAP and UNMAP messages to the host. Roughly:
MAP { u32 domain_id; // context (virtio device) u32 flags; // permissions (read|write) u64 iova; // an arbitrary virtual address u64 gpa; // guest#1 physical address u64 size; // multiple of guest#1 page size };
The hypervisor receives this and finds the phys page (pa) corresponding to g1pa. Then:
(1) Allocates pages in the backend's guest-physical memory, at g2pa. (2) Creates the stage-2 mapping g2pa->pa (for example in KVM KVM_SET_USER_MEMORY_REGION?). (3) Tells the backend about the iova->g2pa mapping for the virtio device.
Then frontend builds the virtqueue or publishes buffers, using an IOVA within the range of what has been mapped so far. The backend translates that to g2pa, then to a g2va and accesses the buffer. When done the frontend unmaps the page:
UNMAP { u32 domain_id; u64 iova; u64 size; };
The host tells the backend and tears down the stage-2 mapping.
I would get rid of steps (1) and (3) by having the frontend directly allocate iova=g2pa. It requires the host to reserve a fixed range of guest#2 physical memory upfront (but not backed by any physical pages at that point), and tell the guest about usable iova ranges (already supported by Linux).
That way there wouldn't be any context switch to the backend during MAP/UNMAP, and the backend implementation is the same as with pre-shared memory.
Thanks, Jean
--- A diagram with the notations I'm using. We have two virt-phys translations pointing to the same physical page. Each guest manages their stage-1 page tables, and the hypervisor manages stage-2. In the above discussion, stage-2 of guest 1 (frontend) is static, while stage-2 of guest 2 (backend) is modified dynamically.
+-----+ +-----+ +-----+ +-----+ +-----+ |-----| |-----| : : ,-|-----|<-. |-----| |-----| g1pa |-----| pa +-----+ / |-----| | |-----| g1va->|-----|-. |-----| ,->| |<-' |-----| | |-----| |-----| \ |-----| / +-----+ pa |-----| '---|-----|<-g2va |-----| '->|-----|-' : : |-----| g2pa |-----| +-----+ +-----+ +-----+ +-----+ +-----+ guest 1 guest 1 phys mem guest 2 guest 2 stage-1 PT stage-2 PT stage-2 PT stage-1 PT