Re: [Stratos-dev] Limited memory sharing investigation

10 Oct 2020


      On Fri, Oct 9, 2020 at 3:46 PM Jean-Philippe Brucker
jean-philippe@linaro.org wrote:
...
On Tue, Oct 06, 2020 at 02:46:02PM +0200, Arnd Bergmann wrote:
...
...
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory.  If MMU and SMMU don't play a key role in orchestrating
address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the
use of vIOMMU in the front-end guest to be able to use an unmodified
set of virtio drivers on guest pages, while using the dynamic iommu mappings
to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of complexity
in the host, in particular with a type-2 hypervisor like KVM, and I'm not
convinced that the overhead of trapping into the hypervisor for each
iotlb flush as well as manipulating the page tables of a remote guest
from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an
object for the queue into its guest-physical address space as a
vm_area_struct, possibly coming from a memfd or dmabuf fd that was
shared by the front-end guest. Performing the dynamic remapping of
that VMA from host user space (qemu, kvmtool, ...) would certainly
mean a high runtime overhead, while doing it in the host kernel
alone may not be acceptable for mainline kernels because of the
added complexity in too many sensitive areas of the kernel (memory
management / file systems, kvm, vhost/virtio).
With a vIOMMU the frontend wouldn't populate the backend's page tables and
send IOTLB flushes. And I don't think the hypervisor could touch the
backend's stage-1 page tables either. Neither do grant tables, I believe,
though I still have some reading to do there. The frontend would send MAP
and UNMAP messages to the host. Roughly:
MAP {
        u32 domain_id;  // context (virtio device)
        u32 flags;      // permissions (read|write)
        u64 iova;       // an arbitrary virtual address
        u64 gpa;        // guest#1 physical address
        u64 size;       // multiple of guest#1 page size
};
The hypervisor receives this and finds the phys page (pa) corresponding
to g1pa. Then:
(1) Allocates pages in the backend's guest-physical memory, at g2pa.
(2) Creates the stage-2 mapping g2pa->pa (for example in KVM
    KVM_SET_USER_MEMORY_REGION?).
(3) Tells the backend about the iova->g2pa mapping for the virtio device.
Then frontend builds the virtqueue or publishes buffers, using an IOVA
within the range of what has been mapped so far. The backend translates
that to g2pa, then to a g2va and accesses the buffer. When done the
frontend unmaps the page:
UNMAP {
        u32 domain_id;
        u64 iova;
        u64 size;
};
The host tells the backend and tears down the stage-2 mapping.
This is what I was trying to explain, but your explanation
provides some of the details I wasn't sure about and is much clearer.
I did not mean that any changes to the backend's guest page tables
would be required, but clearly the host page tables.
I think when you refer to the MAP/UNMAP operations, that's
what I meant with the iotlb flushes: whichever operation one
does that makes the device forget about a previous IOMMU
mapping and use the new one instead. Depending on the
architecture that would be a CPU instruction, hypercall, MMIO
access or a message passed through the virtio-iommu ring.
...
I would get rid of steps (1) and (3) by having the frontend directly
allocate iova=g2pa. It requires the host to reserve a fixed range of
guest#2 physical memory upfront (but not backed by any physical pages at
that point), and tell the guest about usable iova ranges (already
supported by Linux).
Right, that would be ideal, but adds the complexity that guest 1
has to know about the available g2pa range. There is also a requirement
that guest 2 needs to check the address in the ring to ensure they
are within the bounds of that address range to prevent a data leak,
but that is probably the case for any scenario.
...
That way there wouldn't be any context switch to the backend during
MAP/UNMAP, and the backend implementation is the same as with
pre-shared memory.
There are still two (direct queue) or four (indirect queue) map/unmap
operations per update to the virtio ring, and each of these lead
to an update of the host page tables for the secondary guest, plus
the communication needed to trigger that update.
Arnd

2025

2024

2023

2022

2021

2020

Re: [Stratos-dev] Limited memory sharing investigation