Re: [Stratos-dev] Limited memory sharing investigation - Stratos-dev - op-lists.linaro.org

6 Oct 2020


      On Fri, 2 Oct 2020, Jean-Philippe Brucker via Stratos-dev wrote:
...
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8,
STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Problem
We have a primary VM running a guest, and a secondary one running a
backend that manages one hardware resource (for example network access).
They communicate with virtio (for example virtio-net). The guest
implements a virtio driver, the backend a virtio device. Problem is, how
to ensure that the backend and guest only share the memory required for
the virtio communication, and that the backend cannot access any other
memory from the guest?
Well described, this is the crux of the issue and also the same thing we
tried to capture with STR-14 (please ignore the non-virtio items on
STR-14 for now.)
...
Static shared region
Let's first look at static DMA regions. The hypervisor allocates a subset
of memory to be shared between guest and backend. The hypervisor
communicates this per-device DMA restriction to the guest during boot. It
could be using a firmware property, or a discovery protocol. I did start
drafting such a protocol in virtio-iommu, but I now think the
reserved-memory mechanism in device-tree, below, is preferable. Would we
need an equivalent for ACPI, though?
How would we implement this in a Linux guest?  The virtqueue of a virtio
device has two components. Static ring buffers, allocated at boot with
dma_alloc_coherent(), and the actual data payload, mapped with
dma_map_page() and dma_map_single(). Linux calls the former "coherent"
DMA, and the latter "streaming" DMA.
Coherent DMA can already obtain its pages from a per-device memory pool.
dma_init_coherent_memory() defines a range of physical memory usable by a
device. Importantly this region has to be distinct from system memory RAM
and reserved by the platform. It is mapped non-cacheable. If it exists,
dma_alloc_coherent() will only get its pages from that region.
On the other hand streaming DMA doesn't allocate memory. The virtio
drivers don't control where that memory comes from, since the pages are
fed to them by an upper layer of the subsystem, specific to the device
type (net, block, video, etc). Often they are pages from the slab cache
that contain other unrelated objects, a notorious problem for DMA
isolation. If the page is not accessible by the device, swiotlb_map()
allocates a bounce buffer somewhere more convenient and copies the data
when needed.
At the moment the bounce buffer is allocated from a global pool in the low
physical pages. However a recent proposal by Chromium would add support
for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free
   to suggest a better name) binding, which is explicitly specified
   to be the only DMA-able memory for this device and make Linux use
   the given pool for coherent DMA allocations and bouncing
   non-coherent DMA.
That seems to be precisely what we need. Even when using the virtio-pci
transport, it is still possible to define per-device properties in the
device-tree, for example:
/* PCI root complex node */
   pcie@10000000 {
   	compatible = "pci-host-ecam-generic";
   
   	/* Add properties to endpoint with BDF 00:01.0 */
           ep@0008 {
                   reg = <0x00000800 0 0 0 0>;
   		restricted-dma-region = <&dma_region_1>;
           };
   };
reserved-memory {
   	/* Define 64MB reserved region at address 0x50400000 */
   	dma_region_1: restricted_dma_region {
   		reg = <0x50400000 0x4000000>;
   	};
   };
Excellent! You should also be aware of a proposal by Qualcomm to the
LKML about this: https://marc.info/?l=linux-kernel&m=158807398403549
I like this approach in the short term because we should be able to
make it work with very little effort, and it doesn't seem to require any
changes to the virtio specification, which means it is "backward
compatible" or at least easier to backport.
(FYI this approach corresponds to "pre-shared memory" in STR-14.)
...
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to
dynamically update the mappings rather than statically setting a memory
region usable by the backend. I believe that approach is still worth
considering because it satisfies the security requirement and doesn't
necessarily have worse performance. There is a trade-off between bounce
buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require
copying. Sub-page payloads will need bounce buffering anyway, for proper
isolation. But for large payloads bounce buffering might be prohibitive,
and using a virtual IOMMU might actually be more efficient. Instead of
copying large buffers the guest would send a MAP request to the
hypervisor, which would then map the pages into the backend. Despite
taking plenty of cycles for context switching and setting up the maps, it
might be less costly than copying.
Since it depends on the device, I guess we'll need a survey of memory
access patterns by the different virtio devices that we're considering.
In the end a mix of both solutions might be necessary.
Right, and in my experience the results of the trade-off differ greatly
on different architectures; i.e. on x86 bounce buffers are faster, on
ARM tlb flushes are (used to be?) much faster. You are completely right
that it also depends on the protocol. From the measurements I did last
time (2+ years ago) this approach was faster on ARM, even for networking.
With recent hardware things might have changed. This approach is also
the most "expensive" to build and requires an implementation in the
hypervisor itself for Xen: I think the virtio IOMMU backend
implementation cannot be in QEMU (or kvmtools), it needs to be in the
Xen hypervisor proper.
(FYI this approach corresponds to dynamically shared memory in STR-14.)
Finally, there is one more approach to consider: hypervisor-mediated
copy. See this presentation:
https://www.spinics.net/lists/automotive-discussions/attachments/pdfZi7_xDH6...
If you are interested in more details we could ask Christopher to give
the presentation again at one of the next Stratos meetings. Argo is an
interface to ask Xen to do the copy on behalf of the VM. The advantage
is that the hypervisor can validate the input, preventing a class of
attacks where a VM changes the buffer while the other VM is still
reading it.
(This approach was not considered at the time of writing STR-14.)