Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Problem -------
We have a primary VM running a guest, and a secondary one running a backend that manages one hardware resource (for example network access). They communicate with virtio (for example virtio-net). The guest implements a virtio driver, the backend a virtio device. Problem is, how to ensure that the backend and guest only share the memory required for the virtio communication, and that the backend cannot access any other memory from the guest?
Static shared region --------------------
Let's first look at static DMA regions. The hypervisor allocates a subset of memory to be shared between guest and backend. The hypervisor communicates this per-device DMA restriction to the guest during boot. It could be using a firmware property, or a discovery protocol. I did start drafting such a protocol in virtio-iommu, but I now think the reserved-memory mechanism in device-tree, below, is preferable. Would we need an equivalent for ACPI, though?
How would we implement this in a Linux guest? The virtqueue of a virtio device has two components. Static ring buffers, allocated at boot with dma_alloc_coherent(), and the actual data payload, mapped with dma_map_page() and dma_map_single(). Linux calls the former "coherent" DMA, and the latter "streaming" DMA.
Coherent DMA can already obtain its pages from a per-device memory pool. dma_init_coherent_memory() defines a range of physical memory usable by a device. Importantly this region has to be distinct from system memory RAM and reserved by the platform. It is mapped non-cacheable. If it exists, dma_alloc_coherent() will only get its pages from that region.
On the other hand streaming DMA doesn't allocate memory. The virtio drivers don't control where that memory comes from, since the pages are fed to them by an upper layer of the subsystem, specific to the device type (net, block, video, etc). Often they are pages from the slab cache that contain other unrelated objects, a notorious problem for DMA isolation. If the page is not accessible by the device, swiotlb_map() allocates a bounce buffer somewhere more convenient and copies the data when needed.
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
That seems to be precisely what we need. Even when using the virtio-pci transport, it is still possible to define per-device properties in the device-tree, for example:
/* PCI root complex node */ pcie@10000000 { compatible = "pci-host-ecam-generic"; /* Add properties to endpoint with BDF 00:01.0 */ ep@0008 { reg = <0x00000800 0 0 0 0>; restricted-dma-region = <&dma_region_1>; }; };
reserved-memory { /* Define 64MB reserved region at address 0x50400000 */ dma_region_1: restricted_dma_region { reg = <0x50400000 0x4000000>; }; };
Dynamic regions ---------------
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Thanks, Jean
[1] https://lists.oasis-open.org/archives/virtio-dev/202006/msg00037.html