On Fri, Oct 02, 2020 at 04:26:10PM +0200, Arnd Bergmann wrote:
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Thanks for writing this up and for sharing!
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
for each device that allocates a region, we have to reserve host memory that cannot be used for anything else, and the size is based on the maximum we might be needing at any point of time.
it removes any kind of performance benefit that we might see from using an iommu, by forcing the guest to do bounce-buffering, in addition to the copies that may have to be done in the host. (you have the same point below as well)
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
Maybe the partial-page sharing could be part of the iommu dma_ops? There are many scenarios where we already rely on the iommu for isolating the kernel from malicious or accidental DMA, but I have not seen anyone do this on a sub-page granularity. Using bounce buffers for partial page dma_map_* could be something we can do separately from the rest, as both aspects seem useful regardless of one another.
It is about to be added to the dma-iommu module, as part of the consolidation of the IOMMU dma_ops: https://lore.kernel.org/linux-iommu/20200912032200.11489-4-baolu.lu@linux.in...
That will enforce bounce-buffers for any device marked "untrusted" (external devices such as thunderbolt devices, which could be malicious). I believe it would be a good thing to enable in our case as well.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Can you describe a case in which the iommu would clearly be inferior? IOW, what stops us from always using an iommu here?
If the virtio device only transfers small sub-page payloads, for example small packets: - with static regions we copy each buffer to/from the static region, - with an IOMMU we copy each buffer to/from a safe page *and* send requests to map+unmap that page. Even if we didn't use bounce buffer in this case, map+unmap generally has a very high cost due to context switching and could easily be much slower than copying a small buffer.
For this case I see a possible optimization, keeping the bounce buffers mapped for some time so subsequent transfers can reuse them, but it's not implemented at the moment (and I wonder if it opens a vulnerability, though I can't see one right now).
Thanks, Jean