New subject: Limited memory sharing investigation

6 Oct 2020

      On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
...
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker
jean-philippe@linaro.org wrote:
...
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8,
STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Thanks for writing this up and for sharing!
...
At the moment the bounce buffer is allocated from a global pool in the low
physical pages. However a recent proposal by Chromium would add support
for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
    For this, I'd like to propose a "restricted-dma-region" (feel free
    to suggest a better name) binding, which is explicitly specified
    to be the only DMA-able memory for this device and make Linux use
    the given pool for coherent DMA allocations and bouncing
    non-coherent DMA.

Right, I think this can work, but there are very substantial downsides to it:

it is a fairly substantial departure from the virtio specification, which
defines that transfers can be made to any part of the guest physical
address space

for each device that allocates a region, we have to reserve host memory
that cannot be used for anything else, and the size is based on the
maximum we might be needing at any point of time.

it removes any kind of performance benefit that we might see from
using an iommu, by forcing the guest to do bounce-buffering, in
addition to the copies that may have to be done in the host.
(you have the same point below as well)

...
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to
dynamically update the mappings rather than statically setting a memory
region usable by the backend. I believe that approach is still worth
considering because it satisfies the security requirement and doesn't
necessarily have worse performance. There is a trade-off between bounce
buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require
copying. Sub-page payloads will need bounce buffering anyway, for proper
isolation. But for large payloads bounce buffering might be prohibitive,
and using a virtual IOMMU might actually be more efficient. Instead of
copying large buffers the guest would send a MAP request to the
hypervisor, which would then map the pages into the backend. Despite
taking plenty of cycles for context switching and setting up the maps, it
might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising
here.
The two approaches are not mutually exclusive. The first approach could be
demoed in a couple weeks, while this approach will require months of
work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or
a hypervisor-mediated solution with Argo, do some benchmarks to
understand the degradation, and figure out if the degradation is bad
enough that we need to go down the virtio IOMMU route.
...
From my experiece with the Xen grant table and PV drivers, the virtio
IOMMU alone might not make things a lot faster without changes to the
protocols (i.e. virtio-net/block need specific improvements).
If it turns out that the pre-shared memory causes performance issues,
one "crazy" idea would be to make use of DMA engines in Xen to speed up
hypervisor-based copies.
...
Maybe the partial-page sharing could be part of the iommu dma_ops?
There are many scenarios where we already rely on the iommu for
isolating the kernel from malicious or accidental DMA, but I have not
seen anyone do this on a sub-page granularity. Using bounce
buffers for partial page dma_map_* could be something we can do
separately from the rest, as both aspects seem useful regardless of
one another.
...
Since it depends on the device, I guess we'll need a survey of memory
access patterns by the different virtio devices that we're considering.
In the end a mix of both solutions might be necessary.
Can you describe a case in which the iommu would clearly be
inferior? IOW, what stops us from always using an iommu here?
Xen PV drivers have started with the equivalent of a virtio IOMMU in
place, which we call "grant table". A virtual machine uses the grant
table to share memory explicitly with another virtual machine.
Specifically, the frontend uses the grant table to share memory with the
backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table
interface.)
Speaking from that experience, we ended up switching PV network to use
hypervisor-based copy (without Argo; Argo came later and it is a more
generic solution) because it was faster than the alternatives. We are
still using the grant table for everything else (block, framebuffer,
etc.)
For block, we ended extending the life of certain grant mappings to
improve performance, we called them "persistent grants".
The take away is that the results might differ significantly, not just
between a protocol and the other (net and block), but also between
hypervisors (Xen and KVM), and between SoCs. These are difficult waters
to navigate.

Re: [Stratos-dev] Limited memory sharing investigation

Dynamic regions

Dynamic regions