On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Thanks for writing this up and for sharing!
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
for each device that allocates a region, we have to reserve host memory that cannot be used for anything else, and the size is based on the maximum we might be needing at any point of time.
it removes any kind of performance benefit that we might see from using an iommu, by forcing the guest to do bounce-buffering, in addition to the copies that may have to be done in the host. (you have the same point below as well)
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
The two approaches are not mutually exclusive. The first approach could be demoed in a couple weeks, while this approach will require months of work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
From my experiece with the Xen grant table and PV drivers, the virtio
IOMMU alone might not make things a lot faster without changes to the protocols (i.e. virtio-net/block need specific improvements).
If it turns out that the pre-shared memory causes performance issues, one "crazy" idea would be to make use of DMA engines in Xen to speed up hypervisor-based copies.
Maybe the partial-page sharing could be part of the iommu dma_ops? There are many scenarios where we already rely on the iommu for isolating the kernel from malicious or accidental DMA, but I have not seen anyone do this on a sub-page granularity. Using bounce buffers for partial page dma_map_* could be something we can do separately from the rest, as both aspects seem useful regardless of one another.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Can you describe a case in which the iommu would clearly be inferior? IOW, what stops us from always using an iommu here?
Xen PV drivers have started with the equivalent of a virtio IOMMU in place, which we call "grant table". A virtual machine uses the grant table to share memory explicitly with another virtual machine. Specifically, the frontend uses the grant table to share memory with the backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table interface.)
Speaking from that experience, we ended up switching PV network to use hypervisor-based copy (without Argo; Argo came later and it is a more generic solution) because it was faster than the alternatives. We are still using the grant table for everything else (block, framebuffer, etc.)
For block, we ended extending the life of certain grant mappings to improve performance, we called them "persistent grants".
The take away is that the results might differ significantly, not just between a protocol and the other (net and block), but also between hypervisors (Xen and KVM), and between SoCs. These are difficult waters to navigate.
On Tue, 6 Oct 2020, Stefano Stabellini wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Thanks for writing this up and for sharing!
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
for each device that allocates a region, we have to reserve host memory that cannot be used for anything else, and the size is based on the maximum we might be needing at any point of time.
it removes any kind of performance benefit that we might see from using an iommu, by forcing the guest to do bounce-buffering, in addition to the copies that may have to be done in the host. (you have the same point below as well)
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
The two approaches are not mutually exclusive. The first approach could be demoed in a couple weeks, while this approach will require months of work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
From my experiece with the Xen grant table and PV drivers, the virtio
IOMMU alone might not make things a lot faster without changes to the protocols (i.e. virtio-net/block need specific improvements).
If it turns out that the pre-shared memory causes performance issues, one "crazy" idea would be to make use of DMA engines in Xen to speed up hypervisor-based copies.
Maybe the partial-page sharing could be part of the iommu dma_ops? There are many scenarios where we already rely on the iommu for isolating the kernel from malicious or accidental DMA, but I have not seen anyone do this on a sub-page granularity. Using bounce buffers for partial page dma_map_* could be something we can do separately from the rest, as both aspects seem useful regardless of one another.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Can you describe a case in which the iommu would clearly be inferior? IOW, what stops us from always using an iommu here?
Xen PV drivers have started with the equivalent of a virtio IOMMU in place, which we call "grant table". A virtual machine uses the grant table to share memory explicitly with another virtual machine. Specifically, the frontend uses the grant table to share memory with the backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table interface.)
Speaking from that experience, we ended up switching PV network to use hypervisor-based copy (without Argo; Argo came later and it is a more generic solution) because it was faster than the alternatives. We are still using the grant table for everything else (block, framebuffer, etc.)
Just to further clarify: Xen-based copies were faster *on x86* for PV network. At the time the change was made, Xen on ARM didn't exist yet so I don't have corresponding numbers.
When I wrote in the other email that the last time I measured performance (2+ years ago) the grant table was faster than copies on ARM even for networking, I was referring to a new PV networking protocol called "PV Calls" that I wrote from scratch. PV Calls is entirely based on memory copies; on x86 it was fantastic but on ARM the performance was not great and I suspect using the grant table would have been faster.
A detailed analysis of memory copies vs grant table for PV network has not been done on ARM yet.
On Wed, Oct 7, 2020 at 12:37 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
The two approaches are not mutually exclusive. The first approach could be demoed in a couple weeks, while this approach will require months of work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
Xen PV drivers have started with the equivalent of a virtio IOMMU in place, which we call "grant table". A virtual machine uses the grant table to share memory explicitly with another virtual machine. Specifically, the frontend uses the grant table to share memory with the backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table interface.)
Speaking from that experience, we ended up switching PV network to use hypervisor-based copy (without Argo; Argo came later and it is a more generic solution) because it was faster than the alternatives. We are still using the grant table for everything else (block, framebuffer, etc.)
My feeling is that the grant table approach is too specific to Xen and wouldn't lean itself to porting to most other hypervisors. The idea of picking virtio seems to be based around the assumption that this is already portable.
Adding a vIOMMU requirement for regular virtio devices seems possible and builds on existing guest drivers, but it certainly adds complexity in all areas (front-end, hypervisor and back-end) without being an obviously faster than a simpler approach.
OTOH, we can probably use the existing grant table implementation in Xen for a performance comparison, assuming that a virtio+viommu based approach would generally be slower than that.
The take away is that the results might differ significantly, not just between a protocol and the other (net and block), but also between hypervisors (Xen and KVM), and between SoCs. These are difficult waters to navigate.
Agreed.
Arnd
stratos-dev@op-lists.linaro.org