Arnd Bergmann via Stratos-dev stratos-dev@op-lists.linaro.org writes:
On Wed, Oct 7, 2020 at 12:37 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
The two approaches are not mutually exclusive. The first approach could be demoed in a couple weeks, while this approach will require months of work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
That seems like a reasonable set of steps. So the pre-shared region will be the source of memory for the virtqueues as well as the "direct" buffers the virtqueues reference?
Xen PV drivers have started with the equivalent of a virtio IOMMU in place, which we call "grant table". A virtual machine uses the grant table to share memory explicitly with another virtual machine. Specifically, the frontend uses the grant table to share memory with the backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table interface.)
Speaking from that experience, we ended up switching PV network to use hypervisor-based copy (without Argo; Argo came later and it is a more generic solution) because it was faster than the alternatives. We are still using the grant table for everything else (block, framebuffer, etc.)
My feeling is that the grant table approach is too specific to Xen and wouldn't lean itself to porting to most other hypervisors. The idea of picking virtio seems to be based around the assumption that this is already portable.
Adding a vIOMMU requirement for regular virtio devices seems possible and builds on existing guest drivers, but it certainly adds complexity in all areas (front-end, hypervisor and back-end) without being an obviously faster than a simpler approach.
OTOH, we can probably use the existing grant table implementation in Xen for a performance comparison, assuming that a virtio+viommu based approach would generally be slower than that.
With the existing grant tables we should be able to test a DomU vhost-user with a pre-shared chunk of memory if we can someway to pass signalling events to the DomU guest. The front-end could be in the main Dom0 initially.
I'm hoping this is something that Akashi can look at once he's up to speed with Xen although I expect it might lag the KVM based PoC a bit.
The take away is that the results might differ significantly, not just between a protocol and the other (net and block), but also between hypervisors (Xen and KVM), and between SoCs. These are difficult waters to navigate.
Definitely - more data points required ;-)
Agreed.
Arnd
On Mon, Oct 12, 2020 at 5:17 PM Alex Bennée alex.bennee@linaro.org wrote:
Arnd Bergmann via Stratos-dev stratos-dev@op-lists.linaro.org writes:
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
That seems like a reasonable set of steps. So the pre-shared region will be the source of memory for the virtqueues as well as the "direct" buffers the virtqueues reference?
Yes, my rough idea would be to stay as close as possible to the existing virtio/virtqueue design, but replace the existing descriptors pointing to guest physical memory with TLV headers describing the data in the ring buffer itself. This might not actually be possible, but my hope is that we can get away with making the ring buffer large enough for all data that needs to be in flight at any time, require all data to be processed in order, and use the virtq_desc->next field to point to where the next header starts.
This however limits the total size of a queue to 512KB (16 bytes times 32678 index values) like the current limit of the descriptor array, and that may not be enough for all devices.
Arnd
On Mon, 12 Oct 2020, Alex Bennée wrote:
On Wed, Oct 7, 2020 at 12:37 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
The two approaches are not mutually exclusive. The first approach could be demoed in a couple weeks, while this approach will require months of work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
Just FYI, Xen has mechanisms to pre-share memory areas statically (from VM creation) between VMs. We could pre-share a memory region between domU1 and domU2 and use it for virtio. We would have to come up with a way to mark the memory as "special virtio memory" and let Linux/QEMU know about it. Maybe we could use a reserved-memory binding for it.
That seems like a reasonable set of steps. So the pre-shared region will be the source of memory for the virtqueues as well as the "direct" buffers the virtqueues reference?
My original thinking was to use the pre-shared region for everything, in a dma_ops swiotlb fashion: the kernel (frontend) would end up picking bounce buffers out of the pre-shared region thanks to a special swiotlb instance made for the purpose. The backend would be told to map the pre-shared region at initialization and only use already-mapped pages from it.
However, I am not sure this is the best way to do -- you and Arnd might have better ideas on how to integrate the pre-shared region with the rest of the virtio infrastructure.
Xen PV drivers have started with the equivalent of a virtio IOMMU in place, which we call "grant table". A virtual machine uses the grant table to share memory explicitly with another virtual machine. Specifically, the frontend uses the grant table to share memory with the backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table interface.)
Speaking from that experience, we ended up switching PV network to use hypervisor-based copy (without Argo; Argo came later and it is a more generic solution) because it was faster than the alternatives. We are still using the grant table for everything else (block, framebuffer, etc.)
My feeling is that the grant table approach is too specific to Xen and wouldn't lean itself to porting to most other hypervisors. The idea of picking virtio seems to be based around the assumption that this is already portable.
Yeah, you are right. I didn't mean the grant table "as is"; I meant making a few changes to it so that it becomes easy to implement in other hypervisors and turn it into a virtio interface. But I don't know, it could be easier to start from scratch.
Adding a vIOMMU requirement for regular virtio devices seems possible and builds on existing guest drivers, but it certainly adds complexity in all areas (front-end, hypervisor and back-end) without being an obviously faster than a simpler approach.
OTOH, we can probably use the existing grant table implementation in Xen for a performance comparison, assuming that a virtio+viommu based approach would generally be slower than that.
With the existing grant tables we should be able to test a DomU vhost-user with a pre-shared chunk of memory if we can someway to pass signalling events to the DomU guest. The front-end could be in the main Dom0 initially.
I'm hoping this is something that Akashi can look at once he's up to speed with Xen although I expect it might lag the KVM based PoC a bit.
Excellent idea! We should definitely be able to use the existing grant table to do measurements and get some useful data points.
On Tue, Oct 13, 2020 at 2:53 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Mon, 12 Oct 2020, Alex Bennée wrote:
On Wed, Oct 7, 2020 at 12:37 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote: My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
Just FYI, Xen has mechanisms to pre-share memory areas statically (from VM creation) between VMs. We could pre-share a memory region between domU1 and domU2 and use it for virtio. We would have to come up with a way to mark the memory as "special virtio memory" and let Linux/QEMU know about it. Maybe we could use a reserved-memory binding for it.
If the host allocates that memory, I'd probably just define it as a device specific area (e.g. a PCI BAR) rather than making it part of system RAM and then marking it as reserved.
What I had in mind would be more like the existing virtio though: allocate a contiguous guest physical memory area at device initialization time and register it in place of the normal virtqueue descriptor table.
That seems like a reasonable set of steps. So the pre-shared region will be the source of memory for the virtqueues as well as the "direct" buffers the virtqueues reference?
My original thinking was to use the pre-shared region for everything, in a dma_ops swiotlb fashion: the kernel (frontend) would end up picking bounce buffers out of the pre-shared region thanks to a special swiotlb instance made for the purpose. The backend would be told to map the pre-shared region at initialization and only use already-mapped pages from it.
However, I am not sure this is the best way to do -- you and Arnd might have better ideas on how to integrate the pre-shared region with the rest of the virtio infrastructure.
The swiotlb is also what Jean-Philippe described earlier. The advantage would be that it would be superficially compatible with the virtio specification, but in practice the implementation would remain incompatible with existing guests since this is not how swiotlb works today. It's also more complicated to implement from scratch and less efficient than just having everything in a single FIFO, because the swiotlb code now has to manage allocations in the address space, go through multiple indirections in the dma-mapping code, and touch two memory areas instead of one.
Arnd
On Tue, 13 Oct 2020, Arnd Bergmann wrote:
On Tue, Oct 13, 2020 at 2:53 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Mon, 12 Oct 2020, Alex Bennée wrote:
On Wed, Oct 7, 2020 at 12:37 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote: My suggestion would be to hack together a pre-shared memory solution, or a hypervisor-mediated solution with Argo, do some benchmarks to understand the degradation, and figure out if the degradation is bad enough that we need to go down the virtio IOMMU route.
Yes, this makes sense. I'll see if I can come up with a basic design for virtio devices based on pre-shared memory in place of the virtqueue, and then we can see if we can prototype the device side in qemu talking to a modified Linux guest. If that works, the next step would be to share the memory with another guest and have that implement the backend instead of qemu.
Just FYI, Xen has mechanisms to pre-share memory areas statically (from VM creation) between VMs. We could pre-share a memory region between domU1 and domU2 and use it for virtio. We would have to come up with a way to mark the memory as "special virtio memory" and let Linux/QEMU know about it. Maybe we could use a reserved-memory binding for it.
If the host allocates that memory, I'd probably just define it as a device specific area (e.g. a PCI BAR) rather than making it part of system RAM and then marking it as reserved.
What I had in mind would be more like the existing virtio though: allocate a contiguous guest physical memory area at device initialization time and register it in place of the normal virtqueue descriptor table.
And the memory allocation would come from the guest with the frontends, right? I think it would be best if the memory allocation was done by the domU with the frontends, rather than the domain with the backends, because otherwise we risk the backend domain running low on memory (many frontend domains connect to a single backend domain).
That seems like a reasonable set of steps. So the pre-shared region will be the source of memory for the virtqueues as well as the "direct" buffers the virtqueues reference?
My original thinking was to use the pre-shared region for everything, in a dma_ops swiotlb fashion: the kernel (frontend) would end up picking bounce buffers out of the pre-shared region thanks to a special swiotlb instance made for the purpose. The backend would be told to map the pre-shared region at initialization and only use already-mapped pages from it.
However, I am not sure this is the best way to do -- you and Arnd might have better ideas on how to integrate the pre-shared region with the rest of the virtio infrastructure.
The swiotlb is also what Jean-Philippe described earlier. The advantage would be that it would be superficially compatible with the virtio specification, but in practice the implementation would remain incompatible with existing guests since this is not how swiotlb works today. It's also more complicated to implement from scratch and less efficient than just having everything in a single FIFO, because the swiotlb code now has to manage allocations in the address space, go through multiple indirections in the dma-mapping code, and touch two memory areas instead of one.
OK
On Tue, Oct 13, 2020 at 8:05 PM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Tue, 13 Oct 2020, Arnd Bergmann wrote:
On Tue, Oct 13, 2020 at 2:53 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Mon, 12 Oct 2020, Alex Bennée wrote: Just FYI, Xen has mechanisms to pre-share memory areas statically (from VM creation) between VMs. We could pre-share a memory region between domU1 and domU2 and use it for virtio. We would have to come up with a way to mark the memory as "special virtio memory" and let Linux/QEMU know about it. Maybe we could use a reserved-memory binding for it.
If the host allocates that memory, I'd probably just define it as a device specific area (e.g. a PCI BAR) rather than making it part of system RAM and then marking it as reserved.
What I had in mind would be more like the existing virtio though: allocate a contiguous guest physical memory area at device initialization time and register it in place of the normal virtqueue descriptor table.
And the memory allocation would come from the guest with the frontends, right? I think it would be best if the memory allocation was done by the domU with the frontends, rather than the domain with the backends, because otherwise we risk the backend domain running low on memory (many frontend domains connect to a single backend domain).
I'm not sure I was using the same meaning of frontend vs backend here. I thought the use case here would be one front-end with many backends that each access one hardware device.
The allocation needs to be from the guest that runs the virtio driver as normal, and get exported to whatever implements the device behind it, which could be host kernel, host user space or another guest.
Arnd
On Tue, 13 Oct 2020, Arnd Bergmann wrote:
On Tue, Oct 13, 2020 at 8:05 PM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Tue, 13 Oct 2020, Arnd Bergmann wrote:
On Tue, Oct 13, 2020 at 2:53 AM Stefano Stabellini stefano.stabellini@xilinx.com wrote:
On Mon, 12 Oct 2020, Alex Bennée wrote: Just FYI, Xen has mechanisms to pre-share memory areas statically (from VM creation) between VMs. We could pre-share a memory region between domU1 and domU2 and use it for virtio. We would have to come up with a way to mark the memory as "special virtio memory" and let Linux/QEMU know about it. Maybe we could use a reserved-memory binding for it.
If the host allocates that memory, I'd probably just define it as a device specific area (e.g. a PCI BAR) rather than making it part of system RAM and then marking it as reserved.
What I had in mind would be more like the existing virtio though: allocate a contiguous guest physical memory area at device initialization time and register it in place of the normal virtqueue descriptor table.
And the memory allocation would come from the guest with the frontends, right? I think it would be best if the memory allocation was done by the domU with the frontends, rather than the domain with the backends, because otherwise we risk the backend domain running low on memory (many frontend domains connect to a single backend domain).
I'm not sure I was using the same meaning of frontend vs backend here. I thought the use case here would be one front-end with many backends that each access one hardware device.
The allocation needs to be from the guest that runs the virtio driver as normal, and get exported to whatever implements the device behind it, which could be host kernel, host user space or another guest.
Getting the terminology right is always half of the challenge :-)
I have been calling "frontend" the guest running the virtio driver, typically in the kernel, e.g. drivers/net/virtio_net.c.
I have been calling "backend" the domain with the virtio emulator, e.g. QEMU.
So it looks like we are actually saying the same thing.
stratos-dev@op-lists.linaro.org