Thanks Jean-Philippe for describing the problem in some abstract way.
I would like to share considerations related to establishing data paths between a VM (frontend) and a device through another VM (backend).
Sharing model --------------- The VM in direct contact with the device may just provide driver service to a single other VM. This simplifies inbound traffic handling as there is no marshaling of data to different VMs. Shortcuts can be implemented and performance can be dramatically different. We have thus: passthrough with abstraction (to virtio) vs shared model.
IO Model --------------- There are two main ways to orchestrate IO around memory regions for data and rings of descriptors for outbound requests or input data. The key structural difference is how inbound traffic is handled: - the device is provided a list of buffers that are pointed to through pre-populated rings [that is almost the rule on PCI adapters with a handful exceptions] - the device is provided an unstructured memory area (or multiple) and rings. The device does memory allocation as it pleases on the unstructured area(s) [this is very common on Arm platform devices] The structured buffer method is to be understood as computer driven IO while the other is device driven IO. Computer driven IO allows to build zero-copy data paths from device to front-end VM (if other conditions are met) while the device driven will impose a bounce buffer for the incoming traffic while may be zero copy for outbound traffic. Coherent interconnects such as CCIX and CXL do not fundamentally change the IO Model but change the driver architecture.
Memory element --------------- What is the smallest size of a shared memory between VMs? Naturally this is a page but the page size may not be controllable. As a matter of fact RHEL on arm64 is 64KB. So using a 64KB page to hold a 64 byte network packet or audio sample is not really efficient. Sub page allocations are possible but would generate a significant performance penalty as the VMM shall validate memory access in a sub page resulting in a VM exit for each check.
Underlay networking --------------- In cloud infrastructure, VMs network traffic is always encapsulated with protocols such as GENEVE, VXLAN... It is also frequent to have a smartNIC that actually marshall traffic to VMs and host through distinct pseudo devices. In telecom parlance we name the raw network the underlay network. So there may be big differences in achieving the goal depending on the use case (has to be defined and assessed): - cloud native: there is an underlay network - smartphone virtualization: there may be underlay network to differentiate between user traffic with different APNs and also operator traffic dedicated for the SIM card or other cases. - virtualization in the car: probably no underlay network
The presence of an underlay network is significant for the use case: (with or without pseudo device) with proper hashing of inbound traffic, traffic for the front end VM can be associated with dedicated set of queues. This assignment may help work around IO model constraints with device controlled IO and find a way to do zero (data) copy from device to frontend VM.
User land IO --------------- I think the use of userland IO (in the backend and/or the fronted VM) may impact the data path in various ways and thus this should be factored in the analysis. Key elements of performance: use metata data (prepend, postpend to data) to get information rather than ring descriptor which is in uncached memory (we have concrete examples and the cost of the different strategies may be up to 50% of the base performance)
Virtio --------------- There are significant performance different between virtio 1.0 and virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert an element in a queue while it is only one with virtio 1.1 (note sure about the numbers but should not be far). as a reference 6WIND has a VM 2 VM network driver that is not virtio based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1 to reach that level of performance.
Memory allocation backend --------------- Shared memory allocation shall be controlled by the VMM (Qemu, Xen...) but the VMM may further ask FF-A to do so. I think there has been efforts in the FF-A to find memory with proper attributes (coherent between device and all VMs). I frankly have no clue here but this may be worth digging into.
A zero-copy DPDK based DNS on an Arm platform --------------- This example is meant to prove a point: all aspects of the target use cases must be detailed with great care otherwise ideas may not fly... (Please don't tell me I forget this or this does not fly, it is just to give an idea of the detail level required.)
Assumptions: - multiple front end VMs, with underlay network based on VLANs - network interface can allocate multiple queues for 1 VLAN and have thus multiple sets of queues for multiple VLANs - IO model is either device or computer driven, no importance - data memory can be associated to a VLAN/VM - DPDK based front end DNS application
initialization (here is to highlight that it is very feasible that a simple data path is built on an impossible initialization scenario: initialization must be an integral part of the description of the solution) - orchestration spawns backend VM with assigned device - orchestration spawns front end VM which creates virtio-net with private memory - orchestration informs backend VM to assign some traffic for a VLAN to a backend VM; the backend VM create the queues for the VLAN , the virtio-net with the packet memory wrapped around the memory assigned for the queues; the "bridge" configuration between this VLAN the the virtio; - the backend VM informs orchestration to bind the newly created virtio-net memory to the front-end memory - the orchestration asks the front end VMM to wrap its virtio-net to the specified area. - Front end DPDK application is listening on UDP port 53 and is bound the virtio-net device (DPDK apps are bounds to devices, not IP addresses) - IP configuration about the DPDK app is out of scope.
inbound traffic: traffic is received in device SRAM packet is marshalled to a memory associated to a VLAN (i.e. VM). descriptor is updated to point to this packet backend VM kernel handle queues (IRQ, busy polling...), create virtio descriptors pointing to data as part of the bridging (stripping the underlay network VLAN tag) front end DPDK app read the descriptor and the packet if DNS at expected IP, application handle the packet otherwise drop
outbound traffic: DPDK DNS application get a packet from the device memory (i.e. in a memory pool accessible by the device) and populate the response DPDK lib forms the descriptor in virtio Backend end VM kernel gets the packet, forms a descriptor in the right queue (adding the underlay network VLAN tag) and rings the door bell hardware gets the descriptor , retrieves the DNS response (which comes from directly from front end memory) hardware copies the packet into SRAM and is serialized on the wire
On Fri, 2 Oct 2020 at 15:44, Jean-Philippe Brucker via Stratos-dev stratos-dev@op-lists.linaro.org wrote:
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Problem
We have a primary VM running a guest, and a secondary one running a backend that manages one hardware resource (for example network access). They communicate with virtio (for example virtio-net). The guest implements a virtio driver, the backend a virtio device. Problem is, how to ensure that the backend and guest only share the memory required for the virtio communication, and that the backend cannot access any other memory from the guest?
Static shared region
Let's first look at static DMA regions. The hypervisor allocates a subset of memory to be shared between guest and backend. The hypervisor communicates this per-device DMA restriction to the guest during boot. It could be using a firmware property, or a discovery protocol. I did start drafting such a protocol in virtio-iommu, but I now think the reserved-memory mechanism in device-tree, below, is preferable. Would we need an equivalent for ACPI, though?
How would we implement this in a Linux guest? The virtqueue of a virtio device has two components. Static ring buffers, allocated at boot with dma_alloc_coherent(), and the actual data payload, mapped with dma_map_page() and dma_map_single(). Linux calls the former "coherent" DMA, and the latter "streaming" DMA.
Coherent DMA can already obtain its pages from a per-device memory pool. dma_init_coherent_memory() defines a range of physical memory usable by a device. Importantly this region has to be distinct from system memory RAM and reserved by the platform. It is mapped non-cacheable. If it exists, dma_alloc_coherent() will only get its pages from that region.
On the other hand streaming DMA doesn't allocate memory. The virtio drivers don't control where that memory comes from, since the pages are fed to them by an upper layer of the subsystem, specific to the device type (net, block, video, etc). Often they are pages from the slab cache that contain other unrelated objects, a notorious problem for DMA isolation. If the page is not accessible by the device, swiotlb_map() allocates a bounce buffer somewhere more convenient and copies the data when needed.
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
That seems to be precisely what we need. Even when using the virtio-pci transport, it is still possible to define per-device properties in the device-tree, for example:
/* PCI root complex node */ pcie@10000000 { compatible = "pci-host-ecam-generic"; /* Add properties to endpoint with BDF 00:01.0 */ ep@0008 { reg = <0x00000800 0 0 0 0>; restricted-dma-region = <&dma_region_1>; }; }; reserved-memory { /* Define 64MB reserved region at address 0x50400000 */ dma_region_1: restricted_dma_region { reg = <0x50400000 0x4000000>; }; };
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Thanks, Jean
[1] https://lists.oasis-open.org/archives/virtio-dev/202006/msg00037.html
Stratos-dev mailing list Stratos-dev@op-lists.linaro.org https://op-lists.linaro.org/mailman/listinfo/stratos-dev
On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org wrote:
IO Model
...
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the unstructured area(s) [this is very common on Arm platform devices] The structured buffer method is to be understood as computer driven IO while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that implements this so I can see the implications?
User land IO
I think the use of userland IO (in the backend and/or the fronted VM) may impact the data path in various ways and thus this should be factored in the analysis. Key elements of performance: use metata data (prepend, postpend to data) to get information rather than ring descriptor which is in uncached memory (we have concrete examples and the cost of the different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached? Do you mean just the descriptors of the physical device in case of a noncoherent bus, or also shared memory between the guests?
Virtio
There are significant performance different between virtio 1.0 and virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert an element in a queue while it is only one with virtio 1.1 (note sure about the numbers but should not be far). as a reference 6WIND has a VM 2 VM network driver that is not virtio based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1 to reach that level of performance.
I tried to get some information about the 6WIND driver but couldn't find it. Do you have a link to their sources, or do you know what they do specifically?
Memory allocation backend
Shared memory allocation shall be controlled by the VMM (Qemu, Xen...) but the VMM may further ask FF-A to do so. I think there has been efforts in the FF-A to find memory with proper attributes (coherent between device and all VMs). I frankly have no clue here but this may be worth digging into.
I would expect any shared memory between guests to just use the default memory attributes: For incoming data you end up copying each packet twice anyway (from the hw driver allocated buffer to the shared memory, and from shared memory to the actual skb; for outbound data over a noncoherent device, the driver needs to take care of proper barriers and cache management).
Just to clarify, the shared memory buffer in this case would be a fixed host-physical buffer, rather than a fixed guest-physical location with page flipping, right?
inbound traffic: traffic is received in device SRAM packet is marshalled to a memory associated to a VLAN (i.e. VM). descriptor is updated to point to this packet backend VM kernel handle queues (IRQ, busy polling...), create virtio descriptors pointing to data as part of the bridging (stripping the underlay network VLAN tag) fr ont end DPDK app read the descriptor and the packet if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously on different CPUs and send IPIs between them, or would each queue always be on a fixed CPU across both guests?
Arnd
Thanks for you comments Arnd. Some responses below.
On Mon, 5 Oct 2020 at 16:41, Arnd Bergmann arnd@linaro.org wrote:
On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org wrote:
IO Model
...
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the unstructured area(s) [this is very common on Arm platform devices] The structured buffer method is to be understood as computer driven IO while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that implements this so I can see the implications?
For x86 PCI: Chelsio T5+ adapters: the ring is associated with one or more huge pages. Chelsio can run on the two models. I am not sure the upstream driver is offering this possibility, you may want to check the Chelsio pubicly provided one. https://www.netcope.com/en/products/netcopep4 ( https://github.com/DPDK/dpdk/blob/main/drivers/net/szedata2/rte_eth_szedata2... not cheked the code) is also quite interesting to checkout. They have multiple DMA models (more than Chelsio) depending on the usecase. They have both open source and closed source stuff. I don't know what is currently open. For Arm: NXP DPAA2: each ring is associated with a fixed set of arrays (n rings with p arrays : total of n*p arrays), an array for 64 packets, up 128 packets, up to 256 packets.... The ring is not prepopulated with packets pointers as you don't know the incoming size. The hardware placed inbound packets into the array corresponding to the size and update the ring descriptor accordingly.
User land IO
I think the use of userland IO (in the backend and/or the fronted VM) may impact the data path in various ways and thus this should be factored in the analysis. Key elements of performance: use metata data (prepend, postpend to data) to get information rather than ring descriptor which is in uncached memory (we have concrete examples and the cost of the different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached? Do you mean just the descriptors of the physical device in case of a noncoherent bus, or also shared memory between the guests?
two methods I am aware of to know if there are any work to do: - read an MMIO register - poll ring descriptors (depends on HW capability) As far as I have seen so far, the slowest is the MMIO. On all hardware on which I used busy polling of the ring descriptors, those have been uncached. Between guests, which is close to host/VM: I am not sure what is virtio policy on ring mapping, but I assume this is also uncached. Don't know the differences between virtio 1.0 and virtio 1.1 (so called packed queues) [Most of my network activities have been in the context of telecom. In that context, one could argue that a link not 100% used is a link that has a problem. So not need for IRQs, except at startup: I always think of busy polling to get ultra low latency and low jitter]
Virtio
There are significant performance different between virtio 1.0 and virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert an element in a queue while it is only one with virtio 1.1 (note sure about the numbers but should not be far). as a reference 6WIND has a VM 2 VM network driver that is not virtio based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1 to reach that level of performance.
I tried to get some information about the 6WIND driver but couldn't find it. Do you have a link to their sources, or do you know what they do specifically?
No. Closed source.
Memory allocation backend
Shared memory allocation shall be controlled by the VMM (Qemu, Xen...) but the VMM may further ask FF-A to do so. I think there has been efforts in the FF-A to find memory with proper attributes (coherent between device and all VMs). I frankly have no clue here but this may be worth digging into.
I would expect any shared memory between guests to just use the default memory attributes: For incoming data you end up copying each packet twice anyway (from the hw driver allocated buffer to the shared memory, and from shared memory to the actual skb; for outbound data over a noncoherent device, the driver needs to take care of proper barriers and cache management).
Some use cases may allow zero-copy in one direction only or two directions.
Zero-copy is feasible, but is it desirable? what is the actual trustworthiness of the construct? remain to be seen. skb may or may not be present in the backend if DPDK is used to feed the virtio-net device, same may apply on the frontend side. It's not just about sharing memories. Some accelerators have limits for "physical address" and bus visibility (according to member discussion). So FF-A is able to identify memory that can be used between normal world entities, normal and secure world, normal world and device. So when you design a full zero-copy data path you may need to control the memory with help of FF-A
Just to clarify, the shared memory buffer in this case would be
a fixed host-physical buffer, rather than a fixed guest-physical location with pages
e flipping, right?
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
inbound traffic:
traffic is received in device SRAM packet is marshalled to a memory associated to a VLAN (i.e. VM). descriptor is updated to point to this packet backend VM kernel handle queues (IRQ, busy polling...), create virtio descriptors pointing to data as part of the bridging (stripping the underlay network VLAN tag) fr ont end DPDK app read the descriptor and the packet if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously on different CPUs and send IPIs between them, or would each queue always be on a fixed CPU across both guests?
With DPDK in both VMs I would use vCPU pinning. IPIs are not desired as
it cause way to much delay in very high performance as it also corresponds to a VM exit. In any case DPDK busy polling make the IPI and results in zero-exit operations.
Arnd
On Tue, Oct 6, 2020 at 12:06 AM François Ozog francois.ozog@linaro.org wrote:
Thanks for you comments Arnd. Some responses below.
On Mon, 5 Oct 2020 at 16:41, Arnd Bergmann arnd@linaro.org wrote:
On Sat, Oct 3, 2020 at 12:20 PM François Ozog francois.ozog@linaro.org
wrote:
IO Model
...
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the unstructured area(s) [this is very common on Arm platform devices] The structured buffer method is to be understood as computer driven IO while the other is device driven IO.
I still need to read up on this. Can you point to a device driver that implements this so I can see the implications?
For x86 PCI: Chelsio T5+ adapters: the ring is associated with one or
more huge pages. Chelsio can run on the two models. I am not sure the upstream driver is offering this possibility, you may want to check the Chelsio pubicly provided one.
https://github.com/DPDK/dpdk/blob/main/drivers/net/szedata2/rte_eth_szedata2... not cheked the code) is also quite interesting to checkout. They have multiple DMA models (more than Chelsio) depending on the usecase. They have both open source and closed source stuff. I don't know what is currently open.
For Arm: NXP DPAA2: each ring is associated with a fixed set of arrays (n
rings with p arrays : total of n*p arrays), an array for 64 packets, up 128 packets, up to 256 packets.... The ring is not prepopulated with packets pointers as you don't know the incoming size. The hardware placed inbound packets into the array corresponding to the size and update the ring descriptor accordingly.
Ok, got it now. So these would both be fairly unusual adapters and only used for multi-gigabit networking.
User land IO
I think the use of userland IO (in the backend and/or the fronted VM) may impact the data path in various ways and thus this should be factored in the analysis. Key elements of performance: use metata data (prepend, postpend to data) to get information rather than ring descriptor which is in uncached memory (we have concrete examples and the cost of the different strategies may be up to 50% of the base performance)
Can you clarify why any of the descriptors would be uncached? Do you mean just the descriptors of the physical device in case of a noncoherent bus, or also shared memory between the guests?
two methods I am aware of to know if there are any work to do:
- read an MMIO register
- poll ring descriptors (depends on HW capability)
As far as I have seen so far, the slowest is the MMIO. On all hardware on which I used busy polling of the ring descriptors,
those have been uncached.
Between guests, which is close to host/VM: I am not sure what is virtio
policy on ring mapping, but I assume this is also uncached.
No.
The only reason ring descriptors are uncached on low-end Arm SoCs is because they are on a noncoherent bus and uncached mappings can be used as a workaround to maintain a coherent view of the ring between hardware and the OS. In case of virtio, the rings are always cached because both sides are on the CPUs that are connected over a coherent bus. This might be different if the virtio device implementation is on a remote CPU behind a noncoherent bus, but I don't think we support that case at the moment.
The MMIO registers of the virtio device may appear to be mapped as uncached to the guest OS, but they don't actually point to memory here. Instead, the MMIO registers are implemented by trapping into the hypervisor that then handles the side effects of the access, such as forwarding the buffers to a hardware device.
Don't know the differences between virtio 1.0 and virtio 1.1 (so called
packed queues)
[Most of my network activities have been in the context of telecom. In
that context, one could argue that a link not 100% used is a link that has a problem. So not need for IRQs, except at startup: I always think of busy polling to get ultra low latency and low jitter]
This is a bit different in the kernel: polling on a register is generally seen as a waste of time, so the kernel always tries to batch as much work as it can without introducing latency elsewhere, and then let the CPU do useful work elsewhere while a new batch of work piles up. ;-)
Just to clarify, the shared memory buffer in this case would be a fixed host-physical buffer, rather than a fixed guest-physical location with pages
e flipping, right?
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the use of vIOMMU in the front-end guest to be able to use an unmodified set of virtio drivers on guest pages, while using the dynamic iommu mappings to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of complexity in the host, in particular with a type-2 hypervisor like KVM, and I'm not convinced that the overhead of trapping into the hypervisor for each iotlb flush as well as manipulating the page tables of a remote guest from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an object for the queue into its guest-physical address space as a vm_area_struct, possibly coming from a memfd or dmabuf fd that was shared by the front-end guest. Performing the dynamic remapping of that VMA from host user space (qemu, kvmtool, ...) would certainly mean a high runtime overhead, while doing it in the host kernel alone may not be acceptable for mainline kernels because of the added complexity in too many sensitive areas of the kernel (memory management / file systems, kvm, vhost/virtio).
inbound traffic: traffic is received in device SRAM packet is marshalled to a memory associated to a VLAN (i.e. VM). descriptor is updated to point to this packet backend VM kernel handle queues (IRQ, busy polling...), create virtio descriptors pointing to data as part of the bridging (stripping the underlay network VLAN tag) fr ont end DPDK app read the descriptor and the packet if DNS at expected IP, application handle the packet otherwise drop
Would you expect the guests in this scenario to run simultaneously on different CPUs and send IPIs between them, or would each queue always be on a fixed CPU across both guests?
With DPDK in both VMs I would use vCPU pinning. IPIs are not desired as
it cause way to much delay in very high performance as it also corresponds to a VM exit.
In any case DPDK busy polling make the IPI and results in zero-exit
operations.
Wouldn't user space spinning on the state of the queue be a bug in this scenario, if the thing that would add work to the queue is prevented from running because of the busy guest?
Arnd
Thanks for the discussion. More thoughts on vIOMMU below, in case we want to try this solution in addition to the pre-shared one.
On Tue, Oct 06, 2020 at 02:46:02PM +0200, Arnd Bergmann wrote:
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the use of vIOMMU in the front-end guest to be able to use an unmodified set of virtio drivers on guest pages, while using the dynamic iommu mappings to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of complexity in the host, in particular with a type-2 hypervisor like KVM, and I'm not convinced that the overhead of trapping into the hypervisor for each iotlb flush as well as manipulating the page tables of a remote guest from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an object for the queue into its guest-physical address space as a vm_area_struct, possibly coming from a memfd or dmabuf fd that was shared by the front-end guest. Performing the dynamic remapping of that VMA from host user space (qemu, kvmtool, ...) would certainly mean a high runtime overhead, while doing it in the host kernel alone may not be acceptable for mainline kernels because of the added complexity in too many sensitive areas of the kernel (memory management / file systems, kvm, vhost/virtio).
With a vIOMMU the frontend wouldn't populate the backend's page tables and send IOTLB flushes. And I don't think the hypervisor could touch the backend's stage-1 page tables either. Neither do grant tables, I believe, though I still have some reading to do there. The frontend would send MAP and UNMAP messages to the host. Roughly:
MAP { u32 domain_id; // context (virtio device) u32 flags; // permissions (read|write) u64 iova; // an arbitrary virtual address u64 gpa; // guest#1 physical address u64 size; // multiple of guest#1 page size };
The hypervisor receives this and finds the phys page (pa) corresponding to g1pa. Then:
(1) Allocates pages in the backend's guest-physical memory, at g2pa. (2) Creates the stage-2 mapping g2pa->pa (for example in KVM KVM_SET_USER_MEMORY_REGION?). (3) Tells the backend about the iova->g2pa mapping for the virtio device.
Then frontend builds the virtqueue or publishes buffers, using an IOVA within the range of what has been mapped so far. The backend translates that to g2pa, then to a g2va and accesses the buffer. When done the frontend unmaps the page:
UNMAP { u32 domain_id; u64 iova; u64 size; };
The host tells the backend and tears down the stage-2 mapping.
I would get rid of steps (1) and (3) by having the frontend directly allocate iova=g2pa. It requires the host to reserve a fixed range of guest#2 physical memory upfront (but not backed by any physical pages at that point), and tell the guest about usable iova ranges (already supported by Linux).
That way there wouldn't be any context switch to the backend during MAP/UNMAP, and the backend implementation is the same as with pre-shared memory.
Thanks, Jean
--- A diagram with the notations I'm using. We have two virt-phys translations pointing to the same physical page. Each guest manages their stage-1 page tables, and the hypervisor manages stage-2. In the above discussion, stage-2 of guest 1 (frontend) is static, while stage-2 of guest 2 (backend) is modified dynamically.
+-----+ +-----+ +-----+ +-----+ +-----+ |-----| |-----| : : ,-|-----|<-. |-----| |-----| g1pa |-----| pa +-----+ / |-----| | |-----| g1va->|-----|-. |-----| ,->| |<-' |-----| | |-----| |-----| \ |-----| / +-----+ pa |-----| '---|-----|<-g2va |-----| '->|-----|-' : : |-----| g2pa |-----| +-----+ +-----+ +-----+ +-----+ +-----+ guest 1 guest 1 phys mem guest 2 guest 2 stage-1 PT stage-2 PT stage-2 PT stage-1 PT
Le ven. 9 oct. 2020 à 15:46, Jean-Philippe Brucker jean-philippe@linaro.org a écrit :
Thanks for the discussion. More thoughts on vIOMMU below, in case we want to try this solution in addition to the pre-shared one.
On Tue, Oct 06, 2020 at 02:46:02PM +0200, Arnd Bergmann wrote:
From our learnings on mediated devices (vfio stuff), I wouldn't play
with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the use of vIOMMU in the front-end guest to be able to use an unmodified set of virtio drivers on guest pages, while using the dynamic iommu
mappings
to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of
complexity
in the host, in particular with a type-2 hypervisor like KVM, and I'm not convinced that the overhead of trapping into the hypervisor for each iotlb flush as well as manipulating the page tables of a remote guest from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an object for the queue into its guest-physical address space as a vm_area_struct, possibly coming from a memfd or dmabuf fd that was shared by the front-end guest. Performing the dynamic remapping of that VMA from host user space (qemu, kvmtool, ...) would certainly mean a high runtime overhead, while doing it in the host kernel alone may not be acceptable for mainline kernels because of the added complexity in too many sensitive areas of the kernel (memory management / file systems, kvm, vhost/virtio).
With a vIOMMU the frontend wouldn't populate the backend's page tables and send IOTLB flushes. And I don't think the hypervisor could touch the backend's stage-1 page tables either. Neither do grant tables, I believe, though I still have some reading to do there. The frontend would send MAP and UNMAP messages to the host. Roughly:
MAP { u32 domain_id; // context (virtio device) u32 flags; // permissions (read|write) u64 iova; // an arbitrary virtual address u64 gpa; // guest#1 physical address u64 size; // multiple of guest#1 page size };
The hypervisor receives this and finds the phys page (pa) corresponding to g1pa. Then:
(1) Allocates pages in the backend's guest-physical memory, at g2pa. (2) Creates the stage-2 mapping g2pa->pa (for example in KVM KVM_SET_USER_MEMORY_REGION?). (3) Tells the backend about the iova->g2pa mapping for the virtio device.
Then frontend builds the virtqueue or publishes buffers, using an IOVA within the range of what has been mapped so far. The backend translates that to g2pa, then to a g2va and accesses the buffer. When done the frontend unmaps the page:
UNMAP { u32 domain_id; u64 iova; u64 size; };
we bumped into a complication of iommu group membership management if we want to have devices in a single IOVA. I don’t think it is relevant here but may be worth giving a thought just in case.
The host tells the backend and tears down the stage-2 mapping.
I would get rid of steps (1) and (3) by having the frontend directly allocate iova=g2pa. It requires the host to reserve a fixed range of guest#2 physical memory upfront (but not backed by any physical pages at that point), and tell the guest about usable iova ranges (already supported by Linux).
That way there wouldn't be any context switch to the backend during MAP/UNMAP, and the backend implementation is the same as with pre-shared memory.
Thanks, Jean
A diagram with the notations I'm using. We have two virt-phys translations pointing to the same physical page. Each guest manages their stage-1 page tables, and the hypervisor manages stage-2. In the above discussion, stage-2 of guest 1 (frontend) is static, while stage-2 of guest 2 (backend) is modified dynamically.
+-----+ +-----+ +-----+ +-----+ +-----+ |-----| |-----| : : ,-|-----|<-. |-----| |-----| g1pa |-----| pa +-----+ / |-----| | |-----|
g1va->|-----|-. |-----| ,->| |<-' |-----| | |-----| |-----| \ |-----| / +-----+ pa |-----| '---|-----|<-g2va |-----| '->|-----|-' : : |-----| g2pa |-----| +-----+ +-----+ +-----+ +-----+ +-----+ guest 1 guest 1 phys mem guest 2 guest 2 stage-1 PT stage-2 PT stage-2 PT stage-1 PT
--
François-Frédéric Ozog | *Director Linaro Edge & Fog Computing Group* T: +33.67221.6485 francois.ozog@linaro.org | Skype: ffozog
On Fri, Oct 09, 2020 at 04:00:17PM +0200, François Ozog wrote:
we bumped into a complication of iommu group membership management if we want to have devices in a single IOVA. I don’t think it is relevant here but may be worth giving a thought just in case.
Hmm, we might not be using the same terminology here (to me IOVA is "I/O virtual address"). Do you know which system had this problem? Linux puts two devices in the same IOMMU group when it's not possible to isolate them from each other with the IOMMU. For example when they are on a conventional PCI bus where DMA transactions can be snooped by other devices. Or with hardware bugs, for example when the PCIe hierachy doesn't properly implement ACS isolation, or the IOMMU receives all DMA with RequesterID 0...
For stratos IOMMU groups would be a problem when assigning a hardware devices to a backend VM. But for virtio devices it's not a problem, the emulated bus won't have those isolation issues.
Thanks, Jean
On Fri, Oct 9, 2020 at 3:46 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
On Tue, Oct 06, 2020 at 02:46:02PM +0200, Arnd Bergmann wrote:
From our learnings on mediated devices (vfio stuff), I wouldn't play with
physical memory. If MMU and SMMU don't play a key role in orchestrating address spaces, I am not sure what I describe is feasible.
So you would lean towards what Jean-Philippe explained regarding the use of vIOMMU in the front-end guest to be able to use an unmodified set of virtio drivers on guest pages, while using the dynamic iommu mappings to control which pages are shared with the back-end, right?
I think this can work in general, but it seems to introduce a ton of complexity in the host, in particular with a type-2 hypervisor like KVM, and I'm not convinced that the overhead of trapping into the hypervisor for each iotlb flush as well as manipulating the page tables of a remote guest from it is lower than just copying the data through the CPU cache.
In the case of KVM, I would guess that the backend guest has to map an object for the queue into its guest-physical address space as a vm_area_struct, possibly coming from a memfd or dmabuf fd that was shared by the front-end guest. Performing the dynamic remapping of that VMA from host user space (qemu, kvmtool, ...) would certainly mean a high runtime overhead, while doing it in the host kernel alone may not be acceptable for mainline kernels because of the added complexity in too many sensitive areas of the kernel (memory management / file systems, kvm, vhost/virtio).
With a vIOMMU the frontend wouldn't populate the backend's page tables and send IOTLB flushes. And I don't think the hypervisor could touch the backend's stage-1 page tables either. Neither do grant tables, I believe, though I still have some reading to do there. The frontend would send MAP and UNMAP messages to the host. Roughly:
MAP { u32 domain_id; // context (virtio device) u32 flags; // permissions (read|write) u64 iova; // an arbitrary virtual address u64 gpa; // guest#1 physical address u64 size; // multiple of guest#1 page size };
The hypervisor receives this and finds the phys page (pa) corresponding to g1pa. Then:
(1) Allocates pages in the backend's guest-physical memory, at g2pa. (2) Creates the stage-2 mapping g2pa->pa (for example in KVM KVM_SET_USER_MEMORY_REGION?). (3) Tells the backend about the iova->g2pa mapping for the virtio device.
Then frontend builds the virtqueue or publishes buffers, using an IOVA within the range of what has been mapped so far. The backend translates that to g2pa, then to a g2va and accesses the buffer. When done the frontend unmaps the page:
UNMAP { u32 domain_id; u64 iova; u64 size; };
The host tells the backend and tears down the stage-2 mapping.
This is what I was trying to explain, but your explanation provides some of the details I wasn't sure about and is much clearer.
I did not mean that any changes to the backend's guest page tables would be required, but clearly the host page tables.
I think when you refer to the MAP/UNMAP operations, that's what I meant with the iotlb flushes: whichever operation one does that makes the device forget about a previous IOMMU mapping and use the new one instead. Depending on the architecture that would be a CPU instruction, hypercall, MMIO access or a message passed through the virtio-iommu ring.
I would get rid of steps (1) and (3) by having the frontend directly allocate iova=g2pa. It requires the host to reserve a fixed range of guest#2 physical memory upfront (but not backed by any physical pages at that point), and tell the guest about usable iova ranges (already supported by Linux).
Right, that would be ideal, but adds the complexity that guest 1 has to know about the available g2pa range. There is also a requirement that guest 2 needs to check the address in the ring to ensure they are within the bounds of that address range to prevent a data leak, but that is probably the case for any scenario.
That way there wouldn't be any context switch to the backend during MAP/UNMAP, and the backend implementation is the same as with pre-shared memory.
There are still two (direct queue) or four (indirect queue) map/unmap operations per update to the virtio ring, and each of these lead to an update of the host page tables for the secondary guest, plus the communication needed to trigger that update.
Arnd
stratos-dev@op-lists.linaro.org