Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration =============
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space - how it's going to receive (interrupt) and trigger (kick) events - what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
* BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
* BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
- expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
- expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation =========
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
CCing people working on Xen+VirtIO and IOREQs. Not trimming the original email to let them read the full context.
My comments below are related to a potential Xen implementation, not because it is the only implementation that matters, but because it is the one I know best.
Also, please see this relevant email thread: https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest.
Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to give to the backend if the backend is not running in dom0. So for Xen I think the memory has to be already mapped and the mapping probably done by the toolstack (also see below.) Or we would have to invent a new Xen hypervisor interface and Xen virtual machine privileges to allow this kind of mapping.
If we run the backend in Dom0 that we have no problems of course.
To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs the way they are typically envisioned and described in other environments. Instead, Xen has IOREQ servers. Each of them connects independently to Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as emulators for a single Xen VM, each of them connecting to Xen independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up shared interfaces is the toolstack: the xl command and the libxl/libxc libraries.
Oleksandr and others I CCed have been working on ways for the toolstack to create virtio backends and setup memory mappings. They might be able to provide more info on the subject. I do think we miss a way to provide the configuration to the backend and anything else that the backend might require to start doing its job.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to connect directly to the Xen hypervisor.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
I don't think that translating IOREQs to eventfd in the kernel is a good idea: if feels like it would be extra complexity and that the kernel shouldn't be involved as this is a backend-hypervisor interface. Also, eventfd is very Linux-centric and we are trying to design an interface that could work well for RTOSes too. If we want to do something different, both OS-agnostic and hypervisor-agnostic, perhaps we could design a new interface. One that could be implementable in the Xen hypervisor itself (like IOREQ) and of course any other hypervisor too.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
CCing people working on Xen+VirtIO and IOREQs. Not trimming the original email to let them read the full context.
My comments below are related to a potential Xen implementation, not because it is the only implementation that matters, but because it is the one I know best.
Please note that my proposal (and hence the working prototype)[1] is based on Xen's virtio implementation (i.e. IOREQ) and particularly EPAM's virtio-disk application (backend server). It has been, I believe, well generalized but is still a bit biased toward this original design.
So I hope you like my approach :)
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
Let me take this opportunity to explain a bit more about my approach below.
Also, please see this relevant email thread: https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest.
Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared memory" between VMs. I think that Jailhouse's ivshmem gives us good insights on this matter and that it can even be an alternative for hypervisor-agnostic solution.
(Please note memory regions in ivshmem appear as a PCI device and can be mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but the resultant solution would eventually look similar to ivshmem.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to give to the backend if the backend is not running in dom0. So for Xen I think the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information is expected to be exposed to BE via Xenstore: (I know that this is a tentative approach though.) - the start address of configuration space - interrupt number - file path for backing storage - read-only flag And the BE server have to call a particular hypervisor interface to map the configuration space.
In my approach (virtio-proxy), all those Xen (or hypervisor)-specific stuffs are contained in virtio-proxy, yet another VM, to hide all details.
# My point is that a "handle" is not mandatory for executing mapping.
and the mapping probably done by the toolstack (also see below.) Or we would have to invent a new Xen hypervisor interface and Xen virtual machine privileges to allow this kind of mapping.
If we run the backend in Dom0 that we have no problems of course.
One of difficulties on Xen that I found in my approach is that calling such hypervisor intefaces (registering IOREQ, mapping memory) is only allowed on BE servers themselvies and so we will have to extend those interfaces. This, however, will raise some concern on security and privilege distribution as Stefan suggested.
To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs the way they are typically envisioned and described in other environments. Instead, Xen has IOREQ servers. Each of them connects independently to Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as emulators for a single Xen VM, each of them connecting to Xen independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up shared interfaces is the toolstack: the xl command and the libxl/libxc libraries.
I think that VM configuration management (or orchestration in Startos jargon?) is a subject to debate in parallel. Otherwise, is there any good assumption to avoid it right now?
Oleksandr and others I CCed have been working on ways for the toolstack to create virtio backends and setup memory mappings. They might be able to provide more info on the subject. I do think we miss a way to provide the configuration to the backend and anything else that the backend might require to start doing its job.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to connect directly to the Xen hypervisor.
In my approach, a) BE -> FE: interrupts triggered by BE calling a hypervisor interface via virtio-proxy b) FE -> BE: MMIO to config raises events (in event channels), which is converted to a callback to BE via virtio-proxy (Xen's event channel is internnally implemented by interrupts.)
I don't know what "connect directly" means here, but sending interrupts to the opposite side would be best efficient. Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add.
As I said above, my proposal does the same thing that you mentioned here :) The difference is that I do call hypervisor interfaces via virtio-proxy.
The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
I don't think that translating IOREQs to eventfd in the kernel is a good idea: if feels like it would be extra complexity and that the kernel shouldn't be involved as this is a backend-hypervisor interface.
Given that we may want to implement BE as a bare-metal application as I did on Zephyr, I don't think that the translation would not be a big issue, especially on RTOS's. It will be some kind of abstraction layer of interrupt handling (or nothing but a callback mechanism).
Also, eventfd is very Linux-centric and we are trying to design an interface that could work well for RTOSes too. If we want to do something different, both OS-agnostic and hypervisor-agnostic, perhaps we could design a new interface. One that could be implementable in the Xen hypervisor itself (like IOREQ) and of course any other hypervisor too.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces required for virtio-proxy (or hypervisor-related interfaces) are listed as RPC protocols :)
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also apply to other hypervisors than Xen. Technically, yes, but whether people can accept it or not is a different matter.
Thanks, -Takahiro Akashi
Hello, all.
Please see some comments below. And sorry for the possible format issues.
On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro takahiro.akashi@linaro.org wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
CCing people working on Xen+VirtIO and IOREQs. Not trimming the original email to let them read the full context.
My comments below are related to a potential Xen implementation, not because it is the only implementation that matters, but because it is the one I know best.
Please note that my proposal (and hence the working prototype)[1] is based on Xen's virtio implementation (i.e. IOREQ) and particularly EPAM's virtio-disk application (backend server). It has been, I believe, well generalized but is still a bit biased toward this original design.
So I hope you like my approach :)
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
Let me take this opportunity to explain a bit more about my approach below.
Also, please see this relevant email thread: https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was
a
stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images
however
this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest.
Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3
approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would
assume
in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared memory" between VMs. I think that Jailhouse's ivshmem gives us good insights on this matter and that it can even be an alternative for hypervisor-agnostic solution.
(Please note memory regions in ivshmem appear as a PCI device and can be mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but the resultant solution would eventually look similar to ivshmem.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to give to the backend if the backend is not running in dom0. So for Xen I think the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information is expected to be exposed to BE via Xenstore: (I know that this is a tentative approach though.)
- the start address of configuration space
- interrupt number
- file path for backing storage
- read-only flag
And the BE server have to call a particular hypervisor interface to map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration info to the backend running in a non-toolstack domain. I remember, there was a wish to avoid using Xenstore in Virtio backend itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends) to read backend configuration from the Xenstore anyway and pass it to the backend via command line arguments.
But, if ...
In my approach (virtio-proxy), all those Xen (or hypervisor)-specific stuffs are contained in virtio-proxy, yet another VM, to hide all details.
... the solution how to overcome that is already found and proven to work then even better.
# My point is that a "handle" is not mandatory for executing mapping.
and the mapping probably done by the toolstack (also see below.) Or we would have to invent a new Xen hypervisor interface and Xen virtual machine privileges to allow this kind of mapping.
If we run the backend in Dom0 that we have no problems of course.
One of difficulties on Xen that I found in my approach is that calling such hypervisor intefaces (registering IOREQ, mapping memory) is only allowed on BE servers themselvies and so we will have to extend those interfaces. This, however, will raise some concern on security and privilege distribution as Stefan suggested.
We also faced policy related issues with Virtio backend running in other than Dom0 domain in a "dummy" xsm mode. In our target system we run the backend in a driver domain (we call it DomD) where the underlying H/W resides. We trust it, so we wrote policy rules (to be used in "flask" xsm mode) to provide it with a little bit more privileges than a simple DomU had. Now it is permitted to issue device-model, resource and memory mappings, etc calls.
To activate the mapping will require some sort of hypercall to the hypervisor. I can see two
options
at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs the way they are typically envisioned and described in other environments. Instead, Xen has IOREQ servers. Each of them connects independently to Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as emulators for a single Xen VM, each of them connecting to Xen independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up shared interfaces is the toolstack: the xl command and the libxl/libxc libraries.
I think that VM configuration management (or orchestration in Startos jargon?) is a subject to debate in parallel. Otherwise, is there any good assumption to avoid it right now?
Oleksandr and others I CCed have been working on ways for the toolstack to create virtio backends and setup memory mappings. They might be able to provide more info on the subject. I do think we miss a way to provide the configuration to the backend and anything else that the backend might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio MMIO devices in general and Virtio block devices in particular. However, it has not been upstreaned yet. Updated patches on review now: https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-olekstys...
There is an additional (also important) activity to improve/fix foreign memory mapping on Arm which I am also involved in. The foreign memory mapping is proposed to be used for Virtio backends (device emulators) if there is a need to run guest OS completely unmodified. Of course, the more secure way would be to use grant memory mapping. Brietly, the main difference between them is that with foreign mapping the backend can map any guest memory it wants to map, but with grant mapping it is allowed to map only what was previously granted by the frontend.
So, there might be a problem if we want to pre-map some guest memory in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space we need to steal a real physical page from the backend domain. So, with the said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a real domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into, this enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way some prereq work needs to be done. You can find the related Xen discussion at: https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstys...
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to connect directly to the Xen hypervisor.
In my approach, a) BE -> FE: interrupts triggered by BE calling a hypervisor interface via virtio-proxy b) FE -> BE: MMIO to config raises events (in event channels), which is converted to a callback to BE via virtio-proxy (Xen's event channel is internnally implemented by interrupts.)
I don't know what "connect directly" means here, but sending interrupts to the opposite side would be best efficient. Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
Agree that MSI would be more efficient than SPI... At the moment, in order to notify the frontend, the backend issues a specific device-model call to query Xen to inject a corresponding SPI to the guest.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other
kernel
hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add.
As I said above, my proposal does the same thing that you mentioned here :) The difference is that I do call hypervisor interfaces via virtio-proxy.
The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See:
https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
I don't think that translating IOREQs to eventfd in the kernel is a good idea: if feels like it would be extra complexity and that the kernel shouldn't be involved as this is a backend-hypervisor interface.
Given that we may want to implement BE as a bare-metal application as I did on Zephyr, I don't think that the translation would not be a big issue, especially on RTOS's. It will be some kind of abstraction layer of interrupt handling (or nothing but a callback mechanism).
Also, eventfd is very Linux-centric and we are trying to design an interface that could work well for RTOSes too. If we want to do something different, both OS-agnostic and hypervisor-agnostic, perhaps we could design a new interface. One that could be implementable in the Xen hypervisor itself (like IOREQ) and of course any other hypervisor too.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces required for virtio-proxy (or hypervisor-related interfaces) are listed as RPC protocols :)
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also apply to other hypervisors than Xen. Technically, yes, but whether people can accept it or not is a different matter.
Thanks, -Takahiro Akashi
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here. This proposal is still discussing in Xen and KVM communities. The main work is to decouple the kvmtool from KVM and make other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor layer for VMM abstraction, Which is, I think it's very close to stratos' virtio hypervisor agnosticism work.
From: Oleksandr Tyshchenko olekstysh@gmail.com Sent: 2021年8月14日 23:38 To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano Stabellini sstabellini@kernel.org Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean-Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hello, all.
Please see some comments below. And sorry for the possible format issues.
On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro mailto:takahiro.akashi@linaro.org wrote: On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
CCing people working on Xen+VirtIO and IOREQs. Not trimming the original email to let them read the full context.
My comments below are related to a potential Xen implementation, not because it is the only implementation that matters, but because it is the one I know best.
Please note that my proposal (and hence the working prototype)[1] is based on Xen's virtio implementation (i.e. IOREQ) and particularly EPAM's virtio-disk application (backend server). It has been, I believe, well generalized but is still a bit biased toward this original design.
So I hope you like my approach :)
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
Let me take this opportunity to explain a bit more about my approach below.
Also, please see this relevant email thread: https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest.
Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared memory" between VMs. I think that Jailhouse's ivshmem gives us good insights on this matter and that it can even be an alternative for hypervisor-agnostic solution.
(Please note memory regions in ivshmem appear as a PCI device and can be mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but the resultant solution would eventually look similar to ivshmem.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to give to the backend if the backend is not running in dom0. So for Xen I think the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information is expected to be exposed to BE via Xenstore: (I know that this is a tentative approach though.)
- the start address of configuration space
- interrupt number
- file path for backing storage
- read-only flag
And the BE server have to call a particular hypervisor interface to map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration info to the backend running in a non-toolstack domain. I remember, there was a wish to avoid using Xenstore in Virtio backend itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends) to read backend configuration from the Xenstore anyway and pass it to the backend via command line arguments.
Yes, in current PoC code we're using xenstore to pass device configuration. We also designed a static device configuration parse method for Dom0less or other scenarios don't have xentool. yes, it's from device model command line or a config file.
But, if ...
In my approach (virtio-proxy), all those Xen (or hypervisor)-specific stuffs are contained in virtio-proxy, yet another VM, to hide all details.
... the solution how to overcome that is already found and proven to work then even better.
# My point is that a "handle" is not mandatory for executing mapping.
and the mapping probably done by the toolstack (also see below.) Or we would have to invent a new Xen hypervisor interface and Xen virtual machine privileges to allow this kind of mapping.
If we run the backend in Dom0 that we have no problems of course.
One of difficulties on Xen that I found in my approach is that calling such hypervisor intefaces (registering IOREQ, mapping memory) is only allowed on BE servers themselvies and so we will have to extend those interfaces. This, however, will raise some concern on security and privilege distribution as Stefan suggested.
We also faced policy related issues with Virtio backend running in other than Dom0 domain in a "dummy" xsm mode. In our target system we run the backend in a driver domain (we call it DomD) where the underlying H/W resides. We trust it, so we wrote policy rules (to be used in "flask" xsm mode) to provide it with a little bit more privileges than a simple DomU had. Now it is permitted to issue device-model, resource and memory mappings, etc calls.
To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs the way they are typically envisioned and described in other environments. Instead, Xen has IOREQ servers. Each of them connects independently to Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as emulators for a single Xen VM, each of them connecting to Xen independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up shared interfaces is the toolstack: the xl command and the libxl/libxc libraries.
I think that VM configuration management (or orchestration in Startos jargon?) is a subject to debate in parallel. Otherwise, is there any good assumption to avoid it right now?
Oleksandr and others I CCed have been working on ways for the toolstack to create virtio backends and setup memory mappings. They might be able to provide more info on the subject. I do think we miss a way to provide the configuration to the backend and anything else that the backend might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio MMIO devices in general and Virtio block devices in particular. However, it has not been upstreaned yet. Updated patches on review now: https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-olekstys...
There is an additional (also important) activity to improve/fix foreign memory mapping on Arm which I am also involved in. The foreign memory mapping is proposed to be used for Virtio backends (device emulators) if there is a need to run guest OS completely unmodified. Of course, the more secure way would be to use grant memory mapping. Brietly, the main difference between them is that with foreign mapping the backend can map any guest memory it wants to map, but with grant mapping it is allowed to map only what was previously granted by the frontend.
So, there might be a problem if we want to pre-map some guest memory in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space we need to steal a real physical page from the backend domain. So, with the said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a real domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into, this enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way some prereq work needs to be done. You can find the related Xen discussion at: https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstys...
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to connect directly to the Xen hypervisor.
In my approach, a) BE -> FE: interrupts triggered by BE calling a hypervisor interface via virtio-proxy b) FE -> BE: MMIO to config raises events (in event channels), which is converted to a callback to BE via virtio-proxy (Xen's event channel is internnally implemented by interrupts.)
I don't know what "connect directly" means here, but sending interrupts to the opposite side would be best efficient. Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
Agree that MSI would be more efficient than SPI... At the moment, in order to notify the frontend, the backend issues a specific device-model call to query Xen to inject a corresponding SPI to the guest.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add.
As I said above, my proposal does the same thing that you mentioned here :) The difference is that I do call hypervisor interfaces via virtio-proxy.
The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
I don't think that translating IOREQs to eventfd in the kernel is a good idea: if feels like it would be extra complexity and that the kernel shouldn't be involved as this is a backend-hypervisor interface.
Given that we may want to implement BE as a bare-metal application as I did on Zephyr, I don't think that the translation would not be a big issue, especially on RTOS's. It will be some kind of abstraction layer of interrupt handling (or nothing but a callback mechanism).
Also, eventfd is very Linux-centric and we are trying to design an interface that could work well for RTOSes too. If we want to do something different, both OS-agnostic and hypervisor-agnostic, perhaps we could design a new interface. One that could be implementable in the Xen hypervisor itself (like IOREQ) and of course any other hypervisor too.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces required for virtio-proxy (or hypervisor-related interfaces) are listed as RPC protocols :)
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also apply to other hypervisors than Xen. Technically, yes, but whether people can accept it or not is a different matter.
Thanks, -Takahiro Akashi
-- Regards,
Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here. This proposal is still discussing in Xen and KVM communities. The main work is to decouple the kvmtool from KVM and make other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor layer for VMM abstraction, Which is, I think it's very close to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always represent # Linaro's view on this subject nor reflect Alex's concerns. Nevertheless,
Your idea and my proposal seem to share the same background. Both have the similar goal and currently start with, at first, Xen and are based on kvm-tool. (Actually, my work is derived from EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a same set of interfaces (for your "struct vmm_impl" and my "RPC interfaces"). This is not co-incident as we both share the same origin as I said above. And so we will also share the same issues. One of them is a way of "sharing/mapping FE's memory". There is some trade-off between the portability and the performance impact. So we can discuss the topic here in this ML, too. (See Alex's original email, too).
On the other hand, my approach aims to create a "single-binary" solution in which the same binary of BE vm could run on any hypervisors. Somehow similar to your "proposal-#2" in [2], but in my solution, all the hypervisor-specific code would be put into another entity (VM), named "virtio-proxy" and the abstracted operations are served via RPC. (In this sense, BE is hypervisor-agnostic but might have OS dependency.) But I know that we need discuss if this is a requirement even in Stratos project or not. (Maybe not)
Specifically speaking about kvm-tool, I have a concern about its license term; Targeting different hypervisors and different OSs (which I assume includes RTOS's), the resultant library should be license permissive and GPL for kvm-tool might be an issue. Any thoughts?
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
From: Oleksandr Tyshchenko olekstysh@gmail.com Sent: 2021年8月14日 23:38 To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano Stabellini sstabellini@kernel.org Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean-Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hello, all.
Please see some comments below. And sorry for the possible format issues.
On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro mailto:takahiro.akashi@linaro.org wrote: On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
CCing people working on Xen+VirtIO and IOREQs. Not trimming the original email to let them read the full context.
My comments below are related to a potential Xen implementation, not because it is the only implementation that matters, but because it is the one I know best.
Please note that my proposal (and hence the working prototype)[1] is based on Xen's virtio implementation (i.e. IOREQ) and particularly EPAM's virtio-disk application (backend server). It has been, I believe, well generalized but is still a bit biased toward this original design.
So I hope you like my approach :)
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000546.html
Let me take this opportunity to explain a bit more about my approach below.
Also, please see this relevant email thread: https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest.
Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared memory" between VMs. I think that Jailhouse's ivshmem gives us good insights on this matter and that it can even be an alternative for hypervisor-agnostic solution.
(Please note memory regions in ivshmem appear as a PCI device and can be mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but the resultant solution would eventually look similar to ivshmem.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to give to the backend if the backend is not running in dom0. So for Xen I think the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information is expected to be exposed to BE via Xenstore: (I know that this is a tentative approach though.)
- the start address of configuration space
- interrupt number
- file path for backing storage
- read-only flag
And the BE server have to call a particular hypervisor interface to map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration info to the backend running in a non-toolstack domain. I remember, there was a wish to avoid using Xenstore in Virtio backend itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends) to read backend configuration from the Xenstore anyway and pass it to the backend via command line arguments.
Yes, in current PoC code we're using xenstore to pass device configuration. We also designed a static device configuration parse method for Dom0less or other scenarios don't have xentool. yes, it's from device model command line or a config file.
But, if ...
In my approach (virtio-proxy), all those Xen (or hypervisor)-specific stuffs are contained in virtio-proxy, yet another VM, to hide all details.
... the solution how to overcome that is already found and proven to work then even better.
# My point is that a "handle" is not mandatory for executing mapping.
and the mapping probably done by the toolstack (also see below.) Or we would have to invent a new Xen hypervisor interface and Xen virtual machine privileges to allow this kind of mapping.
If we run the backend in Dom0 that we have no problems of course.
One of difficulties on Xen that I found in my approach is that calling such hypervisor intefaces (registering IOREQ, mapping memory) is only allowed on BE servers themselvies and so we will have to extend those interfaces. This, however, will raise some concern on security and privilege distribution as Stefan suggested.
We also faced policy related issues with Virtio backend running in other than Dom0 domain in a "dummy" xsm mode. In our target system we run the backend in a driver domain (we call it DomD) where the underlying H/W resides. We trust it, so we wrote policy rules (to be used in "flask" xsm mode) to provide it with a little bit more privileges than a simple DomU had. Now it is permitted to issue device-model, resource and memory mappings, etc calls.
To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs the way they are typically envisioned and described in other environments. Instead, Xen has IOREQ servers. Each of them connects independently to Xen via the IOREQ interface. E.g. today multiple QEMUs could be used as emulators for a single Xen VM, each of them connecting to Xen independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up shared interfaces is the toolstack: the xl command and the libxl/libxc libraries.
I think that VM configuration management (or orchestration in Startos jargon?) is a subject to debate in parallel. Otherwise, is there any good assumption to avoid it right now?
Oleksandr and others I CCed have been working on ways for the toolstack to create virtio backends and setup memory mappings. They might be able to provide more info on the subject. I do think we miss a way to provide the configuration to the backend and anything else that the backend might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio MMIO devices in general and Virtio block devices in particular. However, it has not been upstreaned yet. Updated patches on review now: https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-olekstys...
There is an additional (also important) activity to improve/fix foreign memory mapping on Arm which I am also involved in. The foreign memory mapping is proposed to be used for Virtio backends (device emulators) if there is a need to run guest OS completely unmodified. Of course, the more secure way would be to use grant memory mapping. Brietly, the main difference between them is that with foreign mapping the backend can map any guest memory it wants to map, but with grant mapping it is allowed to map only what was previously granted by the frontend.
So, there might be a problem if we want to pre-map some guest memory in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space we need to steal a real physical page from the backend domain. So, with the said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a real domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into, this enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way some prereq work needs to be done. You can find the related Xen discussion at: https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstys...
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to connect directly to the Xen hypervisor.
In my approach, a) BE -> FE: interrupts triggered by BE calling a hypervisor interface via virtio-proxy b) FE -> BE: MMIO to config raises events (in event channels), which is converted to a callback to BE via virtio-proxy (Xen's event channel is internnally implemented by interrupts.)
I don't know what "connect directly" means here, but sending interrupts to the opposite side would be best efficient. Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x mechanism.
Agree that MSI would be more efficient than SPI... At the moment, in order to notify the frontend, the backend issues a specific device-model call to query Xen to inject a corresponding SPI to the guest.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add.
As I said above, my proposal does the same thing that you mentioned here :) The difference is that I do call hypervisor interfaces via virtio-proxy.
The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
I don't think that translating IOREQs to eventfd in the kernel is a good idea: if feels like it would be extra complexity and that the kernel shouldn't be involved as this is a backend-hypervisor interface.
Given that we may want to implement BE as a bare-metal application as I did on Zephyr, I don't think that the translation would not be a big issue, especially on RTOS's. It will be some kind of abstraction layer of interrupt handling (or nothing but a callback mechanism).
Also, eventfd is very Linux-centric and we are trying to design an interface that could work well for RTOSes too. If we want to do something different, both OS-agnostic and hypervisor-agnostic, perhaps we could design a new interface. One that could be implementable in the Xen hypervisor itself (like IOREQ) and of course any other hypervisor too.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces required for virtio-proxy (or hypervisor-related interfaces) are listed as RPC protocols :)
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also apply to other hypervisors than Xen. Technically, yes, but whether people can accept it or not is a different matter.
Thanks, -Takahiro Akashi
-- Regards,
Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月17日 16:08 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis- open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here. This proposal is still discussing in Xen and KVM communities. The main work is to decouple the kvmtool from KVM and make other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor layer for VMM abstraction, Which is, I think it's very close to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always represent # Linaro's view on this subject nor reflect Alex's concerns. Nevertheless,
Your idea and my proposal seem to share the same background. Both have the similar goal and currently start with, at first, Xen and are based on kvm-tool. (Actually, my work is derived from EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a same set of interfaces (for your "struct vmm_impl" and my "RPC interfaces"). This is not co-incident as we both share the same origin as I said above. And so we will also share the same issues. One of them is a way of "sharing/mapping FE's memory". There is some trade-off between the portability and the performance impact. So we can discuss the topic here in this ML, too. (See Alex's original email, too).
Yes, I agree.
On the other hand, my approach aims to create a "single-binary" solution in which the same binary of BE vm could run on any hypervisors. Somehow similar to your "proposal-#2" in [2], but in my solution, all the hypervisor-specific code would be put into another entity (VM), named "virtio-proxy" and the abstracted operations are served via RPC. (In this sense, BE is hypervisor-agnostic but might have OS dependency.) But I know that we need discuss if this is a requirement even in Stratos project or not. (Maybe not)
Sorry, I haven't had time to finish reading your virtio-proxy completely (I will do it ASAP). But from your description, it seems we need a 3rd VM between FE and BE? My concern is that, if my assumption is right, will it increase the latency in data transport path? Even if we're using some lightweight guest like RTOS or Unikernel,
Specifically speaking about kvm-tool, I have a concern about its license term; Targeting different hypervisors and different OSs (which I assume includes RTOS's), the resultant library should be license permissive and GPL for kvm-tool might be an issue. Any thoughts?
Yes. If user want to implement a FreeBSD device model, but the virtio library is GPL. Then GPL would be a problem. If we have another good candidate, I am open to it.
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- August/000548.html [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
From: Oleksandr Tyshchenko olekstysh@gmail.com Sent: 2021年8月14日 23:38 To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano Stabellini
Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List
stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org
Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hello, all.
Please see some comments below. And sorry for the possible format
issues.
On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
mailto:takahiro.akashi@linaro.org wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
CCing people working on Xen+VirtIO and IOREQs. Not trimming the
original
email to let them read the full context.
My comments below are related to a potential Xen implementation,
not
because it is the only implementation that matters, but because it
is
the one I know best.
Please note that my proposal (and hence the working prototype)[1] is based on Xen's virtio implementation (i.e. IOREQ) and
particularly
EPAM's virtio-disk application (backend server). It has been, I believe, well generalized but is still a bit biased toward this original design.
So I hope you like my approach :)
August/000546.html
Let me take this opportunity to explain a bit more about my approach
below.
Also, please see this relevant email thread: https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor
agnostic
backends so we can enable as much re-use of code as possible and
avoid
repeating ourselves. This is the flip side of the front end
where
multiple front-end implementations are required - one per OS,
assuming
you don't just want Linux guests. The resultant guests are
trivially
movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned
vhost-user
daemons running in a broadly POSIX like environment. The
interface to
the daemon is fairly simple requiring only some mapped memory
and some
sort of signalling for events (on Linux this is eventfd). The
idea was a
stub binary would be responsible for any hypervisor specific
setup and
then launch a common binary to deal with the actual virtqueue
requests
themselves.
Since that original sketch we've seen an expansion in the sort
of ways
backends could be created. There is interest in encapsulating
backends
in RTOSes or unikernels for solutions like SCMI. There interest
in Rust
has prompted ideas of using the trait interface to abstract
differences
away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor
side to
support VirtIO guests and their backends. However we are some
way off
from that at the moment as I think we need to at least
demonstrate one
portable backend before we start codifying requirements. To that
end I
want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the
host
system can orchestrate the various modules that make up the
complete
system. In the type-1 case (or even type-2 with delegated
service VMs)
we need some sort of mechanism to inform the backend VM about
key
details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick)
events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having
static
configurations and baking the assumptions into your guest images
however
this isn't scalable in the long term. The obvious solution seems
to be
extending a subset of Device Tree data to user space but perhaps
there
are other approaches?
Before any virtio transactions can take place the appropriate
memory
mappings need to be made between the FE guest and the BE guest.
Currently the whole of the FE guests address space needs to be
visible
to whatever is serving the virtio requests. I can envision 3
approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest
Physical
Address space is already taken up and avoiding clashing. I
would assume
in this case you would want a standard interface to userspace
to then
make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared
memory" between
VMs. I think that Jailhouse's ivshmem gives us good insights on this
matter
and that it can even be an alternative for hypervisor-agnostic
solution.
(Please note memory regions in ivshmem appear as a PCI device and
can be
mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but the resultant solution would eventually look similar to ivshmem.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it
wants in
the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to
give
to the backend if the backend is not running in dom0. So for Xen I
think
the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information is
expected
to be exposed to BE via Xenstore: (I know that this is a tentative approach though.)
- the start address of configuration space
- interrupt number
- file path for backing storage
- read-only flag
And the BE server have to call a particular hypervisor interface to map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration info to
the backend running in a non-toolstack domain.
I remember, there was a wish to avoid using Xenstore in Virtio backend
itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends)
to read backend configuration from the Xenstore anyway and pass it to
the backend via command line arguments.
Yes, in current PoC code we're using xenstore to pass device
configuration.
We also designed a static device configuration parse method for Dom0less
or
other scenarios don't have xentool. yes, it's from device model command
line
or a config file.
But, if ...
In my approach (virtio-proxy), all those Xen (or hypervisor)-
specific
stuffs are contained in virtio-proxy, yet another VM, to hide all
details.
... the solution how to overcome that is already found and proven to
work then even better.
# My point is that a "handle" is not mandatory for executing mapping.
and the mapping probably done by the toolstack (also see below.) Or we would have to invent a new Xen hypervisor interface and Xen virtual machine privileges to allow
this
kind of mapping.
If we run the backend in Dom0 that we have no problems of course.
One of difficulties on Xen that I found in my approach is that
calling
such hypervisor intefaces (registering IOREQ, mapping memory) is
only
allowed on BE servers themselvies and so we will have to extend
those
interfaces. This, however, will raise some concern on security and privilege
distribution
as Stefan suggested.
We also faced policy related issues with Virtio backend running in
other than Dom0 domain in a "dummy" xsm mode. In our target system we run the backend in a driver
domain (we call it DomD) where the underlying H/W resides. We trust it,
so we wrote policy rules (to be used in "flask" xsm mode) to provide it with a little bit more privileges than a simple DomU had.
Now it is permitted to issue device-model, resource and memory
mappings, etc calls.
To activate the mapping will require some sort of hypercall to the hypervisor. I can see two
options
at this point:
- expose the handle to userspace for daemon/helper to trigger
the
mapping via existing hypercall interfaces. If using a helper
you
would have a hypervisor specific one to avoid the daemon
having to
care too much about the details or push that complexity into
a
compile time option for the daemon which would result in
different
binaries although a common source base.
- expose a new kernel ABI to abstract the hypercall
differences away
in the guest kernel. In this case the userspace would
essentially
ask for an abstract "map guest N memory to userspace ptr"
and let
the kernel deal with the different hypercall interfaces.
This of
course assumes the majority of BE guests would be Linux
kernels and
leaves the bare-metal/unikernel approaches to their own
devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving
update
events and parsing the resultant virt queue for data. The vhost-
user
specification handles a bunch of setup before that point, mostly
to
detail where the virt queues are set up FD's for memory and
event
communication. This is where the envisioned stub process would
be
responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then
either
communicate with the kernel's abstracted ABI or be re-targeted
as a
build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs
the
way they are typically envisioned and described in other
environments.
Instead, Xen has IOREQ servers. Each of them connects
independently to
Xen via the IOREQ interface. E.g. today multiple QEMUs could be
used as
emulators for a single Xen VM, each of them connecting to Xen independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up
shared
interfaces is the toolstack: the xl command and the libxl/libxc libraries.
I think that VM configuration management (or orchestration in
Startos
jargon?) is a subject to debate in parallel. Otherwise, is there any good assumption to avoid it right now?
Oleksandr and others I CCed have been working on ways for the
toolstack
to create virtio backends and setup memory mappings. They might be
able
to provide more info on the subject. I do think we miss a way to
provide
the configuration to the backend and anything else that the
backend
might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio MMIO
devices in
general and Virtio block devices in particular. However, it has not
been upstreaned yet.
Updated patches on review now: https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-
olekstysh@gmail.com/
There is an additional (also important) activity to improve/fix
foreign memory mapping on Arm which I am also involved in.
The foreign memory mapping is proposed to be used for Virtio backends
(device emulators) if there is a need to run guest OS completely unmodified.
Of course, the more secure way would be to use grant memory mapping.
Brietly, the main difference between them is that with foreign mapping the backend
can map any guest memory it wants to map, but with grant mapping it is
allowed to map only what was previously granted by the frontend.
So, there might be a problem if we want to pre-map some guest memory
in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space we need to steal a real physical page from the backend domain. So, with the said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a real domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into, this enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way some prereq work needs to be done.
You can find the related Xen discussion at: https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-
olekstysh@gmail.com/
One question is how to best handle notification and kicks. The
existing
vhost-user framework uses eventfd to signal the daemon (although
QEMU
is quite capable of simulating them when you use TCG). Xen has
it's own
IOREQ mechanism. However latency is an important factor and
having
events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to connect directly to the Xen hypervisor.
In my approach, a) BE -> FE: interrupts triggered by BE calling a hypervisor
interface
via virtio-proxy
b) FE -> BE: MMIO to config raises events (in event channels),
which is
converted to a callback to BE via virtio-proxy (Xen's event channel is internnally implemented by
interrupts.)
I don't know what "connect directly" means here, but sending
interrupts
to the opposite side would be best efficient. Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
mechanism.
Agree that MSI would be more efficient than SPI... At the moment, in order to notify the frontend, the backend issues a
specific device-model call to query Xen to inject a corresponding SPI to the guest.
Could we consider the kernel internally converting IOREQ
messages from
the Xen hypervisor to eventfd events? Would this scale with
other kernel
hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to
Xen via
the IOREQ interface. We could generalize the IOREQ interface and
make it
hypervisor agnostic. The interface is really trivial and easy to
add.
As I said above, my proposal does the same thing that you mentioned
here :)
The difference is that I do call hypervisor interfaces via virtio-
proxy.
The only Xen-specific part is the notification mechanism, which is
an
event channel. If we replaced the event channel with something
else the
interface would be generic. See: https://gitlab.com/xen-project/xen/-
/blob/staging/xen/include/public/hvm/ioreq.h#L52
I don't think that translating IOREQs to eventfd in the kernel is
a
good idea: if feels like it would be extra complexity and that the kernel shouldn't be involved as this is a backend-hypervisor
interface.
Given that we may want to implement BE as a bare-metal application as I did on Zephyr, I don't think that the translation would not be a big issue, especially on RTOS's. It will be some kind of abstraction layer of interrupt handling (or nothing but a callback mechanism).
Also, eventfd is very Linux-centric and we are trying to design an interface that could work well for RTOSes too. If we want to do something different, both OS-agnostic and hypervisor-agnostic,
perhaps
we could design a new interface. One that could be implementable
in the
Xen hypervisor itself (like IOREQ) and of course any other
hypervisor
too.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
also need
an interface for the backend to inject interrupts into the
frontend? And
if the backend requires dynamic memory mappings of frontend pages,
then
we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces required
for
virtio-proxy (or hypervisor-related interfaces) are listed as RPC protocols :)
These interfaces are a lot more problematic than IOREQ: IOREQ is
tiny
and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because
it
would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also apply to other hypervisors than Xen. Technically, yes, but whether people can accept it or not is a
different
matter.
Thanks, -Takahiro Akashi
-- Regards,
Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月17日 16:08 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis- open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here. This proposal is still discussing in Xen and KVM communities. The main work is to decouple the kvmtool from KVM and make other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor layer for VMM abstraction, Which is, I think it's very close to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always represent # Linaro's view on this subject nor reflect Alex's concerns. Nevertheless,
Your idea and my proposal seem to share the same background. Both have the similar goal and currently start with, at first, Xen and are based on kvm-tool. (Actually, my work is derived from EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a same set of interfaces (for your "struct vmm_impl" and my "RPC interfaces"). This is not co-incident as we both share the same origin as I said above. And so we will also share the same issues. One of them is a way of "sharing/mapping FE's memory". There is some trade-off between the portability and the performance impact. So we can discuss the topic here in this ML, too. (See Alex's original email, too).
Yes, I agree.
On the other hand, my approach aims to create a "single-binary" solution in which the same binary of BE vm could run on any hypervisors. Somehow similar to your "proposal-#2" in [2], but in my solution, all the hypervisor-specific code would be put into another entity (VM), named "virtio-proxy" and the abstracted operations are served via RPC. (In this sense, BE is hypervisor-agnostic but might have OS dependency.) But I know that we need discuss if this is a requirement even in Stratos project or not. (Maybe not)
Sorry, I haven't had time to finish reading your virtio-proxy completely (I will do it ASAP). But from your description, it seems we need a 3rd VM between FE and BE? My concern is that, if my assumption is right, will it increase the latency in data transport path? Even if we're using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of payload, we will see latency issue (even in your case) and if we have some solution for it, we won't see it neither in my proposal :)
Specifically speaking about kvm-tool, I have a concern about its license term; Targeting different hypervisors and different OSs (which I assume includes RTOS's), the resultant library should be license permissive and GPL for kvm-tool might be an issue. Any thoughts?
Yes. If user want to implement a FreeBSD device model, but the virtio library is GPL. Then GPL would be a problem. If we have another good candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind: * Open-AMP, or * corresponding Free-BSD code
-Takahiro Akashi
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- August/000548.html [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
From: Oleksandr Tyshchenko olekstysh@gmail.com Sent: 2021年8月14日 23:38 To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano Stabellini
Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List
stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org
Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hello, all.
Please see some comments below. And sorry for the possible format
issues.
On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
mailto:takahiro.akashi@linaro.org wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
CCing people working on Xen+VirtIO and IOREQs. Not trimming the
original
email to let them read the full context.
My comments below are related to a potential Xen implementation,
not
because it is the only implementation that matters, but because it
is
the one I know best.
Please note that my proposal (and hence the working prototype)[1] is based on Xen's virtio implementation (i.e. IOREQ) and
particularly
EPAM's virtio-disk application (backend server). It has been, I believe, well generalized but is still a bit biased toward this original design.
So I hope you like my approach :)
August/000546.html
Let me take this opportunity to explain a bit more about my approach
below.
Also, please see this relevant email thread: https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote: > Hi, > > One of the goals of Project Stratos is to enable hypervisor
agnostic
> backends so we can enable as much re-use of code as possible and
avoid
> repeating ourselves. This is the flip side of the front end
where
> multiple front-end implementations are required - one per OS,
assuming
> you don't just want Linux guests. The resultant guests are
trivially
> movable between hypervisors modulo any abstracted paravirt type > interfaces. > > In my original thumb nail sketch of a solution I envisioned
vhost-user
> daemons running in a broadly POSIX like environment. The
interface to
> the daemon is fairly simple requiring only some mapped memory
and some
> sort of signalling for events (on Linux this is eventfd). The
idea was a
> stub binary would be responsible for any hypervisor specific
setup and
> then launch a common binary to deal with the actual virtqueue
requests
> themselves. > > Since that original sketch we've seen an expansion in the sort
of ways
> backends could be created. There is interest in encapsulating
backends
> in RTOSes or unikernels for solutions like SCMI. There interest
in Rust
> has prompted ideas of using the trait interface to abstract
differences
> away as well as the idea of bare-metal Rust backends. > > We have a card (STR-12) called "Hypercall Standardisation" which > calls for a description of the APIs needed from the hypervisor
side to
> support VirtIO guests and their backends. However we are some
way off
> from that at the moment as I think we need to at least
demonstrate one
> portable backend before we start codifying requirements. To that
end I
> want to think about what we need for a backend to function. > > Configuration > ============= > > In the type-2 setup this is typically fairly simple because the
host
> system can orchestrate the various modules that make up the
complete
> system. In the type-1 case (or even type-2 with delegated
service VMs)
> we need some sort of mechanism to inform the backend VM about
key
> details about the system: > > - where virt queue memory is in it's address space > - how it's going to receive (interrupt) and trigger (kick)
events
> - what (if any) resources the backend needs to connect to > > Obviously you can elide over configuration issues by having
static
> configurations and baking the assumptions into your guest images
however
> this isn't scalable in the long term. The obvious solution seems
to be
> extending a subset of Device Tree data to user space but perhaps
there
> are other approaches? > > Before any virtio transactions can take place the appropriate
memory
> mappings need to be made between the FE guest and the BE guest.
> Currently the whole of the FE guests address space needs to be
visible
> to whatever is serving the virtio requests. I can envision 3
approaches:
> > * BE guest boots with memory already mapped > > This would entail the guest OS knowing where in it's Guest
Physical
> Address space is already taken up and avoiding clashing. I
would assume
> in this case you would want a standard interface to userspace
to then
> make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared
memory" between
VMs. I think that Jailhouse's ivshmem gives us good insights on this
matter
and that it can even be an alternative for hypervisor-agnostic
solution.
(Please note memory regions in ivshmem appear as a PCI device and
can be
mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but the resultant solution would eventually look similar to ivshmem.
> * BE guests boots with a hypervisor handle to memory > > The BE guest is then free to map the FE's memory to where it
wants in
> the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to
give
to the backend if the backend is not running in dom0. So for Xen I
think
the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information is
expected
to be exposed to BE via Xenstore: (I know that this is a tentative approach though.)
- the start address of configuration space
- interrupt number
- file path for backing storage
- read-only flag
And the BE server have to call a particular hypervisor interface to map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration info to
the backend running in a non-toolstack domain.
I remember, there was a wish to avoid using Xenstore in Virtio backend
itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends)
to read backend configuration from the Xenstore anyway and pass it to
the backend via command line arguments.
Yes, in current PoC code we're using xenstore to pass device
configuration.
We also designed a static device configuration parse method for Dom0less
or
other scenarios don't have xentool. yes, it's from device model command
line
or a config file.
But, if ...
In my approach (virtio-proxy), all those Xen (or hypervisor)-
specific
stuffs are contained in virtio-proxy, yet another VM, to hide all
details.
... the solution how to overcome that is already found and proven to
work then even better.
# My point is that a "handle" is not mandatory for executing mapping.
and the mapping probably done by the toolstack (also see below.) Or we would have to invent a new Xen hypervisor interface and Xen virtual machine privileges to allow
this
kind of mapping.
If we run the backend in Dom0 that we have no problems of course.
One of difficulties on Xen that I found in my approach is that
calling
such hypervisor intefaces (registering IOREQ, mapping memory) is
only
allowed on BE servers themselvies and so we will have to extend
those
interfaces. This, however, will raise some concern on security and privilege
distribution
as Stefan suggested.
We also faced policy related issues with Virtio backend running in
other than Dom0 domain in a "dummy" xsm mode. In our target system we run the backend in a driver
domain (we call it DomD) where the underlying H/W resides. We trust it,
so we wrote policy rules (to be used in "flask" xsm mode) to provide it with a little bit more privileges than a simple DomU had.
Now it is permitted to issue device-model, resource and memory
mappings, etc calls.
> To activate the mapping will > require some sort of hypercall to the hypervisor. I can see two
options
> at this point: > > - expose the handle to userspace for daemon/helper to trigger
the
> mapping via existing hypercall interfaces. If using a helper
you
> would have a hypervisor specific one to avoid the daemon
having to
> care too much about the details or push that complexity into
a
> compile time option for the daemon which would result in
different
> binaries although a common source base. > > - expose a new kernel ABI to abstract the hypercall
differences away
> in the guest kernel. In this case the userspace would
essentially
> ask for an abstract "map guest N memory to userspace ptr"
and let
> the kernel deal with the different hypercall interfaces.
This of
> course assumes the majority of BE guests would be Linux
kernels and
> leaves the bare-metal/unikernel approaches to their own
devices.
> > Operation > ========= > > The core of the operation of VirtIO is fairly simple. Once the > vhost-user feature negotiation is done it's a case of receiving
update
> events and parsing the resultant virt queue for data. The vhost-
user
> specification handles a bunch of setup before that point, mostly
to
> detail where the virt queues are set up FD's for memory and
event
> communication. This is where the envisioned stub process would
be
> responsible for getting the daemon up and ready to run. This is > currently done inside a big VMM like QEMU but I suspect a modern > approach would be to use the rust-vmm vhost crate. It would then
either
> communicate with the kernel's abstracted ABI or be re-targeted
as a
> build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs
the
way they are typically envisioned and described in other
environments.
Instead, Xen has IOREQ servers. Each of them connects
independently to
Xen via the IOREQ interface. E.g. today multiple QEMUs could be
used as
emulators for a single Xen VM, each of them connecting to Xen independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up
shared
interfaces is the toolstack: the xl command and the libxl/libxc libraries.
I think that VM configuration management (or orchestration in
Startos
jargon?) is a subject to debate in parallel. Otherwise, is there any good assumption to avoid it right now?
Oleksandr and others I CCed have been working on ways for the
toolstack
to create virtio backends and setup memory mappings. They might be
able
to provide more info on the subject. I do think we miss a way to
provide
the configuration to the backend and anything else that the
backend
might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio MMIO
devices in
general and Virtio block devices in particular. However, it has not
been upstreaned yet.
Updated patches on review now: https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-
olekstysh@gmail.com/
There is an additional (also important) activity to improve/fix
foreign memory mapping on Arm which I am also involved in.
The foreign memory mapping is proposed to be used for Virtio backends
(device emulators) if there is a need to run guest OS completely unmodified.
Of course, the more secure way would be to use grant memory mapping.
Brietly, the main difference between them is that with foreign mapping the backend
can map any guest memory it wants to map, but with grant mapping it is
allowed to map only what was previously granted by the frontend.
So, there might be a problem if we want to pre-map some guest memory
in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space we need to steal a real physical page from the backend domain. So, with the said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a real domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into, this enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way some prereq work needs to be done.
You can find the related Xen discussion at: https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-
olekstysh@gmail.com/
> One question is how to best handle notification and kicks. The
existing
> vhost-user framework uses eventfd to signal the daemon (although
QEMU
> is quite capable of simulating them when you use TCG). Xen has
it's own
> IOREQ mechanism. However latency is an important factor and
having
> events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to connect directly to the Xen hypervisor.
In my approach, a) BE -> FE: interrupts triggered by BE calling a hypervisor
interface
via virtio-proxy
b) FE -> BE: MMIO to config raises events (in event channels),
which is
converted to a callback to BE via virtio-proxy (Xen's event channel is internnally implemented by
interrupts.)
I don't know what "connect directly" means here, but sending
interrupts
to the opposite side would be best efficient. Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
mechanism.
Agree that MSI would be more efficient than SPI... At the moment, in order to notify the frontend, the backend issues a
specific device-model call to query Xen to inject a corresponding SPI to the guest.
> Could we consider the kernel internally converting IOREQ
messages from
> the Xen hypervisor to eventfd events? Would this scale with
other kernel
> hypercall interfaces? > > So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to
Xen via
the IOREQ interface. We could generalize the IOREQ interface and
make it
hypervisor agnostic. The interface is really trivial and easy to
add.
As I said above, my proposal does the same thing that you mentioned
here :)
The difference is that I do call hypervisor interfaces via virtio-
proxy.
The only Xen-specific part is the notification mechanism, which is
an
event channel. If we replaced the event channel with something
else the
interface would be generic. See: https://gitlab.com/xen-project/xen/-
/blob/staging/xen/include/public/hvm/ioreq.h#L52
I don't think that translating IOREQs to eventfd in the kernel is
a
good idea: if feels like it would be extra complexity and that the kernel shouldn't be involved as this is a backend-hypervisor
interface.
Given that we may want to implement BE as a bare-metal application as I did on Zephyr, I don't think that the translation would not be a big issue, especially on RTOS's. It will be some kind of abstraction layer of interrupt handling (or nothing but a callback mechanism).
Also, eventfd is very Linux-centric and we are trying to design an interface that could work well for RTOSes too. If we want to do something different, both OS-agnostic and hypervisor-agnostic,
perhaps
we could design a new interface. One that could be implementable
in the
Xen hypervisor itself (like IOREQ) and of course any other
hypervisor
too.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
also need
an interface for the backend to inject interrupts into the
frontend? And
if the backend requires dynamic memory mappings of frontend pages,
then
we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces required
for
virtio-proxy (or hypervisor-related interfaces) are listed as RPC protocols :)
These interfaces are a lot more problematic than IOREQ: IOREQ is
tiny
and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because
it
would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also apply to other hypervisors than Xen. Technically, yes, but whether people can accept it or not is a
different
matter.
Thanks, -Takahiro Akashi
-- Regards,
Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis- open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月17日 16:08 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here. This proposal is still discussing in Xen and KVM communities. The main work is to decouple the kvmtool from KVM and make other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor layer for VMM abstraction, Which is, I think it's very close to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always represent # Linaro's view on this subject nor reflect Alex's concerns.
Nevertheless,
Your idea and my proposal seem to share the same background. Both have the similar goal and currently start with, at first, Xen and are based on kvm-tool. (Actually, my work is derived from EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a same set of interfaces (for your "struct vmm_impl" and my "RPC interfaces"). This is not co-incident as we both share the same origin as I said
above.
And so we will also share the same issues. One of them is a way of "sharing/mapping FE's memory". There is some trade-off between the portability and the performance impact. So we can discuss the topic here in this ML, too. (See Alex's original email, too).
Yes, I agree.
On the other hand, my approach aims to create a "single-binary"
solution
in which the same binary of BE vm could run on any hypervisors. Somehow similar to your "proposal-#2" in [2], but in my solution, all the hypervisor-specific code would be put into another entity (VM), named "virtio-proxy" and the abstracted operations are served via RPC. (In this sense, BE is hypervisor-agnostic but might have OS
dependency.)
But I know that we need discuss if this is a requirement even in Stratos project or not. (Maybe not)
Sorry, I haven't had time to finish reading your virtio-proxy completely (I will do it ASAP). But from your description, it seems we need a 3rd VM between FE and BE? My concern is that, if my assumption is right, will it increase the latency in data transport path? Even if we're using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of payload, we will see latency issue (even in your case) and if we have some solution for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this kind of "mapping/unmapping" operations. So the latency caused by this behavior on Xen may eventually be eliminated, and Linux-KVM doesn't have that problem.
Specifically speaking about kvm-tool, I have a concern about its license term; Targeting different hypervisors and different OSs (which I assume includes RTOS's), the resultant library should be license permissive and GPL for kvm-tool might be an issue. Any thoughts?
Yes. If user want to implement a FreeBSD device model, but the virtio library is GPL. Then GPL would be a problem. If we have another good candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- August/000548.html [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
From: Oleksandr Tyshchenko olekstysh@gmail.com Sent: 2021年8月14日 23:38 To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano
Stabellini
Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List
stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org;
Arnd
Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org
Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hello, all.
Please see some comments below. And sorry for the possible format
issues.
On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
mailto:takahiro.akashi@linaro.org wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini
wrote:
> CCing people working on Xen+VirtIO and IOREQs. Not trimming
the
original
> email to let them read the full context. > > My comments below are related to a potential Xen
implementation,
not
> because it is the only implementation that matters, but
because it
is
> the one I know best.
Please note that my proposal (and hence the working prototype)[1] is based on Xen's virtio implementation (i.e. IOREQ) and
particularly
EPAM's virtio-disk application (backend server). It has been, I believe, well generalized but is still a bit
biased
toward this original design.
So I hope you like my approach :)
August/000546.html
Let me take this opportunity to explain a bit more about my
approach
below.
> Also, please see this relevant email thread: > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > Hi, > > > > One of the goals of Project Stratos is to enable hypervisor
agnostic
> > backends so we can enable as much re-use of code as possible
and
avoid
> > repeating ourselves. This is the flip side of the front end
where
> > multiple front-end implementations are required - one per OS,
assuming
> > you don't just want Linux guests. The resultant guests are
trivially
> > movable between hypervisors modulo any abstracted paravirt
type
> > interfaces. > > > > In my original thumb nail sketch of a solution I envisioned
vhost-user
> > daemons running in a broadly POSIX like environment. The
interface to
> > the daemon is fairly simple requiring only some mapped
memory
and some
> > sort of signalling for events (on Linux this is eventfd).
The
idea was a
> > stub binary would be responsible for any hypervisor specific
setup and
> > then launch a common binary to deal with the actual
virtqueue
requests
> > themselves. > > > > Since that original sketch we've seen an expansion in the
sort
of ways
> > backends could be created. There is interest in
encapsulating
backends
> > in RTOSes or unikernels for solutions like SCMI. There
interest
in Rust
> > has prompted ideas of using the trait interface to abstract
differences
> > away as well as the idea of bare-metal Rust backends. > > > > We have a card (STR-12) called "Hypercall Standardisation"
which
> > calls for a description of the APIs needed from the
hypervisor
side to
> > support VirtIO guests and their backends. However we are
some
way off
> > from that at the moment as I think we need to at least
demonstrate one
> > portable backend before we start codifying requirements. To
that
end I
> > want to think about what we need for a backend to function. > > > > Configuration > > ============= > > > > In the type-2 setup this is typically fairly simple because
the
host
> > system can orchestrate the various modules that make up the
complete
> > system. In the type-1 case (or even type-2 with delegated
service VMs)
> > we need some sort of mechanism to inform the backend VM
about
key
> > details about the system: > > > > - where virt queue memory is in it's address space > > - how it's going to receive (interrupt) and trigger (kick)
events
> > - what (if any) resources the backend needs to connect to > > > > Obviously you can elide over configuration issues by having
static
> > configurations and baking the assumptions into your guest
images
however
> > this isn't scalable in the long term. The obvious solution
seems
to be
> > extending a subset of Device Tree data to user space but
perhaps
there
> > are other approaches? > > > > Before any virtio transactions can take place the
appropriate
memory
> > mappings need to be made between the FE guest and the BE
guest.
> > > Currently the whole of the FE guests address space needs to
be
visible
> > to whatever is serving the virtio requests. I can envision 3
approaches:
> > > > * BE guest boots with memory already mapped > > > > This would entail the guest OS knowing where in it's Guest
Physical
> > Address space is already taken up and avoiding clashing. I
would assume
> > in this case you would want a standard interface to
userspace
to then
> > make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared
memory" between
VMs. I think that Jailhouse's ivshmem gives us good insights on
this
matter
and that it can even be an alternative for hypervisor-agnostic
solution.
(Please note memory regions in ivshmem appear as a PCI device
and
can be
mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but the resultant solution would eventually look similar to ivshmem.
> > * BE guests boots with a hypervisor handle to memory > > > > The BE guest is then free to map the FE's memory to where
it
wants in
> > the BE's guest physical address space. > > I cannot see how this could work for Xen. There is no "handle"
to
give
> to the backend if the backend is not running in dom0. So for
Xen I
think
> the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information
is
expected
to be exposed to BE via Xenstore: (I know that this is a tentative approach though.)
- the start address of configuration space
- interrupt number
- file path for backing storage
- read-only flag
And the BE server have to call a particular hypervisor interface
to
map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration
info to
the backend running in a non-toolstack domain.
I remember, there was a wish to avoid using Xenstore in Virtio
backend
itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends)
to read backend configuration from the Xenstore anyway and pass it
to
the backend via command line arguments.
Yes, in current PoC code we're using xenstore to pass device
configuration.
We also designed a static device configuration parse method for
Dom0less
or
other scenarios don't have xentool. yes, it's from device model
command
line
or a config file.
But, if ...
In my approach (virtio-proxy), all those Xen (or hypervisor)-
specific
stuffs are contained in virtio-proxy, yet another VM, to hide
all
details.
... the solution how to overcome that is already found and proven
to
work then even better.
# My point is that a "handle" is not mandatory for executing
mapping.
> and the mapping probably done by the > toolstack (also see below.) Or we would have to invent a new
Xen
> hypervisor interface and Xen virtual machine privileges to
allow
this
> kind of mapping.
> If we run the backend in Dom0 that we have no problems of
course.
One of difficulties on Xen that I found in my approach is that
calling
such hypervisor intefaces (registering IOREQ, mapping memory) is
only
allowed on BE servers themselvies and so we will have to extend
those
interfaces. This, however, will raise some concern on security and privilege
distribution
as Stefan suggested.
We also faced policy related issues with Virtio backend running in
other than Dom0 domain in a "dummy" xsm mode. In our target system we
run
the backend in a driver
domain (we call it DomD) where the underlying H/W resides. We
trust it,
so we wrote policy rules (to be used in "flask" xsm mode) to provide
it
with a little bit more privileges than a simple DomU had.
Now it is permitted to issue device-model, resource and memory
mappings, etc calls.
> > > > To activate the mapping will > > require some sort of hypercall to the hypervisor. I can see
two
options
> > at this point: > > > > - expose the handle to userspace for daemon/helper to
trigger
the
> > mapping via existing hypercall interfaces. If using a
helper
you
> > would have a hypervisor specific one to avoid the daemon
having to
> > care too much about the details or push that complexity
into
a
> > compile time option for the daemon which would result in
different
> > binaries although a common source base. > > > > - expose a new kernel ABI to abstract the hypercall
differences away
> > in the guest kernel. In this case the userspace would
essentially
> > ask for an abstract "map guest N memory to userspace
ptr"
and let
> > the kernel deal with the different hypercall interfaces.
This of
> > course assumes the majority of BE guests would be Linux
kernels and
> > leaves the bare-metal/unikernel approaches to their own
devices.
> > > > Operation > > ========= > > > > The core of the operation of VirtIO is fairly simple. Once
the
> > vhost-user feature negotiation is done it's a case of
receiving
update
> > events and parsing the resultant virt queue for data. The
vhost-
user
> > specification handles a bunch of setup before that point,
mostly
to
> > detail where the virt queues are set up FD's for memory and
event
> > communication. This is where the envisioned stub process
would
be
> > responsible for getting the daemon up and ready to run. This
is
> > currently done inside a big VMM like QEMU but I suspect a
modern
> > approach would be to use the rust-vmm vhost crate. It would
then
either
> > communicate with the kernel's abstracted ABI or be re-
targeted
as a
> > build option for the various hypervisors. > > One thing I mentioned before to Alex is that Xen doesn't have
VMMs
the
> way they are typically envisioned and described in other
environments.
> Instead, Xen has IOREQ servers. Each of them connects
independently to
> Xen via the IOREQ interface. E.g. today multiple QEMUs could
be
used as
> emulators for a single Xen VM, each of them connecting to Xen > independently via the IOREQ interface. > > The component responsible for starting a daemon and/or setting
up
shared
> interfaces is the toolstack: the xl command and the
libxl/libxc
> libraries.
I think that VM configuration management (or orchestration in
Startos
jargon?) is a subject to debate in parallel. Otherwise, is there any good assumption to avoid it right now?
> Oleksandr and others I CCed have been working on ways for the
toolstack
> to create virtio backends and setup memory mappings. They
might be
able
> to provide more info on the subject. I do think we miss a way
to
provide
> the configuration to the backend and anything else that the
backend
> might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio
MMIO
devices in
general and Virtio block devices in particular. However, it has
not
been upstreaned yet.
Updated patches on review now: https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-
email-
olekstysh@gmail.com/
There is an additional (also important) activity to improve/fix
foreign memory mapping on Arm which I am also involved in.
The foreign memory mapping is proposed to be used for Virtio
backends
(device emulators) if there is a need to run guest OS completely unmodified.
Of course, the more secure way would be to use grant memory
mapping.
Brietly, the main difference between them is that with foreign mapping
the
backend
can map any guest memory it wants to map, but with grant mapping
it is
allowed to map only what was previously granted by the frontend.
So, there might be a problem if we want to pre-map some guest
memory
in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space
we
need to steal a real physical page from the backend domain. So, with
the
said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a
real
domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into,
this
enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way
some
prereq work needs to be done.
You can find the related Xen discussion at: https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-
email-
olekstysh@gmail.com/
> > > > One question is how to best handle notification and kicks.
The
existing
> > vhost-user framework uses eventfd to signal the daemon
(although
QEMU
> > is quite capable of simulating them when you use TCG). Xen
has
it's own
> > IOREQ mechanism. However latency is an important factor and
having
> > events go through the stub would add quite a lot. > > Yeah I think, regardless of anything else, we want the
backends to
> connect directly to the Xen hypervisor.
In my approach, a) BE -> FE: interrupts triggered by BE calling a hypervisor
interface
via virtio-proxy
b) FE -> BE: MMIO to config raises events (in event channels),
which is
converted to a callback to BE via virtio-proxy (Xen's event channel is internnally implemented by
interrupts.)
I don't know what "connect directly" means here, but sending
interrupts
to the opposite side would be best efficient. Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
mechanism.
Agree that MSI would be more efficient than SPI... At the moment, in order to notify the frontend, the backend issues
a
specific device-model call to query Xen to inject a corresponding SPI
to
the guest.
> > > Could we consider the kernel internally converting IOREQ
messages from
> > the Xen hypervisor to eventfd events? Would this scale with
other kernel
> > hypercall interfaces? > > > > So any thoughts on what directions are worth experimenting
with?
> > One option we should consider is for each backend to connect
to
Xen via
> the IOREQ interface. We could generalize the IOREQ interface
and
make it
> hypervisor agnostic. The interface is really trivial and easy
to
add.
As I said above, my proposal does the same thing that you
mentioned
here :)
The difference is that I do call hypervisor interfaces via
virtio-
proxy.
> The only Xen-specific part is the notification mechanism,
which is
an
> event channel. If we replaced the event channel with something
else the
> interface would be generic. See: > https://gitlab.com/xen-project/xen/-
/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > I don't think that translating IOREQs to eventfd in the kernel
is
a
> good idea: if feels like it would be extra complexity and that
the
> kernel shouldn't be involved as this is a backend-hypervisor
interface.
Given that we may want to implement BE as a bare-metal
application
as I did on Zephyr, I don't think that the translation would not
be
a big issue, especially on RTOS's. It will be some kind of abstraction layer of interrupt handling (or nothing but a callback mechanism).
> Also, eventfd is very Linux-centric and we are trying to
design an
> interface that could work well for RTOSes too. If we want to
do
> something different, both OS-agnostic and hypervisor-agnostic,
perhaps
> we could design a new interface. One that could be
implementable
in the
> Xen hypervisor itself (like IOREQ) and of course any other
hypervisor
> too. > > > There is also another problem. IOREQ is probably not be the
only
> interface needed. Have a look at > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
also need
> an interface for the backend to inject interrupts into the
frontend? And
> if the backend requires dynamic memory mappings of frontend
pages,
then
> we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces
required
for
virtio-proxy (or hypervisor-related interfaces) are listed as RPC protocols :)
> These interfaces are a lot more problematic than IOREQ: IOREQ
is
tiny
> and self-contained. It is easy to add anywhere. A new
interface to
> inject interrupts or map pages is more difficult to manage
because
it
> would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also
apply
to other hypervisors than Xen. Technically, yes, but whether people can accept it or not is a
different
matter.
Thanks, -Takahiro Akashi
-- Regards,
Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis- open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月17日 16:08 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here. This proposal is still discussing in Xen and KVM communities. The main work is to decouple the kvmtool from KVM and make other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor layer for VMM abstraction, Which is, I think it's very close to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always represent # Linaro's view on this subject nor reflect Alex's concerns.
Nevertheless,
Your idea and my proposal seem to share the same background. Both have the similar goal and currently start with, at first, Xen and are based on kvm-tool. (Actually, my work is derived from EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a same set of interfaces (for your "struct vmm_impl" and my "RPC interfaces"). This is not co-incident as we both share the same origin as I said
above.
And so we will also share the same issues. One of them is a way of "sharing/mapping FE's memory". There is some trade-off between the portability and the performance impact. So we can discuss the topic here in this ML, too. (See Alex's original email, too).
Yes, I agree.
On the other hand, my approach aims to create a "single-binary"
solution
in which the same binary of BE vm could run on any hypervisors. Somehow similar to your "proposal-#2" in [2], but in my solution, all the hypervisor-specific code would be put into another entity (VM), named "virtio-proxy" and the abstracted operations are served via RPC. (In this sense, BE is hypervisor-agnostic but might have OS
dependency.)
But I know that we need discuss if this is a requirement even in Stratos project or not. (Maybe not)
Sorry, I haven't had time to finish reading your virtio-proxy completely (I will do it ASAP). But from your description, it seems we need a 3rd VM between FE and BE? My concern is that, if my assumption is right, will it increase the latency in data transport path? Even if we're using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of payload, we will see latency issue (even in your case) and if we have some solution for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this kind of "mapping/unmapping" operations. So the latency caused by this behavior on Xen may eventually be eliminated, and Linux-KVM doesn't have that problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
-Takahiro Akashi
Specifically speaking about kvm-tool, I have a concern about its license term; Targeting different hypervisors and different OSs (which I assume includes RTOS's), the resultant library should be license permissive and GPL for kvm-tool might be an issue. Any thoughts?
Yes. If user want to implement a FreeBSD device model, but the virtio library is GPL. Then GPL would be a problem. If we have another good candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- August/000548.html [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
From: Oleksandr Tyshchenko olekstysh@gmail.com Sent: 2021年8月14日 23:38 To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano
Stabellini
Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List
stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org;
Arnd
Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org
Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hello, all.
Please see some comments below. And sorry for the possible format
issues.
> On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
mailto:takahiro.akashi@linaro.org wrote:
> On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini
wrote:
> > CCing people working on Xen+VirtIO and IOREQs. Not trimming
the
original
> > email to let them read the full context. > > > > My comments below are related to a potential Xen
implementation,
not
> > because it is the only implementation that matters, but
because it
is
> > the one I know best. > > Please note that my proposal (and hence the working prototype)[1] > is based on Xen's virtio implementation (i.e. IOREQ) and
particularly
> EPAM's virtio-disk application (backend server). > It has been, I believe, well generalized but is still a bit
biased
> toward this original design. > > So I hope you like my approach :) > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
August/000546.html
> > Let me take this opportunity to explain a bit more about my
approach
below.
> > > Also, please see this relevant email thread: > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > Hi, > > > > > > One of the goals of Project Stratos is to enable hypervisor
agnostic
> > > backends so we can enable as much re-use of code as possible
and
avoid
> > > repeating ourselves. This is the flip side of the front end
where
> > > multiple front-end implementations are required - one per OS,
assuming
> > > you don't just want Linux guests. The resultant guests are
trivially
> > > movable between hypervisors modulo any abstracted paravirt
type
> > > interfaces. > > > > > > In my original thumb nail sketch of a solution I envisioned
vhost-user
> > > daemons running in a broadly POSIX like environment. The
interface to
> > > the daemon is fairly simple requiring only some mapped
memory
and some
> > > sort of signalling for events (on Linux this is eventfd).
The
idea was a
> > > stub binary would be responsible for any hypervisor specific
setup and
> > > then launch a common binary to deal with the actual
virtqueue
requests
> > > themselves. > > > > > > Since that original sketch we've seen an expansion in the
sort
of ways
> > > backends could be created. There is interest in
encapsulating
backends
> > > in RTOSes or unikernels for solutions like SCMI. There
interest
in Rust
> > > has prompted ideas of using the trait interface to abstract
differences
> > > away as well as the idea of bare-metal Rust backends. > > > > > > We have a card (STR-12) called "Hypercall Standardisation"
which
> > > calls for a description of the APIs needed from the
hypervisor
side to
> > > support VirtIO guests and their backends. However we are
some
way off
> > > from that at the moment as I think we need to at least
demonstrate one
> > > portable backend before we start codifying requirements. To
that
end I
> > > want to think about what we need for a backend to function. > > > > > > Configuration > > > ============= > > > > > > In the type-2 setup this is typically fairly simple because
the
host
> > > system can orchestrate the various modules that make up the
complete
> > > system. In the type-1 case (or even type-2 with delegated
service VMs)
> > > we need some sort of mechanism to inform the backend VM
about
key
> > > details about the system: > > > > > > - where virt queue memory is in it's address space > > > - how it's going to receive (interrupt) and trigger (kick)
events
> > > - what (if any) resources the backend needs to connect to > > > > > > Obviously you can elide over configuration issues by having
static
> > > configurations and baking the assumptions into your guest
images
however
> > > this isn't scalable in the long term. The obvious solution
seems
to be
> > > extending a subset of Device Tree data to user space but
perhaps
there
> > > are other approaches? > > > > > > Before any virtio transactions can take place the
appropriate
memory
> > > mappings need to be made between the FE guest and the BE
guest.
> > > > > Currently the whole of the FE guests address space needs to
be
visible
> > > to whatever is serving the virtio requests. I can envision 3
approaches:
> > > > > > * BE guest boots with memory already mapped > > > > > > This would entail the guest OS knowing where in it's Guest
Physical
> > > Address space is already taken up and avoiding clashing. I
would assume
> > > in this case you would want a standard interface to
userspace
to then
> > > make that address space visible to the backend daemon. > > Yet another way here is that we would have well known "shared
memory" between
> VMs. I think that Jailhouse's ivshmem gives us good insights on
this
matter
> and that it can even be an alternative for hypervisor-agnostic
solution.
> > (Please note memory regions in ivshmem appear as a PCI device
and
can be
> mapped locally.) > > I want to add this shared memory aspect to my virtio-proxy, but > the resultant solution would eventually look similar to ivshmem. > > > > * BE guests boots with a hypervisor handle to memory > > > > > > The BE guest is then free to map the FE's memory to where
it
wants in
> > > the BE's guest physical address space. > > > > I cannot see how this could work for Xen. There is no "handle"
to
give
> > to the backend if the backend is not running in dom0. So for
Xen I
think
> > the memory has to be already mapped > > In Xen's IOREQ solution (virtio-blk), the following information
is
expected
> to be exposed to BE via Xenstore: > (I know that this is a tentative approach though.) > - the start address of configuration space > - interrupt number > - file path for backing storage > - read-only flag > And the BE server have to call a particular hypervisor interface
to
> map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration
info to
the backend running in a non-toolstack domain.
I remember, there was a wish to avoid using Xenstore in Virtio
backend
itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends)
to read backend configuration from the Xenstore anyway and pass it
to
the backend via command line arguments.
Yes, in current PoC code we're using xenstore to pass device
configuration.
We also designed a static device configuration parse method for
Dom0less
or
other scenarios don't have xentool. yes, it's from device model
command
line
or a config file.
But, if ...
> > In my approach (virtio-proxy), all those Xen (or hypervisor)-
specific
> stuffs are contained in virtio-proxy, yet another VM, to hide
all
details.
... the solution how to overcome that is already found and proven
to
work then even better.
> # My point is that a "handle" is not mandatory for executing
mapping.
> > > and the mapping probably done by the > > toolstack (also see below.) Or we would have to invent a new
Xen
> > hypervisor interface and Xen virtual machine privileges to
allow
this
> > kind of mapping. > > > If we run the backend in Dom0 that we have no problems of
course.
> > One of difficulties on Xen that I found in my approach is that
calling
> such hypervisor intefaces (registering IOREQ, mapping memory) is
only
> allowed on BE servers themselvies and so we will have to extend
those
> interfaces. > This, however, will raise some concern on security and privilege
distribution
> as Stefan suggested.
We also faced policy related issues with Virtio backend running in
other than Dom0 domain in a "dummy" xsm mode. In our target system we
run
the backend in a driver
domain (we call it DomD) where the underlying H/W resides. We
trust it,
so we wrote policy rules (to be used in "flask" xsm mode) to provide
it
with a little bit more privileges than a simple DomU had.
Now it is permitted to issue device-model, resource and memory
mappings, etc calls.
> > > > > > > To activate the mapping will > > > require some sort of hypercall to the hypervisor. I can see
two
options
> > > at this point: > > > > > > - expose the handle to userspace for daemon/helper to
trigger
the
> > > mapping via existing hypercall interfaces. If using a
helper
you
> > > would have a hypervisor specific one to avoid the daemon
having to
> > > care too much about the details or push that complexity
into
a
> > > compile time option for the daemon which would result in
different
> > > binaries although a common source base. > > > > > > - expose a new kernel ABI to abstract the hypercall
differences away
> > > in the guest kernel. In this case the userspace would
essentially
> > > ask for an abstract "map guest N memory to userspace
ptr"
and let
> > > the kernel deal with the different hypercall interfaces.
This of
> > > course assumes the majority of BE guests would be Linux
kernels and
> > > leaves the bare-metal/unikernel approaches to their own
devices.
> > > > > > Operation > > > ========= > > > > > > The core of the operation of VirtIO is fairly simple. Once
the
> > > vhost-user feature negotiation is done it's a case of
receiving
update
> > > events and parsing the resultant virt queue for data. The
vhost-
user
> > > specification handles a bunch of setup before that point,
mostly
to
> > > detail where the virt queues are set up FD's for memory and
event
> > > communication. This is where the envisioned stub process
would
be
> > > responsible for getting the daemon up and ready to run. This
is
> > > currently done inside a big VMM like QEMU but I suspect a
modern
> > > approach would be to use the rust-vmm vhost crate. It would
then
either
> > > communicate with the kernel's abstracted ABI or be re-
targeted
as a
> > > build option for the various hypervisors. > > > > One thing I mentioned before to Alex is that Xen doesn't have
VMMs
the
> > way they are typically envisioned and described in other
environments.
> > Instead, Xen has IOREQ servers. Each of them connects
independently to
> > Xen via the IOREQ interface. E.g. today multiple QEMUs could
be
used as
> > emulators for a single Xen VM, each of them connecting to Xen > > independently via the IOREQ interface. > > > > The component responsible for starting a daemon and/or setting
up
shared
> > interfaces is the toolstack: the xl command and the
libxl/libxc
> > libraries. > > I think that VM configuration management (or orchestration in
Startos
> jargon?) is a subject to debate in parallel. > Otherwise, is there any good assumption to avoid it right now? > > > Oleksandr and others I CCed have been working on ways for the
toolstack
> > to create virtio backends and setup memory mappings. They
might be
able
> > to provide more info on the subject. I do think we miss a way
to
provide
> > the configuration to the backend and anything else that the
backend
> > might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio
MMIO
devices in
general and Virtio block devices in particular. However, it has
not
been upstreaned yet.
Updated patches on review now: https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-
email-
olekstysh@gmail.com/
There is an additional (also important) activity to improve/fix
foreign memory mapping on Arm which I am also involved in.
The foreign memory mapping is proposed to be used for Virtio
backends
(device emulators) if there is a need to run guest OS completely unmodified.
Of course, the more secure way would be to use grant memory
mapping.
Brietly, the main difference between them is that with foreign mapping
the
backend
can map any guest memory it wants to map, but with grant mapping
it is
allowed to map only what was previously granted by the frontend.
So, there might be a problem if we want to pre-map some guest
memory
in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space
we
need to steal a real physical page from the backend domain. So, with
the
said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a
real
domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into,
this
enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way
some
prereq work needs to be done.
You can find the related Xen discussion at: https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-
email-
olekstysh@gmail.com/
> > > > > > > One question is how to best handle notification and kicks.
The
existing
> > > vhost-user framework uses eventfd to signal the daemon
(although
QEMU
> > > is quite capable of simulating them when you use TCG). Xen
has
it's own
> > > IOREQ mechanism. However latency is an important factor and
having
> > > events go through the stub would add quite a lot. > > > > Yeah I think, regardless of anything else, we want the
backends to
> > connect directly to the Xen hypervisor. > > In my approach, > a) BE -> FE: interrupts triggered by BE calling a hypervisor
interface
> via virtio-proxy > b) FE -> BE: MMIO to config raises events (in event channels),
which is
> converted to a callback to BE via virtio-proxy > (Xen's event channel is internnally implemented by
interrupts.)
> > I don't know what "connect directly" means here, but sending
interrupts
> to the opposite side would be best efficient. > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
mechanism.
Agree that MSI would be more efficient than SPI... At the moment, in order to notify the frontend, the backend issues
a
specific device-model call to query Xen to inject a corresponding SPI
to
the guest.
> > > > > Could we consider the kernel internally converting IOREQ
messages from
> > > the Xen hypervisor to eventfd events? Would this scale with
other kernel
> > > hypercall interfaces? > > > > > > So any thoughts on what directions are worth experimenting
with?
> > > > One option we should consider is for each backend to connect
to
Xen via
> > the IOREQ interface. We could generalize the IOREQ interface
and
make it
> > hypervisor agnostic. The interface is really trivial and easy
to
add.
> > As I said above, my proposal does the same thing that you
mentioned
here :)
> The difference is that I do call hypervisor interfaces via
virtio-
proxy.
> > > The only Xen-specific part is the notification mechanism,
which is
an
> > event channel. If we replaced the event channel with something
else the
> > interface would be generic. See: > > https://gitlab.com/xen-project/xen/-
/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > I don't think that translating IOREQs to eventfd in the kernel
is
a
> > good idea: if feels like it would be extra complexity and that
the
> > kernel shouldn't be involved as this is a backend-hypervisor
interface.
> > Given that we may want to implement BE as a bare-metal
application
> as I did on Zephyr, I don't think that the translation would not
be
> a big issue, especially on RTOS's. > It will be some kind of abstraction layer of interrupt handling > (or nothing but a callback mechanism). > > > Also, eventfd is very Linux-centric and we are trying to
design an
> > interface that could work well for RTOSes too. If we want to
do
> > something different, both OS-agnostic and hypervisor-agnostic,
perhaps
> > we could design a new interface. One that could be
implementable
in the
> > Xen hypervisor itself (like IOREQ) and of course any other
hypervisor
> > too. > > > > > > There is also another problem. IOREQ is probably not be the
only
> > interface needed. Have a look at > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
also need
> > an interface for the backend to inject interrupts into the
frontend? And
> > if the backend requires dynamic memory mappings of frontend
pages,
then
> > we would also need an interface to map/unmap domU pages. > > My proposal document might help here; All the interfaces
required
for
> virtio-proxy (or hypervisor-related interfaces) are listed as > RPC protocols :) > > > These interfaces are a lot more problematic than IOREQ: IOREQ
is
tiny
> > and self-contained. It is easy to add anywhere. A new
interface to
> > inject interrupts or map pages is more difficult to manage
because
it
> > would require changes scattered across the various emulators. > > Exactly. I have no confident yet that my approach will also
apply
> to other hypervisors than Xen. > Technically, yes, but whether people can accept it or not is a
different
> matter. > > Thanks, > -Takahiro Akashi
-- Regards,
Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis- open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月17日 16:08 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here. This proposal is still discussing in Xen and KVM communities. The main work is to decouple the kvmtool from KVM and make other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor layer for VMM abstraction, Which is, I think it's very close to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always represent # Linaro's view on this subject nor reflect Alex's concerns.
Nevertheless,
Your idea and my proposal seem to share the same background. Both have the similar goal and currently start with, at first, Xen and are based on kvm-tool. (Actually, my work is derived from EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a same set of interfaces (for your "struct vmm_impl" and my "RPC interfaces"). This is not co-incident as we both share the same origin as I said
above.
And so we will also share the same issues. One of them is a way of "sharing/mapping FE's memory". There is some trade-off between the portability and the performance impact. So we can discuss the topic here in this ML, too. (See Alex's original email, too).
Yes, I agree.
On the other hand, my approach aims to create a "single-binary"
solution
in which the same binary of BE vm could run on any hypervisors. Somehow similar to your "proposal-#2" in [2], but in my solution, all the hypervisor-specific code would be put into another entity (VM), named "virtio-proxy" and the abstracted operations are served via RPC. (In this sense, BE is hypervisor-agnostic but might have OS
dependency.)
But I know that we need discuss if this is a requirement even in Stratos project or not. (Maybe not)
Sorry, I haven't had time to finish reading your virtio-proxy completely (I will do it ASAP). But from your description, it seems we need a 3rd VM between FE and BE? My concern is that, if my assumption is right, will it increase the latency in data transport path? Even if we're using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of payload, we will see latency issue (even in your case) and if we have some solution for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this kind of "mapping/unmapping" operations. So the latency caused by this behavior on Xen may eventually be eliminated, and Linux-KVM doesn't have that problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
Thanks, -Takahiro Akashi
-Takahiro Akashi
Specifically speaking about kvm-tool, I have a concern about its license term; Targeting different hypervisors and different OSs (which I assume includes RTOS's), the resultant library should be license permissive and GPL for kvm-tool might be an issue. Any thoughts?
Yes. If user want to implement a FreeBSD device model, but the virtio library is GPL. Then GPL would be a problem. If we have another good candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- August/000548.html [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> From: Oleksandr Tyshchenko olekstysh@gmail.com > Sent: 2021年8月14日 23:38 > To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano
Stabellini
> Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List
stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org;
Arnd
Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org
> Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > Hello, all. > > Please see some comments below. And sorry for the possible format
issues.
> > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
mailto:takahiro.akashi@linaro.org wrote:
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini
wrote:
> > > CCing people working on Xen+VirtIO and IOREQs. Not trimming
the
original
> > > email to let them read the full context. > > > > > > My comments below are related to a potential Xen
implementation,
not
> > > because it is the only implementation that matters, but
because it
is
> > > the one I know best. > > > > Please note that my proposal (and hence the working prototype)[1] > > is based on Xen's virtio implementation (i.e. IOREQ) and
particularly
> > EPAM's virtio-disk application (backend server). > > It has been, I believe, well generalized but is still a bit
biased
> > toward this original design. > > > > So I hope you like my approach :) > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
August/000546.html
> > > > Let me take this opportunity to explain a bit more about my
approach
below.
> > > > > Also, please see this relevant email thread: > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > Hi, > > > > > > > > One of the goals of Project Stratos is to enable hypervisor
agnostic
> > > > backends so we can enable as much re-use of code as possible
and
avoid
> > > > repeating ourselves. This is the flip side of the front end
where
> > > > multiple front-end implementations are required - one per OS,
assuming
> > > > you don't just want Linux guests. The resultant guests are
trivially
> > > > movable between hypervisors modulo any abstracted paravirt
type
> > > > interfaces. > > > > > > > > In my original thumb nail sketch of a solution I envisioned
vhost-user
> > > > daemons running in a broadly POSIX like environment. The
interface to
> > > > the daemon is fairly simple requiring only some mapped
memory
and some
> > > > sort of signalling for events (on Linux this is eventfd).
The
idea was a
> > > > stub binary would be responsible for any hypervisor specific
setup and
> > > > then launch a common binary to deal with the actual
virtqueue
requests
> > > > themselves. > > > > > > > > Since that original sketch we've seen an expansion in the
sort
of ways
> > > > backends could be created. There is interest in
encapsulating
backends
> > > > in RTOSes or unikernels for solutions like SCMI. There
interest
in Rust
> > > > has prompted ideas of using the trait interface to abstract
differences
> > > > away as well as the idea of bare-metal Rust backends. > > > > > > > > We have a card (STR-12) called "Hypercall Standardisation"
which
> > > > calls for a description of the APIs needed from the
hypervisor
side to
> > > > support VirtIO guests and their backends. However we are
some
way off
> > > > from that at the moment as I think we need to at least
demonstrate one
> > > > portable backend before we start codifying requirements. To
that
end I
> > > > want to think about what we need for a backend to function. > > > > > > > > Configuration > > > > ============= > > > > > > > > In the type-2 setup this is typically fairly simple because
the
host
> > > > system can orchestrate the various modules that make up the
complete
> > > > system. In the type-1 case (or even type-2 with delegated
service VMs)
> > > > we need some sort of mechanism to inform the backend VM
about
key
> > > > details about the system: > > > > > > > > - where virt queue memory is in it's address space > > > > - how it's going to receive (interrupt) and trigger (kick)
events
> > > > - what (if any) resources the backend needs to connect to > > > > > > > > Obviously you can elide over configuration issues by having
static
> > > > configurations and baking the assumptions into your guest
images
however
> > > > this isn't scalable in the long term. The obvious solution
seems
to be
> > > > extending a subset of Device Tree data to user space but
perhaps
there
> > > > are other approaches? > > > > > > > > Before any virtio transactions can take place the
appropriate
memory
> > > > mappings need to be made between the FE guest and the BE
guest.
> > > > > > > Currently the whole of the FE guests address space needs to
be
visible
> > > > to whatever is serving the virtio requests. I can envision 3
approaches:
> > > > > > > > * BE guest boots with memory already mapped > > > > > > > > This would entail the guest OS knowing where in it's Guest
Physical
> > > > Address space is already taken up and avoiding clashing. I
would assume
> > > > in this case you would want a standard interface to
userspace
to then
> > > > make that address space visible to the backend daemon. > > > > Yet another way here is that we would have well known "shared
memory" between
> > VMs. I think that Jailhouse's ivshmem gives us good insights on
this
matter
> > and that it can even be an alternative for hypervisor-agnostic
solution.
> > > > (Please note memory regions in ivshmem appear as a PCI device
and
can be
> > mapped locally.) > > > > I want to add this shared memory aspect to my virtio-proxy, but > > the resultant solution would eventually look similar to ivshmem. > > > > > > * BE guests boots with a hypervisor handle to memory > > > > > > > > The BE guest is then free to map the FE's memory to where
it
wants in
> > > > the BE's guest physical address space. > > > > > > I cannot see how this could work for Xen. There is no "handle"
to
give
> > > to the backend if the backend is not running in dom0. So for
Xen I
think
> > > the memory has to be already mapped > > > > In Xen's IOREQ solution (virtio-blk), the following information
is
expected
> > to be exposed to BE via Xenstore: > > (I know that this is a tentative approach though.) > > - the start address of configuration space > > - interrupt number > > - file path for backing storage > > - read-only flag > > And the BE server have to call a particular hypervisor interface
to
> > map the configuration space. > > Yes, Xenstore was chosen as a simple way to pass configuration
info to
the backend running in a non-toolstack domain.
> I remember, there was a wish to avoid using Xenstore in Virtio
backend
itself if possible, so for non-toolstack domain, this could done with adjusting devd (daemon that listens for devices and launches backends)
> to read backend configuration from the Xenstore anyway and pass it
to
the backend via command line arguments.
>
Yes, in current PoC code we're using xenstore to pass device
configuration.
We also designed a static device configuration parse method for
Dom0less
or
other scenarios don't have xentool. yes, it's from device model
command
line
or a config file.
> But, if ... > > > > > In my approach (virtio-proxy), all those Xen (or hypervisor)-
specific
> > stuffs are contained in virtio-proxy, yet another VM, to hide
all
details.
> > ... the solution how to overcome that is already found and proven
to
work then even better.
> > > > > # My point is that a "handle" is not mandatory for executing
mapping.
> > > > > and the mapping probably done by the > > > toolstack (also see below.) Or we would have to invent a new
Xen
> > > hypervisor interface and Xen virtual machine privileges to
allow
this
> > > kind of mapping. > > > > > If we run the backend in Dom0 that we have no problems of
course.
> > > > One of difficulties on Xen that I found in my approach is that
calling
> > such hypervisor intefaces (registering IOREQ, mapping memory) is
only
> > allowed on BE servers themselvies and so we will have to extend
those
> > interfaces. > > This, however, will raise some concern on security and privilege
distribution
> > as Stefan suggested. > > We also faced policy related issues with Virtio backend running in
other than Dom0 domain in a "dummy" xsm mode. In our target system we
run
the backend in a driver
> domain (we call it DomD) where the underlying H/W resides. We
trust it,
so we wrote policy rules (to be used in "flask" xsm mode) to provide
it
with a little bit more privileges than a simple DomU had.
> Now it is permitted to issue device-model, resource and memory
mappings, etc calls.
> > > > > > > > > > > To activate the mapping will > > > > require some sort of hypercall to the hypervisor. I can see
two
options
> > > > at this point: > > > > > > > > - expose the handle to userspace for daemon/helper to
trigger
the
> > > > mapping via existing hypercall interfaces. If using a
helper
you
> > > > would have a hypervisor specific one to avoid the daemon
having to
> > > > care too much about the details or push that complexity
into
a
> > > > compile time option for the daemon which would result in
different
> > > > binaries although a common source base. > > > > > > > > - expose a new kernel ABI to abstract the hypercall
differences away
> > > > in the guest kernel. In this case the userspace would
essentially
> > > > ask for an abstract "map guest N memory to userspace
ptr"
and let
> > > > the kernel deal with the different hypercall interfaces.
This of
> > > > course assumes the majority of BE guests would be Linux
kernels and
> > > > leaves the bare-metal/unikernel approaches to their own
devices.
> > > > > > > > Operation > > > > ========= > > > > > > > > The core of the operation of VirtIO is fairly simple. Once
the
> > > > vhost-user feature negotiation is done it's a case of
receiving
update
> > > > events and parsing the resultant virt queue for data. The
vhost-
user
> > > > specification handles a bunch of setup before that point,
mostly
to
> > > > detail where the virt queues are set up FD's for memory and
event
> > > > communication. This is where the envisioned stub process
would
be
> > > > responsible for getting the daemon up and ready to run. This
is
> > > > currently done inside a big VMM like QEMU but I suspect a
modern
> > > > approach would be to use the rust-vmm vhost crate. It would
then
either
> > > > communicate with the kernel's abstracted ABI or be re-
targeted
as a
> > > > build option for the various hypervisors. > > > > > > One thing I mentioned before to Alex is that Xen doesn't have
VMMs
the
> > > way they are typically envisioned and described in other
environments.
> > > Instead, Xen has IOREQ servers. Each of them connects
independently to
> > > Xen via the IOREQ interface. E.g. today multiple QEMUs could
be
used as
> > > emulators for a single Xen VM, each of them connecting to Xen > > > independently via the IOREQ interface. > > > > > > The component responsible for starting a daemon and/or setting
up
shared
> > > interfaces is the toolstack: the xl command and the
libxl/libxc
> > > libraries. > > > > I think that VM configuration management (or orchestration in
Startos
> > jargon?) is a subject to debate in parallel. > > Otherwise, is there any good assumption to avoid it right now? > > > > > Oleksandr and others I CCed have been working on ways for the
toolstack
> > > to create virtio backends and setup memory mappings. They
might be
able
> > > to provide more info on the subject. I do think we miss a way
to
provide
> > > the configuration to the backend and anything else that the
backend
> > > might require to start doing its job. > > Yes, some work has been done for the toolstack to handle Virtio
MMIO
devices in
> general and Virtio block devices in particular. However, it has
not
been upstreaned yet.
> Updated patches on review now: > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-
email-
olekstysh@gmail.com/
> > There is an additional (also important) activity to improve/fix
foreign memory mapping on Arm which I am also involved in.
> The foreign memory mapping is proposed to be used for Virtio
backends
(device emulators) if there is a need to run guest OS completely unmodified.
> Of course, the more secure way would be to use grant memory
mapping.
Brietly, the main difference between them is that with foreign mapping
the
backend
> can map any guest memory it wants to map, but with grant mapping
it is
allowed to map only what was previously granted by the frontend.
> > So, there might be a problem if we want to pre-map some guest
memory
in advance or to cache mappings in the backend in order to improve performance (because the mapping/unmapping guest pages every request requires a lot of back and forth to Xen + P2M updates). In a nutshell, currently, in order to map a guest page into the backend address space
we
need to steal a real physical page from the backend domain. So, with
the
said optimizations we might end up with no free memory in the backend domain (see XSA-300). And what we try to achieve is to not waste a
real
domain memory at all by providing safe non-allocated-yet (so unused) address space for the foreign (and grant) pages to be mapped into,
this
enabling work implies Xen and Linux (and likely DTB bindings) changes. However, as it turned out, for this to work in a proper and safe way
some
prereq work needs to be done.
> You can find the related Xen discussion at: > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-
email-
olekstysh@gmail.com/
> > > > > > > > > > > > One question is how to best handle notification and kicks.
The
existing
> > > > vhost-user framework uses eventfd to signal the daemon
(although
QEMU
> > > > is quite capable of simulating them when you use TCG). Xen
has
it's own
> > > > IOREQ mechanism. However latency is an important factor and
having
> > > > events go through the stub would add quite a lot. > > > > > > Yeah I think, regardless of anything else, we want the
backends to
> > > connect directly to the Xen hypervisor. > > > > In my approach, > > a) BE -> FE: interrupts triggered by BE calling a hypervisor
interface
> > via virtio-proxy > > b) FE -> BE: MMIO to config raises events (in event channels),
which is
> > converted to a callback to BE via virtio-proxy > > (Xen's event channel is internnally implemented by
interrupts.)
> > > > I don't know what "connect directly" means here, but sending
interrupts
> > to the opposite side would be best efficient. > > Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
mechanism.
> > Agree that MSI would be more efficient than SPI... > At the moment, in order to notify the frontend, the backend issues
a
specific device-model call to query Xen to inject a corresponding SPI
to
the guest.
> > > > > > > > > > Could we consider the kernel internally converting IOREQ
messages from
> > > > the Xen hypervisor to eventfd events? Would this scale with
other kernel
> > > > hypercall interfaces? > > > > > > > > So any thoughts on what directions are worth experimenting
with?
> > > > > > One option we should consider is for each backend to connect
to
Xen via
> > > the IOREQ interface. We could generalize the IOREQ interface
and
make it
> > > hypervisor agnostic. The interface is really trivial and easy
to
add.
> > > > As I said above, my proposal does the same thing that you
mentioned
here :)
> > The difference is that I do call hypervisor interfaces via
virtio-
proxy.
> > > > > The only Xen-specific part is the notification mechanism,
which is
an
> > > event channel. If we replaced the event channel with something
else the
> > > interface would be generic. See: > > > https://gitlab.com/xen-project/xen/-
/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > > > I don't think that translating IOREQs to eventfd in the kernel
is
a
> > > good idea: if feels like it would be extra complexity and that
the
> > > kernel shouldn't be involved as this is a backend-hypervisor
interface.
> > > > Given that we may want to implement BE as a bare-metal
application
> > as I did on Zephyr, I don't think that the translation would not
be
> > a big issue, especially on RTOS's. > > It will be some kind of abstraction layer of interrupt handling > > (or nothing but a callback mechanism). > > > > > Also, eventfd is very Linux-centric and we are trying to
design an
> > > interface that could work well for RTOSes too. If we want to
do
> > > something different, both OS-agnostic and hypervisor-agnostic,
perhaps
> > > we could design a new interface. One that could be
implementable
in the
> > > Xen hypervisor itself (like IOREQ) and of course any other
hypervisor
> > > too. > > > > > > > > > There is also another problem. IOREQ is probably not be the
only
> > > interface needed. Have a look at > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
also need
> > > an interface for the backend to inject interrupts into the
frontend? And
> > > if the backend requires dynamic memory mappings of frontend
pages,
then
> > > we would also need an interface to map/unmap domU pages. > > > > My proposal document might help here; All the interfaces
required
for
> > virtio-proxy (or hypervisor-related interfaces) are listed as > > RPC protocols :) > > > > > These interfaces are a lot more problematic than IOREQ: IOREQ
is
tiny
> > > and self-contained. It is easy to add anywhere. A new
interface to
> > > inject interrupts or map pages is more difficult to manage
because
it
> > > would require changes scattered across the various emulators. > > > > Exactly. I have no confident yet that my approach will also
apply
> > to other hypervisors than Xen. > > Technically, yes, but whether people can accept it or not is a
different
> > matter. > > > > Thanks, > > -Takahiro Akashi > > > > -- > Regards, > > Oleksandr Tyshchenko IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly Xin Kaly.Xin@arm.com; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月17日 16:08 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > Hi All, > > Thanks for Stefano to link my kvmtool for Xen proposal here. > This proposal is still discussing in Xen and KVM communities. > The main work is to decouple the kvmtool from KVM and make > other hypervisors can reuse the virtual device implementations. > > In this case, we need to introduce an intermediate hypervisor > layer for VMM abstraction, Which is, I think it's very close > to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always
represent
# Linaro's view on this subject nor reflect Alex's concerns.
Nevertheless,
Your idea and my proposal seem to share the same background. Both have the similar goal and currently start with, at first,
Xen
and are based on kvm-tool. (Actually, my work is derived from EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a
same
set of interfaces (for your "struct vmm_impl" and my "RPC
interfaces").
This is not co-incident as we both share the same origin as I
said
above.
And so we will also share the same issues. One of them is a way
of
"sharing/mapping FE's memory". There is some trade-off between the portability and the performance impact. So we can discuss the topic here in this ML, too. (See Alex's original email, too).
Yes, I agree.
On the other hand, my approach aims to create a "single-binary"
solution
in which the same binary of BE vm could run on any hypervisors. Somehow similar to your "proposal-#2" in [2], but in my solution,
all
the hypervisor-specific code would be put into another entity
(VM),
named "virtio-proxy" and the abstracted operations are served
via RPC.
(In this sense, BE is hypervisor-agnostic but might have OS
dependency.)
But I know that we need discuss if this is a requirement even in Stratos project or not. (Maybe not)
Sorry, I haven't had time to finish reading your virtio-proxy
completely
(I will do it ASAP). But from your description, it seems we need a 3rd VM between FE and BE? My concern is that, if my assumption is
right,
will it increase the latency in data transport path? Even if we're using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of payload, we will see latency issue (even in your case) and if we have some
solution
for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this kind of "mapping/unmapping" operations. So the latency caused by this
behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
Thanks, -Takahiro Akashi
-Takahiro Akashi
Specifically speaking about kvm-tool, I have a concern about its license term; Targeting different hypervisors and different OSs (which I assume includes RTOS's), the resultant library should
be
license permissive and GPL for kvm-tool might be an issue. Any thoughts?
Yes. If user want to implement a FreeBSD device model, but the
virtio
library is GPL. Then GPL would be a problem. If we have another
good
candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- August/000548.html [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
> > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > Sent: 2021年8月14日 23:38 > > To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano
Stabellini
sstabellini@kernel.org > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing
List
stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org;
Arnd
Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com;
Oleksandr
Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > Hello, all. > > > > Please see some comments below. And sorry for the possible
format
issues. > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro mailto:takahiro.akashi@linaro.org wrote: > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
wrote:
> > > > CCing people working on Xen+VirtIO and IOREQs. Not
trimming
the
original > > > > email to let them read the full context. > > > > > > > > My comments below are related to a potential Xen
implementation,
not > > > > because it is the only implementation that matters, but
because it
is > > > > the one I know best. > > > > > > Please note that my proposal (and hence the working
prototype)[1]
> > > is based on Xen's virtio implementation (i.e. IOREQ) and particularly > > > EPAM's virtio-disk application (backend server). > > > It has been, I believe, well generalized but is still a
bit
biased
> > > toward this original design. > > > > > > So I hope you like my approach :) > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
August/000546.html > > > > > > Let me take this opportunity to explain a bit more about
my
approach
below. > > > > > > > Also, please see this relevant email thread: > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > Hi, > > > > > > > > > > One of the goals of Project Stratos is to enable
hypervisor
agnostic > > > > > backends so we can enable as much re-use of code as
possible
and
avoid > > > > > repeating ourselves. This is the flip side of the
front end
where > > > > > multiple front-end implementations are required - one
per OS,
assuming > > > > > you don't just want Linux guests. The resultant guests
are
trivially > > > > > movable between hypervisors modulo any abstracted
paravirt
type
> > > > > interfaces. > > > > > > > > > > In my original thumb nail sketch of a solution I
envisioned
vhost-user > > > > > daemons running in a broadly POSIX like environment.
The
interface to > > > > > the daemon is fairly simple requiring only some mapped
memory
and some > > > > > sort of signalling for events (on Linux this is
eventfd).
The
idea was a > > > > > stub binary would be responsible for any hypervisor
specific
setup and > > > > > then launch a common binary to deal with the actual
virtqueue
requests > > > > > themselves. > > > > > > > > > > Since that original sketch we've seen an expansion in
the
sort
of ways > > > > > backends could be created. There is interest in
encapsulating
backends > > > > > in RTOSes or unikernels for solutions like SCMI. There
interest
in Rust > > > > > has prompted ideas of using the trait interface to
abstract
differences > > > > > away as well as the idea of bare-metal Rust backends. > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
which
> > > > > calls for a description of the APIs needed from the
hypervisor
side to > > > > > support VirtIO guests and their backends. However we
are
some
way off > > > > > from that at the moment as I think we need to at least demonstrate one > > > > > portable backend before we start codifying
requirements. To
that
end I > > > > > want to think about what we need for a backend to
function.
> > > > > > > > > > Configuration > > > > > ============= > > > > > > > > > > In the type-2 setup this is typically fairly simple
because
the
host > > > > > system can orchestrate the various modules that make
up the
complete > > > > > system. In the type-1 case (or even type-2 with
delegated
service VMs) > > > > > we need some sort of mechanism to inform the backend
VM
about
key > > > > > details about the system: > > > > > > > > > > - where virt queue memory is in it's address space > > > > > - how it's going to receive (interrupt) and trigger
(kick)
events > > > > > - what (if any) resources the backend needs to
connect to
> > > > > > > > > > Obviously you can elide over configuration issues by
having
static > > > > > configurations and baking the assumptions into your
guest
images
however > > > > > this isn't scalable in the long term. The obvious
solution
seems
to be > > > > > extending a subset of Device Tree data to user space
but
perhaps
there > > > > > are other approaches? > > > > > > > > > > Before any virtio transactions can take place the
appropriate
memory > > > > > mappings need to be made between the FE guest and the
BE
guest.
> > > > > > > > > Currently the whole of the FE guests address space
needs to
be
visible > > > > > to whatever is serving the virtio requests. I can
envision 3
approaches: > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > This would entail the guest OS knowing where in it's
Guest
Physical > > > > > Address space is already taken up and avoiding
clashing. I
would assume > > > > > in this case you would want a standard interface to
userspace
to then > > > > > make that address space visible to the backend daemon. > > > > > > Yet another way here is that we would have well known
"shared
memory" between > > > VMs. I think that Jailhouse's ivshmem gives us good
insights on
this
matter > > > and that it can even be an alternative for hypervisor-
agnostic
solution. > > > > > > (Please note memory regions in ivshmem appear as a PCI
device
and
can be > > > mapped locally.) > > > > > > I want to add this shared memory aspect to my virtio-proxy,
but
> > > the resultant solution would eventually look similar to
ivshmem.
> > > > > > > > * BE guests boots with a hypervisor handle to memory > > > > > > > > > > The BE guest is then free to map the FE's memory to
where
it
wants in > > > > > the BE's guest physical address space. > > > > > > > > I cannot see how this could work for Xen. There is no
"handle"
to
give > > > > to the backend if the backend is not running in dom0. So
for
Xen I
think > > > > the memory has to be already mapped > > > > > > In Xen's IOREQ solution (virtio-blk), the following
information
is
expected > > > to be exposed to BE via Xenstore: > > > (I know that this is a tentative approach though.) > > > - the start address of configuration space > > > - interrupt number > > > - file path for backing storage > > > - read-only flag > > > And the BE server have to call a particular hypervisor
interface
to
> > > map the configuration space. > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
info to
the backend running in a non-toolstack domain. > > I remember, there was a wish to avoid using Xenstore in
Virtio
backend
itself if possible, so for non-toolstack domain, this could done
with
adjusting devd (daemon that listens for devices and launches
backends)
> > to read backend configuration from the Xenstore anyway and
pass it
to
the backend via command line arguments. > > > > Yes, in current PoC code we're using xenstore to pass device configuration. > We also designed a static device configuration parse method
for
Dom0less
or > other scenarios don't have xentool. yes, it's from device
model
command
line > or a config file. > > > But, if ... > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
specific > > > stuffs are contained in virtio-proxy, yet another VM, to
hide
all
details. > > > > ... the solution how to overcome that is already found and
proven
to
work then even better. > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
mapping.
> > > > > > > and the mapping probably done by the > > > > toolstack (also see below.) Or we would have to invent a
new
Xen
> > > > hypervisor interface and Xen virtual machine privileges
to
allow
this > > > > kind of mapping. > > > > > > > If we run the backend in Dom0 that we have no problems
of
course.
> > > > > > One of difficulties on Xen that I found in my approach is
that
calling > > > such hypervisor intefaces (registering IOREQ, mapping
memory) is
only > > > allowed on BE servers themselvies and so we will have to
extend
those > > > interfaces. > > > This, however, will raise some concern on security and
privilege
distribution > > > as Stefan suggested. > > > > We also faced policy related issues with Virtio backend
running in
other than Dom0 domain in a "dummy" xsm mode. In our target
system we
run
the backend in a driver > > domain (we call it DomD) where the underlying H/W resides.
We
trust it,
so we wrote policy rules (to be used in "flask" xsm mode) to
provide
it
with a little bit more privileges than a simple DomU had. > > Now it is permitted to issue device-model, resource and
memory
mappings, etc calls. > > > > > > > > > > > > > > > To activate the mapping will > > > > > require some sort of hypercall to the hypervisor. I
can see
two
options > > > > > at this point: > > > > > > > > > > - expose the handle to userspace for daemon/helper
to
trigger
the > > > > > mapping via existing hypercall interfaces. If
using a
helper
you > > > > > would have a hypervisor specific one to avoid the
daemon
having to > > > > > care too much about the details or push that
complexity
into
a > > > > > compile time option for the daemon which would
result in
different > > > > > binaries although a common source base. > > > > > > > > > > - expose a new kernel ABI to abstract the hypercall differences away > > > > > in the guest kernel. In this case the userspace
would
essentially > > > > > ask for an abstract "map guest N memory to
userspace
ptr"
and let > > > > > the kernel deal with the different hypercall
interfaces.
This of > > > > > course assumes the majority of BE guests would be
Linux
kernels and > > > > > leaves the bare-metal/unikernel approaches to
their own
devices. > > > > > > > > > > Operation > > > > > ========= > > > > > > > > > > The core of the operation of VirtIO is fairly simple.
Once
the
> > > > > vhost-user feature negotiation is done it's a case of
receiving
update > > > > > events and parsing the resultant virt queue for data.
The
vhost-
user > > > > > specification handles a bunch of setup before that
point,
mostly
to > > > > > detail where the virt queues are set up FD's for
memory and
event > > > > > communication. This is where the envisioned stub
process
would
be > > > > > responsible for getting the daemon up and ready to run.
This
is
> > > > > currently done inside a big VMM like QEMU but I
suspect a
modern
> > > > > approach would be to use the rust-vmm vhost crate. It
would
then
either > > > > > communicate with the kernel's abstracted ABI or be re-
targeted
as a > > > > > build option for the various hypervisors. > > > > > > > > One thing I mentioned before to Alex is that Xen doesn't
have
VMMs
the > > > > way they are typically envisioned and described in other environments. > > > > Instead, Xen has IOREQ servers. Each of them connects independently to > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
could
be
used as > > > > emulators for a single Xen VM, each of them connecting
to Xen
> > > > independently via the IOREQ interface. > > > > > > > > The component responsible for starting a daemon and/or
setting
up
shared > > > > interfaces is the toolstack: the xl command and the
libxl/libxc
> > > > libraries. > > > > > > I think that VM configuration management (or orchestration
in
Startos > > > jargon?) is a subject to debate in parallel. > > > Otherwise, is there any good assumption to avoid it right
now?
> > > > > > > Oleksandr and others I CCed have been working on ways
for the
toolstack > > > > to create virtio backends and setup memory mappings.
They
might be
able > > > > to provide more info on the subject. I do think we miss
a way
to
provide > > > > the configuration to the backend and anything else that
the
backend > > > > might require to start doing its job. > > > > Yes, some work has been done for the toolstack to handle
Virtio
MMIO
devices in > > general and Virtio block devices in particular. However, it
has
not
been upstreaned yet. > > Updated patches on review now: > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
send-
email-
olekstysh@gmail.com/ > > > > There is an additional (also important) activity to
improve/fix
foreign memory mapping on Arm which I am also involved in. > > The foreign memory mapping is proposed to be used for Virtio
backends
(device emulators) if there is a need to run guest OS completely unmodified. > > Of course, the more secure way would be to use grant memory
mapping.
Brietly, the main difference between them is that with foreign
mapping
the
backend > > can map any guest memory it wants to map, but with grant
mapping
it is
allowed to map only what was previously granted by the frontend. > > > > So, there might be a problem if we want to pre-map some
guest
memory
in advance or to cache mappings in the backend in order to
improve
performance (because the mapping/unmapping guest pages every
request
requires a lot of back and forth to Xen + P2M updates). In a
nutshell,
currently, in order to map a guest page into the backend address
space
we
need to steal a real physical page from the backend domain. So,
with
the
said optimizations we might end up with no free memory in the
backend
domain (see XSA-300). And what we try to achieve is to not waste
a
real
domain memory at all by providing safe non-allocated-yet (so
unused)
address space for the foreign (and grant) pages to be mapped
into,
this
enabling work implies Xen and Linux (and likely DTB bindings)
changes.
However, as it turned out, for this to work in a proper and safe
way
some
prereq work needs to be done. > > You can find the related Xen discussion at: > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
send-
email-
olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > One question is how to best handle notification and
kicks.
The
existing > > > > > vhost-user framework uses eventfd to signal the daemon
(although
QEMU > > > > > is quite capable of simulating them when you use TCG).
Xen
has
it's own > > > > > IOREQ mechanism. However latency is an important
factor and
having > > > > > events go through the stub would add quite a lot. > > > > > > > > Yeah I think, regardless of anything else, we want the
backends to
> > > > connect directly to the Xen hypervisor. > > > > > > In my approach, > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
interface > > > via virtio-proxy > > > b) FE -> BE: MMIO to config raises events (in event
channels),
which is > > > converted to a callback to BE via virtio-
proxy
> > > (Xen's event channel is internnally
implemented by
interrupts.) > > > > > > I don't know what "connect directly" means here, but
sending
interrupts > > > to the opposite side would be best efficient. > > > Ivshmem, I suppose, takes this approach by utilizing PCI's
msi-x
mechanism. > > > > Agree that MSI would be more efficient than SPI... > > At the moment, in order to notify the frontend, the backend
issues
a
specific device-model call to query Xen to inject a
corresponding SPI
to
the guest. > > > > > > > > > > > > > > > Could we consider the kernel internally converting
IOREQ
messages from > > > > > the Xen hypervisor to eventfd events? Would this scale
with
other kernel > > > > > hypercall interfaces? > > > > > > > > > > So any thoughts on what directions are worth
experimenting
with?
> > > > > > > > One option we should consider is for each backend to
connect
to
Xen via > > > > the IOREQ interface. We could generalize the IOREQ
interface
and
make it > > > > hypervisor agnostic. The interface is really trivial and
easy
to
add. > > > > > > As I said above, my proposal does the same thing that you
mentioned
here :) > > > The difference is that I do call hypervisor interfaces via
virtio-
proxy. > > > > > > > The only Xen-specific part is the notification mechanism,
which is
an > > > > event channel. If we replaced the event channel with
something
else the > > > > interface would be generic. See: > > > > https://gitlab.com/xen-project/xen/- /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > I don't think that translating IOREQs to eventfd in the
kernel
is
a > > > > good idea: if feels like it would be extra complexity
and that
the
> > > > kernel shouldn't be involved as this is a backend-
hypervisor
interface. > > > > > > Given that we may want to implement BE as a bare-metal
application
> > > as I did on Zephyr, I don't think that the translation
would not
be
> > > a big issue, especially on RTOS's. > > > It will be some kind of abstraction layer of interrupt
handling
> > > (or nothing but a callback mechanism). > > > > > > > Also, eventfd is very Linux-centric and we are trying to
design an
> > > > interface that could work well for RTOSes too. If we
want to
do
> > > > something different, both OS-agnostic and hypervisor-
agnostic,
perhaps > > > > we could design a new interface. One that could be
implementable
in the > > > > Xen hypervisor itself (like IOREQ) and of course any
other
hypervisor > > > > too. > > > > > > > > > > > > There is also another problem. IOREQ is probably not be
the
only
> > > > interface needed. Have a look at > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
Don't we
also need > > > > an interface for the backend to inject interrupts into
the
frontend? And > > > > if the backend requires dynamic memory mappings of
frontend
pages,
then > > > > we would also need an interface to map/unmap domU pages. > > > > > > My proposal document might help here; All the interfaces
required
for > > > virtio-proxy (or hypervisor-related interfaces) are listed
as
> > > RPC protocols :) > > > > > > > These interfaces are a lot more problematic than IOREQ:
IOREQ
is
tiny > > > > and self-contained. It is easy to add anywhere. A new
interface to
> > > > inject interrupts or map pages is more difficult to
manage
because
it > > > > would require changes scattered across the various
emulators.
> > > > > > Exactly. I have no confident yet that my approach will
also
apply
> > > to other hypervisors than Xen. > > > Technically, yes, but whether people can accept it or not
is a
different > > > matter. > > > > > > Thanks, > > > -Takahiro Akashi > > > > > > > > -- > > Regards, > > > > Oleksandr Tyshchenko > IMPORTANT NOTICE: The contents of this email and any
attachments are
confidential and may also be privileged. If you are not the
intended
recipient, please notify the sender immediately and do not
disclose
the
contents to any other person, use it for any purpose, or store
or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments
are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or
copy the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Apologies for being late to this thread, but I hope to be able to contribute to this discussion in a meaningful way. I am grateful for the level of interest in this topic. I would like to draw your attention to Argo as a suitable technology for development of VirtIO's hypervisor-agnostic interfaces.
* Argo is an interdomain communication mechanism in Xen (on x86 and Arm) that can send and receive hypervisor-mediated notifications and messages between domains (VMs). [1] The hypervisor can enforce Mandatory Access Control over all communication between domains. It is derived from the earlier v4v, which has been deployed on millions of machines with the HP/Bromium uXen hypervisor and with OpenXT.
* Argo has a simple interface with a small number of operations that was designed for ease of integration into OS primitives on both Linux (sockets) and Windows (ReadFile/WriteFile) [2]. - A unikernel example of using it has also been developed for XTF. [3]
* There has been recent discussion and support in the Xen community for making revisions to the Argo interface to make it hypervisor-agnostic, and support implementations of Argo on other hypervisors. This will enable a single interface for an OS kernel binary to use for inter-VM communication that will work on multiple hypervisors -- this applies equally to both backends and frontend implementations. [4]
* Here are the design documents for building VirtIO-over-Argo, to support a hypervisor-agnostic frontend VirtIO transport driver using Argo.
The Development Plan to build VirtIO virtual device support over Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
A design for using VirtIO over Argo, describing how VirtIO data structures and communication is handled over the Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
Diagram (from the above document) showing how VirtIO rings are synchronized between domains without using shared memory: https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob...
Please note that the above design documents show that the existing VirtIO device drivers, and both vring and virtqueue data structures can be preserved while interdomain communication can be performed with no shared memory required for most drivers; (the exceptions where further design is required are those such as virtual framebuffer devices where shared memory regions are intentionally added to the communication structure beyond the vrings and virtqueues).
An analysis of VirtIO and Argo, informing the design: https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Arg...
* Argo can be used for a communication path for configuration between the backend and the toolstack, avoiding the need for a dependency on XenStore, which is an advantage for any hypervisor-agnostic design. It is also amenable to a notification mechanism that is not based on Xen event channels.
* Argo does not use or require shared memory between VMs and provides an alternative to the use of foreign shared memory mappings. It avoids some of the complexities involved with using grants (eg. XSA-300).
* Argo supports Mandatory Access Control by the hypervisor, satisfying a common certification requirement.
* The Argo headers are BSD-licensed and the Xen hypervisor implementation is GPLv2 but accessible via the hypercall interface. The licensing should not present an obstacle to adoption of Argo in guest software or implementation by other hypervisors.
* Since the interface that Argo presents to a guest VM is similar to DMA, a VirtIO-Argo frontend transport driver should be able to operate with a physical VirtIO-enabled smart-NIC if the toolstack and an Argo-aware backend provide support.
The next Xen Community Call is next week and I would be happy to answer questions about Argo and on this topic. I will also be following this thread.
Christopher (Argo maintainer, Xen Community)
-------------------------------------------------------------------------------- [1] An introduction to Argo: https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20... https://www.youtube.com/watch?v=cnC0Tg3jqJQ Xen Wiki page for Argo: https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] OpenXT Linux Argo driver and userspace library: https://github.com/openxt/linux-xen-argo
Windows V4V at OpenXT wiki: https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V Windows v4v driver source: https://github.com/OpenXT/xc-windows/tree/master/xenv4v
HP/Bromium uXen V4V driver: https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
[3] v2 of the Argo test unikernel for XTF: https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
[4] Argo HMX Transport for VirtIO meeting minutes: https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
VirtIO-Argo Development wiki page: https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
On Thu, Aug 26, 2021 at 5:11 AM Wei Chen Wei.Chen@arm.com wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <
stratos-dev@op-lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann <arnd.bergmann@linaro.org ; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev <Artem_Mygaiev@epam.com
; Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
Hi Akashi,
> -----Original Message----- > From: AKASHI Takahiro takahiro.akashi@linaro.org > Sent: 2021年8月17日 16:08 > To: Wei Chen Wei.Chen@arm.com > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> sstabellini@kernel.org; Alex Benn??e <alex.bennee@linaro.org
;
Stratos
> Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
> open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh
Kumar
> viresh.kumar@linaro.org; Stefano Stabellini > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> jan.kiszka@siemens.com; Carl van Schaik
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
; Jean-
> Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
> Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> paul@xen.org; Xen Devel xen-devel@lists.xen.org > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > Hi Wei, Oleksandr, > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > > Hi All, > > > > Thanks for Stefano to link my kvmtool for Xen proposal here. > > This proposal is still discussing in Xen and KVM communities. > > The main work is to decouple the kvmtool from KVM and make > > other hypervisors can reuse the virtual device
implementations.
> > > > In this case, we need to introduce an intermediate hypervisor > > layer for VMM abstraction, Which is, I think it's very close > > to stratos' virtio hypervisor agnosticism work. > > # My proposal[1] comes from my own idea and doesn't always
represent
> # Linaro's view on this subject nor reflect Alex's concerns.
Nevertheless,
> > Your idea and my proposal seem to share the same background. > Both have the similar goal and currently start with, at first,
Xen
> and are based on kvm-tool. (Actually, my work is derived from > EPAM's virtio-disk, which is also based on kvm-tool.) > > In particular, the abstraction of hypervisor interfaces has a
same
> set of interfaces (for your "struct vmm_impl" and my "RPC
interfaces").
> This is not co-incident as we both share the same origin as I
said
above.
> And so we will also share the same issues. One of them is a way
of
> "sharing/mapping FE's memory". There is some trade-off between > the portability and the performance impact. > So we can discuss the topic here in this ML, too. > (See Alex's original email, too). > Yes, I agree.
> On the other hand, my approach aims to create a "single-binary"
solution
> in which the same binary of BE vm could run on any hypervisors. > Somehow similar to your "proposal-#2" in [2], but in my
solution,
all
> the hypervisor-specific code would be put into another entity
(VM),
> named "virtio-proxy" and the abstracted operations are served
via RPC.
> (In this sense, BE is hypervisor-agnostic but might have OS
dependency.)
> But I know that we need discuss if this is a requirement even > in Stratos project or not. (Maybe not) >
Sorry, I haven't had time to finish reading your virtio-proxy
completely
(I will do it ASAP). But from your description, it seems we need
a
3rd VM between FE and BE? My concern is that, if my assumption is
right,
will it increase the latency in data transport path? Even if
we're
using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of
payload,
we will see latency issue (even in your case) and if we have some
solution
for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this kind of "mapping/unmapping" operations. So the latency caused by this
behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
Thanks, -Takahiro Akashi
-Takahiro Akashi
> Specifically speaking about kvm-tool, I have a concern about
its
> license term; Targeting different hypervisors and different OSs > (which I assume includes RTOS's), the resultant library should
be
> license permissive and GPL for kvm-tool might be an issue. > Any thoughts? >
Yes. If user want to implement a FreeBSD device model, but the
virtio
library is GPL. Then GPL would be a problem. If we have another
good
candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
> -Takahiro Akashi > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- > August/000548.html > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > Sent: 2021年8月14日 23:38 > > > To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano
Stabellini
> sstabellini@kernel.org > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing
List
> stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org;
Arnd
> Bergmann arnd.bergmann@linaro.org; Viresh Kumar > viresh.kumar@linaro.org; Stefano Stabellini > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> jan.kiszka@siemens.com; Carl van Schaik
> pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
; Jean-
> Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com;
Oleksandr
> Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
> Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > Hello, all. > > > > > > Please see some comments below. And sorry for the possible
format
> issues. > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > mailto:takahiro.akashi@linaro.org wrote: > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
wrote:
> > > > > CCing people working on Xen+VirtIO and IOREQs. Not
trimming
the
> original > > > > > email to let them read the full context. > > > > > > > > > > My comments below are related to a potential Xen
implementation,
> not > > > > > because it is the only implementation that matters, but
because it
> is > > > > > the one I know best. > > > > > > > > Please note that my proposal (and hence the working
prototype)[1]
> > > > is based on Xen's virtio implementation (i.e. IOREQ) and > particularly > > > > EPAM's virtio-disk application (backend server). > > > > It has been, I believe, well generalized but is still a
bit
biased
> > > > toward this original design. > > > > > > > > So I hope you like my approach :) > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> August/000546.html > > > > > > > > Let me take this opportunity to explain a bit more about
my
approach
> below. > > > > > > > > > Also, please see this relevant email thread: > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > Hi, > > > > > > > > > > > > One of the goals of Project Stratos is to enable
hypervisor
> agnostic > > > > > > backends so we can enable as much re-use of code as
possible
and
> avoid > > > > > > repeating ourselves. This is the flip side of the
front end
> where > > > > > > multiple front-end implementations are required - one
per OS,
> assuming > > > > > > you don't just want Linux guests. The resultant
guests
are
> trivially > > > > > > movable between hypervisors modulo any abstracted
paravirt
type
> > > > > > interfaces. > > > > > > > > > > > > In my original thumb nail sketch of a solution I
envisioned
> vhost-user > > > > > > daemons running in a broadly POSIX like environment.
The
> interface to > > > > > > the daemon is fairly simple requiring only some
mapped
memory
> and some > > > > > > sort of signalling for events (on Linux this is
eventfd).
The
> idea was a > > > > > > stub binary would be responsible for any hypervisor
specific
> setup and > > > > > > then launch a common binary to deal with the actual
virtqueue
> requests > > > > > > themselves. > > > > > > > > > > > > Since that original sketch we've seen an expansion in
the
sort
> of ways > > > > > > backends could be created. There is interest in
encapsulating
> backends > > > > > > in RTOSes or unikernels for solutions like SCMI.
There
interest
> in Rust > > > > > > has prompted ideas of using the trait interface to
abstract
> differences > > > > > > away as well as the idea of bare-metal Rust backends. > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
which
> > > > > > calls for a description of the APIs needed from the
hypervisor
> side to > > > > > > support VirtIO guests and their backends. However we
are
some
> way off > > > > > > from that at the moment as I think we need to at
least
> demonstrate one > > > > > > portable backend before we start codifying
requirements. To
that
> end I > > > > > > want to think about what we need for a backend to
function.
> > > > > > > > > > > > Configuration > > > > > > ============= > > > > > > > > > > > > In the type-2 setup this is typically fairly simple
because
the
> host > > > > > > system can orchestrate the various modules that make
up the
> complete > > > > > > system. In the type-1 case (or even type-2 with
delegated
> service VMs) > > > > > > we need some sort of mechanism to inform the backend
VM
about
> key > > > > > > details about the system: > > > > > > > > > > > > - where virt queue memory is in it's address space > > > > > > - how it's going to receive (interrupt) and trigger
(kick)
> events > > > > > > - what (if any) resources the backend needs to
connect to
> > > > > > > > > > > > Obviously you can elide over configuration issues by
having
> static > > > > > > configurations and baking the assumptions into your
guest
images
> however > > > > > > this isn't scalable in the long term. The obvious
solution
seems
> to be > > > > > > extending a subset of Device Tree data to user space
but
perhaps
> there > > > > > > are other approaches? > > > > > > > > > > > > Before any virtio transactions can take place the
appropriate
> memory > > > > > > mappings need to be made between the FE guest and the
BE
guest.
> > > > > > > > > > > Currently the whole of the FE guests address space
needs to
be
> visible > > > > > > to whatever is serving the virtio requests. I can
envision 3
> approaches: > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > This would entail the guest OS knowing where in it's
Guest
> Physical > > > > > > Address space is already taken up and avoiding
clashing. I
> would assume > > > > > > in this case you would want a standard interface to
userspace
> to then > > > > > > make that address space visible to the backend
daemon.
> > > > > > > > Yet another way here is that we would have well known
"shared
> memory" between > > > > VMs. I think that Jailhouse's ivshmem gives us good
insights on
this
> matter > > > > and that it can even be an alternative for hypervisor-
agnostic
> solution. > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
device
and
> can be > > > > mapped locally.) > > > > > > > > I want to add this shared memory aspect to my
virtio-proxy,
but
> > > > the resultant solution would eventually look similar to
ivshmem.
> > > > > > > > > > * BE guests boots with a hypervisor handle to memory > > > > > > > > > > > > The BE guest is then free to map the FE's memory to
where
it
> wants in > > > > > > the BE's guest physical address space. > > > > > > > > > > I cannot see how this could work for Xen. There is no
"handle"
to
> give > > > > > to the backend if the backend is not running in dom0.
So
for
Xen I
> think > > > > > the memory has to be already mapped > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
information
is
> expected > > > > to be exposed to BE via Xenstore: > > > > (I know that this is a tentative approach though.) > > > > - the start address of configuration space > > > > - interrupt number > > > > - file path for backing storage > > > > - read-only flag > > > > And the BE server have to call a particular hypervisor
interface
to
> > > > map the configuration space. > > > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
info to
> the backend running in a non-toolstack domain. > > > I remember, there was a wish to avoid using Xenstore in
Virtio
backend
> itself if possible, so for non-toolstack domain, this could
done
with
> adjusting devd (daemon that listens for devices and launches
backends)
> > > to read backend configuration from the Xenstore anyway and
pass it
to
> the backend via command line arguments. > > > > > > > Yes, in current PoC code we're using xenstore to pass device > configuration. > > We also designed a static device configuration parse method
for
Dom0less
> or > > other scenarios don't have xentool. yes, it's from device
model
command
> line > > or a config file. > > > > > But, if ... > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
> specific > > > > stuffs are contained in virtio-proxy, yet another VM, to
hide
all
> details. > > > > > > ... the solution how to overcome that is already found and
proven
to
> work then even better. > > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
mapping.
> > > > > > > > > and the mapping probably done by the > > > > > toolstack (also see below.) Or we would have to invent
a
new
Xen
> > > > > hypervisor interface and Xen virtual machine privileges
to
allow
> this > > > > > kind of mapping. > > > > > > > > > If we run the backend in Dom0 that we have no problems
of
course.
> > > > > > > > One of difficulties on Xen that I found in my approach is
that
> calling > > > > such hypervisor intefaces (registering IOREQ, mapping
memory) is
> only > > > > allowed on BE servers themselvies and so we will have to
extend
> those > > > > interfaces. > > > > This, however, will raise some concern on security and
privilege
> distribution > > > > as Stefan suggested. > > > > > > We also faced policy related issues with Virtio backend
running in
> other than Dom0 domain in a "dummy" xsm mode. In our target
system we
run
> the backend in a driver > > > domain (we call it DomD) where the underlying H/W resides.
We
trust it,
> so we wrote policy rules (to be used in "flask" xsm mode) to
provide
it
> with a little bit more privileges than a simple DomU had. > > > Now it is permitted to issue device-model, resource and
memory
> mappings, etc calls. > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > require some sort of hypercall to the hypervisor. I
can see
two
> options > > > > > > at this point: > > > > > > > > > > > > - expose the handle to userspace for daemon/helper
to
trigger
> the > > > > > > mapping via existing hypercall interfaces. If
using a
helper
> you > > > > > > would have a hypervisor specific one to avoid the
daemon
> having to > > > > > > care too much about the details or push that
complexity
into
> a > > > > > > compile time option for the daemon which would
result in
> different > > > > > > binaries although a common source base. > > > > > > > > > > > > - expose a new kernel ABI to abstract the hypercall > differences away > > > > > > in the guest kernel. In this case the userspace
would
> essentially > > > > > > ask for an abstract "map guest N memory to
userspace
ptr"
> and let > > > > > > the kernel deal with the different hypercall
interfaces.
> This of > > > > > > course assumes the majority of BE guests would be
Linux
> kernels and > > > > > > leaves the bare-metal/unikernel approaches to
their own
> devices. > > > > > > > > > > > > Operation > > > > > > ========= > > > > > > > > > > > > The core of the operation of VirtIO is fairly simple.
Once
the
> > > > > > vhost-user feature negotiation is done it's a case of
receiving
> update > > > > > > events and parsing the resultant virt queue for data.
The
vhost-
> user > > > > > > specification handles a bunch of setup before that
point,
mostly
> to > > > > > > detail where the virt queues are set up FD's for
memory and
> event > > > > > > communication. This is where the envisioned stub
process
would
> be > > > > > > responsible for getting the daemon up and ready to
run.
This
is
> > > > > > currently done inside a big VMM like QEMU but I
suspect a
modern
> > > > > > approach would be to use the rust-vmm vhost crate. It
would
then
> either > > > > > > communicate with the kernel's abstracted ABI or be
re-
targeted
> as a > > > > > > build option for the various hypervisors. > > > > > > > > > > One thing I mentioned before to Alex is that Xen
doesn't
have
VMMs
> the > > > > > way they are typically envisioned and described in
other
> environments. > > > > > Instead, Xen has IOREQ servers. Each of them connects > independently to > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
could
be
> used as > > > > > emulators for a single Xen VM, each of them connecting
to Xen
> > > > > independently via the IOREQ interface. > > > > > > > > > > The component responsible for starting a daemon and/or
setting
up
> shared > > > > > interfaces is the toolstack: the xl command and the
libxl/libxc
> > > > > libraries. > > > > > > > > I think that VM configuration management (or
orchestration
in
> Startos > > > > jargon?) is a subject to debate in parallel. > > > > Otherwise, is there any good assumption to avoid it right
now?
> > > > > > > > > Oleksandr and others I CCed have been working on ways
for the
> toolstack > > > > > to create virtio backends and setup memory mappings.
They
might be
> able > > > > > to provide more info on the subject. I do think we miss
a way
to
> provide > > > > > the configuration to the backend and anything else that
the
> backend > > > > > might require to start doing its job. > > > > > > Yes, some work has been done for the toolstack to handle
Virtio
MMIO
> devices in > > > general and Virtio block devices in particular. However, it
has
not
> been upstreaned yet. > > > Updated patches on review now: > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
send-
email-
> olekstysh@gmail.com/ > > > > > > There is an additional (also important) activity to
improve/fix
> foreign memory mapping on Arm which I am also involved in. > > > The foreign memory mapping is proposed to be used for
Virtio
backends
> (device emulators) if there is a need to run guest OS
completely
> unmodified. > > > Of course, the more secure way would be to use grant memory
mapping.
> Brietly, the main difference between them is that with foreign
mapping
the
> backend > > > can map any guest memory it wants to map, but with grant
mapping
it is
> allowed to map only what was previously granted by the
frontend.
> > > > > > So, there might be a problem if we want to pre-map some
guest
memory
> in advance or to cache mappings in the backend in order to
improve
> performance (because the mapping/unmapping guest pages every
request
> requires a lot of back and forth to Xen + P2M updates). In a
nutshell,
> currently, in order to map a guest page into the backend
address
space
we
> need to steal a real physical page from the backend domain. So,
with
the
> said optimizations we might end up with no free memory in the
backend
> domain (see XSA-300). And what we try to achieve is to not
waste
a
real
> domain memory at all by providing safe non-allocated-yet (so
unused)
> address space for the foreign (and grant) pages to be mapped
into,
this
> enabling work implies Xen and Linux (and likely DTB bindings)
changes.
> However, as it turned out, for this to work in a proper and
safe
way
some
> prereq work needs to be done. > > > You can find the related Xen discussion at: > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
send-
email-
> olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle notification and
kicks.
The
> existing > > > > > > vhost-user framework uses eventfd to signal the
daemon
(although
> QEMU > > > > > > is quite capable of simulating them when you use
TCG).
Xen
has
> it's own > > > > > > IOREQ mechanism. However latency is an important
factor and
> having > > > > > > events go through the stub would add quite a lot. > > > > > > > > > > Yeah I think, regardless of anything else, we want the
backends to
> > > > > connect directly to the Xen hypervisor. > > > > > > > > In my approach, > > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
> interface > > > > via virtio-proxy > > > > b) FE -> BE: MMIO to config raises events (in event
channels),
> which is > > > > converted to a callback to BE via virtio-
proxy
> > > > (Xen's event channel is internnally
implemented by
> interrupts.) > > > > > > > > I don't know what "connect directly" means here, but
sending
> interrupts > > > > to the opposite side would be best efficient. > > > > Ivshmem, I suppose, takes this approach by utilizing
PCI's
msi-x
> mechanism. > > > > > > Agree that MSI would be more efficient than SPI... > > > At the moment, in order to notify the frontend, the backend
issues
a
> specific device-model call to query Xen to inject a
corresponding SPI
to
> the guest. > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally converting
IOREQ
> messages from > > > > > > the Xen hypervisor to eventfd events? Would this
scale
with
> other kernel > > > > > > hypercall interfaces? > > > > > > > > > > > > So any thoughts on what directions are worth
experimenting
with?
> > > > > > > > > > One option we should consider is for each backend to
connect
to
> Xen via > > > > > the IOREQ interface. We could generalize the IOREQ
interface
and
> make it > > > > > hypervisor agnostic. The interface is really trivial
and
easy
to
> add. > > > > > > > > As I said above, my proposal does the same thing that you
mentioned
> here :) > > > > The difference is that I do call hypervisor interfaces
via
virtio-
> proxy. > > > > > > > > > The only Xen-specific part is the notification
mechanism,
which is
> an > > > > > event channel. If we replaced the event channel with
something
> else the > > > > > interface would be generic. See: > > > > > https://gitlab.com/xen-project/xen/- > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > I don't think that translating IOREQs to eventfd in the
kernel
is
> a > > > > > good idea: if feels like it would be extra complexity
and that
the
> > > > > kernel shouldn't be involved as this is a backend-
hypervisor
> interface. > > > > > > > > Given that we may want to implement BE as a bare-metal
application
> > > > as I did on Zephyr, I don't think that the translation
would not
be
> > > > a big issue, especially on RTOS's. > > > > It will be some kind of abstraction layer of interrupt
handling
> > > > (or nothing but a callback mechanism). > > > > > > > > > Also, eventfd is very Linux-centric and we are trying
to
design an
> > > > > interface that could work well for RTOSes too. If we
want to
do
> > > > > something different, both OS-agnostic and hypervisor-
agnostic,
> perhaps > > > > > we could design a new interface. One that could be
implementable
> in the > > > > > Xen hypervisor itself (like IOREQ) and of course any
other
> hypervisor > > > > > too. > > > > > > > > > > > > > > > There is also another problem. IOREQ is probably not be
the
only
> > > > > interface needed. Have a look at > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
Don't we
> also need > > > > > an interface for the backend to inject interrupts into
the
> frontend? And > > > > > if the backend requires dynamic memory mappings of
frontend
pages,
> then > > > > > we would also need an interface to map/unmap domU
pages.
> > > > > > > > My proposal document might help here; All the interfaces
required
> for > > > > virtio-proxy (or hypervisor-related interfaces) are
listed
as
> > > > RPC protocols :) > > > > > > > > > These interfaces are a lot more problematic than IOREQ:
IOREQ
is
> tiny > > > > > and self-contained. It is easy to add anywhere. A new
interface to
> > > > > inject interrupts or map pages is more difficult to
manage
because
> it > > > > > would require changes scattered across the various
emulators.
> > > > > > > > Exactly. I have no confident yet that my approach will
also
apply
> > > > to other hypervisors than Xen. > > > > Technically, yes, but whether people can accept it or not
is a
> different > > > > matter. > > > > > > > > Thanks, > > > > -Takahiro Akashi > > > > > > > > > > > > -- > > > Regards, > > > > > > Oleksandr Tyshchenko > > IMPORTANT NOTICE: The contents of this email and any
attachments are
> confidential and may also be privileged. If you are not the
intended
> recipient, please notify the sender immediately and do not
disclose
the
> contents to any other person, use it for any purpose, or store
or copy
the
> information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments
are
confidential and may also be privileged. If you are not the
intended
recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or
copy the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
[ resending message to ensure delivery to the CCd mailing lists post-subscription ]
Apologies for being late to this thread, but I hope to be able to contribute to this discussion in a meaningful way. I am grateful for the level of interest in this topic. I would like to draw your attention to Argo as a suitable technology for development of VirtIO's hypervisor-agnostic interfaces.
* Argo is an interdomain communication mechanism in Xen (on x86 and Arm) that can send and receive hypervisor-mediated notifications and messages between domains (VMs). [1] The hypervisor can enforce Mandatory Access Control over all communication between domains. It is derived from the earlier v4v, which has been deployed on millions of machines with the HP/Bromium uXen hypervisor and with OpenXT.
* Argo has a simple interface with a small number of operations that was designed for ease of integration into OS primitives on both Linux (sockets) and Windows (ReadFile/WriteFile) [2]. - A unikernel example of using it has also been developed for XTF. [3]
* There has been recent discussion and support in the Xen community for making revisions to the Argo interface to make it hypervisor-agnostic, and support implementations of Argo on other hypervisors. This will enable a single interface for an OS kernel binary to use for inter-VM communication that will work on multiple hypervisors -- this applies equally to both backends and frontend implementations. [4]
* Here are the design documents for building VirtIO-over-Argo, to support a hypervisor-agnostic frontend VirtIO transport driver using Argo.
The Development Plan to build VirtIO virtual device support over Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
A design for using VirtIO over Argo, describing how VirtIO data structures and communication is handled over the Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
Diagram (from the above document) showing how VirtIO rings are synchronized between domains without using shared memory: https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob...
Please note that the above design documents show that the existing VirtIO device drivers, and both vring and virtqueue data structures can be preserved while interdomain communication can be performed with no shared memory required for most drivers; (the exceptions where further design is required are those such as virtual framebuffer devices where shared memory regions are intentionally added to the communication structure beyond the vrings and virtqueues).
An analysis of VirtIO and Argo, informing the design: https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Arg...
* Argo can be used for a communication path for configuration between the backend and the toolstack, avoiding the need for a dependency on XenStore, which is an advantage for any hypervisor-agnostic design. It is also amenable to a notification mechanism that is not based on Xen event channels.
* Argo does not use or require shared memory between VMs and provides an alternative to the use of foreign shared memory mappings. It avoids some of the complexities involved with using grants (eg. XSA-300).
* Argo supports Mandatory Access Control by the hypervisor, satisfying a common certification requirement.
* The Argo headers are BSD-licensed and the Xen hypervisor implementation is GPLv2 but accessible via the hypercall interface. The licensing should not present an obstacle to adoption of Argo in guest software or implementation by other hypervisors.
* Since the interface that Argo presents to a guest VM is similar to DMA, a VirtIO-Argo frontend transport driver should be able to operate with a physical VirtIO-enabled smart-NIC if the toolstack and an Argo-aware backend provide support.
The next Xen Community Call is next week and I would be happy to answer questions about Argo and on this topic. I will also be following this thread.
Christopher (Argo maintainer, Xen Community)
-------------------------------------------------------------------------------- [1] An introduction to Argo: https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20... https://www.youtube.com/watch?v=cnC0Tg3jqJQ Xen Wiki page for Argo: https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] OpenXT Linux Argo driver and userspace library: https://github.com/openxt/linux-xen-argo
Windows V4V at OpenXT wiki: https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V Windows v4v driver source: https://github.com/OpenXT/xc-windows/tree/master/xenv4v
HP/Bromium uXen V4V driver: https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
[3] v2 of the Argo test unikernel for XTF: https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
[4] Argo HMX Transport for VirtIO meeting minutes: https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
VirtIO-Argo Development wiki page: https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
On Thu, Aug 26, 2021 at 5:11 AM Wei Chen Wei.Chen@arm.com wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <
stratos-dev@op-lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann <
arnd.bergmann@linaro.org>;
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev <Artem_Mygaiev@epam.com
; Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote: > Hi Akashi, > > > -----Original Message----- > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > Sent: 2021年8月17日 16:08 > > To: Wei Chen Wei.Chen@arm.com > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> > sstabellini@kernel.org; Alex Benn??e <
alex.bennee@linaro.org>;
Stratos > > Mailing List stratos-dev@op-lists.linaro.org; virtio- dev@lists.oasis- > > open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh
Kumar
> > viresh.kumar@linaro.org; Stefano Stabellini > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > jan.kiszka@siemens.com; Carl van Schaik
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
; Jean-
> > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien > > Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > Hi Wei, Oleksandr, > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > > > Hi All, > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here. > > > This proposal is still discussing in Xen and KVM
communities.
> > > The main work is to decouple the kvmtool from KVM and make > > > other hypervisors can reuse the virtual device
implementations.
> > > > > > In this case, we need to introduce an intermediate
hypervisor
> > > layer for VMM abstraction, Which is, I think it's very close > > > to stratos' virtio hypervisor agnosticism work. > > > > # My proposal[1] comes from my own idea and doesn't always
represent
> > # Linaro's view on this subject nor reflect Alex's concerns. Nevertheless, > > > > Your idea and my proposal seem to share the same background. > > Both have the similar goal and currently start with, at first,
Xen
> > and are based on kvm-tool. (Actually, my work is derived from > > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > In particular, the abstraction of hypervisor interfaces has a
same
> > set of interfaces (for your "struct vmm_impl" and my "RPC
interfaces").
> > This is not co-incident as we both share the same origin as I
said
above. > > And so we will also share the same issues. One of them is a
way
of
> > "sharing/mapping FE's memory". There is some trade-off between > > the portability and the performance impact. > > So we can discuss the topic here in this ML, too. > > (See Alex's original email, too). > > > Yes, I agree. > > > On the other hand, my approach aims to create a
"single-binary"
solution > > in which the same binary of BE vm could run on any
hypervisors.
> > Somehow similar to your "proposal-#2" in [2], but in my
solution,
all
> > the hypervisor-specific code would be put into another entity
(VM),
> > named "virtio-proxy" and the abstracted operations are served
via RPC.
> > (In this sense, BE is hypervisor-agnostic but might have OS dependency.) > > But I know that we need discuss if this is a requirement even > > in Stratos project or not. (Maybe not) > > > > Sorry, I haven't had time to finish reading your virtio-proxy
completely
> (I will do it ASAP). But from your description, it seems we
need a
> 3rd VM between FE and BE? My concern is that, if my assumption
is
right,
> will it increase the latency in data transport path? Even if
we're
> using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of
payload,
we will see latency issue (even in your case) and if we have some
solution
for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this
kind
of "mapping/unmapping" operations. So the latency caused by this
behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
Thanks, -Takahiro Akashi
-Takahiro Akashi
> > Specifically speaking about kvm-tool, I have a concern about
its
> > license term; Targeting different hypervisors and different
OSs
> > (which I assume includes RTOS's), the resultant library should
be
> > license permissive and GPL for kvm-tool might be an issue. > > Any thoughts? > > > > Yes. If user want to implement a FreeBSD device model, but the
virtio
> library is GPL. Then GPL would be a problem. If we have another
good
> candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
> > -Takahiro Akashi > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- > > August/000548.html > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > > Sent: 2021年8月14日 23:38 > > > > To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano Stabellini > > sstabellini@kernel.org > > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos
Mailing
List
> > stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org;
Arnd > > Bergmann arnd.bergmann@linaro.org; Viresh Kumar > > viresh.kumar@linaro.org; Stefano Stabellini > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > jan.kiszka@siemens.com; Carl van Schaik
> > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
; Jean-
> > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com;
Oleksandr
> > Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien > > Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > > > Hello, all. > > > > > > > > Please see some comments below. And sorry for the possible
format
> > issues. > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > > mailto:takahiro.akashi@linaro.org wrote: > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
wrote: > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
trimming
the > > original > > > > > > email to let them read the full context. > > > > > > > > > > > > My comments below are related to a potential Xen implementation, > > not > > > > > > because it is the only implementation that matters,
but
because it > > is > > > > > > the one I know best. > > > > > > > > > > Please note that my proposal (and hence the working
prototype)[1]
> > > > > is based on Xen's virtio implementation (i.e. IOREQ) and > > particularly > > > > > EPAM's virtio-disk application (backend server). > > > > > It has been, I believe, well generalized but is still a
bit
biased > > > > > toward this original design. > > > > > > > > > > So I hope you like my approach :) > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> > August/000546.html > > > > > > > > > > Let me take this opportunity to explain a bit more about
my
approach > > below. > > > > > > > > > > > Also, please see this relevant email thread: > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > > Hi, > > > > > > > > > > > > > > One of the goals of Project Stratos is to enable
hypervisor
> > agnostic > > > > > > > backends so we can enable as much re-use of code as
possible
and > > avoid > > > > > > > repeating ourselves. This is the flip side of the
front end
> > where > > > > > > > multiple front-end implementations are required -
one
per OS,
> > assuming > > > > > > > you don't just want Linux guests. The resultant
guests
are
> > trivially > > > > > > > movable between hypervisors modulo any abstracted
paravirt
type > > > > > > > interfaces. > > > > > > > > > > > > > > In my original thumb nail sketch of a solution I
envisioned
> > vhost-user > > > > > > > daemons running in a broadly POSIX like environment.
The
> > interface to > > > > > > > the daemon is fairly simple requiring only some
mapped
memory > > and some > > > > > > > sort of signalling for events (on Linux this is
eventfd).
The > > idea was a > > > > > > > stub binary would be responsible for any hypervisor
specific
> > setup and > > > > > > > then launch a common binary to deal with the actual virtqueue > > requests > > > > > > > themselves. > > > > > > > > > > > > > > Since that original sketch we've seen an expansion
in
the
sort > > of ways > > > > > > > backends could be created. There is interest in encapsulating > > backends > > > > > > > in RTOSes or unikernels for solutions like SCMI.
There
interest > > in Rust > > > > > > > has prompted ideas of using the trait interface to
abstract
> > differences > > > > > > > away as well as the idea of bare-metal Rust
backends.
> > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
which > > > > > > > calls for a description of the APIs needed from the hypervisor > > side to > > > > > > > support VirtIO guests and their backends. However we
are
some > > way off > > > > > > > from that at the moment as I think we need to at
least
> > demonstrate one > > > > > > > portable backend before we start codifying
requirements. To
that > > end I > > > > > > > want to think about what we need for a backend to
function.
> > > > > > > > > > > > > > Configuration > > > > > > > ============= > > > > > > > > > > > > > > In the type-2 setup this is typically fairly simple
because
the > > host > > > > > > > system can orchestrate the various modules that make
up the
> > complete > > > > > > > system. In the type-1 case (or even type-2 with
delegated
> > service VMs) > > > > > > > we need some sort of mechanism to inform the backend
VM
about > > key > > > > > > > details about the system: > > > > > > > > > > > > > > - where virt queue memory is in it's address space > > > > > > > - how it's going to receive (interrupt) and
trigger
(kick)
> > events > > > > > > > - what (if any) resources the backend needs to
connect to
> > > > > > > > > > > > > > Obviously you can elide over configuration issues by
having
> > static > > > > > > > configurations and baking the assumptions into your
guest
images > > however > > > > > > > this isn't scalable in the long term. The obvious
solution
seems > > to be > > > > > > > extending a subset of Device Tree data to user space
but
perhaps > > there > > > > > > > are other approaches? > > > > > > > > > > > > > > Before any virtio transactions can take place the appropriate > > memory > > > > > > > mappings need to be made between the FE guest and
the
BE
guest. > > > > > > > > > > > > > Currently the whole of the FE guests address space
needs to
be > > visible > > > > > > > to whatever is serving the virtio requests. I can
envision 3
> > approaches: > > > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > > > This would entail the guest OS knowing where in
it's
Guest
> > Physical > > > > > > > Address space is already taken up and avoiding
clashing. I
> > would assume > > > > > > > in this case you would want a standard interface to userspace > > to then > > > > > > > make that address space visible to the backend
daemon.
> > > > > > > > > > Yet another way here is that we would have well known
"shared
> > memory" between > > > > > VMs. I think that Jailhouse's ivshmem gives us good
insights on
this > > matter > > > > > and that it can even be an alternative for hypervisor-
agnostic
> > solution. > > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
device
and > > can be > > > > > mapped locally.) > > > > > > > > > > I want to add this shared memory aspect to my
virtio-proxy,
but
> > > > > the resultant solution would eventually look similar to
ivshmem.
> > > > > > > > > > > > * BE guests boots with a hypervisor handle to
memory
> > > > > > > > > > > > > > The BE guest is then free to map the FE's memory to
where
it > > wants in > > > > > > > the BE's guest physical address space. > > > > > > > > > > > > I cannot see how this could work for Xen. There is no
"handle"
to > > give > > > > > > to the backend if the backend is not running in dom0.
So
for
Xen I > > think > > > > > > the memory has to be already mapped > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
information
is > > expected > > > > > to be exposed to BE via Xenstore: > > > > > (I know that this is a tentative approach though.) > > > > > - the start address of configuration space > > > > > - interrupt number > > > > > - file path for backing storage > > > > > - read-only flag > > > > > And the BE server have to call a particular hypervisor
interface
to > > > > > map the configuration space. > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
info to > > the backend running in a non-toolstack domain. > > > > I remember, there was a wish to avoid using Xenstore in
Virtio
backend > > itself if possible, so for non-toolstack domain, this could
done
with
> > adjusting devd (daemon that listens for devices and launches
backends)
> > > > to read backend configuration from the Xenstore anyway and
pass it
to > > the backend via command line arguments. > > > > > > > > > > Yes, in current PoC code we're using xenstore to pass device > > configuration. > > > We also designed a static device configuration parse method
for
Dom0less > > or > > > other scenarios don't have xentool. yes, it's from device
model
command > > line > > > or a config file. > > > > > > > But, if ... > > > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
> > specific > > > > > stuffs are contained in virtio-proxy, yet another VM, to
hide
all > > details. > > > > > > > > ... the solution how to overcome that is already found and
proven
to > > work then even better. > > > > > > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
mapping. > > > > > > > > > > > and the mapping probably done by the > > > > > > toolstack (also see below.) Or we would have to
invent a
new
Xen > > > > > > hypervisor interface and Xen virtual machine
privileges
to
allow > > this > > > > > > kind of mapping. > > > > > > > > > > > If we run the backend in Dom0 that we have no problems
of
course. > > > > > > > > > > One of difficulties on Xen that I found in my approach
is
that
> > calling > > > > > such hypervisor intefaces (registering IOREQ, mapping
memory) is
> > only > > > > > allowed on BE servers themselvies and so we will have to
extend
> > those > > > > > interfaces. > > > > > This, however, will raise some concern on security and
privilege
> > distribution > > > > > as Stefan suggested. > > > > > > > > We also faced policy related issues with Virtio backend
running in
> > other than Dom0 domain in a "dummy" xsm mode. In our target
system we
run > > the backend in a driver > > > > domain (we call it DomD) where the underlying H/W resides.
We
trust it, > > so we wrote policy rules (to be used in "flask" xsm mode) to
provide
it > > with a little bit more privileges than a simple DomU had. > > > > Now it is permitted to issue device-model, resource and
memory
> > mappings, etc calls. > > > > > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > > require some sort of hypercall to the hypervisor. I
can see
two > > options > > > > > > > at this point: > > > > > > > > > > > > > > - expose the handle to userspace for daemon/helper
to
trigger > > the > > > > > > > mapping via existing hypercall interfaces. If
using a
helper > > you > > > > > > > would have a hypervisor specific one to avoid
the
daemon
> > having to > > > > > > > care too much about the details or push that
complexity
into > > a > > > > > > > compile time option for the daemon which would
result in
> > different > > > > > > > binaries although a common source base. > > > > > > > > > > > > > > - expose a new kernel ABI to abstract the
hypercall
> > differences away > > > > > > > in the guest kernel. In this case the userspace
would
> > essentially > > > > > > > ask for an abstract "map guest N memory to
userspace
ptr" > > and let > > > > > > > the kernel deal with the different hypercall
interfaces.
> > This of > > > > > > > course assumes the majority of BE guests would
be
Linux
> > kernels and > > > > > > > leaves the bare-metal/unikernel approaches to
their own
> > devices. > > > > > > > > > > > > > > Operation > > > > > > > ========= > > > > > > > > > > > > > > The core of the operation of VirtIO is fairly
simple.
Once
the > > > > > > > vhost-user feature negotiation is done it's a case
of
receiving > > update > > > > > > > events and parsing the resultant virt queue for
data.
The
vhost- > > user > > > > > > > specification handles a bunch of setup before that
point,
mostly > > to > > > > > > > detail where the virt queues are set up FD's for
memory and
> > event > > > > > > > communication. This is where the envisioned stub
process
would > > be > > > > > > > responsible for getting the daemon up and ready to
run.
This
is > > > > > > > currently done inside a big VMM like QEMU but I
suspect a
modern > > > > > > > approach would be to use the rust-vmm vhost crate.
It
would
then > > either > > > > > > > communicate with the kernel's abstracted ABI or be
re-
targeted > > as a > > > > > > > build option for the various hypervisors. > > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
doesn't
have
VMMs > > the > > > > > > way they are typically envisioned and described in
other
> > environments. > > > > > > Instead, Xen has IOREQ servers. Each of them connects > > independently to > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
could
be > > used as > > > > > > emulators for a single Xen VM, each of them connecting
to Xen
> > > > > > independently via the IOREQ interface. > > > > > > > > > > > > The component responsible for starting a daemon and/or
setting
up > > shared > > > > > > interfaces is the toolstack: the xl command and the libxl/libxc > > > > > > libraries. > > > > > > > > > > I think that VM configuration management (or
orchestration
in
> > Startos > > > > > jargon?) is a subject to debate in parallel. > > > > > Otherwise, is there any good assumption to avoid it
right
now?
> > > > > > > > > > > Oleksandr and others I CCed have been working on ways
for the
> > toolstack > > > > > > to create virtio backends and setup memory mappings.
They
might be > > able > > > > > > to provide more info on the subject. I do think we
miss
a way
to > > provide > > > > > > the configuration to the backend and anything else
that
the
> > backend > > > > > > might require to start doing its job. > > > > > > > > Yes, some work has been done for the toolstack to handle
Virtio
MMIO > > devices in > > > > general and Virtio block devices in particular. However,
it
has
not > > been upstreaned yet. > > > > Updated patches on review now: > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
send-
email- > > olekstysh@gmail.com/ > > > > > > > > There is an additional (also important) activity to
improve/fix
> > foreign memory mapping on Arm which I am also involved in. > > > > The foreign memory mapping is proposed to be used for
Virtio
backends > > (device emulators) if there is a need to run guest OS
completely
> > unmodified. > > > > Of course, the more secure way would be to use grant
memory
mapping. > > Brietly, the main difference between them is that with foreign
mapping
the > > backend > > > > can map any guest memory it wants to map, but with grant
mapping
it is > > allowed to map only what was previously granted by the
frontend.
> > > > > > > > So, there might be a problem if we want to pre-map some
guest
memory > > in advance or to cache mappings in the backend in order to
improve
> > performance (because the mapping/unmapping guest pages every
request
> > requires a lot of back and forth to Xen + P2M updates). In a
nutshell,
> > currently, in order to map a guest page into the backend
address
space
we > > need to steal a real physical page from the backend domain.
So,
with
the > > said optimizations we might end up with no free memory in the
backend
> > domain (see XSA-300). And what we try to achieve is to not
waste
a
real > > domain memory at all by providing safe non-allocated-yet (so
unused)
> > address space for the foreign (and grant) pages to be mapped
into,
this > > enabling work implies Xen and Linux (and likely DTB bindings)
changes.
> > However, as it turned out, for this to work in a proper and
safe
way
some > > prereq work needs to be done. > > > > You can find the related Xen discussion at: > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
send-
email- > > olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle notification and
kicks.
The > > existing > > > > > > > vhost-user framework uses eventfd to signal the
daemon
(although > > QEMU > > > > > > > is quite capable of simulating them when you use
TCG).
Xen
has > > it's own > > > > > > > IOREQ mechanism. However latency is an important
factor and
> > having > > > > > > > events go through the stub would add quite a lot. > > > > > > > > > > > > Yeah I think, regardless of anything else, we want the backends to > > > > > > connect directly to the Xen hypervisor. > > > > > > > > > > In my approach, > > > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
> > interface > > > > > via virtio-proxy > > > > > b) FE -> BE: MMIO to config raises events (in event
channels),
> > which is > > > > > converted to a callback to BE via virtio-
proxy
> > > > > (Xen's event channel is internnally
implemented by
> > interrupts.) > > > > > > > > > > I don't know what "connect directly" means here, but
sending
> > interrupts > > > > > to the opposite side would be best efficient. > > > > > Ivshmem, I suppose, takes this approach by utilizing
PCI's
msi-x
> > mechanism. > > > > > > > > Agree that MSI would be more efficient than SPI... > > > > At the moment, in order to notify the frontend, the
backend
issues
a > > specific device-model call to query Xen to inject a
corresponding SPI
to > > the guest. > > > > > > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally converting
IOREQ
> > messages from > > > > > > > the Xen hypervisor to eventfd events? Would this
scale
with
> > other kernel > > > > > > > hypercall interfaces? > > > > > > > > > > > > > > So any thoughts on what directions are worth
experimenting
with? > > > > > > > > > > > > One option we should consider is for each backend to
connect
to > > Xen via > > > > > > the IOREQ interface. We could generalize the IOREQ
interface
and > > make it > > > > > > hypervisor agnostic. The interface is really trivial
and
easy
to > > add. > > > > > > > > > > As I said above, my proposal does the same thing that
you
mentioned > > here :) > > > > > The difference is that I do call hypervisor interfaces
via
virtio- > > proxy. > > > > > > > > > > > The only Xen-specific part is the notification
mechanism,
which is > > an > > > > > > event channel. If we replaced the event channel with
something
> > else the > > > > > > interface would be generic. See: > > > > > > https://gitlab.com/xen-project/xen/- > > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > > > I don't think that translating IOREQs to eventfd in
the
kernel
is > > a > > > > > > good idea: if feels like it would be extra complexity
and that
the > > > > > > kernel shouldn't be involved as this is a backend-
hypervisor
> > interface. > > > > > > > > > > Given that we may want to implement BE as a bare-metal application > > > > > as I did on Zephyr, I don't think that the translation
would not
be > > > > > a big issue, especially on RTOS's. > > > > > It will be some kind of abstraction layer of interrupt
handling
> > > > > (or nothing but a callback mechanism). > > > > > > > > > > > Also, eventfd is very Linux-centric and we are trying
to
design an > > > > > > interface that could work well for RTOSes too. If we
want to
do > > > > > > something different, both OS-agnostic and hypervisor-
agnostic,
> > perhaps > > > > > > we could design a new interface. One that could be implementable > > in the > > > > > > Xen hypervisor itself (like IOREQ) and of course any
other
> > hypervisor > > > > > > too. > > > > > > > > > > > > > > > > > > There is also another problem. IOREQ is probably not
be
the
only > > > > > > interface needed. Have a look at > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
Don't we
> > also need > > > > > > an interface for the backend to inject interrupts into
the
> > frontend? And > > > > > > if the backend requires dynamic memory mappings of
frontend
pages, > > then > > > > > > we would also need an interface to map/unmap domU
pages.
> > > > > > > > > > My proposal document might help here; All the interfaces required > > for > > > > > virtio-proxy (or hypervisor-related interfaces) are
listed
as
> > > > > RPC protocols :) > > > > > > > > > > > These interfaces are a lot more problematic than
IOREQ:
IOREQ
is > > tiny > > > > > > and self-contained. It is easy to add anywhere. A new interface to > > > > > > inject interrupts or map pages is more difficult to
manage
because > > it > > > > > > would require changes scattered across the various
emulators.
> > > > > > > > > > Exactly. I have no confident yet that my approach will
also
apply > > > > > to other hypervisors than Xen. > > > > > Technically, yes, but whether people can accept it or
not
is a
> > different > > > > > matter. > > > > > > > > > > Thanks, > > > > > -Takahiro Akashi > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Oleksandr Tyshchenko > > > IMPORTANT NOTICE: The contents of this email and any
attachments are
> > confidential and may also be privileged. If you are not the
intended
> > recipient, please notify the sender immediately and do not
disclose
the > > contents to any other person, use it for any purpose, or store
or copy
the > > information in any medium. Thank you. > IMPORTANT NOTICE: The contents of this email and any attachments
are
confidential and may also be privileged. If you are not the
intended
recipient, please notify the sender immediately and do not
disclose
the
contents to any other person, use it for any purpose, or store or
copy the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Christopher,
Thank you for your feedback.
On Mon, Aug 30, 2021 at 12:53:00PM -0700, Christopher Clark wrote:
[ resending message to ensure delivery to the CCd mailing lists post-subscription ]
Apologies for being late to this thread, but I hope to be able to contribute to this discussion in a meaningful way. I am grateful for the level of interest in this topic. I would like to draw your attention to Argo as a suitable technology for development of VirtIO's hypervisor-agnostic interfaces.
- Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
that can send and receive hypervisor-mediated notifications and messages between domains (VMs). [1] The hypervisor can enforce Mandatory Access Control over all communication between domains. It is derived from the earlier v4v, which has been deployed on millions of machines with the HP/Bromium uXen hypervisor and with OpenXT.
- Argo has a simple interface with a small number of operations that was designed for ease of integration into OS primitives on both Linux
(sockets) and Windows (ReadFile/WriteFile) [2]. - A unikernel example of using it has also been developed for XTF. [3]
- There has been recent discussion and support in the Xen community for
making revisions to the Argo interface to make it hypervisor-agnostic, and support implementations of Argo on other hypervisors. This will enable a single interface for an OS kernel binary to use for inter-VM communication that will work on multiple hypervisors -- this applies equally to both backends and frontend implementations. [4]
Regarding virtio-over-Argo, let me ask a few questions: (In figure "Virtual device buffer access:Virtio+Argo" in [4]) 1) How the configuration is managed? On either virtio-mmio or virtio-pci, there always takes place some negotiation between the FE and BE through the "configuration" space. How can this be done in virtio-over-Argo? 2) Do there physically exist virtio's available/used vrings as well as descriptors, or are they virtually emulated over Argo (rings)? 3) The payload in a request will be copied into the receiver's Argo ring. What does the address in a descriptor mean? Address/offset in a ring buffer? 4) Estimate of performance or latency? It appears that, on FE side, at least three hypervisor calls (and data copying) need to be invoked at every request, right?
Thanks, -Takahiro Akashi
- Here are the design documents for building VirtIO-over-Argo, to support a hypervisor-agnostic frontend VirtIO transport driver using Argo.
The Development Plan to build VirtIO virtual device support over Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
A design for using VirtIO over Argo, describing how VirtIO data structures and communication is handled over the Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
Diagram (from the above document) showing how VirtIO rings are synchronized between domains without using shared memory: https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob...
Please note that the above design documents show that the existing VirtIO device drivers, and both vring and virtqueue data structures can be preserved while interdomain communication can be performed with no shared memory required for most drivers; (the exceptions where further design is required are those such as virtual framebuffer devices where shared memory regions are intentionally added to the communication structure beyond the vrings and virtqueues).
An analysis of VirtIO and Argo, informing the design: https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Arg...
- Argo can be used for a communication path for configuration between the
backend and the toolstack, avoiding the need for a dependency on XenStore, which is an advantage for any hypervisor-agnostic design. It is also amenable to a notification mechanism that is not based on Xen event channels.
- Argo does not use or require shared memory between VMs and provides an
alternative to the use of foreign shared memory mappings. It avoids some of the complexities involved with using grants (eg. XSA-300).
- Argo supports Mandatory Access Control by the hypervisor, satisfying a
common certification requirement.
- The Argo headers are BSD-licensed and the Xen hypervisor implementation
is GPLv2 but accessible via the hypercall interface. The licensing should not present an obstacle to adoption of Argo in guest software or implementation by other hypervisors.
- Since the interface that Argo presents to a guest VM is similar to DMA, a
VirtIO-Argo frontend transport driver should be able to operate with a physical VirtIO-enabled smart-NIC if the toolstack and an Argo-aware backend provide support.
The next Xen Community Call is next week and I would be happy to answer questions about Argo and on this topic. I will also be following this thread.
Christopher (Argo maintainer, Xen Community)
[1] An introduction to Argo: https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20... https://www.youtube.com/watch?v=cnC0Tg3jqJQ Xen Wiki page for Argo: https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] OpenXT Linux Argo driver and userspace library: https://github.com/openxt/linux-xen-argo
Windows V4V at OpenXT wiki: https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V Windows v4v driver source: https://github.com/OpenXT/xc-windows/tree/master/xenv4v
HP/Bromium uXen V4V driver: https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
[3] v2 of the Argo test unikernel for XTF: https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
[4] Argo HMX Transport for VirtIO meeting minutes: https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
VirtIO-Argo Development wiki page: https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
On Thu, Aug 26, 2021 at 5:11 AM Wei Chen Wei.Chen@arm.com wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <
stratos-dev@op-lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann <
arnd.bergmann@linaro.org>;
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
> -----Original Message----- > From: AKASHI Takahiro takahiro.akashi@linaro.org > Sent: 2021年8月18日 13:39 > To: Wei Chen Wei.Chen@arm.com > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
> Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
> open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar > viresh.kumar@linaro.org; Stefano Stabellini > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka > jan.kiszka@siemens.com; Carl van Schaik
> pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
> Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > Bertrand.Marquis@arm.com; Artem Mygaiev <Artem_Mygaiev@epam.com
; Julien
> Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> paul@xen.org; Xen Devel xen-devel@lists.xen.org > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote: > > Hi Akashi, > > > > > -----Original Message----- > > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > > Sent: 2021年8月17日 16:08 > > > To: Wei Chen Wei.Chen@arm.com > > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> > > sstabellini@kernel.org; Alex Benn??e <
alex.bennee@linaro.org>;
> Stratos > > > Mailing List stratos-dev@op-lists.linaro.org; virtio- > dev@lists.oasis- > > > open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh
Kumar
> > > viresh.kumar@linaro.org; Stefano Stabellini > > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > > jan.kiszka@siemens.com; Carl van Schaik
> > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
; Jean-
> > > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > > Bertrand.Marquis@arm.com; Artem Mygaiev
> Julien > > > Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > Hi Wei, Oleksandr, > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > > > > Hi All, > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal here. > > > > This proposal is still discussing in Xen and KVM
communities.
> > > > The main work is to decouple the kvmtool from KVM and make > > > > other hypervisors can reuse the virtual device
implementations.
> > > > > > > > In this case, we need to introduce an intermediate
hypervisor
> > > > layer for VMM abstraction, Which is, I think it's very close > > > > to stratos' virtio hypervisor agnosticism work. > > > > > > # My proposal[1] comes from my own idea and doesn't always
represent
> > > # Linaro's view on this subject nor reflect Alex's concerns. > Nevertheless, > > > > > > Your idea and my proposal seem to share the same background. > > > Both have the similar goal and currently start with, at first,
Xen
> > > and are based on kvm-tool. (Actually, my work is derived from > > > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > > > In particular, the abstraction of hypervisor interfaces has a
same
> > > set of interfaces (for your "struct vmm_impl" and my "RPC
interfaces").
> > > This is not co-incident as we both share the same origin as I
said
> above. > > > And so we will also share the same issues. One of them is a
way
of
> > > "sharing/mapping FE's memory". There is some trade-off between > > > the portability and the performance impact. > > > So we can discuss the topic here in this ML, too. > > > (See Alex's original email, too). > > > > > Yes, I agree. > > > > > On the other hand, my approach aims to create a
"single-binary"
> solution > > > in which the same binary of BE vm could run on any
hypervisors.
> > > Somehow similar to your "proposal-#2" in [2], but in my
solution,
all
> > > the hypervisor-specific code would be put into another entity
(VM),
> > > named "virtio-proxy" and the abstracted operations are served
via RPC.
> > > (In this sense, BE is hypervisor-agnostic but might have OS > dependency.) > > > But I know that we need discuss if this is a requirement even > > > in Stratos project or not. (Maybe not) > > > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
completely
> > (I will do it ASAP). But from your description, it seems we
need a
> > 3rd VM between FE and BE? My concern is that, if my assumption
is
right,
> > will it increase the latency in data transport path? Even if
we're
> > using some lightweight guest like RTOS or Unikernel, > > Yes, you're right. But I'm afraid that it is a matter of degree. > As far as we execute 'mapping' operations at every fetch of
payload,
> we will see latency issue (even in your case) and if we have some
solution
> for it, we won't see it neither in my proposal :) >
Oleksandr has sent a proposal to Xen mailing list to reduce this
kind
of "mapping/unmapping" operations. So the latency caused by this
behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
Thanks, -Takahiro Akashi
-Takahiro Akashi
> > > Specifically speaking about kvm-tool, I have a concern about
its
> > > license term; Targeting different hypervisors and different
OSs
> > > (which I assume includes RTOS's), the resultant library should
be
> > > license permissive and GPL for kvm-tool might be an issue. > > > Any thoughts? > > > > > > > Yes. If user want to implement a FreeBSD device model, but the
virtio
> > library is GPL. Then GPL would be a problem. If we have another
good
> > candidate, I am open to it. > > I have some candidates, particularly for vq/vring, in my mind: > * Open-AMP, or > * corresponding Free-BSD code >
Interesting, I will look into them : )
Cheers, Wei Chen
> -Takahiro Akashi > > > > > -Takahiro Akashi > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- > > > August/000548.html > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > > > Sent: 2021年8月14日 23:38 > > > > > To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano > Stabellini > > > sstabellini@kernel.org > > > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos
Mailing
List
> > > stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org;
> Arnd > > > Bergmann arnd.bergmann@linaro.org; Viresh Kumar > > > viresh.kumar@linaro.org; Stefano Stabellini > > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > > jan.kiszka@siemens.com; Carl van Schaik
> > > pratikp@quicinc.com; Srivatsa Vaddagiri <vatsa@codeaurora.org
; Jean-
> > > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > > mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com;
Oleksandr
> > > Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > > Bertrand.Marquis@arm.com; Artem Mygaiev
> Julien > > > Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > > > > > Hello, all. > > > > > > > > > > Please see some comments below. And sorry for the possible
format
> > > issues. > > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > > > mailto:takahiro.akashi@linaro.org wrote: > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
> wrote: > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
trimming
> the > > > original > > > > > > > email to let them read the full context. > > > > > > > > > > > > > > My comments below are related to a potential Xen > implementation, > > > not > > > > > > > because it is the only implementation that matters,
but
> because it > > > is > > > > > > > the one I know best. > > > > > > > > > > > > Please note that my proposal (and hence the working
prototype)[1]
> > > > > > is based on Xen's virtio implementation (i.e. IOREQ) and > > > particularly > > > > > > EPAM's virtio-disk application (backend server). > > > > > > It has been, I believe, well generalized but is still a
bit
> biased > > > > > > toward this original design. > > > > > > > > > > > > So I hope you like my approach :) > > > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> > > August/000546.html > > > > > > > > > > > > Let me take this opportunity to explain a bit more about
my
> approach > > > below. > > > > > > > > > > > > > Also, please see this relevant email thread: > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > One of the goals of Project Stratos is to enable
hypervisor
> > > agnostic > > > > > > > > backends so we can enable as much re-use of code as
possible
> and > > > avoid > > > > > > > > repeating ourselves. This is the flip side of the
front end
> > > where > > > > > > > > multiple front-end implementations are required -
one
per OS,
> > > assuming > > > > > > > > you don't just want Linux guests. The resultant
guests
are
> > > trivially > > > > > > > > movable between hypervisors modulo any abstracted
paravirt
> type > > > > > > > > interfaces. > > > > > > > > > > > > > > > > In my original thumb nail sketch of a solution I
envisioned
> > > vhost-user > > > > > > > > daemons running in a broadly POSIX like environment.
The
> > > interface to > > > > > > > > the daemon is fairly simple requiring only some
mapped
> memory > > > and some > > > > > > > > sort of signalling for events (on Linux this is
eventfd).
> The > > > idea was a > > > > > > > > stub binary would be responsible for any hypervisor
specific
> > > setup and > > > > > > > > then launch a common binary to deal with the actual > virtqueue > > > requests > > > > > > > > themselves. > > > > > > > > > > > > > > > > Since that original sketch we've seen an expansion
in
the
> sort > > > of ways > > > > > > > > backends could be created. There is interest in > encapsulating > > > backends > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
There
> interest > > > in Rust > > > > > > > > has prompted ideas of using the trait interface to
abstract
> > > differences > > > > > > > > away as well as the idea of bare-metal Rust
backends.
> > > > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
> which > > > > > > > > calls for a description of the APIs needed from the > hypervisor > > > side to > > > > > > > > support VirtIO guests and their backends. However we
are
> some > > > way off > > > > > > > > from that at the moment as I think we need to at
least
> > > demonstrate one > > > > > > > > portable backend before we start codifying
requirements. To
> that > > > end I > > > > > > > > want to think about what we need for a backend to
function.
> > > > > > > > > > > > > > > > Configuration > > > > > > > > ============= > > > > > > > > > > > > > > > > In the type-2 setup this is typically fairly simple
because
> the > > > host > > > > > > > > system can orchestrate the various modules that make
up the
> > > complete > > > > > > > > system. In the type-1 case (or even type-2 with
delegated
> > > service VMs) > > > > > > > > we need some sort of mechanism to inform the backend
VM
> about > > > key > > > > > > > > details about the system: > > > > > > > > > > > > > > > > - where virt queue memory is in it's address space > > > > > > > > - how it's going to receive (interrupt) and
trigger
(kick)
> > > events > > > > > > > > - what (if any) resources the backend needs to
connect to
> > > > > > > > > > > > > > > > Obviously you can elide over configuration issues by
having
> > > static > > > > > > > > configurations and baking the assumptions into your
guest
> images > > > however > > > > > > > > this isn't scalable in the long term. The obvious
solution
> seems > > > to be > > > > > > > > extending a subset of Device Tree data to user space
but
> perhaps > > > there > > > > > > > > are other approaches? > > > > > > > > > > > > > > > > Before any virtio transactions can take place the > appropriate > > > memory > > > > > > > > mappings need to be made between the FE guest and
the
BE
> guest. > > > > > > > > > > > > > > > Currently the whole of the FE guests address space
needs to
> be > > > visible > > > > > > > > to whatever is serving the virtio requests. I can
envision 3
> > > approaches: > > > > > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > > > > > This would entail the guest OS knowing where in
it's
Guest
> > > Physical > > > > > > > > Address space is already taken up and avoiding
clashing. I
> > > would assume > > > > > > > > in this case you would want a standard interface to > userspace > > > to then > > > > > > > > make that address space visible to the backend
daemon.
> > > > > > > > > > > > Yet another way here is that we would have well known
"shared
> > > memory" between > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
insights on
> this > > > matter > > > > > > and that it can even be an alternative for hypervisor-
agnostic
> > > solution. > > > > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
device
> and > > > can be > > > > > > mapped locally.) > > > > > > > > > > > > I want to add this shared memory aspect to my
virtio-proxy,
but
> > > > > > the resultant solution would eventually look similar to
ivshmem.
> > > > > > > > > > > > > > * BE guests boots with a hypervisor handle to
memory
> > > > > > > > > > > > > > > > The BE guest is then free to map the FE's memory to
where
> it > > > wants in > > > > > > > > the BE's guest physical address space. > > > > > > > > > > > > > > I cannot see how this could work for Xen. There is no
"handle"
> to > > > give > > > > > > > to the backend if the backend is not running in dom0.
So
for
> Xen I > > > think > > > > > > > the memory has to be already mapped > > > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
information
> is > > > expected > > > > > > to be exposed to BE via Xenstore: > > > > > > (I know that this is a tentative approach though.) > > > > > > - the start address of configuration space > > > > > > - interrupt number > > > > > > - file path for backing storage > > > > > > - read-only flag > > > > > > And the BE server have to call a particular hypervisor
interface
> to > > > > > > map the configuration space. > > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
> info to > > > the backend running in a non-toolstack domain. > > > > > I remember, there was a wish to avoid using Xenstore in
Virtio
> backend > > > itself if possible, so for non-toolstack domain, this could
done
with
> > > adjusting devd (daemon that listens for devices and launches
backends)
> > > > > to read backend configuration from the Xenstore anyway and
pass it
> to > > > the backend via command line arguments. > > > > > > > > > > > > > Yes, in current PoC code we're using xenstore to pass device > > > configuration. > > > > We also designed a static device configuration parse method
for
> Dom0less > > > or > > > > other scenarios don't have xentool. yes, it's from device
model
> command > > > line > > > > or a config file. > > > > > > > > > But, if ... > > > > > > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
> > > specific > > > > > > stuffs are contained in virtio-proxy, yet another VM, to
hide
> all > > > details. > > > > > > > > > > ... the solution how to overcome that is already found and
proven
> to > > > work then even better. > > > > > > > > > > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
> mapping. > > > > > > > > > > > > > and the mapping probably done by the > > > > > > > toolstack (also see below.) Or we would have to
invent a
new
> Xen > > > > > > > hypervisor interface and Xen virtual machine
privileges
to
> allow > > > this > > > > > > > kind of mapping. > > > > > > > > > > > > > If we run the backend in Dom0 that we have no problems
of
> course. > > > > > > > > > > > > One of difficulties on Xen that I found in my approach
is
that
> > > calling > > > > > > such hypervisor intefaces (registering IOREQ, mapping
memory) is
> > > only > > > > > > allowed on BE servers themselvies and so we will have to
extend
> > > those > > > > > > interfaces. > > > > > > This, however, will raise some concern on security and
privilege
> > > distribution > > > > > > as Stefan suggested. > > > > > > > > > > We also faced policy related issues with Virtio backend
running in
> > > other than Dom0 domain in a "dummy" xsm mode. In our target
system we
> run > > > the backend in a driver > > > > > domain (we call it DomD) where the underlying H/W resides.
We
> trust it, > > > so we wrote policy rules (to be used in "flask" xsm mode) to
provide
> it > > > with a little bit more privileges than a simple DomU had. > > > > > Now it is permitted to issue device-model, resource and
memory
> > > mappings, etc calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > > > require some sort of hypercall to the hypervisor. I
can see
> two > > > options > > > > > > > > at this point: > > > > > > > > > > > > > > > > - expose the handle to userspace for daemon/helper
to
> trigger > > > the > > > > > > > > mapping via existing hypercall interfaces. If
using a
> helper > > > you > > > > > > > > would have a hypervisor specific one to avoid
the
daemon
> > > having to > > > > > > > > care too much about the details or push that
complexity
> into > > > a > > > > > > > > compile time option for the daemon which would
result in
> > > different > > > > > > > > binaries although a common source base. > > > > > > > > > > > > > > > > - expose a new kernel ABI to abstract the
hypercall
> > > differences away > > > > > > > > in the guest kernel. In this case the userspace
would
> > > essentially > > > > > > > > ask for an abstract "map guest N memory to
userspace
> ptr" > > > and let > > > > > > > > the kernel deal with the different hypercall
interfaces.
> > > This of > > > > > > > > course assumes the majority of BE guests would
be
Linux
> > > kernels and > > > > > > > > leaves the bare-metal/unikernel approaches to
their own
> > > devices. > > > > > > > > > > > > > > > > Operation > > > > > > > > ========= > > > > > > > > > > > > > > > > The core of the operation of VirtIO is fairly
simple.
Once
> the > > > > > > > > vhost-user feature negotiation is done it's a case
of
> receiving > > > update > > > > > > > > events and parsing the resultant virt queue for
data.
The
> vhost- > > > user > > > > > > > > specification handles a bunch of setup before that
point,
> mostly > > > to > > > > > > > > detail where the virt queues are set up FD's for
memory and
> > > event > > > > > > > > communication. This is where the envisioned stub
process
> would > > > be > > > > > > > > responsible for getting the daemon up and ready to
run.
This
> is > > > > > > > > currently done inside a big VMM like QEMU but I
suspect a
> modern > > > > > > > > approach would be to use the rust-vmm vhost crate.
It
would
> then > > > either > > > > > > > > communicate with the kernel's abstracted ABI or be
re-
> targeted > > > as a > > > > > > > > build option for the various hypervisors. > > > > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
doesn't
have
> VMMs > > > the > > > > > > > way they are typically envisioned and described in
other
> > > environments. > > > > > > > Instead, Xen has IOREQ servers. Each of them connects > > > independently to > > > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
could
> be > > > used as > > > > > > > emulators for a single Xen VM, each of them connecting
to Xen
> > > > > > > independently via the IOREQ interface. > > > > > > > > > > > > > > The component responsible for starting a daemon and/or
setting
> up > > > shared > > > > > > > interfaces is the toolstack: the xl command and the > libxl/libxc > > > > > > > libraries. > > > > > > > > > > > > I think that VM configuration management (or
orchestration
in
> > > Startos > > > > > > jargon?) is a subject to debate in parallel. > > > > > > Otherwise, is there any good assumption to avoid it
right
now?
> > > > > > > > > > > > > Oleksandr and others I CCed have been working on ways
for the
> > > toolstack > > > > > > > to create virtio backends and setup memory mappings.
They
> might be > > > able > > > > > > > to provide more info on the subject. I do think we
miss
a way
> to > > > provide > > > > > > > the configuration to the backend and anything else
that
the
> > > backend > > > > > > > might require to start doing its job. > > > > > > > > > > Yes, some work has been done for the toolstack to handle
Virtio
> MMIO > > > devices in > > > > > general and Virtio block devices in particular. However,
it
has
> not > > > been upstreaned yet. > > > > > Updated patches on review now: > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
send-
> email- > > > olekstysh@gmail.com/ > > > > > > > > > > There is an additional (also important) activity to
improve/fix
> > > foreign memory mapping on Arm which I am also involved in. > > > > > The foreign memory mapping is proposed to be used for
Virtio
> backends > > > (device emulators) if there is a need to run guest OS
completely
> > > unmodified. > > > > > Of course, the more secure way would be to use grant
memory
> mapping. > > > Brietly, the main difference between them is that with foreign
mapping
> the > > > backend > > > > > can map any guest memory it wants to map, but with grant
mapping
> it is > > > allowed to map only what was previously granted by the
frontend.
> > > > > > > > > > So, there might be a problem if we want to pre-map some
guest
> memory > > > in advance or to cache mappings in the backend in order to
improve
> > > performance (because the mapping/unmapping guest pages every
request
> > > requires a lot of back and forth to Xen + P2M updates). In a
nutshell,
> > > currently, in order to map a guest page into the backend
address
space
> we > > > need to steal a real physical page from the backend domain.
So,
with
> the > > > said optimizations we might end up with no free memory in the
backend
> > > domain (see XSA-300). And what we try to achieve is to not
waste
a
> real > > > domain memory at all by providing safe non-allocated-yet (so
unused)
> > > address space for the foreign (and grant) pages to be mapped
into,
> this > > > enabling work implies Xen and Linux (and likely DTB bindings)
changes.
> > > However, as it turned out, for this to work in a proper and
safe
way
> some > > > prereq work needs to be done. > > > > > You can find the related Xen discussion at: > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
send-
> email- > > > olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle notification and
kicks.
> The > > > existing > > > > > > > > vhost-user framework uses eventfd to signal the
daemon
> (although > > > QEMU > > > > > > > > is quite capable of simulating them when you use
TCG).
Xen
> has > > > it's own > > > > > > > > IOREQ mechanism. However latency is an important
factor and
> > > having > > > > > > > > events go through the stub would add quite a lot. > > > > > > > > > > > > > > Yeah I think, regardless of anything else, we want the > backends to > > > > > > > connect directly to the Xen hypervisor. > > > > > > > > > > > > In my approach, > > > > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
> > > interface > > > > > > via virtio-proxy > > > > > > b) FE -> BE: MMIO to config raises events (in event
channels),
> > > which is > > > > > > converted to a callback to BE via virtio-
proxy
> > > > > > (Xen's event channel is internnally
implemented by
> > > interrupts.) > > > > > > > > > > > > I don't know what "connect directly" means here, but
sending
> > > interrupts > > > > > > to the opposite side would be best efficient. > > > > > > Ivshmem, I suppose, takes this approach by utilizing
PCI's
msi-x
> > > mechanism. > > > > > > > > > > Agree that MSI would be more efficient than SPI... > > > > > At the moment, in order to notify the frontend, the
backend
issues
> a > > > specific device-model call to query Xen to inject a
corresponding SPI
> to > > > the guest. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally converting
IOREQ
> > > messages from > > > > > > > > the Xen hypervisor to eventfd events? Would this
scale
with
> > > other kernel > > > > > > > > hypercall interfaces? > > > > > > > > > > > > > > > > So any thoughts on what directions are worth
experimenting
> with? > > > > > > > > > > > > > > One option we should consider is for each backend to
connect
> to > > > Xen via > > > > > > > the IOREQ interface. We could generalize the IOREQ
interface
> and > > > make it > > > > > > > hypervisor agnostic. The interface is really trivial
and
easy
> to > > > add. > > > > > > > > > > > > As I said above, my proposal does the same thing that
you
> mentioned > > > here :) > > > > > > The difference is that I do call hypervisor interfaces
via
> virtio- > > > proxy. > > > > > > > > > > > > > The only Xen-specific part is the notification
mechanism,
> which is > > > an > > > > > > > event channel. If we replaced the event channel with
something
> > > else the > > > > > > > interface would be generic. See: > > > > > > > https://gitlab.com/xen-project/xen/- > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > > > > > I don't think that translating IOREQs to eventfd in
the
kernel
> is > > > a > > > > > > > good idea: if feels like it would be extra complexity
and that
> the > > > > > > > kernel shouldn't be involved as this is a backend-
hypervisor
> > > interface. > > > > > > > > > > > > Given that we may want to implement BE as a bare-metal > application > > > > > > as I did on Zephyr, I don't think that the translation
would not
> be > > > > > > a big issue, especially on RTOS's. > > > > > > It will be some kind of abstraction layer of interrupt
handling
> > > > > > (or nothing but a callback mechanism). > > > > > > > > > > > > > Also, eventfd is very Linux-centric and we are trying
to
> design an > > > > > > > interface that could work well for RTOSes too. If we
want to
> do > > > > > > > something different, both OS-agnostic and hypervisor-
agnostic,
> > > perhaps > > > > > > > we could design a new interface. One that could be > implementable > > > in the > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
other
> > > hypervisor > > > > > > > too. > > > > > > > > > > > > > > > > > > > > > There is also another problem. IOREQ is probably not
be
the
> only > > > > > > > interface needed. Have a look at > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
Don't we
> > > also need > > > > > > > an interface for the backend to inject interrupts into
the
> > > frontend? And > > > > > > > if the backend requires dynamic memory mappings of
frontend
> pages, > > > then > > > > > > > we would also need an interface to map/unmap domU
pages.
> > > > > > > > > > > > My proposal document might help here; All the interfaces > required > > > for > > > > > > virtio-proxy (or hypervisor-related interfaces) are
listed
as
> > > > > > RPC protocols :) > > > > > > > > > > > > > These interfaces are a lot more problematic than
IOREQ:
IOREQ
> is > > > tiny > > > > > > > and self-contained. It is easy to add anywhere. A new > interface to > > > > > > > inject interrupts or map pages is more difficult to
manage
> because > > > it > > > > > > > would require changes scattered across the various
emulators.
> > > > > > > > > > > > Exactly. I have no confident yet that my approach will
also
> apply > > > > > > to other hypervisors than Xen. > > > > > > Technically, yes, but whether people can accept it or
not
is a
> > > different > > > > > > matter. > > > > > > > > > > > > Thanks, > > > > > > -Takahiro Akashi > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > > > > > > Oleksandr Tyshchenko > > > > IMPORTANT NOTICE: The contents of this email and any
attachments are
> > > confidential and may also be privileged. If you are not the
intended
> > > recipient, please notify the sender immediately and do not
disclose
> the > > > contents to any other person, use it for any purpose, or store
or copy
> the > > > information in any medium. Thank you. > > IMPORTANT NOTICE: The contents of this email and any attachments
are
> confidential and may also be privileged. If you are not the
intended
> recipient, please notify the sender immediately and do not
disclose
the
> contents to any other person, use it for any purpose, or store or
copy the
> information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro takahiro.akashi@linaro.org wrote:
Hi Christopher,
Thank you for your feedback.
On Mon, Aug 30, 2021 at 12:53:00PM -0700, Christopher Clark wrote:
[ resending message to ensure delivery to the CCd mailing lists post-subscription ]
Apologies for being late to this thread, but I hope to be able to contribute to this discussion in a meaningful way. I am grateful for the level of interest in this topic. I would like to draw your attention to Argo as a suitable technology for development of VirtIO's hypervisor-agnostic interfaces.
- Argo is an interdomain communication mechanism in Xen (on x86 and Arm)
that can send and receive hypervisor-mediated notifications and messages between domains (VMs). [1] The hypervisor can enforce Mandatory Access Control over all communication between domains. It is derived from the earlier v4v, which has been deployed on millions of machines with the HP/Bromium uXen hypervisor and with OpenXT.
- Argo has a simple interface with a small number of operations that was designed for ease of integration into OS primitives on both Linux
(sockets) and Windows (ReadFile/WriteFile) [2]. - A unikernel example of using it has also been developed for XTF.
[3]
- There has been recent discussion and support in the Xen community for
making revisions to the Argo interface to make it hypervisor-agnostic, and support implementations of Argo on other hypervisors. This will enable a single interface for an OS kernel binary to use for inter-VM communication
that
will work on multiple hypervisors -- this applies equally to both backends
and
frontend implementations. [4]
Regarding virtio-over-Argo, let me ask a few questions: (In figure "Virtual device buffer access:Virtio+Argo" in [4])
(for ref, this diagram is from this document: https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698 )
Takahiro, thanks for reading the Virtio-Argo materials.
Some relevant context before answering your questions below: the Argo request interface from the hypervisor to a guest, which is currently exposed only via a dedicated hypercall op, has been discussed within the Xen community and is open to being changed in order to better enable support for guest VM access to Argo functions in a hypervisor-agnostic way.
The proposal is to allow hypervisors the option to implement and expose any of multiple access mechanisms for Argo, and then enable a guest device driver to probe the hypervisor for methods that it is aware of and able to use. The hypercall op is likely to be retained (in some form), and complemented at least on x86 with another interface via MSRs presented to the guests.
- How the configuration is managed? On either virtio-mmio or virtio-pci, there always takes place some negotiation between the FE and BE through the "configuration" space. How can this be done in virtio-over-Argo?
Just to be clear about my understanding: your question, in the context of a Linux kernel virtio device driver implementation, is about how a virtio-argo transport driver would implement the get_features function of the virtio_config_ops, as a parallel to the work that vp_get_features does for virtio-pci, and vm_get_features does for virtio-mmio.
The design is still open on this and options have been discussed, including:
* an extension to Argo to allow the system toolstack (which is responsible for managing guest VMs and enabling connections from front-to-backends) to manage a table of "implicit destinations", so a guest can transmit Argo messages to eg. "my storage service" port and the hypervisor will deliver it based on a destination table pre-programmed by the toolstack for the VM. [1] - ref: Notes from the December 2019 Xen F2F meeting in Cambridge, UK: [1] https://lists.archive.carbon60.com/xen/devel/577800#577800
So within that feature negotiation function, communication with the backend via that Argo channel will occur.
* IOREQ The Xen IOREQ implementation is not currently appropriate for virtio-argo since it requires the use of foreign memory mappings of frontend memory in the backend guest. However, a new HMX interface from the hypervisor could support a new DMA Device Model Op to allow the backend to request the hypervisor to retrieve specified bytes from the frontend guest, which would enable plumbing for device configuration between an IOREQ server (device model backend implementation) and the guest driver. [2]
Feature negotiation in the front end in this case would look very similar to the virtio-mmio implementation.
ref: Argo HMX Transport for VirtIO meeting minutes, from January 2021: [2] https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
* guest ACPI tables that surface the address of a remote Argo endpoint on behalf of the toolstack, and Argo communication can then negotiate features
* emulation of a basic PCI device by the hypervisor (though details not determined)
- Do there physically exist virtio's available/used vrings as well as descriptors, or are they virtually emulated over Argo (rings)?
In short: the latter.
In the analysis that I did when looking at this, my observation was that each side (front and backend) should be able to accurately maintain their own local copy of the available/used vrings as well as descriptors, and both be kept synchronized by ensuring that updates are transmitted to the other side when they are written to. eg. As part of this, in the Linux front end implementation the virtqueue_notify function uses a function pointer in the virtqueue that is populated by the transport driver, ie. the virtio-argo driver in this case, which can implement the necessary logic to coordinate with the backend.
- The payload in a request will be copied into the receiver's Argo ring. What does the address in a descriptor mean? Address/offset in a ring buffer?
Effectively yes. I would treat it as a handle that is used to identify and retrieve data from messages exchanged between frontend transport driver and the backend via Argo rings established for moving data for the data path. In the diagram, those are "Argo ring for reads" and "Argo ring for writes".
- Estimate of performance or latency?
Different access methods to Argo (ie. related to my answer to your question '1)' above --) will have different performance characteristics.
Data copying will necessarily involved for any Hypervisor-Mediated data eXchange (HMX) mechanism[1], such as Argo, where there is no shared memory between guest VMs, but the performance profile on modern CPUs with sizable caches has been demonstrated to be acceptable for the guest virtual device drivers use case in the HP/Bromium vSentry uXen product. The VirtIO structure is somewhat different though.
Further performance profiling and measurement will be valuable for enabling tuning of the implementation and development of additional interfaces (eg. such as an asynchronous send primitive) - some of this has been discussed and described on the VirtIO-Argo-Development-Phase-1 wiki page[2].
[1] https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
It appears that, on FE side, at least three hypervisor calls (and data copying) need to be invoked at every request, right?
For a write, counting FE sendv ops: 1: the write data payload is sent via the "Argo ring for writes" 2: the descriptor is sent via a sync of the available/descriptor ring -- is there a third one that I am missing?
Christopher
Thanks, -Takahiro Akashi
- Here are the design documents for building VirtIO-over-Argo, to
support a
hypervisor-agnostic frontend VirtIO transport driver using Argo.
The Development Plan to build VirtIO virtual device support over Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
A design for using VirtIO over Argo, describing how VirtIO data
structures
and communication is handled over the Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
Diagram (from the above document) showing how VirtIO rings are
synchronized
between domains without using shared memory:
https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob...
Please note that the above design documents show that the existing VirtIO device drivers, and both vring and virtqueue data structures can be preserved while interdomain communication can be performed with no shared memory required for most drivers; (the exceptions where further design is required are
those
such as virtual framebuffer devices where shared memory regions are intentionally added to the communication structure beyond the vrings and virtqueues).
An analysis of VirtIO and Argo, informing the design:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Arg...
- Argo can be used for a communication path for configuration between the
backend and the toolstack, avoiding the need for a dependency on XenStore,
which
is an advantage for any hypervisor-agnostic design. It is also amenable to a notification mechanism that is not based on Xen event channels.
- Argo does not use or require shared memory between VMs and provides an
alternative to the use of foreign shared memory mappings. It avoids some of the complexities involved with using grants (eg. XSA-300).
- Argo supports Mandatory Access Control by the hypervisor, satisfying a
common certification requirement.
- The Argo headers are BSD-licensed and the Xen hypervisor implementation
is GPLv2 but accessible via the hypercall interface. The licensing should not
present
an obstacle to adoption of Argo in guest software or implementation by other hypervisors.
- Since the interface that Argo presents to a guest VM is similar to
DMA, a
VirtIO-Argo frontend transport driver should be able to operate with a physical VirtIO-enabled smart-NIC if the toolstack and an Argo-aware backend provide support.
The next Xen Community Call is next week and I would be happy to answer questions about Argo and on this topic. I will also be following this thread.
Christopher (Argo maintainer, Xen Community)
[1] An introduction to Argo:
https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20...
https://www.youtube.com/watch?v=cnC0Tg3jqJQ Xen Wiki page for Argo:
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] OpenXT Linux Argo driver and userspace library: https://github.com/openxt/linux-xen-argo
Windows V4V at OpenXT wiki: https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V Windows v4v driver source: https://github.com/OpenXT/xc-windows/tree/master/xenv4v
HP/Bromium uXen V4V driver: https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
[3] v2 of the Argo test unikernel for XTF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
[4] Argo HMX Transport for VirtIO meeting minutes:
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
VirtIO-Argo Development wiki page:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
Hi,
I have not covered all your comments below yet. So just one comment:
On Mon, Sep 06, 2021 at 05:57:43PM -0700, Christopher Clark wrote:
On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro takahiro.akashi@linaro.org wrote:
(snip)
It appears that, on FE side, at least three hypervisor calls (and data copying) need to be invoked at every request, right?
For a write, counting FE sendv ops: 1: the write data payload is sent via the "Argo ring for writes" 2: the descriptor is sent via a sync of the available/descriptor ring -- is there a third one that I am missing?
In the picture, I can see a) Data transmitted by Argo sendv b) Descriptor written after data sendv c) VirtIO ring sync'd to back-end via separate sendv
Oops, (b) is not a hypervisor call, is it? (But I guess that you will have to have yet another call for notification since there is no config register of QueueNotify?)
Thanks, -Takahiro Akashi
Christopher
Thanks, -Takahiro Akashi
- Here are the design documents for building VirtIO-over-Argo, to
support a
hypervisor-agnostic frontend VirtIO transport driver using Argo.
The Development Plan to build VirtIO virtual device support over Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
A design for using VirtIO over Argo, describing how VirtIO data
structures
and communication is handled over the Argo transport: https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
Diagram (from the above document) showing how VirtIO rings are
synchronized
between domains without using shared memory:
https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob...
Please note that the above design documents show that the existing VirtIO device drivers, and both vring and virtqueue data structures can be preserved while interdomain communication can be performed with no shared memory required for most drivers; (the exceptions where further design is required are
those
such as virtual framebuffer devices where shared memory regions are intentionally added to the communication structure beyond the vrings and virtqueues).
An analysis of VirtIO and Argo, informing the design:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Arg...
- Argo can be used for a communication path for configuration between the
backend and the toolstack, avoiding the need for a dependency on XenStore,
which
is an advantage for any hypervisor-agnostic design. It is also amenable to a notification mechanism that is not based on Xen event channels.
- Argo does not use or require shared memory between VMs and provides an
alternative to the use of foreign shared memory mappings. It avoids some of the complexities involved with using grants (eg. XSA-300).
- Argo supports Mandatory Access Control by the hypervisor, satisfying a
common certification requirement.
- The Argo headers are BSD-licensed and the Xen hypervisor implementation
is GPLv2 but accessible via the hypercall interface. The licensing should not
present
an obstacle to adoption of Argo in guest software or implementation by other hypervisors.
- Since the interface that Argo presents to a guest VM is similar to
DMA, a
VirtIO-Argo frontend transport driver should be able to operate with a physical VirtIO-enabled smart-NIC if the toolstack and an Argo-aware backend provide support.
The next Xen Community Call is next week and I would be happy to answer questions about Argo and on this topic. I will also be following this thread.
Christopher (Argo maintainer, Xen Community)
[1] An introduction to Argo:
https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20...
https://www.youtube.com/watch?v=cnC0Tg3jqJQ Xen Wiki page for Argo:
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] OpenXT Linux Argo driver and userspace library: https://github.com/openxt/linux-xen-argo
Windows V4V at OpenXT wiki: https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V Windows v4v driver source: https://github.com/OpenXT/xc-windows/tree/master/xenv4v
HP/Bromium uXen V4V driver: https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
[3] v2 of the Argo test unikernel for XTF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
[4] Argo HMX Transport for VirtIO meeting minutes:
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
VirtIO-Argo Development wiki page:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
On Tue, Sep 7, 2021 at 4:55 AM AKASHI Takahiro takahiro.akashi@linaro.org wrote:
Hi,
I have not covered all your comments below yet. So just one comment:
On Mon, Sep 06, 2021 at 05:57:43PM -0700, Christopher Clark wrote:
On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <
takahiro.akashi@linaro.org>
wrote:
(snip)
It appears that, on FE side, at least three hypervisor calls (and
data
copying) need to be invoked at every request, right?
For a write, counting FE sendv ops: 1: the write data payload is sent via the "Argo ring for writes" 2: the descriptor is sent via a sync of the available/descriptor ring -- is there a third one that I am missing?
In the picture, I can see a) Data transmitted by Argo sendv b) Descriptor written after data sendv c) VirtIO ring sync'd to back-end via separate sendv
Oops, (b) is not a hypervisor call, is it?
That's correct, it is not - the blue arrows in the diagram are not hypercalls, they are intended to show data movement or action in the flow of performing the operation, and (b) is a data write within the guest's address space into the descriptor ring.
(But I guess that you will have to have yet another call for notification since there is no config register of QueueNotify?)
Reasoning about hypercalls necessary for data movement:
VirtIO transport drivers are responsible for instantiating virtqueues (setup_vq) and are able to populate the notify function pointer in the virtqueue that they supply. The virtio-argo transport driver can provide a suitable notify function implementation that will issue the Argo hypercall sendv hypercall(s) for sending data from the guest frontend to the backend. By issuing the sendv at the time of the queuenotify, rather than as each buffer is added to the virtqueue, the cost of the sendv hypercall can be amortized over multiple buffer additions to the virtqueue.
I also understand that there has been some recent work in the Linaro Project Stratos on "Fat Virtqueues", where the data to be transmitted is included within an expanded virtqueue, which could further reduce the number of hypercalls required, since the data can be transmitted inline with the descriptors. Reference here: https://linaro.atlassian.net/wiki/spaces/STR/pages/25626313982/2021-01-21+Pr... https://linaro.atlassian.net/browse/STR-25
As a result of the above, I think that a single hypercall could be sufficient for communicating data for multiple requests, and that a two-hypercall-per-request (worst case) upper bound could also be established.
Christopher
Thanks, -Takahiro Akashi
Christopher
Thanks, -Takahiro Akashi
- Here are the design documents for building VirtIO-over-Argo, to
support a
hypervisor-agnostic frontend VirtIO transport driver using Argo.
The Development Plan to build VirtIO virtual device support over Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
A design for using VirtIO over Argo, describing how VirtIO data
structures
and communication is handled over the Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
Diagram (from the above document) showing how VirtIO rings are
synchronized
between domains without using shared memory:
https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob...
Please note that the above design documents show that the existing
VirtIO
device drivers, and both vring and virtqueue data structures can be preserved while interdomain communication can be performed with no shared
memory
required for most drivers; (the exceptions where further design is required
are
those
such as virtual framebuffer devices where shared memory regions are intentionally added to the communication structure beyond the vrings and
virtqueues).
An analysis of VirtIO and Argo, informing the design:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Arg...
- Argo can be used for a communication path for configuration
between the
backend and the toolstack, avoiding the need for a dependency on XenStore,
which
is an advantage for any hypervisor-agnostic design. It is also amenable
to a
notification mechanism that is not based on Xen event channels.
- Argo does not use or require shared memory between VMs and
provides an
alternative to the use of foreign shared memory mappings. It avoids some of the complexities involved with using grants (eg. XSA-300).
- Argo supports Mandatory Access Control by the hypervisor,
satisfying a
common certification requirement.
- The Argo headers are BSD-licensed and the Xen hypervisor
implementation
is GPLv2 but accessible via the hypercall interface. The licensing should not
present
an obstacle to adoption of Argo in guest software or implementation by other hypervisors.
- Since the interface that Argo presents to a guest VM is similar to
DMA, a
VirtIO-Argo frontend transport driver should be able to operate with a physical VirtIO-enabled smart-NIC if the toolstack and an Argo-aware backend provide
support.
The next Xen Community Call is next week and I would be happy to
answer
questions about Argo and on this topic. I will also be following this thread.
Christopher (Argo maintainer, Xen Community)
[1] An introduction to Argo:
https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20...
https://www.youtube.com/watch?v=cnC0Tg3jqJQ Xen Wiki page for Argo:
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] OpenXT Linux Argo driver and userspace library: https://github.com/openxt/linux-xen-argo
Windows V4V at OpenXT wiki: https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V Windows v4v driver source: https://github.com/OpenXT/xc-windows/tree/master/xenv4v
HP/Bromium uXen V4V driver: https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
[3] v2 of the Argo test unikernel for XTF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
[4] Argo HMX Transport for VirtIO meeting minutes:
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
VirtIO-Argo Development wiki page:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
Hi Christopher,
On Tue, Sep 07, 2021 at 11:09:34AM -0700, Christopher Clark wrote:
On Tue, Sep 7, 2021 at 4:55 AM AKASHI Takahiro takahiro.akashi@linaro.org wrote:
Hi,
I have not covered all your comments below yet. So just one comment:
On Mon, Sep 06, 2021 at 05:57:43PM -0700, Christopher Clark wrote:
On Thu, Sep 2, 2021 at 12:19 AM AKASHI Takahiro <
takahiro.akashi@linaro.org>
wrote:
(snip)
It appears that, on FE side, at least three hypervisor calls (and
data
copying) need to be invoked at every request, right?
For a write, counting FE sendv ops: 1: the write data payload is sent via the "Argo ring for writes" 2: the descriptor is sent via a sync of the available/descriptor ring -- is there a third one that I am missing?
In the picture, I can see a) Data transmitted by Argo sendv b) Descriptor written after data sendv c) VirtIO ring sync'd to back-end via separate sendv
Oops, (b) is not a hypervisor call, is it?
That's correct, it is not - the blue arrows in the diagram are not hypercalls, they are intended to show data movement or action in the flow of performing the operation, and (b) is a data write within the guest's address space into the descriptor ring.
(But I guess that you will have to have yet another call for notification since there is no config register of QueueNotify?)
Reasoning about hypercalls necessary for data movement:
VirtIO transport drivers are responsible for instantiating virtqueues (setup_vq) and are able to populate the notify function pointer in the virtqueue that they supply. The virtio-argo transport driver can provide a suitable notify function implementation that will issue the Argo hypercall sendv hypercall(s) for sending data from the guest frontend to the backend. By issuing the sendv at the time of the queuenotify, rather than as each buffer is added to the virtqueue, the cost of the sendv hypercall can be amortized over multiple buffer additions to the virtqueue.
I also understand that there has been some recent work in the Linaro Project Stratos on "Fat Virtqueues", where the data to be transmitted is included within an expanded virtqueue, which could further reduce the number of hypercalls required, since the data can be transmitted inline with the descriptors. Reference here: https://linaro.atlassian.net/wiki/spaces/STR/pages/25626313982/2021-01-21+Pr... https://linaro.atlassian.net/browse/STR-25
Ah, yes. Obviously, "fatvirtqueue" has pros and cons. One of cons is that it won't be suitable for bigger payload with limited space in descriptors.
As a result of the above, I think that a single hypercall could be sufficient for communicating data for multiple requests, and that a two-hypercall-per-request (worst case) upper bound could also be established.
When it comes to the payload or data plane, "fatvirtqueue" as well as Argo utilizes copying. You dub it "DMA operations". A similar approach can be also seen in virtio-over-ivshmem, where a limited amount of memory are shared and FE will allocate some space in this buffer and copy the payload into there. Those allocation will be done via dma_ops of virtio_ivshmem driver. BE, on the other hand, fetches the data from the shared memory by using the "offset" described in a descriptor. Shared memory is divided into a couple of different groups; one for read/write for all, others for one writer with many readers. (I hope I'm right here :)
Looks close to Argo? What is different is who is responsible for copying data, the kernel or the hypervisor. (Yeah, I know that Argo has more crucial aspects like access controls.)
In this sense, ivshmem can also be a candidate for hypervisor-agnostic framework. Jailhouse doesn't say so explicitly AFAIK. Jan may have some more to say.
Thanks, -Takahiro Akashi
Christopher
Thanks, -Takahiro Akashi
Christopher
Thanks, -Takahiro Akashi
- Here are the design documents for building VirtIO-over-Argo, to
support a
hypervisor-agnostic frontend VirtIO transport driver using Argo.
The Development Plan to build VirtIO virtual device support over Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
A design for using VirtIO over Argo, describing how VirtIO data
structures
and communication is handled over the Argo transport:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1348763698/VirtIO+Argo
Diagram (from the above document) showing how VirtIO rings are
synchronized
between domains without using shared memory:
https://openxt.atlassian.net/46e1c93b-2b87-4cb2-951e-abd4377a1194#media-blob...
Please note that the above design documents show that the existing
VirtIO
device drivers, and both vring and virtqueue data structures can be preserved while interdomain communication can be performed with no shared
memory
required for most drivers; (the exceptions where further design is required
are
those
such as virtual framebuffer devices where shared memory regions are intentionally added to the communication structure beyond the vrings and
virtqueues).
An analysis of VirtIO and Argo, informing the design:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1333428225/Analysis+of+Arg...
- Argo can be used for a communication path for configuration
between the
backend and the toolstack, avoiding the need for a dependency on XenStore,
which
is an advantage for any hypervisor-agnostic design. It is also amenable
to a
notification mechanism that is not based on Xen event channels.
- Argo does not use or require shared memory between VMs and
provides an
alternative to the use of foreign shared memory mappings. It avoids some of the complexities involved with using grants (eg. XSA-300).
- Argo supports Mandatory Access Control by the hypervisor,
satisfying a
common certification requirement.
- The Argo headers are BSD-licensed and the Xen hypervisor
implementation
is GPLv2 but accessible via the hypercall interface. The licensing should not
present
an obstacle to adoption of Argo in guest software or implementation by other hypervisors.
- Since the interface that Argo presents to a guest VM is similar to
DMA, a
VirtIO-Argo frontend transport driver should be able to operate with a physical VirtIO-enabled smart-NIC if the toolstack and an Argo-aware backend provide
support.
The next Xen Community Call is next week and I would be happy to
answer
questions about Argo and on this topic. I will also be following this thread.
Christopher (Argo maintainer, Xen Community)
[1] An introduction to Argo:
https://static.sched.com/hosted_files/xensummit19/92/Argo%20and%20HMX%20-%20...
https://www.youtube.com/watch?v=cnC0Tg3jqJQ Xen Wiki page for Argo:
https://wiki.xenproject.org/wiki/Argo:_Hypervisor-Mediated_Exchange_(HMX)_fo...
[2] OpenXT Linux Argo driver and userspace library: https://github.com/openxt/linux-xen-argo
Windows V4V at OpenXT wiki: https://openxt.atlassian.net/wiki/spaces/DC/pages/14844007/V4V Windows v4v driver source: https://github.com/OpenXT/xc-windows/tree/master/xenv4v
HP/Bromium uXen V4V driver: https://github.com/uxen-virt/uxen/tree/ascara/windows/uxenv4vlib
[3] v2 of the Argo test unikernel for XTF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg02234.html
[4] Argo HMX Transport for VirtIO meeting minutes:
https://lists.xenproject.org/archives/html/xen-devel/2021-02/msg01422.html
VirtIO-Argo Development wiki page:
https://openxt.atlassian.net/wiki/spaces/DC/pages/1696169985/VirtIO-Argo+Dev...
Wei,
On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly Xin Kaly.Xin@arm.com; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote:
Hi Akashi,
> -----Original Message----- > From: AKASHI Takahiro takahiro.akashi@linaro.org > Sent: 2021年8月17日 16:08 > To: Wei Chen Wei.Chen@arm.com > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
> Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
> open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar > viresh.kumar@linaro.org; Stefano Stabellini > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka > jan.kiszka@siemens.com; Carl van Schaik
> pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
> Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
> Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> paul@xen.org; Xen Devel xen-devel@lists.xen.org > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > Hi Wei, Oleksandr, > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > > Hi All, > > > > Thanks for Stefano to link my kvmtool for Xen proposal here. > > This proposal is still discussing in Xen and KVM communities. > > The main work is to decouple the kvmtool from KVM and make > > other hypervisors can reuse the virtual device implementations. > > > > In this case, we need to introduce an intermediate hypervisor > > layer for VMM abstraction, Which is, I think it's very close > > to stratos' virtio hypervisor agnosticism work. > > # My proposal[1] comes from my own idea and doesn't always
represent
> # Linaro's view on this subject nor reflect Alex's concerns.
Nevertheless,
> > Your idea and my proposal seem to share the same background. > Both have the similar goal and currently start with, at first,
Xen
> and are based on kvm-tool. (Actually, my work is derived from > EPAM's virtio-disk, which is also based on kvm-tool.) > > In particular, the abstraction of hypervisor interfaces has a
same
> set of interfaces (for your "struct vmm_impl" and my "RPC
interfaces").
> This is not co-incident as we both share the same origin as I
said
above.
> And so we will also share the same issues. One of them is a way
of
> "sharing/mapping FE's memory". There is some trade-off between > the portability and the performance impact. > So we can discuss the topic here in this ML, too. > (See Alex's original email, too). > Yes, I agree.
> On the other hand, my approach aims to create a "single-binary"
solution
> in which the same binary of BE vm could run on any hypervisors. > Somehow similar to your "proposal-#2" in [2], but in my solution,
all
> the hypervisor-specific code would be put into another entity
(VM),
> named "virtio-proxy" and the abstracted operations are served
via RPC.
> (In this sense, BE is hypervisor-agnostic but might have OS
dependency.)
> But I know that we need discuss if this is a requirement even > in Stratos project or not. (Maybe not) >
Sorry, I haven't had time to finish reading your virtio-proxy
completely
(I will do it ASAP). But from your description, it seems we need a 3rd VM between FE and BE? My concern is that, if my assumption is
right,
will it increase the latency in data transport path? Even if we're using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of payload, we will see latency issue (even in your case) and if we have some
solution
for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this kind of "mapping/unmapping" operations. So the latency caused by this
behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-07/msg01532.html
I'm afraid that I miss something here, but I don't know why this proposed API will lead to eliminating 'mmap' in accessing the queued payload at every request?
-Takahiro Akashi
Thanks, -Takahiro Akashi
-Takahiro Akashi
> Specifically speaking about kvm-tool, I have a concern about its > license term; Targeting different hypervisors and different OSs > (which I assume includes RTOS's), the resultant library should
be
> license permissive and GPL for kvm-tool might be an issue. > Any thoughts? >
Yes. If user want to implement a FreeBSD device model, but the
virtio
library is GPL. Then GPL would be a problem. If we have another
good
candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
> -Takahiro Akashi > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- > August/000548.html > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > Sent: 2021年8月14日 23:38 > > > To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano
Stabellini
> sstabellini@kernel.org > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing
List
> stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org;
Arnd
> Bergmann arnd.bergmann@linaro.org; Viresh Kumar > viresh.kumar@linaro.org; Stefano Stabellini > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka > jan.kiszka@siemens.com; Carl van Schaik
> pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
> Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com;
Oleksandr
> Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
> Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > Hello, all. > > > > > > Please see some comments below. And sorry for the possible
format
> issues. > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > mailto:takahiro.akashi@linaro.org wrote: > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
wrote:
> > > > > CCing people working on Xen+VirtIO and IOREQs. Not
trimming
the
> original > > > > > email to let them read the full context. > > > > > > > > > > My comments below are related to a potential Xen
implementation,
> not > > > > > because it is the only implementation that matters, but
because it
> is > > > > > the one I know best. > > > > > > > > Please note that my proposal (and hence the working
prototype)[1]
> > > > is based on Xen's virtio implementation (i.e. IOREQ) and > particularly > > > > EPAM's virtio-disk application (backend server). > > > > It has been, I believe, well generalized but is still a
bit
biased
> > > > toward this original design. > > > > > > > > So I hope you like my approach :) > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> August/000546.html > > > > > > > > Let me take this opportunity to explain a bit more about
my
approach
> below. > > > > > > > > > Also, please see this relevant email thread: > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > Hi, > > > > > > > > > > > > One of the goals of Project Stratos is to enable
hypervisor
> agnostic > > > > > > backends so we can enable as much re-use of code as
possible
and
> avoid > > > > > > repeating ourselves. This is the flip side of the
front end
> where > > > > > > multiple front-end implementations are required - one
per OS,
> assuming > > > > > > you don't just want Linux guests. The resultant guests
are
> trivially > > > > > > movable between hypervisors modulo any abstracted
paravirt
type
> > > > > > interfaces. > > > > > > > > > > > > In my original thumb nail sketch of a solution I
envisioned
> vhost-user > > > > > > daemons running in a broadly POSIX like environment.
The
> interface to > > > > > > the daemon is fairly simple requiring only some mapped
memory
> and some > > > > > > sort of signalling for events (on Linux this is
eventfd).
The
> idea was a > > > > > > stub binary would be responsible for any hypervisor
specific
> setup and > > > > > > then launch a common binary to deal with the actual
virtqueue
> requests > > > > > > themselves. > > > > > > > > > > > > Since that original sketch we've seen an expansion in
the
sort
> of ways > > > > > > backends could be created. There is interest in
encapsulating
> backends > > > > > > in RTOSes or unikernels for solutions like SCMI. There
interest
> in Rust > > > > > > has prompted ideas of using the trait interface to
abstract
> differences > > > > > > away as well as the idea of bare-metal Rust backends. > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
which
> > > > > > calls for a description of the APIs needed from the
hypervisor
> side to > > > > > > support VirtIO guests and their backends. However we
are
some
> way off > > > > > > from that at the moment as I think we need to at least > demonstrate one > > > > > > portable backend before we start codifying
requirements. To
that
> end I > > > > > > want to think about what we need for a backend to
function.
> > > > > > > > > > > > Configuration > > > > > > ============= > > > > > > > > > > > > In the type-2 setup this is typically fairly simple
because
the
> host > > > > > > system can orchestrate the various modules that make
up the
> complete > > > > > > system. In the type-1 case (or even type-2 with
delegated
> service VMs) > > > > > > we need some sort of mechanism to inform the backend
VM
about
> key > > > > > > details about the system: > > > > > > > > > > > > - where virt queue memory is in it's address space > > > > > > - how it's going to receive (interrupt) and trigger
(kick)
> events > > > > > > - what (if any) resources the backend needs to
connect to
> > > > > > > > > > > > Obviously you can elide over configuration issues by
having
> static > > > > > > configurations and baking the assumptions into your
guest
images
> however > > > > > > this isn't scalable in the long term. The obvious
solution
seems
> to be > > > > > > extending a subset of Device Tree data to user space
but
perhaps
> there > > > > > > are other approaches? > > > > > > > > > > > > Before any virtio transactions can take place the
appropriate
> memory > > > > > > mappings need to be made between the FE guest and the
BE
guest.
> > > > > > > > > > > Currently the whole of the FE guests address space
needs to
be
> visible > > > > > > to whatever is serving the virtio requests. I can
envision 3
> approaches: > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > This would entail the guest OS knowing where in it's
Guest
> Physical > > > > > > Address space is already taken up and avoiding
clashing. I
> would assume > > > > > > in this case you would want a standard interface to
userspace
> to then > > > > > > make that address space visible to the backend daemon. > > > > > > > > Yet another way here is that we would have well known
"shared
> memory" between > > > > VMs. I think that Jailhouse's ivshmem gives us good
insights on
this
> matter > > > > and that it can even be an alternative for hypervisor-
agnostic
> solution. > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
device
and
> can be > > > > mapped locally.) > > > > > > > > I want to add this shared memory aspect to my virtio-proxy,
but
> > > > the resultant solution would eventually look similar to
ivshmem.
> > > > > > > > > > * BE guests boots with a hypervisor handle to memory > > > > > > > > > > > > The BE guest is then free to map the FE's memory to
where
it
> wants in > > > > > > the BE's guest physical address space. > > > > > > > > > > I cannot see how this could work for Xen. There is no
"handle"
to
> give > > > > > to the backend if the backend is not running in dom0. So
for
Xen I
> think > > > > > the memory has to be already mapped > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
information
is
> expected > > > > to be exposed to BE via Xenstore: > > > > (I know that this is a tentative approach though.) > > > > - the start address of configuration space > > > > - interrupt number > > > > - file path for backing storage > > > > - read-only flag > > > > And the BE server have to call a particular hypervisor
interface
to
> > > > map the configuration space. > > > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
info to
> the backend running in a non-toolstack domain. > > > I remember, there was a wish to avoid using Xenstore in
Virtio
backend
> itself if possible, so for non-toolstack domain, this could done
with
> adjusting devd (daemon that listens for devices and launches
backends)
> > > to read backend configuration from the Xenstore anyway and
pass it
to
> the backend via command line arguments. > > > > > > > Yes, in current PoC code we're using xenstore to pass device > configuration. > > We also designed a static device configuration parse method
for
Dom0less
> or > > other scenarios don't have xentool. yes, it's from device
model
command
> line > > or a config file. > > > > > But, if ... > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
> specific > > > > stuffs are contained in virtio-proxy, yet another VM, to
hide
all
> details. > > > > > > ... the solution how to overcome that is already found and
proven
to
> work then even better. > > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
mapping.
> > > > > > > > > and the mapping probably done by the > > > > > toolstack (also see below.) Or we would have to invent a
new
Xen
> > > > > hypervisor interface and Xen virtual machine privileges
to
allow
> this > > > > > kind of mapping. > > > > > > > > > If we run the backend in Dom0 that we have no problems
of
course.
> > > > > > > > One of difficulties on Xen that I found in my approach is
that
> calling > > > > such hypervisor intefaces (registering IOREQ, mapping
memory) is
> only > > > > allowed on BE servers themselvies and so we will have to
extend
> those > > > > interfaces. > > > > This, however, will raise some concern on security and
privilege
> distribution > > > > as Stefan suggested. > > > > > > We also faced policy related issues with Virtio backend
running in
> other than Dom0 domain in a "dummy" xsm mode. In our target
system we
run
> the backend in a driver > > > domain (we call it DomD) where the underlying H/W resides.
We
trust it,
> so we wrote policy rules (to be used in "flask" xsm mode) to
provide
it
> with a little bit more privileges than a simple DomU had. > > > Now it is permitted to issue device-model, resource and
memory
> mappings, etc calls. > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > require some sort of hypercall to the hypervisor. I
can see
two
> options > > > > > > at this point: > > > > > > > > > > > > - expose the handle to userspace for daemon/helper
to
trigger
> the > > > > > > mapping via existing hypercall interfaces. If
using a
helper
> you > > > > > > would have a hypervisor specific one to avoid the
daemon
> having to > > > > > > care too much about the details or push that
complexity
into
> a > > > > > > compile time option for the daemon which would
result in
> different > > > > > > binaries although a common source base. > > > > > > > > > > > > - expose a new kernel ABI to abstract the hypercall > differences away > > > > > > in the guest kernel. In this case the userspace
would
> essentially > > > > > > ask for an abstract "map guest N memory to
userspace
ptr"
> and let > > > > > > the kernel deal with the different hypercall
interfaces.
> This of > > > > > > course assumes the majority of BE guests would be
Linux
> kernels and > > > > > > leaves the bare-metal/unikernel approaches to
their own
> devices. > > > > > > > > > > > > Operation > > > > > > ========= > > > > > > > > > > > > The core of the operation of VirtIO is fairly simple.
Once
the
> > > > > > vhost-user feature negotiation is done it's a case of
receiving
> update > > > > > > events and parsing the resultant virt queue for data.
The
vhost-
> user > > > > > > specification handles a bunch of setup before that
point,
mostly
> to > > > > > > detail where the virt queues are set up FD's for
memory and
> event > > > > > > communication. This is where the envisioned stub
process
would
> be > > > > > > responsible for getting the daemon up and ready to run.
This
is
> > > > > > currently done inside a big VMM like QEMU but I
suspect a
modern
> > > > > > approach would be to use the rust-vmm vhost crate. It
would
then
> either > > > > > > communicate with the kernel's abstracted ABI or be re-
targeted
> as a > > > > > > build option for the various hypervisors. > > > > > > > > > > One thing I mentioned before to Alex is that Xen doesn't
have
VMMs
> the > > > > > way they are typically envisioned and described in other > environments. > > > > > Instead, Xen has IOREQ servers. Each of them connects > independently to > > > > > Xen via the IOREQ interface. E.g. today multiple QEMUs
could
be
> used as > > > > > emulators for a single Xen VM, each of them connecting
to Xen
> > > > > independently via the IOREQ interface. > > > > > > > > > > The component responsible for starting a daemon and/or
setting
up
> shared > > > > > interfaces is the toolstack: the xl command and the
libxl/libxc
> > > > > libraries. > > > > > > > > I think that VM configuration management (or orchestration
in
> Startos > > > > jargon?) is a subject to debate in parallel. > > > > Otherwise, is there any good assumption to avoid it right
now?
> > > > > > > > > Oleksandr and others I CCed have been working on ways
for the
> toolstack > > > > > to create virtio backends and setup memory mappings.
They
might be
> able > > > > > to provide more info on the subject. I do think we miss
a way
to
> provide > > > > > the configuration to the backend and anything else that
the
> backend > > > > > might require to start doing its job. > > > > > > Yes, some work has been done for the toolstack to handle
Virtio
MMIO
> devices in > > > general and Virtio block devices in particular. However, it
has
not
> been upstreaned yet. > > > Updated patches on review now: > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-git-
send-
email-
> olekstysh@gmail.com/ > > > > > > There is an additional (also important) activity to
improve/fix
> foreign memory mapping on Arm which I am also involved in. > > > The foreign memory mapping is proposed to be used for Virtio
backends
> (device emulators) if there is a need to run guest OS completely > unmodified. > > > Of course, the more secure way would be to use grant memory
mapping.
> Brietly, the main difference between them is that with foreign
mapping
the
> backend > > > can map any guest memory it wants to map, but with grant
mapping
it is
> allowed to map only what was previously granted by the frontend. > > > > > > So, there might be a problem if we want to pre-map some
guest
memory
> in advance or to cache mappings in the backend in order to
improve
> performance (because the mapping/unmapping guest pages every
request
> requires a lot of back and forth to Xen + P2M updates). In a
nutshell,
> currently, in order to map a guest page into the backend address
space
we
> need to steal a real physical page from the backend domain. So,
with
the
> said optimizations we might end up with no free memory in the
backend
> domain (see XSA-300). And what we try to achieve is to not waste
a
real
> domain memory at all by providing safe non-allocated-yet (so
unused)
> address space for the foreign (and grant) pages to be mapped
into,
this
> enabling work implies Xen and Linux (and likely DTB bindings)
changes.
> However, as it turned out, for this to work in a proper and safe
way
some
> prereq work needs to be done. > > > You can find the related Xen discussion at: > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-git-
send-
email-
> olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle notification and
kicks.
The
> existing > > > > > > vhost-user framework uses eventfd to signal the daemon
(although
> QEMU > > > > > > is quite capable of simulating them when you use TCG).
Xen
has
> it's own > > > > > > IOREQ mechanism. However latency is an important
factor and
> having > > > > > > events go through the stub would add quite a lot. > > > > > > > > > > Yeah I think, regardless of anything else, we want the
backends to
> > > > > connect directly to the Xen hypervisor. > > > > > > > > In my approach, > > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
> interface > > > > via virtio-proxy > > > > b) FE -> BE: MMIO to config raises events (in event
channels),
> which is > > > > converted to a callback to BE via virtio-
proxy
> > > > (Xen's event channel is internnally
implemented by
> interrupts.) > > > > > > > > I don't know what "connect directly" means here, but
sending
> interrupts > > > > to the opposite side would be best efficient. > > > > Ivshmem, I suppose, takes this approach by utilizing PCI's
msi-x
> mechanism. > > > > > > Agree that MSI would be more efficient than SPI... > > > At the moment, in order to notify the frontend, the backend
issues
a
> specific device-model call to query Xen to inject a
corresponding SPI
to
> the guest. > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally converting
IOREQ
> messages from > > > > > > the Xen hypervisor to eventfd events? Would this scale
with
> other kernel > > > > > > hypercall interfaces? > > > > > > > > > > > > So any thoughts on what directions are worth
experimenting
with?
> > > > > > > > > > One option we should consider is for each backend to
connect
to
> Xen via > > > > > the IOREQ interface. We could generalize the IOREQ
interface
and
> make it > > > > > hypervisor agnostic. The interface is really trivial and
easy
to
> add. > > > > > > > > As I said above, my proposal does the same thing that you
mentioned
> here :) > > > > The difference is that I do call hypervisor interfaces via
virtio-
> proxy. > > > > > > > > > The only Xen-specific part is the notification mechanism,
which is
> an > > > > > event channel. If we replaced the event channel with
something
> else the > > > > > interface would be generic. See: > > > > > https://gitlab.com/xen-project/xen/- > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > I don't think that translating IOREQs to eventfd in the
kernel
is
> a > > > > > good idea: if feels like it would be extra complexity
and that
the
> > > > > kernel shouldn't be involved as this is a backend-
hypervisor
> interface. > > > > > > > > Given that we may want to implement BE as a bare-metal
application
> > > > as I did on Zephyr, I don't think that the translation
would not
be
> > > > a big issue, especially on RTOS's. > > > > It will be some kind of abstraction layer of interrupt
handling
> > > > (or nothing but a callback mechanism). > > > > > > > > > Also, eventfd is very Linux-centric and we are trying to
design an
> > > > > interface that could work well for RTOSes too. If we
want to
do
> > > > > something different, both OS-agnostic and hypervisor-
agnostic,
> perhaps > > > > > we could design a new interface. One that could be
implementable
> in the > > > > > Xen hypervisor itself (like IOREQ) and of course any
other
> hypervisor > > > > > too. > > > > > > > > > > > > > > > There is also another problem. IOREQ is probably not be
the
only
> > > > > interface needed. Have a look at > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
Don't we
> also need > > > > > an interface for the backend to inject interrupts into
the
> frontend? And > > > > > if the backend requires dynamic memory mappings of
frontend
pages,
> then > > > > > we would also need an interface to map/unmap domU pages. > > > > > > > > My proposal document might help here; All the interfaces
required
> for > > > > virtio-proxy (or hypervisor-related interfaces) are listed
as
> > > > RPC protocols :) > > > > > > > > > These interfaces are a lot more problematic than IOREQ:
IOREQ
is
> tiny > > > > > and self-contained. It is easy to add anywhere. A new
interface to
> > > > > inject interrupts or map pages is more difficult to
manage
because
> it > > > > > would require changes scattered across the various
emulators.
> > > > > > > > Exactly. I have no confident yet that my approach will
also
apply
> > > > to other hypervisors than Xen. > > > > Technically, yes, but whether people can accept it or not
is a
> different > > > > matter. > > > > > > > > Thanks, > > > > -Takahiro Akashi > > > > > > > > > > > > -- > > > Regards, > > > > > > Oleksandr Tyshchenko > > IMPORTANT NOTICE: The contents of this email and any
attachments are
> confidential and may also be privileged. If you are not the
intended
> recipient, please notify the sender immediately and do not
disclose
the
> contents to any other person, use it for any purpose, or store
or copy
the
> information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments
are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or
copy the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月31日 14:18 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly Xin Kaly.Xin@arm.com; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Wei,
On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <stratos-dev@op-
lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月18日 13:39 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote: > Hi Akashi, > > > -----Original Message----- > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > Sent: 2021年8月17日 16:08 > > To: Wei Chen Wei.Chen@arm.com > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> > sstabellini@kernel.org; Alex Benn??e
Stratos > > Mailing List stratos-dev@op-lists.linaro.org; virtio- dev@lists.oasis- > > open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh
Kumar
> > viresh.kumar@linaro.org; Stefano Stabellini > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > jan.kiszka@siemens.com; Carl van Schaik
> > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean-
> > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien > > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant
> > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > Hi Wei, Oleksandr, > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > > > Hi All, > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal
here.
> > > This proposal is still discussing in Xen and KVM
communities.
> > > The main work is to decouple the kvmtool from KVM and make > > > other hypervisors can reuse the virtual device
implementations.
> > > > > > In this case, we need to introduce an intermediate
hypervisor
> > > layer for VMM abstraction, Which is, I think it's very
close
> > > to stratos' virtio hypervisor agnosticism work. > > > > # My proposal[1] comes from my own idea and doesn't always
represent
> > # Linaro's view on this subject nor reflect Alex's concerns. Nevertheless, > > > > Your idea and my proposal seem to share the same background. > > Both have the similar goal and currently start with, at
first,
Xen
> > and are based on kvm-tool. (Actually, my work is derived
from
> > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > In particular, the abstraction of hypervisor interfaces has
a
same
> > set of interfaces (for your "struct vmm_impl" and my "RPC
interfaces").
> > This is not co-incident as we both share the same origin as
I
said
above. > > And so we will also share the same issues. One of them is a
way
of
> > "sharing/mapping FE's memory". There is some trade-off
between
> > the portability and the performance impact. > > So we can discuss the topic here in this ML, too. > > (See Alex's original email, too). > > > Yes, I agree. > > > On the other hand, my approach aims to create a "single-
binary"
solution > > in which the same binary of BE vm could run on any
hypervisors.
> > Somehow similar to your "proposal-#2" in [2], but in my
solution,
all
> > the hypervisor-specific code would be put into another
entity
(VM),
> > named "virtio-proxy" and the abstracted operations are
served
via RPC.
> > (In this sense, BE is hypervisor-agnostic but might have OS dependency.) > > But I know that we need discuss if this is a requirement
even
> > in Stratos project or not. (Maybe not) > > > > Sorry, I haven't had time to finish reading your virtio-proxy
completely
> (I will do it ASAP). But from your description, it seems we
need a
> 3rd VM between FE and BE? My concern is that, if my assumption
is
right,
> will it increase the latency in data transport path? Even if
we're
> using some lightweight guest like RTOS or Unikernel,
Yes, you're right. But I'm afraid that it is a matter of degree. As far as we execute 'mapping' operations at every fetch of
payload,
we will see latency issue (even in your case) and if we have
some
solution
for it, we won't see it neither in my proposal :)
Oleksandr has sent a proposal to Xen mailing list to reduce this
kind
of "mapping/unmapping" operations. So the latency caused by this
behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have
that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-
07/msg01532.html
I'm afraid that I miss something here, but I don't know why this proposed API will lead to eliminating 'mmap' in accessing the queued payload at every request?
This API give Xen device model (QEMU or kvmtool) the ability to map whole guest RAM in device model's address space. In this case, device model doesn't need dynamic hypercall to map/unmap payload memory. It can use a flat offset to access payload memory in its address space directly. Just Like KVM device model does now.
Before this API, When device model to map whole guest memory, will severely consume the physical pages of Dom-0/Dom-D.
-Takahiro Akashi
Thanks, -Takahiro Akashi
-Takahiro Akashi
> > Specifically speaking about kvm-tool, I have a concern about
its
> > license term; Targeting different hypervisors and different
OSs
> > (which I assume includes RTOS's), the resultant library
should
be
> > license permissive and GPL for kvm-tool might be an issue. > > Any thoughts? > > > > Yes. If user want to implement a FreeBSD device model, but the
virtio
> library is GPL. Then GPL would be a problem. If we have
another
good
> candidate, I am open to it.
I have some candidates, particularly for vq/vring, in my mind:
- Open-AMP, or
- corresponding Free-BSD code
Interesting, I will look into them : )
Cheers, Wei Chen
-Takahiro Akashi
> > -Takahiro Akashi > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- > > August/000548.html > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > > Sent: 2021年8月14日 23:38 > > > > To: AKASHI Takahiro takahiro.akashi@linaro.org;
Stefano
Stabellini > > sstabellini@kernel.org > > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos
Mailing
List
> > stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org;
Arnd > > Bergmann arnd.bergmann@linaro.org; Viresh Kumar > > viresh.kumar@linaro.org; Stefano Stabellini > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > jan.kiszka@siemens.com; Carl van Schaik
> > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean-
> > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com;
Oleksandr
> > Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien > > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant
> > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > > > Hello, all. > > > > > > > > Please see some comments below. And sorry for the
possible
format
> > issues. > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > > mailto:takahiro.akashi@linaro.org wrote: > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
wrote: > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
trimming
the > > original > > > > > > email to let them read the full context. > > > > > > > > > > > > My comments below are related to a potential Xen implementation, > > not > > > > > > because it is the only implementation that matters,
but
because it > > is > > > > > > the one I know best. > > > > > > > > > > Please note that my proposal (and hence the working
prototype)[1]
> > > > > is based on Xen's virtio implementation (i.e. IOREQ)
and
> > particularly > > > > > EPAM's virtio-disk application (backend server). > > > > > It has been, I believe, well generalized but is still
a
bit
biased > > > > > toward this original design. > > > > > > > > > > So I hope you like my approach :) > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> > August/000546.html > > > > > > > > > > Let me take this opportunity to explain a bit more
about
my
approach > > below. > > > > > > > > > > > Also, please see this relevant email thread: > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > > Hi, > > > > > > > > > > > > > > One of the goals of Project Stratos is to enable
hypervisor
> > agnostic > > > > > > > backends so we can enable as much re-use of code
as
possible
and > > avoid > > > > > > > repeating ourselves. This is the flip side of the
front end
> > where > > > > > > > multiple front-end implementations are required -
one
per OS,
> > assuming > > > > > > > you don't just want Linux guests. The resultant
guests
are
> > trivially > > > > > > > movable between hypervisors modulo any abstracted
paravirt
type > > > > > > > interfaces. > > > > > > > > > > > > > > In my original thumb nail sketch of a solution I
envisioned
> > vhost-user > > > > > > > daemons running in a broadly POSIX like
environment.
The
> > interface to > > > > > > > the daemon is fairly simple requiring only some
mapped
memory > > and some > > > > > > > sort of signalling for events (on Linux this is
eventfd).
The > > idea was a > > > > > > > stub binary would be responsible for any
hypervisor
specific
> > setup and > > > > > > > then launch a common binary to deal with the
actual
virtqueue > > requests > > > > > > > themselves. > > > > > > > > > > > > > > Since that original sketch we've seen an expansion
in
the
sort > > of ways > > > > > > > backends could be created. There is interest in encapsulating > > backends > > > > > > > in RTOSes or unikernels for solutions like SCMI.
There
interest > > in Rust > > > > > > > has prompted ideas of using the trait interface to
abstract
> > differences > > > > > > > away as well as the idea of bare-metal Rust
backends.
> > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
which > > > > > > > calls for a description of the APIs needed from
the
hypervisor > > side to > > > > > > > support VirtIO guests and their backends. However
we
are
some > > way off > > > > > > > from that at the moment as I think we need to at
least
> > demonstrate one > > > > > > > portable backend before we start codifying
requirements. To
that > > end I > > > > > > > want to think about what we need for a backend to
function.
> > > > > > > > > > > > > > Configuration > > > > > > > ============= > > > > > > > > > > > > > > In the type-2 setup this is typically fairly
simple
because
the > > host > > > > > > > system can orchestrate the various modules that
make
up the
> > complete > > > > > > > system. In the type-1 case (or even type-2 with
delegated
> > service VMs) > > > > > > > we need some sort of mechanism to inform the
backend
VM
about > > key > > > > > > > details about the system: > > > > > > > > > > > > > > - where virt queue memory is in it's address
space
> > > > > > > - how it's going to receive (interrupt) and
trigger
(kick)
> > events > > > > > > > - what (if any) resources the backend needs to
connect to
> > > > > > > > > > > > > > Obviously you can elide over configuration issues
by
having
> > static > > > > > > > configurations and baking the assumptions into
your
guest
images > > however > > > > > > > this isn't scalable in the long term. The obvious
solution
seems > > to be > > > > > > > extending a subset of Device Tree data to user
space
but
perhaps > > there > > > > > > > are other approaches? > > > > > > > > > > > > > > Before any virtio transactions can take place the appropriate > > memory > > > > > > > mappings need to be made between the FE guest and
the
BE
guest. > > > > > > > > > > > > > Currently the whole of the FE guests address space
needs to
be > > visible > > > > > > > to whatever is serving the virtio requests. I can
envision 3
> > approaches: > > > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > > > This would entail the guest OS knowing where in
it's
Guest
> > Physical > > > > > > > Address space is already taken up and avoiding
clashing. I
> > would assume > > > > > > > in this case you would want a standard interface
to
userspace > > to then > > > > > > > make that address space visible to the backend
daemon.
> > > > > > > > > > Yet another way here is that we would have well known
"shared
> > memory" between > > > > > VMs. I think that Jailhouse's ivshmem gives us good
insights on
this > > matter > > > > > and that it can even be an alternative for hypervisor-
agnostic
> > solution. > > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
device
and > > can be > > > > > mapped locally.) > > > > > > > > > > I want to add this shared memory aspect to my virtio-
proxy,
but
> > > > > the resultant solution would eventually look similar
to
ivshmem.
> > > > > > > > > > > > * BE guests boots with a hypervisor handle to
memory
> > > > > > > > > > > > > > The BE guest is then free to map the FE's memory
to
where
it > > wants in > > > > > > > the BE's guest physical address space. > > > > > > > > > > > > I cannot see how this could work for Xen. There is
no
"handle"
to > > give > > > > > > to the backend if the backend is not running in dom0.
So
for
Xen I > > think > > > > > > the memory has to be already mapped > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
information
is > > expected > > > > > to be exposed to BE via Xenstore: > > > > > (I know that this is a tentative approach though.) > > > > > - the start address of configuration space > > > > > - interrupt number > > > > > - file path for backing storage > > > > > - read-only flag > > > > > And the BE server have to call a particular hypervisor
interface
to > > > > > map the configuration space. > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
info to > > the backend running in a non-toolstack domain. > > > > I remember, there was a wish to avoid using Xenstore in
Virtio
backend > > itself if possible, so for non-toolstack domain, this could
done
with
> > adjusting devd (daemon that listens for devices and launches
backends)
> > > > to read backend configuration from the Xenstore anyway
and
pass it
to > > the backend via command line arguments. > > > > > > > > > > Yes, in current PoC code we're using xenstore to pass
device
> > configuration. > > > We also designed a static device configuration parse
method
for
Dom0less > > or > > > other scenarios don't have xentool. yes, it's from device
model
command > > line > > > or a config file. > > > > > > > But, if ... > > > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
> > specific > > > > > stuffs are contained in virtio-proxy, yet another VM,
to
hide
all > > details. > > > > > > > > ... the solution how to overcome that is already found
and
proven
to > > work then even better. > > > > > > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
mapping. > > > > > > > > > > > and the mapping probably done by the > > > > > > toolstack (also see below.) Or we would have to
invent a
new
Xen > > > > > > hypervisor interface and Xen virtual machine
privileges
to
allow > > this > > > > > > kind of mapping. > > > > > > > > > > > If we run the backend in Dom0 that we have no
problems
of
course. > > > > > > > > > > One of difficulties on Xen that I found in my approach
is
that
> > calling > > > > > such hypervisor intefaces (registering IOREQ, mapping
memory) is
> > only > > > > > allowed on BE servers themselvies and so we will have
to
extend
> > those > > > > > interfaces. > > > > > This, however, will raise some concern on security and
privilege
> > distribution > > > > > as Stefan suggested. > > > > > > > > We also faced policy related issues with Virtio backend
running in
> > other than Dom0 domain in a "dummy" xsm mode. In our target
system we
run > > the backend in a driver > > > > domain (we call it DomD) where the underlying H/W
resides.
We
trust it, > > so we wrote policy rules (to be used in "flask" xsm mode) to
provide
it > > with a little bit more privileges than a simple DomU had. > > > > Now it is permitted to issue device-model, resource and
memory
> > mappings, etc calls. > > > > > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > > require some sort of hypercall to the hypervisor.
I
can see
two > > options > > > > > > > at this point: > > > > > > > > > > > > > > - expose the handle to userspace for
daemon/helper
to
trigger > > the > > > > > > > mapping via existing hypercall interfaces. If
using a
helper > > you > > > > > > > would have a hypervisor specific one to avoid
the
daemon
> > having to > > > > > > > care too much about the details or push that
complexity
into > > a > > > > > > > compile time option for the daemon which would
result in
> > different > > > > > > > binaries although a common source base. > > > > > > > > > > > > > > - expose a new kernel ABI to abstract the
hypercall
> > differences away > > > > > > > in the guest kernel. In this case the
userspace
would
> > essentially > > > > > > > ask for an abstract "map guest N memory to
userspace
ptr" > > and let > > > > > > > the kernel deal with the different hypercall
interfaces.
> > This of > > > > > > > course assumes the majority of BE guests would
be
Linux
> > kernels and > > > > > > > leaves the bare-metal/unikernel approaches to
their own
> > devices. > > > > > > > > > > > > > > Operation > > > > > > > ========= > > > > > > > > > > > > > > The core of the operation of VirtIO is fairly
simple.
Once
the > > > > > > > vhost-user feature negotiation is done it's a case
of
receiving > > update > > > > > > > events and parsing the resultant virt queue for
data.
The
vhost- > > user > > > > > > > specification handles a bunch of setup before that
point,
mostly > > to > > > > > > > detail where the virt queues are set up FD's for
memory and
> > event > > > > > > > communication. This is where the envisioned stub
process
would > > be > > > > > > > responsible for getting the daemon up and ready to
run.
This
is > > > > > > > currently done inside a big VMM like QEMU but I
suspect a
modern > > > > > > > approach would be to use the rust-vmm vhost crate.
It
would
then > > either > > > > > > > communicate with the kernel's abstracted ABI or be
re-
targeted > > as a > > > > > > > build option for the various hypervisors. > > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
doesn't
have
VMMs > > the > > > > > > way they are typically envisioned and described in
other
> > environments. > > > > > > Instead, Xen has IOREQ servers. Each of them
connects
> > independently to > > > > > > Xen via the IOREQ interface. E.g. today multiple
QEMUs
could
be > > used as > > > > > > emulators for a single Xen VM, each of them
connecting
to Xen
> > > > > > independently via the IOREQ interface. > > > > > > > > > > > > The component responsible for starting a daemon
and/or
setting
up > > shared > > > > > > interfaces is the toolstack: the xl command and the libxl/libxc > > > > > > libraries. > > > > > > > > > > I think that VM configuration management (or
orchestration
in
> > Startos > > > > > jargon?) is a subject to debate in parallel. > > > > > Otherwise, is there any good assumption to avoid it
right
now?
> > > > > > > > > > > Oleksandr and others I CCed have been working on
ways
for the
> > toolstack > > > > > > to create virtio backends and setup memory mappings.
They
might be > > able > > > > > > to provide more info on the subject. I do think we
miss
a way
to > > provide > > > > > > the configuration to the backend and anything else
that
the
> > backend > > > > > > might require to start doing its job. > > > > > > > > Yes, some work has been done for the toolstack to handle
Virtio
MMIO > > devices in > > > > general and Virtio block devices in particular. However,
it
has
not > > been upstreaned yet. > > > > Updated patches on review now: > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-
git-
send-
email- > > olekstysh@gmail.com/ > > > > > > > > There is an additional (also important) activity to
improve/fix
> > foreign memory mapping on Arm which I am also involved in. > > > > The foreign memory mapping is proposed to be used for
Virtio
backends > > (device emulators) if there is a need to run guest OS
completely
> > unmodified. > > > > Of course, the more secure way would be to use grant
memory
mapping. > > Brietly, the main difference between them is that with
foreign
mapping
the > > backend > > > > can map any guest memory it wants to map, but with grant
mapping
it is > > allowed to map only what was previously granted by the
frontend.
> > > > > > > > So, there might be a problem if we want to pre-map some
guest
memory > > in advance or to cache mappings in the backend in order to
improve
> > performance (because the mapping/unmapping guest pages every
request
> > requires a lot of back and forth to Xen + P2M updates). In a
nutshell,
> > currently, in order to map a guest page into the backend
address
space
we > > need to steal a real physical page from the backend domain.
So,
with
the > > said optimizations we might end up with no free memory in
the
backend
> > domain (see XSA-300). And what we try to achieve is to not
waste
a
real > > domain memory at all by providing safe non-allocated-yet (so
unused)
> > address space for the foreign (and grant) pages to be mapped
into,
this > > enabling work implies Xen and Linux (and likely DTB bindings)
changes.
> > However, as it turned out, for this to work in a proper and
safe
way
some > > prereq work needs to be done. > > > > You can find the related Xen discussion at: > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-
git-
send-
email- > > olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle notification
and
kicks.
The > > existing > > > > > > > vhost-user framework uses eventfd to signal the
daemon
(although > > QEMU > > > > > > > is quite capable of simulating them when you use
TCG).
Xen
has > > it's own > > > > > > > IOREQ mechanism. However latency is an important
factor and
> > having > > > > > > > events go through the stub would add quite a lot. > > > > > > > > > > > > Yeah I think, regardless of anything else, we want
the
backends to > > > > > > connect directly to the Xen hypervisor. > > > > > > > > > > In my approach, > > > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
> > interface > > > > > via virtio-proxy > > > > > b) FE -> BE: MMIO to config raises events (in event
channels),
> > which is > > > > > converted to a callback to BE via
virtio-
proxy
> > > > > (Xen's event channel is internnally
implemented by
> > interrupts.) > > > > > > > > > > I don't know what "connect directly" means here, but
sending
> > interrupts > > > > > to the opposite side would be best efficient. > > > > > Ivshmem, I suppose, takes this approach by utilizing
PCI's
msi-x
> > mechanism. > > > > > > > > Agree that MSI would be more efficient than SPI... > > > > At the moment, in order to notify the frontend, the
backend
issues
a > > specific device-model call to query Xen to inject a
corresponding SPI
to > > the guest. > > > > > > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally converting
IOREQ
> > messages from > > > > > > > the Xen hypervisor to eventfd events? Would this
scale
with
> > other kernel > > > > > > > hypercall interfaces? > > > > > > > > > > > > > > So any thoughts on what directions are worth
experimenting
with? > > > > > > > > > > > > One option we should consider is for each backend to
connect
to > > Xen via > > > > > > the IOREQ interface. We could generalize the IOREQ
interface
and > > make it > > > > > > hypervisor agnostic. The interface is really trivial
and
easy
to > > add. > > > > > > > > > > As I said above, my proposal does the same thing that
you
mentioned > > here :) > > > > > The difference is that I do call hypervisor interfaces
via
virtio- > > proxy. > > > > > > > > > > > The only Xen-specific part is the notification
mechanism,
which is > > an > > > > > > event channel. If we replaced the event channel with
something
> > else the > > > > > > interface would be generic. See: > > > > > > https://gitlab.com/xen-project/xen/- > > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > > > I don't think that translating IOREQs to eventfd in
the
kernel
is > > a > > > > > > good idea: if feels like it would be extra
complexity
and that
the > > > > > > kernel shouldn't be involved as this is a backend-
hypervisor
> > interface. > > > > > > > > > > Given that we may want to implement BE as a bare-metal application > > > > > as I did on Zephyr, I don't think that the translation
would not
be > > > > > a big issue, especially on RTOS's. > > > > > It will be some kind of abstraction layer of interrupt
handling
> > > > > (or nothing but a callback mechanism). > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
trying to
design an > > > > > > interface that could work well for RTOSes too. If we
want to
do > > > > > > something different, both OS-agnostic and
hypervisor-
agnostic,
> > perhaps > > > > > > we could design a new interface. One that could be implementable > > in the > > > > > > Xen hypervisor itself (like IOREQ) and of course any
other
> > hypervisor > > > > > > too. > > > > > > > > > > > > > > > > > > There is also another problem. IOREQ is probably not
be
the
only > > > > > > interface needed. Have a look at > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
Don't we
> > also need > > > > > > an interface for the backend to inject interrupts
into
the
> > frontend? And > > > > > > if the backend requires dynamic memory mappings of
frontend
pages, > > then > > > > > > we would also need an interface to map/unmap domU
pages.
> > > > > > > > > > My proposal document might help here; All the
interfaces
required > > for > > > > > virtio-proxy (or hypervisor-related interfaces) are
listed
as
> > > > > RPC protocols :) > > > > > > > > > > > These interfaces are a lot more problematic than
IOREQ:
IOREQ
is > > tiny > > > > > > and self-contained. It is easy to add anywhere. A
new
interface to > > > > > > inject interrupts or map pages is more difficult to
manage
because > > it > > > > > > would require changes scattered across the various
emulators.
> > > > > > > > > > Exactly. I have no confident yet that my approach will
also
apply > > > > > to other hypervisors than Xen. > > > > > Technically, yes, but whether people can accept it or
not
is a
> > different > > > > > matter. > > > > > > > > > > Thanks, > > > > > -Takahiro Akashi > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Oleksandr Tyshchenko > > > IMPORTANT NOTICE: The contents of this email and any
attachments are
> > confidential and may also be privileged. If you are not the
intended
> > recipient, please notify the sender immediately and do not
disclose
the > > contents to any other person, use it for any purpose, or
store
or copy
the > > information in any medium. Thank you. > IMPORTANT NOTICE: The contents of this email and any
attachments
are
confidential and may also be privileged. If you are not the
intended
recipient, please notify the sender immediately and do not
disclose
the
contents to any other person, use it for any purpose, or store
or
copy the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments
are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Wei,
On Wed, Sep 01, 2021 at 11:12:58AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月31日 14:18 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly Xin Kaly.Xin@arm.com; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Wei,
On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <stratos-dev@op-
lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote:
Hi Akashi,
> -----Original Message----- > From: AKASHI Takahiro takahiro.akashi@linaro.org > Sent: 2021年8月18日 13:39 > To: Wei Chen Wei.Chen@arm.com > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Stratos
> Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
> open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar > viresh.kumar@linaro.org; Stefano Stabellini > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka > jan.kiszka@siemens.com; Carl van Schaik
> pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
> Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
> Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
> paul@xen.org; Xen Devel xen-devel@lists.xen.org > Subject: Re: Enabling hypervisor agnosticism for VirtIO backends > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote: > > Hi Akashi, > > > > > -----Original Message----- > > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > > Sent: 2021年8月17日 16:08 > > > To: Wei Chen Wei.Chen@arm.com > > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> > > sstabellini@kernel.org; Alex Benn??e
> Stratos > > > Mailing List stratos-dev@op-lists.linaro.org; virtio- > dev@lists.oasis- > > > open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh
Kumar
> > > viresh.kumar@linaro.org; Stefano Stabellini > > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > > jan.kiszka@siemens.com; Carl van Schaik
> > > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean-
> > > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > > Bertrand.Marquis@arm.com; Artem Mygaiev
> Julien > > > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant
> > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > Hi Wei, Oleksandr, > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > > > > Hi All, > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal
here.
> > > > This proposal is still discussing in Xen and KVM
communities.
> > > > The main work is to decouple the kvmtool from KVM and make > > > > other hypervisors can reuse the virtual device
implementations.
> > > > > > > > In this case, we need to introduce an intermediate
hypervisor
> > > > layer for VMM abstraction, Which is, I think it's very
close
> > > > to stratos' virtio hypervisor agnosticism work. > > > > > > # My proposal[1] comes from my own idea and doesn't always
represent
> > > # Linaro's view on this subject nor reflect Alex's concerns. > Nevertheless, > > > > > > Your idea and my proposal seem to share the same background. > > > Both have the similar goal and currently start with, at
first,
Xen
> > > and are based on kvm-tool. (Actually, my work is derived
from
> > > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > > > In particular, the abstraction of hypervisor interfaces has
a
same
> > > set of interfaces (for your "struct vmm_impl" and my "RPC
interfaces").
> > > This is not co-incident as we both share the same origin as
I
said
> above. > > > And so we will also share the same issues. One of them is a
way
of
> > > "sharing/mapping FE's memory". There is some trade-off
between
> > > the portability and the performance impact. > > > So we can discuss the topic here in this ML, too. > > > (See Alex's original email, too). > > > > > Yes, I agree. > > > > > On the other hand, my approach aims to create a "single-
binary"
> solution > > > in which the same binary of BE vm could run on any
hypervisors.
> > > Somehow similar to your "proposal-#2" in [2], but in my
solution,
all
> > > the hypervisor-specific code would be put into another
entity
(VM),
> > > named "virtio-proxy" and the abstracted operations are
served
via RPC.
> > > (In this sense, BE is hypervisor-agnostic but might have OS > dependency.) > > > But I know that we need discuss if this is a requirement
even
> > > in Stratos project or not. (Maybe not) > > > > > > > Sorry, I haven't had time to finish reading your virtio-proxy
completely
> > (I will do it ASAP). But from your description, it seems we
need a
> > 3rd VM between FE and BE? My concern is that, if my assumption
is
right,
> > will it increase the latency in data transport path? Even if
we're
> > using some lightweight guest like RTOS or Unikernel, > > Yes, you're right. But I'm afraid that it is a matter of degree. > As far as we execute 'mapping' operations at every fetch of
payload,
> we will see latency issue (even in your case) and if we have
some
solution
> for it, we won't see it neither in my proposal :) >
Oleksandr has sent a proposal to Xen mailing list to reduce this
kind
of "mapping/unmapping" operations. So the latency caused by this
behavior
on Xen may eventually be eliminated, and Linux-KVM doesn't have
that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-
07/msg01532.html
I'm afraid that I miss something here, but I don't know why this proposed API will lead to eliminating 'mmap' in accessing the queued payload at every request?
This API give Xen device model (QEMU or kvmtool) the ability to map whole guest RAM in device model's address space. In this case, device model doesn't need dynamic hypercall to map/unmap payload memory. It can use a flat offset to access payload memory in its address space directly. Just Like KVM device model does now.
Thank you. Quickly, let me make sure one thing: This API itself doesn't do any mapping operations, right? So I suppose that virtio BE guest is responsible to 1) fetch the information about all the memory regions in FE, 2) call this API to allocate a big chunk of unused space in BE, 3) create grant/foreign mappings for FE onto this region(S) in the initialization/configuration of emulated virtio devices.
Is this the way this API is expected to be used? Does Xen already has an interface for (1)?
-Takahiro Akashi
Before this API, When device model to map whole guest memory, will severely consume the physical pages of Dom-0/Dom-D.
-Takahiro Akashi
Thanks, -Takahiro Akashi
-Takahiro Akashi
> > > Specifically speaking about kvm-tool, I have a concern about
its
> > > license term; Targeting different hypervisors and different
OSs
> > > (which I assume includes RTOS's), the resultant library
should
be
> > > license permissive and GPL for kvm-tool might be an issue. > > > Any thoughts? > > > > > > > Yes. If user want to implement a FreeBSD device model, but the
virtio
> > library is GPL. Then GPL would be a problem. If we have
another
good
> > candidate, I am open to it. > > I have some candidates, particularly for vq/vring, in my mind: > * Open-AMP, or > * corresponding Free-BSD code >
Interesting, I will look into them : )
Cheers, Wei Chen
> -Takahiro Akashi > > > > > -Takahiro Akashi > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-dev/2021- > > > August/000548.html > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > > > Sent: 2021年8月14日 23:38 > > > > > To: AKASHI Takahiro takahiro.akashi@linaro.org;
Stefano
> Stabellini > > > sstabellini@kernel.org > > > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos
Mailing
List
> > > stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org;
> Arnd > > > Bergmann arnd.bergmann@linaro.org; Viresh Kumar > > > viresh.kumar@linaro.org; Stefano Stabellini > > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > > jan.kiszka@siemens.com; Carl van Schaik
> > > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean-
> > > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > > mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com;
Oleksandr
> > > Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > > Bertrand.Marquis@arm.com; Artem Mygaiev
> Julien > > > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant
> > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > > > > > Hello, all. > > > > > > > > > > Please see some comments below. And sorry for the
possible
format
> > > issues. > > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > > > mailto:takahiro.akashi@linaro.org wrote: > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
> wrote: > > > > > > > CCing people working on Xen+VirtIO and IOREQs. Not
trimming
> the > > > original > > > > > > > email to let them read the full context. > > > > > > > > > > > > > > My comments below are related to a potential Xen > implementation, > > > not > > > > > > > because it is the only implementation that matters,
but
> because it > > > is > > > > > > > the one I know best. > > > > > > > > > > > > Please note that my proposal (and hence the working
prototype)[1]
> > > > > > is based on Xen's virtio implementation (i.e. IOREQ)
and
> > > particularly > > > > > > EPAM's virtio-disk application (backend server). > > > > > > It has been, I believe, well generalized but is still
a
bit
> biased > > > > > > toward this original design. > > > > > > > > > > > > So I hope you like my approach :) > > > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> > > August/000546.html > > > > > > > > > > > > Let me take this opportunity to explain a bit more
about
my
> approach > > > below. > > > > > > > > > > > > > Also, please see this relevant email thread: > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > One of the goals of Project Stratos is to enable
hypervisor
> > > agnostic > > > > > > > > backends so we can enable as much re-use of code
as
possible
> and > > > avoid > > > > > > > > repeating ourselves. This is the flip side of the
front end
> > > where > > > > > > > > multiple front-end implementations are required -
one
per OS,
> > > assuming > > > > > > > > you don't just want Linux guests. The resultant
guests
are
> > > trivially > > > > > > > > movable between hypervisors modulo any abstracted
paravirt
> type > > > > > > > > interfaces. > > > > > > > > > > > > > > > > In my original thumb nail sketch of a solution I
envisioned
> > > vhost-user > > > > > > > > daemons running in a broadly POSIX like
environment.
The
> > > interface to > > > > > > > > the daemon is fairly simple requiring only some
mapped
> memory > > > and some > > > > > > > > sort of signalling for events (on Linux this is
eventfd).
> The > > > idea was a > > > > > > > > stub binary would be responsible for any
hypervisor
specific
> > > setup and > > > > > > > > then launch a common binary to deal with the
actual
> virtqueue > > > requests > > > > > > > > themselves. > > > > > > > > > > > > > > > > Since that original sketch we've seen an expansion
in
the
> sort > > > of ways > > > > > > > > backends could be created. There is interest in > encapsulating > > > backends > > > > > > > > in RTOSes or unikernels for solutions like SCMI.
There
> interest > > > in Rust > > > > > > > > has prompted ideas of using the trait interface to
abstract
> > > differences > > > > > > > > away as well as the idea of bare-metal Rust
backends.
> > > > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
> which > > > > > > > > calls for a description of the APIs needed from
the
> hypervisor > > > side to > > > > > > > > support VirtIO guests and their backends. However
we
are
> some > > > way off > > > > > > > > from that at the moment as I think we need to at
least
> > > demonstrate one > > > > > > > > portable backend before we start codifying
requirements. To
> that > > > end I > > > > > > > > want to think about what we need for a backend to
function.
> > > > > > > > > > > > > > > > Configuration > > > > > > > > ============= > > > > > > > > > > > > > > > > In the type-2 setup this is typically fairly
simple
because
> the > > > host > > > > > > > > system can orchestrate the various modules that
make
up the
> > > complete > > > > > > > > system. In the type-1 case (or even type-2 with
delegated
> > > service VMs) > > > > > > > > we need some sort of mechanism to inform the
backend
VM
> about > > > key > > > > > > > > details about the system: > > > > > > > > > > > > > > > > - where virt queue memory is in it's address
space
> > > > > > > > - how it's going to receive (interrupt) and
trigger
(kick)
> > > events > > > > > > > > - what (if any) resources the backend needs to
connect to
> > > > > > > > > > > > > > > > Obviously you can elide over configuration issues
by
having
> > > static > > > > > > > > configurations and baking the assumptions into
your
guest
> images > > > however > > > > > > > > this isn't scalable in the long term. The obvious
solution
> seems > > > to be > > > > > > > > extending a subset of Device Tree data to user
space
but
> perhaps > > > there > > > > > > > > are other approaches? > > > > > > > > > > > > > > > > Before any virtio transactions can take place the > appropriate > > > memory > > > > > > > > mappings need to be made between the FE guest and
the
BE
> guest. > > > > > > > > > > > > > > > Currently the whole of the FE guests address space
needs to
> be > > > visible > > > > > > > > to whatever is serving the virtio requests. I can
envision 3
> > > approaches: > > > > > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > > > > > This would entail the guest OS knowing where in
it's
Guest
> > > Physical > > > > > > > > Address space is already taken up and avoiding
clashing. I
> > > would assume > > > > > > > > in this case you would want a standard interface
to
> userspace > > > to then > > > > > > > > make that address space visible to the backend
daemon.
> > > > > > > > > > > > Yet another way here is that we would have well known
"shared
> > > memory" between > > > > > > VMs. I think that Jailhouse's ivshmem gives us good
insights on
> this > > > matter > > > > > > and that it can even be an alternative for hypervisor-
agnostic
> > > solution. > > > > > > > > > > > > (Please note memory regions in ivshmem appear as a PCI
device
> and > > > can be > > > > > > mapped locally.) > > > > > > > > > > > > I want to add this shared memory aspect to my virtio-
proxy,
but
> > > > > > the resultant solution would eventually look similar
to
ivshmem.
> > > > > > > > > > > > > > * BE guests boots with a hypervisor handle to
memory
> > > > > > > > > > > > > > > > The BE guest is then free to map the FE's memory
to
where
> it > > > wants in > > > > > > > > the BE's guest physical address space. > > > > > > > > > > > > > > I cannot see how this could work for Xen. There is
no
"handle"
> to > > > give > > > > > > > to the backend if the backend is not running in dom0.
So
for
> Xen I > > > think > > > > > > > the memory has to be already mapped > > > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the following
information
> is > > > expected > > > > > > to be exposed to BE via Xenstore: > > > > > > (I know that this is a tentative approach though.) > > > > > > - the start address of configuration space > > > > > > - interrupt number > > > > > > - file path for backing storage > > > > > > - read-only flag > > > > > > And the BE server have to call a particular hypervisor
interface
> to > > > > > > map the configuration space. > > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
> info to > > > the backend running in a non-toolstack domain. > > > > > I remember, there was a wish to avoid using Xenstore in
Virtio
> backend > > > itself if possible, so for non-toolstack domain, this could
done
with
> > > adjusting devd (daemon that listens for devices and launches
backends)
> > > > > to read backend configuration from the Xenstore anyway
and
pass it
> to > > > the backend via command line arguments. > > > > > > > > > > > > > Yes, in current PoC code we're using xenstore to pass
device
> > > configuration. > > > > We also designed a static device configuration parse
method
for
> Dom0less > > > or > > > > other scenarios don't have xentool. yes, it's from device
model
> command > > > line > > > > or a config file. > > > > > > > > > But, if ... > > > > > > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
> > > specific > > > > > > stuffs are contained in virtio-proxy, yet another VM,
to
hide
> all > > > details. > > > > > > > > > > ... the solution how to overcome that is already found
and
proven
> to > > > work then even better. > > > > > > > > > > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
> mapping. > > > > > > > > > > > > > and the mapping probably done by the > > > > > > > toolstack (also see below.) Or we would have to
invent a
new
> Xen > > > > > > > hypervisor interface and Xen virtual machine
privileges
to
> allow > > > this > > > > > > > kind of mapping. > > > > > > > > > > > > > If we run the backend in Dom0 that we have no
problems
of
> course. > > > > > > > > > > > > One of difficulties on Xen that I found in my approach
is
that
> > > calling > > > > > > such hypervisor intefaces (registering IOREQ, mapping
memory) is
> > > only > > > > > > allowed on BE servers themselvies and so we will have
to
extend
> > > those > > > > > > interfaces. > > > > > > This, however, will raise some concern on security and
privilege
> > > distribution > > > > > > as Stefan suggested. > > > > > > > > > > We also faced policy related issues with Virtio backend
running in
> > > other than Dom0 domain in a "dummy" xsm mode. In our target
system we
> run > > > the backend in a driver > > > > > domain (we call it DomD) where the underlying H/W
resides.
We
> trust it, > > > so we wrote policy rules (to be used in "flask" xsm mode) to
provide
> it > > > with a little bit more privileges than a simple DomU had. > > > > > Now it is permitted to issue device-model, resource and
memory
> > > mappings, etc calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > > > require some sort of hypercall to the hypervisor.
I
can see
> two > > > options > > > > > > > > at this point: > > > > > > > > > > > > > > > > - expose the handle to userspace for
daemon/helper
to
> trigger > > > the > > > > > > > > mapping via existing hypercall interfaces. If
using a
> helper > > > you > > > > > > > > would have a hypervisor specific one to avoid
the
daemon
> > > having to > > > > > > > > care too much about the details or push that
complexity
> into > > > a > > > > > > > > compile time option for the daemon which would
result in
> > > different > > > > > > > > binaries although a common source base. > > > > > > > > > > > > > > > > - expose a new kernel ABI to abstract the
hypercall
> > > differences away > > > > > > > > in the guest kernel. In this case the
userspace
would
> > > essentially > > > > > > > > ask for an abstract "map guest N memory to
userspace
> ptr" > > > and let > > > > > > > > the kernel deal with the different hypercall
interfaces.
> > > This of > > > > > > > > course assumes the majority of BE guests would
be
Linux
> > > kernels and > > > > > > > > leaves the bare-metal/unikernel approaches to
their own
> > > devices. > > > > > > > > > > > > > > > > Operation > > > > > > > > ========= > > > > > > > > > > > > > > > > The core of the operation of VirtIO is fairly
simple.
Once
> the > > > > > > > > vhost-user feature negotiation is done it's a case
of
> receiving > > > update > > > > > > > > events and parsing the resultant virt queue for
data.
The
> vhost- > > > user > > > > > > > > specification handles a bunch of setup before that
point,
> mostly > > > to > > > > > > > > detail where the virt queues are set up FD's for
memory and
> > > event > > > > > > > > communication. This is where the envisioned stub
process
> would > > > be > > > > > > > > responsible for getting the daemon up and ready to
run.
This
> is > > > > > > > > currently done inside a big VMM like QEMU but I
suspect a
> modern > > > > > > > > approach would be to use the rust-vmm vhost crate.
It
would
> then > > > either > > > > > > > > communicate with the kernel's abstracted ABI or be
re-
> targeted > > > as a > > > > > > > > build option for the various hypervisors. > > > > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
doesn't
have
> VMMs > > > the > > > > > > > way they are typically envisioned and described in
other
> > > environments. > > > > > > > Instead, Xen has IOREQ servers. Each of them
connects
> > > independently to > > > > > > > Xen via the IOREQ interface. E.g. today multiple
QEMUs
could
> be > > > used as > > > > > > > emulators for a single Xen VM, each of them
connecting
to Xen
> > > > > > > independently via the IOREQ interface. > > > > > > > > > > > > > > The component responsible for starting a daemon
and/or
setting
> up > > > shared > > > > > > > interfaces is the toolstack: the xl command and the > libxl/libxc > > > > > > > libraries. > > > > > > > > > > > > I think that VM configuration management (or
orchestration
in
> > > Startos > > > > > > jargon?) is a subject to debate in parallel. > > > > > > Otherwise, is there any good assumption to avoid it
right
now?
> > > > > > > > > > > > > Oleksandr and others I CCed have been working on
ways
for the
> > > toolstack > > > > > > > to create virtio backends and setup memory mappings.
They
> might be > > > able > > > > > > > to provide more info on the subject. I do think we
miss
a way
> to > > > provide > > > > > > > the configuration to the backend and anything else
that
the
> > > backend > > > > > > > might require to start doing its job. > > > > > > > > > > Yes, some work has been done for the toolstack to handle
Virtio
> MMIO > > > devices in > > > > > general and Virtio block devices in particular. However,
it
has
> not > > > been upstreaned yet. > > > > > Updated patches on review now: > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-1-
git-
send-
> email- > > > olekstysh@gmail.com/ > > > > > > > > > > There is an additional (also important) activity to
improve/fix
> > > foreign memory mapping on Arm which I am also involved in. > > > > > The foreign memory mapping is proposed to be used for
Virtio
> backends > > > (device emulators) if there is a need to run guest OS
completely
> > > unmodified. > > > > > Of course, the more secure way would be to use grant
memory
> mapping. > > > Brietly, the main difference between them is that with
foreign
mapping
> the > > > backend > > > > > can map any guest memory it wants to map, but with grant
mapping
> it is > > > allowed to map only what was previously granted by the
frontend.
> > > > > > > > > > So, there might be a problem if we want to pre-map some
guest
> memory > > > in advance or to cache mappings in the backend in order to
improve
> > > performance (because the mapping/unmapping guest pages every
request
> > > requires a lot of back and forth to Xen + P2M updates). In a
nutshell,
> > > currently, in order to map a guest page into the backend
address
space
> we > > > need to steal a real physical page from the backend domain.
So,
with
> the > > > said optimizations we might end up with no free memory in
the
backend
> > > domain (see XSA-300). And what we try to achieve is to not
waste
a
> real > > > domain memory at all by providing safe non-allocated-yet (so
unused)
> > > address space for the foreign (and grant) pages to be mapped
into,
> this > > > enabling work implies Xen and Linux (and likely DTB bindings)
changes.
> > > However, as it turned out, for this to work in a proper and
safe
way
> some > > > prereq work needs to be done. > > > > > You can find the related Xen discussion at: > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-1-
git-
send-
> email- > > > olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle notification
and
kicks.
> The > > > existing > > > > > > > > vhost-user framework uses eventfd to signal the
daemon
> (although > > > QEMU > > > > > > > > is quite capable of simulating them when you use
TCG).
Xen
> has > > > it's own > > > > > > > > IOREQ mechanism. However latency is an important
factor and
> > > having > > > > > > > > events go through the stub would add quite a lot. > > > > > > > > > > > > > > Yeah I think, regardless of anything else, we want
the
> backends to > > > > > > > connect directly to the Xen hypervisor. > > > > > > > > > > > > In my approach, > > > > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
> > > interface > > > > > > via virtio-proxy > > > > > > b) FE -> BE: MMIO to config raises events (in event
channels),
> > > which is > > > > > > converted to a callback to BE via
virtio-
proxy
> > > > > > (Xen's event channel is internnally
implemented by
> > > interrupts.) > > > > > > > > > > > > I don't know what "connect directly" means here, but
sending
> > > interrupts > > > > > > to the opposite side would be best efficient. > > > > > > Ivshmem, I suppose, takes this approach by utilizing
PCI's
msi-x
> > > mechanism. > > > > > > > > > > Agree that MSI would be more efficient than SPI... > > > > > At the moment, in order to notify the frontend, the
backend
issues
> a > > > specific device-model call to query Xen to inject a
corresponding SPI
> to > > > the guest. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally converting
IOREQ
> > > messages from > > > > > > > > the Xen hypervisor to eventfd events? Would this
scale
with
> > > other kernel > > > > > > > > hypercall interfaces? > > > > > > > > > > > > > > > > So any thoughts on what directions are worth
experimenting
> with? > > > > > > > > > > > > > > One option we should consider is for each backend to
connect
> to > > > Xen via > > > > > > > the IOREQ interface. We could generalize the IOREQ
interface
> and > > > make it > > > > > > > hypervisor agnostic. The interface is really trivial
and
easy
> to > > > add. > > > > > > > > > > > > As I said above, my proposal does the same thing that
you
> mentioned > > > here :) > > > > > > The difference is that I do call hypervisor interfaces
via
> virtio- > > > proxy. > > > > > > > > > > > > > The only Xen-specific part is the notification
mechanism,
> which is > > > an > > > > > > > event channel. If we replaced the event channel with
something
> > > else the > > > > > > > interface would be generic. See: > > > > > > > https://gitlab.com/xen-project/xen/- > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > > > > > I don't think that translating IOREQs to eventfd in
the
kernel
> is > > > a > > > > > > > good idea: if feels like it would be extra
complexity
and that
> the > > > > > > > kernel shouldn't be involved as this is a backend-
hypervisor
> > > interface. > > > > > > > > > > > > Given that we may want to implement BE as a bare-metal > application > > > > > > as I did on Zephyr, I don't think that the translation
would not
> be > > > > > > a big issue, especially on RTOS's. > > > > > > It will be some kind of abstraction layer of interrupt
handling
> > > > > > (or nothing but a callback mechanism). > > > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
trying to
> design an > > > > > > > interface that could work well for RTOSes too. If we
want to
> do > > > > > > > something different, both OS-agnostic and
hypervisor-
agnostic,
> > > perhaps > > > > > > > we could design a new interface. One that could be > implementable > > > in the > > > > > > > Xen hypervisor itself (like IOREQ) and of course any
other
> > > hypervisor > > > > > > > too. > > > > > > > > > > > > > > > > > > > > > There is also another problem. IOREQ is probably not
be
the
> only > > > > > > > interface needed. Have a look at > > > > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2.
Don't we
> > > also need > > > > > > > an interface for the backend to inject interrupts
into
the
> > > frontend? And > > > > > > > if the backend requires dynamic memory mappings of
frontend
> pages, > > > then > > > > > > > we would also need an interface to map/unmap domU
pages.
> > > > > > > > > > > > My proposal document might help here; All the
interfaces
> required > > > for > > > > > > virtio-proxy (or hypervisor-related interfaces) are
listed
as
> > > > > > RPC protocols :) > > > > > > > > > > > > > These interfaces are a lot more problematic than
IOREQ:
IOREQ
> is > > > tiny > > > > > > > and self-contained. It is easy to add anywhere. A
new
> interface to > > > > > > > inject interrupts or map pages is more difficult to
manage
> because > > > it > > > > > > > would require changes scattered across the various
emulators.
> > > > > > > > > > > > Exactly. I have no confident yet that my approach will
also
> apply > > > > > > to other hypervisors than Xen. > > > > > > Technically, yes, but whether people can accept it or
not
is a
> > > different > > > > > > matter. > > > > > > > > > > > > Thanks, > > > > > > -Takahiro Akashi > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > > > > > > Oleksandr Tyshchenko > > > > IMPORTANT NOTICE: The contents of this email and any
attachments are
> > > confidential and may also be privileged. If you are not the
intended
> > > recipient, please notify the sender immediately and do not
disclose
> the > > > contents to any other person, use it for any purpose, or
store
or copy
> the > > > information in any medium. Thank you. > > IMPORTANT NOTICE: The contents of this email and any
attachments
are
> confidential and may also be privileged. If you are not the
intended
> recipient, please notify the sender immediately and do not
disclose
the
> contents to any other person, use it for any purpose, or store
or
copy the
> information in any medium. Thank you. IMPORTANT NOTICE: The contents of this email and any attachments
are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose
the
contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Akashi,
I am sorry for the possible format issues.
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-
07/msg01532.html
I'm afraid that I miss something here, but I don't know why this proposed API will lead to eliminating 'mmap' in accessing the queued payload at every request?
This API give Xen device model (QEMU or kvmtool) the ability to map whole guest RAM in device model's address space. In this case, device model doesn't need dynamic hypercall to map/unmap payload memory. It can use a flat offset to access payload memory in its address space directly. Just Like KVM device model does now.
Yes!
Thank you. Quickly, let me make sure one thing: This API itself doesn't do any mapping operations, right?
Right. The only purpose of that "API" is to guery hypervisor to find unallocated address space ranges to map the foreign pages into (instead of stealing real RAM pages), In a nutshell, if you try to map the whole guest memory in the backend address space on Arm (or even cache some mappings) you might end up with memory exhaustion in the backend domain (XSA-300), and the possibility to hit XSA-300 is higher if your backend needs to serve several Guests. Of course, this depends on the memory assigned to the backend domain and to the Guest(s) it serves... We believe that with the proposed solution the backend will be able to handle Guest(s) without wasting it's real RAM. However, please note that proposed Xen + Linux changes which are on review now [1] are far from the final solution and require rework and some prereq work to operate in a proper and safe way.
So I suppose that virtio BE guest is responsible to
- fetch the information about all the memory regions in FE,
- call this API to allocate a big chunk of unused space in BE,
- create grant/foreign mappings for FE onto this region(S)
in the initialization/configuration of emulated virtio devices.
Is this the way this API is expected to be used?
No really, the userspace backend doesn't need to call this API at all, all what backend calls is still xenforeignmemory_map()/xenforeignmemory_unmap(), so let's say "magic" is done by Linux and Xen internally. You can take a look at the virtio-disk PoC [2] (last 4 patches) to better understand what Wei and I are talking about. There we map the Guest memory at the beginning and just calculate a pointer at runtime. Again, the code is not in good shape, but it is enough to demonstrate the feasibility of the improvement.
Does Xen already has an interface for (1)?
I am not aware of anything existing. For the PoC I guessed the Guest memory layout in a really hackish way (I got total Guest memory size, so having GUEST_RAMX_BASE/GUEST_RAMX_SIZE in hand just performed calculation). Definitely, it is a no-go, so 1) deserves additional discussion/design.
[1] https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstys... https://lore.kernel.org/lkml/1627490656-1267-1-git-send-email-olekstysh@gmai... https://lore.kernel.org/lkml/1627490656-1267-2-git-send-email-olekstysh@gmai... [2] https://github.com/otyshchenko1/virtio-disk/commits/map_opt_next
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年9月1日 20:29 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly Xin Kaly.Xin@arm.com; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; nd nd@arm.com; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Wed, Sep 01, 2021 at 11:12:58AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月31日 14:18 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <stratos-dev@op-
lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Wei,
On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <stratos-dev@op-
lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote:
On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote: > Hi Akashi, > > > -----Original Message----- > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > Sent: 2021年8月18日 13:39 > > To: Wei Chen Wei.Chen@arm.com > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> > sstabellini@kernel.org; Alex Benn??e
Stratos
> > Mailing List stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
> > open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh
Kumar
> > viresh.kumar@linaro.org; Stefano Stabellini > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > jan.kiszka@siemens.com; Carl van Schaik
> > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean-
> > Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
> > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant
> > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote: > > > Hi Akashi, > > > > > > > -----Original Message----- > > > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > > > Sent: 2021年8月17日 16:08 > > > > To: Wei Chen Wei.Chen@arm.com > > > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> > > > sstabellini@kernel.org; Alex Benn??e
> > Stratos > > > > Mailing List stratos-dev@op-lists.linaro.org; virtio- > > dev@lists.oasis- > > > > open.org; Arnd Bergmann arnd.bergmann@linaro.org;
Viresh
Kumar
> > > > viresh.kumar@linaro.org; Stefano Stabellini > > > > stefano.stabellini@xilinx.com; stefanha@redhat.com;
Jan
Kiszka
> > > > jan.kiszka@siemens.com; Carl van Schaik
> > > > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean-
> > > > Philippe Brucker jean-philippe@linaro.org; Mathieu
Poirier
> > > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > > > Bertrand.Marquis@arm.com; Artem Mygaiev
> > Julien > > > > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant
> > > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > > > Hi Wei, Oleksandr, > > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote: > > > > > Hi All, > > > > > > > > > > Thanks for Stefano to link my kvmtool for Xen proposal
here.
> > > > > This proposal is still discussing in Xen and KVM
communities.
> > > > > The main work is to decouple the kvmtool from KVM and
make
> > > > > other hypervisors can reuse the virtual device
implementations.
> > > > > > > > > > In this case, we need to introduce an intermediate
hypervisor
> > > > > layer for VMM abstraction, Which is, I think it's very
close
> > > > > to stratos' virtio hypervisor agnosticism work. > > > > > > > > # My proposal[1] comes from my own idea and doesn't
always
represent
> > > > # Linaro's view on this subject nor reflect Alex's
concerns.
> > Nevertheless, > > > > > > > > Your idea and my proposal seem to share the same
background.
> > > > Both have the similar goal and currently start with, at
first,
Xen
> > > > and are based on kvm-tool. (Actually, my work is derived
from
> > > > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > > > > > In particular, the abstraction of hypervisor interfaces
has
a
same
> > > > set of interfaces (for your "struct vmm_impl" and my
"RPC
interfaces").
> > > > This is not co-incident as we both share the same origin
as
I
said
> > above. > > > > And so we will also share the same issues. One of them
is a
way
of
> > > > "sharing/mapping FE's memory". There is some trade-off
between
> > > > the portability and the performance impact. > > > > So we can discuss the topic here in this ML, too. > > > > (See Alex's original email, too). > > > > > > > Yes, I agree. > > > > > > > On the other hand, my approach aims to create a "single-
binary"
> > solution > > > > in which the same binary of BE vm could run on any
hypervisors.
> > > > Somehow similar to your "proposal-#2" in [2], but in my
solution,
all
> > > > the hypervisor-specific code would be put into another
entity
(VM),
> > > > named "virtio-proxy" and the abstracted operations are
served
via RPC.
> > > > (In this sense, BE is hypervisor-agnostic but might have
OS
> > dependency.) > > > > But I know that we need discuss if this is a requirement
even
> > > > in Stratos project or not. (Maybe not) > > > > > > > > > > Sorry, I haven't had time to finish reading your virtio-
proxy
completely
> > > (I will do it ASAP). But from your description, it seems
we
need a
> > > 3rd VM between FE and BE? My concern is that, if my
assumption
is
right,
> > > will it increase the latency in data transport path? Even
if
we're
> > > using some lightweight guest like RTOS or Unikernel, > > > > Yes, you're right. But I'm afraid that it is a matter of
degree.
> > As far as we execute 'mapping' operations at every fetch of
payload,
> > we will see latency issue (even in your case) and if we have
some
solution
> > for it, we won't see it neither in my proposal :) > > > > Oleksandr has sent a proposal to Xen mailing list to reduce
this
kind
> of "mapping/unmapping" operations. So the latency caused by
this
behavior
> on Xen may eventually be eliminated, and Linux-KVM doesn't
have
that
problem.
Obviously, I have not yet caught up there in the discussion. Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-
07/msg01532.html
I'm afraid that I miss something here, but I don't know why this proposed API will lead to eliminating 'mmap' in accessing the queued payload at every request?
This API give Xen device model (QEMU or kvmtool) the ability to map whole guest RAM in device model's address space. In this case, device model doesn't need dynamic hypercall to map/unmap payload memory. It can use a flat offset to access payload memory in its address space directly. Just Like KVM device model does now.
Thank you. Quickly, let me make sure one thing: This API itself doesn't do any mapping operations, right? So I suppose that virtio BE guest is responsible to
- fetch the information about all the memory regions in FE,
- call this API to allocate a big chunk of unused space in BE,
- create grant/foreign mappings for FE onto this region(S)
in the initialization/configuration of emulated virtio devices.
Is this the way this API is expected to be used? Does Xen already has an interface for (1)?
They are discussing in that thread to find a proper way to do it. Because this API is common, both x86 and Arm should be considered.
-Takahiro Akashi
Before this API, When device model to map whole guest memory, will severely consume the physical pages of Dom-0/Dom-D.
-Takahiro Akashi
Thanks, -Takahiro Akashi
-Takahiro Akashi
> > > > Specifically speaking about kvm-tool, I have a concern
about
its
> > > > license term; Targeting different hypervisors and
different
OSs
> > > > (which I assume includes RTOS's), the resultant library
should
be
> > > > license permissive and GPL for kvm-tool might be an
issue.
> > > > Any thoughts? > > > > > > > > > > Yes. If user want to implement a FreeBSD device model, but
the
virtio
> > > library is GPL. Then GPL would be a problem. If we have
another
good
> > > candidate, I am open to it. > > > > I have some candidates, particularly for vq/vring, in my
mind:
> > * Open-AMP, or > > * corresponding Free-BSD code > > > > Interesting, I will look into them : ) > > Cheers, > Wei Chen > > > -Takahiro Akashi > > > > > > > > -Takahiro Akashi > > > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> > > > August/000548.html > > > > [2] https://marc.info/?l=xen-devel&m=162373754705233&w=2 > > > > > > > > > > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > > > > Sent: 2021年8月14日 23:38 > > > > > > To: AKASHI Takahiro takahiro.akashi@linaro.org;
Stefano
> > Stabellini > > > > sstabellini@kernel.org > > > > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos
Mailing
List
> > > > stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org;
> > Arnd > > > > Bergmann arnd.bergmann@linaro.org; Viresh Kumar > > > > viresh.kumar@linaro.org; Stefano Stabellini > > > > stefano.stabellini@xilinx.com; stefanha@redhat.com;
Jan
Kiszka
> > > > jan.kiszka@siemens.com; Carl van Schaik
> > > > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean-
> > > > Philippe Brucker jean-philippe@linaro.org; Mathieu
Poirier
> > > > mathieu.poirier@linaro.org; Wei Chen
Oleksandr
> > > > Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand
Marquis
> > > > Bertrand.Marquis@arm.com; Artem Mygaiev
> > Julien > > > > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant
> > > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > > > Subject: Re: Enabling hypervisor agnosticism for
VirtIO
backends
> > > > > > > > > > > > Hello, all. > > > > > > > > > > > > Please see some comments below. And sorry for the
possible
format
> > > > issues. > > > > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > > > > mailto:takahiro.akashi@linaro.org wrote: > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano
Stabellini
> > wrote: > > > > > > > > CCing people working on Xen+VirtIO and IOREQs.
Not
trimming
> > the > > > > original > > > > > > > > email to let them read the full context. > > > > > > > > > > > > > > > > My comments below are related to a potential Xen > > implementation, > > > > not > > > > > > > > because it is the only implementation that
matters,
but
> > because it > > > > is > > > > > > > > the one I know best. > > > > > > > > > > > > > > Please note that my proposal (and hence the
working
prototype)[1]
> > > > > > > is based on Xen's virtio implementation (i.e.
IOREQ)
and
> > > > particularly > > > > > > > EPAM's virtio-disk application (backend server). > > > > > > > It has been, I believe, well generalized but is
still
a
bit
> > biased > > > > > > > toward this original design. > > > > > > > > > > > > > > So I hope you like my approach :) > > > > > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> > > > August/000546.html > > > > > > > > > > > > > > Let me take this opportunity to explain a bit more
about
my
> > approach > > > > below. > > > > > > > > > > > > > > > Also, please see this relevant email thread: > > > > > > > > https://marc.info/?l=xen-
devel&m=162373754705233&w=2
> > > > > > > > > > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > One of the goals of Project Stratos is to
enable
hypervisor
> > > > agnostic > > > > > > > > > backends so we can enable as much re-use of
code
as
possible
> > and > > > > avoid > > > > > > > > > repeating ourselves. This is the flip side of
the
front end
> > > > where > > > > > > > > > multiple front-end implementations are
required -
one
per OS,
> > > > assuming > > > > > > > > > you don't just want Linux guests. The
resultant
guests
are
> > > > trivially > > > > > > > > > movable between hypervisors modulo any
abstracted
paravirt
> > type > > > > > > > > > interfaces. > > > > > > > > > > > > > > > > > > In my original thumb nail sketch of a solution
I
envisioned
> > > > vhost-user > > > > > > > > > daemons running in a broadly POSIX like
environment.
The
> > > > interface to > > > > > > > > > the daemon is fairly simple requiring only
some
mapped
> > memory > > > > and some > > > > > > > > > sort of signalling for events (on Linux this
is
eventfd).
> > The > > > > idea was a > > > > > > > > > stub binary would be responsible for any
hypervisor
specific
> > > > setup and > > > > > > > > > then launch a common binary to deal with the
actual
> > virtqueue > > > > requests > > > > > > > > > themselves. > > > > > > > > > > > > > > > > > > Since that original sketch we've seen an
expansion
in
the
> > sort > > > > of ways > > > > > > > > > backends could be created. There is interest
in
> > encapsulating > > > > backends > > > > > > > > > in RTOSes or unikernels for solutions like
SCMI.
There
> > interest > > > > in Rust > > > > > > > > > has prompted ideas of using the trait
interface to
abstract
> > > > differences > > > > > > > > > away as well as the idea of bare-metal Rust
backends.
> > > > > > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall
Standardisation"
> > which > > > > > > > > > calls for a description of the APIs needed
from
the
> > hypervisor > > > > side to > > > > > > > > > support VirtIO guests and their backends.
However
we
are
> > some > > > > way off > > > > > > > > > from that at the moment as I think we need to
at
least
> > > > demonstrate one > > > > > > > > > portable backend before we start codifying
requirements. To
> > that > > > > end I > > > > > > > > > want to think about what we need for a backend
to
function.
> > > > > > > > > > > > > > > > > > Configuration > > > > > > > > > ============= > > > > > > > > > > > > > > > > > > In the type-2 setup this is typically fairly
simple
because
> > the > > > > host > > > > > > > > > system can orchestrate the various modules
that
make
up the
> > > > complete > > > > > > > > > system. In the type-1 case (or even type-2
with
delegated
> > > > service VMs) > > > > > > > > > we need some sort of mechanism to inform the
backend
VM
> > about > > > > key > > > > > > > > > details about the system: > > > > > > > > > > > > > > > > > > - where virt queue memory is in it's address
space
> > > > > > > > > - how it's going to receive (interrupt) and
trigger
(kick)
> > > > events > > > > > > > > > - what (if any) resources the backend needs
to
connect to
> > > > > > > > > > > > > > > > > > Obviously you can elide over configuration
issues
by
having
> > > > static > > > > > > > > > configurations and baking the assumptions into
your
guest
> > images > > > > however > > > > > > > > > this isn't scalable in the long term. The
obvious
solution
> > seems > > > > to be > > > > > > > > > extending a subset of Device Tree data to user
space
but
> > perhaps > > > > there > > > > > > > > > are other approaches? > > > > > > > > > > > > > > > > > > Before any virtio transactions can take place
the
> > appropriate > > > > memory > > > > > > > > > mappings need to be made between the FE guest
and
the
BE
> > guest. > > > > > > > > > > > > > > > > > Currently the whole of the FE guests address
space
needs to
> > be > > > > visible > > > > > > > > > to whatever is serving the virtio requests. I
can
envision 3
> > > > approaches: > > > > > > > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > > > > > > > This would entail the guest OS knowing where
in
it's
Guest
> > > > Physical > > > > > > > > > Address space is already taken up and
avoiding
clashing. I
> > > > would assume > > > > > > > > > in this case you would want a standard
interface
to
> > userspace > > > > to then > > > > > > > > > make that address space visible to the
backend
daemon.
> > > > > > > > > > > > > > Yet another way here is that we would have well
known
"shared
> > > > memory" between > > > > > > > VMs. I think that Jailhouse's ivshmem gives us
good
insights on
> > this > > > > matter > > > > > > > and that it can even be an alternative for
hypervisor-
agnostic
> > > > solution. > > > > > > > > > > > > > > (Please note memory regions in ivshmem appear as a
PCI
device
> > and > > > > can be > > > > > > > mapped locally.) > > > > > > > > > > > > > > I want to add this shared memory aspect to my
virtio-
proxy,
but
> > > > > > > the resultant solution would eventually look
similar
to
ivshmem.
> > > > > > > > > > > > > > > > * BE guests boots with a hypervisor handle to
memory
> > > > > > > > > > > > > > > > > > The BE guest is then free to map the FE's
memory
to
where
> > it > > > > wants in > > > > > > > > > the BE's guest physical address space. > > > > > > > > > > > > > > > > I cannot see how this could work for Xen. There
is
no
"handle"
> > to > > > > give > > > > > > > > to the backend if the backend is not running in
dom0.
So
for
> > Xen I > > > > think > > > > > > > > the memory has to be already mapped > > > > > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the
following
information
> > is > > > > expected > > > > > > > to be exposed to BE via Xenstore: > > > > > > > (I know that this is a tentative approach though.) > > > > > > > - the start address of configuration space > > > > > > > - interrupt number > > > > > > > - file path for backing storage > > > > > > > - read-only flag > > > > > > > And the BE server have to call a particular
hypervisor
interface
> > to > > > > > > > map the configuration space. > > > > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass
configuration
> > info to > > > > the backend running in a non-toolstack domain. > > > > > > I remember, there was a wish to avoid using Xenstore
in
Virtio
> > backend > > > > itself if possible, so for non-toolstack domain, this
could
done
with
> > > > adjusting devd (daemon that listens for devices and
launches
backends)
> > > > > > to read backend configuration from the Xenstore
anyway
and
pass it
> > to > > > > the backend via command line arguments. > > > > > > > > > > > > > > > > Yes, in current PoC code we're using xenstore to pass
device
> > > > configuration. > > > > > We also designed a static device configuration parse
method
for
> > Dom0less > > > > or > > > > > other scenarios don't have xentool. yes, it's from
device
model
> > command > > > > line > > > > > or a config file. > > > > > > > > > > > But, if ... > > > > > > > > > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or
hypervisor)-
> > > > specific > > > > > > > stuffs are contained in virtio-proxy, yet another
VM,
to
hide
> > all > > > > details. > > > > > > > > > > > > ... the solution how to overcome that is already
found
and
proven
> > to > > > > work then even better. > > > > > > > > > > > > > > > > > > > > > > > > > # My point is that a "handle" is not mandatory for
executing
> > mapping. > > > > > > > > > > > > > > > and the mapping probably done by the > > > > > > > > toolstack (also see below.) Or we would have to
invent a
new
> > Xen > > > > > > > > hypervisor interface and Xen virtual machine
privileges
to
> > allow > > > > this > > > > > > > > kind of mapping. > > > > > > > > > > > > > > > If we run the backend in Dom0 that we have no
problems
of
> > course. > > > > > > > > > > > > > > One of difficulties on Xen that I found in my
approach
is
that
> > > > calling > > > > > > > such hypervisor intefaces (registering IOREQ,
mapping
memory) is
> > > > only > > > > > > > allowed on BE servers themselvies and so we will
have
to
extend
> > > > those > > > > > > > interfaces. > > > > > > > This, however, will raise some concern on security
and
privilege
> > > > distribution > > > > > > > as Stefan suggested. > > > > > > > > > > > > We also faced policy related issues with Virtio
backend
running in
> > > > other than Dom0 domain in a "dummy" xsm mode. In our
target
system we
> > run > > > > the backend in a driver > > > > > > domain (we call it DomD) where the underlying H/W
resides.
We
> > trust it, > > > > so we wrote policy rules (to be used in "flask" xsm mode)
to
provide
> > it > > > > with a little bit more privileges than a simple DomU had. > > > > > > Now it is permitted to issue device-model, resource
and
memory
> > > > mappings, etc calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > > > > require some sort of hypercall to the
hypervisor.
I
can see
> > two > > > > options > > > > > > > > > at this point: > > > > > > > > > > > > > > > > > > - expose the handle to userspace for
daemon/helper
to
> > trigger > > > > the > > > > > > > > > mapping via existing hypercall interfaces.
If
using a
> > helper > > > > you > > > > > > > > > would have a hypervisor specific one to
avoid
the
daemon
> > > > having to > > > > > > > > > care too much about the details or push
that
complexity
> > into > > > > a > > > > > > > > > compile time option for the daemon which
would
result in
> > > > different > > > > > > > > > binaries although a common source base. > > > > > > > > > > > > > > > > > > - expose a new kernel ABI to abstract the
hypercall
> > > > differences away > > > > > > > > > in the guest kernel. In this case the
userspace
would
> > > > essentially > > > > > > > > > ask for an abstract "map guest N memory to
userspace
> > ptr" > > > > and let > > > > > > > > > the kernel deal with the different
hypercall
interfaces.
> > > > This of > > > > > > > > > course assumes the majority of BE guests
would
be
Linux
> > > > kernels and > > > > > > > > > leaves the bare-metal/unikernel approaches
to
their own
> > > > devices. > > > > > > > > > > > > > > > > > > Operation > > > > > > > > > ========= > > > > > > > > > > > > > > > > > > The core of the operation of VirtIO is fairly
simple.
Once
> > the > > > > > > > > > vhost-user feature negotiation is done it's a
case
of
> > receiving > > > > update > > > > > > > > > events and parsing the resultant virt queue
for
data.
The
> > vhost- > > > > user > > > > > > > > > specification handles a bunch of setup before
that
point,
> > mostly > > > > to > > > > > > > > > detail where the virt queues are set up FD's
for
memory and
> > > > event > > > > > > > > > communication. This is where the envisioned
stub
process
> > would > > > > be > > > > > > > > > responsible for getting the daemon up and
ready to
run.
This
> > is > > > > > > > > > currently done inside a big VMM like QEMU but
I
suspect a
> > modern > > > > > > > > > approach would be to use the rust-vmm vhost
crate.
It
would
> > then > > > > either > > > > > > > > > communicate with the kernel's abstracted ABI
or be
re-
> > targeted > > > > as a > > > > > > > > > build option for the various hypervisors. > > > > > > > > > > > > > > > > One thing I mentioned before to Alex is that Xen
doesn't
have
> > VMMs > > > > the > > > > > > > > way they are typically envisioned and described
in
other
> > > > environments. > > > > > > > > Instead, Xen has IOREQ servers. Each of them
connects
> > > > independently to > > > > > > > > Xen via the IOREQ interface. E.g. today multiple
QEMUs
could
> > be > > > > used as > > > > > > > > emulators for a single Xen VM, each of them
connecting
to Xen
> > > > > > > > independently via the IOREQ interface. > > > > > > > > > > > > > > > > The component responsible for starting a daemon
and/or
setting
> > up > > > > shared > > > > > > > > interfaces is the toolstack: the xl command and
the
> > libxl/libxc > > > > > > > > libraries. > > > > > > > > > > > > > > I think that VM configuration management (or
orchestration
in
> > > > Startos > > > > > > > jargon?) is a subject to debate in parallel. > > > > > > > Otherwise, is there any good assumption to avoid
it
right
now?
> > > > > > > > > > > > > > > Oleksandr and others I CCed have been working on
ways
for the
> > > > toolstack > > > > > > > > to create virtio backends and setup memory
mappings.
They
> > might be > > > > able > > > > > > > > to provide more info on the subject. I do think
we
miss
a way
> > to > > > > provide > > > > > > > > the configuration to the backend and anything
else
that
the
> > > > backend > > > > > > > > might require to start doing its job. > > > > > > > > > > > > Yes, some work has been done for the toolstack to
handle
Virtio
> > MMIO > > > > devices in > > > > > > general and Virtio block devices in particular.
However,
it
has
> > not > > > > been upstreaned yet. > > > > > > Updated patches on review now: > > > > > > https://lore.kernel.org/xen-devel/1621626361-29076-
1-
git-
send-
> > email- > > > > olekstysh@gmail.com/ > > > > > > > > > > > > There is an additional (also important) activity to
improve/fix
> > > > foreign memory mapping on Arm which I am also involved
in.
> > > > > > The foreign memory mapping is proposed to be used
for
Virtio
> > backends > > > > (device emulators) if there is a need to run guest OS
completely
> > > > unmodified. > > > > > > Of course, the more secure way would be to use grant
memory
> > mapping. > > > > Brietly, the main difference between them is that with
foreign
mapping
> > the > > > > backend > > > > > > can map any guest memory it wants to map, but with
grant
mapping
> > it is > > > > allowed to map only what was previously granted by the
frontend.
> > > > > > > > > > > > So, there might be a problem if we want to pre-map
some
guest
> > memory > > > > in advance or to cache mappings in the backend in order
to
improve
> > > > performance (because the mapping/unmapping guest pages
every
request
> > > > requires a lot of back and forth to Xen + P2M updates).
In a
nutshell,
> > > > currently, in order to map a guest page into the backend
address
space
> > we > > > > need to steal a real physical page from the backend
domain.
So,
with
> > the > > > > said optimizations we might end up with no free memory
in
the
backend
> > > > domain (see XSA-300). And what we try to achieve is to
not
waste
a
> > real > > > > domain memory at all by providing safe non-allocated-yet
(so
unused)
> > > > address space for the foreign (and grant) pages to be
mapped
into,
> > this > > > > enabling work implies Xen and Linux (and likely DTB
bindings)
changes.
> > > > However, as it turned out, for this to work in a proper
and
safe
way
> > some > > > > prereq work needs to be done. > > > > > > You can find the related Xen discussion at: > > > > > > https://lore.kernel.org/xen-devel/1627489110-25633-
1-
git-
send-
> > email- > > > > olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle
notification
and
kicks.
> > The > > > > existing > > > > > > > > > vhost-user framework uses eventfd to signal
the
daemon
> > (although > > > > QEMU > > > > > > > > > is quite capable of simulating them when you
use
TCG).
Xen
> > has > > > > it's own > > > > > > > > > IOREQ mechanism. However latency is an
important
factor and
> > > > having > > > > > > > > > events go through the stub would add quite a
lot.
> > > > > > > > > > > > > > > > Yeah I think, regardless of anything else, we
want
the
> > backends to > > > > > > > > connect directly to the Xen hypervisor. > > > > > > > > > > > > > > In my approach, > > > > > > > a) BE -> FE: interrupts triggered by BE calling a
hypervisor
> > > > interface > > > > > > > via virtio-proxy > > > > > > > b) FE -> BE: MMIO to config raises events (in
event
channels),
> > > > which is > > > > > > > converted to a callback to BE via
virtio-
proxy
> > > > > > > (Xen's event channel is internnally
implemented by
> > > > interrupts.) > > > > > > > > > > > > > > I don't know what "connect directly" means here,
but
sending
> > > > interrupts > > > > > > > to the opposite side would be best efficient. > > > > > > > Ivshmem, I suppose, takes this approach by
utilizing
PCI's
msi-x
> > > > mechanism. > > > > > > > > > > > > Agree that MSI would be more efficient than SPI... > > > > > > At the moment, in order to notify the frontend, the
backend
issues
> > a > > > > specific device-model call to query Xen to inject a
corresponding SPI
> > to > > > > the guest. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally
converting
IOREQ
> > > > messages from > > > > > > > > > the Xen hypervisor to eventfd events? Would
this
scale
with
> > > > other kernel > > > > > > > > > hypercall interfaces? > > > > > > > > > > > > > > > > > > So any thoughts on what directions are worth
experimenting
> > with? > > > > > > > > > > > > > > > > One option we should consider is for each
backend to
connect
> > to > > > > Xen via > > > > > > > > the IOREQ interface. We could generalize the
IOREQ
interface
> > and > > > > make it > > > > > > > > hypervisor agnostic. The interface is really
trivial
and
easy
> > to > > > > add. > > > > > > > > > > > > > > As I said above, my proposal does the same thing
that
you
> > mentioned > > > > here :) > > > > > > > The difference is that I do call hypervisor
interfaces
via
> > virtio- > > > > proxy. > > > > > > > > > > > > > > > The only Xen-specific part is the notification
mechanism,
> > which is > > > > an > > > > > > > > event channel. If we replaced the event channel
with
something
> > > > else the > > > > > > > > interface would be generic. See: > > > > > > > > https://gitlab.com/xen-project/xen/- > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > > > > > > > I don't think that translating IOREQs to eventfd
in
the
kernel
> > is > > > > a > > > > > > > > good idea: if feels like it would be extra
complexity
and that
> > the > > > > > > > > kernel shouldn't be involved as this is a
backend-
hypervisor
> > > > interface. > > > > > > > > > > > > > > Given that we may want to implement BE as a bare-
metal
> > application > > > > > > > as I did on Zephyr, I don't think that the
translation
would not
> > be > > > > > > > a big issue, especially on RTOS's. > > > > > > > It will be some kind of abstraction layer of
interrupt
handling
> > > > > > > (or nothing but a callback mechanism). > > > > > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
trying to
> > design an > > > > > > > > interface that could work well for RTOSes too.
If we
want to
> > do > > > > > > > > something different, both OS-agnostic and
hypervisor-
agnostic,
> > > > perhaps > > > > > > > > we could design a new interface. One that could
be
> > implementable > > > > in the > > > > > > > > Xen hypervisor itself (like IOREQ) and of course
any
other
> > > > hypervisor > > > > > > > > too. > > > > > > > > > > > > > > > > > > > > > > > > There is also another problem. IOREQ is probably
not
be
the
> > only > > > > > > > > interface needed. Have a look at > > > > > > > > https://marc.info/?l=xen-
devel&m=162373754705233&w=2.
Don't we
> > > > also need > > > > > > > > an interface for the backend to inject
interrupts
into
the
> > > > frontend? And > > > > > > > > if the backend requires dynamic memory mappings
of
frontend
> > pages, > > > > then > > > > > > > > we would also need an interface to map/unmap
domU
pages.
> > > > > > > > > > > > > > My proposal document might help here; All the
interfaces
> > required > > > > for > > > > > > > virtio-proxy (or hypervisor-related interfaces)
are
listed
as
> > > > > > > RPC protocols :) > > > > > > > > > > > > > > > These interfaces are a lot more problematic than
IOREQ:
IOREQ
> > is > > > > tiny > > > > > > > > and self-contained. It is easy to add anywhere.
A
new
> > interface to > > > > > > > > inject interrupts or map pages is more difficult
to
manage
> > because > > > > it > > > > > > > > would require changes scattered across the
various
emulators.
> > > > > > > > > > > > > > Exactly. I have no confident yet that my approach
will
also
> > apply > > > > > > > to other hypervisors than Xen. > > > > > > > Technically, yes, but whether people can accept it
or
not
is a
> > > > different > > > > > > > matter. > > > > > > > > > > > > > > Thanks, > > > > > > > -Takahiro Akashi > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards, > > > > > > > > > > > > Oleksandr Tyshchenko
Hi Akashi, Oleksandr,
-----Original Message----- From: Xen-devel xen-devel-bounces@lists.xenproject.org On Behalf Of Wei Chen Sent: 2021年9月2日 9:31 To: AKASHI Takahiro takahiro.akashi@linaro.org Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly Xin Kaly.Xin@arm.com; Stratos Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; nd nd@arm.com; Xen Devel xen-devel@lists.xen.org Subject: RE: Enabling hypervisor agnosticism for VirtIO backends
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年9月1日 20:29 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <stratos-dev@op-
lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean- Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant paul@xen.org; nd nd@arm.com; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Wed, Sep 01, 2021 at 11:12:58AM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月31日 14:18 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <stratos-dev@op-
lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com;
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Wei,
On Thu, Aug 26, 2021 at 12:10:19PM +0000, Wei Chen wrote:
Hi Akashi,
-----Original Message----- From: AKASHI Takahiro takahiro.akashi@linaro.org Sent: 2021年8月26日 17:41 To: Wei Chen Wei.Chen@arm.com Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org;
Kaly
Xin
Kaly.Xin@arm.com; Stratos Mailing List <stratos-dev@op-
lists.linaro.org>;
virtio-dev@lists.oasis-open.org; Arnd Bergmann
Viresh Kumar viresh.kumar@linaro.org; Stefano Stabellini stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka jan.kiszka@siemens.com; Carl van Schaik
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org;
Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier mathieu.poirier@linaro.org; Oleksandr Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis Bertrand.Marquis@arm.com; Artem Mygaiev
Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul
Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei,
On Fri, Aug 20, 2021 at 03:41:50PM +0900, AKASHI Takahiro wrote: > On Wed, Aug 18, 2021 at 08:35:51AM +0000, Wei Chen wrote: > > Hi Akashi, > > > > > -----Original Message----- > > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > > Sent: 2021年8月18日 13:39 > > > To: Wei Chen Wei.Chen@arm.com > > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano
Stabellini
> > > sstabellini@kernel.org; Alex Benn??e
Stratos > > > Mailing List stratos-dev@op-lists.linaro.org; virtio- dev@lists.oasis- > > > open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh
Kumar
> > > viresh.kumar@linaro.org; Stefano Stabellini > > > stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan
Kiszka
> > > jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; > > > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean- > > > Philippe Brucker jean-philippe@linaro.org; Mathieu
Poirier
> > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > > Bertrand.Marquis@arm.com; Artem Mygaiev
Julien > > > Grall julien@xen.org; Juergen Gross jgross@suse.com;
Paul
Durrant > > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > Subject: Re: Enabling hypervisor agnosticism for VirtIO
backends
> > > > > > On Tue, Aug 17, 2021 at 08:39:09AM +0000, Wei Chen wrote: > > > > Hi Akashi, > > > > > > > > > -----Original Message----- > > > > > From: AKASHI Takahiro takahiro.akashi@linaro.org > > > > > Sent: 2021年8月17日 16:08 > > > > > To: Wei Chen Wei.Chen@arm.com > > > > > Cc: Oleksandr Tyshchenko olekstysh@gmail.com;
Stefano
Stabellini > > > > > sstabellini@kernel.org; Alex Benn??e
> > > Stratos > > > > > Mailing List stratos-dev@op-lists.linaro.org;
virtio-
> > > dev@lists.oasis- > > > > > open.org; Arnd Bergmann arnd.bergmann@linaro.org;
Viresh
Kumar
> > > > > viresh.kumar@linaro.org; Stefano Stabellini > > > > > stefano.stabellini@xilinx.com; stefanha@redhat.com;
Jan
Kiszka
> > > > > jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean- > > > > > Philippe Brucker jean-philippe@linaro.org; Mathieu
Poirier
> > > > > mathieu.poirier@linaro.org; Oleksandr Tyshchenko > > > > > Oleksandr_Tyshchenko@epam.com; Bertrand Marquis > > > > > Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; > > > Julien > > > > > Grall julien@xen.org; Juergen Gross
Paul
Durrant > > > > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > > Subject: Re: Enabling hypervisor agnosticism for
VirtIO
backends
> > > > > > > > > > Hi Wei, Oleksandr, > > > > > > > > > > On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen
wrote:
> > > > > > Hi All, > > > > > > > > > > > > Thanks for Stefano to link my kvmtool for Xen
proposal
here.
> > > > > > This proposal is still discussing in Xen and KVM
communities.
> > > > > > The main work is to decouple the kvmtool from KVM
and
make
> > > > > > other hypervisors can reuse the virtual device
implementations.
> > > > > > > > > > > > In this case, we need to introduce an intermediate
hypervisor
> > > > > > layer for VMM abstraction, Which is, I think it's
very
close
> > > > > > to stratos' virtio hypervisor agnosticism work. > > > > > > > > > > # My proposal[1] comes from my own idea and doesn't
always
represent > > > > > # Linaro's view on this subject nor reflect Alex's
concerns.
> > > Nevertheless, > > > > > > > > > > Your idea and my proposal seem to share the same
background.
> > > > > Both have the similar goal and currently start with,
at
first,
Xen > > > > > and are based on kvm-tool. (Actually, my work is
derived
from
> > > > > EPAM's virtio-disk, which is also based on kvm-tool.) > > > > > > > > > > In particular, the abstraction of hypervisor
interfaces
has
a
same > > > > > set of interfaces (for your "struct vmm_impl" and my
"RPC
interfaces"). > > > > > This is not co-incident as we both share the same
origin
as
I
said > > > above. > > > > > And so we will also share the same issues. One of them
is a
way
of > > > > > "sharing/mapping FE's memory". There is some trade-off
between
> > > > > the portability and the performance impact. > > > > > So we can discuss the topic here in this ML, too. > > > > > (See Alex's original email, too). > > > > > > > > > Yes, I agree. > > > > > > > > > On the other hand, my approach aims to create a
"single-
binary"
> > > solution > > > > > in which the same binary of BE vm could run on any
hypervisors.
> > > > > Somehow similar to your "proposal-#2" in [2], but in
my
solution,
all > > > > > the hypervisor-specific code would be put into another
entity
(VM), > > > > > named "virtio-proxy" and the abstracted operations are
served
via RPC. > > > > > (In this sense, BE is hypervisor-agnostic but might
have
OS
> > > dependency.) > > > > > But I know that we need discuss if this is a
requirement
even
> > > > > in Stratos project or not. (Maybe not) > > > > > > > > > > > > > Sorry, I haven't had time to finish reading your virtio-
proxy
completely > > > > (I will do it ASAP). But from your description, it seems
we
need a
> > > > 3rd VM between FE and BE? My concern is that, if my
assumption
is
right, > > > > will it increase the latency in data transport path?
Even
if
we're
> > > > using some lightweight guest like RTOS or Unikernel, > > > > > > Yes, you're right. But I'm afraid that it is a matter of
degree.
> > > As far as we execute 'mapping' operations at every fetch
of
payload,
> > > we will see latency issue (even in your case) and if we
have
some
solution > > > for it, we won't see it neither in my proposal :) > > > > > > > Oleksandr has sent a proposal to Xen mailing list to reduce
this
kind
> > of "mapping/unmapping" operations. So the latency caused by
this
behavior > > on Xen may eventually be eliminated, and Linux-KVM doesn't
have
that
problem. > > Obviously, I have not yet caught up there in the discussion. > Which patch specifically?
Can you give me the link to the discussion or patch, please?
It's a RFC discussion. We have tested this RFC patch internally. https://lists.xenproject.org/archives/html/xen-devel/2021-
07/msg01532.html
I'm afraid that I miss something here, but I don't know why this proposed API will lead to eliminating 'mmap' in accessing the queued payload at every request?
This API give Xen device model (QEMU or kvmtool) the ability to map whole guest RAM in device model's address space. In this case, device model doesn't need dynamic hypercall to map/unmap payload memory. It can use a flat offset to access payload memory in its address space directly. Just Like KVM device model does now.
Thank you. Quickly, let me make sure one thing: This API itself doesn't do any mapping operations, right? So I suppose that virtio BE guest is responsible to
- fetch the information about all the memory regions in FE,
- call this API to allocate a big chunk of unused space in BE,
- create grant/foreign mappings for FE onto this region(S)
in the initialization/configuration of emulated virtio devices.
Is this the way this API is expected to be used? Does Xen already has an interface for (1)?
They are discussing in that thread to find a proper way to do it. Because this API is common, both x86 and Arm should be considered.
Please ignore my above reply. I hadn't seen Oleksandr had replied this question. Sorry about it!
-Takahiro Akashi
Before this API, When device model to map whole guest memory, will severely consume the physical pages of Dom-0/Dom-D.
-Takahiro Akashi
Thanks, -Takahiro Akashi
> -Takahiro Akashi > > > > > > Specifically speaking about kvm-tool, I have a concern
about
its
> > > > > license term; Targeting different hypervisors and
different
OSs
> > > > > (which I assume includes RTOS's), the resultant
library
should
be > > > > > license permissive and GPL for kvm-tool might be an
issue.
> > > > > Any thoughts? > > > > > > > > > > > > > Yes. If user want to implement a FreeBSD device model,
but
the
virtio > > > > library is GPL. Then GPL would be a problem. If we have
another
good > > > > candidate, I am open to it. > > > > > > I have some candidates, particularly for vq/vring, in my
mind:
> > > * Open-AMP, or > > > * corresponding Free-BSD code > > > > > > > Interesting, I will look into them : ) > > > > Cheers, > > Wei Chen > > > > > -Takahiro Akashi > > > > > > > > > > > -Takahiro Akashi > > > > > > > > > > > > > > > [1] https://op-lists.linaro.org/pipermail/stratos-
dev/2021-
> > > > > August/000548.html > > > > > [2] https://marc.info/?l=xen-
devel&m=162373754705233&w=2
> > > > > > > > > > > > > > > > > > From: Oleksandr Tyshchenko olekstysh@gmail.com > > > > > > > Sent: 2021年8月14日 23:38 > > > > > > > To: AKASHI Takahiro takahiro.akashi@linaro.org;
Stefano
> > > Stabellini > > > > > sstabellini@kernel.org > > > > > > > Cc: Alex Benn??e alex.bennee@linaro.org; Stratos
Mailing
List > > > > > stratos-dev@op-lists.linaro.org; virtio-
dev@lists.oasis-
open.org; > > > Arnd > > > > > Bergmann arnd.bergmann@linaro.org; Viresh Kumar > > > > > viresh.kumar@linaro.org; Stefano Stabellini > > > > > stefano.stabellini@xilinx.com; stefanha@redhat.com;
Jan
Kiszka
> > > > > jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com; > > > > > pratikp@quicinc.com; Srivatsa Vaddagiri
Jean- > > > > > Philippe Brucker jean-philippe@linaro.org; Mathieu
Poirier
> > > > > mathieu.poirier@linaro.org; Wei Chen
Oleksandr > > > > > Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand
Marquis
> > > > > Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; > > > Julien > > > > > Grall julien@xen.org; Juergen Gross
Paul
Durrant > > > > > paul@xen.org; Xen Devel xen-devel@lists.xen.org > > > > > > > Subject: Re: Enabling hypervisor agnosticism for
VirtIO
backends > > > > > > > > > > > > > > Hello, all. > > > > > > > > > > > > > > Please see some comments below. And sorry for the
possible
format > > > > > issues. > > > > > > > > > > > > > > > On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro > > > > > mailto:takahiro.akashi@linaro.org wrote: > > > > > > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700,
Stefano
Stabellini > > > wrote: > > > > > > > > > CCing people working on Xen+VirtIO and IOREQs.
Not
trimming > > > the > > > > > original > > > > > > > > > email to let them read the full context. > > > > > > > > > > > > > > > > > > My comments below are related to a potential
Xen
> > > implementation, > > > > > not > > > > > > > > > because it is the only implementation that
matters,
but
> > > because it > > > > > is > > > > > > > > > the one I know best. > > > > > > > > > > > > > > > > Please note that my proposal (and hence the
working
prototype)[1] > > > > > > > > is based on Xen's virtio implementation (i.e.
IOREQ)
and
> > > > > particularly > > > > > > > > EPAM's virtio-disk application (backend server). > > > > > > > > It has been, I believe, well generalized but is
still
a
bit > > > biased > > > > > > > > toward this original design. > > > > > > > > > > > > > > > > So I hope you like my approach :) > > > > > > > > > > > > > > > > [1] https://op-
lists.linaro.org/pipermail/stratos-
dev/2021- > > > > > August/000546.html > > > > > > > > > > > > > > > > Let me take this opportunity to explain a bit
more
about
my > > > approach > > > > > below. > > > > > > > > > > > > > > > > > Also, please see this relevant email thread: > > > > > > > > > https://marc.info/?l=xen-
devel&m=162373754705233&w=2
> > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 4 Aug 2021, Alex Bennée wrote: > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > One of the goals of Project Stratos is to
enable
hypervisor > > > > > agnostic > > > > > > > > > > backends so we can enable as much re-use of
code
as
possible > > > and > > > > > avoid > > > > > > > > > > repeating ourselves. This is the flip side
of
the
front end > > > > > where > > > > > > > > > > multiple front-end implementations are
required -
one
per OS, > > > > > assuming > > > > > > > > > > you don't just want Linux guests. The
resultant
guests
are > > > > > trivially > > > > > > > > > > movable between hypervisors modulo any
abstracted
paravirt > > > type > > > > > > > > > > interfaces. > > > > > > > > > > > > > > > > > > > > In my original thumb nail sketch of a
solution
I
envisioned > > > > > vhost-user > > > > > > > > > > daemons running in a broadly POSIX like
environment.
The > > > > > interface to > > > > > > > > > > the daemon is fairly simple requiring only
some
mapped
> > > memory > > > > > and some > > > > > > > > > > sort of signalling for events (on Linux this
is
eventfd). > > > The > > > > > idea was a > > > > > > > > > > stub binary would be responsible for any
hypervisor
specific > > > > > setup and > > > > > > > > > > then launch a common binary to deal with the
actual
> > > virtqueue > > > > > requests > > > > > > > > > > themselves. > > > > > > > > > > > > > > > > > > > > Since that original sketch we've seen an
expansion
in
the > > > sort > > > > > of ways > > > > > > > > > > backends could be created. There is interest
in
> > > encapsulating > > > > > backends > > > > > > > > > > in RTOSes or unikernels for solutions like
SCMI.
There
> > > interest > > > > > in Rust > > > > > > > > > > has prompted ideas of using the trait
interface to
abstract > > > > > differences > > > > > > > > > > away as well as the idea of bare-metal Rust
backends.
> > > > > > > > > > > > > > > > > > > > We have a card (STR-12) called "Hypercall Standardisation" > > > which > > > > > > > > > > calls for a description of the APIs needed
from
the
> > > hypervisor > > > > > side to > > > > > > > > > > support VirtIO guests and their backends.
However
we
are > > > some > > > > > way off > > > > > > > > > > from that at the moment as I think we need
to
at
least
> > > > > demonstrate one > > > > > > > > > > portable backend before we start codifying requirements. To > > > that > > > > > end I > > > > > > > > > > want to think about what we need for a
backend
to
function. > > > > > > > > > > > > > > > > > > > > Configuration > > > > > > > > > > ============= > > > > > > > > > > > > > > > > > > > > In the type-2 setup this is typically fairly
simple
because > > > the > > > > > host > > > > > > > > > > system can orchestrate the various modules
that
make
up the > > > > > complete > > > > > > > > > > system. In the type-1 case (or even type-2
with
delegated > > > > > service VMs) > > > > > > > > > > we need some sort of mechanism to inform the
backend
VM > > > about > > > > > key > > > > > > > > > > details about the system: > > > > > > > > > > > > > > > > > > > > - where virt queue memory is in it's
address
space
> > > > > > > > > > - how it's going to receive (interrupt)
and
trigger
(kick) > > > > > events > > > > > > > > > > - what (if any) resources the backend
needs
to
connect to > > > > > > > > > > > > > > > > > > > > Obviously you can elide over configuration
issues
by
having > > > > > static > > > > > > > > > > configurations and baking the assumptions
into
your
guest > > > images > > > > > however > > > > > > > > > > this isn't scalable in the long term. The
obvious
solution > > > seems > > > > > to be > > > > > > > > > > extending a subset of Device Tree data to
user
space
but > > > perhaps > > > > > there > > > > > > > > > > are other approaches? > > > > > > > > > > > > > > > > > > > > Before any virtio transactions can take
place
the
> > > appropriate > > > > > memory > > > > > > > > > > mappings need to be made between the FE
guest
and
the
BE > > > guest. > > > > > > > > > > > > > > > > > > > Currently the whole of the FE guests address
space
needs to > > > be > > > > > visible > > > > > > > > > > to whatever is serving the virtio requests.
I
can
envision 3 > > > > > approaches: > > > > > > > > > > > > > > > > > > > > * BE guest boots with memory already mapped > > > > > > > > > > > > > > > > > > > > This would entail the guest OS knowing
where
in
it's
Guest > > > > > Physical > > > > > > > > > > Address space is already taken up and
avoiding
clashing. I > > > > > would assume > > > > > > > > > > in this case you would want a standard
interface
to
> > > userspace > > > > > to then > > > > > > > > > > make that address space visible to the
backend
daemon.
> > > > > > > > > > > > > > > > Yet another way here is that we would have well
known
"shared > > > > > memory" between > > > > > > > > VMs. I think that Jailhouse's ivshmem gives us
good
insights on > > > this > > > > > matter > > > > > > > > and that it can even be an alternative for
hypervisor-
agnostic > > > > > solution. > > > > > > > > > > > > > > > > (Please note memory regions in ivshmem appear as
a
PCI
device > > > and > > > > > can be > > > > > > > > mapped locally.) > > > > > > > > > > > > > > > > I want to add this shared memory aspect to my
virtio-
proxy,
but > > > > > > > > the resultant solution would eventually look
similar
to
ivshmem. > > > > > > > > > > > > > > > > > > * BE guests boots with a hypervisor handle
to
memory
> > > > > > > > > > > > > > > > > > > > The BE guest is then free to map the FE's
memory
to
where > > > it > > > > > wants in > > > > > > > > > > the BE's guest physical address space. > > > > > > > > > > > > > > > > > > I cannot see how this could work for Xen.
There
is
no
"handle" > > > to > > > > > give > > > > > > > > > to the backend if the backend is not running
in
dom0.
So
for > > > Xen I > > > > > think > > > > > > > > > the memory has to be already mapped > > > > > > > > > > > > > > > > In Xen's IOREQ solution (virtio-blk), the
following
information > > > is > > > > > expected > > > > > > > > to be exposed to BE via Xenstore: > > > > > > > > (I know that this is a tentative approach
though.)
> > > > > > > > - the start address of configuration space > > > > > > > > - interrupt number > > > > > > > > - file path for backing storage > > > > > > > > - read-only flag > > > > > > > > And the BE server have to call a particular
hypervisor
interface > > > to > > > > > > > > map the configuration space. > > > > > > > > > > > > > > Yes, Xenstore was chosen as a simple way to pass configuration > > > info to > > > > > the backend running in a non-toolstack domain. > > > > > > > I remember, there was a wish to avoid using
Xenstore
in
Virtio > > > backend > > > > > itself if possible, so for non-toolstack domain, this
could
done
with > > > > > adjusting devd (daemon that listens for devices and
launches
backends) > > > > > > > to read backend configuration from the Xenstore
anyway
and
pass it > > > to > > > > > the backend via command line arguments. > > > > > > > > > > > > > > > > > > > Yes, in current PoC code we're using xenstore to
pass
device
> > > > > configuration. > > > > > > We also designed a static device configuration parse
method
for > > > Dom0less > > > > > or > > > > > > other scenarios don't have xentool. yes, it's from
device
model > > > command > > > > > line > > > > > > or a config file. > > > > > > > > > > > > > But, if ... > > > > > > > > > > > > > > > > > > > > > > > In my approach (virtio-proxy), all those Xen (or hypervisor)- > > > > > specific > > > > > > > > stuffs are contained in virtio-proxy, yet
another
VM,
to
hide > > > all > > > > > details. > > > > > > > > > > > > > > ... the solution how to overcome that is already
found
and
proven > > > to > > > > > work then even better. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > # My point is that a "handle" is not mandatory
for
executing > > > mapping. > > > > > > > > > > > > > > > > > and the mapping probably done by the > > > > > > > > > toolstack (also see below.) Or we would have
to
invent a
new > > > Xen > > > > > > > > > hypervisor interface and Xen virtual machine
privileges
to > > > allow > > > > > this > > > > > > > > > kind of mapping. > > > > > > > > > > > > > > > > > If we run the backend in Dom0 that we have no
problems
of > > > course. > > > > > > > > > > > > > > > > One of difficulties on Xen that I found in my
approach
is
that > > > > > calling > > > > > > > > such hypervisor intefaces (registering IOREQ,
mapping
memory) is > > > > > only > > > > > > > > allowed on BE servers themselvies and so we will
have
to
extend > > > > > those > > > > > > > > interfaces. > > > > > > > > This, however, will raise some concern on
security
and
privilege > > > > > distribution > > > > > > > > as Stefan suggested. > > > > > > > > > > > > > > We also faced policy related issues with Virtio
backend
running in > > > > > other than Dom0 domain in a "dummy" xsm mode. In our
target
system we > > > run > > > > > the backend in a driver > > > > > > > domain (we call it DomD) where the underlying H/W
resides.
We > > > trust it, > > > > > so we wrote policy rules (to be used in "flask" xsm
mode)
to
provide > > > it > > > > > with a little bit more privileges than a simple DomU
had.
> > > > > > > Now it is permitted to issue device-model,
resource
and
memory > > > > > mappings, etc calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To activate the mapping will > > > > > > > > > > require some sort of hypercall to the
hypervisor.
I
can see > > > two > > > > > options > > > > > > > > > > at this point: > > > > > > > > > > > > > > > > > > > > - expose the handle to userspace for
daemon/helper
to > > > trigger > > > > > the > > > > > > > > > > mapping via existing hypercall
interfaces.
If
using a > > > helper > > > > > you > > > > > > > > > > would have a hypervisor specific one to
avoid
the
daemon > > > > > having to > > > > > > > > > > care too much about the details or push
that
complexity > > > into > > > > > a > > > > > > > > > > compile time option for the daemon which
would
result in > > > > > different > > > > > > > > > > binaries although a common source base. > > > > > > > > > > > > > > > > > > > > - expose a new kernel ABI to abstract the
hypercall
> > > > > differences away > > > > > > > > > > in the guest kernel. In this case the
userspace
would > > > > > essentially > > > > > > > > > > ask for an abstract "map guest N memory
to
userspace > > > ptr" > > > > > and let > > > > > > > > > > the kernel deal with the different
hypercall
interfaces. > > > > > This of > > > > > > > > > > course assumes the majority of BE guests
would
be
Linux > > > > > kernels and > > > > > > > > > > leaves the bare-metal/unikernel
approaches
to
their own > > > > > devices. > > > > > > > > > > > > > > > > > > > > Operation > > > > > > > > > > ========= > > > > > > > > > > > > > > > > > > > > The core of the operation of VirtIO is
fairly
simple.
Once > > > the > > > > > > > > > > vhost-user feature negotiation is done it's
a
case
of
> > > receiving > > > > > update > > > > > > > > > > events and parsing the resultant virt queue
for
data.
The > > > vhost- > > > > > user > > > > > > > > > > specification handles a bunch of setup
before
that
point, > > > mostly > > > > > to > > > > > > > > > > detail where the virt queues are set up FD's
for
memory and > > > > > event > > > > > > > > > > communication. This is where the envisioned
stub
process > > > would > > > > > be > > > > > > > > > > responsible for getting the daemon up and
ready to
run.
This > > > is > > > > > > > > > > currently done inside a big VMM like QEMU
but
I
suspect a > > > modern > > > > > > > > > > approach would be to use the rust-vmm vhost
crate.
It
would > > > then > > > > > either > > > > > > > > > > communicate with the kernel's abstracted ABI
or be
re-
> > > targeted > > > > > as a > > > > > > > > > > build option for the various hypervisors. > > > > > > > > > > > > > > > > > > One thing I mentioned before to Alex is that
Xen
doesn't
have > > > VMMs > > > > > the > > > > > > > > > way they are typically envisioned and
described
in
other
> > > > > environments. > > > > > > > > > Instead, Xen has IOREQ servers. Each of them
connects
> > > > > independently to > > > > > > > > > Xen via the IOREQ interface. E.g. today
multiple
QEMUs
could > > > be > > > > > used as > > > > > > > > > emulators for a single Xen VM, each of them
connecting
to Xen > > > > > > > > > independently via the IOREQ interface. > > > > > > > > > > > > > > > > > > The component responsible for starting a
daemon
and/or
setting > > > up > > > > > shared > > > > > > > > > interfaces is the toolstack: the xl command
and
the
> > > libxl/libxc > > > > > > > > > libraries. > > > > > > > > > > > > > > > > I think that VM configuration management (or
orchestration
in > > > > > Startos > > > > > > > > jargon?) is a subject to debate in parallel. > > > > > > > > Otherwise, is there any good assumption to avoid
it
right
now? > > > > > > > > > > > > > > > > > Oleksandr and others I CCed have been working
on
ways
for the > > > > > toolstack > > > > > > > > > to create virtio backends and setup memory
mappings.
They > > > might be > > > > > able > > > > > > > > > to provide more info on the subject. I do
think
we
miss
a way > > > to > > > > > provide > > > > > > > > > the configuration to the backend and anything
else
that
the > > > > > backend > > > > > > > > > might require to start doing its job. > > > > > > > > > > > > > > Yes, some work has been done for the toolstack to
handle
Virtio > > > MMIO > > > > > devices in > > > > > > > general and Virtio block devices in particular.
However,
it
has > > > not > > > > > been upstreaned yet. > > > > > > > Updated patches on review now: > > > > > > > https://lore.kernel.org/xen-devel/1621626361-
29076-
1-
git-
send- > > > email- > > > > > olekstysh@gmail.com/ > > > > > > > > > > > > > > There is an additional (also important) activity
to
improve/fix > > > > > foreign memory mapping on Arm which I am also involved
in.
> > > > > > > The foreign memory mapping is proposed to be used
for
Virtio
> > > backends > > > > > (device emulators) if there is a need to run guest OS
completely
> > > > > unmodified. > > > > > > > Of course, the more secure way would be to use
grant
memory
> > > mapping. > > > > > Brietly, the main difference between them is that with
foreign
mapping > > > the > > > > > backend > > > > > > > can map any guest memory it wants to map, but with
grant
mapping > > > it is > > > > > allowed to map only what was previously granted by the
frontend.
> > > > > > > > > > > > > > So, there might be a problem if we want to pre-map
some
guest > > > memory > > > > > in advance or to cache mappings in the backend in
order
to
improve > > > > > performance (because the mapping/unmapping guest pages
every
request > > > > > requires a lot of back and forth to Xen + P2M updates).
In a
nutshell, > > > > > currently, in order to map a guest page into the
backend
address
space > > > we > > > > > need to steal a real physical page from the backend
domain.
So,
with > > > the > > > > > said optimizations we might end up with no free memory
in
the
backend > > > > > domain (see XSA-300). And what we try to achieve is to
not
waste
a > > > real > > > > > domain memory at all by providing safe non-allocated-
yet
(so
unused) > > > > > address space for the foreign (and grant) pages to be
mapped
into, > > > this > > > > > enabling work implies Xen and Linux (and likely DTB
bindings)
changes. > > > > > However, as it turned out, for this to work in a
proper
and
safe
way > > > some > > > > > prereq work needs to be done. > > > > > > > You can find the related Xen discussion at: > > > > > > > https://lore.kernel.org/xen-devel/1627489110-
25633-
1-
git-
send- > > > email- > > > > > olekstysh@gmail.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > One question is how to best handle
notification
and
kicks. > > > The > > > > > existing > > > > > > > > > > vhost-user framework uses eventfd to signal
the
daemon
> > > (although > > > > > QEMU > > > > > > > > > > is quite capable of simulating them when you
use
TCG).
Xen > > > has > > > > > it's own > > > > > > > > > > IOREQ mechanism. However latency is an
important
factor and > > > > > having > > > > > > > > > > events go through the stub would add quite a
lot.
> > > > > > > > > > > > > > > > > > Yeah I think, regardless of anything else, we
want
the
> > > backends to > > > > > > > > > connect directly to the Xen hypervisor. > > > > > > > > > > > > > > > > In my approach, > > > > > > > > a) BE -> FE: interrupts triggered by BE calling
a
hypervisor > > > > > interface > > > > > > > > via virtio-proxy > > > > > > > > b) FE -> BE: MMIO to config raises events (in
event
channels), > > > > > which is > > > > > > > > converted to a callback to BE via
virtio-
proxy > > > > > > > > (Xen's event channel is
internnally
implemented by > > > > > interrupts.) > > > > > > > > > > > > > > > > I don't know what "connect directly" means here,
but
sending > > > > > interrupts > > > > > > > > to the opposite side would be best efficient. > > > > > > > > Ivshmem, I suppose, takes this approach by
utilizing
PCI's
msi-x > > > > > mechanism. > > > > > > > > > > > > > > Agree that MSI would be more efficient than SPI... > > > > > > > At the moment, in order to notify the frontend,
the
backend
issues > > > a > > > > > specific device-model call to query Xen to inject a corresponding SPI > > > to > > > > > the guest. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Could we consider the kernel internally
converting
IOREQ > > > > > messages from > > > > > > > > > > the Xen hypervisor to eventfd events? Would
this
scale
with > > > > > other kernel > > > > > > > > > > hypercall interfaces? > > > > > > > > > > > > > > > > > > > > So any thoughts on what directions are worth experimenting > > > with? > > > > > > > > > > > > > > > > > > One option we should consider is for each
backend to
connect > > > to > > > > > Xen via > > > > > > > > > the IOREQ interface. We could generalize the
IOREQ
interface > > > and > > > > > make it > > > > > > > > > hypervisor agnostic. The interface is really
trivial
and
easy > > > to > > > > > add. > > > > > > > > > > > > > > > > As I said above, my proposal does the same thing
that
you
> > > mentioned > > > > > here :) > > > > > > > > The difference is that I do call hypervisor
interfaces
via
> > > virtio- > > > > > proxy. > > > > > > > > > > > > > > > > > The only Xen-specific part is the notification
mechanism,
> > > which is > > > > > an > > > > > > > > > event channel. If we replaced the event
channel
with
something > > > > > else the > > > > > > > > > interface would be generic. See: > > > > > > > > > https://gitlab.com/xen-project/xen/- > > > > > /blob/staging/xen/include/public/hvm/ioreq.h#L52 > > > > > > > > > > > > > > > > > > I don't think that translating IOREQs to
eventfd
in
the
kernel > > > is > > > > > a > > > > > > > > > good idea: if feels like it would be extra
complexity
and that > > > the > > > > > > > > > kernel shouldn't be involved as this is a
backend-
hypervisor > > > > > interface. > > > > > > > > > > > > > > > > Given that we may want to implement BE as a
bare-
metal
> > > application > > > > > > > > as I did on Zephyr, I don't think that the
translation
would not > > > be > > > > > > > > a big issue, especially on RTOS's. > > > > > > > > It will be some kind of abstraction layer of
interrupt
handling > > > > > > > > (or nothing but a callback mechanism). > > > > > > > > > > > > > > > > > Also, eventfd is very Linux-centric and we are
trying to
> > > design an > > > > > > > > > interface that could work well for RTOSes too.
If we
want to > > > do > > > > > > > > > something different, both OS-agnostic and
hypervisor-
agnostic, > > > > > perhaps > > > > > > > > > we could design a new interface. One that
could
be
> > > implementable > > > > > in the > > > > > > > > > Xen hypervisor itself (like IOREQ) and of
course
any
other > > > > > hypervisor > > > > > > > > > too. > > > > > > > > > > > > > > > > > > > > > > > > > > > There is also another problem. IOREQ is
probably
not
be
the > > > only > > > > > > > > > interface needed. Have a look at > > > > > > > > > https://marc.info/?l=xen-
devel&m=162373754705233&w=2.
Don't we > > > > > also need > > > > > > > > > an interface for the backend to inject
interrupts
into
the > > > > > frontend? And > > > > > > > > > if the backend requires dynamic memory
mappings
of
frontend > > > pages, > > > > > then > > > > > > > > > we would also need an interface to map/unmap
domU
pages.
> > > > > > > > > > > > > > > > My proposal document might help here; All the
interfaces
> > > required > > > > > for > > > > > > > > virtio-proxy (or hypervisor-related interfaces)
are
listed
as > > > > > > > > RPC protocols :) > > > > > > > > > > > > > > > > > These interfaces are a lot more problematic
than
IOREQ:
IOREQ > > > is > > > > > tiny > > > > > > > > > and self-contained. It is easy to add anywhere.
A
new
> > > interface to > > > > > > > > > inject interrupts or map pages is more
difficult
to
manage > > > because > > > > > it > > > > > > > > > would require changes scattered across the
various
emulators. > > > > > > > > > > > > > > > > Exactly. I have no confident yet that my
approach
will
also > > > apply > > > > > > > > to other hypervisors than Xen. > > > > > > > > Technically, yes, but whether people can accept
it
or
not
is a > > > > > different > > > > > > > > matter. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -Takahiro Akashi > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Regards, > > > > > > > > > > > > > > Oleksandr Tyshchenko
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
Stefan
Hi Stefan,
On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Do you know the current status of Elena's work? It was last February that she posted her latest patch and it has not been merged upstream yet.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
Can you please elaborate your thoughts a bit more here?
It seems to me that trapping MMIOs to configuration space and forwarding those events to BE (or device emulation) is a quite straight-forward way to emulate device MMIOs. Or do you think of something of protocols used in vhost-user?
# On the contrary, virtio-ivshmem only requires a driver to explicitly # forward a "write" request of MMIO accesses to BE. But I don't think # it's your point.
-Takahiro Akashi
Stefan
On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
Hi Stefan,
On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Do you know the current status of Elena's work? It was last February that she posted her latest patch and it has not been merged upstream yet.
Elena worked on this during her Outreachy internship. At the moment no one is actively working on the patches.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
Can you please elaborate your thoughts a bit more here?
It seems to me that trapping MMIOs to configuration space and forwarding those events to BE (or device emulation) is a quite straight-forward way to emulate device MMIOs. Or do you think of something of protocols used in vhost-user?
# On the contrary, virtio-ivshmem only requires a driver to explicitly # forward a "write" request of MMIO accesses to BE. But I don't think # it's your point.
See my first reply to this email thread about alternative interfaces for VIRTIO device emulation. The main thing to note was that although the shared memory vring is used by VIRTIO transports today, the device model actually allows transports to implement virtqueues differently (e.g. making it possible to create a VIRTIO over TCP transport without shared memory in the future).
It's possible to define a hypercall interface as a new VIRTIO transport that provides higher-level virtqueue operations. Doing this is more work than using vrings though since existing guest driver and device emulation code already supports vrings.
I don't know the requirements of Stratos so I can't say if creating a new hypervisor-independent interface (VIRTIO transport) that doesn't rely on shared memory vrings makes sense. I just wanted to raise the idea in case you find that VIRTIO's vrings don't meet your requirements.
Stefan
Hi Stefan,
On Mon, Aug 23, 2021 at 10:58:46AM +0100, Stefan Hajnoczi wrote:
On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
Hi Stefan,
On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Do you know the current status of Elena's work? It was last February that she posted her latest patch and it has not been merged upstream yet.
Elena worked on this during her Outreachy internship. At the moment no one is actively working on the patches.
Does RedHat plan to take over or follow up her work hereafter? # I'm simply asking from my curiosity.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
Can you please elaborate your thoughts a bit more here?
It seems to me that trapping MMIOs to configuration space and forwarding those events to BE (or device emulation) is a quite straight-forward way to emulate device MMIOs. Or do you think of something of protocols used in vhost-user?
# On the contrary, virtio-ivshmem only requires a driver to explicitly # forward a "write" request of MMIO accesses to BE. But I don't think # it's your point.
See my first reply to this email thread about alternative interfaces for VIRTIO device emulation. The main thing to note was that although the shared memory vring is used by VIRTIO transports today, the device model actually allows transports to implement virtqueues differently (e.g. making it possible to create a VIRTIO over TCP transport without shared memory in the future).
Do you have any example of such use cases or systems?
It's possible to define a hypercall interface as a new VIRTIO transport that provides higher-level virtqueue operations. Doing this is more work than using vrings though since existing guest driver and device emulation code already supports vrings.
Personally, I'm open to discuss about your point, but
I don't know the requirements of Stratos so I can't say if creating a new hypervisor-independent interface (VIRTIO transport) that doesn't rely on shared memory vrings makes sense. I just wanted to raise the idea in case you find that VIRTIO's vrings don't meet your requirements.
While I cannot represent the project's view, what the JIRA task that is assigned to me describes: Deliverables * Low level library allowing: * management of virtio rings and buffers [and so on] So supporting the shared memory-based vring is one of our assumptions.
In my understanding, the goal of Stratos project is that we would have several VMs congregated into a SoC, yet sharing most of physical IPs, where the shared memory should be, I assume, the most efficient transport for virtio. One of target applications would be automotive, I guess.
Alex and Mike should have more to say here.
-Takahiro Akashi
Stefan
On Wed, Aug 25, 2021 at 07:29:45PM +0900, AKASHI Takahiro wrote:
On Mon, Aug 23, 2021 at 10:58:46AM +0100, Stefan Hajnoczi wrote:
On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
Hi Stefan,
On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Do you know the current status of Elena's work? It was last February that she posted her latest patch and it has not been merged upstream yet.
Elena worked on this during her Outreachy internship. At the moment no one is actively working on the patches.
Does RedHat plan to take over or follow up her work hereafter? # I'm simply asking from my curiosity.
At the moment I'm not aware of anyone from Red Hat working on it. If someone decides they need this KVM API then that could change.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
Can you please elaborate your thoughts a bit more here?
It seems to me that trapping MMIOs to configuration space and forwarding those events to BE (or device emulation) is a quite straight-forward way to emulate device MMIOs. Or do you think of something of protocols used in vhost-user?
# On the contrary, virtio-ivshmem only requires a driver to explicitly # forward a "write" request of MMIO accesses to BE. But I don't think # it's your point.
See my first reply to this email thread about alternative interfaces for VIRTIO device emulation. The main thing to note was that although the shared memory vring is used by VIRTIO transports today, the device model actually allows transports to implement virtqueues differently (e.g. making it possible to create a VIRTIO over TCP transport without shared memory in the future).
Do you have any example of such use cases or systems?
This aspect of VIRTIO isn't being exploited today AFAIK. But the layering to allow other virtqueue implementations is there. For example, Linux's virtqueue API is independent of struct vring, so existing drivers generally aren't tied to vrings.
It's possible to define a hypercall interface as a new VIRTIO transport that provides higher-level virtqueue operations. Doing this is more work than using vrings though since existing guest driver and device emulation code already supports vrings.
Personally, I'm open to discuss about your point, but
I don't know the requirements of Stratos so I can't say if creating a new hypervisor-independent interface (VIRTIO transport) that doesn't rely on shared memory vrings makes sense. I just wanted to raise the idea in case you find that VIRTIO's vrings don't meet your requirements.
While I cannot represent the project's view, what the JIRA task that is assigned to me describes: Deliverables * Low level library allowing: * management of virtio rings and buffers [and so on] So supporting the shared memory-based vring is one of our assumptions.
If shared memory is allowed then vrings are the natural choice. That way existing virtio code will work with minimal modifications.
Stefan
Stefan Hajnoczi stefanha@redhat.com writes:
[[PGP Signed Part:Undecided]] On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Reading the cover letter was very useful in showing how this provides a separate channel for signalling IO events to userspace instead of using the normal type-2 vmexit type event. I wonder how deeply tied the userspace facing side of this is to KVM? Could it provide a common FD type interface to IOREQ?
As I understand IOREQ this is currently a direct communication between userspace and the hypervisor using the existing Xen message bus. My worry would be that by adding knowledge of what the underlying hypervisor is we'd end up with excess complexity in the kernel. For one thing we certainly wouldn't want an API version dependency on the kernel to understand which version of the Xen hypervisor it was running on.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
It's true our focus is just VirtIO which does support alternative transport options however most implementations seem to be targeting virtio-mmio for it's relative simplicity and understood semantics (modulo a desire for MSI to reduce round trip latency handling signalling).
Stefan
[[End of PGP Signed Part]]
On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Bennée wrote:
Stefan Hajnoczi stefanha@redhat.com writes:
[[PGP Signed Part:Undecided]] On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Reading the cover letter was very useful in showing how this provides a separate channel for signalling IO events to userspace instead of using the normal type-2 vmexit type event. I wonder how deeply tied the userspace facing side of this is to KVM? Could it provide a common FD type interface to IOREQ?
I wondered this too after reading Stefano's link to Xen's ioreq. They seem to be quite similar. ioregionfd is closer to have PIO/MMIO vmexits are handled in KVM while I guess ioreq is closer to how Xen handles them, but those are small details.
It may be possible to use the ioreq struct instead of ioregionfd in KVM, but I haven't checked each field.
As I understand IOREQ this is currently a direct communication between userspace and the hypervisor using the existing Xen message bus. My worry would be that by adding knowledge of what the underlying hypervisor is we'd end up with excess complexity in the kernel. For one thing we certainly wouldn't want an API version dependency on the kernel to understand which version of the Xen hypervisor it was running on.
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
It's true our focus is just VirtIO which does support alternative transport options however most implementations seem to be targeting virtio-mmio for it's relative simplicity and understood semantics (modulo a desire for MSI to reduce round trip latency handling signalling).
Okay.
Stefan
Alex,
On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
Stefan Hajnoczi stefanha@redhat.com writes:
[[PGP Signed Part:Undecided]] On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Reading the cover letter was very useful in showing how this provides a separate channel for signalling IO events to userspace instead of using the normal type-2 vmexit type event. I wonder how deeply tied the userspace facing side of this is to KVM? Could it provide a common FD type interface to IOREQ?
Why do you stick to a "FD" type interface?
As I understand IOREQ this is currently a direct communication between userspace and the hypervisor using the existing Xen message bus. My
With IOREQ server, IO event occurrences are notified to BE via Xen's event channel, while the actual contexts of IO events (see struct ioreq in ioreq.h) are put in a queue on a single shared memory page which is to be assigned beforehand with xenforeignmemory_map_resource hypervisor call.
worry would be that by adding knowledge of what the underlying hypervisor is we'd end up with excess complexity in the kernel. For one thing we certainly wouldn't want an API version dependency on the kernel to understand which version of the Xen hypervisor it was running on.
That's exactly what virtio-proxy in my proposal[1] does; All the hypervisor- specific details of IO event handlings are contained in virtio-proxy and virtio BE will communicate with virtio-proxy through a virtqueue (yes, virtio-proxy is seen as yet another virtio device on BE) and will get IO event-related *RPC* callbacks, either MMIO read or write, from virtio-proxy.
See page 8 (protocol flow) and 10 (interfaces) in [1].
If kvm's ioregionfd can fit into this protocol, virtio-proxy for kvm will hopefully be implemented using ioregionfd.
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html
There is also another problem. IOREQ is probably not be the only interface needed. Have a look at https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need an interface for the backend to inject interrupts into the frontend? And if the backend requires dynamic memory mappings of frontend pages, then we would also need an interface to map/unmap domU pages.
These interfaces are a lot more problematic than IOREQ: IOREQ is tiny and self-contained. It is easy to add anywhere. A new interface to inject interrupts or map pages is more difficult to manage because it would require changes scattered across the various emulators.
Something like ioreq is indeed necessary to implement arbitrary devices, but if you are willing to restrict yourself to VIRTIO then other interfaces are possible too because the VIRTIO device model is different from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
It's true our focus is just VirtIO which does support alternative transport options however most implementations seem to be targeting virtio-mmio for it's relative simplicity and understood semantics (modulo a desire for MSI to reduce round trip latency handling signalling).
Stefan
[[End of PGP Signed Part]]
-- Alex Bennée
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Alex,
On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
Stefan Hajnoczi stefanha@redhat.com writes:
[[PGP Signed Part:Undecided]] On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Reading the cover letter was very useful in showing how this provides a separate channel for signalling IO events to userspace instead of using the normal type-2 vmexit type event. I wonder how deeply tied the userspace facing side of this is to KVM? Could it provide a common FD type interface to IOREQ?
Why do you stick to a "FD" type interface?
I mean most user space interfaces on POSIX start with a file descriptor and the usual read/write semantics or a series of ioctls.
As I understand IOREQ this is currently a direct communication between userspace and the hypervisor using the existing Xen message bus. My
With IOREQ server, IO event occurrences are notified to BE via Xen's event channel, while the actual contexts of IO events (see struct ioreq in ioreq.h) are put in a queue on a single shared memory page which is to be assigned beforehand with xenforeignmemory_map_resource hypervisor call.
If we abstracted the IOREQ via the kernel interface you would probably just want to put the ioreq structure on a queue rather than expose the shared page to userspace.
worry would be that by adding knowledge of what the underlying hypervisor is we'd end up with excess complexity in the kernel. For one thing we certainly wouldn't want an API version dependency on the kernel to understand which version of the Xen hypervisor it was running on.
That's exactly what virtio-proxy in my proposal[1] does; All the hypervisor- specific details of IO event handlings are contained in virtio-proxy and virtio BE will communicate with virtio-proxy through a virtqueue (yes, virtio-proxy is seen as yet another virtio device on BE) and will get IO event-related *RPC* callbacks, either MMIO read or write, from virtio-proxy.
See page 8 (protocol flow) and 10 (interfaces) in [1].
There are two areas of concern with the proxy approach at the moment. The first is how the bootstrap of the virtio-proxy channel happens and the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
If kvm's ioregionfd can fit into this protocol, virtio-proxy for kvm will hopefully be implemented using ioregionfd.
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html
Alex,
On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Alex,
On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
Stefan Hajnoczi stefanha@redhat.com writes:
[[PGP Signed Part:Undecided]] On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to Xen via the IOREQ interface. We could generalize the IOREQ interface and make it hypervisor agnostic. The interface is really trivial and easy to add. The only Xen-specific part is the notification mechanism, which is an event channel. If we replaced the event channel with something else the interface would be generic. See: https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd): https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Reading the cover letter was very useful in showing how this provides a separate channel for signalling IO events to userspace instead of using the normal type-2 vmexit type event. I wonder how deeply tied the userspace facing side of this is to KVM? Could it provide a common FD type interface to IOREQ?
Why do you stick to a "FD" type interface?
I mean most user space interfaces on POSIX start with a file descriptor and the usual read/write semantics or a series of ioctls.
Who do you assume is responsible for implementing this kind of fd semantics, OSs on BE or hypervisor itself?
I think such interfaces can only be easily implemented on type-2 hypervisors.
# In this sense, I don't think rust-vmm, as it is, cannot be # a general solution.
As I understand IOREQ this is currently a direct communication between userspace and the hypervisor using the existing Xen message bus. My
With IOREQ server, IO event occurrences are notified to BE via Xen's event channel, while the actual contexts of IO events (see struct ioreq in ioreq.h) are put in a queue on a single shared memory page which is to be assigned beforehand with xenforeignmemory_map_resource hypervisor call.
If we abstracted the IOREQ via the kernel interface you would probably just want to put the ioreq structure on a queue rather than expose the shared page to userspace.
Where is that queue?
worry would be that by adding knowledge of what the underlying hypervisor is we'd end up with excess complexity in the kernel. For one thing we certainly wouldn't want an API version dependency on the kernel to understand which version of the Xen hypervisor it was running on.
That's exactly what virtio-proxy in my proposal[1] does; All the hypervisor- specific details of IO event handlings are contained in virtio-proxy and virtio BE will communicate with virtio-proxy through a virtqueue (yes, virtio-proxy is seen as yet another virtio device on BE) and will get IO event-related *RPC* callbacks, either MMIO read or write, from virtio-proxy.
See page 8 (protocol flow) and 10 (interfaces) in [1].
There are two areas of concern with the proxy approach at the moment. The first is how the bootstrap of the virtio-proxy channel happens and
As I said, from BE point of view, virtio-proxy would be seen as yet another virtio device by which BE could talk to "virtio proxy" vm or whatever else.
This way we guarantee BE's hypervisor-agnosticism instead of having "common" hypervisor interfaces. That is the base of my idea.
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
# Would we better discuss what "hypervisor-agnosticism" means?
-Takahiro Akashi
If kvm's ioregionfd can fit into this protocol, virtio-proxy for kvm will hopefully be implemented using ioregionfd.
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-August/000548.html
-- Alex Bennée
On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev < stratos-dev@op-lists.linaro.org> wrote:
Alex,
On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Alex,
On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
Stefan Hajnoczi stefanha@redhat.com writes:
[[PGP Signed Part:Undecided]] On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> Could we consider the kernel internally converting IOREQ
messages from
> the Xen hypervisor to eventfd events? Would this scale with
other kernel
> hypercall interfaces? > > So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to
Xen via
the IOREQ interface. We could generalize the IOREQ interface and
make it
hypervisor agnostic. The interface is really trivial and easy to
add.
The only Xen-specific part is the notification mechanism, which is
an
event channel. If we replaced the event channel with something
else the
interface would be generic. See:
https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd):
https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Reading the cover letter was very useful in showing how this provides
a
separate channel for signalling IO events to userspace instead of
using
the normal type-2 vmexit type event. I wonder how deeply tied the userspace facing side of this is to KVM? Could it provide a common FD type interface to IOREQ?
Why do you stick to a "FD" type interface?
I mean most user space interfaces on POSIX start with a file descriptor and the usual read/write semantics or a series of ioctls.
Who do you assume is responsible for implementing this kind of fd semantics, OSs on BE or hypervisor itself?
I think such interfaces can only be easily implemented on type-2 hypervisors.
# In this sense, I don't think rust-vmm, as it is, cannot be # a general solution.
As I understand IOREQ this is currently a direct communication between userspace and the hypervisor using the existing Xen message bus. My
With IOREQ server, IO event occurrences are notified to BE via Xen's
event
channel, while the actual contexts of IO events (see struct ioreq in
ioreq.h)
are put in a queue on a single shared memory page which is to be
assigned
beforehand with xenforeignmemory_map_resource hypervisor call.
If we abstracted the IOREQ via the kernel interface you would probably just want to put the ioreq structure on a queue rather than expose the shared page to userspace.
Where is that queue?
worry would be that by adding knowledge of what the underlying hypervisor is we'd end up with excess complexity in the kernel. For
one
thing we certainly wouldn't want an API version dependency on the
kernel
to understand which version of the Xen hypervisor it was running on.
That's exactly what virtio-proxy in my proposal[1] does; All the
hypervisor-
specific details of IO event handlings are contained in virtio-proxy and virtio BE will communicate with virtio-proxy through a virtqueue (yes, virtio-proxy is seen as yet another virtio device on BE) and will get IO event-related *RPC* callbacks, either MMIO read or write, from virtio-proxy.
See page 8 (protocol flow) and 10 (interfaces) in [1].
There are two areas of concern with the proxy approach at the moment. The first is how the bootstrap of the virtio-proxy channel happens and
As I said, from BE point of view, virtio-proxy would be seen as yet another virtio device by which BE could talk to "virtio proxy" vm or whatever else.
This way we guarantee BE's hypervisor-agnosticism instead of having "common" hypervisor interfaces. That is the base of my idea.
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
# Would we better discuss what "hypervisor-agnosticism" means?
Is there a call that you could recommend that we join to discuss this and
the topics of this thread? There is definitely interest in pursuing a new interface for Argo that can be implemented in other hypervisors and enable guest binary portability between them, at least on the same hardware architecture, with VirtIO transport as a primary use case.
The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF for Xen and Guest OS, which include context about the several separate approaches to VirtIO on Xen, have now been posted here: https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html
Christopher
-Takahiro Akashi
On Mon, Sep 06, 2021 at 07:41:48PM -0700, Christopher Clark wrote:
On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev < stratos-dev@op-lists.linaro.org> wrote:
Alex,
On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Alex,
On Wed, Sep 01, 2021 at 01:53:34PM +0100, Alex Benn??e wrote:
Stefan Hajnoczi stefanha@redhat.com writes:
[[PGP Signed Part:Undecided]] On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote: > > Could we consider the kernel internally converting IOREQ
messages from
> > the Xen hypervisor to eventfd events? Would this scale with
other kernel
> > hypercall interfaces? > > > > So any thoughts on what directions are worth experimenting with? > > One option we should consider is for each backend to connect to
Xen via
> the IOREQ interface. We could generalize the IOREQ interface and
make it
> hypervisor agnostic. The interface is really trivial and easy to
add.
> The only Xen-specific part is the notification mechanism, which is
an
> event channel. If we replaced the event channel with something
else the
> interface would be generic. See: >
https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ior...
There have been experiments with something kind of similar in KVM recently (see struct ioregionfd_cmd):
https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828...
Reading the cover letter was very useful in showing how this provides
a
separate channel for signalling IO events to userspace instead of
using
the normal type-2 vmexit type event. I wonder how deeply tied the userspace facing side of this is to KVM? Could it provide a common FD type interface to IOREQ?
Why do you stick to a "FD" type interface?
I mean most user space interfaces on POSIX start with a file descriptor and the usual read/write semantics or a series of ioctls.
Who do you assume is responsible for implementing this kind of fd semantics, OSs on BE or hypervisor itself?
I think such interfaces can only be easily implemented on type-2 hypervisors.
# In this sense, I don't think rust-vmm, as it is, cannot be # a general solution.
As I understand IOREQ this is currently a direct communication between userspace and the hypervisor using the existing Xen message bus. My
With IOREQ server, IO event occurrences are notified to BE via Xen's
event
channel, while the actual contexts of IO events (see struct ioreq in
ioreq.h)
are put in a queue on a single shared memory page which is to be
assigned
beforehand with xenforeignmemory_map_resource hypervisor call.
If we abstracted the IOREQ via the kernel interface you would probably just want to put the ioreq structure on a queue rather than expose the shared page to userspace.
Where is that queue?
worry would be that by adding knowledge of what the underlying hypervisor is we'd end up with excess complexity in the kernel. For
one
thing we certainly wouldn't want an API version dependency on the
kernel
to understand which version of the Xen hypervisor it was running on.
That's exactly what virtio-proxy in my proposal[1] does; All the
hypervisor-
specific details of IO event handlings are contained in virtio-proxy and virtio BE will communicate with virtio-proxy through a virtqueue (yes, virtio-proxy is seen as yet another virtio device on BE) and will get IO event-related *RPC* callbacks, either MMIO read or write, from virtio-proxy.
See page 8 (protocol flow) and 10 (interfaces) in [1].
There are two areas of concern with the proxy approach at the moment. The first is how the bootstrap of the virtio-proxy channel happens and
As I said, from BE point of view, virtio-proxy would be seen as yet another virtio device by which BE could talk to "virtio proxy" vm or whatever else.
This way we guarantee BE's hypervisor-agnosticism instead of having "common" hypervisor interfaces. That is the base of my idea.
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
# Would we better discuss what "hypervisor-agnosticism" means?
Is there a call that you could recommend that we join to discuss this and the topics of this thread?
Stratos call? Alex should have more to say.
-Takahiro Akashi
There is definitely interest in pursuing a new interface for Argo that can be implemented in other hypervisors and enable guest binary portability between them, at least on the same hardware architecture, with VirtIO transport as a primary use case.
The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF for Xen and Guest OS, which include context about the several separate approaches to VirtIO on Xen, have now been posted here: https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html
Christopher
-Takahiro Akashi
Christopher Clark christopher.w.clark@gmail.com writes:
On Sun, Sep 5, 2021 at 7:24 PM AKASHI Takahiro via Stratos-dev stratos-dev@op-lists.linaro.org wrote:
Alex,
On Fri, Sep 03, 2021 at 10:28:06AM +0100, Alex Benn??e wrote:
<snip>
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
The proposal for a single binary would always require something to shim between hypervisors. This is still an area of discussion though. Having a compile time selectable approach is practically unavoidable for "bare metal" backends though because there are no other processes/layers that communication with the hypervisor can be delegated to.
# Would we better discuss what "hypervisor-agnosticism" means?
Is there a call that you could recommend that we join to discuss this and the topics of this thread? There is definitely interest in pursuing a new interface for Argo that can be implemented in other hypervisors and enable guest binary portability between them, at least on the same hardware architecture, with VirtIO transport as a primary use case.
There is indeed ;-)
We have a regular open call every two week for the Stratos project which you are welcome to attend. You can find the details on the project overview page:
https://linaro.atlassian.net/wiki/spaces/STR/overview
we regularly have teams from outside the project present their work as well.
The notes from the Xen Summit Design Session on VirtIO Cross-Project BoF for Xen and Guest OS, which include context about the several separate approaches to VirtIO on Xen, have now been posted here: https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00472.html
Thanks for the link - looks like a very detailed summary.
Christopher
-Takahiro Akashi
On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
In my experience, latency, performance, and security are far more important than providing a single binary.
In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec.
Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem.
I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes.
As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models.
Hi
Le mar. 14 sept. 2021 à 01:51, Stefano Stabellini via Stratos-dev < stratos-dev@op-lists.linaro.org> a écrit :
On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big
matter.
In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only
issue is
how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single
binary"
BE. (I know this is is an arguable requirement, though.)
In my experience, latency, performance, and security are far more important than providing a single binary.
In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec.
Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem.
I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes.
As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models.
i like the hypervisor-less approach described in the link below. it can also be used to drfine abstract HAL between normal world and TrustZone to implement confidential workloads in the TZ. Virtio-sock is of particular interest. In addition, this can define a HAL that can be re-used in many contexts : could we use this to implement something similar to Android Generic Kernel Image stuff ?
[1] https://github.com/OpenAMP/kvmtool
Stratos-dev mailing list Stratos-dev@op-lists.linaro.org https://op-lists.linaro.org/mailman/listinfo/stratos-dev
Stefano Stabellini stefano.stabellini@xilinx.com writes:
On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
In my experience, latency, performance, and security are far more important than providing a single binary.
In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary.
I agree it shouldn't be a primary goal although a single binary working with helpers to bridge the gap would make a cool demo. The real aim of agnosticism is avoid having multiple implementations of the backend itself for no other reason than a change in hypervisor.
I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec.
Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem.
I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes.
This does imply making some choices, especially the implementation language. However I feel that C is really the lowest common denominator here and I get the sense that people would rather avoid it if they could given the potential security implications of a bug prone back end. This is what is prompting interest in Rust.
As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models.
Hello,
On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
In my experience, latency, performance, and security are far more important than providing a single binary.
In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec.
Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem.
I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes.
As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models.
I have my reservations on using "kvmtool" to do any development here. "kvmtool" can't be used on the products and it is just a tool for the developers.
The benefit of the solving problem w/ rust-vmm is that some of the crates from this project can be utilized for the real product. Alex has mentioned that "rust-vmm" today has some KVM specific bits but the rust-vmm community is already discussing to remove or re-org them in such a way that other Hypervisors can fit in.
Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of the rust-vmm components as well and they had shown interest to add the Hyper-V support in the "rust-vmm" project as well. I don't know the current progress but they had proven it it "cloud-hypervisor" project.
"rust-vmm" project's license will work as well for most of the project developments and I see that "CrosVM" is shipping in the products as well.
---Trilok Soni
On Tue, 14 Sep 2021, Trilok Soni wrote:
On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
In my experience, latency, performance, and security are far more important than providing a single binary.
In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec.
Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem.
I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes.
As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models.
I have my reservations on using "kvmtool" to do any development here. "kvmtool" can't be used on the products and it is just a tool for the developers.
The benefit of the solving problem w/ rust-vmm is that some of the crates from this project can be utilized for the real product. Alex has mentioned that "rust-vmm" today has some KVM specific bits but the rust-vmm community is already discussing to remove or re-org them in such a way that other Hypervisors can fit in.
Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of the rust-vmm components as well and they had shown interest to add the Hyper-V support in the "rust-vmm" project as well. I don't know the current progress but they had proven it it "cloud-hypervisor" project.
"rust-vmm" project's license will work as well for most of the project developments and I see that "CrosVM" is shipping in the products as well.
Most things in open source start as a developers tool before they become part of a product :)
I am concerned about how "embeddable" rust-vmm is going to be. Do you think it would be possible to run it against an RTOS together with other apps written in C?
Let me make a realistic example. You can imagine a Zephyr instance with simple toolstack functionalities written in C (starting/stopping VMs). One might want to add a virtio backend to it. I am not familiar enough with Rust and rust-vmm to know if it would be feasible and "easy" to run a rust-vmm backend as a Zephyr app.
A C project of the size of kvmtool, but BSD-licensed, could run on Zephyr with only a little porting effort using the POSIX compatibility layer. I think that would be ideal. Anybody aware of a project fulfilling these requirements?
If we have to give up the ability to integrate with an RTOS, then I think QEMU could be the leading choice because is still the main reference implementation for virtio.
Hi Stefano,
On 9/14/2021 8:29 PM, Stefano Stabellini wrote:
On Tue, 14 Sep 2021, Trilok Soni wrote:
On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
In my experience, latency, performance, and security are far more important than providing a single binary.
In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec.
Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem.
I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes.
As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models.
I have my reservations on using "kvmtool" to do any development here. "kvmtool" can't be used on the products and it is just a tool for the developers.
The benefit of the solving problem w/ rust-vmm is that some of the crates from this project can be utilized for the real product. Alex has mentioned that "rust-vmm" today has some KVM specific bits but the rust-vmm community is already discussing to remove or re-org them in such a way that other Hypervisors can fit in.
Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of the rust-vmm components as well and they had shown interest to add the Hyper-V support in the "rust-vmm" project as well. I don't know the current progress but they had proven it it "cloud-hypervisor" project.
"rust-vmm" project's license will work as well for most of the project developments and I see that "CrosVM" is shipping in the products as well.
Most things in open source start as a developers tool before they become part of a product :)
Agree, but I had an offline discussions with one the active developer of kvmtool and the confidence of using it in the product was no where near we expected during our evaluation. Same goes the QEMU and one of the biggest problem was no. of security issues against this huge codebase of QEMU.
I am concerned about how "embeddable" rust-vmm is going to be. Do you think it would be possible to run it against an RTOS together with other apps written in C?
I don't see any limitations of rust-vmm. For example, I am confident that we can port rust-vmm based backend into the QNX as host OS and same goes w/ Zephyr as well. Some work is needed but nothing fundamentally blocking it. We should be able to run it w/ Fuchsia as well with some effort.
---Trilok Soni
On Wed, 15 Sep 2021, Trilok Soni wrote:
On 9/14/2021 8:29 PM, Stefano Stabellini wrote:
On Tue, 14 Sep 2021, Trilok Soni wrote:
On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:
On Mon, 6 Sep 2021, AKASHI Takahiro wrote:
the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary.
In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take.
Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.)
In my experience, latency, performance, and security are far more important than providing a single binary.
In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec.
Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem.
I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes.
As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models.
I have my reservations on using "kvmtool" to do any development here. "kvmtool" can't be used on the products and it is just a tool for the developers.
The benefit of the solving problem w/ rust-vmm is that some of the crates from this project can be utilized for the real product. Alex has mentioned that "rust-vmm" today has some KVM specific bits but the rust-vmm community is already discussing to remove or re-org them in such a way that other Hypervisors can fit in.
Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of the rust-vmm components as well and they had shown interest to add the Hyper-V support in the "rust-vmm" project as well. I don't know the current progress but they had proven it it "cloud-hypervisor" project.
"rust-vmm" project's license will work as well for most of the project developments and I see that "CrosVM" is shipping in the products as well.
Most things in open source start as a developers tool before they become part of a product :)
Agree, but I had an offline discussions with one the active developer of kvmtool and the confidence of using it in the product was no where near we expected during our evaluation. Same goes the QEMU and one of the biggest problem was no. of security issues against this huge codebase of QEMU.
That is fair, but it is important to recognize that these are *known* security issues.
Does rust-vmm have a security process and a security response team? I tried googling for it but couldn't find relevant info.
QEMU is a very widely used and very well inspected codebase. It has a mailing list to report security issues and a security process. As a consequence we know of many vulnerabilities affecting the code base. As far as I am aware rust-vmm has not been inspected yet with the same level of attention and the same amount of security researchers.
That said, of course it is undeniable that the larger size of QEMU implies a higher amount of security issues. But for this project, we wouldn't be using the whole of QEMU of course. We would be narrowing it down to a build with only few revelant pieces. I imagine that the total LOC count would still be higher but the number of relevant security vulnerabilities would only be a small fraction of the QEMU total.
I am concerned about how "embeddable" rust-vmm is going to be. Do you think it would be possible to run it against an RTOS together with other apps written in C?
I don't see any limitations of rust-vmm. For example, I am confident that we can port rust-vmm based backend into the QNX as host OS and same goes w/ Zephyr as well. Some work is needed but nothing fundamentally blocking it. We should be able to run it w/ Fuchsia as well with some effort.
That's good to hear.
On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
VIRTIO typically uses the vring memory layout but doesn't need to. The VIRTIO device model deals with virtqueues. The shared memory vring layout is part of the VIRTIO transport (PCI, MMIO, and CCW use vrings). Alternative transports with other virtqueue representations are possible (e.g. VIRTIO-over-TCP). They don't need to involve a BE mapping shared memory and processing a vring owned by the FE.
For example, there could be BE hypercalls to pop a virtqueue elements, push a virtqueue elements, and to access buffers (basically DMA read/write). The FE could either be a traditional virtio-mmio/pci device with a vring or use FE hypercalls to add available elements to a virtqueue and get used elements.
I don't know the goals of project Stratos or whether this helps, but it might allow other architectures that have different security, complexity, etc properties.
Stefan
On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Benn??e wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where
Despite that I submitted some proposal for this subject, I'm still not confident that this goal is an indisputable consensus among parties.
multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
What does the future system look like? Any big picture to be shared?
-Takahiro Akashi
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
-- Alex Bennée
Hello Alex,
I can tell you my experience from working on a PoC (library) to allow the implementation of virtio-devices that are hypervisor/OS agnostic. I focused on two use cases: 1. type-I hypervisor in which the backend is running as a VM. This is an in-house hypervisor that does not support VMExits. 2. Linux user-space. In this case, the library is just used to communicate threads. The goal of this use case is merely testing.
I have chosen virtio-mmio as the way to exchange information between the frontend and backend. I found it hard to synchronize the access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow the front-end and back-end to synchronize, which is required during the device-status initialization. These extra bits would not be needed in case the hypervisor supports VMExits, e.g., KVM.
Each guest has a memory region that is shared with the backend. This memory region is used by the frontend to allocate the io-buffers. This region also maps the virtio-mmio layout that is initialized by the backend. For the moment, this region is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. At some point, the guest shall be able to balloon this region. Notifications between the frontend and the backend are implemented by using an hypercall. The hypercall mechanism and the memory allocation are abstracted away by a platform layer that exposes an interface that is hypervisor/os agnostic.
I split the backend into a virtio-device driver and a backend driver. The virtio-device driver is the virtqueues and the backend driver gets packets from the virtqueue for post-processing. For example, in the case of virtio-net, the backend driver would decide if the packet goes to the hardware or to another virtio-net device. The virtio-device drivers may be implemented in different ways like by using a single thread, multiple threads, or one thread for all the virtio-devices.
In this PoC, I just tackled two very simple use-cases. These use-cases allowed me to extract some requirements for an hypervisor to support virtio.
Matias
On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
-- Alex Bennée
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
Hi Matias,
On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
Hello Alex,
I can tell you my experience from working on a PoC (library) to allow the implementation of virtio-devices that are hypervisor/OS agnostic.
What hypervisor are you using for your PoC here?
I focused on two use cases:
- type-I hypervisor in which the backend is running as a VM. This
is an in-house hypervisor that does not support VMExits. 2. Linux user-space. In this case, the library is just used to communicate threads. The goal of this use case is merely testing.
I have chosen virtio-mmio as the way to exchange information between the frontend and backend. I found it hard to synchronize the access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow
Can you explain how MMIOs to registers in virito-mmio layout (which I think means a configuration space?) will be propagated to BE?
the front-end and back-end to synchronize, which is required during the device-status initialization. These extra bits would not be needed in case the hypervisor supports VMExits, e.g., KVM.
Each guest has a memory region that is shared with the backend. This memory region is used by the frontend to allocate the io-buffers. This region also maps the virtio-mmio layout that is initialized by the backend. For the moment, this region is defined when the guest is created. One limitation is that the memory for io-buffers is fixed.
So in summary, you have a single memory region that is used for virtio-mmio layout and io-buffers (I think they are for payload) and you assume that the region will be (at lease for now) statically shared between FE and BE so that you can eliminate 'mmap' at every time to access the payload. Correct?
If so, it can be an alternative solution for memory access issue, and a similar technique is used in some implementations: - (Jailhouse's) ivshmem - Arnd's fat virtqueue
In either case, however, you will have to allocate payload from the region and so you will see some impact on FE code (at least at some low level). (In ivshmem, dma_ops in the kernel is defined for this purpose.) Correct?
-Takahiro Akashi
At some point, the guest shall be able to balloon this region. Notifications between the frontend and the backend are implemented by using an hypercall. The hypercall mechanism and the memory allocation are abstracted away by a platform layer that exposes an interface that is hypervisor/os agnostic.
I split the backend into a virtio-device driver and a backend driver. The virtio-device driver is the virtqueues and the backend driver gets packets from the virtqueue for post-processing. For example, in the case of virtio-net, the backend driver would decide if the packet goes to the hardware or to another virtio-net device. The virtio-device drivers may be implemented in different ways like by using a single thread, multiple threads, or one thread for all the virtio-devices.
In this PoC, I just tackled two very simple use-cases. These use-cases allowed me to extract some requirements for an hypervisor to support virtio.
Matias
On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
-- Alex Bennée
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
Hello,
On Fri, Aug 20, 2021 at 03:05:58PM +0900, AKASHI Takahiro wrote:
Hi Matias,
On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
Hello Alex,
I can tell you my experience from working on a PoC (library) to allow the implementation of virtio-devices that are hypervisor/OS agnostic.
What hypervisor are you using for your PoC here?
I am using an in-house hypervisor, which is similar to Jailhouse.
I focused on two use cases:
- type-I hypervisor in which the backend is running as a VM. This
is an in-house hypervisor that does not support VMExits. 2. Linux user-space. In this case, the library is just used to communicate threads. The goal of this use case is merely testing.
I have chosen virtio-mmio as the way to exchange information between the frontend and backend. I found it hard to synchronize the access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow
Can you explain how MMIOs to registers in virito-mmio layout (which I think means a configuration space?) will be propagated to BE?
In this PoC, the BE guest is created with a fixed number of regions of memory that represents each device. The BE initializes these regions, and then, waits for the FEs to begin the initialization.
the front-end and back-end to synchronize, which is required during the device-status initialization. These extra bits would not be needed in case the hypervisor supports VMExits, e.g., KVM.
Each guest has a memory region that is shared with the backend. This memory region is used by the frontend to allocate the io-buffers. This region also maps the virtio-mmio layout that is initialized by the backend. For the moment, this region is defined when the guest is created. One limitation is that the memory for io-buffers is fixed.
So in summary, you have a single memory region that is used for virtio-mmio layout and io-buffers (I think they are for payload) and you assume that the region will be (at lease for now) statically shared between FE and BE so that you can eliminate 'mmap' at every time to access the payload. Correct?
Yes, It is.
If so, it can be an alternative solution for memory access issue, and a similar technique is used in some implementations:
- (Jailhouse's) ivshmem
- Arnd's fat virtqueue
In either case, however, you will have to allocate payload from the region and so you will see some impact on FE code (at least at some low level). (In ivshmem, dma_ops in the kernel is defined for this purpose.) Correct?
Yes, It is. The FE implements a sort of malloc() to organize the allocation of io-buffers from that memory region.
Rethinking about the VMExits, I am not sure how this mechanism may be used when both the FE and the BE are VMs. The use of VMExits may require to involve the hypervisor.
Matias
-Takahiro Akashi
At some point, the guest shall be able to balloon this region. Notifications between the frontend and the backend are implemented by using an hypercall. The hypercall mechanism and the memory allocation are abstracted away by a platform layer that exposes an interface that is hypervisor/os agnostic.
I split the backend into a virtio-device driver and a backend driver. The virtio-device driver is the virtqueues and the backend driver gets packets from the virtqueue for post-processing. For example, in the case of virtio-net, the backend driver would decide if the packet goes to the hardware or to another virtio-net device. The virtio-device drivers may be implemented in different ways like by using a single thread, multiple threads, or one thread for all the virtio-devices.
In this PoC, I just tackled two very simple use-cases. These use-cases allowed me to extract some requirements for an hypervisor to support virtio.
Matias
On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
-- Alex Bennée
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
Hi Matias,
On Sat, Aug 21, 2021 at 04:08:20PM +0200, Matias Ezequiel Vara Larsen wrote:
Hello,
On Fri, Aug 20, 2021 at 03:05:58PM +0900, AKASHI Takahiro wrote:
Hi Matias,
On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
Hello Alex,
I can tell you my experience from working on a PoC (library) to allow the implementation of virtio-devices that are hypervisor/OS agnostic.
What hypervisor are you using for your PoC here?
I am using an in-house hypervisor, which is similar to Jailhouse.
I focused on two use cases:
- type-I hypervisor in which the backend is running as a VM. This
is an in-house hypervisor that does not support VMExits. 2. Linux user-space. In this case, the library is just used to communicate threads. The goal of this use case is merely testing.
I have chosen virtio-mmio as the way to exchange information between the frontend and backend. I found it hard to synchronize the access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow
Can you explain how MMIOs to registers in virito-mmio layout (which I think means a configuration space?) will be propagated to BE?
In this PoC, the BE guest is created with a fixed number of regions of memory that represents each device. The BE initializes these regions, and then, waits for the FEs to begin the initialization.
Let me ask you in another way; When FE tries to write a register in configuration space, say QueueSel, how is BE notified of this event?
the front-end and back-end to synchronize, which is required during the device-status initialization. These extra bits would not be needed in case the hypervisor supports VMExits, e.g., KVM.
Each guest has a memory region that is shared with the backend. This memory region is used by the frontend to allocate the io-buffers. This region also maps the virtio-mmio layout that is initialized by the backend. For the moment, this region is defined when the guest is created. One limitation is that the memory for io-buffers is fixed.
So in summary, you have a single memory region that is used for virtio-mmio layout and io-buffers (I think they are for payload) and you assume that the region will be (at lease for now) statically shared between FE and BE so that you can eliminate 'mmap' at every time to access the payload. Correct?
Yes, It is.
If so, it can be an alternative solution for memory access issue, and a similar technique is used in some implementations:
- (Jailhouse's) ivshmem
- Arnd's fat virtqueue
In either case, however, you will have to allocate payload from the region and so you will see some impact on FE code (at least at some low level). (In ivshmem, dma_ops in the kernel is defined for this purpose.) Correct?
Yes, It is. The FE implements a sort of malloc() to organize the allocation of io-buffers from that memory region.
Rethinking about the VMExits, I am not sure how this mechanism may be used when both the FE and the BE are VMs. The use of VMExits may require to involve the hypervisor.
Maybe I misunderstand something. Are FE/BE not VMs in your PoC?
-Takahiro Akashi
Matias
-Takahiro Akashi
At some point, the guest shall be able to balloon this region. Notifications between the frontend and the backend are implemented by using an hypercall. The hypercall mechanism and the memory allocation are abstracted away by a platform layer that exposes an interface that is hypervisor/os agnostic.
I split the backend into a virtio-device driver and a backend driver. The virtio-device driver is the virtqueues and the backend driver gets packets from the virtqueue for post-processing. For example, in the case of virtio-net, the backend driver would decide if the packet goes to the hardware or to another virtio-net device. The virtio-device drivers may be implemented in different ways like by using a single thread, multiple threads, or one thread for all the virtio-devices.
In this PoC, I just tackled two very simple use-cases. These use-cases allowed me to extract some requirements for an hypervisor to support virtio.
Matias
On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
-- Alex Bennée
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
On Mon, Aug 23, 2021 at 10:20:29AM +0900, AKASHI Takahiro wrote:
Hi Matias,
On Sat, Aug 21, 2021 at 04:08:20PM +0200, Matias Ezequiel Vara Larsen wrote:
Hello,
On Fri, Aug 20, 2021 at 03:05:58PM +0900, AKASHI Takahiro wrote:
Hi Matias,
On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
Hello Alex,
I can tell you my experience from working on a PoC (library) to allow the implementation of virtio-devices that are hypervisor/OS agnostic.
What hypervisor are you using for your PoC here?
I am using an in-house hypervisor, which is similar to Jailhouse.
I focused on two use cases:
- type-I hypervisor in which the backend is running as a VM. This
is an in-house hypervisor that does not support VMExits. 2. Linux user-space. In this case, the library is just used to communicate threads. The goal of this use case is merely testing.
I have chosen virtio-mmio as the way to exchange information between the frontend and backend. I found it hard to synchronize the access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow
Can you explain how MMIOs to registers in virito-mmio layout (which I think means a configuration space?) will be propagated to BE?
In this PoC, the BE guest is created with a fixed number of regions of memory that represents each device. The BE initializes these regions, and then, waits for the FEs to begin the initialization.
Let me ask you in another way; When FE tries to write a register in configuration space, say QueueSel, how is BE notified of this event?
In my PoC, it is never notified when FE writes to a register. For example, the QueueSel is only used in one of the steps of the device status configuration. The BE is only notified when the FE is in that step. When the FE is setting up the vrings, it sets the address, set the QueueSel, and then blocks until the BE can get the values. The BE gets the values and resumes the FE, which moves to the next step.
the front-end and back-end to synchronize, which is required during the device-status initialization. These extra bits would not be needed in case the hypervisor supports VMExits, e.g., KVM.
Each guest has a memory region that is shared with the backend. This memory region is used by the frontend to allocate the io-buffers. This region also maps the virtio-mmio layout that is initialized by the backend. For the moment, this region is defined when the guest is created. One limitation is that the memory for io-buffers is fixed.
So in summary, you have a single memory region that is used for virtio-mmio layout and io-buffers (I think they are for payload) and you assume that the region will be (at lease for now) statically shared between FE and BE so that you can eliminate 'mmap' at every time to access the payload. Correct?
Yes, It is.
If so, it can be an alternative solution for memory access issue, and a similar technique is used in some implementations:
- (Jailhouse's) ivshmem
- Arnd's fat virtqueue
In either case, however, you will have to allocate payload from the region and so you will see some impact on FE code (at least at some low level). (In ivshmem, dma_ops in the kernel is defined for this purpose.) Correct?
Yes, It is. The FE implements a sort of malloc() to organize the allocation of io-buffers from that memory region.
Rethinking about the VMExits, I am not sure how this mechanism may be used when both the FE and the BE are VMs. The use of VMExits may require to involve the hypervisor.
Maybe I misunderstand something. Are FE/BE not VMs in your PoC?
Yes, both are VMs. I meant, in case that both are VMs AND a VMExit mechanism is used, such a mechanism would require the hypervisor to forward the traps. In my PoC, both are VMs BUT there is not a VMExit mechanism.
Matias
-Takahiro Akashi
Matias
-Takahiro Akashi
At some point, the guest shall be able to balloon this region. Notifications between the frontend and the backend are implemented by using an hypercall. The hypercall mechanism and the memory allocation are abstracted away by a platform layer that exposes an interface that is hypervisor/os agnostic.
I split the backend into a virtio-device driver and a backend driver. The virtio-device driver is the virtqueues and the backend driver gets packets from the virtqueue for post-processing. For example, in the case of virtio-net, the backend driver would decide if the packet goes to the hardware or to another virtio-net device. The virtio-device drivers may be implemented in different ways like by using a single thread, multiple threads, or one thread for all the virtio-devices.
In this PoC, I just tackled two very simple use-cases. These use-cases allowed me to extract some requirements for an hypervisor to support virtio.
Matias
On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
Hi,
One of the goals of Project Stratos is to enable hypervisor agnostic backends so we can enable as much re-use of code as possible and avoid repeating ourselves. This is the flip side of the front end where multiple front-end implementations are required - one per OS, assuming you don't just want Linux guests. The resultant guests are trivially movable between hypervisors modulo any abstracted paravirt type interfaces.
In my original thumb nail sketch of a solution I envisioned vhost-user daemons running in a broadly POSIX like environment. The interface to the daemon is fairly simple requiring only some mapped memory and some sort of signalling for events (on Linux this is eventfd). The idea was a stub binary would be responsible for any hypervisor specific setup and then launch a common binary to deal with the actual virtqueue requests themselves.
Since that original sketch we've seen an expansion in the sort of ways backends could be created. There is interest in encapsulating backends in RTOSes or unikernels for solutions like SCMI. There interest in Rust has prompted ideas of using the trait interface to abstract differences away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which calls for a description of the APIs needed from the hypervisor side to support VirtIO guests and their backends. However we are some way off from that at the moment as I think we need to at least demonstrate one portable backend before we start codifying requirements. To that end I want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the host system can orchestrate the various modules that make up the complete system. In the type-1 case (or even type-2 with delegated service VMs) we need some sort of mechanism to inform the backend VM about key details about the system:
- where virt queue memory is in it's address space
- how it's going to receive (interrupt) and trigger (kick) events
- what (if any) resources the backend needs to connect to
Obviously you can elide over configuration issues by having static configurations and baking the assumptions into your guest images however this isn't scalable in the long term. The obvious solution seems to be extending a subset of Device Tree data to user space but perhaps there are other approaches?
Before any virtio transactions can take place the appropriate memory mappings need to be made between the FE guest and the BE guest. Currently the whole of the FE guests address space needs to be visible to whatever is serving the virtio requests. I can envision 3 approaches:
- BE guest boots with memory already mapped
This would entail the guest OS knowing where in it's Guest Physical Address space is already taken up and avoiding clashing. I would assume in this case you would want a standard interface to userspace to then make that address space visible to the backend daemon.
- BE guests boots with a hypervisor handle to memory
The BE guest is then free to map the FE's memory to where it wants in the BE's guest physical address space. To activate the mapping will require some sort of hypercall to the hypervisor. I can see two options at this point:
expose the handle to userspace for daemon/helper to trigger the mapping via existing hypercall interfaces. If using a helper you would have a hypervisor specific one to avoid the daemon having to care too much about the details or push that complexity into a compile time option for the daemon which would result in different binaries although a common source base.
expose a new kernel ABI to abstract the hypercall differences away in the guest kernel. In this case the userspace would essentially ask for an abstract "map guest N memory to userspace ptr" and let the kernel deal with the different hypercall interfaces. This of course assumes the majority of BE guests would be Linux kernels and leaves the bare-metal/unikernel approaches to their own devices.
Operation
The core of the operation of VirtIO is fairly simple. Once the vhost-user feature negotiation is done it's a case of receiving update events and parsing the resultant virt queue for data. The vhost-user specification handles a bunch of setup before that point, mostly to detail where the virt queues are set up FD's for memory and event communication. This is where the envisioned stub process would be responsible for getting the daemon up and ready to run. This is currently done inside a big VMM like QEMU but I suspect a modern approach would be to use the rust-vmm vhost crate. It would then either communicate with the kernel's abstracted ABI or be re-targeted as a build option for the various hypervisors.
One question is how to best handle notification and kicks. The existing vhost-user framework uses eventfd to signal the daemon (although QEMU is quite capable of simulating them when you use TCG). Xen has it's own IOREQ mechanism. However latency is an important factor and having events go through the stub would add quite a lot.
Could we consider the kernel internally converting IOREQ messages from the Xen hypervisor to eventfd events? Would this scale with other kernel hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
-- Alex Bennée
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
Matias Ezequiel Vara Larsen matiasevara@gmail.com writes:
Hello Alex,
I can tell you my experience from working on a PoC (library) to allow the implementation of virtio-devices that are hypervisor/OS agnostic. I focused on two use cases:
- type-I hypervisor in which the backend is running as a VM. This
is an in-house hypervisor that does not support VMExits. 2. Linux user-space. In this case, the library is just used to communicate threads. The goal of this use case is merely testing.
I have chosen virtio-mmio as the way to exchange information between the frontend and backend. I found it hard to synchronize the access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow the front-end and back-end to synchronize, which is required during the device-status initialization. These extra bits would not be needed in case the hypervisor supports VMExits, e.g., KVM.
The support for a vmexit seems rather fundamental to type-2 hypervisors (like KVM) as the VMM is intrinsically linked to a vCPUs run loop. This makes handling a condition like a bit of MMIO fairly natural to implement. For type-1 cases the line of execution between "guest accesses MMIO" and "something services that request" is a little trickier to pin down. Ultimately at that point you are relying on the hypervisor itself to make the scheduling decision to stop executing the guest and allow the backend to do it's thing. We don't really want to expose the exact details about that as it probably varies a lot between hypervisors. However would a backend API semantic that expresses:
- guest has done some MMIO - hypervisor has stopped execution of guest - guest will be restarted when response conditions are set by backend
cover the needs of a virtio backend and could the userspace facing portion of that be agnostic?
Each guest has a memory region that is shared with the backend. This memory region is used by the frontend to allocate the io-buffers. This region also maps the virtio-mmio layout that is initialized by the backend. For the moment, this region is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. At some point, the guest shall be able to balloon this region. Notifications between the frontend and the backend are implemented by using an hypercall. The hypercall mechanism and the memory allocation are abstracted away by a platform layer that exposes an interface that is hypervisor/os agnostic.
I split the backend into a virtio-device driver and a backend driver. The virtio-device driver is the virtqueues and the backend driver gets packets from the virtqueue for post-processing. For example, in the case of virtio-net, the backend driver would decide if the packet goes to the hardware or to another virtio-net device. The virtio-device drivers may be implemented in different ways like by using a single thread, multiple threads, or one thread for all the virtio-devices.
In this PoC, I just tackled two very simple use-cases. These use-cases allowed me to extract some requirements for an hypervisor to support virtio.
Matias
<snip>
stratos-dev@op-lists.linaro.org