Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends

17 Aug 2021

      Hi Akashi,
...
-----Original Message-----
From: AKASHI Takahiro takahiro.akashi@linaro.org
Sent: 2021年8月17日 16:08
To: Wei Chen Wei.Chen@arm.com
Cc: Oleksandr Tyshchenko olekstysh@gmail.com; Stefano Stabellini
sstabellini@kernel.org; Alex Benn??e alex.bennee@linaro.org; Stratos
Mailing List stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-
open.org; Arnd Bergmann arnd.bergmann@linaro.org; Viresh Kumar
viresh.kumar@linaro.org; Stefano Stabellini
stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka
jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com;
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier
mathieu.poirier@linaro.org; Oleksandr Tyshchenko
Oleksandr_Tyshchenko@epam.com; Bertrand Marquis
Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org
Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hi Wei, Oleksandr,
On Mon, Aug 16, 2021 at 10:04:03AM +0000, Wei Chen wrote:
...
Hi All,
Thanks for Stefano to link my kvmtool for Xen proposal here.
This proposal is still discussing in Xen and KVM communities.
The main work is to decouple the kvmtool from KVM and make
other hypervisors can reuse the virtual device implementations.
In this case, we need to introduce an intermediate hypervisor
layer for VMM abstraction, Which is, I think it's very close
to stratos' virtio hypervisor agnosticism work.
# My proposal[1] comes from my own idea and doesn't always represent
# Linaro's view on this subject nor reflect Alex's concerns. Nevertheless,
Your idea and my proposal seem to share the same background.
Both have the similar goal and currently start with, at first, Xen
and are based on kvm-tool. (Actually, my work is derived from
EPAM's virtio-disk, which is also based on kvm-tool.)
In particular, the abstraction of hypervisor interfaces has a same
set of interfaces (for your "struct vmm_impl" and my "RPC interfaces").
This is not co-incident as we both share the same origin as I said above.
And so we will also share the same issues. One of them is a way of
"sharing/mapping FE's memory". There is some trade-off between
the portability and the performance impact.
So we can discuss the topic here in this ML, too.
(See Alex's original email, too).
Yes, I agree.
...
On the other hand, my approach aims to create a "single-binary" solution
in which the same binary of BE vm could run on any hypervisors.
Somehow similar to your "proposal-#2" in [2], but in my solution, all
the hypervisor-specific code would be put into another entity (VM),
named "virtio-proxy" and the abstracted operations are served via RPC.
(In this sense, BE is hypervisor-agnostic but might have OS dependency.)
But I know that we need discuss if this is a requirement even
in Stratos project or not. (Maybe not)
Sorry, I haven't had time to finish reading your virtio-proxy completely
(I will do it ASAP). But from your description, it seems we need a
3rd VM between FE and BE? My concern is that, if my assumption is right,
will it increase the latency in data transport path? Even if we're
using some lightweight guest like RTOS or Unikernel,
...
Specifically speaking about kvm-tool, I have a concern about its
license term; Targeting different hypervisors and different OSs
(which I assume includes RTOS's), the resultant library should be
license permissive and GPL for kvm-tool might be an issue.
Any thoughts?
Yes. If user want to implement a FreeBSD device model, but the virtio
library is GPL. Then GPL would be a problem. If we have another good
candidate, I am open to it.
...
-Takahiro Akashi
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
August/000548.html
[2] https://marc.info/?l=xen-devel&m=162373754705233&w=2
...
...
From: Oleksandr Tyshchenko olekstysh@gmail.com
Sent: 2021年8月14日 23:38
To: AKASHI Takahiro takahiro.akashi@linaro.org; Stefano Stabellini
sstabellini@kernel.org
...
...
Cc: Alex Benn??e alex.bennee@linaro.org; Stratos Mailing List
stratos-dev@op-lists.linaro.org; virtio-dev@lists.oasis-open.org; Arnd
Bergmann arnd.bergmann@linaro.org; Viresh Kumar
viresh.kumar@linaro.org; Stefano Stabellini
stefano.stabellini@xilinx.com; stefanha@redhat.com; Jan Kiszka
jan.kiszka@siemens.com; Carl van Schaik cvanscha@qti.qualcomm.com;
pratikp@quicinc.com; Srivatsa Vaddagiri vatsa@codeaurora.org; Jean-
Philippe Brucker jean-philippe@linaro.org; Mathieu Poirier
mathieu.poirier@linaro.org; Wei Chen Wei.Chen@arm.com; Oleksandr
Tyshchenko Oleksandr_Tyshchenko@epam.com; Bertrand Marquis
Bertrand.Marquis@arm.com; Artem Mygaiev Artem_Mygaiev@epam.com; Julien
Grall julien@xen.org; Juergen Gross jgross@suse.com; Paul Durrant
paul@xen.org; Xen Devel xen-devel@lists.xen.org
...
...
Subject: Re: Enabling hypervisor agnosticism for VirtIO backends
Hello, all.
Please see some comments below. And sorry for the possible format
issues.
...
...
...
On Wed, Aug 11, 2021 at 9:27 AM AKASHI Takahiro
mailto:takahiro.akashi@linaro.org wrote:
...
...
...
On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
...
CCing people working on Xen+VirtIO and IOREQs. Not trimming the
original
...
...
...
...
email to let them read the full context.
My comments below are related to a potential Xen implementation,
not
...
...
...
...
because it is the only implementation that matters, but because it
is
...
...
...
...
the one I know best.
Please note that my proposal (and hence the working prototype)[1]
is based on Xen's virtio implementation (i.e. IOREQ) and
particularly
...
...
...
EPAM's virtio-disk application (backend server).
It has been, I believe, well generalized but is still a bit biased
toward this original design.
So I hope you like my approach :)
[1] https://op-lists.linaro.org/pipermail/stratos-dev/2021-
August/000546.html
...
...
...
Let me take this opportunity to explain a bit more about my approach
below.
...
...
...
...
Also, please see this relevant email thread:
https://marc.info/?l=xen-devel&m=162373754705233&w=2
On Wed, 4 Aug 2021, Alex Bennée wrote:
...
Hi,
One of the goals of Project Stratos is to enable hypervisor
agnostic
...
...
...
...
...
backends so we can enable as much re-use of code as possible and
avoid
...
...
...
...
...
repeating ourselves. This is the flip side of the front end
where
...
...
...
...
...
multiple front-end implementations are required - one per OS,
assuming
...
...
...
...
...
you don't just want Linux guests. The resultant guests are
trivially
...
...
...
...
...
movable between hypervisors modulo any abstracted paravirt type
interfaces.
In my original thumb nail sketch of a solution I envisioned
vhost-user
...
...
...
...
...
daemons running in a broadly POSIX like environment. The
interface to
...
...
...
...
...
the daemon is fairly simple requiring only some mapped memory
and some
...
...
...
...
...
sort of signalling for events (on Linux this is eventfd). The
idea was a
...
...
...
...
...
stub binary would be responsible for any hypervisor specific
setup and
...
...
...
...
...
then launch a common binary to deal with the actual virtqueue
requests
...
...
...
...
...
themselves.
Since that original sketch we've seen an expansion in the sort
of ways
...
...
...
...
...
backends could be created. There is interest in encapsulating
backends
...
...
...
...
...
in RTOSes or unikernels for solutions like SCMI. There interest
in Rust
...
...
...
...
...
has prompted ideas of using the trait interface to abstract
differences
...
...
...
...
...
away as well as the idea of bare-metal Rust backends.
We have a card (STR-12) called "Hypercall Standardisation" which
calls for a description of the APIs needed from the hypervisor
side to
...
...
...
...
...
support VirtIO guests and their backends. However we are some
way off
...
...
...
...
...
from that at the moment as I think we need to at least
demonstrate one
...
...
...
...
...
portable backend before we start codifying requirements. To that
end I
...
...
...
...
...
want to think about what we need for a backend to function.
Configuration
In the type-2 setup this is typically fairly simple because the
host
...
...
...
...
...
system can orchestrate the various modules that make up the
complete
...
...
...
...
...
system. In the type-1 case (or even type-2 with delegated
service VMs)
...
...
...
...
...
we need some sort of mechanism to inform the backend VM about
key
...
...
...
...
...
details about the system:

where virt queue memory is in it's address space
how it's going to receive (interrupt) and trigger (kick)

events
...
...
...
...
...

what (if any) resources the backend needs to connect to

Obviously you can elide over configuration issues by having
static
...
...
...
...
...
configurations and baking the assumptions into your guest images
however
...
...
...
...
...
this isn't scalable in the long term. The obvious solution seems
to be
...
...
...
...
...
extending a subset of Device Tree data to user space but perhaps
there
...
...
...
...
...
are other approaches?
Before any virtio transactions can take place the appropriate
memory
...
...
...
...
...
mappings need to be made between the FE guest and the BE guest.
...
Currently the whole of the FE guests address space needs to be
visible
...
...
...
...
...
to whatever is serving the virtio requests. I can envision 3
approaches:
...
...
...
...
...

BE guest boots with memory already mapped

This would entail the guest OS knowing where in it's Guest
Physical
...
...
...
...
...
Address space is already taken up and avoiding clashing. I
would assume
...
...
...
...
...
in this case you would want a standard interface to userspace
to then
...
...
...
...
...
make that address space visible to the backend daemon.
Yet another way here is that we would have well known "shared
memory" between
...
...
...
VMs. I think that Jailhouse's ivshmem gives us good insights on this
matter
...
...
...
and that it can even be an alternative for hypervisor-agnostic
solution.
...
...
...
(Please note memory regions in ivshmem appear as a PCI device and
can be
...
...
...
mapped locally.)
I want to add this shared memory aspect to my virtio-proxy, but
the resultant solution would eventually look similar to ivshmem.
...
...

BE guests boots with a hypervisor handle to memory

The BE guest is then free to map the FE's memory to where it
wants in
...
...
...
...
...
the BE's guest physical address space.
I cannot see how this could work for Xen. There is no "handle" to
give
...
...
...
...
to the backend if the backend is not running in dom0. So for Xen I
think
...
...
...
...
the memory has to be already mapped
In Xen's IOREQ solution (virtio-blk), the following information is
expected
...
...
...
to be exposed to BE via Xenstore:
(I know that this is a tentative approach though.)

the start address of configuration space
interrupt number
file path for backing storage
read-only flag

And the BE server have to call a particular hypervisor interface to
map the configuration space.
Yes, Xenstore was chosen as a simple way to pass configuration info to
the backend running in a non-toolstack domain.
...
...
I remember, there was a wish to avoid using Xenstore in Virtio backend
itself if possible, so for non-toolstack domain, this could done with
adjusting devd (daemon that listens for devices and launches backends)
...
...
to read backend configuration from the Xenstore anyway and pass it to
the backend via command line arguments.
...
...
Yes, in current PoC code we're using xenstore to pass device
configuration.
...
We also designed a static device configuration parse method for Dom0less
or
...
other scenarios don't have xentool. yes, it's from device model command
line
...
or a config file.
...
But, if ...
...
In my approach (virtio-proxy), all those Xen (or hypervisor)-
specific
...
...
...
stuffs are contained in virtio-proxy, yet another VM, to hide all
details.
...
...
... the solution how to overcome that is already found and proven to
work then even better.
...
...
...
# My point is that a "handle" is not mandatory for executing mapping.
...
and the mapping probably done by the
toolstack (also see below.) Or we would have to invent a new Xen
hypervisor interface and Xen virtual machine privileges to allow
this
...
...
...
...
kind of mapping.
...
If we run the backend in Dom0 that we have no problems of course.
One of difficulties on Xen that I found in my approach is that
calling
...
...
...
such hypervisor intefaces (registering IOREQ, mapping memory) is
only
...
...
...
allowed on BE servers themselvies and so we will have to extend
those
...
...
...
interfaces.
This, however, will raise some concern on security and privilege
distribution
...
...
...
as Stefan suggested.
We also faced policy related issues with Virtio backend running in
other than Dom0 domain in a "dummy" xsm mode. In our target system we run
the backend in a driver
...
...
domain (we call it DomD) where the underlying H/W resides. We trust it,
so we wrote policy rules (to be used in "flask" xsm mode) to provide it
with a little bit more privileges than a simple DomU had.
...
...
Now it is permitted to issue device-model, resource and memory
mappings, etc calls.
...
...
...
...
...
To activate the mapping will
 require some sort of hypercall to the hypervisor. I can see two
options
...
...
...
...
...
at this point:

expose the handle to userspace for daemon/helper to trigger

the
...
...
...
...
...
mapping via existing hypercall interfaces. If using a helper

you
...
...
...
...
...
would have a hypervisor specific one to avoid the daemon

having to
...
...
...
...
...
care too much about the details or push that complexity into

a
...
...
...
...
...
compile time option for the daemon which would result in

different
...
...
...
...
...
binaries although a common source base.

expose a new kernel ABI to abstract the hypercall

differences away
...
...
...
...
...
in the guest kernel. In this case the userspace would

essentially
...
...
...
...
...
ask for an abstract "map guest N memory to userspace ptr"

and let
...
...
...
...
...
the kernel deal with the different hypercall interfaces.

This of
...
...
...
...
...
course assumes the majority of BE guests would be Linux

kernels and
...
...
...
...
...
leaves the bare-metal/unikernel approaches to their own

devices.
...
...
...
...
...
Operation
The core of the operation of VirtIO is fairly simple. Once the
vhost-user feature negotiation is done it's a case of receiving
update
...
...
...
...
...
events and parsing the resultant virt queue for data. The vhost-
user
...
...
...
...
...
specification handles a bunch of setup before that point, mostly
to
...
...
...
...
...
detail where the virt queues are set up FD's for memory and
event
...
...
...
...
...
communication. This is where the envisioned stub process would
be
...
...
...
...
...
responsible for getting the daemon up and ready to run. This is
currently done inside a big VMM like QEMU but I suspect a modern
approach would be to use the rust-vmm vhost crate. It would then
either
...
...
...
...
...
communicate with the kernel's abstracted ABI or be re-targeted
as a
...
...
...
...
...
build option for the various hypervisors.
One thing I mentioned before to Alex is that Xen doesn't have VMMs
the
...
...
...
...
way they are typically envisioned and described in other
environments.
...
...
...
...
Instead, Xen has IOREQ servers. Each of them connects
independently to
...
...
...
...
Xen via the IOREQ interface. E.g. today multiple QEMUs could be
used as
...
...
...
...
emulators for a single Xen VM, each of them connecting to Xen
independently via the IOREQ interface.
The component responsible for starting a daemon and/or setting up
shared
...
...
...
...
interfaces is the toolstack: the xl command and the libxl/libxc
libraries.
I think that VM configuration management (or orchestration in
Startos
...
...
...
jargon?) is a subject to debate in parallel.
Otherwise, is there any good assumption to avoid it right now?
...
Oleksandr and others I CCed have been working on ways for the
toolstack
...
...
...
...
to create virtio backends and setup memory mappings. They might be
able
...
...
...
...
to provide more info on the subject. I do think we miss a way to
provide
...
...
...
...
the configuration to the backend and anything else that the
backend
...
...
...
...
might require to start doing its job.
Yes, some work has been done for the toolstack to handle Virtio MMIO
devices in
...
...
general and Virtio block devices in particular. However, it has not
been upstreaned yet.
...
...
Updated patches on review now:
https://lore.kernel.org/xen-devel/1621626361-29076-1-git-send-email-
olekstysh@gmail.com/
...
...
There is an additional (also important) activity to improve/fix
foreign memory mapping on Arm which I am also involved in.
...
...
The foreign memory mapping is proposed to be used for Virtio backends
(device emulators) if there is a need to run guest OS completely
unmodified.
...
...
Of course, the more secure way would be to use grant memory mapping.
Brietly, the main difference between them is that with foreign mapping the
backend
...
...
can map any guest memory it wants to map, but with grant mapping it is
allowed to map only what was previously granted by the frontend.
...
...
So, there might be a problem if we want to pre-map some guest memory
in advance or to cache mappings in the backend in order to improve
performance (because the mapping/unmapping guest pages every request
requires a lot of back and forth to Xen + P2M updates). In a nutshell,
currently, in order to map a guest page into the backend address space we
need to steal a real physical page from the backend domain. So, with the
said optimizations we might end up with no free memory in the backend
domain (see XSA-300). And what we try to achieve is to not waste a real
domain memory at all by providing safe non-allocated-yet (so unused)
address space for the foreign (and grant) pages to be mapped into, this
enabling work implies Xen and Linux (and likely DTB bindings) changes.
However, as it turned out, for this to work in a proper and safe way some
prereq work needs to be done.
...
...
You can find the related Xen discussion at:
https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-
olekstysh@gmail.com/
...
...
...
...
...
One question is how to best handle notification and kicks. The
existing
...
...
...
...
...
vhost-user framework uses eventfd to signal the daemon (although
QEMU
...
...
...
...
...
is quite capable of simulating them when you use TCG). Xen has
it's own
...
...
...
...
...
IOREQ mechanism. However latency is an important factor and
having
...
...
...
...
...
events go through the stub would add quite a lot.
Yeah I think, regardless of anything else, we want the backends to
connect directly to the Xen hypervisor.
In my approach,
 a) BE -> FE: interrupts triggered by BE calling a hypervisor
interface
...
...
...
          via virtio-proxy

b) FE -> BE: MMIO to config raises events (in event channels),
which is
...
...
...
          converted to a callback to BE via virtio-proxy
          (Xen's event channel is internnally implemented by

interrupts.)
...
...
...
I don't know what "connect directly" means here, but sending
interrupts
...
...
...
to the opposite side would be best efficient.
Ivshmem, I suppose, takes this approach by utilizing PCI's msi-x
mechanism.
...
...
Agree that MSI would be more efficient than SPI...
At the moment, in order to notify the frontend, the backend issues a
specific device-model call to query Xen to inject a corresponding SPI to
the guest.
...
...
...
...
...
Could we consider the kernel internally converting IOREQ
messages from
...
...
...
...
...
the Xen hypervisor to eventfd events? Would this scale with
other kernel
...
...
...
...
...
hypercall interfaces?
So any thoughts on what directions are worth experimenting with?
One option we should consider is for each backend to connect to
Xen via
...
...
...
...
the IOREQ interface. We could generalize the IOREQ interface and
make it
...
...
...
...
hypervisor agnostic. The interface is really trivial and easy to
add.
...
...
...
As I said above, my proposal does the same thing that you mentioned
here :)
...
...
...
The difference is that I do call hypervisor interfaces via virtio-
proxy.
...
...
...
...
The only Xen-specific part is the notification mechanism, which is
an
...
...
...
...
event channel. If we replaced the event channel with something
else the
...
...
...
...
interface would be generic. See:
https://gitlab.com/xen-project/xen/-
/blob/staging/xen/include/public/hvm/ioreq.h#L52
...
...
...
...
I don't think that translating IOREQs to eventfd in the kernel is
a
...
...
...
...
good idea: if feels like it would be extra complexity and that the
kernel shouldn't be involved as this is a backend-hypervisor
interface.
...
...
...
Given that we may want to implement BE as a bare-metal application
as I did on Zephyr, I don't think that the translation would not be
a big issue, especially on RTOS's.
It will be some kind of abstraction layer of interrupt handling
(or nothing but a callback mechanism).
...
Also, eventfd is very Linux-centric and we are trying to design an
interface that could work well for RTOSes too. If we want to do
something different, both OS-agnostic and hypervisor-agnostic,
perhaps
...
...
...
...
we could design a new interface. One that could be implementable
in the
...
...
...
...
Xen hypervisor itself (like IOREQ) and of course any other
hypervisor
...
...
...
...
too.
There is also another problem. IOREQ is probably not be the only
interface needed. Have a look at
https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we
also need
...
...
...
...
an interface for the backend to inject interrupts into the
frontend? And
...
...
...
...
if the backend requires dynamic memory mappings of frontend pages,
then
...
...
...
...
we would also need an interface to map/unmap domU pages.
My proposal document might help here; All the interfaces required
for
...
...
...
virtio-proxy (or hypervisor-related interfaces) are listed as
RPC protocols :)
...
These interfaces are a lot more problematic than IOREQ: IOREQ is
tiny
...
...
...
...
and self-contained. It is easy to add anywhere. A new interface to
inject interrupts or map pages is more difficult to manage because
it
...
...
...
...
would require changes scattered across the various emulators.
Exactly. I have no confident yet that my approach will also apply
to other hypervisors than Xen.
Technically, yes, but whether people can accept it or not is a
different
...
...
...
matter.
Thanks,
-Takahiro Akashi
--
Regards,
Oleksandr Tyshchenko
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2026

2025

2024

2023

2022

2021

2020

Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends

Configuration

Operation