Hi Alex,
On Mon, Apr 12, 2021 at 08:34:54AM +0000, Alex Benn??e via Stratos-dev wrote:
Alex Bennée via Stratos-dev stratos-dev@op-lists.linaro.org writes:
Hi All,
We've been discussing various ideas for Stratos in and around STR-7 (common virtio library). I'd originally de-emphasised the STR-7 work because I wasn't sure if this was duplicate effort given we already had libvhost-user as well as interest in rust-vmm for portable backends in user-space. However we have seen from the Windriver hypervisor-less virtio, NXP's Zephyr/Jailhouse and the requirements for the SCMI server that there is a use-case for a small, liberally licensed C library that is suitable for embedding in lightweight backends without a full Linux stack behind it. These workloads would run in either simple command loops, RTOSes or Unikernels.
Given the multiple interested parties I'm hoping we have enough people who can devote time to collaborate on the project to make the following realistic over the next cycle and culminate in the following demo in 6 months:
Components
portable virtio backend library
- source based library (so you include directly in your project)
- liberally licensed (Apache? to facilitate above)
- tailored for non-POSIX, limited resource setups
- e.g. avoid malloc/free, provide abstraction hooks where needed
- not assume OS facilities (so embeddable in RTOS or Unikernel)
- static virtio configuration supplied from outside library (location of queues etc)
- hypervisor agnostic
- provide a hook to signal when queues need processing
Following on from a discussion with Vincent and Akashi-san last week we need to think more about how to make this hypervisor agnostic. There will always be some code that has to live outside the library but if it ends up being the same amount again what have we gained?
I suspect this should be a from scratch implementation but it's certainly worth investigating the BSD implementation as Windriver have suggested.
SCMI server
This is the work product of STR-4, embedded in an RTOS. I'm suggesting Zephyr makes the most sense here given the previous work done by Peng and Akashi-san but I guess an argument could be made for a unikernel. I would suggest demonstrating portability to a unikernel would be a stretch goal for the demo.
The server would be *build* time selectable to deploy either in a Jailhouse partition or a Xen Dom U deployment. Either way there will need to be an out-of-band communication of location of virtual queues and memory maps - I assume via DTB.
From our discussion last week Zephry's DTB support is very limited and not designed to cope with dynamic setups. So rather than having two build time selectable configurations we should opt for a single binary with a fixed expectation of where the remote guests memory and virtqueues will exist in it's memory space.
I'm still a bit skeptical about "single binary with a fixed expectation" concept. - Is it feasible to expect that all the hypervisors would configure a BE domain in the same way (in respect of assigning a memory region or an interrupt number)? - what if we want a BE domain to - provide different type of virtio devices - support more than one frontend demains at the same time? How can a single binary without ability of dynamic configuration deal with those requirements?
It will then be up to the VMM setting things up to ensure everything is mapped in the appropriate place in the RTOS memory map. There would also be a fixed IRQ map for signalling when things are changes to the RTOS and a fixed doorbell for signalling the other way.
We should think of two different phases: 1) virtio setup/negotiation via MMIO configuration space 2) (virtio device specific) operation via virtqueue
Anyway, signaling mechanism can be different from hypervisor to hypervisor; On Xen, for example, - the notification of MMIO's to the configuration space by FE will be trapped and delivered via an event channel + dedicated "shared IO page" - the notification of virtqueue update from BE to FE will be done via another event channel, not by interrupt.
Another topic is "virtio device specific configuration parameters," for instance, a file path as backend storage for a virtio block. We might need out-of-band(side-band?) communication channel to feed those information to a BE domain. (For Xen, xenstore is used for this purpose in EPAM's virtio-disk implemenetation.)
I'm unfamiliar with the RTOS build process but I guess this would be a single git repository with the RTOS and glue code and git sub-projects for the virtio and scmi libraries?
I think that Zephyr's build process (cmake) allows us to import a library from an external repository.
Deployments
To demonstrate portability we would have:
- Xen hypervisor
- Dom0 with Linux/xen tools
- DomU with Linux with a virtio-scmi front-end
- DomU with RTOS/SCMI server with virtio-scmi back-end
The Dom0 in this case is just for convenience of the demo as we don't currently have a fully working dom0-less setup. The key communication is between the two DomU guests.
However the Dom0 will need also the glue code to setup the communication and memory mapping between the two DomU guests.
I initially thought so, but after looking into Xen api's, I found that we have to call IOREQ-related hypervisor calls directly on a BE domain. At least under the current implementation, dom0 cannot call them on behalf of a BE domain.
This could happily link with the existing Xen library for setting up the guest table mappings.
- Jailhouse
- Linux kernel partition with virtio-scmi front-end
- RTOS/SCMI server partition with a virtio-scmi back-end
The RTOS/SCMI server would be the same binary blob as for Xen. Again some glue setup code would be required. I'm still unsure on how this would work for Jailhouse so if we don't have any Jailhouse expertise joining the project we could do this with KVM instead.
Linux/KVM host - setup code in main host (kvmtool/QEMU launch) - KVM guest with Linux with a virtio-scmi front-end - KVM guest with RTOS/SCMI server with virtio-scmi back-end
The easiest way of implementing BE for kvm is to utilize vhost-user library, but please note that this library internally uses socket(AF_UNIX) eventfd and mmap(), which are in some sense hypervisor-specific interfaces given that linux works as type-2 hypervisor :) Then it's not quite straightforward to port it to RTOS like Zephyr and I don't think a single binary would work both on Xen and kvm.
-Takahiro Akashi
This is closer to Windrivers' hypervisor-less virtio deployment as Jailhouse is not a "proper" hypervisor in this case just a way of partitioning up the resources. There will need to be some way for the kernel and server partitions to signal each other when queues are updated.
Platform
We know we have working Xen on Synquacer and Jailhouse on the iMX. Should we target those as well as a QEMU -M virt for those who wish to play without hardware?
Stretch Goals
Integrate Arnd's fat virtqueues
Hopefully this will be ready soon enough in the cycle that we can add this to the library and prototype the minimal memory cross section.
This is dependant on having something at least sketched out early in the cycle. It would allow us to simplify the shared memory mapping to just plain virtqueus.
Port the server/library to another RTOS/unikernel
This would demonstrate the core code hasn't grown any assumptions about what it is running in.
Run the server blob on another hypervisor
Running in KVM is probably boring at this point. Maybe investigate having it Hafnium? Or in a R-profile safety island setup?
So what do people think? Thoughts? Comments? Volunteers?
-- Alex Bennée -- Stratos-dev mailing list Stratos-dev@op-lists.linaro.org https://op-lists.linaro.org/mailman/listinfo/stratos-dev
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Hi Alex,
On Mon, Apr 12, 2021 at 08:34:54AM +0000, Alex Benn??e via Stratos-dev wrote:
Alex Bennée via Stratos-dev stratos-dev@op-lists.linaro.org writes:
Hi All,
We've been discussing various ideas for Stratos in and around STR-7 (common virtio library). I'd originally de-emphasised the STR-7 work because I wasn't sure if this was duplicate effort given we already had libvhost-user as well as interest in rust-vmm for portable backends in user-space. However we have seen from the Windriver hypervisor-less virtio, NXP's Zephyr/Jailhouse and the requirements for the SCMI server that there is a use-case for a small, liberally licensed C library that is suitable for embedding in lightweight backends without a full Linux stack behind it. These workloads would run in either simple command loops, RTOSes or Unikernels.
Given the multiple interested parties I'm hoping we have enough people who can devote time to collaborate on the project to make the following realistic over the next cycle and culminate in the following demo in 6 months:
Components
portable virtio backend library
- source based library (so you include directly in your project)
- liberally licensed (Apache? to facilitate above)
- tailored for non-POSIX, limited resource setups
- e.g. avoid malloc/free, provide abstraction hooks where needed
- not assume OS facilities (so embeddable in RTOS or Unikernel)
- static virtio configuration supplied from outside library (location of queues etc)
- hypervisor agnostic
- provide a hook to signal when queues need processing
Following on from a discussion with Vincent and Akashi-san last week we need to think more about how to make this hypervisor agnostic. There will always be some code that has to live outside the library but if it ends up being the same amount again what have we gained?
I suspect this should be a from scratch implementation but it's certainly worth investigating the BSD implementation as Windriver have suggested.
SCMI server
This is the work product of STR-4, embedded in an RTOS. I'm suggesting Zephyr makes the most sense here given the previous work done by Peng and Akashi-san but I guess an argument could be made for a unikernel. I would suggest demonstrating portability to a unikernel would be a stretch goal for the demo.
The server would be *build* time selectable to deploy either in a Jailhouse partition or a Xen Dom U deployment. Either way there will need to be an out-of-band communication of location of virtual queues and memory maps - I assume via DTB.
From our discussion last week Zephry's DTB support is very limited and not designed to cope with dynamic setups. So rather than having two build time selectable configurations we should opt for a single binary with a fixed expectation of where the remote guests memory and virtqueues will exist in it's memory space.
I'm still a bit skeptical about "single binary with a fixed expectation" concept.
- Is it feasible to expect that all the hypervisors would configure a BE domain in the same way (in respect of assigning a memory region or an interrupt number)?
I think the configuration mechanism will be different but surely it's possible to give the same guest view to the BE from any hypervisor.
- what if we want a BE domain to
at the same time?
- provide different type of virtio devices
- support more than one frontend demains
That is certainly out of scope for this proposed demo which is a single statically configured device servicing a single frontend domain.
How can a single binary without ability of dynamic configuration deal with those requirements?
I don't think it can. For complex toplogies of devices and backends I think you will need a degree of flexibility so while layouts will be static on the device the components will need be flexible/portable. This isn't really a topic we've explored in detail yet but would be further work under STR-10 (hypervisor boot orchestration).
It will then be up to the VMM setting things up to ensure everything is mapped in the appropriate place in the RTOS memory map. There would also be a fixed IRQ map for signalling when things are changes to the RTOS and a fixed doorbell for signalling the other way.
We should think of two different phases:
- virtio setup/negotiation via MMIO configuration space
There is a stage before this which is knowing there is a MMIO device in the first place (on PCI this is simplified a little by the PCI probe).
- (virtio device specific) operation via virtqueue
Anyway, signaling mechanism can be different from hypervisor to hypervisor; On Xen, for example,
- the notification of MMIO's to the configuration space by FE will be trapped and delivered via an event channel + dedicated "shared IO
page"
If we want to keep hypervisor specifics out of the BE can't the VMM or equivalent then trigger an IRQ in the BE domain as a result of the signal?
- the notification of virtqueue update from BE to FE will be done via another event channel, not by interrupt.
The BE can write to a trapped location and trigger the event the other away around (BE -> VMM -> FE IRQ)
Another topic is "virtio device specific configuration parameters," for instance, a file path as backend storage for a virtio block. We might need out-of-band(side-band?) communication channel to feed those information to a BE domain. (For Xen, xenstore is used for this purpose in EPAM's virtio-disk implemenetation.)
This isn't in scope for this demo. The reason I proposed the SCMI server as it is very self-contained and doesn't have to deal with additional things such as storage locations.
I'm unfamiliar with the RTOS build process but I guess this would be a single git repository with the RTOS and glue code and git sub-projects for the virtio and scmi libraries?
I think that Zephyr's build process (cmake) allows us to import a library from an external repository.
Deployments
To demonstrate portability we would have:
- Xen hypervisor
- Dom0 with Linux/xen tools
- DomU with Linux with a virtio-scmi front-end
- DomU with RTOS/SCMI server with virtio-scmi back-end
The Dom0 in this case is just for convenience of the demo as we don't currently have a fully working dom0-less setup. The key communication is between the two DomU guests.
However the Dom0 will need also the glue code to setup the communication and memory mapping between the two DomU guests.
I initially thought so, but after looking into Xen api's, I found that we have to call IOREQ-related hypervisor calls directly on a BE domain. At least under the current implementation, dom0 cannot call them on behalf of a BE domain.
OK so this is a concrete implementation detail we need to work out. I thought the IOREQ events are going from the FE domain (one domU) to the dom0. Can it not trigger an IRQ in the BE domain (another domU)?
This could happily link with the existing Xen library for setting up the guest table mappings.
- Jailhouse
- Linux kernel partition with virtio-scmi front-end
- RTOS/SCMI server partition with a virtio-scmi back-end
The RTOS/SCMI server would be the same binary blob as for Xen. Again some glue setup code would be required. I'm still unsure on how this would work for Jailhouse so if we don't have any Jailhouse expertise joining the project we could do this with KVM instead.
Linux/KVM host - setup code in main host (kvmtool/QEMU launch) - KVM guest with Linux with a virtio-scmi front-end - KVM guest with RTOS/SCMI server with virtio-scmi back-end
The easiest way of implementing BE for kvm is to utilize vhost-user library, but please note that this library internally uses socket(AF_UNIX) eventfd and mmap(), which are in some sense hypervisor-specific interfaces given that linux works as type-2 hypervisor :)
You wouldn't be able to terminate vhost-user in the KVM guest (at least not yet) - but what we are talking about here is a mediation in the main host. For example:
KVM guest, virtio-scmi FE -> trigger event event -> QEMU (QEMU is the VMM for the FE guest) QEMU -> vhost-user BE (kvm-tool for BE guest implements vhost-user) vhost-user BE -> triggers IRQ in FE guest
so the vhost-user part in the main host would just be shuttling events between the two KVM guests. The processing of the virtqueue would all in the BE guest but it wouldn't be vhost-user aware.
Then it's not quite straightforward to port it to RTOS like Zephyr and I don't think a single binary would work both on Xen and kvm.
Well we are trying to abstract the hypervisor specific details out into their own respective blobs and leave the BE in the guest as purely dealing with virtqueues and IRQ/doorbell signalling.
-Takahiro Akashi
This is closer to Windrivers' hypervisor-less virtio deployment as Jailhouse is not a "proper" hypervisor in this case just a way of partitioning up the resources. There will need to be some way for the kernel and server partitions to signal each other when queues are updated.
Platform
We know we have working Xen on Synquacer and Jailhouse on the iMX. Should we target those as well as a QEMU -M virt for those who wish to play without hardware?
Stretch Goals
Integrate Arnd's fat virtqueues
Hopefully this will be ready soon enough in the cycle that we can add this to the library and prototype the minimal memory cross section.
This is dependant on having something at least sketched out early in the cycle. It would allow us to simplify the shared memory mapping to just plain virtqueus.
Port the server/library to another RTOS/unikernel
This would demonstrate the core code hasn't grown any assumptions about what it is running in.
Run the server blob on another hypervisor
Running in KVM is probably boring at this point. Maybe investigate having it Hafnium? Or in a R-profile safety island setup?
So what do people think? Thoughts? Comments? Volunteers?
-- Alex Bennée -- Stratos-dev mailing list Stratos-dev@op-lists.linaro.org https://op-lists.linaro.org/mailman/listinfo/stratos-dev
On Thu, Apr 15, 2021 at 11:42:04AM +0100, Alex Benn??e wrote:
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Hi Alex,
On Mon, Apr 12, 2021 at 08:34:54AM +0000, Alex Benn??e via Stratos-dev wrote:
Alex Bennée via Stratos-dev stratos-dev@op-lists.linaro.org writes:
Hi All,
We've been discussing various ideas for Stratos in and around STR-7 (common virtio library). I'd originally de-emphasised the STR-7 work because I wasn't sure if this was duplicate effort given we already had libvhost-user as well as interest in rust-vmm for portable backends in user-space. However we have seen from the Windriver hypervisor-less virtio, NXP's Zephyr/Jailhouse and the requirements for the SCMI server that there is a use-case for a small, liberally licensed C library that is suitable for embedding in lightweight backends without a full Linux stack behind it. These workloads would run in either simple command loops, RTOSes or Unikernels.
Given the multiple interested parties I'm hoping we have enough people who can devote time to collaborate on the project to make the following realistic over the next cycle and culminate in the following demo in 6 months:
Components
portable virtio backend library
- source based library (so you include directly in your project)
- liberally licensed (Apache? to facilitate above)
- tailored for non-POSIX, limited resource setups
- e.g. avoid malloc/free, provide abstraction hooks where needed
- not assume OS facilities (so embeddable in RTOS or Unikernel)
- static virtio configuration supplied from outside library (location of queues etc)
- hypervisor agnostic
- provide a hook to signal when queues need processing
Following on from a discussion with Vincent and Akashi-san last week we need to think more about how to make this hypervisor agnostic. There will always be some code that has to live outside the library but if it ends up being the same amount again what have we gained?
I suspect this should be a from scratch implementation but it's certainly worth investigating the BSD implementation as Windriver have suggested.
SCMI server
This is the work product of STR-4, embedded in an RTOS. I'm suggesting Zephyr makes the most sense here given the previous work done by Peng and Akashi-san but I guess an argument could be made for a unikernel. I would suggest demonstrating portability to a unikernel would be a stretch goal for the demo.
The server would be *build* time selectable to deploy either in a Jailhouse partition or a Xen Dom U deployment. Either way there will need to be an out-of-band communication of location of virtual queues and memory maps - I assume via DTB.
From our discussion last week Zephry's DTB support is very limited and not designed to cope with dynamic setups. So rather than having two build time selectable configurations we should opt for a single binary with a fixed expectation of where the remote guests memory and virtqueues will exist in it's memory space.
I'm still a bit skeptical about "single binary with a fixed expectation" concept.
- Is it feasible to expect that all the hypervisors would configure a BE domain in the same way (in respect of assigning a memory region or an interrupt number)?
I think the configuration mechanism will be different but surely it's possible to give the same guest view to the BE from any hypervisor.
- what if we want a BE domain to
at the same time?
- provide different type of virtio devices
- support more than one frontend demains
That is certainly out of scope for this proposed demo which is a single statically configured device servicing a single frontend domain.
How can a single binary without ability of dynamic configuration deal with those requirements?
I don't think it can. For complex toplogies of devices and backends I think you will need a degree of flexibility so while layouts will be static on the device the components will need be flexible/portable. This isn't really a topic we've explored in detail yet but would be further work under STR-10 (hypervisor boot orchestration).
I'd like to see a 'big picture' of system/device configuration.
It will then be up to the VMM setting things up to ensure everything is mapped in the appropriate place in the RTOS memory map. There would also be a fixed IRQ map for signalling when things are changes to the RTOS and a fixed doorbell for signalling the other way.
We should think of two different phases:
- virtio setup/negotiation via MMIO configuration space
There is a stage before this which is knowing there is a MMIO device in the first place (on PCI this is simplified a little by the PCI probe).
- (virtio device specific) operation via virtqueue
Anyway, signaling mechanism can be different from hypervisor to hypervisor; On Xen, for example,
- the notification of MMIO's to the configuration space by FE will be trapped and delivered via an event channel + dedicated "shared IO
page"
If we want to keep hypervisor specifics out of the BE can't the VMM or equivalent then trigger an IRQ in the BE domain as a result of the signal?
For MMIO configuraiton, it's not enough. BE needs to know details of MMMIO requests: - address (or offset) in the configuration space - IO size (mostly 4bytes) - type of access (read or write) - value (in case of write)
On Xen, those information is exposed to BE through a dedicated page ("shared IO page"). So when BE is notified of such an event via an event channel, BE is expected to access to that page. Once BE has recognized and completed a MMIO request, it will issue another event channel to notify FE. The emulation mechanism is quite hypervisor specific. How can we handle that with a single binary?
- the notification of virtqueue update from BE to FE will be done via another event channel, not by interrupt.
The BE can write to a trapped location and trigger the event the other away around (BE -> VMM -> FE IRQ)
Again, on Xen, such a trigger will be kicked off by calling a particular hypervisor call. (How can we handle that with a single binary?)
Another topic is "virtio device specific configuration parameters," for instance, a file path as backend storage for a virtio block. We might need out-of-band(side-band?) communication channel to feed those information to a BE domain. (For Xen, xenstore is used for this purpose in EPAM's virtio-disk implemenetation.)
This isn't in scope for this demo. The reason I proposed the SCMI server as it is very self-contained and doesn't have to deal with additional things such as storage locations.
I'm unfamiliar with the RTOS build process but I guess this would be a single git repository with the RTOS and glue code and git sub-projects for the virtio and scmi libraries?
I think that Zephyr's build process (cmake) allows us to import a library from an external repository.
Deployments
To demonstrate portability we would have:
- Xen hypervisor
- Dom0 with Linux/xen tools
- DomU with Linux with a virtio-scmi front-end
- DomU with RTOS/SCMI server with virtio-scmi back-end
The Dom0 in this case is just for convenience of the demo as we don't currently have a fully working dom0-less setup. The key communication is between the two DomU guests.
However the Dom0 will need also the glue code to setup the communication and memory mapping between the two DomU guests.
I initially thought so, but after looking into Xen api's, I found that we have to call IOREQ-related hypervisor calls directly on a BE domain. At least under the current implementation, dom0 cannot call them on behalf of a BE domain.
OK so this is a concrete implementation detail we need to work out. I thought the IOREQ events are going from the FE domain (one domU) to the dom0. Can it not trigger an IRQ in the BE domain (another domU)?
I'm not sure what 'IOREQ events' mean here in this context. If it is an event to notify BE of some data being written to a virtqueue, it will be escalated to BE in this way: - FE to push a data buffer and update a descriptor in vq - FE to write bogus data to "QueueNotify" register in the configuration space - Xen to trap this IO and notifies BE via an event channel - (event channels are actually implemented by an interrupt on BE) - Be to acknowledge that IO request was made on "QueueNotify" and invoke a virtio-device-specific callback function - Such a callback function handles the request - ...
What I meant above by "IOREQ-related hypervisor calls" are ones that must be called by BE in order to create an IOREQ server in Xen and bind FE's configuration space to BE for Xen to trap any IO's and notify BE.
Again, those initialization process is hypervisor specific, and must be done by BE itself under the current Xen implementation.
This could happily link with the existing Xen library for setting up the guest table mappings.
- Jailhouse
- Linux kernel partition with virtio-scmi front-end
- RTOS/SCMI server partition with a virtio-scmi back-end
The RTOS/SCMI server would be the same binary blob as for Xen. Again some glue setup code would be required. I'm still unsure on how this would work for Jailhouse so if we don't have any Jailhouse expertise joining the project we could do this with KVM instead.
Linux/KVM host - setup code in main host (kvmtool/QEMU launch) - KVM guest with Linux with a virtio-scmi front-end - KVM guest with RTOS/SCMI server with virtio-scmi back-end
The easiest way of implementing BE for kvm is to utilize vhost-user library, but please note that this library internally uses socket(AF_UNIX) eventfd and mmap(), which are in some sense hypervisor-specific interfaces given that linux works as type-2 hypervisor :)
You wouldn't be able to terminate vhost-user in the KVM guest (at least not yet) - but what we are talking about here is a mediation in the main host. For example:
KVM guest, virtio-scmi FE -> trigger event event -> QEMU (QEMU is the VMM for the FE guest) QEMU -> vhost-user BE (kvm-tool for BE guest implements vhost-user) vhost-user BE -> triggers IRQ in FE guest
so the vhost-user part in the main host would just be shuttling events between the two KVM guests. The processing of the virtqueue would all in the BE guest but it wouldn't be vhost-user aware.
#I'm afraid that I don't fully understand what you meant here.
Yes, once all is set up, what is needed is "would just be shuttling events between the two KVM guests" and "the processing of the virtqueue" on BE won't be vhost-user aware. That's fine.
My main concerns about vhost-user approach are not there.
Virtio on Xen is implemented based on IOREQ feature, *but* Xen itself doesn't know anything about virtio-mmio, particularly, meanings of registers on the configuration space. It simply traps any memory accesses by FE and notifies BE of such events.
On the other hand, qemu "as VMM" is responsible not only for trapping MMIO's but also interpreting such IO's as register accesses and transforming them into vhost-user messages for BE. It works as a proxy (or delegate?) for BE.
Under the current implementation on Xen, this kind of processing must be done by BE itself.
Another issue is, as I mentioned, API's used in vhost-user. The communication between qemu and BE is done via socket() and eventfd() which are linux specific interfaces. Mapping FE's virtqueue and memory regions for data is done by mmap() with a file descriptor which represents (part of?) FE's address space.
IT can be difficult to implement those API's with the same semantics in Zephyr on Xen.
In short, my point is that those issues I mentioned above will suggest that "a single binary" concept might not work well.
Then it's not quite straightforward to port it to RTOS like Zephyr and I don't think a single binary would work both on Xen and kvm.
Well we are trying to abstract the hypervisor specific details out into their own respective blobs and leave the BE in the guest as purely dealing with virtqueues and IRQ/doorbell signalling.
What do you mean by "respective blobs"? Doesn't it contradict to "a single binary"?
-Takahiro Akashi
-Takahiro Akashi
This is closer to Windrivers' hypervisor-less virtio deployment as Jailhouse is not a "proper" hypervisor in this case just a way of partitioning up the resources. There will need to be some way for the kernel and server partitions to signal each other when queues are updated.
Platform
We know we have working Xen on Synquacer and Jailhouse on the iMX. Should we target those as well as a QEMU -M virt for those who wish to play without hardware?
Stretch Goals
Integrate Arnd's fat virtqueues
Hopefully this will be ready soon enough in the cycle that we can add this to the library and prototype the minimal memory cross section.
This is dependant on having something at least sketched out early in the cycle. It would allow us to simplify the shared memory mapping to just plain virtqueus.
Port the server/library to another RTOS/unikernel
This would demonstrate the core code hasn't grown any assumptions about what it is running in.
Run the server blob on another hypervisor
Running in KVM is probably boring at this point. Maybe investigate having it Hafnium? Or in a R-profile safety island setup?
So what do people think? Thoughts? Comments? Volunteers?
-- Alex Bennée -- Stratos-dev mailing list Stratos-dev@op-lists.linaro.org https://op-lists.linaro.org/mailman/listinfo/stratos-dev
-- Alex Bennée
Does anybody have any comments, agree or not agree? Any better ideas? I'm willing to give you technical details behind my thoughts if you like?
-Takahiro Akashi
On Fri, Apr 16, 2021 at 09:18:37PM +0900, AKASHI Takahiro wrote:
On Thu, Apr 15, 2021 at 11:42:04AM +0100, Alex Benn??e wrote:
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Hi Alex,
On Mon, Apr 12, 2021 at 08:34:54AM +0000, Alex Benn??e via Stratos-dev wrote:
Alex Bennée via Stratos-dev stratos-dev@op-lists.linaro.org writes:
Hi All,
We've been discussing various ideas for Stratos in and around STR-7 (common virtio library). I'd originally de-emphasised the STR-7 work because I wasn't sure if this was duplicate effort given we already had libvhost-user as well as interest in rust-vmm for portable backends in user-space. However we have seen from the Windriver hypervisor-less virtio, NXP's Zephyr/Jailhouse and the requirements for the SCMI server that there is a use-case for a small, liberally licensed C library that is suitable for embedding in lightweight backends without a full Linux stack behind it. These workloads would run in either simple command loops, RTOSes or Unikernels.
Given the multiple interested parties I'm hoping we have enough people who can devote time to collaborate on the project to make the following realistic over the next cycle and culminate in the following demo in 6 months:
Components
portable virtio backend library
- source based library (so you include directly in your project)
- liberally licensed (Apache? to facilitate above)
- tailored for non-POSIX, limited resource setups
- e.g. avoid malloc/free, provide abstraction hooks where needed
- not assume OS facilities (so embeddable in RTOS or Unikernel)
- static virtio configuration supplied from outside library (location of queues etc)
- hypervisor agnostic
- provide a hook to signal when queues need processing
Following on from a discussion with Vincent and Akashi-san last week we need to think more about how to make this hypervisor agnostic. There will always be some code that has to live outside the library but if it ends up being the same amount again what have we gained?
I suspect this should be a from scratch implementation but it's certainly worth investigating the BSD implementation as Windriver have suggested.
SCMI server
This is the work product of STR-4, embedded in an RTOS. I'm suggesting Zephyr makes the most sense here given the previous work done by Peng and Akashi-san but I guess an argument could be made for a unikernel. I would suggest demonstrating portability to a unikernel would be a stretch goal for the demo.
The server would be *build* time selectable to deploy either in a Jailhouse partition or a Xen Dom U deployment. Either way there will need to be an out-of-band communication of location of virtual queues and memory maps - I assume via DTB.
From our discussion last week Zephry's DTB support is very limited and not designed to cope with dynamic setups. So rather than having two build time selectable configurations we should opt for a single binary with a fixed expectation of where the remote guests memory and virtqueues will exist in it's memory space.
I'm still a bit skeptical about "single binary with a fixed expectation" concept.
- Is it feasible to expect that all the hypervisors would configure a BE domain in the same way (in respect of assigning a memory region or an interrupt number)?
I think the configuration mechanism will be different but surely it's possible to give the same guest view to the BE from any hypervisor.
- what if we want a BE domain to
at the same time?
- provide different type of virtio devices
- support more than one frontend demains
That is certainly out of scope for this proposed demo which is a single statically configured device servicing a single frontend domain.
How can a single binary without ability of dynamic configuration deal with those requirements?
I don't think it can. For complex toplogies of devices and backends I think you will need a degree of flexibility so while layouts will be static on the device the components will need be flexible/portable. This isn't really a topic we've explored in detail yet but would be further work under STR-10 (hypervisor boot orchestration).
I'd like to see a 'big picture' of system/device configuration.
It will then be up to the VMM setting things up to ensure everything is mapped in the appropriate place in the RTOS memory map. There would also be a fixed IRQ map for signalling when things are changes to the RTOS and a fixed doorbell for signalling the other way.
We should think of two different phases:
- virtio setup/negotiation via MMIO configuration space
There is a stage before this which is knowing there is a MMIO device in the first place (on PCI this is simplified a little by the PCI probe).
- (virtio device specific) operation via virtqueue
Anyway, signaling mechanism can be different from hypervisor to hypervisor; On Xen, for example,
- the notification of MMIO's to the configuration space by FE will be trapped and delivered via an event channel + dedicated "shared IO
page"
If we want to keep hypervisor specifics out of the BE can't the VMM or equivalent then trigger an IRQ in the BE domain as a result of the signal?
For MMIO configuraiton, it's not enough. BE needs to know details of MMMIO requests:
- address (or offset) in the configuration space
- IO size (mostly 4bytes)
- type of access (read or write)
- value (in case of write)
On Xen, those information is exposed to BE through a dedicated page ("shared IO page"). So when BE is notified of such an event via an event channel, BE is expected to access to that page. Once BE has recognized and completed a MMIO request, it will issue another event channel to notify FE. The emulation mechanism is quite hypervisor specific. How can we handle that with a single binary?
- the notification of virtqueue update from BE to FE will be done via another event channel, not by interrupt.
The BE can write to a trapped location and trigger the event the other away around (BE -> VMM -> FE IRQ)
Again, on Xen, such a trigger will be kicked off by calling a particular hypervisor call. (How can we handle that with a single binary?)
Another topic is "virtio device specific configuration parameters," for instance, a file path as backend storage for a virtio block. We might need out-of-band(side-band?) communication channel to feed those information to a BE domain. (For Xen, xenstore is used for this purpose in EPAM's virtio-disk implemenetation.)
This isn't in scope for this demo. The reason I proposed the SCMI server as it is very self-contained and doesn't have to deal with additional things such as storage locations.
I'm unfamiliar with the RTOS build process but I guess this would be a single git repository with the RTOS and glue code and git sub-projects for the virtio and scmi libraries?
I think that Zephyr's build process (cmake) allows us to import a library from an external repository.
Deployments
To demonstrate portability we would have:
- Xen hypervisor
- Dom0 with Linux/xen tools
- DomU with Linux with a virtio-scmi front-end
- DomU with RTOS/SCMI server with virtio-scmi back-end
The Dom0 in this case is just for convenience of the demo as we don't currently have a fully working dom0-less setup. The key communication is between the two DomU guests.
However the Dom0 will need also the glue code to setup the communication and memory mapping between the two DomU guests.
I initially thought so, but after looking into Xen api's, I found that we have to call IOREQ-related hypervisor calls directly on a BE domain. At least under the current implementation, dom0 cannot call them on behalf of a BE domain.
OK so this is a concrete implementation detail we need to work out. I thought the IOREQ events are going from the FE domain (one domU) to the dom0. Can it not trigger an IRQ in the BE domain (another domU)?
I'm not sure what 'IOREQ events' mean here in this context. If it is an event to notify BE of some data being written to a virtqueue, it will be escalated to BE in this way:
- FE to push a data buffer and update a descriptor in vq
- FE to write bogus data to "QueueNotify" register in the configuration space
- Xen to trap this IO and notifies BE via an event channel
- (event channels are actually implemented by an interrupt on BE)
- Be to acknowledge that IO request was made on "QueueNotify" and invoke a virtio-device-specific callback function
- Such a callback function handles the request
- ...
What I meant above by "IOREQ-related hypervisor calls" are ones that must be called by BE in order to create an IOREQ server in Xen and bind FE's configuration space to BE for Xen to trap any IO's and notify BE.
Again, those initialization process is hypervisor specific, and must be done by BE itself under the current Xen implementation.
This could happily link with the existing Xen library for setting up the guest table mappings.
- Jailhouse
- Linux kernel partition with virtio-scmi front-end
- RTOS/SCMI server partition with a virtio-scmi back-end
The RTOS/SCMI server would be the same binary blob as for Xen. Again some glue setup code would be required. I'm still unsure on how this would work for Jailhouse so if we don't have any Jailhouse expertise joining the project we could do this with KVM instead.
Linux/KVM host - setup code in main host (kvmtool/QEMU launch) - KVM guest with Linux with a virtio-scmi front-end - KVM guest with RTOS/SCMI server with virtio-scmi back-end
The easiest way of implementing BE for kvm is to utilize vhost-user library, but please note that this library internally uses socket(AF_UNIX) eventfd and mmap(), which are in some sense hypervisor-specific interfaces given that linux works as type-2 hypervisor :)
You wouldn't be able to terminate vhost-user in the KVM guest (at least not yet) - but what we are talking about here is a mediation in the main host. For example:
KVM guest, virtio-scmi FE -> trigger event event -> QEMU (QEMU is the VMM for the FE guest) QEMU -> vhost-user BE (kvm-tool for BE guest implements vhost-user) vhost-user BE -> triggers IRQ in FE guest
so the vhost-user part in the main host would just be shuttling events between the two KVM guests. The processing of the virtqueue would all in the BE guest but it wouldn't be vhost-user aware.
#I'm afraid that I don't fully understand what you meant here.
Yes, once all is set up, what is needed is "would just be shuttling events between the two KVM guests" and "the processing of the virtqueue" on BE won't be vhost-user aware. That's fine.
My main concerns about vhost-user approach are not there.
Virtio on Xen is implemented based on IOREQ feature, *but* Xen itself doesn't know anything about virtio-mmio, particularly, meanings of registers on the configuration space. It simply traps any memory accesses by FE and notifies BE of such events.
On the other hand, qemu "as VMM" is responsible not only for trapping MMIO's but also interpreting such IO's as register accesses and transforming them into vhost-user messages for BE. It works as a proxy (or delegate?) for BE.
Under the current implementation on Xen, this kind of processing must be done by BE itself.
Another issue is, as I mentioned, API's used in vhost-user. The communication between qemu and BE is done via socket() and eventfd() which are linux specific interfaces. Mapping FE's virtqueue and memory regions for data is done by mmap() with a file descriptor which represents (part of?) FE's address space.
IT can be difficult to implement those API's with the same semantics in Zephyr on Xen.
In short, my point is that those issues I mentioned above will suggest that "a single binary" concept might not work well.
Then it's not quite straightforward to port it to RTOS like Zephyr and I don't think a single binary would work both on Xen and kvm.
Well we are trying to abstract the hypervisor specific details out into their own respective blobs and leave the BE in the guest as purely dealing with virtqueues and IRQ/doorbell signalling.
What do you mean by "respective blobs"? Doesn't it contradict to "a single binary"?
-Takahiro Akashi
-Takahiro Akashi
This is closer to Windrivers' hypervisor-less virtio deployment as Jailhouse is not a "proper" hypervisor in this case just a way of partitioning up the resources. There will need to be some way for the kernel and server partitions to signal each other when queues are updated.
Platform
We know we have working Xen on Synquacer and Jailhouse on the iMX. Should we target those as well as a QEMU -M virt for those who wish to play without hardware?
Stretch Goals
Integrate Arnd's fat virtqueues
Hopefully this will be ready soon enough in the cycle that we can add this to the library and prototype the minimal memory cross section.
This is dependant on having something at least sketched out early in the cycle. It would allow us to simplify the shared memory mapping to just plain virtqueus.
Port the server/library to another RTOS/unikernel
This would demonstrate the core code hasn't grown any assumptions about what it is running in.
Run the server blob on another hypervisor
Running in KVM is probably boring at this point. Maybe investigate having it Hafnium? Or in a R-profile safety island setup?
So what do people think? Thoughts? Comments? Volunteers?
-- Alex Bennée -- Stratos-dev mailing list Stratos-dev@op-lists.linaro.org https://op-lists.linaro.org/mailman/listinfo/stratos-dev
-- Alex Bennée
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Does anybody have any comments, agree or not agree? Any better ideas? I'm willing to give you technical details behind my thoughts if you like?
I too would like to know if anyone else has any thoughts.
-Takahiro Akashi
On Fri, Apr 16, 2021 at 09:18:37PM +0900, AKASHI Takahiro wrote:
On Thu, Apr 15, 2021 at 11:42:04AM +0100, Alex Benn??e wrote:
AKASHI Takahiro takahiro.akashi@linaro.org writes:
Hi Alex,
On Mon, Apr 12, 2021 at 08:34:54AM +0000, Alex Benn??e via Stratos-dev wrote:
Alex Bennée via Stratos-dev stratos-dev@op-lists.linaro.org writes:
Hi All,
We've been discussing various ideas for Stratos in and around STR-7 (common virtio library). I'd originally de-emphasised the STR-7 work because I wasn't sure if this was duplicate effort given we already had libvhost-user as well as interest in rust-vmm for portable backends in user-space. However we have seen from the Windriver hypervisor-less virtio, NXP's Zephyr/Jailhouse and the requirements for the SCMI server that there is a use-case for a small, liberally licensed C library that is suitable for embedding in lightweight backends without a full Linux stack behind it. These workloads would run in either simple command loops, RTOSes or Unikernels.
Given the multiple interested parties I'm hoping we have enough people who can devote time to collaborate on the project to make the following realistic over the next cycle and culminate in the following demo in 6 months:
Components
portable virtio backend library
- source based library (so you include directly in your project)
- liberally licensed (Apache? to facilitate above)
- tailored for non-POSIX, limited resource setups
- e.g. avoid malloc/free, provide abstraction hooks where needed
- not assume OS facilities (so embeddable in RTOS or Unikernel)
- static virtio configuration supplied from outside library (location of queues etc)
- hypervisor agnostic
- provide a hook to signal when queues need processing
Following on from a discussion with Vincent and Akashi-san last week we need to think more about how to make this hypervisor agnostic. There will always be some code that has to live outside the library but if it ends up being the same amount again what have we gained?
<snip>
The server would be *build* time selectable to deploy either in a Jailhouse partition or a Xen Dom U deployment. Either way there will need to be an out-of-band communication of location of virtual queues and memory maps - I assume via DTB.
From our discussion last week Zephry's DTB support is very limited and not designed to cope with dynamic setups. So rather than having two build time selectable configurations we should opt for a single binary with a fixed expectation of where the remote guests memory and virtqueues will exist in it's memory space.
I'm still a bit skeptical about "single binary with a fixed expectation" concept.
- Is it feasible to expect that all the hypervisors would configure a BE domain in the same way (in respect of assigning a memory region or an interrupt number)?
I think the configuration mechanism will be different but surely it's possible to give the same guest view to the BE from any hypervisor.
- what if we want a BE domain to
at the same time?
- provide different type of virtio devices
- support more than one frontend demains
That is certainly out of scope for this proposed demo which is a single statically configured device servicing a single frontend domain.
How can a single binary without ability of dynamic configuration deal with those requirements?
I don't think it can. For complex toplogies of devices and backends I think you will need a degree of flexibility so while layouts will be static on the device the components will need be flexible/portable. This isn't really a topic we've explored in detail yet but would be further work under STR-10 (hypervisor boot orchestration).
I'd like to see a 'big picture' of system/device configuration.
It will then be up to the VMM setting things up to ensure everything is mapped in the appropriate place in the RTOS memory map. There would also be a fixed IRQ map for signalling when things are changes to the RTOS and a fixed doorbell for signalling the other way.
We should think of two different phases:
- virtio setup/negotiation via MMIO configuration space
There is a stage before this which is knowing there is a MMIO device in the first place (on PCI this is simplified a little by the PCI probe).
- (virtio device specific) operation via virtqueue
Anyway, signaling mechanism can be different from hypervisor to hypervisor; On Xen, for example,
- the notification of MMIO's to the configuration space by FE will be trapped and delivered via an event channel + dedicated "shared IO
page"
If we want to keep hypervisor specifics out of the BE can't the VMM or equivalent then trigger an IRQ in the BE domain as a result of the signal?
For MMIO configuraiton, it's not enough. BE needs to know details of MMMIO requests:
- address (or offset) in the configuration space
- IO size (mostly 4bytes)
- type of access (read or write)
- value (in case of write)
On Xen, those information is exposed to BE through a dedicated page ("shared IO page"). So when BE is notified of such an event via an event channel, BE is expected to access to that page. Once BE has recognized and completed a MMIO request, it will issue another event channel to notify FE. The emulation mechanism is quite hypervisor specific. How can we handle that with a single binary?
Is there a halfway house - can we keep as much of the hypervisor specifics in the host side and just bring in the minimal needed hypervisor interface for each hypervisor. It could be runtime selectable or it sounds like we need to fall back to a build time selectable interface.
What I'm trying to avoid though is a binary that has more hypervisor specific code in it the generic virtio backend handling code.
stratos-dev@op-lists.linaro.org