On Fri, 2 Oct 2020, Arnd Bergmann via Stratos-dev wrote:
> On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker
> <jean-philippe(a)linaro.org> wrote:
> >
> > Hi,
> >
> > I've looked in more details at limited memory sharing (STR-6, STR-8,
> > STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
>
> Thanks for writing this up and for sharing!
>
> > At the moment the bounce buffer is allocated from a global pool in the low
> > physical pages. However a recent proposal by Chromium would add support
> > for per-device swiotlb pools:
> >
> > https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromiu…
> >
> > And quoting Tomasz from the discussion on patch 4:
> >
> > For this, I'd like to propose a "restricted-dma-region" (feel free
> > to suggest a better name) binding, which is explicitly specified
> > to be the only DMA-able memory for this device and make Linux use
> > the given pool for coherent DMA allocations and bouncing
> > non-coherent DMA.
>
> Right, I think this can work, but there are very substantial downsides to it:
>
> - it is a fairly substantial departure from the virtio specification, which
> defines that transfers can be made to any part of the guest physical
> address space
>
> - for each device that allocates a region, we have to reserve host memory
> that cannot be used for anything else, and the size is based on the
> maximum we might be needing at any point of time.
>
> - it removes any kind of performance benefit that we might see from
> using an iommu, by forcing the guest to do bounce-buffering, in
> addition to the copies that may have to be done in the host.
> (you have the same point below as well)
>
> > Dynamic regions
> > ---------------
> >
> > In a previous discussion [1], several people suggested using a vIOMMU to
> > dynamically update the mappings rather than statically setting a memory
> > region usable by the backend. I believe that approach is still worth
> > considering because it satisfies the security requirement and doesn't
> > necessarily have worse performance. There is a trade-off between bounce
> > buffers on one hand, and map notifications on the other.
> >
> > The problem with static regions is that all of the traffic will require
> > copying. Sub-page payloads will need bounce buffering anyway, for proper
> > isolation. But for large payloads bounce buffering might be prohibitive,
> > and using a virtual IOMMU might actually be more efficient. Instead of
> > copying large buffers the guest would send a MAP request to the
> > hypervisor, which would then map the pages into the backend. Despite
> > taking plenty of cycles for context switching and setting up the maps, it
> > might be less costly than copying.
>
> Agreed, I would think the iommu based approach is much more promising
> here.
The two approaches are not mutually exclusive. The first approach could be
demoed in a couple weeks, while this approach will require months of
work and at least one new virtio interface.
My suggestion would be to hack together a pre-shared memory solution, or
a hypervisor-mediated solution with Argo, do some benchmarks to
understand the degradation, and figure out if the degradation is bad
enough that we need to go down the virtio IOMMU route.
>From my experiece with the Xen grant table and PV drivers, the virtio
IOMMU alone might not make things a lot faster without changes to the
protocols (i.e. virtio-net/block need specific improvements).
If it turns out that the pre-shared memory causes performance issues,
one "crazy" idea would be to make use of DMA engines in Xen to speed up
hypervisor-based copies.
> Maybe the partial-page sharing could be part of the iommu dma_ops?
> There are many scenarios where we already rely on the iommu for
> isolating the kernel from malicious or accidental DMA, but I have not
> seen anyone do this on a sub-page granularity. Using bounce
> buffers for partial page dma_map_* could be something we can do
> separately from the rest, as both aspects seem useful regardless of
> one another.
>
> > Since it depends on the device, I guess we'll need a survey of memory
> > access patterns by the different virtio devices that we're considering.
> > In the end a mix of both solutions might be necessary.
>
> Can you describe a case in which the iommu would clearly be
> inferior? IOW, what stops us from always using an iommu here?
Xen PV drivers have started with the equivalent of a virtio IOMMU in
place, which we call "grant table". A virtual machine uses the grant
table to share memory explicitly with another virtual machine.
Specifically, the frontend uses the grant table to share memory with the
backend, otherwise the backend is not allowed to map the memory.
(There is a question on whether we could standardize the grant table
interface.)
Speaking from that experience, we ended up switching PV network to use
hypervisor-based copy (without Argo; Argo came later and it is a more
generic solution) because it was faster than the alternatives. We are
still using the grant table for everything else (block, framebuffer,
etc.)
For block, we ended extending the life of certain grant mappings to
improve performance, we called them "persistent grants".
The take away is that the results might differ significantly, not just
between a protocol and the other (net and block), but also between
hypervisors (Xen and KVM), and between SoCs. These are difficult waters
to navigate.
On Fri, 9 Oct 2020, Jean-Philippe Brucker via Stratos-dev wrote:
> Thanks for the discussion. More thoughts on vIOMMU below, in case we want
> to try this solution in addition to the pre-shared one.
>
> On Tue, Oct 06, 2020 at 02:46:02PM +0200, Arnd Bergmann wrote:
> > > From our learnings on mediated devices (vfio stuff), I wouldn't play with
> > physical memory. If MMU and SMMU don't play a key role in orchestrating
> > address spaces, I am not sure what I describe is feasible.
> >
> > So you would lean towards what Jean-Philippe explained regarding the
> > use of vIOMMU in the front-end guest to be able to use an unmodified
> > set of virtio drivers on guest pages, while using the dynamic iommu mappings
> > to control which pages are shared with the back-end, right?
> >
> > I think this can work in general, but it seems to introduce a ton of complexity
> > in the host, in particular with a type-2 hypervisor like KVM, and I'm not
> > convinced that the overhead of trapping into the hypervisor for each
> > iotlb flush as well as manipulating the page tables of a remote guest
> > from it is lower than just copying the data through the CPU cache.
> >
> > In the case of KVM, I would guess that the backend guest has to map an
> > object for the queue into its guest-physical address space as a
> > vm_area_struct, possibly coming from a memfd or dmabuf fd that was
> > shared by the front-end guest. Performing the dynamic remapping of
> > that VMA from host user space (qemu, kvmtool, ...) would certainly
> > mean a high runtime overhead, while doing it in the host kernel
> > alone may not be acceptable for mainline kernels because of the
> > added complexity in too many sensitive areas of the kernel (memory
> > management / file systems, kvm, vhost/virtio).
>
> With a vIOMMU the frontend wouldn't populate the backend's page tables and
> send IOTLB flushes. And I don't think the hypervisor could touch the
> backend's stage-1 page tables either. Neither do grant tables, I believe,
> though I still have some reading to do there. The frontend would send MAP
> and UNMAP messages to the host. Roughly:
>
> MAP {
> u32 domain_id; // context (virtio device)
> u32 flags; // permissions (read|write)
> u64 iova; // an arbitrary virtual address
> u64 gpa; // guest#1 physical address
> u64 size; // multiple of guest#1 page size
> };
>
> The hypervisor receives this and finds the phys page (pa) corresponding
> to g1pa. Then:
>
> (1) Allocates pages in the backend's guest-physical memory, at g2pa.
> (2) Creates the stage-2 mapping g2pa->pa (for example in KVM
> KVM_SET_USER_MEMORY_REGION?).
> (3) Tells the backend about the iova->g2pa mapping for the virtio device.
>
> Then frontend builds the virtqueue or publishes buffers, using an IOVA
> within the range of what has been mapped so far. The backend translates
> that to g2pa, then to a g2va and accesses the buffer. When done the
> frontend unmaps the page:
>
> UNMAP {
> u32 domain_id;
> u64 iova;
> u64 size;
> };
>
> The host tells the backend and tears down the stage-2 mapping.
As reference, the grant table works a bit differently. Continuing on
your abstract example, the MAP operation would return a grant_reference
which is just a number to identify the "grant".
MAP {
u32 domain_id; // context (virtio device)
u32 flags; // permissions (read|write)
u64 gpa; // guest#1 physical address
u32 gnt_ref; // output: the grant reference
};
The frontend passes the grant_reference to the backend, for instance
over a ring. The backend issues a GRANT_MAP operation to map the
grant_reference at g2pa->pa.
GRANT_MAP {
u32 domain_id; // source (frontend) domain_id
u32 gnt_ref; // grant refence
u64 g2pa; // guest physical address where to create the mapping
}
This way, the frontend doesn't have to know about g2pa->pa. The backend
is free to choose any (valid) g2pa it wants.
Thanks Jean-Philippe for describing the problem in some abstract way.
I would like to share considerations related to establishing data
paths between a VM (frontend) and a device through another VM
(backend).
Sharing model
---------------
The VM in direct contact with the device may just provide driver
service to a single other VM.
This simplifies inbound traffic handling as there is no marshaling of
data to different VMs.
Shortcuts can be implemented and performance can be dramatically different.
We have thus: passthrough with abstraction (to virtio) vs shared model.
IO Model
---------------
There are two main ways to orchestrate IO around memory regions for
data and rings of descriptors for outbound requests or input data.
The key structural difference is how inbound traffic is handled:
- the device is provided a list of buffers that are pointed to through
pre-populated rings [that is almost the rule on PCI adapters with a
handful exceptions]
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the
unstructured area(s) [this is very common on Arm platform devices]
The structured buffer method is to be understood as computer driven IO
while the other is device driven IO.
Computer driven IO allows to build zero-copy data paths from device to
front-end VM (if other conditions are met) while the device driven
will impose a bounce buffer for the incoming traffic while may be zero
copy for outbound traffic.
Coherent interconnects such as CCIX and CXL do not fundamentally
change the IO Model but change the driver architecture.
Memory element
---------------
What is the smallest size of a shared memory between VMs? Naturally
this is a page but the page size may not be controllable. As a matter
of fact RHEL on arm64 is 64KB. So using a 64KB page to hold a 64 byte
network packet or audio sample is not really efficient. Sub page
allocations are possible but would generate a significant performance
penalty as the VMM shall validate memory access in a sub page
resulting in a VM exit for each check.
Underlay networking
---------------
In cloud infrastructure, VMs network traffic is always encapsulated
with protocols such as GENEVE, VXLAN... It is also frequent to have a
smartNIC that actually marshall traffic to VMs and host through
distinct pseudo devices.
In telecom parlance we name the raw network the underlay network.
So there may be big differences in achieving the goal depending on the
use case (has to be defined and assessed):
- cloud native: there is an underlay network
- smartphone virtualization: there may be underlay network to
differentiate between user traffic with different APNs and also
operator traffic dedicated for the SIM card or other cases.
- virtualization in the car: probably no underlay network
The presence of an underlay network is significant for the use case:
(with or without pseudo device) with proper hashing of inbound
traffic, traffic for the front end VM can be associated with dedicated
set of queues. This assignment may help work around IO model
constraints with device controlled IO and find a way to do zero (data)
copy from device to frontend VM.
User land IO
---------------
I think the use of userland IO (in the backend and/or the fronted VM)
may impact the data path in various ways and thus this should be
factored in the analysis.
Key elements of performance: use metata data (prepend, postpend to
data) to get information rather than ring descriptor which is in
uncached memory (we have concrete examples and the cost of the
different strategies may be up to 50% of the base performance)
Virtio
---------------
There are significant performance different between virtio 1.0 and
virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert
an element in a queue while it is only one with virtio 1.1 (note sure
about the numbers but should not be far).
as a reference 6WIND has a VM 2 VM network driver that is not virtio
based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1
to reach that level of performance.
Memory allocation backend
---------------
Shared memory allocation shall be controlled by the VMM (Qemu, Xen...)
but the VMM may further ask FF-A to do so. I think there has been
efforts in the FF-A to find memory with proper attributes (coherent
between device and all VMs). I frankly have no clue here but this may
be worth digging into.
A zero-copy DPDK based DNS on an Arm platform
---------------
This example is meant to prove a point: all aspects of the target use
cases must be detailed with great care otherwise ideas may not fly...
(Please don't tell me I forget this or this does not fly, it is just
to give an idea of the detail level required.)
Assumptions:
- multiple front end VMs, with underlay network based on VLANs
- network interface can allocate multiple queues for 1 VLAN and have
thus multiple sets of queues for multiple VLANs
- IO model is either device or computer driven, no importance
- data memory can be associated to a VLAN/VM
- DPDK based front end DNS application
initialization (here is to highlight that it is very feasible that a
simple data path is built on an impossible initialization scenario:
initialization must be an integral part of the description of the
solution)
- orchestration spawns backend VM with assigned device
- orchestration spawns front end VM which creates virtio-net with private memory
- orchestration informs backend VM to assign some traffic for a VLAN
to a backend VM; the backend VM create the queues for the VLAN , the
virtio-net with the packet memory wrapped around the memory assigned
for the queues; the "bridge" configuration between this VLAN the the
virtio;
- the backend VM informs orchestration to bind the newly created
virtio-net memory to the front-end memory
- the orchestration asks the front end VMM to wrap its virtio-net to
the specified area.
- Front end DPDK application is listening on UDP port 53 and is bound
the virtio-net device (DPDK apps are bounds to devices, not IP
addresses)
- IP configuration about the DPDK app is out of scope.
inbound traffic:
traffic is received in device SRAM
packet is marshalled to a memory associated to a VLAN (i.e. VM).
descriptor is updated to point to this packet
backend VM kernel handle queues (IRQ, busy polling...), create virtio
descriptors pointing to data as part of the bridging (stripping the
underlay network VLAN tag)
front end DPDK app read the descriptor and the packet
if DNS at expected IP, application handle the packet otherwise drop
outbound traffic:
DPDK DNS application get a packet from the device memory (i.e. in a
memory pool accessible by the device) and populate the response
DPDK lib forms the descriptor in virtio
Backend end VM kernel gets the packet, forms a descriptor in the right
queue (adding the underlay network VLAN tag) and rings the door bell
hardware gets the descriptor , retrieves the DNS response (which comes
from directly from front end memory)
hardware copies the packet into SRAM and is serialized on the wire
On Fri, 2 Oct 2020 at 15:44, Jean-Philippe Brucker via Stratos-dev
<stratos-dev(a)op-lists.linaro.org> wrote:
>
> Hi,
>
> I've looked in more details at limited memory sharing (STR-6, STR-8,
> STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
>
> Problem
> -------
>
> We have a primary VM running a guest, and a secondary one running a
> backend that manages one hardware resource (for example network access).
> They communicate with virtio (for example virtio-net). The guest
> implements a virtio driver, the backend a virtio device. Problem is, how
> to ensure that the backend and guest only share the memory required for
> the virtio communication, and that the backend cannot access any other
> memory from the guest?
>
> Static shared region
> --------------------
>
> Let's first look at static DMA regions. The hypervisor allocates a subset
> of memory to be shared between guest and backend. The hypervisor
> communicates this per-device DMA restriction to the guest during boot. It
> could be using a firmware property, or a discovery protocol. I did start
> drafting such a protocol in virtio-iommu, but I now think the
> reserved-memory mechanism in device-tree, below, is preferable. Would we
> need an equivalent for ACPI, though?
>
> How would we implement this in a Linux guest? The virtqueue of a virtio
> device has two components. Static ring buffers, allocated at boot with
> dma_alloc_coherent(), and the actual data payload, mapped with
> dma_map_page() and dma_map_single(). Linux calls the former "coherent"
> DMA, and the latter "streaming" DMA.
>
> Coherent DMA can already obtain its pages from a per-device memory pool.
> dma_init_coherent_memory() defines a range of physical memory usable by a
> device. Importantly this region has to be distinct from system memory RAM
> and reserved by the platform. It is mapped non-cacheable. If it exists,
> dma_alloc_coherent() will only get its pages from that region.
>
> On the other hand streaming DMA doesn't allocate memory. The virtio
> drivers don't control where that memory comes from, since the pages are
> fed to them by an upper layer of the subsystem, specific to the device
> type (net, block, video, etc). Often they are pages from the slab cache
> that contain other unrelated objects, a notorious problem for DMA
> isolation. If the page is not accessible by the device, swiotlb_map()
> allocates a bounce buffer somewhere more convenient and copies the data
> when needed.
>
> At the moment the bounce buffer is allocated from a global pool in the low
> physical pages. However a recent proposal by Chromium would add support
> for per-device swiotlb pools:
>
> https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromiu…
>
> And quoting Tomasz from the discussion on patch 4:
>
> For this, I'd like to propose a "restricted-dma-region" (feel free
> to suggest a better name) binding, which is explicitly specified
> to be the only DMA-able memory for this device and make Linux use
> the given pool for coherent DMA allocations and bouncing
> non-coherent DMA.
>
> That seems to be precisely what we need. Even when using the virtio-pci
> transport, it is still possible to define per-device properties in the
> device-tree, for example:
>
> /* PCI root complex node */
> pcie@10000000 {
> compatible = "pci-host-ecam-generic";
>
> /* Add properties to endpoint with BDF 00:01.0 */
> ep@0008 {
> reg = <0x00000800 0 0 0 0>;
> restricted-dma-region = <&dma_region_1>;
> };
> };
>
> reserved-memory {
> /* Define 64MB reserved region at address 0x50400000 */
> dma_region_1: restricted_dma_region {
> reg = <0x50400000 0x4000000>;
> };
> };
>
> Dynamic regions
> ---------------
>
> In a previous discussion [1], several people suggested using a vIOMMU to
> dynamically update the mappings rather than statically setting a memory
> region usable by the backend. I believe that approach is still worth
> considering because it satisfies the security requirement and doesn't
> necessarily have worse performance. There is a trade-off between bounce
> buffers on one hand, and map notifications on the other.
>
> The problem with static regions is that all of the traffic will require
> copying. Sub-page payloads will need bounce buffering anyway, for proper
> isolation. But for large payloads bounce buffering might be prohibitive,
> and using a virtual IOMMU might actually be more efficient. Instead of
> copying large buffers the guest would send a MAP request to the
> hypervisor, which would then map the pages into the backend. Despite
> taking plenty of cycles for context switching and setting up the maps, it
> might be less costly than copying.
>
>
> Since it depends on the device, I guess we'll need a survey of memory
> access patterns by the different virtio devices that we're considering.
> In the end a mix of both solutions might be necessary.
>
> Thanks,
> Jean
>
> [1] https://lists.oasis-open.org/archives/virtio-dev/202006/msg00037.html
> --
> Stratos-dev mailing list
> Stratos-dev(a)op-lists.linaro.org
> https://op-lists.linaro.org/mailman/listinfo/stratos-dev
--
François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group
T: +33.67221.6485
francois.ozog(a)linaro.org | Skype: ffozog
Hi,
This is an initial implementation of a vhost-user backend for the
VirtIO RPMB device. The device is currently in the draft of the next
VirtIO specification and describes block device which uses combination
of a key, nonce, hashing and a persistent write counter to prevent
replay attacks (hence Replay Protected Memory Block).
It is implemented as a vhost-user device because we want to experiment
in making portable backends that can be used with multiple
hypervisors. We also want to support backends isolated in their own
separate service VMs with limited memory cross-sections with the
principle guest. This is part of a wider initiative called project
Stratos for which you can find information here:
https://collaborate.linaro.org/display/STR/Stratos
I mention this to explain the decision to duplicate some of the
utility functions (specifically iov and hmac handling) and write the
daemon as a fairly pure glib application that just depends on
libvhost-user. As it happens I ended up having to include libqemuutil
as libvhost-user requires qemu_memfd_alloc. Whether this is an
oversight for libvhost-user or it means we should split these daemons
into a separate repository is a discussion I would like to have with
the community. Now I have a working reference implementation I also
want to explore how easy it is to write a Rust version of the backend
which raises similar questions about where such a project should live.
The current Linux kernel doesn't support RPMB devices in the vanilla
tree so if you want to test you will need to look at my testing tree
which is based on Thomas Winkler's original patches although somewhat
cut down and pared back to just support the JDEC style frames of the
upstream spec and the simple chardev based userspace interface. You
can find my kernel testing tree here:
https://git.linaro.org/people/alex.bennee/linux.git/log/?h=testing/virtio-r…
The above branch includes a simple test script with the rpmb userspace
tool which I've used to exercise the various features. I'm unsure if
there will ever be a push to upstream support for RPMB to the kernel
as access to these sorts of devices are usually the preserve of
firmware living in the secure world. There is currently work underway
to support this device in uboot and I suspect eventually there will be
support for OPTEE as well.
Any review comments gratefully received as well as discussion about if
we should consider creating some new projects for housing these sort
of vhost-user backends.
Alex Bennée (19):
tools/virtiofsd: add support for --socket-group
hw/block: add boilerplate for vhost-user-rpmb device
hw/virtio: move virtio-pci.h into shared include space
hw/block: add vhost-user-rpmb-pci boilerplate
virtio-pci: add notification trace points
tools/vhost-user-rpmb: add boilerplate and initial main
tools/vhost-user-rpmb: implement --print-capabilities
tools/vhost-user-rpmb: connect to fd and instantiate basic run loop
tools/vhost-user-rpmb: add a --verbose/debug flags for logging
tools/vhost-user-rpmb: handle shutdown and SIGINT/SIGHUP cleanly
tools/vhost-user-rpmb: add --flash-path for backing store
tools/vhost-user-rpmb: import hmac_sha256 functions
tools/vhost-user-rpmb: implement the PROGRAM_KEY handshake
tools/vhost-user-rpmb: implement VIRTIO_RPMB_REQ_GET_WRITE_COUNTER
tools/vhost-user-rpmb: implement VIRTIO_RPMB_REQ_DATA_WRITE
tools/vhost-user-rpmb: implement VIRTIO_RPMB_REQ_DATA_READ
tools/vhost-user-rpmb: add key persistence
tools/vhost-user-rpmb: allow setting of the write_count
docs: add a man page for vhost-user-rpmb
docs/tools/index.rst | 1 +
docs/tools/vhost-user-rpmb.rst | 102 +++
docs/tools/virtiofsd.rst | 4 +
include/hw/virtio/vhost-user-rpmb.h | 46 ++
{hw => include/hw}/virtio/virtio-pci.h | 0
tools/vhost-user-rpmb/hmac_sha256.h | 87 ++
tools/virtiofsd/fuse_i.h | 1 +
hw/block/vhost-user-rpmb-pci.c | 82 ++
hw/block/vhost-user-rpmb.c | 333 ++++++++
hw/virtio/vhost-scsi-pci.c | 2 +-
hw/virtio/vhost-user-blk-pci.c | 2 +-
hw/virtio/vhost-user-fs-pci.c | 2 +-
hw/virtio/vhost-user-input-pci.c | 2 +-
hw/virtio/vhost-user-scsi-pci.c | 2 +-
hw/virtio/vhost-user-vsock-pci.c | 2 +-
hw/virtio/vhost-vsock-pci.c | 2 +-
hw/virtio/virtio-9p-pci.c | 2 +-
hw/virtio/virtio-balloon-pci.c | 2 +-
hw/virtio/virtio-blk-pci.c | 2 +-
hw/virtio/virtio-input-host-pci.c | 2 +-
hw/virtio/virtio-input-pci.c | 2 +-
hw/virtio/virtio-iommu-pci.c | 2 +-
hw/virtio/virtio-net-pci.c | 2 +-
hw/virtio/virtio-pci.c | 5 +-
hw/virtio/virtio-rng-pci.c | 2 +-
hw/virtio/virtio-scsi-pci.c | 2 +-
hw/virtio/virtio-serial-pci.c | 2 +-
tools/vhost-user-rpmb/hmac_sha256.c | 331 ++++++++
tools/vhost-user-rpmb/main.c | 880 +++++++++++++++++++++
tools/virtiofsd/fuse_lowlevel.c | 6 +
tools/virtiofsd/fuse_virtio.c | 20 +-
MAINTAINERS | 5 +
hw/block/Kconfig | 5 +
hw/block/meson.build | 3 +
hw/virtio/trace-events | 7 +-
tools/meson.build | 8 +
tools/vhost-user-rpmb/50-qemu-rpmb.json.in | 5 +
tools/vhost-user-rpmb/meson.build | 12 +
38 files changed, 1956 insertions(+), 21 deletions(-)
create mode 100644 docs/tools/vhost-user-rpmb.rst
create mode 100644 include/hw/virtio/vhost-user-rpmb.h
rename {hw => include/hw}/virtio/virtio-pci.h (100%)
create mode 100644 tools/vhost-user-rpmb/hmac_sha256.h
create mode 100644 hw/block/vhost-user-rpmb-pci.c
create mode 100644 hw/block/vhost-user-rpmb.c
create mode 100644 tools/vhost-user-rpmb/hmac_sha256.c
create mode 100644 tools/vhost-user-rpmb/main.c
create mode 100644 tools/vhost-user-rpmb/50-qemu-rpmb.json.in
create mode 100644 tools/vhost-user-rpmb/meson.build
--
2.20.1
On Fri, 2 Oct 2020, Jean-Philippe Brucker via Stratos-dev wrote:
> Hi,
>
> I've looked in more details at limited memory sharing (STR-6, STR-8,
> STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
>
> Problem
> -------
>
> We have a primary VM running a guest, and a secondary one running a
> backend that manages one hardware resource (for example network access).
> They communicate with virtio (for example virtio-net). The guest
> implements a virtio driver, the backend a virtio device. Problem is, how
> to ensure that the backend and guest only share the memory required for
> the virtio communication, and that the backend cannot access any other
> memory from the guest?
Well described, this is the crux of the issue and also the same thing we
tried to capture with STR-14 (please ignore the non-virtio items on
STR-14 for now.)
> Static shared region
> --------------------
>
> Let's first look at static DMA regions. The hypervisor allocates a subset
> of memory to be shared between guest and backend. The hypervisor
> communicates this per-device DMA restriction to the guest during boot. It
> could be using a firmware property, or a discovery protocol. I did start
> drafting such a protocol in virtio-iommu, but I now think the
> reserved-memory mechanism in device-tree, below, is preferable. Would we
> need an equivalent for ACPI, though?
>
> How would we implement this in a Linux guest? The virtqueue of a virtio
> device has two components. Static ring buffers, allocated at boot with
> dma_alloc_coherent(), and the actual data payload, mapped with
> dma_map_page() and dma_map_single(). Linux calls the former "coherent"
> DMA, and the latter "streaming" DMA.
>
> Coherent DMA can already obtain its pages from a per-device memory pool.
> dma_init_coherent_memory() defines a range of physical memory usable by a
> device. Importantly this region has to be distinct from system memory RAM
> and reserved by the platform. It is mapped non-cacheable. If it exists,
> dma_alloc_coherent() will only get its pages from that region.
>
> On the other hand streaming DMA doesn't allocate memory. The virtio
> drivers don't control where that memory comes from, since the pages are
> fed to them by an upper layer of the subsystem, specific to the device
> type (net, block, video, etc). Often they are pages from the slab cache
> that contain other unrelated objects, a notorious problem for DMA
> isolation. If the page is not accessible by the device, swiotlb_map()
> allocates a bounce buffer somewhere more convenient and copies the data
> when needed.
>
> At the moment the bounce buffer is allocated from a global pool in the low
> physical pages. However a recent proposal by Chromium would add support
> for per-device swiotlb pools:
>
> https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromiu…
>
> And quoting Tomasz from the discussion on patch 4:
>
> For this, I'd like to propose a "restricted-dma-region" (feel free
> to suggest a better name) binding, which is explicitly specified
> to be the only DMA-able memory for this device and make Linux use
> the given pool for coherent DMA allocations and bouncing
> non-coherent DMA.
>
> That seems to be precisely what we need. Even when using the virtio-pci
> transport, it is still possible to define per-device properties in the
> device-tree, for example:
>
> /* PCI root complex node */
> pcie@10000000 {
> compatible = "pci-host-ecam-generic";
>
> /* Add properties to endpoint with BDF 00:01.0 */
> ep@0008 {
> reg = <0x00000800 0 0 0 0>;
> restricted-dma-region = <&dma_region_1>;
> };
> };
>
> reserved-memory {
> /* Define 64MB reserved region at address 0x50400000 */
> dma_region_1: restricted_dma_region {
> reg = <0x50400000 0x4000000>;
> };
> };
Excellent! You should also be aware of a proposal by Qualcomm to the
LKML about this: https://marc.info/?l=linux-kernel&m=158807398403549
I like this approach in the short term because we should be able to
make it work with very little effort, and it doesn't seem to require any
changes to the virtio specification, which means it is "backward
compatible" or at least easier to backport.
(FYI this approach corresponds to "pre-shared memory" in STR-14.)
> Dynamic regions
> ---------------
>
> In a previous discussion [1], several people suggested using a vIOMMU to
> dynamically update the mappings rather than statically setting a memory
> region usable by the backend. I believe that approach is still worth
> considering because it satisfies the security requirement and doesn't
> necessarily have worse performance. There is a trade-off between bounce
> buffers on one hand, and map notifications on the other.
>
> The problem with static regions is that all of the traffic will require
> copying. Sub-page payloads will need bounce buffering anyway, for proper
> isolation. But for large payloads bounce buffering might be prohibitive,
> and using a virtual IOMMU might actually be more efficient. Instead of
> copying large buffers the guest would send a MAP request to the
> hypervisor, which would then map the pages into the backend. Despite
> taking plenty of cycles for context switching and setting up the maps, it
> might be less costly than copying.
>
>
> Since it depends on the device, I guess we'll need a survey of memory
> access patterns by the different virtio devices that we're considering.
> In the end a mix of both solutions might be necessary.
Right, and in my experience the results of the trade-off differ greatly
on different architectures; i.e. on x86 bounce buffers are faster, on
ARM tlb flushes are (used to be?) much faster. You are completely right
that it also depends on the protocol. From the measurements I did last
time (2+ years ago) this approach was faster on ARM, even for networking.
With recent hardware things might have changed. This approach is also
the most "expensive" to build and requires an implementation in the
hypervisor itself for Xen: I think the virtio IOMMU backend
implementation cannot be in QEMU (or kvmtools), it needs to be in the
Xen hypervisor proper.
(FYI this approach corresponds to dynamically shared memory in STR-14.)
Finally, there is one more approach to consider: hypervisor-mediated
copy. See this presentation:
https://www.spinics.net/lists/automotive-discussions/attachments/pdfZi7_xDH…
If you are interested in more details we could ask Christopher to give
the presentation again at one of the next Stratos meetings. Argo is an
interface to ask Xen to do the copy on behalf of the VM. The advantage
is that the hypervisor can validate the input, preventing a class of
attacks where a VM changes the buffer while the other VM is still
reading it.
(This approach was not considered at the time of writing STR-14.)
Hi,
So an update on a potential demo platform for the AGL demo.
Platform:
MACCHIATObin 1.2 board
+ manually applied PCIe reset fix
+ removed JTAG & hacked PCIe
+ NVIDIA GK208B (originally from SynQaucer)
+ EFI firmware on sdcard from Steve McIntyre
I have a fairly stable box running Debian Buster + backported kernel +
testing Grub (for Xen) with managed graphics (via nouveau) and working
sound (via a USB audio dongle). With the following caveats:
I use the DT system info instead of ACPI.
I keep hitting ESC on serial at boot prompt and selecting "Reset"
until the graphics card wakes up and displays EFI splash screen on the
monitor.
and
Only kernel 5.6.0-0.bpo.2-arm64 #1 SMP Debian 5.6.14-2~bpo10+1 (2020-06-09)
While I can get a newer kernel to build and boot up the graphics either
comes up corrupted or keeps crashing the X server with a kernel related
nouvaue bug. I've tried to narrow it down but both vanilla 5.6.0 and
5.6.14 have the same problem so I suspect there is a patch in the Debian
series which fixes it which might need up-streaming.
(aside: anyone know how to make a traditional bisectable git tree from
a series of Debian patches?)
KVM works and I'm able to boot up my test kernel and get to a userspace
prompt on the serial console. I still have to boot up with virtio-gpu
and prove that works.
I've been unable to get Xen up and running on it so far. I've tested
with vanilla and also the current testing tree that I finally got
working on QEMU TCG. I have gotten some Xen diagnostics which I've
dumped into a separate thread so as not to confuse things too much.
Subject: Current status of Xen on MachiatoBin
Date: Fri, 02 Oct 2020 17:00:43 +0100
Message-ID: <87lfgomxck.fsf(a)linaro.org>
--
Alex Bennée
All
We have an interesting agenda today [1] with update on RPMG and the
Demo, *however
the call has been moved one hour later to avoid a conflict with
Linaros Town hall, all hands meeting.*
Mike
[1]
https://collaborate.linaro.org/display/STR/2020-10-01+Project+Stratos+Sync+…
--
Mike Holmes | Director, Foundation Technologies, Linaro
Mike.Holmes(a)linaro.org <mike.holmes(a)linaro.org>
"Work should be fun and collaborative, the rest follows"