New subject: Limited memory sharing investigation

3 Oct 2020


      Thanks Jean-Philippe for describing the problem in some abstract way.
I would like to share considerations related to establishing data
paths between a VM (frontend) and a device through another VM
(backend).
Sharing model
---------------
The VM in direct contact with the device may just provide driver
service to a single other VM.
This simplifies inbound traffic handling as there is no marshaling of
data to different VMs.
Shortcuts can be implemented and performance can be dramatically different.
We have thus: passthrough with abstraction (to virtio) vs shared model.
IO Model
---------------
There are two main ways to orchestrate IO around memory regions for
data and rings of descriptors for outbound requests or input data.
The key structural difference is how inbound traffic is handled:
- the device is provided a list of buffers that are pointed to through
pre-populated rings [that is almost the rule on PCI adapters with a
handful exceptions]
- the device is provided an unstructured memory area (or multiple) and
rings. The device does memory allocation as it pleases on the
unstructured area(s) [this is very common on Arm platform devices]
The structured buffer method is to be understood as computer driven IO
while the other is device driven IO.
Computer driven IO allows to build zero-copy data paths from device to
front-end VM (if other conditions are met) while the device driven
will impose a bounce buffer for the incoming traffic while may be zero
copy for outbound traffic.
Coherent interconnects such as CCIX and CXL do not fundamentally
change the IO Model but change the driver architecture.
Memory element
---------------
What is the smallest size of a shared memory between VMs? Naturally
this is a page but the page size may not be controllable. As a matter
of fact RHEL on arm64 is 64KB. So using a 64KB page to hold a 64 byte
network packet or audio sample is not really efficient. Sub page
allocations are possible but would generate a significant performance
penalty as the VMM shall validate memory access in a sub page
resulting in a VM exit for each check.
Underlay networking
---------------
In cloud infrastructure, VMs network traffic is always encapsulated
with protocols such as GENEVE, VXLAN... It is also frequent to have a
smartNIC that actually marshall traffic to VMs and host through
distinct pseudo devices.
In telecom parlance we name the raw network the underlay network.
So there may be big differences in achieving the goal depending on the
use case (has to be defined and assessed):
- cloud native: there is an underlay network
- smartphone virtualization: there may be underlay network to
differentiate between user traffic with different APNs and also
operator traffic dedicated for the SIM card or other cases.
- virtualization in the car: probably no underlay network
The presence of an underlay network is significant for the use case:
(with or without pseudo device) with proper hashing of inbound
traffic, traffic for the front end VM can be associated with dedicated
set of queues. This assignment may help work around IO model
constraints with device controlled IO and find a way to do zero (data)
copy from device to frontend VM.
User land IO
---------------
I think the use of userland IO (in the backend and/or the fronted VM)
may impact the data path in various ways and thus this should be
factored in the analysis.
Key elements of performance: use metata data (prepend, postpend to
data) to get information rather than ring descriptor which is in
uncached memory (we have concrete examples and the cost of the
different strategies may be up to 50% of the base performance)
Virtio
---------------
There are significant performance different between virtio 1.0 and
virtio 1.1: virtio 1.0 touches something like 6 cachelines to insert
an element in a queue while it is only one with virtio 1.1 (note sure
about the numbers but should not be far).
as a reference 6WIND has a VM 2 VM network driver that is not virtio
based and can go beyond 100Gbps per x86 vCPU. So I expect virtio 1.1
to reach that level of performance.
Memory allocation backend
---------------
Shared memory allocation shall be controlled by the VMM (Qemu, Xen...)
but the VMM may further ask FF-A to do so. I think there has been
efforts in the FF-A to find memory with proper attributes (coherent
between device and all VMs). I frankly have no clue here but this may
be worth digging into.
A zero-copy DPDK based DNS on an Arm platform
---------------
This example is meant to prove a point: all aspects of the target use
cases must be detailed with great care otherwise ideas may not fly...
(Please don't tell me I forget this or this does not fly, it is just
to give an idea of the detail level required.)
Assumptions:
- multiple front end VMs, with underlay network based on VLANs
- network interface can allocate multiple queues for 1 VLAN and have
thus multiple sets of queues for multiple VLANs
- IO model is either device or computer driven, no importance
- data memory can be associated to a VLAN/VM
- DPDK based front end DNS application
initialization (here is to highlight that it is very feasible that a
simple data path is built on an impossible initialization scenario:
initialization must be an integral part of the description of the
solution)
- orchestration spawns backend VM with assigned device
- orchestration spawns front end VM which creates virtio-net with private memory
- orchestration informs backend VM to assign some traffic for a VLAN
to a backend VM; the backend VM create the queues for the VLAN , the
virtio-net with the packet memory wrapped around the memory assigned
for the queues; the "bridge" configuration between this VLAN the the
virtio;
- the backend VM informs orchestration to bind the newly created
virtio-net memory to the front-end memory
- the orchestration asks the front end VMM to wrap its virtio-net to
the specified area.
- Front end DPDK application is listening on UDP port 53 and is bound
the virtio-net device (DPDK apps are bounds to devices, not IP
addresses)
- IP configuration about the DPDK app is out of scope.
inbound traffic:
traffic is received in device SRAM
packet is marshalled to a memory associated to a VLAN (i.e. VM).
descriptor is updated to point to this packet
backend VM kernel handle queues (IRQ, busy polling...), create virtio
descriptors pointing to data as part of the bridging (stripping the
underlay network VLAN tag)
front end DPDK app read the descriptor and the packet
if DNS at expected IP, application handle the packet otherwise drop
outbound traffic:
DPDK DNS application get a packet from the device memory (i.e. in a
memory pool accessible by the device) and populate the response
DPDK lib forms the descriptor in virtio
Backend end VM kernel gets the packet, forms a descriptor in the right
queue (adding the underlay network VLAN tag) and rings the door bell
hardware gets the descriptor , retrieves the DNS response (which comes
from directly from front end memory)
hardware copies the packet into SRAM and is serialized on the wire
On Fri, 2 Oct 2020 at 15:44, Jean-Philippe Brucker via Stratos-dev
stratos-dev@op-lists.linaro.org wrote:
...
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8,
STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Problem
We have a primary VM running a guest, and a secondary one running a
backend that manages one hardware resource (for example network access).
They communicate with virtio (for example virtio-net). The guest
implements a virtio driver, the backend a virtio device. Problem is, how
to ensure that the backend and guest only share the memory required for
the virtio communication, and that the backend cannot access any other
memory from the guest?
Static shared region
Let's first look at static DMA regions. The hypervisor allocates a subset
of memory to be shared between guest and backend. The hypervisor
communicates this per-device DMA restriction to the guest during boot. It
could be using a firmware property, or a discovery protocol. I did start
drafting such a protocol in virtio-iommu, but I now think the
reserved-memory mechanism in device-tree, below, is preferable. Would we
need an equivalent for ACPI, though?
How would we implement this in a Linux guest?  The virtqueue of a virtio
device has two components. Static ring buffers, allocated at boot with
dma_alloc_coherent(), and the actual data payload, mapped with
dma_map_page() and dma_map_single(). Linux calls the former "coherent"
DMA, and the latter "streaming" DMA.
Coherent DMA can already obtain its pages from a per-device memory pool.
dma_init_coherent_memory() defines a range of physical memory usable by a
device. Importantly this region has to be distinct from system memory RAM
and reserved by the platform. It is mapped non-cacheable. If it exists,
dma_alloc_coherent() will only get its pages from that region.
On the other hand streaming DMA doesn't allocate memory. The virtio
drivers don't control where that memory comes from, since the pages are
fed to them by an upper layer of the subsystem, specific to the device
type (net, block, video, etc). Often they are pages from the slab cache
that contain other unrelated objects, a notorious problem for DMA
isolation. If the page is not accessible by the device, swiotlb_map()
allocates a bounce buffer somewhere more convenient and copies the data
when needed.
At the moment the bounce buffer is allocated from a global pool in the low
physical pages. However a recent proposal by Chromium would add support
for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
    For this, I'd like to propose a "restricted-dma-region" (feel free
    to suggest a better name) binding, which is explicitly specified
    to be the only DMA-able memory for this device and make Linux use
    the given pool for coherent DMA allocations and bouncing
    non-coherent DMA.


That seems to be precisely what we need. Even when using the virtio-pci
transport, it is still possible to define per-device properties in the
device-tree, for example:
    /* PCI root complex node */
    pcie@10000000 {
            compatible = "pci-host-ecam-generic";

            /* Add properties to endpoint with BDF 00:01.0 */
            ep@0008 {
                    reg = <0x00000800 0 0 0 0>;
                    restricted-dma-region = <&dma_region_1>;
            };
    };

    reserved-memory {
            /* Define 64MB reserved region at address 0x50400000 */
            dma_region_1: restricted_dma_region {
                    reg = <0x50400000 0x4000000>;
            };
    };


Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to
dynamically update the mappings rather than statically setting a memory
region usable by the backend. I believe that approach is still worth
considering because it satisfies the security requirement and doesn't
necessarily have worse performance. There is a trade-off between bounce
buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require
copying. Sub-page payloads will need bounce buffering anyway, for proper
isolation. But for large payloads bounce buffering might be prohibitive,
and using a virtual IOMMU might actually be more efficient. Instead of
copying large buffers the guest would send a MAP request to the
hypervisor, which would then map the pages into the backend. Despite
taking plenty of cycles for context switching and setting up the maps, it
might be less costly than copying.
Since it depends on the device, I guess we'll need a survey of memory
access patterns by the different virtio devices that we're considering.
In the end a mix of both solutions might be necessary.
Thanks,
Jean
[1] https://lists.oasis-open.org/archives/virtio-dev/202006/msg00037.html
Stratos-dev mailing list
Stratos-dev@op-lists.linaro.org
https://op-lists.linaro.org/mailman/listinfo/stratos-dev
-- 
François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog

Re: [Stratos-dev] Limited memory sharing investigation

Problem

Static shared region

Dynamic regions

[1] https://lists.oasis-open.org/archives/virtio-dev/202006/msg00037.html

IO Model

User land IO

Virtio

Memory allocation backend

IO Model

User land IO

Virtio

Memory allocation backend

IO Model

User land IO