I am pretty sure the reasons have to do with old x86 PV guests, so I am CCing Juergen and Boris.
Hi,
While we've been working on the rust-vmm virtio backends on Xen we obviously have to map guest memory info the userspace of the daemon. However following the logic of what is going on is a little confusing. For example in the Linux backend we have this:
void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem, uint32_t dom, void *addr, int prot, int flags, size_t num, const xen_pfn_t arr[/*num*/], int err[/*num*/]) { int fd = fmem->fd; privcmd_mmapbatch_v2_t ioctlx; size_t i; int rc;
addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED, fd, 0); if ( addr == MAP_FAILED ) return NULL; ioctlx.num = num; ioctlx.dom = dom; ioctlx.addr = (unsigned long)addr; ioctlx.arr = arr; ioctlx.err = err; rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx);
Where the fd passed down is associated with the /dev/xen/privcmd device for issuing hypercalls on userspaces behalf. What is confusing is why the function does it's own mmap - one would assume the passed addr would be associated with a anonymous or file backed mmap region already that the calling code has setup. Applying a mmap to a special device seems a little odd.
Looking at the implementation on the kernel side it seems the mmap handler only sets a few flags:
static int privcmd_mmap(struct file *file, struct vm_area_struct *vma) { /* DONTCOPY is essential for Xen because copy_page_range doesn't know * how to recreate these mappings */ vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP; vma->vm_ops = &privcmd_vm_ops; vma->vm_private_data = NULL;
return 0;
}
So can I confirm that the mmap of /dev/xen/privcmd is being called for side effects? Is it so when the actual ioctl is called the correct flags are set of the pages associated with the user space virtual address range?
Can I confirm there shouldn't be any limitation on where and how the userspace virtual address space is setup for the mapping in the guest memory?
Is there a reason why this isn't done in the ioctl path itself?
I'm trying to understand the differences between Xen and KVM in the API choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION ioctl for KVM which brings a section of the guest physical address space into the userspaces vaddr range.
On 24.03.22 02:42, Stefano Stabellini wrote:
I am pretty sure the reasons have to do with old x86 PV guests, so I am CCing Juergen and Boris.
Hi,
While we've been working on the rust-vmm virtio backends on Xen we obviously have to map guest memory info the userspace of the daemon. However following the logic of what is going on is a little confusing. For example in the Linux backend we have this:
void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem, uint32_t dom, void *addr, int prot, int flags, size_t num, const xen_pfn_t arr[/*num*/], int err[/*num*/]) { int fd = fmem->fd; privcmd_mmapbatch_v2_t ioctlx; size_t i; int rc;
addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED, fd, 0); if ( addr == MAP_FAILED ) return NULL; ioctlx.num = num; ioctlx.dom = dom; ioctlx.addr = (unsigned long)addr; ioctlx.arr = arr; ioctlx.err = err; rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx);
Where the fd passed down is associated with the /dev/xen/privcmd device for issuing hypercalls on userspaces behalf. What is confusing is why the function does it's own mmap - one would assume the passed addr would be associated with a anonymous or file backed mmap region already that the calling code has setup. Applying a mmap to a special device seems a little odd.
Looking at the implementation on the kernel side it seems the mmap handler only sets a few flags:
static int privcmd_mmap(struct file *file, struct vm_area_struct *vma) { /* DONTCOPY is essential for Xen because copy_page_range doesn't know * how to recreate these mappings */ vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP; vma->vm_ops = &privcmd_vm_ops; vma->vm_private_data = NULL;
return 0;
}
So can I confirm that the mmap of /dev/xen/privcmd is being called for side effects? Is it so when the actual ioctl is called the correct flags are set of the pages associated with the user space virtual address range?
Can I confirm there shouldn't be any limitation on where and how the userspace virtual address space is setup for the mapping in the guest memory?
Is there a reason why this isn't done in the ioctl path itself?
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
I'm trying to understand the differences between Xen and KVM in the API choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION ioctl for KVM which brings a section of the guest physical address space into the userspaces vaddr range.
The main difference is just that the consumer of the hypercall buffer is NOT the kernel, but the hypervisor. In the KVM case both are the same, so a brief period of an invalid PTE can be handled just fine in KVM, while the Xen hypervisor has no idea that this situation will be over very soon.
Juergen
(add Arnd to CC)
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.03.22 02:42, Stefano Stabellini wrote:
I am pretty sure the reasons have to do with old x86 PV guests, so I am CCing Juergen and Boris.
Hi,
While we've been working on the rust-vmm virtio backends on Xen we obviously have to map guest memory info the userspace of the daemon. However following the logic of what is going on is a little confusing. For example in the Linux backend we have this:
void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem, uint32_t dom, void *addr, int prot, int flags, size_t num, const xen_pfn_t arr[/*num*/], int err[/*num*/]) { int fd = fmem->fd; privcmd_mmapbatch_v2_t ioctlx; size_t i; int rc;
addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED, fd, 0); if ( addr == MAP_FAILED ) return NULL; ioctlx.num = num; ioctlx.dom = dom; ioctlx.addr = (unsigned long)addr; ioctlx.arr = arr; ioctlx.err = err; rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx);
Where the fd passed down is associated with the /dev/xen/privcmd device for issuing hypercalls on userspaces behalf. What is confusing is why the function does it's own mmap - one would assume the passed addr would be associated with a anonymous or file backed mmap region already that the calling code has setup. Applying a mmap to a special device seems a little odd.
Looking at the implementation on the kernel side it seems the mmap handler only sets a few flags:
static int privcmd_mmap(struct file *file, struct vm_area_struct *vma) { /* DONTCOPY is essential for Xen because copy_page_range doesn't know * how to recreate these mappings */ vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP; vma->vm_ops = &privcmd_vm_ops; vma->vm_private_data = NULL;
return 0;
}
So can I confirm that the mmap of /dev/xen/privcmd is being called for side effects? Is it so when the actual ioctl is called the correct flags are set of the pages associated with the user space virtual address range?
Can I confirm there shouldn't be any limitation on where and how the userspace virtual address space is setup for the mapping in the guest memory?
Is there a reason why this isn't done in the ioctl path itself?
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Was this using the normal mlock() semantics to stop pages being swapped out of RAM?
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
I must admit I'm not super familiar with the internals of page table handling with Linux+Xen. Doesn't the kernel need to delegate the tweaking of page tables to the hypervisor or is it allowed to manipulate the page tables itself?
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
I'm trying to understand the differences between Xen and KVM in the API choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION ioctl for KVM which brings a section of the guest physical address space into the userspaces vaddr range.
The main difference is just that the consumer of the hypercall buffer is NOT the kernel, but the hypervisor. In the KVM case both are the same, so a brief period of an invalid PTE can be handled just fine in KVM, while the Xen hypervisor has no idea that this situation will be over very soon.
I still don't follow the details of why we have the separate mmap. Is it purely because the VM flags of the special file can be changed in a way that can't be done with a traditional file-backed mmap?
I can see various other devices have their own setting of vm flags but VM_DONTCOPY for example can be set with the appropriate madvise call:
MADV_DONTFORK (since Linux 2.6.16) Do not make the pages in this range available to the child after a fork(2). This is useful to prevent copy-on-write semantics from changing the physical location of a page if the parent writes to it after a fork(2). (Such page relocations cause problems for hardware that DMAs into the page.)
For the vhost-user work we need to be able to share the guest memory between the xen-vhost-master (which is doing the ioctls to talk to Xen) and the vhost-user daemon (which doesn't know about hypervisors but just deals in memory and events).
Would it be enough to loosen the API and just have xen_remap_pfn() verify the kernels VM flags are appropriately set before requesting Xen updates the page tables?
On 25.03.22 17:07, Alex Bennée wrote:
(add Arnd to CC)
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.03.22 02:42, Stefano Stabellini wrote:
I am pretty sure the reasons have to do with old x86 PV guests, so I am CCing Juergen and Boris.
Hi,
While we've been working on the rust-vmm virtio backends on Xen we obviously have to map guest memory info the userspace of the daemon. However following the logic of what is going on is a little confusing. For example in the Linux backend we have this:
void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem, uint32_t dom, void *addr, int prot, int flags, size_t num, const xen_pfn_t arr[/*num*/], int err[/*num*/]) { int fd = fmem->fd; privcmd_mmapbatch_v2_t ioctlx; size_t i; int rc; addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED, fd, 0); if ( addr == MAP_FAILED ) return NULL; ioctlx.num = num; ioctlx.dom = dom; ioctlx.addr = (unsigned long)addr; ioctlx.arr = arr; ioctlx.err = err; rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx);
Where the fd passed down is associated with the /dev/xen/privcmd device for issuing hypercalls on userspaces behalf. What is confusing is why the function does it's own mmap - one would assume the passed addr would be associated with a anonymous or file backed mmap region already that the calling code has setup. Applying a mmap to a special device seems a little odd.
Looking at the implementation on the kernel side it seems the mmap handler only sets a few flags:
static int privcmd_mmap(struct file *file, struct vm_area_struct *vma) { /* DONTCOPY is essential for Xen because copy_page_range doesn't know * how to recreate these mappings */ vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP; vma->vm_ops = &privcmd_vm_ops; vma->vm_private_data = NULL; return 0; }
So can I confirm that the mmap of /dev/xen/privcmd is being called for side effects? Is it so when the actual ioctl is called the correct flags are set of the pages associated with the user space virtual address range?
Can I confirm there shouldn't be any limitation on where and how the userspace virtual address space is setup for the mapping in the guest memory?
Is there a reason why this isn't done in the ioctl path itself?
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Was this using the normal mlock() semantics to stop pages being swapped out of RAM?
The code is still in tools/libs/call/linux.c in alloc_pages_nobufdev(), which is used if the kernel driver doesn't support the special device for the kernel memory mmap().
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
I must admit I'm not super familiar with the internals of page table handling with Linux+Xen. Doesn't the kernel need to delegate the tweaking of page tables to the hypervisor or is it allowed to manipulate the page tables itself?
PV domains need to do page table manipulations via the hypervisor, but the issue would occur in PVH or HVM domains, too.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
I'm trying to understand the differences between Xen and KVM in the API choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION ioctl for KVM which brings a section of the guest physical address space into the userspaces vaddr range.
The main difference is just that the consumer of the hypercall buffer is NOT the kernel, but the hypervisor. In the KVM case both are the same, so a brief period of an invalid PTE can be handled just fine in KVM, while the Xen hypervisor has no idea that this situation will be over very soon.
I still don't follow the details of why we have the separate mmap. Is it purely because the VM flags of the special file can be changed in a way that can't be done with a traditional file-backed mmap?
Yes. You can't make the kernel believe that a user page is a kernel one. And only kernel pages are not affected by the short time PTE invalidation which caused the problems (this is what I was told by the guy maintaining the kernel's memory management at SUSE).
I can see various other devices have their own setting of vm flags but VM_DONTCOPY for example can be set with the appropriate madvise call:
MADV_DONTFORK (since Linux 2.6.16) Do not make the pages in this range available to the child after a fork(2). This is useful to prevent copy-on-write semantics from changing the physical location of a page if the parent writes to it after a fork(2). (Such page relocations cause problems for hardware that DMAs into the page.)
For the vhost-user work we need to be able to share the guest memory between the xen-vhost-master (which is doing the ioctls to talk to Xen) and the vhost-user daemon (which doesn't know about hypervisors but just deals in memory and events).
The problem is really only with the hypervisor trying to access a domain's buffer via a domain virtual memory address. It has nothing to do with mapping other domain's memory in a domain.
Would it be enough to loosen the API and just have xen_remap_pfn() verify the kernels VM flags are appropriately set before requesting Xen updates the page tables?
I don't think you have to change anything for that.
Juergen
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
For instance, in our case, where we are looking to create hypervisor-agnostic virtio backends, the rust-vmm library [1] issues mmap() only and expects it to work. It doesn't know it is running on a Xen system, and it shouldn't know that as well.
Thanks.
On 24/08/2022 10:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
For instance, in our case, where we are looking to create hypervisor-agnostic virtio backends, the rust-vmm library [1] issues mmap() only and expects it to work. It doesn't know it is running on a Xen system, and it shouldn't know that as well.
Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. privcmd is very much not sane.
In practice you'll need to use both. /dev/xen/hypercall for getting "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
~Andrew
Andrew Cooper Andrew.Cooper3@citrix.com writes:
On 24/08/2022 10:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
As I understand it the MMAPBATCH ioctl is being treated like every other hypercall proxy through the ioctl interface. Which makes sense from the point of view of having a consistent interface to the hypervisor but not from point of view of providing a consistent userspace interface for mapping memory which doesn't care about the hypervisor details.
The privcmd_mmapbatch_v2 interface is slightly richer than what you could expose via mmap() because it allows the handling of partial mappings with what I presume is a per-page *err array. If you issued the hypercall directly from the mmap() and one of the pages wasn't mapped by the hypervisor you would have to unwind everything before returning EFAULT to the user.
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
For instance, in our case, where we are looking to create hypervisor-agnostic virtio backends, the rust-vmm library [1] issues mmap() only and expects it to work. It doesn't know it is running on a Xen system, and it shouldn't know that as well.
Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. privcmd is very much not sane.
In practice you'll need to use both. /dev/xen/hypercall for getting "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
I'm unsure what is meant by safe memory here. privcmd_buf_mmap() looks like it just allocates a bunch of GFP_KERNEL pages rather than interacting with the hypervisor directly. Are these the same pages that get used when you eventually call privcmd_ioctl_mmap_batch()?
The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a little confusing TBH.
Anyway the goal here is to provide a non-xen aware userspace with standard userspace API to access the guests memory. Perhaps messing around with the semantics of the /dev/xen/[hypercall|privcmd] devices nodes is too confusing.
Maybe we could instead:
1. Have the Xen aware VMM ask to make the guests memory visible to the host kernels address space. 2. When this is done explicitly create a device node to represent it (/dev/xen/dom-%d-mem?) 3. Pass this new device to the non-Xen aware userspace which uses the standard mmap() call to make the kernel pages visible to userspace
Does that make sense?
On 24.08.22 13:22, Alex Bennée wrote:
Andrew Cooper Andrew.Cooper3@citrix.com writes:
On 24/08/2022 10:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
As I understand it the MMAPBATCH ioctl is being treated like every other hypercall proxy through the ioctl interface. Which makes sense from the point of view of having a consistent interface to the hypervisor but not from point of view of providing a consistent userspace interface for mapping memory which doesn't care about the hypervisor details.
The privcmd_mmapbatch_v2 interface is slightly richer than what you could expose via mmap() because it allows the handling of partial mappings with what I presume is a per-page *err array. If you issued the hypercall directly from the mmap() and one of the pages wasn't mapped by the hypervisor you would have to unwind everything before returning EFAULT to the user.
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
For instance, in our case, where we are looking to create hypervisor-agnostic virtio backends, the rust-vmm library [1] issues mmap() only and expects it to work. It doesn't know it is running on a Xen system, and it shouldn't know that as well.
Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. privcmd is very much not sane.
In practice you'll need to use both. /dev/xen/hypercall for getting "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
I'm unsure what is meant by safe memory here. privcmd_buf_mmap() looks like it just allocates a bunch of GFP_KERNEL pages rather than interacting with the hypervisor directly. Are these the same pages that get used when you eventually call privcmd_ioctl_mmap_batch()?
privcmd_buf_mmap() is allocating kernel pages which are used for data being accessed by the hypervisor when doing the hypercall later. This is a generic interface being used for all hypercalls, not only for privcmd_ioctl_mmap_batch().
The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a little confusing TBH.
Anyway the goal here is to provide a non-xen aware userspace with standard userspace API to access the guests memory. Perhaps messing
This is what the Xen related libraries are meant for. Your decision to ignore those is firing back now.
around with the semantics of the /dev/xen/[hypercall|privcmd] devices nodes is too confusing.
Maybe we could instead:
- Have the Xen aware VMM ask to make the guests memory visible to the host kernels address space.
Urgh. This would be a major breach of the Xen security concept.
- When this is done explicitly create a device node to represent it (/dev/xen/dom-%d-mem?)
- Pass this new device to the non-Xen aware userspace which uses the standard mmap() call to make the kernel pages visible to userspace
Does that make sense?
Maybe from your point of view, but not from the Xen architectural point of view IMHO. You are removing basically the main security advantages of Xen by generating a kernel interface for mapping arbitrary guest memory easily.
Juergen
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 13:22, Alex Bennée wrote:
Andrew Cooper Andrew.Cooper3@citrix.com writes:
On 24/08/2022 10:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
As I understand it the MMAPBATCH ioctl is being treated like every other hypercall proxy through the ioctl interface. Which makes sense from the point of view of having a consistent interface to the hypervisor but not from point of view of providing a consistent userspace interface for mapping memory which doesn't care about the hypervisor details. The privcmd_mmapbatch_v2 interface is slightly richer than what you could expose via mmap() because it allows the handling of partial mappings with what I presume is a per-page *err array. If you issued the hypercall directly from the mmap() and one of the pages wasn't mapped by the hypervisor you would have to unwind everything before returning EFAULT to the user.
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
For instance, in our case, where we are looking to create hypervisor-agnostic virtio backends, the rust-vmm library [1] issues mmap() only and expects it to work. It doesn't know it is running on a Xen system, and it shouldn't know that as well.
Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. privcmd is very much not sane.
In practice you'll need to use both. /dev/xen/hypercall for getting "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
I'm unsure what is meant by safe memory here. privcmd_buf_mmap() looks like it just allocates a bunch of GFP_KERNEL pages rather than interacting with the hypervisor directly. Are these the same pages that get used when you eventually call privcmd_ioctl_mmap_batch()?
privcmd_buf_mmap() is allocating kernel pages which are used for data being accessed by the hypervisor when doing the hypercall later. This is a generic interface being used for all hypercalls, not only for privcmd_ioctl_mmap_batch().
The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a little confusing TBH. Anyway the goal here is to provide a non-xen aware userspace with standard userspace API to access the guests memory. Perhaps messing
This is what the Xen related libraries are meant for. Your decision to ignore those is firing back now.
We didn't ignore them - the initial version of the xen-vhost-master binary was built with the rust and linking to the Xen libraries. We are however in the process of moving to more pure rust (with the xen-sys crate being a pure rust ioctl/hypercall wrapper).
However I was under the impression there where two classes of hypercalls. ABI stable ones which won't change (which is all we are planning to implement for xen-sys) and non-stable ABIs which would need mediating by the xen libs. We are hoping we can do all of VirtIO with just the stable ABI.
around with the semantics of the /dev/xen/[hypercall|privcmd] devices nodes is too confusing. Maybe we could instead:
- Have the Xen aware VMM ask to make the guests memory visible to
the host kernels address space.
Urgh. This would be a major breach of the Xen security concept.
- When this is done explicitly create a device node to represent it (/dev/xen/dom-%d-mem?)
- Pass this new device to the non-Xen aware userspace which uses the standard mmap() call to make the kernel pages visible to userspace
Does that make sense?
Maybe from your point of view, but not from the Xen architectural point of view IMHO. You are removing basically the main security advantages of Xen by generating a kernel interface for mapping arbitrary guest memory easily.
We are not talking about doing an end-run around the Xen architecture. The guest still has to instruct the hypervisor to grant access to its memory. Currently this is a global thing (i.e. whole address space or nothing) but obviously more fine grained grants can be done on a transaction by transaction basis although we are exploring more efficient mechanisms for this (shared pools and carve outs).
This does raise questions for the mmap interface though - each individually granted region would need to be mapped into the dom0 userspace virtual address space or perhaps a new flag for mmap() so we can map the whole address space but expect SIGBUS faults if we access something that hasn't been granted.
Juergen
[2. OpenPGP public key --- application/pgp-keys; OpenPGP_0xB0DE9DD628BF132F.asc]...
[[End of PGP Signed Part]]
On 24.08.22 17:58, Alex Bennée wrote:
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 13:22, Alex Bennée wrote:
Andrew Cooper Andrew.Cooper3@citrix.com writes:
On 24/08/2022 10:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
As I understand it the MMAPBATCH ioctl is being treated like every other hypercall proxy through the ioctl interface. Which makes sense from the point of view of having a consistent interface to the hypervisor but not from point of view of providing a consistent userspace interface for mapping memory which doesn't care about the hypervisor details. The privcmd_mmapbatch_v2 interface is slightly richer than what you could expose via mmap() because it allows the handling of partial mappings with what I presume is a per-page *err array. If you issued the hypercall directly from the mmap() and one of the pages wasn't mapped by the hypervisor you would have to unwind everything before returning EFAULT to the user.
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
For instance, in our case, where we are looking to create hypervisor-agnostic virtio backends, the rust-vmm library [1] issues mmap() only and expects it to work. It doesn't know it is running on a Xen system, and it shouldn't know that as well.
Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. privcmd is very much not sane.
In practice you'll need to use both. /dev/xen/hypercall for getting "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
I'm unsure what is meant by safe memory here. privcmd_buf_mmap() looks like it just allocates a bunch of GFP_KERNEL pages rather than interacting with the hypervisor directly. Are these the same pages that get used when you eventually call privcmd_ioctl_mmap_batch()?
privcmd_buf_mmap() is allocating kernel pages which are used for data being accessed by the hypervisor when doing the hypercall later. This is a generic interface being used for all hypercalls, not only for privcmd_ioctl_mmap_batch().
The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a little confusing TBH. Anyway the goal here is to provide a non-xen aware userspace with standard userspace API to access the guests memory. Perhaps messing
This is what the Xen related libraries are meant for. Your decision to ignore those is firing back now.
We didn't ignore them - the initial version of the xen-vhost-master binary was built with the rust and linking to the Xen libraries. We are however in the process of moving to more pure rust (with the xen-sys crate being a pure rust ioctl/hypercall wrapper).
Ah, okay, I wasn't aware of this.
However I was under the impression there where two classes of hypercalls. ABI stable ones which won't change (which is all we are planning to implement for xen-sys) and non-stable ABIs which would need mediating by the xen libs. We are hoping we can do all of VirtIO with just the stable ABI.
Okay.
around with the semantics of the /dev/xen/[hypercall|privcmd] devices nodes is too confusing. Maybe we could instead:
- Have the Xen aware VMM ask to make the guests memory visible to
the host kernels address space.
Urgh. This would be a major breach of the Xen security concept.
- When this is done explicitly create a device node to represent it (/dev/xen/dom-%d-mem?)
- Pass this new device to the non-Xen aware userspace which uses the standard mmap() call to make the kernel pages visible to userspace
Does that make sense?
Maybe from your point of view, but not from the Xen architectural point of view IMHO. You are removing basically the main security advantages of Xen by generating a kernel interface for mapping arbitrary guest memory easily.
We are not talking about doing an end-run around the Xen architecture. The guest still has to instruct the hypervisor to grant access to its memory. Currently this is a global thing (i.e. whole address space or nothing) but obviously more fine grained grants can be done on a transaction by transaction basis although we are exploring more efficient mechanisms for this (shared pools and carve outs).
Happy to hear that.
This does raise questions for the mmap interface though - each individually granted region would need to be mapped into the dom0 userspace virtual address space or perhaps a new flag for mmap() so we can map the whole address space but expect SIGBUS faults if we access something that hasn't been granted.
Do I understand that correctly? You want the guest to grant a memory region to the backend, and the backend should be able to map this region not using grants, but the guest physical addresses?
Juergen
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 17:58, Alex Bennée wrote:
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 13:22, Alex Bennée wrote:
Andrew Cooper Andrew.Cooper3@citrix.com writes:
On 24/08/2022 10:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote: > For a rather long time we were using "normal" user pages for this purpose, > which were just locked into memory for doing the hypercall. > > Unfortunately there have been very rare problems with that approach, as > the Linux kernel can set a user page related PTE to invalid for short > periods of time, which led to EFAULT in the hypervisor when trying to > access the hypercall data. > > In Linux this can avoided only by using kernel memory, which is the > reason why the hypercall buffers are allocated and mmap()-ed through the > privcmd driver. Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
As I understand it the MMAPBATCH ioctl is being treated like every other hypercall proxy through the ioctl interface. Which makes sense from the point of view of having a consistent interface to the hypervisor but not from point of view of providing a consistent userspace interface for mapping memory which doesn't care about the hypervisor details. The privcmd_mmapbatch_v2 interface is slightly richer than what you could expose via mmap() because it allows the handling of partial mappings with what I presume is a per-page *err array. If you issued the hypercall directly from the mmap() and one of the pages wasn't mapped by the hypervisor you would have to unwind everything before returning EFAULT to the user.
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
For instance, in our case, where we are looking to create hypervisor-agnostic virtio backends, the rust-vmm library [1] issues mmap() only and expects it to work. It doesn't know it is running on a Xen system, and it shouldn't know that as well.
Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. privcmd is very much not sane.
In practice you'll need to use both. /dev/xen/hypercall for getting "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
I'm unsure what is meant by safe memory here. privcmd_buf_mmap() looks like it just allocates a bunch of GFP_KERNEL pages rather than interacting with the hypervisor directly. Are these the same pages that get used when you eventually call privcmd_ioctl_mmap_batch()?
privcmd_buf_mmap() is allocating kernel pages which are used for data being accessed by the hypervisor when doing the hypercall later. This is a generic interface being used for all hypercalls, not only for privcmd_ioctl_mmap_batch().
The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a little confusing TBH. Anyway the goal here is to provide a non-xen aware userspace with standard userspace API to access the guests memory. Perhaps messing
This is what the Xen related libraries are meant for. Your decision to ignore those is firing back now.
We didn't ignore them - the initial version of the xen-vhost-master binary was built with the rust and linking to the Xen libraries. We are however in the process of moving to more pure rust (with the xen-sys crate being a pure rust ioctl/hypercall wrapper).
Ah, okay, I wasn't aware of this.
However I was under the impression there where two classes of hypercalls. ABI stable ones which won't change (which is all we are planning to implement for xen-sys) and non-stable ABIs which would need mediating by the xen libs. We are hoping we can do all of VirtIO with just the stable ABI.
Okay.
around with the semantics of the /dev/xen/[hypercall|privcmd] devices nodes is too confusing. Maybe we could instead:
- Have the Xen aware VMM ask to make the guests memory visible to
the host kernels address space.
Urgh. This would be a major breach of the Xen security concept.
- When this is done explicitly create a device node to represent it (/dev/xen/dom-%d-mem?)
- Pass this new device to the non-Xen aware userspace which uses the standard mmap() call to make the kernel pages visible to userspace
Does that make sense?
Maybe from your point of view, but not from the Xen architectural point of view IMHO. You are removing basically the main security advantages of Xen by generating a kernel interface for mapping arbitrary guest memory easily.
We are not talking about doing an end-run around the Xen architecture. The guest still has to instruct the hypervisor to grant access to its memory. Currently this is a global thing (i.e. whole address space or nothing) but obviously more fine grained grants can be done on a transaction by transaction basis although we are exploring more efficient mechanisms for this (shared pools and carve outs).
Happy to hear that.
This does raise questions for the mmap interface though - each individually granted region would need to be mapped into the dom0 userspace virtual address space or perhaps a new flag for mmap() so we can map the whole address space but expect SIGBUS faults if we access something that hasn't been granted.
Do I understand that correctly? You want the guest to grant a memory region to the backend, and the backend should be able to map this region not using grants, but the guest physical addresses?
Yes - although it doesn't have to be the whole GPA range. The vhost-user protocol communicates what offset into the GPA space the various memory regions exist at.
Juergen
[2. OpenPGP public key --- application/pgp-keys; OpenPGP_0xB0DE9DD628BF132F.asc]...
[[End of PGP Signed Part]]
On 31.08.22 18:02, Alex Bennée wrote:
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 17:58, Alex Bennée wrote:
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 13:22, Alex Bennée wrote:
Andrew Cooper Andrew.Cooper3@citrix.com writes:
On 24/08/2022 10:19, Viresh Kumar wrote: > On 24-03-22, 06:12, Juergen Gross wrote: >> For a rather long time we were using "normal" user pages for this purpose, >> which were just locked into memory for doing the hypercall. >> >> Unfortunately there have been very rare problems with that approach, as >> the Linux kernel can set a user page related PTE to invalid for short >> periods of time, which led to EFAULT in the hypervisor when trying to >> access the hypercall data. >> >> In Linux this can avoided only by using kernel memory, which is the >> reason why the hypercall buffers are allocated and mmap()-ed through the >> privcmd driver. > Hi Juergen, > > I understand why we moved from user pages to kernel pages, but I don't > fully understand why we need to make two separate calls to map the > guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH). > > Why aren't we doing all of it from mmap() itself ? I hacked it up to > check on it and it works fine if we do it all from mmap() itself.
As I understand it the MMAPBATCH ioctl is being treated like every other hypercall proxy through the ioctl interface. Which makes sense from the point of view of having a consistent interface to the hypervisor but not from point of view of providing a consistent userspace interface for mapping memory which doesn't care about the hypervisor details. The privcmd_mmapbatch_v2 interface is slightly richer than what you could expose via mmap() because it allows the handling of partial mappings with what I presume is a per-page *err array. If you issued the hypercall directly from the mmap() and one of the pages wasn't mapped by the hypervisor you would have to unwind everything before returning EFAULT to the user.
> Aren't we abusing the Linux userspace ABI here ? As standard userspace > code would expect just mmap() to be enough to map the memory. Yes, the > current user, Xen itself, is adapted to make two calls, but it breaks > as soon as we want to use something that relies on Linux userspace > ABI. > > For instance, in our case, where we are looking to create > hypervisor-agnostic virtio backends, the rust-vmm library [1] issues > mmap() only and expects it to work. It doesn't know it is running on a > Xen system, and it shouldn't know that as well.
Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. privcmd is very much not sane.
In practice you'll need to use both. /dev/xen/hypercall for getting "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
I'm unsure what is meant by safe memory here. privcmd_buf_mmap() looks like it just allocates a bunch of GFP_KERNEL pages rather than interacting with the hypervisor directly. Are these the same pages that get used when you eventually call privcmd_ioctl_mmap_batch()?
privcmd_buf_mmap() is allocating kernel pages which are used for data being accessed by the hypervisor when doing the hypercall later. This is a generic interface being used for all hypercalls, not only for privcmd_ioctl_mmap_batch().
The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a little confusing TBH. Anyway the goal here is to provide a non-xen aware userspace with standard userspace API to access the guests memory. Perhaps messing
This is what the Xen related libraries are meant for. Your decision to ignore those is firing back now.
We didn't ignore them - the initial version of the xen-vhost-master binary was built with the rust and linking to the Xen libraries. We are however in the process of moving to more pure rust (with the xen-sys crate being a pure rust ioctl/hypercall wrapper).
Ah, okay, I wasn't aware of this.
However I was under the impression there where two classes of hypercalls. ABI stable ones which won't change (which is all we are planning to implement for xen-sys) and non-stable ABIs which would need mediating by the xen libs. We are hoping we can do all of VirtIO with just the stable ABI.
Okay.
around with the semantics of the /dev/xen/[hypercall|privcmd] devices nodes is too confusing. Maybe we could instead: 1. Have the Xen aware VMM ask to make the guests memory visible to the host kernels address space.
Urgh. This would be a major breach of the Xen security concept.
2. When this is done explicitly create a device node to represent it (/dev/xen/dom-%d-mem?) 3. Pass this new device to the non-Xen aware userspace which uses the standard mmap() call to make the kernel pages visible to userspace
Does that make sense?
Maybe from your point of view, but not from the Xen architectural point of view IMHO. You are removing basically the main security advantages of Xen by generating a kernel interface for mapping arbitrary guest memory easily.
We are not talking about doing an end-run around the Xen architecture. The guest still has to instruct the hypervisor to grant access to its memory. Currently this is a global thing (i.e. whole address space or nothing) but obviously more fine grained grants can be done on a transaction by transaction basis although we are exploring more efficient mechanisms for this (shared pools and carve outs).
Happy to hear that.
This does raise questions for the mmap interface though - each individually granted region would need to be mapped into the dom0 userspace virtual address space or perhaps a new flag for mmap() so we can map the whole address space but expect SIGBUS faults if we access something that hasn't been granted.
Do I understand that correctly? You want the guest to grant a memory region to the backend, and the backend should be able to map this region not using grants, but the guest physical addresses?
Yes - although it doesn't have to be the whole GPA range. The vhost-user protocol communicates what offset into the GPA space the various memory regions exist at.
How would the interface with the hypervisor look like then?
In order to make this secure, the hypervisor would need to scan the grant table of the guest to look for a physical address the backend wants to map. I don't think this is an acceptable interface.
Juergen
On 24.08.22 11:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen,
I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
Hypercall buffers are needed for more than just the "MMAPBATCH" hypercall. Or are you suggesting one device per possible hypercall?
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
I think you are still mixing up the hypercall buffers with the memory you want to map via the hypercall. At least the reference to kernel memory above is suggesting that.
Juergen
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 11:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen, I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH). Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
Hypercall buffers are needed for more than just the "MMAPBATCH" hypercall. Or are you suggesting one device per possible hypercall?
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
I think you are still mixing up the hypercall buffers with the memory you want to map via the hypercall. At least the reference to kernel memory above is suggesting that.
Aren't the hypercall buffers all internal to the kernel/hypervisor interface or are you talking about the ioctl contents?
Juergen
[2. OpenPGP public key --- application/pgp-keys; OpenPGP_0xB0DE9DD628BF132F.asc]...
[[End of PGP Signed Part]]
On 24.08.22 13:47, Alex Bennée wrote:
Juergen Gross jgross@suse.com writes:
[[PGP Signed Part:Undecided]] On 24.08.22 11:19, Viresh Kumar wrote:
On 24-03-22, 06:12, Juergen Gross wrote:
For a rather long time we were using "normal" user pages for this purpose, which were just locked into memory for doing the hypercall.
Unfortunately there have been very rare problems with that approach, as the Linux kernel can set a user page related PTE to invalid for short periods of time, which led to EFAULT in the hypervisor when trying to access the hypercall data.
In Linux this can avoided only by using kernel memory, which is the reason why the hypercall buffers are allocated and mmap()-ed through the privcmd driver.
Hi Juergen, I understand why we moved from user pages to kernel pages, but I don't fully understand why we need to make two separate calls to map the guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH). Why aren't we doing all of it from mmap() itself ? I hacked it up to check on it and it works fine if we do it all from mmap() itself.
Hypercall buffers are needed for more than just the "MMAPBATCH" hypercall. Or are you suggesting one device per possible hypercall?
Aren't we abusing the Linux userspace ABI here ? As standard userspace code would expect just mmap() to be enough to map the memory. Yes, the current user, Xen itself, is adapted to make two calls, but it breaks as soon as we want to use something that relies on Linux userspace ABI.
I think you are still mixing up the hypercall buffers with the memory you want to map via the hypercall. At least the reference to kernel memory above is suggesting that.
Aren't the hypercall buffers all internal to the kernel/hypervisor interface or are you talking about the ioctl contents?
The hypercall buffers are filled by the Xen libraries in user mode. The ioctl() is really only a passthrough mechanism for doing hypercalls, as hypercalls are allowed only from the kernel. In order not having to adapt the kernel driver for each new hypercall, all parameters for the hypercall, including the in memory ones, are prepared by the Xen libraries and then given to the hypervisor via the ioctl(). This allows to use existing kernels with new Xen versions.
Your decision to ignore the Xen libraries might fire back in case a dom0-only hypercall is being changed in a new Xen version or even in a Xen update: as Xen tools and the hypervisor are coupled, the updated Xen libraries will work with the new hypervisor, while your VMM will probably break, unless you are building it for each Xen version.
Juergen
stratos-dev@op-lists.linaro.org