This series makes it possible for purecap apps to use the aio_ring shared memory region to bypass the io_getevents syscall's overhead. This functionality is also used in libaio.
With these patches, all io_* LTP tests pass in both Purecap and plain AArch64 modes. Note that the LTP tests only address the basic functionality of the aio system and a significant portion of the functionality is untested in LTP.
For a more comprehensive testing, libaio has been updated with the new uAPI and ported. All the tests in libaio pass accordingly, in both Purecap and plain AArch64 modes.
v4..v3: - Restore flush_dcache_page in all places with the exception of one where is replaced with flush_kernel_vmap_range - Use ifdef instead of IS_ENABLED in a few places - Improve formatting
v3..v2: - Improve the commit messages - Revert a few unrelated changes - Change compat_aio_context_t to compat_uptr_t - Remove io_events_compat union member - Improve code formatting - Add copy_to_user_with_ptr in copy_io_events_to_user - Split copy_from_user_with_ptr for struct __aio_sigset into a different patch
v2..v1: - Add Patch 1 that fixes a parameter type for the compat handler - Split the change the types to user pointers into two patches: one for aio_context_t, and the other for io_event struct fields. - vmap all the ring pages at the beginning and cache them in the ctx - Don't remap the pages while allowing tag access to the shared memory. Setting the VM flags is enough. - Change aio_context_t to a void __user *. - Improve commit messages. - Refactor some of the functions for compat handling. - Create valid user pointers ctx_id when received from a compat task
Gitlab issue: https://git.morello-project.org/morello/kernel/linux/-/issues/49
Review branch: https://git.morello-project.org/tudcre01/linux/-/commits/morello/aio_v4
Tudor Cretu (7): aio: Fix type of nr parameter in compat handler of io_submit aio: Use copy_from_user_with_ptr for struct __aio_sigset aio: vmap entire aio_ring instead of kmapping each page aio: Implement compat handling for the io_event struct aio: Allow capability tag access on the shared memory aio: Change aio_context_t to a user pointer aio: Use user pointer type in the io_event struct
fs/aio.c | 283 ++++++++++++++++++++++------------- include/asm-generic/compat.h | 4 +- include/uapi/linux/aio_abi.h | 12 +- 3 files changed, 186 insertions(+), 113 deletions(-)
The native handler of the io_submit syscall takes the nr parameter as long. For consistency, the type of the nr parameter in the compat handler has been changed to compat_long_t.
Reported-by: Kevin Brodsky kevin.brodsky@arm.com Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/aio.c b/fs/aio.c index 079074f47c2e..7d64ba7e42a5 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2139,7 +2139,7 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
#ifdef CONFIG_COMPAT COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, - int, nr, compat_uptr_t __user *, iocbpp) + compat_long_t, nr, compat_uptr_t __user *, iocbpp) { struct kioctx *ctx; long ret = 0;
struct __aio_sigset contains user pointers, so use the correct copy routine.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 7d64ba7e42a5..1f235c446e89 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2307,7 +2307,7 @@ SYSCALL_DEFINE6(io_pgetevents, if (timeout && unlikely(get_timespec64(&ts, timeout))) return -EFAULT;
- if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) + if (usig && copy_from_user_with_ptr(&ksig, usig, sizeof(ksig))) return -EFAULT;
ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize); @@ -2342,7 +2342,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32, if (timeout && unlikely(get_old_timespec32(&ts, timeout))) return -EFAULT;
- if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) + if (usig && copy_from_user_with_ptr(&ksig, usig, sizeof(ksig))) return -EFAULT;
The aio_ring mechanism relies on two assumptions: 1. The size of the header of an aio_ring is equal to the size of an io_event, i.e. sizeof(aio_ring) equals sizeof(io_event) 2. There is no leftover space at the end of a page when populating it with io_events, i.e. PAGE_SIZE is a multiple of sizeof(io_event)
These assumptions are not met in PCuABI. This change removes these assumptions, making the aio_ring mechanism more flexible and robust.
Mapping pages independently using kmap doesn't create a contiguous virtual space. As some io_event structs could span across different pages, all the pages necessary to access the events' data need to be mapped to a contiguous virtual address space. For this reason, the ring pages are vmapped at the context initialisation step, and the mapping is cached in the context.
In addition, this change improves the performance of the aio_ring mechanism. Previously, io_events were copied from the shared ring to userspace using multiple copy_to_user calls. This change batches the copying of io_events into a maximum of two copy_to_user calls.
The changes are similar to the approach taken in [0].
[0] https://lore.kernel.org/all/1349764760-21093-4-git-send-email-koverstreet@go...
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 95 +++++++++++++++++++++----------------------------------- 1 file changed, 35 insertions(+), 60 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 1f235c446e89..1d91b1504297 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -119,6 +119,7 @@ struct kioctx { /* Size of ringbuffer, in units of struct io_event */ unsigned nr_events;
+ struct aio_ring *ring; unsigned long mmap_base; unsigned long mmap_size;
@@ -351,6 +352,9 @@ static void aio_free_ring(struct kioctx *ctx) { int i;
+ if (ctx->ring) + vunmap(ctx->ring); + /* Disconnect the kiotx from the ring file. This prevents future * accesses to the kioctx from page migration. */ @@ -508,7 +512,6 @@ static const struct address_space_operations aio_ctx_aops = {
static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) { - struct aio_ring *ring; struct mm_struct *mm = current->mm; unsigned long size, unused; int nr_pages; @@ -589,24 +592,25 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->user_id = ctx->mmap_base; ctx->nr_events = nr_events; /* trusted copy */
- ring = kmap_atomic(ctx->ring_pages[0]); - ring->nr = nr_events; /* user copy */ - ring->id = ~0U; - ring->head = ring->tail = 0; - ring->magic = AIO_RING_MAGIC; - ring->compat_features = AIO_RING_COMPAT_FEATURES; - ring->incompat_features = AIO_RING_INCOMPAT_FEATURES; - ring->header_length = sizeof(struct aio_ring); - kunmap_atomic(ring); + ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL); + if (!ctx->ring) { + aio_free_ring(ctx); + return -ENOMEM; + } + + ctx->ring->nr = nr_events; /* user copy */ + ctx->ring->id = ~0U; + ctx->ring->head = ctx->ring->tail = 0; + ctx->ring->magic = AIO_RING_MAGIC; + ctx->ring->compat_features = AIO_RING_COMPAT_FEATURES; + ctx->ring->incompat_features = AIO_RING_INCOMPAT_FEATURES; + ctx->ring->header_length = sizeof(struct aio_ring); + /* Only the header is updated, so flush the first page */ flush_dcache_page(ctx->ring_pages[0]);
return 0; }
-#define AIO_EVENTS_PER_PAGE (PAGE_SIZE / sizeof(struct io_event)) -#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event)) -#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE) - void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel) { struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, rw); @@ -683,7 +687,6 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) { unsigned i, new_nr; struct kioctx_table *table, *old; - struct aio_ring *ring;
spin_lock(&mm->ioctx_lock); table = rcu_dereference_raw(mm->ioctx_table); @@ -700,9 +703,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) * we are protected from page migration * changes ring_pages by ->ring_lock. */ - ring = kmap_atomic(ctx->ring_pages[0]); - ring->id = ctx->id; - kunmap_atomic(ring); + ctx->ring->id = ctx->id; return 0; }
@@ -1031,7 +1032,6 @@ static void user_refill_reqs_available(struct kioctx *ctx) { spin_lock_irq(&ctx->completion_lock); if (ctx->completed_events) { - struct aio_ring *ring; unsigned head;
/* Access of ring->head may race with aio_read_events_ring() @@ -1043,9 +1043,7 @@ static void user_refill_reqs_available(struct kioctx *ctx) * against ctx->completed_events below will make sure we do the * safe/right thing. */ - ring = kmap_atomic(ctx->ring_pages[0]); - head = ring->head; - kunmap_atomic(ring); + head = ctx->ring->head;
refill_reqs_available(ctx, head, ctx->tail); } @@ -1133,9 +1131,7 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) static void aio_complete(struct aio_kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx; - struct aio_ring *ring; - struct io_event *ev_page, *event; - unsigned tail, pos, head; + unsigned int tail, head; unsigned long flags;
/* @@ -1146,19 +1142,13 @@ static void aio_complete(struct aio_kiocb *iocb) spin_lock_irqsave(&ctx->completion_lock, flags);
tail = ctx->tail; - pos = tail + AIO_EVENTS_OFFSET; + + ctx->ring->io_events[tail] = iocb->ki_res; + flush_kernel_vmap_range(&ctx->ring->io_events[tail], sizeof(struct io_event));
if (++tail >= ctx->nr_events) tail = 0;
- ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); - event = ev_page + pos % AIO_EVENTS_PER_PAGE; - - *event = iocb->ki_res; - - kunmap_atomic(ev_page); - flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); - pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)(unsigned long)iocb->ki_res.obj, iocb->ki_res.data, iocb->ki_res.res, iocb->ki_res.res2); @@ -1169,11 +1159,8 @@ static void aio_complete(struct aio_kiocb *iocb) smp_wmb(); /* make event visible before updating tail */
ctx->tail = tail; - - ring = kmap_atomic(ctx->ring_pages[0]); - head = ring->head; - ring->tail = tail; - kunmap_atomic(ring); + head = ctx->ring->head; + ctx->ring->tail = tail; flush_dcache_page(ctx->ring_pages[0]);
ctx->completed_events++; @@ -1218,8 +1205,7 @@ static inline void iocb_put(struct aio_kiocb *iocb) static long aio_read_events_ring(struct kioctx *ctx, struct io_event __user *event, long nr) { - struct aio_ring *ring; - unsigned head, tail, pos; + unsigned int head, tail; long ret = 0; int copy_ret;
@@ -1232,11 +1218,9 @@ static long aio_read_events_ring(struct kioctx *ctx, sched_annotate_sleep(); mutex_lock(&ctx->ring_lock);
- /* Access to ->ring_pages here is protected by ctx->ring_lock. */ - ring = kmap_atomic(ctx->ring_pages[0]); - head = ring->head; - tail = ring->tail; - kunmap_atomic(ring); + /* Access to ->ring here is protected by ctx->ring_lock. */ + head = ctx->ring->head; + tail = ctx->ring->tail;
/* * Ensure that once we've read the current tail pointer, that @@ -1254,24 +1238,17 @@ static long aio_read_events_ring(struct kioctx *ctx,
while (ret < nr) { long avail; - struct io_event *ev; - struct page *page;
avail = (head <= tail ? tail : ctx->nr_events) - head; + if (head == tail) break;
- pos = head + AIO_EVENTS_OFFSET; - page = ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]; - pos %= AIO_EVENTS_PER_PAGE; - avail = min(avail, nr - ret); - avail = min_t(long, avail, AIO_EVENTS_PER_PAGE - pos);
- ev = kmap(page); - copy_ret = copy_to_user(event + ret, ev + pos, - sizeof(*ev) * avail); - kunmap(page); + copy_ret = copy_to_user(event + ret, + &ctx->ring->io_events[head], + sizeof(struct io_event) * avail);
if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1283,9 +1260,7 @@ static long aio_read_events_ring(struct kioctx *ctx, head %= ctx->nr_events; }
- ring = kmap_atomic(ctx->ring_pages[0]); - ring->head = head; - kunmap_atomic(ring); + ctx->ring->head = head; flush_dcache_page(ctx->ring_pages[0]);
pr_debug("%li h%u t%u\n", ret, head, tail);
As the io_event struct might contain pointers as members, introduce a compat version of the struct. Also, implement functions that convert the compat version to the native version of the struct.
A subsequent patch is going to change the io_event struct to enable it to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 119 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 98 insertions(+), 21 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 1d91b1504297..52d8b4dd96b7 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -65,9 +65,8 @@ struct aio_ring { unsigned incompat_features; unsigned header_length; /* size of aio_ring */
- struct io_event io_events[]; -}; /* 128 bytes + ring size */ +}; /* 32 bytes + ring size */
/* * Plugging is meant to work with larger batches of IOs. If we don't @@ -165,6 +164,9 @@ struct kioctx { struct file *aio_ring_file;
unsigned id; +#ifdef CONFIG_COMPAT64 + bool compat; +#endif };
/* @@ -219,6 +221,13 @@ struct aio_kiocb { struct eventfd_ctx *ki_eventfd; };
+struct compat_io_event { + __u64 data; + __u64 obj; + __s64 res; + __s64 res2; +}; + struct compat_iocb { __u64 aio_data; #if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN) @@ -281,6 +290,63 @@ static struct vfsmount *aio_mnt; static const struct file_operations aio_ring_fops; static const struct address_space_operations aio_ctx_aops;
+static inline bool aio_in_compat64(struct kioctx *ctx) +{ +#ifdef CONFIG_COMPAT64 + return ctx->compat; +#else + return false; +#endif +} + +static inline size_t io_event_size(struct kioctx *ctx) +{ + return aio_in_compat64(ctx) ? sizeof(struct compat_io_event) + : sizeof(struct io_event); +} + +static u32 __user *aio_key_uptr(struct kioctx *ctx, + struct iocb __user *user_iocb) +{ + return aio_in_compat64(ctx) ? + &((struct compat_iocb __user *)user_iocb)->aio_key : + &user_iocb->aio_key; +} + +static int copy_io_events_to_user(struct kioctx *ctx, + struct io_event __user *event_array, + long offset, + unsigned int ring_head, + long nr) +{ + if (aio_in_compat64(ctx)) + return copy_to_user((struct compat_io_event __user *)event_array + offset, + (struct compat_io_event *)ctx->ring->io_events + ring_head, + sizeof(struct compat_io_event) * nr); + return copy_to_user(event_array + offset, + ctx->ring->io_events + ring_head, + sizeof(struct io_event) * nr); +} + +static void copy_io_event_to_ring(struct kioctx *ctx, + unsigned int ring_idx, + struct io_event *native_event) +{ + if (aio_in_compat64(ctx)) { + struct compat_io_event *compat_ring_event = + (struct compat_io_event *)ctx->ring->io_events + ring_idx; + + compat_ring_event->data = native_event->data; + compat_ring_event->obj = native_event->obj; + compat_ring_event->res = native_event->res; + compat_ring_event->res2 = native_event->res2; + flush_kernel_vmap_range(compat_ring_event, sizeof(struct compat_io_event)); + return; + } + ctx->ring->io_events[ring_idx] = *native_event; + flush_kernel_vmap_range(&ctx->ring->io_events[ring_idx], sizeof(struct io_event)); +} + static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) { struct file *file; @@ -522,7 +588,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) nr_events += 2; /* 1 is required, 2 for good luck */
size = sizeof(struct aio_ring); - size += sizeof(struct io_event) * nr_events; + size += io_event_size(ctx) * nr_events;
nr_pages = PFN_UP(size); if (nr_pages < 0) @@ -536,7 +602,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
ctx->aio_ring_file = file; nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) - / sizeof(struct io_event); + / io_event_size(ctx);
ctx->ring_pages = ctx->internal_pages; if (nr_pages > AIO_RING_PAGES) { @@ -747,10 +813,13 @@ static void aio_nr_sub(unsigned nr) /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. */ -static struct kioctx *ioctx_alloc(unsigned nr_events) +static struct kioctx *ioctx_alloc(unsigned nr_events, bool compat) { struct mm_struct *mm = current->mm; struct kioctx *ctx; + size_t event_size = (IS_ENABLED(CONFIG_COMPAT64) && compat) ? + sizeof(struct compat_io_event) : + sizeof(struct io_event); int err = -ENOMEM;
/* @@ -772,7 +841,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) nr_events *= 2;
/* Prevent overflows */ - if (nr_events > (0x10000000U / sizeof(struct io_event))) { + if (nr_events > (0x10000000U / event_size)) { pr_debug("ENOMEM: nr_events too high\n"); return ERR_PTR(-EINVAL); } @@ -784,6 +853,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!ctx) return ERR_PTR(-ENOMEM);
+#ifdef CONFIG_COMPAT64 + ctx->compat = compat; +#endif ctx->max_reqs = max_reqs;
spin_lock_init(&ctx->ctx_lock); @@ -1143,8 +1215,7 @@ static void aio_complete(struct aio_kiocb *iocb)
tail = ctx->tail;
- ctx->ring->io_events[tail] = iocb->ki_res; - flush_kernel_vmap_range(&ctx->ring->io_events[tail], sizeof(struct io_event)); + copy_io_event_to_ring(ctx, tail, &iocb->ki_res);
if (++tail >= ctx->nr_events) tail = 0; @@ -1203,7 +1274,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) * events fetched */ static long aio_read_events_ring(struct kioctx *ctx, - struct io_event __user *event, long nr) + struct io_event __user *event, long event_idx, + long nr) { unsigned int head, tail; long ret = 0; @@ -1246,9 +1318,8 @@ static long aio_read_events_ring(struct kioctx *ctx,
avail = min(avail, nr - ret);
- copy_ret = copy_to_user(event + ret, - &ctx->ring->io_events[head], - sizeof(struct io_event) * avail); + copy_ret = copy_io_events_to_user(ctx, event, event_idx + ret, + head, avail);
if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1273,7 +1344,7 @@ static long aio_read_events_ring(struct kioctx *ctx, static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event __user *event, long *i) { - long ret = aio_read_events_ring(ctx, event + *i, nr - *i); + long ret = aio_read_events_ring(ctx, event, *i, nr - *i);
if (ret > 0) *i += ret; @@ -1346,7 +1417,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) goto out; }
- ioctx = ioctx_alloc(nr_events); + ioctx = ioctx_alloc(nr_events, false); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { ret = put_user(ioctx->user_id, ctxp); @@ -1377,7 +1448,7 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use goto out; }
- ioctx = ioctx_alloc(nr_events); + ioctx = ioctx_alloc(nr_events, true); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { /* truncating is ok because it's a user address */ @@ -1952,7 +2023,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, req->ki_eventfd = eventfd; }
- if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) { + if (unlikely(put_user(KIOCB_KEY, aio_key_uptr(ctx, user_iocb)))) { pr_debug("EFAULT: aio_key\n"); return -EFAULT; } @@ -2174,15 +2245,20 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, u32 key; u64 obj = user_ptr_addr(iocb);
- if (unlikely(get_user(key, &iocb->aio_key))) - return -EFAULT; - if (unlikely(key != KIOCB_KEY)) - return -EINVAL; - ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) return -EINVAL;
+ if (unlikely(get_user(key, aio_key_uptr(ctx, iocb)))) { + ret = -EFAULT; + goto out; + } + + if (unlikely(key != KIOCB_KEY)) { + ret = -EINVAL; + goto out; + } + spin_lock_irq(&ctx->ctx_lock); /* TODO: use a hash or array, this sucks. */ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) { @@ -2203,6 +2279,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, ret = -EINPROGRESS; }
+out: percpu_ref_put(&ctx->users);
return ret;
The aio_ring buffer is a shared memory region that hosts the io_event array. The io_event struct may contain user pointers, so the memory region must be allowed to store and load capability pointers in PCuABI.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/fs/aio.c b/fs/aio.c index 52d8b4dd96b7..9e40452e2a48 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -484,6 +484,11 @@ static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) { vma->vm_flags |= VM_DONTEXPAND; vma->vm_ops = &aio_ring_vm_ops; +#ifdef CONFIG_CHERI_PURECAP_UABI + vma->vm_flags |= VM_READ_CAPS | VM_WRITE_CAPS; + vma_set_page_prot(vma); +#endif + return 0; }
The aio_context_t represents a token that allows the userspace to query a specific aio context. It can also be used as a pointer to the userspace mapping of aio_ring buffer. In some userspace software, including libaio, it's used directly to access the aio_ring. Therefore, aio_context_t must be able to hold a full capability. As __kernel_ulong_t type is not large enough to fit a user pointer in the PCuABI, change the aio_context_t type to void __user *.
As aio_context_t is still mainly used as a token id, it should not be modified in any way by userspace. Therefore, make a full capability comparison in lookup_ioctx() to match the aio_context_t.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 54 +++++++++++++++++++++++------------- include/asm-generic/compat.h | 4 +-- include/uapi/linux/aio_abi.h | 2 +- 3 files changed, 38 insertions(+), 22 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 9e40452e2a48..7b2733b0f773 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -97,7 +97,7 @@ struct kioctx {
struct percpu_ref reqs;
- unsigned long user_id; + aio_context_t user_id;
struct __percpu kioctx_cpu *cpu;
@@ -313,6 +313,14 @@ static u32 __user *aio_key_uptr(struct kioctx *ctx, &user_iocb->aio_key; }
+static inline bool aio_ctx_id_is_same(struct kioctx *ctx, + aio_context_t ctx_id) +{ + if (aio_in_compat64(ctx)) + return ctx->user_id == ctx_id; + return user_ptr_is_same(ctx->user_id, ctx_id); +} + static int copy_io_events_to_user(struct kioctx *ctx, struct io_event __user *event_array, long offset, @@ -459,7 +467,9 @@ static int aio_ring_mremap(struct vm_area_struct *vma) ctx = rcu_dereference(table->table[i]); if (ctx && ctx->aio_ring_file == file) { if (!atomic_read(&ctx->dead)) { - ctx->user_id = ctx->mmap_base = vma->vm_start; + ctx->mmap_base = vma->vm_start; + /* TODO [PCuABI] - derive proper capability */ + ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base); res = 0; } break; @@ -660,7 +670,8 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
- ctx->user_id = ctx->mmap_base; + /* TODO [PCuABI] - derive proper capability */ + ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base); ctx->nr_events = nr_events; /* trusted copy */
ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL); @@ -914,7 +925,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events, bool compat) mutex_unlock(&ctx->ring_lock);
pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n", - ctx, ctx->user_id, mm, ctx->nr_events); + ctx, user_ptr_addr(ctx->user_id), mm, ctx->nr_events); return ctx;
err_cleanup: @@ -1164,9 +1175,9 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) return req; }
-static struct kioctx *lookup_ioctx(unsigned long ctx_id) +static struct kioctx *lookup_ioctx(aio_context_t ctx_id) { - struct aio_ring __user *ring = (void __user *)ctx_id; + struct aio_ring __user *ring = ctx_id; struct mm_struct *mm = current->mm; struct kioctx *ctx, *ret = NULL; struct kioctx_table *table; @@ -1183,7 +1194,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
id = array_index_nospec(id, table->nr); ctx = rcu_dereference(table->table[id]); - if (ctx && ctx->user_id == ctx_id) { + if (ctx && aio_ctx_id_is_same(ctx, ctx_id)) { if (percpu_ref_tryget_live(&ctx->users)) ret = ctx; } @@ -1405,27 +1416,32 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr, * pointer is passed for ctxp. Will fail with -ENOSYS if not * implemented. */ -SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) +SYSCALL_DEFINE2(io_setup, unsigned, nr_events, +#ifdef CONFIG_CHERI_PURECAP_UABI + aio_context_t * __capability, ctxp) +#else + aio_context_t __user *, ctxp) +#endif { struct kioctx *ioctx = NULL; - unsigned long ctx; + aio_context_t ctx; long ret;
- ret = get_user(ctx, ctxp); + ret = get_user_ptr(ctx, ctxp); if (unlikely(ret)) goto out;
ret = -EINVAL; if (unlikely(ctx || nr_events == 0)) { pr_debug("EINVAL: ctx %lu nr_events %u\n", - ctx, nr_events); + user_ptr_addr(ctx), nr_events); goto out; }
ioctx = ioctx_alloc(nr_events, false); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { - ret = put_user(ioctx->user_id, ctxp); + ret = put_user_ptr(ioctx->user_id, ctxp); if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users); @@ -1439,7 +1455,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __user *, ctxp) { struct kioctx *ioctx = NULL; - unsigned long ctx; + compat_aio_context_t ctx; long ret;
ret = get_user(ctx, ctxp); @@ -1456,8 +1472,7 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use ioctx = ioctx_alloc(nr_events, true); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { - /* truncating is ok because it's a user address */ - ret = put_user((compat_aio_context_t)ioctx->user_id, ctxp); + ret = put_user(ptr_to_compat(ioctx->user_id), ctxp); if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users); @@ -1477,6 +1492,7 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx) { struct kioctx *ioctx = lookup_ioctx(ctx); + if (likely(NULL != ioctx)) { struct ctx_rq_wait wait; int ret; @@ -2200,7 +2216,7 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, if (unlikely(nr < 0)) return -EINVAL;
- ctx = lookup_ioctx(ctx_id); + ctx = lookup_ioctx(compat_ptr(ctx_id)); if (unlikely(!ctx)) { pr_debug("EINVAL: invalid context id\n"); return -EINVAL; @@ -2421,7 +2437,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32,
#if defined(CONFIG_COMPAT_32BIT_TIME)
-SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, +SYSCALL_DEFINE5(io_getevents_time32, aio_context_t, ctx_id, __s32, min_nr, __s32, nr, struct io_event __user *, events, @@ -2473,7 +2489,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, if (ret) return ret;
- ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); + ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);
interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); @@ -2508,7 +2524,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, if (ret) return ret;
- ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); + ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);
interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); diff --git a/include/asm-generic/compat.h b/include/asm-generic/compat.h index f4ef0518470e..f7f82db5caa4 100644 --- a/include/asm-generic/compat.h +++ b/include/asm-generic/compat.h @@ -56,7 +56,6 @@ typedef s64 compat_long_t; typedef u64 compat_ulong_t; typedef u64 compat_uptr_t; typedef u64 compat_caddr_t; -typedef u64 compat_aio_context_t; typedef u64 compat_old_sigset_t; #else typedef u32 compat_size_t; @@ -68,10 +67,11 @@ typedef s32 compat_long_t; typedef u32 compat_ulong_t; typedef u32 compat_uptr_t; typedef u32 compat_caddr_t; -typedef u32 compat_aio_context_t; typedef u32 compat_old_sigset_t; #endif
+typedef compat_uptr_t compat_aio_context_t; + #ifndef __compat_uid_t typedef u32 __compat_uid_t; typedef u32 __compat_gid_t; diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 8050909c2887..87c270021449 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -31,7 +31,7 @@ #include <linux/fs.h> #include <asm/byteorder.h>
-typedef __kernel_ulong_t aio_context_t; +typedef void __user *aio_context_t;
enum { IOCB_CMD_PREAD = 0,
The libaio library may set up I/O completion callback functions for I/O requests. The callback function pointer is stored in the data field of the io_event struct, and the pointer to the original iocb struct is stored in the obj field. Therefore, both the data and obj fields of the io_event struct might hold user pointers.
In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others. Additionally, use the special copy routine when copying user pointers to userspace.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 24 ++++++++++++------------ include/uapi/linux/aio_abi.h | 10 +++++----- 2 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 7b2733b0f773..82f06f7b0df8 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -331,9 +331,9 @@ static int copy_io_events_to_user(struct kioctx *ctx, return copy_to_user((struct compat_io_event __user *)event_array + offset, (struct compat_io_event *)ctx->ring->io_events + ring_head, sizeof(struct compat_io_event) * nr); - return copy_to_user(event_array + offset, - ctx->ring->io_events + ring_head, - sizeof(struct io_event) * nr); + return copy_to_user_with_ptr(event_array + offset, + ctx->ring->io_events + ring_head, + sizeof(struct io_event) * nr); }
static void copy_io_event_to_ring(struct kioctx *ctx, @@ -344,8 +344,8 @@ static void copy_io_event_to_ring(struct kioctx *ctx, struct compat_io_event *compat_ring_event = (struct compat_io_event *)ctx->ring->io_events + ring_idx;
- compat_ring_event->data = native_event->data; - compat_ring_event->obj = native_event->obj; + compat_ring_event->data = (__u64)native_event->data; + compat_ring_event->obj = (__u64)native_event->obj; compat_ring_event->res = native_event->res; compat_ring_event->res2 = native_event->res2; flush_kernel_vmap_range(compat_ring_event, sizeof(struct compat_io_event)); @@ -1236,9 +1236,10 @@ static void aio_complete(struct aio_kiocb *iocb) if (++tail >= ctx->nr_events) tail = 0;
- pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, - (void __user *)(unsigned long)iocb->ki_res.obj, - iocb->ki_res.data, iocb->ki_res.res, iocb->ki_res.res2); + pr_debug("%p[%u]: %p: %Lx %Lx %Lx %Lx\n", ctx, tail, iocb, + (unsigned long long)iocb->ki_res.obj, + (unsigned long long)iocb->ki_res.data, + iocb->ki_res.res, iocb->ki_res.res2);
/* after flagging the request as done, we * must never even look at it again @@ -2049,7 +2050,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return -EFAULT; }
- req->ki_res.obj = user_ptr_addr(user_iocb); + req->ki_res.obj = (__kernel_uintptr_t)user_iocb; req->ki_res.data = iocb->aio_data; req->ki_res.res = 0; req->ki_res.res2 = 0; @@ -2080,7 +2081,7 @@ static int get_compat_iocb(struct iocb *iocb, const struct iocb __user *user_ioc struct compat_iocb compat_iocb; if (unlikely(copy_from_user(&compat_iocb, user_iocb, sizeof(struct compat_iocb)))) return -EFAULT; - iocb->aio_data = compat_iocb.aio_data; + iocb->aio_data = (__kernel_uintptr_t)compat_iocb.aio_data; iocb->aio_key = compat_iocb.aio_key; iocb->aio_rw_flags = compat_iocb.aio_rw_flags; iocb->aio_lio_opcode = compat_iocb.aio_lio_opcode; @@ -2264,7 +2265,6 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, struct aio_kiocb *kiocb; int ret = -EINVAL; u32 key; - u64 obj = user_ptr_addr(iocb);
ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) @@ -2283,7 +2283,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, spin_lock_irq(&ctx->ctx_lock); /* TODO: use a hash or array, this sucks. */ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) { - if (kiocb->ki_res.obj == obj) { + if (user_ptr_is_same((struct iocb __user *)kiocb->ki_res.obj, iocb)) { ret = kiocb->ki_cancel(&kiocb->rw); list_del_init(&kiocb->ki_list); break; diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 87c270021449..edd2ac54ac42 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -58,10 +58,10 @@ enum {
/* read() from /dev/aio returns these structures. */ struct io_event { - __u64 data; /* the data field from the iocb */ - __u64 obj; /* what iocb this event came from */ - __s64 res; /* result code for this event */ - __s64 res2; /* secondary result */ + __kernel_uintptr_t data; /* the data field from the iocb */ + __kernel_uintptr_t obj; /* what iocb this event came from */ + __s64 res; /* result code for this event */ + __s64 res2; /* secondary result */ };
/* @@ -72,7 +72,7 @@ struct io_event {
struct iocb { /* these are internal to the kernel/libc. */ - __u64 aio_data; /* data to be returned in event's data */ + __kernel_uintptr_t aio_data; /* data to be returned in event's data */
#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN) __u32 aio_key; /* the kernel sets aio_key to the req # */
On 03/07/2023 19:46, Tudor Cretu wrote:
This series makes it possible for purecap apps to use the aio_ring shared memory region to bypass the io_getevents syscall's overhead. This functionality is also used in libaio.
With these patches, all io_* LTP tests pass in both Purecap and plain AArch64 modes. Note that the LTP tests only address the basic functionality of the aio system and a significant portion of the functionality is untested in LTP.
For a more comprehensive testing, libaio has been updated with the new uAPI and ported. All the tests in libaio pass accordingly, in both Purecap and plain AArch64 modes.
v4..v3:
- Restore flush_dcache_page in all places with the exception of one where is replaced with flush_kernel_vmap_range
- Use ifdef instead of IS_ENABLED in a few places
- Improve formatting
v3..v2:
- Improve the commit messages
- Revert a few unrelated changes
- Change compat_aio_context_t to compat_uptr_t
- Remove io_events_compat union member
- Improve code formatting
- Add copy_to_user_with_ptr in copy_io_events_to_user
- Split copy_from_user_with_ptr for struct __aio_sigset into a different patch
v2..v1:
- Add Patch 1 that fixes a parameter type for the compat handler
- Split the change the types to user pointers into two patches: one for aio_context_t, and the other for io_event struct fields.
- vmap all the ring pages at the beginning and cache them in the ctx
- Don't remap the pages while allowing tag access to the shared memory. Setting the VM flags is enough.
- Change aio_context_t to a void __user *.
- Improve commit messages.
- Refactor some of the functions for compat handling.
- Create valid user pointers ctx_id when received from a compat task
Gitlab issue: https://git.morello-project.org/morello/kernel/linux/-/issues/49
Review branch: https://git.morello-project.org/tudcre01/linux/-/commits/morello/aio_v4
Tudor Cretu (7): aio: Fix type of nr parameter in compat handler of io_submit aio: Use copy_from_user_with_ptr for struct __aio_sigset aio: vmap entire aio_ring instead of kmapping each page aio: Implement compat handling for the io_event struct aio: Allow capability tag access on the shared memory aio: Change aio_context_t to a user pointer aio: Use user pointer type in the io_event struct
Now applied on next, thanks!
Along the way I removed some leftover empty lines in aio_read_events_ring() (patch 3) and the io_destroy handler (patch 6).
I had some hesitation around the use of flush_kernel_vmap_range() vs flush_dcache_page() in patch 3. No firm conclusion has been reached on whether the current approach is correct in general, but what is clear is that it does not matter for arm64 as only the coherence between D-cache and I-cache is a concern there, and these pages are never executed. Since we have no easy way to verify the correctness of that patch w.r.t. the cache flushes, it doesn't seem worth spending more time investigating.
For the 6.4 branch, I updated patch 5 to use vm_flags_set() (like the corresponding io_uring patch).
Kevin
fs/aio.c | 283 ++++++++++++++++++++++------------- include/asm-generic/compat.h | 4 +- include/uapi/linux/aio_abi.h | 12 +- 3 files changed, 186 insertions(+), 113 deletions(-)
On 24-07-2023 11:42, Kevin Brodsky wrote:
On 03/07/2023 19:46, Tudor Cretu wrote:
This series makes it possible for purecap apps to use the aio_ring shared memory region to bypass the io_getevents syscall's overhead. This functionality is also used in libaio.
With these patches, all io_* LTP tests pass in both Purecap and plain AArch64 modes. Note that the LTP tests only address the basic functionality of the aio system and a significant portion of the functionality is untested in LTP.
For a more comprehensive testing, libaio has been updated with the new uAPI and ported. All the tests in libaio pass accordingly, in both Purecap and plain AArch64 modes.
v4..v3:
- Restore flush_dcache_page in all places with the exception of one where is replaced with flush_kernel_vmap_range
- Use ifdef instead of IS_ENABLED in a few places
- Improve formatting
v3..v2:
- Improve the commit messages
- Revert a few unrelated changes
- Change compat_aio_context_t to compat_uptr_t
- Remove io_events_compat union member
- Improve code formatting
- Add copy_to_user_with_ptr in copy_io_events_to_user
- Split copy_from_user_with_ptr for struct __aio_sigset into a different patch
v2..v1:
- Add Patch 1 that fixes a parameter type for the compat handler
- Split the change the types to user pointers into two patches: one for aio_context_t, and the other for io_event struct fields.
- vmap all the ring pages at the beginning and cache them in the ctx
- Don't remap the pages while allowing tag access to the shared memory. Setting the VM flags is enough.
- Change aio_context_t to a void __user *.
- Improve commit messages.
- Refactor some of the functions for compat handling.
- Create valid user pointers ctx_id when received from a compat task
Gitlab issue: https://git.morello-project.org/morello/kernel/linux/-/issues/49
Review branch: https://git.morello-project.org/tudcre01/linux/-/commits/morello/aio_v4
Tudor Cretu (7): aio: Fix type of nr parameter in compat handler of io_submit aio: Use copy_from_user_with_ptr for struct __aio_sigset aio: vmap entire aio_ring instead of kmapping each page aio: Implement compat handling for the io_event struct aio: Allow capability tag access on the shared memory aio: Change aio_context_t to a user pointer aio: Use user pointer type in the io_event struct
Now applied on next, thanks!
Along the way I removed some leftover empty lines in aio_read_events_ring() (patch 3) and the io_destroy handler (patch 6).
Many thanks for the improvements!
I had some hesitation around the use of flush_kernel_vmap_range() vs flush_dcache_page() in patch 3. No firm conclusion has been reached on whether the current approach is correct in general, but what is clear is that it does not matter for arm64 as only the coherence between D-cache and I-cache is a concern there, and these pages are never executed. Since we have no easy way to verify the correctness of that patch w.r.t. the cache flushes, it doesn't seem worth spending more time investigating.
For the 6.4 branch, I updated patch 5 to use vm_flags_set() (like the corresponding io_uring patch).
I knew about this, but somehow I forgot about it. Many thanks for taking care of this as well, greatly appreciate it!
Tudor
Kevin
fs/aio.c | 283 ++++++++++++++++++++++------------- include/asm-generic/compat.h | 4 +- include/uapi/linux/aio_abi.h | 12 +- 3 files changed, 186 insertions(+), 113 deletions(-)
linux-morello@op-lists.linaro.org