[PATCH v2 0/6] Support aio shared memory usage in the Purecap apps

List overview All Threads
Download

newer

older

[PATCH v3 0/7] Support aio shared...

[PATCH 0/4] Support aio shared...

Tudor Cretu

15 Jun 2023 15 Jun '23

3:54 p.m.

This series makes it possible for purecap apps to use the aio_ring shared memory region to bypass the io_getevents syscall's overhead. This functionality is also used in libaio.

With these patches, all io_* LTP tests pass in both Purecap and plain AArch64 modes. Note that the LTP tests only address the basic functionality of the aio system and a significant portion of the functionality is untested in LTP.

For a more comprehensive testing, libaio has been updated with the new uAPI and ported. All the tests in libaio pass accordingly, in both Purecap and plain AArch64 modes.

v2..v1: - Add Patch 1 that fixes a parameter type for the compat handler - Split the change the types to user pointers into two patches: one for aio_context_t, and the other for io_event struct fields. - vmap all the ring pages at the beginning and cache them in the ctx - Don't remap the pages while allowing tag access to the shared memory. Setting the VM flags is enough. - Change aio_context_t to a void __user *. - Improve commit messages. - Refactor some of the functions for compat handling. - Create valid user pointers ctx_id when received from a compat task

Gitlab issue: https://git.morello-project.org/morello/kernel/linux/-/issues/49

Review branch: https://git.morello-project.org/tudcre01/linux/-/commits/morello/aio_v2

Tudor Cretu (6): aio: Fix type of nr parameter in compat handler of io_submit aio: vmap entire aio_ring instead of kmapping each page aio: Implement compat handling for the io_event struct aio: Allow capability tag access on the shared memory aio: Change aio_context_t to a user pointer aio: Use user pointer type in the io_event struct

fs/aio.c | 306 ++++++++++++++++++++++------------- include/uapi/linux/aio_abi.h | 12 +- 2 files changed, 198 insertions(+), 120 deletions(-)

-- 2.34.1

Show replies by date

Tudor Cretu

15 Jun 15 Jun

3:54 p.m.

New subject: [PATCH v2 1/6] aio: Fix type of nr parameter in compat handler of io_submit

The native handler of the io_submit syscall takes the nr parameter as long. For consistency, the type of the nr parameter in the compat handler has been changed to compat_long_t.

Reported-by: Kevin Brodsky kevin.brodsky@arm.com Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c index 079074f47c2e..7d64ba7e42a5 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2139,7 +2139,7 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,

#ifdef CONFIG_COMPAT COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, - int, nr, compat_uptr_t __user *, iocbpp) + compat_long_t, nr, compat_uptr_t __user *, iocbpp) { struct kioctx *ctx; long ret = 0;

-- 2.34.1

Tudor Cretu

3:54 p.m.

New subject: [PATCH v2 2/6] aio: vmap entire aio_ring instead of kmapping each page

Previously, the aio_ring mechanism relied on two assumptions: 1. The size of the header of an aio_ring is equal to the size of an io_event, i.e. sizeof(aio_ring) equals sizeof(io_event) 2. There is no leftover space at the end of a page when populating it with io_events, i.e. PAGE_SIZE is a multiple of sizeof(io_event)

These assumptions are not met in PCuABI. This change removes these assumptions, making the aio_ring mechanism more flexible and robust.

In addition, this change improves the performance of the aio_ring mechanism. Previously, io_events were copied from the shared ring to userspace using multiple copy_to_user calls. This change batches the copying of io_events into a maximum of two copy_to_user calls.

Mapping pages independently using kmap doesn't create a contiguous virtual space. As some io_event structs could span across different pages, all the pages necessary to access the events' data need to be mapped to a contiguous virtual address space. For this reason, the ring pages are vmapped at the context initialisation step, and the mapping is cached in the context.

The changes are similar to the approach taken in [0].

[0] https://lore.kernel.org/all/1349764760-21093-4-git-send-email-koverstreet@go...

Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 99 ++++++++++++++++++-------------------------------------- 1 file changed, 32 insertions(+), 67 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index 7d64ba7e42a5..4fdab03bc0f2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -119,6 +119,7 @@ struct kioctx { /* Size of ringbuffer, in units of struct io_event */ unsigned nr_events;

+ struct aio_ring *ring; unsigned long mmap_base; unsigned long mmap_size;

@@ -508,7 +509,6 @@ static const struct address_space_operations aio_ctx_aops = {

static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) { - struct aio_ring *ring; struct mm_struct *mm = current->mm; unsigned long size, unused; int nr_pages; @@ -589,24 +589,23 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->user_id = ctx->mmap_base; ctx->nr_events = nr_events; /* trusted copy */

- ring = kmap_atomic(ctx->ring_pages[0]); - ring->nr = nr_events; /* user copy */ - ring->id = ~0U; - ring->head = ring->tail = 0; - ring->magic = AIO_RING_MAGIC; - ring->compat_features = AIO_RING_COMPAT_FEATURES; - ring->incompat_features = AIO_RING_INCOMPAT_FEATURES; - ring->header_length = sizeof(struct aio_ring); - kunmap_atomic(ring); - flush_dcache_page(ctx->ring_pages[0]); + ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL); + if (!ctx->ring) { + aio_free_ring(ctx); + return -ENOMEM; + } + + ctx->ring->nr = nr_events; /* user copy */ + ctx->ring->id = ~0U; + ctx->ring->head = ctx->ring->tail = 0; + ctx->ring->magic = AIO_RING_MAGIC; + ctx->ring->compat_features = AIO_RING_COMPAT_FEATURES; + ctx->ring->incompat_features = AIO_RING_INCOMPAT_FEATURES; + ctx->ring->header_length = sizeof(struct aio_ring);

return 0; }

-#define AIO_EVENTS_PER_PAGE (PAGE_SIZE / sizeof(struct io_event)) -#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event)) -#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE) - void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel) { struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, rw); @@ -683,7 +682,6 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) { unsigned i, new_nr; struct kioctx_table *table, *old; - struct aio_ring *ring;

spin_lock(&mm->ioctx_lock); table = rcu_dereference_raw(mm->ioctx_table); @@ -700,9 +698,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) * we are protected from page migration * changes ring_pages by ->ring_lock. */ - ring = kmap_atomic(ctx->ring_pages[0]); - ring->id = ctx->id; - kunmap_atomic(ring); + ctx->ring->id = ctx->id; return 0; }

@@ -1031,7 +1027,6 @@ static void user_refill_reqs_available(struct kioctx *ctx) { spin_lock_irq(&ctx->completion_lock); if (ctx->completed_events) { - struct aio_ring *ring; unsigned head;

/* Access of ring->head may race with aio_read_events_ring() @@ -1043,9 +1038,7 @@ static void user_refill_reqs_available(struct kioctx *ctx) * against ctx->completed_events below will make sure we do the * safe/right thing. */ - ring = kmap_atomic(ctx->ring_pages[0]); - head = ring->head; - kunmap_atomic(ring); + head = ctx->ring->head;

refill_reqs_available(ctx, head, ctx->tail); } @@ -1133,9 +1126,7 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) static void aio_complete(struct aio_kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx; - struct aio_ring *ring; - struct io_event *ev_page, *event; - unsigned tail, pos, head; + unsigned int tail, head; unsigned long flags;

/* @@ -1146,18 +1137,8 @@ static void aio_complete(struct aio_kiocb *iocb) spin_lock_irqsave(&ctx->completion_lock, flags);

tail = ctx->tail; - pos = tail + AIO_EVENTS_OFFSET; - - if (++tail >= ctx->nr_events) - tail = 0; - - ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); - event = ev_page + pos % AIO_EVENTS_PER_PAGE; - - *event = iocb->ki_res;

- kunmap_atomic(ev_page); - flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); + ctx->ring->io_events[tail] = iocb->ki_res;

pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)(unsigned long)iocb->ki_res.obj, @@ -1168,13 +1149,12 @@ static void aio_complete(struct aio_kiocb *iocb) */ smp_wmb(); /* make event visible before updating tail */

- ctx->tail = tail; + if (++tail >= ctx->nr_events) + tail = 0;

- ring = kmap_atomic(ctx->ring_pages[0]); - head = ring->head; - ring->tail = tail; - kunmap_atomic(ring); - flush_dcache_page(ctx->ring_pages[0]); + ctx->tail = tail; + head = ctx->ring->head; + ctx->ring->tail = tail;

ctx->completed_events++; if (ctx->completed_events > 1) @@ -1218,10 +1198,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) static long aio_read_events_ring(struct kioctx *ctx, struct io_event __user *event, long nr) { - struct aio_ring *ring; - unsigned head, tail, pos; + unsigned int head, tail; long ret = 0; - int copy_ret;

/* * The mutex can block and wake us up and that will cause @@ -1233,10 +1211,8 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock);

/* Access to ->ring_pages here is protected by ctx->ring_lock. */ - ring = kmap_atomic(ctx->ring_pages[0]); - head = ring->head; - tail = ring->tail; - kunmap_atomic(ring); + head = ctx->ring->head; + tail = ctx->ring->tail;

/* * Ensure that once we've read the current tail pointer, that @@ -1254,24 +1230,16 @@ static long aio_read_events_ring(struct kioctx *ctx,

while (ret < nr) { long avail; - struct io_event *ev; - struct page *page; + int copy_ret;

- avail = (head <= tail ? tail : ctx->nr_events) - head; if (head == tail) break; - - pos = head + AIO_EVENTS_OFFSET; - page = ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]; - pos %= AIO_EVENTS_PER_PAGE; - + avail = (head <= tail ? tail : ctx->nr_events) - head; avail = min(avail, nr - ret); - avail = min_t(long, avail, AIO_EVENTS_PER_PAGE - pos);

- ev = kmap(page); - copy_ret = copy_to_user(event + ret, ev + pos, - sizeof(*ev) * avail); - kunmap(page); + copy_ret = copy_to_user(event + ret, + &ctx->ring->io_events[head], + sizeof(struct io_event) * avail);

if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1283,10 +1251,7 @@ static long aio_read_events_ring(struct kioctx *ctx, head %= ctx->nr_events; }

- ring = kmap_atomic(ctx->ring_pages[0]); - ring->head = head; - kunmap_atomic(ring); - flush_dcache_page(ctx->ring_pages[0]); + ctx->ring->head = head;

pr_debug("%li h%u t%u\n", ret, head, tail); out:

-- 2.34.1

Kevin Brodsky

22 Jun 22 Jun

7:39 a.m.

New subject: [PATCH v2 2/6] aio: vmap entire aio_ring instead of kmapping each page

On 15/06/2023 17:54, Tudor Cretu wrote:

...

Previously, the aio_ring mechanism relied on two assumptions:

Nit: I find it clearer to talk about the situation before the patch as the current situation (present tense), and then say what the patch is doing to change that. It's just a convention really, it's also tempting to do it the way you did.

...

The size of the header of an aio_ring is equal to the size of an io_event, i.e. sizeof(aio_ring) equals sizeof(io_event)

There is no leftover space at the end of a page when populating it with io_events, i.e. PAGE_SIZE is a multiple of sizeof(io_event)

These assumptions are not met in PCuABI. This change removes these assumptions, making the aio_ring mechanism more flexible and robust.

In addition, this change improves the performance of the aio_ring mechanism. Previously, io_events were copied from the shared ring to userspace using multiple copy_to_user calls. This change batches the copying of io_events into a maximum of two copy_to_user calls.

Mapping pages independently using kmap doesn't create a contiguous virtual space. As some io_event structs could span across different pages, all the pages necessary to access the events' data need to be mapped to a contiguous virtual address space. For this reason, the ring pages are vmapped at the context initialisation step, and the mapping is cached in the context.

Should this paragraph be before the previous one? Considering this is the main justification it would feel more natural.

...

The changes are similar to the approach taken in [0].

[0] https://lore.kernel.org/all/1349764760-21093-4-git-send-email-koverstreet@go...

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 99 ++++++++++++++++++-------------------------------------- 1 file changed, 32 insertions(+), 67 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index 7d64ba7e42a5..4fdab03bc0f2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -119,6 +119,7 @@ struct kioctx { /* Size of ringbuffer, in units of struct io_event */ unsigned nr_events;

struct aio_ring *ring; unsigned long mmap_base; unsigned long mmap_size;

@@ -508,7 +509,6 @@ static const struct address_space_operations aio_ctx_aops = { static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) {

struct aio_ring *ring; struct mm_struct *mm = current->mm; unsigned long size, unused; int nr_pages;

@@ -589,24 +589,23 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->user_id = ctx->mmap_base; ctx->nr_events = nr_events; /* trusted copy */

ring = kmap_atomic(ctx->ring_pages[0]);

ring->nr = nr_events; /* user copy */

ring->id = ~0U;

ring->head = ring->tail = 0;

ring->magic = AIO_RING_MAGIC;

ring->compat_features = AIO_RING_COMPAT_FEATURES;

ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;

ring->header_length = sizeof(struct aio_ring);

kunmap_atomic(ring);

flush_dcache_page(ctx->ring_pages[0]);
ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL);

if (!ctx->ring) {
aio_free_ring(ctx);
return -ENOMEM;
}

ctx->ring->nr = nr_events; /* user copy */

ctx->ring->id = ~0U;

ctx->ring->head = ctx->ring->tail = 0;

ctx->ring->magic = AIO_RING_MAGIC;

ctx->ring->compat_features = AIO_RING_COMPAT_FEATURES;

ctx->ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;

ctx->ring->header_length = sizeof(struct aio_ring);
return 0; } -#define AIO_EVENTS_PER_PAGE (PAGE_SIZE / sizeof(struct io_event)) -#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event)) -#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)

void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel) { struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, rw); @@ -683,7 +682,6 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) { unsigned i, new_nr; struct kioctx_table *table, *old;

struct aio_ring *ring;

spin_lock(&mm->ioctx_lock); table = rcu_dereference_raw(mm->ioctx_table); @@ -700,9 +698,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) * we are protected from page migration

Speaking of migration, I wonder if we shouldn't do something about aio_migrate_folio(), since it seems to replace the backing pages (I don't completely understand what it does though).

...

			 * changes ring_pages by ->ring_lock.
			 */
			ring = kmap_atomic(ctx->ring_pages[0]);
			ring->id = ctx->id;
			kunmap_atomic(ring);
			ctx->ring->id = ctx->id;
		return 0;
	}
@@ -1031,7 +1027,6 @@ static void user_refill_reqs_available(struct kioctx *ctx) { spin_lock_irq(&ctx->completion_lock); if (ctx->completed_events) {
struct aio_ring *ring;
unsigned head;
/* Access of ring->head may race with aio_read_events_ring() @@ -1043,9 +1038,7 @@ static void user_refill_reqs_available(struct kioctx *ctx) * against ctx->completed_events below will make sure we do the * safe/right thing. */
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);
head = ctx->ring->head;
refill_reqs_available(ctx, head, ctx->tail); } @@ -1133,9 +1126,7 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) static void aio_complete(struct aio_kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx;

struct aio_ring *ring;

struct io_event *ev_page, *event;

unsigned tail, pos, head;

unsigned int tail, head; unsigned long flags;

/* @@ -1146,18 +1137,8 @@ static void aio_complete(struct aio_kiocb *iocb) spin_lock_irqsave(&ctx->completion_lock, flags); tail = ctx->tail;
pos = tail + AIO_EVENTS_OFFSET;

if (++tail >= ctx->nr_events)
tail = 0;
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);

event = ev_page + pos % AIO_EVENTS_PER_PAGE;

*event = iocb->ki_res;
kunmap_atomic(ev_page);

flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);

ctx->ring->io_events[tail] = iocb->ki_res;

pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)(unsigned long)iocb->ki_res.obj, @@ -1168,13 +1149,12 @@ static void aio_complete(struct aio_kiocb *iocb) */ smp_wmb(); /* make event visible before updating tail */

ctx->tail = tail;

if (++tail >= ctx->nr_events)

Nit: should stay before the pr_debug() to keep the output unchanged. Not sure printing the new tail value was intended, but just in case...

...

tail = 0;
ring = kmap_atomic(ctx->ring_pages[0]);

head = ring->head;

ring->tail = tail;

kunmap_atomic(ring);

flush_dcache_page(ctx->ring_pages[0]);

ctx->tail = tail;

head = ctx->ring->head;

ctx->ring->tail = tail;

ctx->completed_events++; if (ctx->completed_events > 1) @@ -1218,10 +1198,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) static long aio_read_events_ring(struct kioctx *ctx, struct io_event __user *event, long nr) {

struct aio_ring *ring;

unsigned head, tail, pos;

unsigned int head, tail; long ret = 0;

int copy_ret;

/* * The mutex can block and wake us up and that will cause @@ -1233,10 +1211,8 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock); /* Access to ->ring_pages here is protected by ctx->ring_lock. */

Nit: I guess s/ring_pages/ring

...

ring = kmap_atomic(ctx->ring_pages[0]);

head = ring->head;

tail = ring->tail;

kunmap_atomic(ring);

head = ctx->ring->head;

tail = ctx->ring->tail;

/* * Ensure that once we've read the current tail pointer, that @@ -1254,24 +1230,16 @@ static long aio_read_events_ring(struct kioctx *ctx, while (ret < nr) { long avail;
struct io_event *ev;
struct page *page;
int copy_ret;
avail = (head <= tail ?  tail : ctx->nr_events) - head;

Nit: I'd rather avoid this sort of move (both this line and copy_ret), as it's not actually necessary or related to what the patch does. Both do make sense, but they are a bit distracting when looking at the patch.

Kevin

...

if (head == tail)
	break;
pos = head + AIO_EVENTS_OFFSET;
page = ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE];
pos %= AIO_EVENTS_PER_PAGE;
avail = (head <= tail ?  tail : ctx->nr_events) - head;
avail = min(avail, nr - ret);
avail = min_t(long, avail, AIO_EVENTS_PER_PAGE - pos);
ev = kmap(page);
copy_ret = copy_to_user(event + ret, ev + pos,
			sizeof(*ev) * avail);
kunmap(page);
copy_ret = copy_to_user(event + ret,
			&ctx->ring->io_events[head],
			sizeof(struct io_event) * avail);
if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1283,10 +1251,7 @@ static long aio_read_events_ring(struct kioctx *ctx, head %= ctx->nr_events; }

ring = kmap_atomic(ctx->ring_pages[0]);

ring->head = head;

kunmap_atomic(ring);

flush_dcache_page(ctx->ring_pages[0]);

ctx->ring->head = head;

pr_debug("%li h%u t%u\n", ret, head, tail); out:

Tudor Cretu

23 Jun 23 Jun

1:34 p.m.

New subject: [PATCH v2 2/6] aio: vmap entire aio_ring instead of kmapping each page

On 22-06-2023 08:39, Kevin Brodsky wrote:

...

On 15/06/2023 17:54, Tudor Cretu wrote:

...
Previously, the aio_ring mechanism relied on two assumptions:

Nit: I find it clearer to talk about the situation before the patch as the current situation (present tense), and then say what the patch is doing to change that. It's just a convention really, it's also tempting to do it the way you did.

Done!

...

...

The size of the header of an aio_ring is equal to the size of an io_event, i.e. sizeof(aio_ring) equals sizeof(io_event)

There is no leftover space at the end of a page when populating it with io_events, i.e. PAGE_SIZE is a multiple of sizeof(io_event)

These assumptions are not met in PCuABI. This change removes these assumptions, making the aio_ring mechanism more flexible and robust.

In addition, this change improves the performance of the aio_ring mechanism. Previously, io_events were copied from the shared ring to userspace using multiple copy_to_user calls. This change batches the copying of io_events into a maximum of two copy_to_user calls.

Mapping pages independently using kmap doesn't create a contiguous virtual space. As some io_event structs could span across different pages, all the pages necessary to access the events' data need to be mapped to a contiguous virtual address space. For this reason, the ring pages are vmapped at the context initialisation step, and the mapping is cached in the context.

Should this paragraph be before the previous one? Considering this is the main justification it would feel more natural.

Indeed!

...

...
The changes are similar to the approach taken in [0].

[0] https://lore.kernel.org/all/1349764760-21093-4-git-send-email-koverstreet@go...

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 99 ++++++++++++++++++-------------------------------------- 1 file changed, 32 insertions(+), 67 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index 7d64ba7e42a5..4fdab03bc0f2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -119,6 +119,7 @@ struct kioctx { /* Size of ringbuffer, in units of struct io_event */ unsigned nr_events;

struct aio_ring *ring; unsigned long mmap_base; unsigned long mmap_size;

@@ -508,7 +509,6 @@ static const struct address_space_operations aio_ctx_aops = { static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) {

struct aio_ring *ring; struct mm_struct *mm = current->mm; unsigned long size, unused; int nr_pages;

@@ -589,24 +589,23 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->user_id = ctx->mmap_base; ctx->nr_events = nr_events; /* trusted copy */

ring = kmap_atomic(ctx->ring_pages[0]);

ring->nr = nr_events; /* user copy */

ring->id = ~0U;

ring->head = ring->tail = 0;

ring->magic = AIO_RING_MAGIC;

ring->compat_features = AIO_RING_COMPAT_FEATURES;

ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;

ring->header_length = sizeof(struct aio_ring);

kunmap_atomic(ring);

flush_dcache_page(ctx->ring_pages[0]);
ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL);

if (!ctx->ring) {
aio_free_ring(ctx);
return -ENOMEM;
}

ctx->ring->nr = nr_events; /* user copy */

ctx->ring->id = ~0U;

ctx->ring->head = ctx->ring->tail = 0;

ctx->ring->magic = AIO_RING_MAGIC;

ctx->ring->compat_features = AIO_RING_COMPAT_FEATURES;

ctx->ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;

ctx->ring->header_length = sizeof(struct aio_ring);
return 0; } -#define AIO_EVENTS_PER_PAGE (PAGE_SIZE / sizeof(struct io_event)) -#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event)) -#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)

void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel) { struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, rw);

@@ -683,7 +682,6 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) { unsigned i, new_nr; struct kioctx_table *table, *old;

struct aio_ring *ring;

spin_lock(&mm->ioctx_lock); table = rcu_dereference_raw(mm->ioctx_table); @@ -700,9 +698,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) * we are protected from page migration
Speaking of migration, I wonder if we shouldn't do something about aio_migrate_folio(), since it seems to replace the backing pages (I don't completely understand what it does though).

I don't think we need to do anything. IIUC, the migration of the physical location of the pages doesn't change their virtual addresses. From mm/Kconfig:

``` config MIGRATION bool "Page migration" def_bool y depends on (NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA) && MMU help Allows the migration of the physical location of pages of processes while the virtual addresses are not changed. This is useful in two situations. The first is on NUMA systems to put pages nearer to the processors accessing. The second is when allocating huge pages as migration can relocate pages to satisfy a huge page allocation instead of reclaiming. ```

So, the vmap should still be valid even after migration.

...

...
			 * changes ring_pages by ->ring_lock.
			 */
			ring = kmap_atomic(ctx->ring_pages[0]);
			ring->id = ctx->id;
			kunmap_atomic(ring);
			ctx->ring->id = ctx->id;
		return 0;
	}
@@ -1031,7 +1027,6 @@ static void user_refill_reqs_available(struct kioctx *ctx) { spin_lock_irq(&ctx->completion_lock); if (ctx->completed_events) {
struct aio_ring *ring;
unsigned head;
/* Access of ring->head may race with aio_read_events_ring() @@ -1043,9 +1038,7 @@ static void user_refill_reqs_available(struct kioctx *ctx) * against ctx->completed_events below will make sure we do the * safe/right thing. */
ring = kmap_atomic(ctx->ring_pages[0]);
head = ring->head;
kunmap_atomic(ring);
head = ctx->ring->head;
refill_reqs_available(ctx, head, ctx->tail); } @@ -1133,9 +1126,7 @@ static inline void iocb_destroy(struct aio_kiocb *iocb) static void aio_complete(struct aio_kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx;

struct aio_ring *ring;

struct io_event *ev_page, *event;

unsigned tail, pos, head;

unsigned int tail, head; unsigned long flags;

/* @@ -1146,18 +1137,8 @@ static void aio_complete(struct aio_kiocb *iocb) spin_lock_irqsave(&ctx->completion_lock, flags); tail = ctx->tail;
pos = tail + AIO_EVENTS_OFFSET;

if (++tail >= ctx->nr_events)
tail = 0;
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);

event = ev_page + pos % AIO_EVENTS_PER_PAGE;

*event = iocb->ki_res;
kunmap_atomic(ev_page);

flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);

ctx->ring->io_events[tail] = iocb->ki_res;

pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)(unsigned long)iocb->ki_res.obj, @@ -1168,13 +1149,12 @@ static void aio_complete(struct aio_kiocb *iocb) */ smp_wmb(); /* make event visible before updating tail */

ctx->tail = tail;

if (++tail >= ctx->nr_events)
Nit: should stay before the pr_debug() to keep the output unchanged. Not sure printing the new tail value was intended, but just in case...

According to the original pr_debug() message, the output should be updated index of the context, not the new tail. I agree though, this change is not relevant to this commit, so I moved the tail's increment back before the pr_debug().

...

...
tail = 0;
ring = kmap_atomic(ctx->ring_pages[0]);

head = ring->head;

ring->tail = tail;

kunmap_atomic(ring);

flush_dcache_page(ctx->ring_pages[0]);

ctx->tail = tail;

head = ctx->ring->head;

ctx->ring->tail = tail;

ctx->completed_events++; if (ctx->completed_events > 1) @@ -1218,10 +1198,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) static long aio_read_events_ring(struct kioctx *ctx, struct io_event __user *event, long nr) {

struct aio_ring *ring;

unsigned head, tail, pos;

unsigned int head, tail; long ret = 0;

int copy_ret;

/* * The mutex can block and wake us up and that will cause @@ -1233,10 +1211,8 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock); /* Access to ->ring_pages here is protected by ctx->ring_lock. */
Nit: I guess s/ring_pages/ring

Thanks!

...

...
ring = kmap_atomic(ctx->ring_pages[0]);

head = ring->head;

tail = ring->tail;

kunmap_atomic(ring);

head = ctx->ring->head;

tail = ctx->ring->tail;

/* * Ensure that once we've read the current tail pointer, that @@ -1254,24 +1230,16 @@ static long aio_read_events_ring(struct kioctx *ctx, while (ret < nr) { long avail;
struct io_event *ev;
struct page *page;
int copy_ret;
avail = (head <= tail ?  tail : ctx->nr_events) - head;
Nit: I'd rather avoid this sort of move (both this line and copy_ret), as it's not actually necessary or related to what the patch does. Both do make sense, but they are a bit distracting when looking at the patch.

Sure!

Thanks, Tudor

...

Kevin

...
if (head == tail)
	break;
pos = head + AIO_EVENTS_OFFSET;
page = ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE];
pos %= AIO_EVENTS_PER_PAGE;
avail = (head <= tail ?  tail : ctx->nr_events) - head;
avail = min(avail, nr - ret);
avail = min_t(long, avail, AIO_EVENTS_PER_PAGE - pos);
ev = kmap(page);
copy_ret = copy_to_user(event + ret, ev + pos,
			sizeof(*ev) * avail);
kunmap(page);
copy_ret = copy_to_user(event + ret,
			&ctx->ring->io_events[head],
			sizeof(struct io_event) * avail);
if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1283,10 +1251,7 @@ static long aio_read_events_ring(struct kioctx *ctx, head %= ctx->nr_events; }

ring = kmap_atomic(ctx->ring_pages[0]);

ring->head = head;

kunmap_atomic(ring);

flush_dcache_page(ctx->ring_pages[0]);

ctx->ring->head = head;

pr_debug("%li h%u t%u\n", ret, head, tail); out:

Kevin Brodsky

28 Jun 28 Jun

12:24 p.m.

New subject: [PATCH v2 2/6] aio: vmap entire aio_ring instead of kmapping each page

On 23/06/2023 15:34, Tudor Cretu wrote:

...

...
Speaking of migration, I wonder if we shouldn't do something about aio_migrate_folio(), since it seems to replace the backing pages (I don't completely understand what it does though).

I don't think we need to do anything. IIUC, the migration of the physical location of the pages doesn't change their virtual addresses. From mm/Kconfig:
config MIGRATION
  bool "Page migration"
  def_bool y
  depends on (NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION ||
CMA) && MMU
  help
    Allows the migration of the physical location of pages of processes
    while the virtual addresses are not changed. This is useful in
    two situations. The first is on NUMA systems to put pages nearer
    to the processors accessing. The second is when allocating huge
    pages as migration can relocate pages to satisfy a huge page
    allocation instead of reclaiming.
So, the vmap should still be valid even after migration.

Right, the vmap mapping does not need to move for sure. However, I am not sure what happens if folio_migrate_mapping() is passed pages that are already mapped by vmap(). Would it magically update the mapping (and invalidate TLBs), or would it fail? For that matter I don't really understand what happens to the userspace mapping either, as clearly userspace doesn't do anything itself. That might require further investigation.

Kevin

Tudor Cretu

15 Jun 15 Jun

3:54 p.m.

New subject: [PATCH v2 3/6] aio: Implement compat handling for the io_event struct

Some userspace software, including libaio, use struct aio_ring's layout, even if the uAPI headers don't expose it. This is typically done to skip the io_getevents syscall's overhead by reading directly from the io_events array in the aio_ring, which is shared with the userspace.

As the io_event struct might contain pointers as members, introduce a compat version of the struct. Also, implement functions that convert the compat version to the native version of the struct.

A subsequent patch is going to change the io_event struct to enable it to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.

Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 123 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 102 insertions(+), 21 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index 4fdab03bc0f2..31a9b1c818e0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -65,9 +65,8 @@ struct aio_ring { unsigned incompat_features; unsigned header_length; /* size of aio_ring */

- struct io_event io_events[]; -}; /* 128 bytes + ring size */ +}; /* 32 bytes + ring size */

/* * Plugging is meant to work with larger batches of IOs. If we don't @@ -123,6 +122,11 @@ struct kioctx { unsigned long mmap_base; unsigned long mmap_size;

+ union { + struct compat_io_event *io_events_compat; + struct io_event *io_events; + }; + struct page **ring_pages; long nr_pages;

@@ -165,6 +169,9 @@ struct kioctx { struct file *aio_ring_file;

unsigned id; +#ifdef CONFIG_COMPAT64 + bool compat; +#endif };

/* @@ -219,6 +226,13 @@ struct aio_kiocb { struct eventfd_ctx *ki_eventfd; };

+struct compat_io_event { + __u64 data; + __u64 obj; + __s64 res; + __s64 res2; +}; + struct compat_iocb { __u64 aio_data; #if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN) @@ -281,6 +295,56 @@ static struct vfsmount *aio_mnt; static const struct file_operations aio_ring_fops; static const struct address_space_operations aio_ctx_aops;

+static inline bool aio_in_compat64(struct kioctx *ctx) +{ + return IS_ENABLED(CONFIG_COMPAT64) && ctx->compat; +} + +static inline size_t io_event_size(struct kioctx *ctx) +{ + return aio_in_compat64(ctx) ? sizeof(struct compat_io_event) + : sizeof(struct io_event); +} + +static u32 __user *aio_key_uptr(struct kioctx *ctx, + struct iocb __user *user_iocb) +{ + return aio_in_compat64(ctx) ? + &((struct compat_iocb __user *)user_iocb)->aio_key : + &user_iocb->aio_key; +} + +static int copy_io_events_to_user(struct kioctx *ctx, + struct io_event __user *event_array, + long offset, + long ring_head, + int nr) +{ + if (aio_in_compat64(ctx)) + return copy_to_user((struct compat_io_event __user *)event_array + offset, + &ctx->io_events_compat[ring_head], + io_event_size(ctx) * nr); + return copy_to_user(event_array + offset, + &ctx->io_events[ring_head], + io_event_size(ctx) * nr); +} + +static void copy_io_event_to_ring(struct kioctx *ctx, + long ring_tail, + struct io_event *native_event) +{ + if (aio_in_compat64(ctx)) { + struct compat_io_event *compat_ring_event = &ctx->io_events_compat[ring_tail]; + + compat_ring_event->data = native_event->data; + compat_ring_event->obj = native_event->obj; + compat_ring_event->res = native_event->res; + compat_ring_event->res2 = native_event->res2; + return; + } + ctx->io_events[ring_tail] = *native_event; +} + static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) { struct file *file; @@ -352,6 +416,9 @@ static void aio_free_ring(struct kioctx *ctx) { int i;

+ if (ctx->ring) + vunmap(ctx->ring); + /* Disconnect the kiotx from the ring file. This prevents future * accesses to the kioctx from page migration. */ @@ -519,7 +586,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) nr_events += 2; /* 1 is required, 2 for good luck */

size = sizeof(struct aio_ring); - size += sizeof(struct io_event) * nr_events; + size += io_event_size(ctx) * nr_events;

nr_pages = PFN_UP(size); if (nr_pages < 0) @@ -533,7 +600,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)

ctx->aio_ring_file = file; nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) - / sizeof(struct io_event); + / io_event_size(ctx);

ctx->ring_pages = ctx->internal_pages; if (nr_pages > AIO_RING_PAGES) { @@ -595,6 +662,8 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) return -ENOMEM; }

+ ctx->io_events = ctx->ring->io_events; + ctx->ring->nr = nr_events; /* user copy */ ctx->ring->id = ~0U; ctx->ring->head = ctx->ring->tail = 0; @@ -766,8 +835,16 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) nr_events = max(nr_events, num_possible_cpus() * 4); nr_events *= 2;

+ ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL); + if (!ctx) + return ERR_PTR(-ENOMEM); + +#ifdef CONFIG_COMPAT64 + ctx->compat = in_compat_syscall(); +#endif + /* Prevent overflows */ - if (nr_events > (0x10000000U / sizeof(struct io_event))) { + if (nr_events > (0x10000000U / io_event_size(ctx))) { pr_debug("ENOMEM: nr_events too high\n"); return ERR_PTR(-EINVAL); } @@ -775,10 +852,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!nr_events || (unsigned long)max_reqs > aio_max_nr) return ERR_PTR(-EAGAIN);

- ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL); - if (!ctx) - return ERR_PTR(-ENOMEM); - ctx->max_reqs = max_reqs;

spin_lock_init(&ctx->ctx_lock); @@ -1138,7 +1211,7 @@ static void aio_complete(struct aio_kiocb *iocb)

tail = ctx->tail;

- ctx->ring->io_events[tail] = iocb->ki_res; + copy_io_event_to_ring(ctx, tail, &iocb->ki_res);

pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)(unsigned long)iocb->ki_res.obj, @@ -1196,7 +1269,8 @@ static inline void iocb_put(struct aio_kiocb *iocb) * events fetched */ static long aio_read_events_ring(struct kioctx *ctx, - struct io_event __user *event, long nr) + struct io_event __user *event, long idx, + long nr) { unsigned int head, tail; long ret = 0; @@ -1237,9 +1311,7 @@ static long aio_read_events_ring(struct kioctx *ctx, avail = (head <= tail ? tail : ctx->nr_events) - head; avail = min(avail, nr - ret);

- copy_ret = copy_to_user(event + ret, - &ctx->ring->io_events[head], - sizeof(struct io_event) * avail); + copy_ret = copy_io_events_to_user(ctx, event, idx + ret, head, avail);

if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1263,7 +1335,10 @@ static long aio_read_events_ring(struct kioctx *ctx, static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event __user *event, long *i) { - long ret = aio_read_events_ring(ctx, event + *i, nr - *i); + long ret = aio_read_events_ring(ctx, + event, + *i, + nr - *i);

if (ret > 0) *i += ret; @@ -1942,7 +2017,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, req->ki_eventfd = eventfd; }

- if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) { + if (unlikely(put_user(KIOCB_KEY, aio_key_uptr(ctx, user_iocb)))) { pr_debug("EFAULT: aio_key\n"); return -EFAULT; } @@ -2164,15 +2239,20 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, u32 key; u64 obj = user_ptr_addr(iocb);

- if (unlikely(get_user(key, &iocb->aio_key))) - return -EFAULT; - if (unlikely(key != KIOCB_KEY)) - return -EINVAL; - ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) return -EINVAL;

+ if (unlikely(get_user(key, aio_key_uptr(ctx, iocb)))) { + ret = -EFAULT; + goto out; + } + + if (unlikely(key != KIOCB_KEY)) { + ret = -EINVAL; + goto out; + } + spin_lock_irq(&ctx->ctx_lock); /* TODO: use a hash or array, this sucks. */ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) { @@ -2193,6 +2273,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, ret = -EINPROGRESS; }

+out: percpu_ref_put(&ctx->users);

return ret;

-- 2.34.1

Kevin Brodsky

22 Jun 22 Jun

7:41 a.m.

New subject: [PATCH v2 3/6] aio: Implement compat handling for the io_event struct

On 15/06/2023 17:54, Tudor Cretu wrote:

...

Some userspace software, including libaio, use struct aio_ring's layout, even if the uAPI headers don't expose it. This is typically done to skip the io_getevents syscall's overhead by reading directly from the io_events array in the aio_ring, which is shared with the userspace.

It's still not clear to me why direct access to the ring makes a difference, as we're not touch struct aio_ring itself and struct io_event is already exposed through syscalls.

...

As the io_event struct might contain pointers as members, introduce a compat version of the struct. Also, implement functions that convert the compat version to the native version of the struct.

A subsequent patch is going to change the io_event struct to enable it to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 123 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 102 insertions(+), 21 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index 4fdab03bc0f2..31a9b1c818e0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -65,9 +65,8 @@ struct aio_ring { unsigned incompat_features; unsigned header_length; /* size of aio_ring */

struct io_event io_events[];

-}; /* 128 bytes + ring size */ +}; /* 32 bytes + ring size */ /*

Plugging is meant to work with larger batches of IOs. If we don't

@@ -123,6 +122,11 @@ struct kioctx { unsigned long mmap_base; unsigned long mmap_size;
union {
struct compat_io_event	*io_events_compat;
struct io_event		*io_events;
};

I'm not convinced this is worth it. io_events_compat is used in just two places, casting ctx->ring->io_events to struct compat_io_event * there would be less overhead than adding that member to struct kioctx.

...

struct page **ring_pages; long nr_pages;

@@ -165,6 +169,9 @@ struct kioctx { struct file *aio_ring_file; unsigned id; +#ifdef CONFIG_COMPAT64

bool compat;

+#endif }; /* @@ -219,6 +226,13 @@ struct aio_kiocb { struct eventfd_ctx *ki_eventfd; }; +struct compat_io_event {

__u64 data;

__u64 obj;

__s64 res;

__s64 res2;

+};

struct compat_iocb { __u64 aio_data; #if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN) @@ -281,6 +295,56 @@ static struct vfsmount *aio_mnt; static const struct file_operations aio_ring_fops; static const struct address_space_operations aio_ctx_aops; +static inline bool aio_in_compat64(struct kioctx *ctx) +{

return IS_ENABLED(CONFIG_COMPAT64) && ctx->compat;

+}

+static inline size_t io_event_size(struct kioctx *ctx) +{
return aio_in_compat64(ctx) ? sizeof(struct compat_io_event)
		    : sizeof(struct io_event);
+}

+static u32 __user *aio_key_uptr(struct kioctx *ctx,
		struct iocb __user *user_iocb)
+{
return aio_in_compat64(ctx) ?
      &((struct compat_iocb __user *)user_iocb)->aio_key :
      &user_iocb->aio_key;
+}

+static int copy_io_events_to_user(struct kioctx *ctx,
		  struct io_event __user *event_array,
		  long offset,
		  long ring_head,
		  int nr)

Maybe long to match avail in aio_read_events_ring()? Otherwise much clearer now!

...

if (aio_in_compat64(ctx))

return copy_to_user((struct compat_io_event __user *)event_array + offset,

		    &ctx->io_events_compat[ring_head],

```
		    io_event_size(ctx) * nr);
```
return copy_to_user(event_array + offset,
```
	    &ctx->io_events[ring_head],
```
```
	    io_event_size(ctx) * nr);
```

+static void copy_io_event_to_ring(struct kioctx *ctx,

```
		  long ring_tail,
```

Nit: I think ring_idx or similar would make a bit more sense as it doesn't matter that it's the tail or not here, we're writing to whatever index is passed.

Also, why long? Ring indices seem to be represented as unsigned everywhere else.

...

		  struct io_event *native_event)
+{
if (aio_in_compat64(ctx)) {
struct compat_io_event *compat_ring_event = &ctx->io_events_compat[ring_tail];
compat_ring_event->data = native_event->data;
compat_ring_event->obj = native_event->obj;
compat_ring_event->res = native_event->res;
compat_ring_event->res2 = native_event->res2;
return;
}

ctx->io_events[ring_tail] = *native_event;
+}

static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) { struct file *file; @@ -352,6 +416,9 @@ static void aio_free_ring(struct kioctx *ctx) { int i;
if (ctx->ring)
vunmap(ctx->ring);

That one escaped from patch 2 I think :)

...

/* Disconnect the kiotx from the ring file. This prevents future

accesses to the kioctx from page migration.

*/

@@ -519,7 +586,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) nr_events += 2; /* 1 is required, 2 for good luck */ size = sizeof(struct aio_ring);

size += sizeof(struct io_event) * nr_events;

size += io_event_size(ctx) * nr_events;

nr_pages = PFN_UP(size); if (nr_pages < 0) @@ -533,7 +600,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->aio_ring_file = file; nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring))
	/ sizeof(struct io_event);
	/ io_event_size(ctx);
ctx->ring_pages = ctx->internal_pages; if (nr_pages > AIO_RING_PAGES) { @@ -595,6 +662,8 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) return -ENOMEM; }

ctx->io_events = ctx->ring->io_events;

ctx->ring->nr = nr_events; /* user copy */ ctx->ring->id = ~0U; ctx->ring->head = ctx->ring->tail = 0;

@@ -766,8 +835,16 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) nr_events = max(nr_events, num_possible_cpus() * 4); nr_events *= 2;
ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);

if (!ctx)
return ERR_PTR(-ENOMEM);
+#ifdef CONFIG_COMPAT64

ctx->compat = in_compat_syscall();

+#endif

/* Prevent overflows */

if (nr_events > (0x10000000U / sizeof(struct io_event))) {

if (nr_events > (0x10000000U / io_event_size(ctx))) { pr_debug("ENOMEM: nr_events too high\n"); return ERR_PTR(-EINVAL);

I think we would leak ctx in this case, since it is now allocated before that check.

We could make it simpler by having ioctx_alloc() take an extra compat argument. That would avoid having to allocate ctx first, and it would also remove the need to use in_compat_syscall(), since we already have a compat handler.

I agree with your reply on v1, it is not essential to use the right size here. That said effectively diving by 2 the max nr_events in compat64 is still a user-visible change we should try to avoid.

...

} @@ -775,10 +852,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!nr_events || (unsigned long)max_reqs > aio_max_nr) return ERR_PTR(-EAGAIN);
ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);

if (!ctx)
return ERR_PTR(-ENOMEM);
ctx->max_reqs = max_reqs;
spin_lock_init(&ctx->ctx_lock); @@ -1138,7 +1211,7 @@ static void aio_complete(struct aio_kiocb *iocb) tail = ctx->tail;

ctx->ring->io_events[tail] = iocb->ki_res;

copy_io_event_to_ring(ctx, tail, &iocb->ki_res);

pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)(unsigned long)iocb->ki_res.obj, @@ -1196,7 +1269,8 @@ static inline void iocb_put(struct aio_kiocb *iocb)

events fetched

*/ static long aio_read_events_ring(struct kioctx *ctx,
		 struct io_event __user *event, long nr)
		 struct io_event __user *event, long idx,

Nit: maybe event_idx (for a while I was confused as I thought idx was a ring index).

...

		 long nr)
{ unsigned int head, tail; long ret = 0; @@ -1237,9 +1311,7 @@ static long aio_read_events_ring(struct kioctx *ctx, avail = (head <= tail ? tail : ctx->nr_events) - head; avail = min(avail, nr - ret);
copy_ret = copy_to_user(event + ret,
			&ctx->ring->io_events[head],
			sizeof(struct io_event) * avail);
copy_ret = copy_io_events_to_user(ctx, event, idx + ret, head, avail);
if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1263,7 +1335,10 @@ static long aio_read_events_ring(struct kioctx *ctx, static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event __user *event, long *i) {

long ret = aio_read_events_ring(ctx, event + *i, nr - *i);
long ret = aio_read_events_ring(ctx,
			event,
			*i,
			nr - *i);

It should all fit in in one line again.

Kevin

...

if (ret > 0) *i += ret; @@ -1942,7 +2017,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, req->ki_eventfd = eventfd; }

if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) {

if (unlikely(put_user(KIOCB_KEY, aio_key_uptr(ctx, user_iocb)))) { pr_debug("EFAULT: aio_key\n"); return -EFAULT; }

@@ -2164,15 +2239,20 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, u32 key; u64 obj = user_ptr_addr(iocb);
if (unlikely(get_user(key, &iocb->aio_key)))
return -EFAULT;
if (unlikely(key != KIOCB_KEY))
return -EINVAL;
ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) return -EINVAL;
if (unlikely(get_user(key, aio_key_uptr(ctx, iocb)))) {
ret = -EFAULT;
goto out;
}

if (unlikely(key != KIOCB_KEY)) {
ret = -EINVAL;
goto out;
}

spin_lock_irq(&ctx->ctx_lock); /* TODO: use a hash or array, this sucks. */ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) {
@@ -2193,6 +2273,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, ret = -EINPROGRESS; } +out: percpu_ref_put(&ctx->users); return ret;

Tudor Cretu

23 Jun 23 Jun

1:34 p.m.

New subject: [PATCH v2 3/6] aio: Implement compat handling for the io_event struct

On 22-06-2023 08:41, Kevin Brodsky wrote:

...

On 15/06/2023 17:54, Tudor Cretu wrote:

...
Some userspace software, including libaio, use struct aio_ring's layout, even if the uAPI headers don't expose it. This is typically done to skip the io_getevents syscall's overhead by reading directly from the io_events array in the aio_ring, which is shared with the userspace.

It's still not clear to me why direct access to the ring makes a difference, as we're not touch struct aio_ring itself and struct io_event is already exposed through syscalls.

Indeed it doesn't, it just slipped into v2. Sorry about that!

...

...
As the io_event struct might contain pointers as members, introduce a compat version of the struct. Also, implement functions that convert the compat version to the native version of the struct.

A subsequent patch is going to change the io_event struct to enable it to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 123 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 102 insertions(+), 21 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index 4fdab03bc0f2..31a9b1c818e0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -65,9 +65,8 @@ struct aio_ring { unsigned incompat_features; unsigned header_length; /* size of aio_ring */

struct io_event io_events[];

-}; /* 128 bytes + ring size */ +}; /* 32 bytes + ring size */ /*

Plugging is meant to work with larger batches of IOs. If we don't

@@ -123,6 +122,11 @@ struct kioctx { unsigned long mmap_base; unsigned long mmap_size;
union {
struct compat_io_event	*io_events_compat;
struct io_event		*io_events;
};
I'm not convinced this is worth it. io_events_compat is used in just two places, casting ctx->ring->io_events to struct compat_io_event * there would be less overhead than adding that member to struct kioctx.

Makes sense, I agree.

...

...
struct page **ring_pages; long nr_pages;

@@ -165,6 +169,9 @@ struct kioctx { struct file *aio_ring_file; unsigned id; +#ifdef CONFIG_COMPAT64

bool compat;

+#endif }; /* @@ -219,6 +226,13 @@ struct aio_kiocb { struct eventfd_ctx *ki_eventfd; }; +struct compat_io_event {

__u64 data;

__u64 obj;

__s64 res;

__s64 res2;

+};

struct compat_iocb { __u64 aio_data; #if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)

@@ -281,6 +295,56 @@ static struct vfsmount *aio_mnt; static const struct file_operations aio_ring_fops; static const struct address_space_operations aio_ctx_aops; +static inline bool aio_in_compat64(struct kioctx *ctx) +{

return IS_ENABLED(CONFIG_COMPAT64) && ctx->compat;

+}

+static inline size_t io_event_size(struct kioctx *ctx) +{
return aio_in_compat64(ctx) ? sizeof(struct compat_io_event)
		    : sizeof(struct io_event);
+}

+static u32 __user *aio_key_uptr(struct kioctx *ctx,
		struct iocb __user *user_iocb)
+{
return aio_in_compat64(ctx) ?
      &((struct compat_iocb __user *)user_iocb)->aio_key :
      &user_iocb->aio_key;
+}

+static int copy_io_events_to_user(struct kioctx *ctx,
		  struct io_event __user *event_array,
		  long offset,
		  long ring_head,
		  int nr)
Maybe long to match avail in aio_read_events_ring()? Otherwise much clearer now!

Indeed! Thank you for the suggestion!

...

...
+{
if (aio_in_compat64(ctx))
return copy_to_user((struct compat_io_event __user *)event_array + offset,
		    &ctx->io_events_compat[ring_head],
		    io_event_size(ctx) * nr);
return copy_to_user(event_array + offset,
	    &ctx->io_events[ring_head],
	    io_event_size(ctx) * nr);
+}

+static void copy_io_event_to_ring(struct kioctx *ctx,
		  long ring_tail,
Nit: I think ring_idx or similar would make a bit more sense as it doesn't matter that it's the tail or not here, we're writing to whatever index is passed.

Also, why long? Ring indices seem to be represented as unsigned everywhere else.

Done!

...

...
		  struct io_event *native_event)
+{
if (aio_in_compat64(ctx)) {
struct compat_io_event *compat_ring_event = &ctx->io_events_compat[ring_tail];
compat_ring_event->data = native_event->data;
compat_ring_event->obj = native_event->obj;
compat_ring_event->res = native_event->res;
compat_ring_event->res2 = native_event->res2;
return;
}

ctx->io_events[ring_tail] = *native_event;
+}

static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) { struct file *file;

@@ -352,6 +416,9 @@ static void aio_free_ring(struct kioctx *ctx) { int i;
if (ctx->ring)
vunmap(ctx->ring);
That one escaped from patch 2 I think :)

It didn't reach very far :). Thanks for spotting that!

...

...
/* Disconnect the kiotx from the ring file. This prevents future

accesses to the kioctx from page migration.

*/

@@ -519,7 +586,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) nr_events += 2; /* 1 is required, 2 for good luck */ size = sizeof(struct aio_ring);

size += sizeof(struct io_event) * nr_events;

size += io_event_size(ctx) * nr_events;

nr_pages = PFN_UP(size); if (nr_pages < 0) @@ -533,7 +600,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->aio_ring_file = file; nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring))
	/ sizeof(struct io_event);
	/ io_event_size(ctx);
ctx->ring_pages = ctx->internal_pages; if (nr_pages > AIO_RING_PAGES) { @@ -595,6 +662,8 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) return -ENOMEM; }

ctx->io_events = ctx->ring->io_events;

ctx->ring->nr = nr_events; /* user copy */ ctx->ring->id = ~0U; ctx->ring->head = ctx->ring->tail = 0;

@@ -766,8 +835,16 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) nr_events = max(nr_events, num_possible_cpus() * 4); nr_events *= 2;
ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);

if (!ctx)
return ERR_PTR(-ENOMEM);
+#ifdef CONFIG_COMPAT64

ctx->compat = in_compat_syscall();

+#endif

/* Prevent overflows */

if (nr_events > (0x10000000U / sizeof(struct io_event))) {

if (nr_events > (0x10000000U / io_event_size(ctx))) { pr_debug("ENOMEM: nr_events too high\n"); return ERR_PTR(-EINVAL);
I think we would leak ctx in this case, since it is now allocated before that check.

Indeed, well spotted.

...

We could make it simpler by having ioctx_alloc() take an extra compat argument. That would avoid having to allocate ctx first, and it would also remove the need to use in_compat_syscall(), since we already have a compat handler.

Sounds good!

...

I agree with your reply on v1, it is not essential to use the right size here. That said effectively diving by 2 the max nr_events in compat64 is still a user-visible change we should try to avoid.

...
} @@ -775,10 +852,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!nr_events || (unsigned long)max_reqs > aio_max_nr) return ERR_PTR(-EAGAIN);
ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);

if (!ctx)
return ERR_PTR(-ENOMEM);
ctx->max_reqs = max_reqs;
spin_lock_init(&ctx->ctx_lock); @@ -1138,7 +1211,7 @@ static void aio_complete(struct aio_kiocb *iocb) tail = ctx->tail;

ctx->ring->io_events[tail] = iocb->ki_res;

copy_io_event_to_ring(ctx, tail, &iocb->ki_res);

pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, (void __user *)(unsigned long)iocb->ki_res.obj, @@ -1196,7 +1269,8 @@ static inline void iocb_put(struct aio_kiocb *iocb)

events fetched

*/ static long aio_read_events_ring(struct kioctx *ctx,
		 struct io_event __user *event, long nr)
		 struct io_event __user *event, long idx,
Nit: maybe event_idx (for a while I was confused as I thought idx was a ring index).

Done

...

...
		 long nr)
{ unsigned int head, tail; long ret = 0;
@@ -1237,9 +1311,7 @@ static long aio_read_events_ring(struct kioctx *ctx, avail = (head <= tail ? tail : ctx->nr_events) - head; avail = min(avail, nr - ret);
copy_ret = copy_to_user(event + ret,
			&ctx->ring->io_events[head],
			sizeof(struct io_event) * avail);
copy_ret = copy_io_events_to_user(ctx, event, idx + ret, head, avail);
if (unlikely(copy_ret)) { ret = -EFAULT; @@ -1263,7 +1335,10 @@ static long aio_read_events_ring(struct kioctx *ctx, static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr, struct io_event __user *event, long *i) {

long ret = aio_read_events_ring(ctx, event + *i, nr - *i);
long ret = aio_read_events_ring(ctx,
			event,
			*i,
			nr - *i);
It should all fit in in one line again.

Done

Thanks, Tudor

...

Kevin

...
if (ret > 0)
*i += ret;
@@ -1942,7 +2017,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, req->ki_eventfd = eventfd; }

if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) {

if (unlikely(put_user(KIOCB_KEY, aio_key_uptr(ctx, user_iocb)))) { pr_debug("EFAULT: aio_key\n"); return -EFAULT; }

@@ -2164,15 +2239,20 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, u32 key; u64 obj = user_ptr_addr(iocb);
if (unlikely(get_user(key, &iocb->aio_key)))
return -EFAULT;
if (unlikely(key != KIOCB_KEY))
return -EINVAL;
ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) return -EINVAL;
if (unlikely(get_user(key, aio_key_uptr(ctx, iocb)))) {
ret = -EFAULT;
goto out;
}

if (unlikely(key != KIOCB_KEY)) {
ret = -EINVAL;
goto out;
}

spin_lock_irq(&ctx->ctx_lock); /* TODO: use a hash or array, this sucks. */ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) {
@@ -2193,6 +2273,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, ret = -EINPROGRESS; } +out: percpu_ref_put(&ctx->users); return ret;

Tudor Cretu

15 Jun 15 Jun

3:54 p.m.

New subject: [PATCH v2 4/6] aio: Allow capability tag access on the shared memory

The aio_ring buffer is a shared memory region that hosts the io_event array. The io_event struct may contain user pointers, so the memory region must be allowed to store and load capability pointers.

Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c index 31a9b1c818e0..a0bd049b0c45 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -482,6 +482,11 @@ static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) { vma->vm_flags |= VM_DONTEXPAND; vma->vm_ops = &aio_ring_vm_ops; + if (IS_ENABLED(CONFIG_CHERI_PURECAP_UABI)) { + vma->vm_flags |= VM_READ_CAPS | VM_WRITE_CAPS; + vma_set_page_prot(vma); + } + return 0; }

-- 2.34.1

Kevin Brodsky

22 Jun 22 Jun

7:54 a.m.

New subject: [PATCH v2 4/6] aio: Allow capability tag access on the shared memory

On 15/06/2023 17:54, Tudor Cretu wrote:

...

The aio_ring buffer is a shared memory region that hosts the io_event array. The io_event struct may contain user pointers, so the memory region must be allowed to store and load capability pointers.

... in PCuABI.

...

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c index 31a9b1c818e0..a0bd049b0c45 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -482,6 +482,11 @@ static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) { vma->vm_flags |= VM_DONTEXPAND; vma->vm_ops = &aio_ring_vm_ops;

if (IS_ENABLED(CONFIG_CHERI_PURECAP_UABI)) {

A lot simpler now! We could just use a normal #ifdef since we don't need to declare variables any more.

Kevin

...

vma->vm_flags |= VM_READ_CAPS | VM_WRITE_CAPS;

```
vma_set_page_prot(vma);
```
}
return 0;

}

Tudor Cretu

23 Jun 23 Jun

1:34 p.m.

New subject: [PATCH v2 4/6] aio: Allow capability tag access on the shared memory

On 22-06-2023 08:54, Kevin Brodsky wrote:

...

On 15/06/2023 17:54, Tudor Cretu wrote:

...
The aio_ring buffer is a shared memory region that hosts the io_event array. The io_event struct may contain user pointers, so the memory region must be allowed to store and load capability pointers.

... in PCuABI.

Done

...

...
Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c index 31a9b1c818e0..a0bd049b0c45 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -482,6 +482,11 @@ static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) { vma->vm_flags |= VM_DONTEXPAND; vma->vm_ops = &aio_ring_vm_ops;

if (IS_ENABLED(CONFIG_CHERI_PURECAP_UABI)) {

A lot simpler now! We could just use a normal #ifdef since we don't need to declare variables any more.

Indeed!

Thanks, Tudor

...

Kevin

...

vma->vm_flags |= VM_READ_CAPS | VM_WRITE_CAPS;

```
vma_set_page_prot(vma);
```
}
return 0; }

Tudor Cretu

15 Jun 15 Jun

3:54 p.m.

New subject: [PATCH v2 5/6] aio: Change aio_context_t to a user pointer

The aio_context_t represents the pointer to the userspace mapping of aio_ring buffer. In some userspace software, including libaio, it's used to directly access the aio_ring. As __kernel_ulong_t type is not large enough to fit a user pointer in the PCuABI, change the aio_context_t to void __user *. Moreover, make a full capability comparison in lookup_ioctx().

Also, make sure to always create a valid user pointer in the compat syscall handlers or if the syscalls are invoked from a compat context by using compat_ptr().

Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 71 +++++++++++++++++++++++++----------- include/uapi/linux/aio_abi.h | 2 +- 2 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index a0bd049b0c45..dff01598760e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -97,7 +97,7 @@ struct kioctx {

struct percpu_ref reqs;

- unsigned long user_id; + aio_context_t user_id;

struct __percpu kioctx_cpu *cpu;

@@ -314,6 +314,14 @@ static u32 __user *aio_key_uptr(struct kioctx *ctx, &user_iocb->aio_key; }

+static inline bool aio_ctx_id_is_same(struct kioctx *ctx, + aio_context_t ctx_id) +{ + if (IS_ENABLED(CONFIG_CHERI_PURECAP_UABI) && !aio_in_compat64(ctx)) + return __builtin_cheri_equal_exact(ctx->user_id, ctx_id); + return ctx->user_id == ctx_id; +} + static int copy_io_events_to_user(struct kioctx *ctx, struct io_event __user *event_array, long offset, @@ -457,7 +465,9 @@ static int aio_ring_mremap(struct vm_area_struct *vma) ctx = rcu_dereference(table->table[i]); if (ctx && ctx->aio_ring_file == file) { if (!atomic_read(&ctx->dead)) { - ctx->user_id = ctx->mmap_base = vma->vm_start; + ctx->mmap_base = vma->vm_start; + /* TODO [PCuABI] - derive proper capability */ + ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base); res = 0; } break; @@ -658,7 +668,8 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)

pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);

- ctx->user_id = ctx->mmap_base; + /* TODO [PCuABI] - derive proper capability */ + ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base); ctx->nr_events = nr_events; /* trusted copy */

ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL); @@ -910,7 +921,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) mutex_unlock(&ctx->ring_lock);

pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n", - ctx, ctx->user_id, mm, ctx->nr_events); + ctx, user_ptr_addr(ctx->user_id), mm, ctx->nr_events); return ctx;

err_cleanup: @@ -1160,9 +1171,9 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) return req; }

-static struct kioctx *lookup_ioctx(unsigned long ctx_id) +static struct kioctx *lookup_ioctx(aio_context_t ctx_id) { - struct aio_ring __user *ring = (void __user *)ctx_id; + struct aio_ring __user *ring = ctx_id; struct mm_struct *mm = current->mm; struct kioctx *ctx, *ret = NULL; struct kioctx_table *table; @@ -1179,7 +1190,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)

id = array_index_nospec(id, table->nr); ctx = rcu_dereference(table->table[id]); - if (ctx && ctx->user_id == ctx_id) { + if (ctx && aio_ctx_id_is_same(ctx, ctx_id)) { if (percpu_ref_tryget_live(&ctx->users)) ret = ctx; } @@ -1399,27 +1410,32 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr, * pointer is passed for ctxp. Will fail with -ENOSYS if not * implemented. */ -SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) +SYSCALL_DEFINE2(io_setup, unsigned, nr_events, +#ifdef CONFIG_CHERI_PURECAP_UABI +void * __capability * __capability, ctxp) +#else +aio_context_t __user *, ctxp) +#endif { struct kioctx *ioctx = NULL; - unsigned long ctx; + aio_context_t ctx; long ret;

- ret = get_user(ctx, ctxp); + ret = get_user_ptr(ctx, ctxp); if (unlikely(ret)) goto out;

ret = -EINVAL; if (unlikely(ctx || nr_events == 0)) { pr_debug("EINVAL: ctx %lu nr_events %u\n", - ctx, nr_events); + user_ptr_addr(ctx), nr_events); goto out; }

ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { - ret = put_user(ioctx->user_id, ctxp); + ret = put_user_ptr(ioctx->user_id, ctxp); if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users); @@ -1433,7 +1449,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __user *, ctxp) { struct kioctx *ioctx = NULL; - unsigned long ctx; + compat_aio_context_t ctx; long ret;

ret = get_user(ctx, ctxp); @@ -1451,7 +1467,7 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { /* truncating is ok because it's a user address */ - ret = put_user((compat_aio_context_t)ioctx->user_id, ctxp); + ret = put_user(ptr_to_compat(ioctx->user_id), ctxp); if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users); @@ -1470,7 +1486,12 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use */ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx) { - struct kioctx *ioctx = lookup_ioctx(ctx); + struct kioctx *ioctx; + + if (in_compat_syscall()) + ctx = compat_ptr(user_ptr_addr(ctx)); + + ioctx = lookup_ioctx(ctx); if (likely(NULL != ioctx)) { struct ctx_rq_wait wait; int ret; @@ -2194,7 +2215,7 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, if (unlikely(nr < 0)) return -EINVAL;

- ctx = lookup_ioctx(ctx_id); + ctx = lookup_ioctx(compat_ptr(ctx_id)); if (unlikely(!ctx)) { pr_debug("EINVAL: invalid context id\n"); return -EINVAL; @@ -2244,6 +2265,9 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, u32 key; u64 obj = user_ptr_addr(iocb);

+ if (in_compat_syscall()) + ctx_id = compat_ptr(user_ptr_addr(ctx_id)); + ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) return -EINVAL; @@ -2329,6 +2353,9 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, if (timeout && unlikely(get_timespec64(&ts, timeout))) return -EFAULT;

+ if (in_compat_syscall()) + ctx_id = compat_ptr(user_ptr_addr(ctx_id)); + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); if (!ret && signal_pending(current)) ret = -EINTR; @@ -2378,7 +2405,7 @@ SYSCALL_DEFINE6(io_pgetevents, #if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT)

SYSCALL_DEFINE6(io_pgetevents_time32, - aio_context_t, ctx_id, + compat_aio_context_t, ctx_id, long, min_nr, long, nr, struct io_event __user *, events, @@ -2401,7 +2428,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32, if (ret) return ret;

- ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); + ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &ts : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); @@ -2415,7 +2442,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32,

#if defined(CONFIG_COMPAT_32BIT_TIME)

-SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, +SYSCALL_DEFINE5(io_getevents_time32, compat_aio_context_t, ctx_id, __s32, min_nr, __s32, nr, struct io_event __user *, events, @@ -2427,7 +2454,7 @@ SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, if (timeout && get_old_timespec32(&t, timeout)) return -EFAULT;

- ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); + ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL); if (!ret && signal_pending(current)) ret = -EINTR; return ret; @@ -2467,7 +2494,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, if (ret) return ret;

- ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); + ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); @@ -2502,7 +2529,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, if (ret) return ret;

- ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); + ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 8050909c2887..87c270021449 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -31,7 +31,7 @@ #include <linux/fs.h> #include <asm/byteorder.h>

-typedef __kernel_ulong_t aio_context_t; +typedef void __user *aio_context_t;

enum { IOCB_CMD_PREAD = 0,

-- 2.34.1

Kevin Brodsky

22 Jun 22 Jun

7:59 a.m.

New subject: [PATCH v2 5/6] aio: Change aio_context_t to a user pointer

On 15/06/2023 17:54, Tudor Cretu wrote:

...

The aio_context_t represents the pointer to the userspace mapping of aio_ring buffer.

I would slightly rephrase this. It seems that originally aio_context_t was meant to represent a token with no particular meaning. In a way it still represents that, but it can also be used in the way you describe afterwards.

...

In some userspace software, including libaio, it's used to directly access the aio_ring. As __kernel_ulong_t type is not large enough to fit a user pointer in the PCuABI, change the aio_context_t to void __user *. Moreover, make a full capability comparison in lookup_ioctx().

Would be good to explain why (mainly we still consider it a token).

...

Also, make sure to always create a valid user pointer in the compat syscall handlers or if the syscalls are invoked from a compat context by using compat_ptr().

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 71 +++++++++++++++++++++++++----------- include/uapi/linux/aio_abi.h | 2 +- 2 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index a0bd049b0c45..dff01598760e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -97,7 +97,7 @@ struct kioctx { struct percpu_ref reqs;

unsigned long user_id;

aio_context_t user_id;

struct __percpu kioctx_cpu *cpu; @@ -314,6 +314,14 @@ static u32 __user *aio_key_uptr(struct kioctx *ctx, &user_iocb->aio_key; } +static inline bool aio_ctx_id_is_same(struct kioctx *ctx,
		      aio_context_t ctx_id)
+{
if (IS_ENABLED(CONFIG_CHERI_PURECAP_UABI) && !aio_in_compat64(ctx))
return __builtin_cheri_equal_exact(ctx->user_id, ctx_id);

I've pondered on this some more and I think this is the right thing to do, starting from the assumption that aio_context_t represents some token and should therefore not be modified in any way. In any case we do want to check the capability metadata, as it needs to enable access to the ring (which we access via a kernel mapping, so some kind of explicit checking is required). It might be a bit tricky to ensure that the derivation in aio_ring_mremap() matches exactly what mremap() returns once we start narrowing bounds/permissions there, but we'll see how it goes. I'm not sure that functionality is actually used at all anyway.

Small issue with the current implementation, it won't build on !CHERI targets. That can be fixed by using user_ptr_is_same(). It is generic, so no need to special-case PCuABI at all. It still makes sense to special-case compat64 I think, as a full capability comparison would be artificial in that case (and slower).

...

return ctx->user_id == ctx_id;

+}

static int copy_io_events_to_user(struct kioctx *ctx, struct io_event __user *event_array, long offset, @@ -457,7 +465,9 @@ static int aio_ring_mremap(struct vm_area_struct *vma) ctx = rcu_dereference(table->table[i]); if (ctx && ctx->aio_ring_file == file) { if (!atomic_read(&ctx->dead)) {
		ctx->user_id = ctx->mmap_base = vma->vm_start;
		ctx->mmap_base = vma->vm_start;
		/* TODO [PCuABI] - derive proper capability */
		ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base);
	res = 0;
}
break;
@@ -658,7 +668,8 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);

ctx->user_id = ctx->mmap_base;

/* TODO [PCuABI] - derive proper capability */

ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base); ctx->nr_events = nr_events; /* trusted copy */

ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL); @@ -910,7 +921,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) mutex_unlock(&ctx->ring_lock); pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
 ctx, ctx->user_id, mm, ctx->nr_events);
 ctx, user_ptr_addr(ctx->user_id), mm, ctx->nr_events);
return ctx;
err_cleanup: @@ -1160,9 +1171,9 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) return req; } -static struct kioctx *lookup_ioctx(unsigned long ctx_id) +static struct kioctx *lookup_ioctx(aio_context_t ctx_id) {

struct aio_ring __user *ring = (void __user *)ctx_id;

struct aio_ring __user *ring = ctx_id; struct mm_struct *mm = current->mm; struct kioctx *ctx, *ret = NULL; struct kioctx_table *table;

@@ -1179,7 +1190,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) id = array_index_nospec(id, table->nr); ctx = rcu_dereference(table->table[id]);

if (ctx && ctx->user_id == ctx_id) {

if (ctx && aio_ctx_id_is_same(ctx, ctx_id)) { if (percpu_ref_tryget_live(&ctx->users)) ret = ctx; }

@@ -1399,27 +1410,32 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,

pointer is passed for ctxp. Will fail with -ENOSYS if not

implemented.

*/ -SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) +SYSCALL_DEFINE2(io_setup, unsigned, nr_events, +#ifdef CONFIG_CHERI_PURECAP_UABI +void * __capability * __capability, ctxp)

I think we can still use aio_context_t, can't we? We just need to move the second __capability after *.

Also these argument lines should be indented as usual.

...

+#else +aio_context_t __user *, ctxp) +#endif { struct kioctx *ioctx = NULL;

unsigned long ctx;

aio_context_t ctx; long ret;

ret = get_user(ctx, ctxp);

ret = get_user_ptr(ctx, ctxp); if (unlikely(ret)) goto out;

ret = -EINVAL; if (unlikely(ctx || nr_events == 0)) { pr_debug("EINVAL: ctx %lu nr_events %u\n",
         ctx, nr_events);
	 user_ptr_addr(ctx), nr_events);
goto out; }
ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) {
ret = put_user(ioctx->user_id, ctxp);
ret = put_user_ptr(ioctx->user_id, ctxp);
if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users);
@@ -1433,7 +1449,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __user *, ctxp) { struct kioctx *ioctx = NULL;

unsigned long ctx;

compat_aio_context_t ctx; long ret;

ret = get_user(ctx, ctxp); @@ -1451,7 +1467,7 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { /* truncating is ok because it's a user address */

I'd remove that comment now we're converting the pointer the proper way.

...

ret = put_user((compat_aio_context_t)ioctx->user_id, ctxp);
ret = put_user(ptr_to_compat(ioctx->user_id), ctxp);
if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users);
@@ -1470,7 +1486,12 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use */ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx) {

struct kioctx *ioctx = lookup_ioctx(ctx);
struct kioctx *ioctx;

if (in_compat_syscall())
ctx = compat_ptr(user_ptr_addr(ctx));

No need to bother with that: in compat, the syscall wrappers automatically create the user pointers a sycall expects using compat_ptr() :) Otherwise ctxp wouldn't be usable in io_setup, for instance.

...

ioctx = lookup_ioctx(ctx); if (likely(NULL != ioctx)) { struct ctx_rq_wait wait; int ret;

@@ -2194,7 +2215,7 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, if (unlikely(nr < 0)) return -EINVAL;

ctx = lookup_ioctx(ctx_id);

ctx = lookup_ioctx(compat_ptr(ctx_id)); if (unlikely(!ctx)) { pr_debug("EINVAL: invalid context id\n"); return -EINVAL;

@@ -2244,6 +2265,9 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, u32 key; u64 obj = user_ptr_addr(iocb);
if (in_compat_syscall())
ctx_id = compat_ptr(user_ptr_addr(ctx_id));
ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) return -EINVAL;
@@ -2329,6 +2353,9 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, if (timeout && unlikely(get_timespec64(&ts, timeout))) return -EFAULT;
if (in_compat_syscall())
ctx_id = compat_ptr(user_ptr_addr(ctx_id));
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); if (!ret && signal_pending(current)) ret = -EINTR;
@@ -2378,7 +2405,7 @@ SYSCALL_DEFINE6(io_pgetevents, #if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT) SYSCALL_DEFINE6(io_pgetevents_time32,
aio_context_t, ctx_id,
compat_aio_context_t, ctx_id,

I'm confused, this is not a compat handler?

Kevin

...

long, min_nr,
long, nr,
struct io_event __user *, events,
@@ -2401,7 +2428,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32, if (ret) return ret;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &ts : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); @@ -2415,7 +2442,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32, #if defined(CONFIG_COMPAT_32BIT_TIME) -SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, +SYSCALL_DEFINE5(io_getevents_time32, compat_aio_context_t, ctx_id, __s32, min_nr, __s32, nr, struct io_event __user *, events, @@ -2427,7 +2454,7 @@ SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, if (timeout && get_old_timespec32(&t, timeout)) return -EFAULT;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL); if (!ret && signal_pending(current)) ret = -EINTR; return ret;

@@ -2467,7 +2494,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, if (ret) return ret;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); @@ -2502,7 +2529,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, if (ret) return ret;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 8050909c2887..87c270021449 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -31,7 +31,7 @@ #include <linux/fs.h> #include <asm/byteorder.h> -typedef __kernel_ulong_t aio_context_t; +typedef void __user *aio_context_t; enum { IOCB_CMD_PREAD = 0,

Tudor Cretu

23 Jun 23 Jun

1:34 p.m.

New subject: [PATCH v2 5/6] aio: Change aio_context_t to a user pointer

On 22-06-2023 08:59, Kevin Brodsky wrote:

...

On 15/06/2023 17:54, Tudor Cretu wrote:

...
The aio_context_t represents the pointer to the userspace mapping of aio_ring buffer.

I would slightly rephrase this. It seems that originally aio_context_t was meant to represent a token with no particular meaning. In a way it still represents that, but it can also be used in the way you describe afterwards.

Indeed!

...

...
In some userspace software, including libaio, it's used to directly access the aio_ring. As __kernel_ulong_t type is not large enough to fit a user pointer in the PCuABI, change the aio_context_t to void __user *. Moreover, make a full capability comparison in lookup_ioctx().

Would be good to explain why (mainly we still consider it a token).

Done

...

...
Also, make sure to always create a valid user pointer in the compat syscall handlers or if the syscalls are invoked from a compat context by using compat_ptr().

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 71 +++++++++++++++++++++++++----------- include/uapi/linux/aio_abi.h | 2 +- 2 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index a0bd049b0c45..dff01598760e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -97,7 +97,7 @@ struct kioctx { struct percpu_ref reqs;

unsigned long user_id;

aio_context_t user_id;

struct __percpu kioctx_cpu *cpu; @@ -314,6 +314,14 @@ static u32 __user *aio_key_uptr(struct kioctx *ctx, &user_iocb->aio_key; } +static inline bool aio_ctx_id_is_same(struct kioctx *ctx,
		      aio_context_t ctx_id)
+{
if (IS_ENABLED(CONFIG_CHERI_PURECAP_UABI) && !aio_in_compat64(ctx))
return __builtin_cheri_equal_exact(ctx->user_id, ctx_id);
I've pondered on this some more and I think this is the right thing to do, starting from the assumption that aio_context_t represents some token and should therefore not be modified in any way. In any case we do want to check the capability metadata, as it needs to enable access to the ring (which we access via a kernel mapping, so some kind of explicit checking is required). It might be a bit tricky to ensure that the derivation in aio_ring_mremap() matches exactly what mremap() returns once we start narrowing bounds/permissions there, but we'll see how it goes. I'm not sure that functionality is actually used at all anyway.

Small issue with the current implementation, it won't build on !CHERI targets. That can be fixed by using user_ptr_is_same(). It is generic, so no need to special-case PCuABI at all. It still makes sense to special-case compat64 I think, as a full capability comparison would be artificial in that case (and slower).

Done!

...

...
return ctx->user_id == ctx_id;

+}

static int copy_io_events_to_user(struct kioctx *ctx, struct io_event __user *event_array, long offset,

@@ -457,7 +465,9 @@ static int aio_ring_mremap(struct vm_area_struct *vma) ctx = rcu_dereference(table->table[i]); if (ctx && ctx->aio_ring_file == file) { if (!atomic_read(&ctx->dead)) {
		ctx->user_id = ctx->mmap_base = vma->vm_start;
		ctx->mmap_base = vma->vm_start;
		/* TODO [PCuABI] - derive proper capability */
		ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base);
	res = 0;
}
break;
@@ -658,7 +668,8 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);

ctx->user_id = ctx->mmap_base;

/* TODO [PCuABI] - derive proper capability */

ctx->user_id = uaddr_to_user_ptr_safe(ctx->mmap_base); ctx->nr_events = nr_events; /* trusted copy */

ctx->ring = vmap(ctx->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL); @@ -910,7 +921,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) mutex_unlock(&ctx->ring_lock); pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
 ctx, ctx->user_id, mm, ctx->nr_events);
 ctx, user_ptr_addr(ctx->user_id), mm, ctx->nr_events);
return ctx;
err_cleanup: @@ -1160,9 +1171,9 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) return req; } -static struct kioctx *lookup_ioctx(unsigned long ctx_id) +static struct kioctx *lookup_ioctx(aio_context_t ctx_id) {

struct aio_ring __user *ring = (void __user *)ctx_id;

struct aio_ring __user *ring = ctx_id; struct mm_struct *mm = current->mm; struct kioctx *ctx, *ret = NULL; struct kioctx_table *table;

@@ -1179,7 +1190,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) id = array_index_nospec(id, table->nr); ctx = rcu_dereference(table->table[id]);

if (ctx && ctx->user_id == ctx_id) {

if (ctx && aio_ctx_id_is_same(ctx, ctx_id)) { if (percpu_ref_tryget_live(&ctx->users)) ret = ctx; }

@@ -1399,27 +1410,32 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,

pointer is passed for ctxp. Will fail with -ENOSYS if not

implemented.

*/ -SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) +SYSCALL_DEFINE2(io_setup, unsigned, nr_events, +#ifdef CONFIG_CHERI_PURECAP_UABI +void * __capability * __capability, ctxp)
I think we can still use aio_context_t, can't we? We just need to move the second __capability after *.

Indeed!

...

Also these argument lines should be indented as usual.

Oups!

...

...
+#else +aio_context_t __user *, ctxp) +#endif { struct kioctx *ioctx = NULL;

unsigned long ctx;

aio_context_t ctx; long ret;

ret = get_user(ctx, ctxp);

ret = get_user_ptr(ctx, ctxp); if (unlikely(ret)) goto out;

ret = -EINVAL; if (unlikely(ctx || nr_events == 0)) { pr_debug("EINVAL: ctx %lu nr_events %u\n",
         ctx, nr_events);
	 user_ptr_addr(ctx), nr_events);
goto out; }
ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) {
ret = put_user(ioctx->user_id, ctxp);
ret = put_user_ptr(ioctx->user_id, ctxp);
if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users);
@@ -1433,7 +1449,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __user *, ctxp) { struct kioctx *ioctx = NULL;

unsigned long ctx;

compat_aio_context_t ctx; long ret;

ret = get_user(ctx, ctxp); @@ -1451,7 +1467,7 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { /* truncating is ok because it's a user address */
I'd remove that comment now we're converting the pointer the proper way.

Done

...

...
ret = put_user((compat_aio_context_t)ioctx->user_id, ctxp);
ret = put_user(ptr_to_compat(ioctx->user_id), ctxp);
if (ret) kill_ioctx(current->mm, ioctx, NULL); percpu_ref_put(&ioctx->users);
@@ -1470,7 +1486,12 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, compat_aio_context_t __use */ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx) {

struct kioctx *ioctx = lookup_ioctx(ctx);
struct kioctx *ioctx;

if (in_compat_syscall())
ctx = compat_ptr(user_ptr_addr(ctx));
No need to bother with that: in compat, the syscall wrappers automatically create the user pointers a sycall expects using compat_ptr() :) Otherwise ctxp wouldn't be usable in io_setup, for instance.

It slipped my mind when I did this change. Thank you!

...

...
ioctx = lookup_ioctx(ctx); if (likely(NULL != ioctx)) { struct ctx_rq_wait wait; int ret;

@@ -2194,7 +2215,7 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, if (unlikely(nr < 0)) return -EINVAL;

ctx = lookup_ioctx(ctx_id);

ctx = lookup_ioctx(compat_ptr(ctx_id)); if (unlikely(!ctx)) { pr_debug("EINVAL: invalid context id\n"); return -EINVAL;

@@ -2244,6 +2265,9 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, u32 key; u64 obj = user_ptr_addr(iocb);
if (in_compat_syscall())
ctx_id = compat_ptr(user_ptr_addr(ctx_id));
ctx = lookup_ioctx(ctx_id); if (unlikely(!ctx)) return -EINVAL;
@@ -2329,6 +2353,9 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, if (timeout && unlikely(get_timespec64(&ts, timeout))) return -EFAULT;
if (in_compat_syscall())
ctx_id = compat_ptr(user_ptr_addr(ctx_id));
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); if (!ret && signal_pending(current)) ret = -EINTR;
@@ -2378,7 +2405,7 @@ SYSCALL_DEFINE6(io_pgetevents, #if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT) SYSCALL_DEFINE6(io_pgetevents_time32,
aio_context_t, ctx_id,
compat_aio_context_t, ctx_id,
I'm confused, this is not a compat handler?

Done!

Many thanks, Tudor

...

Kevin

...
long, min_nr,
long, nr,
struct io_event __user *, events,
@@ -2401,7 +2428,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32, if (ret) return ret;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &ts : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); @@ -2415,7 +2442,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32, #if defined(CONFIG_COMPAT_32BIT_TIME) -SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, +SYSCALL_DEFINE5(io_getevents_time32, compat_aio_context_t, ctx_id, __s32, min_nr, __s32, nr, struct io_event __user *, events, @@ -2427,7 +2454,7 @@ SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, if (timeout && get_old_timespec32(&t, timeout)) return -EFAULT;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL); if (!ret && signal_pending(current)) ret = -EINTR; return ret;

@@ -2467,7 +2494,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, if (ret) return ret;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); @@ -2502,7 +2529,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, if (ret) return ret;

ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);

ret = do_io_getevents(compat_ptr(ctx_id), min_nr, nr, events, timeout ? &t : NULL);

interrupted = signal_pending(current); restore_saved_sigmask_unless(interrupted); diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 8050909c2887..87c270021449 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -31,7 +31,7 @@ #include <linux/fs.h> #include <asm/byteorder.h> -typedef __kernel_ulong_t aio_context_t; +typedef void __user *aio_context_t; enum { IOCB_CMD_PREAD = 0,

Tudor Cretu

15 Jun 15 Jun

3:54 p.m.

New subject: [PATCH v2 6/6] aio: Use user pointer type in the io_event struct

The libaio library may set up I/O completion callback functions for I/O requests. The callback function pointer is stored in the data field of the io_event struct, and the pointer to the original iocb struct is stored in the obj field. Therefore, both the data and obj fields of the io_event struct might hold user pointers.

In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.

Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/aio.c | 20 ++++++++++---------- include/uapi/linux/aio_abi.h | 10 +++++----- 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index dff01598760e..cb0b5de179df 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -344,8 +344,8 @@ static void copy_io_event_to_ring(struct kioctx *ctx, if (aio_in_compat64(ctx)) { struct compat_io_event *compat_ring_event = &ctx->io_events_compat[ring_tail];

- compat_ring_event->data = native_event->data; - compat_ring_event->obj = native_event->obj; + compat_ring_event->data = (__u64)native_event->data; + compat_ring_event->obj = (__u64)native_event->obj; compat_ring_event->res = native_event->res; compat_ring_event->res2 = native_event->res2; return; @@ -1229,9 +1229,10 @@ static void aio_complete(struct aio_kiocb *iocb)

copy_io_event_to_ring(ctx, tail, &iocb->ki_res);

- pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb, - (void __user *)(unsigned long)iocb->ki_res.obj, - iocb->ki_res.data, iocb->ki_res.res, iocb->ki_res.res2); + pr_debug("%p[%u]: %p: %Lx %Lx %Lx %Lx\n", ctx, tail, iocb, + (unsigned long long)iocb->ki_res.obj, + (unsigned long long)iocb->ki_res.data, + iocb->ki_res.res, iocb->ki_res.res2);

/* after flagging the request as done, we * must never even look at it again @@ -2048,7 +2049,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return -EFAULT; }

- req->ki_res.obj = user_ptr_addr(user_iocb); + req->ki_res.obj = (__kernel_uintptr_t)user_iocb; req->ki_res.data = iocb->aio_data; req->ki_res.res = 0; req->ki_res.res2 = 0; @@ -2079,7 +2080,7 @@ static int get_compat_iocb(struct iocb *iocb, const struct iocb __user *user_ioc struct compat_iocb compat_iocb; if (unlikely(copy_from_user(&compat_iocb, user_iocb, sizeof(struct compat_iocb)))) return -EFAULT; - iocb->aio_data = compat_iocb.aio_data; + iocb->aio_data = (__kernel_uintptr_t)compat_iocb.aio_data; iocb->aio_key = compat_iocb.aio_key; iocb->aio_rw_flags = compat_iocb.aio_rw_flags; iocb->aio_lio_opcode = compat_iocb.aio_lio_opcode; @@ -2263,7 +2264,6 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, struct aio_kiocb *kiocb; int ret = -EINVAL; u32 key; - u64 obj = user_ptr_addr(iocb);

if (in_compat_syscall()) ctx_id = compat_ptr(user_ptr_addr(ctx_id)); @@ -2285,7 +2285,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, spin_lock_irq(&ctx->ctx_lock); /* TODO: use a hash or array, this sucks. */ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) { - if (kiocb->ki_res.obj == obj) { + if (user_ptr_is_same((struct iocb __user *)kiocb->ki_res.obj, iocb)) { ret = kiocb->ki_cancel(&kiocb->rw); list_del_init(&kiocb->ki_list); break; @@ -2385,7 +2385,7 @@ SYSCALL_DEFINE6(io_pgetevents, if (timeout && unlikely(get_timespec64(&ts, timeout))) return -EFAULT;

- if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) + if (usig && copy_from_user_with_ptr(&ksig, usig, sizeof(ksig))) return -EFAULT;

ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize); diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 87c270021449..edd2ac54ac42 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -58,10 +58,10 @@ enum {

/* read() from /dev/aio returns these structures. */ struct io_event { - __u64 data; /* the data field from the iocb */ - __u64 obj; /* what iocb this event came from */ - __s64 res; /* result code for this event */ - __s64 res2; /* secondary result */ + __kernel_uintptr_t data; /* the data field from the iocb */ + __kernel_uintptr_t obj; /* what iocb this event came from */ + __s64 res; /* result code for this event */ + __s64 res2; /* secondary result */ };

/* @@ -72,7 +72,7 @@ struct io_event {

struct iocb { /* these are internal to the kernel/libc. */ - __u64 aio_data; /* data to be returned in event's data */ + __kernel_uintptr_t aio_data; /* data to be returned in event's data */

#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN) __u32 aio_key; /* the kernel sets aio_key to the req # */

-- 2.34.1

Kevin Brodsky

22 Jun 22 Jun

8 a.m.

New subject: [PATCH v2 6/6] aio: Use user pointer type in the io_event struct

On 15/06/2023 17:54, Tudor Cretu wrote:

...

The libaio library may set up I/O completion callback functions for I/O requests. The callback function pointer is stored in the data field of the io_event struct, and the pointer to the original iocb struct is stored in the obj field. Therefore, both the data and obj fields of the io_event struct might hold user pointers.

In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.

Signed-off-by: Tudor Cretu tudor.cretu@arm.com

fs/aio.c | 20 ++++++++++---------- include/uapi/linux/aio_abi.h | 10 +++++----- 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c index dff01598760e..cb0b5de179df 100644 --- a/fs/aio.c +++ b/fs/aio.c

Surely in copy_io_events_to_user() we need to use copy_to_user_with_ptr()?

...

@@ -344,8 +344,8 @@ static void copy_io_event_to_ring(struct kioctx *ctx, if (aio_in_compat64(ctx)) { struct compat_io_event *compat_ring_event = &ctx->io_events_compat[ring_tail];
compat_ring_event->data = native_event->data;
compat_ring_event->obj = native_event->obj;
compat_ring_event->data = (__u64)native_event->data;
compat_ring_event->obj = (__u64)native_event->obj;
compat_ring_event->res = native_event->res; compat_ring_event->res2 = native_event->res2; return;
@@ -1229,9 +1229,10 @@ static void aio_complete(struct aio_kiocb *iocb) copy_io_event_to_ring(ctx, tail, &iocb->ki_res);
pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb,
 (void __user *)(unsigned long)iocb->ki_res.obj,
 iocb->ki_res.data, iocb->ki_res.res, iocb->ki_res.res2);
pr_debug("%p[%u]: %p: %Lx %Lx %Lx %Lx\n", ctx, tail, iocb,
 (unsigned long long)iocb->ki_res.obj,
 (unsigned long long)iocb->ki_res.data,
 iocb->ki_res.res, iocb->ki_res.res2);
/* after flagging the request as done, we * must never even look at it again @@ -2048,7 +2049,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return -EFAULT; }

req->ki_res.obj = user_ptr_addr(user_iocb);

req->ki_res.obj = (__kernel_uintptr_t)user_iocb; req->ki_res.data = iocb->aio_data; req->ki_res.res = 0; req->ki_res.res2 = 0;

@@ -2079,7 +2080,7 @@ static int get_compat_iocb(struct iocb *iocb, const struct iocb __user *user_ioc struct compat_iocb compat_iocb; if (unlikely(copy_from_user(&compat_iocb, user_iocb, sizeof(struct compat_iocb)))) return -EFAULT;

iocb->aio_data = compat_iocb.aio_data;

iocb->aio_data = (__kernel_uintptr_t)compat_iocb.aio_data; iocb->aio_key = compat_iocb.aio_key; iocb->aio_rw_flags = compat_iocb.aio_rw_flags; iocb->aio_lio_opcode = compat_iocb.aio_lio_opcode;

@@ -2263,7 +2264,6 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, struct aio_kiocb *kiocb; int ret = -EINVAL; u32 key;

u64 obj = user_ptr_addr(iocb);

if (in_compat_syscall()) ctx_id = compat_ptr(user_ptr_addr(ctx_id)); @@ -2285,7 +2285,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb, spin_lock_irq(&ctx->ctx_lock); /* TODO: use a hash or array, this sucks. */ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) {
if (kiocb->ki_res.obj == obj) {
if (user_ptr_is_same((struct iocb __user *)kiocb->ki_res.obj, iocb)) {
ret = kiocb->ki_cancel(&kiocb->rw);
list_del_init(&kiocb->ki_list);
break;
@@ -2385,7 +2385,7 @@ SYSCALL_DEFINE6(io_pgetevents, if (timeout && unlikely(get_timespec64(&ts, timeout))) return -EFAULT;

if (usig && copy_from_user(&ksig, usig, sizeof(ksig)))

if (usig && copy_from_user_with_ptr(&ksig, usig, sizeof(ksig)))

Arguably this is an unrelated fix (i.e. it was already broken before this series and we would have found out if uaccess was checked), so would be better as a separate patch.

Kevin

...

return -EFAULT;
ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize); diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 87c270021449..edd2ac54ac42 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -58,10 +58,10 @@ enum { /* read() from /dev/aio returns these structures. */ struct io_event {

__u64 data; /* the data field from the iocb */

__u64 obj; /* what iocb this event came from */

__s64 res; /* result code for this event */

__s64 res2; /* secondary result */

__kernel_uintptr_t data; /* the data field from the iocb */

__kernel_uintptr_t obj; /* what iocb this event came from */

__s64 res; /* result code for this event */

__s64 res2; /* secondary result */

}; /* @@ -72,7 +72,7 @@ struct io_event { struct iocb { /* these are internal to the kernel/libc. */

__u64 aio_data; /* data to be returned in event's data */

__kernel_uintptr_t aio_data; /* data to be returned in event's data */

#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN) __u32 aio_key; /* the kernel sets aio_key to the req # */

738

days inactive

751

days old

linux-morello@op-lists.linaro.org

16 comments

participants

tags (0)

participants (2)

Kevin Brodsky
Tudor Cretu