This series makes it possible for purecap apps to use the io_uring system.
With these patches, all io_uring LTP tests pass in both Purecap and compat modes. Note that the LTP tests only address the basic functionality of the io_uring system and a significant portion of the multiplexed functionality is untested in LTP.
I have finished investigating Purecap and plain AArch64 liburing tests and examples and the series is updated accordingly.
v4: - Rebase on top of morello/next - Remove the union for flags in struct compat_io_uring_sqe and only kept a single member - Improve format and move functions as per feedback on v3 - Add a new helper for checking if context is compat - Remove struct conversion in fdinfo and just use macros - Remove the union from struct io_overflow_cqe and just leave the native struct - Fix the cqe_cached/cqe_sentinel mechanism - Separate the fix for the shared ring size's off-by-one error into a new PATCH 6 - Remove the compat_ptr for addr fields that represent user_data values - Extend the trace events accordingly to propagate capabilities - Use copy*_with_ptr routine for copy_msghdr_from_user in a new PATCH 1 - Fix the misuse of addr2 and off in IORING_OP_CONNECT and IORING_OP_POLL_REMOVE
v3: - Introduce Patch 5 which exposes the compat handling logic for epoll_event. This is used then in io_uring/epoll.c. - Introduce Patch 6 which makes sure that when struct iovec is copied from userspace, the capability tags are preserved. - Fix a few sizeof(var) to sizeof(*var). - Use iovec_from_user so that compat handling logic is applied instead of copying directly from user - Add a few missing copy_from_user_with_ptr where suitable.
v2: - Rebase on top of release 6.1 - Remove VM_READ_CAPS/VM_LOAD_CAPS patches as they are already merged - Update commit message in PATCH 1 - Add the generic changes PATCH 2 and PATCH 3 to avoid copying user pointers from/to userspace unnecesarily. These could be upstreamable. - Split "pulling the cqes memeber out" change into PATCH 4 - The changes for PATCH 5 and 6 are now split into their respective files after the rebase. - Format and change organization based on the feedback on the previous version, including creating helpers copy_*_from_* for various uAPI structs - Add comments related to handling of setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 - Add handling for new uAPI structs: io_uring_buf, io_uring_buf_ring, io_uring_buf_reg, io_uring_sync_cancel_reg.
Gitlab issue: https://git.morello-project.org/morello/kernel/linux/-/issues/2
Review branch: https://git.morello-project.org/tudcre01/linux/-/commits/morello/io_uring_v4
Tudor Cretu (9): net: socket: use copy_from_user_with_ptr for struct user_msghdr io_uring/rw: Restrict copy to only uiov->len from userspace io_uring/tctx: Copy only the offset field back to user io_uring: Pull cqes member out from rings struct epoll: Expose compat handling logic of epoll_event io_uring/kbuf: Fix size for shared buffer ring io_uring: Implement compat versions of uAPI structs and handle them io_uring: Allow capability tag access on the shared memory io_uring: Use user pointer type in the uAPI structs
fs/eventpoll.c | 38 ++-- include/linux/eventpoll.h | 4 + include/linux/io_uring_types.h | 155 +++++++++++++-- include/trace/events/io_uring.h | 46 ++--- include/uapi/linux/io_uring.h | 76 ++++---- io_uring/advise.c | 7 +- io_uring/cancel.c | 28 ++- io_uring/cancel.h | 2 +- io_uring/epoll.c | 4 +- io_uring/fdinfo.c | 80 +++++--- io_uring/fs.c | 16 +- io_uring/io_uring.c | 324 +++++++++++++++++++++++--------- io_uring/io_uring.h | 124 +++++++++--- io_uring/kbuf.c | 109 +++++++++-- io_uring/kbuf.h | 8 +- io_uring/msg_ring.c | 4 +- io_uring/net.c | 25 +-- io_uring/openclose.c | 4 +- io_uring/poll.c | 6 +- io_uring/rsrc.c | 136 +++++++++++--- io_uring/rw.c | 22 +-- io_uring/statx.c | 4 +- io_uring/tctx.c | 56 +++++- io_uring/timeout.c | 10 +- io_uring/uring_cmd.c | 5 + io_uring/uring_cmd.h | 5 + io_uring/xattr.c | 12 +- net/socket.c | 2 +- 28 files changed, 978 insertions(+), 334 deletions(-)
struct user_msghdr contains user pointers, so use the correct copy routine.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- net/socket.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/socket.c b/net/socket.c index 741086ceff95d..0ac6d2a16808e 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2418,7 +2418,7 @@ static int copy_msghdr_from_user(struct msghdr *kmsg, struct user_msghdr msg; ssize_t err;
- if (copy_from_user(&msg, umsg, sizeof(*umsg))) + if (copy_from_user_with_ptr(&msg, umsg, sizeof(*umsg))) return -EFAULT;
err = __copy_msghdr(kmsg, &msg, save_addr);
Only the len member is needed, so restrict the copy_from_user to that.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- io_uring/rw.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/io_uring/rw.c b/io_uring/rw.c index 1393cdae75854..2edca190450ee 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -55,7 +55,6 @@ static int io_iov_compat_buffer_select_prep(struct io_rw *rw) static int io_iov_buffer_select_prep(struct io_kiocb *req) { struct iovec __user *uiov; - struct iovec iov; struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
if (rw->len != 1) @@ -67,9 +66,8 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req) #endif
uiov = u64_to_user_ptr(rw->addr); - if (copy_from_user(&iov, uiov, sizeof(*uiov))) + if (get_user(rw->len, &uiov->iov_len)) return -EFAULT; - rw->len = iov.iov_len; return 0; }
Upon successful return of the io_uring_register system call, the offset field will contain the value of the registered file descriptor to be used for future io_uring_enter system calls. The rest of the struct doesn't need to be copied back to userspace, so restrict the copy_to_user call only to the offset field.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- io_uring/tctx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 4324b1cf1f6af..96f77450cf4e2 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -289,7 +289,7 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, break;
reg.offset = ret; - if (copy_to_user(&arg[i], ®, sizeof(reg))) { + if (put_user(reg.offset, &arg[i].offset)) { fput(tctx->registered_rings[reg.offset]); tctx->registered_rings[reg.offset] = NULL; ret = -EFAULT;
Pull cqes member out from rings struct so that we are able to have a union between cqes and cqes_compat. This is done in a similar way to commit 75b28affdd6a ("io_uring: allocate the two rings together"), where sq_array was pulled out from the rings struct.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- include/linux/io_uring_types.h | 18 +++++++++-------- io_uring/fdinfo.c | 2 +- io_uring/io_uring.c | 35 ++++++++++++++++++++++++---------- 3 files changed, 36 insertions(+), 19 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index df7d4febc38a4..440179029a8f0 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -141,14 +141,6 @@ struct io_rings { * ordered with any other data. */ u32 cq_overflow; - /* - * Ring buffer of completion events. - * - * The kernel writes completion events fresh every time they are - * produced, so the application is allowed to modify pending - * entries. - */ - struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; };
struct io_restriction { @@ -270,7 +262,17 @@ struct io_ring_ctx { struct xarray personalities; u32 pers_next;
+ /* completion data */ struct { + /* + * Ring buffer of completion events. + * + * The kernel writes completion events fresh every time they are + * produced, so the application is allowed to modify pending + * entries. + */ + struct io_uring_cqe *cqes; + /* * We cache a range of free CQEs we can use, once exhausted it * should go through a slower range setup, see __io_get_cqe() diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index 2e04850a657b0..bc8c9d764bc13 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -119,7 +119,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, cq_entries = min(cq_tail - cq_head, ctx->cq_entries); for (i = 0; i < cq_entries; i++) { unsigned int entry = i + cq_head; - struct io_uring_cqe *cqe = &r->cqes[(entry & cq_mask) << cq_shift]; + struct io_uring_cqe *cqe = &ctx->cqes[(entry & cq_mask) << cq_shift];
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", entry & cq_mask, cqe->user_data, cqe->res, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index df41a63c642c1..707229ae04dc8 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -743,7 +743,6 @@ bool io_req_cqe_overflow(struct io_kiocb *req) */ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) { - struct io_rings *rings = ctx->rings; unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len;
@@ -768,14 +767,14 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &rings->cqes[off]; + ctx->cqe_cached = &ctx->cqes[off]; ctx->cqe_sentinel = ctx->cqe_cached + len;
ctx->cached_cq_tail++; ctx->cqe_cached++; if (ctx->flags & IORING_SETUP_CQE32) ctx->cqe_cached++; - return &rings->cqes[off]; + return &ctx->cqes[off]; }
bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, @@ -2476,13 +2475,28 @@ static void *io_mem_alloc(size_t size) }
static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries, - unsigned int cq_entries, size_t *sq_offset) + unsigned int cq_entries, size_t *sq_offset, + size_t *cq_offset) { struct io_rings *rings; - size_t off, sq_array_size; + size_t off, cq_array_size, sq_array_size; + + off = sizeof(*rings); + +#ifdef CONFIG_SMP + off = ALIGN(off, SMP_CACHE_BYTES); + if (off == 0) + return SIZE_MAX; +#endif + + if (cq_offset) + *cq_offset = off; + + cq_array_size = array_size(sizeof(struct io_uring_cqe), cq_entries); + if (cq_array_size == SIZE_MAX) + return SIZE_MAX;
- off = struct_size(rings, cqes, cq_entries); - if (off == SIZE_MAX) + if (check_add_overflow(off, cq_array_size, &off)) return SIZE_MAX; if (ctx->flags & IORING_SETUP_CQE32) { if (check_shl_overflow(off, 1, &off)) @@ -3314,13 +3328,13 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, struct io_uring_params *p) { struct io_rings *rings; - size_t size, sq_array_offset; + size_t size, cqes_offset, sq_array_offset;
/* make sure these are sane, as we already accounted them */ ctx->sq_entries = p->sq_entries; ctx->cq_entries = p->cq_entries;
- size = rings_size(ctx, p->sq_entries, p->cq_entries, &sq_array_offset); + size = rings_size(ctx, p->sq_entries, p->cq_entries, &sq_array_offset, &cqes_offset); if (size == SIZE_MAX) return -EOVERFLOW;
@@ -3329,6 +3343,7 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, return -ENOMEM;
ctx->rings = rings; + ctx->cqes = (struct io_uring_cqe *)((char *)rings + cqes_offset); ctx->sq_array = (u32 *)((char *)rings + sq_array_offset); rings->sq_ring_mask = p->sq_entries - 1; rings->cq_ring_mask = p->cq_entries - 1; @@ -3533,7 +3548,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, p->cq_off.ring_mask = offsetof(struct io_rings, cq_ring_mask); p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries); p->cq_off.overflow = offsetof(struct io_rings, cq_overflow); - p->cq_off.cqes = offsetof(struct io_rings, cqes); + p->cq_off.cqes = (char *)ctx->cqes - (char *)ctx->rings; p->cq_off.flags = offsetof(struct io_rings, cq_flags);
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |
Move the logic that copies an epoll_event from user to its own function and expose it in the eventpoll.h header. This allows other subsystems such as io_uring to handle epoll_events.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/eventpoll.c | 38 +++++++++++++++++++++++++------------- include/linux/eventpoll.h | 4 ++++ 2 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 7e33a2781dec8..c6afc25b1d4ee 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2197,6 +2197,27 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, return error; }
+static int get_compat_epoll_event(struct epoll_event *epds, + const void __user *user_epds) +{ + struct compat_epoll_event compat_epds; + + if (unlikely(copy_from_user(&compat_epds, user_epds, sizeof(compat_epds)))) + return -EFAULT; + epds->events = compat_epds.events; + epds->data = (__kernel_uintptr_t)as_user_ptr(compat_epds.data); + return 0; +} + +int copy_epoll_event_from_user(struct epoll_event *epds, + const void __user *user_epds, + bool compat) +{ + if (compat) + return get_compat_epoll_event(epds, user_epds); + return copy_from_user_with_ptr(epds, user_epds, sizeof(*epds)); +} + /* * The following function implements the controller interface for * the eventpoll file that enables the insertion/removal/change of @@ -2211,20 +2232,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event epds;
if (ep_op_has_event(op)) { - if (in_compat_syscall()) { - struct compat_epoll_event compat_epds; - - if (copy_from_user(&compat_epds, event, - sizeof(struct compat_epoll_event))) - return -EFAULT; + int ret;
- epds.events = compat_epds.events; - epds.data = (__kernel_uintptr_t)as_user_ptr(compat_epds.data); - } else { - if (copy_from_user_with_ptr(&epds, event, - sizeof(struct epoll_event))) - return -EFAULT; - } + ret = copy_epoll_event_from_user(&epds, event, in_compat_syscall()); + if (ret) + return -EFAULT; }
return do_epoll_ctl(epfd, op, fd, &epds, false); diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 457811d82ff20..62b0829354a0e 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -103,4 +103,8 @@ epoll_put_uevent(__poll_t revents, __kernel_uintptr_t data, } #endif
+int copy_epoll_event_from_user(struct epoll_event *epds, + const void __user *user_epds, + bool compat); + #endif /* #ifndef _LINUX_EVENTPOLL_H */
The size of the ring is the product of ring_entries and the size of struct io_uring_buf. Using struct_size is equivalent to (ring_entries + 1) * sizeof(struct io_uring_buf) and generates a off-by-one error. Fix it by using size_mul directly.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- io_uring/kbuf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index e2c46889d5fab..182e594b56c6e 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -509,7 +509,7 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) }
pages = io_pin_pages(reg.ring_addr, - struct_size(br, bufs, reg.ring_entries), + size_mul(sizeof(struct io_uring_buf), reg.ring_entries), &nr_pages); if (IS_ERR(pages)) { kfree(free_bl);
Introduce compat versions of the structs exposed in the uAPI headers that might contain pointers as a member. Also, implement functions that convert the compat versions to the native versions of the struct.
A subsequent patch is going to change the io_uring structs to enable them to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- include/linux/io_uring_types.h | 135 ++++++++++++++++++- io_uring/cancel.c | 26 +++- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 80 ++++++----- io_uring/io_uring.c | 234 +++++++++++++++++++++++---------- io_uring/io_uring.h | 111 +++++++++++++--- io_uring/kbuf.c | 96 ++++++++++++-- io_uring/kbuf.h | 6 +- io_uring/net.c | 5 +- io_uring/rsrc.c | 108 +++++++++++++-- io_uring/tctx.c | 56 +++++++- io_uring/uring_cmd.h | 5 + 12 files changed, 704 insertions(+), 160 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 440179029a8f0..f0eb34ad8b709 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -7,6 +7,127 @@ #include <linux/llist.h> #include <uapi/linux/io_uring.h>
+struct compat_io_uring_sqe { + __u8 opcode; + __u8 flags; + __u16 ioprio; + __s32 fd; + union { + __u64 off; + __u64 addr2; + struct { + __u32 cmd_op; + __u32 __pad1; + }; + }; + union { + __u64 addr; + __u64 splice_off_in; + }; + __u32 len; + /* This member is actually a union in the native struct */ + __kernel_rwf_t rw_flags; + __u64 user_data; + union { + __u16 buf_index; + __u16 buf_group; + } __packed; + __u16 personality; + union { + __s32 splice_fd_in; + __u32 file_index; + struct { + __u16 addr_len; + __u16 __pad3[1]; + }; + }; + union { + struct { + __u64 addr3; + __u64 __pad2[1]; + }; + __u8 cmd[0]; + }; +}; + +struct compat_io_uring_cqe { + __u64 user_data; + __s32 res; + __u32 flags; + __u64 big_cqe[]; +}; + +struct compat_io_uring_files_update { + __u32 offset; + __u32 resv; + __aligned_u64 fds; +}; + +struct compat_io_uring_rsrc_register { + __u32 nr; + __u32 flags; + __u64 resv2; + __aligned_u64 data; + __aligned_u64 tags; +}; + +struct compat_io_uring_rsrc_update { + __u32 offset; + __u32 resv; + __aligned_u64 data; +}; + +struct compat_io_uring_rsrc_update2 { + __u32 offset; + __u32 resv; + __aligned_u64 data; + __aligned_u64 tags; + __u32 nr; + __u32 resv2; +}; + +struct compat_io_uring_buf { + __u64 addr; + __u32 len; + __u16 bid; + __u16 resv; +}; + +struct compat_io_uring_buf_ring { + union { + struct { + __u64 resv1; + __u32 resv2; + __u16 resv3; + __u16 tail; + }; + struct compat_io_uring_buf bufs[0]; + }; +}; + +struct compat_io_uring_buf_reg { + __u64 ring_addr; + __u32 ring_entries; + __u16 bgid; + __u16 pad; + __u64 resv[3]; +}; + +struct compat_io_uring_getevents_arg { + __u64 sigmask; + __u32 sigmask_sz; + __u32 pad; + __u64 ts; +}; + +struct compat_io_uring_sync_cancel_reg { + __u64 addr; + __s32 fd; + __u32 flags; + struct __kernel_timespec timeout; + __u64 pad[4]; +}; + struct io_wq_work_node { struct io_wq_work_node *next; }; @@ -216,7 +337,10 @@ struct io_ring_ctx { * array. */ u32 *sq_array; - struct io_uring_sqe *sq_sqes; + union { + struct compat_io_uring_sqe *sq_sqes_compat; + struct io_uring_sqe *sq_sqes; + }; unsigned cached_sq_head; unsigned sq_entries;
@@ -271,14 +395,17 @@ struct io_ring_ctx { * produced, so the application is allowed to modify pending * entries. */ - struct io_uring_cqe *cqes; + union { + struct compat_io_uring_cqe *cqes_compat; + struct io_uring_cqe *cqes; + };
/* * We cache a range of free CQEs we can use, once exhausted it * should go through a slower range setup, see __io_get_cqe() */ - struct io_uring_cqe *cqe_cached; - struct io_uring_cqe *cqe_sentinel; + void *cqe_cached; + void *cqe_sentinel;
unsigned cached_cq_tail; unsigned cq_entries; diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 2291a53cdabd1..8382ea03fe899 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -27,6 +27,30 @@ struct io_cancel { #define CANCEL_FLAGS (IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_FD | \ IORING_ASYNC_CANCEL_ANY | IORING_ASYNC_CANCEL_FD_FIXED)
+static int get_compat64_io_uring_sync_cancel_reg(struct io_uring_sync_cancel_reg *sc, + const void __user *user_sc) +{ + struct compat_io_uring_sync_cancel_reg compat_sc; + + if (copy_from_user(&compat_sc, user_sc, sizeof(compat_sc))) + return -EFAULT; + sc->addr = compat_sc.addr; + sc->fd = compat_sc.fd; + sc->flags = compat_sc.flags; + sc->timeout = compat_sc.timeout; + memcpy(sc->pad, compat_sc.pad, sizeof(sc->pad)); + return 0; +} + +static int copy_io_uring_sync_cancel_reg_from_user(struct io_ring_ctx *ctx, + struct io_uring_sync_cancel_reg *sc, + const void __user *arg) +{ + if (is_compat64_io_ring_ctx(ctx)) + return get_compat64_io_uring_sync_cancel_reg(sc, arg); + return copy_from_user(sc, arg, sizeof(*sc)); +} + static bool io_cancel_cb(struct io_wq_work *work, void *data) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); @@ -243,7 +267,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg) DEFINE_WAIT(wait); int ret;
- if (copy_from_user(&sc, arg, sizeof(sc))) + if (copy_io_uring_sync_cancel_reg_from_user(ctx, &sc, arg)) return -EFAULT; if (sc.flags & ~CANCEL_FLAGS) return -EINVAL; diff --git a/io_uring/epoll.c b/io_uring/epoll.c index 9aa74d2c80bc4..d5580ff465c3e 100644 --- a/io_uring/epoll.c +++ b/io_uring/epoll.c @@ -40,7 +40,7 @@ int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct epoll_event __user *ev;
ev = u64_to_user_ptr(READ_ONCE(sqe->addr)); - if (copy_from_user(&epoll->event, ev, sizeof(*ev))) + if (copy_epoll_event_from_user(&epoll->event, ev, req->ctx->compat)) return -EFAULT; }
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index bc8c9d764bc13..c724e6c544809 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -88,45 +88,64 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sq_entries = min(sq_tail - sq_head, ctx->sq_entries); for (i = 0; i < sq_entries; i++) { unsigned int entry = i + sq_head; - struct io_uring_sqe *sqe; - unsigned int sq_idx; + unsigned int sq_idx, sq_off;
sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue; - sqe = &ctx->sq_sqes[sq_idx << sq_shift]; - seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, " - "addr:0x%llx, rw_flags:0x%x, buf_index:%d " - "user_data:%llu", - sq_idx, io_uring_get_opcode(sqe->opcode), sqe->fd, - sqe->flags, (unsigned long long) sqe->off, - (unsigned long long) sqe->addr, sqe->rw_flags, - sqe->buf_index, sqe->user_data); - if (sq_shift) { - u64 *sqeb = (void *) (sqe + 1); - int size = sizeof(struct io_uring_sqe) / sizeof(u64); - int j; - - for (j = 0; j < size; j++) { - seq_printf(m, ", e%d:0x%llx", j, - (unsigned long long) *sqeb); - sqeb++; - } - } + sq_off = sq_idx << sq_shift; +#define print_sqe(sqe) \ + do { \ + seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, " \ + "addr:0x%llx, rw_flags:0x%x, buf_index:%d " \ + "user_data:%llu", \ + sq_idx, io_uring_get_opcode((sqe)->opcode), (sqe)->fd, \ + (sqe)->flags, (unsigned long long) (sqe)->off, \ + (unsigned long long) (sqe)->addr, (sqe)->rw_flags, \ + (sqe)->buf_index, (sqe)->user_data); \ + if (sq_shift) { \ + u64 *sqeb = (void *) ((sqe) + 1); \ + int size = sizeof(*(sqe)) / sizeof(u64); \ + int j; \ + \ + for (j = 0; j < size; j++) { \ + seq_printf(m, ", e%d:0x%llx", j, \ + (unsigned long long) *sqeb); \ + sqeb++; \ + } \ + } \ + } while (0) + + if (is_compat64_io_ring_ctx(ctx)) + print_sqe(&ctx->sq_sqes_compat[sq_off]); + else + print_sqe(&ctx->sq_sqes[sq_off]); +#undef print_sqe + seq_printf(m, "\n"); } seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head); cq_entries = min(cq_tail - cq_head, ctx->cq_entries); for (i = 0; i < cq_entries; i++) { unsigned int entry = i + cq_head; - struct io_uring_cqe *cqe = &ctx->cqes[(entry & cq_mask) << cq_shift]; - - seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", - entry & cq_mask, cqe->user_data, cqe->res, - cqe->flags); - if (cq_shift) - seq_printf(m, ", extra1:%llu, extra2:%llu\n", - cqe->big_cqe[0], cqe->big_cqe[1]); + unsigned int cq_off = (entry & cq_mask) << cq_shift; + +#define print_cqe(cqe) \ + do { \ + seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", \ + entry & cq_mask, (cqe)->user_data, (cqe)->res, \ + (cqe)->flags); \ + if (cq_shift) \ + seq_printf(m, ", extra1:%llu, extra2:%llu\n", \ + (cqe)->big_cqe[0], (cqe)->big_cqe[1]); \ + } while (0) + + if (is_compat64_io_ring_ctx(ctx)) + print_cqe((struct compat_io_uring_cqe *)&ctx->cqes_compat[cq_off]); + else + print_cqe((struct io_uring_cqe *)&ctx->cqes[cq_off]); +#undef print_cqe + seq_printf(m, "\n"); }
@@ -191,8 +210,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe = &ocqe->cqe;
seq_printf(m, " user_data=%llu, res=%d, flags=%x\n", - cqe->user_data, cqe->res, cqe->flags); - + (cqe)->user_data, (cqe)->res, (cqe)->flags); }
spin_unlock(&ctx->completion_lock); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 707229ae04dc8..3f0e005481f3f 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -152,6 +152,35 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
static struct kmem_cache *req_cachep;
+static int get_compat64_io_uring_getevents_arg(struct io_uring_getevents_arg *arg, + const void __user *user_arg) +{ + struct compat_io_uring_getevents_arg compat_arg; + + if (copy_from_user(&compat_arg, user_arg, sizeof(compat_arg))) + return -EFAULT; + arg->sigmask = compat_arg.sigmask; + arg->sigmask_sz = compat_arg.sigmask_sz; + arg->pad = compat_arg.pad; + arg->ts = compat_arg.ts; + return 0; +} + +static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx, + struct io_uring_getevents_arg *arg, + const void __user *argp, + size_t size) +{ + if (is_compat64_io_ring_ctx(ctx)) { + if (size != sizeof(struct compat_io_uring_getevents_arg)) + return -EINVAL; + return get_compat64_io_uring_getevents_arg(arg, argp); + } + if (size != sizeof(*arg)) + return -EINVAL; + return copy_from_user(arg, argp, sizeof(*arg)); +} + struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX) @@ -604,14 +633,10 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed; - size_t cqe_size = sizeof(struct io_uring_cqe);
if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false;
- if (ctx->flags & IORING_SETUP_CQE32) - cqe_size <<= 1; - io_cq_lock(ctx); while (!list_empty(&ctx->cq_overflow_list)) { struct io_uring_cqe *cqe = io_get_cqe_overflow(ctx, true); @@ -621,9 +646,18 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) break; ocqe = list_first_entry(&ctx->cq_overflow_list, struct io_overflow_cqe, list); - if (cqe) - memcpy(cqe, &ocqe->cqe, cqe_size); - else + if (cqe) { + u64 extra1 = 0; + u64 extra2 = 0; + + if (ctx->flags & IORING_SETUP_CQE32) { + extra1 = ocqe->cqe.big_cqe[0]; + extra2 = ocqe->cqe.big_cqe[1]; + } + + __io_fill_cqe(ctx, cqe, ocqe->cqe.user_data, ocqe->cqe.res, + ocqe->cqe.flags, extra1, extra2); + } else io_account_cq_overflow(ctx);
list_del(&ocqe->list); @@ -745,6 +779,10 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) { unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len; + size_t cqe_size = ctx->compat ? + sizeof(struct compat_io_uring_cqe) : + sizeof(struct io_uring_cqe); + struct io_uring_cqe *cqe;
/* * Posting into the CQ when there are pending overflowed CQEs may break @@ -767,14 +805,15 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &ctx->cqes[off]; - ctx->cqe_sentinel = ctx->cqe_cached + len; + cqe = ctx->compat ? (struct io_uring_cqe *)&ctx->cqes_compat[off] : &ctx->cqes[off]; + ctx->cqe_cached = cqe; + ctx->cqe_sentinel = ctx->cqe_cached + len * cqe_size;
ctx->cached_cq_tail++; - ctx->cqe_cached++; + ctx->cqe_cached += cqe_size; if (ctx->flags & IORING_SETUP_CQE32) - ctx->cqe_cached++; - return &ctx->cqes[off]; + ctx->cqe_cached += cqe_size; + return cqe; }
bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, @@ -793,14 +832,7 @@ bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags if (likely(cqe)) { trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
- WRITE_ONCE(cqe->user_data, user_data); - WRITE_ONCE(cqe->res, res); - WRITE_ONCE(cqe->flags, cflags); - - if (ctx->flags & IORING_SETUP_CQE32) { - WRITE_ONCE(cqe->big_cqe[0], 0); - WRITE_ONCE(cqe->big_cqe[1], 0); - } + __io_fill_cqe(ctx, cqe, user_data, res, cflags, 0, 0); return true; }
@@ -2240,7 +2272,9 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) /* double index for 128-byte SQEs, twice as long */ if (ctx->flags & IORING_SETUP_SQE128) head <<= 1; - return &ctx->sq_sqes[head]; + return ctx->compat ? + (struct io_uring_sqe *)&ctx->sq_sqes_compat[head] : + &ctx->sq_sqes[head]; }
/* drop invalid entries */ @@ -2267,6 +2301,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) do { const struct io_uring_sqe *sqe; struct io_kiocb *req; + struct io_uring_sqe native_sqe[2];
if (unlikely(!io_alloc_req_refill(ctx))) break; @@ -2276,6 +2311,11 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) io_req_add_to_cache(req, ctx); break; } + if (is_compat64_io_ring_ctx(ctx)) { + convert_compat64_io_uring_sqe(ctx, native_sqe, + (struct compat_io_uring_sqe *)sqe); + sqe = native_sqe; + }
/* * Continue submitting even for sqe failure if the @@ -2480,6 +2520,9 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries { struct io_rings *rings; size_t off, cq_array_size, sq_array_size; + size_t cqe_size = ctx->compat ? + sizeof(struct compat_io_uring_cqe) : + sizeof(struct io_uring_cqe);
off = sizeof(*rings);
@@ -2492,7 +2535,7 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries if (cq_offset) *cq_offset = off;
- cq_array_size = array_size(sizeof(struct io_uring_cqe), cq_entries); + cq_array_size = array_size(cqe_size, cq_entries); if (cq_array_size == SIZE_MAX) return SIZE_MAX;
@@ -3120,20 +3163,22 @@ static unsigned long io_uring_nommu_get_unmapped_area(struct file *file,
#endif /* !CONFIG_MMU */
-static int io_validate_ext_arg(unsigned flags, const void __user *argp, size_t argsz) +static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned int flags, + const void __user *argp, size_t argsz) { if (flags & IORING_ENTER_EXT_ARG) { struct io_uring_getevents_arg arg; + int ret;
- if (argsz != sizeof(arg)) - return -EINVAL; - if (copy_from_user(&arg, argp, sizeof(arg))) - return -EFAULT; + ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, argsz); + if (ret) + return ret; } return 0; }
-static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz, +static int io_get_ext_arg(struct io_ring_ctx *ctx, unsigned int flags, + const void __user *argp, size_t *argsz, #ifdef CONFIG_CHERI_PURECAP_UABI struct __kernel_timespec * __capability *ts, const sigset_t * __capability *sig) @@ -3143,6 +3188,7 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz #endif { struct io_uring_getevents_arg arg; + int ret;
/* * If EXT_ARG isn't set, then we have no timespec and the argp pointer @@ -3158,10 +3204,9 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz * EXT_ARG is set - ensure we agree on the size of it and copy in our * timespec and sigset_t pointers if good. */ - if (*argsz != sizeof(arg)) - return -EINVAL; - if (copy_from_user(&arg, argp, sizeof(arg))) - return -EFAULT; + ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, *argsz); + if (ret) + return ret; if (arg.pad) return -EINVAL; *sig = u64_to_user_ptr(arg.sigmask); @@ -3268,7 +3313,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, */ mutex_lock(&ctx->uring_lock); iopoll_locked: - ret2 = io_validate_ext_arg(flags, argp, argsz); + ret2 = io_validate_ext_arg(ctx, flags, argp, argsz); if (likely(!ret2)) { min_complete = min(min_complete, ctx->cq_entries); @@ -3279,7 +3324,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, const sigset_t __user *sig; struct __kernel_timespec __user *ts;
- ret2 = io_get_ext_arg(flags, argp, &argsz, &ts, &sig); + ret2 = io_get_ext_arg(ctx, flags, argp, &argsz, &ts, &sig); if (likely(!ret2)) { min_complete = min(min_complete, ctx->cq_entries); @@ -3329,6 +3374,9 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, { struct io_rings *rings; size_t size, cqes_offset, sq_array_offset; + size_t sqe_size = ctx->compat ? + sizeof(struct compat_io_uring_sqe) : + sizeof(struct io_uring_sqe);
/* make sure these are sane, as we already accounted them */ ctx->sq_entries = p->sq_entries; @@ -3351,9 +3399,9 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, rings->cq_ring_entries = p->cq_entries;
if (p->flags & IORING_SETUP_SQE128) - size = array_size(2 * sizeof(struct io_uring_sqe), p->sq_entries); + size = array_size(2 * sqe_size, p->sq_entries); else - size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + size = array_size(sqe_size, p->sq_entries); if (size == SIZE_MAX) { io_mem_free(ctx->rings); ctx->rings = NULL; @@ -4107,48 +4155,48 @@ static int __init io_uring_init(void) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64); - BUILD_BUG_SQE_ELEM(0, __u8, opcode); - BUILD_BUG_SQE_ELEM(1, __u8, flags); - BUILD_BUG_SQE_ELEM(2, __u16, ioprio); - BUILD_BUG_SQE_ELEM(4, __s32, fd); - BUILD_BUG_SQE_ELEM(8, __u64, off); - BUILD_BUG_SQE_ELEM(8, __u64, addr2); - BUILD_BUG_SQE_ELEM(8, __u32, cmd_op); + BUILD_BUG_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_SQE_ELEM(1, __u8, flags); + BUILD_BUG_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_SQE_ELEM(4, __s32, fd); + BUILD_BUG_SQE_ELEM(8, __u64, off); + BUILD_BUG_SQE_ELEM(8, __u64, addr2); + BUILD_BUG_SQE_ELEM(8, __u32, cmd_op); BUILD_BUG_SQE_ELEM(12, __u32, __pad1); - BUILD_BUG_SQE_ELEM(16, __u64, addr); - BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in); - BUILD_BUG_SQE_ELEM(24, __u32, len); + BUILD_BUG_SQE_ELEM(16, __u64, addr); + BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in); + BUILD_BUG_SQE_ELEM(24, __u32, len); BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags); BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); BUILD_BUG_SQE_ELEM(28, /* compat */ __u32, rw_flags); - BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); - BUILD_BUG_SQE_ELEM(28, /* compat */ __u16, poll_events); - BUILD_BUG_SQE_ELEM(28, __u32, poll32_events); - BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); - BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); - BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); - BUILD_BUG_SQE_ELEM(28, __u32, accept_flags); - BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags); - BUILD_BUG_SQE_ELEM(28, __u32, open_flags); - BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); - BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); - BUILD_BUG_SQE_ELEM(28, __u32, splice_flags); - BUILD_BUG_SQE_ELEM(28, __u32, rename_flags); - BUILD_BUG_SQE_ELEM(28, __u32, unlink_flags); - BUILD_BUG_SQE_ELEM(28, __u32, hardlink_flags); - BUILD_BUG_SQE_ELEM(28, __u32, xattr_flags); - BUILD_BUG_SQE_ELEM(28, __u32, msg_ring_flags); - BUILD_BUG_SQE_ELEM(32, __u64, user_data); - BUILD_BUG_SQE_ELEM(40, __u16, buf_index); - BUILD_BUG_SQE_ELEM(40, __u16, buf_group); - BUILD_BUG_SQE_ELEM(42, __u16, personality); - BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in); - BUILD_BUG_SQE_ELEM(44, __u32, file_index); - BUILD_BUG_SQE_ELEM(44, __u16, addr_len); - BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]); - BUILD_BUG_SQE_ELEM(48, __u64, addr3); + BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); + BUILD_BUG_SQE_ELEM(28, /* compat */ __u16, poll_events); + BUILD_BUG_SQE_ELEM(28, __u32, poll32_events); + BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); + BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); + BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); + BUILD_BUG_SQE_ELEM(28, __u32, accept_flags); + BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags); + BUILD_BUG_SQE_ELEM(28, __u32, open_flags); + BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); + BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(28, __u32, splice_flags); + BUILD_BUG_SQE_ELEM(28, __u32, rename_flags); + BUILD_BUG_SQE_ELEM(28, __u32, unlink_flags); + BUILD_BUG_SQE_ELEM(28, __u32, hardlink_flags); + BUILD_BUG_SQE_ELEM(28, __u32, xattr_flags); + BUILD_BUG_SQE_ELEM(28, __u32, msg_ring_flags); + BUILD_BUG_SQE_ELEM(32, __u64, user_data); + BUILD_BUG_SQE_ELEM(40, __u16, buf_index); + BUILD_BUG_SQE_ELEM(40, __u16, buf_group); + BUILD_BUG_SQE_ELEM(42, __u16, personality); + BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in); + BUILD_BUG_SQE_ELEM(44, __u32, file_index); + BUILD_BUG_SQE_ELEM(44, __u16, addr_len); + BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]); + BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); - BUILD_BUG_SQE_ELEM(56, __u64, __pad2); + BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
BUILD_BUG_ON(sizeof(struct io_uring_files_update) != sizeof(struct io_uring_rsrc_update)); @@ -4160,6 +4208,46 @@ static int __init io_uring_init(void) BUILD_BUG_ON(offsetof(struct io_uring_buf, resv) != offsetof(struct io_uring_buf_ring, tail));
+#ifdef CONFIG_COMPAT64 +#define BUILD_BUG_COMPAT_SQE_ELEM(eoffset, etype, ename) \ + __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, sizeof(etype), ename) +#define BUILD_BUG_COMPAT_SQE_ELEM_SIZE(eoffset, esize, ename) \ + __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, esize, ename) + BUILD_BUG_ON(sizeof(struct compat_io_uring_sqe) != 64); + BUILD_BUG_COMPAT_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_COMPAT_SQE_ELEM(1, __u8, flags); + BUILD_BUG_COMPAT_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_COMPAT_SQE_ELEM(4, __s32, fd); + BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, off); + BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, addr2); + BUILD_BUG_COMPAT_SQE_ELEM(8, __u32, cmd_op); + BUILD_BUG_COMPAT_SQE_ELEM(12, __u32, __pad1); + BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, addr); + BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, splice_off_in); + BUILD_BUG_COMPAT_SQE_ELEM(24, __u32, len); + BUILD_BUG_COMPAT_SQE_ELEM(28, __kernel_rwf_t, rw_flags); + BUILD_BUG_COMPAT_SQE_ELEM(32, __u64, user_data); + BUILD_BUG_COMPAT_SQE_ELEM(40, __u16, buf_index); + BUILD_BUG_COMPAT_SQE_ELEM(40, __u16, buf_group); + BUILD_BUG_COMPAT_SQE_ELEM(42, __u16, personality); + BUILD_BUG_COMPAT_SQE_ELEM(44, __s32, splice_fd_in); + BUILD_BUG_COMPAT_SQE_ELEM(44, __u32, file_index); + BUILD_BUG_COMPAT_SQE_ELEM(44, __u16, addr_len); + BUILD_BUG_COMPAT_SQE_ELEM(46, __u16, __pad3[0]); + BUILD_BUG_COMPAT_SQE_ELEM(48, __u64, addr3); + BUILD_BUG_COMPAT_SQE_ELEM_SIZE(48, 0, cmd); + BUILD_BUG_COMPAT_SQE_ELEM(56, __u64, __pad2); + + BUILD_BUG_ON(sizeof(struct compat_io_uring_files_update) != + sizeof(struct compat_io_uring_rsrc_update)); + BUILD_BUG_ON(sizeof(struct compat_io_uring_rsrc_update) > + sizeof(struct compat_io_uring_rsrc_update2)); + + BUILD_BUG_ON(offsetof(struct compat_io_uring_buf_ring, bufs) != 0); + BUILD_BUG_ON(offsetof(struct compat_io_uring_buf, resv) != + offsetof(struct compat_io_uring_buf_ring, tail)); +#endif /* CONFIG_COMPAT64 */ + /* should fit into one byte */ BUILD_BUG_ON(SQE_VALID_FLAGS >= (1 << 8)); BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8)); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 50bc3af449534..b44ad558137be 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -5,6 +5,7 @@ #include <linux/lockdep.h> #include <linux/io_uring_types.h> #include "io-wq.h" +#include "uring_cmd.h" #include "slist.h" #include "filetable.h"
@@ -93,16 +94,69 @@ static inline void io_cq_lock(struct io_ring_ctx *ctx)
void io_cq_unlock_post(struct io_ring_ctx *ctx);
+static inline bool is_compat64_io_ring_ctx(struct io_ring_ctx *ctx) +{ + return IS_ENABLED(CONFIG_COMPAT64) && ctx->compat; +} + +static inline void convert_compat64_io_uring_sqe(struct io_ring_ctx *ctx, + struct io_uring_sqe *sqe, + const struct compat_io_uring_sqe *compat_sqe) +{ +/* + * The struct io_uring_sqe contains anonymous unions and there is no field + * keeping track of which union's member is active. Because in all the cases, + * the unions are between integral types and the types are compatible, use the + * largest member of each union to perform the copy. Use this compile-time check + * to ensure that the union's members are not truncated during the conversion. + */ +#define BUILD_BUG_COMPAT_SQE_UNION_ELEM(elem1, elem2) \ + BUILD_BUG_ON(sizeof_field(struct compat_io_uring_sqe, elem1) != \ + (offsetof(struct compat_io_uring_sqe, elem2) - \ + offsetof(struct compat_io_uring_sqe, elem1))) + + sqe->opcode = READ_ONCE(compat_sqe->opcode); + sqe->flags = READ_ONCE(compat_sqe->flags); + sqe->ioprio = READ_ONCE(compat_sqe->ioprio); + sqe->fd = READ_ONCE(compat_sqe->fd); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr); + sqe->addr2 = READ_ONCE(compat_sqe->addr2); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len); + sqe->addr = READ_ONCE(compat_sqe->addr); + sqe->len = READ_ONCE(compat_sqe->len); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data); + sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags); + sqe->user_data = READ_ONCE(compat_sqe->user_data); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality); + sqe->buf_index = READ_ONCE(compat_sqe->buf_index); + sqe->personality = READ_ONCE(compat_sqe->personality); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(splice_fd_in, addr3); + sqe->splice_fd_in = READ_ONCE(compat_sqe->splice_fd_in); + if (sqe->opcode == IORING_OP_URING_CMD) { + size_t compat_cmd_size = compat_uring_cmd_pdu_size(ctx->flags & + IORING_SETUP_SQE128); + + memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size); + } else { + sqe->addr3 = READ_ONCE(compat_sqe->addr3); + sqe->__pad2[0] = READ_ONCE(compat_sqe->__pad2[0]); + } +#undef BUILD_BUG_COMPAT_SQE_UNION_ELEM +} + static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx, bool overflow) { if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) { struct io_uring_cqe *cqe = ctx->cqe_cached; + size_t cqe_size = ctx->compat ? + sizeof(struct compat_io_uring_cqe) : + sizeof(struct io_uring_cqe);
ctx->cached_cq_tail++; - ctx->cqe_cached++; + ctx->cqe_cached += cqe_size; if (ctx->flags & IORING_SETUP_CQE32) - ctx->cqe_cached++; + ctx->cqe_cached += cqe_size; return cqe; }
@@ -114,10 +168,40 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx) return io_get_cqe_overflow(ctx, false); }
+static inline void __io_fill_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe, + u64 user_data, s32 res, u32 cflags, + u64 extra1, u64 extra2) +{ + if (is_compat64_io_ring_ctx(ctx)) { + struct compat_io_uring_cqe *compat_cqe = (struct compat_io_uring_cqe *)cqe; + + WRITE_ONCE(compat_cqe->user_data, user_data); + WRITE_ONCE(compat_cqe->res, res); + WRITE_ONCE(compat_cqe->flags, cflags); + + if (ctx->flags & IORING_SETUP_CQE32) { + WRITE_ONCE(compat_cqe->big_cqe[0], extra1); + WRITE_ONCE(compat_cqe->big_cqe[1], extra2); + } + return; + } + + WRITE_ONCE(cqe->user_data, user_data); + WRITE_ONCE(cqe->res, res); + WRITE_ONCE(cqe->flags, cflags); + + if (ctx->flags & IORING_SETUP_CQE32) { + WRITE_ONCE(cqe->big_cqe[0], extra1); + WRITE_ONCE(cqe->big_cqe[1], extra2); + } +} + static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req) { struct io_uring_cqe *cqe; + u64 extra1 = 0; + u64 extra2 = 0;
/* * If we can't get a cq entry, userspace overflowed the @@ -128,24 +212,17 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, if (unlikely(!cqe)) return io_req_cqe_overflow(req);
+ if (ctx->flags & IORING_SETUP_CQE32 && req->flags & REQ_F_CQE32_INIT) { + extra1 = req->extra1; + extra2 = req->extra2; + } + trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags, - (req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0, - (req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0); + extra1, extra2);
- memcpy(cqe, &req->cqe, sizeof(*cqe)); - - if (ctx->flags & IORING_SETUP_CQE32) { - u64 extra1 = 0, extra2 = 0; - - if (req->flags & REQ_F_CQE32_INIT) { - extra1 = req->extra1; - extra2 = req->extra2; - } - - WRITE_ONCE(cqe->big_cqe[0], extra1); - WRITE_ONCE(cqe->big_cqe[1], extra2); - } + __io_fill_cqe(ctx, cqe, req->cqe.user_data, req->cqe.res, + req->cqe.flags, extra1, extra2); return true; }
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 182e594b56c6e..b388592e67df9 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -16,6 +16,7 @@ #include "kbuf.h"
#define IO_BUFFER_LIST_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct io_uring_buf)) +#define IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct compat_io_uring_buf))
#define BGID_ARRAY 64
@@ -28,6 +29,30 @@ struct io_provide_buf { __u16 bid; };
+static int get_compat64_io_uring_buf_reg(struct io_uring_buf_reg *reg, + const void __user *user_reg) +{ + struct compat_io_uring_buf_reg compat_reg; + + if (copy_from_user(&compat_reg, user_reg, sizeof(compat_reg))) + return -EFAULT; + reg->ring_addr = compat_reg.ring_addr; + reg->ring_entries = compat_reg.ring_entries; + reg->bgid = compat_reg.bgid; + reg->pad = compat_reg.pad; + memcpy(reg->resv, compat_reg.resv, sizeof(reg->resv)); + return 0; +} + +static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx, + struct io_uring_buf_reg *reg, + const void __user *arg) +{ + if (is_compat64_io_ring_ctx(ctx)) + return get_compat64_io_uring_buf_reg(reg, arg); + return copy_from_user(reg, arg, sizeof(*reg)); +} + static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, unsigned int bgid) { @@ -125,6 +150,35 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, return NULL; }
+static void __user *io_ring_buffer_select_compat64(struct io_kiocb *req, size_t *len, + struct io_buffer_list *bl, + unsigned int issue_flags) +{ + struct compat_io_uring_buf_ring *br = bl->buf_ring_compat; + struct compat_io_uring_buf *buf; + __u16 head = bl->head; + + if (unlikely(smp_load_acquire(&br->tail) == head)) + return NULL; + + head &= bl->mask; + if (head < IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE) { + buf = &br->bufs[head]; + } else { + int off = head & (IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE - 1); + int index = head / IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE; + buf = page_address(bl->buf_pages[index]); + buf += off; + } + if (*len == 0 || *len > buf->len) + *len = buf->len; + req->flags |= REQ_F_BUFFER_RING; + req->buf_list = bl; + req->buf_index = buf->bid; + + return compat_ptr(buf->addr); +} + static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, struct io_buffer_list *bl, unsigned int issue_flags) @@ -151,6 +205,23 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, req->buf_list = bl; req->buf_index = buf->bid;
+ return u64_to_user_ptr(buf->addr); +} + +static void __user *io_ring_buffer_select_any(struct io_kiocb *req, size_t *len, + struct io_buffer_list *bl, + unsigned int issue_flags) +{ + void __user *ret; + + if (is_compat64_io_ring_ctx(req->ctx)) + ret = io_ring_buffer_select_compat64(req, len, bl, issue_flags); + else + ret = io_ring_buffer_select(req, len, bl, issue_flags); + + if (!ret) + return ret; + if (issue_flags & IO_URING_F_UNLOCKED || !file_can_poll(req->file)) { /* * If we came in unlocked, we have no choice but to consume the @@ -165,7 +236,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, req->buf_list = NULL; bl->head++; } - return u64_to_user_ptr(buf->addr); + return ret; }
void __user *io_buffer_select(struct io_kiocb *req, size_t *len, @@ -180,7 +251,7 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len, bl = io_buffer_get_list(ctx, req->buf_index); if (likely(bl)) { if (bl->buf_nr_pages) - ret = io_ring_buffer_select(req, len, bl, issue_flags); + ret = io_ring_buffer_select_any(req, len, bl, issue_flags); else ret = io_provided_buffer_select(req, len, bl); } @@ -215,9 +286,12 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx, return 0;
if (bl->buf_nr_pages) { + __u16 tail = ctx->compat ? + bl->buf_ring_compat->tail : + bl->buf_ring->tail; int j;
- i = bl->buf_ring->tail - bl->head; + i = tail - bl->head; for (j = 0; j < bl->buf_nr_pages; j++) unpin_user_page(bl->buf_pages[j]); kvfree(bl->buf_pages); @@ -469,13 +543,13 @@ int io_provide_buffers(struct io_kiocb *req, unsigned int issue_flags)
int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) { - struct io_uring_buf_ring *br; struct io_uring_buf_reg reg; struct io_buffer_list *bl, *free_bl = NULL; struct page **pages; + size_t pages_size; int nr_pages;
- if (copy_from_user(®, arg, sizeof(reg))) + if (copy_io_uring_buf_reg_from_user(ctx, ®, arg)) return -EFAULT;
if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2]) @@ -508,19 +582,19 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) return -ENOMEM; }
- pages = io_pin_pages(reg.ring_addr, - size_mul(sizeof(struct io_uring_buf), reg.ring_entries), - &nr_pages); + pages_size = ctx->compat ? + size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : + size_mul(sizeof(struct io_uring_buf), reg.ring_entries); + pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl); return PTR_ERR(pages); }
- br = page_address(pages[0]); bl->buf_pages = pages; bl->buf_nr_pages = nr_pages; bl->nr_entries = reg.ring_entries; - bl->buf_ring = br; + bl->buf_ring = page_address(pages[0]); bl->mask = reg.ring_entries - 1; io_buffer_add_list(ctx, bl, reg.bgid); return 0; @@ -531,7 +605,7 @@ int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) struct io_uring_buf_reg reg; struct io_buffer_list *bl;
- if (copy_from_user(®, arg, sizeof(reg))) + if (copy_io_uring_buf_reg_from_user(ctx, ®, arg)) return -EFAULT; if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2]) return -EINVAL; diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index c23e15d7d3caf..1aa5bbbc5d628 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -2,6 +2,7 @@ #ifndef IOU_KBUF_H #define IOU_KBUF_H
+#include <linux/io_uring_types.h> #include <uapi/linux/io_uring.h>
struct io_buffer_list { @@ -13,7 +14,10 @@ struct io_buffer_list { struct list_head buf_list; struct { struct page **buf_pages; - struct io_uring_buf_ring *buf_ring; + union { + struct io_uring_buf_ring *buf_ring; + struct compat_io_uring_buf_ring *buf_ring_compat; + }; }; }; __u16 bgid; diff --git a/io_uring/net.c b/io_uring/net.c index c586278858e7e..4c133bc6f9d1d 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -4,6 +4,7 @@ #include <linux/file.h> #include <linux/slab.h> #include <linux/net.h> +#include <linux/uio.h> #include <linux/compat.h> #include <net/compat.h> #include <linux/io_uring.h> @@ -435,7 +436,9 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, } else if (msg.msg_iovlen > 1) { return -EINVAL; } else { - if (copy_from_user(iomsg->fast_iov, msg.msg_iov, sizeof(*msg.msg_iov))) + void *iov = iovec_from_user(msg.msg_iov, 1, 1, iomsg->fast_iov, + req->ctx->compat); + if (IS_ERR(iov)) return -EFAULT; sr->len = iomsg->fast_iov[0].iov_len; iomsg->free_iov = NULL; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 41e192de9e8a7..c65b99fb9264f 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -23,6 +23,89 @@ struct io_rsrc_update { u32 offset; };
+static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2, + const void __user *user_up) +{ + struct compat_io_uring_rsrc_update compat_up; + + if (copy_from_user(&compat_up, user_up, sizeof(compat_up))) + return -EFAULT; + up2->offset = compat_up.offset; + up2->resv = compat_up.resv; + up2->data = compat_up.data; + return 0; +} + +static int get_compat64_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2, + const void __user *user_up2) +{ + struct compat_io_uring_rsrc_update2 compat_up2; + + if (copy_from_user(&compat_up2, user_up2, sizeof(compat_up2))) + return -EFAULT; + up2->offset = compat_up2.offset; + up2->resv = compat_up2.resv; + up2->data = compat_up2.data; + up2->tags = compat_up2.tags; + up2->nr = compat_up2.nr; + up2->resv2 = compat_up2.resv2; + return 0; +} + +static int get_compat64_io_uring_rsrc_register(struct io_uring_rsrc_register *rr, + const void __user *user_rr) +{ + struct compat_io_uring_rsrc_register compat_rr; + + if (copy_from_user(&compat_rr, user_rr, sizeof(compat_rr))) + return -EFAULT; + rr->nr = compat_rr.nr; + rr->flags = compat_rr.flags; + rr->resv2 = compat_rr.resv2; + rr->data = compat_rr.data; + rr->tags = compat_rr.tags; + return 0; +} + +static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_update2 *up2, + const void __user *arg) +{ + if (is_compat64_io_ring_ctx(ctx)) + return get_compat64_io_uring_rsrc_update(up2, arg); + return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update)); +} + +static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_update2 *up2, + const void __user *arg, + size_t size) +{ + if (is_compat64_io_ring_ctx(ctx)) { + if (size != sizeof(struct compat_io_uring_rsrc_update2)) + return -EINVAL; + return get_compat64_io_uring_rsrc_update2(up2, arg); + } + if (size != sizeof(*up2)) + return -EINVAL; + return copy_from_user(up2, arg, sizeof(*up2)); +} + +static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_register *rr, + const void __user *arg, + size_t size) +{ + if (is_compat64_io_ring_ctx(ctx)) { + if (size != sizeof(struct compat_io_uring_rsrc_register)) + return -EINVAL; + return get_compat64_io_uring_rsrc_register(rr, arg); + } + if (size != sizeof(*rr)) + return -EINVAL; + return copy_from_user(rr, arg, size); +} + static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, struct io_mapped_ubuf **pimu, struct page **last_hpage); @@ -597,12 +680,14 @@ int io_register_files_update(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) { struct io_uring_rsrc_update2 up; + int ret;
if (!nr_args) return -EINVAL; memset(&up, 0, sizeof(up)); - if (copy_from_user(&up, arg, sizeof(struct io_uring_rsrc_update))) - return -EFAULT; + ret = copy_io_uring_rsrc_update_from_user(ctx, &up, arg); + if (ret) + return ret; if (up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, nr_args); @@ -612,11 +697,11 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg, unsigned size, unsigned type) { struct io_uring_rsrc_update2 up; + int ret;
- if (size != sizeof(up)) - return -EINVAL; - if (copy_from_user(&up, arg, sizeof(up))) - return -EFAULT; + ret = copy_io_uring_rsrc_update2_from_user(ctx, &up, arg, size); + if (ret) + return ret; if (!up.nr || up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, type, &up, up.nr); @@ -626,14 +711,11 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, unsigned int size, unsigned int type) { struct io_uring_rsrc_register rr; + int ret;
- /* keep it extendible */ - if (size != sizeof(rr)) - return -EINVAL; - - memset(&rr, 0, sizeof(rr)); - if (copy_from_user(&rr, arg, size)) - return -EFAULT; + ret = copy_io_uring_rsrc_register_from_user(ctx, &rr, arg, size); + if (ret) + return ret; if (!rr.nr || rr.resv2) return -EINVAL; if (rr.flags & ~IORING_RSRC_REGISTER_SPARSE) diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 96f77450cf4e2..e69e8d7ba36c0 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -12,6 +12,28 @@ #include "io_uring.h" #include "tctx.h"
+static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update *up, + const void __user *user_up) +{ + struct compat_io_uring_rsrc_update compat_up; + + if (copy_from_user(&compat_up, user_up, sizeof(compat_up))) + return -EFAULT; + up->offset = compat_up.offset; + up->resv = compat_up.resv; + up->data = compat_up.data; + return 0; +} + +static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_update *up, + const void __user *arg) +{ + if (is_compat64_io_ring_ctx(ctx)) + return get_compat64_io_uring_rsrc_update(up, arg); + return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update)); +} + static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, struct task_struct *task) { @@ -233,6 +255,16 @@ static int io_ring_add_registered_fd(struct io_uring_task *tctx, int fd, return -EBUSY; }
+static void __user *get_ith_io_uring_rsrc_update(struct io_ring_ctx *ctx, + void __user *__arg, + int i) +{ + if (is_compat64_io_ring_ctx(ctx)) + return &((struct compat_io_uring_rsrc_update __user *)__arg)[i]; + else + return &((struct io_uring_rsrc_update __user *)__arg)[i]; +} + /* * Register a ring fd to avoid fdget/fdput for each io_uring_enter() * invocation. User passes in an array of struct io_uring_rsrc_update @@ -244,8 +276,6 @@ static int io_ring_add_registered_fd(struct io_uring_task *tctx, int fd, int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, unsigned nr_args) { - struct io_uring_rsrc_update __user *arg = __arg; - struct io_uring_rsrc_update reg; struct io_uring_task *tctx; int ret, i;
@@ -260,9 +290,14 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg,
tctx = current->io_uring; for (i = 0; i < nr_args; i++) { + void __user *arg; + __u32 __user *arg_offset; + struct io_uring_rsrc_update reg; int start, end;
- if (copy_from_user(®, &arg[i], sizeof(reg))) { + arg = get_ith_io_uring_rsrc_update(ctx, __arg, i); + + if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break; } @@ -289,7 +324,10 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, break;
reg.offset = ret; - if (put_user(reg.offset, &arg[i].offset)) { + arg_offset = ctx->compat ? + &((struct compat_io_uring_rsrc_update __user *)arg)->offset : + &((struct io_uring_rsrc_update __user *)arg)->offset; + if (put_user(reg.offset, arg_offset)) { fput(tctx->registered_rings[reg.offset]); tctx->registered_rings[reg.offset] = NULL; ret = -EFAULT; @@ -303,9 +341,7 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg, unsigned nr_args) { - struct io_uring_rsrc_update __user *arg = __arg; struct io_uring_task *tctx = current->io_uring; - struct io_uring_rsrc_update reg; int ret = 0, i;
if (!nr_args || nr_args > IO_RINGFD_REG_MAX) @@ -314,10 +350,16 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg, return 0;
for (i = 0; i < nr_args; i++) { - if (copy_from_user(®, &arg[i], sizeof(reg))) { + void __user *arg; + struct io_uring_rsrc_update reg; + + arg = get_ith_io_uring_rsrc_update(ctx, __arg, i); + + if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break; } + if (reg.resv || reg.data || reg.offset >= IO_RINGFD_REG_MAX) { ret = -EINVAL; break; diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h index 7c6697d13cb2e..d7e577b39625b 100644 --- a/io_uring/uring_cmd.h +++ b/io_uring/uring_cmd.h @@ -11,3 +11,8 @@ int io_uring_cmd_prep_async(struct io_kiocb *req); #define uring_cmd_pdu_size(is_sqe128) \ ((1 + !!(is_sqe128)) * sizeof(struct io_uring_sqe) - \ offsetof(struct io_uring_sqe, cmd)) + +#define compat_uring_cmd_pdu_size(is_sqe128) \ + ((1 + !!(is_sqe128)) * sizeof(struct compat_io_uring_sqe) - \ + offsetof(struct compat_io_uring_sqe, cmd)) +
On 16/03/2023 14:40, Tudor Cretu wrote:
Introduce compat versions of the structs exposed in the uAPI headers that might contain pointers as a member. Also, implement functions that convert the compat versions to the native versions of the struct.
A subsequent patch is going to change the io_uring structs to enable them to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 135 ++++++++++++++++++- io_uring/cancel.c | 26 +++- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 80 ++++++----- io_uring/io_uring.c | 234 +++++++++++++++++++++++---------- io_uring/io_uring.h | 111 +++++++++++++--- io_uring/kbuf.c | 96 ++++++++++++-- io_uring/kbuf.h | 6 +- io_uring/net.c | 5 +- io_uring/rsrc.c | 108 +++++++++++++-- io_uring/tctx.c | 56 +++++++- io_uring/uring_cmd.h | 5 + 12 files changed, 704 insertions(+), 160 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 440179029a8f0..f0eb34ad8b709 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -7,6 +7,127 @@ #include <linux/llist.h> #include <uapi/linux/io_uring.h> +struct compat_io_uring_sqe {
- __u8 opcode;
- __u8 flags;
- __u16 ioprio;
- __s32 fd;
- union {
__u64 off;
__u64 addr2;
struct {
__u32 cmd_op;
__u32 __pad1;
};
- };
- union {
__u64 addr;
__u64 splice_off_in;
- };
- __u32 len;
- /* This member is actually a union in the native struct */
- __kernel_rwf_t rw_flags;
- __u64 user_data;
- union {
__u16 buf_index;
__u16 buf_group;
- } __packed;
- __u16 personality;
- union {
__s32 splice_fd_in;
__u32 file_index;
struct {
__u16 addr_len;
__u16 __pad3[1];
};
- };
- union {
struct {
__u64 addr3;
__u64 __pad2[1];
};
__u8 cmd[0];
- };
+};
+struct compat_io_uring_cqe {
- __u64 user_data;
- __s32 res;
- __u32 flags;
- __u64 big_cqe[];
+};
+struct compat_io_uring_files_update {
- __u32 offset;
- __u32 resv;
- __aligned_u64 fds;
+};
+struct compat_io_uring_rsrc_register {
- __u32 nr;
- __u32 flags;
- __u64 resv2;
- __aligned_u64 data;
- __aligned_u64 tags;
+};
+struct compat_io_uring_rsrc_update {
- __u32 offset;
- __u32 resv;
- __aligned_u64 data;
+};
+struct compat_io_uring_rsrc_update2 {
- __u32 offset;
- __u32 resv;
- __aligned_u64 data;
- __aligned_u64 tags;
- __u32 nr;
- __u32 resv2;
+};
+struct compat_io_uring_buf {
- __u64 addr;
- __u32 len;
- __u16 bid;
- __u16 resv;
+};
+struct compat_io_uring_buf_ring {
- union {
struct {
__u64 resv1;
__u32 resv2;
__u16 resv3;
__u16 tail;
};
struct compat_io_uring_buf bufs[0];
- };
+};
+struct compat_io_uring_buf_reg {
- __u64 ring_addr;
- __u32 ring_entries;
- __u16 bgid;
- __u16 pad;
- __u64 resv[3];
+};
+struct compat_io_uring_getevents_arg {
- __u64 sigmask;
- __u32 sigmask_sz;
- __u32 pad;
- __u64 ts;
+};
+struct compat_io_uring_sync_cancel_reg {
- __u64 addr;
- __s32 fd;
- __u32 flags;
- struct __kernel_timespec timeout;
- __u64 pad[4];
+};
Just thinking about this now, might make more sense to have these structs at the end of the file, rather than before the (common) native ones? Or maybe it would be even better to move them to a new file, say <linux/io_uring_compat.h>, that can be included from this file. That would make conflicts less likely and avoid adding a lot of structs to io__uring_types.h
[...]
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index bc8c9d764bc13..c724e6c544809 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -88,45 +88,64 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sq_entries = min(sq_tail - sq_head, ctx->sq_entries); for (i = 0; i < sq_entries; i++) { unsigned int entry = i + sq_head;
struct io_uring_sqe *sqe;
unsigned int sq_idx;
unsigned int sq_idx, sq_off;
sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue;
sqe = &ctx->sq_sqes[sq_idx << sq_shift];
seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, "
"addr:0x%llx, rw_flags:0x%x, buf_index:%d "
"user_data:%llu",
sq_idx, io_uring_get_opcode(sqe->opcode), sqe->fd,
sqe->flags, (unsigned long long) sqe->off,
(unsigned long long) sqe->addr, sqe->rw_flags,
sqe->buf_index, sqe->user_data);
if (sq_shift) {
u64 *sqeb = (void *) (sqe + 1);
int size = sizeof(struct io_uring_sqe) / sizeof(u64);
int j;
for (j = 0; j < size; j++) {
seq_printf(m, ", e%d:0x%llx", j,
(unsigned long long) *sqeb);
sqeb++;
}
}
sq_off = sq_idx << sq_shift;
+#define print_sqe(sqe) \
It's unusual to define macros in the middle of a function, I can see why it could make sense here but I think it's still more readable to define them normally, before the function. The number of arguments remains reasonable if you do it this way (AFAICT just need to add m, sq_idx, sq_shift).
Also minor thing, some of the \ are not aligned.
do { \
seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, " \
"addr:0x%llx, rw_flags:0x%x, buf_index:%d " \
"user_data:%llu", \
sq_idx, io_uring_get_opcode((sqe)->opcode), (sqe)->fd, \
(sqe)->flags, (unsigned long long) (sqe)->off, \
(unsigned long long) (sqe)->addr, (sqe)->rw_flags, \
(sqe)->buf_index, (sqe)->user_data); \
if (sq_shift) { \
u64 *sqeb = (void *) ((sqe) + 1); \
int size = sizeof(*(sqe)) / sizeof(u64); \
int j; \
\
for (j = 0; j < size; j++) { \
seq_printf(m, ", e%d:0x%llx", j, \
(unsigned long long) *sqeb); \
sqeb++; \
} \
} \
} while (0)
if (is_compat64_io_ring_ctx(ctx))
print_sqe(&ctx->sq_sqes_compat[sq_off]);
else
print_sqe(&ctx->sq_sqes[sq_off]);
+#undef print_sqe
- seq_printf(m, "\n"); } seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head); cq_entries = min(cq_tail - cq_head, ctx->cq_entries); for (i = 0; i < cq_entries; i++) { unsigned int entry = i + cq_head;
struct io_uring_cqe *cqe = &ctx->cqes[(entry & cq_mask) << cq_shift];
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x",
entry & cq_mask, cqe->user_data, cqe->res,
cqe->flags);
if (cq_shift)
seq_printf(m, ", extra1:%llu, extra2:%llu\n",
cqe->big_cqe[0], cqe->big_cqe[1]);
unsigned int cq_off = (entry & cq_mask) << cq_shift;
+#define print_cqe(cqe) \
do { \
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", \
entry & cq_mask, (cqe)->user_data, (cqe)->res, \
(cqe)->flags); \
if (cq_shift) \
seq_printf(m, ", extra1:%llu, extra2:%llu\n", \
(cqe)->big_cqe[0], (cqe)->big_cqe[1]); \
} while (0)
if (is_compat64_io_ring_ctx(ctx))
print_cqe((struct compat_io_uring_cqe *)&ctx->cqes_compat[cq_off]);
I don't think you need the casts.
else
print_cqe((struct io_uring_cqe *)&ctx->cqes[cq_off]);
+#undef print_cqe
- seq_printf(m, "\n"); }
@@ -191,8 +210,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe = &ocqe->cqe; seq_printf(m, " user_data=%llu, res=%d, flags=%x\n",
cqe->user_data, cqe->res, cqe->flags);
(cqe)->user_data, (cqe)->res, (cqe)->flags);
Looks like a leftover, I guess you don't need to change this any more as struct io_overflow_cqe is now always native?
} spin_unlock(&ctx->completion_lock); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 707229ae04dc8..3f0e005481f3f 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -152,6 +152,35 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx); static struct kmem_cache *req_cachep; +static int get_compat64_io_uring_getevents_arg(struct io_uring_getevents_arg *arg,
const void __user *user_arg)
+{
- struct compat_io_uring_getevents_arg compat_arg;
- if (copy_from_user(&compat_arg, user_arg, sizeof(compat_arg)))
return -EFAULT;
- arg->sigmask = compat_arg.sigmask;
- arg->sigmask_sz = compat_arg.sigmask_sz;
- arg->pad = compat_arg.pad;
- arg->ts = compat_arg.ts;
- return 0;
+}
+static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx,
struct io_uring_getevents_arg *arg,
const void __user *argp,
size_t size)
+{
- if (is_compat64_io_ring_ctx(ctx)) {
if (size != sizeof(struct compat_io_uring_getevents_arg))
return -EINVAL;
return get_compat64_io_uring_getevents_arg(arg, argp);
- }
- if (size != sizeof(*arg))
return -EINVAL;
- return copy_from_user(arg, argp, sizeof(*arg));
+}
struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX) @@ -604,14 +633,10 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed;
- size_t cqe_size = sizeof(struct io_uring_cqe);
if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false;
- if (ctx->flags & IORING_SETUP_CQE32)
cqe_size <<= 1;
- io_cq_lock(ctx); while (!list_empty(&ctx->cq_overflow_list)) { struct io_uring_cqe *cqe = io_get_cqe_overflow(ctx, true);
@@ -621,9 +646,18 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) break; ocqe = list_first_entry(&ctx->cq_overflow_list, struct io_overflow_cqe, list);
if (cqe)
memcpy(cqe, &ocqe->cqe, cqe_size);
else
if (cqe) {
u64 extra1 = 0;
u64 extra2 = 0;
if (ctx->flags & IORING_SETUP_CQE32) {
extra1 = ocqe->cqe.big_cqe[0];
extra2 = ocqe->cqe.big_cqe[1];
}
__io_fill_cqe(ctx, cqe, ocqe->cqe.user_data, ocqe->cqe.res,
ocqe->cqe.flags, extra1, extra2);
} else io_account_cq_overflow(ctx);
list_del(&ocqe->list); @@ -745,6 +779,10 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) { unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len;
- size_t cqe_size = ctx->compat ?
It would be better to use is_compat64_io_ring_ctx() everywhere, now that we have a helper. This should leave the code entirely unmodified unless CONFIG_COMPAT64 is selected.
sizeof(struct compat_io_uring_cqe) :
sizeof(struct io_uring_cqe);
- struct io_uring_cqe *cqe;
/* * Posting into the CQ when there are pending overflowed CQEs may break @@ -767,14 +805,15 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &ctx->cqes[off];
- ctx->cqe_sentinel = ctx->cqe_cached + len;
- cqe = ctx->compat ? (struct io_uring_cqe *)&ctx->cqes_compat[off] : &ctx->cqes[off];
- ctx->cqe_cached = cqe;
- ctx->cqe_sentinel = ctx->cqe_cached + len * cqe_size;
The cqe_cached / cqe_sentinel changes look fine, but I wonder if we wouldn't be better off changing things a little more in that case. Manipulating these as void * pointers is a little awkward and can be error-prone, for instance on this line I thought "why not use cqe instead of ctx->cqe_cached", but that is definitely not equivalent as they are pointers to different types.
An alternative would be to make cqe_cached / cqe_sentinel indices into cqes / cqes_compat. This way we don't need to compute cqe_size to update them, and the logic remains otherwise unaffected. It could be a separate patch to make it easier to understand. That's just an idea though, leaving it as-is is also fine by me.
ctx->cached_cq_tail++;
- ctx->cqe_cached++;
- ctx->cqe_cached += cqe_size; if (ctx->flags & IORING_SETUP_CQE32)
ctx->cqe_cached++;
- return &ctx->cqes[off];
ctx->cqe_cached += cqe_size;
- return cqe;
} [...]
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 50bc3af449534..b44ad558137be 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -5,6 +5,7 @@ #include <linux/lockdep.h> #include <linux/io_uring_types.h> #include "io-wq.h" +#include "uring_cmd.h" #include "slist.h" #include "filetable.h" @@ -93,16 +94,69 @@ static inline void io_cq_lock(struct io_ring_ctx *ctx) void io_cq_unlock_post(struct io_ring_ctx *ctx); +static inline bool is_compat64_io_ring_ctx(struct io_ring_ctx *ctx)
I think it makes sense to follow the convention of having the full struct name in conversion helpers, but I'm less convinced here. It could be better to use the io_ prefix like most functions in this header, and pass struct io_ring_ctx without reflecting it in the function name. Maybe simple io_in_compat64(), reminiscent of in_compat_syscall()? That's also shorter, handy as we're using it a lot.
+{
- return IS_ENABLED(CONFIG_COMPAT64) && ctx->compat;
+}
[...]
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 41e192de9e8a7..c65b99fb9264f 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -23,6 +23,89 @@ struct io_rsrc_update { u32 offset; }; +static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2,
const void __user *user_up)
+{
- struct compat_io_uring_rsrc_update compat_up;
- if (copy_from_user(&compat_up, user_up, sizeof(compat_up)))
return -EFAULT;
- up2->offset = compat_up.offset;
- up2->resv = compat_up.resv;
- up2->data = compat_up.data;
- return 0;
+}
+static int get_compat64_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2,
const void __user *user_up2)
+{
- struct compat_io_uring_rsrc_update2 compat_up2;
- if (copy_from_user(&compat_up2, user_up2, sizeof(compat_up2)))
return -EFAULT;
- up2->offset = compat_up2.offset;
- up2->resv = compat_up2.resv;
- up2->data = compat_up2.data;
- up2->tags = compat_up2.tags;
- up2->nr = compat_up2.nr;
- up2->resv2 = compat_up2.resv2;
- return 0;
+}
+static int get_compat64_io_uring_rsrc_register(struct io_uring_rsrc_register *rr,
const void __user *user_rr)
+{
- struct compat_io_uring_rsrc_register compat_rr;
- if (copy_from_user(&compat_rr, user_rr, sizeof(compat_rr)))
return -EFAULT;
- rr->nr = compat_rr.nr;
- rr->flags = compat_rr.flags;
- rr->resv2 = compat_rr.resv2;
- rr->data = compat_rr.data;
- rr->tags = compat_rr.tags;
- return 0;
+}
+static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up2,
const void __user *arg)
+{
- if (is_compat64_io_ring_ctx(ctx))
return get_compat64_io_uring_rsrc_update(up2, arg);
- return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update));
I've just realised that this can be problematic. Since io_register_files_update() returns whatever this function returns if not 0, it effectively returns what copy_from_user() returns (and eventually that value will be returned by the io_uring_register handler). However, copy_from_user() doesn't always return 0 or an error. It can also return a positive value, the number of bytes left to copy, in case a fault happened before reaching the requested size. We would therefore end up returning that positive value to userspace, instead of the intended -EFAULT.
In most other contexts where a copy_io_uring_* helper is called, this doesn't matter as -EFAULT will be forced as return value at some point, but we should handle this consistently to avoid accidentally returning the value from copy_from_user() unmodified. Either all the copy_io_uring_* helpers should return -EFAULT explicitly, or all their (direct) callers should return -EFAULT explicitly. The former might be a little safer.
Kevin
On 28-03-2023 09:48, Kevin Brodsky wrote:
On 16/03/2023 14:40, Tudor Cretu wrote:
Introduce compat versions of the structs exposed in the uAPI headers that might contain pointers as a member. Also, implement functions that convert the compat versions to the native versions of the struct.
A subsequent patch is going to change the io_uring structs to enable them to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 135 ++++++++++++++++++- io_uring/cancel.c | 26 +++- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 80 ++++++----- io_uring/io_uring.c | 234 +++++++++++++++++++++++---------- io_uring/io_uring.h | 111 +++++++++++++--- io_uring/kbuf.c | 96 ++++++++++++-- io_uring/kbuf.h | 6 +- io_uring/net.c | 5 +- io_uring/rsrc.c | 108 +++++++++++++-- io_uring/tctx.c | 56 +++++++- io_uring/uring_cmd.h | 5 + 12 files changed, 704 insertions(+), 160 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 440179029a8f0..f0eb34ad8b709 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -7,6 +7,127 @@ #include <linux/llist.h> #include <uapi/linux/io_uring.h> +struct compat_io_uring_sqe {
- __u8 opcode;
- __u8 flags;
- __u16 ioprio;
- __s32 fd;
- union {
__u64 off;
__u64 addr2;
struct {
__u32 cmd_op;
__u32 __pad1;
};
- };
- union {
__u64 addr;
__u64 splice_off_in;
- };
- __u32 len;
- /* This member is actually a union in the native struct */
- __kernel_rwf_t rw_flags;
- __u64 user_data;
- union {
__u16 buf_index;
__u16 buf_group;
- } __packed;
- __u16 personality;
- union {
__s32 splice_fd_in;
__u32 file_index;
struct {
__u16 addr_len;
__u16 __pad3[1];
};
- };
- union {
struct {
__u64 addr3;
__u64 __pad2[1];
};
__u8 cmd[0];
- };
+};
+struct compat_io_uring_cqe {
- __u64 user_data;
- __s32 res;
- __u32 flags;
- __u64 big_cqe[];
+};
+struct compat_io_uring_files_update {
- __u32 offset;
- __u32 resv;
- __aligned_u64 fds;
+};
+struct compat_io_uring_rsrc_register {
- __u32 nr;
- __u32 flags;
- __u64 resv2;
- __aligned_u64 data;
- __aligned_u64 tags;
+};
+struct compat_io_uring_rsrc_update {
- __u32 offset;
- __u32 resv;
- __aligned_u64 data;
+};
+struct compat_io_uring_rsrc_update2 {
- __u32 offset;
- __u32 resv;
- __aligned_u64 data;
- __aligned_u64 tags;
- __u32 nr;
- __u32 resv2;
+};
+struct compat_io_uring_buf {
- __u64 addr;
- __u32 len;
- __u16 bid;
- __u16 resv;
+};
+struct compat_io_uring_buf_ring {
- union {
struct {
__u64 resv1;
__u32 resv2;
__u16 resv3;
__u16 tail;
};
struct compat_io_uring_buf bufs[0];
- };
+};
+struct compat_io_uring_buf_reg {
- __u64 ring_addr;
- __u32 ring_entries;
- __u16 bgid;
- __u16 pad;
- __u64 resv[3];
+};
+struct compat_io_uring_getevents_arg {
- __u64 sigmask;
- __u32 sigmask_sz;
- __u32 pad;
- __u64 ts;
+};
+struct compat_io_uring_sync_cancel_reg {
- __u64 addr;
- __s32 fd;
- __u32 flags;
- struct __kernel_timespec timeout;
- __u64 pad[4];
+};
Just thinking about this now, might make more sense to have these structs at the end of the file, rather than before the (common) native ones? Or maybe it would be even better to move them to a new file, say <linux/io_uring_compat.h>, that can be included from this file. That would make conflicts less likely and avoid adding a lot of structs to io__uring_types.h
Indeed much better, done!
[...]
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index bc8c9d764bc13..c724e6c544809 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -88,45 +88,64 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sq_entries = min(sq_tail - sq_head, ctx->sq_entries); for (i = 0; i < sq_entries; i++) { unsigned int entry = i + sq_head;
struct io_uring_sqe *sqe;
unsigned int sq_idx;
unsigned int sq_idx, sq_off;
sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue;
sqe = &ctx->sq_sqes[sq_idx << sq_shift];
seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, "
"addr:0x%llx, rw_flags:0x%x, buf_index:%d "
"user_data:%llu",
sq_idx, io_uring_get_opcode(sqe->opcode), sqe->fd,
sqe->flags, (unsigned long long) sqe->off,
(unsigned long long) sqe->addr, sqe->rw_flags,
sqe->buf_index, sqe->user_data);
if (sq_shift) {
u64 *sqeb = (void *) (sqe + 1);
int size = sizeof(struct io_uring_sqe) / sizeof(u64);
int j;
for (j = 0; j < size; j++) {
seq_printf(m, ", e%d:0x%llx", j,
(unsigned long long) *sqeb);
sqeb++;
}
}
sq_off = sq_idx << sq_shift;
+#define print_sqe(sqe) \
It's unusual to define macros in the middle of a function, I can see why it could make sense here but I think it's still more readable to define them normally, before the function. The number of arguments remains reasonable if you do it this way (AFAICT just need to add m, sq_idx, sq_shift).
Also minor thing, some of the \ are not aligned.
Sure!
do { \
seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, " \
"addr:0x%llx, rw_flags:0x%x, buf_index:%d " \
"user_data:%llu", \
sq_idx, io_uring_get_opcode((sqe)->opcode), (sqe)->fd, \
(sqe)->flags, (unsigned long long) (sqe)->off, \
(unsigned long long) (sqe)->addr, (sqe)->rw_flags, \
(sqe)->buf_index, (sqe)->user_data); \
if (sq_shift) { \
u64 *sqeb = (void *) ((sqe) + 1); \
int size = sizeof(*(sqe)) / sizeof(u64); \
int j; \
\
for (j = 0; j < size; j++) { \
seq_printf(m, ", e%d:0x%llx", j, \
(unsigned long long) *sqeb); \
sqeb++; \
} \
} \
} while (0)
if (is_compat64_io_ring_ctx(ctx))
print_sqe(&ctx->sq_sqes_compat[sq_off]);
else
print_sqe(&ctx->sq_sqes[sq_off]);
+#undef print_sqe
- seq_printf(m, "\n"); } seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head); cq_entries = min(cq_tail - cq_head, ctx->cq_entries); for (i = 0; i < cq_entries; i++) { unsigned int entry = i + cq_head;
struct io_uring_cqe *cqe = &ctx->cqes[(entry & cq_mask) << cq_shift];
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x",
entry & cq_mask, cqe->user_data, cqe->res,
cqe->flags);
if (cq_shift)
seq_printf(m, ", extra1:%llu, extra2:%llu\n",
cqe->big_cqe[0], cqe->big_cqe[1]);
unsigned int cq_off = (entry & cq_mask) << cq_shift;
+#define print_cqe(cqe) \
do { \
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", \
entry & cq_mask, (cqe)->user_data, (cqe)->res, \
(cqe)->flags); \
if (cq_shift) \
seq_printf(m, ", extra1:%llu, extra2:%llu\n", \
(cqe)->big_cqe[0], (cqe)->big_cqe[1]); \
} while (0)
if (is_compat64_io_ring_ctx(ctx))
print_cqe((struct compat_io_uring_cqe *)&ctx->cqes_compat[cq_off]);
I don't think you need the casts.
Yes, I've removed them from print_sqe() and I forgot them here...
else
print_cqe((struct io_uring_cqe *)&ctx->cqes[cq_off]);
+#undef print_cqe
- seq_printf(m, "\n"); }
@@ -191,8 +210,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe = &ocqe->cqe; seq_printf(m, " user_data=%llu, res=%d, flags=%x\n",
cqe->user_data, cqe->res, cqe->flags);
(cqe)->user_data, (cqe)->res, (cqe)->flags);
Looks like a leftover, I guess you don't need to change this any more as struct io_overflow_cqe is now always native?
That's right, thank you!
} spin_unlock(&ctx->completion_lock); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 707229ae04dc8..3f0e005481f3f 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -152,6 +152,35 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx); static struct kmem_cache *req_cachep; +static int get_compat64_io_uring_getevents_arg(struct io_uring_getevents_arg *arg,
const void __user *user_arg)
+{
- struct compat_io_uring_getevents_arg compat_arg;
- if (copy_from_user(&compat_arg, user_arg, sizeof(compat_arg)))
return -EFAULT;
- arg->sigmask = compat_arg.sigmask;
- arg->sigmask_sz = compat_arg.sigmask_sz;
- arg->pad = compat_arg.pad;
- arg->ts = compat_arg.ts;
- return 0;
+}
+static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx,
struct io_uring_getevents_arg *arg,
const void __user *argp,
size_t size)
+{
- if (is_compat64_io_ring_ctx(ctx)) {
if (size != sizeof(struct compat_io_uring_getevents_arg))
return -EINVAL;
return get_compat64_io_uring_getevents_arg(arg, argp);
- }
- if (size != sizeof(*arg))
return -EINVAL;
- return copy_from_user(arg, argp, sizeof(*arg));
+}
- struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX)
@@ -604,14 +633,10 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed;
- size_t cqe_size = sizeof(struct io_uring_cqe);
if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false;
- if (ctx->flags & IORING_SETUP_CQE32)
cqe_size <<= 1;
- io_cq_lock(ctx); while (!list_empty(&ctx->cq_overflow_list)) { struct io_uring_cqe *cqe = io_get_cqe_overflow(ctx, true);
@@ -621,9 +646,18 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) break; ocqe = list_first_entry(&ctx->cq_overflow_list, struct io_overflow_cqe, list);
if (cqe)
memcpy(cqe, &ocqe->cqe, cqe_size);
else
if (cqe) {
u64 extra1 = 0;
u64 extra2 = 0;
if (ctx->flags & IORING_SETUP_CQE32) {
extra1 = ocqe->cqe.big_cqe[0];
extra2 = ocqe->cqe.big_cqe[1];
}
__io_fill_cqe(ctx, cqe, ocqe->cqe.user_data, ocqe->cqe.res,
ocqe->cqe.flags, extra1, extra2);
} else io_account_cq_overflow(ctx);
list_del(&ocqe->list); @@ -745,6 +779,10 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) { unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len;
- size_t cqe_size = ctx->compat ?
It would be better to use is_compat64_io_ring_ctx() everywhere, now that we have a helper. This should leave the code entirely unmodified unless CONFIG_COMPAT64 is selected.
Indeed, I missed quite a few of them...
sizeof(struct compat_io_uring_cqe) :
sizeof(struct io_uring_cqe);
- struct io_uring_cqe *cqe;
/* * Posting into the CQ when there are pending overflowed CQEs may break @@ -767,14 +805,15 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &ctx->cqes[off];
- ctx->cqe_sentinel = ctx->cqe_cached + len;
- cqe = ctx->compat ? (struct io_uring_cqe *)&ctx->cqes_compat[off] : &ctx->cqes[off];
- ctx->cqe_cached = cqe;
- ctx->cqe_sentinel = ctx->cqe_cached + len * cqe_size;
The cqe_cached / cqe_sentinel changes look fine, but I wonder if we wouldn't be better off changing things a little more in that case. Manipulating these as void * pointers is a little awkward and can be error-prone, for instance on this line I thought "why not use cqe instead of ctx->cqe_cached", but that is definitely not equivalent as they are pointers to different types.
An alternative would be to make cqe_cached / cqe_sentinel indices into cqes / cqes_compat. This way we don't need to compute cqe_size to update them, and the logic remains otherwise unaffected. It could be a separate patch to make it easier to understand. That's just an idea though, leaving it as-is is also fine by me.
I've thought about that as well. Now it seems it's the obvious choice when you put it like that. Added the changes in a new patch.
ctx->cached_cq_tail++;
- ctx->cqe_cached++;
- ctx->cqe_cached += cqe_size; if (ctx->flags & IORING_SETUP_CQE32)
ctx->cqe_cached++;
- return &ctx->cqes[off];
ctx->cqe_cached += cqe_size;
- return cqe; }
[...]
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 50bc3af449534..b44ad558137be 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -5,6 +5,7 @@ #include <linux/lockdep.h> #include <linux/io_uring_types.h> #include "io-wq.h" +#include "uring_cmd.h" #include "slist.h" #include "filetable.h" @@ -93,16 +94,69 @@ static inline void io_cq_lock(struct io_ring_ctx *ctx) void io_cq_unlock_post(struct io_ring_ctx *ctx); +static inline bool is_compat64_io_ring_ctx(struct io_ring_ctx *ctx)
I think it makes sense to follow the convention of having the full struct name in conversion helpers, but I'm less convinced here. It could be better to use the io_ prefix like most functions in this header, and pass struct io_ring_ctx without reflecting it in the function name. Maybe simple io_in_compat64(), reminiscent of in_compat_syscall()? That's also shorter, handy as we're using it a lot.
Indeed, done!
+{
- return IS_ENABLED(CONFIG_COMPAT64) && ctx->compat;
+}
[...]
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 41e192de9e8a7..c65b99fb9264f 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -23,6 +23,89 @@ struct io_rsrc_update { u32 offset; }; +static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2,
const void __user *user_up)
+{
- struct compat_io_uring_rsrc_update compat_up;
- if (copy_from_user(&compat_up, user_up, sizeof(compat_up)))
return -EFAULT;
- up2->offset = compat_up.offset;
- up2->resv = compat_up.resv;
- up2->data = compat_up.data;
- return 0;
+}
+static int get_compat64_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2,
const void __user *user_up2)
+{
- struct compat_io_uring_rsrc_update2 compat_up2;
- if (copy_from_user(&compat_up2, user_up2, sizeof(compat_up2)))
return -EFAULT;
- up2->offset = compat_up2.offset;
- up2->resv = compat_up2.resv;
- up2->data = compat_up2.data;
- up2->tags = compat_up2.tags;
- up2->nr = compat_up2.nr;
- up2->resv2 = compat_up2.resv2;
- return 0;
+}
+static int get_compat64_io_uring_rsrc_register(struct io_uring_rsrc_register *rr,
const void __user *user_rr)
+{
- struct compat_io_uring_rsrc_register compat_rr;
- if (copy_from_user(&compat_rr, user_rr, sizeof(compat_rr)))
return -EFAULT;
- rr->nr = compat_rr.nr;
- rr->flags = compat_rr.flags;
- rr->resv2 = compat_rr.resv2;
- rr->data = compat_rr.data;
- rr->tags = compat_rr.tags;
- return 0;
+}
+static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up2,
const void __user *arg)
+{
- if (is_compat64_io_ring_ctx(ctx))
return get_compat64_io_uring_rsrc_update(up2, arg);
- return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update));
I've just realised that this can be problematic. Since io_register_files_update() returns whatever this function returns if not 0, it effectively returns what copy_from_user() returns (and eventually that value will be returned by the io_uring_register handler). However, copy_from_user() doesn't always return 0 or an error. It can also return a positive value, the number of bytes left to copy, in case a fault happened before reaching the requested size. We would therefore end up returning that positive value to userspace, instead of the intended -EFAULT.
Ah, good catch!
In most other contexts where a copy_io_uring_* helper is called, this doesn't matter as -EFAULT will be forced as return value at some point, but we should handle this consistently to avoid accidentally returning the value from copy_from_user() unmodified. Either all the copy_io_uring_* helpers should return -EFAULT explicitly, or all their (direct) callers should return -EFAULT explicitly. The former might be a little safer.
Alright, great suggestion! Thanks!
Kevin
Many thanks for the detailed review!
Tudor
The io_uring shared memory region hosts the io_uring_sqe and io_uring_cqe arrays. These structs may contain user pointers, so the memory region must be allowed to store and load capability pointers.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- io_uring/io_uring.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 3f0e005481f3f..d4710672b4fc7 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3132,6 +3132,11 @@ static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma) if (IS_ERR(ptr)) return PTR_ERR(ptr);
+#ifdef CONFIG_CHERI_PURECAP_UABI + vma->vm_flags |= VM_READ_CAPS | VM_WRITE_CAPS; + vma_set_page_prot(vma); +#endif + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); }
Some members of the io_uring uAPI structs may contain user pointers. In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.
The user_data field must be passed unchanged from the submission queue to the completion queue. As it is standard practice to store a pointer in user_data, expand the field to __kernel_uintptr_t. However, the kernel doesn't dereference the user_data, so don't convert it in the compat case.
In addition, for the io_uring structs containing user pointers, use the special copy routines when copying user pointers from/to userspace.
In the case of operation IORING_OP_POLL_REMOVE, if IORING_POLL_UPDATE_USER_DATA is set in the SQE len field, then the request will update the user_data of an existing poll request based on the value passed in the addr2 field, instead of the off field. This is required because the off field is not large enough to fit a user_data value.
Note that the structs io_uring_sqe and io_uring_cqe are doubled in size in PCuABI. The setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 used to double the sizes of the two structs up to 128 bytes and 32 bytes respectively. In PCuABI, the two flags are still used to double the sizes of the two structs, but, as they increased in size, they increase up to 256 bytes and 64 bytes.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- include/linux/io_uring_types.h | 4 +- include/trace/events/io_uring.h | 46 ++++++++++---------- include/uapi/linux/io_uring.h | 76 ++++++++++++++++++--------------- io_uring/advise.c | 7 +-- io_uring/cancel.c | 6 +-- io_uring/cancel.h | 2 +- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 8 ++-- io_uring/fs.c | 16 +++---- io_uring/io_uring.c | 62 +++++++++++++++++++++++---- io_uring/io_uring.h | 25 ++++++----- io_uring/kbuf.c | 19 +++++---- io_uring/kbuf.h | 2 +- io_uring/msg_ring.c | 4 +- io_uring/net.c | 20 ++++----- io_uring/openclose.c | 4 +- io_uring/poll.c | 6 +-- io_uring/rsrc.c | 44 +++++++++---------- io_uring/rw.c | 18 ++++---- io_uring/statx.c | 4 +- io_uring/tctx.c | 4 +- io_uring/timeout.c | 10 ++--- io_uring/uring_cmd.c | 5 +++ io_uring/xattr.c | 12 +++--- 24 files changed, 235 insertions(+), 171 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index f0eb34ad8b709..186504cfb2f9a 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -600,8 +600,8 @@ struct io_task_work { };
struct io_cqe { - __u64 user_data; - __s32 res; + __kernel_uintptr_t user_data; + __s32 res; /* fd initially, then cflags for completion */ union { __u32 flags; diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index 936fd41bf147e..846e762d8a0ea 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -112,10 +112,10 @@ TRACE_EVENT(io_uring_file_get, TP_ARGS(req, fd),
TP_STRUCT__entry ( - __field( void *, ctx ) - __field( void *, req ) - __field( u64, user_data ) - __field( int, fd ) + __field( void *, ctx ) + __field( void *, req ) + __field( __kernel_uintptr_t, user_data ) + __field( int, fd ) ),
TP_fast_assign( @@ -146,7 +146,7 @@ TRACE_EVENT(io_uring_queue_async_work, TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( u64, user_data ) + __field( __kernel_uintptr_t, user_data ) __field( u8, opcode ) __field( unsigned int, flags ) __field( struct io_wq_work *, work ) @@ -190,7 +190,7 @@ TRACE_EVENT(io_uring_defer, TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( unsigned long long, data ) + __field( __kernel_uintptr_t, data ) __field( u8, opcode )
__string( op_str, io_uring_get_opcode(req->opcode) ) @@ -289,7 +289,7 @@ TRACE_EVENT(io_uring_fail_link, TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( unsigned long long, user_data ) + __field( __kernel_uintptr_t, user_data ) __field( u8, opcode ) __field( void *, link )
@@ -325,19 +325,19 @@ TRACE_EVENT(io_uring_fail_link, */ TRACE_EVENT(io_uring_complete,
- TP_PROTO(void *ctx, void *req, u64 user_data, int res, unsigned cflags, + TP_PROTO(void *ctx, void *req, __kernel_uintptr_t user_data, int res, unsigned cflags, u64 extra1, u64 extra2),
TP_ARGS(ctx, req, user_data, res, cflags, extra1, extra2),
TP_STRUCT__entry ( - __field( void *, ctx ) - __field( void *, req ) - __field( u64, user_data ) - __field( int, res ) - __field( unsigned, cflags ) - __field( u64, extra1 ) - __field( u64, extra2 ) + __field( void *, ctx ) + __field( void *, req ) + __field( __kernel_uintptr_t, user_data ) + __field( int, res ) + __field( unsigned, cflags ) + __field( u64, extra1 ) + __field( u64, extra2 ) ),
TP_fast_assign( @@ -377,7 +377,7 @@ TRACE_EVENT(io_uring_submit_sqe, TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( unsigned long long, user_data ) + __field( __kernel_uintptr_t, user_data ) __field( u8, opcode ) __field( u32, flags ) __field( bool, force_nonblock ) @@ -423,7 +423,7 @@ TRACE_EVENT(io_uring_poll_arm, TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( unsigned long long, user_data ) + __field( __kernel_uintptr_t, user_data ) __field( u8, opcode ) __field( int, mask ) __field( int, events ) @@ -464,7 +464,7 @@ TRACE_EVENT(io_uring_task_add, TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( unsigned long long, user_data ) + __field( __kernel_uintptr_t, user_data ) __field( u8, opcode ) __field( int, mask )
@@ -505,19 +505,19 @@ TRACE_EVENT(io_uring_req_failed, TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( unsigned long long, user_data ) + __field( __kernel_uintptr_t, user_data ) __field( u8, opcode ) __field( u8, flags ) __field( u8, ioprio ) __field( u64, off ) - __field( u64, addr ) + __field( __kernel_uintptr_t, addr ) __field( u32, len ) __field( u32, op_flags ) __field( u16, buf_index ) __field( u16, personality ) __field( u32, file_index ) __field( u64, pad1 ) - __field( u64, addr3 ) + __field( __kernel_uintptr_t, addr3 ) __field( int, error )
__string( op_str, io_uring_get_opcode(sqe->opcode) ) @@ -573,14 +573,14 @@ TRACE_EVENT(io_uring_req_failed, */ TRACE_EVENT(io_uring_cqe_overflow,
- TP_PROTO(void *ctx, unsigned long long user_data, s32 res, u32 cflags, + TP_PROTO(void *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, void *ocqe),
TP_ARGS(ctx, user_data, res, cflags, ocqe),
TP_STRUCT__entry ( __field( void *, ctx ) - __field( unsigned long long, user_data ) + __field( __kernel_uintptr_t, user_data ) __field( s32, res ) __field( u32, cflags ) __field( void *, ocqe ) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 2df3225b562fa..121c9aef5ad00 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -11,6 +11,11 @@ #include <linux/fs.h> #include <linux/types.h> #include <linux/time_types.h> +#ifdef __KERNEL__ +#include <linux/stddef.h> /* for offsetof */ +#else +#include <stddef.h> /* for offsetof */ +#endif
#ifdef __cplusplus extern "C" { @@ -25,16 +30,16 @@ struct io_uring_sqe { __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ union { - __u64 off; /* offset into file */ - __u64 addr2; + __u64 off; /* offset into file */ + __kernel_uintptr_t addr2; struct { __u32 cmd_op; __u32 __pad1; }; }; union { - __u64 addr; /* pointer to buffer or iovecs */ - __u64 splice_off_in; + __kernel_uintptr_t addr; /* pointer to buffer or iovecs */ + __u64 splice_off_in; }; __u32 len; /* buffer size or number of iovecs */ union { @@ -58,7 +63,7 @@ struct io_uring_sqe { __u32 msg_ring_flags; __u32 uring_cmd_flags; }; - __u64 user_data; /* data to be passed back at completion time */ + __kernel_uintptr_t user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ union { /* index into fixed buffers, if used */ @@ -78,12 +83,14 @@ struct io_uring_sqe { }; union { struct { - __u64 addr3; - __u64 __pad2[1]; + __kernel_uintptr_t addr3; + __kernel_uintptr_t __pad2[1]; }; /* * If the ring is initialized with IORING_SETUP_SQE128, then - * this field is used for 80 bytes of arbitrary command data + * this field is used to double the size of the + * struct io_uring_sqe to store bytes of arbitrary + * command data, i.e. 80 bytes or 160 bytes in PCuABI */ __u8 cmd[0]; }; @@ -326,13 +333,14 @@ enum { * IO completion data structure (Completion Queue Entry) */ struct io_uring_cqe { - __u64 user_data; /* sqe->data submission passed back */ - __s32 res; /* result code for this event */ - __u32 flags; + __kernel_uintptr_t user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags;
/* * If the ring is initialized with IORING_SETUP_CQE32, then this field - * contains 16-bytes of padding, doubling the size of the CQE. + * doubles the size of the CQE, i.e. contains 16 bytes, or in PCuABI, + * 32 bytes of padding. */ __u64 big_cqe[]; }; @@ -504,7 +512,7 @@ enum { struct io_uring_files_update { __u32 offset; __u32 resv; - __aligned_u64 /* __s32 * */ fds; + __kernel_aligned_uintptr_t /* __s32 * */ fds; };
/* @@ -517,21 +525,21 @@ struct io_uring_rsrc_register { __u32 nr; __u32 flags; __u64 resv2; - __aligned_u64 data; - __aligned_u64 tags; + __kernel_aligned_uintptr_t data; + __kernel_aligned_uintptr_t tags; };
struct io_uring_rsrc_update { __u32 offset; __u32 resv; - __aligned_u64 data; + __kernel_aligned_uintptr_t data; };
struct io_uring_rsrc_update2 { __u32 offset; __u32 resv; - __aligned_u64 data; - __aligned_u64 tags; + __kernel_aligned_uintptr_t data; + __kernel_aligned_uintptr_t tags; __u32 nr; __u32 resv2; }; @@ -581,10 +589,10 @@ struct io_uring_restriction { };
struct io_uring_buf { - __u64 addr; - __u32 len; - __u16 bid; - __u16 resv; + __kernel_uintptr_t addr; + __u32 len; + __u16 bid; + __u16 resv; };
struct io_uring_buf_ring { @@ -594,9 +602,7 @@ struct io_uring_buf_ring { * ring tail is overlaid with the io_uring_buf->resv field. */ struct { - __u64 resv1; - __u32 resv2; - __u16 resv3; + __u8 resv[offsetof(struct io_uring_buf, resv)]; __u16 tail; }; struct io_uring_buf bufs[0]; @@ -605,11 +611,11 @@ struct io_uring_buf_ring {
/* argument for IORING_(UN)REGISTER_PBUF_RING */ struct io_uring_buf_reg { - __u64 ring_addr; - __u32 ring_entries; - __u16 bgid; - __u16 pad; - __u64 resv[3]; + __kernel_uintptr_t ring_addr; + __u32 ring_entries; + __u16 bgid; + __u16 pad; + __u64 resv[3]; };
/* @@ -632,17 +638,17 @@ enum { };
struct io_uring_getevents_arg { - __u64 sigmask; - __u32 sigmask_sz; - __u32 pad; - __u64 ts; + __kernel_uintptr_t sigmask; + __u32 sigmask_sz; + __u32 pad; + __kernel_uintptr_t ts; };
/* * Argument for IORING_REGISTER_SYNC_CANCEL */ struct io_uring_sync_cancel_reg { - __u64 addr; + __kernel_uintptr_t addr; __s32 fd; __u32 flags; struct __kernel_timespec timeout; diff --git a/io_uring/advise.c b/io_uring/advise.c index 449c6f14649f7..05fd3bbaf8090 100644 --- a/io_uring/advise.c +++ b/io_uring/advise.c @@ -23,7 +23,7 @@ struct io_fadvise {
struct io_madvise { struct file *file; - u64 addr; + void __user *addr; u32 len; u32 advice; }; @@ -36,7 +36,7 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->buf_index || sqe->off || sqe->splice_fd_in) return -EINVAL;
- ma->addr = READ_ONCE(sqe->addr); + ma->addr = (void __user *)READ_ONCE(sqe->addr); ma->len = READ_ONCE(sqe->len); ma->advice = READ_ONCE(sqe->fadvise_advice); return 0; @@ -54,7 +54,8 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags) if (issue_flags & IO_URING_F_NONBLOCK) return -EAGAIN;
- ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice); + /* TODO [PCuABI] - capability checks for uaccess */ + ret = do_madvise(current->mm, user_ptr_addr(ma->addr), ma->len, ma->advice); io_req_set_res(req, ret, 0); return IOU_OK; #else diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 8382ea03fe899..dd642da52233f 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -19,7 +19,7 @@
struct io_cancel { struct file *file; - u64 addr; + __kernel_uintptr_t addr; u32 flags; s32 fd; }; @@ -34,7 +34,7 @@ static int get_compat64_io_uring_sync_cancel_reg(struct io_uring_sync_cancel_reg
if (copy_from_user(&compat_sc, user_sc, sizeof(compat_sc))) return -EFAULT; - sc->addr = compat_sc.addr; + sc->addr = (__kernel_uintptr_t)compat_sc.addr; sc->fd = compat_sc.fd; sc->flags = compat_sc.flags; sc->timeout = compat_sc.timeout; @@ -48,7 +48,7 @@ static int copy_io_uring_sync_cancel_reg_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_sync_cancel_reg(sc, arg); - return copy_from_user(sc, arg, sizeof(*sc)); + return copy_from_user_with_ptr(sc, arg, sizeof(*sc)); }
static bool io_cancel_cb(struct io_wq_work *work, void *data) diff --git a/io_uring/cancel.h b/io_uring/cancel.h index 6a59ee484d0cc..7c1249d61bf25 100644 --- a/io_uring/cancel.h +++ b/io_uring/cancel.h @@ -5,7 +5,7 @@ struct io_cancel_data { struct io_ring_ctx *ctx; union { - u64 data; + __kernel_uintptr_t data; struct file *file; }; u32 flags; diff --git a/io_uring/epoll.c b/io_uring/epoll.c index d5580ff465c3e..d9d5983f823c2 100644 --- a/io_uring/epoll.c +++ b/io_uring/epoll.c @@ -39,7 +39,7 @@ int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ep_op_has_event(epoll->op)) { struct epoll_event __user *ev;
- ev = u64_to_user_ptr(READ_ONCE(sqe->addr)); + ev = (struct epoll_event __user *)READ_ONCE(sqe->addr); if (copy_epoll_event_from_user(&epoll->event, ev, req->ctx->compat)) return -EFAULT; } diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index c724e6c544809..e5442e0ddbc8b 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -102,7 +102,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sq_idx, io_uring_get_opcode((sqe)->opcode), (sqe)->fd, \ (sqe)->flags, (unsigned long long) (sqe)->off, \ (unsigned long long) (sqe)->addr, (sqe)->rw_flags, \ - (sqe)->buf_index, (sqe)->user_data); \ + (sqe)->buf_index, (unsigned long long)(sqe)->user_data); \ if (sq_shift) { \ u64 *sqeb = (void *) ((sqe) + 1); \ int size = sizeof(*(sqe)) / sizeof(u64); \ @@ -133,7 +133,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, #define print_cqe(cqe) \ do { \ seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", \ - entry & cq_mask, (cqe)->user_data, (cqe)->res, \ + entry & cq_mask, \ + (unsigned long long) (cqe)->user_data, (cqe)->res, \ (cqe)->flags); \ if (cq_shift) \ seq_printf(m, ", extra1:%llu, extra2:%llu\n", \ @@ -210,7 +211,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe = &ocqe->cqe;
seq_printf(m, " user_data=%llu, res=%d, flags=%x\n", - (cqe)->user_data, (cqe)->res, (cqe)->flags); + (unsigned long long) cqe->user_data, cqe->res, + cqe->flags); }
spin_unlock(&ctx->completion_lock); diff --git a/io_uring/fs.c b/io_uring/fs.c index 7100c293c13a8..2e01e7da1d4ba 100644 --- a/io_uring/fs.c +++ b/io_uring/fs.c @@ -58,8 +58,8 @@ int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
ren->old_dfd = READ_ONCE(sqe->fd); - oldf = u64_to_user_ptr(READ_ONCE(sqe->addr)); - newf = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + oldf = (char __user *)READ_ONCE(sqe->addr); + newf = (char __user *)READ_ONCE(sqe->addr2); ren->new_dfd = READ_ONCE(sqe->len); ren->flags = READ_ONCE(sqe->rename_flags);
@@ -117,7 +117,7 @@ int io_unlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (un->flags & ~AT_REMOVEDIR) return -EINVAL;
- fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = (char __user *)READ_ONCE(sqe->addr); un->filename = getname(fname); if (IS_ERR(un->filename)) return PTR_ERR(un->filename); @@ -164,7 +164,7 @@ int io_mkdirat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) mkd->dfd = READ_ONCE(sqe->fd); mkd->mode = READ_ONCE(sqe->len);
- fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = (char __user *)READ_ONCE(sqe->addr); mkd->filename = getname(fname); if (IS_ERR(mkd->filename)) return PTR_ERR(mkd->filename); @@ -206,8 +206,8 @@ int io_symlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
sl->new_dfd = READ_ONCE(sqe->fd); - oldpath = u64_to_user_ptr(READ_ONCE(sqe->addr)); - newpath = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + oldpath = (char __user *)READ_ONCE(sqe->addr); + newpath = (char __user *)READ_ONCE(sqe->addr2);
sl->oldpath = getname(oldpath); if (IS_ERR(sl->oldpath)) @@ -250,8 +250,8 @@ int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
lnk->old_dfd = READ_ONCE(sqe->fd); lnk->new_dfd = READ_ONCE(sqe->len); - oldf = u64_to_user_ptr(READ_ONCE(sqe->addr)); - newf = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + oldf = (char __user *)READ_ONCE(sqe->addr); + newf = (char __user *)READ_ONCE(sqe->addr2); lnk->flags = READ_ONCE(sqe->hardlink_flags);
lnk->oldpath = getname(oldf); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index d4710672b4fc7..98179f01cd12b 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -159,10 +159,10 @@ static int get_compat64_io_uring_getevents_arg(struct io_uring_getevents_arg *ar
if (copy_from_user(&compat_arg, user_arg, sizeof(compat_arg))) return -EFAULT; - arg->sigmask = compat_arg.sigmask; + arg->sigmask = (__kernel_uintptr_t)compat_ptr(compat_arg.sigmask); arg->sigmask_sz = compat_arg.sigmask_sz; arg->pad = compat_arg.pad; - arg->ts = compat_arg.ts; + arg->ts = (__kernel_uintptr_t)compat_ptr(compat_arg.ts); return 0; }
@@ -178,7 +178,7 @@ static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx, } if (size != sizeof(*arg)) return -EINVAL; - return copy_from_user(arg, argp, sizeof(*arg)); + return copy_from_user_with_ptr(arg, argp, sizeof(*arg)); }
struct sock *io_uring_get_socket(struct file *file) @@ -721,7 +721,7 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } }
-static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, u64 extra1, u64 extra2) { struct io_overflow_cqe *ocqe; @@ -816,8 +816,8 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) return cqe; }
-bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, - bool allow_overflow) +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, + s32 res, u32 cflags, bool allow_overflow) { struct io_uring_cqe *cqe;
@@ -843,7 +843,7 @@ bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags }
bool io_post_aux_cqe(struct io_ring_ctx *ctx, - u64 user_data, s32 res, u32 cflags, + __kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow) { bool filled; @@ -3214,9 +3214,9 @@ static int io_get_ext_arg(struct io_ring_ctx *ctx, unsigned int flags, return ret; if (arg.pad) return -EINVAL; - *sig = u64_to_user_ptr(arg.sigmask); + *sig = (sigset_t __user *)arg.sigmask; *argsz = arg.sigmask_sz; - *ts = u64_to_user_ptr(arg.ts); + *ts = (struct __kernel_timespec __user *)arg.ts; return 0; }
@@ -4159,6 +4159,49 @@ static int __init io_uring_init(void) __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, sizeof(etype), ename) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) +#ifdef CONFIG_CHERI_PURECAP_UABI + BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 128); + BUILD_BUG_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_SQE_ELEM(1, __u8, flags); + BUILD_BUG_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_SQE_ELEM(4, __s32, fd); + BUILD_BUG_SQE_ELEM(16, __u64, off); + BUILD_BUG_SQE_ELEM(16, __uintcap_t, addr2); + BUILD_BUG_SQE_ELEM(16, __u32, cmd_op); + BUILD_BUG_SQE_ELEM(20, __u32, __pad1); + BUILD_BUG_SQE_ELEM(32, __uintcap_t, addr); + BUILD_BUG_SQE_ELEM(32, __u64, splice_off_in); + BUILD_BUG_SQE_ELEM(48, __u32, len); + BUILD_BUG_SQE_ELEM(52, __kernel_rwf_t, rw_flags); + BUILD_BUG_SQE_ELEM(52, __u32, fsync_flags); + BUILD_BUG_SQE_ELEM(52, __u16, poll_events); + BUILD_BUG_SQE_ELEM(52, __u32, poll32_events); + BUILD_BUG_SQE_ELEM(52, __u32, sync_range_flags); + BUILD_BUG_SQE_ELEM(52, __u32, msg_flags); + BUILD_BUG_SQE_ELEM(52, __u32, timeout_flags); + BUILD_BUG_SQE_ELEM(52, __u32, accept_flags); + BUILD_BUG_SQE_ELEM(52, __u32, cancel_flags); + BUILD_BUG_SQE_ELEM(52, __u32, open_flags); + BUILD_BUG_SQE_ELEM(52, __u32, statx_flags); + BUILD_BUG_SQE_ELEM(52, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(52, __u32, splice_flags); + BUILD_BUG_SQE_ELEM(52, __u32, rename_flags); + BUILD_BUG_SQE_ELEM(52, __u32, unlink_flags); + BUILD_BUG_SQE_ELEM(52, __u32, hardlink_flags); + BUILD_BUG_SQE_ELEM(52, __u32, xattr_flags); + BUILD_BUG_SQE_ELEM(52, __u32, msg_ring_flags); + BUILD_BUG_SQE_ELEM(64, __uintcap_t, user_data); + BUILD_BUG_SQE_ELEM(80, __u16, buf_index); + BUILD_BUG_SQE_ELEM(80, __u16, buf_group); + BUILD_BUG_SQE_ELEM(82, __u16, personality); + BUILD_BUG_SQE_ELEM(84, __s32, splice_fd_in); + BUILD_BUG_SQE_ELEM(84, __u32, file_index); + BUILD_BUG_SQE_ELEM(84, __u16, addr_len); + BUILD_BUG_SQE_ELEM(86, __u16, __pad3[0]); + BUILD_BUG_SQE_ELEM(96, __uintcap_t, addr3); + BUILD_BUG_SQE_ELEM_SIZE(96, 0, cmd); + BUILD_BUG_SQE_ELEM(112, __uintcap_t, __pad2); +#else /* !CONFIG_CHERI_PURECAP_UABI */ BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64); BUILD_BUG_SQE_ELEM(0, __u8, opcode); BUILD_BUG_SQE_ELEM(1, __u8, flags); @@ -4202,6 +4245,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); BUILD_BUG_SQE_ELEM(56, __u64, __pad2); +#endif /* !CONFIG_CHERI_PURECAP_UABI */
BUILD_BUG_ON(sizeof(struct io_uring_files_update) != sizeof(struct io_uring_rsrc_update)); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index b44ad558137be..ad6b8d79e98de 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -34,10 +34,10 @@ void io_req_complete_failed(struct io_kiocb *req, s32 res); void __io_req_complete(struct io_kiocb *req, unsigned issue_flags); void io_req_complete_post(struct io_kiocb *req); void __io_req_complete_post(struct io_kiocb *req); -bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, - bool allow_overflow); -bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, - bool allow_overflow); +bool io_post_aux_cqe(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, + s32 res, u32 cflags, bool allow_overflow); +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, + s32 res, u32 cflags, bool allow_overflow); void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages); @@ -120,13 +120,13 @@ static inline void convert_compat64_io_uring_sqe(struct io_ring_ctx *ctx, sqe->ioprio = READ_ONCE(compat_sqe->ioprio); sqe->fd = READ_ONCE(compat_sqe->fd); BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr); - sqe->addr2 = READ_ONCE(compat_sqe->addr2); + sqe->addr2 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr2)); BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len); - sqe->addr = READ_ONCE(compat_sqe->addr); + sqe->addr = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr)); sqe->len = READ_ONCE(compat_sqe->len); BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data); sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags); - sqe->user_data = READ_ONCE(compat_sqe->user_data); + sqe->user_data = (__kernel_uintptr_t)READ_ONCE(compat_sqe->user_data); BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality); sqe->buf_index = READ_ONCE(compat_sqe->buf_index); sqe->personality = READ_ONCE(compat_sqe->personality); @@ -136,9 +136,14 @@ static inline void convert_compat64_io_uring_sqe(struct io_ring_ctx *ctx, size_t compat_cmd_size = compat_uring_cmd_pdu_size(ctx->flags & IORING_SETUP_SQE128);
+ /* + * Note that sqe->cmd is bigger than compat_sqe->cmd, but + * uring_cmd handlers are not using that extra data in the + * compat mode, so the end of sqe->cmd is left uninitialised. + */ memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size); } else { - sqe->addr3 = READ_ONCE(compat_sqe->addr3); + sqe->addr3 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr3)); sqe->__pad2[0] = READ_ONCE(compat_sqe->__pad2[0]); } #undef BUILD_BUG_COMPAT_SQE_UNION_ELEM @@ -169,13 +174,13 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx) }
static inline void __io_fill_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe, - u64 user_data, s32 res, u32 cflags, + __kernel_uintptr_t user_data, s32 res, u32 cflags, u64 extra1, u64 extra2) { if (is_compat64_io_ring_ctx(ctx)) { struct compat_io_uring_cqe *compat_cqe = (struct compat_io_uring_cqe *)cqe;
- WRITE_ONCE(compat_cqe->user_data, user_data); + WRITE_ONCE(compat_cqe->user_data, (__u64)user_data); WRITE_ONCE(compat_cqe->res, res); WRITE_ONCE(compat_cqe->flags, cflags);
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index b388592e67df9..4614ab633c4bd 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -22,7 +22,7 @@
struct io_provide_buf { struct file *file; - __u64 addr; + void __user *addr; __u32 len; __u32 bgid; __u16 nbufs; @@ -36,7 +36,7 @@ static int get_compat64_io_uring_buf_reg(struct io_uring_buf_reg *reg,
if (copy_from_user(&compat_reg, user_reg, sizeof(compat_reg))) return -EFAULT; - reg->ring_addr = compat_reg.ring_addr; + reg->ring_addr = (__kernel_uintptr_t)compat_ptr(compat_reg.ring_addr); reg->ring_entries = compat_reg.ring_entries; reg->bgid = compat_reg.bgid; reg->pad = compat_reg.pad; @@ -50,7 +50,7 @@ static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_buf_reg(reg, arg); - return copy_from_user(reg, arg, sizeof(*reg)); + return copy_from_user_with_ptr(reg, arg, sizeof(*reg)); }
static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, @@ -145,7 +145,7 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, req->flags |= REQ_F_BUFFER_SELECTED; req->kbuf = kbuf; req->buf_index = kbuf->bid; - return u64_to_user_ptr(kbuf->addr); + return (void __user *)kbuf->addr; } return NULL; } @@ -205,7 +205,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, req->buf_list = bl; req->buf_index = buf->bid;
- return u64_to_user_ptr(buf->addr); + return (void __user *)buf->addr; }
static void __user *io_ring_buffer_select_any(struct io_kiocb *req, size_t *len, @@ -403,17 +403,17 @@ int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe if (!tmp || tmp > USHRT_MAX) return -E2BIG; p->nbufs = tmp; - p->addr = READ_ONCE(sqe->addr); + p->addr = (void __user *)READ_ONCE(sqe->addr); p->len = READ_ONCE(sqe->len);
if (check_mul_overflow((unsigned long)p->len, (unsigned long)p->nbufs, &size)) return -EOVERFLOW; - if (check_add_overflow((unsigned long)p->addr, size, &tmp_check)) + if (check_add_overflow(user_ptr_addr(p->addr), size, &tmp_check)) return -EOVERFLOW;
size = (unsigned long)p->len * p->nbufs; - if (!access_ok(u64_to_user_ptr(p->addr), size)) + if (!access_ok(p->addr, size)) return -EFAULT;
p->bgid = READ_ONCE(sqe->buf_group); @@ -473,7 +473,7 @@ static int io_add_buffers(struct io_ring_ctx *ctx, struct io_provide_buf *pbuf, struct io_buffer_list *bl) { struct io_buffer *buf; - u64 addr = pbuf->addr; + void __user *addr = pbuf->addr; int i, bid = pbuf->bid;
for (i = 0; i < pbuf->nbufs; i++) { @@ -585,6 +585,7 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = ctx->compat ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries); + /* TODO [PCuABI] - capability checks for uaccess */ pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl); diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index 1aa5bbbc5d628..1977c13ccf3ff 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -31,7 +31,7 @@ struct io_buffer_list {
struct io_buffer { struct list_head list; - __u64 addr; + void __user *addr; __u32 len; __u16 bid; __u16 bgid; diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index 90d2fc6fd80e4..654f5ad0b11c0 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -15,7 +15,7 @@
struct io_msg { struct file *file; - u64 user_data; + __kernel_uintptr_t user_data; u32 len; u32 cmd; u32 src_fd; @@ -130,7 +130,7 @@ int io_msg_ring_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(sqe->buf_index || sqe->personality)) return -EINVAL;
- msg->user_data = READ_ONCE(sqe->off); + msg->user_data = READ_ONCE(sqe->addr2); msg->len = READ_ONCE(sqe->len); msg->cmd = READ_ONCE(sqe->addr); msg->src_fd = READ_ONCE(sqe->addr3); diff --git a/io_uring/net.c b/io_uring/net.c index 4c133bc6f9d1d..6fd28a49b6715 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -243,13 +243,13 @@ int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL; - sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + sr->addr = (void __user *)READ_ONCE(sqe->addr2); sr->addr_len = READ_ONCE(sqe->addr_len); } else if (sqe->addr2 || sqe->file_index) { return -EINVAL; }
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~IORING_RECVSEND_POLL_FIRST) @@ -421,7 +421,7 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct user_msghdr msg; int ret;
- if (copy_from_user(&msg, sr->umsg, sizeof(*sr->umsg))) + if (copy_from_user_with_ptr(&msg, sr->umsg, sizeof(*sr->umsg))) return -EFAULT;
ret = __copy_msghdr(&iomsg->msg, &msg, &iomsg->uaddr); @@ -549,7 +549,7 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(sqe->file_index || sqe->addr2)) return -EINVAL;
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~(RECVMSG_FLAGS)) @@ -966,7 +966,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND_ZC) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL; - zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + zc->addr = (void __user *)READ_ONCE(sqe->addr2); zc->addr_len = READ_ONCE(sqe->addr_len); } else { if (unlikely(sqe->addr2 || sqe->file_index)) @@ -975,7 +975,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL; }
- zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); + zc->buf = (void __user *)READ_ONCE(sqe->addr); zc->len = READ_ONCE(sqe->len); zc->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL; if (zc->msg_flags & MSG_DONTWAIT) @@ -1242,8 +1242,8 @@ int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index) return -EINVAL;
- accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + accept->addr = (void __user *)READ_ONCE(sqe->addr); + accept->addr_len = (int __user *)READ_ONCE(sqe->addr2); accept->flags = READ_ONCE(sqe->accept_flags); accept->nofile = rlimit(RLIMIT_NOFILE); flags = READ_ONCE(sqe->ioprio); @@ -1392,8 +1392,8 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index || sqe->rw_flags || sqe->splice_fd_in) return -EINVAL;
- conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - conn->addr_len = READ_ONCE(sqe->addr2); + conn->addr = (void __user *)READ_ONCE(sqe->addr); + conn->addr_len = READ_ONCE(sqe->off); conn->in_progress = false; return 0; } diff --git a/io_uring/openclose.c b/io_uring/openclose.c index 67178e4bb282d..0a5c838885306 100644 --- a/io_uring/openclose.c +++ b/io_uring/openclose.c @@ -47,7 +47,7 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe open->how.flags |= O_LARGEFILE;
open->dfd = READ_ONCE(sqe->fd); - fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = (char __user *)READ_ONCE(sqe->addr); open->filename = getname(fname); if (IS_ERR(open->filename)) { ret = PTR_ERR(open->filename); @@ -81,7 +81,7 @@ int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) size_t len; int ret;
- how = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + how = (struct open_how __user *)READ_ONCE(sqe->addr2); len = READ_ONCE(sqe->len); if (len < OPEN_HOW_SIZE_VER0) return -EINVAL; diff --git a/io_uring/poll.c b/io_uring/poll.c index d9bf1767867e6..0b7936c817e50 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -22,8 +22,8 @@
struct io_poll_update { struct file *file; - u64 old_user_data; - u64 new_user_data; + __kernel_uintptr_t old_user_data; + __kernel_uintptr_t new_user_data; __poll_t events; bool update_events; bool update_user_data; @@ -890,7 +890,7 @@ int io_poll_remove_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) upd->update_events = flags & IORING_POLL_UPDATE_EVENTS; upd->update_user_data = flags & IORING_POLL_UPDATE_USER_DATA;
- upd->new_user_data = READ_ONCE(sqe->off); + upd->new_user_data = READ_ONCE(sqe->addr2); if (!upd->update_user_data && upd->new_user_data) return -EINVAL; if (upd->update_events) diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index c65b99fb9264f..7c308e00e1c2c 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -18,7 +18,7 @@
struct io_rsrc_update { struct file *file; - u64 arg; + __s32 __user *arg; u32 nr_args; u32 offset; }; @@ -32,7 +32,7 @@ static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2, return -EFAULT; up2->offset = compat_up.offset; up2->resv = compat_up.resv; - up2->data = compat_up.data; + up2->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0; }
@@ -45,8 +45,8 @@ static int get_compat64_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2, return -EFAULT; up2->offset = compat_up2.offset; up2->resv = compat_up2.resv; - up2->data = compat_up2.data; - up2->tags = compat_up2.tags; + up2->data = (__kernel_uintptr_t)compat_ptr(compat_up2.data); + up2->tags = (__kernel_uintptr_t)compat_ptr(compat_up2.tags); up2->nr = compat_up2.nr; up2->resv2 = compat_up2.resv2; return 0; @@ -62,8 +62,8 @@ static int get_compat64_io_uring_rsrc_register(struct io_uring_rsrc_register *rr rr->nr = compat_rr.nr; rr->flags = compat_rr.flags; rr->resv2 = compat_rr.resv2; - rr->data = compat_rr.data; - rr->tags = compat_rr.tags; + rr->data = (__kernel_uintptr_t)compat_ptr(compat_rr.data); + rr->tags = (__kernel_uintptr_t)compat_ptr(compat_rr.tags); return 0; }
@@ -73,7 +73,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_rsrc_update(up2, arg); - return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update)); + return copy_from_user_with_ptr(up2, arg, sizeof(struct io_uring_rsrc_update)); }
static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, @@ -88,7 +88,7 @@ static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, } if (size != sizeof(*up2)) return -EINVAL; - return copy_from_user(up2, arg, sizeof(*up2)); + return copy_from_user_with_ptr(up2, arg, sizeof(*up2)); }
static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx, @@ -103,7 +103,7 @@ static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx, } if (size != sizeof(*rr)) return -EINVAL; - return copy_from_user(rr, arg, size); + return copy_from_user_with_ptr(rr, arg, size); }
static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, @@ -184,13 +184,13 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) return -EFAULT;
- dst->iov_base = u64_to_user_ptr((u64)ciov.iov_base); + dst->iov_base = compat_ptr(ciov.iov_base); dst->iov_len = ciov.iov_len; return 0; } #endif src = (struct iovec __user *) arg; - if (copy_from_user(dst, &src[index], sizeof(*dst))) + if (copy_from_user_with_ptr(dst, &src[index], sizeof(*dst))) return -EFAULT; return 0; } @@ -517,8 +517,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned nr_args) { - u64 __user *tags = u64_to_user_ptr(up->tags); - __s32 __user *fds = u64_to_user_ptr(up->data); + u64 __user *tags = (u64 __user *)up->tags; + __s32 __user *fds = (__s32 __user *)up->data; struct io_rsrc_data *data = ctx->file_data; struct io_fixed_file *file_slot; struct file *file; @@ -597,9 +597,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned int nr_args) { - u64 __user *tags = u64_to_user_ptr(up->tags); + u64 __user *tags = (u64 __user *)up->tags; struct iovec iov; - struct iovec __user *iovs = u64_to_user_ptr(up->data); + struct iovec __user *iovs = (struct iovec __user *)up->data; struct page *last_hpage = NULL; bool needs_switch = false; __u32 done; @@ -725,13 +725,13 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, case IORING_RSRC_FILE: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break; - return io_sqe_files_register(ctx, u64_to_user_ptr(rr.data), - rr.nr, u64_to_user_ptr(rr.tags)); + return io_sqe_files_register(ctx, (void __user *)rr.data, + rr.nr, (u64 __user *)rr.tags); case IORING_RSRC_BUFFER: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break; - return io_sqe_buffers_register(ctx, u64_to_user_ptr(rr.data), - rr.nr, u64_to_user_ptr(rr.tags)); + return io_sqe_buffers_register(ctx, (void __user *)rr.data, + rr.nr, (u64 __user *)rr.tags); } return -EINVAL; } @@ -749,7 +749,7 @@ int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) up->nr_args = READ_ONCE(sqe->len); if (!up->nr_args) return -EINVAL; - up->arg = READ_ONCE(sqe->addr); + up->arg = (__s32 __user *)READ_ONCE(sqe->addr); return 0; }
@@ -757,7 +757,7 @@ static int io_files_update_with_index_alloc(struct io_kiocb *req, unsigned int issue_flags) { struct io_rsrc_update *up = io_kiocb_to_cmd(req, struct io_rsrc_update); - __s32 __user *fds = u64_to_user_ptr(up->arg); + __s32 __user *fds = up->arg; unsigned int done; struct file *file; int ret, fd; @@ -800,7 +800,7 @@ int io_files_update(struct io_kiocb *req, unsigned int issue_flags) int ret;
up2.offset = up->offset; - up2.data = up->arg; + up2.data = (__kernel_uintptr_t)up->arg; up2.nr = 0; up2.tags = 0; up2.resv = 0; diff --git a/io_uring/rw.c b/io_uring/rw.c index 2edca190450ee..229c0d778c9d6 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -23,7 +23,7 @@ struct io_rw { /* NOTE: kiocb has the file as the first member, so don't do it here */ struct kiocb kiocb; - u64 addr; + void __user *addr; u32 len; rwf_t flags; }; @@ -39,7 +39,7 @@ static int io_iov_compat_buffer_select_prep(struct io_rw *rw) struct compat_iovec __user *uiov; compat_ssize_t clen;
- uiov = u64_to_user_ptr(rw->addr); + uiov = rw->addr; if (!access_ok(uiov, sizeof(*uiov))) return -EFAULT; if (__get_user(clen, &uiov->iov_len)) @@ -65,7 +65,7 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req) return io_iov_compat_buffer_select_prep(rw); #endif
- uiov = u64_to_user_ptr(rw->addr); + uiov = rw->addr; if (get_user(rw->len, &uiov->iov_len)) return -EFAULT; return 0; @@ -104,7 +104,7 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) rw->kiocb.ki_ioprio = get_current_ioprio(); }
- rw->addr = READ_ONCE(sqe->addr); + rw->addr = (void __user *)READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); rw->flags = READ_ONCE(sqe->rw_flags);
@@ -364,13 +364,14 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, ssize_t ret;
if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { - ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len); + ret = io_import_fixed(ddir, iter, req->imu, + user_ptr_addr(rw->addr), rw->len); if (ret) return ERR_PTR(ret); return NULL; }
- buf = u64_to_user_ptr(rw->addr); + buf = rw->addr; sqe_len = rw->len;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE || @@ -379,8 +380,7 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, buf = io_buffer_select(req, &sqe_len, issue_flags); if (!buf) return ERR_PTR(-ENOBUFS); - /* TODO [PCuABI] - capability checks for uaccess */ - rw->addr = user_ptr_addr(buf); + rw->addr = buf; rw->len = sqe_len; }
@@ -446,7 +446,7 @@ static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter) if (!iov_iter_is_bvec(iter)) { iovec = iov_iter_iovec(iter); } else { - iovec.iov_base = u64_to_user_ptr(rw->addr); + iovec.iov_base = rw->addr; iovec.iov_len = rw->len; }
diff --git a/io_uring/statx.c b/io_uring/statx.c index d8fc933d3f593..d2604fdbcbe33 100644 --- a/io_uring/statx.c +++ b/io_uring/statx.c @@ -32,8 +32,8 @@ int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
sx->dfd = READ_ONCE(sqe->fd); sx->mask = READ_ONCE(sqe->len); - path = u64_to_user_ptr(READ_ONCE(sqe->addr)); - sx->buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + path = (char __user *)READ_ONCE(sqe->addr); + sx->buffer = (struct statx __user *)READ_ONCE(sqe->addr2); sx->flags = READ_ONCE(sqe->statx_flags);
sx->filename = getname_flags(path, diff --git a/io_uring/tctx.c b/io_uring/tctx.c index e69e8d7ba36c0..d36993fb577c9 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -21,7 +21,7 @@ static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update *up, return -EFAULT; up->offset = compat_up.offset; up->resv = compat_up.resv; - up->data = compat_up.data; + up->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0; }
@@ -31,7 +31,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_rsrc_update(up, arg); - return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update)); + return copy_from_user_with_ptr(up, arg, sizeof(struct io_uring_rsrc_update)); }
static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, diff --git a/io_uring/timeout.c b/io_uring/timeout.c index e8a8c20994805..5a0fe53c13329 100644 --- a/io_uring/timeout.c +++ b/io_uring/timeout.c @@ -26,7 +26,7 @@ struct io_timeout {
struct io_timeout_rem { struct file *file; - u64 addr; + __kernel_uintptr_t addr;
/* timeout update */ struct timespec64 ts; @@ -337,7 +337,7 @@ static clockid_t io_timeout_get_clock(struct io_timeout_data *data) } }
-static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, +static int io_linked_timeout_update(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, struct timespec64 *ts, enum hrtimer_mode mode) __must_hold(&ctx->timeout_lock) { @@ -365,7 +365,7 @@ static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, return 0; }
-static int io_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, +static int io_timeout_update(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, struct timespec64 *ts, enum hrtimer_mode mode) __must_hold(&ctx->timeout_lock) { @@ -405,7 +405,7 @@ int io_timeout_remove_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) tr->ltimeout = true; if (tr->flags & ~(IORING_TIMEOUT_UPDATE_MASK|IORING_TIMEOUT_ABS)) return -EINVAL; - if (get_timespec64(&tr->ts, u64_to_user_ptr(sqe->addr2))) + if (get_timespec64(&tr->ts, (struct __kernel_timespec __user *)sqe->addr2)) return -EFAULT; if (tr->ts.tv_sec < 0 || tr->ts.tv_nsec < 0) return -EINVAL; @@ -490,7 +490,7 @@ static int __io_timeout_prep(struct io_kiocb *req, data->req = req; data->flags = flags;
- if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) + if (get_timespec64(&data->ts, (struct __kernel_timespec __user *)sqe->addr)) return -EFAULT;
if (data->ts.tv_sec < 0 || data->ts.tv_nsec < 0) diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index e50de0b6b9f84..4d2d2e3f885ee 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -65,8 +65,13 @@ int io_uring_cmd_prep_async(struct io_kiocb *req) struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); size_t cmd_size;
+#ifdef CONFIG_CHERI_PURECAP_UABI + BUILD_BUG_ON(uring_cmd_pdu_size(0) != 32); + BUILD_BUG_ON(uring_cmd_pdu_size(1) != 160); +#else BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16); BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80); +#endif
cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
diff --git a/io_uring/xattr.c b/io_uring/xattr.c index 99df641594d74..1f13032e59536 100644 --- a/io_uring/xattr.c +++ b/io_uring/xattr.c @@ -53,8 +53,8 @@ static int __io_getxattr_prep(struct io_kiocb *req,
ix->filename = NULL; ix->ctx.kvalue = NULL; - name = u64_to_user_ptr(READ_ONCE(sqe->addr)); - ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + name = (char __user *)READ_ONCE(sqe->addr); + ix->ctx.cvalue = (void __user *)READ_ONCE(sqe->addr2); ix->ctx.size = READ_ONCE(sqe->len); ix->ctx.flags = READ_ONCE(sqe->xattr_flags);
@@ -93,7 +93,7 @@ int io_getxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3)); + path = (char __user *)READ_ONCE(sqe->addr3);
ix->filename = getname_flags(path, LOOKUP_FOLLOW, NULL); if (IS_ERR(ix->filename)) { @@ -159,8 +159,8 @@ static int __io_setxattr_prep(struct io_kiocb *req, return -EBADF;
ix->filename = NULL; - name = u64_to_user_ptr(READ_ONCE(sqe->addr)); - ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + name = (char __user *)READ_ONCE(sqe->addr); + ix->ctx.cvalue = (void __user *)READ_ONCE(sqe->addr2); ix->ctx.kvalue = NULL; ix->ctx.size = READ_ONCE(sqe->len); ix->ctx.flags = READ_ONCE(sqe->xattr_flags); @@ -189,7 +189,7 @@ int io_setxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3)); + path = (char __user *)READ_ONCE(sqe->addr3);
ix->filename = getname_flags(path, LOOKUP_FOLLOW, NULL); if (IS_ERR(ix->filename)) {
On 16/03/2023 14:40, Tudor Cretu wrote:
Some members of the io_uring uAPI structs may contain user pointers. In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.
The user_data field must be passed unchanged from the submission queue to the completion queue. As it is standard practice to store a pointer in user_data, expand the field to __kernel_uintptr_t. However, the kernel doesn't dereference the user_data, so don't convert it in the compat case.
In addition, for the io_uring structs containing user pointers, use the special copy routines when copying user pointers from/to userspace.
In the case of operation IORING_OP_POLL_REMOVE, if IORING_POLL_UPDATE_USER_DATA is set in the SQE len field, then the request will update the user_data of an existing poll request based on the value passed in the addr2 field, instead of the off field. This is required because the off field is not large enough to fit a user_data value.
Note that the structs io_uring_sqe and io_uring_cqe are doubled in size in PCuABI. The setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 used to double the sizes of the two structs up to 128 bytes and 32 bytes respectively. In PCuABI, the two flags are still used to double the sizes of the two structs, but, as they increased in size, they increase up to 256 bytes and 64 bytes.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 4 +- include/trace/events/io_uring.h | 46 ++++++++++----------
As per my reply in v3, I think we're better off not changing the signature of trace functions unless we also make them print the capability metadata, which doesn't sound essential.
include/uapi/linux/io_uring.h | 76 ++++++++++++++++++--------------- io_uring/advise.c | 7 +-- io_uring/cancel.c | 6 +-- io_uring/cancel.h | 2 +- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 8 ++-- io_uring/fs.c | 16 +++---- io_uring/io_uring.c | 62 +++++++++++++++++++++++---- io_uring/io_uring.h | 25 ++++++----- io_uring/kbuf.c | 19 +++++---- io_uring/kbuf.h | 2 +- io_uring/msg_ring.c | 4 +- io_uring/net.c | 20 ++++----- io_uring/openclose.c | 4 +- io_uring/poll.c | 6 +-- io_uring/rsrc.c | 44 +++++++++---------- io_uring/rw.c | 18 ++++---- io_uring/statx.c | 4 +- io_uring/tctx.c | 4 +- io_uring/timeout.c | 10 ++--- io_uring/uring_cmd.c | 5 +++ io_uring/xattr.c | 12 +++--- 24 files changed, 235 insertions(+), 171 deletions(-)
[...]
static bool io_cancel_cb(struct io_wq_work *work, void *data) diff --git a/io_uring/cancel.h b/io_uring/cancel.h index 6a59ee484d0cc..7c1249d61bf25 100644 --- a/io_uring/cancel.h +++ b/io_uring/cancel.h @@ -5,7 +5,7 @@ struct io_cancel_data { struct io_ring_ctx *ctx; union {
u64 data;
__kernel_uintptr_t data;
This still needs some more work if we really want to treat the user data as a capability in PCuABI. That means that functions such as io_cancel_cb() need to do a full capability comparison (standard arithmetic only compares the address). Worth mentioning in the commit message too as this is not necessarily an obvious choice.
At first I thought you could use user_ptr_is_same() for that purpose, but in fact this isn't entirely appropriate, as we always want a 64-bit comparison in !PCuABI and user pointers are 32-bit on a 32-bit arch. It would probably be better to introduce an io_uring helper, with an implementation similar to user_ptr_is_same(), but taking __kernel_uintptr_t instead of void __user * (that would also avoid unnecessary casts).
struct file *file;
}; u32 flags;
[...]
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index b388592e67df9..4614ab633c4bd 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -22,7 +22,7 @@ struct io_provide_buf { struct file *file;
- __u64 addr;
- void __user *addr; __u32 len; __u32 bgid; __u16 nbufs;
@@ -36,7 +36,7 @@ static int get_compat64_io_uring_buf_reg(struct io_uring_buf_reg *reg, if (copy_from_user(&compat_reg, user_reg, sizeof(compat_reg))) return -EFAULT;
- reg->ring_addr = compat_reg.ring_addr;
- reg->ring_addr = (__kernel_uintptr_t)compat_ptr(compat_reg.ring_addr); reg->ring_entries = compat_reg.ring_entries; reg->bgid = compat_reg.bgid; reg->pad = compat_reg.pad;
@@ -50,7 +50,7 @@ static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_buf_reg(reg, arg);
- return copy_from_user(reg, arg, sizeof(*reg));
- return copy_from_user_with_ptr(reg, arg, sizeof(*reg));
} static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, @@ -145,7 +145,7 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, req->flags |= REQ_F_BUFFER_SELECTED; req->kbuf = kbuf; req->buf_index = kbuf->bid;
return u64_to_user_ptr(kbuf->addr);
return (void __user *)kbuf->addr;
I think you don't need a cast any more.
[...]
diff --git a/io_uring/tctx.c b/io_uring/tctx.c index e69e8d7ba36c0..d36993fb577c9 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -21,7 +21,7 @@ static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update *up, return -EFAULT; up->offset = compat_up.offset; up->resv = compat_up.resv;
- up->data = compat_up.data;
- up->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0;
} @@ -31,7 +31,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_rsrc_update(up, arg);
- return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update));
- return copy_from_user_with_ptr(up, arg, sizeof(struct io_uring_rsrc_update));
You might have missed my comment in v3, there is no pointer here and we may want to rename that function to disambiguate.
Kevin
On 28-03-2023 09:49, Kevin Brodsky wrote:
On 16/03/2023 14:40, Tudor Cretu wrote:
Some members of the io_uring uAPI structs may contain user pointers. In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.
The user_data field must be passed unchanged from the submission queue to the completion queue. As it is standard practice to store a pointer in user_data, expand the field to __kernel_uintptr_t. However, the kernel doesn't dereference the user_data, so don't convert it in the compat case.
In addition, for the io_uring structs containing user pointers, use the special copy routines when copying user pointers from/to userspace.
In the case of operation IORING_OP_POLL_REMOVE, if IORING_POLL_UPDATE_USER_DATA is set in the SQE len field, then the request will update the user_data of an existing poll request based on the value passed in the addr2 field, instead of the off field. This is required because the off field is not large enough to fit a user_data value.
Note that the structs io_uring_sqe and io_uring_cqe are doubled in size in PCuABI. The setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 used to double the sizes of the two structs up to 128 bytes and 32 bytes respectively. In PCuABI, the two flags are still used to double the sizes of the two structs, but, as they increased in size, they increase up to 256 bytes and 64 bytes.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 4 +- include/trace/events/io_uring.h | 46 ++++++++++----------
As per my reply in v3, I think we're better off not changing the signature of trace functions unless we also make them print the capability metadata, which doesn't sound essential.
I agree
include/uapi/linux/io_uring.h | 76 ++++++++++++++++++--------------- io_uring/advise.c | 7 +-- io_uring/cancel.c | 6 +-- io_uring/cancel.h | 2 +- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 8 ++-- io_uring/fs.c | 16 +++---- io_uring/io_uring.c | 62 +++++++++++++++++++++++---- io_uring/io_uring.h | 25 ++++++----- io_uring/kbuf.c | 19 +++++---- io_uring/kbuf.h | 2 +- io_uring/msg_ring.c | 4 +- io_uring/net.c | 20 ++++----- io_uring/openclose.c | 4 +- io_uring/poll.c | 6 +-- io_uring/rsrc.c | 44 +++++++++---------- io_uring/rw.c | 18 ++++---- io_uring/statx.c | 4 +- io_uring/tctx.c | 4 +- io_uring/timeout.c | 10 ++--- io_uring/uring_cmd.c | 5 +++ io_uring/xattr.c | 12 +++--- 24 files changed, 235 insertions(+), 171 deletions(-)
[...]
static bool io_cancel_cb(struct io_wq_work *work, void *data) diff --git a/io_uring/cancel.h b/io_uring/cancel.h index 6a59ee484d0cc..7c1249d61bf25 100644 --- a/io_uring/cancel.h +++ b/io_uring/cancel.h @@ -5,7 +5,7 @@ struct io_cancel_data { struct io_ring_ctx *ctx; union {
u64 data;
__kernel_uintptr_t data;
This still needs some more work if we really want to treat the user data as a capability in PCuABI. That means that functions such as io_cancel_cb() need to do a full capability comparison (standard arithmetic only compares the address). Worth mentioning in the commit message too as this is not necessarily an obvious choice.
At first I thought you could use user_ptr_is_same() for that purpose, but in fact this isn't entirely appropriate, as we always want a 64-bit comparison in !PCuABI and user pointers are 32-bit on a 32-bit arch. It would probably be better to introduce an io_uring helper, with an implementation similar to user_ptr_is_same(), but taking __kernel_uintptr_t instead of void __user * (that would also avoid unnecessary casts).
Thank you for the details! I added a helper.
struct file *file;
}; u32 flags;
[...]
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index b388592e67df9..4614ab633c4bd 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -22,7 +22,7 @@ struct io_provide_buf { struct file *file;
- __u64 addr;
- void __user *addr; __u32 len; __u32 bgid; __u16 nbufs;
@@ -36,7 +36,7 @@ static int get_compat64_io_uring_buf_reg(struct io_uring_buf_reg *reg, if (copy_from_user(&compat_reg, user_reg, sizeof(compat_reg))) return -EFAULT;
- reg->ring_addr = compat_reg.ring_addr;
- reg->ring_addr = (__kernel_uintptr_t)compat_ptr(compat_reg.ring_addr); reg->ring_entries = compat_reg.ring_entries; reg->bgid = compat_reg.bgid; reg->pad = compat_reg.pad;
@@ -50,7 +50,7 @@ static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_buf_reg(reg, arg);
- return copy_from_user(reg, arg, sizeof(*reg));
- return copy_from_user_with_ptr(reg, arg, sizeof(*reg)); }
static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, @@ -145,7 +145,7 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, req->flags |= REQ_F_BUFFER_SELECTED; req->kbuf = kbuf; req->buf_index = kbuf->bid;
return u64_to_user_ptr(kbuf->addr);
return (void __user *)kbuf->addr;
I think you don't need a cast any more.
That's right, thank you!
[...]
diff --git a/io_uring/tctx.c b/io_uring/tctx.c index e69e8d7ba36c0..d36993fb577c9 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -21,7 +21,7 @@ static int get_compat64_io_uring_rsrc_update(struct io_uring_rsrc_update *up, return -EFAULT; up->offset = compat_up.offset; up->resv = compat_up.resv;
- up->data = compat_up.data;
- up->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0; }
@@ -31,7 +31,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, { if (is_compat64_io_ring_ctx(ctx)) return get_compat64_io_uring_rsrc_update(up, arg);
- return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update));
- return copy_from_user_with_ptr(up, arg, sizeof(struct io_uring_rsrc_update));
You might have missed my comment in v3, there is no pointer here and we may want to rename that function to disambiguate.
I did miss it; I apologize! I updated the names too.
Kevin
Thank you again!
Tudor
linux-morello@op-lists.linaro.org