Whenever a GUP call is made, the page address for the lookup is a 64-bit long raw pointer. When working in PCuABI, this means that the metadata of the capability gets discarded, hence any access made by the GUP is not checked in hardware.
This commit introduces explicit capability checks whenever a call to the current mm through the GUP functions is made.
Signed-off-by: Luca Vizzarro Luca.Vizzarro@arm.com --- Hello,
this patch adds explicit capability checks needed in conjuction with GUP calls. Submitting only for review and not merging.
In order for this patch to work Kevin's user pointer helpers patch is required, in addition to a modification that I have suggested in his series thread.
Another essential required change is actually porting the futex_waitv syscall to PCuABI. Otherwise this patch will be a BREAKING change and LTP will fail. LTP was ran successfully against this patch ONLY with the minimum required changes for futex_waitv to work, which are still a WIP.
The ticket related to this patch is #7: https://git.morello-project.org/morello/kernel/linux/-/issues/7
This patch can be found on the following branch, which contains some futex_waitv changes, Kevin's series and its modification: https://git.morello-project.org/Sevenarth/linux/-/commits/morello/gup-checks
Best, Luca --- drivers/usb/core/devio.c | 8 ++++++-- io_uring/kbuf.c | 5 ++++- io_uring/net.c | 4 +++- io_uring/rsrc.c | 7 ++++++- kernel/bpf/helpers.c | 4 +++- kernel/futex/core.c | 11 ++++++++--- lib/iov_iter.c | 21 ++++++++++++++++++--- mm/gup.c | 8 ++++++-- 8 files changed, 54 insertions(+), 14 deletions(-)
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c index cb37f6d2010a..4d3249ba343a 100644 --- a/drivers/usb/core/devio.c +++ b/drivers/usb/core/devio.c @@ -1584,8 +1584,12 @@ find_memory_area(struct usb_dev_state *ps, const struct usbdevfs_urb *uurb) { struct usb_memory *usbm = NULL, *iter; unsigned long flags; - /* TODO [PCuABI] - capability checks for uaccess */ - unsigned long uurb_start = user_ptr_addr(uurb->buffer); + unsigned long uurb_start; + + if (!check_user_ptr_rw(uurb->buffer, uurb->buffer_length)) + return ERR_PTR(-EFAULT); + + uurb_start = user_ptr_addr(uurb->buffer);
spin_lock_irqsave(&ps->lock, flags); list_for_each_entry(iter, &ps->memory_list, memlist) { diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 70056c27d778..d6e6227caab0 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -587,7 +587,10 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = io_in_compat64(ctx) ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries); - /* TODO [PCuABI] - capability checks for uaccess */ + + if (!check_user_ptr_rw((void __user *)reg.ring_addr, pages_size)) + return -EFAULT; + pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl); diff --git a/io_uring/net.c b/io_uring/net.c index 6fd28a49b671..a8766d53cad8 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1088,7 +1088,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) return io_setup_async_addr(req, &__address, issue_flags);
if (zc->flags & IORING_RECVSEND_FIXED_BUF) { - /* TODO [PCuABI] - capability checks for uaccess */ + if (!check_user_ptr_write(zc->buf, zc->len)) + return -EFAULT; + ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu, user_ptr_addr(zc->buf), zc->len); if (unlikely(ret)) diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 9e716fef91d7..285938fcf119 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -1285,7 +1285,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, return 0;
ret = -ENOMEM; - /* TODO [PCuABI] - capability checks for uaccess */ + + if (!check_user_ptr_rw(iov->iov_base, iov->iov_len)) { + ret = -EFAULT; + goto done; + } + pages = io_pin_pages(user_ptr_addr(iov->iov_base), iov->iov_len, &nr_pages); if (IS_ERR(pages)) { diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a8e76cf06da7..4db910f1758d 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -675,7 +675,9 @@ BPF_CALL_5(bpf_copy_from_user_task, void *, dst, u32, size, if (unlikely(!size)) return 0;
- /* TODO [PCuABI] - capability checks for uaccess */ + if (!check_user_ptr_read(user_ptr, size)) + return -EFAULT; + ret = access_process_vm(tsk, user_ptr_addr(user_ptr), dst, size, 0); if (ret == size) return 0; diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 9613080ccf0c..04289dc13b4a 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -226,8 +226,6 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, struct address_space *mapping; int err, ro = 0;
- /* TODO [PCuABI] - capability checks for uaccess */ - /* * The futex address must be "naturally" aligned. */ @@ -239,6 +237,12 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, if (unlikely(!access_ok(uaddr, sizeof(u32)))) return -EFAULT;
+ if (rw == FUTEX_READ && !check_user_ptr_read(uaddr, sizeof(u32))) + return -EFAULT; + + if (rw == FUTEX_WRITE && !check_user_ptr_rw(uaddr, sizeof(u32))) + return -EFAULT; + if (unlikely(should_fail_futex(fshared))) return -EFAULT;
@@ -413,7 +417,8 @@ int fault_in_user_writeable(u32 __user *uaddr) struct mm_struct *mm = current->mm; int ret;
- /* TODO [PCuABI] - capability checks for uaccess */ + if (!check_user_ptr_rw(uaddr, PAGE_SIZE)) + return -EFAULT;
mmap_read_lock(mm); ret = fixup_user_fault(mm, user_ptr_addr(uaddr), diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 2d74d8d00ad9..046915cc1562 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1481,10 +1481,16 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) { size_t skip; long k; + bool is_rw = iov_iter_rw(i) != WRITE; + + if (iter_is_ubuf(i)) { + if (is_rw && !check_user_ptr_rw(i->ubuf, *size)) + return -EFAULT; + if (!is_rw && !check_user_ptr_read(i->ubuf, *size)) + return -EFAULT;
- if (iter_is_ubuf(i)) - /* TODO [PCuABI] - capability checks for uaccess */ return user_ptr_addr(i->ubuf) + i->iov_offset; + }
for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) { size_t len = i->iov[k].iov_len - skip; @@ -1493,7 +1499,13 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) continue; if (*size > len) *size = len; - /* TODO [PCuABI] - capability checks for uaccess */ + if (is_rw && !check_user_ptr_rw(i->iov[k].iov_base + skip, + *size)) + return -EFAULT; + if (!is_rw && !check_user_ptr_read(i->iov[k].iov_base + skip, + *size)) + return -EFAULT; + return user_ptr_addr(i->iov[k].iov_base) + skip; } BUG(); // if it had been empty, we wouldn't get called @@ -1539,6 +1551,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, gup_flags |= FOLL_NOFAULT;
addr = first_iovec_segment(i, &maxsize); + if (IS_ERR_VALUE(addr)) + return addr; + *start = addr % PAGE_SIZE; addr &= PAGE_MASK; n = want_pages_array(pages, maxsize, *start, maxpages); diff --git a/mm/gup.c b/mm/gup.c index dc02749eaf9b..197459c58cd1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1843,11 +1843,15 @@ EXPORT_SYMBOL(fault_in_subpage_writeable); */ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size) { - /* TODO [PCuABI] - capability checks for uaccess */ - unsigned long start = user_ptr_addr(uaddr), end; + unsigned long start, end; struct mm_struct *mm = current->mm; bool unlocked = false;
+ if (!check_user_ptr_read(uaddr, size)) + return 0; + + start = user_ptr_addr(uaddr); + if (unlikely(size == 0)) return 0; end = PAGE_ALIGN(start + size);
On 03-05-2023 15:17, Luca Vizzarro wrote:
Whenever a GUP call is made, the page address for the lookup is a
"page address" is a bit ambiguous. The "start" parameter is a user address, and the first page pinned is the page that would be accessed when the start address is accessed by a user thread.
64-bit long raw pointer. When working in PCuABI, this means that the
"raw pointer" is a bit ambiguous as well. Capabilities are pointers.
I think this paragraph needs a bit of rephrasing.
metadata of the capability gets discarded, hence any access made by the GUP is not checked in hardware.
This commit introduces explicit capability checks whenever a call to the current mm through the GUP functions is made.
Signed-off-by: Luca Vizzarro Luca.Vizzarro@arm.com
I think the tag should be: "mm/gup:"
Hello,
this patch adds explicit capability checks needed in conjuction with GUP calls. Submitting only for review and not merging.
In order for this patch to work Kevin's user pointer helpers patch is required, in addition to a modification that I have suggested in his series thread.
Another essential required change is actually porting the futex_waitv syscall to PCuABI. Otherwise this patch will be a BREAKING change and LTP will fail. LTP was ran successfully against this patch ONLY with the minimum required changes for futex_waitv to work, which are still a WIP.
The ticket related to this patch is #7: https://git.morello-project.org/morello/kernel/linux/-/issues/7
This patch can be found on the following branch, which contains some futex_waitv changes, Kevin's series and its modification: https://git.morello-project.org/Sevenarth/linux/-/commits/morello/gup-checks
Best, Luca
drivers/usb/core/devio.c | 8 ++++++-- io_uring/kbuf.c | 5 ++++- io_uring/net.c | 4 +++- io_uring/rsrc.c | 7 ++++++- kernel/bpf/helpers.c | 4 +++- kernel/futex/core.c | 11 ++++++++--- lib/iov_iter.c | 21 ++++++++++++++++++--- mm/gup.c | 8 ++++++-- 8 files changed, 54 insertions(+), 14 deletions(-)
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c index cb37f6d2010a..4d3249ba343a 100644 --- a/drivers/usb/core/devio.c +++ b/drivers/usb/core/devio.c @@ -1584,8 +1584,12 @@ find_memory_area(struct usb_dev_state *ps, const struct usbdevfs_urb *uurb) { struct usb_memory *usbm = NULL, *iter; unsigned long flags;
- /* TODO [PCuABI] - capability checks for uaccess */
- unsigned long uurb_start = user_ptr_addr(uurb->buffer);
- unsigned long uurb_start;
- if (!check_user_ptr_rw(uurb->buffer, uurb->buffer_length))
return ERR_PTR(-EFAULT);
- uurb_start = user_ptr_addr(uurb->buffer);
Is this change related to gup?
spin_lock_irqsave(&ps->lock, flags); list_for_each_entry(iter, &ps->memory_list, memlist) { diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 70056c27d778..d6e6227caab0 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -587,7 +587,10 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = io_in_compat64(ctx) ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries);
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw((void __user *)reg.ring_addr, pages_size))
return -EFAULT;
If you return here with an error, there's a potential memory leak. You need to kfree(free_bl) (as below), or just move the the check before the kzalloc on line 582.
- pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl);
diff --git a/io_uring/net.c b/io_uring/net.c index 6fd28a49b671..a8766d53cad8 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1088,7 +1088,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) return io_setup_async_addr(req, &__address, issue_flags); if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
/* TODO [PCuABI] - capability checks for uaccess */
if (!check_user_ptr_write(zc->buf, zc->len))
return -EFAULT;
AFAICT, this change isn't related to gup.
- ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu, user_ptr_addr(zc->buf), zc->len); if (unlikely(ret))
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 9e716fef91d7..285938fcf119 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -1285,7 +1285,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, return 0; ret = -ENOMEM;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw(iov->iov_base, iov->iov_len)) {
ret = -EFAULT;
goto done;
- }
Neither pages nor imu is allocated at this point, so you can just return -EFAULT really.
Related to this instance and the io_uring/kbuf.c, I think it would be better to add the check in io_pin_pages (a function defined by io_uring). Only that function calls pin_user_pages, the actual GUP function. Also, I don't think that to check for iov_len is sufficient if it doesn't represent an integer number of pages. What if the last page grabbed contains area outside of the capability bounds...
- pages = io_pin_pages(user_ptr_addr(iov->iov_base), iov->iov_len, &nr_pages); if (IS_ERR(pages)) {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a8e76cf06da7..4db910f1758d 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -675,7 +675,9 @@ BPF_CALL_5(bpf_copy_from_user_task, void *, dst, u32, size, if (unlikely(!size)) return 0;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_read(user_ptr, size))
return -EFAULT;
Is this change related to gup?
- ret = access_process_vm(tsk, user_ptr_addr(user_ptr), dst, size, 0); if (ret == size) return 0;
diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 9613080ccf0c..04289dc13b4a 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -226,8 +226,6 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, struct address_space *mapping; int err, ro = 0;
- /* TODO [PCuABI] - capability checks for uaccess */
- /*
*/
- The futex address must be "naturally" aligned.
@@ -239,6 +237,12 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, if (unlikely(!access_ok(uaddr, sizeof(u32)))) return -EFAULT;
- if (rw == FUTEX_READ && !check_user_ptr_read(uaddr, sizeof(u32)))
return -EFAULT;
- if (rw == FUTEX_WRITE && !check_user_ptr_rw(uaddr, sizeof(u32)))
return -EFAULT;
- if (unlikely(should_fail_futex(fshared))) return -EFAULT;
@@ -413,7 +417,8 @@ int fault_in_user_writeable(u32 __user *uaddr) struct mm_struct *mm = current->mm; int ret;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw(uaddr, PAGE_SIZE))
return -EFAULT;
Just curious, does uaddr here really has bounds for an entire PAGE_SIZE? Following the code a bit, in handling the fault it can allocate the page of which the uaddr is part of, this doesn't mean that uaddr is the beginning of a page though. If you have a look at
https://git.morello-project.org/Sevenarth/linux/-/blob/morello/master/mm/mem...
It takes the starting address of the page and uses that afterwards. So if a check is needed here, then it should be something like check_user_ptr_rw(uaddr & PAGE_MASK, PAGE_SIZE)
I am doubtful however that we need to perform this check here. To me it looks like the kernel really needs an address to allocate a page, and doesn't dereference it. Moreover, I would find it bizarre that uaddr has a capability length larger than sizeof(u32). Or I am missing something...
mmap_read_lock(mm); ret = fixup_user_fault(mm, user_ptr_addr(uaddr), diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 2d74d8d00ad9..046915cc1562 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1481,10 +1481,16 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) { size_t skip; long k;
- bool is_rw = iov_iter_rw(i) != WRITE;
I can see this comes from __iov_iter_get_pages_alloc, but it still gives me a headache figuring out why it it like this...
- if (iter_is_ubuf(i)) {
if (is_rw && !check_user_ptr_rw(i->ubuf, *size))
return -EFAULT;
if (!is_rw && !check_user_ptr_read(i->ubuf, *size))
return -EFAULT;
- if (iter_is_ubuf(i))
return user_ptr_addr(i->ubuf) + i->iov_offset;/* TODO [PCuABI] - capability checks for uaccess */
- }
for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) { size_t len = i->iov[k].iov_len - skip; @@ -1493,7 +1499,13 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) continue; if (*size > len) *size = len;
/* TODO [PCuABI] - capability checks for uaccess */
if (is_rw && !check_user_ptr_rw(i->iov[k].iov_base + skip,
*size))
return -EFAULT;
if (!is_rw && !check_user_ptr_read(i->iov[k].iov_base + skip,
*size))
return -EFAULT;
- return user_ptr_addr(i->iov[k].iov_base) + skip; } BUG(); // if it had been empty, we wouldn't get called
@@ -1539,6 +1551,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, gup_flags |= FOLL_NOFAULT; addr = first_iovec_segment(i, &maxsize);
if (IS_ERR_VALUE(addr))
return addr;
- *start = addr % PAGE_SIZE; addr &= PAGE_MASK; n = want_pages_array(pages, maxsize, *start, maxpages);
diff --git a/mm/gup.c b/mm/gup.c index dc02749eaf9b..197459c58cd1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1843,11 +1843,15 @@ EXPORT_SYMBOL(fault_in_subpage_writeable); */ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size) {
- /* TODO [PCuABI] - capability checks for uaccess */
- unsigned long start = user_ptr_addr(uaddr), end;
- unsigned long start, end; struct mm_struct *mm = current->mm; bool unlocked = false;
- if (!check_user_ptr_read(uaddr, size))
return 0;
- start = user_ptr_addr(uaddr);
- if (unlikely(size == 0)) return 0; end = PAGE_ALIGN(start + size);
Some comments and questions only. I am not convinced that the changes here are all related to gup, some look like various other types of uaccess. Maybe split the patches and leave here only the gup related ones?
Thanks, Tudor
Hi Tudor,
Thank you for the great review. I have missed quite a few things as you pointed them out, so well spotting!
Regarding all the changes being squashed into one single commit, this was actually intentional as I was not certain what kind of splitting is preferred in this case. The preference has been indeed voiced as I expected, so thank you for that. Will split them!
The only relation to the GUP is that the checks are made whenever the GUP functions are involved. Except for the only change in the GUP of course.
On 03/05/2023 18:12, Tudor Cretu wrote:
Related to this instance and the io_uring/kbuf.c, I think it would be better to add the check in io_pin_pages (a function defined by io_uring). Only that function calls pin_user_pages, the actual GUP function. Also, I don't think that to check for iov_len is sufficient if it doesn't represent an integer number of pages. What if the last page grabbed contains area outside of the capability bounds...
This is a rather complicated one. The point of view I am following here is that we are only checking that the address space for the user data is correctly delimited by the user capability. While in this case the kernel effectively grabs all the interested pages, the user space won't certainly gain access to them. Therefore, this is a matter of making sure that the kernel itself does not access parts of pages which are not covered by the provided capability bounds. If we care about this now, how could we tackle it then?
I can see this comes from __iov_iter_get_pages_alloc, but it still gives me a headache figuring out why it it like this...
I am there with you. I personally don't understand it either. I was indeed hoping that somebody would shed some light on it.
Best, Luca
On 11/05/2023 16:34, Luca Vizzarro wrote:
On 03/05/2023 18:12, Tudor Cretu wrote:
Related to this instance and the io_uring/kbuf.c, I think it would be better to add the check in io_pin_pages (a function defined by io_uring). Only that function calls pin_user_pages, the actual GUP function. Also, I don't think that to check for iov_len is sufficient if it doesn't represent an integer number of pages. What if the last page grabbed contains area outside of the capability bounds...
This is a rather complicated one. The point of view I am following here is that we are only checking that the address space for the user data is correctly delimited by the user capability. While in this case the kernel effectively grabs all the interested pages, the user space won't certainly gain access to them. Therefore, this is a matter of making sure that the kernel itself does not access parts of pages which are not covered by the provided capability bounds. If we care about this now, how could we tackle it then?
On where to do the checking: as a rule, we should do it as generically as possible, so in the helper and not the caller if possible. Whether we can do that here is still unclear to me.
On the bound checking: I do think that checking iov_len is sufficient. What matters is that only the [iov_base, iov_base + iov_len] area is accessed by the kernel on behalf of userspace, and AFAICT this is the case. The fact that the kernel needs to pin entire pages is an implementation detail that should remain invisible to userspace.
I can see this comes from __iov_iter_get_pages_alloc, but it still gives me a headache figuring out why it it like this...
I am there with you. I personally don't understand it either. I was indeed hoping that somebody would shed some light on it.
That one made me twitch too. Took me a while to figure it out, but in fact it makes perfect sense, if you think about it from the right angle! iov_iter_rw() returns the direction of *data_source*. What the iterator does is on the other side, hence the direction is reversed. Take for instance this line in new_sync_read():
iov_iter_ubuf(&iter, READ, buf, len);
Here READ means "we're reading a file", and as a consequence we are *writing* to the specified buffer. For that reason even simple syscalls like read/write are confusing in terms of data direction...
Kevin
On 12/05/2023 18:29, Kevin Brodsky wrote:
I can see this comes from __iov_iter_get_pages_alloc, but it still gives me a headache figuring out why it it like this...
I am there with you. I personally don't understand it either. I was indeed hoping that somebody would shed some light on it.
That one made me twitch too. Took me a while to figure it out, but in fact it makes perfect sense, if you think about it from the right angle! iov_iter_rw() returns the direction of *data_source*. What the iterator does is on the other side, hence the direction is reversed. Take for instance this line in new_sync_read():
iov_iter_ubuf(&iter, READ, buf, len);
Here READ means "we're reading a file", and as a consequence we are *writing* to the specified buffer. For that reason even simple syscalls like read/write are confusing in terms of data direction...
FWIW we're not the only ones getting confused by all this, in the meantime a patch by Al Viro [1] renaming all such uses of READ/WRITE to ITER_{DEST,SOURCE} appeared in mainline. The discussion with Torvalds under that patch is worth a glance.
It might make sense to cherry-pick that patch as part of this series, to avoid having to replace all the READ/WRITE in your changes next time we rebase.
Kevin
[1] https://lore.kernel.org/all/20221028023352.3532080-12-viro@zeniv.linux.org.u...
On 03/05/2023 16:17, Luca Vizzarro wrote:
Whenever a GUP call is made, the page address for the lookup is a 64-bit long raw pointer. When working in PCuABI, this means that the metadata of the capability gets discarded, hence any access made by the GUP is not checked in hardware.
This commit introduces explicit capability checks whenever a call to the current mm through the GUP functions is made.
Signed-off-by: Luca Vizzarro Luca.Vizzarro@arm.com
Hello,
this patch adds explicit capability checks needed in conjuction with GUP calls. Submitting only for review and not merging.
In order for this patch to work Kevin's user pointer helpers patch is required, in addition to a modification that I have suggested in his series thread.
Another essential required change is actually porting the futex_waitv syscall to PCuABI. Otherwise this patch will be a BREAKING change and LTP will fail. LTP was ran successfully against this patch ONLY with the minimum required changes for futex_waitv to work, which are still a WIP.
The ticket related to this patch is #7: https://git.morello-project.org/morello/kernel/linux/-/issues/7
This patch can be found on the following branch, which contains some futex_waitv changes, Kevin's series and its modification: https://git.morello-project.org/Sevenarth/linux/-/commits/morello/gup-checks
Best, Luca
drivers/usb/core/devio.c | 8 ++++++-- io_uring/kbuf.c | 5 ++++- io_uring/net.c | 4 +++- io_uring/rsrc.c | 7 ++++++- kernel/bpf/helpers.c | 4 +++- kernel/futex/core.c | 11 ++++++++--- lib/iov_iter.c | 21 ++++++++++++++++++--- mm/gup.c | 8 ++++++-- 8 files changed, 54 insertions(+), 14 deletions(-)
I haven't reviewed the patch thoroughly but it's a good start. Just a few comments below on what caught my eyes. I also second Tudor's suggestions, especially in terms of splitting the patches (there's no reason not to split them by subsystem here).
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c index cb37f6d2010a..4d3249ba343a 100644 --- a/drivers/usb/core/devio.c +++ b/drivers/usb/core/devio.c @@ -1584,8 +1584,12 @@ find_memory_area(struct usb_dev_state *ps, const struct usbdevfs_urb *uurb) { struct usb_memory *usbm = NULL, *iter; unsigned long flags;
- /* TODO [PCuABI] - capability checks for uaccess */
- unsigned long uurb_start = user_ptr_addr(uurb->buffer);
- unsigned long uurb_start;
- if (!check_user_ptr_rw(uurb->buffer, uurb->buffer_length))
return ERR_PTR(-EFAULT);
- uurb_start = user_ptr_addr(uurb->buffer);
spin_lock_irqsave(&ps->lock, flags); list_for_each_entry(iter, &ps->memory_list, memlist) { diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 70056c27d778..d6e6227caab0 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -587,7 +587,10 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = io_in_compat64(ctx) ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries);
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw((void __user *)reg.ring_addr, pages_size))
Maybe a question both for you and Tudor: do we actually have any idea whether the operation that will be performed on the registered buffer is R or W at this point? If not this is a pretty serious issue, as we must not reject say a read-only pointer if the buffer is only ever read from, not written to. The same question applies to all users of io_pin_pages(); for io_import_fixed() it looks the prototype already tells us about that.
return -EFAULT;
- pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl);
diff --git a/io_uring/net.c b/io_uring/net.c index 6fd28a49b671..a8766d53cad8 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1088,7 +1088,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) return io_setup_async_addr(req, &__address, issue_flags); if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
/* TODO [PCuABI] - capability checks for uaccess */
if (!check_user_ptr_write(zc->buf, zc->len))
return -EFAULT;
- ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu, user_ptr_addr(zc->buf), zc->len); if (unlikely(ret))
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 9e716fef91d7..285938fcf119 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -1285,7 +1285,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, return 0; ret = -ENOMEM;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw(iov->iov_base, iov->iov_len)) {
ret = -EFAULT;
goto done;
- }
- pages = io_pin_pages(user_ptr_addr(iov->iov_base), iov->iov_len, &nr_pages); if (IS_ERR(pages)) {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a8e76cf06da7..4db910f1758d 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -675,7 +675,9 @@ BPF_CALL_5(bpf_copy_from_user_task, void *, dst, u32, size, if (unlikely(!size)) return 0;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_read(user_ptr, size))
return -EFAULT;
- ret = access_process_vm(tsk, user_ptr_addr(user_ptr), dst, size, 0); if (ret == size) return 0;
diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 9613080ccf0c..04289dc13b4a 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -226,8 +226,6 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, struct address_space *mapping; int err, ro = 0;
- /* TODO [PCuABI] - capability checks for uaccess */
- /*
*/
- The futex address must be "naturally" aligned.
@@ -239,6 +237,12 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, if (unlikely(!access_ok(uaddr, sizeof(u32)))) return -EFAULT;
- if (rw == FUTEX_READ && !check_user_ptr_read(uaddr, sizeof(u32)))
return -EFAULT;
- if (rw == FUTEX_WRITE && !check_user_ptr_rw(uaddr, sizeof(u32)))
return -EFAULT;
- if (unlikely(should_fail_futex(fshared))) return -EFAULT;
@@ -413,7 +417,8 @@ int fault_in_user_writeable(u32 __user *uaddr) struct mm_struct *mm = current->mm; int ret;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw(uaddr, PAGE_SIZE))
Following up on Tudor's comments, the size to require here is tricky. The purpose of this function is to bring in a (writeable) page in order to retry the atomic operation. This does not however mean that the pointer needs to be able to access the entire page. In fact, AFAICT the pointer we are manipulating here is only ever used to access a futex, i.e. a u32. So the required size should probably be that of a u32.
Maybe some elaboration is required here on why we need a check at all. In itself causing a page fault should have no functional effect in userspace, so from a capability model perspective we don't really need an explicit check. However, it is essential, for another reason: breaking the loop between futex_cmpxchg_value_locked() (or similar) when it returns -EFAULT, and fault_in_user_writeable() when it succeeds. The explicit check ensures that when the atomic operation faults, we figure out here that this was due to the pointer itself and break the loop, irrespective of the underlying page. Probably worth a comment in this function.
return -EFAULT;
mmap_read_lock(mm); ret = fixup_user_fault(mm, user_ptr_addr(uaddr), diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 2d74d8d00ad9..046915cc1562 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1481,10 +1481,16 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) { size_t skip; long k;
- bool is_rw = iov_iter_rw(i) != WRITE;
- if (iter_is_ubuf(i)) {
if (is_rw && !check_user_ptr_rw(i->ubuf, *size))
return -EFAULT;
if (!is_rw && !check_user_ptr_read(i->ubuf, *size))
return -EFAULT;
- if (iter_is_ubuf(i))
return user_ptr_addr(i->ubuf) + i->iov_offset;/* TODO [PCuABI] - capability checks for uaccess */
- }
for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) { size_t len = i->iov[k].iov_len - skip; @@ -1493,7 +1499,13 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) continue; if (*size > len) *size = len;
/* TODO [PCuABI] - capability checks for uaccess */
if (is_rw && !check_user_ptr_rw(i->iov[k].iov_base + skip,
*size))
return -EFAULT;
if (!is_rw && !check_user_ptr_read(i->iov[k].iov_base + skip,
*size))
return -EFAULT;
- return user_ptr_addr(i->iov[k].iov_base) + skip; } BUG(); // if it had been empty, we wouldn't get called
@@ -1539,6 +1551,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, gup_flags |= FOLL_NOFAULT; addr = first_iovec_segment(i, &maxsize);
if (IS_ERR_VALUE(addr))
return addr;
- *start = addr % PAGE_SIZE; addr &= PAGE_MASK; n = want_pages_array(pages, maxsize, *start, maxpages);
diff --git a/mm/gup.c b/mm/gup.c index dc02749eaf9b..197459c58cd1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1843,11 +1843,15 @@ EXPORT_SYMBOL(fault_in_subpage_writeable); */ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size) {
- /* TODO [PCuABI] - capability checks for uaccess */
- unsigned long start = user_ptr_addr(uaddr), end;
- unsigned long start, end; struct mm_struct *mm = current->mm; bool unlocked = false;
- if (!check_user_ptr_read(uaddr, size))
check_user_ptr_write() surely? That's the main reason why you need the change in prototype in the patch adding the helpers AIUI.
Kevin
return 0;
- start = user_ptr_addr(uaddr);
- if (unlikely(size == 0)) return 0; end = PAGE_ALIGN(start + size);
On 08-05-2023 08:04, Kevin Brodsky wrote:
On 03/05/2023 16:17, Luca Vizzarro wrote:
Whenever a GUP call is made, the page address for the lookup is a 64-bit long raw pointer. When working in PCuABI, this means that the metadata of the capability gets discarded, hence any access made by the GUP is not checked in hardware.
This commit introduces explicit capability checks whenever a call to the current mm through the GUP functions is made.
Signed-off-by: Luca Vizzarro Luca.Vizzarro@arm.com
Hello,
this patch adds explicit capability checks needed in conjuction with GUP calls. Submitting only for review and not merging.
In order for this patch to work Kevin's user pointer helpers patch is required, in addition to a modification that I have suggested in his series thread.
Another essential required change is actually porting the futex_waitv syscall to PCuABI. Otherwise this patch will be a BREAKING change and LTP will fail. LTP was ran successfully against this patch ONLY with the minimum required changes for futex_waitv to work, which are still a WIP.
The ticket related to this patch is #7: https://git.morello-project.org/morello/kernel/linux/-/issues/7
This patch can be found on the following branch, which contains some futex_waitv changes, Kevin's series and its modification: https://git.morello-project.org/Sevenarth/linux/-/commits/morello/gup-checks
Best, Luca
drivers/usb/core/devio.c | 8 ++++++-- io_uring/kbuf.c | 5 ++++- io_uring/net.c | 4 +++- io_uring/rsrc.c | 7 ++++++- kernel/bpf/helpers.c | 4 +++- kernel/futex/core.c | 11 ++++++++--- lib/iov_iter.c | 21 ++++++++++++++++++--- mm/gup.c | 8 ++++++-- 8 files changed, 54 insertions(+), 14 deletions(-)
I haven't reviewed the patch thoroughly but it's a good start. Just a few comments below on what caught my eyes. I also second Tudor's suggestions, especially in terms of splitting the patches (there's no reason not to split them by subsystem here).
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c index cb37f6d2010a..4d3249ba343a 100644 --- a/drivers/usb/core/devio.c +++ b/drivers/usb/core/devio.c @@ -1584,8 +1584,12 @@ find_memory_area(struct usb_dev_state *ps, const struct usbdevfs_urb *uurb) { struct usb_memory *usbm = NULL, *iter; unsigned long flags;
- /* TODO [PCuABI] - capability checks for uaccess */
- unsigned long uurb_start = user_ptr_addr(uurb->buffer);
- unsigned long uurb_start;
- if (!check_user_ptr_rw(uurb->buffer, uurb->buffer_length))
return ERR_PTR(-EFAULT);
- uurb_start = user_ptr_addr(uurb->buffer);
spin_lock_irqsave(&ps->lock, flags); list_for_each_entry(iter, &ps->memory_list, memlist) { diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 70056c27d778..d6e6227caab0 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -587,7 +587,10 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = io_in_compat64(ctx) ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries);
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw((void __user *)reg.ring_addr, pages_size))
Maybe a question both for you and Tudor: do we actually have any idea whether the operation that will be performed on the registered buffer is R or W at this point? If not this is a pretty serious issue, as we must not reject say a read-only pointer if the buffer is only ever read from, not written to. The same question applies to all users of io_pin_pages(); for io_import_fixed() it looks the prototype already tells us about that.
IIUC, the registered buffers are only accesible by SQEs requests that read or receive data from a file or socket, so it should be enough to check only for read permission here.
For the io_pin_pages() in rsrc.c, they could be used for both read abd write operations as far as I can tell...
return -EFAULT;
- pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl);
diff --git a/io_uring/net.c b/io_uring/net.c index 6fd28a49b671..a8766d53cad8 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1088,7 +1088,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) return io_setup_async_addr(req, &__address, issue_flags); if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
/* TODO [PCuABI] - capability checks for uaccess */
if (!check_user_ptr_write(zc->buf, zc->len))
return -EFAULT;
- ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu, user_ptr_addr(zc->buf), zc->len); if (unlikely(ret))
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 9e716fef91d7..285938fcf119 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -1285,7 +1285,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, return 0; ret = -ENOMEM;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw(iov->iov_base, iov->iov_len)) {
ret = -EFAULT;
goto done;
- }
- pages = io_pin_pages(user_ptr_addr(iov->iov_base), iov->iov_len, &nr_pages); if (IS_ERR(pages)) {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a8e76cf06da7..4db910f1758d 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -675,7 +675,9 @@ BPF_CALL_5(bpf_copy_from_user_task, void *, dst, u32, size, if (unlikely(!size)) return 0;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_read(user_ptr, size))
return -EFAULT;
- ret = access_process_vm(tsk, user_ptr_addr(user_ptr), dst, size, 0); if (ret == size) return 0;
diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 9613080ccf0c..04289dc13b4a 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -226,8 +226,6 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, struct address_space *mapping; int err, ro = 0;
- /* TODO [PCuABI] - capability checks for uaccess */
- /*
*/
- The futex address must be "naturally" aligned.
@@ -239,6 +237,12 @@ int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key, if (unlikely(!access_ok(uaddr, sizeof(u32)))) return -EFAULT;
- if (rw == FUTEX_READ && !check_user_ptr_read(uaddr, sizeof(u32)))
return -EFAULT;
- if (rw == FUTEX_WRITE && !check_user_ptr_rw(uaddr, sizeof(u32)))
return -EFAULT;
- if (unlikely(should_fail_futex(fshared))) return -EFAULT;
@@ -413,7 +417,8 @@ int fault_in_user_writeable(u32 __user *uaddr) struct mm_struct *mm = current->mm; int ret;
- /* TODO [PCuABI] - capability checks for uaccess */
- if (!check_user_ptr_rw(uaddr, PAGE_SIZE))
Following up on Tudor's comments, the size to require here is tricky. The purpose of this function is to bring in a (writeable) page in order to retry the atomic operation. This does not however mean that the pointer needs to be able to access the entire page. In fact, AFAICT the pointer we are manipulating here is only ever used to access a futex, i.e. a u32. So the required size should probably be that of a u32.
Maybe some elaboration is required here on why we need a check at all. In itself causing a page fault should have no functional effect in userspace, so from a capability model perspective we don't really need an explicit check. However, it is essential, for another reason: breaking the loop between futex_cmpxchg_value_locked() (or similar) when it returns -EFAULT, and fault_in_user_writeable() when it succeeds. The explicit check ensures that when the atomic operation faults, we figure out here that this was due to the pointer itself and break the loop, irrespective of the underlying page. Probably worth a comment in this function.
return -EFAULT;
mmap_read_lock(mm); ret = fixup_user_fault(mm, user_ptr_addr(uaddr), diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 2d74d8d00ad9..046915cc1562 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1481,10 +1481,16 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) { size_t skip; long k;
- bool is_rw = iov_iter_rw(i) != WRITE;
- if (iter_is_ubuf(i)) {
if (is_rw && !check_user_ptr_rw(i->ubuf, *size))
return -EFAULT;
if (!is_rw && !check_user_ptr_read(i->ubuf, *size))
return -EFAULT;
- if (iter_is_ubuf(i))
return user_ptr_addr(i->ubuf) + i->iov_offset;/* TODO [PCuABI] - capability checks for uaccess */
- }
for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) { size_t len = i->iov[k].iov_len - skip; @@ -1493,7 +1499,13 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size) continue; if (*size > len) *size = len;
/* TODO [PCuABI] - capability checks for uaccess */
if (is_rw && !check_user_ptr_rw(i->iov[k].iov_base + skip,
*size))
return -EFAULT;
if (!is_rw && !check_user_ptr_read(i->iov[k].iov_base + skip,
*size))
return -EFAULT;
- return user_ptr_addr(i->iov[k].iov_base) + skip; } BUG(); // if it had been empty, we wouldn't get called
@@ -1539,6 +1551,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, gup_flags |= FOLL_NOFAULT; addr = first_iovec_segment(i, &maxsize);
if (IS_ERR_VALUE(addr))
return addr;
- *start = addr % PAGE_SIZE; addr &= PAGE_MASK; n = want_pages_array(pages, maxsize, *start, maxpages);
diff --git a/mm/gup.c b/mm/gup.c index dc02749eaf9b..197459c58cd1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1843,11 +1843,15 @@ EXPORT_SYMBOL(fault_in_subpage_writeable); */ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size) {
- /* TODO [PCuABI] - capability checks for uaccess */
- unsigned long start = user_ptr_addr(uaddr), end;
- unsigned long start, end; struct mm_struct *mm = current->mm; bool unlocked = false;
- if (!check_user_ptr_read(uaddr, size))
check_user_ptr_write() surely? That's the main reason why you need the change in prototype in the patch adding the helpers AIUI.
Kevin
return 0;
- start = user_ptr_addr(uaddr);
- if (unlikely(size == 0)) return 0; end = PAGE_ALIGN(start + size);
On 10/05/2023 13:52, Tudor Cretu wrote:
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 70056c27d778..d6e6227caab0 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -587,7 +587,10 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = io_in_compat64(ctx) ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries); - /* TODO [PCuABI] - capability checks for uaccess */
+ if (!check_user_ptr_rw((void __user *)reg.ring_addr, pages_size))
Maybe a question both for you and Tudor: do we actually have any idea whether the operation that will be performed on the registered buffer is R or W at this point? If not this is a pretty serious issue, as we must not reject say a read-only pointer if the buffer is only ever read from, not written to. The same question applies to all users of io_pin_pages(); for io_import_fixed() it looks the prototype already tells us about that.
IIUC, the registered buffers are only accesible by SQEs requests that read or receive data from a file or socket, so it should be enough to check only for read permission here.
I think you're right, though it's the other way round: read operations write to the provide buffer! FWIW the definitive list of operations allowed to use IOSQE_BUFFER_SELECT can be found in opdef.c (.buffer_select must be set to 1), there are 4 currently and all are reads. Which I suppose makes sense given the use-case.
For the io_pin_pages() in rsrc.c, they could be used for both read abd write operations as far as I can tell...
That seems to be he case, in fact AFAICT this can only be used together with IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED.
Overall I feel like we need to have a deeper think about this. There are multiple situations to consider, and multiple places where the checking could happen. __io_import_iovec() summarises the multiple situations quite well.
From what I've gathered: - For IORING_REGISTER_BUFFERS and friends, it's fairly straightforward: they can only be used together with IORING_OP_{READ,WRITE}_FIXED. Crucially, these requests are still required to specify addr and len, within the registered region. One option here would be to check only when processing the operation, not at the point of registration, since the latter can only be used through the former. - For IORING_REGISTER_PBUF_RING and friends, the situation is different. They can be used with operations specifying IOSQE_BUFFER_SELECT (4 allowed, see above), but here the operation does not specify the addr / len. We could still do the checking when processing the operation, since io_buffer_select() returns the original user pointer. It could be better to do it at the point of registration instead though (which is possible permission-wise because we know the operation will write to the buffer, as per above).
I'm sure I've missed something, so I'd be very grateful if someone could dig deeper into this and figure out all the angles. We should come up with a consistent approach, for instance checking as much as possible as soon as possible (when registering) or checking only at the point of processing the operation, not at the point of registration.
Kevin
On 12-05-2023 18:24, Kevin Brodsky wrote:
On 10/05/2023 13:52, Tudor Cretu wrote:
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 70056c27d778..d6e6227caab0 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -587,7 +587,10 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = io_in_compat64(ctx) ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries); - /* TODO [PCuABI] - capability checks for uaccess */
+ if (!check_user_ptr_rw((void __user *)reg.ring_addr, pages_size))
Maybe a question both for you and Tudor: do we actually have any idea whether the operation that will be performed on the registered buffer is R or W at this point? If not this is a pretty serious issue, as we must not reject say a read-only pointer if the buffer is only ever read from, not written to. The same question applies to all users of io_pin_pages(); for io_import_fixed() it looks the prototype already tells us about that.
IIUC, the registered buffers are only accesible by SQEs requests that read or receive data from a file or socket, so it should be enough to check only for read permission here.
I think you're right, though it's the other way round: read operations write to the provide buffer! FWIW the definitive list of operations allowed to use IOSQE_BUFFER_SELECT can be found in opdef.c (.buffer_select must be set to 1), there are 4 currently and all are reads. Which I suppose makes sense given the use-case.
This is also what I deduced. Also, thank you for the correction, that's right, these need to have write permissions.
For the io_pin_pages() in rsrc.c, they could be used for both read abd write operations as far as I can tell...
That seems to be he case, in fact AFAICT this can only be used together with IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED.
I've reached the same conclusion.
Overall I feel like we need to have a deeper think about this. There are multiple situations to consider, and multiple places where the checking could happen. __io_import_iovec() summarises the multiple situations quite well.
From what I've gathered:
- For IORING_REGISTER_BUFFERS and friends, it's fairly straightforward:
they can only be used together with IORING_OP_{READ,WRITE}_FIXED. Crucially, these requests are still required to specify addr and len, within the registered region. One option here would be to check only when processing the operation, not at the point of registration, since the latter can only be used through the former.
I think doing the check only when processing the operation is fine. But we should leave a comment at the registration point on why there is no need to check.
- For IORING_REGISTER_PBUF_RING and friends, the situation is different.
They can be used with operations specifying IOSQE_BUFFER_SELECT (4 allowed, see above), but here the operation does not specify the addr / len. We could still do the checking when processing the operation, since io_buffer_select() returns the original user pointer. It could be better to do it at the point of registration instead though (which is possible permission-wise because we know the operation will write to the buffer, as per above).
I think just doing the check at the registration point sounds good.
I'm sure I've missed something, so I'd be very grateful if someone could dig deeper into this and figure out all the angles. We should come up with a consistent approach, for instance checking as much as possible as soon as possible (when registering) or checking only at the point of processing the operation, not at the point of registration.
I've gone through the code and I confirm that your summary above is exhaustive and... exceptionally concise, well done!
I agree that a consistent approach would be nice, but I don't really think it's worth complicating the code in the case of IORING_REGISTER_BUFFERS or double-checking in the case of IORING_REGISTER_PBUF_RING. As long as all the possible registration-operation paths are checked and the decisions are documented, it should be enough. After all, these are exceptions to the standard... This is not a strong opinion, of course.
Thanks, Tudor
Kevin
On 15/05/2023 17:05, Tudor Cretu wrote:
For the io_pin_pages() in rsrc.c, they could be used for both read abd write operations as far as I can tell...
That seems to be he case, in fact AFAICT this can only be used together with IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED.
I've reached the same conclusion.
Overall I feel like we need to have a deeper think about this. There are multiple situations to consider, and multiple places where the checking could happen. __io_import_iovec() summarises the multiple situations quite well.
From what I've gathered:
- For IORING_REGISTER_BUFFERS and friends, it's fairly straightforward:
they can only be used together with IORING_OP_{READ,WRITE}_FIXED. Crucially, these requests are still required to specify addr and len, within the registered region. One option here would be to check only when processing the operation, not at the point of registration, since the latter can only be used through the former.
I think doing the check only when processing the operation is fine. But we should leave a comment at the registration point on why there is no need to check.
- For IORING_REGISTER_PBUF_RING and friends, the situation is different.
They can be used with operations specifying IOSQE_BUFFER_SELECT (4 allowed, see above), but here the operation does not specify the addr / len. We could still do the checking when processing the operation, since io_buffer_select() returns the original user pointer. It could be better to do it at the point of registration instead though (which is possible permission-wise because we know the operation will write to the buffer, as per above).
I think just doing the check at the registration point sounds good.
I'm sure I've missed something, so I'd be very grateful if someone could dig deeper into this and figure out all the angles. We should come up with a consistent approach, for instance checking as much as possible as soon as possible (when registering) or checking only at the point of processing the operation, not at the point of registration.
I've gone through the code and I confirm that your summary above is exhaustive and... exceptionally concise, well done!
I agree that a consistent approach would be nice, but I don't really think it's worth complicating the code in the case of IORING_REGISTER_BUFFERS or double-checking in the case of IORING_REGISTER_PBUF_RING. As long as all the possible registration-operation paths are checked and the decisions are documented, it should be enough. After all, these are exceptions to the standard... This is not a strong opinion, of course.
Thanks for going through that code too! I'd tend to agree, some asymmetry is probably better than shoehorning checks in this case.
Having had another look though, I think there is more complexity than just what I've mentioned. Grepping buf_index in io_uring/ gives a better idea of all users of registered buffers, and there are indeed more cases: - IORING_OP_SEND_ZC with IORING_RECVSEND_FIXED_BUF. This one should be no different from IORING_OP_WRITE_FIXED, with the same check inserted. - IORING_OP_PROVIDE_BUFFERS. I suppose it's equivalent to IORING_REGISTER_PBUF_RING and we should therefore perform the check there? - IORING_OP_URING_CMD with IORING_URING_CMD_FIXED. Undocumented so even more fun! Should be the same as IORING_OP_{READ,WRITE}_FIXED, though which exactly, I'm not sure... io_uring_cmd_import_fixed() looks relevant, though I'm not sure it's the only way to use a fixed buffer in this case.
Something that makes it quite hard to understand it all is that although buf_group has been put in a union with buf_index to distinguish the two cases, the kernel often uses buf_index to mean buf_group... And of course the man page being out of date in multiple ways is not helping.
Hopefully that's all of them found now, but a second look is definitely appreciated.
Kevin
On 17-05-2023 09:38, Kevin Brodsky wrote:
On 15/05/2023 17:05, Tudor Cretu wrote:
For the io_pin_pages() in rsrc.c, they could be used for both read and write operations as far as I can tell...
That seems to be he case, in fact AFAICT this can only be used together with IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED.
I've reached the same conclusion.
Overall I feel like we need to have a deeper think about this. There are multiple situations to consider, and multiple places where the checking could happen. __io_import_iovec() summarises the multiple situations quite well.
From what I've gathered:
- For IORING_REGISTER_BUFFERS and friends, it's fairly straightforward:
they can only be used together with IORING_OP_{READ,WRITE}_FIXED. Crucially, these requests are still required to specify addr and len, within the registered region. One option here would be to check only when processing the operation, not at the point of registration, since the latter can only be used through the former.
I think doing the check only when processing the operation is fine. But we should leave a comment at the registration point on why there is no need to check.
- For IORING_REGISTER_PBUF_RING and friends, the situation is different.
They can be used with operations specifying IOSQE_BUFFER_SELECT (4 allowed, see above), but here the operation does not specify the addr / len. We could still do the checking when processing the operation, since io_buffer_select() returns the original user pointer. It could be better to do it at the point of registration instead though (which is possible permission-wise because we know the operation will write to the buffer, as per above).
I think just doing the check at the registration point sounds good.
I'm sure I've missed something, so I'd be very grateful if someone could dig deeper into this and figure out all the angles. We should come up with a consistent approach, for instance checking as much as possible as soon as possible (when registering) or checking only at the point of processing the operation, not at the point of registration.
I've gone through the code and I confirm that your summary above is exhaustive and... exceptionally concise, well done!
I agree that a consistent approach would be nice, but I don't really think it's worth complicating the code in the case of IORING_REGISTER_BUFFERS or double-checking in the case of IORING_REGISTER_PBUF_RING. As long as all the possible registration-operation paths are checked and the decisions are documented, it should be enough. After all, these are exceptions to the standard... This is not a strong opinion, of course.
Thanks for going through that code too! I'd tend to agree, some asymmetry is probably better than shoehorning checks in this case.
Having had another look though, I think there is more complexity than just what I've mentioned. Grepping buf_index in io_uring/ gives a better idea of all users of registered buffers, and there are indeed more cases:
Great finds! I've gone through liburing as well this time, and IORING_OP_SEND_ZC with IORING_RECVSEND_FIXED_BUF and IORING_OP_URING_CMD with IORING_URING_CMD_FIXED seem to be the only cases besides IORING_OP_{READ,WRITE}_FIXED that can use buffers registered with IORING_REGISTER_BUFFERS/IORING_REGISTER_BUFFERS2/IORING_REGISTER_BUFFERS_UPDATE.
- IORING_OP_SEND_ZC with IORING_RECVSEND_FIXED_BUF. This one should be
no different from IORING_OP_WRITE_FIXED, with the same check inserted.
This looks correct from what I can tell.
- IORING_OP_PROVIDE_BUFFERS. I suppose it's equivalent to
IORING_REGISTER_PBUF_RING and we should therefore perform the check there?
I've noticed this one yesterday, and I agree, is part of the "IORING_REGISTER_PBUF_RING and friends" group and there should be a check there.
- IORING_OP_URING_CMD with IORING_URING_CMD_FIXED. Undocumented so even
more fun! Should be the same as IORING_OP_{READ,WRITE}_FIXED, though which exactly, I'm not sure... io_uring_cmd_import_fixed() looks relevant, though I'm not sure it's the only way to use a fixed buffer in this case.
IIUC, io_uring_cmd_import_fixed() is the only (intended) way to use the fixed buffer for IORING_OP_URING_CMD requests. struct io_uring_cmd doesn't have the buffer exposed, only struct io_kiocb does. And the only place where the buffer (`imu` field from struct io_kiocb) is accessed is from io_uring_cmd_import_fixed().
Tudor
Something that makes it quite hard to understand it all is that although buf_group has been put in a union with buf_index to distinguish the two cases, the kernel often uses buf_index to mean buf_group... And of course the man page being out of date in multiple ways is not helping.
Hopefully that's all of them found now, but a second look is definitely appreciated.
Kevin
On 17/05/2023 16:54, Tudor Cretu wrote:
On 17-05-2023 09:38, Kevin Brodsky wrote:
[...]
I'm sure I've missed something, so I'd be very grateful if someone could dig deeper into this and figure out all the angles. We should come up with a consistent approach, for instance checking as much as possible as soon as possible (when registering) or checking only at the point of processing the operation, not at the point of registration.
I've gone through the code and I confirm that your summary above is exhaustive and... exceptionally concise, well done!
I agree that a consistent approach would be nice, but I don't really think it's worth complicating the code in the case of IORING_REGISTER_BUFFERS or double-checking in the case of IORING_REGISTER_PBUF_RING. As long as all the possible registration-operation paths are checked and the decisions are documented, it should be enough. After all, these are exceptions to the standard... This is not a strong opinion, of course.
Thanks for going through that code too! I'd tend to agree, some asymmetry is probably better than shoehorning checks in this case.
Having had another look though, I think there is more complexity than just what I've mentioned. Grepping buf_index in io_uring/ gives a better idea of all users of registered buffers, and there are indeed more cases:
Great finds! I've gone through liburing as well this time, and IORING_OP_SEND_ZC with IORING_RECVSEND_FIXED_BUF and IORING_OP_URING_CMD with IORING_URING_CMD_FIXED seem to be the only cases besides IORING_OP_{READ,WRITE}_FIXED that can use buffers registered with IORING_REGISTER_BUFFERS/IORING_REGISTER_BUFFERS2/IORING_REGISTER_BUFFERS_UPDATE.
Thanks for double-checking!
- IORING_OP_SEND_ZC with IORING_RECVSEND_FIXED_BUF. This one should be
no different from IORING_OP_WRITE_FIXED, with the same check inserted.
This looks correct from what I can tell.
- IORING_OP_PROVIDE_BUFFERS. I suppose it's equivalent to
IORING_REGISTER_PBUF_RING and we should therefore perform the check there?
I've noticed this one yesterday, and I agree, is part of the "IORING_REGISTER_PBUF_RING and friends" group and there should be a check there.
- IORING_OP_URING_CMD with IORING_URING_CMD_FIXED. Undocumented so even
more fun! Should be the same as IORING_OP_{READ,WRITE}_FIXED, though which exactly, I'm not sure... io_uring_cmd_import_fixed() looks relevant, though I'm not sure it's the only way to use a fixed buffer in this case.
IIUC, io_uring_cmd_import_fixed() is the only (intended) way to use the fixed buffer for IORING_OP_URING_CMD requests. struct io_uring_cmd doesn't have the buffer exposed, only struct io_kiocb does. And the only place where the buffer (`imu` field from struct io_kiocb) is accessed is from io_uring_cmd_import_fixed().
Alright makes sense, thanks for the confirmation. In that case we should also have a check if IORING_URING_CMD_FIXED is passed when processing IORING_OP_URING_CMD. It should be relatively straightforward to implement, as we automatically get that check if we add it to io_import_fixed() (which knows the R/W direction). That requires changing the prototype of io_uring_cmd_import_fixed() to take a void __user *, but fortunately only the NVMe driver makes use of that functionality at the moment (and the change there should be quite small). We do not build this driver yet so no immediate action required there.
Kevin
Tudor
Something that makes it quite hard to understand it all is that although buf_group has been put in a union with buf_index to distinguish the two cases, the kernel often uses buf_index to mean buf_group... And of course the man page being out of date in multiple ways is not helping.
Hopefully that's all of them found now, but a second look is definitely appreciated.
Kevin
On 12/05/2023 19:24, Kevin Brodsky wrote:
- For IORING_REGISTER_PBUF_RING and friends, the situation is different.
They can be used with operations specifying IOSQE_BUFFER_SELECT (4 allowed, see above), but here the operation does not specify the addr / len. We could still do the checking when processing the operation, since io_buffer_select() returns the original user pointer. It could be better to do it at the point of registration instead though (which is possible permission-wise because we know the operation will write to the buffer, as per above).
I'm sure I've missed something [...]
... and indeed I have. A pretty big thing in fact! IORING_REGISTER_PBUF_RING is different from IORING_OP_PROVIDE_BUFFERS (addressed further down the thread) in an important way: while the latter specifies N consecutive buffers "synchronously", the former specifies in fact *a ring* of struct io_uring_buf, each specifying a buffer, and the contents of the ring are *not* read at the point of registration. A glance at io_buffer_select() makes the difference between the two paths very clear.
So for IORING_REGISTER_PBUF_RING, there are really two levels of checks. The first is checking whether we can read the ring via struct io_uring_buf_reg::ring_addr. This is straightforward and should be done at the point of registration. The second is checking whether the selected struct io_uring_buf::addr allows accessing the buffer, and clearly this one can only be done at the point of use, as the ring can be modified at any point by userspace.
This leads me to another confusion I created: there is in fact no need to explicitly check the pointers to the registered buffers. This is because such pointers returned by io_buffer_select() are still user pointers (void __user *), and explicit checking is only needed in the cases where we have to obtain the address for GUP or similar. It does happen in some cases, but it is orthogonal to these registration mechanisms.
This means that there is no good reason to check the pointer provided to IORING_OP_PROVIDE_BUFFERS either. We can just register the pointer, and then whoever calls io_buffer_select() might get back this pointer, which will eventually be checked like any other pointer provided by SQE::addr or similar.
The nice thing with this new approach is that we no longer have an inconsistency between the two approaches ({READ,WRITE}_FIXED and PROVIDE_BUFFERS / PBUF_RING): we now always check when processing the operation and never when registering buffers. Does that sound sensible?
Kevin
On 13-06-2023 13:02, Kevin Brodsky wrote:
On 12/05/2023 19:24, Kevin Brodsky wrote:
- For IORING_REGISTER_PBUF_RING and friends, the situation is different.
They can be used with operations specifying IOSQE_BUFFER_SELECT (4 allowed, see above), but here the operation does not specify the addr / len. We could still do the checking when processing the operation, since io_buffer_select() returns the original user pointer. It could be better to do it at the point of registration instead though (which is possible permission-wise because we know the operation will write to the buffer, as per above).
I'm sure I've missed something [...]
... and indeed I have. A pretty big thing in fact! IORING_REGISTER_PBUF_RING is different from IORING_OP_PROVIDE_BUFFERS (addressed further down the thread) in an important way: while the latter specifies N consecutive buffers "synchronously", the former specifies in fact *a ring* of struct io_uring_buf, each specifying a buffer, and the contents of the ring are *not* read at the point of registration. A glance at io_buffer_select() makes the difference between the two paths very clear.
Ah, indeed the whole point of registering a ring is so that the userspace can add and manipulate registered buffers in the later stages as well. Great find!
So for IORING_REGISTER_PBUF_RING, there are really two levels of checks. The first is checking whether we can read the ring via struct io_uring_buf_reg::ring_addr. This is straightforward and should be done at the point of registration. The second is checking whether the selected struct io_uring_buf::addr allows accessing the buffer, and clearly this one can only be done at the point of use, as the ring can be modified at any point by userspace.
This leads me to another confusion I created: there is in fact no need to explicitly check the pointers to the registered buffers. This is because such pointers returned by io_buffer_select() are still user pointers (void __user *), and explicit checking is only needed in the cases where we have to obtain the address for GUP or similar. It does happen in some cases, but it is orthogonal to these registration mechanisms.
This means that there is no good reason to check the pointer provided to IORING_OP_PROVIDE_BUFFERS either. We can just register the pointer, and then whoever calls io_buffer_select() might get back this pointer, which will eventually be checked like any other pointer provided by SQE::addr or similar.
The nice thing with this new approach is that we no longer have an inconsistency between the two approaches ({READ,WRITE}_FIXED and PROVIDE_BUFFERS / PBUF_RING): we now always check when processing the operation and never when registering buffers. Does that sound sensible?
It sounds good to me! Well done! :)
Tudor
Kevin
On 08/05/2023 08:04, Kevin Brodsky wrote:
Following up on Tudor's comments, the size to require here is tricky. The purpose of this function is to bring in a (writeable) page in order to retry the atomic operation. This does not however mean that the pointer needs to be able to access the entire page. In fact, AFAICT the pointer we are manipulating here is only ever used to access a futex, i.e. a u32. So the required size should probably be that of a u32.
Maybe some elaboration is required here on why we need a check at all. In itself causing a page fault should have no functional effect in userspace, so from a capability model perspective we don't really need an explicit check. However, it is essential, for another reason: breaking the loop between futex_cmpxchg_value_locked() (or similar) when it returns -EFAULT, and fault_in_user_writeable() when it succeeds. The explicit check ensures that when the atomic operation faults, we figure out here that this was due to the pointer itself and break the loop, irrespective of the underlying page. Probably worth a comment in this function.
The required size is indeed u32, using PAGE_SIZE here was actually a mistake I didn't spot myself. So thank you for pointing it out! Good point made about explaining why we are applying a check there, will be there in the next patch.
Luca
linux-morello@op-lists.linaro.org