Hi,
This series introduces new user_ptr helpers to help in certain
uaccess-related situations. This is a follow-up to my previous series
"New CHERI API and separation of root capabilities"; the CHERI helpers
it introduced are used to implement the new generic user_ptr helpers in
PCuABI.
The new helpers are (see patch 1 for details):
* make_user_ptr_for_<perms>_uaccess(), to create user pointers in order
to perform uaccess, with appropriate bounds and permissions.
* check_user_ptr_<perms>(), to perform explicit checking of user
pointers.
This series does not actually make use of check_user_ptr_<perms>(),
rather it prepares the ground for implementing explicit checking when
user memory is accessed via kernel mappings [1].
The rest of the series (patch 2-9) is about converting existing uses of
uaddr_to_user_ptr_safe(), as it should now only be used for *providing*
user pointers to userspace, and not for uaccess.
After this series, the only remaining users of uaddr_to_user_ptr_safe()
are:
- fs/binfmt_elf.c to provide all the initial capabilities (stack,
AT_CHERI_*_CAP, etc.). uaddr_to_user_ptr_safe() is still used to write
the initial data on the stack too; it didn't seem worthwhile to
refactor this code as it is going to change anyway as part of [2]
and [3].
- mmap / mremap / shmat to return a valid capability.
To clarify which helper should be used in which situation, here are two
tables specifying the helper to use depending on whether the address is
specified by userspace or the kernel itself, and whether the pointer is
provided to userspace or used by the kernel itself.
*Before* this series:
+-----------------------------------+---------------------+--------------------------+
| Pointer for \ Address provided by | User | Kernel |
+===================================+=====================+==========================+
| User | - | uaddr_to_user_ptr_safe() |
+-----------------------------------+---------------------+--------------------------+
| Kernel (uaccess) | uaddr_to_user_ptr() | uaddr_to_user_ptr_safe() |
+-----------------------------------+---------------------+--------------------------+
*After* this series:
+-----------------------------------+---------------------+-------------------------------+
| Pointer for \ Address provided by | User | Kernel |
+===================================+=====================+===============================+
| User | - | uaddr_to_user_ptr_safe() |
+-----------------------------------+---------------------+-------------------------------+
| Kernel (uaccess) | uaddr_to_user_ptr() | make_user_ptr_*_for_uaccess() |
+-----------------------------------+---------------------+-------------------------------+
Eventually both uaddr_to_user_ptr() and uaddr_to_user_ptr_safe() should
disappear, the first thanks to userspace always providing full pointers
and the second being replaced by handcrafted code creating capabilities
in line with the PCuABI spec (whose bounds give access to only the
intended object and potentially padding).
Note that patch 1 and 4 were included in the first RFC of the CHERI API
series [4]. They remain broadly the same, but:
- make_privileged_user_ptr and check_user_ptr() have been renamed, and the
permissions are now specified by calling the right variant of the function
instead of passing a bitfield. They are now called respectively
make_user_ptr_for_<perms>_uaccess() and check_user_ptr_<perms>().
- The user_ptr documentation has been updated accordingly.
- The commit messages have been improved to reflect the overall
intention better.
Review branch:
https://git.morello-project.org/kbrodsky-arm/linux/-/commits/morello/user_p…
Rendered doc:
https://git.morello-project.org/kbrodsky-arm/linux/-/blob/morello/user_ptr_…
Thanks,
Kevin
[1] https://git.morello-project.org/morello/kernel/linux/-/issues/7
[2] https://git.morello-project.org/morello/kernel/linux/-/issues/19
[3] https://git.morello-project.org/morello/kernel/linux/-/issues/22
[4] https://op-lists.linaro.org/archives/list/linux-morello@op-lists.linaro.org…
Kevin Brodsky (9):
linux/user_ptr.h: Introduce uaccess-related helpers
fs/binfmt_elf: Create appropriate user pointer for uaccess
coredump: Create appropriate user pointer for uaccess
mm/memory: Create appropriate user pointer for uaccess
Revert "mm/hugetlb: Use appropriate user pointer conversions"
Revert "mm/shmem: Use appropriate user pointer conversions"
audit: Create appropriate user pointer for uaccess
perf: Avoid uaddr_to_user_ptr_safe() for arbitrary user address
arm64: Create appropriate user pointer for uaccess
Documentation/core-api/user_ptr.rst | 100 ++++++++++++++++++----------
arch/arm64/kernel/debug-monitors.c | 3 +-
arch/arm64/kernel/traps.c | 2 +-
fs/binfmt_elf.c | 14 ++--
fs/coredump.c | 4 +-
include/linux/user_ptr.h | 86 ++++++++++++++++++++++--
kernel/auditsc.c | 3 +-
kernel/events/internal.h | 2 +-
lib/user_ptr.c | 46 +++++++++++++
mm/hugetlb.c | 2 +-
mm/memory.c | 2 +-
mm/shmem.c | 2 +-
12 files changed, 216 insertions(+), 50 deletions(-)
--
2.38.1
Hi,
For the sake of clarity, I am starting this new thread to discuss the
interface we should adopt for the check_user_ptr interface, introduced
by the "New user_ptr helpers for uaccess" series [1.1]. This function is
meant to check that the specifier user pointer (capability in PCuABI)
allows a given range to be accessed in a given way. The range is
specified quite naturally by the address of the pointer + a separate
size argument. As to the nature of the access (read and/or write), as
suggested [2.2] during the review of the preliminary RFC, it is
currently specified in the function name itself: check_user_ptr_read(),
check_user_ptr_write(), check_user_ptr_rw().
I believe this is looking pretty good, but Luca's recent explicit check
patch [3] has shown that it is a little inconvenient in the cases where
the nature of the access is only known at runtime, often through an
integer set to either READ or WRITE. This led me to suggest [1.2] to use
these constants as an argument for check_user_ptr too, only to realise
later that it doesn't work, as it is not possible to use them together
(READ | WRITE is meaningless).
The question now is therefore the following: should we go back to
something like in the RFC [2.1], i.e. pass the access type as argument,
introducing a whole new type / constants for the access type? Or should
we stick to [1.1], i.e. the access type in the function name itself. At
this point I would tend to favour the latter option, as there are not so
many situations where the access type is only known at runtime (just 3
so far), but maybe this is short-sighted.
Opinions very welcome!
Thanks,
Kevin
[1.1]
https://op-lists.linaro.org/archives/list/linux-morello@op-lists.linaro.org…
[1.2]
https://op-lists.linaro.org/archives/list/linux-morello@op-lists.linaro.org…
[2.1]
https://op-lists.linaro.org/archives/list/linux-morello@op-lists.linaro.org…
[2.2]
https://op-lists.linaro.org/archives/list/linux-morello@op-lists.linaro.org…
[3]
https://op-lists.linaro.org/archives/list/linux-morello@op-lists.linaro.org…
Some io_uring operations' SQEs store user_data values in the addr2 field.
These don't need to be modified as they're not dereferenced by the kernel.
Reported-by: Kevin Brodsky <kevin.brodsky(a)arm.com>
Signed-off-by: Tudor Cretu <tudor.cretu(a)arm.com>
---
v2:
- Updated the comment that it's only propagated, not matched.
Review branch:
https://git.morello-project.org/tudcre01/linux/-/commits/morello/addr2_fix_…
---
io_uring/io_uring.h | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 5b4f0f298ad9..26cfe280b049 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -132,9 +132,22 @@ static inline void convert_compat64_io_uring_sqe(struct io_ring_ctx *ctx,
sqe->ioprio = READ_ONCE(compat_sqe->ioprio);
sqe->fd = READ_ONCE(compat_sqe->fd);
BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr);
- sqe->addr2 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr2));
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
+ /*
+ * Some opcodes set a user_data value in the addr2 field to propagate
+ * it as-is to the user_data field of a CQE. It's not dereferenced
+ * by the kernel, so don't modify it.
+ */
+ switch (sqe->opcode) {
+ case IORING_OP_POLL_REMOVE:
+ case IORING_OP_MSG_RING:
+ sqe->addr2 = (__kernel_uintptr_t)READ_ONCE(compat_sqe->addr2);
+ break;
+ default:
+ sqe->addr2 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr2));
+ break;
+ }
+ BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
/*
* Some opcodes set a user_data value in the addr field to be matched
* with a pre-existing IO event's user_data. It's not dereferenced by
--
2.34.1
With this patch capabilities provided to userspace to access argv and envp
are properly bounded with the permissions defined in the PCuABI spec.
This change is split between two files, as they handle two parts of the
process. This cover letter tries to explain the process surrounding the
argument strings, hopefully to facilitate reviewing the patch but also
to check my understanding, and the logic implemented by the patch.
Hopefully I can find some time to refine it after reviews, but in case
I cannot I hope the details will be useful.
# General patch comments
One thing that might be an issue is that I don't see how to update
fs/exec.c:bprm_stack_limits to handle this new process short of
reproducing it almost entirely inside the function.
There are warnings because elf_stack_put_user_cap expects uintcap_t and
cheri_build_user_cap builds a capability, but I didn't really know
how to handle that.
The function itself might be entirely unecessary.
checkpatch complains about elf_stack_put_user_cap, but it follows the
style of the other functions, and from missing blank lines which appear
to be there, so I'm not too sure what to do about it.
# Giving argument strings to userspace
Calling a new executable is done through execve, where userspace passes
the argv and envp pointer arrays to the kernel.
In fs/exec.c:do_execveat_common, the kernel creates a struct linux_binprm
which is used throughout the process of loading a new binary.
Among other things, it keeps track of the stack for the new executable,
both the memory space and its current top, which it allocates here.
The argument are copied to this new stack in fs/exec.c:copy_strings.
It does so from the top of the stack, starting with the last envp element
and ending with the first of argv: opposite to how it will be read later.
The function allocates pages as it goes, thus splitting the string in at
most page-sized chunks, and handles offsets where the page was already
partially used or the remainder of the string will not fill it.
This only puts the /strings/ themselves on the stack.
Later, in fs/binfmt_elf.c:create_elf_tables, the kernel puts the auxv,
argv and envp arrays themselves – thus the pointers to the strings put
on the stack earlier – on the stack.
It does so in the order expected by userspace: from bottom to top of the
strings, so from the first of argv to the last of envp.
# Making it CHERI
We thus have a potential issue: the moment where the strings themselves
are allocated is different from the place where the capabilities are
created, with limited ways to pass information in between.
However, one piece of information is available in both cases: the length
of the null-terminated string.
This way, we can get the same representable length from the length of the
string, which remains unchanged, while allocating the string *and* when
creating the capability.
Assuming we can properly align the strings in copy_strings, we don't need
more information to properly create a capability with exact bounds
in create_elf_tables.
Thankfully, we can re-use most of the machinery in copy_strings to do so.
If the representable length of a string differs from its real length, we
need to get the mask needed for aligning it for exact bounds.
Given this new length, we can compute the position where it would end up
without alignment and ALIGN_DOWN() as we are going downards the stack.
This gives the position where the string *must* start for exact capability
bounds. Thus, all the padding will be after the string.
However as we are going backwards, we need to put the padding first and
the string second.
The code already goes through the string chunk by chunk, handling smaller
chunks at page and argument boundaries.
We can re-use this by first going through the length of padding, and once
this is done go through the string as normal.
This guarantees the string will start at the properly aligned address with
the appropriate amount of padding behind the previous allocations.
We do need to handle this padding and to detect when strings might need
proper handling for exact bounds.
As there isn't padding normally, we can simply check if there is some.
If there is, get the representable length of the string and use it for
the exact capability bounds.
As all the padding is after the strings, if there was padding needed for
alignment the next argument will start in the padding. This is slow, but
just go through the zeroes until we find the real start of the argument
and go from there.
This is not really properly done in this revision, it is a remnant of
the first draft and as such will improperly flag arguments as needing
adjustment after one that needed.
This is known but should be simply fixable by moving this looping
through the zeroes after allocating the capability.
In fs/binfmt_elf.c:create_elf_tables argv and envp are handled exactly the
same but in different loops. I removed the comments from the envp section
for conciseness.
# Testing
To generate a big enough argument, you can use this dirty bash script:
bigarg=""; count=0; while [ $(echo $bigarg | wc -c) -le 20000 ]; do
bigarg=${bigarg}"==${count}==This_should_be_a_really_big_arg_string"
count=$(($count + 1))
done
It is a bit unnecessary and slow but it does allow easily checking that
it starts and ends in the proper places.
Without the patch, the kernel log should show a CPU fault if $bigarg is
passed as an argument or in the environment. With the patch, it should
work fine.
Hopefully this is useful for reviewing, and correct in the first place !
Review branch:
https://git.morello-project.org/Teo-CD/linux/-/tree/review-arg-str-resticte…
Commit itself:
https://git.morello-project.org/Teo-CD/linux/-/commit/a4d9fe880d098e9f9fe1e…
Thanks very much in advance !
Téo
Teo Couprie Diaz (1):
fs: Handle exact bounds for argv and envp
fs/binfmt_elf.c | 111 ++++++++++++++++++++++++++++++++++++++++++--
fs/exec.c | 66 ++++++++++++++++++++++++++
include/linux/elf.h | 4 ++
3 files changed, 178 insertions(+), 3 deletions(-)
--
2.34.1
This series of patches enables nfs rootfs
support on the Morello board.
Patch 01 is fixing the inital kernel build error
associated with a wrong function pointer type within
the sunrpc modules due to the unlocked_ioctl fp,
the error occurs upon enabling nfs within the defconfig.
Patch 02 deals with the fallout caused by changes
inferred by patches 01. See details in the description
of the patch.
Patch 03 is enabling nfs rootfs by default in the kernel.
It was confirmed that the kernel can boot with a nfs rootfs.
V2 changes:
- patch only the modules that are actually being used
- change description and fix nits
- address the incorrect proc_compat_ioctl
Pawel Zalewski (3):
net: sunrpc: fix unlocked_ioctl handler signature
include: linux: fix proc_ioctl
arm64: morello: enable nfs rootfs by default
arch/arm64/configs/morello_transitional_pcuabi_defconfig | 2 ++
drivers/pci/proc.c | 4 ++--
include/linux/proc_fs.h | 2 +-
net/sunrpc/cache.c | 6 +++---
net/sunrpc/rpc_pipe.c | 2 +-
5 files changed, 9 insertions(+), 7 deletions(-)
--
2.34.1
Some io_uring operations' SQEs store user_data values in the addr2 field.
These don't need to be modified as they're not dereferenced by the kernel.
Reported-by: Kevin Brodsky <kevin.brodsky(a)arm.com>
Signed-off-by: Tudor Cretu <tudor.cretu(a)arm.com>
---
Review branch:
https://git.morello-project.org/tudcre01/linux/-/commits/morello/addr2_fix
---
io_uring/io_uring.h | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 5b4f0f298ad9..db4f91cc64b2 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -132,9 +132,23 @@ static inline void convert_compat64_io_uring_sqe(struct io_ring_ctx *ctx,
sqe->ioprio = READ_ONCE(compat_sqe->ioprio);
sqe->fd = READ_ONCE(compat_sqe->fd);
BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr);
- sqe->addr2 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr2));
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
+ /*
+ * Some opcodes set a user_data value in the addr2 field to be matched
+ * with a pre-existing IO event's user_data or to propagate it to the
+ * user_data field of a CQE. It's not dereferenced by the kernel, so
+ * don't modify it.
+ */
+ switch (sqe->opcode) {
+ case IORING_OP_POLL_REMOVE:
+ case IORING_OP_MSG_RING:
+ sqe->addr2 = (__kernel_uintptr_t)READ_ONCE(compat_sqe->addr2);
+ break;
+ default:
+ sqe->addr2 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr2));
+ break;
+ }
+ BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
/*
* Some opcodes set a user_data value in the addr field to be matched
* with a pre-existing IO event's user_data. It's not dereferenced by
--
2.34.1