On 30/10/2023 15:42, Kevin Brodsky wrote:

On 25/10/2023 18:57, Teo Couprie Diaz wrote:
argv and envp strings can be large enough so that a capability
cannot be created to access them exclusively: their bounds would
not be exactly representable and go outside of the strings.

During allocation on the stack, align the strings that need an adjusted
base and pad if needed for alignment or a representable length.
Handle this potential padding in fs/binfmt_elf by detecting unexpected 0s.
"unexpected" is a bit... unexpected for something we actually expect? :D
Maybe "extra".

This is done irrespective of the calling binary or loaded binary ABIs.
Indeed, at the point where strings are put on the stack during `execve`
we can only know the ABI of the calling binary, but not that of the loaded
binary.
Thus the compat path will be slightly slower and the memory used by the
strings will be larger, as if to create capabilities with exact bounds.
However, this memory impact should stay minimal and in rare cases where
argv or envp strings are more than a couple thousand characters.
Same comment as in v2.

Add a a helper function to put already created capabilities on the stack
s/a a/a/

during elf setup.

Additionally, align the newly created stack to a multiple of the
maximum argument length to make sure the relocated arguments will still be
respresentable with their new alignment in a PC-uABI kernel.
s/respresentable/representable/

Also I would say "in their new location", because the point is that we
preserve a sufficient alignment.
Agree with all comments, sorry for forgetting that one from v2. Fixed !

A big improvement over v2 overall, well done!
Thanks for the comments ! v4 is mostly ready, just need a bit of input
regarding one of the comments below.

Signed-off-by: Teo Couprie Diaz <teo.coupriediaz@arm.com>
---
 fs/binfmt_elf.c | 101 +++++++++++++++++++++++++++++++++++++++++-------
 fs/exec.c       |  58 +++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 17 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6455313d7f36..198c1ed442a6 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -151,6 +151,84 @@ static int padzero(unsigned long elf_bss)
 	}
 	return 0;
 }
+/*
+ * Put a pointer or capability to a string on the stack, return the effective
+ * size of the string on the stack or an error.
+ * When building a PC-uABI kernel, this will take into account padding and
"PCuABI" without dash (just to make grepping and such more reliable).
I wasn't sure, updated wherever I put a dash !

+ * create an exactly representable capability if in purecap.
+ */
+static long put_str_ptr(char __user *ustr,
+		elf_stack_item_t __user *stack_item)
+{
+	size_t len;
+#if defined(CONFIG_CHERI_PURECAP_UABI)
+	char c;
+	void __user *str_ptr;
Nit: char __user *
Done

+	len = strnlen_user(ustr, MAX_ARG_STRLEN);
+	if (!len || len > MAX_ARG_STRLEN)
+		return -EINVAL;
+
+		/*
+		 * Check if the next arg string starts right after the current one.
+		 * If not, the len was not exactly representable and there is padding.
+		 */
That comment looks misindented.
It was indeed, but removed as per below.

+	if (get_user(c, ustr + len))
+		return -EFAULT;
+	if (c == '\0')
+		/*
+		 * If the string has been adjusted for exact bounds representation,
+		 * its start should already be properly aligned. Get the representable
+		 * length by using the same length that was used during allocation:
+		 * the length of the original string.
+		 * This takes into account the padding due to length change after
+		 * the string, but not that for alignment. Thus we might not end up
+		 * at the start of the next arg. If not, we will need to take a slow
+		 * path to go through the padding.
+		 */
+		len = cheri_representable_length(len);
I'm realising: do we even need to do this conditionally? The
representable length should always be equal to the length if there is no
padding. Executing RRLEN unconditionally is definitely faster than
get_user() (and of course that's simpler too).

No I guess you are entirely right.

I think I wanted to not waste an instruction, but I obviously didn't think about it
enough because of the get_user taking even more time.

Doing it unconditionally now.


      
+
+#if (ELF_COMPAT == 0)
+	str_ptr = cheri_perms_and(ustr,
+		(CHERI_PERM_GLOBAL | CHERI_PERM_STORE | CHERI_PERM_LOAD));
+	str_ptr = cheri_bounds_set_exact(str_ptr, len);
+	if (put_user_ptr(str_ptr, stack_item++))
+		return -EFAULT;
+#else
+	if (elf_stack_put_user(ustr, stack_item++))
Nit: to reduce duplication some more and make it even clearer what's
happening, this line could be common to the two paths, simply setting
str_ptr to ustr in compat.
I wasn't sure when preparing v3 as I got a compile time assertion when
using what I felt was the most generic way to put the pointer on the stack,
with put_user_ptr.

It seems to build when using elf_stack_put_user instead and gets rid of the
warning introduced. Testing it looks to be just fine as well, so I just didn't
understand those two properly :)

Changed to have elf_stack_put_user outside the #ifdef.


+		return -EFAULT;
+#endif
+
+	/*
+	 * If right after the end of the argument length we have a zero,
+	 * that means the argument alignment was adjusted in order to create a
+	 * representable capability in purecap, even if we are not loading a
+	 * purecap binary. This padding is added at the end, so find the real
+	 * end by going through the padding.
+	 */
+	if (get_user(c, ustr + len))
+		return -EFAULT;
+	if (c == '\0') {
+		size_t pad_len = 0;
+
+		while (pad_len < MAX_ARG_STRLEN && c == '\0') {
+			pad_len++;
I think there's an off-by-one, since we increment pad_len, then call
get_user(), and then only after that check that pad_len <
MAX_ARG_STRLEN. I was actually going to suggest using a for loop
instead, which makes such issues less likely.
I'd argue that in this case there's even an off-by len+1 ! I don't think
it makes sense to do use pad_len by itself here.

Otherwise agree, there is a mistake.


Also, I'm not sure we need to have that initial get_user() + if. We can
directly have a loop, which may exit right away (need to be careful
about limit conditions of course).

However I am not sure of the shape you would want for the for loop,
specifically if it were to remove the if check and get_user ?

There's two conditions there, is the padding too large and is the
character null.

Would you be happy with the following ?

  for (pad_len = 0; len + pad_len < MAX_ARG_STRLEN ; pad_len += 1)

With the get_user and a break condtion on != '\0' inside the loop body ?

But that doesn't exit right away, it goes through an iteration (which makes sense to be fair).

So not sure of what you'd expect :)

diff --git a/fs/exec.c b/fs/exec.c
index 892cd911549d..7bc423a6fc9a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -66,6 +66,7 @@
 #include <linux/coredump.h>
 #include <linux/time_namespace.h>
 #include <linux/user_events.h>
+#include <linux/cheri.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -76,6 +77,12 @@
 
 #include <trace/events/sched.h>
 
+#ifdef CONFIG_CHERI_PURECAP_UABI
+#define STACK_ALIGN_SIZE MAX_ARG_STRLEN
+#else
+#define STACK_ALIGN_SIZE PAGE_SIZE
+#endif
+
 static int bprm_creds_from_file(struct linux_binprm *bprm);
 
 int suid_dumpable = 0;
@@ -515,6 +522,39 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
 	return 0;
 }
 
+#if defined(CONFIG_CHERI_PURECAP_UABI)
+/*
+ * Compute the a new value for p that allows for creating an exactly
+ * representable capability with all its padding after.
+ */
+size_t align_p_for_exact_bounds(unsigned long p, size_t len)
I would say s/exact/representable/ (it is indeed in order to be able to
set exact capability bounds, but representability is the root cause).
Fair, as mentioned I wasn't sure so happy to change it !

+{
+	size_t repr_len = 0, repr_align;
+
+	repr_len = cheri_representable_length(len);
+	repr_align = ~cheri_representable_alignment_mask(len) + 1;
+	if (repr_len > len || repr_align > 0) {
Do we actually need the conditions? It feels like the calculation should
work in any case. That certainly appears to be true for the length, I
would expect repr_align to compute to 1 if there is no requirement but
worth checking. If it is the case, removing the conditions is not only
more readable but also probably faster, as the calculation is
straightforward arithmetic (better than adding a branch even if it is a
no-op).

I am pretty sure it does, as I did check before adding the repr_align > 0 check.

It certainly makes it much more clear, removing the check.


@@ -542,6 +582,18 @@ static int copy_strings(int argc, struct user_arg_ptr argv,
 		if (!len)
 			goto out;
 
+#if defined(CONFIG_CHERI_PURECAP_UABI)
The nice thing with having moved everything to a function is that we can
easily have a #else there, returning p unconditionally, so that we can
avoid this #ifdef.
That's fair, maybe not even an #else by moving it inside the function.

+		/*
+		 * Handle all strings as if they were for purecap binaries as we only
+		 * know the calling process compat status, which might be different
+		 * from the binary to be exec'ed.
+		 *
+		 * Start directly at the string end, with all the padding already
+		 * taken into account.
"taken into account" as in skipped? I'm not sure how to interpret this
sentence.
That was the idea yes, changed to "skipped" as it's clearer.

Kevin

+		 */
+		bprm->p = align_p_for_exact_bounds(bprm->p, len);
+#endif
+
 		ret = -E2BIG;
 		if (!valid_arg_len(bprm, len))
 			goto out;
@@ -775,14 +827,14 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	if (vma->vm_end - vma->vm_start > stack_base)
 		return -ENOMEM;
 
-	stack_base = PAGE_ALIGN(stack_top - stack_base);
+	stack_base = ALIGN(stack_top - stack_base, STACK_ALIGN_SIZE);
 
 	stack_shift = vma->vm_start - stack_base;
 	mm->arg_start = bprm->p - stack_shift;
 	bprm->p = vma->vm_end - stack_shift;
 #else
 	stack_top = arch_align_stack(stack_top);
-	stack_top = PAGE_ALIGN(stack_top);
+	stack_top = ALIGN(stack_top, STACK_ALIGN_SIZE);
 
 	if (unlikely(stack_top < mmap_min_addr) ||
 	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))