On Mon, Sep 26, 2022, at 10:09 AM, Vincenzo Frascino wrote:
Hi Arnd,
I spoke to Linus (in Cc) on Friday and I thought it was a good idea to give to you an update on what we are doing as part of the linux on Morello project. We originally started with the basic enablement of the feature almost two year ago and then proceeded enabling the userspace support as part of the research project. To do so we went through the exercise of defining a Pure Capability based user Application Binary Interface (PCuABI) [1]. This ABI is still in review and we are hoping to finalize it by the end of October 2022.
Hi Vincenzo,
Thanks for looping me in for this. It's good to see you have made so much progress. Sorry for taking too long to reply here, I had started my reply before the merge window but didn't get all my thoughts sorted before I had to get the 6.1 stuff out first.
To get started with our implementation we identified a more stable subset of the full PCuABI which we call transitional PCuABI [2] and made sure it can work with the most commonly used C libraries (musl, glibc). The full PCuABI can be seen as an extension of the Transitional PCuABI.
Recently we opened our implementation of the transitional PCuABI for external contributions [3]. We setup a mailing list as well for reviews and general discussions around Morello [4] and have a public task tracker that details what we are planning to do next [5]. Last but not least we have a public CI that verifies our implementation (currently based on kselftest and ltp but we are planning to extend it to more test suites in future) [6].
This sounds like a reasonable approach, especially the scope of the transitional ABI.
In reading our code, please consider that to enable userspace "quickly" we had to take some shortcuts of which we are aware. Because of that we feel that this is the right moment to start discussing design choices with the wider linux community especially after Matt's (in Cc) presentation at LPC ("Zettalinux: It's Not Too Late To Start") which made us realize that in the near future we will have to solve similar kind of problems.
We consider in fact problems like the distinction in between an address and a pointer foundational work for a pure capability kernel.
I've skimmed through the documents and your current implementation to get a rough feeling where you are now. I have a number of questions and some concerns regarding how this would end up being upstreamed as well as about possible sources of bugs. Here are some initial thoughts I have:
=== Integer types ===
* I understand that __user pointers, __kernel_uintptr_t, and __uintcap_t are 128 bit wide in a purecap kernel. However I'm constantly confused about the size of the basic types (void *, long, long long) in userspace, in a purecap kernel and in a kernel without CONFIG_CHERI_PURECAP_UABI. While those are probably completely obvious to anyone working on Morello, maybe you can repeat how these are defined in the ABI documents and/or in the Documentation/arm64/morello.rst files.
* The use of uintcap_t and __kernel_uintcap_t type names makes perfect sense in the context of Morello, but there is a risk that anything using those will have to be reworked again in order to deal with architectures that have >64bit addresses without using Cheri-style capabilities.
* I assume we'll eventually need to do a mass-conversion from 'unsigned long' to 'uintptr_t' (or similar) inside the kernel at some point. This seems entirely uncontroversial but will need some interesting scripting using coccinelle.
=== System call ABI ===
* Repurposing the 32-bit compat layer to support normal LP64 tasks is clearly a clever trick to get things going. However, this is something that worries me because you end up implicitly defining the native ABI with 128-bit pointers. This will likely repeat some of the mistakes that got introduced with the first 64-bit ABIs. My feeling is that the compat namespace (struct compat_*, in_compat_syscall(), ...) is better left for 32-bit tasks, but adding a compat64_* and/or compat128_* namespace would be fine.
* One immediate issue I see with the new native ABI is that it creates structures with holes in them like this one from your documentation:
struct clone_args { __aligned_u64 flags; __kernel_uintptr_t pidfd; __kernel_uintptr_t child_tid; __kernel_uintptr_t parent_tid; __aligned_u64 exit_signal; __kernel_uintptr_t stack; __aligned_u64 stack_size; __kernel_uintptr_t tls; __kernel_uintptr_t set_tid; __aligned_u64 set_tid_size; __aligned_u64 cgroup; };  Since __kernel_uintptr_t has 128-bit alignment, there are now holes after flags, exit_signal, and stack_size. We try hard not to define ABIs like this in the kernel because that risks information leaks when copying the structure from kernel to user space (clone3 does not copy its arguments back, but other syscalls do).
* Unfortunately, the system call table definition on arm64 is not in a good shape, which is partially my fault for leaving some technical debt after the time64 syscall conversion. The way I think this should actually be done is to generalize the syscall.tbl method (see e.g. arch/powerpc/kernel/syscalls/syscall.tbl) to all architectures that currently use include/asm-generic/unistd.h and list separate entry points for 64-bit and 128-bit pointers systematically.
=== ioctl interfaces ===
* Getting driver ioctl interfaces right is going to be the most tedious part of this work. You have so far sidestepped a lot of this by not submitting the code to the upstream kernel and only supporting a small set of drivers that actually make sense on the Morello SoC, but I would still hope that we can come up with a plan that lets us do this in the mainline kernel in a way that is helpful to both you and others that need changes to the way we handle ioctls.
* I think where we need to go with ioctl handlers is to have a table-driven method similar to the existing drivers/media/v4l2-core/v4l2-ioctl.c, and make it handle multiple ABIs (arm vs x86, 32 vs 64 vs 128, mips/ppc/sparc command encodings, ...) in addition to handling the copy-in/out and type safety. I have some plans for how this could be implemented, but of course this will need to be discussed on the mailing list.
Arnd