On 10/11/22 11:45, Arnd Bergmann wrote:
On Mon, Sep 26, 2022, at 10:09 AM, Vincenzo Frascino wrote:
Hi Arnd,
I spoke to Linus (in Cc) on Friday and I thought it was a good idea to give to you an update on what we are doing as part of the linux on Morello project. We originally started with the basic enablement of the feature almost two year ago and then proceeded enabling the userspace support as part of the research project. To do so we went through the exercise of defining a Pure Capability based user Application Binary Interface (PCuABI) [1]. This ABI is still in review and we are hoping to finalize it by the end of October 2022.
Hi Vincenzo,
Thanks for looping me in for this. It's good to see you have made so much progress. Sorry for taking too long to reply here, I had started my reply before the merge window but didn't get all my thoughts sorted before I had to get the 6.1 stuff out first.
Hi Arnd,
thank you very much for taking the time and replying to my email. I understand that time is very limited during a merging window. It took me a while as well to go through your comments and try to find synergies for potential future collaborations. So, here I am.
To get started with our implementation we identified a more stable subset of the full PCuABI which we call transitional PCuABI [2] and made sure it can work with the most commonly used C libraries (musl, glibc). The full PCuABI can be seen as an extension of the Transitional PCuABI.
Recently we opened our implementation of the transitional PCuABI for external contributions [3]. We setup a mailing list as well for reviews and general discussions around Morello [4] and have a public task tracker that details what we are planning to do next [5]. Last but not least we have a public CI that verifies our implementation (currently based on kselftest and ltp but we are planning to extend it to more test suites in future) [6].
This sounds like a reasonable approach, especially the scope of the transitional ABI.
In reading our code, please consider that to enable userspace "quickly" we had to take some shortcuts of which we are aware. Because of that we feel that this is the right moment to start discussing design choices with the wider linux community especially after Matt's (in Cc) presentation at LPC ("Zettalinux: It's Not Too Late To Start") which made us realize that in the near future we will have to solve similar kind of problems.
We consider in fact problems like the distinction in between an address and a pointer foundational work for a pure capability kernel.
I've skimmed through the documents and your current implementation to get a rough feeling where you are now. I have a number of questions and some concerns regarding how this would end up being upstreamed as well as about possible sources of bugs. Here are some initial thoughts I have:
We are not planning to upstream our work, at least in its current form. As I was mentioning in my previous email, Morello is not a committed architecture hence arm64 maintainers are not keen to support it in the long term, unless something changes. Our current focus is to enable userspace (as quickly as possible) to allow research on capabilities. We are really keen to align our implementation with upstream and contribute back (where it makes sense) because we feel it helps us to maintain our implementation and makes it simple to rebase our patches on a recent kernel.
=== Integer types ===
- I understand that __user pointers, __kernel_uintptr_t, and __uintcap_t are 128 bit wide in a purecap kernel. However I'm constantly confused about the size of the basic types (void *, long, long long) in userspace, in a purecap kernel and in a kernel without CONFIG_CHERI_PURECAP_UABI. While those are probably completely obvious to anyone working on Morello, maybe you can repeat how these are defined in the ABI documents and/or in the Documentation/arm64/morello.rst files.
I agree, it is not a straight forward concept and requires further clarification. To fix this we are currently working on a new document (will be placed in Documentation/cheri/) that will cover various aspects of the PCuABI including sizes of the basic types. It should be ready in the coming months. I will make sure you are Cc'ed once it is posted for review.
We decided to go for a separate document because we think that the information contained might be relevant even for other architectures/platforms and that it makes it easier to extend it as we proceed with the implementation of the kernel.
- The use of uintcap_t and __kernel_uintcap_t type names makes perfect sense in the context of Morello, but there is a risk that anything using those will have to be reworked again in order to deal with architectures that have >64bit addresses without using Cheri-style capabilities.
The risk is there and IMHO it depends on whether or not we decide to support capability based architectures in the kernel together with 128 bit ones.
If we do, we will always have the necessity to distinguish in between what it is a pointer (allows deref) and what it is an address (no deref). Another dimension to keep into account is how we distinguish in between what belongs to the kernel to what belongs to the user.
Keeping this in mind, probably we require a matrix of four types to make sure we capture all the possibilies: - For the kernel: - ptraddr_t identifies an address - uintptr_t identifies a pointer (capability) - For the user: - user_ptraddr_t identifies an address - user_uintptr_t identifies a pointer (capability)
This approach seems generic and should work transparently for architectures that have >64bit addresses without using Cheri-style capabilities.
For what concerns the provenance (user vs kernel), we could use __user to detect it. Whereas tools like sparse should be OK with it, I am not sure the same can be applied to compilers (we need to discuss it with compiler guys). In fact I am not currently aware of any annotation that modifies directly the size of an integer type and there might be good reasons for that.
The good thing is that, in the current implementation, we are not using/planning to use uintcap_t (__kernel_uintcap_t) in many places, hence it should not be hard to replace them once we agree on the naming. Our implementation relies mosty on user_uintptr_t which is aligned with what I was mentioning above.
Note: I do not have a good eye for naming of the types, so suggestions are welcome.
- I assume we'll eventually need to do a mass-conversion from 'unsigned long' to 'uintptr_t' (or similar) inside the kernel at some point. This seems entirely uncontroversial but will need some interesting scripting using coccinelle.
Keeping in mind the distinction in between addresses and pointers, I think that having an automated way to convert the kernel would definitely make our life easier. Something I played in past with is combining python with coccinelle which might be useful in this situation as well.
=== System call ABI ===
- Repurposing the 32-bit compat layer to support normal LP64 tasks is clearly a clever trick to get things going. However, this is something that worries me because you end up implicitly defining the native ABI with 128-bit pointers. This will likely repeat some of the mistakes that got introduced with the first 64-bit ABIs. My feeling is that the compat namespace (struct compat_*, in_compat_syscall(), ...) is better left for 32-bit tasks, but adding a compat64_* and/or compat128_* namespace would be fine.
This approach was taken because it was the fastest for us to be able to run multiple PCuABI compatible binaries without having to port a shell and bought us time to focus on the kernel implementation. It has definitely limitations and problems, we recognize it, but it served its purpose. Once we complete the implementation of the PCuABI and enforce capability properties across the kernel (our main focus at the moment) we might revisit this. Having a different namespace for compat64 sounds a good approach.
One immediate issue I see with the new native ABI is that it creates structures with holes in them like this one from your documentation:
struct clone_args { __aligned_u64 flags; __kernel_uintptr_t pidfd; __kernel_uintptr_t child_tid; __kernel_uintptr_t parent_tid; __aligned_u64 exit_signal; __kernel_uintptr_t stack; __aligned_u64 stack_size; __kernel_uintptr_t tls; __kernel_uintptr_t set_tid; __aligned_u64 set_tid_size; __aligned_u64 cgroup; };
 Since __kernel_uintptr_t has 128-bit alignment, there are now holes after flags, exit_signal, and stack_size. We try hard not to define ABIs like this in the kernel because that risks information leaks when copying the structure from kernel to user space (clone3 does not copy its arguments back, but other syscalls do).
I discussed this topic with the team and we agree on the potential risk of leaking information when data structures containing holes are copied from kernel to userspace.
Said that, we tried to define our PCuABI with two concepts in mind: maintainability and simplicity in porting applications that today do not support capabilities. One of the approaches we could have followed to prevent holes would have been to rearrange the data structures (this is possible because we are defining a new ABI) but it means increasing the complexity in maintaining different data structures across ABIs and makes generally more difficult porting applications because it might break some userspace assumptions.
This does not mean that we take the leaks problem less seriously, in fact we were thinking to run, in future, KMSAN sanitizer [1] (support recently merged upstream) as part of our CI, which should be able to detect these kind of problems for both structs on the stack and on the heap. We have a ticket against our project that describes the details [2].
Note: It requires LLVM 14.0.6 for arm64 support.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kmsan.html [2] https://git.morello-project.org/morello/kernel/linux/-/issues/38
- Unfortunately, the system call table definition on arm64 is not in a good shape, which is partially my fault for leaving some technical debt after the time64 syscall conversion. The way I think this should actually be done is to generalize the syscall.tbl method (see e.g. arch/powerpc/kernel/syscalls/syscall.tbl) to all architectures that currently use include/asm-generic/unistd.h and list separate entry points for 64-bit and 128-bit pointers systematically.
I do not feel like it is someone's fault, it just takes time, especially considering the number of things you are involved in.
That said, I really like the approach, makes the addition of new ABIs very simple. We will keep a close eye on the relevant lists and we will convert our implementation once it lands on arm64.
=== ioctl interfaces ===
- Getting driver ioctl interfaces right is going to be the most tedious part of this work. You have so far sidestepped a lot of this by not submitting the code to the upstream kernel and only supporting a small set of drivers that actually make sense on the Morello SoC, but I would still hope that we can come up with a plan that lets us do this in the mainline kernel in a way that is helpful to both you and others that need changes to the way we handle ioctls.
I hope so as well, and I agree on the fact that dealing with ioctl() is the most tedious part of the whole project. We decided therefore to leave it for the end and enable in the meantime only what was strictly needed to focus on the PCuABI.
- I think where we need to go with ioctl handlers is to have a table-driven method similar to the existing drivers/media/v4l2-core/v4l2-ioctl.c, and make it handle multiple ABIs (arm vs x86, 32 vs 64 vs 128, mips/ppc/sparc command encodings, ...) in addition to handling the copy-in/out and type safety. I have some plans for how this could be implemented, but of course this will need to be discussed on the mailing list.
I had a look at the driver you pointed out and the approach looks quite clean, and I personally think that table based approach is the most suited to solve multi-ABI support related problems [3].
Please, if you start a discussion upstream on this topic, keep me and the list in copy. We would be really interested to follow the evolution.
[3] arch/arm64/kernel/vdso.c
Arnd
Thanks, Vincenzo