Hi Arnd,
I spoke to Linus (in Cc) on Friday and I thought it was a good idea to give to you an update on what we are doing as part of the linux on Morello project. We originally started with the basic enablement of the feature almost two year ago and then proceeded enabling the userspace support as part of the research project. To do so we went through the exercise of defining a Pure Capability based user Application Binary Interface (PCuABI) [1]. This ABI is still in review and we are hoping to finalize it by the end of October 2022.
To get started with our implementation we identified a more stable subset of the full PCuABI which we call transitional PCuABI [2] and made sure it can work with the most commonly used C libraries (musl, glibc). The full PCuABI can be seen as an extension of the Transitional PCuABI.
Recently we opened our implementation of the transitional PCuABI for external contributions [3]. We setup a mailing list as well for reviews and general discussions around Morello [4] and have a public task tracker that details what we are planning to do next [5]. Last but not least we have a public CI that verifies our implementation (currently based on kselftest and ltp but we are planning to extend it to more test suites in future) [6].
In reading our code, please consider that to enable userspace "quickly" we had to take some shortcuts of which we are aware. Because of that we feel that this is the right moment to start discussing design choices with the wider linux community especially after Matt's (in Cc) presentation at LPC ("Zettalinux: It's Not Too Late To Start") which made us realize that in the near future we will have to solve similar kind of problems.
We consider in fact problems like the distinction in between an address and a pointer foundational work for a pure capability kernel.
Caveat: Morello is not a committed architecture and the arm64 maintainers are not keen on supporting it upstream before that happens.
I hope this email provides an helpful introduction on where we are and what we are doing. Please feel free to ask any question and to subscribe to the list if you want to follow the work we are doing.
Thanks, Vincenzo
[1] https://git.morello-project.org/morello/kernel/linux/-/wikis/Morello-pure-ca... [2] https://git.morello-project.org/morello/kernel/linux/-/wikis/Transitional-Mo... [3] https://git.morello-project.org/morello/kernel/linux [4] https://op-lists.linaro.org/mailman3/lists/linux-morello.op-lists.linaro.org... [5] https://git.morello-project.org/groups/morello/kernel/-/epics?state=opened&a... [6] https://git.morello-project.org/morello/kernel/linux/-/pipelines
On Mon, Sep 26, 2022, at 10:09 AM, Vincenzo Frascino wrote:
Hi Arnd,
I spoke to Linus (in Cc) on Friday and I thought it was a good idea to give to you an update on what we are doing as part of the linux on Morello project. We originally started with the basic enablement of the feature almost two year ago and then proceeded enabling the userspace support as part of the research project. To do so we went through the exercise of defining a Pure Capability based user Application Binary Interface (PCuABI) [1]. This ABI is still in review and we are hoping to finalize it by the end of October 2022.
Hi Vincenzo,
Thanks for looping me in for this. It's good to see you have made so much progress. Sorry for taking too long to reply here, I had started my reply before the merge window but didn't get all my thoughts sorted before I had to get the 6.1 stuff out first.
To get started with our implementation we identified a more stable subset of the full PCuABI which we call transitional PCuABI [2] and made sure it can work with the most commonly used C libraries (musl, glibc). The full PCuABI can be seen as an extension of the Transitional PCuABI.
Recently we opened our implementation of the transitional PCuABI for external contributions [3]. We setup a mailing list as well for reviews and general discussions around Morello [4] and have a public task tracker that details what we are planning to do next [5]. Last but not least we have a public CI that verifies our implementation (currently based on kselftest and ltp but we are planning to extend it to more test suites in future) [6].
This sounds like a reasonable approach, especially the scope of the transitional ABI.
In reading our code, please consider that to enable userspace "quickly" we had to take some shortcuts of which we are aware. Because of that we feel that this is the right moment to start discussing design choices with the wider linux community especially after Matt's (in Cc) presentation at LPC ("Zettalinux: It's Not Too Late To Start") which made us realize that in the near future we will have to solve similar kind of problems.
We consider in fact problems like the distinction in between an address and a pointer foundational work for a pure capability kernel.
I've skimmed through the documents and your current implementation to get a rough feeling where you are now. I have a number of questions and some concerns regarding how this would end up being upstreamed as well as about possible sources of bugs. Here are some initial thoughts I have:
=== Integer types ===
* I understand that __user pointers, __kernel_uintptr_t, and __uintcap_t are 128 bit wide in a purecap kernel. However I'm constantly confused about the size of the basic types (void *, long, long long) in userspace, in a purecap kernel and in a kernel without CONFIG_CHERI_PURECAP_UABI. While those are probably completely obvious to anyone working on Morello, maybe you can repeat how these are defined in the ABI documents and/or in the Documentation/arm64/morello.rst files.
* The use of uintcap_t and __kernel_uintcap_t type names makes perfect sense in the context of Morello, but there is a risk that anything using those will have to be reworked again in order to deal with architectures that have >64bit addresses without using Cheri-style capabilities.
* I assume we'll eventually need to do a mass-conversion from 'unsigned long' to 'uintptr_t' (or similar) inside the kernel at some point. This seems entirely uncontroversial but will need some interesting scripting using coccinelle.
=== System call ABI ===
* Repurposing the 32-bit compat layer to support normal LP64 tasks is clearly a clever trick to get things going. However, this is something that worries me because you end up implicitly defining the native ABI with 128-bit pointers. This will likely repeat some of the mistakes that got introduced with the first 64-bit ABIs. My feeling is that the compat namespace (struct compat_*, in_compat_syscall(), ...) is better left for 32-bit tasks, but adding a compat64_* and/or compat128_* namespace would be fine.
* One immediate issue I see with the new native ABI is that it creates structures with holes in them like this one from your documentation:
struct clone_args { __aligned_u64 flags; __kernel_uintptr_t pidfd; __kernel_uintptr_t child_tid; __kernel_uintptr_t parent_tid; __aligned_u64 exit_signal; __kernel_uintptr_t stack; __aligned_u64 stack_size; __kernel_uintptr_t tls; __kernel_uintptr_t set_tid; __aligned_u64 set_tid_size; __aligned_u64 cgroup; };  Since __kernel_uintptr_t has 128-bit alignment, there are now holes after flags, exit_signal, and stack_size. We try hard not to define ABIs like this in the kernel because that risks information leaks when copying the structure from kernel to user space (clone3 does not copy its arguments back, but other syscalls do).
* Unfortunately, the system call table definition on arm64 is not in a good shape, which is partially my fault for leaving some technical debt after the time64 syscall conversion. The way I think this should actually be done is to generalize the syscall.tbl method (see e.g. arch/powerpc/kernel/syscalls/syscall.tbl) to all architectures that currently use include/asm-generic/unistd.h and list separate entry points for 64-bit and 128-bit pointers systematically.
=== ioctl interfaces ===
* Getting driver ioctl interfaces right is going to be the most tedious part of this work. You have so far sidestepped a lot of this by not submitting the code to the upstream kernel and only supporting a small set of drivers that actually make sense on the Morello SoC, but I would still hope that we can come up with a plan that lets us do this in the mainline kernel in a way that is helpful to both you and others that need changes to the way we handle ioctls.
* I think where we need to go with ioctl handlers is to have a table-driven method similar to the existing drivers/media/v4l2-core/v4l2-ioctl.c, and make it handle multiple ABIs (arm vs x86, 32 vs 64 vs 128, mips/ppc/sparc command encodings, ...) in addition to handling the copy-in/out and type safety. I have some plans for how this could be implemented, but of course this will need to be discussed on the mailing list.
Arnd
On 10/11/22 11:45, Arnd Bergmann wrote:
On Mon, Sep 26, 2022, at 10:09 AM, Vincenzo Frascino wrote:
Hi Arnd,
I spoke to Linus (in Cc) on Friday and I thought it was a good idea to give to you an update on what we are doing as part of the linux on Morello project. We originally started with the basic enablement of the feature almost two year ago and then proceeded enabling the userspace support as part of the research project. To do so we went through the exercise of defining a Pure Capability based user Application Binary Interface (PCuABI) [1]. This ABI is still in review and we are hoping to finalize it by the end of October 2022.
Hi Vincenzo,
Thanks for looping me in for this. It's good to see you have made so much progress. Sorry for taking too long to reply here, I had started my reply before the merge window but didn't get all my thoughts sorted before I had to get the 6.1 stuff out first.
Hi Arnd,
thank you very much for taking the time and replying to my email. I understand that time is very limited during a merging window. It took me a while as well to go through your comments and try to find synergies for potential future collaborations. So, here I am.
To get started with our implementation we identified a more stable subset of the full PCuABI which we call transitional PCuABI [2] and made sure it can work with the most commonly used C libraries (musl, glibc). The full PCuABI can be seen as an extension of the Transitional PCuABI.
Recently we opened our implementation of the transitional PCuABI for external contributions [3]. We setup a mailing list as well for reviews and general discussions around Morello [4] and have a public task tracker that details what we are planning to do next [5]. Last but not least we have a public CI that verifies our implementation (currently based on kselftest and ltp but we are planning to extend it to more test suites in future) [6].
This sounds like a reasonable approach, especially the scope of the transitional ABI.
In reading our code, please consider that to enable userspace "quickly" we had to take some shortcuts of which we are aware. Because of that we feel that this is the right moment to start discussing design choices with the wider linux community especially after Matt's (in Cc) presentation at LPC ("Zettalinux: It's Not Too Late To Start") which made us realize that in the near future we will have to solve similar kind of problems.
We consider in fact problems like the distinction in between an address and a pointer foundational work for a pure capability kernel.
I've skimmed through the documents and your current implementation to get a rough feeling where you are now. I have a number of questions and some concerns regarding how this would end up being upstreamed as well as about possible sources of bugs. Here are some initial thoughts I have:
We are not planning to upstream our work, at least in its current form. As I was mentioning in my previous email, Morello is not a committed architecture hence arm64 maintainers are not keen to support it in the long term, unless something changes. Our current focus is to enable userspace (as quickly as possible) to allow research on capabilities. We are really keen to align our implementation with upstream and contribute back (where it makes sense) because we feel it helps us to maintain our implementation and makes it simple to rebase our patches on a recent kernel.
=== Integer types ===
- I understand that __user pointers, __kernel_uintptr_t, and __uintcap_t are 128 bit wide in a purecap kernel. However I'm constantly confused about the size of the basic types (void *, long, long long) in userspace, in a purecap kernel and in a kernel without CONFIG_CHERI_PURECAP_UABI. While those are probably completely obvious to anyone working on Morello, maybe you can repeat how these are defined in the ABI documents and/or in the Documentation/arm64/morello.rst files.
I agree, it is not a straight forward concept and requires further clarification. To fix this we are currently working on a new document (will be placed in Documentation/cheri/) that will cover various aspects of the PCuABI including sizes of the basic types. It should be ready in the coming months. I will make sure you are Cc'ed once it is posted for review.
We decided to go for a separate document because we think that the information contained might be relevant even for other architectures/platforms and that it makes it easier to extend it as we proceed with the implementation of the kernel.
- The use of uintcap_t and __kernel_uintcap_t type names makes perfect sense in the context of Morello, but there is a risk that anything using those will have to be reworked again in order to deal with architectures that have >64bit addresses without using Cheri-style capabilities.
The risk is there and IMHO it depends on whether or not we decide to support capability based architectures in the kernel together with 128 bit ones.
If we do, we will always have the necessity to distinguish in between what it is a pointer (allows deref) and what it is an address (no deref). Another dimension to keep into account is how we distinguish in between what belongs to the kernel to what belongs to the user.
Keeping this in mind, probably we require a matrix of four types to make sure we capture all the possibilies: - For the kernel: - ptraddr_t identifies an address - uintptr_t identifies a pointer (capability) - For the user: - user_ptraddr_t identifies an address - user_uintptr_t identifies a pointer (capability)
This approach seems generic and should work transparently for architectures that have >64bit addresses without using Cheri-style capabilities.
For what concerns the provenance (user vs kernel), we could use __user to detect it. Whereas tools like sparse should be OK with it, I am not sure the same can be applied to compilers (we need to discuss it with compiler guys). In fact I am not currently aware of any annotation that modifies directly the size of an integer type and there might be good reasons for that.
The good thing is that, in the current implementation, we are not using/planning to use uintcap_t (__kernel_uintcap_t) in many places, hence it should not be hard to replace them once we agree on the naming. Our implementation relies mosty on user_uintptr_t which is aligned with what I was mentioning above.
Note: I do not have a good eye for naming of the types, so suggestions are welcome.
- I assume we'll eventually need to do a mass-conversion from 'unsigned long' to 'uintptr_t' (or similar) inside the kernel at some point. This seems entirely uncontroversial but will need some interesting scripting using coccinelle.
Keeping in mind the distinction in between addresses and pointers, I think that having an automated way to convert the kernel would definitely make our life easier. Something I played in past with is combining python with coccinelle which might be useful in this situation as well.
=== System call ABI ===
- Repurposing the 32-bit compat layer to support normal LP64 tasks is clearly a clever trick to get things going. However, this is something that worries me because you end up implicitly defining the native ABI with 128-bit pointers. This will likely repeat some of the mistakes that got introduced with the first 64-bit ABIs. My feeling is that the compat namespace (struct compat_*, in_compat_syscall(), ...) is better left for 32-bit tasks, but adding a compat64_* and/or compat128_* namespace would be fine.
This approach was taken because it was the fastest for us to be able to run multiple PCuABI compatible binaries without having to port a shell and bought us time to focus on the kernel implementation. It has definitely limitations and problems, we recognize it, but it served its purpose. Once we complete the implementation of the PCuABI and enforce capability properties across the kernel (our main focus at the moment) we might revisit this. Having a different namespace for compat64 sounds a good approach.
One immediate issue I see with the new native ABI is that it creates structures with holes in them like this one from your documentation:
struct clone_args { __aligned_u64 flags; __kernel_uintptr_t pidfd; __kernel_uintptr_t child_tid; __kernel_uintptr_t parent_tid; __aligned_u64 exit_signal; __kernel_uintptr_t stack; __aligned_u64 stack_size; __kernel_uintptr_t tls; __kernel_uintptr_t set_tid; __aligned_u64 set_tid_size; __aligned_u64 cgroup; };
 Since __kernel_uintptr_t has 128-bit alignment, there are now holes after flags, exit_signal, and stack_size. We try hard not to define ABIs like this in the kernel because that risks information leaks when copying the structure from kernel to user space (clone3 does not copy its arguments back, but other syscalls do).
I discussed this topic with the team and we agree on the potential risk of leaking information when data structures containing holes are copied from kernel to userspace.
Said that, we tried to define our PCuABI with two concepts in mind: maintainability and simplicity in porting applications that today do not support capabilities. One of the approaches we could have followed to prevent holes would have been to rearrange the data structures (this is possible because we are defining a new ABI) but it means increasing the complexity in maintaining different data structures across ABIs and makes generally more difficult porting applications because it might break some userspace assumptions.
This does not mean that we take the leaks problem less seriously, in fact we were thinking to run, in future, KMSAN sanitizer [1] (support recently merged upstream) as part of our CI, which should be able to detect these kind of problems for both structs on the stack and on the heap. We have a ticket against our project that describes the details [2].
Note: It requires LLVM 14.0.6 for arm64 support.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kmsan.html [2] https://git.morello-project.org/morello/kernel/linux/-/issues/38
- Unfortunately, the system call table definition on arm64 is not in a good shape, which is partially my fault for leaving some technical debt after the time64 syscall conversion. The way I think this should actually be done is to generalize the syscall.tbl method (see e.g. arch/powerpc/kernel/syscalls/syscall.tbl) to all architectures that currently use include/asm-generic/unistd.h and list separate entry points for 64-bit and 128-bit pointers systematically.
I do not feel like it is someone's fault, it just takes time, especially considering the number of things you are involved in.
That said, I really like the approach, makes the addition of new ABIs very simple. We will keep a close eye on the relevant lists and we will convert our implementation once it lands on arm64.
=== ioctl interfaces ===
- Getting driver ioctl interfaces right is going to be the most tedious part of this work. You have so far sidestepped a lot of this by not submitting the code to the upstream kernel and only supporting a small set of drivers that actually make sense on the Morello SoC, but I would still hope that we can come up with a plan that lets us do this in the mainline kernel in a way that is helpful to both you and others that need changes to the way we handle ioctls.
I hope so as well, and I agree on the fact that dealing with ioctl() is the most tedious part of the whole project. We decided therefore to leave it for the end and enable in the meantime only what was strictly needed to focus on the PCuABI.
- I think where we need to go with ioctl handlers is to have a table-driven method similar to the existing drivers/media/v4l2-core/v4l2-ioctl.c, and make it handle multiple ABIs (arm vs x86, 32 vs 64 vs 128, mips/ppc/sparc command encodings, ...) in addition to handling the copy-in/out and type safety. I have some plans for how this could be implemented, but of course this will need to be discussed on the mailing list.
I had a look at the driver you pointed out and the approach looks quite clean, and I personally think that table based approach is the most suited to solve multi-ABI support related problems [3].
Please, if you start a discussion upstream on this topic, keep me and the list in copy. We would be really interested to follow the evolution.
[3] arch/arm64/kernel/vdso.c
Arnd
Thanks, Vincenzo
On Tue, Oct 25, 2022 at 04:13:37PM +0100, Vincenzo Frascino wrote:
On 10/11/22 11:45, Arnd Bergmann wrote:
=== Integer types ===
- I understand that __user pointers, __kernel_uintptr_t, and __uintcap_t are 128 bit wide in a purecap kernel. However I'm constantly confused about the size of the basic types (void *, long, long long) in userspace, in a purecap kernel and in a kernel without CONFIG_CHERI_PURECAP_UABI. While those are probably completely obvious to anyone working on Morello, maybe you can repeat how these are defined in the ABI documents and/or in the Documentation/arm64/morello.rst files.
I agree, it is not a straight forward concept and requires further clarification. To fix this we are currently working on a new document (will be placed in Documentation/cheri/) that will cover various aspects of the PCuABI including sizes of the basic types. It should be ready in the coming months. I will make sure you are Cc'ed once it is posted for review.
We decided to go for a separate document because we think that the information contained might be relevant even for other architectures/platforms and that it makes it easier to extend it as we proceed with the implementation of the kernel.
- The use of uintcap_t and __kernel_uintcap_t type names makes perfect sense in the context of Morello, but there is a risk that anything using those will have to be reworked again in order to deal with architectures that have >64bit addresses without using Cheri-style capabilities.
The risk is there and IMHO it depends on whether or not we decide to support capability based architectures in the kernel together with 128 bit ones.
If we do, we will always have the necessity to distinguish in between what it is a pointer (allows deref) and what it is an address (no deref). Another dimension to keep into account is how we distinguish in between what belongs to the kernel to what belongs to the user.
Keeping this in mind, probably we require a matrix of four types to make sure we capture all the possibilies:
- For the kernel:
- ptraddr_t identifies an address
- uintptr_t identifies a pointer (capability)
- For the user:
- user_ptraddr_t identifies an address
- user_uintptr_t identifies a pointer (capability)
This approach seems generic and should work transparently for architectures that have >64bit addresses without using Cheri-style capabilities.
For what concerns the provenance (user vs kernel), we could use __user to detect it. Whereas tools like sparse should be OK with it, I am not sure the same can be applied to compilers (we need to discuss it with compiler guys). In fact I am not currently aware of any annotation that modifies directly the size of an integer type and there might be good reasons for that.
The good thing is that, in the current implementation, we are not using/planning to use uintcap_t (__kernel_uintcap_t) in many places, hence it should not be hard to replace them once we agree on the naming. Our implementation relies mosty on user_uintptr_t which is aligned with what I was mentioning above.
Note: I do not have a good eye for naming of the types, so suggestions are welcome.
- I assume we'll eventually need to do a mass-conversion from 'unsigned long' to 'uintptr_t' (or similar) inside the kernel at some point. This seems entirely uncontroversial but will need some interesting scripting using coccinelle.
Keeping in mind the distinction in between addresses and pointers, I think that having an automated way to convert the kernel would definitely make our life easier. Something I played in past with is combining python with coccinelle which might be useful in this situation as well.
When I last spoke to Torvalds about uintptr_t, he's completely opposed. In fact, we already do make a distinction inside the kernel between addresses (unsigned long) and pointers (void * / void __user *). Yes, there are places where we do confuse the two, but I think they're fixable.
This was what I was trying to propose in my talk, that we widen both long and pointer to 128-bit at the same time. It means we don't need the mass-conversion of long -> intptr_t.
On 25/10/2022 16:49, Matthew Wilcox wrote:
Keeping in mind the distinction in between addresses and pointers, I think that having an automated way to convert the kernel would definitely make our life easier. Something I played in past with is combining python with coccinelle which might be useful in this situation as well.
When I last spoke to Torvalds about uintptr_t, he's completely opposed. In fact, we already do make a distinction inside the kernel between addresses (unsigned long) and pointers (void * / void __user *).
As I said in my previous email, this all depends on whether or not we decide to support capability based architectures in the kernel together with 128 bit ones.
If we say we do not support capability based architectures in the kernel this approach is perfect (e.g. RV128).
If we decide to support capability based architectures, the first thing we need to think of is the fact that userspace on these architectures will most likely not upscale artificially unsigned long to 128 bit (it will continue to use 64 bit ones). This is mainly for memory impact and performance reasons. In such a scenario according to me we (as in for Morello) are "forced" to head to a pure capability kernel (not a big deal since it is my preferred choice) but we are still left with one mass conversion at ABI level. In fact IMHO we will need to replace unsigned long with something like __kernel_ulong_t (already present in the kernel) or similar to distinguish the semantical meaning of longs in userspace from what the kernel does.
On open question on this front is the performance impact of such a change since we still did not have time to measure it.
An alternative, for sake of completeness, would be to say that unsigned long can contain capabilities, but I am mentioning it only for academic reasons because I think that we all agree on the fact that making unsigned long 129-bits (128-bit data + capability valid bit) would have a devastating effect on performances.
Yes, there are places where we do confuse the two, but I think they're fixable.
In our current implementation we redefine the meaning of __user to capability to make sure that we propagate capabilities at ABI level in the correct way. This approach has clearly some limitations and it is not meant for upstream. But it was the fastest way we found to enable userspace.
Another advantage of this approach is to identify all the places where __user is used improperly (the kernel does not build, we cannot bypass it). We fixed a few, and where it makes sense we are contributing back our findings upstream, for instance [1].
[1] https://lore.kernel.org/all/20220907121230.21252-1-vincenzo.frascino@arm.com...
This was what I was trying to propose in my talk, that we widen both long and pointer to 128-bit at the same time. It means we don't need the mass-conversion of long -> intptr_t.
We watched very carefully your presentation because we think that most of the concepts are in line with a pure capability based kernel implementation. And we believe that this is the only sane way to bring capabilities upstream.
Though the doubt I presented above still stands.
On Tue, Nov 1, 2022, at 16:27, Vincenzo Frascino wrote:
On 25/10/2022 16:49, Matthew Wilcox wrote:
Keeping in mind the distinction in between addresses and pointers, I think that having an automated way to convert the kernel would definitely make our life easier. Something I played in past with is combining python with coccinelle which might be useful in this situation as well.
When I last spoke to Torvalds about uintptr_t, he's completely opposed. In fact, we already do make a distinction inside the kernel between addresses (unsigned long) and pointers (void * / void __user *).
As I said in my previous email, this all depends on whether or not we decide to support capability based architectures in the kernel together with 128 bit ones.
If we say we do not support capability based architectures in the kernel this approach is perfect (e.g. RV128).
If we decide to support capability based architectures, the first thing we need to think of is the fact that userspace on these architectures will most likely not upscale artificially unsigned long to 128 bit (it will continue to use 64 bit ones). This is mainly for memory impact and performance reasons.
I think the same reasons apply on RV128 as well. The draft specification for RV128 has two suggested ABIs, using either 128 pointer and 64-bit long, or introducing a separate 128-bit far pointer, but leaving normal pointers and 'long' values as 64 bit wide.
While this could be changed before actual RV128 products come out, I believe the exact same tradeoffs apply in userspace for RV128 and for Morello: there is very little gain in changing 'long', but a significant cost in memory usage, CPU cycles and developer time for fixing applications that make assumptions about 'long' being either 32-bit or 64-bit wide.
OTOH, I also agree with Linus that for the kernel, breaking the assumption of size(long)==sizeof(void*) is crazy difficult because the amount of wasted developer time for anyone who does not care about 128-bit pointers.
Yes, there are places where we do confuse the two, but I think they're fixable.
In our current implementation we redefine the meaning of __user to capability to make sure that we propagate capabilities at ABI level in the correct way. This approach has clearly some limitations and it is not meant for upstream. But it was the fastest way we found to enable userspace.
I guess this would also work with the rv128 'far pointer' model in the scenario where one needs to support 128-bit processes in user space, without extending the physical address space visible to the kernel, though it breaks down for scenarios where the goal is to extend the physical address space.
Side note: on arm64, both physical and virtual address space is limited to 48 bit in practice (52 if you push it) but extending this any further is going to involve either trading off architectural features that use the upper bits, or adding larger pointers. On other architectures, it is apparently sufficient to go to 5-level paging to get 57 bits (x86, rv64) or 64 bits (s390, rv64 6-level) of virtual space.
Another advantage of this approach is to identify all the places where __user is used improperly (the kernel does not build, we cannot bypass it). We fixed a few, and where it makes sense we are contributing back our findings upstream, for instance [1].
[1] https://lore.kernel.org/all/20220907121230.21252-1-vincenzo.frascino@arm.com...
Indeed, this is definitely helpful.
This was what I was trying to propose in my talk, that we widen both long and pointer to 128-bit at the same time. It means we don't need the mass-conversion of long -> intptr_t.
We watched very carefully your presentation because we think that most of the concepts are in line with a pure capability based kernel implementation. And we believe that this is the only sane way to bring capabilities upstream.
Though the doubt I presented above still stands.
As there is no clear answer on how to proceed with the integer types, we're probably left with leaving this up to your experiments before the uintptr_t or uintcap_t annotations can be upstreamed. While this is probably the largest part of your work, there should be a lot of other parts that we can work on upstreaming, or at least nailing down first, especially around the syscall and ioctl ABIs.
Arnd
On Tue, Oct 25, 2022, at 17:13, Vincenzo Frascino wrote:
On 10/11/22 11:45, Arnd Bergmann wrote:
On Mon, Sep 26, 2022, at 10:09 AM, Vincenzo Frascino wrote:
I've skimmed through the documents and your current implementation to get a rough feeling where you are now. I have a number of questions and some concerns regarding how this would end up being upstreamed as well as about possible sources of bugs. Here are some initial thoughts I have:
We are not planning to upstream our work, at least in its current form. As I was mentioning in my previous email, Morello is not a committed architecture hence arm64 maintainers are not keen to support it in the long term, unless something changes. Our current focus is to enable userspace (as quickly as possible) to allow research on capabilities. We are really keen to align our implementation with upstream and contribute back (where it makes sense) because we feel it helps us to maintain our implementation and makes it simple to rebase our patches on a recent kernel.
Right, that was my understanding already.
=== System call ABI ===
- Repurposing the 32-bit compat layer to support normal LP64 tasks is clearly a clever trick to get things going. However, this is something that worries me because you end up implicitly defining the native ABI with 128-bit pointers. This will likely repeat some of the mistakes that got introduced with the first 64-bit ABIs. My feeling is that the compat namespace (struct compat_*, in_compat_syscall(), ...) is better left for 32-bit tasks, but adding a compat64_* and/or compat128_* namespace would be fine.
This approach was taken because it was the fastest for us to be able to run multiple PCuABI compatible binaries without having to port a shell and bought us time to focus on the kernel implementation. It has definitely limitations and problems, we recognize it, but it served its purpose. Once we complete the implementation of the PCuABI and enforce capability properties across the kernel (our main focus at the moment) we might revisit this. Having a different namespace for compat64 sounds a good approach.
Ok.
One immediate issue I see with the new native ABI is that it creates structures with holes in them like this one from your documentation:
struct clone_args { __aligned_u64 flags; __kernel_uintptr_t pidfd; __kernel_uintptr_t child_tid; __kernel_uintptr_t parent_tid; __aligned_u64 exit_signal; __kernel_uintptr_t stack; __aligned_u64 stack_size; __kernel_uintptr_t tls; __kernel_uintptr_t set_tid; __aligned_u64 set_tid_size; __aligned_u64 cgroup; };
 Since __kernel_uintptr_t has 128-bit alignment, there are now holes after flags, exit_signal, and stack_size. We try hard not to define ABIs like this in the kernel because that risks information leaks when copying the structure from kernel to user space (clone3 does not copy its arguments back, but other syscalls do).
I discussed this topic with the team and we agree on the potential risk of leaking information when data structures containing holes are copied from kernel to userspace.
Said that, we tried to define our PCuABI with two concepts in mind: maintainability and simplicity in porting applications that today do not support capabilities. One of the approaches we could have followed to prevent holes would have been to rearrange the data structures (this is possible because we are defining a new ABI) but it means increasing the complexity in maintaining different data structures across ABIs and makes generally more difficult porting applications because it might break some userspace assumptions.
This does not mean that we take the leaks problem less seriously, in fact we were thinking to run, in future, KMSAN sanitizer [1] (support recently merged upstream) as part of our CI, which should be able to detect these kind of problems for both structs on the stack and on the heap. We have a ticket against our project that describes the details [2].
Note: It requires LLVM 14.0.6 for arm64 support.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kmsan.html [2] https://git.morello-project.org/morello/kernel/linux/-/issues/38
I've created an account there now and subscribed to this issue.
Most likely we should end up with a combination of new interfaces where appropriate and other cases where the implicit padding gets turned into explicit padding with an #ifdef. We can discuss this further in that issue.
=== ioctl interfaces ===
- Getting driver ioctl interfaces right is going to be the most tedious part of this work. You have so far sidestepped a lot of this by not submitting the code to the upstream kernel and only supporting a small set of drivers that actually make sense on the Morello SoC, but I would still hope that we can come up with a plan that lets us do this in the mainline kernel in a way that is helpful to both you and others that need changes to the way we handle ioctls.
I hope so as well, and I agree on the fact that dealing with ioctl() is the most tedious part of the whole project. We decided therefore to leave it for the end and enable in the meantime only what was strictly needed to focus on the PCuABI.
How do you plan to deal with cases where you currently have an accidental ABI that may or may not work when the correct one involves an ABI change? If you do a series of internal releases that are not required to be strictly backwards compatible, that won't be much of a problem, but I'm worried that defining interfaces ad-hoc causes more work later if you need to stay compatible.
Arnd
linux-morello@op-lists.linaro.org