On Tue, Nov 1, 2022, at 16:27, Vincenzo Frascino wrote:
On 25/10/2022 16:49, Matthew Wilcox wrote:
Keeping in mind the distinction in between addresses and pointers, I think that having an automated way to convert the kernel would definitely make our life easier. Something I played in past with is combining python with coccinelle which might be useful in this situation as well.
When I last spoke to Torvalds about uintptr_t, he's completely opposed. In fact, we already do make a distinction inside the kernel between addresses (unsigned long) and pointers (void * / void __user *).
As I said in my previous email, this all depends on whether or not we decide to support capability based architectures in the kernel together with 128 bit ones.
If we say we do not support capability based architectures in the kernel this approach is perfect (e.g. RV128).
If we decide to support capability based architectures, the first thing we need to think of is the fact that userspace on these architectures will most likely not upscale artificially unsigned long to 128 bit (it will continue to use 64 bit ones). This is mainly for memory impact and performance reasons.
I think the same reasons apply on RV128 as well. The draft specification for RV128 has two suggested ABIs, using either 128 pointer and 64-bit long, or introducing a separate 128-bit far pointer, but leaving normal pointers and 'long' values as 64 bit wide.
While this could be changed before actual RV128 products come out, I believe the exact same tradeoffs apply in userspace for RV128 and for Morello: there is very little gain in changing 'long', but a significant cost in memory usage, CPU cycles and developer time for fixing applications that make assumptions about 'long' being either 32-bit or 64-bit wide.
OTOH, I also agree with Linus that for the kernel, breaking the assumption of size(long)==sizeof(void*) is crazy difficult because the amount of wasted developer time for anyone who does not care about 128-bit pointers.
Yes, there are places where we do confuse the two, but I think they're fixable.
In our current implementation we redefine the meaning of __user to capability to make sure that we propagate capabilities at ABI level in the correct way. This approach has clearly some limitations and it is not meant for upstream. But it was the fastest way we found to enable userspace.
I guess this would also work with the rv128 'far pointer' model in the scenario where one needs to support 128-bit processes in user space, without extending the physical address space visible to the kernel, though it breaks down for scenarios where the goal is to extend the physical address space.
Side note: on arm64, both physical and virtual address space is limited to 48 bit in practice (52 if you push it) but extending this any further is going to involve either trading off architectural features that use the upper bits, or adding larger pointers. On other architectures, it is apparently sufficient to go to 5-level paging to get 57 bits (x86, rv64) or 64 bits (s390, rv64 6-level) of virtual space.
Another advantage of this approach is to identify all the places where __user is used improperly (the kernel does not build, we cannot bypass it). We fixed a few, and where it makes sense we are contributing back our findings upstream, for instance [1].
[1] https://lore.kernel.org/all/20220907121230.21252-1-vincenzo.frascino@arm.com...
Indeed, this is definitely helpful.
This was what I was trying to propose in my talk, that we widen both long and pointer to 128-bit at the same time. It means we don't need the mass-conversion of long -> intptr_t.
We watched very carefully your presentation because we think that most of the concepts are in line with a pure capability based kernel implementation. And we believe that this is the only sane way to bring capabilities upstream.
Though the doubt I presented above still stands.
As there is no clear answer on how to proceed with the integer types, we're probably left with leaving this up to your experiments before the uintptr_t or uintcap_t annotations can be upstreamed. While this is probably the largest part of your work, there should be a lot of other parts that we can work on upstreaming, or at least nailing down first, especially around the syscall and ioctl ABIs.
Arnd