On 09/11/2023 11:50, Amit Daniel Kachhap wrote:
On 11/9/23 14:44, Kevin Brodsky wrote:
On 03/11/2023 07:56, Amit Daniel Kachhap wrote:
- The reservation management remains next to the mapping management,
like in these patches. It is done unconditionally in PCuABI.
The way binfmt_elf calls vm_mmap() is for each elf segment so the approach I am thinking is to defer the reservation till all the elf segments are mapped. In the end, call the api reserv_range_insert_entry() with the range including the complete elf segments. This api will also return a capability with the bound same as reservation range.
Right I see that binfmt_elf is a special case, as the reservation we need to create is larger than the first vm_mmap() call (for the first segment). Ideally we wouldn't need to add a different path in vm_mmap() just for this situation though. Intuitively the reservation should be created before any mapping it contains in this case, but clearly we cannot do this when representing reservations as vma metadata. Having considered this further, couldn't we simply enlarge the reservation created by the first vm_mmap()? That is:
- Map the first segment [unchanged]
- Call reserv_vma_insert_entry() on that new vma, setting the
reservation bounds to cover all the segments 3. Map the following segments with MAP_FIXED, within that reservation [unchanged] 4. Set AT_CHERI_*_{RW,RX}_CAP based on the bounds of the executable/interpreter reservation
At step 2, there is an assumption that there is sufficient space to create that reservation. However I believe this is essentially the same assumption that exists today, as the kernel has to be able to map all the remaining segments at a fixed offset from the first one.
The calls to vm_mmap() at step 3 almost observe the standard reservation rules. Almost, because they are creating mappings in unmapped ranges within the reservation, which is normally forbidden. I suspect this rule isn't really necessary for in-kernel users anyway (it's meant to prevent userspace from unmapping and remapping the same range, instead of doing that atomically by overwriting the range with MAP_FIXED), so we could avoid this problem by having this check in the syscall handler too, not down in vm_mmap().
I believe that with this approach, there is no need to special-case vm_mmap() for the binfmt_elf usage, but I may well be missing something.
Your suggestion makes sense. How about the below steps?.
- Call vm_mmap with total segment size and maximum permission.
Reservation will be created for full size. 2). call vm_mmap with MAP_FIXED address and each segment size.
The disadvantage of this approach is that there will be extra vma's for each of the holes.
I considered this, but I would really like to avoid any user-visible side-effect, in this case additional (PROT_NONE) vmas. We could manually unmap those, but the whole process because quite cumbersome, along the lines of: 1. Map the whole reservation (anonymous, PROT_NONE) 2. Map each segment (MAP_FIXED) 3. Unmap leftovers of the PROT_NONE mapping from 1.
The above approach is also doable but does not look clean as per spec
The spec is really about the kernel-user ABI - as long as what the user sees is consistent with the spec, we are fine. Do you have any particular concern with the approach I outlined above?
Kevin