On 26/09/2023 16:45, Tudor Cretu wrote:
+ if (!check_user_ptr_read((void __user *)zc->address, zc->length))
This is a bit of a special situation as the kernel is essentially asked to throw away all pages in that range and replace them with new pages. I would argue this looks more like a write than a read - from a userspace perspective, data appears in the specified range, which is what a write from a kernel looks like regardless of the specific mechanism.
Hi Kevin,
Thank you for the detailed review! Indeed it looked to me as well that a write happens here. AFAIK. Harry chose a `check_user_ptr_read` because the user passes a capability with only read permissions. The capability is one returned from an mmap with PROT_READ only, as also found in this kselftest [1]:
If you try to call getsockopt with zc.address equal to an mmap(..., PROT_READ | PROT_WRITE, ...), you actually get an EINVAL. So, I'm not sure what we should do, as there's no capability that would suggest that the userspace is allowed to write there.
Ah, thanks for pointing this out, I missed it completely. I hadn't realised that in fact we are not dealing with some arbitrary anonymous mapping, but rather one backed by a (TCP) socket fd, and I see that tcp_mmap() fully prevents mapping it as writeable. That makes sense, because otherwise userspace could corrupt these pages (which are owned by the kernel).
With this considered, I understand why a read check was chosen - there isn't any other option with our current generic interface. However, I wonder if this is sufficient. The problem is that with that approach, any pointer that allows reading (at least a page of) the zerocopy region also allows changing its contents via the appropriate getsockopt() call. As a result, such a pointer cannot be passed around (say to a non-privileged context) while retaining the guarantee that the underlying memory will be left unmodified - which one would expect from a read-only pointer.
Thus, as a very special case, I think we should consider requiring VMem here, i.e. that the capability owns the range (as Harry originally wrote in the commit message). This is not quite the semantics of VMem, but it is close enough - that getsockopt() call is more or less the same as an mmap(MAP_FIXED) overwriting (part of) the mapping. Does that make sense? We do not have a generic helper for checking this, but this may well be the only generic case where we need it so I think it would be fine to have some PCuABI-specific code there (using cheri_check_cap()).
Kevin