On 26-09-2023 10:02, Kevin Brodsky wrote:
On 22/09/2023 14:07, Tudor Cretu wrote:
From: Harry Ramsey harry.ramsey@arm.com
The tcp_zerocopy_receive struct is used to create a shared memory region between userspace and the kernel in which received network packets are directly copied from the network interface card.
In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others. Additionally, use the special copy routine when copying user pointers from and to userspace.
An additional check has been added to ensure that the owning capability has read access in order to read the received network packets from the shared memory region.
To be updated, see my last comment: this is actually a write from a userspace perspective, and we do not require an owning capability.
Signed-off-by: Harry Ramsey harry.ramsey@arm.com Co-developed-by: Tudor Cretu tudor.cretu@arm.com Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/sockptr.h | 28 ++++++++++++++++++++++++++ include/uapi/linux/tcp.h | 10 +++++----- net/ipv4/tcp.c | 43 +++++++++++++++++++++------------------- 3 files changed, 56 insertions(+), 25 deletions(-)
diff --git a/include/linux/sockptr.h b/include/linux/sockptr.h index bae5e2369b4f..fd270378e8c8 100644 --- a/include/linux/sockptr.h +++ b/include/linux/sockptr.h @@ -41,6 +41,20 @@ static inline bool sockptr_is_null(sockptr_t sockptr) return !sockptr.user; } +static inline int copy_from_sockptr_offset_with_ptr(void *dst, sockptr_t src,
size_t offset, size_t size)
+{
- if (!sockptr_is_kernel(src))
return copy_from_user_with_ptr(dst, src.user + offset, size);
- memcpy(dst, src.kernel + offset, size);
- return 0;
+}
+static inline int copy_from_sockptr_with_ptr(void *dst, sockptr_t src, size_t size) +{
- return copy_from_sockptr_offset_with_ptr(dst, src, 0, size);
+}
- static inline int copy_from_sockptr_offset(void *dst, sockptr_t src, size_t offset, size_t size) {
@@ -55,6 +69,20 @@ static inline int copy_from_sockptr(void *dst, sockptr_t src, size_t size) return copy_from_sockptr_offset(dst, src, 0, size); } +static inline int copy_to_sockptr_offset_with_ptr(sockptr_t dst, size_t offset,
const void *src, size_t size)
+{
- if (!sockptr_is_kernel(dst))
return copy_to_user_with_ptr(dst.user + offset, src, size);
- memcpy(dst.kernel + offset, src, size);
- return 0;
+}
+static inline int copy_to_sockptr_with_ptr(sockptr_t dst, const void *src, size_t size) +{
- return copy_to_sockptr_offset_with_ptr(dst, 0, src, size);
+}
I would split this out, especially since it's not specific to TCP.
[...]
@@ -2248,11 +2248,11 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk, struct tcp_zerocopy_receive *zc, struct scm_timestamping_internal *tss) {
- unsigned long msg_control_addr;
- __kernel_uintptr_t msg_control_addr;tcp_zerocopy_vm_insert_batch
It would make sense to use user_uintptr_t here too.
struct msghdr cmsg_dummy;
- msg_control_addr = (unsigned long)zc->msg_control;
- cmsg_dummy.msg_control_user = uaddr_to_user_ptr(msg_control_addr);
- msg_control_addr = zc->msg_control;
- cmsg_dummy.msg_control_user = (void __user *)msg_control_addr; cmsg_dummy.msg_controllen = (__kernel_size_t)zc->msg_controllen; cmsg_dummy.msg_flags = in_compat_syscall()
@@ -2262,8 +2262,8 @@ static void tcp_zc_finalize_rx_tstamp(struct sock *sk, if (zc->msg_control == msg_control_addr && zc->msg_controllen == cmsg_dummy.msg_controllen) { tcp_recv_timestamp(&cmsg_dummy, sk, tss);
zc->msg_control = (__u64)
user_ptr_addr(cmsg_dummy.msg_control_user);
zc->msg_control =
zc->msg_controllen = (__u64)cmsg_dummy.msg_controllen; zc->msg_flags = (__u32)cmsg_dummy.msg_flags;(__kernel_uintptr_t)cmsg_dummy.msg_control_user;
@@ -2276,7 +2276,7 @@ static int tcp_zerocopy_receive(struct sock *sk, struct scm_timestamping_internal *tss) { u32 length = 0, offset, vma_len, avail_len, copylen = 0;
- unsigned long address = (unsigned long)zc->address;
- user_uintptr_t address = (user_uintptr_t)zc->address;
There's no need to change this or tcp_zerocopy_vm_insert_batch*. Everything operates on addresses here, the check below that uses zc->address directly is sufficient.
struct page *pages[TCP_ZEROCOPY_PAGE_BATCH_SIZE]; s32 copybuf_len = zc->copybuf_len; struct tcp_sock *tp = tcp_sk(sk); @@ -2303,6 +2303,9 @@ static int tcp_zerocopy_receive(struct sock *sk, if (inq && inq <= copybuf_len) return receive_fallback_to_copy(sk, zc, inq, tss);
- if (!check_user_ptr_read((void __user *)zc->address, zc->length))
This is a bit of a special situation as the kernel is essentially asked to throw away all pages in that range and replace them with new pages. I would argue this looks more like a write than a read - from a userspace perspective, data appears in the specified range, which is what a write from a kernel looks like regardless of the specific mechanism.
Hi Kevin,
Thank you for the detailed review! Indeed it looked to me as well that a write happens here. AFAIK. Harry chose a `check_user_ptr_read` because the user passes a capability with only read permissions. The capability is one returned from an mmap with PROT_READ only, as also found in this kselftest [1]:
If you try to call getsockopt with zc.address equal to an mmap(..., PROT_READ | PROT_WRITE, ...), you actually get an EINVAL. So, I'm not sure what we should do, as there's no capability that would suggest that the userspace is allowed to write there.
[1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/tc...
Thanks, Tudor
I would also tend to put that check just after the address alignment check, so that we fail reliably regardless of the state of the socket.
Kevin
return -EFAULT;
- if (inq < PAGE_SIZE) { zc->length = 0; zc->recv_skip_hint = inq;