RFC April 2024 Arm Ltd Zachary Leaf
Compartmentalising eBPF with Morello
Abstract
This document describes how a hybrid compartment enabled by the Morello architecture/CHERI can be used to sandbox execution of JIT'd eBPF code in the Linux kernel.
Since Morello is hardware based it can provide much more lightweight compartmentalisation, isolation and sandboxing, compared to other pure software approaches.
Further, it can offer greater guarantees of memory safety when combined with the existing security model used by eBPF. Where previous verifier bypass exploits in eBPF would result in arbitrary kernel read/writes and privilege escalation, these attacks would instead result in hardware exceptions due to out of bounds memory accesses.
RFC patchset
The attached patch series is a first draft enabling compartmentalisation of JIT'd eBPF using a hybrid model. Restricted/Executive mode is used for domain transitions.
This is a working proof of concept and outlines a general approach and technique to isolate eBPF. Some of the limitations and further work required is found below. In particular one major technical problem remains unsolved, and further work is needed on this.
The patch series is also available as a branch at: https://git.morello-project.org/morello/kernel/linux/-/tree/morello/bpf_comp...
What is eBPF
eBPF is a relatively new kernel feature that allows extending the operating system, much like kernel modules. Compared with kernel modules, eBPF programs are meant to come with much stronger safety and reliability guarantees.
eBPF programs also run differently to kernel modules in that eBPF is its own unique architecture and byte code that runs inside a virtual machine in the kernel. Programs can either be run via the eBPF interpreter, or JIT compiled into native code.
Since by design eBPF programs run in the kernel context, they're fast - avoiding syscalls and context switches, and calling directly into the kernel. This is important as eBPF programs are event based and may run frequently, for instance on every incoming packet.
There are now many use cases for eBPF, from the original - writing complex packet filtering logic, to live tracing, monitoring and security programs.
Threat Model
eBPF has had a number of CVEs[1] in recent years. The current security model relies on accurate analysis by the eBPF verifier, a pre-runtime static analyser that checks if programs are safe to run. As evidenced by the number of CVEs relating to errors in the verifier[2], this kind of verification does not currently offer a strong guarantee of safety. The verification is not a formal one, rather a long list of checks. With the complexity of modern eBPF, it is likely the verifier could not ever be provably safe, without significant restrictions on the capabilities of eBPF itself. Further details about the verifier and attacks against it can be found in the section below.
Since eBPF runs in the kernel context, a large number of these CVEs therefore allow arbitrary kernel read/writes leading to local privilege escalation. This combined with Spectre speculation attacks[3][4] has resulted in the kernel and all major Linux distributions disabling unprivileged eBPF execution by default[5]. Many eBPF maintainers have also deprecated any new efforts to support unprivileged execution due to the difficulty securing it[6][7].
The CAP_BPF permission was introduced in v5.8 to attempt to limit possible damage from the previously required CAP_SYS_ADMIN permission and to prevent "pointer leaks and arbitrary kernel memory access"[8]. While this is an improvement over the wide capabilities granted by CAP_SYS_ADMIN, it remains vulnerable to verifier bypasses and hence privilege escalation from CAP_BPF to root.
Using Morello could therefore strengthen the CAP_BPF model as well as allow safer usage of unprivileged eBPF. This includes re-enabling several existing use cases e.g. user socket filtering and access control, as well as open up the possibility for new use cases, such as eBPF seccomp filters[9][10].
eBPF Verifier
┌────────┐ ┌────────────────┐ BPF_PROG_LOAD ┌─►│ JIT ├─►│[native aarch64]│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ └────────┘ └────────────────┘ │ [C] ├─►│ clang ├─►│ [eBPF] ├─►│verifier├─┤ └────────┘ └────────┘ └────────┘ └────────┘ │ ┌─────────────┐ └─►│ interpreter │ └─────────────┘
A simplified flow of writing and loading eBPF in the kernel is as follows:
1. Compile a subset of C with clang -target=bpf into eBPF byte code 2. Request to load the program into the kernel 3. eBPF byte code is run through the verifier - The verifier steps through all possible execution paths and instructions, keeping track of state - Control flow checks (e.g. no infinite loops, no unreachable instructions) - Individual instruction checks, both static (e.g. divide by zero) + dynamic (out of range accesses, stack + register checks) - No leaking kernel ptrs to memory shared with userspace (e.g. through eBPF maps) etc... - Once all checks are passed, code is determined/marked as "safe" 4. eBPF is then loaded into the kernel, either JIT compiled or as raw eBPF bytecode, ready to run in the kernel context when triggered
Both JIT'd programs or interpreted programs are run from a memory area marked as executable inside the kernel address space. On arm64[11], x86 and some other archs this area is also set to RO.
The rules of the verifier do not allow bpf programs to access any arbitrary kernel memory, they are restricted to calling: - other bpf programs - access to limited + approved kernel memory info and data only via bpf helper functions - access to limited and explicitly allowed kernel functions marked as kfuncs
Attempted accesses outside these bounds should be caught by the verifier before a bpf program is loaded into the kernel.
kernel memory ┌──────────────────────────────────────────────────────────────────┐ │ executable memory (RO) │ │ ┌──────────┐ ┌────────┬─────────┬────────┐ │ │ │ │ │ │ │ │ │ │ │ KFUNCS │ allowed │ BPF │ │ BPF │ │ │ │ │◄─────────┤ PROG ◄─────────► PROG │ │ │ └──────────┘ │ │ │ │ │ │ ├────────┘ └────────┤ │ │ │ │ │ │ │ ┌--------┐ │ │ │ ┌-----------┐ │ load denied ' ' │ │ │ ' arbitrary '◄----X──────────────' BPF ' │ │ │ ' r/w ' │ by verifier ' PROG ' │ │ │ └-----------┘ │ ' ' │ │ │ │ └--------┘ │ │ │ │ │ │ │ ┌─────────┐◄───┐ │ ┌────────┐ │ │ │ └─────────┘ ├───────────┐ │ │ │ │ │ │ │ BPF │◄──┼────┤ BPF │ │ │ │ ┌─────────┐◄───┤ HELPERS │ │ │ PROG │ │ │ │ └─────────┘ └───────────┘ │ │ │ │ │ │ └────┴────────┴─────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────┘
Verifier Bypass
Attacks on the verifier generally share a common theme - tricking the verifier into marking unsafe code as safe. This has been done a number of ways, generally by faulty control flow graph logic in the verifier[12], abusing type bounds[13], adding unsafe offsets to pointers[14] and others[2].
Since bpf programs exist in the kernel address space, once the program has passed the verifier and been loaded into the kernel, there are no further barriers or run time checks on kernel memory accesses. This results in arbitrary read/writes usually leading to privilege escalation.
kernel memory ┌──────────────────────────────────────────────────────────────────┐ │ executable memory (RO) │ │ ┌──────────┐ ┌────────┬─────────┬────────┐ │ │ │ │ │ │ │ │ │ │ │ KFUNCS │ allowed │ BPF │ │ BPF │ │ │ │ │◄─────────┤ PROG ◄─────────► PROG │ │ │ └──────────┘ │ │ │ │ │ │ ├────────┘ └────────┤ │ │ │ │ │ │ │ ┌────────┐ │ │ │ ┌───────────┐ │ verifier │ │ │ │ │ │ arbitrary │◄─────┼──────────────┤ BPF │ │ │ │ │ r/w │ │ bypass │ PROG │ │ │ │ └───────────┘ │ │ │ │ │ │ │ └────────┘ │ │ │ │ │ │ │ ┌─────────┐◄───┐ │ ┌────────┐ │ │ │ └─────────┘ ├───────────┐ │ │ │ │ │ │ │ BPF │◄──┼────┤ BPF │ │ │ │ ┌─────────┐◄───┤ HELPERS │ │ │ PROG │ │ │ │ └─────────┘ └───────────┘ │ │ │ │ │ │ └────┴────────┴─────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────┘
This ability to call directly into the kernel is somewhat by design. Any mechanism such as putting programs in a different address space would result in additional latency from context switches. When running as packet filtering code for example, this additional latency may be unacceptable.
In theory, Morello compartments provide a much more lightweight way to enforce memory isolation. Depending on the performance impact (yet to be determined) this compartmentalisation method could be deployed only on some eBPF program types as determined by the sysadmin -or- eBPF wide to provide robustness to the overall system.
Hybrid Mode
Morello supports two execution modes - 'hybrid' mode and 'pure capability' aka 'purecap' mode. In purecap mode, all pointers are capabilities and accesses are checked against the specified bounds and permissions within that capability attached to each address. In hybrid mode, pointers remain 64-bits unless specifically annotated with the `__capability` flag in code, resulting in a mix of capability pointers and standard pointers.
Importantly for hybrid mode, and perhaps counter-intuitively, capability checks are still made for all normal, standard non-capability pointer memory accesses. A standard pointer will derive a capability from the Default Data Capability (DDC) and memory accesses will be checked against that. Similarly, the Program Counter is extended with a capability (PCC). When the instruction at the PCC is fetched for decoding and execution it's checked against the corresponding capability. Any access out of bounds, permission or tag issue results in a capability fault exception.
One benefit of this is existing aarch64 programs can benefit from capability checks without having to recompile that program to use capabilities. In hybrid mode, by setting the DDC register of the processor we can limit memory accesses of all normal pointers. By setting a capability in PCC we can limit what code can be executed. This provides us with a simple mechanism to form the basis of a hybrid compartment.
For more information on hybrid mode see the Morello Examples repo on GitLab[15].
Hybrid Compartment
Being able to restrict and run existing aarch64 code with Morello maps nicely onto the arm64 eBPF JIT engine. The eBPF JIT converts eBPF instructions into plain aarch64 assembly, which can be restricted using the features of hybrid mode to limit memory access and code execution. The Morello Linux kernel on which these changes are based is also a hybrid kernel (supporting a pure capability userspace)[16]. This is therefore a relatively simple approach with fairly minimal changes required.
A rough model of compartmentalising eBPF with hybrid mode looks like the below. The bounds of DDC and PCC are restricted to a memory area forming the hybrid compartment, allowing intra-bpf calls. Any memory accesses or branches outside of the approved kfuncs or bpf helper calls are disallowed.
kernel memory hybrid compartment ┌─────────────────────────────────┬─────────────────────────────┬──┐ │ │┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼│ │ │ ┌──────────┐ │┼────────┼─────────┼────────┼│ │ │ │ │ ││ │ │ ││ │ │ │ KFUNCS │ allowed││ BPF │ │ BPF ││ │ │ │ │◄────────┼│ PROG ◄─────────► PROG ││ │ │ └──────────┘ ││ │ │ ││ │ │ │┼────────┘ └────────┼│ │ │ ││ ││ │ │ ││ ┌────────┐ ││ │ │ capability ├┘ verifier │ │ ││ │ │ fault/exception │X◄─────────────┤ BPF │ ││ │ │ ├┐ bypass │ PROG │ ││ │ │ ││ │ │ ││ │ │ ││ └────────┘ ││ │ │ ││ ││ │ │ ┌─────────┐◄───┐ ││ ┌────────┐ ││ │ │ └─────────┘ ├───────────┐ ││ │ │ ││ │ │ │ BPF │◄─┼┼────┤ BPF │ ││ │ │ ┌─────────┐◄───┤ HELPERS │ ││ │ PROG │ ││ │ │ └─────────┘ └───────────┘ ││ │ │ ││ │ │ │┼────┼────────┼─────────────┼│ │ │ │┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼│ │ └─────────────────────────────────┴─────────────────────────────┴──┘
This type of compartment has in general proved impractical for most use cases, since many accesses in and out of the compartment are usually required, e.g. library calls such as libc, and controlling the compartment boundary in terms of these accesses can be difficult. In the case of eBPF, the majority of code and accesses are internal. There are no library calls and each program type has strict limited access to a small number of appropriate bpf helpers and kfuncs that are pre-defined in the verifier[17][18].
eBPF therefore turns out to be a good use case for this type of compartment. Since the accesses outside the compartment are limited and well defined this should allow for easier domain transitions between the compartment and the kernel.
Moving the eBPF stack
Currently JIT'd eBPF reuses the kernel stack. In order for the compartmentalised eBPF program to access the stack, a separate stack must be allocated inside the compartment. This also allows for clean separation and isolation of kernel and eBPF stacks.
Allocation and free'ing of this new stack is done at the same time as the memory for the JIT'd binary image; this way the eBPF stack lives for the lifetime of the eBPF program.
Domain Transitions via Restricted/Executive mode
The Restricted/Executive mode of the Morello hardware provides a simple way to handle domain transitions aka switching between the kernel and the hybrid compartment.
Restricted mode introduces banked alternative registers RDDC_EL0, RCSP_EL0 and RCTPIDR_EL0, controlled via the EXECUTIVE bit in PCC[19]. Executive mode has access to the Restricted regs, but not vice versa. Hence the compartment manager, in this case the kernel, can setup a compartment in the Restricted regs, such as providing a new stack pointer in RSCP_EL0, before atomically switching execution to Restricted mode.
Switching into Restricted mode is done via a BRR/BLRR instruction[20] on a sealed (non-modifiable capability) function pointer with the EXECUTIVE bit unset.
Atomically returning back to Executive mode can be done simply with RET(CLR), where CLR is a sealed link register capability with the EXECUTIVE bit set.
For further details about Restricted/Executive mode see the Morello Examples repo on Gitlab[21] and the Morello Architecture Reference Manual[19].
Hybrid Compartment Structure
The RFC patches implement a hybrid compartment structure roughly as below. Every JIT'd eBPF program is placed between a prologue/epilogue as generated by the bpf_jit_comp.c:build_{prologue,epilogue} functions.
┌─── prologue │1 | │2 | executive mode │3 | │4 brr(fnp)─┐ | // fnp=5 clr=y │5 ▼---┼----------- │6 | ├─── | main │7 | │8 | │9 | restricted mode │. | │. | │. | ├─── | epilogue │x ret(clr)─┐ | // clr=y │y ▼---┼----------- │z | │. | executive mode │. | └───
The prologue is comprised of two parts. The first part labeled above as 1,2,3 is mostly adhering to the Arm64 Procedure Call Standard (AAPCS) e.g. preserving the FP, LR and regs r19-r28 on the original kernel stack.
After this, we setup the compartment. The JIT compiler does not provide encoding for Morello instructions, so this is done by branching to bpf_enter_sandbox(). bpf_enter_sandbox() sets up the Restricted mode regs which includes setting the new stack pointer RSCP_EL0 and restricting RDDC_EL0 as appropriate.
To restrict the PCC the sealed function pointer (FNP) used as the target of BRR instruction has its bounds and permissions restricted. fnp points back into the prologue (labeled above as '5') where we can continue with setting up the eBPF stack in Restricted mode after we've atomically switched stacks to the new value we put in RCSP_EL0.
Before switching execution to Restricted via BRR(FNP), we manually create a sealed capability link register (CLR) to point to the instruction (labeled above as 'y') near the top of the epilogue. Since the JIT compiler does multiple passes, we're able to calculate this as a fixed offset. Since we don't yet know where the JIT code will be loaded, the CLR can be described as a PCC relative offset using an ADR instruction. Having the EXECUTIVE bit set on CLR switches execution back to Executive mode when used as the target of RET.
Since not all registers are banked, after the transition from Executive to Restricted, care must be taken to sanitise all general purpose regs to avoid leaking kernel regs to potentially untrusted eBPF code in main.
Exception Handling
Usually kernel exceptions are handled by el1h_64_xyz_handler() where 'h' indicates that the stack pointer in SP_EL1 is being used[22].
As per the Morello Architecture Reference manual:
"If the PE is in Restricted and an exception is taken from the current Exception level, exception entry uses the same exception vector as an exception taken from the current Exception level with SP_EL0" [23]
This means for code running in Restricted mode at EL1, exceptions will use the el1t_64_xyz_handler(), where the 't' suffix indicates SP_EL0 is being used.
Since el1t_xyz_handler's are currently unhandled in the kernel, the RFC patches reuse the existing exception handlers.
No special exception handling here is required. Should an eBPF program make an out of bounds access or attempt to execute code out of bounds, this will result in a capability fault. Since kernel state may have been changed by the eBPF program, e.g. via various bpf helper functions, we cannot reliably or easily unwind this state back to a known good state. In this case, as per existing behaviour of eBPF the only safe thing to do is to kill the kernel thread resulting in a kernel Oops, for example:
[66105.333338] Unable to handle kernel paging request at virtual address ffff800080168b58 [66105.341538] Mem abort info: [66105.344368] ESR = 0x0000000086000028 [66105.348389] EC = 0x21: IABT (current EL), IL = 32 bits [66105.353824] SET = 0, FnV = 0 [66105.356911] EA = 0, S1PTW = 0 [66105.360056] FSC = 0x28: capability tag fault [...] [66105.383898] Internal error: Oops: 0000000086000028 [#3] PREEMPT SMP [66105.390152] Modules linked in: [66105.393194] CPU: 0 PID: 3273 Comm: bpf_print_sp Not tainted 6.7.0-gcf3037d47c40 #4 [66105.402312] Hardware name: ARM LTD Morello System Development Platform, BIOS EDK II Jul 19 2023 [66105.410994] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS ISA=A64 BTYPE=--) [66105.418637] pc : bpf_trace_printk+0x0/0x13c [66105.422812] lr : bpf_prog_015bce9d80f8185c+0xf8/0x128 [...] [66105.502369] pcc: 0:ff7f40004ddc8cac:ffff800080168b58 [66105.507319] clr: 0:0000000000000000:ffff800082618d60 [66105.512268] csp: 0:0000000000000000:0000000000000000 [...] [66105.649892] Call trace: [66105.652325] bpf_trace_printk+0x0/0x13c [66105.662228] ---[ end trace 0000000000000000 ]---
Breaking this down we can see that we've faulted at the start (0x0) of the bpf_trace_printk() bpf helper function. The link reg shows we got there from a bpf program. Fault status code (FSC) shows a capability tag fault, and looking at PCC we can see the first tag bit is unset (invalid).
What has happened is that bpf_trace_printk() exists outside the compartment. When the branch is based on an immediate or X register value like here, the program counter is updated with the address of bpf_trace_printk() via BranchToAddr[24], and the PCC is modified/updated with CapSetValue[25].
BranchToAddr(bits(N) target, BranchType branch_type) [...] assert N == 64 && !UsingAArch32(); _PC = target<63:0>; PCC = CapSetValue(PCC, target<63:0>); return;
In this case, the value of bpf_trace_printk() is so far out of bounds it is unrepresentable and therefore the tag value is cleared by CapSetValue. This then results in a capability tag fault on the instruction fetch in the next cycle.
This is a bit of a special case. Where the instruction is outside of capability bounds but representable we would expect a standard out of bounds capability fault.
Note: bpf_trace_printk is a bpf helper function and should work, however the RFC patches currently do not include any trampolines or mechanism to call out of the compartment to approved helper functions or kfuncs.
Also note that due to zero'ing regs on entering the compartment we've lost the ability to get a full kernel call trace.
Current State & Future Work
The attached RFC patches roughly implement the below:
kernel memory hybrid compartment ┌─────────────────────────────────┬─────────────────────────────┬──┐ │ │┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼│ │ │ ┌──────────┐ │┼──┼────────┼──────┼────────┼│ │ │ │ │ ├┘ │ │ │ ││ │ │ │ KFUNCS │ │X◄─┤ BPF X◄────►X BPF ││ │ │ │ │ ├┐ │ PROG │ │ PROG ││ │ │ └──────────┘ ││ │ │ │ ││ │ │ ││ └────────┘ └────────┼│ │ │ ││ ││ │ │ ││ ┌────────┐ ││ │ │ capability ├┘ verifier │ │ ││ │ │ fault/exceptions │X◄─────────────┤ BPF │ ││ │ │ ├┐ bypass │ PROG │ ││ │ │ ││ │ │ ││ │ │ ││ └────────┘ ││ │ │ ││ ││ │ │ ┌─────────┐◄───┐ ││ ┌────────┐ ││ │ │ └─────────┘ ├───────────┐ ├┘ │ │ ││ │ │ │ BPF │ │X◄───┤ BPF │ ││ │ │ ┌─────────┐◄───┤ HELPERS │ ├┐ │ PROG │ ││ │ │ └─────────┘ └───────────┘ ││ │ │ ││ │ │ │┼────┼────────┼─────────────┼│ │ │ │┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼│ │ └─────────────────────────────────┴─────────────────────────────┴──┘
RDDC_EL0: bounds set to vmalloc memory region RCSP_EL0: points to new stack allocated in vmalloc region PCC: bounds set to the JIT'd program binary image
For simplicity's sake in the RFC patch set, the current DDC bounds is setup to be the entire vmalloc memory region in the kernel[26]. This is a wide memory area where JIT'd eBPF programs are already allocated. The eBPF stack is also allocated in this region. The PCC bounds are set to be roughly the code area taken up by the eBPF binary image. This means tail calls between two bpf programs and general intra-bpf calls are not currently possible. Future work should include either a specific carve out for all eBPF programs and their stacks or individual compartments per program with a mechanism for intra-bpf calls.
There is also currently no mechanism to allow access to bpf helpers and kfuncs. It's highly likely this will have to involve a full domain transition back to Executive to allow the bpf helpers to access the kernel memory they require. The bpf helper must then return with RETR or B(L)RR back into Restricted mode. Capabilities to allow access to these helpers must therefore be made available to the compartment before branching into Restricted mode.
This will look roughly something like the below:
// Restricted blr <helper_cap> // Executive str clr, [...] ... ldr clr, [...] retr clr // Restricted
Filtering of arguments to the bpf helper functions and kfuncs, especially those containing pointers is one of the most problematic and currently unsolved aspects. There is a risk of using bpf helper functions as gadgets to perform operations not allowed inside the compartment.
For example the bpf_strncmp helper allows passing any arbitrary pointer and passing it to strncmp:
BPF_CALL_3(bpf_strncmp, const char *, s1, u32, s1_sz, const char *, s2) { return strncmp(s1, s2, s1_sz); }
This could conceivably be used to methodically move through and map out kernel memory based on the location of known strings.
bpf_strtol is even more concerning, as this takes an arbitrary pointer, reads it as a string and writes the long int equivalent to another arbitrary pointer. This is essentially an arbitrary read and write primitive for all of kernel memory.
The next step for anyone continuing this work is to first resolve these issues. In particular the argument filtering seems particularly difficult to resolve. Access to these kinds of bpf helpers that provide wide ranging capabilities will have to be limited in some way.
Future uArch changes
Currently to exit from Executive mode we have to manually craft a CLR that points to the instruction directly after the RET(CLR). It is possible to determine this as a fixed offset due to the multi-pass JIT compilation, however the process could be simplified with new instruction that returns back to Executive via a label instead of a register. For example, RETE #4 would switch to Executive from the next instruction, avoiding the need to retain CLR through the call stack. This would require strong landing pads similar to BTI to verify and check switching states is allowed. This also works well in a JIT context where we can guarantee that user controlled code could not generate RETE instructions or landing pads.
Other Future Work
An alternative option to the hybrid compartment would be to extend the JIT engine to output Morello/capability instructions and make the resulting program purecap. The engineering effort involved in this would be significant. After adding the instruction encodings of the new Morello instructions and then using that to generate correct and secure purecap code would be possible but non-trivial.
In this scenario where the kernel remains hybrid, there exists mismatch between the kernel and eBPF ABI. Since the eBPF interface remains a C interface using standard pointers, translating this to use capabilities and make appropriate restrictions would be difficult. Thus, a purecap eBPF but hybrid kernel, or even vice versa, a purecap kernel but hybrid eBPF may not be viable.
A purecap kernel and purecap eBPF would provide the strongest and most straightforward model of compartmentalisation, however this comes with a very high engineering cost. The hybrid model therefore is attractive for the potential to add security with relative ease of implementation and being relatively simple to integrate into existing systems.
To further extend the hybrid model and mitigate vulnerabilities in the JIT compiler and verifier itself, it might be useful to run these inside separate compartments. Since these are all parsing user input they may benefit from some level of isolation.
General Limitations
A hybrid compartment solves the issue of bpf programs accessing other areas of memory outside of the compartment, but it cannot easily solve the inverse - exploits elsewhere in the system executing data within the compartment as code.
A general problem with JIT compilation is there exists a user controlled area of executable data in memory. If an attacker can control a return address they can jump to and execute this data.
kernel memory ┌─────────────────────────────────┬──┬──────────┬──────┬────┬───┬──┐ │ │┼┼│executable│memory│(RO)│┼┼┼│ │ │ │┼─┴──────────┼──────┴────┴──┼│ │ │ ││ JIT'd │ ││ │ │ ││ BPF PROG │ ││ │ │ ┌─────────┐ jump/branch ││ ┌──────┐ │ ││ │ │ │ exploit ├─────────────┼┼──► data │ │ ││ │ │ └─────────┘ ││ └──────┘ │ ││ │ │ ││ │ ││ │ │ │┼────────────┘ ││ │ │ ││ ││ │ │ ││ ││ │ │ │┼───────────────────────────┼│ │ │ │┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼│ │ └─────────────────────────────────┴─────────────────────────────┴──┘
A hybrid Morello model could solve the issue but it would require 2 compartments:
1. eBPF compartment 2. !eBPF aka "everything else"
Currently the DDC and PCC in the PCuABI kernel are completely unrestricted. Except for a case with a clear cut, simple linear address space, setting the bounds for the second compartment aka the kernel becomes difficult if not impossible to do with how bounds encoding currently works in Morello and with a hybrid kernel. The limited bounds encoding of Morello is limiting factor in this. A purecap kernel should make mitigation of this type of exploit much easier.
Note: the arm64 kernel does contain some mitigations already against this problem. The addresses of JIT images are already randomised and marked as RO[11], although JIT spraying can be used to bypass this. In addition Branch Target Identification (BTI)[27] has also been enabled for JIT'd images[28], although BTI is not available on the Morello platform which is based on ARMv8.2-A
The /proc/sys/net/core/bpf_jit_harden option[29][30] can also be used to effectively nullify this attack entirely, although with some overhead. All JIT'd 32b and 64b constants are "blinded" in memory by saving them XOR'd with a random number. This operation is then undone at execution.
LTP Test
An LTP test that can be used to test current operation of the RFC patches is available at:
https://git.morello-project.org/zdleaf/morello-linux-test-project/-/commits/...
This is a simple test to make a single call to the bpf helper function bpf_trace_printk and print the current SP. Given the current lack of mechanism/trampoline to call helper functions, this should result in a capability tag fault and test fail. Oops details are printed in dmesg.
Thanks
Thanks to everyone at Arm and Cambridge University for their help and support, in particular Yury Khrustalev and Kevin Brodsky. This work was only possible building on the extensive work already completed on compartments, the PCuABI Linux kernel and Morello/CHERI.
Refs
1. https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=BPF 2. CVE-2016-2383 CVE-2016-4557 CVE-2017-16995 CVE-2017-17856 CVE-2017-17855 CVE-2017-17854 CVE-2017-17853 CVE-2017-17852 CVE-2017-16996 CVE-2017-17862 CVE-2017-17863 CVE-2017-17864 CVE-2017-9150 CVE-2018-18445 CVE-2019-7308 CVE-2020-27170 CVE-2020-27171 CVE-2021-33200 CVE-2021-33624 CVE-2021-3444 CVE-2021-3490 CVE-2021-4001 CVE-2021-45402 CVE-2022-0500 CVE-2022-23222 CVE-2022-2785 CVE-2022-2905 CVE-2023-2163 CVE-2023-39191 (list is not exhaustive) 3. https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-wit... 4. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...). 5. https://lore.kernel.org/lkml/YX%2FWKa4qYamp1ml9@FVFF77S0Q05N/T/ 6. https://lwn.net/Articles/796328/ 7. https://lwn.net/Articles/929746/ 8. https://lore.kernel.org/bpf/20200513230355.7858-1-alexei.starovoitov@gmail.c... 9. https://arxiv.org/pdf/2302.10366.pdf 10. https://lwn.net/Articles/857228/ 11. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... 12. CVE-2023-2163 13. CVE-2021-33200, CVE-2021-3490, CVE-2021-3444, CVE-2020-8835, CVE-2020-27194, CVE-2018-18445 14. CVE-2022-23222 15. https://git.morello-project.org/morello/morello-examples/-/tree/main/src/hyb... 16. https://git.morello-project.org/morello/kernel/linux/-/wikis/Morello-pure-ca... 17. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kern... 18. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kern... 19. Morello Architecture Reference Manual (DDI0606), Executive/Restricted banking, RNHGSJ 20. Morello Architecture Reference Manual (DDI0606), 4.4.11 BLRR + 4.4.16 BRR 21. https://git.morello-project.org/morello/morello-examples/-/tree/main/src/res... 22. Arm Architecture Reference Manual for A-profile architecture (DDI 0487K.a), D1.2.2.1 Stack pointer register selection, IVYNZY 23. Morello Architecture Reference Manual (DDI0606), 2.13 Exception model, RGXNXG 24. Morello Architecture Reference Manual (DDI0606), 5.613 shared/functions/registers/BranchToAddr 25. Morello Architecture Reference Manual (DDI0606), 5.390 shared/functions/capability/CapSetValue 26. https://www.kernel.org/doc/html/v6.7/arch/arm64/memory.html 27. https://community.arm.com/arm-community-blogs/b/architectures-and-processors... 28. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... 29. https://docs.cilium.io/en/latest/bpf/architecture/#hardening 30. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
Kevin Brodsky (3): arm64: morello: Context-switch RCSP at EL1 too arm64: morello: Context-switch RDDC on kernel entry/return arm64: morello: Set CCTLR_ELx.SBL
Zachary Leaf (10): arm64: morello: enable bpf jit in defconfig bpf: debug: bpf_jit_enable=0 by default bpf: debug: print jit'd code location bpf: debug: disable prologue size check bpf: jit: zero general purpose regs bpf: jit: simplify preserving x19-x28 bpf: jit: move image_size into ctx bpf: jit: allocate stack in vmalloc region bpf: jit: handle exceptions bpf: jit: run inside restricted mode
arch/arm64/configs/morello_pcuabi_defconfig | 2 + arch/arm64/include/asm/morello.h | 1 - arch/arm64/include/asm/ptrace.h | 1 + arch/arm64/include/asm/suspend.h | 2 +- arch/arm64/kernel/asm-offsets.c | 1 + arch/arm64/kernel/entry-common.c | 23 ++- arch/arm64/kernel/entry.S | 15 ++ arch/arm64/kernel/head.S | 5 +- arch/arm64/kernel/morello.c | 4 - arch/arm64/kernel/ptrace.c | 4 +- arch/arm64/mm/proc.S | 19 +- arch/arm64/net/bpf_jit_comp.c | 205 ++++++++++++++++---- include/linux/filter.h | 2 + kernel/bpf/core.c | 17 +- 14 files changed, 240 insertions(+), 61 deletions(-)
-- 2.34.1
From: Kevin Brodsky kevin.brodsky@arm.com
RCSP is currently considered as an EL0 register. However, Restricted registers are not banked, and as a result this assumes that the kernel itself does not modify it.
To enable the kernel to make use of RCSP, save and restore it when taking an exception from / returning to EL1, in addition to EL0.
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- arch/arm64/kernel/entry.S | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 329f58c5d1fb..cfc1abc396af 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -309,6 +309,10 @@ alternative_else_nop_endif
scs_load_current .else +#ifdef CONFIG_ARM64_MORELLO + mrs c22, rcsp_el0 + str c22, [sp, #S_CSP + 16] +#endif add x21, sp, #PT_REGS_SIZE get_current_task tsk .endif /* \el == 0 */ @@ -476,6 +480,11 @@ alternative_else_nop_endif mte_set_user_gcr tsk, x0, x1
apply_ssbd 0, x0, x1 +#ifdef CONFIG_ARM64_MORELLO + .else + ldr c25, [sp, #S_CSP + 16] + msr rcsp_el0, c25 +#endif .endif
msr spsr_el1, x22
From: Kevin Brodsky kevin.brodsky@arm.com
RDDC is currently considered as an EL0 register. However, Restricted registers are not banked, and as a result this assumes that the kernel itself does not modify it.
To enable the kernel to make use of RDDC, context-switch it on every kernel entry/return, instead of when rescheduling (__switch_to). As a result its value is saved in pt_regs instead of morello_state, and a few places need fixing up accordingly.
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- arch/arm64/include/asm/morello.h | 1 - arch/arm64/include/asm/ptrace.h | 1 + arch/arm64/include/asm/suspend.h | 2 +- arch/arm64/kernel/asm-offsets.c | 1 + arch/arm64/kernel/entry.S | 6 ++++++ arch/arm64/kernel/morello.c | 4 ---- arch/arm64/kernel/ptrace.c | 4 ++-- arch/arm64/mm/proc.S | 14 ++++++-------- 8 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/arch/arm64/include/asm/morello.h b/arch/arm64/include/asm/morello.h index 556af690ed4d..c33a7193b397 100644 --- a/arch/arm64/include/asm/morello.h +++ b/arch/arm64/include/asm/morello.h @@ -19,7 +19,6 @@ struct morello_state { uintcap_t ctpidr; uintcap_t rctpidr; uintcap_t ddc; - uintcap_t rddc; uintcap_t cid; unsigned long cctlr; }; diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h index c7fa869f9191..1be90aebfcd5 100644 --- a/arch/arm64/include/asm/ptrace.h +++ b/arch/arm64/include/asm/ptrace.h @@ -208,6 +208,7 @@ struct pt_regs { uintcap_t csp; uintcap_t rcsp; uintcap_t pcc; + uintcap_t rddc; #endif };
diff --git a/arch/arm64/include/asm/suspend.h b/arch/arm64/include/asm/suspend.h index 7d49c553b8df..d7672c60aa9f 100644 --- a/arch/arm64/include/asm/suspend.h +++ b/arch/arm64/include/asm/suspend.h @@ -5,7 +5,7 @@ #include <asm/morello.h>
#define NR_CTX_REGS 13 -#define NR_CTX_CREGS 6 +#define NR_CTX_CREGS 5 #define NR_CALLEE_SAVED_REGS 12
/* diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c index defc1f6f5d11..efa3431db4cb 100644 --- a/arch/arm64/kernel/asm-offsets.c +++ b/arch/arm64/kernel/asm-offsets.c @@ -85,6 +85,7 @@ int main(void) DEFINE(S_CREGS, offsetof(struct pt_regs, cregs)); DEFINE(S_CSP, offsetof(struct pt_regs, csp)); DEFINE(S_PCC, offsetof(struct pt_regs, pcc)); + DEFINE(S_RDDC, offsetof(struct pt_regs, rddc)); #endif DEFINE(PT_REGS_SIZE, sizeof(struct pt_regs)); BLANK(); diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index cfc1abc396af..949c6869daa7 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -239,6 +239,9 @@ alternative_cb_end
mrs c26, celr_el1 str c26, [sp, #S_PCC] + + mrs c27, rddc_el0 + str c27, [sp, #S_RDDC] #endif
.if \el == 0 @@ -490,6 +493,9 @@ alternative_else_nop_endif msr spsr_el1, x22
#ifdef CONFIG_ARM64_MORELLO + ldr c25, [sp, #S_RDDC] + msr rddc_el0, c25 + morello_merge_c_x 26, x21 // merge PC into PCC msr celr_el1, c26
diff --git a/arch/arm64/kernel/morello.c b/arch/arm64/kernel/morello.c index da9d8b1f2578..7003c53445f9 100644 --- a/arch/arm64/kernel/morello.c +++ b/arch/arm64/kernel/morello.c @@ -172,9 +172,7 @@ void morello_thread_init_user(void) write_cap_sysreg(0, rctpidr_el0);
write_cap_sysreg(ddc, ddc_el0); - write_cap_sysreg(0, rddc_el0); morello_state->ddc = ddc; - morello_state->rddc = (uintcap_t)0;
write_cap_sysreg(0, cid_el0); morello_state->cid = (uintcap_t)0; @@ -205,7 +203,6 @@ void morello_thread_save_user_state(struct task_struct *tsk)
/* (R)CTPIDR is handled by task_save_user_tls */ morello_state->ddc = read_cap_sysreg(ddc_el0); - morello_state->rddc = read_cap_sysreg(rddc_el0); morello_state->cid = read_cap_sysreg(cid_el0); morello_state->cctlr = read_sysreg(cctlr_el0); } @@ -216,7 +213,6 @@ void morello_thread_restore_user_state(struct task_struct *tsk)
/* (R)CTPIDR is handled by task_restore_user_tls */ write_cap_sysreg(morello_state->ddc, ddc_el0); - write_cap_sysreg(morello_state->rddc, rddc_el0); write_cap_sysreg(morello_state->cid, cid_el0); write_sysreg(morello_state->cctlr, cctlr_el0); } diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index be1faed4908e..57aa2c49f6d0 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1472,7 +1472,7 @@ static int morello_get(struct task_struct *target, MORELLO_STATE_COPY_VAL_TAG(out, ctpidr, morello_state->ctpidr);
MORELLO_STATE_COPY_VAL_TAG(out, rcsp, regs->rcsp); - MORELLO_STATE_COPY_VAL_TAG(out, rddc, morello_state->rddc); + MORELLO_STATE_COPY_VAL_TAG(out, rddc, regs->rddc); MORELLO_STATE_COPY_VAL_TAG(out, rctpidr, morello_state->rctpidr);
MORELLO_STATE_COPY_VAL_TAG(out, cid, morello_state->cid); @@ -1514,7 +1514,7 @@ static int morello_set(struct task_struct *target, MORELLO_STATE_BUILD_CAP(new_state, ctpidr, morello_state->ctpidr);
MORELLO_STATE_BUILD_CAP(new_state, rcsp, regs->rcsp); - MORELLO_STATE_BUILD_CAP(new_state, rddc, morello_state->rddc); + MORELLO_STATE_BUILD_CAP(new_state, rddc, regs->rddc); MORELLO_STATE_BUILD_CAP(new_state, rctpidr, morello_state->rctpidr);
MORELLO_STATE_BUILD_CAP(new_state, cid, morello_state->cid); diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S index f16186d16d2a..f67a1a17e909 100644 --- a/arch/arm64/mm/proc.S +++ b/arch/arm64/mm/proc.S @@ -92,12 +92,11 @@ SYM_FUNC_START(cpu_do_suspend) mrs c9, ctpidr_el0 mrs c10, rctpidr_el0 mrs c11, ddc_el0 - mrs c12, rddc_el0 - mrs c13, cid_el0 - mrs c14, cvbar_el1 + mrs c12, cid_el0 + mrs c13, cvbar_el1 stp c9, c10, [x0, #CPU_CTX_CREGS] stp c11, c12, [x0, #CPU_CTX_CREGS + 32] - stp c13, c14, [x0, #CPU_CTX_CREGS + 64] + str c13, [x0, #CPU_CTX_CREGS + 64] #else /* CONFIG_ARM64_MORELLO */ mrs x2, tpidr_el0 mrs x8, vbar_el1 @@ -170,10 +169,9 @@ SYM_FUNC_START(cpu_do_resume) msr rctpidr_el0, c3 ldp c2, c3, [x0, #CPU_CTX_CREGS + 32] msr ddc_el0, c2 - msr rddc_el0, c3 - ldp c2, c3, [x0, #CPU_CTX_CREGS + 64] - msr cid_el0, c2 - msr cvbar_el1, c3 + msr cid_el0, c3 + ldr c2, [x0, #CPU_CTX_CREGS + 64] + msr cvbar_el1, c2 #else /* CONFIG_ARM64_MORELLO */ msr tpidr_el0, x2 msr vbar_el1, x9
From: Kevin Brodsky kevin.brodsky@arm.com
This should help with in-kernel sandboxing.
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- arch/arm64/kernel/head.S | 5 +++-- arch/arm64/mm/proc.S | 5 +++-- 2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index 59fda4ffc43f..072a676a7bbc 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -598,8 +598,9 @@ SYM_INNER_LABEL(init_el2, SYM_L_LOCAL) bic x1, x1, #CPTR_EL2_TC msr cptr_el2, x1 isb - /* Disable PCC/DDC base offset and other capability-related features */ - msr cctlr_el2, xzr + /* Seal CLR / require a sealed target for capability branches */ + mov x9, #CCTLR_ELx_SBL + msr cctlr_el2, x9
/* * Capability exception entry/return is now enabled, as a result we diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S index f67a1a17e909..3ba111dab6fc 100644 --- a/arch/arm64/mm/proc.S +++ b/arch/arm64/mm/proc.S @@ -498,8 +498,9 @@ SYM_FUNC_START(__cpu_setup) orr x9, x9, CPACR_EL1_CEN msr cpacr_el1, x9 isb - /* Disable PCC/DDC base offset and other capability-related features */ - msr cctlr_el1, xzr + /* Seal CLR / require a sealed target for capability branches */ + mov x9, #CCTLR_ELx_SBL + msr cctlr_el1, x9
/* * Allow controlling the Morello-defined capability tag load/store
Enable bpf JIT and allow unprivileged bpf by default without having to specify on every boot:
echo "0" > /proc/sys/kernel/unprivileged_bpf_disabled
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/configs/morello_pcuabi_defconfig | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/arm64/configs/morello_pcuabi_defconfig b/arch/arm64/configs/morello_pcuabi_defconfig index eb778c38abbe..620460d12749 100644 --- a/arch/arm64/configs/morello_pcuabi_defconfig +++ b/arch/arm64/configs/morello_pcuabi_defconfig @@ -4,6 +4,8 @@ CONFIG_AUDIT=y CONFIG_NO_HZ_IDLE=y CONFIG_HIGH_RES_TIMERS=y CONFIG_BPF_SYSCALL=y +CONFIG_BPF_JIT=y +# CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set CONFIG_PREEMPT=y CONFIG_IRQ_TIME_ACCOUNTING=y CONFIG_BSD_PROCESS_ACCT=y
While messing with the JIT, it's useful to have it turned off by default; since systemd runs some bpf on startup, if we break the JIT we'll end up not being able to boot.
note: CONFIG_BPF_JIT_DEFAULT_ON is automatically selected by ARCH_WANT_DEFAULT_BPF_JIT and cannot be selected as part of a defconfig, hence the approach here.
Once booted, turn on the JIT via:
echo "2" > /proc/sys/net/core/bpf_jit_enable
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- kernel/bpf/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index fe254ae035fe..93b7dd22236c 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -555,7 +555,7 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
#ifdef CONFIG_BPF_JIT /* All BPF JIT sysctl knobs here. */ -int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON); +int bpf_jit_enable __read_mostly = 0; int bpf_jit_kallsyms __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON); int bpf_jit_harden __read_mostly; long bpf_jit_limit __read_mostly;
For debugging/setting kernel breakpoints etc, it's useful to know the location of our JIT'd program in memory. Since the start of the JIT'd bpf code is randomised to make ROP more difficult, print out the actual start.
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- kernel/bpf/core.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 93b7dd22236c..510fec53df3f 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1058,6 +1058,8 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
/* Leave a random number of instructions before BPF code. */ *image_ptr = &hdr->image[start]; + /* The actual start of the JIT code */ + printk("%s JIT loc=%#lx\n", __func__, *image_ptr);
return hdr; }
We're about to add a lot of instructions to the prologue. To avoid having to recalculate PROLOGUE_OFFSET every time and keep that up to date, disable the static #define and check in build_prologue.
The check in build_prologue seems like it should know what the correct offset is anyway, and the only place this offset is used is in build_body which is called after build_prologue in bpf_int_jit_compile.
Set the offset in build_prologue to save manually calculating/updating the #define.
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/net/bpf_jit_comp.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 7d4af64e3982..920c1bfd098e 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -41,6 +41,8 @@ #define check_imm19(imm) check_imm(19, imm) #define check_imm26(imm) check_imm(26, imm)
+static int PROLOGUE_OFFSET = 0; + /* Map BPF registers to A64 registers */ static const int bpf2a64[] = { /* return value from in-kernel function, and exit value from eBPF */ @@ -282,9 +284,6 @@ static bool is_lsi_offset(int offset, int scale) /* Offset of nop instruction in bpf prog entry to be poked */ #define POKE_OFFSET (BTI_INSNS + 1)
-/* Tail call offset to jump into */ -#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8) - static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { const struct bpf_prog *prog = ctx->prog; @@ -297,7 +296,6 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) const u8 tcc = bpf2a64[TCALL_CNT]; const u8 fpb = bpf2a64[FP_BOTTOM]; const int idx0 = ctx->idx; - int cur_offset;
/* * BPF prog stack layout @@ -354,12 +352,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) /* Initialize tail_call_cnt */ emit(A64_MOVZ(1, tcc, 0, 0), ctx);
- cur_offset = ctx->idx - idx0; - if (cur_offset != PROLOGUE_OFFSET) { - pr_err_once("PROLOGUE_OFFSET = %d, expected %d!\n", - cur_offset, PROLOGUE_OFFSET); - return -1; - } + PROLOGUE_OFFSET = ctx->idx - idx0;
/* BTI landing pad for the tail call, done with a BR */ emit_bti(A64_BTI_J, ctx);
Generate arm64 bytecode to zero general purpose regs.
This can be used to sanitise regs after domain transitions between the kernel and eBPF compartments.
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/net/bpf_jit_comp.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 920c1bfd098e..7f1f6e09ea53 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -284,6 +284,21 @@ static bool is_lsi_offset(int offset, int scale) /* Offset of nop instruction in bpf prog entry to be poked */ #define POKE_OFFSET (BTI_INSNS + 1)
+static inline void zero_gpr(struct jit_ctx *ctx) +{ + /* + * Try generating this without repeating yourself using + * emit(A64_MOVZ(1, A64_R(0), 0, 0), ctx); + * ... + */ + int base = 0xd2800000; // mov x0, #0 + // 0xd2800001; // mov x1, #0 + // ... + // 0xd280001d; // mov x29, #0 + for(int i=0; i<=29; i++) + emit(base+i, ctx); +} + static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { const struct bpf_prog *prog = ctx->prog;
AAPCS specifies that x19-x29 must be preserved between function calls, however the JIT compiler outputs arm64 asm using only some of these regs. Due to this only those used are saved and restored in the prologue/epilogue of the JIT'd bpf code.
Now we intend to zero all GPRs on entry to the JIT'd bpf, simplify storing/restoring regs by explicitly referring to x19-x28 instead of the bpf regs.
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/net/bpf_jit_comp.c | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 7f1f6e09ea53..d8ecb79dbb74 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -303,10 +303,6 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { const struct bpf_prog *prog = ctx->prog; const bool is_main_prog = !bpf_is_subprog(prog); - const u8 r6 = bpf2a64[BPF_REG_6]; - const u8 r7 = bpf2a64[BPF_REG_7]; - const u8 r8 = bpf2a64[BPF_REG_8]; - const u8 r9 = bpf2a64[BPF_REG_9]; const u8 fp = bpf2a64[BPF_REG_FP]; const u8 tcc = bpf2a64[TCALL_CNT]; const u8 fpb = bpf2a64[FP_BOTTOM]; @@ -355,10 +351,11 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) emit(A64_MOV(1, A64_FP, A64_SP), ctx);
/* Save callee-saved registers */ - emit(A64_PUSH(r6, r7, A64_SP), ctx); - emit(A64_PUSH(r8, r9, A64_SP), ctx); - emit(A64_PUSH(fp, tcc, A64_SP), ctx); - emit(A64_PUSH(fpb, A64_R(28), A64_SP), ctx); + emit(A64_PUSH(A64_R(19), A64_R(20), A64_SP), ctx); + emit(A64_PUSH(A64_R(21), A64_R(22), A64_SP), ctx); + emit(A64_PUSH(A64_R(23), A64_R(24), A64_SP), ctx); + emit(A64_PUSH(A64_R(25), A64_R(26), A64_SP), ctx); + emit(A64_PUSH(A64_R(27), A64_R(28), A64_SP), ctx);
/* Set up BPF prog stack base register */ emit(A64_MOV(1, fp, A64_SP), ctx); @@ -664,24 +661,16 @@ static void build_plt(struct jit_ctx *ctx) static void build_epilogue(struct jit_ctx *ctx) { const u8 r0 = bpf2a64[BPF_REG_0]; - const u8 r6 = bpf2a64[BPF_REG_6]; - const u8 r7 = bpf2a64[BPF_REG_7]; - const u8 r8 = bpf2a64[BPF_REG_8]; - const u8 r9 = bpf2a64[BPF_REG_9]; - const u8 fp = bpf2a64[BPF_REG_FP]; - const u8 fpb = bpf2a64[FP_BOTTOM];
/* We're done with BPF stack */ emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
- /* Restore x27 and x28 */ - emit(A64_POP(fpb, A64_R(28), A64_SP), ctx); - /* Restore fs (x25) and x26 */ - emit(A64_POP(fp, A64_R(26), A64_SP), ctx); - - /* Restore callee-saved register */ - emit(A64_POP(r8, r9, A64_SP), ctx); - emit(A64_POP(r6, r7, A64_SP), ctx); + /* Restore x19-x28 */ + emit(A64_POP(A64_R(27), A64_R(28), A64_SP), ctx); + emit(A64_POP(A64_R(25), A64_R(26), A64_SP), ctx); + emit(A64_POP(A64_R(23), A64_R(24), A64_SP), ctx); + emit(A64_POP(A64_R(21), A64_R(22), A64_SP), ctx); + emit(A64_POP(A64_R(19), A64_R(20), A64_SP), ctx);
/* Restore FP/LR registers */ emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
For setting compartment bounds it's useful to know the overall size of the eBPF JIT image.
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/net/bpf_jit_comp.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index d8ecb79dbb74..85810908dc15 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -78,6 +78,7 @@ struct jit_ctx { int *offset; int exentry_idx; __le32 *image; + int image_size; u32 stack_size; int fpb_offset; }; @@ -1514,7 +1515,7 @@ struct arm64_jit_data {
struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { - int image_size, prog_size, extable_size, extable_align, extable_offset; + int prog_size, extable_size, extable_align, extable_offset; struct bpf_prog *tmp, *orig_prog = prog; struct bpf_binary_header *header; struct arm64_jit_data *jit_data; @@ -1594,8 +1595,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog_size = sizeof(u32) * ctx.idx; /* also allocate space for plt target */ extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align); - image_size = extable_offset + extable_size; - header = bpf_jit_binary_alloc(image_size, &image_ptr, + ctx.image_size = extable_offset + extable_size; + header = bpf_jit_binary_alloc(ctx.image_size, &image_ptr, sizeof(u32), jit_fill_hole); if (header == NULL) { prog = orig_prog;
arm64 JIT'd bpf programs on Morello currently re-use the existing kernel stack in the kernel logical memory map (VMAP_STACK is not available in PCuABI).
Since JIT programs will be running in a compartment with memory accesses limited to the vmalloc region, new stacks for each eBPF program must be created here to allow access and provide isolation of stacks.
Allocate and free new page sized/aligned stacks in the vmalloc region at the same time as the binary image to ensure they last for the lifetime of the program.
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/net/bpf_jit_comp.c | 27 +++++++++++++++++++++++++++ include/linux/filter.h | 2 ++ kernel/bpf/core.c | 13 +++++++++++++ 3 files changed, 42 insertions(+)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 85810908dc15..36419cdaa710 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -24,6 +24,8 @@
#include "bpf_jit.h"
+#define BPF_STACK_SZ (PAGE_SIZE * 16) + #define TMP_REG_1 (MAX_BPF_JIT_REG + 0) #define TMP_REG_2 (MAX_BPF_JIT_REG + 1) #define TCALL_CNT (MAX_BPF_JIT_REG + 2) @@ -79,6 +81,7 @@ struct jit_ctx { int exentry_idx; __le32 *image; int image_size; + void *stack; u32 stack_size; int fpb_offset; }; @@ -1513,6 +1516,25 @@ struct arm64_jit_data { struct jit_ctx ctx; };
+void *bpf_jit_alloc_stack() +{ + void *sp = __vmalloc_node(BPF_STACK_SZ, PAGE_SIZE, THREADINFO_GFP, + NUMA_NO_NODE, __builtin_return_address(0)); + + if (!sp) { + printk("%s stack allocation failed\n", __func__); + return NULL; + } + printk("%s allocated stack at:%#lx-%lx size:%d\n", __func__, + sp, sp+BPF_STACK_SZ, BPF_STACK_SZ); + if (((uintptr_t)sp & 0xF) != 0) + printk("%s stack is NOT 16B aligned\n", __func__); + if (((uintptr_t)sp & 0xFFF) != 0) + printk("%s stack is NOT 4k page aligned\n", __func__); + + return kasan_reset_tag(sp); +} + struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { int prog_size, extable_size, extable_align, extable_offset; @@ -1603,6 +1625,11 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) goto out_off; }
+ if (header->stack == NULL) { + goto out_off; + } + ctx.stack = header->stack; + /* 2. Now, the actual pass. */
ctx.image = (__le32 *)image_ptr; diff --git a/include/linux/filter.h b/include/linux/filter.h index a4953fafc8cb..471cef983dd4 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -601,6 +601,7 @@ struct sock_fprog_kern {
struct bpf_binary_header { u32 size; + void *stack; u8 image[] __aligned(BPF_IMAGE_ALIGNMENT); };
@@ -1061,6 +1062,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, void bpf_jit_binary_free(struct bpf_binary_header *hdr); u64 bpf_jit_alloc_exec_limit(void); void *bpf_jit_alloc_exec(unsigned long size); +void *bpf_jit_alloc_stack(void); void bpf_jit_free_exec(void *addr); void bpf_jit_free(struct bpf_prog *fp); struct bpf_binary_header * diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 510fec53df3f..509a70e6c25f 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1023,6 +1023,11 @@ void __weak bpf_jit_free_exec(void *addr) module_memfree(addr); }
+void *__weak bpf_jit_alloc_stack() +{ + return NULL; +} + struct bpf_binary_header * bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, unsigned int alignment, @@ -1061,6 +1066,12 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, /* The actual start of the JIT code */ printk("%s JIT loc=%#lx\n", __func__, *image_ptr);
+ /* + * Save the bpf SP in the header; it's the easiest way to ensure the + * memory is free'd at the same time as the image in bpf_jit_binary_free + */ + hdr->stack = bpf_jit_alloc_stack(); + return hdr; }
@@ -1068,6 +1079,8 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr) { u32 size = hdr->size;
+ if (hdr->stack) + bpf_jit_free_exec(hdr->stack); bpf_jit_free_exec(hdr); bpf_jit_uncharge_modmem(size); }
As per Morello arch ref manual:
"If the PE is in Restricted and an exception is taken from the current Exception level, exception entry uses the same exception vector as an exception taken from the current Exception level with SP_EL0"
This means for bpf programs running in Restricted mode at EL1, exceptions will use the el1t_64_sync_handler(), where the t suffix indicates SP_EL0 is being used.
Since el1t_xyz_handler's are currently unhandled, passthrough/reuse the existing exception handlers.
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/kernel/entry-common.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c index 63cab2184e11..bf34209fcab1 100644 --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c @@ -384,10 +384,25 @@ static inline void fp_user_discard(void) } }
-UNHANDLED(el1t, 64, sync) -UNHANDLED(el1t, 64, irq) -UNHANDLED(el1t, 64, fiq) -UNHANDLED(el1t, 64, error) +asmlinkage void noinstr el1t_64_sync_handler(struct pt_regs *regs) +{ + el1h_64_sync_handler(regs); +} + +asmlinkage void noinstr el1t_64_irq_handler(struct pt_regs *regs) +{ + el1h_64_irq_handler(regs); +} + +asmlinkage void noinstr el1t_64_fiq_handler(struct pt_regs *regs) +{ + el1h_64_fiq_handler(regs); +} + +asmlinkage void noinstr el1t_64_error_handler(struct pt_regs *regs) +{ + el1h_64_error_handler(regs); +}
static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr) {
Amend the JIT prologue/epilogue to switch execution to the hybrid compartment and back again.
The branch to bpf_enter_sandbox() sets up the compartment bounds and permissions then switches into Restricted mode.
From here we can zero all GPR and setup the bpf stack using the new stack pointer in RCSP_EL0 before entering into bpf main.
Return from the compartment back to Executive mode at the top of the epilogue.
┌─── prologue │1 | │2 | executive mode │3 | │4 brr(fnp)─┐ | // fnp=5 clr=y │5 ▼---┼----------- │6 | ├─── | main │7 | │8 | │9 | restricted mode │. | │. | │. | ├─── | epilogue │x ret(clr)─┐ | // clr=y │y ▼---┼----------- │z | │. | executive mode │. | └───
Signed-off-by: Zachary Leaf zachary.leaf@arm.com --- arch/arm64/net/bpf_jit_comp.c | 112 +++++++++++++++++++++++++++++++++- 1 file changed, 110 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 36419cdaa710..87e041638936 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -26,6 +26,9 @@
#define BPF_STACK_SZ (PAGE_SIZE * 16)
+#define PERM_SYS_REG __CHERI_CAP_PERMISSION_ACCESS_SYSTEM_REGISTERS__ +#define PERM_EXECUTIVE __ARM_CAP_PERMISSION_EXECUTIVE__ + #define TMP_REG_1 (MAX_BPF_JIT_REG + 0) #define TMP_REG_2 (MAX_BPF_JIT_REG + 1) #define TCALL_CNT (MAX_BPF_JIT_REG + 2) @@ -266,6 +269,84 @@ static bool is_lsi_offset(int offset, int scale) return true; }
+void bpf_enter_sandbox(void *sp, void *ret_addr, int image_size) +{ + void * __capability ddc; + void * __capability rcsp; + void * __capability rddc; + void * __capability fnp; + void * __capability clr; + void *lr; + + /* + * All new caps are explicitly derived from kernel DDC; since EL1 DDC is + * entirely unrestricted, this gives us a blank cap to use as a base + */ + ddc = cheri_ddc_get(); + + /* Setup RCSP to top of new stack created in vmalloc region */ + rcsp = cheri_address_set(ddc, (u64)sp); + rcsp = cheri_bounds_set(rcsp, BPF_STACK_SZ); + rcsp = cheri_offset_set(rcsp, BPF_STACK_SZ); + asm volatile("msr rcsp_el0, %[RCSP]" :: [RCSP] "C" (rcsp)); + + /* Restrict RDDC to the vmalloc region */ + rddc = cheri_address_set(ddc, VMALLOC_START); + rddc = cheri_bounds_set(rddc, VMALLOC_END - VMALLOC_START); + rddc = cheri_perms_clear(rddc, PERM_EXECUTIVE | PERM_SYS_REG); + asm volatile("msr rddc_el0, %[RDDC]" :: [RDDC] "C" (rddc)); + + /* + * To switch to restricted mode with BLRR/BRR, we must have a sealed + * function ptr with a cleared executive permission bit (PERM_EXECUTIVE) + * + * We want to return from bpf_enter_sandbox in restricted mode, so use + * link reg (x30) for the function ptr + * + * Setting the bounds to LR + image_size isn't precise, but it's good + * enough for now. That restricts PCC/execution to roughly the current + * bpf program. To support bpf to bpf and bpf tail calls within the + * vmalloc region needs further work. + * + * PERM_SYS_REG: remove access to privileged system regs e.g. mmu, + * interupts mgmt, processor reset + */ + __asm__ volatile("mov %[LR], x30" :[LR] "=r" (lr): ); + fnp = cheri_address_set(ddc, (u64)lr); + fnp = cheri_bounds_set(fnp, image_size); + fnp = cheri_perms_clear(fnp, PERM_EXECUTIVE | PERM_SYS_REG); + fnp = cheri_sentry_create(fnp); + + /* + * To exit restricted mode, we need to do ret(clr) + * + * If we use BLRR we'd create a sealed c30/clr pointing to the + * instruction directly after the BLRR - that's no good to return + * back here to the end of this function + * + * Use BRR and manually craft the CLR to be able to return somewhere + * we actually want to return to when exiting restricted mode - + * currently that's the start of the epilogue, directly after the + * ret(clr) instruction + * + * Note: since we're using BRR to exit instead of a normal RET, we're + * relying on the code gen here to not create a stack frame; otherwise + * we end up branching out before restoring the stack. Since this + * is a leaf function that doesn't allocate or use stack space we're + * ok, however this function would be less liable to break in pure + * assembly + */ + clr = cheri_address_set(ddc, (u64)ret_addr); + clr = cheri_sentry_create(clr); + __asm__ volatile( + "mov c30, %[CLR]\n" + "brr %[FNP]" /* Branch to restricted mode instead of RET */ + :: + [CLR] "r" (clr), + [FNP] "r" (fnp) + ); +} + /* generated prologue: * bti c // if CONFIG_ARM64_BTI_KERNEL * mov x9, lr @@ -361,6 +442,26 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) emit(A64_PUSH(A64_R(25), A64_R(26), A64_SP), ctx); emit(A64_PUSH(A64_R(27), A64_R(28), A64_SP), ctx);
+ /* Setup and enter restricted mode compartment: + * arg1: new SP + * arg2: PCC-relative address of where we want to return to on + * exiting restricted mode + * arg3: image size + */ + emit_addr_mov_i64(A64_R(0), (const u64)ctx->stack, ctx); + /* byte offset = idx * sizeof(inst) + sizeof(emit_call) */ + emit(A64_ADR(A64_R(1), (epilogue_offset(ctx)*4)+4), ctx); + emit_a64_mov_i(0, A64_R(2), ctx->image_size, ctx); + emit_call((const u64)bpf_enter_sandbox, ctx); + /* ----> Now we're in restricted mode */ + + /* + * Since not all regs are banked between Restricted/Executive, zero + * GPRs to avoid leaking kernel regs to bpf code + * We don't really mind the way around i.e R -> E. + */ + zero_gpr(ctx); + /* Set up BPF prog stack base register */ emit(A64_MOV(1, fp, A64_SP), ctx);
@@ -666,8 +767,12 @@ static void build_epilogue(struct jit_ctx *ctx) { const u8 r0 = bpf2a64[BPF_REG_0];
- /* We're done with BPF stack */ - emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx); + /* + * Exit from restricted mode compartment + * CLR should point to the instruction below this one + */ + emit(0xc2c253c0, ctx); // ret clr + /* ----> Now we're back in executive mode */
/* Restore x19-x28 */ emit(A64_POP(A64_R(27), A64_R(28), A64_SP), ctx); @@ -1647,6 +1752,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) goto out_off; }
+ /* Update now we know the actual size */ + ctx.epilogue_offset = ctx.idx; + build_epilogue(&ctx); build_plt(&ctx);
linux-morello@op-lists.linaro.org