As planned during the vCPU hot-add discussions from previous LOD meetings, this prototype lets userspace handle PSCI calls from a guest.
The vCPU hot-add model preferred by Arm presents all possible resources through ACPI at boot time, only marking unavailable vCPUs as hidden. The VMM prevents bringing up those vCPUs by rejecting PSCI CPU_ON calls. This allows to keep things simple for vCPU scaling enablement, while leaving the door open for hardware CPU hot-add.
This series focuses on moving PSCI support into userspace. Patches 1-3 allow userspace to request WFI to be executed by KVM. That way the VMM can easily implement the CPU_SUSPEND function, which is mandatory from PSCI v0.2 onwards (even if it doesn't have a more useful implementation than WFI, natively available to the guest). An alternative would be to poll the vGIC implemented in KVM for interrupts, but I haven't explored that solution. Patches 4 and 5 let the VMM request PSCI calls.
The guest needs additional support to deal with hidden CPUs and to gracefully handle the "NOT_PRESENT" return value from PSCI CPU_ON. The full prototype can be found here:
https://jpbrucker.net/git/linux/log/?h=cpuhp/devel https://jpbrucker.net/git/qemu/log/?h=cpuhp/devel
Jean-Philippe Brucker (5): KVM: arm64: Replace power_off with mp_state in struct kvm_vcpu_arch KVM: arm64: Move WFI execution to check_vcpu_requests() KVM: arm64: Allow userspace to request WFI KVM: arm64: Pass hypercalls to userspace KVM: arm64: Pass PSCI calls to userspace
Documentation/virt/kvm/api.rst | 46 +++++++++++++++---- Documentation/virt/kvm/arm/psci.rst | 1 + arch/arm64/include/asm/kvm_host.h | 10 ++++- include/kvm/arm_hypercalls.h | 1 + include/kvm/arm_psci.h | 4 ++ include/uapi/linux/kvm.h | 3 ++ arch/arm64/kvm/arm.c | 66 +++++++++++++++++++-------- arch/arm64/kvm/handle_exit.c | 3 +- arch/arm64/kvm/hypercalls.c | 28 +++++++++++- arch/arm64/kvm/psci.c | 69 ++++++++++++++--------------- 10 files changed, 165 insertions(+), 66 deletions(-)
So we can add a new "suspend" power state, replace power_off with mp_state in struct kvm_vcpu_arch. Factor the cpu_off() function while we're here.
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org --- arch/arm64/include/asm/kvm_host.h | 6 ++++-- arch/arm64/kvm/arm.c | 29 +++++++++++++++-------------- arch/arm64/kvm/psci.c | 19 ++++++------------- 3 files changed, 25 insertions(+), 29 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 7cd7d5c8c4bc..55a04f4d5919 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -340,8 +340,8 @@ struct kvm_vcpu_arch { u32 mdscr_el1; } guest_debug_preserved;
- /* vcpu power-off state */ - bool power_off; + /* vcpu power state (runnable, stopped, halted) */ + u32 mp_state;
/* Don't run the guest (internal implementation need) */ bool pause; @@ -720,6 +720,8 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); +void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu); +bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu);
/* Guest/host FPSIMD coordination helpers */ int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 1cb39c0803a4..e4a8bf1b638b 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -435,21 +435,22 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) vcpu->cpu = -1; }
-static void vcpu_power_off(struct kvm_vcpu *vcpu) +void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu) { - vcpu->arch.power_off = true; + vcpu->arch.mp_state = KVM_MP_STATE_STOPPED; kvm_make_request(KVM_REQ_SLEEP, vcpu); kvm_vcpu_kick(vcpu); }
+bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.mp_state == KVM_MP_STATE_STOPPED; +} + int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, struct kvm_mp_state *mp_state) { - if (vcpu->arch.power_off) - mp_state->mp_state = KVM_MP_STATE_STOPPED; - else - mp_state->mp_state = KVM_MP_STATE_RUNNABLE; - + mp_state->mp_state = vcpu->arch.mp_state; return 0; }
@@ -460,10 +461,10 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
switch (mp_state->mp_state) { case KVM_MP_STATE_RUNNABLE: - vcpu->arch.power_off = false; + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; break; case KVM_MP_STATE_STOPPED: - vcpu_power_off(vcpu); + kvm_arm_vcpu_power_off(vcpu); break; default: ret = -EINVAL; @@ -483,7 +484,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v) { bool irq_lines = *vcpu_hcr(v) & (HCR_VI | HCR_VF); return ((irq_lines || kvm_vgic_vcpu_pending_irq(v)) - && !v->arch.power_off && !v->arch.pause); + && !kvm_arm_vcpu_is_off(v) && !v->arch.pause); }
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) @@ -643,10 +644,10 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu) struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
rcuwait_wait_event(wait, - (!vcpu->arch.power_off) &&(!vcpu->arch.pause), + !kvm_arm_vcpu_is_off(vcpu) && !vcpu->arch.pause, TASK_INTERRUPTIBLE);
- if (vcpu->arch.power_off || vcpu->arch.pause) { + if (kvm_arm_vcpu_is_off(vcpu) || vcpu->arch.pause) { /* Awaken to handle a signal, request we sleep again later. */ kvm_make_request(KVM_REQ_SLEEP, vcpu); } @@ -1073,9 +1074,9 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu, * Handle the "start in power-off" case. */ if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features)) - vcpu_power_off(vcpu); + kvm_arm_vcpu_power_off(vcpu); else - vcpu->arch.power_off = false; + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
return 0; } diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c index db4056ecccfd..24b4a2265dbd 100644 --- a/arch/arm64/kvm/psci.c +++ b/arch/arm64/kvm/psci.c @@ -52,13 +52,6 @@ static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu) return PSCI_RET_SUCCESS; }
-static void kvm_psci_vcpu_off(struct kvm_vcpu *vcpu) -{ - vcpu->arch.power_off = true; - kvm_make_request(KVM_REQ_SLEEP, vcpu); - kvm_vcpu_kick(vcpu); -} - static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) { struct vcpu_reset_state *reset_state; @@ -78,7 +71,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) */ if (!vcpu) return PSCI_RET_INVALID_PARAMS; - if (!vcpu->arch.power_off) { + if (!kvm_arm_vcpu_is_off(vcpu)) { if (kvm_psci_version(source_vcpu, kvm) != KVM_ARM_PSCI_0_1) return PSCI_RET_ALREADY_ON; else @@ -107,7 +100,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) */ smp_wmb();
- vcpu->arch.power_off = false; + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; kvm_vcpu_wake_up(vcpu);
return PSCI_RET_SUCCESS; @@ -142,7 +135,7 @@ static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu) mpidr = kvm_vcpu_get_mpidr_aff(tmp); if ((mpidr & target_affinity_mask) == target_affinity) { matching_cpus++; - if (!tmp->arch.power_off) + if (!kvm_arm_vcpu_is_off(tmp)) return PSCI_0_2_AFFINITY_LEVEL_ON; } } @@ -168,7 +161,7 @@ static void kvm_prepare_system_event(struct kvm_vcpu *vcpu, u32 type) * re-initialized. */ kvm_for_each_vcpu(i, tmp, vcpu->kvm) - tmp->arch.power_off = true; + tmp->arch.mp_state = KVM_MP_STATE_STOPPED; kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP);
memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event)); @@ -237,7 +230,7 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu) val = kvm_psci_vcpu_suspend(vcpu); break; case PSCI_0_2_FN_CPU_OFF: - kvm_psci_vcpu_off(vcpu); + kvm_arm_vcpu_power_off(vcpu); val = PSCI_RET_SUCCESS; break; case PSCI_0_2_FN_CPU_ON: @@ -350,7 +343,7 @@ static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu)
switch (psci_fn) { case KVM_PSCI_FN_CPU_OFF: - kvm_psci_vcpu_off(vcpu); + kvm_arm_vcpu_power_off(vcpu); val = PSCI_RET_SUCCESS; break; case KVM_PSCI_FN_CPU_ON:
Add a suspend request and move the WFI execution into check_vcpu_requests(), next to the power-off logic. This will allow userspace implementing PSCI to request WFI before returning to the guest.
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org
--- There is a compatibility question regarding the GET_MP_STATE ioctl. Since older userspace does not know about the new KVM_MP_STATE_HALTED, we cannot return that, otherwise we'll break their mp_state support. Not a risk at the moment because we don't return to userspace with HALTED state (WFI and PSCI both stay in the run loop until the suspend request is consumed). But I'm not 100% sure of that and it will certainly change in the future. I'd rather explicitly return RUNNABLE in GET_MP_STATE, but I kept it simple for the moment - we'll probably toss the patch anyway. --- arch/arm64/include/asm/kvm_host.h | 2 ++ arch/arm64/kvm/arm.c | 18 ++++++++++++++- arch/arm64/kvm/handle_exit.c | 3 +-- arch/arm64/kvm/psci.c | 37 +++++++++++++------------------ 4 files changed, 35 insertions(+), 25 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 55a04f4d5919..3ca732feb9a5 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -46,6 +46,7 @@ #define KVM_REQ_VCPU_RESET KVM_ARCH_REQ(2) #define KVM_REQ_RECORD_STEAL KVM_ARCH_REQ(3) #define KVM_REQ_RELOAD_GICv4 KVM_ARCH_REQ(4) +#define KVM_REQ_SUSPEND KVM_ARCH_REQ(5)
#define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ KVM_DIRTY_LOG_INITIALLY_SET) @@ -722,6 +723,7 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr); void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu); bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu); +void kvm_arm_vcpu_suspend(struct kvm_vcpu *vcpu);
/* Guest/host FPSIMD coordination helpers */ int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index e4a8bf1b638b..4a42d54299db 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -447,6 +447,12 @@ bool kvm_arm_vcpu_is_off(struct kvm_vcpu *vcpu) return vcpu->arch.mp_state == KVM_MP_STATE_STOPPED; }
+void kvm_arm_vcpu_suspend(struct kvm_vcpu *vcpu) +{ + vcpu->arch.mp_state = KVM_MP_STATE_HALTED; + kvm_make_request(KVM_REQ_SUSPEND, vcpu); +} + int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, struct kvm_mp_state *mp_state) { @@ -667,6 +673,8 @@ static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
static void check_vcpu_requests(struct kvm_vcpu *vcpu) { + bool irq_pending; + if (kvm_request_pending(vcpu)) { if (kvm_check_request(KVM_REQ_SLEEP, vcpu)) vcpu_req_sleep(vcpu); @@ -678,7 +686,7 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu) * Clear IRQ_PENDING requests that were made to guarantee * that a VCPU sees new virtual interrupts. */ - kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu); + irq_pending = kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu)) kvm_update_stolen_time(vcpu); @@ -690,6 +698,14 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu) vgic_v4_load(vcpu); preempt_enable(); } + + if (kvm_check_request(KVM_REQ_SUSPEND, vcpu)) { + if (!irq_pending) { + kvm_vcpu_block(vcpu); + kvm_clear_request(KVM_REQ_UNHALT, vcpu); + } + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; + } } }
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index 6f48336b1d86..9717df3104cf 100644 --- a/arch/arm64/kvm/handle_exit.c +++ b/arch/arm64/kvm/handle_exit.c @@ -95,8 +95,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu) } else { trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false); vcpu->stat.wfi_exit_stat++; - kvm_vcpu_block(vcpu); - kvm_clear_request(KVM_REQ_UNHALT, vcpu); + kvm_arm_vcpu_suspend(vcpu); }
kvm_incr_pc(vcpu); diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c index 24b4a2265dbd..42a307ceb95f 100644 --- a/arch/arm64/kvm/psci.c +++ b/arch/arm64/kvm/psci.c @@ -31,27 +31,6 @@ static unsigned long psci_affinity_mask(unsigned long affinity_level) return 0; }
-static unsigned long kvm_psci_vcpu_suspend(struct kvm_vcpu *vcpu) -{ - /* - * NOTE: For simplicity, we make VCPU suspend emulation to be - * same-as WFI (Wait-for-interrupt) emulation. - * - * This means for KVM the wakeup events are interrupts and - * this is consistent with intended use of StateID as described - * in section 5.4.1 of PSCI v0.2 specification (ARM DEN 0022A). - * - * Further, we also treat power-down request to be same as - * stand-by request as-per section 5.4.2 clause 3 of PSCI v0.2 - * specification (ARM DEN 0022A). This means all suspend states - * for KVM will preserve the register state. - */ - kvm_vcpu_block(vcpu); - kvm_clear_request(KVM_REQ_UNHALT, vcpu); - - return PSCI_RET_SUCCESS; -} - static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) { struct vcpu_reset_state *reset_state; @@ -227,7 +206,21 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu) break; case PSCI_0_2_FN_CPU_SUSPEND: case PSCI_0_2_FN64_CPU_SUSPEND: - val = kvm_psci_vcpu_suspend(vcpu); + /* + * NOTE: For simplicity, we make VCPU suspend emulation to be + * same-as WFI (Wait-for-interrupt) emulation. + * + * This means for KVM the wakeup events are interrupts and this + * is consistent with intended use of StateID as described in + * section 5.4.1 of PSCI v0.2 specification (ARM DEN 0022A). + * + * Further, we also treat power-down request to be same as + * stand-by request as-per section 5.4.2 clause 3 of PSCI v0.2 + * specification (ARM DEN 0022A). This means all suspend states + * for KVM will preserve the register state. + */ + kvm_arm_vcpu_suspend(vcpu); + val = PSCI_RET_SUCCESS; break; case PSCI_0_2_FN_CPU_OFF: kvm_arm_vcpu_power_off(vcpu);
To help userspace implement PSCI CPU_SUSPEND, allow setting the "HALTED" MP state to request a WFI before returning to the guest. There is no way the cancel the request at the moment.
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org --- Documentation/virt/kvm/api.rst | 15 +++++++++------ include/uapi/linux/kvm.h | 1 + arch/arm64/kvm/arm.c | 4 ++++ 3 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 7fcb2fd38f42..8da6a9940086 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1416,8 +1416,8 @@ Possible values are: which has not yet received an INIT signal [x86] KVM_MP_STATE_INIT_RECEIVED the vcpu has received an INIT signal, and is now ready for a SIPI [x86] - KVM_MP_STATE_HALTED the vcpu has executed a HLT instruction and - is waiting for an interrupt [x86] + KVM_MP_STATE_HALTED the vcpu has executed a HLT/WFI instruction + and is waiting for an interrupt [x86,arm64] KVM_MP_STATE_SIPI_RECEIVED the vcpu has just received a SIPI (vector accessible via KVM_GET_VCPU_EVENTS) [x86] KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm/arm64] @@ -1435,8 +1435,9 @@ these architectures. For arm/arm64: ^^^^^^^^^^^^^^
-The only states that are valid are KVM_MP_STATE_STOPPED and -KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not. +Valid states are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect +if the vcpu is paused or not. KVM_MP_STATE_HALTED is valid if +KVM_CAP_ARM_MP_HALTED is present.
4.39 KVM_SET_MP_STATE --------------------- @@ -1457,8 +1458,10 @@ these architectures. For arm/arm64: ^^^^^^^^^^^^^^
-The only states that are valid are KVM_MP_STATE_STOPPED and -KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not. +Valid states are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect +if the vcpu should be paused or not. If KVM_CAP_ARM_MP_HALTED is present, +KVM_MP_STATE_HALTED can be set, to wait for interrupts targeted at the vcpu +before running it.
4.40 KVM_SET_IDENTITY_MAP_ADDR ------------------------------ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 3fd9a7e9d90c..9934a57db40c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_SGX_ATTRIBUTE 196 #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197 #define KVM_CAP_PTP_KVM 198 +#define KVM_CAP_ARM_MP_HALTED 199
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 4a42d54299db..10e1f7832e7f 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -207,6 +207,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: + case KVM_CAP_ARM_MP_HALTED: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: @@ -469,6 +470,9 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, case KVM_MP_STATE_RUNNABLE: vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; break; + case KVM_MP_STATE_HALTED: + kvm_arm_vcpu_suspend(vcpu); + break; case KVM_MP_STATE_STOPPED: kvm_arm_vcpu_power_off(vcpu); break;
When capability KVM_CAP_ARM_HVC_TO_USER is available, userspace can request to handle all hypercalls that aren't handled by KVM. With the help of another capability, this will allow userspace to handle PSCI calls.
Suggested-by: James Morse james.morse@arm.com Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org ---
Notes on this implementation:
* A similar mechanism was proposed for SDEI some time ago [1]. This RFC generalizes the idea to all hypercalls, since that was suggested on the list [2, 3].
* We're reusing kvm_run.hypercall. I copied x0-x5 into kvm_run.hypercall.args[] to help userspace but I'm tempted to remove this, because: - Most user handlers will need to write results back into the registers (x0-x3 for SMCCC), so if we keep this shortcut we should go all the way and read them back on return to kernel. - QEMU doesn't care about this shortcut, it pulls all vcpu regs before handling the call. - SMCCC uses x0-x16 for parameters. x0 does contain the SMCCC function ID and may be useful for fast dispatch, we could keep that plus the immediate number.
* Add a flag in the kvm_run.hypercall telling whether this is HVC or SMC? Can be added later in those bottom longmode and pad fields.
* On top of this we could share with userspace which HVC ranges are available and which ones are handled by KVM. That can actually be added independently, through a vCPU/VM device attribute which doesn't consume a new ioctl: - userspace issues HAS_ATTR ioctl on the vcpu fd to query whether this feature is available. - userspace queries the number N of HVC ranges using one GET_ATTR. - userspace passes an array of N ranges using another GET_ATTR. The array is filled and returned by KVM.
* Enabling this using a vCPU arch feature rather than the whole-VM capability would be fine, but it would be difficult to do the same for the following psci-in-user capability. So let's enable everything at the VM scope.
* No idea whether this work out of the box for AArch32 guests.
[1] https://lore.kernel.org/linux-arm-kernel/20170808164616.25949-12-james.morse... [2] https://lore.kernel.org/linux-arm-kernel/bf7e83f1-c58e-8d65-edd0-d08f27b8b76... [3] https://lore.kernel.org/linux-arm-kernel/f56cf420-affc-35f0-2355-801a924b8a3... --- Documentation/virt/kvm/api.rst | 17 +++++++++++++++-- arch/arm64/include/asm/kvm_host.h | 1 + include/kvm/arm_psci.h | 4 ++++ include/uapi/linux/kvm.h | 1 + arch/arm64/kvm/arm.c | 5 +++++ arch/arm64/kvm/hypercalls.c | 28 +++++++++++++++++++++++++++- 6 files changed, 53 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 8da6a9940086..1afab8deadb3 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -5228,8 +5228,12 @@ to the byte array. __u32 pad; } hypercall;
-Unused. This was once used for 'hypercall to userspace'. To implement -such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390). +On x86 this was once used for 'hypercall to userspace'. To implement such +functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390). + +On arm64 it is used for hypercalls, when the KVM_CAP_ARM_HVC_TO_USER capability +is enabled. 'nr' contains the HVC or SMC immediate. 'args' contains registers +x0 - x5. The other parameters are unused.
.. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
@@ -6894,3 +6898,12 @@ This capability is always enabled. This capability indicates that the KVM virtual PTP service is supported in the host. A VMM can check whether the service is available to the guest on migration. + +8.33 KVM_CAP_ARM_HVC_TO_USER +---------------------------- + +:Architecture: arm64 + +This capability indicates that KVM can pass unhandled hypercalls to userspace, +if the VMM enables it. Hypercalls are passed with KVM_EXIT_HYPERCALL in +kvm_run::hypercall. diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 3ca732feb9a5..25554ce97045 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -123,6 +123,7 @@ struct kvm_arch { * supported. */ bool return_nisv_io_abort_to_user; + bool hvc_to_user;
/* * VM-wide PMU filter, implemented as a bitmap and big enough for diff --git a/include/kvm/arm_psci.h b/include/kvm/arm_psci.h index 5b58bd2fe088..d6b71a48fbb1 100644 --- a/include/kvm/arm_psci.h +++ b/include/kvm/arm_psci.h @@ -16,6 +16,10 @@
#define KVM_ARM_PSCI_LATEST KVM_ARM_PSCI_1_0
+#define KVM_PSCI_FN_LAST KVM_PSCI_FN(3) +#define PSCI_0_2_FN_LAST PSCI_0_2_FN(0x3f) +#define PSCI_0_2_FN64_LAST PSCI_0_2_FN64(0x3f) + /* * We need the KVM pointer independently from the vcpu as we can call * this from HYP, and need to apply kern_hyp_va on it... diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 9934a57db40c..1d8b6dd5d68f 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1083,6 +1083,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197 #define KVM_CAP_PTP_KVM 198 #define KVM_CAP_ARM_MP_HALTED 199 +#define KVM_CAP_ARM_HVC_TO_USER 200
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 10e1f7832e7f..3c2fcf878b72 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -93,6 +93,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, r = 0; kvm->arch.return_nisv_io_abort_to_user = true; break; + case KVM_CAP_ARM_HVC_TO_USER: + r = 0; + kvm->arch.hvc_to_user = true; + break; default: r = -EINVAL; break; @@ -208,6 +212,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: case KVM_CAP_ARM_MP_HALTED: + case KVM_CAP_ARM_HVC_TO_USER: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c index 30da78f72b3b..b00ffd59d10e 100644 --- a/arch/arm64/kvm/hypercalls.c +++ b/arch/arm64/kvm/hypercalls.c @@ -58,6 +58,28 @@ static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val) val[3] = lower_32_bits(cycles); }
+static int kvm_hvc_user(struct kvm_vcpu *vcpu) +{ + int i; + struct kvm_run *run = vcpu->run; + + if (!vcpu->kvm->arch.hvc_to_user) { + smccc_set_retval(vcpu, SMCCC_RET_NOT_SUPPORTED, 0, 0, 0); + return 1; + } + + run->exit_reason = KVM_EXIT_HYPERCALL; + run->hypercall.nr = kvm_vcpu_hvc_get_imm(vcpu); + /* Add the first parameters for fast access. */ + for (i = 0; i < 6; i++) + run->hypercall.args[i] = vcpu_get_reg(vcpu, i); + run->hypercall.ret = 0; + run->hypercall.longmode = 0; + run->hypercall.pad = 0; + + return 0; +} + int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) { u32 func_id = smccc_get_function(vcpu); @@ -139,8 +161,12 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) case ARM_SMCCC_TRNG_RND32: case ARM_SMCCC_TRNG_RND64: return kvm_trng_call(vcpu); - default: + case KVM_PSCI_FN_BASE...KVM_PSCI_FN_LAST: + case PSCI_0_2_FN_BASE...PSCI_0_2_FN_LAST: + case PSCI_0_2_FN64_BASE...PSCI_0_2_FN64_LAST: return kvm_psci_call(vcpu); + default: + return kvm_hvc_user(vcpu); }
smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]);
When the KVM_CAP_ARM_PSCI_TO_USER capability is available, userspace can request to handle PSCI calls.
SMCCC probe requires PSCI v1.x. If userspace only implements PSCI v0.2, the guest won't query SMCCC support through PSCI and won't use the spectre workarounds. We could hijack PSCI_VERSION and pretend to support v1.0 if userspace does not, then handle all v1.0 calls ourselves (including guessing the PSCI feature set implemented by the guest), but that seems unnecessary. After all the API already allows userspace to force a version lower than v1.0 using the firmware pseudo-registers.
The KVM_REG_ARM_PSCI_VERSION pseudo-register currently resets to either v0.1 if userspace doesn't set KVM_ARM_VCPU_PSCI_0_2, or KVM_ARM_PSCI_LATEST (1.0).
Suggested-by: James Morse james.morse@arm.com Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org --- Documentation/virt/kvm/api.rst | 14 ++++++++++++++ Documentation/virt/kvm/arm/psci.rst | 1 + arch/arm64/include/asm/kvm_host.h | 1 + include/kvm/arm_hypercalls.h | 1 + include/uapi/linux/kvm.h | 1 + arch/arm64/kvm/arm.c | 10 +++++++--- arch/arm64/kvm/hypercalls.c | 2 +- arch/arm64/kvm/psci.c | 13 +++++++++++++ 8 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1afab8deadb3..c98bba51776f 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6907,3 +6907,17 @@ available to the guest on migration. This capability indicates that KVM can pass unhandled hypercalls to userspace, if the VMM enables it. Hypercalls are passed with KVM_EXIT_HYPERCALL in kvm_run::hypercall. + +8.34 KVM_CAP_ARM_PSCI_TO_USER +----------------------------- + +:Architectures: arm64 + +When the VMM enables this capability, all PSCI calls are passed to userspace +instead of being handled by KVM. Capability KVM_CAP_ARM_HVC_TO_USER must be +enabled first. + +Userspace should support at least PSCI v1.0. Otherwise SMCCC features won't be +available to the guest. Userspace does not need to handle the SMCCC_VERSION +parameter for the PSCI_FEATURES function. The KVM_ARM_VCPU_PSCI_0_2 vCPU +feature should be set even if this capability is enabled. diff --git a/Documentation/virt/kvm/arm/psci.rst b/Documentation/virt/kvm/arm/psci.rst index d52c2e83b5b8..110011d1fa3f 100644 --- a/Documentation/virt/kvm/arm/psci.rst +++ b/Documentation/virt/kvm/arm/psci.rst @@ -34,6 +34,7 @@ The following register is defined: - Allows any PSCI version implemented by KVM and compatible with v0.2 to be set with SET_ONE_REG - Affects the whole VM (even if the register view is per-vcpu) + - Defaults to PSCI 1.0 if userspace enables KVM_CAP_ARM_PSCI_TO_USER.
* KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: Holds the state of the firmware support to mitigate CVE-2017-5715, as diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 25554ce97045..5d74b769c16d 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -124,6 +124,7 @@ struct kvm_arch { */ bool return_nisv_io_abort_to_user; bool hvc_to_user; + bool psci_to_user;
/* * VM-wide PMU filter, implemented as a bitmap and big enough for diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h index 0e2509d27910..b66c6a000ef3 100644 --- a/include/kvm/arm_hypercalls.h +++ b/include/kvm/arm_hypercalls.h @@ -6,6 +6,7 @@
#include <asm/kvm_emulate.h>
+int kvm_hvc_user(struct kvm_vcpu *vcpu); int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
static inline u32 smccc_get_function(struct kvm_vcpu *vcpu) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 1d8b6dd5d68f..83702fb6d39d 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1084,6 +1084,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PTP_KVM 198 #define KVM_CAP_ARM_MP_HALTED 199 #define KVM_CAP_ARM_HVC_TO_USER 200 +#define KVM_CAP_ARM_PSCI_TO_USER 201
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 3c2fcf878b72..37ddcf7089ad 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -83,7 +83,7 @@ int kvm_arch_check_processor_compat(void *opaque) int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) { - int r; + int r = -EINVAL;
if (cap->flags) return -EINVAL; @@ -97,8 +97,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, r = 0; kvm->arch.hvc_to_user = true; break; - default: - r = -EINVAL; + case KVM_CAP_ARM_PSCI_TO_USER: + if (kvm->arch.hvc_to_user) { + r = 0; + kvm->arch.psci_to_user = true; + } break; }
@@ -213,6 +216,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_PTP_KVM: case KVM_CAP_ARM_MP_HALTED: case KVM_CAP_ARM_HVC_TO_USER: + case KVM_CAP_ARM_PSCI_TO_USER: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c index b00ffd59d10e..7fe7713f61df 100644 --- a/arch/arm64/kvm/hypercalls.c +++ b/arch/arm64/kvm/hypercalls.c @@ -58,7 +58,7 @@ static void kvm_ptp_get_time(struct kvm_vcpu *vcpu, u64 *val) val[3] = lower_32_bits(cycles); }
-static int kvm_hvc_user(struct kvm_vcpu *vcpu) +int kvm_hvc_user(struct kvm_vcpu *vcpu) { int i; struct kvm_run *run = vcpu->run; diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c index 42a307ceb95f..7f44ee527966 100644 --- a/arch/arm64/kvm/psci.c +++ b/arch/arm64/kvm/psci.c @@ -353,6 +353,16 @@ static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu) return 1; }
+static bool kvm_psci_call_is_user(struct kvm_vcpu *vcpu) +{ + /* Handle the special case of SMCCC probe through PSCI */ + if (smccc_get_function(vcpu) == PSCI_1_0_FN_PSCI_FEATURES && + smccc_get_arg1(vcpu) == ARM_SMCCC_VERSION_FUNC_ID) + return false; + + return vcpu->kvm->arch.psci_to_user; +} + /** * kvm_psci_call - handle PSCI call if r0 value is in range * @vcpu: Pointer to the VCPU struct @@ -369,6 +379,9 @@ static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu) */ int kvm_psci_call(struct kvm_vcpu *vcpu) { + if (kvm_psci_call_is_user(vcpu)) + return kvm_hvc_user(vcpu); + switch (kvm_psci_version(vcpu, vcpu->kvm)) { case KVM_ARM_PSCI_1_0: return kvm_psci_1_0_call(vcpu);
Enable new KVM capabilities to handle PSCI calls in QEMU rather than KVM. This work is part of the larger vCPU hot-add prototype for arm64, which can be found at [1]. I rebased Salil's work [2], with small changes to hide unavailable CPUs in ACPI. These four patches move PSCI to userspace so we can differentiate nonexistent from unavailable CPUs (See patch 4, associated to Linux patch "arm64: psci: Ignore NOT_PRESENT CPUs").
My testing commands are roughly:
qemu-system-aarch64 -cpu host -enable-kvm -M virt -M 'gic-version=3' -bios QEMU_EFI.fd -smp 'cpus=2,maxcpus=4' -m 1G -kernel ... device_add host-arm-cpu,id=core2,core-id=2 # QEMU monitor echo 1 > /sys/devices/system/cpu/cpu2/online # Guest device_del core2
[1] https://jpbrucker.net/git/qemu/log/?h=cpuhp/devel https://jpbrucker.net/git/linux/log/?h=cpuhp/devel [2] https://lore.kernel.org/qemu-devel/20200613213629.21984-1-salil.mehta@huawei...
Jean-Philippe Brucker (4): target/arm/kvm: Write CPU state back to KVM on reset target/arm: Support PSCI CPU_SUSPEND for KVM target/arm/kvm: Handle PSCI calls target/arm/arm-powerctl: Handle unplugged CPUs
target/arm/arm-powerctl.h | 1 + target/arm/cpu.h | 3 ++ target/arm/internals.h | 2 +- target/arm/arm-powerctl.c | 44 ++++++++++++++++++----- target/arm/kvm.c | 75 +++++++++++++++++++++++++++++++++++---- target/arm/kvm64.c | 5 ++- target/arm/psci.c | 14 +++++--- 7 files changed, 120 insertions(+), 24 deletions(-)
From: Jean-Philippe Brucker [mailto:jean-philippe@linaro.org] Sent: Thursday, May 20, 2021 2:19 PM To: linaro-open-discussions@op-lists.linaro.org Cc: Salil Mehta salil.mehta@huawei.com; Jonathan Cameron jonathan.cameron@huawei.com; james.morse@arm.com; lorenzo.pieralisi@arm.com; Jean-Philippe Brucker jean-philippe@linaro.org Subject: [RFC qemu 0/4] target/arm/kvm: Implement PSCI in userspace
Enable new KVM capabilities to handle PSCI calls in QEMU rather than KVM. This work is part of the larger vCPU hot-add prototype for arm64, which can be found at [1]. I rebased Salil's work [2], with small changes to hide unavailable CPUs in ACPI. These four patches move PSCI to userspace so we can differentiate nonexistent from unavailable CPUs (See patch 4, associated to Linux patch "arm64: psci: Ignore NOT_PRESENT CPUs").
Hi Jean, Many thanks for this. I could see from your repository that you have used most of the earlier QEMU patches[2]. There are some patches related to the ACPI/GED, we might not need them anymore as we are not conveying the vcpu hotplug events through ACPI/Hotplug channel to guest now.
There are lots of places in QEMU we still need to work maybe along with the community to finally create a refined solution at QEMU level.
But good to see that we have worked out an *kind-of* acceptable prototype. Will further go through these over next couple of days.
Thanks Salil.
My testing commands are roughly:
qemu-system-aarch64 -cpu host -enable-kvm -M virt -M 'gic-version=3' -bios QEMU_EFI.fd -smp 'cpus=2,maxcpus=4' -m 1G -kernel ... device_add host-arm-cpu,id=core2,core-id=2 # QEMU monitor echo 1 > /sys/devices/system/cpu/cpu2/online # Guest device_del core2
[1] https://jpbrucker.net/git/qemu/log/?h=cpuhp/devel https://jpbrucker.net/git/linux/log/?h=cpuhp/devel [2] https://lore.kernel.org/qemu-devel/20200613213629.21984-1-salil.mehta@huawe i.com/
Jean-Philippe Brucker (4): target/arm/kvm: Write CPU state back to KVM on reset target/arm: Support PSCI CPU_SUSPEND for KVM target/arm/kvm: Handle PSCI calls target/arm/arm-powerctl: Handle unplugged CPUs
target/arm/arm-powerctl.h | 1 + target/arm/cpu.h | 3 ++ target/arm/internals.h | 2 +- target/arm/arm-powerctl.c | 44 ++++++++++++++++++----- target/arm/kvm.c | 75 +++++++++++++++++++++++++++++++++++---- target/arm/kvm64.c | 5 ++- target/arm/psci.c | 14 +++++--- 7 files changed, 120 insertions(+), 24 deletions(-)
-- 2.31.1
When a KVM vCPU is reset following a PSCI CPU_ON call, its power state is not synchronized with KVM at the moment. Because the vCPU is not marked dirty, we miss the call to kvm_arch_put_registers() that writes to KVM's MP_STATE. Force mp_state synchronization.
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org --- target/arm/kvm.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 20d55c23c39..56e460b388f 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -604,11 +604,12 @@ void kvm_arm_cpu_post_load(ARMCPU *cpu) void kvm_arm_reset_vcpu(ARMCPU *cpu) { int ret; + CPUState *cs = CPU(cpu);
/* Re-init VCPU so that all registers are set to * their respective reset values. */ - ret = kvm_arm_vcpu_init(CPU(cpu)); + ret = kvm_arm_vcpu_init(cs); if (ret < 0) { fprintf(stderr, "kvm_arm_vcpu_init failed: %s\n", strerror(-ret)); abort(); @@ -625,6 +626,12 @@ void kvm_arm_reset_vcpu(ARMCPU *cpu) * for the same reason we do so in kvm_arch_get_registers(). */ write_list_to_cpustate(cpu); + + /* + * Ensure we call kvm_arch_put_registers(). The vCPU isn't marked dirty if + * it was parked in KVM and is now booting from a PSCI CPU_ON call. + */ + cs->vcpu_dirty = true; }
void kvm_arm_create_host_vcpu(ARMCPU *cpu)
The CPU_SUSPEND function is mandatory in PSCI v0.2+. As KVM handles timer interrupts and IPIs with the vGIC, it can implement wait-for- interrupt more easily than userspace. To implement CPU_SUSPEND, tell KVM to wait for interrupts before returning to the guest.
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org --- target/arm/cpu.h | 3 +++ target/arm/kvm.c | 17 +++++++++++++---- target/arm/psci.c | 14 +++++++++----- 3 files changed, 25 insertions(+), 9 deletions(-)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h index 02ab8495665..fdc09e27b1a 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -882,6 +882,9 @@ struct ARMCPU { /* KVM steal time */ OnOffAuto kvm_steal_time;
+ /* Put the vCPU in WFI before returning to the guest */ + bool kvm_suspend; + /* Uniprocessor system with MP extensions */ bool mp_is_up;
diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 56e460b388f..1a7e52b6bad 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -38,6 +38,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = { };
static bool cap_has_mp_state; +static bool cap_has_mp_halted; static bool cap_has_inject_serror_esr; static bool cap_has_inject_ext_dabt;
@@ -256,6 +257,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s) kvm_halt_in_kernel_allowed = true;
cap_has_mp_state = kvm_check_extension(s, KVM_CAP_MP_STATE); + cap_has_mp_halted = kvm_check_extension(s, KVM_CAP_ARM_MP_HALTED);
if (ms->smp.cpus > 256 && !kvm_check_extension(s, KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)) { @@ -672,10 +674,16 @@ void kvm_arm_create_host_vcpu(ARMCPU *cpu) int kvm_arm_sync_mpstate_to_kvm(ARMCPU *cpu) { if (cap_has_mp_state) { - struct kvm_mp_state mp_state = { - .mp_state = (cpu->power_state == PSCI_OFF) ? - KVM_MP_STATE_STOPPED : KVM_MP_STATE_RUNNABLE - }; + struct kvm_mp_state mp_state = {}; + + if (cpu->power_state == PSCI_OFF) { + mp_state.mp_state = KVM_MP_STATE_STOPPED; + } else if (cap_has_mp_halted && cpu->kvm_suspend) { + mp_state.mp_state = KVM_MP_STATE_HALTED; + } else { + mp_state.mp_state = KVM_MP_STATE_RUNNABLE; + } + int ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MP_STATE, &mp_state); if (ret) { fprintf(stderr, "%s: failed to set MP_STATE %d/%s\n", @@ -702,6 +710,7 @@ int kvm_arm_sync_mpstate_to_qemu(ARMCPU *cpu) } cpu->power_state = (mp_state.mp_state == KVM_MP_STATE_STOPPED) ? PSCI_OFF : PSCI_ON; + cpu->kvm_suspend = (mp_state.mp_state == KVM_MP_STATE_HALTED); }
return 0; diff --git a/target/arm/psci.c b/target/arm/psci.c index 6709e280133..3a980c7f2bf 100644 --- a/target/arm/psci.c +++ b/target/arm/psci.c @@ -21,6 +21,7 @@ #include "exec/helper-proto.h" #include "kvm-consts.h" #include "qemu/main-loop.h" +#include "sysemu/kvm.h" #include "sysemu/runstate.h" #include "internals.h" #include "arm-powerctl.h" @@ -184,13 +185,16 @@ void arm_handle_psci_call(ARMCPU *cpu) ret = QEMU_PSCI_RET_INVALID_PARAMS; break; } - /* Powerdown is not supported, we always go into WFI */ - if (is_a64(env)) { - env->xregs[0] = 0; + /* + * Powerdown is not supported, we always go into WFI. + * Under KVM, let the kernel suspend the vCPU. + */ + if (kvm_enabled()) { + cpu->kvm_suspend = true; } else { - env->regs[0] = 0; + helper_wfi(env, 4); } - helper_wfi(env, 4); + ret = QEMU_ARM_POWERCTL_RET_SUCCESS; break; case QEMU_PSCI_0_1_FN_MIGRATE: case QEMU_PSCI_0_2_FN_MIGRATE:
If KVM supports it, handle PSCI calls in QEMU. For CPU_ON and CPU_OFF, the arm-powerctl implementation used by TCG can be reused as is.
Note that we add some infrastructure to halt CPUs within QEMU rather than KVM (kvm_arch_process_async_events()) for reference, but it can be removed. To implement CPU_SUSPEND we do rely on halting in kernel, so we have to keep kvm_halt_in_kernel_allowed, set by kvm_irqchip_create(). As a result OFF CPUs are still parked in the kernel rather than in QEMU.
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org --- target/arm/internals.h | 2 +- target/arm/kvm.c | 49 +++++++++++++++++++++++++++++++++++++++++- target/arm/kvm64.c | 5 ++--- 3 files changed, 51 insertions(+), 5 deletions(-)
diff --git a/target/arm/internals.h b/target/arm/internals.h index 32821f8b04a..71df4ad3104 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -293,7 +293,7 @@ vaddr arm_adjust_watchpoint_address(CPUState *cs, vaddr addr, int len); /* Callback function for when a watchpoint or breakpoint triggers. */ void arm_debug_excp_handler(CPUState *cs);
-#if defined(CONFIG_USER_ONLY) || !defined(CONFIG_TCG) +#if defined(CONFIG_USER_ONLY) || (!defined(CONFIG_TCG) && !defined(CONFIG_KVM)) static inline bool arm_is_psci_call(ARMCPU *cpu, int excp_type) { return false; diff --git a/target/arm/kvm.c b/target/arm/kvm.c index 1a7e52b6bad..0fef482878b 100644 --- a/target/arm/kvm.c +++ b/target/arm/kvm.c @@ -276,6 +276,19 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } }
+ /* + * To handle PSCI calls in QEMU, we need KVM support for suspending the + * vCPU, and of course the PSCI-to-userspace capability. + */ + if (cap_has_mp_halted && + kvm_check_extension(kvm_state, KVM_CAP_ARM_HVC_TO_USER) && + kvm_check_extension(kvm_state, KVM_CAP_ARM_PSCI_TO_USER)) { + if (kvm_vm_enable_cap(s, KVM_CAP_ARM_HVC_TO_USER, 0) || + kvm_vm_enable_cap(s, KVM_CAP_ARM_PSCI_TO_USER, 0)) { + error_report("Failed to enable KVM_CAP_ARM_PSCI_TO_USER"); + } + } + return ret; }
@@ -951,6 +964,32 @@ static int kvm_arm_handle_dabt_nisv(CPUState *cs, uint64_t esr_iss, return -1; }
+static int kvm_arm_handle_hypercall(CPUState *cs, uint16_t imm) +{ + ARMCPU *cpu = ARM_CPU(cs); + CPUARMState *env = &cpu->env; + + if (imm != 0) { + return 0; + } + + kvm_cpu_synchronize_state(cs); + + /* Under KVM, the PSCI conduit is HVC */ + cs->exception_index = EXCP_HVC; + env->exception.target_el = 1; + env->exception.syndrome = syn_aa64_hvc(imm); + qemu_mutex_lock_iothread(); + arm_cpu_do_interrupt(cs); + qemu_mutex_unlock_iothread(); + + /* + * For PSCI, exit the kvm_run loop and process the work. Especially + * important if this was a CPU_OFF command and we can't return to the guest. + */ + return EXCP_INTERRUPT; +} + int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) { int ret = 0; @@ -966,6 +1005,9 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) ret = kvm_arm_handle_dabt_nisv(cs, run->arm_nisv.esr_iss, run->arm_nisv.fault_ipa); break; + case KVM_EXIT_HYPERCALL: + ret = kvm_arm_handle_hypercall(cs, run->hypercall.nr); + break; default: qemu_log_mask(LOG_UNIMP, "%s: un-handled exit reason %d\n", __func__, run->exit_reason); @@ -981,7 +1023,12 @@ bool kvm_arch_stop_on_emulation_error(CPUState *cs)
int kvm_arch_process_async_events(CPUState *cs) { - return 0; + if (kvm_halt_in_kernel_allowed) { + return 0; + } + + /* If we're handling PSCI, don't call KVM_RUN for halted vCPUs */ + return cs->halted; }
void kvm_arch_update_guest_debug(CPUState *cs, struct kvm_guest_debug *dbg) diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c index 8b9fd50ff6c..d7f1bfa6127 100644 --- a/target/arm/kvm64.c +++ b/target/arm/kvm64.c @@ -877,9 +877,8 @@ int kvm_arch_init_vcpu(CPUState *cs) }
/* - * When KVM is in use, PSCI is emulated in-kernel and not by qemu. - * Currently KVM has its own idea about MPIDR assignment, so we - * override our defaults with what we get from KVM. + * KVM may emulate PSCI in-kernel. Currently KVM has its own idea about + * MPIDR assignment, so we override our defaults with what we get from KVM. */ ret = kvm_get_one_reg(cs, ARM64_SYS_REG(ARM_CPU_ID_MPIDR), &mpidr); if (ret) {
For CPU_ON, check whether a CPU is disabled or if it actually does not exit. Return NOT_PRESENT rather than INVALID_PARAMS in the former case.
This requires an update to the PSCI spec. As of PSCI v1.1, NOT_PRESENT isn't a valid return value for CPU_ON.
Signed-off-by: Jean-Philippe Brucker jean-philippe@linaro.org --- target/arm/arm-powerctl.h | 1 + target/arm/arm-powerctl.c | 44 +++++++++++++++++++++++++++++++-------- 2 files changed, 36 insertions(+), 9 deletions(-)
diff --git a/target/arm/arm-powerctl.h b/target/arm/arm-powerctl.h index 37c8a04f0a9..258f8570e8e 100644 --- a/target/arm/arm-powerctl.h +++ b/target/arm/arm-powerctl.h @@ -18,6 +18,7 @@ #define QEMU_ARM_POWERCTL_ALREADY_ON QEMU_PSCI_RET_ALREADY_ON #define QEMU_ARM_POWERCTL_IS_OFF QEMU_PSCI_RET_DENIED #define QEMU_ARM_POWERCTL_ON_PENDING QEMU_PSCI_RET_ON_PENDING +#define QEMU_ARM_POWERCTL_NOT_PRESENT QEMU_PSCI_RET_NOT_PRESENT
/* * arm_get_cpu_by_id: diff --git a/target/arm/arm-powerctl.c b/target/arm/arm-powerctl.c index b75f813b403..d80eec70b6a 100644 --- a/target/arm/arm-powerctl.c +++ b/target/arm/arm-powerctl.c @@ -13,6 +13,7 @@ #include "cpu-qom.h" #include "internals.h" #include "arm-powerctl.h" +#include "hw/boards.h" #include "qemu/log.h" #include "qemu/main-loop.h"
@@ -27,18 +28,37 @@ } \ } while (0)
+static CPUArchId *arm_get_archid_by_id(uint64_t id) +{ + int n; + CPUArchId *arch_id; + MachineState *ms = MACHINE(qdev_get_machine()); + + /* + * At this point disabled CPUs don't have a CPUState, but their CPUArchId + * exists. + * + * TODO: Is arch_id == mp_affinity? This needs work. + */ + for (n = 0; n < ms->possible_cpus->len; n++) { + arch_id = &ms->possible_cpus->cpus[n]; + + if (arch_id->arch_id == id) { + return arch_id; + } + } + return NULL; +} + CPUState *arm_get_cpu_by_id(uint64_t id) { - CPUState *cpu; + CPUArchId *arch_id;
DPRINTF("cpu %" PRId64 "\n", id);
- CPU_FOREACH(cpu) { - ARMCPU *armcpu = ARM_CPU(cpu); - - if (armcpu->mp_affinity == id) { - return cpu; - } + arch_id = arm_get_archid_by_id(id); + if (arch_id && arch_id->cpu) { + return CPU(arch_id->cpu); }
qemu_log_mask(LOG_GUEST_ERROR, @@ -145,6 +165,7 @@ int arm_set_cpu_on(uint64_t cpuid, uint64_t entry, uint64_t context_id, { CPUState *target_cpu_state; ARMCPU *target_cpu; + CPUArchId *arch_id; struct CpuOnInfo *info;
assert(qemu_mutex_iothread_locked()); @@ -165,11 +186,16 @@ int arm_set_cpu_on(uint64_t cpuid, uint64_t entry, uint64_t context_id, }
/* Retrieve the cpu we are powering up */ - target_cpu_state = arm_get_cpu_by_id(cpuid); - if (!target_cpu_state) { + arch_id = arm_get_archid_by_id(cpuid); + if (!arch_id) { /* The cpu was not found */ return QEMU_ARM_POWERCTL_INVALID_PARAM; } + if (!arch_id->cpu) { + /* The cpu is not plugged in */ + return QEMU_ARM_POWERCTL_NOT_PRESENT; + } + target_cpu_state = CPU(arch_id->cpu);
target_cpu = ARM_CPU(target_cpu_state); if (target_cpu->power_state == PSCI_ON) {
From: Jean-Philippe Brucker [mailto:jean-philippe@linaro.org] Sent: Thursday, May 20, 2021 2:07 PM To: linaro-open-discussions@op-lists.linaro.org Cc: Salil Mehta salil.mehta@huawei.com; Jonathan Cameron jonathan.cameron@huawei.com; james.morse@arm.com; lorenzo.pieralisi@arm.com; Jean-Philippe Brucker jean-philippe@linaro.org Subject: [RFC linux 0/5] KVM: arm64: Let userspace handle PSCI
As planned during the vCPU hot-add discussions from previous LOD meetings, this prototype lets userspace handle PSCI calls from a guest.
The vCPU hot-add model preferred by Arm presents all possible resources through ACPI at boot time, only marking unavailable vCPUs as hidden. The VMM prevents bringing up those vCPUs by rejecting PSCI CPU_ON calls. This allows to keep things simple for vCPU scaling enablement, while leaving the door open for hardware CPU hot-add.
This series focuses on moving PSCI support into userspace. Patches 1-3 allow userspace to request WFI to be executed by KVM. That way the VMM can easily implement the CPU_SUSPEND function, which is mandatory from PSCI v0.2 onwards (even if it doesn't have a more useful implementation than WFI, natively available to the guest). An alternative would be to poll the vGIC implemented in KVM for interrupts, but I haven't explored that solution. Patches 4 and 5 let the VMM request PSCI calls.
The guest needs additional support to deal with hidden CPUs and to gracefully handle the "NOT_PRESENT" return value from PSCI CPU_ON. The full prototype can be found here:
https://jpbrucker.net/git/linux/log/?h=cpuhp/devel https://jpbrucker.net/git/qemu/log/?h=cpuhp/devel
Hi Jean/James, Thanks for this very useful contribution and sharing patches.
I have quickly scratched all of the patches including Linux host/kvm changes and the guest changes done by James and looks good logically. I will spent time for detailed review over this weekend and get back to you with some comments.
BTW, I have tested the hot-{add,remove} and I have found it working with at least straight forward cases. Will do more testing over next couple of days.
Again...many thanks to all of you guys for taking this pain and getting us some acceptable solution. This sets up good ground for useful discussion on 25th May 2021 LOD meeting.
Thanks Salil.
Jean-Philippe Brucker (5): KVM: arm64: Replace power_off with mp_state in struct kvm_vcpu_arch KVM: arm64: Move WFI execution to check_vcpu_requests() KVM: arm64: Allow userspace to request WFI KVM: arm64: Pass hypercalls to userspace KVM: arm64: Pass PSCI calls to userspace
Documentation/virt/kvm/api.rst | 46 +++++++++++++++---- Documentation/virt/kvm/arm/psci.rst | 1 + arch/arm64/include/asm/kvm_host.h | 10 ++++- include/kvm/arm_hypercalls.h | 1 + include/kvm/arm_psci.h | 4 ++ include/uapi/linux/kvm.h | 3 ++ arch/arm64/kvm/arm.c | 66 +++++++++++++++++++-------- arch/arm64/kvm/handle_exit.c | 3 +- arch/arm64/kvm/hypercalls.c | 28 +++++++++++- arch/arm64/kvm/psci.c | 69 ++++++++++++++--------------- 10 files changed, 165 insertions(+), 66 deletions(-)
-- 2.31.1
'lo
On 20/05/2021 14:07, Jean-Philippe Brucker wrote:
As planned during the vCPU hot-add discussions from previous LOD meetings, this prototype lets userspace handle PSCI calls from a guest.
The vCPU hot-add model preferred by Arm presents all possible resources through ACPI at boot time, only marking unavailable vCPUs as hidden. The VMM prevents bringing up those vCPUs by rejecting PSCI CPU_ON calls. This allows to keep things simple for vCPU scaling enablement, while leaving the door open for hardware CPU hot-add.
This series focuses on moving PSCI support into userspace. Patches 1-3 allow userspace to request WFI to be executed by KVM. That way the VMM can easily implement the CPU_SUSPEND function, which is mandatory from PSCI v0.2 onwards (even if it doesn't have a more useful implementation than WFI, natively available to the guest). An alternative would be to poll the vGIC implemented in KVM for interrupts, but I haven't explored that solution. Patches 4 and 5 let the VMM request PSCI calls.
As mentioned on the call, I've tested the udev output on x86 and arm64, as expected its the same: | root@vm:~# udevadm monitor | monitor will print the received events for: | UDEV - the event which udev sends out after rule processing | KERNEL - the kernel uevent | | KERNEL[33.935817] add /devices/system/cpu/cpu1 (cpu) | KERNEL[33.946333] bind /devices/system/cpu/cpu1 (cpu) | UDEV [33.953251] add /devices/system/cpu/cpu1 (cpu) | UDEV [33.958676] bind /devices/system/cpu/cpu1 (cpu)
(I've not played with the KVM changes yet)
The guest needs additional support to deal with hidden CPUs and to gracefully handle the "NOT_PRESENT" return value from PSCI CPU_ON. The full prototype can be found here:
Hopefully its possible to make those silent!
Thanks,
James
From: James Morse [mailto:james.morse@arm.com] Sent: Thursday, June 3, 2021 4:39 PM To: Jean-Philippe Brucker jean-philippe@linaro.org; linaro-open-discussions@op-lists.linaro.org Cc: Salil Mehta salil.mehta@huawei.com; Jonathan Cameron jonathan.cameron@huawei.com; lorenzo.pieralisi@arm.com Subject: Re: [RFC linux 0/5] KVM: arm64: Let userspace handle PSCI
'lo
On 20/05/2021 14:07, Jean-Philippe Brucker wrote:
As planned during the vCPU hot-add discussions from previous LOD meetings, this prototype lets userspace handle PSCI calls from a guest.
The vCPU hot-add model preferred by Arm presents all possible resources through ACPI at boot time, only marking unavailable vCPUs as hidden. The VMM prevents bringing up those vCPUs by rejecting PSCI CPU_ON calls. This allows to keep things simple for vCPU scaling enablement, while leaving the door open for hardware CPU hot-add.
This series focuses on moving PSCI support into userspace. Patches 1-3 allow userspace to request WFI to be executed by KVM. That way the VMM can easily implement the CPU_SUSPEND function, which is mandatory from PSCI v0.2 onwards (even if it doesn't have a more useful implementation than WFI, natively available to the guest). An alternative would be to poll the vGIC implemented in KVM for interrupts, but I haven't explored that solution. Patches 4 and 5 let the VMM request PSCI calls.
As mentioned on the call, I've tested the udev output on x86 and arm64, as expected its the same: | root@vm:~# udevadm monitor | monitor will print the received events for: | UDEV - the event which udev sends out after rule processing | KERNEL - the kernel uevent | | KERNEL[33.935817] add /devices/system/cpu/cpu1 (cpu) | KERNEL[33.946333] bind /devices/system/cpu/cpu1 (cpu) | UDEV [33.953251] add /devices/system/cpu/cpu1 (cpu) | UDEV [33.958676] bind /devices/system/cpu/cpu1 (cpu)
(I've not played with the KVM changes yet)
The guest needs additional support to deal with hidden CPUs and to gracefully handle the "NOT_PRESENT" return value from PSCI CPU_ON. The full prototype can be found here:
Hopefully its possible to make those silent!
Regarding PSCI, I believe you are referring to the return value in the below code hunk(s) and the PSCI spec changes mentioned in the patch[1]?
@@ -209,6 +209,8 @@ static int __psci_cpu_on(u32 fn, unsigned long cpuid, unsigned long entry_point) int err;
err = invoke_psci_fn(fn, cpuid, entry_point, 0); + if (err == PSCI_RET_NOT_PRESENT) + return -EPROBE_DEFER; return psci_to_linux_errno(err); }
[1] https://jpbrucker.net/git/linux/commit/?h=cpuhp/devel&id=afa4089cb122637...
Perhaps, by "additional" support with the hidden cpus you mean its effect on the sizing of various data-structures and other features dependent on the explicit awareness of the present and possible cpus?
Salil.
Hi,
On Thu, May 20, 2021 at 03:07:02PM +0200, Jean-Philippe Brucker wrote:
As planned during the vCPU hot-add discussions from previous LOD meetings, this prototype lets userspace handle PSCI calls from a guest.
To kick things off I've sent the Linux RFC to the public lists https://lore.kernel.org/kvmarm/20210608154805.216869-1-jean-philippe@linaro....
I've only sent the KVM bits to see if folks are strongly opposed to the idea. I'm guessing the next steps are:
(1) Discuss changes to the PSCI spec with ATG. The spec needs to recognize the new behavior in section 5.6 (and probably elsewhere): A core may be available for power-up or not. The mechanism by which the OS detects if a core can be powered up is IMPLEMENTATION DEFINED (we use ACPI _STA). If the core is not available, CPU_ON returns a new error (I've reused NOT_PRESENT -7 in my patch, to avoid introducing a new error code in PSCI).
(2) Send a new RFC for Qemu, which includes Salil's RFC + the additional PSCI support.
(3) Send the changes for Linux guests, both for ACPI and PSCI.
Thanks, Jean
Hi Jean,
On 20/05/2021 14:07, Jean-Philippe Brucker wrote:
The guest needs additional support to deal with hidden CPUs and to gracefully handle the "NOT_PRESENT" return value from PSCI CPU_ON. The full prototype can be found here:
There was some internal discussion on the PSCI bits..
Could we use 'DENIED' here, as its firmware policy not to allow the core to be brought online. I'm nervous that reporting 'PRESENT' in the _STA method, and 'NOT_PRESENT' from firmware is going to require us to paper over something later!
(I need to find where you posted this publicly...)
Thanks,
James
On Tue, Jun 22, 2021 at 04:56:02PM +0100, James Morse wrote:
Hi Jean,
On 20/05/2021 14:07, Jean-Philippe Brucker wrote:
The guest needs additional support to deal with hidden CPUs and to gracefully handle the "NOT_PRESENT" return value from PSCI CPU_ON. The full prototype can be found here:
There was some internal discussion on the PSCI bits..
Could we use 'DENIED' here, as its firmware policy not to allow the core to be brought online. I'm nervous that reporting 'PRESENT' in the _STA method, and 'NOT_PRESENT' from firmware is going to require us to paper over something later!
It makes sense, I don't think that's a problem given that the code implementing it can't be merged pending spec changes, NOT_PRESENT is just a made-up return value for the time being.
(I need to find where you posted this publicly...)
https://lore.kernel.org/kvm/20210608154805.216869-1-jean-philippe@linaro.org
Lorenzo
Thanks,
James
linaro-open-discussions@op-lists.linaro.org