Hi James,
From: James Morse [mailto:james.morse@arm.com] Sent: Monday, October 31, 2022 4:49 PM To: Salil Mehta salil.mehta@huawei.com; joyce.qi@linaro.org Cc: linaro-open-discussions@op-lists.linaro.org; Jonathan Cameron jonathan.cameron@huawei.com; lorenzo.pieralisi@linaro.org; ilkka@os.amperecomputing.com; Jean-Philippe Brucker jean-philippe.brucker@arm.com; salil.mehta@opnsrc.net; mehta.salil.lnk@gmail.com Subject: Re: [Linaro-open-discussions] Re: Invitation: Linaro Open Discussions monthly meeting @ Wed 5 Oct 2022 19:00 - 20:00 (HKT) (linaro-open-discussions@op-lists.linaro.org)
Hi Salil,
On 31/10/2022 12:40, Salil Mehta wrote:
From: James Morse [mailto:james.morse@arm.com] Sent: Friday, October 28, 2022 5:17 PM To: Salil Mehta salil.mehta@huawei.com; joyce.qi@linaro.org Cc: linaro-open-discussions@op-lists.linaro.org; lorenzo.pieralisi@linaro.org; ilkka@os.amperecomputing.com; Jean-Philippe Brucker jean-philippe.brucker@arm.com; salil.mehta@opnsrc.net Subject: Re: [Linaro-open-discussions] Re: Invitation: Linaro Open Discussions monthly meeting @ Wed 5 Oct 2022 19:00 - 20:00 (HKT) (linaro-open-discussions@op-lists.linaro.org)
On the qemu side: Using KVM and your Qemu tree, PSCI returns success, when it should return denied for CPUs that are forbidden due to firmware policy: [ 0.027597] smp: Bringing up secondary CPUs ... [ 5.108204] CPU1: failed to come online [ 5.109198] CPU1: failed in unknown state : 0x0 [ 5.110384] smp: Brought up 1 node, 1 CPU
Did you try to use below branch(with some suggested fixes) which I shared with you?
[1] James Approach with online-capable and present==possible (some fixes) https://github.com/salil-mehta/linux.git virt-cpuhp-arm64/rfc-v2/jmorse-pres-eq-poss-cpu
No, its the qemu changes I'm after.
Returning "success" to CPU_ON, but not actually bringing the CPU online is an abomination. I can't imagine what the maintainer will throw at you if you suggest it.
This is mostly harmless for linux, but using PSCI_DENIED squashes the error, and avoids the timeout waiting for the secondary to turn up. Mostly harmless as this may cause 'cpus stuck in kernel' to get set, which would prevent kexec.
This probably means you're not using the HVC_TO_USER support for KVM that gets added as part of this series. You can then use the PSCI_TO_USER cap to disable the kernel's PSCI handler, which lets the VMM manage this directly.
Yes, that is true . I intentionally removed the PSCI-to-userspace patches as I was wondering if the CPU devices are not available in the guest kernel for the disabled CPUs how can they be made online?
... and if your guest kernel is malicious?
Ok for the security, Yes, it makes sense but wanted to hear clearly from you the reasoning so took off those patches. I will add back those patches both in the linux and the QEMU(changes to handle HVC exit calls in userspace and enable capabilities in the KVM at arch init) repositories.
We would need to explain this reason clearly in the patches.
PSCI has to be able to return an error if you try to online a disabled CPU, (as without enforcement the feature is pointless). So there is no need to track which CPUs are disabled in the MADT - just online the lot, and handle the error gracefully.
Without security being compromised(and CPU devices not existing in kernel) CPU online'ing cannot be done. Security and feature functionality are two different aspects of the policy enforcement.
Do you mean no need to track disabled CPUs as in MADT !GICC.flags.Enabled?
Doing this then lines up nicely with the ACPI changes, where if firmware isn't doing any enforcement, then the CPUs get registered and a nasty firmware-bug message is printed. This is what makes those changes safe on x86 where the _STA enabled bit may be wrong on mass-produced laptops. If their 'firmware' isn't enforcing any policy, then the _STA bit gets ignored.
Still using KVM, firmware should clear the enabled bit from _STA for any call after the eject-request. I see this is still 0xF before _EJx is called.
This is how the existing handshake protocol works. Firmware intimates the OS about ejection request which confirms its status about ejection-in-progress. Firmware/VMM then sends the ACPI event to remove the CPU from the kernel(which effectively is try-to-offline process and making not-present), Until this point _STA.ENABLED=1 (which is a correct behavior since firmware/VMM cannot disable the CPU before it has been made offline by the OS - can we snatch the physical resources before completing the housekeeping at OS?). Once offline'ing/removal is done, OS then informs the firmware/VMM about this completion using _EJ0 method (Which is effective way of giving green signal to the firmware/VMM to go ahead and remove the physical resources).
Okay, the scope is narrower than after the eject-request... The problem is apci_processor's remove method calls _STA, and finds "no change", so it does nothing. But it appears this is what Qemu does on x86 too.
Exactly. and this was the reason why below crash was happening during remove leg as arch_unregister_cpu() was not getting called and then during next processor_add() arch_register_cpu() would fail with below
[ 73.647991] sysfs_warn_dup+0x60/0x80 [ 73.648414] sysfs_create_dir_ns+0xe4/0x100 [ 73.648885] kobject_add_internal+0x98/0x220 [ 73.649367] kobject_add+0x94/0x108 [ 73.649759] device_add+0xf8/0x8a8 [ 73.650145] device_register+0x20/0x30 [ 73.650569] register_cpu+0xf0/0x1b0 [ 73.650974] arch_register_cpu+0x5c/0x70 [ 73.651415] acpi_processor_add+0x410/0x680 [ 73.651886] acpi_bus_attach+0x12c/0x228 [ 73.652334] acpi_bus_scan+0x58/0x110 [ 73.652745] acpi_device_hotplug+0x208/0x470 [ 73.653227] acpi_hotplug_work_fn+0x24/0x40 [ 73.653697] process_one_work+0x1d0/0x320 [ 73.654148] worker_thread+0x4c/0x400 [ 73.654561] kthread+0x110/0x120 [ 73.654924] ret_from_fork+0x10/0x20
Once _EJ0 has been called, I'd expect _STA to reflect the conclusion of the eject-request.
Yes, agreed. and as I mentioned earlier this change is pending in QEMU. I deferred it intentionally since it did not matter with respect to your changes as the places where it was being used in your code would always have _STA.Ebaled=1.
Linux needs to see _STA change, as otherwise it can't know which bits in _STA have changed.
Ok we can do that now. But even with this change in the QEMU, the remove leg in the kernel where it is being referred, flag _STA.Enabled would still be 1.
It is better to use online-capable Bit there or we would need to hack the existing handshake Hotplug protocol to fit the _STA.Enabled=0 which is not very clean as the eject-in-progress handling in OS should be exactly same as that of the physical Hotplug feature. The only difference is what we do with the _STA.PRESENT Bits later at the ACPI level in the firmware/VMM. It should be exposed as present by the VMM/firmware after _EJ0 method has been evaluated by the OS.
Firmware then removes the device and any further evaluation of _STA would show _STA.ENABLED=0 and ideally should show is not present(but this is something which I have not changed in the QEMU - maybe we need to do or does it even matter at QEMU level?).
No, it looks like linux is calling these in the wrong order. The ACPI processor driver needs to have its remove call made once the CPU is offline and _EJ0 has been called,...
I beg to differ on this. I think order during eject-in-progress handling in OS should be exactly same as that of the physical Hotplug feature.
Hence, the acpi_bus_trim() should come before acpi_evaluate_ej0().
acpi_scan_hot_remove(struct acpi_device *device) { [...] 1. acpi_scan_try_to_offline(device); [...] 2. acpi_bus_trim(device); /* processor_remove() gets called here */ [...] 3. acpi_evaluate_ej0(handle); /* firmware will make _STA.Enabled=0, _STA.PRESENT=1(not doing right now) */ [...] 4. acpi_evaluate_integer(handle, "_STA", NULL, &sta); /* this is not checking PRESENT Bit right now */ [...] 5. acpi_evaluate_ost(...); /* NOT required still? */ [...] }
(I thought this is what qemu was doing on x86, but now that I test it again...!)
[..]
It looks like a post_eject callback on the scan handler is the right way to do this.
Would look like a hack to be frank.
This means the cpu remains registered, and user-space doesn't get notified that the CPU is no longer available. User-space can still try to enable the CPU, which will fail.
That is true and to fix this we could use the online-capable Bit instead of using the _STA.ENABLED Bit status as discussed in the 5th Oct LOD meeting. I have done those changes in the above branch [1].
But the MADT online-capable bit is static, it only ever has one value. We need something that can be changed by firmware, (and the OS _knows_ is being changed by firmware). This is what _STA is for.
You can't guess from an eject-request that the CPU is being disabled, what happens when we need to add support for physical hot-remove?
Isn’t presence of online-capability(for different architectures this could be interpreted differently and they just need to add their bit) enough to take different route than the physical Hotplug path of hot-remove?
You might want to check below changes which achieves above
Repository: https://github.com/salil-mehta/linux.git virt-cpuhp-arm64/rfc-v2/jmorse-pres-eq-poss-cpu
Patches: ACPI: APIs to check online-capability of the processor ACPI: Fix: Introduce online-capable check to fix remove cpu crash
- if (cpu_present(pr->id) && !(sta & ACPI_STA_DEVICE_ENABLED)) + if (cpu_present(pr->id) && acpi_check_online_capable(pr)) { + arch_unregister_cpu(pr->id); + } else { + pr_err_once(FW_BUG "CPU%u not online-capable is being made not present\n", pr->id); + add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK); + }
Thanks Salil