[Linaro-open-discussions] Re: Invitation: Linaro Open Discussions monthly meeting @ Wed 5 Oct 2022 19:00 - 20:00 (HKT) (linaro-open-discussions@op-lists.linaro.org)

1 Nov 2022

      Hi Salil,
On 01/11/2022 11:09, Salil Mehta wrote:
...
...
From: James Morse [mailto:james.morse@arm.com]
Sent: Monday, October 31, 2022 4:49 PM
To: Salil Mehta salil.mehta@huawei.com; joyce.qi@linaro.org
Cc: linaro-open-discussions@op-lists.linaro.org; Jonathan Cameron
jonathan.cameron@huawei.com; lorenzo.pieralisi@linaro.org;
ilkka@os.amperecomputing.com; Jean-Philippe Brucker
jean-philippe.brucker@arm.com; salil.mehta@opnsrc.net;
mehta.salil.lnk@gmail.com
Subject: Re: [Linaro-open-discussions] Re: Invitation: Linaro Open Discussions
monthly meeting @ Wed 5 Oct 2022 19:00 - 20:00 (HKT)
(linaro-open-discussions@op-lists.linaro.org)
...
...
...
...
This probably means you're not using the HVC_TO_USER support for KVM that gets added as
part of this series. You can then use the PSCI_TO_USER cap to disable the kernel's PSCI
handler, which lets the VMM manage this directly.
...
Yes, that is true . I intentionally removed the PSCI-to-userspace patches as
I was wondering if the CPU devices are not available in the guest kernel for the
disabled CPUs how can they be made online?
... and if your guest kernel is malicious?
...
Ok for the security,
I wouldn't call it security, just you can't trust the guest to play nice.
...
Yes, it makes sense but wanted to hear clearly from you the
reasoning so took off those patches. I will add back those patches both in the
linux and the QEMU(changes to handle HVC exit calls in userspace and enable
capabilities in the KVM at arch init) repositories.
We would need to explain this reason clearly in the patches.
Its described in the documentation patch as:
| Firmware can enforce its policy via PSCI's return codes. e.g. `DENIED``
I'm coming from a position of "PSCI has always been able to return errors", but this is
the second time I've been surprised by people not seeing it this way - I'll try and expand
the commit messages to explain this some more...
...
...
PSCI has to be able to return an error if you try to online a disabled CPU, (as without
enforcement the feature is pointless). So there is no need to track which CPUs are
disabled in the MADT - just online the lot, and handle the error gracefully.
Without security being compromised(and CPU devices not existing in kernel) CPU
online'ing cannot be done. Security and feature functionality are two different
aspects of the policy enforcement.
Do you mean no need to track disabled CPUs as in MADT !GICC.flags.Enabled?
Linux doesn't need those GICC flags, they were added for another operating system.
For linux, its simpler to try to enable everything, and handle the failure gracefully.
This lets us disappear the feature if it turns out firmware isn't enforcing any policy.
(as described below:)
...
...
Doing this then lines up nicely with the ACPI changes, where if firmware isn't doing any
enforcement, then the CPUs get registered and a nasty firmware-bug message is printed.
This is what makes those changes safe on x86 where the _STA enabled bit may be wrong on
mass-produced laptops. If their 'firmware' isn't enforcing any policy, then the _STA bit
gets ignored.
...
...
...
...
Still using KVM, firmware should clear the enabled bit from _STA for any call after the
eject-request. I see this is still 0xF before _EJx is called.
This is how the existing handshake protocol works. Firmware intimates the
OS about ejection request which confirms its status about ejection-in-progress.
Firmware/VMM then sends the ACPI event to remove the CPU from the kernel(which
effectively is try-to-offline process and making not-present), Until
this point _STA.ENABLED=1 (which is a correct behavior since firmware/VMM
cannot disable the CPU before it has been made offline by the OS - can we
snatch the physical resources before completing the housekeeping at OS?).
Once offline'ing/removal is done, OS then informs the firmware/VMM about
this completion using _EJ0 method (Which is effective way of giving green
signal to the firmware/VMM to go ahead and remove the physical resources).
Okay, the scope is narrower than after the eject-request... The problem is
apci_processor's remove method calls _STA, and finds "no change", so it does nothing. But
it appears this is what Qemu does on x86 too.
Exactly. and this was the reason why below crash was happening during remove
leg as arch_unregister_cpu() was not getting called and then during next
processor_add() arch_register_cpu() would fail with below
Yes, linux was calling process_remove and _EJ0 in the wrong order. I've fixed that, but
now Qemu is reported 'not present' via _STA, so acpi_processor_make_not_present() prints a
nastygram about this not being supported, and arch_unregister_cpu() is never called.
The work to make a CPU not-present has not been defined, we must not paper over this.
For arm64 we know there are no physical hotplug systems, if we can get the OSI strings
defined, we can call this a firmware-bug if it tries when the OS didn't advertise support.
(for x86 that ship has sailed, so a nasty print message about support being disabled is
all we can do)
...
...
Once _EJ0 has been called, I'd expect _STA to reflect the conclusion of the
eject-request.
Yes, agreed. and as I mentioned earlier this change is pending in QEMU.
I deferred it intentionally since it did not matter with respect to your
changes as the places where it was being used in your code would always
have _STA.Ebaled=1.
Fair enough, but I'd really expect _STA to be a reflection of Qemu's device-model. Maybe
its more complex than I think.
...
...
Linux needs to see _STA change, as otherwise it can't know which bits in _STA
have changed.
Ok we can do that now. But even with this change in the QEMU, the remove
leg in the kernel where it is being referred, flag _STA.Enabled would
still be 1.
I've added a post_eject handler to ACPI's bus scanning stuff. That fixes it for me.
https://gitlab.arm.com/linux-arm/linux-jm/-/tree/virtual_cpu_hotplug/rfc/v0....
...
It is better to use online-capable Bit
But its static! It can't change. All you can use this for is to _guess_ that an
eject-request is to make a CPU disabled.
Its much better to change linux to call _STA at the right point!
...
there or we would need to hack
the existing handshake Hotplug protocol to fit the _STA.Enabled=0 which
is not very clean as the eject-in-progress handling in OS should be
exactly same as that of the physical Hotplug feature.
The existing processor remove code is equally broken: It assume an eject-request is to
make the CPU not-present.
...
The only
difference is what we do with the _STA.PRESENT Bits later at the ACPI
level in the firmware/VMM. It should be exposed as present by the
VMM/firmware after _EJ0 method has been evaluated by the OS.
For arm64 yes.
...
...
...
Firmware then removes the device and any further evaluation of _STA would
show _STA.ENABLED=0 and ideally should show is not present(but this is
something which I have not changed in the QEMU - maybe we need to do or
does it even matter at QEMU level?).
No, it looks like linux is calling these in the wrong order.
The ACPI processor driver needs to have its remove call made once the CPU is offline and
_EJ0 has been called,...
I beg to differ on this. I think order during eject-in-progress
handling in OS should be exactly same as that of the physical
Hotplug feature.
Hence, the acpi_bus_trim() should come before acpi_evaluate_ej0().
I agree - but the reason is for all the other drivers that have a detach callback that
needs to happen while the device is still present so that they can quiesce it.
This is really another phase as some drivers need work to be done 'post eject'. the ACPI
processor driver currently gets away with it by making assumptions.
...
acpi_scan_hot_remove(struct acpi_device *device)
{
   [...]

acpi_scan_try_to_offline(device);
[...]
acpi_bus_trim(device);  /* processor_remove() gets called here */
[...]
acpi_evaluate_ej0(handle); /* firmware will make _STA.Enabled=0, _STA.PRESENT=1(not doing right now) */
[...]
acpi_evaluate_integer(handle, "_STA", NULL, &sta); /* this is not checking PRESENT Bit right now */
[...]

...

acpi_evaluate_ost(...);  /* NOT required still? */
[...]

_OST is very important for real systems that do this. _OST is the only way the firmware
can know that the OS received the eject-request and has started work on it. _OST is the
only way firmware can know if the eject-request failed for some reason. Without it,
firmware has to guess (a timeout - how long?) whether the eject failed ... or is just
taking longer ...
...
...
(I thought this is what qemu was doing on x86, but now that I test it again...!)
[..]
It looks like a post_eject callback on the scan handler is the right way to do
this.
Would look like a hack to be frank.
I think its fair that drivers have some cleanup work that needs to be done after their
device has been removed. Especially if they don't know how its been 'ejected' until _EJ0
has completed. _STA is already called after _EJ0 to 'see' if the eject worked.
I think the alternative would be to add a new disable-request that calls _DIS on
completion. It would be even harder to make this look like hotplug to user-space.
(then again, all the world is not a VHS)
...
...
...
...
This means the
cpu remains
registered, and user-space doesn't get notified that the CPU is no longer
available.
User-space can still try to enable the CPU, which will fail.
...
That is true and to fix this we could use the online-capable Bit instead
of using the _STA.ENABLED Bit status as discussed in the 5th Oct LOD
meeting. I have done those changes in the above branch [1].
But the MADT online-capable bit is static, it only ever has one value. We need something
that can be changed by firmware, (and the OS _knows_ is being changed by firmware). This
is what _STA is for.
You can't guess from an eject-request that the CPU is being disabled, what happens when we
need to add support for physical hot-remove?
...
Isn’t presence of online-capability(for different architectures this could be interpreted
differently and they just need to add their bit) enough to take different route than the
physical Hotplug path of hot-remove?
No - that bit was only needed by another operating system to prevent their boot code from
choking on errors returned by PSCI.
We never did need it for linux, I don't think we should add code to carry it around to
quirk how ACPI behaves.
...
You might want to check below changes which achieves above
Repository:
https://github.com/salil-mehta/linux.git virt-cpuhp-arm64/rfc-v2/jmorse-pres-eq-poss-cpu
Patches:
ACPI: APIs to check online-capability of the processor
ACPI: Fix: Introduce online-capable check to fix remove cpu crash

if (cpu_present(pr->id) && !(sta & ACPI_STA_DEVICE_ENABLED))

if (cpu_present(pr->id) && acpi_check_online_capable(pr)) {
arch_unregister_cpu(pr->id);

} else {
pr_err_once(FW_BUG "CPU%u not online-capable is being made not present\n", pr->id);

add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK);

}

But if you don't read _STA, you don't know whether the present bit or the enabled bit is
being toggled. How are stable kernels with this code supposed to know on hardware that
supports physical hot-remove? What about platforms that support both!?
Don't guess - call _STA to know for sure. acpi_scan_hot_remove() already calls _STA after
_EJ0.
Thanks,
James

2025

2024

2023

2022

2021

2020

[Linaro-open-discussions] Re: Invitation: Linaro Open Discussions monthly meeting @ Wed 5 Oct 2022 19:00 - 20:00 (HKT) (linaro-open-discussions@op-lists.linaro.org)