Hi James,
From: James Morse [mailto:james.morse@arm.com] Sent: Monday, October 31, 2022 5:51 PM To: Salil Mehta salil.mehta@huawei.com; joyce.qi@linaro.org Cc: linaro-open-discussions@op-lists.linaro.org; lorenzo.pieralisi@linaro.org; ilkka@os.amperecomputing.com; Jean-Philippe Brucker jean-philippe.brucker@arm.com; salil.mehta@opnsrc.net Subject: Re: [Linaro-open-discussions] Re: Invitation: Linaro Open Discussions monthly meeting @ Wed 5 Oct 2022 19:00 - 20:00 (HKT) (linaro-open-discussions@op-lists.linaro.org)
Hi Salil,
On 31/10/2022 16:48, James Morse via Linaro-open-discussions wrote:
On 31/10/2022 12:40, Salil Mehta wrote:
From: James Morse [mailto:james.morse@arm.com] Sent: Friday, October 28, 2022 5:17 PM To: Salil Mehta salil.mehta@huawei.com; joyce.qi@linaro.org Cc: linaro-open-discussions@op-lists.linaro.org; lorenzo.pieralisi@linaro.org; ilkka@os.amperecomputing.com; Jean-Philippe Brucker jean-philippe.brucker@arm.com; salil.mehta@opnsrc.net Subject: Re: [Linaro-open-discussions] Re: Invitation: Linaro Open Discussions monthly meeting @ Wed 5 Oct 2022 19:00 - 20:00 (HKT) (linaro-open-discussions@op-lists.linaro.org)
Still using KVM, firmware should clear the enabled bit from _STA for any call after the eject-request. I see this is still 0xF before _EJx is called.
This is how the existing handshake protocol works. Firmware intimates the OS about ejection request which confirms its status about ejection-in-progress. Firmware/VMM then sends the ACPI event to remove the CPU from the kernel(which effectively is try-to-offline process and making not-present), Until this point _STA.ENABLED=1 (which is a correct behavior since firmware/VMM cannot disable the CPU before it has been made offline by the OS - can we snatch the physical resources before completing the housekeeping at OS?). Once offline'ing/removal is done, OS then informs the firmware/VMM about this completion using _EJ0 method (Which is effective way of giving green signal to the firmware/VMM to go ahead and remove the physical resources).
Okay, the scope is narrower than after the eject-request... The problem is apci_processor's remove method calls _STA, and finds "no change", so it does nothing. But it appears this is what Qemu does on x86 too. Once _EJ0 has been called, I'd expect _STA to reflect the conclusion of the eject-request.
Linux needs to see _STA change, as otherwise it can't know which bits in _STA have changed.
Firmware then removes the device and any further evaluation of _STA would show _STA.ENABLED=0 and ideally should show is not present(but this is something which I have not changed in the QEMU - maybe we need to do or does it even matter at QEMU level?).
No, it looks like linux is calling these in the wrong order. The ACPI processor driver needs to have its remove call made once the CPU is offline and _EJ0 has been called,...
(I thought this is what qemu was doing on x86, but now that I test it again...!)
[..]
It looks like a post_eject callback on the scan handler is the right way to do this.
This works - but now it looks like you are returning 0 from _STA after _EJ0. Please don't do this, the present bit needs to remain set as while the CPU has been disabled, the redistributor, any RAS ERR nodes, and anything else that hangs from the CPU is still present.
Sure, as mentioned in the earlier update, I intentionally deferred this change in the QEMU. I will introduce this i.e. after _EJ0, flags would be _STA.Enabled=0 AND _STA.PRESENT=1
Does anyone think we need _OSI strings for this? If so, now is the time to define them. My suggestions would be "ACPI0007 present updates" and "ACPI0007 enabled updates", which makes it fairly clear this is about toggling bits in the processor object.
It sounds like kubernetes might have some kind of gui that shows something based on the cpu present mask. If its needed, my current thinking is to have a cpu_enabled_mask in the kernel, (that defaults to be the cpu_present_mask on all but arm64), and use that to filter cpus out of the sysfs 'offline' file. This avoids creating new files in sysfs, and may even fix whatever it is that kubernetes is doing...
Maybe, but if possible, I would still prefer to use cpu_present_mask. It is dynamic. I do understand, and as you mentioned earlier, that right now present mask might be tied up with the presence of CPU in the ACPI.
File: https://elixir.bootlin.com/linux/latest/source/include/linux/cpumask.h
[...] * If HOTPLUG is enabled, then cpu_present_mask varies dynamically, * depending on what ACPI reports as currently plugged in, otherwise * cpu_present_mask is just a copy of cpu_possible_mask. {...]
Any upper layer using present mask should already know that it is dynamic and should have proper handling when it changes. It's about the association between the CPU being present in the Linux and being present at the ACPI. Can we make this association conditional?
After all, we are already making presence of CPU devices conditional(by calling arch_{un}register_cpu() selectively) although CPU will always be present at the ACPI level?
AFAICS nothing breaks inside the kernel?
But yes, maybe it might be more safe to introduce a new mask. If we decide to use another mask then there would be more changes around the corner than just hiding sysfs entries/conditionally making CPU devices appear/disappear.
For example, any upper layer which now rely on the present mask might need to refer to this new enabled/available mask as well like CPUHP state machine in kernel?
https://elixir.bootlin.com/linux/latest/source/kernel/cpu.c#L1335
static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) { [...]
if (!cpu_present(cpu)) { /* and similar check in _cpu_down() */ ret = -EINVAL; goto out; }
[...] }
Also, all future changes would now need to be careful with this new meaning of vCPU 'being enabled' while it is present inside the kernel.
Thanks Salil