Hi James,
Do we have any plan when we want to push the non-RFC Virtual CPU Hotplug patch-set
for the kernel side to the community. I guess people will be more interested in reviewing
if it is floated with a non-RFC tag.
If time is a problem at your side then do you need my help in carrying it forward for you?
Please let me know
Thanks
Salil.
Hello,
We have been trying to verify the "system suspend/Restore" with vCPU Hotplug patches recently and
found this functionality does not work on ARM64 with VMs even without our patches i.e. using latest kernel and qemu repository.
estuary:/$ cat /sys/power/mem_sleep
[s2idle]
estuary:/$
estuary:/$ cat /sys/power/state
freeze mem disk
estuary:/$
estuary:/$
estuary:/$ echo mem > /sys/power/state
[ 60.458445] PM: suspend entry (s2idle)
[ 60.458840] Filesystems sync: 0.000 seconds
[ 60.459649] Freezing user space processes
[ 60.461149] Freezing user space processes completed (elapsed 0.001 seconds)
[ 60.461830] OOM killer disabled.
[ 60.462144] Freezing remaining freezable tasks
[ 60.463188] Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
[ 60.463920] printk: Suspending console(s) (use no_console_suspend to debug)
(qemu)
(qemu) sys
system_powerdown system_reset system_wakeup
(qemu) system_wakeup
Error: wake-up from suspend is not supported by this guest
(qemu)
Or using # systemctl suspend
What is the expected behavior or are we missing something?
Thanks
Salil
Hi,all
May I, representing our members, bring forth an issue for discussion?
The impact of this issue is big: Without resolving this issue, the scenario
where the GPU is passed through to the virtual machine cannot be used. And
it exists on Arm only, x86 is good. (I know workarounds exist, but we want
to fix that in the mainline).
After reading through so many community emails (see below), I believe it’s
unlikely that a single patch can quickly gain everyone's support. A broad
discussion involving the ARM ecosystem and the KVM community is essential.
Only with consensus can a submitted patch receive sufficient support,
eventually resolving this issue in the mainline version.
History:
-
This issue was first submitted to the public on 2021-07-01, refer to
this [URL] (
https://gitee.com/openeuler/kernel/issues/I3YRDP?from=project-issue ).
-
A patch was submitted to the kernel maillist on 2022-04-01, authored by
kylinos.com, but it was refused. No follow-up was found after that.
[Link to the Patch] (
https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.cn/T/
)
-
Another discussion went on in this email chain on 2022-05-09:, authored
by nvidia.com, but no conclusion.
https://lore.kernel.org/all/20210429162906.32742-1-sdonthineni@nvidia.com/
-
As of this writing 2023-10-09, the issue still can be reproduced in
kernel 6.1.x with Nvidia / AMD GPUs, on Arm arch.
Problem Description:
A GPU is passed through to a virtual machine via a PCIE node. When
installing the GPU driver within a virtual machine that runs on Openeuler
22.03 LTS SP2 (aarch64) system (linux kernel 5.10 based), the following
error occurs:
“Unsupported FSC: EC=0x24 xFSC=0x21 ESR_EL2=0x92000061”
PS: the same issue can also be reproduced in kernel 6.1.x with Nvidia / AMD
GPUs.
Upon consulting the official ARM documentation, the meanings of some of the
error codes are as follows:
EC=0x24
The binary code is 0b100100, which corresponds to a data abort. A possible
cause of this problem could be alignment errors.
xFSC=0x21
The binary code is 0b100001. Upon inquiry, this code represents an
alignment error.
There is a lack of online solutions to this error. KylinOS engineers once
proposed a patch for this error, but it was rejected by the community.
Moreover, their modification was based on the old 4.x kernel version.
Link: [Link to the Patch](
https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.cn/T/)
This patch suggests that it is unreasonable to set the I/O memory
attributes of the virtual machine to Device_nGnRE type.
According to ARM's official Whitepaper: Understanding Write Combining on Arm,
Device-GRE is a type of relaxed-order memory that allows for gathering
operations, but it does not allow for read speculation and imposes strict
alignment constraints. You may refer to the table below or the following
link for more information.
[Link to ARM Community](
https://community.arm.com/arm-research/m/resources/1012 )
Preliminary Deduction:
The GPU might be using Device-GRE type memory but without proper alignment,
leading to the generation of this error.
Thanks.
Best regards,
Guodong Xu
Linaro