aarch64: KVM + GPU: hypervisor encountered an error while trying to access memory - Linaro-open-discussions - op-lists.linaro.org

9 Oct 2023


      Hi，all
May I, representing our members, bring forth an issue for discussion?
The impact of this issue is big: Without resolving this issue, the scenario
where the GPU is passed through to the virtual machine cannot be used. And
it exists on Arm only, x86 is good. (I know workarounds exist, but we want
to fix that in the mainline).
After reading through so many community emails (see below), I believe it’s
unlikely that a single patch can quickly gain everyone's support. A broad
discussion involving the ARM ecosystem and the KVM community is essential.
Only with consensus can a submitted patch receive sufficient support,
eventually resolving this issue in the mainline version.
History:
-
This issue was first submitted to the public on 2021-07-01, refer to
   this [URL] (
   https://gitee.com/openeuler/kernel/issues/I3YRDP?from=project-issue ).
   -
A patch was submitted to the kernel maillist on 2022-04-01, authored by
   kylinos.com, but it was refused. No follow-up was found after that.
   [Link to the Patch] (
   https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.cn/T/
   )
   -
Another discussion went on in this email chain on 2022-05-09:, authored
   by nvidia.com,  but no conclusion.
   https://lore.kernel.org/all/20210429162906.32742-1-sdonthineni@nvidia.com/
-
As of this writing 2023-10-09, the issue still can be reproduced in
   kernel 6.1.x with Nvidia / AMD GPUs, on Arm arch.
Problem Description:
A GPU is passed through to a virtual machine via a PCIE node. When
installing the GPU driver within a virtual machine that runs on Openeuler
22.03 LTS SP2 (aarch64) system (linux kernel 5.10 based), the following
error occurs:
“Unsupported FSC: EC=0x24 xFSC=0x21 ESR_EL2=0x92000061”
PS: the same issue can also be reproduced in kernel 6.1.x with Nvidia / AMD
GPUs.
Upon consulting the official ARM documentation, the meanings of some of the
error codes are as follows:
EC=0x24
The binary code is 0b100100, which corresponds to a data abort. A possible
cause of this problem could be alignment errors.
xFSC=0x21
The binary code is 0b100001. Upon inquiry, this code represents an
alignment error.
There is a lack of online solutions to this error. KylinOS engineers once
proposed a patch for this error, but it was rejected by the community.
Moreover, their modification was based on the old 4.x kernel version.
Link: [Link to the Patch](
https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.cn/T/)
This patch suggests that it is unreasonable to set the I/O memory
attributes of the virtual machine to Device_nGnRE type.
According to ARM's official Whitepaper: Understanding Write Combining on Arm,
Device-GRE is a type of relaxed-order memory that allows for gathering
operations, but it does not allow for read speculation and imposes strict
alignment constraints. You may refer to the table below or the following
link for more information.
[Link to ARM Community](
https://community.arm.com/arm-research/m/resources/1012 )
Preliminary Deduction:
The GPU might be using Device-GRE type memory but without proper alignment,
leading to the generation of this error.
Thanks.
Best regards,
Guodong Xu
Linaro