Hi,all
May I, representing our members, bring forth an issue for discussion?
The impact of this issue is big: Without resolving this issue, the scenario where the GPU is passed through to the virtual machine cannot be used. And it exists on Arm only, x86 is good. (I know workarounds exist, but we want to fix that in the mainline).
After reading through so many community emails (see below), I believe it’s unlikely that a single patch can quickly gain everyone's support. A broad discussion involving the ARM ecosystem and the KVM community is essential. Only with consensus can a submitted patch receive sufficient support, eventually resolving this issue in the mainline version.
History:
-
This issue was first submitted to the public on 2021-07-01, refer to this [URL] ( https://gitee.com/openeuler/kernel/issues/I3YRDP?from=project-issue ). -
A patch was submitted to the kernel maillist on 2022-04-01, authored by kylinos.com, but it was refused. No follow-up was found after that. [Link to the Patch] ( https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.cn/T/ ) -
Another discussion went on in this email chain on 2022-05-09:, authored by nvidia.com, but no conclusion. https://lore.kernel.org/all/20210429162906.32742-1-sdonthineni@nvidia.com/
-
As of this writing 2023-10-09, the issue still can be reproduced in kernel 6.1.x with Nvidia / AMD GPUs, on Arm arch.
Problem Description:
A GPU is passed through to a virtual machine via a PCIE node. When installing the GPU driver within a virtual machine that runs on Openeuler 22.03 LTS SP2 (aarch64) system (linux kernel 5.10 based), the following error occurs:
“Unsupported FSC: EC=0x24 xFSC=0x21 ESR_EL2=0x92000061”
PS: the same issue can also be reproduced in kernel 6.1.x with Nvidia / AMD GPUs.
Upon consulting the official ARM documentation, the meanings of some of the error codes are as follows:
EC=0x24
The binary code is 0b100100, which corresponds to a data abort. A possible cause of this problem could be alignment errors.
xFSC=0x21
The binary code is 0b100001. Upon inquiry, this code represents an alignment error.
There is a lack of online solutions to this error. KylinOS engineers once proposed a patch for this error, but it was rejected by the community. Moreover, their modification was based on the old 4.x kernel version.
Link: [Link to the Patch]( https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.cn/T/)
This patch suggests that it is unreasonable to set the I/O memory attributes of the virtual machine to Device_nGnRE type.
According to ARM's official Whitepaper: Understanding Write Combining on Arm, Device-GRE is a type of relaxed-order memory that allows for gathering operations, but it does not allow for read speculation and imposes strict alignment constraints. You may refer to the table below or the following link for more information.
[Link to ARM Community]( https://community.arm.com/arm-research/m/resources/1012 )
Preliminary Deduction:
The GPU might be using Device-GRE type memory but without proper alignment, leading to the generation of this error.
Thanks.
Best regards,
Guodong Xu
Linaro
Hi Guodong,
Just to make sure the below patch series wont address the problem you mentioned here,
https://yhbt.net/lore/all/20230907181459.18145-2-ankita@nvidia.com/T/#m70641...
Unless I am mistaken, it looks like patch #2 provides a way to set the IO memory to NORMAL_NC which can support unaligned access.
I think the above series is a follow-up from the discussions here,
https://lore.kernel.org/lkml/ZLRvf1M3gk4jjPp0@nvidia.com/T/#m6ccabf232f6f387...
Sorry, if this is not what the issue you are trying to address.
Thanks, Shameer
-----Original Message----- From: Guodong Xu via Linaro-open-discussions [mailto:linaro-open-discussions@op-lists.linaro.org] Sent: 09 October 2023 11:06 To: linaro-open-discussions linaro-open-discussions@op-lists.linaro.org Cc: Joyce Qi joyce.qi@linaro.org Subject: [Linaro-open-discussions] aarch64: KVM + GPU: hypervisor encountered an error while trying to access memory
Hi,all
May I, representing our members, bring forth an issue for discussion?
The impact of this issue is big: Without resolving this issue, the scenario where the GPU is passed through to the virtual machine cannot be used. And it exists on Arm only, x86 is good. (I know workarounds exist, but we want to fix that in the mainline).
After reading through so many community emails (see below), I believe it’s unlikely that a single patch can quickly gain everyone's support. A broad discussion involving the ARM ecosystem and the KVM community is essential. Only with consensus can a submitted patch receive sufficient support, eventually resolving this issue in the mainline version.
History:
This issue was first submitted to the public on 2021-07-01, refer to this [URL] ( https://gitee.com/openeuler/kernel/issues/I3YRDP?from=project-issue ).
A patch was submitted to the kernel maillist on 2022-04-01, authored by kylinos.com, but it was refused. No follow-up was found after that. [Link to the Patch] (
https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.c n/T/ )
Another discussion went on in this email chain on 2022-05-09:, authored by nvidia.com, but no conclusion.
https://lore.kernel.org/all/20210429162906.32742-1-sdonthineni@nvidia.c om/
As of this writing 2023-10-09, the issue still can be reproduced in kernel 6.1.x with Nvidia / AMD GPUs, on Arm arch.
Problem Description:
A GPU is passed through to a virtual machine via a PCIE node. When installing the GPU driver within a virtual machine that runs on Openeuler 22.03 LTS SP2 (aarch64) system (linux kernel 5.10 based), the following error occurs:
“Unsupported FSC: EC=0x24 xFSC=0x21 ESR_EL2=0x92000061”
PS: the same issue can also be reproduced in kernel 6.1.x with Nvidia / AMD GPUs.
Upon consulting the official ARM documentation, the meanings of some of the error codes are as follows:
EC=0x24
The binary code is 0b100100, which corresponds to a data abort. A possible cause of this problem could be alignment errors.
xFSC=0x21
The binary code is 0b100001. Upon inquiry, this code represents an alignment error.
There is a lack of online solutions to this error. KylinOS engineers once proposed a patch for this error, but it was rejected by the community. Moreover, their modification was based on the old 4.x kernel version.
Link: [Link to the Patch]( https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.c n/T/)
This patch suggests that it is unreasonable to set the I/O memory attributes of the virtual machine to Device_nGnRE type.
According to ARM's official Whitepaper: Understanding Write Combining on Arm, Device-GRE is a type of relaxed-order memory that allows for gathering operations, but it does not allow for read speculation and imposes strict alignment constraints. You may refer to the table below or the following link for more information.
[Link to ARM Community]( https://community.arm.com/arm-research/m/resources/1012 )
Preliminary Deduction:
The GPU might be using Device-GRE type memory but without proper alignment, leading to the generation of this error.
Thanks.
Best regards,
Guodong Xu
Linaro
Linaro-open-discussions mailing list -- linaro-open-discussions@op-lists.linaro.org https://collaborate.linaro.org/display/LOD/Linaro+Open+Discussions+Home
On Mon, 9 Oct 2023 at 13:11, Shameerali Kolothum Thodi via Linaro-open-discussions linaro-open-discussions@op-lists.linaro.org wrote:
Hi Guodong,
Just to make sure the below patch series wont address the problem you mentioned here,
https://yhbt.net/lore/all/20230907181459.18145-2-ankita@nvidia.com/T/#m70641...
Unless I am mistaken, it looks like patch #2 provides a way to set the IO memory to NORMAL_NC which can support unaligned access.
I think the above series is a follow-up from the discussions here,
https://lore.kernel.org/lkml/ZLRvf1M3gk4jjPp0@nvidia.com/T/#m6ccabf232f6f387...
Sorry, if this is not what the issue you are trying to address.
I think it should be and I also think it would be useful to comment on that thread with bug reports to make sure it grabs the attention it deserves.
Lorenzo
Thanks, Shameer
-----Original Message----- From: Guodong Xu via Linaro-open-discussions [mailto:linaro-open-discussions@op-lists.linaro.org] Sent: 09 October 2023 11:06 To: linaro-open-discussions linaro-open-discussions@op-lists.linaro.org Cc: Joyce Qi joyce.qi@linaro.org Subject: [Linaro-open-discussions] aarch64: KVM + GPU: hypervisor encountered an error while trying to access memory
Hi,all
May I, representing our members, bring forth an issue for discussion?
The impact of this issue is big: Without resolving this issue, the scenario where the GPU is passed through to the virtual machine cannot be used. And it exists on Arm only, x86 is good. (I know workarounds exist, but we want to fix that in the mainline).
After reading through so many community emails (see below), I believe it’s unlikely that a single patch can quickly gain everyone's support. A broad discussion involving the ARM ecosystem and the KVM community is essential. Only with consensus can a submitted patch receive sufficient support, eventually resolving this issue in the mainline version.
History:
This issue was first submitted to the public on 2021-07-01, refer to this [URL] ( https://gitee.com/openeuler/kernel/issues/I3YRDP?from=project-issue ).
A patch was submitted to the kernel maillist on 2022-04-01, authored by kylinos.com, but it was refused. No follow-up was found after that. [Link to the Patch] (
https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.c n/T/ )
Another discussion went on in this email chain on 2022-05-09:, authored by nvidia.com, but no conclusion.
https://lore.kernel.org/all/20210429162906.32742-1-sdonthineni@nvidia.c om/
As of this writing 2023-10-09, the issue still can be reproduced in kernel 6.1.x with Nvidia / AMD GPUs, on Arm arch.
Problem Description:
A GPU is passed through to a virtual machine via a PCIE node. When installing the GPU driver within a virtual machine that runs on Openeuler 22.03 LTS SP2 (aarch64) system (linux kernel 5.10 based), the following error occurs:
“Unsupported FSC: EC=0x24 xFSC=0x21 ESR_EL2=0x92000061”
PS: the same issue can also be reproduced in kernel 6.1.x with Nvidia / AMD GPUs.
Upon consulting the official ARM documentation, the meanings of some of the error codes are as follows:
EC=0x24
The binary code is 0b100100, which corresponds to a data abort. A possible cause of this problem could be alignment errors.
xFSC=0x21
The binary code is 0b100001. Upon inquiry, this code represents an alignment error.
There is a lack of online solutions to this error. KylinOS engineers once proposed a patch for this error, but it was rejected by the community. Moreover, their modification was based on the old 4.x kernel version.
Link: [Link to the Patch]( https://lore.kernel.org/lkml/20220401090828.614167-1-xieming@kylinos.c n/T/)
This patch suggests that it is unreasonable to set the I/O memory attributes of the virtual machine to Device_nGnRE type.
According to ARM's official Whitepaper: Understanding Write Combining on Arm, Device-GRE is a type of relaxed-order memory that allows for gathering operations, but it does not allow for read speculation and imposes strict alignment constraints. You may refer to the table below or the following link for more information.
[Link to ARM Community]( https://community.arm.com/arm-research/m/resources/1012 )
Preliminary Deduction:
The GPU might be using Device-GRE type memory but without proper alignment, leading to the generation of this error.
Thanks.
Best regards,
Guodong Xu
Linaro
Linaro-open-discussions mailing list -- linaro-open-discussions@op-lists.linaro.org https://collaborate.linaro.org/display/LOD/Linaro+Open+Discussions+Home
-- Linaro-open-discussions mailing list -- linaro-open-discussions@op-lists.linaro.org https://collaborate.linaro.org/display/LOD/Linaro+Open+Discussions+Home
linaro-open-discussions@op-lists.linaro.org