Hi Zhangfei,
On Fri, Mar 26, 2021 at 12:35:49PM +0000, Zhangfei Gao via Linaro-open-discussions wrote:
Hi,
I am looking for some suggestions about dvm [1].
we are testing sva with openssl, and dvm is enabled by default to broadcast TLB maintenance. And thp is enabled in system by default, a daemon khugepaged (mm/khugepaged.c) is running to collapse memory to huge page in a period of time.
And we found thp: khugepaged may cause issue to sva test case. https://github.com/Linaro/uadk/issues/215
Two cases:
- Heavy weight test case, async mode, 36+ jobs.
openssl speed -elapsed -engine uadk -async_jobs 36 rsa2048
Once collapse_huge_page happens. hardware may hung in io page fault, there maybe huge numbers of page fault keeps happening, while usually only several io page fault reported.
- With high thp scan frequence, low weight test case, sync mode, 1 job.
data may not correct. sudo openssl speed -engine uadk -seconds 1 rsa2048 Doing 2048 bits public rsa's for 1s: RSA verify failure
Two workarounds:
- disable thp
echo never > /sys/kernel/mm/transparent_hugepage/enabled
- enable thp but add tlbi and ignore dvm.
Adding arm_smmu_tlb_inv_range in arm-smmu-v3-sva.c: arm_smmu_mm_invalidate_range. It is called by khugepaged: collapse_huge_page-> mmu_notifier_invalidate_range_end
Looks dvm is not taking effect in some corner cases.
Questions
- if khugepaged collapse the memory used by device, and then change tlb,
can dvm sync this tlb change to smmu.
- Any possible dma is just using the memory, and collapsed by khugepaged,
can dvm handle this case? Or khugepaged should not touch memory using by device, looks khugepaged can not distinguish.
I think hugepage is juste one symptom, and may not be the only way to trigger the issue. On khugepaged collapse, a large invalidation is issued which arch/arm64/include/asm/tlbflush.h transforms into a TLBI-by-ASID:
if ((!system_supports_tlb_range() && (end - start) >= (MAX_TLBI_OPS * stride)) || pages >= MAX_TLBI_RANGE_PAGES) { flush_tlb_mm(vma->vm_mm); return; }
With MAX_TLBI_OPS = 512 and stride = 0x1000 we hit the size limit of 2M, which is the size of a hugepage. My guess is that on khugepage collapse we issue a TLBI ASIDE1IS (rather than a TLBI VALE1IS), somehow that isn't taken into account by the SMMU, and we end up with stale TLB entries leading to memory corruption. If that's the case, I'd suggest keeping DVM disabled on this platform (workaround 2), to force all SVA invalidation to go through the command queue.
Thanks, Jean
On 2021/3/26 下午9:36, Jean-Philippe Brucker wrote:
Hi Zhangfei,
On Fri, Mar 26, 2021 at 12:35:49PM +0000, Zhangfei Gao via Linaro-open-discussions wrote:
Hi,
I am looking for some suggestions about dvm [1].
we are testing sva with openssl, and dvm is enabled by default to broadcast TLB maintenance. And thp is enabled in system by default, a daemon khugepaged (mm/khugepaged.c) is running to collapse memory to huge page in a period of time.
And we found thp: khugepaged may cause issue to sva test case. https://github.com/Linaro/uadk/issues/215
Two cases:
- Heavy weight test case, async mode, 36+ jobs.
openssl speed -elapsed -engine uadk -async_jobs 36 rsa2048
Once collapse_huge_page happens. hardware may hung in io page fault, there maybe huge numbers of page fault keeps happening, while usually only several io page fault reported.
- With high thp scan frequence, low weight test case, sync mode, 1 job.
data may not correct. sudo openssl speed -engine uadk -seconds 1 rsa2048 Doing 2048 bits public rsa's for 1s: RSA verify failure
Two workarounds:
- disable thp
echo never > /sys/kernel/mm/transparent_hugepage/enabled
- enable thp but add tlbi and ignore dvm.
Adding arm_smmu_tlb_inv_range in arm-smmu-v3-sva.c: arm_smmu_mm_invalidate_range. It is called by khugepaged: collapse_huge_page-> mmu_notifier_invalidate_range_end
Looks dvm is not taking effect in some corner cases.
Questions
- if khugepaged collapse the memory used by device, and then change tlb,
can dvm sync this tlb change to smmu.
- Any possible dma is just using the memory, and collapsed by khugepaged,
can dvm handle this case? Or khugepaged should not touch memory using by device, looks khugepaged can not distinguish.
I think hugepage is juste one symptom, and may not be the only way to trigger the issue. On khugepaged collapse, a large invalidation is issued which arch/arm64/include/asm/tlbflush.h transforms into a TLBI-by-ASID:
if ((!system_supports_tlb_range() && (end - start) >= (MAX_TLBI_OPS * stride)) || pages >= MAX_TLBI_RANGE_PAGES) { flush_tlb_mm(vma->vm_mm); return; }
With MAX_TLBI_OPS = 512 and stride = 0x1000 we hit the size limit of 2M, which is the size of a hugepage. My guess is that on khugepage collapse we issue a TLBI ASIDE1IS (rather than a TLBI VALE1IS), somehow that isn't taken into account by the SMMU, and we end up with stale TLB entries leading to memory corruption. If that's the case, I'd suggest keeping DVM disabled on this platform (workaround 2), to force all SVA invalidation to go through the command queue.
Thanks Jean
Have found the reason, hpre is connected to another smmu, whose dvm is not enabled by the bios :(. sudo busybox devmem 0x2001c0030 32 0x1 // is error 0x9 // is correct With the updated bios, have passed stress test in the weekend: 1000 times async test, 5w times of sync test. Will update the bios of the openlab board as well.
By the way, still one uncertainty. Is is possible thp collapse the memory just using by dma? the spin lock has no effect to device. Though in the stress test, not find such issue.
Thanks
On Mon, Mar 29, 2021 at 10:15:14AM +0800, Zhangfei Gao wrote:
Have found the reason, hpre is connected to another smmu, whose dvm is not enabled by the bios :(.
Oh ok that's good news, explains why I couldn't reproduce it with the zip engine.
sudo busybox devmem 0x2001c0030 32 0x1 // is error 0x9 // is correct With the updated bios, have passed stress test in the weekend: 1000 times async test, 5w times of sync test. Will update the bios of the openlab board as well.
By the way, still one uncertainty. Is is possible thp collapse the memory just using by dma?
I'm not sure I understand the question. You can have DMA cause THP collapse, but it's done by a separate thread, khugepaged, that regularly scans memory, looking to transform contiguous small-pages mappings into huge pages:
1. Create a virtually-contiguous 2MB region of 4k pages:
for (page = 0; page < SZ_2M; page += SZ_4K) { mmap(base + page, SZ_4K, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, 0, 0); /* Allocate half of the 4k pages now */ if (page % SZ_8K) *(char *)(base + page) = 1; }
2. Issue DMA for the other pages in the range, causing IOPF to allocate the remaining pages. 3. If khugepaged scans that region (reduce scan_sleep_millisecs to increases chances of that), it will collapse it into a huge page. I check /proc/self/smaps before an after to see whether an address ranges has AnonHugePages or not.
You can also cause huge page allocation from IOPF, by allocating a 2MB mapping and initializing it using DMA.
the spin lock has no effect to device.
Which spinlock?
Thanks, Jean
linaro-open-discussions@op-lists.linaro.org