On Mon, 21 Oct 2024 16:19:41 +0100 Jonathan Cameron via Linaro-open-discussions linaro-open-discussions@op-lists.linaro.org wrote:
On Mon, 21 Oct 2024 13:53:31 +0100 James Morse james.morse@arm.com wrote:
Hi Jonathan,
On 21/10/2024 09:54, Jonathan Cameron via Linaro-open-discussions wrote:
I'm checking the availability of people for the MPAM topic.
In the meantime one other topic might be good to discuss briefly.
Flush by PA range - need for a mini subsystem...
This is needed for CXL and we've jumped through various possible ways to handle it.
- PSCI call (not currently in spec) Doesn't work well as 'only' solution as some hardware provides direct memory mapped IO interfaces for this.
I have the code for this - but have been unable to (re)test it. I really want to kill the stop_machine() 'rendevzevous' in the kernel cases, but equally the firmware people don't want to do it in firmware!
I'm kind of assuming that anyone using this solution will soon learn it is a bad idea. Want it to be 'possible' but maybe we want for someone to show up actually asking for it.
- ACPI wrapping of both PSCI call and memory mapped. Unfortunately you can wrap SCMI (via a PCC channel) but not (as I understand it SMCCC) so I don't think an ACPI wrapper in general works.
I thought the opposite - the FFH "Operation Region" allows SMC calls, and linux supports this via acpi_ffh_address_space_arch_handler(). Its restricted to a few SMC ranges - but we should be able to do this in the 'standard' space as I bet a hypervisor is going to have to emulate this one day in the future... (not touching CXL hardware directly - but providing an encryption key to unlock a device)
Hmm. Seems I'll need to dig further as I'd missed that delight. However, doesn't that leave us stuck if stop_machine() needs to run in kernel. I'm hoping there isn't AML to do that!
Because I never trust anything to work until I have have some code I hacked this into QEMU this morning and subject to the small issue that the kernel needs SMCCC 1.2 for this and qemu does SMCCC 1.0 it works as expected. (I bodged the version number and crossed my fingers ;)
Worth a quick discussion on whether the stop machine option is doable via this path though.
Jonathan
I was thinking of a device that is all firmware in the DSDT that has _DSM to do this - and a driver to poke it in linux.
Ok. Sounds like this might be possible, but I'm not sure we want to be in a world were we get to write more AML.
Anyhow, seems this one is back on the table, though maybe one part of a solution not the whole thing.
- Current option: Face up to the fact we need to just do this kernel first with drivers for the various hardware that surfaces. We aren't particularly far into a design yet so maybe email is fine for now if people are busy.
For this I'm thinking a really small subsystem / class. Lots of design options but one might be
- Driver registers with class and provides PA ranges for which it needs to be notified (simpler option is it gets notified of everything) Class is there for userspace to be able to see what is involved...
- Notification chain follows similar design to mmu notifiers, just on physical addresses and global rather than tied to each mm.
Requirements for design:
- Multiple entities may need to be told to flush for a given range (interleaving etc may be going on).
- Flush should be range. Any hardware that doesn't support range will need to flush everything.
This needs advertising to the OS to avoid repeatedly flushing everything!
Agreed, it's details I'd expect to provide via the class - effectively a performance hint to let the ultimate software triggering this know how painful it might be.
Though flush all is what x86 does so people will be used to it ;)
- Support rich set of flushes. Drivers may have to upgrade the lighter ones to what they support (cache coherency protocols all allow this so should be technically fine to do rather than reject the flush)
Is this just clean/clean+invalidated, or is there another option?
In most general sense invalidate without clean. Whether it is useful is a whole different question. Just single clean+invalidate should work for functionality I think. The others are performance optimizations.
Other things I've forgotten?
Asynchronous completion? We may want to do some other work while the hardware does its thing.
Maybe. Initially synchronous should cover the use cases I know about. However we should indeed keep it in mind.
In particular I'd like any insights people have on plausible hardware designs that this approach might not work for.
When thrashing out the PSCI spec proposal, the "don't build that" corner case we had was "system caches" after the PoC that lack a mechanism to clean+invalidate while running, and can't be power-cycled to reset them if they share a power-domain with something important. Even by-VA clean+invalidate would be insufficient on such a platform...
(Is it plausible? I hope not ...)
If they build that then they don't get CXL hotplug. Maybe some guidance notes on silly things not to do would be good if anyone fancies writing them.
A sticky corner is how to know that all drivers for architecturally necessary flushes are present. Easy to check if there is one covering the range, not so easy to check if one instance of an interleaves solution is missing. We can probably solve this at the individual driver level, but it's ugly.
Wouldn't this 'just' need to cover the CXL.Mem decoders address windows?
If you mean the area we need to flush - yes, an HDM decoder or a DCD extent (typically much smaller). I meant the other way around. It is easy to bound what to flush but not in general to know which hardware (and how much hardware) you need to poke to do it.
To avoid the races that need the stop_machine() you need to be sure a cacheline won't move around between these entities that are responsible for ensuring a global flush of the line happens, but different cachelines in the range you care about can be handled by different entities. It might be doable to wrap all that up in careful device definition by pretending everyone involved is one device even if they are highly distributed in the system. That way the discovering everyone has shown up is pushed to individual drivers and made an impdef problem.
Jonathan
Thanks,
James