[Linaro-open-discussions] Re: Flush by PA range Designs. (was Re: Linaro-open-discussions Digest, Vol 38, Issue 14)

22 Oct 2024

      On Mon, 21 Oct 2024 16:19:41 +0100
Jonathan Cameron via Linaro-open-discussions linaro-open-discussions@op-lists.linaro.org wrote:
...
On Mon, 21 Oct 2024 13:53:31 +0100
James Morse james.morse@arm.com wrote:
...
Hi Jonathan,
On 21/10/2024 09:54, Jonathan Cameron via Linaro-open-discussions wrote:
...
I'm checking the availability of people for the MPAM topic.
In the meantime one other topic might be good to discuss briefly.
...
Flush by PA range - need for a mini subsystem...
This is needed for CXL and we've jumped through various possible ways to handle
it.

PSCI call (not currently in spec)
Doesn't work well as 'only' solution as some hardware provides direct
memory mapped IO interfaces for this.

I have the code for this - but have been unable to (re)test it. I really want to kill the
stop_machine() 'rendevzevous' in the kernel cases, but equally the firmware people don't
want to do it in firmware!
I'm kind of assuming that anyone using this solution will soon learn it
is a bad idea. Want it to be 'possible' but maybe we want for someone to show
up actually asking for it.
...
...

ACPI wrapping of both PSCI call and memory mapped.
Unfortunately you can wrap SCMI (via a PCC channel) but not (as I understand
it SMCCC) so I don't think an ACPI wrapper in general works.

I thought the opposite - the FFH "Operation Region" allows SMC calls, and linux supports
this via acpi_ffh_address_space_arch_handler(). Its restricted to a few SMC ranges - but
we should be able to do this in the 'standard' space as I bet a hypervisor is going to
have to emulate this one day in the future... (not touching CXL hardware directly - but
providing an encryption key to unlock a device)
Hmm. Seems I'll need to dig further as I'd missed that delight.
However, doesn't that leave us stuck if stop_machine() needs to run in kernel.
I'm hoping there isn't AML to do that!
Because I never trust anything to work until I have have some code I hacked
this into QEMU this morning and subject to the small issue that the kernel
needs SMCCC 1.2 for this and qemu does SMCCC 1.0 it works as expected.
(I bodged the version number and crossed my fingers ;)
Worth a quick discussion on whether the stop machine option is doable via
this path though.
Jonathan
...
...
I was thinking of a device that is all firmware in the DSDT that has _DSM to do this - and
a driver to poke it in linux.
Ok. Sounds like this might be possible, but I'm not sure we want to be
in a world were we get to write more AML.
Anyhow, seems this one is back on the table, though maybe one part of a solution
not the whole thing.
...
...

Current option:  Face up to the fact we need to just do this kernel first with
drivers for the various hardware that surfaces.  We aren't particularly
far into a design yet so maybe email is fine for now if people are busy.

...
For this I'm thinking a really small subsystem / class.
Lots of design options but one might be

Driver registers with class and provides PA ranges for which it needs
to be notified (simpler option is it gets notified of everything)
Class is there for userspace to be able to see what is involved...
Notification chain follows similar design to mmu notifiers, just on
physical addresses and global rather than tied to each mm.

Requirements for design:

Multiple entities may need to be told to flush for a given range
(interleaving etc may be going on).
Flush should be range. Any hardware that doesn't support range will
need to flush everything.

This needs advertising to the OS to avoid repeatedly flushing everything!
Agreed, it's details I'd expect to provide via the class - effectively
a performance hint to let the ultimate software triggering this know
how painful it might be.
Though flush all is what x86 does so people will be used to it ;)
...
...

Support rich set of flushes. Drivers may have to upgrade the lighter
ones to what they support (cache coherency protocols all allow
this so should be technically fine to do rather than reject the
flush)

Is this just clean/clean+invalidated, or is there another option?
In most general sense invalidate without clean.  Whether it is useful
is a whole different question.  Just single clean+invalidate should
work for functionality I think. The others are performance optimizations.
...
...
Other things I've forgotten?
Asynchronous completion? We may want to do some other work while the hardware does its thing.
Maybe.  Initially synchronous should cover the use cases
I know about. However we should indeed keep it in mind.
...
...
In particular I'd like any insights people have on plausible hardware
designs that this approach might not work for.
When thrashing out the PSCI spec proposal, the "don't build that" corner case we had was
"system caches" after the PoC that lack a mechanism to clean+invalidate while running, and
can't be power-cycled to reset them if they share a power-domain with something important.
Even by-VA clean+invalidate would be insufficient on such a platform...
(Is it plausible? I hope not ...)
If they build that then they don't get CXL hotplug. Maybe some guidance
notes on silly things not to do would be good if anyone fancies writing them.
...
...
A sticky corner is how to know that all drivers for architecturally necessary
flushes are present.  Easy to check if there is one covering the range, not
so easy to check if one instance of an interleaves solution is missing.
We can probably solve this at the individual driver level, but it's ugly.
Wouldn't this 'just' need to cover the CXL.Mem decoders address windows?
If you mean the area we need to flush - yes, an HDM decoder or a DCD extent
(typically much smaller).  I meant the other way around.  It is easy to
bound what to flush but not in general to know which hardware (and
how much hardware) you need to poke to do it.
To avoid the races that need the stop_machine() you need to be sure a cacheline
won't move around between these entities that are responsible for ensuring
a global flush of the line happens, but different cachelines in the range
you care about can be handled by different entities. It might be doable
to wrap all that up in careful device definition by pretending everyone involved
is one device even if they are highly distributed in the system. That way
the discovering everyone has shown up is pushed to individual drivers
and made an impdef problem.
Jonathan
...
Thanks,
James

2025

2024

2023

2022

2021

2020

[Linaro-open-discussions] Re: Flush by PA range Designs. (was Re: Linaro-open-discussions Digest, Vol 38, Issue 14)