[Linaro-open-discussions] ACPI Generic Initiator support status and potential additions

Jonathan Cameron Jonathan.Cameron at Huawei.com
Fri Nov 13 11:16:32 UTC 2020


On Thu, 12 Nov 2020 16:36:01 +0000
Lorenzo Pieralisi <lorenzo.pieralisi at arm.com> wrote:

> On Wed, Nov 04, 2020 at 01:23:43PM +0000, Jonathan Cameron via Linaro-open-discussions wrote:
> > Hi All,
> > 
> > This one made it onto the list of topics to discuss (now marked as no need
> > to discuss). I've been meaning to give a status update by email including
> > what is outstanding here. Please let me know if this fails to cover
> > some aspect of interest.
> > 
> > Background:
> > 
> > https://github.com/hisilicon/acpi-numa-whitepaper/releases/tag/v0.93 chapter 3.
> > 
> > Generic initiators are a concept in ACPI 6.3 (sec 5.2.16.6) to plug a hole
> > in the definition of proximity domains.
> > 
> > Proximity domains in ACPI (NUMA nodes in kernel) are defined by entries in SRAT
> > table.  There are a whole range of different types of SRAT entry but before
> > ACPI 6.3 this more or less in practice meant that a proximity domain only
> > existed if it contained either (or both) memory and CPUs.   Other initiators
> > of memory transactions such as network cards can be assigned to an existing
> > proximity domain via _PXM in ACPI DSDT. This restricted them to sharing a domain
> > with either memory or processors.
> > 
> > That doesn't always reflect system architecture, particularly with the addition
> > of richer descriptions of access characteristics (latency / bandwidth) brought
> > in by HMAT.  Hence Generic Initiator domains to allow you to specify a
> > proximity domain with some other type of device in it (such as a network card)
> > and get all of the descriptive capability available for CPU / memory nodes.
> > 
> > Note that this was brought in prior to CXL becoming public but 1.1 CXL spec
> > states that initiators on CXL should be described using Generic Initiator nodes.
> > This should accelerate the number of users of this feature considerably.
> > It is also useful in some existing systems.
> > 
> > What support was needed in kernel:
> > 
> > 1) Parsing of the SRAT Generic Initiator Affinity Structure
> > 2) Instantiating the NUMA nodes that map to the GI PXM nodes to ensure stuff
> >    like fallback lists for memory allocation work as normal.
> > 3) Richer use of HMAT access characteristics to differentiate nearest CPU
> >    to memory from nearest initiator to memory.
> > 4) PXM assignment from the SRAT record rather than _PXM (not yet done).
> > 5) PCI PXM assignment (not yet done)
> > 
> > Status:
> > 
> > The kernel patches sat on the list (with rebases) for well over a year
> > failing to get the architecture review needed (as there was significant
> > risk of breakage in both ARM64 and x86). It was to break this blockage
> > that we were interested in an open discussion on this. However, they did
> > recently get x86 review this and Rafael queued them for 5.10 (now merged)
> > 
> > The PCI PXM issue has been long standing due to some buggy firmware
> > on certain X86 boards and the need for a clarification in the ACPI spec
> > (added in 6.3).  To make this safe, needed to ensure that NUMA nodes on
> > ACPI systems can only be instantiated during the main parse of SRAT.
> > 
> > https://lore.kernel.org/linux-mm/20200818142430.1156547-1-Jonathan.Cameron@huawei.com/
> > 
> > That fix is now in place, and we'll resend the PCI fix shortly.  
> 
> Overall, this is a topic to be discussed since it is important.
> 
> Shorter term, I need to pick your brain on this:
> 
> 1) The series above, it should fix this issue:
> 
> https://lore.kernel.org/lkml/26284ca5-ea05-0496-629d-9951f49dda8f@linux.alibaba.com
> 
> Correct ?

From a code read through I 'think' the PXM only from SRAT series should fix that
but without actually setting up a suitable test I'm not 100% confident.

> 
> 2) Why in dummy_numa_init() (arch/arm64/mm/numa.c) we don't turn
>    numa_off == true if numa_add_memblk() fails ?
>    (Side note: I really like the comment :) "...It is unlikely that this
>    function fails."

So this got moved around by Mike's patch set.  That set more or less by coincidence
fixed the issues that I saw with SRAT not covering all memblocks. The issue I
ran into in some broken firmware test cases left memblock in a wierd state after
the ACPI code ran. Mike's series put everything back together again because
the core memblock handling was a bit more resilient than the loop in the arm64
code.

He didn't change the flow however, we never set numa_off in the the error
path. 

So can we actually carry on safely if numa_add_memblk() has failed here?
The only way that can happen is to have a failure from memblock_isolate_range()
- the actual code to set the node can't fail.

memblock_isolate_range() only has one failure path from memblock_double_array()
That only fails for reasons of code ordering (reserved regions not set yet)
or a memory allocation failure (very unlikely this early in boot unless something
is horribly wrong) So I think the comment is kind of valid.  It doesn't fail.

If this code does fail to run we potentially end up with corrupted memblock
entries but they 'might' be fine. That makes me nervous.
Perhaps worth noting x86 doesn't even bother checking for return value of
numa_add_memblk().

On the numa_off front, we had a discussion around whether x86 should
set it for similar error paths (it doesn't).

https://patchwork.kernel.org/project/linux-pci/patch/20181211094737.71554-1-Jonathan.Cameron@huawei.com/

I haven't tested, but I think we could drop the numa_off = true
from the arm64 code now without any problems. 

At somepoint I'd like to explore more cross arch unification in this
code where possible. There is a lot of cut and paste going on and
assumptions in this code tend to have unexpected affects on ordering
of core code.  One implementation would make reworking things much
easier, even if we have to handle a few x86 quirks in that generic code.


> 
> Forgive me for not looking earlier into the series above that Rafael
> merged - as I said above this deserves more attention on my side.

No problem - always far too many things to do!

Thanks,

Jonathan

> 
> Thanks,
> Lorenzo
> 
> > Note it may be "interesting" to support nodes from CXL CDAT tables at runtime
> > but that is another topic.
> > ( https://uefi.org/node/4093 https://lore.kernel.org/linux-cxl/20201102183428.00005f4f@Huawei.com/T/#m52fb027e30e7c476c4993ba72aad618620f178c6 )
> > 
> > For a Generic Initiator Nodes, there are two ways a device an be assigned
> > to the proximity domain. Conventional _PXM in DSDT can be used and
> > that is now supported. The SRAT entry itself also contains an address
> > (PCI seg + BDF or Platform UID / HID based).  There is no obligation to
> > provide both. The SRAT based method will require some level of alternative
> > infrastructure to that used for _PXM. We may look at this at some stage.
> > 
> > So a few outstanding things but probably not worth discussing on a call
> > at this stage unless anyone is seeing problems with the stuff already merged.
> > 
> > Thanks,
> > 
> > Jonathan
> > -- 
> > Linaro-open-discussions mailing list
> > https://collaborate.linaro.org/display/LOD/Linaro+Open+Discussions+Home
> > https://op-lists.linaro.org/mailman/listinfo/linaro-open-discussions  



More information about the Linaro-open-discussions mailing list