Potential issues (and fixes to them) for the current MPAM supprot by Arm [1].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/...
Hesham Almatary (2): MPAM: Fix calculating the bandwidth granularity MPAM/resctrl: allocate a domain per component
drivers/platform/mpam/mpam_resctrl.c | 30 +++++++++++++--------------- 1 file changed, 14 insertions(+), 16 deletions(-)
The minimum bandwidth granularity is calculated from MPAMF_MBW_IDR.BWA_WD, which represents the number of bits for a fixed-point fraction (See Section 11.3.8 MPAM Reference Manual).
Right now, the calculation works fine for 1, but is wrong for 2 bits and more. For example, 1 bit will equal 50%. In the current implementation, 2 bits will equal 33% instead of 25%. Similarly, 3 bits will equal 25% instead of 12.5%. This commit corrects the calculation.
Signed-off-by: Hesham Almatary hesham.almatary@huawei.com --- drivers/platform/mpam/mpam_resctrl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index f9f2bab8365a..6d35b3aea828 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -512,7 +512,7 @@ static u32 get_mba_granularity(struct mpam_props *cprops) * bwa_wd is the number of bits implemented in the 0.xxx * fixed point fraction. 1 bit is 50%, 2 is 25% etc. */ - return MAX_MBA_BW / (cprops->bwa_wd + 1); + return (MAX_MBA_BW / BIT(cprops->bwa_wd)); }
return 0;
Hi Hesham,
On 22/11/2022 11:41, Hesham Almatary wrote:
The minimum bandwidth granularity is calculated from MPAMF_MBW_IDR.BWA_WD, which represents the number of bits for a fixed-point fraction (See Section 11.3.8 MPAM Reference Manual).
Right now, the calculation works fine for 1, but is wrong for 2 bits and more. For example, 1 bit will equal 50%. In the current implementation, 2 bits will equal 33% instead of 25%. Similarly, 3 bits will equal 25% instead of 12.5%. This commit corrects the calculation.
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index f9f2bab8365a..6d35b3aea828 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -512,7 +512,7 @@ static u32 get_mba_granularity(struct mpam_props *cprops) * bwa_wd is the number of bits implemented in the 0.xxx * fixed point fraction. 1 bit is 50%, 2 is 25% etc. */
return MAX_MBA_BW / (cprops->bwa_wd + 1);
}return (MAX_MBA_BW / BIT(cprops->bwa_wd));
return 0;
Oops, thanks!
I've included this as is, but If I do any refactoring near here it is likely to get squashed in to the buggy patch. If that happens I'll add you to CC on the patch. (The joke is CC also stands for Celebrate Contribution!)
Thanks,
James
On Wed, 23 Nov 2022 16:11:51 +0000 James Morse via Linaro-open-discussions linaro-open-discussions@op-lists.linaro.org wrote:
Hi Hesham,
On 22/11/2022 11:41, Hesham Almatary wrote:
The minimum bandwidth granularity is calculated from MPAMF_MBW_IDR.BWA_WD, which represents the number of bits for a fixed-point fraction (See Section 11.3.8 MPAM Reference Manual).
Right now, the calculation works fine for 1, but is wrong for 2 bits and more. For example, 1 bit will equal 50%. In the current implementation, 2 bits will equal 33% instead of 25%. Similarly, 3 bits will equal 25% instead of 12.5%. This commit corrects the calculation.
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index f9f2bab8365a..6d35b3aea828 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -512,7 +512,7 @@ static u32 get_mba_granularity(struct mpam_props *cprops) * bwa_wd is the number of bits implemented in the 0.xxx * fixed point fraction. 1 bit is 50%, 2 is 25% etc. */
return MAX_MBA_BW / (cprops->bwa_wd + 1);
}return (MAX_MBA_BW / BIT(cprops->bwa_wd));
return 0;
Oops, thanks!
I've included this as is, but If I do any refactoring near here it is likely to get squashed in to the buggy patch. If that happens I'll add you to CC on the patch. (The joke is CC also stands for Celebrate Contribution!)
Glad that was the right fix. No problem about squashing it, please feel free to do whatever makes things go smoother for you.
Thanks,
James
The current code will only allocate/online a domain for only one component in a class but not the remaining ones. For example, in a system that has 4 MSCs, each of which controls a memory controller, only one memory controller will appear in the resctrl schemata instead of 4.
This commit iterates over all the components per class (e.g., all MBA MSCs per a memory class) and creates/onlines a domain for each. resctrl will properly show 4 MBAs consequently.
Signed-off-by: Hesham Almatary hesham.almatary@huawei.com --- drivers/platform/mpam/mpam_resctrl.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 6d35b3aea828..9ac256c073e7 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -996,19 +996,9 @@ void resctrl_arch_reset_resources(void) }
static struct mpam_resctrl_dom * -mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) +mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res, struct mpam_component *comp) { struct mpam_resctrl_dom *dom; - struct mpam_class *class = res->class; - struct mpam_component *comp_iter, *comp; - - comp = NULL; - list_for_each_entry(comp_iter, &class->components, class_list) { - if (cpumask_test_cpu(cpu, &comp_iter->affinity)) { - comp = comp_iter; - break; - } - }
/* cpu with unknown exported component? */ if (WARN_ON_ONCE(!comp)) @@ -1069,6 +1059,7 @@ int mpam_resctrl_online_cpu(unsigned int cpu) int i, err; struct mpam_resctrl_dom *dom; struct mpam_resctrl_res *res; + struct mpam_component *comp_iter;
for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -1082,12 +1073,19 @@ int mpam_resctrl_online_cpu(unsigned int cpu) continue; }
- dom = mpam_resctrl_alloc_domain(cpu, res); - if (IS_ERR(dom)) + list_for_each_entry(comp_iter, &res->class->components, class_list) { + + if (!cpumask_test_cpu(cpu, &comp_iter->affinity)) + continue; + + dom = mpam_resctrl_alloc_domain(cpu, res, comp_iter); + if (IS_ERR(dom)) return PTR_ERR(dom); - err = resctrl_online_domain(&res->resctrl_res, &dom->resctrl_dom); - if (err) + + err = resctrl_online_domain(&res->resctrl_res, &dom->resctrl_dom); + if (err) return err; + } }
return resctrl_online_cpu(cpu);
Hi Hesham,
On 22/11/2022 11:41, Hesham Almatary wrote:
The current code will only allocate/online a domain for only one component in a class but not the remaining ones.
This is deliberate....
For example, in a system that has 4 MSCs, each of which controls a memory controller, only one memory controller will appear in the resctrl schemata instead of 4.
Unless there is something wrong with the ACPI parsing code: this is how you described your system.
Your memory controller might be made up of four slices or channels, each with an MSC. The regulation might go wrong if the MSC are programmed differently. This is what components are for, they are hidden from resctrl.
In contrast, your four memory controllers could control different regions of memory with different 'proximity domain's. The regulation for these is independent. This is what classes are for, and this is why there is only one domain per class.
The comment above mpam_classes should clarify the structure here. If its not clear I can try to improve it!
If you really have four memory controllers in one NUMA node, you really shouldn't configure them differently. The OS will compact memory within a NUMA node, and doesn't expect surprise performance effects if it goes through a different MSC. User-space can't find the PA range of its memory to know which MSC it needs to configure.
Thanks,
James
Hello James,
On Wed, 23 Nov 2022 16:11:48 +0000 James Morse james.morse@arm.com wrote:
Hi Hesham,
On 22/11/2022 11:41, Hesham Almatary wrote:
The current code will only allocate/online a domain for only one component in a class but not the remaining ones.
This is deliberate....
For example, in a system that has 4 MSCs, each of which controls a memory controller, only one memory controller will appear in the resctrl schemata instead of 4.
Unless there is something wrong with the ACPI parsing code: this is how you described your system.
Your memory controller might be made up of four slices or channels, each with an MSC. The regulation might go wrong if the MSC are programmed differently. This is what components are for, they are hidden from resctrl.
In contrast, your four memory controllers could control different regions of memory with different 'proximity domain's. The regulation for these is independent. This is what classes are for, and this is why there is only one domain per class.
The comment above mpam_classes should clarify the structure here. If its not clear I can try to improve it!
Thanks for explaining that. This is indeed my case; 4 different NUMA nodes, each with a memory controller and an MSC. No RIS support and no channel/slices. So for my understanding, and please correct me if I am wrong, you are saying that each memory controller/MSC (each with a different 'proximity domain') in my case should be on its own class? If yes, this means that each one should have the same type (MPAM_CLASS_MEMORY), but different class id? As in the following?
--------------------------- --------------------------- | Class: MPAM_CLASS_MEMORY | | Class: MPAM_CLASS_MEMORY | | ID: 0 | | ID: 1 | | ----------------- | | ----------------- | | | Component | | | | Component | | | | MSC | | | | MSC | | | ----------------- | | ----------------- | --------------------------- ---------------------------
--------------------------- --------------------------- | Class: MPAM_CLASS_MEMORY | | Class: MPAM_CLASS_MEMORY | | ID: 2 | | ID: 3 | | ----------------- | | ----------------- | | | Component | | | | Component | | | | MSC | | | | MSC | | | ----------------- | | ----------------- | --------------------------- ---------------------------
In that case, what would the proximity domain be mapped to? A class id or a component id? The comment in the code says that "Classes are the set components of the same type." This is a bit confusing to me because it means that all components of the same type (e.g., MPAM_CLASS_MEMORY) should be part of a single class?
The other questions is, according to my understanding of the code [1], it maps the 'proximity domain' to a component ID, but all MSCs with type MPAM_CLASS_MEMORY will be assigned just one class with ID 255 i.e., there will only be one class for all MSCs of type MPAM_CLASS_MEMORY, and MSCs will be mapped to different components under the same single class as follows:
------------------------------------------------------------------- | Class type: MPAM_CLASS_MEMORY | | Class ID: 255 | | ---------------------- --------------------- | | | Component/MSC | | Components/MSC | | | | ID: proximity domain | | ID: proximity domain| | | ---------------------- ---------------------- | | | | ---------------------- --------------------- | | | Component/MSC | | Component/MSC | | | | ID: proximity domain | | ID: proximity domain| | | ---------------------- ---------------------- | -------------------------------------------------------------------
Is that the correct intended behavior?
[1] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/drivers...
Thanks, Hesham
If you really have four memory controllers in one NUMA node, you really shouldn't configure them differently. The OS will compact memory within a NUMA node, and doesn't expect surprise performance effects if it goes through a different MSC. User-space can't find the PA range of its memory to know which MSC it needs to configure.
Thanks,
James
Hi Hesham,
On 24/11/2022 12:21, Hesham Almatary wrote:
On Wed, 23 Nov 2022 16:11:48 +0000 James Morse james.morse@arm.com wrote:
On 22/11/2022 11:41, Hesham Almatary wrote:
The current code will only allocate/online a domain for only one component in a class but not the remaining ones.
This is deliberate....
For example, in a system that has 4 MSCs, each of which controls a memory controller, only one memory controller will appear in the resctrl schemata instead of 4.
Unless there is something wrong with the ACPI parsing code: this is how you described your system.
Your memory controller might be made up of four slices or channels, each with an MSC. The regulation might go wrong if the MSC are programmed differently. This is what components are for, they are hidden from resctrl.
In contrast, your four memory controllers could control different regions of memory with different 'proximity domain's. The regulation for these is independent. This is what classes are for, and this is why there is only one domain per class.
The comment above mpam_classes should clarify the structure here. If its not clear I can try to improve it!
Thanks for explaining that. This is indeed my case; 4 different NUMA nodes, each with a memory controller and an MSC. No RIS support and no channel/slices. So for my understanding, and please correct me if I am wrong, you are saying that each memory controller/MSC (each with a different 'proximity domain') in my case should be on its own class? If yes, this means that each one should have the same type (MPAM_CLASS_MEMORY), but different class id?
AH - you're right. I think I was off by one in how the driver structures these things. (Maybe I need to expand those comments for my own benefit!)
Let me try and reproduce what you are seeing to work out what it should be doing. (then I'll get back to your reply). I have ACPI tables to test this with combining/splitting the FVP's 'cache's...
Thanks,
James
As in the following?
| Class: MPAM_CLASS_MEMORY | | Class: MPAM_CLASS_MEMORY | | ID: 0 | | ID: 1 | | ----------------- | | ----------------- | | | Component | | | | Component | | | | MSC | | | | MSC | | | ----------------- | | ----------------- |
| Class: MPAM_CLASS_MEMORY | | Class: MPAM_CLASS_MEMORY | | ID: 2 | | ID: 3 | | ----------------- | | ----------------- | | | Component | | | | Component | | | | MSC | | | | MSC | | | ----------------- | | ----------------- |
In that case, what would the proximity domain be mapped to? A class id or a component id? The comment in the code says that "Classes are the set components of the same type." This is a bit confusing to me because it means that all components of the same type (e.g., MPAM_CLASS_MEMORY) should be part of a single class?
The other questions is, according to my understanding of the code [1], it maps the 'proximity domain' to a component ID, but all MSCs with type MPAM_CLASS_MEMORY will be assigned just one class with ID 255 i.e., there will only be one class for all MSCs of type MPAM_CLASS_MEMORY, and MSCs will be mapped to different components under the same single class as follows:
| Class type: MPAM_CLASS_MEMORY | | Class ID: 255 | | ---------------------- --------------------- | | | Component/MSC | | Components/MSC | | | | ID: proximity domain | | ID: proximity domain| | | ---------------------- ---------------------- | | | | ---------------------- --------------------- | | | Component/MSC | | Component/MSC | | | | ID: proximity domain | | ID: proximity domain| | | ---------------------- ---------------------- |
Is that the correct intended behavior?
[1] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/drivers...
Thanks, Hesham
If you really have four memory controllers in one NUMA node, you really shouldn't configure them differently. The OS will compact memory within a NUMA node, and doesn't expect surprise performance effects if it goes through a different MSC. User-space can't find the PA range of its memory to know which MSC it needs to configure.
Thanks,
James
Hi Hesham,
On 24/11/2022 12:29, James Morse via Linaro-open-discussions wrote:
On 24/11/2022 12:21, Hesham Almatary wrote:
Thanks for explaining that. This is indeed my case; 4 different NUMA nodes, each with a memory controller and an MSC. No RIS support and no channel/slices. So for my understanding, and please correct me if I am wrong, you are saying that each memory controller/MSC (each with a different 'proximity domain') in my case should be on its own class? If yes, this means that each one should have the same type (MPAM_CLASS_MEMORY), but different class id?
AH - you're right. I think I was off by one in how the driver structures these things. (Maybe I need to expand those comments for my own benefit!)
Let me try and reproduce what you are seeing to work out what it should be doing. (then I'll get back to your reply). I have ACPI tables to test this with combining/splitting the FVP's 'cache's...
I've reproduced something like your setup in the FVP, but I can't reproduce the problem you describe: | root@(none):/sys/fs/resctrl# cat schemata | MB:0=100;1=100
I have found one issue with cpumask_of_node() - it only knows about online CPUs meaning the sanity checks that a class covers all possible CPUs fails if not all CPUs are online, and no MB resources are created. This doesn't match what you described, but it might be relevant. Does [0] help?
Your diff is creating domains when any CPU comes online ... this is the wrong thing to do, and would be an ABI break for resctrl. If all the CPUs in a domain are offline, it is supposed to disappear from the schemata file. When a CPU in the domain returns, it should re-create the domain, or at least reset all the configuration.
If [0] doesn't change things, could you share your MPAM and SRAT ACPI tables? I suspect something is going wrong when working out the MSC affinity. The output of: | find /sys/kernel/debug/mpam/ -type f -print -exec cat {} ; might help me decode your tables. Feel free to send these things in private if that's going to be easier. (I just won't be allowed to reply)
Thanks,
James
[0] Don't use cpumask_of_node, it only works for online CPUs --------------------%<-------------------- diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_d evices.c index c1b2bd9caa29..d4853c567b4e 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -418,6 +418,21 @@ static int get_cpumask_from_cache_id(u32 cache_id, u32 cach e_level, return 0; }
+ +/* + * cpumask_of_node() only knows about online CPUs. This can't tell us whether + * a class is represented on all possible CPUs. + */ +static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity) +static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity) +{ + int cpu; + + for_each_possible_cpu(cpu) { + if (node_id == cpu_to_node(cpu)) + cpumask_set_cpu(cpu, affinity); + } +} + static int get_cpumask_from_cache(struct device_node *cache, cpumask_t *affinity) { @@ -460,7 +475,7 @@ static int mpam_ris_get_affinity(struct mpam_msc *msc, cpumask_t *affinity,
break; case MPAM_CLASS_MEMORY: - *affinity = *cpumask_of_node(comp->comp_id); + get_cpumask_from_node_id(comp->comp_id, affinity); if (cpumask_empty(affinity)) pr_warn_once("%s no CPUs associated with memory node", dev_name(&msc->pdev->dev)); --------------------%<--------------------
Hello James,
Thanks for getting back to me on this. I have done some changes to my ACPI tables and got your code working fine without this patch. In particular, I matched the proximity domain field with the NUMA node ID for each memory controller. If they differ, the code won't work (as it has the assumption that the proximity domain is the same as NUMA id, from which the affinity/accessibility is set). This leaves me with a few questions regarding the design and implementation of the driver. I'd appreciate your input on that.
1) What does a memory MSC correspond to? A class (with a unique ID) or a component? From the code, it seems like it maps to a component to me.
2) Could we have a use case in which we have different class IDs with the same class type? If yes could you please give an example?
3) What should a component ID for a memory MSC be/represent? The code assumes it's a (NUMA?) node ID.
4) What should a class ID represent for a memory MSC? Which is different from the class type itself.
5) How would 4 memory MSCs (with different proximity domains) map to classes and components?
6) How would 2 Memory MSCs with(in) the same proximity domain and/or same NUMA node work, if at all?
7) Should the ACPI/MPAM MSC's "identifier" field be mapped to class IDs or component IDs at all?
Regards, Hesham
On Wed, 21 Dec 2022 13:54:01 +0000 James Morse james.morse@arm.com wrote:
Hi Hesham,
On 24/11/2022 12:29, James Morse via Linaro-open-discussions wrote:
On 24/11/2022 12:21, Hesham Almatary wrote:
Thanks for explaining that. This is indeed my case; 4 different NUMA nodes, each with a memory controller and an MSC. No RIS support and no channel/slices. So for my understanding, and please correct me if I am wrong, you are saying that each memory controller/MSC (each with a different 'proximity domain') in my case should be on its own class? If yes, this means that each one should have the same type (MPAM_CLASS_MEMORY), but different class id?
AH - you're right. I think I was off by one in how the driver structures these things. (Maybe I need to expand those comments for my own benefit!)
Let me try and reproduce what you are seeing to work out what it should be doing. (then I'll get back to your reply). I have ACPI tables to test this with combining/splitting the FVP's 'cache's...
I've reproduced something like your setup in the FVP, but I can't reproduce the problem you describe: | root@(none):/sys/fs/resctrl# cat schemata | MB:0=100;1=100
I have found one issue with cpumask_of_node() - it only knows about online CPUs meaning the sanity checks that a class covers all possible CPUs fails if not all CPUs are online, and no MB resources are created. This doesn't match what you described, but it might be relevant. Does [0] help?
Your diff is creating domains when any CPU comes online ... this is the wrong thing to do, and would be an ABI break for resctrl. If all the CPUs in a domain are offline, it is supposed to disappear from the schemata file. When a CPU in the domain returns, it should re-create the domain, or at least reset all the configuration.
If [0] doesn't change things, could you share your MPAM and SRAT ACPI tables? I suspect something is going wrong when working out the MSC affinity. The output of: | find /sys/kernel/debug/mpam/ -type f -print -exec cat {} ; might help me decode your tables. Feel free to send these things in private if that's going to be easier. (I just won't be allowed to reply)
Thanks,
James
[0] Don't use cpumask_of_node, it only works for online CPUs --------------------%<-------------------- diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_d evices.c index c1b2bd9caa29..d4853c567b4e 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -418,6 +418,21 @@ static int get_cpumask_from_cache_id(u32 cache_id, u32 cach e_level, return 0; }
+/*
- cpumask_of_node() only knows about online CPUs. This can't tell
us whether
- a class is represented on all possible CPUs.
- */
+static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity) +static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity) +{
int cpu;
for_each_possible_cpu(cpu) {
if (node_id == cpu_to_node(cpu))
cpumask_set_cpu(cpu, affinity);
}
+}
static int get_cpumask_from_cache(struct device_node *cache, cpumask_t *affinity) { @@ -460,7 +475,7 @@ static int mpam_ris_get_affinity(struct mpam_msc *msc, cpumask_t *affinity,
break; case MPAM_CLASS_MEMORY:
*affinity = *cpumask_of_node(comp->comp_id);
get_cpumask_from_node_id(comp->comp_id, affinity); if (cpumask_empty(affinity)) pr_warn_once("%s no CPUs associated with
memory node", dev_name(&msc->pdev->dev)); --------------------%<--------------------
Hi Hesham,
On 12/01/2023 10:34, Hesham Almatary wrote:
Thanks for getting back to me on this. I have done some changes to my ACPI tables and got your code working fine without this patch. In particular, I matched the proximity domain field with the NUMA node ID for each memory controller. If they differ, the code won't work (as it has the assumption that the proximity domain is the same as NUMA id,
Right, if there is an extra level of indirection in there, its something I wasn't aware of. I'll need to dig into it. I agree this explains what you were seeing.
from which the affinity/accessibility is set). This leaves me with a few questions regarding the design and implementation of the driver. I'd appreciate your input on that.
- What does a memory MSC correspond to? A class (with a unique ID) or a
component? From the code, it seems like it maps to a component to me.
An MSC is the device, it has registers and generates interrupts. If its part of your memory controller, it gets described like that in the ACPI tables, which lets linux guess that this MSC (or the RIS within it) control some policy in the memory controller.
Components exist to group devices that should be configured the same. This happens where designs are sliced up, but this slicing makes no sense to the software. Classes are a group of components that do the same thing, but not to the same resource. e.g. they control memory controllers.
The ACPI tables should describe the MSC, its up to the driver to build the class and component structures from what it can infer from the other ACPI tables.
- Could we have a use case in which we have different class IDs with
the same class type? If yes could you please give an example?
Your L2 and L3 are both caches, but use the level number as the id. I doubt anyone builds a system with MSC on both, but its possible by the architecture, and we could expose both via resctrl.
- What should a component ID for a memory MSC be/represent? The code
assumes it's a (NUMA?) node ID.
The component-ids are some number that makes sense to linux, and matches something in the ACPI tables. These are exposed via the schema file to user-space. For the caches, its the cache-id property from the PPTT table. This is exposed to user-space via /sys/devices/system/cpu/cpu0/cache/index3/id or equivalent.
Its important that user-space can work out which CPUs share a component/domain in the schema. Using a sensible id is the pre-requisite for that.
Intel's memory bandwidth control appears to be implemented on the L3, so they re-use the id of the L3 cache. These seem to correspond to NUMA nodes already.
For MPAM - we have no idea if the memory controllers map 1:1 with any level in the cache. Instead, the driver expects to use the numa node number directly.
(I'll put this on the list of KNOWN_ISSUES, the Intel side of this ought to be cleaned up so it doesn't break if they build a SoC where L3 doesn't map 1:1 with Numa nodes. It looks like they are getting away with it because Atom doesn't support L3 or memory bandwidth)
- What should a class ID represent for a memory MSC? Which is different
from the class type itself.
The class id is private to the driver, for the caches it needs to be the cache level. Because of that memory is shoved at the end, on the assumption no-one has an L255 cache, and 'unknown' devices are shoved at the beginning... L0 caches probably do exist, but I doubt anyone would add an MSC to them.
Classes can't be arbitrarily created, as the resctrl picking code needs to know how they map to resctrl schemas, as we can't invent new schemas without messing up user-space.
- How would 4 memory MSCs (with different proximity domains) map to
classes and components?
Each MSC would be a device. There would be one device per component, because each proximity domain is different. They would all be the same class, as you'd described them all with a memory type in the ACPI tables.
If you see a problem with this, let me know! The folk who write the ACPI specs didn't find any systems where this would lead to problems... that doesn't mean you haven't build something that looks quite different.
- How would 2 Memory MSCs with(in) the same proximity domain and/or
same NUMA node work, if at all?
If you build this, I bet your hardware people say those two MSC must be programmed the same for the regulation to work. (if not - how is software expected to understand the hashing scheme used to map physical-addresses to memory controllers?!)
Each MSC would be a device. They would both be part of the same component as they have the same proximity domain.
Configuration is applied to the component, so each device/MSC within the component is always configured the same.
- Should the ACPI/MPAM MSC's "identifier" field be mapped to class IDs
or component IDs at all?
Classes, no - these are just for the driver to keep track of the groups. Components, probably ... but another number may make more sense. This should line up with something that is already exposed to user-space via sysfs.
Thanks,
James
Hello James,
Many thanks for clarifying things up, that definitely helps with my understanding.
On Thu, 12 Jan 2023 13:38:17 +0000 James Morse james.morse@arm.com wrote:
Hi Hesham,
On 12/01/2023 10:34, Hesham Almatary wrote:
Thanks for getting back to me on this. I have done some changes to my ACPI tables and got your code working fine without this patch. In particular, I matched the proximity domain field with the NUMA node ID for each memory controller. If they differ, the code won't work (as it has the assumption that the proximity domain is the same as NUMA id,
Right, if there is an extra level of indirection in there, its something I wasn't aware of. I'll need to dig into it. I agree this explains what you were seeing.
from which the affinity/accessibility is set). This leaves me with a few questions regarding the design and implementation of the driver. I'd appreciate your input on that.
- What does a memory MSC correspond to? A class (with a unique ID)
or a component? From the code, it seems like it maps to a component to me.
An MSC is the device, it has registers and generates interrupts. If its part of your memory controller, it gets described like that in the ACPI tables, which lets linux guess that this MSC (or the RIS within it) control some policy in the memory controller.
Components exist to group devices that should be configured the same. This happens where designs are sliced up, but this slicing makes no sense to the software. Classes are a group of components that do the same thing, but not to the same resource. e.g. they control memory controllers.
The ACPI tables should describe the MSC, its up to the driver to build the class and component structures from what it can infer from the other ACPI tables.
- Could we have a use case in which we have different class IDs
with the same class type? If yes could you please give an example?
Your L2 and L3 are both caches, but use the level number as the id. I doubt anyone builds a system with MSC on both, but its possible by the architecture, and we could expose both via resctrl.
- What should a component ID for a memory MSC be/represent? The
code assumes it's a (NUMA?) node ID.
The component-ids are some number that makes sense to linux, and matches something in the ACPI tables. These are exposed via the schema file to user-space. For the caches, its the cache-id property from the PPTT table. This is exposed to user-space via /sys/devices/system/cpu/cpu0/cache/index3/id or equivalent.
Its important that user-space can work out which CPUs share a component/domain in the schema. Using a sensible id is the pre-requisite for that.
Intel's memory bandwidth control appears to be implemented on the L3, so they re-use the id of the L3 cache. These seem to correspond to NUMA nodes already.
For MPAM - we have no idea if the memory controllers map 1:1 with any level in the cache. Instead, the driver expects to use the numa node number directly.
(I'll put this on the list of KNOWN_ISSUES, the Intel side of this ought to be cleaned up so it doesn't break if they build a SoC where L3 doesn't map 1:1 with Numa nodes. It looks like they are getting away with it because Atom doesn't support L3 or memory bandwidth)
That's very useful and informative. Some form of documentation (in KNOWN_ISSUES, comments, or a README) would be quite useful for such assumptions as the ACPI/MPAM spec doesn't mention that (i.e., the logical ID assignments from OS point of view).
- What should a class ID represent for a memory MSC? Which is
different from the class type itself.
The class id is private to the driver, for the caches it needs to be the cache level. Because of that memory is shoved at the end, on the assumption no-one has an L255 cache, and 'unknown' devices are shoved at the beginning... L0 caches probably do exist, but I doubt anyone would add an MSC to them.
Classes can't be arbitrarily created, as the resctrl picking code needs to know how they map to resctrl schemas, as we can't invent new schemas without messing up user-space.
- How would 4 memory MSCs (with different proximity domains) map to
classes and components?
Each MSC would be a device. There would be one device per component, because each proximity domain is different. They would all be the same class, as you'd described them all with a memory type in the ACPI tables.
If you see a problem with this, let me know! The folk who write the ACPI specs didn't find any systems where this would lead to problems... that doesn't mean you haven't build something that looks quite different.
- How would 2 Memory MSCs with(in) the same proximity domain and/or
same NUMA node work, if at all?
If you build this, I bet your hardware people say those two MSC must be programmed the same for the regulation to work. (if not - how is software expected to understand the hashing scheme used to map physical-addresses to memory controllers?!)
Each MSC would be a device. They would both be part of the same component as they have the same proximity domain.
Configuration is applied to the component, so each device/MSC within the component is always configured the same.
- Should the ACPI/MPAM MSC's "identifier" field be mapped to class
IDs or component IDs at all?
Classes, no - these are just for the driver to keep track of the groups. Components, probably ... but another number may make more sense. This should line up with something that is already exposed to user-space via sysfs.
Thanks,
James
linaro-open-discussions@op-lists.linaro.org