This is the follow-up work to support cluster scheduler. Previously we have added cluster level in the scheduler[1] to make tasks spread between clusters to bring more memory bandwidth and decrease cache contention. But it may hurt some workloads which are sensitive to the communication latency as they will be placed across clusters.
We modified the select_idle_cpu() on the wake affine path in this series, expecting the wake affined task to be woken more likely on the same cluster with the waker. The latency will be decreased as the waker and wakee in the same cluster may benefit from the hot L3 cache tag.
[1] https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
Hi Tim and Barry, This the modified patch of packing path of cluster scheduler and tests have been done on Kunpeng 920 2-socket 4-NUMA 128core platform, with 8 clusters on each NUMA. Patches based on 5.16-rc1.
Compared to the previous one[2], we give up scanning the first cpu of the cluster as the cpu id may not be continuous. So we pickup the way of scanning cluster first before LLC. The result from tbench and pgbench are rather positive.
[2] https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-October/0...
Barry Song (2): sched: Add per_cpu cluster domain info sched/fair: Scan cluster before scanning LLC in wake-up path
include/linux/sched/sd_flags.h | 9 ++++++++ include/linux/sched/topology.h | 2 +- kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 5 +++++ 5 files changed, 56 insertions(+), 2 deletions(-)
From: Barry Song song.bao.hua@hisilicon.com
Add per-cpu cluster domain info. This is the preparation for optimization of select_idle_cpu() on platforms with cluster scheduler level.
Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com --- include/linux/sched/sd_flags.h | 9 +++++++++ include/linux/sched/topology.h | 2 +- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 5 +++++ 4 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h index 57bde66d95f7..656473a17904 100644 --- a/include/linux/sched/sd_flags.h +++ b/include/linux/sched/sd_flags.h @@ -109,6 +109,15 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS) */ SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+/* + * Domain members share CPU cluster resources (i.e. llc cache tags or l2) + * + * SHARED_CHILD: Set from the base domain up until spanned CPUs no longer share + * the cluster resources (such as llc tags or l2) + * NEEDS_GROUPS: Caches are shared between groups. + */ +SD_FLAG(SD_SHARE_CLS_RESOURCES, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS) + /* * Domain members share CPU package resources (i.e. caches) * diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07bfa2d80f2..208471341a5d 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void) #ifdef CONFIG_SCHED_CLUSTER static inline int cpu_cluster_flags(void) { - return SD_SHARE_PKG_RESOURCES; + return SD_SHARE_CLS_RESOURCES | SD_SHARE_PKG_RESOURCES; } #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0e66749486e7..b4a2294f5d0c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1764,6 +1764,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..dfbc5eb32824 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -644,6 +644,7 @@ static void destroy_sched_domains(struct sched_domain *sd) DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); @@ -657,6 +658,9 @@ static void update_top_cache_domain(int cpu) int id = cpu; int size = 1;
+ sd = highest_flag_domain(cpu, SD_SHARE_CLS_RESOURCES); + rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd); + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); if (sd) { id = cpumask_first(sched_domain_span(sd)); @@ -1514,6 +1518,7 @@ static unsigned long __read_mostly *sched_numa_onlined_nodes; */ #define TOPOLOGY_SD_FLAGS \ (SD_SHARE_CPUCAPACITY | \ + SD_SHARE_CLS_RESOURCES | \ SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ SD_ASYM_PACKING)
From: Barry Song song.bao.hua@hisilicon.com
For platforms having clusters like Kunpeng 920, tasks in the same cluster sharing L3 Cache Tag will have lower latency when synchronizing and accessing shared resources. Based on this, wake the task within the same cluster with the waker will make migration cost smaller. This patch tries to find a wake cpu by scanning the cluster first before scanning LLC.
Benchmark Tests have been done on 2-socket 4-NUMA Kunpeng 920 with 8 clusters in each NUMA. And the results from tbench and hackbench are rather positive.
tbench4 5.16-rc1-vanilla 5.16-rc1+patch Hmean 1 341.78 ( 0.00%) 350.10 * 2.43%* Hmean 2 684.31 ( 0.00%) 700.25 * 2.33%* Hmean 4 1350.03 ( 0.00%) 1374.33 * 1.80%* Hmean 8 2563.33 ( 0.00%) 2615.74 * 2.04%* Hmean 16 4976.31 ( 0.00%) 4911.05 * -1.31%* Hmean 32 8446.80 ( 0.00%) 9076.71 * 7.46%* Hmean 64 4938.98 ( 0.00%) 5890.29 * 19.26%* Hmean 128 7422.75 ( 0.00%) 8941.65 * 20.46%* Hmean 256 7503.72 ( 0.00%) 7609.30 * 1.41%* Hmean 512 6526.50 ( 0.00%) 7616.90 * 16.71%*
hackbench-process-pipes 5.16-rc1-vanilla 5.16-rc1+patch Amean 1 0.7233 ( 0.00%) 0.6048 * 16.38%* Amean 4 1.6168 ( 0.00%) 0.9831 * 39.19%* Amean 7 1.7604 ( 0.00%) 1.3456 * 23.56%* Amean 12 2.1637 ( 0.00%) 2.0515 * 5.19%* Amean 21 3.7302 ( 0.00%) 3.4755 * 6.83%* Amean 30 6.8281 ( 0.00%) 5.4964 * 19.50%* Amean 48 11.5442 ( 0.00%) 9.2672 * 19.72%* Amean 79 14.1319 ( 0.00%) 12.1617 * 13.94%* Amean 110 17.2689 ( 0.00%) 15.0081 * 13.09%* Amean 141 20.2057 ( 0.00%) 18.4041 * 8.92%* Amean 172 25.2087 ( 0.00%) 21.2069 * 15.87%* Amean 203 28.4038 ( 0.00%) 24.8319 * 12.58%* Amean 234 32.4690 ( 0.00%) 28.2500 * 12.99%* Amean 256 33.1803 ( 0.00%) 30.0114 * 9.55%*
Tested-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com --- kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e476f6d9435..f8b094738c03 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6230,6 +6230,34 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
#endif /* CONFIG_SCHED_SMT */
+#ifdef CONFIG_SCHED_CLUSTER +static inline int scan_cluster(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target, int *idle_cpu) +{ + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + int i, cpu; + + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + cpumask_clear_cpu(target, cpus); + + for_each_cpu_wrap(cpu, cpus, target + 1) { + if (has_idle_core) + i = select_idle_core(p, cpu, cpus, idle_cpu); + else + i = __select_idle_cpu(cpu, p); + + if ((unsigned int)i < nr_cpumask_bits) + return i; + } + + return -1; +} +#else +static inline int scan_cluster(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target, int *idle_cpu) +{ + return -1; +} +#endif + /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against the @@ -6241,14 +6269,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool int i, cpu, idle_cpu = -1, nr = INT_MAX; struct rq *this_rq = this_rq(); int this = smp_processor_id(); - struct sched_domain *this_sd; + struct sched_domain *this_sd, *cluster_sd; u64 time = 0;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) return -1;
+ /* scan cluster before scanning LLC */ + cluster_sd = rcu_dereference(per_cpu(sd_cluster, target)); + if (cluster_sd) { + i = scan_cluster(p, cluster_sd, has_idle_core, target, &idle_cpu); + if ((unsigned int)i < nr_cpumask_bits) + return i; + } + + /* scan LLC excluding cluster */ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (cluster_sd) + cpumask_andnot(cpus, cpus, sched_domain_span(cluster_sd));
if (sched_feat(SIS_PROP) && !has_idle_core) { u64 avg_cost, avg_idle, span_avg;
On Wed, Nov 24, 2021 at 11:17 PM Yicong Yang yangyicong@hisilicon.com wrote:
From: Barry Song song.bao.hua@hisilicon.com
For platforms having clusters like Kunpeng 920, tasks in the same cluster sharing L3 Cache Tag will have lower latency when synchronizing and accessing shared resources. Based on this, wake the task within the same cluster with the waker will make migration cost smaller.
the target could be waker or wakee. wakee could be the scan target as well.
This patch tries to find a wake cpu by scanning the cluster first before scanning LLC.
find an idle cpu
Benchmark Tests have been done on 2-socket 4-NUMA Kunpeng 920 with 8 clusters in each NUMA. And the results from tbench and hackbench are rather positive.
tbench4 5.16-rc1-vanilla 5.16-rc1+patch Hmean 1 341.78 ( 0.00%) 350.10 * 2.43%* Hmean 2 684.31 ( 0.00%) 700.25 * 2.33%* Hmean 4 1350.03 ( 0.00%) 1374.33 * 1.80%* Hmean 8 2563.33 ( 0.00%) 2615.74 * 2.04%* Hmean 16 4976.31 ( 0.00%) 4911.05 * -1.31%* Hmean 32 8446.80 ( 0.00%) 9076.71 * 7.46%* Hmean 64 4938.98 ( 0.00%) 5890.29 * 19.26%* Hmean 128 7422.75 ( 0.00%) 8941.65 * 20.46%* Hmean 256 7503.72 ( 0.00%) 7609.30 * 1.41%* Hmean 512 6526.50 ( 0.00%) 7616.90 * 16.71%*
hackbench-process-pipes 5.16-rc1-vanilla 5.16-rc1+patch Amean 1 0.7233 ( 0.00%) 0.6048 * 16.38%* Amean 4 1.6168 ( 0.00%) 0.9831 * 39.19%* Amean 7 1.7604 ( 0.00%) 1.3456 * 23.56%* Amean 12 2.1637 ( 0.00%) 2.0515 * 5.19%* Amean 21 3.7302 ( 0.00%) 3.4755 * 6.83%* Amean 30 6.8281 ( 0.00%) 5.4964 * 19.50%* Amean 48 11.5442 ( 0.00%) 9.2672 * 19.72%* Amean 79 14.1319 ( 0.00%) 12.1617 * 13.94%* Amean 110 17.2689 ( 0.00%) 15.0081 * 13.09%* Amean 141 20.2057 ( 0.00%) 18.4041 * 8.92%* Amean 172 25.2087 ( 0.00%) 21.2069 * 15.87%* Amean 203 28.4038 ( 0.00%) 24.8319 * 12.58%* Amean 234 32.4690 ( 0.00%) 28.2500 * 12.99%* Amean 256 33.1803 ( 0.00%) 30.0114 * 9.55%*
Tested-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com
kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e476f6d9435..f8b094738c03 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6230,6 +6230,34 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
#endif /* CONFIG_SCHED_SMT */
+#ifdef CONFIG_SCHED_CLUSTER +static inline int scan_cluster(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target, int *idle_cpu) +{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu;
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
cpumask_clear_cpu(target, cpus);
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core)
i = select_idle_core(p, cpu, cpus, idle_cpu);
else
i = __select_idle_cpu(cpu, p);
if ((unsigned int)i < nr_cpumask_bits)
return i;
}
return -1;
+} +#else +static inline int scan_cluster(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target, int *idle_cpu) +{
return -1;
+} +#endif
/*
- Scan the LLC domain for idle CPUs; this is dynamically regulated by
- comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6241,14 +6269,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool int i, cpu, idle_cpu = -1, nr = INT_MAX; struct rq *this_rq = this_rq(); int this = smp_processor_id();
struct sched_domain *this_sd;
struct sched_domain *this_sd, *cluster_sd; u64 time = 0; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) return -1;
/* scan cluster before scanning LLC */
cluster_sd = rcu_dereference(per_cpu(sd_cluster, target));
if (cluster_sd) {
i = scan_cluster(p, cluster_sd, has_idle_core, target, &idle_cpu);
if ((unsigned int)i < nr_cpumask_bits)
return i;
}
/* scan LLC excluding cluster */ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
if (cluster_sd)
cpumask_andnot(cpus, cpus, sched_domain_span(cluster_sd)); if (sched_feat(SIS_PROP) && !has_idle_core) { u64 avg_cost, avg_idle, span_avg;
-- 2.33.0
On Wed, Nov 24, 2021 at 11:17 PM Yicong Yang yangyicong@hisilicon.com wrote:
This is the follow-up work to support cluster scheduler. Previously we have added cluster level in the scheduler[1] to make tasks spread between clusters to bring more memory bandwidth and decrease cache contention. But it may hurt some workloads which are sensitive to the communication latency as they will be placed across clusters.
We modified the select_idle_cpu() on the wake affine path in this series, expecting the wake affined task to be woken more likely on the same cluster with the waker. The latency will be decreased as the waker and wakee in the same cluster may benefit from the hot L3 cache tag.
if the task runs in the same cluster with the scanned target, data synchronization cost will be lower. the task can wake up either in the cluster of waker, or the cluster of the wakee based on the return of wake_wide. so to be more accurate, we are not always putting waker and wakee in the same cluster. We are trying to put the task in the same cluster of the target so it can either get active cache from waker, or get its old cache from wakee.
in case a wakes up b, if we scan from a, we get the new cache A wrote to B just now; if we scan from b as the target, we get the old cache of B;
In both cases, cache synchronization cost is lower by finding an idle cpu within the cluster of a or b.
[1] https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
we are able to directly refer to commit id as it has been mainlined.
Hi Tim and Barry, This the modified patch of packing path of cluster scheduler and tests have been done on Kunpeng 920 2-socket 4-NUMA 128core platform, with 8 clusters on each NUMA. Patches based on 5.16-rc1.
Compared to the previous one[2], we give up scanning the first cpu of the cluster as the cpu id may not be continuous. So we pickup the way of scanning cluster first before LLC. The result from tbench and pgbench are rather positive.
[2] https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-October/0...
Barry Song (2): sched: Add per_cpu cluster domain info sched/fair: Scan cluster before scanning LLC in wake-up path
include/linux/sched/sd_flags.h | 9 ++++++++ include/linux/sched/topology.h | 2 +- kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 5 +++++ 5 files changed, 56 insertions(+), 2 deletions(-)
-- 2.33.0
Thanks Barry
linaro-open-discussions@op-lists.linaro.org