[Linaro-open-discussions] [PATCH 2/3] scheduler: add scheduler level for clusters

16 Jun 2021

ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus. This means cache coherence overhead inside one
cluster is much less than the overhead across clusters.
This patch adds the sched_domain for clusters. On kunpeng 920, without
this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.
This will help spread tasks among clusters, thus decrease the contention
and improve the throughput. Verified by Mel's mm-tests by writing config
files as below to define the number of threads for stream,e.g
configs/config-workload-stream-omp-4threads:
export STREAM_SIZE=$((1048576*512))
export STREAM_THREADS=4
export STREAM_METHOD=omp
export STREAM_ITERATIONS=5
export STREAM_BUILD_FLAGS="-lm -Ofast"
Ran the stream benchmark on kunpeng920 with 4numa nodes and each node
has 24core by commands like:
numactl -N 0 -m 0 ./run-mmtests.sh -c \
    configs/config-workload-stream-omp-4threads tip-sched-core-4threads
and compared the cases between tip/sched/core and tip/sched/core with
cluster scheduler. The result is as below:
4threads stream (on 1numa * 24cores = 24cores)
                            stream                    stream
                4threads               4threads-cluster-scheduler
MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)
6threads stream (on 1numa * 24cores = 24cores)
                            stream                    stream
                6threads               6threads-cluster-scheduler
MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)
12threads stream (on 1numa * 24cores = 24cores)
                             stream                  stream
               12threads               12threads-cluster-scheduler
MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)
The result was generated by commands like:
../../compare-kernels.sh --baseline tip-sched-core-4threads \
    --compare tip-sched-core-4threads-cluster-scheduler
Thus, it could help memory-bound workload especially under medium load.
For example, ran mmtests configs/config-workload-lkp-compress benchmark
on 4numa*24cores=96 cores kunpeng920, 12,21,30 threads present the best
improvement:
lkp-pbzip2 (on 4numa * 24cores = 96cores)
lkp                    lkp
                     compress-w/o-cluster     compress-w/-cluster
Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*
Ran the same benchmark by "numactl -N 0 -m 0" from 2 threads to
24 threads on numa node0 with 24 cores:
lkp-pbzip2 (on 1numa * 24cores = 24cores)
                                     lkp                    lkp
                 compress-1numa-w/o-cluster compress-1numa-w/-cluster
Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*
In the case of 6 threads and 8 threads, we see the greatest performance
improvement.
Similar improvement was seen on lkp-pixz though the improvement is
smaller:
lkp-pixz (on 1numa * 24cores = 24cores)
                                     lkp                    lkp
                 compress-1numa-w/o-cluster compress-1numa-w/-cluster
Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*
On the other hand, it is slightly helpful to cpu-bound tasks.
With configs/config-workload-kernbench like:
export KERNBENCH_ITERATIONS=3
export KERNBENCH_MIN_THREADS=$((NUMCPUS/4))
export KERNBENCH_MAX_THREADS=$((NUMCPUS))
export KERNBENCH_CONFIG=allmodconfig
export KERNBENCH_TARGETS=vmlinux,modules
export KERNBENCH_SKIP_WARMUP=yes
export MMTESTS_THREAD_CUTOFF=
export KERNBENCH_VERSION=5.3
Ran kernbench by 24,48,96 threads to compile an entire kernel
without numactl binding, each case run 3 iterations:
24 threads w/o and w/ cluster-scheduler:
w/o 10:03.26 10:00.46 10:01.09
w/  10:01.11 10:00.83 9:58.64
48 threads w/o and w/ cluster-scheduler:
w/o 5:33.96 5:34.28 5:34.06
w/  5:32.65 5:32.57 5:33.25
96 threads w/o and w/ cluster-scheduler:
w/o 3:33.34 3:31.22 3:31.31
w/  3:32.22 3:30.47 3:32.69
kernbench (on 4numa * 24cores = 96cores)
                               kernbench              kernbench
                              w/o-cluster              w/-cluster
Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)
[
Hi Yicong,
Is it possible for you to run similar SPECrate mcf test with
Tim and get some supportive data here?
Thanks
Barry
]
Signed-off-by: Barry Song song.bao.hua@hisilicon.com
---
 arch/arm64/Kconfig             | 7 +++++++
 include/linux/sched/topology.h | 7 +++++++
 include/linux/topology.h       | 7 +++++++
 kernel/sched/topology.c        | 5 +++++
 4 files changed, 26 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..3b54ea4e1bd7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -999,6 +999,13 @@ config SCHED_MC
      making when dealing with multi-core CPU chips at a cost of slightly
      increased overhead in some places. If unsure say N here.
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	help
+	  Cluster scheduler support improves the CPU scheduler's decision
+	  making when dealing with machines that have clusters(sharing internal
+	  bus or sharing LLC cache tag). If unsure say N here.
+
 config SCHED_SMT
    bool "SMT scheduler support"
    help
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..2f9166f6dec8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
 }
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+static inline int cpu_cluster_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+#endif
+
 #ifdef CONFIG_SCHED_MC
 static inline int cpu_core_flags(void)
 {
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 80d27d717631..0b3704ad13c8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
 }
 #endif
+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
+static inline const struct cpumask *cpu_cluster_mask(int cpu)
+{
+	return topology_cluster_cpumask(cpu);
+}
+#endif
+
 static inline const struct cpumask *cpu_cpu_mask(int cpu)
 {
    return cpumask_of_node(cpu_to_node(cpu));
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 55a0a243e871..c7523dc7aab7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1511,6 +1511,11 @@ static struct sched_domain_topology_level default_topology[] = {
 #ifdef CONFIG_SCHED_SMT
    { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+
+#ifdef CONFIG_SCHED_CLUSTER
+       { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+
 #ifdef CONFIG_SCHED_MC
    { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
-- 
2.25.1


    

2025

2024

2023

2022

2021

2020

[Linaro-open-discussions] [PATCH 2/3] scheduler: add scheduler level for clusters