ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. This means cache coherence overhead inside one cluster is much less than the overhead across clusters.
This patch adds the sched_domain for clusters. On kunpeng 920, without this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.
This will help spread tasks among clusters, thus decrease the contention and improve the throughput. Verified by Mel's mm-tests by writing config files as below to define the number of threads for stream,e.g configs/config-workload-stream-omp-4threads: export STREAM_SIZE=$((1048576*512)) export STREAM_THREADS=4 export STREAM_METHOD=omp export STREAM_ITERATIONS=5 export STREAM_BUILD_FLAGS="-lm -Ofast"
Ran the stream benchmark on kunpeng920 with 4numa nodes and each node has 24core by commands like: numactl -N 0 -m 0 ./run-mmtests.sh -c \ configs/config-workload-stream-omp-4threads tip-sched-core-4threads
and compared the cases between tip/sched/core and tip/sched/core with cluster scheduler. The result is as below:
4threads stream (on 1numa * 24cores = 24cores) stream stream 4threads 4threads-cluster-scheduler MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%) MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%) MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%) MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%)
6threads stream (on 1numa * 24cores = 24cores) stream stream 6threads 6threads-cluster-scheduler MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%) MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%) MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%) MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%)
12threads stream (on 1numa * 24cores = 24cores) stream stream 12threads 12threads-cluster-scheduler MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%) MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%) MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%) MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%)
The result was generated by commands like: ../../compare-kernels.sh --baseline tip-sched-core-4threads \ --compare tip-sched-core-4threads-cluster-scheduler
Thus, it could help memory-bound workload especially under medium load. For example, ran mmtests configs/config-workload-lkp-compress benchmark on 4numa*24cores=96 cores kunpeng920, 12,21,30 threads present the best improvement:
lkp-pbzip2 (on 4numa * 24cores = 96cores)
lkp lkp compress-w/o-cluster compress-w/-cluster Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%* Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%* Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%* Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%* Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%* Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%* Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%* Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%* Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%*
Ran the same benchmark by "numactl -N 0 -m 0" from 2 threads to 24 threads on numa node0 with 24 cores:
lkp-pbzip2 (on 1numa * 24cores = 24cores) lkp lkp compress-1numa-w/o-cluster compress-1numa-w/-cluster Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%* Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%* Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%* Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%* Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%* Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%) Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%*
In the case of 6 threads and 8 threads, we see the greatest performance improvement.
Similar improvement was seen on lkp-pixz though the improvement is smaller:
lkp-pixz (on 1numa * 24cores = 24cores) lkp lkp compress-1numa-w/o-cluster compress-1numa-w/-cluster Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%* Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%) Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%* Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%* Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%* Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%* Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%*
On the other hand, it is slightly helpful to cpu-bound tasks. With configs/config-workload-kernbench like: export KERNBENCH_ITERATIONS=3 export KERNBENCH_MIN_THREADS=$((NUMCPUS/4)) export KERNBENCH_MAX_THREADS=$((NUMCPUS)) export KERNBENCH_CONFIG=allmodconfig export KERNBENCH_TARGETS=vmlinux,modules export KERNBENCH_SKIP_WARMUP=yes export MMTESTS_THREAD_CUTOFF= export KERNBENCH_VERSION=5.3 Ran kernbench by 24,48,96 threads to compile an entire kernel without numactl binding, each case run 3 iterations: 24 threads w/o and w/ cluster-scheduler: w/o 10:03.26 10:00.46 10:01.09 w/ 10:01.11 10:00.83 9:58.64
48 threads w/o and w/ cluster-scheduler: w/o 5:33.96 5:34.28 5:34.06 w/ 5:32.65 5:32.57 5:33.25
96 threads w/o and w/ cluster-scheduler: w/o 3:33.34 3:31.22 3:31.31 w/ 3:32.22 3:30.47 3:32.69
kernbench (on 4numa * 24cores = 96cores) kernbench kernbench w/o-cluster w/-cluster Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%) Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%) Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%) Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%) Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%) Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%) Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%) Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%) Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%) Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%* Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%* Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%) Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%* Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%* Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%* Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%) Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%* Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%)
[ Hi Yicong, Is it possible for you to run similar SPECrate mcf test with Tim and get some supportive data here?
Thanks Barry ]
Signed-off-by: Barry Song song.bao.hua@hisilicon.com --- arch/arm64/Kconfig | 7 +++++++ include/linux/sched/topology.h | 7 +++++++ include/linux/topology.h | 7 +++++++ kernel/sched/topology.c | 5 +++++ 4 files changed, 26 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 9f1d8566bbf9..3b54ea4e1bd7 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -999,6 +999,13 @@ config SCHED_MC making when dealing with multi-core CPU chips at a cost of slightly increased overhead in some places. If unsure say N here.
+config SCHED_CLUSTER + bool "Cluster scheduler support" + help + Cluster scheduler support improves the CPU scheduler's decision + making when dealing with machines that have clusters(sharing internal + bus or sharing LLC cache tag). If unsure say N here. + config SCHED_SMT bool "SMT scheduler support" help diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8f0f778b7c91..2f9166f6dec8 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void) } #endif
+#ifdef CONFIG_SCHED_CLUSTER +static inline int cpu_cluster_flags(void) +{ + return SD_SHARE_PKG_RESOURCES; +} +#endif + #ifdef CONFIG_SCHED_MC static inline int cpu_core_flags(void) { diff --git a/include/linux/topology.h b/include/linux/topology.h index 80d27d717631..0b3704ad13c8 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu) } #endif
+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask) +static inline const struct cpumask *cpu_cluster_mask(int cpu) +{ + return topology_cluster_cpumask(cpu); +} +#endif + static inline const struct cpumask *cpu_cpu_mask(int cpu) { return cpumask_of_node(cpu_to_node(cpu)); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 55a0a243e871..c7523dc7aab7 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1511,6 +1511,11 @@ static struct sched_domain_topology_level default_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, #endif + +#ifdef CONFIG_SCHED_CLUSTER + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, +#endif + #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif