Re: [Linaro-open-discussions] [PATCH 2/3] scheduler: add scheduler level for clusters

21 Jun 2021

Hi Barry and Tim,
Some more updates for the speccpu. Most results are rather positive
and as expected.
For the mcf_r alone with numa 0 bound (5 iterations included):
[w/o]
    	min	max	aver	stddev
4 Copies	10.7	11.2	11	0.187082869
8 Copies	19.4	20.1	19.7	0.250998008
16 Copies	30.1	30.7	30.5	0.240831892
[w]
    	min	max	aver	stddev		aver. enhancement
4 Copies	11.3	11.4	11.3	0.054772256	2.73%
8 Copies	21.1	21.4	21.3	0.114017543	8.12%
16 Copies	30.5	30.8	30.7	0.130384048	0.66%
8 Copies is the best case, as there are 8 clusters per NUMA and the
threads can spread through the clusters well. For 4, 8, 16 copies
we have less bouncing with cluster scheduler level, as the standard
deviation is smaller.
For all intrate suite without numa bound and with 32 copies (4 iterations included):
[w/o]
    	Min	Max	Aver	Stddev
500.perlbench_r	86	87.1	86.7	0.496655481
502.gcc_r	89.9	91	90.25	0.519615242
505.mcf_r	71.5	73.6	72.375	0.970824392
520.omnetpp_r	40.3	41.2	40.6	0.40824829
523.xalancbmk_r	57.4	59.2	58.1	0.836660027
525.x264_r	197	198	197.5	0.577350269
531.deepsjeng_r	109	110	109.25	0.5
541.leela_r	95.4	95.6	95.525	0.095742711
548.exchange2_r	163	163	163	0
557.xz_r	64.2	65.2	64.85	0.450924975
Est. SPECrate2017_int_base              88.1
[w]
    	Min	Max	Aver	Stddev	aver. Enhancement
500.perlbench_r	87.3	87.9	87.625	0.2500	1.07%
502.gcc_r	93.1	95.3	94.5	0.9933	4.71%
505.mcf_r	77	81.4	79.075	2.3514	9.26%
520.omnetpp_r	42.8	43.5	43.15	0.3109	6.28%
523.xalancbmk_r	60.6	62	61.45	0.6455	5.77%
525.x264_r	197	198	197.75	0.5000	0.13%
531.deepsjeng_r	109	110	109.75	0.5000	0.46%
541.leela_r	95.5	95.6	95.55	0.0577	0.03%
548.exchange2_r	163	163	163	0.0000	0.00%
557.xz_r	66	66.5	66.3	0.2160	2.24%
Est. SPECrate2017_int_base              90.7 (+2.95%)
The mcf_r performs even better. Although the stddev is larger
than w/o result, but the minimum rate(77) is bigger than
the maximum rate(73.6) of w/o result.
Well some benchmarks are not affected by the patch, I guess
they'are cpu bound. We cannot decrease increase the bandwidth
by place these threads inter-clusters.
The test machines are both 2 socket with 128 cores, 32 cores per numa
and 4 cores per cluster.
Thanks,
Yicong
On 2021/6/18 18:27, Yicong Yang wrote:
...
Hi Barry and Tim,
As Barry pointed I didn't enable the CONFIG_SCHED_CLUSTER...
I'd like to share some updated results with the config correctly enabled.
I re-run the mcf_r with 4,8,16 copies on NUMA 0, the result is like:
   	Base     	 	Base
   	Run Time   	 	Rate
   	-------  	 	---------
4 Copies	w/o 580 (w 570)       	w/o 11.1 (w 11.3)
8 Copies	w/o 647 (w 605)       	w/o 20.0 (w 21.4)
16 Copies	w/o 844 (w 844)       	w/o 30.6 (w 30.6)
Seems there is a ~7% improvement for 8 Copies but little changed
for 4 and 16 copies.
This time from htop the tasks are spread through clusters well.
For the 4 copies I use
perf stat -e probe:active_load_balance_cpu_stop -- ./bin/runcpu default.cfg 505.mcf_r
to check the different of 'active_load_balance_cpu_stop' with and
without the patch. There is no difference and the counts are both 0.
I also run the whole intrate suite on another machine, same model
as the one above. 32 Copies is lanuch in the whole system without
binding to a specific NUMA node. And seems there is some positive
results.
(x264_r is not included as there is a bug for x264 while compiling
with gcc 10: https://www.spec.org/cpu2017/Docs/benchmarks/625.x264_s.html
I'll fix this in the following test)
[w/o]
                 Base     Base        Base
Benchmarks       Copies  Run Time     Rate

500.perlbench_r      32        584       87.2  *
502.gcc_r            32        503       90.2  *
505.mcf_r            32        745       69.4  *
520.omnetpp_r        32       1031       40.7  *
523.xalancbmk_r      32        597       56.6  *
525.x264_r            1         --            CE
531.deepsjeng_r      32        336      109    *
541.leela_r          32        556       95.4  *
548.exchange2_r      32        513      163    *
557.xz_r             32        530       65.2  *
 Est. SPECrate2017_int_base              80.3
[w]
                  Base     Base        Base
Benchmarks       Copies  Run Time     Rate

500.perlbench_r      32        580      87.8 (+0.688%)  *
502.gcc_r            32        477      95.1 (+5.432%)  *
505.mcf_r            32        644      80.3 (+13.574%) *
520.omnetpp_r        32        942      44.6 (+9.58%)   *
523.xalancbmk_r      32        560      60.4 (+6.714%%) *
525.x264_r            1         --           CE
531.deepsjeng_r      32        337      109  (+0.000%) *
541.leela_r          32        554      95.6 (+0.210%) *
548.exchange2_r      32        515      163  (+0.000%) *
557.xz_r             32        524      66.0 (+1.227%) *
 Est. SPECrate2017_int_base              83.7 (+4.062%)
The iteration of the test is 1, and I'm going to increase it
to 5 to see the average result.
Thanks,
Yicong
On 2021/6/17 19:33, Yicong Yang wrote:
...
On 2021/6/16 17:36, Barry Song wrote:
...
ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus. This means cache coherence overhead inside one
cluster is much less than the overhead across clusters.
This patch adds the sched_domain for clusters. On kunpeng 920, without
this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.
This will help spread tasks among clusters, thus decrease the contention
and improve the throughput. Verified by Mel's mm-tests by writing config
files as below to define the number of threads for stream,e.g
configs/config-workload-stream-omp-4threads:
export STREAM_SIZE=$((1048576*512))
export STREAM_THREADS=4
export STREAM_METHOD=omp
export STREAM_ITERATIONS=5
export STREAM_BUILD_FLAGS="-lm -Ofast"
Ran the stream benchmark on kunpeng920 with 4numa nodes and each node
has 24core by commands like:
numactl -N 0 -m 0 ./run-mmtests.sh -c \
   configs/config-workload-stream-omp-4threads tip-sched-core-4threads
and compared the cases between tip/sched/core and tip/sched/core with
cluster scheduler. The result is as below:
4threads stream (on 1numa * 24cores = 24cores)
                            stream                    stream
                4threads               4threads-cluster-scheduler
MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)
6threads stream (on 1numa * 24cores = 24cores)
                            stream                    stream
                6threads               6threads-cluster-scheduler
MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)
12threads stream (on 1numa * 24cores = 24cores)
                             stream                  stream
               12threads               12threads-cluster-scheduler
MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)
The result was generated by commands like:
../../compare-kernels.sh --baseline tip-sched-core-4threads \
   --compare tip-sched-core-4threads-cluster-scheduler
Thus, it could help memory-bound workload especially under medium load.
For example, ran mmtests configs/config-workload-lkp-compress benchmark
on 4numa*24cores=96 cores kunpeng920, 12,21,30 threads present the best
improvement:
lkp-pbzip2 (on 4numa * 24cores = 96cores)
                                 lkp                    lkp
                 compress-w/o-cluster     compress-w/-cluster

Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*
Ran the same benchmark by "numactl -N 0 -m 0" from 2 threads to
24 threads on numa node0 with 24 cores:
lkp-pbzip2 (on 1numa * 24cores = 24cores)
                                     lkp                    lkp
                 compress-1numa-w/o-cluster compress-1numa-w/-cluster
Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*
In the case of 6 threads and 8 threads, we see the greatest performance
improvement.
Similar improvement was seen on lkp-pixz though the improvement is
smaller:
lkp-pixz (on 1numa * 24cores = 24cores)
                                     lkp                    lkp
                 compress-1numa-w/o-cluster compress-1numa-w/-cluster
Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*
On the other hand, it is slightly helpful to cpu-bound tasks.
With configs/config-workload-kernbench like:
export KERNBENCH_ITERATIONS=3
export KERNBENCH_MIN_THREADS=$((NUMCPUS/4))
export KERNBENCH_MAX_THREADS=$((NUMCPUS))
export KERNBENCH_CONFIG=allmodconfig
export KERNBENCH_TARGETS=vmlinux,modules
export KERNBENCH_SKIP_WARMUP=yes
export MMTESTS_THREAD_CUTOFF=
export KERNBENCH_VERSION=5.3
Ran kernbench by 24,48,96 threads to compile an entire kernel
without numactl binding, each case run 3 iterations:
24 threads w/o and w/ cluster-scheduler:
w/o 10:03.26 10:00.46 10:01.09
w/  10:01.11 10:00.83 9:58.64
48 threads w/o and w/ cluster-scheduler:
w/o 5:33.96 5:34.28 5:34.06
w/  5:32.65 5:32.57 5:33.25
96 threads w/o and w/ cluster-scheduler:
w/o 3:33.34 3:31.22 3:31.31
w/  3:32.22 3:30.47 3:32.69
kernbench (on 4numa * 24cores = 96cores)
                               kernbench              kernbench
                              w/o-cluster              w/-cluster
Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)
[
Hi Yicong,
Is it possible for you to run similar SPECrate mcf test with
Tim and get some supportive data here?
For our Kunpeng 920, I run the whole intrate suite firstly with 32 copies,
and didn't bind the NUMA. Here's the result
             Base     Base        		Base

Benchmarks       Copies  Run Time     		Rate

500.perlbench_r      32  w/o 580(w 578)       	w/o 87.8(w 88.2)  *
502.gcc_r            32  w/o 500(w 504)       	w/o 90.5(w 90.0)  *
505.mcf_r            32  w/o 764(w 767)       	w/o 67.7(w 67.4)  *
520.omnetpp_r        32  w/o 1030(w 1024)       w/o 40.7(w 41.0)  *
523.xalancbmk_r      32  w/o 584(w 584)       	w/o 57.9(w 57.8)  *
525.x264_r           32  w/o 285(w 284)      	w/o 196 (w 197)   *
531.deepsjeng_r      32  w/o 336(w 338)      	w/o 109 (w 108)   *
541.leela_r          32  w/o 570(w 569)       	w/o 93.0(w 93.1)  *
548.exchange2_r      32  w/o 526(w 532)      	w/o 160 (w 157)   *
557.xz_r             32  w/o 538(w 542)       	w/o 64.2(w 63.8)  *
 Est. SPECrate2017_int_base              w/o 87.4(w 87.2)
(w/o is without the patch, the bigger the rate is the better)
Then I test the mcf_r alone with different copies and bind to NUMA 0:
Base     	 Base
Run Time   	 Rate
-------  	 ---------

4 Copies	w/o 618 (w 580)  w/o 10.5 (w 11.1)
8 Copies	w/o 645 (w 647)	 w/o 20.0 (w 20)
16 Copies	w/o 849 (w 844)	 w/o 30.4 (w 30.6)
As I checked from the htop, the tasks running on the cpu didn't spread
through the clusters rigidly.
I didn't apply Patch #3 as I met some conflicts and didn't try to resolve
it. As we're testing on arm64 I think it's okay to test without patch #3.
The machine I have tested have 128 cores in 2 sockets and 4 numas with 32
cores each. Of course, still 4 cores in one cluster. Below are the memory
info through numa:
   available: 4 nodes (0-3)
   node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
   28 29 30 31
   node 0 size: 257190 MB
   node 0 free: 254203 MB
   node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
   57 58 59 60 61 62 63
   node 1 size: 258005 MB
   node 1 free: 257191 MB
   node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
   89 90 91 92 93 94 95
   node 2 size: 96763 MB
   node 2 free: 96158 MB
   node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
   115 116 117 118 119 120 121 122 123 124 125 126 127
   node 3 size: 127540 MB
   node 3 free: 126922 MB
   node distances:
   node   0   1   2   3
     0:  10  12  20  22
     1:  12  10  22  24
     2:  20  22  10  12
     3:  22  24  12  10



Any comments? I notice Tim observed that sleep and wakeup will have some influences. So I wonder
whether the speccpu intrate test also suffers from this.
Thanks,
Yicong
...
Thanks
Barry
]
Signed-off-by: Barry Song song.bao.hua@hisilicon.com
arch/arm64/Kconfig             | 7 +++++++
 include/linux/sched/topology.h | 7 +++++++
 include/linux/topology.h       | 7 +++++++
 kernel/sched/topology.c        | 5 +++++
 4 files changed, 26 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..3b54ea4e1bd7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -999,6 +999,13 @@ config SCHED_MC
     making when dealing with multi-core CPU chips at a cost of slightly
     increased overhead in some places. If unsure say N here.
 
+config SCHED_CLUSTER

bool "Cluster scheduler support"
help
 Cluster scheduler support improves the CPU scheduler's decision


 making when dealing with machines that have clusters(sharing internal


 bus or sharing LLC cache tag). If unsure say N here.




config SCHED_SMT
   bool "SMT scheduler support"
   help
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..2f9166f6dec8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
 }
 #endif
 
+#ifdef CONFIG_SCHED_CLUSTER
+static inline int cpu_cluster_flags(void)
+{

return SD_SHARE_PKG_RESOURCES;

+}
+#endif



#ifdef CONFIG_SCHED_MC
 static inline int cpu_core_flags(void)
 {
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 80d27d717631..0b3704ad13c8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
 }
 #endif
 
+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
+static inline const struct cpumask *cpu_cluster_mask(int cpu)
+{

return topology_cluster_cpumask(cpu);

+}
+#endif



static inline const struct cpumask *cpu_cpu_mask(int cpu)
 {
   return cpumask_of_node(cpu_to_node(cpu));
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 55a0a243e871..c7523dc7aab7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1511,6 +1511,11 @@ static struct sched_domain_topology_level default_topology[] = {
 #ifdef CONFIG_SCHED_SMT
   { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
 #endif



+#ifdef CONFIG_SCHED_CLUSTER

  { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },



+#endif



#ifdef CONFIG_SCHED_MC
   { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
.
.

    

2026

2025

2024

2023

2022

2021

2020

Re: [Linaro-open-discussions] [PATCH 2/3] scheduler: add scheduler level for clusters

Signed-off-by: Barry Song song.bao.hua@hisilicon.com