Re: [Linaro-open-discussions] [PATCH 2/3] scheduler: add scheduler level for clusters

25 Jun 2021


      On 2021/6/25 1:56, Tim Chen wrote:
...
On 6/18/21 3:27 AM, Yicong Yang wrote:
...
Hi Barry and Tim,
As Barry pointed I didn't enable the CONFIG_SCHED_CLUSTER...
I'd like to share some updated results with the config correctly enabled.
I re-run the mcf_r with 4,8,16 copies on NUMA 0, the result is like:
   	Base     	 	Base
   	Run Time   	 	Rate
   	-------  	 	---------
4 Copies	w/o 580 (w 570)       	w/o 11.1 (w 11.3)
8 Copies	w/o 647 (w 605)       	w/o 20.0 (w 21.4)
16 Copies	w/o 844 (w 844)       	w/o 30.6 (w 30.6)
Barry and Yicong,
Our benchmark team helped me run mcf on a Jacobsville that
has 24 Atom cores, arranged into 6 clusters of 4 cores each.
The benchmark numbers from the benchmark team is as follow:
Improvement over baseline kernel for mcf_r
copies	run time	base rate
1	-0.1%		-0.2%
6	25.1%		25.1%
12	18.8%		19.0%
24	0.3%		0.3%
So this looks pretty good.  It is even better than what Yicong
saw. I probed into their system's task
distribution, and saw some pretty bad clumping for the vanilla
kernel without the L2 cluster domain for the 6 and 12 copies case.  
With the extra domain for cluster, the load does get evened out 
between the clusters.
The load balancing helps a lot at moderate load point for mcf.
As expected, there is little change in performance for the single
copy case and the fully loaded 24 copies case.
...
Seems there is a ~7% improvement for 8 Copies but little changed
for 4 and 16 copies.
This time from htop the tasks are spread through clusters well.
For the 4 copies I use
perf stat -e probe:active_load_balance_cpu_stop -- ./bin/runcpu default.cfg 505.mcf_r
to check the different of 'active_load_balance_cpu_stop' with and
without the patch. There is no difference and the counts are both 0.
I also run the whole intrate suite on another machine, same model
as the one above. 32 Copies is lanuch in the whole system without
binding to a specific NUMA node. And seems there is some positive
results.
(x264_r is not included as there is a bug for x264 while compiling
with gcc 10: https://www.spec.org/cpu2017/Docs/benchmarks/625.x264_s.html
I'll fix this in the following test)
[w/o]
                 Base     Base        Base
Benchmarks       Copies  Run Time     Rate

500.perlbench_r      32        584       87.2  *
502.gcc_r            32        503       90.2  *
505.mcf_r            32        745       69.4  *
520.omnetpp_r        32       1031       40.7  *
523.xalancbmk_r      32        597       56.6  *
525.x264_r            1         --            CE
531.deepsjeng_r      32        336      109    *
541.leela_r          32        556       95.4  *
548.exchange2_r      32        513      163    *
557.xz_r             32        530       65.2  *
 Est. SPECrate2017_int_base              80.3
[w]
                  Base     Base        Base
Benchmarks       Copies  Run Time     Rate

500.perlbench_r      32        580      87.8 (+0.688%)  *
502.gcc_r            32        477      95.1 (+5.432%)  *
505.mcf_r            32        644      80.3 (+13.574%) *
520.omnetpp_r        32        942      44.6 (+9.58%)   *
523.xalancbmk_r      32        560      60.4 (+6.714%%) *
525.x264_r            1         --           CE
531.deepsjeng_r      32        337      109  (+0.000%) *
541.leela_r          32        554      95.6 (+0.210%) *
548.exchange2_r      32        515      163  (+0.000%) *
557.xz_r             32        524      66.0 (+1.227%) *
 Est. SPECrate2017_int_base              83.7 (+4.062%)
You have 24 cores right? So this is the case with a
little bit of overload?  It is nice we are also seeing
improvement here.
On this machine there is 4 numa and 32 cores for each numa.
We have two types of machine, Barry's result is from 24 cores
per numa machine and mine is from 32 cores per numa machine.
...
Tim
.

2025

2024

2023

2022

2021

2020

Re: [Linaro-open-discussions] [PATCH 2/3] scheduler: add scheduler level for clusters