On 2021/6/25 1:56, Tim Chen wrote:
On 6/18/21 3:27 AM, Yicong Yang wrote:
Hi Barry and Tim,
As Barry pointed I didn't enable the CONFIG_SCHED_CLUSTER... I'd like to share some updated results with the config correctly enabled.
I re-run the mcf_r with 4,8,16 copies on NUMA 0, the result is like: Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w 570) w/o 11.1 (w 11.3) 8 Copies w/o 647 (w 605) w/o 20.0 (w 21.4) 16 Copies w/o 844 (w 844) w/o 30.6 (w 30.6)
Barry and Yicong,
Our benchmark team helped me run mcf on a Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4 cores each. The benchmark numbers from the benchmark team is as follow:
Improvement over baseline kernel for mcf_r copies run time base rate 1 -0.1% -0.2% 6 25.1% 25.1% 12 18.8% 19.0% 24 0.3% 0.3%
So this looks pretty good. It is even better than what Yicong saw. I probed into their system's task distribution, and saw some pretty bad clumping for the vanilla kernel without the L2 cluster domain for the 6 and 12 copies case. With the extra domain for cluster, the load does get evened out between the clusters.
The load balancing helps a lot at moderate load point for mcf. As expected, there is little change in performance for the single copy case and the fully loaded 24 copies case.
Seems there is a ~7% improvement for 8 Copies but little changed for 4 and 16 copies.
This time from htop the tasks are spread through clusters well.
For the 4 copies I use
perf stat -e probe:active_load_balance_cpu_stop -- ./bin/runcpu default.cfg 505.mcf_r
to check the different of 'active_load_balance_cpu_stop' with and without the patch. There is no difference and the counts are both 0.
I also run the whole intrate suite on another machine, same model as the one above. 32 Copies is lanuch in the whole system without binding to a specific NUMA node. And seems there is some positive results.
(x264_r is not included as there is a bug for x264 while compiling with gcc 10: https://www.spec.org/cpu2017/Docs/benchmarks/625.x264_s.html I'll fix this in the following test)
[w/o] Base Base Base Benchmarks Copies Run Time Rate
500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3
[w] Base Base Base Benchmarks Copies Run Time Rate
500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%)
You have 24 cores right? So this is the case with a little bit of overload? It is nice we are also seeing improvement here.
On this machine there is 4 numa and 32 cores for each numa. We have two types of machine, Barry's result is from 24 cores per numa machine and mine is from 32 cores per numa machine.
Tim
.