On 6/18/21 3:27 AM, Yicong Yang wrote:
Hi Barry and Tim,
As Barry pointed I didn't enable the CONFIG_SCHED_CLUSTER... I'd like to share some updated results with the config correctly enabled.
I re-run the mcf_r with 4,8,16 copies on NUMA 0, the result is like: Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w 570) w/o 11.1 (w 11.3) 8 Copies w/o 647 (w 605) w/o 20.0 (w 21.4) 16 Copies w/o 844 (w 844) w/o 30.6 (w 30.6)
Barry and Yicong,
Our benchmark team helped me run mcf on a Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4 cores each. The benchmark numbers from the benchmark team is as follow:
Improvement over baseline kernel for mcf_r copies run time base rate 1 -0.1% -0.2% 6 25.1% 25.1% 12 18.8% 19.0% 24 0.3% 0.3%
So this looks pretty good. It is even better than what Yicong saw. I probed into their system's task distribution, and saw some pretty bad clumping for the vanilla kernel without the L2 cluster domain for the 6 and 12 copies case. With the extra domain for cluster, the load does get evened out between the clusters.
The load balancing helps a lot at moderate load point for mcf. As expected, there is little change in performance for the single copy case and the fully loaded 24 copies case.
Seems there is a ~7% improvement for 8 Copies but little changed for 4 and 16 copies.
This time from htop the tasks are spread through clusters well.
For the 4 copies I use
perf stat -e probe:active_load_balance_cpu_stop -- ./bin/runcpu default.cfg 505.mcf_r
to check the different of 'active_load_balance_cpu_stop' with and without the patch. There is no difference and the counts are both 0.
I also run the whole intrate suite on another machine, same model as the one above. 32 Copies is lanuch in the whole system without binding to a specific NUMA node. And seems there is some positive results.
(x264_r is not included as there is a bug for x264 while compiling with gcc 10: https://www.spec.org/cpu2017/Docs/benchmarks/625.x264_s.html I'll fix this in the following test)
[w/o] Base Base Base Benchmarks Copies Run Time Rate
500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3
[w] Base Base Base Benchmarks Copies Run Time Rate
500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%)
You have 24 cores right? So this is the case with a little bit of overload? It is nice we are also seeing improvement here.
Tim