Re: [Linaro-open-discussions] [PATCH 2/2] sched/fair: Scan from the first cpu of cluster if presents in select_idle_cpu

4 Nov 2021

On 2021/11/4 20:12, Barry Song wrote:
...
On Thu, Nov 4, 2021 at 11:39 PM Barry Song 21cnbao@gmail.com wrote:
...
On Thu, Oct 28, 2021 at 9:18 PM Yicong Yang yangyicong@hisilicon.com wrote:
...
From: Barry Song song.bao.hua@hisilicon.com
For platforms having clusters like Kunpeng 920, tasks in the same
cluster sharing L3 Cache Tag will have lower latency when synchronizing
and accessing shared resources. Based on this, this patch moves to
change the begin cpu of scanning in select_idle_cpu() from the next
cpu of target to the first cpu of the target's cluster. Then the
search will perform within the cluster first and we'll have more
chance to wake the wakee in the same cluster of the waker.
Benchmark Tests have been done on 2-socket 4-NUMA Kunpeng 920 with
8 clusters in each NUMA and on NUMA 0. Improvements are observed
in most cases compared to 5.15-rc1 with cluster scheduler level[1].
hackbench-process-pipes
                    5.15-rc1+cluster 5.15-rc1+cluster+patch
Amean     1        0.6136 (   0.00%)      0.5988 (   2.41%)
Amean     4        0.8380 (   0.00%)      0.8904 *  -6.25%*
Amean     7        1.1661 (   0.00%)      1.1017 *   5.52%*
Amean     12       1.4670 (   0.00%)      1.5994 *  -9.03%*
Amean     21       2.8909 (   0.00%)      2.8640 (   0.93%)
Amean     30       4.3943 (   0.00%)      4.2052 (   4.30%)
Amean     48       6.6870 (   0.00%)      6.4079 (   4.17%)
Amean     79      10.4796 (   0.00%)      9.5507 *   8.86%*
Amean     110     14.5310 (   0.00%)     12.2114 *  15.96%*
Amean     141     16.4772 (   0.00%)     14.1517 *  14.11%*
Amean     172     20.0868 (   0.00%)     15.9852 *  20.42%*
Amean     203     22.9282 (   0.00%)     18.4574 *  19.50%*
Amean     234     25.8139 (   0.00%)     20.4725 *  20.69%*
Amean     256     27.6834 (   0.00%)     22.9076 *  17.25%*
tbench4
                    5.15-rc1+cluster 5.15-rc1+cluster+patch
Hmean     1        338.50 (   0.00%)      345.47 *   2.06%*
Hmean     2        672.20 (   0.00%)      695.10 *   3.41%*
Hmean     4       1329.03 (   0.00%)     1357.40 *   2.14%*
Hmean     8       2513.25 (   0.00%)     2419.88 *  -3.71%*
Hmean     16      4957.39 (   0.00%)     4882.04 *  -1.52%*
Hmean     32      8737.07 (   0.00%)     8649.97 *  -1.00%*
Hmean     64      4929.31 (   0.00%)     6570.13 *  33.29%*
Hmean     128     5052.75 (   0.00%)     8157.96 *  61.46%*
Hmean     256     6971.70 (   0.00%)     7648.01 *   9.70%*
Hmean     512     7427.32 (   0.00%)     7450.68 *   0.31%*
tbench4 NUMA 0
                    5.15-rc1+cluster 5.15-rc1+cluster+patch
Hmean     1        318.98 (   0.00%)      322.53 *   1.11%*
Hmean     2        640.50 (   0.00%)      641.89 *   0.22%*
Hmean     4       1277.57 (   0.00%)     1292.54 *   1.17%*
Hmean     8       2584.55 (   0.00%)     2622.64 *   1.47%*
Hmean     16      5245.05 (   0.00%)     5440.75 *   3.73%*
Hmean     32      3231.60 (   0.00%)     3991.83 *  23.52%*
Hmean     64      7361.28 (   0.00%)     7356.56 (  -0.06%)
Hmean     128     6240.28 (   0.00%)     6293.78 *   0.86%*
hackbench-process-pipes NUMA 0
                    5.15-rc1+cluster 5.15-rc1+cluster+patch
Amean     1        0.5196 (   0.00%)      0.5121 (   1.44%)
Amean     4        1.0946 (   0.00%)      1.3234 * -20.90%*
Amean     7        1.9368 (   0.00%)      2.4304 * -25.49%*
Amean     12       3.4168 (   0.00%)      3.6422 *  -6.60%*
Amean     21       6.1119 (   0.00%)      5.5032 *   9.96%*
Amean     30       7.8980 (   0.00%)      7.5433 *   4.49%*
Amean     48      11.2969 (   0.00%)     10.6889 *   5.38%*
Amean     79      17.3220 (   0.00%)     15.2553 *  11.93%*
Amean     110     22.9893 (   0.00%)     19.8521 *  13.65%*
Amean     141     28.5319 (   0.00%)     24.9064 *  12.71%*
Amean     172     34.1731 (   0.00%)     30.8424 *   9.75%*
Amean     203     39.9368 (   0.00%)     35.4607 *  11.21%*
Amean     234     45.6207 (   0.00%)     40.4969 *  11.23%*
Amean     256     50.0725 (   0.00%)     45.0295 *  10.07%*
[1] https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
the patchset is causing a kernel panic during kexec reboot:
[ 1254.167993] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000120
[ 1254.176771] Mem abort info:
[ 1254.179551]   ESR = 0x96000004
[ 1254.182596]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1254.187899]   SET = 0, FnV = 0
[ 1254.190944]   EA = 0, S1PTW = 0
[ 1254.194076]   FSC = 0x04: level 0 translation fault
[ 1254.198944] Data abort info:
[ 1254.201815]   ISV = 0, ISS = 0x00000004
[ 1254.205643]   CM = 0, WnR = 0
[ 1254.208604] user pgtable: 4k pages16] Internal error: Oops:
96000004 [#1] PREEMPT SMP
[ 1254.227375] Modules linked in:
[ 1254.230416] CPU: 0 PID: 786 Comm: kworker/1:2 Not tainted
5.15.0-rc1-00005-g4c1b4a4d90b6-dirty #302
[ 1254.239447] Hardware name: Huawei XA320 V2 /BC82HPNBB, BIOS 0.86 07/19/2019
[ 1254.246393] Workqueue: events cpuset_hotplug_workfn
[ 1254.251263] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 1254.258211] pc : __bitmap_weight+0x30/0x90
[ 1254.262297] lr : cpu_attach_domain+0x1ec/0x838
[ 1254.266729] sp : ffff8000238fba10
[ 1254.270029] x29: ffff8000238fba10 x28: ffff204000059f00 x27: 0000000000000000
[ 1254.277151] x26: ffff800010e3a238 x25: 0000000000000001 x24: ffff8000117858f0
[ 1254.284274] x23: 0000000000000100 x22: 0000000000000004 x21: 0000000000000120
[ 1254.291395] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000001
[ 1254.298517] x17: 0000000000000000 x16: 00000000000006d4 x15: 00000000000006d1
[ 1254.305639] x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000
[ 1254.312760] x11: 00000000000000c0 x10: 0000000000000a80 x9 : 0000000000000001
[ 1254.319882] x8 : ffff002080410000 x7 : 0000000000000000 x6 : 0000000000000000
[ 1254.327004] x5 : ffff800011f60b00 x4 : 00000000002dc6c0 x3 : ffff803f6e3fd000
[ 1254.334126] x2 : 0000000000000000 x1 : 0000000000000100 x0 : 0000000000000120
[ 1254.341247] Call trace:
[ 1254.343680]  __bitmap_weight+0x30/0x90
[ 1254.347416]  cpu_attach_domain+0x1ec/0x838
[ 1254.351499]  partition_sched_domains_locked+0x12c/0x908
[ 1254.356711]  rebuild_sched_domains_locked+0x384/0x800
[ 1254.361749]  rebuild_sched_domains+0x24/0x40
[ 1254.366006]  cpuset_hotplug_workfn+0x34c/0x548
[ 1254.370437]  process_one_work+0x1bc/0x338
[ 1254.374433]  worker_thread+0x48/0x418
[ 1254.378081]  kthread+0x14c/0x158
[ 1254.381297]  ret_from_fork+0x10/0x20
[ 1254.384861] Code: 2a0103f7 54000300 d2800013 52800014 (f8737aa0)
[ 1254.390940] ---[ end trace 179fc74a465f3bec ]---
sorry. pls ignore the noise. it was made by my local debug code.
one benchmark result:
running sysbench on numa0-1(cpu0-cpu63), and running mysqld on
numa2-3(cpu64-127)
sysbench command as below:
numactl -C 0-63 sysbench --db-driver=mysql --mysql-user=sbtest_user \
--mysql_password=password --mysql-db=sbtest --mysql-host=127.0.0.1 \
--mysql-port=3306 --point-selects=10 --simple-ranges=1 --sum-ranges=1 \
--order-ranges=1 --distinct-ranges=1 --delete_inserts=1 --index-updates=1 \
 --non-index-updates=1 --delete-inserts=1 --range-size=100 --time=600 \
--events=0 --report-interval=60 --tables=64 --table-size=2000000 \
 --threads=64 /usr/share/sysbench/oltp_write_only.lua run
   w/o patchset          w/ patchset

tps    53325.97              52331.69  (-1.86%)
qps    319955.80             313990.12 (-1.86%)
it seems the patchset is bringing some regression for this particular case.
will need more thinking to figure out a better approach.
I established a mysql environment on my server and interestingly I got
some different result.
since my SDD locates on numa0, I bind mysqld on cpu0-63 and bind sysbench
on cpu64-127.
I know little about mysql so I run a very basic sysbench command like
below:
numactl -C 64-127 /mnt/sde/sysbench-1.0.20/INSTALL/bin/sysbench \
        /mnt/sde/sysbench-1.0.20/INSTALL/share/sysbench/oltp_read_write.lua \
        --mysql-host=localhost \
        --mysql-port=3306 \
        --mysql-user=root \
        --mysql-db=test \
        --db-driver=mysql \
        --report-interval=10 \
        --tables=12 \
        --table-size=1000000 \
        --threads=64 \
        --time=120 \
        --events=0 \
        run
w/o patchset	w/ patchset
tps		20073.61	21510.50 (+7.16%)
qps		401472.28	430209.96 (+7.16%)
avg lat		3.19		2.97 (+6.90%)
The tables and table size differ, so maybe under this case of mysql we can get
enhancement.
...
...
...
Signed-off-by: Barry Song song.bao.hua@hisilicon.com
Signed-off-by: Yicong Yang yangyicong@hisilicon.com

kernel/sched/fair.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..852a048a5f8c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6265,10 +6265,10 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
        struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

  int i, cpu, idle_cpu = -1, nr = INT_MAX;




  int i, cpu, scan_from, idle_cpu = -1, nr = INT_MAX;


  struct sched_domain *this_sd, *cluster_sd;
  struct rq *this_rq = this_rq();
  int this = smp_processor_id();




  struct sched_domain *this_sd;
  u64 time = 0;

  this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));



@@ -6276,6 +6276,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
                return -1;
    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);


  cpumask_clear_cpu(target, cpus);



  cluster_sd = rcu_dereference(*this_cpu_ptr(&sd_cluster));


  scan_from = cluster_sd ? cpumask_first(sched_domain_span(cluster_sd)) : target + 1;

  if (sched_feat(SIS_PROP) && !has_idle_core) {
          u64 avg_cost, avg_idle, span_avg;



@@ -6305,7 +6309,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
                time = cpu_clock(this);
        }

  for_each_cpu_wrap(cpu, cpus, target + 1) {




  for_each_cpu_wrap(cpu, cpus, scan_from) {
          if (has_idle_core) {
                  i = select_idle_core(p, cpu, cpus, &idle_cpu);
                  if ((unsigned int)i < nr_cpumask_bits)



--
2.33.0
Thanks
barry
.

    

2025

2024

2023

2022

2021

2020

Re: [Linaro-open-discussions] [PATCH 2/2] sched/fair: Scan from the first cpu of cluster if presents in select_idle_cpu