On Thu, Oct 28, 2021 at 9:18 PM Yicong Yang yangyicong@hisilicon.com wrote:
From: Barry Song song.bao.hua@hisilicon.com
For platforms having clusters like Kunpeng 920, tasks in the same cluster sharing L3 Cache Tag will have lower latency when synchronizing and accessing shared resources. Based on this, this patch moves to change the begin cpu of scanning in select_idle_cpu() from the next cpu of target to the first cpu of the target's cluster. Then the search will perform within the cluster first and we'll have more chance to wake the wakee in the same cluster of the waker.
Benchmark Tests have been done on 2-socket 4-NUMA Kunpeng 920 with 8 clusters in each NUMA and on NUMA 0. Improvements are observed in most cases compared to 5.15-rc1 with cluster scheduler level[1].
hackbench-process-pipes 5.15-rc1+cluster 5.15-rc1+cluster+patch Amean 1 0.6136 ( 0.00%) 0.5988 ( 2.41%) Amean 4 0.8380 ( 0.00%) 0.8904 * -6.25%* Amean 7 1.1661 ( 0.00%) 1.1017 * 5.52%* Amean 12 1.4670 ( 0.00%) 1.5994 * -9.03%* Amean 21 2.8909 ( 0.00%) 2.8640 ( 0.93%) Amean 30 4.3943 ( 0.00%) 4.2052 ( 4.30%) Amean 48 6.6870 ( 0.00%) 6.4079 ( 4.17%) Amean 79 10.4796 ( 0.00%) 9.5507 * 8.86%* Amean 110 14.5310 ( 0.00%) 12.2114 * 15.96%* Amean 141 16.4772 ( 0.00%) 14.1517 * 14.11%* Amean 172 20.0868 ( 0.00%) 15.9852 * 20.42%* Amean 203 22.9282 ( 0.00%) 18.4574 * 19.50%* Amean 234 25.8139 ( 0.00%) 20.4725 * 20.69%* Amean 256 27.6834 ( 0.00%) 22.9076 * 17.25%*
tbench4 5.15-rc1+cluster 5.15-rc1+cluster+patch Hmean 1 338.50 ( 0.00%) 345.47 * 2.06%* Hmean 2 672.20 ( 0.00%) 695.10 * 3.41%* Hmean 4 1329.03 ( 0.00%) 1357.40 * 2.14%* Hmean 8 2513.25 ( 0.00%) 2419.88 * -3.71%* Hmean 16 4957.39 ( 0.00%) 4882.04 * -1.52%* Hmean 32 8737.07 ( 0.00%) 8649.97 * -1.00%* Hmean 64 4929.31 ( 0.00%) 6570.13 * 33.29%* Hmean 128 5052.75 ( 0.00%) 8157.96 * 61.46%* Hmean 256 6971.70 ( 0.00%) 7648.01 * 9.70%* Hmean 512 7427.32 ( 0.00%) 7450.68 * 0.31%*
tbench4 NUMA 0 5.15-rc1+cluster 5.15-rc1+cluster+patch Hmean 1 318.98 ( 0.00%) 322.53 * 1.11%* Hmean 2 640.50 ( 0.00%) 641.89 * 0.22%* Hmean 4 1277.57 ( 0.00%) 1292.54 * 1.17%* Hmean 8 2584.55 ( 0.00%) 2622.64 * 1.47%* Hmean 16 5245.05 ( 0.00%) 5440.75 * 3.73%* Hmean 32 3231.60 ( 0.00%) 3991.83 * 23.52%* Hmean 64 7361.28 ( 0.00%) 7356.56 ( -0.06%) Hmean 128 6240.28 ( 0.00%) 6293.78 * 0.86%*
hackbench-process-pipes NUMA 0 5.15-rc1+cluster 5.15-rc1+cluster+patch Amean 1 0.5196 ( 0.00%) 0.5121 ( 1.44%) Amean 4 1.0946 ( 0.00%) 1.3234 * -20.90%* Amean 7 1.9368 ( 0.00%) 2.4304 * -25.49%* Amean 12 3.4168 ( 0.00%) 3.6422 * -6.60%* Amean 21 6.1119 ( 0.00%) 5.5032 * 9.96%* Amean 30 7.8980 ( 0.00%) 7.5433 * 4.49%* Amean 48 11.2969 ( 0.00%) 10.6889 * 5.38%* Amean 79 17.3220 ( 0.00%) 15.2553 * 11.93%* Amean 110 22.9893 ( 0.00%) 19.8521 * 13.65%* Amean 141 28.5319 ( 0.00%) 24.9064 * 12.71%* Amean 172 34.1731 ( 0.00%) 30.8424 * 9.75%* Amean 203 39.9368 ( 0.00%) 35.4607 * 11.21%* Amean 234 45.6207 ( 0.00%) 40.4969 * 11.23%* Amean 256 50.0725 ( 0.00%) 45.0295 * 10.07%*
[1] https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
the patchset is causing a kernel panic during kexec reboot:
[ 1254.167993] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000120 [ 1254.176771] Mem abort info: [ 1254.179551] ESR = 0x96000004 [ 1254.182596] EC = 0x25: DABT (current EL), IL = 32 bits [ 1254.187899] SET = 0, FnV = 0 [ 1254.190944] EA = 0, S1PTW = 0 [ 1254.194076] FSC = 0x04: level 0 translation fault [ 1254.198944] Data abort info: [ 1254.201815] ISV = 0, ISS = 0x00000004 [ 1254.205643] CM = 0, WnR = 0 [ 1254.208604] user pgtable: 4k pages16] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 1254.227375] Modules linked in: [ 1254.230416] CPU: 0 PID: 786 Comm: kworker/1:2 Not tainted 5.15.0-rc1-00005-g4c1b4a4d90b6-dirty #302 [ 1254.239447] Hardware name: Huawei XA320 V2 /BC82HPNBB, BIOS 0.86 07/19/2019 [ 1254.246393] Workqueue: events cpuset_hotplug_workfn [ 1254.251263] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1254.258211] pc : __bitmap_weight+0x30/0x90 [ 1254.262297] lr : cpu_attach_domain+0x1ec/0x838 [ 1254.266729] sp : ffff8000238fba10 [ 1254.270029] x29: ffff8000238fba10 x28: ffff204000059f00 x27: 0000000000000000 [ 1254.277151] x26: ffff800010e3a238 x25: 0000000000000001 x24: ffff8000117858f0 [ 1254.284274] x23: 0000000000000100 x22: 0000000000000004 x21: 0000000000000120 [ 1254.291395] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000001 [ 1254.298517] x17: 0000000000000000 x16: 00000000000006d4 x15: 00000000000006d1 [ 1254.305639] x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000 [ 1254.312760] x11: 00000000000000c0 x10: 0000000000000a80 x9 : 0000000000000001 [ 1254.319882] x8 : ffff002080410000 x7 : 0000000000000000 x6 : 0000000000000000 [ 1254.327004] x5 : ffff800011f60b00 x4 : 00000000002dc6c0 x3 : ffff803f6e3fd000 [ 1254.334126] x2 : 0000000000000000 x1 : 0000000000000100 x0 : 0000000000000120 [ 1254.341247] Call trace: [ 1254.343680] __bitmap_weight+0x30/0x90 [ 1254.347416] cpu_attach_domain+0x1ec/0x838 [ 1254.351499] partition_sched_domains_locked+0x12c/0x908 [ 1254.356711] rebuild_sched_domains_locked+0x384/0x800 [ 1254.361749] rebuild_sched_domains+0x24/0x40 [ 1254.366006] cpuset_hotplug_workfn+0x34c/0x548 [ 1254.370437] process_one_work+0x1bc/0x338 [ 1254.374433] worker_thread+0x48/0x418 [ 1254.378081] kthread+0x14c/0x158 [ 1254.381297] ret_from_fork+0x10/0x20 [ 1254.384861] Code: 2a0103f7 54000300 d2800013 52800014 (f8737aa0) [ 1254.390940] ---[ end trace 179fc74a465f3bec ]---
Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com
kernel/sched/fair.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..852a048a5f8c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6265,10 +6265,10 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
int i, cpu, scan_from, idle_cpu = -1, nr = INT_MAX;
struct sched_domain *this_sd, *cluster_sd; struct rq *this_rq = this_rq(); int this = smp_processor_id();
struct sched_domain *this_sd; u64 time = 0; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
@@ -6276,6 +6276,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool return -1;
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
cpumask_clear_cpu(target, cpus);
cluster_sd = rcu_dereference(*this_cpu_ptr(&sd_cluster));
scan_from = cluster_sd ? cpumask_first(sched_domain_span(cluster_sd)) : target + 1; if (sched_feat(SIS_PROP) && !has_idle_core) { u64 avg_cost, avg_idle, span_avg;
@@ -6305,7 +6309,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool time = cpu_clock(this); }
for_each_cpu_wrap(cpu, cpus, target + 1) {
for_each_cpu_wrap(cpu, cpus, scan_from) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); if ((unsigned int)i < nr_cpumask_bits)
-- 2.33.0