On 2021/11/4 20:12, Barry Song wrote:
On Thu, Nov 4, 2021 at 11:39 PM Barry Song 21cnbao@gmail.com wrote:
On Thu, Oct 28, 2021 at 9:18 PM Yicong Yang yangyicong@hisilicon.com wrote:
From: Barry Song song.bao.hua@hisilicon.com
For platforms having clusters like Kunpeng 920, tasks in the same cluster sharing L3 Cache Tag will have lower latency when synchronizing and accessing shared resources. Based on this, this patch moves to change the begin cpu of scanning in select_idle_cpu() from the next cpu of target to the first cpu of the target's cluster. Then the search will perform within the cluster first and we'll have more chance to wake the wakee in the same cluster of the waker.
Benchmark Tests have been done on 2-socket 4-NUMA Kunpeng 920 with 8 clusters in each NUMA and on NUMA 0. Improvements are observed in most cases compared to 5.15-rc1 with cluster scheduler level[1].
hackbench-process-pipes 5.15-rc1+cluster 5.15-rc1+cluster+patch Amean 1 0.6136 ( 0.00%) 0.5988 ( 2.41%) Amean 4 0.8380 ( 0.00%) 0.8904 * -6.25%* Amean 7 1.1661 ( 0.00%) 1.1017 * 5.52%* Amean 12 1.4670 ( 0.00%) 1.5994 * -9.03%* Amean 21 2.8909 ( 0.00%) 2.8640 ( 0.93%) Amean 30 4.3943 ( 0.00%) 4.2052 ( 4.30%) Amean 48 6.6870 ( 0.00%) 6.4079 ( 4.17%) Amean 79 10.4796 ( 0.00%) 9.5507 * 8.86%* Amean 110 14.5310 ( 0.00%) 12.2114 * 15.96%* Amean 141 16.4772 ( 0.00%) 14.1517 * 14.11%* Amean 172 20.0868 ( 0.00%) 15.9852 * 20.42%* Amean 203 22.9282 ( 0.00%) 18.4574 * 19.50%* Amean 234 25.8139 ( 0.00%) 20.4725 * 20.69%* Amean 256 27.6834 ( 0.00%) 22.9076 * 17.25%*
tbench4 5.15-rc1+cluster 5.15-rc1+cluster+patch Hmean 1 338.50 ( 0.00%) 345.47 * 2.06%* Hmean 2 672.20 ( 0.00%) 695.10 * 3.41%* Hmean 4 1329.03 ( 0.00%) 1357.40 * 2.14%* Hmean 8 2513.25 ( 0.00%) 2419.88 * -3.71%* Hmean 16 4957.39 ( 0.00%) 4882.04 * -1.52%* Hmean 32 8737.07 ( 0.00%) 8649.97 * -1.00%* Hmean 64 4929.31 ( 0.00%) 6570.13 * 33.29%* Hmean 128 5052.75 ( 0.00%) 8157.96 * 61.46%* Hmean 256 6971.70 ( 0.00%) 7648.01 * 9.70%* Hmean 512 7427.32 ( 0.00%) 7450.68 * 0.31%*
tbench4 NUMA 0 5.15-rc1+cluster 5.15-rc1+cluster+patch Hmean 1 318.98 ( 0.00%) 322.53 * 1.11%* Hmean 2 640.50 ( 0.00%) 641.89 * 0.22%* Hmean 4 1277.57 ( 0.00%) 1292.54 * 1.17%* Hmean 8 2584.55 ( 0.00%) 2622.64 * 1.47%* Hmean 16 5245.05 ( 0.00%) 5440.75 * 3.73%* Hmean 32 3231.60 ( 0.00%) 3991.83 * 23.52%* Hmean 64 7361.28 ( 0.00%) 7356.56 ( -0.06%) Hmean 128 6240.28 ( 0.00%) 6293.78 * 0.86%*
hackbench-process-pipes NUMA 0 5.15-rc1+cluster 5.15-rc1+cluster+patch Amean 1 0.5196 ( 0.00%) 0.5121 ( 1.44%) Amean 4 1.0946 ( 0.00%) 1.3234 * -20.90%* Amean 7 1.9368 ( 0.00%) 2.4304 * -25.49%* Amean 12 3.4168 ( 0.00%) 3.6422 * -6.60%* Amean 21 6.1119 ( 0.00%) 5.5032 * 9.96%* Amean 30 7.8980 ( 0.00%) 7.5433 * 4.49%* Amean 48 11.2969 ( 0.00%) 10.6889 * 5.38%* Amean 79 17.3220 ( 0.00%) 15.2553 * 11.93%* Amean 110 22.9893 ( 0.00%) 19.8521 * 13.65%* Amean 141 28.5319 ( 0.00%) 24.9064 * 12.71%* Amean 172 34.1731 ( 0.00%) 30.8424 * 9.75%* Amean 203 39.9368 ( 0.00%) 35.4607 * 11.21%* Amean 234 45.6207 ( 0.00%) 40.4969 * 11.23%* Amean 256 50.0725 ( 0.00%) 45.0295 * 10.07%*
[1] https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
the patchset is causing a kernel panic during kexec reboot:
[ 1254.167993] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000120 [ 1254.176771] Mem abort info: [ 1254.179551] ESR = 0x96000004 [ 1254.182596] EC = 0x25: DABT (current EL), IL = 32 bits [ 1254.187899] SET = 0, FnV = 0 [ 1254.190944] EA = 0, S1PTW = 0 [ 1254.194076] FSC = 0x04: level 0 translation fault [ 1254.198944] Data abort info: [ 1254.201815] ISV = 0, ISS = 0x00000004 [ 1254.205643] CM = 0, WnR = 0 [ 1254.208604] user pgtable: 4k pages16] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 1254.227375] Modules linked in: [ 1254.230416] CPU: 0 PID: 786 Comm: kworker/1:2 Not tainted 5.15.0-rc1-00005-g4c1b4a4d90b6-dirty #302 [ 1254.239447] Hardware name: Huawei XA320 V2 /BC82HPNBB, BIOS 0.86 07/19/2019 [ 1254.246393] Workqueue: events cpuset_hotplug_workfn [ 1254.251263] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1254.258211] pc : __bitmap_weight+0x30/0x90 [ 1254.262297] lr : cpu_attach_domain+0x1ec/0x838 [ 1254.266729] sp : ffff8000238fba10 [ 1254.270029] x29: ffff8000238fba10 x28: ffff204000059f00 x27: 0000000000000000 [ 1254.277151] x26: ffff800010e3a238 x25: 0000000000000001 x24: ffff8000117858f0 [ 1254.284274] x23: 0000000000000100 x22: 0000000000000004 x21: 0000000000000120 [ 1254.291395] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000001 [ 1254.298517] x17: 0000000000000000 x16: 00000000000006d4 x15: 00000000000006d1 [ 1254.305639] x14: 0000000000000002 x13: 0000000000000000 x12: 0000000000000000 [ 1254.312760] x11: 00000000000000c0 x10: 0000000000000a80 x9 : 0000000000000001 [ 1254.319882] x8 : ffff002080410000 x7 : 0000000000000000 x6 : 0000000000000000 [ 1254.327004] x5 : ffff800011f60b00 x4 : 00000000002dc6c0 x3 : ffff803f6e3fd000 [ 1254.334126] x2 : 0000000000000000 x1 : 0000000000000100 x0 : 0000000000000120 [ 1254.341247] Call trace: [ 1254.343680] __bitmap_weight+0x30/0x90 [ 1254.347416] cpu_attach_domain+0x1ec/0x838 [ 1254.351499] partition_sched_domains_locked+0x12c/0x908 [ 1254.356711] rebuild_sched_domains_locked+0x384/0x800 [ 1254.361749] rebuild_sched_domains+0x24/0x40 [ 1254.366006] cpuset_hotplug_workfn+0x34c/0x548 [ 1254.370437] process_one_work+0x1bc/0x338 [ 1254.374433] worker_thread+0x48/0x418 [ 1254.378081] kthread+0x14c/0x158 [ 1254.381297] ret_from_fork+0x10/0x20 [ 1254.384861] Code: 2a0103f7 54000300 d2800013 52800014 (f8737aa0) [ 1254.390940] ---[ end trace 179fc74a465f3bec ]---
sorry. pls ignore the noise. it was made by my local debug code.
one benchmark result:
running sysbench on numa0-1(cpu0-cpu63), and running mysqld on numa2-3(cpu64-127)
sysbench command as below: numactl -C 0-63 sysbench --db-driver=mysql --mysql-user=sbtest_user \ --mysql_password=password --mysql-db=sbtest --mysql-host=127.0.0.1 \ --mysql-port=3306 --point-selects=10 --simple-ranges=1 --sum-ranges=1 \ --order-ranges=1 --distinct-ranges=1 --delete_inserts=1 --index-updates=1 \ --non-index-updates=1 --delete-inserts=1 --range-size=100 --time=600 \ --events=0 --report-interval=60 --tables=64 --table-size=2000000 \ --threads=64 /usr/share/sysbench/oltp_write_only.lua run
w/o patchset w/ patchset
tps 53325.97 52331.69 (-1.86%) qps 319955.80 313990.12 (-1.86%)
it seems the patchset is bringing some regression for this particular case. will need more thinking to figure out a better approach.
I established a mysql environment on my server and interestingly I got some different result.
since my SDD locates on numa0, I bind mysqld on cpu0-63 and bind sysbench on cpu64-127.
I know little about mysql so I run a very basic sysbench command like below: numactl -C 64-127 /mnt/sde/sysbench-1.0.20/INSTALL/bin/sysbench \ /mnt/sde/sysbench-1.0.20/INSTALL/share/sysbench/oltp_read_write.lua \ --mysql-host=localhost \ --mysql-port=3306 \ --mysql-user=root \ --mysql-db=test \ --db-driver=mysql \ --report-interval=10 \ --tables=12 \ --table-size=1000000 \ --threads=64 \ --time=120 \ --events=0 \ run
w/o patchset w/ patchset tps 20073.61 21510.50 (+7.16%) qps 401472.28 430209.96 (+7.16%) avg lat 3.19 2.97 (+6.90%)
The tables and table size differ, so maybe under this case of mysql we can get enhancement.
Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com
kernel/sched/fair.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..852a048a5f8c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6265,10 +6265,10 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
int i, cpu, scan_from, idle_cpu = -1, nr = INT_MAX;
struct sched_domain *this_sd, *cluster_sd; struct rq *this_rq = this_rq(); int this = smp_processor_id();
struct sched_domain *this_sd; u64 time = 0; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
@@ -6276,6 +6276,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool return -1;
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
cpumask_clear_cpu(target, cpus);
cluster_sd = rcu_dereference(*this_cpu_ptr(&sd_cluster));
scan_from = cluster_sd ? cpumask_first(sched_domain_span(cluster_sd)) : target + 1; if (sched_feat(SIS_PROP) && !has_idle_core) { u64 avg_cost, avg_idle, span_avg;
@@ -6305,7 +6309,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool time = cpu_clock(this); }
for_each_cpu_wrap(cpu, cpus, target + 1) {
for_each_cpu_wrap(cpu, cpus, scan_from) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); if ((unsigned int)i < nr_cpumask_bits)
-- 2.33.0
Thanks barry .