-----Original Message-----
From: Tim Chen [mailto:tim.c.chen@linux.intel.com]
Sent: Friday, June 25, 2021 1:00 PM
To: Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com; yangyicong
yangyicong@huawei.com
Cc: guodong.xu@linaro.org; tangchengchang tangchengchang@huawei.com;
Zengtao (B) prime.zeng@hisilicon.com; Jonathan Cameron
jonathan.cameron@huawei.com; chenxiang (M) chenxiang66@hisilicon.com;
Linuxarm linuxarm@huawei.com
Subject: Re: [PATCH 4/4] sched/fair: Use cpus_share_cluster to further pull
wakee
On 6/24/21 2:15 AM, Barry Song wrote:
In wake_affine_idle() and select_idle_sibling(), move to use
cpus_share_cluster() rather than cpus_share_cache() to
determine if waker and wakee are in same cache domain.
This patch mainly benefits light load system.
we are seeing a huge improvement while load is relatively light:
(1numa, 24cores, 6 clusters)
tbench4
tbench tbench
1numa-w/-cluster-spread-only
1numa-w-cluster-spread+scancluster+cpus_share_cluster
Hmean 1 297.22 ( 0.00%) 331.31 * 11.47%*
Hmean 2 597.33 ( 0.00%) 661.35 * 10.72%*
Hmean 4 1211.53 ( 0.00%) 1317.34 * 8.73%*
Hmean 8 2465.05 ( 0.00%) 2612.18 * 5.97%*
Hmean 16 3194.06 ( 0.00%) 4089.96 * 28.05%*
Hmean 32 5984.82 ( 0.00%) 6374.53 * 6.51%*
Hmean 64 5017.10 ( 0.00%) 5255.47 * 4.75%*
Hmean 96 4705.88 ( 0.00%) 4860.40 * 3.28%*
With this patch, tbench4 is getting better than w/o cluster for both
light load and heavy load:
tbench4
tbench tbench
1numa-w/o-cluster
1numa-w-cluster-spread+scancluster+cpus_share_cluster
Hmean 1 314.11 ( 0.00%) 331.31 * 5.47%*
Hmean 2 612.83 ( 0.00%) 661.35 * 7.92%*
Hmean 4 1232.90 ( 0.00%) 1317.34 * 6.85%*
Hmean 8 2470.26 ( 0.00%) 2612.18 * 5.74%*
Hmean 16 3203.89 ( 0.00%) 4089.96 * 27.66%*
Hmean 32 6210.83 ( 0.00%) 6374.53 * 2.64%*
Hmean 64 5053.54 ( 0.00%) 5255.47 * 4.00%*
Hmean 96 4726.44 ( 0.00%) 4860.40 * 2.83%*
[
Todo:
need pgbench, hackbench on 1numa, 4numa; need tbench4 on 4numa
]
kernel/sched/core.c | 5 ++++-
kernel/sched/fair.c | 6 +++---
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47a4c82..6cfb0fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3504,7 +3504,10 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
bool cpus_share_cluster(int this_cpu, int that_cpu)
{
- return per_cpu(sd_cluster_id, this_cpu) == per_cpu(sd_cluster_id,
that_cpu);
that_cpu);
- return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
Nitpicking a bit. When sched_cluster is inactive is not active, we are
really comparing the last level cache. Wonder if we can give cpus_share_cluster
a better
name as we will return true in this case when the cpus are not in same cluster.
Yes. A little bit ugly here. I used to add a parameter in
cpus_share_cache(cpux,cpuy,cluster) in RFC patchset.
If cluster is 0, we compare LLC; if cluster is 1, we compare
cluster.
And I feel it made cpus_share_cache() more obscure. So I add
cpus_share_cluster(cpux,cpuy)
but it is also not perfect. Might add an internal function
in fair.c:
static bool cpus_share_wake_affine_cache(cpux, cpuy)
{
if(cluster is active)
return cpus_share_cluster(cpux,cpuy);
else
return cpus_share_cache(cpux,cpuy);
}
And remove the " if(cluster is active)" in current
cpus_share_cluster()?
BTW, Tim, I am curious if your machine will get same tbench4
result with kunpeng920. I am using https://github.com/gormanm/mmtests
and command like:
run-mmtests.sh --no-monitor -c configs/config-network-tbench-quick testtag1
run-mmtests.sh --no-monitor -c configs/config-network-tbench-quick testtag2
cd work/log
../../compare-kernels.sh --baseline testtag1 --compare testtag2
}
static inline bool ttwu_queue_cond(int cpu, int wake_flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdc2cf8..41ebbf2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5797,7 +5797,7 @@ static int wake_wide(struct task_struct *p)
* a cpufreq perspective, it's better to have higher utilisation
* on one CPU.
*/
- if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
- if (available_idle_cpu(this_cpu) && cpus_share_cluster(this_cpu,
prev_cpu))
return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
if (sync && cpu_rq(this_cpu)->nr_running == 1)
@@ -6297,7 +6297,7 @@ static int select_idle_sibling(struct task_struct *p,
int prev, int target)
/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) &&
- if (prev != target && cpus_share_cluster(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
asym_fits_capacity(task_util, prev))
return prev;
@@ -6320,7 +6320,7 @@ static int select_idle_sibling(struct task_struct *p,
int prev, int target)
recent_used_cpu = p->recent_used_cpu;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
asym_fits_capacity(task_util, recent_used_cpu)) {
Thanks
Barry