Hi, (I had to remove many cc, my mail server rejected to send)
On 12/26/24 9:11 PM, Oleg Nesterov wrote:
I mostly agree, see my reply to this patch, but...
On 12/26, Linus Torvalds wrote:
If the optimization is correct, there is no point to having a config option.
If the optimization is incorrect, there is no point to having the code.
Either way, there's no way we'd ever have a config option for this.
Agreed,
if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
Agreed,
And I think your patch is simply buggy
Agreed again, but probably my reasoning was wrong,
But now waiters use "wait_event_interruptible_exclusive()" explicitly outside the pipe mutex, so the waiters and wakers aren't actually serialized.
This is what I don't understand... Could you spell ? I _think_ that
wait_event_whatever(WQ, CONDITION); vs
CONDITION = 1; if (wq_has_sleeper(WQ)) wake_up_xxx(WQ, ...); is fine.
This pattern is documented in wait.h:
https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
Thus if there an issue, then the documentation should be updated.
Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event() have the necessary barriers to serialize the waiters and wakers.
Damn I am sure I missed something ;)
Actually:
Does this work universally? I.e. can we add the optimization into __wake_up()?
But I do not understand this comment (from 2.6.0)
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kern...
/* * Note: we use "set_current_state()" _after_ the wait-queue add, * because we need a memory barrier there on SMP, so that any * wake-function that tests for the wait-queue being active * will be guaranteed to see waitqueue addition _or_ subsequent * tests in this thread will see the wakeup having taken place. * * The spin_unlock() itself is semi-permeable and only protects * one way (it only protects stuff inside the critical region and * stops them from bleeding out - it would still allow subsequent * loads to move into the the critical region). */ voidprepare_to_wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/prepare_to_wait(wait_queue_head_t https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_head_t*q,wait_queue_t https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_t*wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait,intstate) { unsignedlongflags; wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait->flags&=~WQ_FLAG_EXCLUSIVE https://elixir.bootlin.com/linux/v2.6.0/C/ident/WQ_FLAG_EXCLUSIVE; spin_lock_irqsave https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_lock_irqsave(&q->lock,flags); if(list_empty https://elixir.bootlin.com/linux/v2.6.0/C/ident/list_empty(&wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait->task_list https://elixir.bootlin.com/linux/v2.6.0/C/ident/task_list)) __add_wait_queue https://elixir.bootlin.com/linux/v2.6.0/C/ident/__add_wait_queue(q,wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait); set_current_state https://elixir.bootlin.com/linux/v2.6.0/C/ident/set_current_state(state); spin_unlock_irqrestore https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_unlock_irqrestore(&q->lock,flags); }
set_current_state() now uses smp_store_mb(), which is a memory barrier _after_ the store. Thus I do not see what enforces that the store happens before the store for the __add_wait_queue().
--
Manfred