Hi, (I had to remove many cc, my mail server rejected to send)
On 12/26/24 9:11 PM, Oleg Nesterov wrote:
I mostly agree, see my reply to this patch, but...
On 12/26, Linus Torvalds wrote:
If the optimization is correct, there is no point to having a config option.
If the optimization is incorrect, there is no point to having the code.
Either way, there's no way we'd ever have a config option for this.
Agreed,
if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
Agreed,
And I think your patch is simply buggy
Agreed again, but probably my reasoning was wrong,
But now waiters use "wait_event_interruptible_exclusive()" explicitly outside the pipe mutex, so the waiters and wakers aren't actually serialized.
This is what I don't understand... Could you spell ? I _think_ that
wait_event_whatever(WQ, CONDITION); vs
CONDITION = 1; if (wq_has_sleeper(WQ)) wake_up_xxx(WQ, ...); is fine.
This pattern is documented in wait.h:
https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
Thus if there an issue, then the documentation should be updated.
Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event() have the necessary barriers to serialize the waiters and wakers.
Damn I am sure I missed something ;)
Actually:
Does this work universally? I.e. can we add the optimization into __wake_up()?
But I do not understand this comment (from 2.6.0)
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kern...
/* * Note: we use "set_current_state()" _after_ the wait-queue add, * because we need a memory barrier there on SMP, so that any * wake-function that tests for the wait-queue being active * will be guaranteed to see waitqueue addition _or_ subsequent * tests in this thread will see the wakeup having taken place. * * The spin_unlock() itself is semi-permeable and only protects * one way (it only protects stuff inside the critical region and * stops them from bleeding out - it would still allow subsequent * loads to move into the the critical region). */ voidprepare_to_wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/prepare_to_wait(wait_queue_head_t https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_head_t*q,wait_queue_t https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_t*wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait,intstate) { unsignedlongflags; wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait->flags&=~WQ_FLAG_EXCLUSIVE https://elixir.bootlin.com/linux/v2.6.0/C/ident/WQ_FLAG_EXCLUSIVE; spin_lock_irqsave https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_lock_irqsave(&q->lock,flags); if(list_empty https://elixir.bootlin.com/linux/v2.6.0/C/ident/list_empty(&wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait->task_list https://elixir.bootlin.com/linux/v2.6.0/C/ident/task_list)) __add_wait_queue https://elixir.bootlin.com/linux/v2.6.0/C/ident/__add_wait_queue(q,wait https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait); set_current_state https://elixir.bootlin.com/linux/v2.6.0/C/ident/set_current_state(state); spin_unlock_irqrestore https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_unlock_irqrestore(&q->lock,flags); }
set_current_state() now uses smp_store_mb(), which is a memory barrier _after_ the store. Thus I do not see what enforces that the store happens before the store for the __add_wait_queue().
--
Manfred
On 12/27, Manfred Spraul wrote:
I _think_ that
wait_event_whatever(WQ, CONDITION); vs
CONDITION = 1; if (wq_has_sleeper(WQ)) wake_up_xxx(WQ, ...);
is fine.
This pattern is documented in wait.h:
https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
Thus if there an issue, then the documentation should be updated.
Agreed, basically the same pattern, prepare_to_wait_event() is similar to prepare_to_wait().
But I do not understand this comment (from 2.6.0)
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kern...
/* * Note: we use "set_current_state()" _after_ the wait-queue add, * because we need a memory barrier there on SMP, so that any * wake-function that tests for the wait-queue being active * will be guaranteed to see waitqueue addition _or_ subsequent * tests in this thread will see the wakeup having taken place. * * The spin_unlock() itself is semi-permeable and only protects * one way (it only protects stuff inside the critical region and * stops them from bleeding out - it would still allow subsequent * loads to move into the the critical region). */
...
set_current_state() now uses smp_store_mb(), which is a memory barrier _after_ the store.
And afaics this is what we actually need.
Thus I do not see what enforces that the store happens before the store for the __add_wait_queue().
IIUC this is fine, no need to serialize list_add() and STORE(tsk->__state), they can be reordered.
But we need mb() between __add_wait_queue + __set_current_state (in any order) and the subsequent "if (CONDITION)" check.
--- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m int __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, void *key) {
- if (list_empty(&wq_head->head)) {
struct list_head *pn;
/*
* pairs with spin_unlock_irqrestore(&wq_head->lock);
* We actually do not need to acquire wq_head->lock, we just
* need to be sure that there is no prepare_to_wait() that
* completed on any CPU before __wake_up was called.
* Thus instead of load_acquiring the spinlock and dropping
* it again, we load_acquire the next list entry and check
* that the list is not empty.
*/
pn = smp_load_acquire(&wq_head->head.next);
if(pn == &wq_head->head)
return 0;
- }
Too subtle for me ;)
I have some concerns, but I need to think a bit more to (try to) actually understand this change.
Oleg.
On 12/28, Oleg Nesterov wrote:
int __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, void *key) {
- if (list_empty(&wq_head->head)) {
struct list_head *pn;
/*
* pairs with spin_unlock_irqrestore(&wq_head->lock);
* We actually do not need to acquire wq_head->lock, we just
* need to be sure that there is no prepare_to_wait() that
* completed on any CPU before __wake_up was called.
* Thus instead of load_acquiring the spinlock and dropping
* it again, we load_acquire the next list entry and check
* that the list is not empty.
*/
pn = smp_load_acquire(&wq_head->head.next);
if(pn == &wq_head->head)
return 0;
- }
Too subtle for me ;)
I have some concerns, but I need to think a bit more to (try to) actually understand this change.
If nothing else, consider
int CONDITION; wait_queue_head_t WQ;
void wake(void) { CONDITION = 1; wake_up(WQ); }
void wait(void) { DEFINE_WAIT_FUNC(entry, woken_wake_function);
add_wait_queue(WQ, entry); if (!CONDITION) wait_woken(entry, ...); remove_wait_queue(WQ, entry); }
this code is correct even if LOAD(CONDITION) can leak into the critical section in add_wait_queue(), so CPU running wait() can actually do
// add_wait_queue spin_lock(WQ->lock); LOAD(CONDITION); // false! list_add(entry, head); spin_unlock(WQ->lock);
if (!false) // result of the LOAD above wait_woken(entry, ...);
Now suppose that another CPU executes wake() between LOAD(CONDITION) and list_add(entry, head). With your patch wait() will miss the event. The same for __pollwait(), I think...
No?
Oleg.
On 12/28, Oleg Nesterov wrote:
If nothing else, consider
int CONDITION; wait_queue_head_t WQ;
void wake(void) { CONDITION = 1; wake_up(WQ); }
void wait(void) { DEFINE_WAIT_FUNC(entry, woken_wake_function);
add_wait_queue(WQ, entry); if (!CONDITION) wait_woken(entry, ...); remove_wait_queue(WQ, entry);
}
this code is correct even if LOAD(CONDITION) can leak into the critical section in add_wait_queue(), so CPU running wait() can actually do
// add_wait_queue spin_lock(WQ->lock); LOAD(CONDITION); // false! list_add(entry, head); spin_unlock(WQ->lock); if (!false) // result of the LOAD above wait_woken(entry, ...);
Now suppose that another CPU executes wake() between LOAD(CONDITION) and list_add(entry, head). With your patch wait() will miss the event. The same for __pollwait(), I think...
No?
Even simpler,
void wait(void) { DEFINE_WAIT(entry);
__set_current_state(XXX); add_wait_queue(WQ, entry);
if (!CONDITION) schedule();
remove_wait_queue(WQ, entry); __set_current_state(TASK_RUNNING); }
This code is ugly but currently correct unless I am totally confused.
Oleg.
linux-morello@op-lists.linaro.org