When a user calls the read/write system call and passes a pipe descriptor, the pipe_read/pipe_write functions are invoked:
1. pipe_read(): 1). Checks if the pipe is valid and if there is any data in the pipe buffer. 2). Waits for data: *If there is no data in the pipe and the write end is still open, the current process enters a sleep state (wait_event()) until data is written. *If the write end is closed, return 0. 3). Reads data: *Wakes up the process and copies data from the pipe's memory buffer to user space. *When the buffer is full, the writing process will go to sleep, waiting for the pipe state to change to be awakened (using the wake_up_interruptible_sync_poll() mechanism). Once data is read from the buffer, the writing process can continue writing, and the reading process can continue reading new data. 4). Returns the number of bytes read upon successful read.
2. pipe_write(): 1). Checks if the pipe is valid and if there is any available space in the pipe buffer. 2). Waits for buffer space: *If the pipe buffer is full and the reading process has not read any data, pipe_write() may put the current process to sleep until there is space in the buffer. *If the read end of the pipe is closed (no process is waiting to read), an error code -EPIPE is returned, and a SIGPIPE signal may be sent to the process. 3). Writes data: *If there is enough space in the pipe buffer, pipe_write() copies data from the user space buffer to the kernel buffer of the pipe (using copy_from_user()). *If the amount of data the user requests to write is larger than the available space in the buffer, multiple writes may be required, or the process may wait for new space to be freed. 4). Wakes up waiting reading processes: *After the data is successfully written, pipe_write() wakes up any processes that may be waiting to read data (using the wake_up_interruptible_sync_poll() mechanism). 5). Returns the number of bytes successfully written.
Check if there are any waiting processes in the process wait queue by introducing wq_has_sleeper() when waking up processes for pipe read/write operations.
If no processes are waiting, there's no need to execute wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.
Unnecessary wake-ups can lead to context switches, where a process is woken up to handle I/O events even when there is no immediate need.
Only wake up processes when there are actually waiting processes to reduce context switches and system overhead by checking with wq_has_sleeper().
Additionally, by reducing unnecessary synchronization and wake-up operations, wq_has_sleeper() can decrease system resource waste and lock contention, improving overall system performance.
For pipe read/write operations, this eliminates ineffective scheduling and enhances concurrency.
It's important to note that enabling this option means invoking wq_has_sleeper() to check for sleeping processes in the wait queue for every read or write operation.
While this is a lightweight operation, it still incurs some overhead.
In low-load or single-task scenarios, this overhead may not yield significant benefits and could even introduce minor performance degradation.
UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6 With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
Single-core performance improved by 4.1%, multi-core performance improved by 4.8%.
Co-developed-by: Shengjin Yu yushengjin@uniontech.com Signed-off-by: Shengjin Yu yushengjin@uniontech.com Co-developed-by: Dandan Zhang zhangdandan@uniontech.com Signed-off-by: Dandan Zhang zhangdandan@uniontech.com Tested-by: Dandan Zhang zhangdandan@uniontech.com Signed-off-by: WangYuli wangyuli@uniontech.com --- fs/Kconfig | 13 +++++++++++++ fs/pipe.c | 21 +++++++++++++++------ 2 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/fs/Kconfig b/fs/Kconfig index 64d420e3c475..0dacc46a73fe 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -429,4 +429,17 @@ source "fs/unicode/Kconfig" config IO_WQ bool
+config PIPE_SKIP_SLEEPER + bool "Skip sleeping processes during pipe read/write" + default n + help + This option introduces a check whether the sleep queue will + be awakened during pipe read/write. + + It often leads to a performance improvement. However, in + low-load or single-task scenarios, it may introduce minor + performance overhead. + + If unsure, say N. + endmenu diff --git a/fs/pipe.c b/fs/pipe.c index 12b22c2723b7..c085333ae72c 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -247,6 +247,15 @@ static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe, return tail; }
+static inline bool +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head) +{ + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER)) + return wq_has_sleeper(wq_head); + else + return true; +} + static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to) { @@ -377,7 +386,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * _very_ unlikely case that the pipe was full, but we got * no data. */ - if (unlikely(was_full)) + if (unlikely(was_full) && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
@@ -398,9 +407,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) wake_next_reader = false; mutex_unlock(&pipe->mutex);
- if (was_full) + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); - if (wake_next_reader) + if (wake_next_reader && pipe_check_wq_has_sleeper(&pipe->rd_wait)) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); if (ret > 0) @@ -573,7 +582,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * become empty while we dropped the lock. */ mutex_unlock(&pipe->mutex); - if (was_empty) + if (was_empty && pipe_check_wq_has_sleeper(&pipe->rd_wait)) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); @@ -598,10 +607,10 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * Epoll nonsensically wants a wakeup whether the pipe * was already empty or not. */ - if (was_empty || pipe->poll_usage) + if ((was_empty || pipe->poll_usage) && pipe_check_wq_has_sleeper(&pipe->rd_wait)) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); - if (wake_next_writer) + if (wake_next_writer && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { int err = file_update_time(filp);
Don't you think the Cc list is a bit overloaded?
On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
When a user calls the read/write system call and passes a pipe descriptor, the pipe_read/pipe_write functions are invoked:
- pipe_read():
1). Checks if the pipe is valid and if there is any data in the pipe buffer. 2). Waits for data: *If there is no data in the pipe and the write end is still open, the current process enters a sleep state (wait_event()) until data is written. *If the write end is closed, return 0. 3). Reads data: *Wakes up the process and copies data from the pipe's memory buffer to user space. *When the buffer is full, the writing process will go to sleep, waiting for the pipe state to change to be awakened (using the wake_up_interruptible_sync_poll() mechanism). Once data is read from the buffer, the writing process can continue writing, and the reading process can continue reading new data. 4). Returns the number of bytes read upon successful read.
- pipe_write():
1). Checks if the pipe is valid and if there is any available space in the pipe buffer. 2). Waits for buffer space: *If the pipe buffer is full and the reading process has not read any data, pipe_write() may put the current process to sleep until there is space in the buffer. *If the read end of the pipe is closed (no process is waiting to read), an error code -EPIPE is returned, and a SIGPIPE signal may be sent to the process. 3). Writes data: *If there is enough space in the pipe buffer, pipe_write() copies data from the user space buffer to the kernel buffer of the pipe (using copy_from_user()). *If the amount of data the user requests to write is larger than the available space in the buffer, multiple writes may be required, or the process may wait for new space to be freed. 4). Wakes up waiting reading processes: *After the data is successfully written, pipe_write() wakes up any processes that may be waiting to read data (using the wake_up_interruptible_sync_poll() mechanism). 5). Returns the number of bytes successfully written.
Check if there are any waiting processes in the process wait queue by introducing wq_has_sleeper() when waking up processes for pipe read/write operations.
If no processes are waiting, there's no need to execute wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.
Unnecessary wake-ups can lead to context switches, where a process is woken up to handle I/O events even when there is no immediate need.
Only wake up processes when there are actually waiting processes to reduce context switches and system overhead by checking with wq_has_sleeper().
Additionally, by reducing unnecessary synchronization and wake-up operations, wq_has_sleeper() can decrease system resource waste and lock contention, improving overall system performance.
For pipe read/write operations, this eliminates ineffective scheduling and enhances concurrency.
It's important to note that enabling this option means invoking wq_has_sleeper() to check for sleeping processes in the wait queue for every read or write operation.
While this is a lightweight operation, it still incurs some overhead.
In low-load or single-task scenarios, this overhead may not yield significant benefits and could even introduce minor performance degradation.
UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6 With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
Single-core performance improved by 4.1%, multi-core performance improved by 4.8%.
...
+config PIPE_SKIP_SLEEPER
- bool "Skip sleeping processes during pipe read/write"
- default n
'n' is the default 'default', no need to have this line.
- help
This option introduces a check whether the sleep queue will
be awakened during pipe read/write.
It often leads to a performance improvement. However, in
low-load or single-task scenarios, it may introduce minor
performance overhead.
If unsure, say N.
Illogical, it's already N as you stated by putting a redundant line, but after removing that line it will make sense.
...
+static inline bool
Have you build this with Clang and `make W=1 ...`?
+pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head) +{
- if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
return wq_has_sleeper(wq_head);
- else
Redundant.
return true;
if (!foo) return true;
return bar(...);
+}
On 2024/12/25 21:30, Andy Shevchenko wrote:
Don't you think the Cc list is a bit overloaded?
Hi,
I apologize for any inconvenience this may cause.
I understand that under normal circumstances, one would simply pass the modified code path as an argument to the kernel's scripts/get_maintainer.pl script to determine the appropriate recipients.
However, given the vast and complex nature of the Linux kernel community, with tens of thousands of developers worldwide, and considering the varying "customs" of different subsystems, as well as time zone differences and individual work habits, it's not uncommon for patches to be sent to mailing lists and subsequently ignored or left pending.
This patch, for example, has been submitted multiple times without receiving any response, unfortunately.
My intention is simply to seek your review, and that of other technical experts like yourself, but I cannot be certain, prior to your response, which specific experts on which lists would be willing to provide feedback.
I would appreciate any other suggestions you may have.
UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6 With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7
Single-core performance improved by 4.1%, multi-core performance improved by 4.8%.
...
As you know, the kernel is extremely sensitive to performance.
Even a 1% performance improvement can lead to significant efficiency gains and reduced carbon emissions in production environments, as long as there is sufficient testing and theoretical analysis to prove that the improvement is real and not due to measurement error or jitter.
+config PIPE_SKIP_SLEEPER
- bool "Skip sleeping processes during pipe read/write"
- default n
'n' is the default 'default', no need to have this line.
OK, I'll drop it. Thanks.
- help
This option introduces a check whether the sleep queue will
be awakened during pipe read/write.
It often leads to a performance improvement. However, in
low-load or single-task scenarios, it may introduce minor
performance overhead.
If unsure, say N.
Illogical, it's already N as you stated by putting a redundant line, but after removing that line it will make sense.
...
As noted, I'll remove "default n" as it serves no purpose.
+static inline bool
Have you build this with Clang and `make W=1 ...`?
Hmm...I've noticed a discrepancy in kernel compilation results with and without "make W=1".
When I use x86_64_defconfig and clang-19.1.1 (Ubuntu 24.10) and run "make", there are no warnings.
However, when I run "make W=1", the kernel generates a massive number of errors, causing the compilation to fail prematurely.
e.g.
In file included from arch/x86/kernel/asm-offsets.c:14: In file included from ./include/linux/suspend.h:5: In file included from ./include/linux/swap.h:9: In file included from ./include/linux/memcontrol.h:21: In file included from ./include/linux/mm.h:2224: ./include/linux/vmstat.h:504:43: error: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Werror,-Wenum-enum-conversion] 504 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~ ^ 505 | item]; | ~~~~ ./include/linux/vmstat.h:511:43: error: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Werror,-Wenum-enum-conversion] 511 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~ ^ 512 | NR_VM_NUMA_EVENT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/vmstat.h:524:43: error: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Werror,-Wenum-enum-conversion] 524 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~ ^ 525 | NR_VM_NUMA_EVENT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~~ 3 errors generated.
And I've observed similar behavior with gcc-14.2.0.
While I'm keen on addressing as many potential compile errors and warnings in the kernel as possible, it seems like a long-term endeavor.
Regarding this specific code, I'd appreciate your insights on how to improve it.
+pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head) +{
- if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
return wq_has_sleeper(wq_head);
- else
Redundant.
return true;
if (!foo) return true;
return bar(...);
+}
Yes. I'll rework the code structure here. Thanks.
Thanks,
On Wed, 25 Dec 2024 at 01:44, WangYuli wangyuli@uniontech.com wrote:
+config PIPE_SKIP_SLEEPER
bool "Skip sleeping processes during pipe read/write"
If the optimization is correct, there is no point to having a config option.
If the optimization is incorrect, there is no point to having the code.
Either way, there's no way we'd ever have a config option for this.
+pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head) +{
return wq_has_sleeper(wq_head);
So generally, the reason this is buggy is that it's racy due to either CPU memory ordering issues or simply because the sleeper is going to sleep but hasn't *quite* added itself to the wait queue.
Which is why the wakeup path takes the wq head lock.
Which is the only thing you are actually optimizing away.
if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
And I think your patch is simply buggy because the race can in fact exist, because there is no other lock than the waitqueue lock that enforces memory ordering and protects against the "somebody is just about to sleep" race.
That said, *if* all the wait queue work was done under the pipe mutex, we could use the pipe mutex for exclusion instead.
That's actually how it kind of used to work long long ago (not really - but close: it depended on the BKL originally, and adding waiters to the wait-queues early before all the tests for full/empty, and avoiding the races that way)
But now waiters use "wait_event_interruptible_exclusive()" explicitly outside the pipe mutex, so the waiters and wakers aren't actually serialized.
And no, you are unlikely to see the races in practice (particularly the memory ordering ones). So you have to *think* about them, not test them.
Linus
linux-morello@op-lists.linaro.org