[linux-morello] Re: [PATCH 1/4] aio: Fix the relationship between ctx pages and io_events array

15 Jun 2023

On 05-06-2023 09:39, Kevin Brodsky wrote:
...
In commit title: I'm a bit hesitant to call this a fix, as it doesn't
actually fix anything if the size of io_event remains the same. Maybe
just talk directly about using vmap? The commit message itself clarifies
the point of doing that.
Makes sense. Done!
...
On 04/04/2023 15:53, Tudor Cretu wrote:
...
There were two underlying assumptions:

Size of the header of aio_ring is equal to the size of an io_event
sizeof(aio_ring) == sizeof(io_event)
There is no leftover space at the end of a page when populating
it with io_events, i.e. PAGE_SIZE is a multiple of sizeof(io_event)

This change aims to make the aio_ring mechanism more robust such that
it doesn't rely anymore on the above assumptions. It also improves the
copying of the io_events from the shared ring to the userspace,
by batching them together in a maximum of two copy_to_user calls.
Mapping pages independently using kmap doesn't create a virtually
contiguous space. As some io_event structs could span across different
pages, all the pages necessary to access the events' data need to be
mapped to a virtually contiguous space. For this reason, kmap is
replaced with vmap.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
fs/aio.c | 80 ++++++++++++++++++++++++++++++++++++--------------------
  1 file changed, 52 insertions(+), 28 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 079074f47c2e7..c835deda5cdcc 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -603,9 +603,24 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
   return 0;
  }
  
-#define AIO_EVENTS_PER_PAGE	(PAGE_SIZE / sizeof(struct io_event))
-#define AIO_EVENTS_FIRST_PAGE	((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event))
-#define AIO_EVENTS_OFFSET	(AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)
+static unsigned int get_event_pages_idx(unsigned int event_idx,

			unsigned int nr_events,


			size_t *event_offset,


			unsigned int *nr_pages)



+{

unsigned int page_idx;
size_t off;

off = sizeof(struct aio_ring);
off += sizeof(struct io_event) * event_idx;

page_idx = off / PAGE_SIZE;
*event_offset = offset_in_page(off);

off += sizeof(struct io_event) * nr_events - 1;
*nr_pages = off / PAGE_SIZE + 1 - page_idx;
return page_idx;

+}
  
  void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel)
  {
@@ -1134,8 +1149,10 @@ static void aio_complete(struct aio_kiocb *iocb)
  {
   struct kioctx	*ctx = iocb->ki_ctx;
   struct aio_ring	*ring;

struct io_event	*ev_page, *event;
unsigned tail, pos, head;


struct io_event	*event;
void		*ring_pages;
size_t		event_offset;
unsigned int	tail, head, ctx_page_idx, nr_pages;
unsigned long	flags;

/*
@@ -1146,18 +1163,21 @@ static void aio_complete(struct aio_kiocb *iocb)
   spin_lock_irqsave(&ctx->completion_lock, flags);
  
   tail = ctx->tail;

pos = tail + AIO_EVENTS_OFFSET;


if (++tail >= ctx->nr_events)
tail = 0;



ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;


ctx_page_idx = get_event_pages_idx(tail, 1, &eventV_offset, &nr_pages);
ring_pages = vmap(ctx->ring_pages + ctx_page_idx, nr_pages, VM_MAP, PAGE_KERNEL);

I've pondered on this for a while, in the end the current approach does
worry me in terms of performance. The io_getevents() case is not a very
big deal, you can fetch multiple pages of events at the same time this
way and anyway high-performance applications will probably just read
directly from the ring. OTOH there is no way around io_submit(), and
here we are effectively performing a vmap() + vunmap() for every event
in the array, which seems very wasteful.
Very interestingly, while I was checking whether the overhead of vmap()
had been measured, I stumbled upon an aio patch [1] that also replaces
kmap with vmap! A big difference is that it sets up the mapping when
creating the ring and then keeps it around, allowing the ring to be
accessed directly at all time (a bit like io_uring). There are some
discussions on this patch, including performance-wise, but unfortunately
no conclusion and it looks like the author then lost interest.
In our situation, where struct io_event ends up being 48 bytes,
switching to vmap seems to be justified, if only to keep things
reasonably simple. It does feel though that the approach in [1] makes
more sense - we do increase TLB pressure by keeping the kernel mapping,
but that should be a negligible cost compared to calling vmap/vunmap in
a loop.
I don't think [1] applies as-is and it might not be a perfect solution
anyway, so feel free to adapt it or just take some inspiration from it.
I wanted the changes to be as noninvasive as possible, but I agree that 
the overhead is just too much and this solution is much better. [1] is a 
great find and it confirms that it's a viable change in approach. Many 
thanks!
Tudor
...
Kevin
[1]
https://lore.kernel.org/all/1349764760-21093-4-git-send-email-koverstreet@go...

    

2026

2025

2024

2023

2022

[linux-morello] Re: [PATCH 1/4] aio: Fix the relationship between ctx pages and io_events array

Signed-off-by: Tudor Cretu tudor.cretu@arm.com