Tools: Building an Async Rust Runtime on io_uring: 7.5ms vs Tokio's 14.9ms

Tools: Building an Async Rust Runtime on io_uring: 7.5ms vs Tokio's 14.9ms

The Question That Started Everything

Why Does Async Exist at All?

Enter io_uring: The Kernel's Secret Weapon

How RingCore Works: A Tour of the Four Layers

Layer 1: Talking to the Kernel (src/sys.rs, src/ring.rs)

Layer 2: Wrapping Operations in Futures (src/op.rs)

Layer 3: The Executor (src/executor.rs)

Layer 4: Friendly Wrappers (src/net.rs)

The Benchmarks

File I/O : reading a 100MB file

Networking : sequential and concurrent requests

Advanced: kernel-level task chaining

The Mental Model That Changes Everything

What's in the Repo

Requirements

Why Build This Instead of Just Using Tokio?

This is Part of a Series You use async/await every day. But do you know what actually happens when your code "pauses"? I didn't, so I built something to find out. The result is RingCore, a minimal async runtime in Rust, built directly on Linux's io_uring, with zero abstraction layers in the way. No Tokio. No hidden thread pools. Just Rust, a kernel interface, and a lot of curiosity. If you've written async Rust, you've probably typed this: And it just works. The program doesn't freeze. Other tasks keep running. But I kept asking: what is actually happening when .await suspends a task? Where does execution go? Who wakes it back up? How does the OS fit into any of this? Most tutorials stop at "the runtime handles it." That answer never satisfied me. Imagine you're a chef in a kitchen. You put a steak on the grill and just stand there watching it cook. You don't prep the salad. You don't plate the dessert. You just wait. That's synchronous I/O. Your program calls read(), the OS fetches data from disk or the network, and your thread sits idle until it comes back. Wasteful. Async I/O lets you be a smarter chef. You start the steak, set a timer, and go do other things. When the timer fires, you come back and finish. In Rust, async/await is the language-level mechanism for writing this kind of code. But Rust itself doesn't define how the waiting works, that's the runtime's job. Most people reach for Tokio, which is fantastic and production-ready. But it's also a black box. I wanted the white box. Traditional async I/O on Linux is expensive. Every interaction with the kernel requires a context switch, which is a CPU jump from user mode (your program) into kernel mode (the OS) and back. Under heavy I/O load, these add up fast. io_uring, introduced in Linux 5.1 by kernel developer Jens Axboe, takes a radically different approach. Instead of making individual system calls, your program and the kernel share two ring buffers in memory: Think of it like a diner counter with a ticket window. Instead of running to the kitchen for every order, you slide all your tickets through the window at once and the kitchen slides the finished plates back. One trip. Maximum efficiency. Multiple I/O operations can be batched into a single io_uring_enter system call. Context switches plummet. Performance soars. The lowest layer handles raw kernel communication. No OS library wrappers. No abstraction. RingCore manually invokes SYS_IO_URING_SETUP and SYS_IO_URING_ENTER via libc, and uses mmap to map the kernel's SQ and CQ ring buffers directly into the process's address space. This is the part most async tutorials skip entirely. In RingCore, it's front and center. This is where things get interesting. Rust's Future trait is simple: poll it, get Poll::Ready(value) if the result is done, or Poll::Pending if not along with a Waker so someone can nudge it later. In RingCore, every io_uring operation becomes a Future. Here's the key poll implementation: The elegant part: when the kernel finishes and writes a CQE with a matching ID, the executor retrieves the stored Waker and calls it. No magic, it's just a map, an ID, and a callback. The executor is the brain that orchestrates everything. Its main loop is beautifully simple: This is a classic event loop similar in spirit to Node.js, but with direct kernel access instead of libuv underneath. The top layer gives you TcpListener and TcpStream with clean async fn methods. They feel like normal Rust networking but under the hood, they're submitting SQEs to the ring. The whole stack, four files, clean separation, nothing hidden. Tested on Debian 13, Kernel 6.12. Comparing RingCore against std and Tokio. Tokio is 5× slower here. Why? Tokio doesn't use io_uring for file I/O by default, it offloads blocking file reads to a thread pool, which adds significant overhead. RingCore uses true async kernel operations. The 1,000-request stress test is the eye-opener. Tokio takes over a second because its thread-per-task model drowns in scheduling overhead at scale. RingCore handles all of it on a single thread, with the kernel doing the heavy lifting. Using IOSQE_IO_LINK, RingCore chains dependent operations (like Read → Write) so the kernel executes them back-to-back without ever returning to userspace. One io_uring_enter call. Zero ping-pong. Here's what building RingCore made concrete for me, the thing no tutorial made clear before: When you .await something in Rust, you're saying: "I'm not ready yet. Here's my callback (the Waker). Come get me when something changes." The executor moves on to other tasks. The kernel works in the background. When the kernel is done, it writes a CQE. The executor reads it, finds the matching Waker in the map, and calls it. Your task wakes up and continues from where it left off. That's the entire model. RingCore makes every step of it visible and there's no layer you can't read. Examples are organized into four tiers so you can explore progressively: Tier 1 : Proving the runtime Tier 2 : The async model Tier 3 : Real workloads Tier 4 : Advanced features Start with echo, trace through the source, and you'll have a complete mental model of async I/O in about an afternoon. Tokio is the right choice for production. I'm not suggesting you replace it. But if you've ever stared at a select! macro, a JoinHandle, or a .await and wondered what is actually happening in the kernel right now, building something like RingCore is the answer. I'm not intimidated by async Rust anymore. Not because it got simpler, but because I can now see every moving part. The abstraction didn't disappear, I just understand what it's abstracting. RingCore isn't the first time I've gone down this rabbit hole. A few weeks ago I also built a container engine in Rust that starts in 10ms, cracking open Linux namespaces, cgroups, and clone() syscalls along the way. The two projects rhyme. With the container engine I asked: what actually happens when you run a container? With RingCore I asked: what actually happens when you .await? Both answers live in the kernel. Both are learnable. The best way to demystify them is to build a tiny, intentionally incomplete version yourself. If this sparked any curiosity about systems programming, async I/O, or Rust internals, that was the whole point. Issues and PRs are very welcome. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

let data = file.read().await; let data = file.read().await; let data = file.read().await; // Manually invoke the io_uring_setup syscall let ring_fd = unsafe { libc::syscall( libc::SYS_io_uring_setup, QUEUE_DEPTH as libc::c_long, &params as *const _ as libc::c_long, ) } as i32; // Map the Submission Queue into our address space let sq_ptr = unsafe { libc::mmap( std::ptr::null_mut(), sq_size, libc::PROT_READ | libc::PROT_WRITE, libc::MAP_SHARED | libc::MAP_POPULATE, ring_fd, libc::IORING_OFF_SQ_RING as libc::off_t, ) }; // Manually invoke the io_uring_setup syscall let ring_fd = unsafe { libc::syscall( libc::SYS_io_uring_setup, QUEUE_DEPTH as libc::c_long, &params as *const _ as libc::c_long, ) } as i32; // Map the Submission Queue into our address space let sq_ptr = unsafe { libc::mmap( std::ptr::null_mut(), sq_size, libc::PROT_READ | libc::PROT_WRITE, libc::MAP_SHARED | libc::MAP_POPULATE, ring_fd, libc::IORING_OFF_SQ_RING as libc::off_t, ) }; // Manually invoke the io_uring_setup syscall let ring_fd = unsafe { libc::syscall( libc::SYS_io_uring_setup, QUEUE_DEPTH as libc::c_long, &params as *const _ as libc::c_long, ) } as i32; // Map the Submission Queue into our address space let sq_ptr = unsafe { libc::mmap( std::ptr::null_mut(), sq_size, libc::PROT_READ | libc::PROT_WRITE, libc::MAP_SHARED | libc::MAP_POPULATE, ring_fd, libc::IORING_OFF_SQ_RING as libc::off_t, ) }; impl Future for Op { type Output = i32; fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { // If this is the first poll, submit the SQE to the ring if !self.submitted { RING.with(|ring| { let mut ring = ring.borrow_mut(); // Write the Submission Queue Entry to the shared kernel buffer ring.push_sqe(self.sqe); }); // Store the Waker in a global map, keyed by our unique operation ID // The executor will retrieve this when the kernel signals completion WAKER_MAP.with(|map| { map.borrow_mut().insert(self.user_data, cx.waker().clone()); }); self.submitted = true; return Poll::Pending; // Go away, we'll call you when the kernel is done } // Check if our Completion Queue Entry has arrived match self.result.take() { Some(res) => Poll::Ready(res), None => { // Update the waker and keep waiting WAKER_MAP.with(|map| { map.borrow_mut().insert(self.user_data, cx.waker().clone()); }); Poll::Pending } } } } impl Future for Op { type Output = i32; fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { // If this is the first poll, submit the SQE to the ring if !self.submitted { RING.with(|ring| { let mut ring = ring.borrow_mut(); // Write the Submission Queue Entry to the shared kernel buffer ring.push_sqe(self.sqe); }); // Store the Waker in a global map, keyed by our unique operation ID // The executor will retrieve this when the kernel signals completion WAKER_MAP.with(|map| { map.borrow_mut().insert(self.user_data, cx.waker().clone()); }); self.submitted = true; return Poll::Pending; // Go away, we'll call you when the kernel is done } // Check if our Completion Queue Entry has arrived match self.result.take() { Some(res) => Poll::Ready(res), None => { // Update the waker and keep waiting WAKER_MAP.with(|map| { map.borrow_mut().insert(self.user_data, cx.waker().clone()); }); Poll::Pending } } } } impl Future for Op { type Output = i32; fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { // If this is the first poll, submit the SQE to the ring if !self.submitted { RING.with(|ring| { let mut ring = ring.borrow_mut(); // Write the Submission Queue Entry to the shared kernel buffer ring.push_sqe(self.sqe); }); // Store the Waker in a global map, keyed by our unique operation ID // The executor will retrieve this when the kernel signals completion WAKER_MAP.with(|map| { map.borrow_mut().insert(self.user_data, cx.waker().clone()); }); self.submitted = true; return Poll::Pending; // Go away, we'll call you when the kernel is done } // Check if our Completion Queue Entry has arrived match self.result.take() { Some(res) => Poll::Ready(res), None => { // Update the waker and keep waiting WAKER_MAP.with(|map| { map.borrow_mut().insert(self.user_data, cx.waker().clone()); }); Poll::Pending } } } } pub fn run(&mut self) { loop { // Step 1: Poll all tasks that have been woken up while let Some(task) = self.ready_queue.pop_front() { let waker = task.waker(); let mut cx = Context::from_waker(&waker); match task.future.borrow_mut().as_mut().poll(&mut cx) { Poll::Ready(_) => { /* Task complete, drop it */ } Poll::Pending => { /* Task is waiting on I/O, leave it */ } } } // Step 2: Submit pending SQEs and harvest completed CQEs // min_complete=1 means: block until at least one operation finishes // This puts the thread to sleep until the kernel has work for us let completed = self.ring.submit_and_wait(1); // Step 3: For each completed operation, wake the waiting task for cqe in completed { WAKER_MAP.with(|map| { if let Some(waker) = map.borrow_mut().remove(&cqe.user_data) { // Store the result, then wake the future store_result(cqe.user_data, cqe.res); waker.wake(); } }); } if self.all_tasks_complete() { break; } } } pub fn run(&mut self) { loop { // Step 1: Poll all tasks that have been woken up while let Some(task) = self.ready_queue.pop_front() { let waker = task.waker(); let mut cx = Context::from_waker(&waker); match task.future.borrow_mut().as_mut().poll(&mut cx) { Poll::Ready(_) => { /* Task complete, drop it */ } Poll::Pending => { /* Task is waiting on I/O, leave it */ } } } // Step 2: Submit pending SQEs and harvest completed CQEs // min_complete=1 means: block until at least one operation finishes // This puts the thread to sleep until the kernel has work for us let completed = self.ring.submit_and_wait(1); // Step 3: For each completed operation, wake the waiting task for cqe in completed { WAKER_MAP.with(|map| { if let Some(waker) = map.borrow_mut().remove(&cqe.user_data) { // Store the result, then wake the future store_result(cqe.user_data, cqe.res); waker.wake(); } }); } if self.all_tasks_complete() { break; } } } pub fn run(&mut self) { loop { // Step 1: Poll all tasks that have been woken up while let Some(task) = self.ready_queue.pop_front() { let waker = task.waker(); let mut cx = Context::from_waker(&waker); match task.future.borrow_mut().as_mut().poll(&mut cx) { Poll::Ready(_) => { /* Task complete, drop it */ } Poll::Pending => { /* Task is waiting on I/O, leave it */ } } } // Step 2: Submit pending SQEs and harvest completed CQEs // min_complete=1 means: block until at least one operation finishes // This puts the thread to sleep until the kernel has work for us let completed = self.ring.submit_and_wait(1); // Step 3: For each completed operation, wake the waiting task for cqe in completed { WAKER_MAP.with(|map| { if let Some(waker) = map.borrow_mut().remove(&cqe.user_data) { // Store the result, then wake the future store_result(cqe.user_data, cqe.res); waker.wake(); } }); } if self.all_tasks_complete() { break; } } } impl TcpStream { pub async fn read(&self, buf: &mut [u8]) -> io::Result<usize> { // This creates an Op future that submits IORING_OP_READ // and suspends until the kernel completes it let result = Op::read(self.fd, buf).await; if result < 0 { Err(io::Error::from_raw_os_error(-result)) } else { Ok(result as usize) } } } impl TcpStream { pub async fn read(&self, buf: &mut [u8]) -> io::Result<usize> { // This creates an Op future that submits IORING_OP_READ // and suspends until the kernel completes it let result = Op::read(self.fd, buf).await; if result < 0 { Err(io::Error::from_raw_os_error(-result)) } else { Ok(result as usize) } } } impl TcpStream { pub async fn read(&self, buf: &mut [u8]) -> io::Result<usize> { // This creates an Op future that submits IORING_OP_READ // and suspends until the kernel completes it let result = Op::read(self.fd, buf).await; if result < 0 { Err(io::Error::from_raw_os_error(-result)) } else { Ok(result as usize) } } } cargo run --example echo # Chained Accept → Read → Write cargo run --example cat -- <file> # File I/O in isolation cargo run --example timer # Task parking and waking without I/O cargo run --example echo # Chained Accept → Read → Write cargo run --example cat -- <file> # File I/O in isolation cargo run --example timer # Task parking and waking without I/O cargo run --example echo # Chained Accept → Read → Write cargo run --example cat -- <file> # File I/O in isolation cargo run --example timer # Task parking and waking without I/O cargo run --example concurrent_downloads # 100 SQEs submitted simultaneously cargo run --example timeout_race # Operation cancellation via IORING_OP_ASYNC_CANCEL cargo run --example concurrent_downloads # 100 SQEs submitted simultaneously cargo run --example timeout_race # Operation cancellation via IORING_OP_ASYNC_CANCEL cargo run --example concurrent_downloads # 100 SQEs submitted simultaneously cargo run --example timeout_race # Operation cancellation via IORING_OP_ASYNC_CANCEL cargo run --example http_server # High-concurrency "Hello World" cargo run --example file_server # Serving static files over TCP cargo run --example http_server # High-concurrency "Hello World" cargo run --example file_server # Serving static files over TCP cargo run --example http_server # High-concurrency "Hello World" cargo run --example file_server # Serving static files over TCP sudo cargo run --example sqpoll # Kernel-side SQ polling (needs CAP_SYS_ADMIN) cargo run --example linked_cat -- <file> # Chained Read + Write at kernel level cargo run --example multishot_accept # One SQE → infinite connection CQEs sudo cargo run --example sqpoll # Kernel-side SQ polling (needs CAP_SYS_ADMIN) cargo run --example linked_cat -- <file> # Chained Read + Write at kernel level cargo run --example multishot_accept # One SQE → infinite connection CQEs sudo cargo run --example sqpoll # Kernel-side SQ polling (needs CAP_SYS_ADMIN) cargo run --example linked_cat -- <file> # Chained Read + Write at kernel level cargo run --example multishot_accept # One SQE → infinite connection CQEs [dependencies] ringcore = "0.1.0" [dependencies] ringcore = "0.1.0" [dependencies] ringcore = "0.1.0" - Submission Queue (SQ): You write your I/O requests here. - Completion Queue (CQ): The kernel writes results back here. - Linux 5.10+ for stable IORING_OP_ACCEPT support - x86_64 architecture - Dependencies: libc and std only - GitHub: github.com/sumant1122/ringcore - Crates.io: crates.io/crates/ringcore - Docs: docs.rs/ringcore - Companion project (container engine): github.com/sumant1122/Nucleus