Papers
Topics
Authors
Recent
Search
2000 character limit reached

io_uring Interface Overview

Updated 17 January 2026
  • io_uring is a modern asynchronous I/O interface that utilizes shared submission and completion ring buffers for batching system calls.
  • It unifies storage and network I/O operations, significantly reducing syscall overhead and boosting throughput in high-performance systems.
  • Advanced features such as registered buffers and SQPOLL enable fine-tuned optimizations, offering practical benefits for DBMS integrations.

The Linux io_uring interface is a modern, asynchronous system call batching facility that integrates both storage and network I/O under a unified, low-overhead API. The core innovation is the use of two shared, memory-mapped ring buffers—the Submission Queue (SQ) and Completion Queue (CQ)—enabling user-space processes to enqueue and complete arbitrary system calls asynchronously, batch requests to minimize syscall overhead, and efficiently handle high-throughput, low-latency I/O workloads. Its extensibility and unified interface have rendered io_uring a foundational component for high-performance systems, notably database management systems (DBMSs), though optimal performance requires careful architectural design and tuning (Jasny et al., 4 Dec 2025).

1. Architectural Foundations

The io_uring interface centers on two user–kernel shared, mmap-ed circular ring buffers: the SQ and CQ. Applications initialize io_uring via io_uring_queue_init(entries, &ring, flags), defining the size of these structures. The user-space process populates Submission Queue Entries (SQEs), tracked with SQ_head and SQ_tail pointers, and submits up to N requests with a single io_uring_enter(ring_fd, to_submit=N,...) syscall. Completion Queue Events (CQEs) are produced asynchronously by the kernel, with applications collecting up to M results per invocation.

By batching submissions (N ≫ 1), syscall costs are amortized drastically: microbenchmarks report a reduction from approximately 300 cycles per SQ for individual submissions to around 40 cycles per SQ when batching 16 requests. This design delivers a 5–6× per-I/O syscall cost reduction, directly improving I/O throughput and CPU efficiency (Jasny et al., 4 Dec 2025).

2. Unification of Storage and Network I/O

Traditional Linux offered disjoint APIs for different I/O types (e.g., read/write for storage, epoll for sockets, libaio for block devices), each with distinct semantics. io_uring subsumes these under a single, completion-based model. A wide set of operations (e.g., read(), write(), sendmsg(), recvmsg(), fsync(), openat(), madvise(), arbitrary NVMe commands via OP_URING_CMD) are issued as SQEs. The kernel demultiplexes operations to the filesystem, block-layer, or network stack as appropriate; all completions are uniformly reported via the CQ.

Execution proceeds along three kernel-defined paths:

  • Inline: Operations are immediately satisfied if data is available (e.g., socket with pending data), minimizing latency.
  • Non-blocking/pollable: For pollable resources, an async_wake handler completes requests when readiness is signaled, avoiding thread blocking.
  • Worker threads: Operations incompatible with pure asynchrony (e.g., fsync, large reads) fall back to io_worker threads. Explicit control is exposed (IOSQE_ASYNC), but excessive fallback is detrimental (adding ~7.3 µs per operation) (Jasny et al., 4 Dec 2025).

System builders must weigh trade-offs: the unified interface increases configuration complexity (e.g., selecting from DeferTR, CoopTR, SQPOLL modes, tuning O_DIRECT vs passthrough, poll vs interrupt), and flag misconfiguration can induce undesirable worker fallback or inter-processor interrupts.

3. Advanced Features and Performance Implications

io_uring provides a suite of advanced mechanisms, each contributing to performance but with stringent compatibility constraints:

  • Registered Buffers ("Fixed Buffers"): Pages are pinned once via io_uring_register, allowing DMA directly without per-I/O pinning or memcpy. In a YCSB-style buffer manager, this reduces per-I/O overhead, yielding ~11% throughput gain (238k tx/s vs 216k tx/s).
  • NVMe Passthrough (OP_URING_CMD): Enables direct NVMe command issuance, bypassing the Linux block layer and scheduler, producing an additional ~20% throughput increase (300k tx/s in the cited workload).
  • IOPoll (NAPI for Storage): Replaces interrupt-driven completion with polling of NVMe queues, securing another 21% gain (376k tx/s, single core), albeit at the expense of dedicating CPU cycles to the polling loop when I/O is sparse.
  • SQPOLL: A dedicated kernel thread spins on the SQ, obviating most io_uring_enter syscalls. This approach, requiring a dedicated core, yields an observed ~32% throughput surge (546k tx/s).
  • Zero-copy Networking: Socket buffers and FDs are registered to enable kernel–user zero-copy on both send and receive paths. In a 6-node shuffle, this halves memory bandwidth per GiB transmitted (~10GiB/GiB to ~5GiB/GiB), with up to 2.5× end-to-end speedup versus epoll+copy under high concurrency.
  • Multishot recv and PollFirst: Multishot receive allows one SQE to generate multiple CQEs in datagram-heavy traffic; PollFirst disables speculative syscalls, lowering CPU consumption for RPC-style exchanges (Jasny et al., 4 Dec 2025).

An analytical model for batched, asynchronous DBMS threads is presented:

  • For synchronous, latency-bound designs:

Throughput≈1rpf⋅(Lread+Lwrite)\text{Throughput} \approx \frac{1}{r_{pf} \cdot (L_{read} + L_{write})}

  • For asynchronous, CPU-bound regimes:

T≈fclkctx+rpf⋅cioT \approx \frac{f_{clk}}{c_{tx} + r_{pf} \cdot c_{io}}

with ctxc_{tx} as in-memory cycles (~8,264), cioc_{io} as amortized I/O cost (e.g., 15,900 cycles per op for unbatched, 11,100 cycles for read batching). The model aligns with observation (e.g., 190k tx/s predicted vs 183k tx/s measured) (Jasny et al., 4 Dec 2025).

4. Guidelines for Integration in Database Management Systems

Practical lessons and configuration guidelines for exploiting io_uring in DBMSs are derived empirically:

  • Profiling: Substantial io_uring benefits manifest only when I/O or memory bandwidth dominates runtimes (>30% CPU time). For in-memory or compute-bound loads (e.g., in-memory TPC-C), improvements are modest.
  • Async/Batched Architectures: Critical execution paths should issue I/O in batches, overlap computation and I/O (using coroutines or fibers), and eschew one-request-per-syscall designs to exploit batching.
  • Correct Execution Flags:
    • DEFER_TASKRUN delays completions intentionally.
    • SQPOLL should be enabled if a core can be dedicated.
    • Avoid IOSQE_ASYNC fallback by adhering to hardware and IOMMU constraints.
  • Activation of Appropriate Optimizations:
    • passthrough/IOPOLL require raw block devices (i.e., no filesystem overlay).
    • registered buffers need page alignment.
    • zero-copy networking requires NIC support.
    • Batched submission sizes (16–64 I/Os) must balance syscall amortization with added latency variance (Jasny et al., 4 Dec 2025).

5. Empirical Case Study: PostgreSQL Integration

PostgreSQL 18 introduces an optional io_uring backend for both data and WAL I/O. Even naïve, asynchronous reads (no SQPOLL/fixed buffers) produce a 3× speedup on cold scans over blocking readahead, confirming I/O-bound execution. The staged application of advanced features yields cumulative gains:

Optimization Observed Gain
Register entire buffer pool +4–6%
Enable IOPOLL with O_DIRECT/ext4 +7.5%
Activate SQPOLL (shared thread) +2–3%

End-to-end, the integration delivers 11–15% over the io_uring baseline and approximately 14% beyond pre-io_uring PostgreSQL releases (Jasny et al., 4 Dec 2025).

6. Summary and Implications for Future I/O System Design

io_uring embodies three core principles: a unified model for asynchronous storage and networking I/O, efficient batched communication via shared rings, and an extensible infrastructure accommodating low-level optimizations (registered buffers, NVMe passthrough, zero-copy networking). Realizing maximal benefit is contingent upon architectural re-engineering toward async, batched paradigms, judicious flag selection, and deep awareness of hardware interactions. When properly employed, io_uring is empirically validated to provide up to a doubling of single-threaded IOPS, saturate single-node network bandwidth at 400 Gbit/s with minimal cores, and yield up to 14% end-to-end speedup in mature DBMS deployments (Jasny et al., 4 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to io_uring Interface.