High-Performance DBMSs via io_uring

Updated 10 December 2025

High-performance DBMSs with io_uring are advanced systems that leverage a unified, low-overhead asynchronous I/O interface using shared submission and completion ring buffers.
They employ batched system calls and adaptive thread pool models to maximize throughput, achieving performance gains up to 32% and significant latency reductions.
Empirical studies demonstrate improvements of up to 11× in transaction rates and enhanced network efficiency by exploiting features like SQPOLL and zero-copy I/O.

High-performance database management systems (DBMSs) have increasingly adopted Linux io_uring to more effectively exploit advances in storage and network device capabilities. io_uring provides a unified, low-overhead, asynchronous interface for both block and network I/O that eliminates much of the overhead intrinsic to traditional synchronous and asynchronous I/O APIs. When carefully architected, io_uring integration enables DBMSs to achieve device-saturating throughput with low latency and flexible concurrency control, but it introduces new complexities in orchestration, error handling, and durability. This article details the technical principles, design patterns, workload considerations, empirical results, and key challenges surrounding high-performance DBMS construction with io_uring.

1. Technical Fundamentals of io_uring in DBMS Context

io_uring introduces two ring buffers—Submission Queue (SQ) and Completion Queue (CQ)—shared between user and kernel space. Applications enqueue I/O requests into an SQ of depth $Q_{depth}$ and receive completions via the CQ, each entry associated with a user-defined identifier for out-of-order operations. System call batching is central: requests are filled into SQEs, then a single io_uring_enter() system call submits the batch, amortizing syscall overhead. In typical usage, batched requests of $B = 16$ can reduce cycles per operation by 5–6× compared to legacy interfaces (Jasny et al., 4 Dec 2025).

io_uring supports several execution paths:

Inline completions for ready data.
Event-based non-blocking path for pollable operations.
Blocking fallback (io_worker pool) for operations requiring blocking context, with additional latency (~7.3 μs).

Advanced features relevant to DBMSs include:

Registered Buffers: Pinning pages for zero-copy DMA.
Passthrough I/O: Direct NVMe command submission, bypassing the block stack.
IOPOLL and SQPOLL: Polling modes that further reduce interrupt and syscall overhead at the expense of dedicated CPU resource (Jasny et al., 4 Dec 2025).

2. io_uring-Driven DBMS Architectures

Three main architecture models have emerged for integrating io_uring into high-performance DBMSs (Pestka et al., 25 Nov 2024):

Approach	Ring/Thread Model	Characteristics
Thread-per-Core (1:1)	1 io_uring per worker	Zero sharing, minimal locking, high isolation, CPU hungry
Static I/O Pool (M:N)	M io_urings for N workers	Centralized dispatch, finely tuned poll threads, decoupled
Dynamic I/O Pool	Adaptive M at runtime	Scale threads/rings with load, management overhead tradeoff

Thread-per-Core: Each DBMS thread owns an io_uring instance (optionally with SQPOLL). This eliminates inter-thread sharing and minimizes synchronization but ties kernel poll thread allocation directly to the number of workers, increasing idle CPU usage if not load-balanced.
Static I/O Thread Pool: A fixed number (M) of io_uring instances, each managed by a dedicated I/O thread, serve requests dispatched from a larger pool (N) of application threads. This approach decouples computation and I/O, enabling explicit control over resource allocation but introduces cross-thread queuing and notification overhead.
Dynamic I/O Thread Pool: The number of active io_uring instances and associated poll threads adapts to system load, dynamically balancing throughput and CPU utilization. This requires robust ring creation/destruction and hysteresis-based scaling to avoid oscillatory behavior.

Context-sensitive code sketches for submission/completion flows and inter-thread notification are standard in all three models, with careful ring buffer state management and ordering semantics (Pestka et al., 25 Nov 2024).

3. Complexity, Responsibility, and Orchestration Challenges

Using io_uring in a DBMS context replaces kernel-side blocking abstraction with user-space orchestration. Key responsibilities include:

Ring buffer management: SQ/CQ are fixed size; overflow or underutilization must be dynamically managed to avoid -ENOSPC errors and idle resources.
Ordering and atomicity: Durable operation ordering (especially for write and fsync) must be explicitly enforced, as atomicity guarantees are only preserved within a submission order per ring.
Memory safety and synchronization: Multi-threaded access to a ring’s data requires memory fences; registered buffers and file descriptors further complicate allocation and reclamation.
Error handling: Underlying device or API faults emerge as negative completion results; user code must detect incomplete operations, handle CQ flooding, and support partial completions.
Debugging and tracing: The asynchronous, fragmented call stack model necessitates explicit context propagation and lifecycle tracking for each I/O request (Pestka et al., 25 Nov 2024).

The complexity is further exacerbated by a lack of high-level DBMS-facing wrappers, requiring each system to build its own I/O façade to encapsulate ring management, completion handling, and error propagation.

4. Practical Use Cases and Empirical Performance

Rigorous evaluation of io_uring-based designs has been performed on both storage-bound and network-bound DBMS workloads (Jasny et al., 4 Dec 2025).

Storage-Bound Buffer Managers

On hardware such as AMD 3.7 GHz, 8× PCIe 5.0 NVMe SSD:

Synchronous baseline (single I/O in flight): ~16.5 k tx/s.
Batching dirty-page writebacks: +14% throughput.
User-level fibers (fully async): ~183 k tx/s (11×).
Adaptive batch submit: +18%.
Registered buffers: +11%.
Passthrough (direct NVMe): +20%.
IOPOLL: +21%.
SQPOLL (dedicating a CPU core): +32% (to ~546 k tx/s). Out-of-memory OLTP (100 warehouses, TPC-C) workloads showed up to 12.5× improvement over blocking I/O (Jasny et al., 4 Dec 2025).

Network-Bound Analytical Shuffling

On 400 Gb/s-capable clusters:

Legacy epoll: ~30 GiB/s/node (240 Gb/s).
io_uring, ring-per-thread model: Scales linearly to saturate links at ~50 GiB/s/node.
Zero-copy send/receive: 2.5× improvement for large tuples over epoll. When per-thread memory stalls are lower relative to I/O stalls and for payloads >1 KiB, io_uring’s zero-copy features dominate (Jasny et al., 4 Dec 2025).

Case Study: AisLSM and Asynchronous Compaction

For LSM-tree compaction, the AisLSM model demonstrates that decoupling CPU and I/O for merge-sort, write, and fsync via io_uring:

Increases throughput by up to 2.1× (e.g., 4.2 × 10⁵ ops/s in RocksDB vs. 8.8 × 10⁵ ops/s in AisLSM with 1 KB values).
Reduces 99th percentile fillrandom latency from 20.1 ms (RocksDB) to 10.3 ms.
Increases CPU utilization from ~40% to ~80% and increases typical queue depth from 8 to 32 for NVMe devices.
Write-intensive workloads and balanced CPU/storage speed are particularly effective scenarios (Hu et al., 2023).

5. Workload Suitability and Performance Boundaries

The advantages of io_uring are workload-dependent:

Storage: Substantial benefit when the page-fault rate $r_{pf}$ is high (≥ 50%). For in-memory workloads, performance gains are <10% (Jasny et al., 4 Dec 2025).
Network: Advantages increase when per-thread memory stalls are lower than I/O stalls. Zero-copy and batched operation features are beneficial for large payloads (>1 KiB).
Concurrency/Queue-Depth: To amortize device latency, asynchronous concurrency $C \geq \mathrm{BW}_{device} \times T_{lat}$ must be achieved. Overly large batch sizes ( $B \geq 64$ ) can spike tail latency (~200 μs+). Polling (SQPoll) is only justifiable for IOPS > 500k or when the system can dedicate a CPU core.
Avoiding io_worker fallbacks: Maintain I/O request sizes < 512 KiB and ensure submissions do not exceed device DMA mapping capabilities (Jasny et al., 4 Dec 2025).

A simple throughput model is often used: $\text{Throughput} \approx \frac{f_{CPU}}{c_{tx} + r_{io}c_{io}}$ where $c_{tx}$ is CPU cycles per transaction and $r_{io}c_{io}$ is cycles spent in I/O.

6. Design Guidelines and Mitigation Strategies

The literature provides precise, empirically validated engineering guidance:

Asynchronous Architecture: Employ coroutines (fibers) to overlap CPU and I/O work, and batch I/O operations (8–32) at natural computation boundaries (e.g., page eviction, commit).
Execution Tuning: Use DeferTR for predictable batching, SQPoll for extreme IOPS or bandwidth demands (dedicating a core), and exploit NUMA topology sensitivity when pinning rings.
IO_uring Features: Deploy registered buffers for zero-copy DMA (payload >1 KiB), passthrough I/O for device-specialized workloads, and enable multishot receive for small (< 1 KiB) network messages.
Robust Error and Resource Management: Wrap common error-prone patterns (cancels, ordering, resource exhaustion) in internal libraries to promote consistency and fault tolerance.
Mitigating Adoption Barriers: Employ dynamic ring sizing and back-pressure, avoid idle poll threads via pool adaptation, and shield user logic from low-level management via an I/O façade (Pestka et al., 25 Nov 2024, Jasny et al., 4 Dec 2025).

A practical decision flow is:

If $(r_{io} \times T_{lat}) > 10\%$ total time, use async fibers and batching ( $B \approx 16$ ).
If per-request CPU cost $c_{io} > 2000$ cycles, register buffers.
If direct device access is present, enable passthrough+IOPoll. For network <1 KiB messages, use multishot receive; otherwise, zero-copy (Jasny et al., 4 Dec 2025).

7. Durability, Consistency, and System Integration

Durability and correctness semantics require explicit orchestration with io_uring:

For compaction (e.g., AisLSM), file-generation dependency tracking and deferred deletion strategies ensure that source SST files are only deleted after all dependent asynchronous fsyncs have completed. This decouples I/O and CPU, eliminating compaction stalls while preserving lineage for durability (Hu et al., 2023).
In PostgreSQL 18, io_uring integration—fixed buffer pools, IOPoll+SQPoll, shared rings—produces aggregate scan throughput improvements of up to 14% in I/O-bound settings, with a clear alignment to empirically derived guidelines (Jasny et al., 4 Dec 2025).

The correctness of such approaches depends on rigorous tracking of pending fsyncs and ensuring that visibility and persistence events are explicitly managed at the completion queue level.

Comprehensive empirical validation demonstrates that, for I/O-intensive DBMS cores, io_uring enables near-linear throughput scaling to the limits of device and interconnect parallelism, with substantial reductions in system overhead. Successful adoption depends on end-to-end async, batched designs tuned to application and hardware characteristics, and robust handling of underlying API complexity (Pestka et al., 25 Nov 2024, Jasny et al., 4 Dec 2025, Hu et al., 2023).