Direct NVMe Engine Overview
- Direct NVMe Engines are innovative storage architectures that bypass traditional OS stacks to provide near-native NVMe SSD access.
- They leverage direct memory mapping, IOMMU-assisted DMA, and SmartNIC offloading to achieve low latency and high throughput.
- These engines support advanced cloud and HPC applications while addressing challenges in scheduling, management, and security.
A Direct NVMe Engine is a storage system architecture or component that allows software—often user-space frameworks, virtualization stacks, or distributed applications—to issue NVMe commands and access NVMe data paths with minimal software overhead, reduced mediation from traditional OS or hypervisor stacks, and often direct hardware-. or accelerator-level offloading. This approach aims to exploit the native performance potential of NVMe SSDs by minimizing data path latency, maximizing queue and parallelism utilization, enabling advanced data management features, and, in some designs, providing security or virtualization guarantees. Direct NVMe Engines are increasingly central to high-performance, cloud, HPC, and disaggregated storage deployments, especially in the context of modern SSD innovations and evolving workload requirements.
1. Architectural Principles and Technology Foundations
At the core of Direct NVMe Engine design is access to NVMe hardware data paths that bypass or minimize the involvement of legacy storage stacks (such as POSIX block layers or interrupt-heavy device models). This is realized through:
- Direct mapping of user or VM memory to NVMe controller queues or controller memory buffer (CMB) regions, supporting lock-free, high-queue-depth, multi-tenant I/O submissions (LightIOV (Chen et al., 2023), FlexBSO (Aschenbrenner et al., 4 Sep 2024)).
- Use of technologies such as IOMMU-assisted DMA remapping and interrupt-post mechanisms to deliver low-latency, high-throughput performance, and to safely multiplex device access between software contexts (LightIOV (Chen et al., 2023), FlexBSO (Aschenbrenner et al., 4 Sep 2024)).
- Frontend and backend architectural separation: frontend layers expose high-performance virtual or physical block devices—often supporting multiple queues or rich management functions—while backend engines interact directly and efficiently with physical NVMe SSDs, persistent memory devices (e.g., Intel Optane (Subedi et al., 2018)), or advanced storage class memory.
- Emphasis on software-defined storage interfaces, enabling programmability for storage logic (e.g., RAID, compression, encryption) within SmartNIC or DPU environments (FlexBSO (Aschenbrenner et al., 4 Sep 2024)).
- Elimination or minimization of mediation points such as hypervisor VM-exits, OS-level I/O scheduling, or block cache indirection.
NVMe’s rich command set and abundant queuing (up to 65k I/O queues, each with thousands of entries) underpins this approach, enabling each tenant, VM, or application to achieve near-native SSD performance while maintaining strong isolation.
2. Performance, Parallelism, and Benchmarking
Direct NVMe Engine designs consistently demonstrate substantial advantages in bandwidth, IOPS, and latency:
- In representative FIO benchmarks, emerging byte-addressable NVRAM SSDs such as Intel Optane DC4800X achieve sequential write throughput up to 1005 MB/sec (vs. 128 MB/sec for conventional NVMe SSD DC3700, both at 4KB transfer size, QD=1), and can attain >2 GB/sec random read/write performance at larger transfer sizes and high queue depths. Such performance is only sustained when the engine or application saturates the available concurrency and queue resources (Subedi et al., 2018).
- Direct mapping of NVMe I/O queues allows VM-attached devices to achieve 97.6%–100.2% of exclusive-pass-through (VFIO) IOPS, while maintaining device sharing flexibility and supporting VM counts in the thousands (LightIOV (Chen et al., 2023)).
- Hardware-offloaded solutions using SmartNICs (FlexBSO (Aschenbrenner et al., 4 Sep 2024)) record throughput up to 14 GB/s, and in low-latency workloads, realize average read latencies of 16 μs (vs. 63.7 μs for RDMA-based NVMe-oF), demonstrating a near 4× reduction.
- Optimized user-space block device frontends (ublk) and direct-to-disk schemes (DBS) in distributed storage engines (Longhorn (Kampadais et al., 20 Feb 2025)) elevate frontend IOPS from 20 k (iSCSI-based) to 500 k, with end-to-end improvements of 3×–6× compared to default I/O paths.
A key insight from these results is that full exploitation of NVMe SSD capability requires the engine to issue large or highly concurrent I/O operations, designed to saturate device queues—achievable only when user-level or VM-level applications have near-direct access to hardware submission paths, with minimal performance mediation.
3. Advanced Flash Management and Placement
Emerging SSD features have shifted substantial device and data management responsibilities to host stack or engine logic. Notable developments include:
- Zoned Namespace (ZNS) SSDs expose append-only zones and management commands (e.g., open, finish, reset), delegating data placement, block allocation, and predictable garbage collection to the host or engine (Doekemeijer et al., 2023).
- Flexible Data Placement (FDP) enables the host engine to “hint” data placement for physically segregating data by type or usage pattern, crucial in Flash cache environments (e.g., CacheLib (Allison et al., 21 Feb 2025)), to prevent small-object (SOC) and large-object (LOC) data from intermixing, reducing device-level write amplification (DLWA).
- In practical deployments, targeted data placement achieves near-ideal DLWA of 1, reducing overprovisioning requirements (from 50% to near zero), and translating to substantial sustainability benefits (lower energy, reduced SSD replacement frequency), as per the formula and carbon amortization
- Management of ZNS and FDP requires engines to account for both I/O and management command costs, zone state transition latencies, and tailored I/O sizes to harness both intra-zone and inter-zone parallelism (Doekemeijer et al., 2023, Allison et al., 21 Feb 2025).
This drive towards host-based flash control is motivated by the limitations and inefficiencies in traditional drive-level Flash Translation Layers (FTL), and can only be fully leveraged with direct engine–device interfaces.
4. Software, Virtualization, and Programmability
Direct NVMe Engine innovation is involved in both hypervisor-driven and user-space modeling:
- LightIOV (Chen et al., 2023) leverages CMB-based queue mapping, IOMMU DMA remapping, and minimalistic software trap-and-emulate layers to share an NVMe device among many VMs, each with dedicated queues, while providing isolation and reducing CPU overhead compared to poll-based approaches such as SPDK-Vhost.
- FlexBSO (Aschenbrenner et al., 4 Sep 2024) uses NVIDIA SNAP on Bluefield-2 SmartNICs to fully offload NVMe device emulation and block stack logic (e.g., SPDK-based RAID, compression) onto a programmable ARM core, exposing each VM with SR-IOV NVMe block devices, thereby removing VM-exit and host-side mediation.
- The xNVMe project (Lund et al., 11 Nov 2024) offers a cross-platform, minimal message-passing API to unify various I/O storage command paths (POSIX, io_uring, libaio, SPDK, etc.), supporting both synchronous and asynchronous queue models and allowing developers to exploit direct device features regardless of operating system.
- Cloud-native distributed block stores (Longhorn (Kampadais et al., 20 Feb 2025)) achieve direct NVMe performance by refactoring frontend device emulation (ublk with io_uring), controller-replica concurrency primitives, and extent-based direct block stores.
This shift empowers both application-level and systems-level code to manage NVMe storage in a scalable, flexible way, which is critical as direct access modes (per-VM, per-container, even per-accelerator) become standard in large deployments.
5. Security Considerations and Countermeasures
The direct, low-overhead access model for NVMe devices exposes new security challenges, which are reflected in several research efforts:
- NVMe-oF and RDMA. NVMe-oF relies on RDMA for remote block device access, but protocol-level flaws (e.g., lack of source authentication, weak queue management) can be exploited to inject, spoof, or disrupt storage commands—including unauthorized block writes and forced disconnects (NeVerMore (Taranov et al., 2022)). Application-layer MACs and improved RDMA primitives are required to mitigate these threats.
- Malicious devices. Sophisticated firmware attacks (eNVMe platform (Wertenbroek et al., 1 Nov 2024)) can exploit DMA via PCIe to escalate privilege or subvert OS storage subsystems, highlighting the need for proper IOMMU configuration, Secureboot adoption, and off-disk cryptography for sensitive data.
- Confidential computing overlays. sNVMe-oF (Chrapek et al., 21 Oct 2025) demonstrates that confidentiality, integrity, and freshness guarantees can be layered over direct NVMe–oF paths by encapsulating cryptographic metadata in the NVMe metadata fields and offloading Merkle tree integrity update logic to CC-capable SmartNICs. These mechanisms operate with as little as 2% performance degradation versus bare metal, and do not require protocol modification.
This suggests that future Direct NVMe Engine designs must embed security-by-design—enforcing DMA boundaries via IOMMU, integrating or supporting cryptographic verification in the data/message path, and providing hardware-anchored attestation for programmable stacks.
6. Advanced I/O Scheduling, Structural Encoding, and Data Layout
Maximizing direct NVMe engine performance for complex analytical and machine learning workloads demands attention to physical I/O request characteristics, structuring, and CPU–I/O scheduling:
- For columnar data formats, random access and scan throughput can be dramatically improved by alignment of file layout and read sizes to device granularity—smaller Parquet pages (e.g., 8 KiB) and adaptive search caches deliver >60× better random access rates, with only marginal trade-offs in scan throughput or RAM use (Lance (Pace et al., 21 Apr 2025)).
- Structural encodings such as full-zip (for large types) and mini-block (for small types) can reduce per-query I/O count, optimizing both random and scan-intensive workloads.
- Scheduling strategies that merge sequential I/Os (e.g., via Linux mq-deadline for ZNS (Doekemeijer et al., 2023)) improve throughput, albeit sometimes with higher latency.
- Adaptive encoding and struct packing minimize metadata read overhead and enable direct engine-driven access even with complex, nested, or variable-width types.
This emphasizes that the effectiveness of a Direct NVMe Engine is not just in physical access, but in the synergy of device interface, data layout, and application usage.
7. Limitations, Trade-offs, and Future Evolutions
Despite their benefits, Direct NVMe Engines present challenges:
- Engineering for full performance mandates careful tuning of I/O size, queue depth, and data layout. Under-utilized engines (low queue depth, single-threaded access) fail to capitalize on available bandwidth (Subedi et al., 2018, Doekemeijer et al., 2023).
- Security is a continual concern—unmediated hardware access can enable sophisticated attacks if device, stack, or management plane is improperly isolated or misconfigured (Taranov et al., 2022, Wertenbroek et al., 1 Nov 2024).
- Emerging SSD interfaces (FDP, ZNS) move responsibility for data placement and GC to the host; while offering potential efficiency, they increase system complexity and require efficient, possibly application-aware engine scheduling (Allison et al., 21 Feb 2025, Doekemeijer et al., 2023).
- Current emulation and simulation environments may not accurately reproduce key aspects of direct NVMe operation, especially ZNS or management command latencies (Doekemeijer et al., 2023), implying the need for improved tools and models.
- Balancing CPU-side overhead, RAM usage (especially search-cache and index metadata), and I/O offload requires trade-offs tuned to particular deployment and workload requirements (Pace et al., 21 Apr 2025, Chen et al., 2023).
Industry trends indicate continued evolution in hardware interfaces (e.g., more advanced SmartNICs, CC accelerators), further integration of storage engines with confidential computing, and broader adoption of open-source, programmable device firmware (Wertenbroek et al., 1 Nov 2024), all of which will continue to shape the design, adoption, and security properties of Direct NVMe Engines.