FlexiNS: Programmable SmartNIC Network Stack
- FlexiNS is a SmartNIC-centric, fully programmable network stack designed for line-rate performance and seamless RDMA compatibility.
- It leverages a header-only TX path, in-cache RX processing, and a high-performance DMA-only notification pipe to eliminate CPU overhead and DRAM bottlenecks.
- The architecture integrates a programmable offloading engine, delivering 2.2× higher IOPS and 1.3× increased bandwidth compared to traditional stacks.
FlexiNS is a SmartNIC-centric, fully programmable, and line-rate network stack designed to address the growing mismatch between modern data center network speeds and the performance of CPU-centric and hardware-offloaded network stacks. FlexiNS systematically rearchitects both the transport and data planes on top of off-path SmartNIC platforms—specifically Nvidia BlueField-3—integrating four core architectural innovations to simultaneously deliver software-defined flexibility, minimal resource overhead, and line-rate performance, while maintaining compatibility with standard RDMA IBV verbs interfaces (Chen et al., 25 Apr 2025).
1. Design Motivation and Context
Conventional datacenter network stacks, whether CPU-centric (kernel-based or microkernel) or hardware-offloaded (RDMA NICs), encounter scalability and flexibility barriers as network speeds reach 200/400/800G. CPU-centric stacks incur prohibitive CPU and DRAM overhead, high tail latency, and resource contention, especially under workload interference. Hardware-offloaded stacks offer high throughput but are inflexible and not easily programmable: protocol evolution and feature addition are limited by ASIC design cycles.
Off-path SmartNICs, such as BlueField-3, provide general-purpose ARM cores and a full OS, allowing reprogrammability. However, naively porting an entire transport stack to the SmartNIC exposes severe bottlenecks:
- C1 (Link Bottleneck): Shared bandwidth between SmartNIC ARM and NIC switch constrains throughput.
- C2 (DRAM Bandwidth): ARM DRAM interface cannot sustain full-duplex line-rate for both RX and TX in software.
- C3 (Latency): ARM in-path processing and PCIe roundtrips inflate per-packet latency.
FlexiNS is designed to resolve all three bottlenecks, enabling both cloud-scale application performance and deep transport control extensibility.
2. Architectural Overview and Execution Model
The FlexiNS architecture consists of a host user library and kernel module exposing standard RDMA verbs, and a SmartNIC-side software stack divided into data and control cores. The user views FlexiNS as a standard RDMA device, supporting existing tools and applications "out of the box." The SmartNIC ARM cores execute protocol processing and offload logic ("data cores"), along with stack management and resource scheduling ("control cores"). Data movements and notifications between SmartNIC and host bypass bottlenecks via hardware-accelerated and memory-local mechanisms.
The following diagram (as in (Chen et al., 25 Apr 2025) Fig. 1) illustrates the main components:
1 2 3 4 5 6 7 8 9 10 11 |
[Host RDMA Application]
|
[User Library / Kernel RDMA Driver]
|
---------------- PCIe ----------------
|
[SmartNIC ARM Cores]
|
[SmartNIC NIC Switch ASIC]
|
Network |
FlexiNS leverages BlueField-3 hardware features—such as cache-coherent DMA, programmable memory translation, and lockless DMA pipes—to orchestrate all stack operations at line rate while minimizing host involvement.
3. Key System Innovations
FlexiNS introduces four tightly interlocking mechanisms:
(a) Header-Only Offloading TX Path
For outgoing data, FlexiNS decouples payload and header handling. Only headers are generated and processed by the ARM cores; the SmartNIC DMA engine fetches payload data directly from host memory using pre-registered "shadow" regions, avoiding double copies or excessive ARM DRAM traffic. This architecture removes ARM-NIC link contention and achieves line-rate TX throughput even during full-duplex loads. The ARM-side memory bandwidth for TX remains below 0.5 GB/s at 400G. Multiple send queues communicate through a scalable shared notification mechanism.
(b) Unlimited-Working-Set In-Cache Processing RX Path
For reception, RX payloads are delivered by the NIC switch directly into the ARM LLC (leveraging DDIO), enabling protocol processing to proceed in-cache rather than in DRAM. After processing and DMA delivery to host memory, cachelines are explicitly invalidated, removing residual DRAM pressure. This design supports arbitrarily large RX working set sizes (hundreds of thousands of concurrent buffers) with constant RX throughput; observed ARM DRAM traffic is <0.8 GB/s at line rate.
The required cache size is governed by: For 400 Gbps and 10 μs processing per packet, only 500 KB is required—well below typical LLC sizes on BlueField-3 (16 MB).
(c) High-Performance DMA-Only Notification Pipe
Traditional notification mechanisms (PCIe MMIO, Doorbells) induce latency and scaling bottlenecks. FlexiNS implements a lockless, single-producer/single-consumer ring as a notification pipe, leveraging high-speed DMA to transfer request/completion events between host and SmartNIC. This reduces notification latency to sub-2x of hardware RNICs and enables full-duplex microsecond-scale operation, even with hundreds of thousands of queues, with negligible ARM or host overhead.
(d) Programmable Offloading Engine
FlexiNS exposes a general programmable engine on the SmartNIC, allowing user-supplied protocol opcodes and application-defined code handlers to be registered and executed on spare ARM cores as part of the stack datapath. The API provides high-level primitives for DMA management, task scheduling, and asynchronous coroutines, greatly simplifying the development of server-side batched operations, pointer-chasing, and in-network compute without hardware-specific programming. Prototyping new transport or RPC logic is reduced to a few hundred lines of C++.
4. Comparative Positioning and Impact on Network Stack Design
FlexiNS achieves a combination of attributes previously thought mutually incompatible:
| Architecture | Throughput | Host Overhead | Memory Contention | Programmability |
|---|---|---|---|---|
| CPU/Kernal-Based | Low | High | High | Medium |
| Microkernel-Based | Medium | Medium | High | High |
| Hardware-Offloaded RDMA | High | Low | None | Low |
| Naive SmartNIC Stack | Low | Low | None | High |
| FlexiNS | High | Low | None | High |
FlexiNS thereby delivers the performance (line-rate IOPS and bandwidth, hardware-class latency), resource efficiency (no host CPU/DRAM spike), and the programmability (custom protocol, batching, and stack logic) needed by emerging cloud, storage, and AI workloads.
5. Evaluation and Performance Results
All results derive from dual Xeon hosts with Nvidia BlueField-3 SmartNICs (2×200 G links):
- Network Stack Microbenchmarks: FlexiNS matches hardware RNICs in single-connection throughput, achieving up to 3.5× the throughput of Snap (microkernel) on SEND/WRITE. Latency is 1.5× that of RNICs, but 1.4× lower than Snap; optimized notification further narrows this gap.
- Block Storage Disaggregation (Solar Protocol): Delivers 2.2× higher IOPS than microkernel-based Snap (4K READ). With additional CRC/DSA host offload, still achieves 1.5× higher IOPS vs. optimized CPU kernel stacks.
- KVCache Transfer: Yields 1.3× higher bandwidth over Mooncake-RDMA hardware stack and reduces large-object transfer latency by 3.2× compared to Mooncake-TCP.
- Programmable Engine (Batched RDMA READ, Pointer-Chasing): Batched RDMA READ achieves 3.5× hardware-RNIC throughput; server-side pointer-chasing reduces total latency by 1.7× versus client-based RDMA.
Key factors enabling these results are the elimination of ARM link/memory bottlenecks, efficient in-cache RX, and fast DMA-based notifications, combined with a flexible application offload model.
6. Implementation and Compatibility
FlexiNS is implemented in C++20 (SmartNIC and user space), with a C98 kernel module, and modifies only the SmartNIC-side mlx5 driver to handle enhanced DMA/cache operations; the host driver is left unmodified. It does not depend on Nvidia DOCA—DEVX and IBV verbs are accessed directly.
FlexiNS is fully compatible with standard RDMA IBV verbs. The stack is registered as a standard RDMA device in the OS kernel, allowing unmodified RDMA-aware user-space applications, devices, and management tools to operate transparently atop FlexiNS. Core stack management is partitioned between "data" and "control" SmartNIC cores for scalability. Efficient LLC invalidation is realized via Nvidia-provided opcodes.
7. Summary Table: Design Features and Achieved Benefits
| Innovation | Problem Addressed | Method/Mechanism | Key Result |
|---|---|---|---|
| Header-only TX Offload | Link/DRAM bottlenecks | ARM headers + direct DMA payload fetch | Full link-rate; minimal ARM b/w |
| Unlimited In-Cache RX | RX memory scaling | RX direct to cache, explicit invalidation | Scalable working set; line-rate throughput |
| DMA-Only Notification | Notification latency/bottlenecks | Lockless DMA ring buffer (host ↔ SmartNIC) | µs-level latency; scalable notifications |
| Programmable Offloading | Application CPU offload | Code registration, handler API on ARM | In-network logic; rapid prototyping |
FlexiNS thereby enables datacenters to attain hardware-grade performance, programmability, and resource efficiency, without sacrificing backwards compatibility with existing RDMA software ecosystems.
8. Conclusion
FlexiNS establishes a new design paradigm for SmartNIC-enabled datacenter network stacks by achieving scalable, line-rate, software-definable transport and application offloading, underpinned by four systematic hardware/software co-designs. Its results—2.2× IOPS over microkernel (block storage), 1.3× bandwidth over hardware RDMA (KVCache), and 3.5× throughput for offloaded batched operations—demonstrate the feasibility of achieving high throughput, ultra-low host overhead, and deep programmability concurrently, while maintaining seamless RDMA compatibility (Chen et al., 25 Apr 2025). This architecture is positioned to support cloud, AI, and storage workloads as network and application demands continue to intensify.