RoCEv2 RDMA Offload Engine

Updated 27 February 2026

RoCEv2-compliant RDMA offload engines are specialized network subsystems implemented in ASIC, FPGA, or configurable logic to map RDMA verbs over UDP/IP with near line-rate throughput.
They employ a layered pipeline architecture—including ingress, egress, state management, and protocol offload logic—to ensure low latency, reliable transport, and minimal CPU overhead.
Practical implementations in SmartNICs and accelerator-offload cards demonstrate scalable zero-copy data movement, enhanced security, and in-network compute capabilities for high-throughput systems.

A RoCEv2-compliant RDMA offload engine is a network subsystem—implemented in either ASIC, FPGA, or configurable logic—dedicated to efficiently mapping Infiniband-style RDMA verbs over UDP/IP encapsulation, following the RoCEv2 specification. It performs reliable transport, flow control, segmentation, and completion processing in hardware, aiming for near-line-rate throughput, low end-to-end latency, and minimal CPU involvement. Such engines are foundational to modern SmartNICs, accelerator-offload cards, and high-throughput data acquisition systems, enabling scalable zero-copy data movement across commodity Ethernet networks while preserving the reliability and verb semantics of Infiniband networks.

1. Architectural Principles and High-Level Pipeline

RoCEv2-compliant RDMA offload engines integrate a layered pipeline architecture for transmitting and receiving packets, managing connection state, and interfacing with host (or accelerator) memory subsystems. The primary path comprises five essential stages:

Ingress (RX Path): Ethernet frames, received by a high-speed MAC (e.g., 100 GbE CMAC), are parsed for IP/UDP/RoCEv2 headers. The headers are decoded to identify RDMA verbs, extract Queue Pair numbers (QPNs), Packet Sequence Numbers (PSNs), and other protocol fields. Subsequent modules manage PSN order checking, window management, and direct data movement to application memory via PCIe DMA or on-chip buffers. Completion notification logic posts status to host-accessible "Completion Queues" (CQs) (Zhong et al., 2023, Heer et al., 27 Jul 2025).
Egress (TX Path): Host or compute engines post Work Queue Elements (WQEs) to the engine via MMIO or PCIe. The engine reads WQEs, fetches or assembles payload, builds headers (including BTH, optional RETH/AETH/IETH), performs segmentation, enforces flow control, applies protocol-specific checksum (roCE iCRC), and transmits frames over the MAC. Payloads are tracked for potential retransmit (Zhong et al., 2023, Mittal et al., 2018).
State Tables: Connection, state, and MSN (Message Sequence Number) tables per QP are maintained in local memory (BRAM/URAM/SRAM), facilitating fast lookups for connection attributes, PSN space, flow-control credits, and GID or key validation (Heer et al., 27 Jul 2025, Marini et al., 2 Sep 2025).
Protocol Offload Logic: Dedicated modules handle retransmission buffers (typically in HBM or BRAM), ACK/NACK parsing, PSN wraparound logic, and in some cases, in-line compute or protocol enhancement stages (e.g., encryption, ML-DPI, in-flight reduction plugins) (Heer et al., 27 Jul 2025, He et al., 2023).
Host and Application Interface: Doorbell mechanisms, direct-mapped MMIO, or batch WQE submission methods are used to trigger operations and minimize PCIe transaction overhead (Zhong et al., 2023).

The block pipeline scheduling is optimized for a fixed ultra-wide data path (e.g., 512 bits @ 250 MHz) to enable line-rate operation with minimal initiation interval (II=1) for each FSM and pipelined header generator, sustaining ≃128 Gb/s at peak (Heer et al., 27 Jul 2025).

2. RoCEv2 Protocol Compliance and Transport Semantics

RoCEv2 offload engines comply strictly with the protocol’s multi-layer encapsulation and verb semantics:

Encapsulation: All RDMA packets are constructed as Ethernet frames carrying IPv4/IPv6 + UDP, with destination port 4791. The RDMA payload is wrapped in Infiniband Base Transport Header (BTH), with optional extended headers (RETH, AETH, DETH, IETH) as per verb type. The RoCE iCRC is computed over the UDP payload (Zhong et al., 2023, Marini et al., 2 Sep 2025).
Resource Validation: QPs, P_Keys, GIDs, and memory region keys are validated in hardware; GID matching and key checks are integrated into the packet parsing stage for each transfer (Heer et al., 27 Jul 2025).
Verbs and Reliable Connected Mode: The engines support core RDMA verbs—READ, WRITE, SEND (with Immediate), and (in selected designs) atomics—operating in reliable connection (RC) mode with full PSN and MSN tracking, cumulative ACK/NACK, and dual flow control (credit-based and optional explicit congestion notification, e.g., ECN/QCN) (He et al., 2023, Mittal et al., 2018, Heer et al., 27 Jul 2025).
Loss Recovery: Standard engines employ go-back-N or selective retransmit with per-packet SACK fields, supporting dual RTO timers for latency optimization (short/long), and window-based in-flight packet cap (BDP-based or credit-based) to avoid overrun (Mittal et al., 2018, Marini et al., 2 Sep 2025). Enhanced designs such as IRN replace the need for PFC by using BDP-FC plus selective recovery (Mittal et al., 2018).

3. Flow Control, Loss Recovery, and Congestion Handling

Flow and reliability semantics are implemented to ensure zero-copy, ordered delivery with minimal host involvement:

Credit- and BDP-Based Flow Control: Outgoing packets per QP are bounded by window W, either representing receive credits (credit-based) or calculated as the bandwidth-delay product divided by MTU ( $W = \lceil \frac{C \cdot RTT}{MTU} \rceil$ for link capacity $C$ and round-trip time $RTT$ ), with in-flight count enforced at the transmit engine (Mittal et al., 2018, Heer et al., 27 Jul 2025).
Packet Loss and Retransmission: Engines detect out-of-order or missing PSN via sequence tracking bitmaps and ACK/NACK packets. Timeout triggers dual RTO logic: $RTO_{low} \ll RTO_{high}$ , with values chosen for short-message tail latency minimization and loss recovery robustness respectively. If a NACK or RTO event is detected, the sender scans the selective ACK bitmap for holes and selectively retransmits missing PSNs up to the recovery point. Recovery mode exits when cumulative ACK advances (Mittal et al., 2018).
Congestion Control Extensions: Some engines integrate explicit ECN feedback (BTH ECE flag, IP ECN bits) and provide a token-bucket or AIMD-based rate control interface either in hardware or programmable microcode. Designs are compatible with DCQCN, TIMELY, or host-driven rate control (He et al., 2023, Mittal et al., 2018, Heer et al., 27 Jul 2025).
PFC Optionality: RoCEv2 was historically coupled with hop-by-hop Priority Flow Control (PFC); however, improved NIC designs using advanced loss recovery schemes (e.g., IRN) render PFC unnecessary, eliminating head-of-line blocking, congestion spreading, and deadlock risks. Performance studies document consistent improvement (6–83%) of IRN (no PFC) versus legacy RoCE+PFC (Mittal et al., 2018).

4. Practical Implementations and FPGA/SmartNIC Integration

RoCEv2-compliant RDMA offload is realized in both proprietary and open-source platforms, each leveraging specific hardware assets:

RecoNIC uses a Xilinx ERNIC IP block for full RoCEv2 offload with programmable compute blocks (RTL, HLS, Vitis P4) sharing the engine, targeting Alveo U250 at 100 Gb/s line rate. Host and device memory are accessible to RDMA verbs, and engine state machines control reliability and flow control in “firmware-free” RTL (Zhong et al., 2023).
BALBOA implements all protocol logic in BRAM and pipeline logic, supporting up to 1000 QPs and line-rate throughput at 100 Gb/s. Deeply pipelined FSMs manage all protocol fields, and auxiliary pipelines perform on-datapath enhancements (e.g., AES encryption, ML-based DPI, network-attached pre-processing) without loss of host or wire performance (Heer et al., 27 Jul 2025).
ACCL+ employs a Coyote RDMA protocol offload engine with programmable microcontroller (MicroBlaze) and a data movement processor (DMP), integrating high-level MPI-style collectives directly over RDMA verbs, with streaming data paths and custom plugin support for reductions and transforms (He et al., 2023).
CTAO-LST Readout demonstrates a complete modular BSV-based RoCEv2 core in telescope data acquisition: digitized data are assembled, headerized and streamed using Bluespec SystemVerilog, transferred to commodity servers via RDMA WRITEs, with resource footprint and performance comparable to commercial solutions (Marini et al., 2 Sep 2025).

5. Performance Characterization and Resource Metrics

Quantitative evaluation and resource metrics for leading offload engines are summarized below:

Engine/Platform	Throughput (Gb/s)	Latency (µs)	Max QPs	Notable Features
RecoNIC (U250 FPGA)	>90 (WRITE, READ)	~0.4 (batch)	N/A	Batch WQE, programmable compute
BALBOA (U55C FPGA)	100 (WRITE)	2.1 (64B WR)	500–1000	On-datapath encryption/ML
ACCL+ (U55C FPGA)	95 (alltoall)	15–25 (coll.)	N/A	MPI collectives, host offloading
CTAO-LST Readout	9.7 (10 GbE PHY)	2–4 (one-way)	O(10)	BSV implementation, full core open

All report near-line rate throughput for bulk transfers and sub-10 µs message latency for MTU-sized packets under normal operating conditions. FPGA resource utilization for major logic blocks does not generally exceed 5–13% LUTs or 5–8% BRAM on typical UltraScale+ devices (Heer et al., 27 Jul 2025, He et al., 2023, Marini et al., 2 Sep 2025).

6. Advanced Services, Programmability, and Ecosystem Integration

Besides baseline protocol offload, engines are increasingly enhanced for adaptive compute and protocol services:

Programmable Compute: RecoNIC, BALBOA, and ACCL+ integrate compute units—triggered as “lookaside” or streaming processors—capable of in-network aggregation, telemetry, ML inference, or application pre-processing (e.g., for recommender systems) at full line rate by interfacing AXI-stream/AXI-MM to the RDMA engine (Zhong et al., 2023, Heer et al., 27 Jul 2025, He et al., 2023).
Security and Inspection: BALBOA incorporates line-rate AES encryption modules and ML DPI engines in the RX/TX pipeline, demonstrating hardware-offloaded cryptography and live network inference with sub-100 ns cost (Heer et al., 27 Jul 2025).
OSS and Extensibility: Multiple platforms (RecoNIC, BALBOA, ACCL+) offer open-source implementations and/or APIs for QP, MR, and WQE lifecycle management, batch operation, and accelerator integration, promoting rapid prototyping and in-situ extension of services for research and industry.

7. Design Tradeoffs, Limitations, and Best Practices

RoCEv2 offload engine design exposes several tradeoffs and caveats:

Resource Overhead: Advanced flow control and selective-retransmit additions incur minimal per-QP NIC-local SRAM requirements (typically ≤1 KB/QP, or ≲6% overhead over base state), with negligible area impact in FPGA or ASIC implementations (Mittal et al., 2018).
Scalability: The number of QPs scales with available BRAM/URAM. Design practices use collision-free table indexing and round-robin arbiters to achieve multi-QP fairness and avoid head-of-line blocking (Heer et al., 27 Jul 2025).
Protocol Pitfalls: Out-of-order memory writes must be guarded by Infiniband fences or non-overlapping ranges; aggressive load-balancing or switch reordering may necessitate increased NACK thresholds prior to invoking loss recovery (Mittal et al., 2018).
Integration and Configuration: Engines recommend dual-mode negotiation to ensure backward compatibility with legacy RoCE peers, expose BDP/timeout parameters via sysfs or device APIs for system tuning, and provide fallbacks to software RDMA for unsupported operations (Mittal et al., 2018).
Testbench Rigor: Verification frameworks employ cocotb, code-injection, and error-injection to validate protocol compliance, error-handling, and error rates, with observed hardware BER < $10^{-12}$ and near-zero packet loss in line-rate tests (Marini et al., 2 Sep 2025).

RoCEv2-compliant RDMA offload engines thus constitute a key intersection of high-performance networking, in-network compute, and hardware-accelerated protocol processing, with state-of-the-art designs achieving composability, line-rate performance, and flexible protocol innovation across datacenter and scientific domains (Mittal et al., 2018, Zhong et al., 2023, He et al., 2023, Heer et al., 27 Jul 2025, Marini et al., 2 Sep 2025).