Papers
Topics
Authors
Recent
Search
2000 character limit reached

RecoNIC: RoCEv2 RDMA Offload Engine

Updated 27 February 2026
  • RecoNIC is a RoCEv2 RDMA offload engine designed for low-latency data transfers and in-network compute acceleration.
  • It features a deeply pipelined architecture incorporating Ethernet MAC, UDP/IP, and Infiniband RDMA logic with dedicated hardware for iCRC and flow control.
  • The design achieves high throughput and minimal host intervention, making it ideal for datacenter and scientific applications.

A RoCEv2-compliant RDMA offload engine is a hardware implementation of Remote Direct Memory Access (RDMA) communication over Converged Ethernet (RoCE) version 2, supporting direct memory transfers across network endpoints with minimal host CPU intervention and strict adherence to the RoCEv2 protocol stack, including Reliable Connected (RC) transport semantics, encapsulated in UDP/IP. State-of-the-art implementations target SmartNICs and FPGAs for low-latency, high-throughput datacenter and scientific workloads, and increasingly provide programmability and extensibility for in-network acceleration.

1. Architectural Principles of RoCEv2 RDMA Offload Engines

A RoCEv2-compliant RDMA offload engine consists of a deeply pipelined hardware datapath incorporating Ethernet MAC, UDP/IP, and Infiniband RDMA transport logic. The principal design blocks—shown in BALBOA, ACCL+, and RecoNIC—include packet parsers, queue pair (QP) tables, retransmission logic, flow and congestion control, DMA engines (PCIe or direct-attached memory), and dedicated hardware support for iCRC computation and line-rate data movement (Heer et al., 27 Jul 2025, He et al., 2023, Zhong et al., 2023).

The receive pipeline typically parses layered headers: Ethernet, VLAN (optional), IPv4/IPv6, UDP (dst port 4791 for RoCEv2), followed by the Infiniband Base Transport Header (BTH) and any extended headers (RETH for RDMA Read, AETH for ACK/NAK, etc.) (Zhong et al., 2023, Heer et al., 27 Jul 2025). Transmit logic assembles these headers, retrieves payloads using DMA or host memory adapters, maintains per-QP PSN (Packet Sequence Number) logic, and supports segmentation and scatter/gather as dictated by posted work requests.

Per-QP hardware state tables maintain connection attributes (remote IP, QPN, GID, P_Key), reliable transport state (next TX PSN, expected RX PSN), message-sequence (MSN) tracking, and flow-control credits. Scalability is achieved by parametrizing table sizes, typically supporting hundreds to thousands of QPs using either BRAM or external HBM resources (Heer et al., 27 Jul 2025).

Additional datapath instrumentation (as in BALBOA) allows seamless insertion of protocol-aware in-line accelerators such as AES cryptography cores and ML-powered Deep Packet Inspection, demonstrating the openness of the architecture to line-rate service extensions (Heer et al., 27 Jul 2025).

2. RoCEv2 Protocol Compliance and Transport Mechanics

RoCEv2-compliant engines encapsulate Infiniband RC traffic in UDP packets, leveraging IP for routing and supporting both IPv4 and IPv6 addressing. Compliance encompasses precise format and parsing of the BTH, extended headers (RETH, AETH, etc.), PSN tracking, and support for protection domains (P_Key, M_Key, rkey validation), as well as iCRC computation for packet integrity (Zhong et al., 2023, Heer et al., 27 Jul 2025, Marini et al., 2 Sep 2025).

Reliable transport is enforced via per-QP PSN state, ACK/NAK handling, retransmission timers, and appropriate NAK-based loss recovery. In all reviewed engines, credit-based flow control is implemented, where transmitters are gated by advertised receive credits, and the sender is stalled or retransmits on NAK/timeouts (He et al., 2023, Heer et al., 27 Jul 2025, Zhong et al., 2023). Engines support congestion control mechanisms compatible with RoCEv2, ready for DCQCN or TIMELY via ECN bits in the IP or BTH headers.

Notably, IRN demonstrates that Priority Flow Control (PFC)—historically a requirement for RoCEv2—is not mandatory if the NIC supports selective per-packet loss recovery and end-to-end BDP-based flow-control, eliminating PFC-induced problems such as head-of-line blocking and deadlock (Mittal et al., 2018).

3. Engine Variants and Extensible Offload Capabilities

Recent RoCEv2 RDMA offload engines extend beyond classical RDMA verbs (READ, WRITE, SEND) to support programmable in-network compute functions and collective communication primitives. For example, ACCL+ features a CCLO engine with an embedded microcontroller, pipelined data movement processor, and HLS streaming APIs, enabling line-rate collectives implementation (e.g., broadcast, reduce, all-to-all) mapped onto sequences of RDMA verbs, fully offloaded from host CPUs (He et al., 2023).

RecoNIC and BALBOA both exemplify the integration of programmable compute (e.g., HLS or P4 kernels, ML preprocessors, AES encryption) inserted directly into the AXI-Stream pipelines, processing data as it traverses the SmartNIC before or after RDMA transport block, and enabling applications such as in-network aggregation, deep packet inspection, or direct-to-GPU data preprocessing (e.g., for ML inference workloads) (Zhong et al., 2023, Heer et al., 27 Jul 2025).

In scientific instrumentation, implementations such as the CTAO-LST readout system employ BSV-based RoCEv2 cores on custom data acquisition FPGAs, enabling deterministic zero-copy event transfer between digitizer backends and event-builders, meeting stringent loss/latency requirements (Marini et al., 2 Sep 2025).

4. Flow and Congestion Control: Mechanisms and Implications

RoCEv2 engines implement flow control to prevent buffer overrun and maintain in-order data delivery at high link utilization. In RC mode, per-QP credits are managed—decremented on send, incremented on ACK, stalled at zero. Engines like IRN argue for replacing in-network PFC and end-to-end credit frames with a static cap on in-flight packets computed from the link Bandwidth Delay Product (BDP), where W=⌈C⋅RTTMTU⌉W = \lceil \frac{C \cdot RTT}{MTU} \rceil bounds allowed in-flight packets, and the NIC enforces in_flight_pkts≤Win\_flight\_pkts \leq W (Mittal et al., 2018).

Congestion control compatibility is assured, with many engines supporting DCQCN or TIMELY for rate adaptation. ECN propagation from the IP or BTH header is exposed to control logic, enabling adaptation of credit rates or fallback to eager protocols (He et al., 2023, Heer et al., 27 Jul 2025).

Analysis further demonstrates that accurate BDP configuration is critical; underestimation risks TCP-style stalling, overestimation wastes NIC resources. Testing protocols involve link-drop emulation, RTT variation, incast stress, and resource utilization measurement to validate robustness (Mittal et al., 2018).

5. Performance and Resource Utilization

RoCEv2-compliant RDMA offload engines achieve near line-rate throughput and minimal latency. Measurements illustrate:

  • BALBOA saturates 100 Gb/s links at 32 KB payloads with 2.1–2.4 μs WRITE and 3.8–4.2 μs READ latency (single QP), matching commercial NICs within 10–15% (Heer et al., 27 Jul 2025).
  • ACCL+ achieves 95 Gb/s for both FPGA-to-FPGA and host-to-host RDMA, with collective operation latencies (broadcast, reduce, all-to-all) 2–3× lower than software MPI (e.g., 15–18 μs for 64 KB broadcast vs 45–55 μs for SW MPI) (He et al., 2023).
  • RecoNIC obtains 92 Gb/s throughput at 32 KiB batch-operations, with 400 ns RDMA READ latency for small payloads in batch mode (Zhong et al., 2023).
  • In BSV-based scientific readout (CTAO-LST), 100 Gb/s line rate is attained on large FPGAs, with measured per-packet payload of 4,038 B at MTU=4096 B (overhead from all L2–L4 headers and iCRC included) (Marini et al., 2 Sep 2025).

Resource utilization varies, but post-route results for BALBOA and ACCL+ engines indicate <14% LUT and <6% BRAM utilization on high-end FPGAs, supporting hundreds to a thousand QPs (Heer et al., 27 Jul 2025, He et al., 2023). Additional pipeline stages for protocol or application offload (encryption, DPI, ML preprocessing) incur marginally higher resource use but retain line-rate throughput and <100 ns per-stage latency (Heer et al., 27 Jul 2025).

6. Design Considerations, Integration Strategies, and Limitations

Integration of RoCEv2 offload engines into existing host and datacenter infrastructure mandates careful QP and memory registration, reliable host interface via PCIe (XDMA/QDMA/AXI4-MM), and dual-mode negotiation for peer compatibility (e.g., IRN fallback to standard RoCE if peer does not support enhancements) (Mittal et al., 2018, Zhong et al., 2023). Many platforms expose configuration and tuning parameters (BDP, RTOs) via host driver APIs (e.g., sysfs entries for datacenter operators) (Mittal et al., 2018).

Potential pitfalls include application exposure to out-of-order writes (mitigated with Infiniband fences), risks of misconfigured BDPs, and edge cases caused by per-packet load balancing. Programmability is evolving: firmwares remain host-driven in RecoNIC, MicroBlaze-driven in ACCL+, and firmware-free in IRN and BALBOA, with future work exploring further offload of connection management and memory registration logic (Zhong et al., 2023, He et al., 2023, Heer et al., 27 Jul 2025).

Current generation engines are limited in atomic verbs, dynamic QP creation, and advanced transport modes (e.g., SRQ, XRC, UD); efforts are underway to extend support for these, enable multi-path RoCE, and accelerate memory registration in hardware (He et al., 2023, Heer et al., 27 Jul 2025). Empirical and simulation-based robustness testing (random drop, error injection) is crucial to validate correctness and performance across parameter sweeps and failure scenarios (Mittal et al., 2018, Marini et al., 2 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RecoNIC.