BALBOA Engine: RoCEv2 RDMA Offload

Updated 27 February 2026

BALBOA Engine is a RoCEv2-compliant RDMA offload engine that accelerates memory access by offloading protocol processing to hardware.
It employs deep pipelining, dynamic QP management, and programmable data-path features to enhance throughput and reduce CPU overhead.
Applications include smart NICs, data center accelerators, and scientific instrumentation, achieving up to 100 Gb/s speeds with low latency.

A RoCEv2-compliant RDMA offload engine is a network interface subsystem—either implemented as a discrete ASIC, an FPGA module, or part of a SmartNIC architecture—that provides hardware acceleration for Remote Direct Memory Access (RDMA) operations over Ethernet, specifically using the RDMA over Converged Ethernet v2 (RoCEv2) protocol. RoCEv2 encapsulates InfiniBand transport semantics in UDP/IP headers, enabling reliable, low-latency, and high-throughput memory accesses directly between the memories of different hosts or devices across standard Ethernet data center fabrics, often at speeds up to or exceeding 100 Gb/s. By offloading protocol processing, packet sequencing, flow control, and reliability mechanisms to hardware, these engines significantly reduce CPU utilization and memory copy overheads, while supporting advanced features such as selective retransmission, credit-based flow control, and programmable data-path accelerations (Mittal et al., 2018, Heer et al., 27 Jul 2025, Zhong et al., 2023, He et al., 2023, Marini et al., 2 Sep 2025).

1. Core Protocol Compliance and Architectural Principles

RoCEv2-compliant RDMA offload engines conform to the protocol stack prescribed by the RoCEv2 specification, encapsulating Infiniband Base Transport Headers (BTH) and optional extension headers (RETH, AETH, ImmDt, IETH) within UDP/IP over Ethernet. This architecture, exemplified in open designs like RoCE BALBOA and RecoNIC, accommodates the full range of RDMA verbs (READ, WRITE, SEND, SEND_WITH_IMM, INVALIDATE) and implements Reliable Connected (RC) semantics, which require precise packet sequencing, PSN maintenance, and credit-based or bandwidth-delay-product-based flow control (Heer et al., 27 Jul 2025, Zhong et al., 2023).

High-performance implementations utilize deeply pipelined packet processing for both ingress (RX) and egress (TX) datapaths, supporting features such as:

Header parsing/assembly at rates matching 100 GbE line rate (packetization via 250 MHz/512-bit AXI-Stream for 128 Gb/s theoretical throughput)
Dynamic queue-pair (QP) management in on-chip SRAM/BRAM for connection, state, and flow control tables
Programmable transport-layer checksums (iCRC) and hardware-implemented end-to-end retransmission logic
Doorbell and completion-queue notification mechanisms (PCIe MMIO or on-chip register interfaces) for high-throughput host-software interaction (Heer et al., 27 Jul 2025, Zhong et al., 2023, Marini et al., 2 Sep 2025).

2. Loss Recovery, Flow Control, and Congestion Management

Classic RoCEv2 deployments depend on Priority Flow Control (PFC) to provide a "lossless" Ethernet fabric, which introduces challenges such as head-of-line blocking and network congestion spreading. IRN (Improved RoCE NIC) provides an alternative: the key insight is that these dependencies are artifacts of insufficient transport logic in NICs (e.g., go-back-N retransmissions, lack of robust flow control). IRN introduces two crucial mechanisms:

Selective Loss Recovery: Per-packet SACKs and dual RTOs (short and long) allow for rapid recovery of isolated losses and efficient tail latency reduction. Retransmission logic tracks the cumulative ACK, SACK bitmaps, and recovery points for selective retransmit (not go-back-N).
End-to-End BDP-based Flow Control: The outstanding in-flight packets are statically capped ( $W = \lceil \text{BDP} / \text{MTU} \rceil$ ), preventing overflow of network buffers and rendering PFC obsolete. Packet transmission is stalled when $in\_flight\_pkts \geq W$ (Mittal et al., 2018).

These changes result in 6–83% performance gain over legacy RoCEv2+PFC across a range of network scenarios and are implementable with $\lesssim 1$ KB/QP state overhead and $\approx$ 1.4%/4% FPGA resource use (FF/LUT) (Mittal et al., 2018).

Moreover, optional on-NIC congestion control (e.g., DCQCN) can be layered above IRN without impacting the loss recovery layer, leveraging ECN-marked frames for transport pacing.

3. Advanced Hardware Architectures and Programmability

Recent RoCEv2 offload engines leverage FPGA fabric for both protocol offload and application extension. RecoNIC and ACCL+ exemplify this approach:

RecoNIC: Integrates programmable compute blocks (Lookaside and Streaming) with the ERNIC RoCEv2 IP core, exposing unified AXI4-Stream/MM interfaces. User logic in RTL/HLS/Vitis P4 can issue RDMA operations directly and process network traffic at line rate, enabling in-situ data transformation and acceleration (Zhong et al., 2023).
ACCL+: Augments a RoCEv2-compliant core with a MicroBlaze-based collective offload engine and a microcode-driven DMP, enabling runtime algorithm selection for collectives (e.g., broadcast, reduce). Streaming plugins for in-flight reduction or transforms are inserted as AXI stream modules, offering efficient FPGA-to-FPGA and CPU-offloaded collectives (He et al., 2023).
RoCE BALBOA: Implements a modular datapath where protocol services (e.g., AES encryption, ML-based DPI) or application logic (e.g., ML preprocessing) are inserted as pipeline blocks. This enables protocol enhancement and compute offload with sub-100 ns latency per service, all at 100 Gb/s line rate. Resource usage is parameterized by QP count, offload service count, and buffer pool sizing (Heer et al., 27 Jul 2025).

4. Resource Utilization, Scalability, and Performance Models

Resource allocation in RoCEv2 offload engines is dominated by per-QP state tables and datapath pipelining. Empirical synthesis results demonstrate:

Engine	LUTs (% dev.)	BRAM (% dev.)	FFs (% dev.)	QP Capacity
BALBOA Core	43,732 (3.4%)	101 (5.1%)	102,988 (4%)	500–1000+
ACCL+ CCLO+POE	(12–13)% Lut	(5–5.7)% BRAM		FPGA mem limited
CTAO-LST RDMA	29,802 (12%)	8.5 (1.4%)	40,902 (8%)	VU9P: 10s QPs

Performance typically saturates line rate for batch or large-payload transfers:

Throughput: $T_{payload} \approx \frac{\text{MSS}}{\text{RTT} + T_{proc}}$ , with near wire speeds for large buffer sizes ( $\sim 92$ –$100$ Gb/s @ 32 KiB payload)
Latency: For small messages, write latencies of $2.1$– $2.4\,\mu$ s and read latencies of $3.8$– $4.2\,\mu$ s have been reported for 64B to 4KB payloads over 100 GbE (Heer et al., 27 Jul 2025, Zhong et al., 2023, Marini et al., 2 Sep 2025)
Batch RDMA: Amortizes doorbell and WQE-fetch overhead, yielding back-to-back completions at 40 ns interval post-initial read

Scalability is dictated by QP table provisioning (BRAM/URAM usage) and pipeline scheduling (e.g., flat BRAM-hash versus round-robin QP allocation for fairness), with tested deployment up to 1000 QPs (Heer et al., 27 Jul 2025).

5. Application Domains, Protocol Extensions, and Offload Use Cases

RoCEv2-compliant RDMA offload engines are deployed in a diversity of contexts:

SmartNICs and Data Center Accelerators: Enabling low-latency, high-throughput RDMA for distributed compute, e.g., ML, storage, and in-network data preprocessing (Heer et al., 27 Jul 2025, Zhong et al., 2023)
Scientific Instrumentation: As demonstrated in CTAO-LST, an FPGA-based RoCEv2 core in Bluespec SystemVerilog enables direct DAQ-to-RDMA offload for high-throughput camera readout, attaining up to 100 Gb/s, low error rates ( $<10^{-12}$ BER), and complete offloaded transfer to host memory (Marini et al., 2 Sep 2025)
Protocol/Service Enhancement: BALBOA integrates cryptographic engines (e.g., AES-128 ECB, 100 Gb/s line-rate) and ML-based DPI with negligible latencies and minimal (<1%) fabric cost, demonstrating how protocol and security enhancements can be natively offloaded in the data-path (Heer et al., 27 Jul 2025)
Collective Communication: ACCL+ exposes MPI-style collectives mapped onto RDMA verbs, achieving broadcast latencies of 15–25 $\mu$ s (vs. 45–70 $\mu$ s for host MPI) and reducing CPU utilization by up to 70% (He et al., 2023)

6. Verification, Testing, and Best Practices

Robustness and compliance are validated via:

Synthetic testbenches (cocotb, Questa Advanced) for protocol and error handling (PSN wrap, CRC error, multi-MTU sweep) (Marini et al., 2 Sep 2025)
Emulation of random link drops and incast/congestion stress for tail latency (Mittal et al., 2018)
QP negotiation and fallback mechanisms for backward compatibility with legacy RoCE engines (Mittal et al., 2018)
Exposition of critical parameters (BDP, RTO) via NIC driver sysfs for layer-7 adaptability (Mittal et al., 2018)

Common integration recommendations include dual-mode negotiation to enable seamless upgrade and staged PFC withdrawal as IRN-capable deployment approaches ubiquity (Mittal et al., 2018).

Potential pitfalls involve misestimation of BDP, out-of-order memory writes (mitigated through application-level fences), and support limitations for advanced transports (e.g., XRC, MP-RoCE still under active development in recent FPGA engines) (He et al., 2023, Heer et al., 27 Jul 2025).

7. Future Directions and Unresolved Challenges

Persistent challenges and research frontiers include:

Dynamic QP creation and path-aware PSN for multi-path RoCE (MP-RoCE)
Hardware-accelerated memory registration (MKEY/MR offload)
Exposing atomic operations (fetch-add, CAS) for GPU collectives (He et al., 2023)
In-hardware congestion/admission control leveraging DCQCN/ECN and runtime QoS
Full compatibility with new transport extensions and application-level offload APIs (Heer et al., 27 Jul 2025, Marini et al., 2 Sep 2025)

These evolving areas are expected to further increase the adaptability and efficiency of RoCEv2-compliant RDMA offload engines for next-generation high-performance and cloud-scale deployments.