Papers
Topics
Authors
Recent
Search
2000 character limit reached

CXL-over-Ethernet: Memory Disaggregation

Updated 1 April 2026
  • CXL-over-Ethernet designs are novel architectures that encapsulate CXL memory semantics in Ethernet frames, enabling disaggregated, low-latency remote memory access.
  • They leverage commodity Ethernet switches and minimal endpoint software changes to extend memory access beyond rack-local PCIe limits and achieve near-wire speeds.
  • Recent implementations demonstrate cache-hit latencies as low as 415 ns and aggregate throughputs up to 1.9 Tb/s, ideal for large-scale and AI workloads.

A Compute Express Link (CXL)-over-Ethernet design delivers CXL memory semantics—low-latency, native load/store remote memory access—across Ethernet, thus enabling disaggregated memory architectures that scale beyond the physical reach of PCI Express buses or CXL native fabrics. CXL-over-Ethernet architectures extend traditional CXL, which is limited to rack-local deployments by the attenuation and hop count constraints of PCIe/CXL signal integrity, by encapsulating CXL.mem transactions for transmission over high-speed Ethernet, leveraging commodity switching infrastructure while minimizing software changes at endpoints. This approach underpins recent developments in ultra-low latency distributed memory, pooled NIC architectures, and hybrid two-tier interconnects that target large-scale data-parallel and AI workloads (Su et al., 2024, Wang et al., 2023, Zhang et al., 2024).

1. CXL-over-Ethernet: Architectural Principles

CXL-over-Ethernet designs bridge the gap between cache-coherent memory interfaces and conventional Ethernet fabrics by intercepting memory load/store instructions at the host, encapsulating them in suitable transport frames, and guaranteeing remote memory bandwidth and latency profiles suitable for direct host attachment semantics. The approach decouples memory provisioning from compute nodes and overcomes intra-rack scaling limits, providing transparent access at the application and OS levels.

Typical system organization incorporates:

  • Host Domain: Standard CPU with a PCIe-rooted CXL port, supporting native CXL.mem and CXL.io protocols.
  • CXL Agent (FPGA or ASIC): Translates CXL flits to AXI4, encapsulates semantics into custom Ethernet frames.
  • Memory Node (FPGA/DRAM): Receives Ethernet frames, restores AXI4, translates addresses and accesses local DRAM.
  • Ethernet Fabric: High-speed Ethernet with Priority Flow Control (PFC) across standard switches, carrying encapsulated CXL frames.
  • NIC Pool (DFabric): Pool of CXL-attached NICs enabling rack-wide and cross-rack communication, supporting aggregated bandwidth scaling (Wang et al., 2023, Zhang et al., 2024).

The encapsulation preserves CXL-native semantics, requiring no application changes, and leverages high-throughput, low-latency fabric with reliability mechanisms such as explicit ACK/NAK, congestion control, and selective retransmit (Wang et al., 2023).

2. Protocol Stack Innovations and Data Path

Recent solutions depart from the traditional layered networking stack to eliminate OS intervention and reduce per-packet overhead. Notably, EDM implements the entire remote memory transport at the Physical Coding Sublayer (PCS) of the Ethernet PHY. This allows the protocol to:

  • Transmit at the granularity of 66-bit physical blocks (matching IEEE 802.3), circumventing the latency and inefficiencies of Ethernet MAC-level framing (minimum 64B, inter-frame gaps, no preemption).
  • Utilize idle symbol spaces (IFG) to transport memory data, and introduce block-level traffic preemption between memory and best-effort flows.
  • Expand block-type space in the 66b PHY to define explicit block markers for memory start, continuation, and termination, as well as notifications and grants for scheduler interaction. /MS/ (Memory Start), /MD/ (Memory Data), /MT/ (Memory Terminate), /MST/ (Memory Single-block), /N/ (Notify), /G/ (Grant) extend the PHY for memory transactions (Su et al., 2024).

Data-path operation (in both FPGA and host logic) typically follows:

  • TX Side: CXL to AXI4 mapping, local cache lookup; cache miss triggers direct encapsulation into Ethernet packets; packets injected at line-rate into the hardware TX pipeline.
  • RX Side: Ethernet frames are parsed, AXI transactions are reconstructed, and memory requests are issued; responses are retargeted through encapsulation for return.

Optimized designs bypass the OS network stack entirely, with FPGA soft-logic operating as a custom CXL-to-Ethernet bridge, leveraging programmable logic for cache and protocol management (Wang et al., 2023).

3. In-Network Scheduling and Congestion Control

Ultra-low latency memory access over shared Ethernet infrastructure requires carefully engineered scheduling and congestion mechanisms. Key approaches include:

  • In-PHY Virtual Circuit Scheduler (EDM): Scheduler inspired by the Parallel Iterative Matching (PIM) algorithm operates within the physical layer of the switch, establishing dynamic virtual circuits between source-destination pairs. This design guarantees bandwidth reservation, reduces switch internal queuing, and eliminates Layer 2 packet processing delay for memory traffic, achieving deterministic low latency.
  • FPGA-Embedded Congestion Control (CXL-over-Ethernet): Utilizes IEEE 802.1Qbb PFC, a custom six-state finite-state machine for flow regulation, and selective retransmission/ACK to maintain reliability and stable throughput, converging to within 7% of sustainable rate after initial recovery, and ensuring stability even under constrained buffer sizing (Wang et al., 2023, Su et al., 2024).

Hierarchical or ring-based scheduling is recommended for large clusters to contain allreduce and similar patterns prior to egress, minimizing cross-rack contention (Zhang et al., 2024).

4. Latency, Bandwidth, and Scalability Analysis

Across experimental deployments and validated analytic models, key performance features are:

Architecture Remote Access Latency Line-rate Throughput Protocol Location
EDM ~300 ns (unloaded) Near-wire speed Ethernet PCS/PHY (Su et al., 2024)
CXL-over-Ethernet (FPGA cache) 415 ns (cache hit) ~100 Gbps (full b/w) FPGA edge/host (Wang et al., 2023)
CXL-over-Ethernet (no cache) 1.97 μs ~100 Gbps FPGA edge/host
One-sided RDMA 1.85 μs Stack above MAC
DFabric Data Pooling 400 ns (CXL.mem) 1.9 Tb/s (NIC pool) Hybrid CXL+Ethernet (Zhang et al., 2024)

Latency is primarily a function of bypassing OS, MAC, and software switches. Fine-grained transfers and cache hit optimizations further reduce round-trip delays; e.g., CXL-over-Ethernet cache hits are measured at 415 ns, while cache miss roundtrip is 1.88 μs. Average latency is Lavg=αLhit+(1α)LmissL_{\text{avg}} = \alpha\,L_{\text{hit}} + (1-\alpha)\,L_{\text{miss}} where α\alpha is hit rate (Wang et al., 2023).

Bandwidth is constrained by DRAM device throughput and aggregate NIC pool rate. For instance, DFabric demonstrates aggregate throughput scaling to 7.5 TB/s across 16 compute nodes, surpassing pure CXL-only or GbE rack configurations.

Memory pool architectures mitigate local DDR bottlenecks by striping data across multiple devices to match or exceed the aggregate Ethernet output (Zhang et al., 2024).

5. Hardware Platforms, Implementation, and Empirical Evaluation

Prototypes use production server hardware (Intel Sapphire Rapids CPUs, Linux 5.x kernels), CXL 1.1-3.0 compliant root ports, and FPGA boards implemented as CXL agents and memory endpoints:

  • Physical Connectivity: PCIe Gen4 ×8 for host-to-FPGA, 100 Gbps QSFP28 for Ethernet, up to 32 GB DDR4/DDR5 for memory backing.
  • FPGAs: Xilinx UltraScale+ U280 as CXL agent and memory controller, running at 250–322 MHz.
  • Software Stack: Linux CXL.io driver supports mapping of remote memory (CMem) in the system address space with IOMMU; bitstreams integrate vendor and custom IP for CXL/Ethernet.
  • Switches: Standard Ethernet switches with PFC enabled; rack-wide or cross-rack logical domains established via standard or custom firmware (Wang et al., 2023, Su et al., 2024).

No changes to application code are required; normal malloc/free and pointer-based dereferencing work transparently, facilitating ease of adoption.

Empirical results demonstrate:

  • CXL-over-Ethernet with cache hit achieves 415 ns round-trip latency (FPGA cache mode); average memory access latency is 1.97 μs uncached.
  • Throughput is sustained at 100 Gbps/per device or up to 1.9 Tb/s for pooled NIC architectures.
  • DRAM-cache and cache-line striping increase effective memory-copy bandwidth (e.g., from 5 GB/s to 30 GB/s in DFabric).
  • DFabric reduces AllReduce communication time (ResNet-50 training) by 28% vs. 100GbE and 15% vs. pure CXL switch fabrics (Zhang et al., 2024).

6. Design Trade-offs, Limitations, and Best Practices

Key design trade-offs include:

  • Switch Silicon Maturity: CXL switching silicon is emerging and less mature than commodity Ethernet, potentially increasing cost and complexity (Zhang et al., 2024).
  • Fabric Scope: Current CXL fabric topologies are restricted to rack-scale; cross-rack switching requires Ethernet as the transport or custom hardware.
  • Bulk Efficiency vs. Latency: Synchronous CXL.mem messaging achieves low latency, but large data movement is more efficient via DMA/start-burst mode. DRAM-cache in the endpoint solves inefficiency but adds area and power demands.
  • Transport and Coherency: Out-of-order Ethernet paths require sophisticated transport (e.g., Multipath TCP, flowlet-aware schedulers). DRAM-cache coherency may degrade for random, small-packet workloads without locality.

Best practices identified in the literature:

  • Size aggregate memory pool bandwidth at least 1.5× higher than the NIC pool to absorb bursty ingress.
  • Employ hierarchical or in-fabric collective patterns to localize allreduce and similar operations, reducing cross-rack congestion (Zhang et al., 2024).
  • Apply endpoint-level flow or MPTCP-based scheduling to mitigate path asymmetry.
  • Configure DRAM-cache for hot-spot accesses, bypassing for bulk transfers.

A plausible implication is that combining fine-grained cache-coherent memory transport with pooled, high-capacity NIC architectures over Ethernet enables scalable, low-latency memory disaggregation, suitable for LLM, DNN training, and analytics workloads beyond traditional server boundaries (Su et al., 2024, Wang et al., 2023, Zhang et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CXL-over-Ethernet Designs.