Memory Disaggregation: Decoupling Memory & Compute
- Memory disaggregation is an architecture that separates compute from memory, exposing DRAM as a shared, network-accessible pool.
- Advances in CXL, RDMA, and FPGA-based fabrics enable low-latency interconnects and effective resource management in these systems.
- MD improves utilization for memory-intensive applications while introducing trade-offs in latency, coherence, and power costs.
Memory disaggregation (MD) denotes the architectural separation of compute and memory resources in datacenters, decoupling DRAM from CPUs and exposing memory as a shared, network-accessible pool. This paradigm responds to persistent challenges in resource fragmentation and under-utilization inherent in monolithic server designs and is increasingly enabled by advances in high-performance interconnects such as CXL, RDMA, and FPGA-based network fabrics. MD delivers the elasticity to dynamically allocate memory independent of CPU provisioning, accommodates memory-intensive applications that exceed local DRAM capacity, and improves overall utilization and resource efficiency. However, it introduces trade-offs in the form of increased access latency, complexity in coherence and consistency, and additional costs in power and infrastructure.
1. Architectural Foundations and System Models
Memory disaggregation is architected through a variety of system fabrics, classified broadly as follows (Wang et al., 26 Mar 2025, Yelam, 2022):
- Single-Node, Single-Pool: Memory expanders (e.g., CXL-attached DIMMs, OpenCAPI) connect one compute node to a single memory blade, typically via a cache-coherent protocol and PCIe class link.
- Rack-Scale (Multi-Compute, Multi-Memory): A mesh of compute nodes and memory pools communicate over high-throughput, low-latency networks (e.g., Ethernet, Infiniband, Gen-Z, CXL fabrics), with local DRAM acting as a cache for the disaggregated memory pool (Puri et al., 2023, Puri et al., 2023). Interconnects may employ standard ToR switches with PFC, custom FPGA packetization, or even in-network logic (Su et al., 13 Nov 2024, Lee et al., 2021).
- Component Blocks: System-level modules include:
- Host CPUs/SoCs with CXL/PCIe interfaces
- Network-attached memory nodes (FPGAs, DDR/NVM banks)
- In-network managers for translation, protection, and scheduling
- Software or hardware global managers for allocation and policy (Puri et al., 2023, Wang et al., 2023)
Typical access flows, as in CXL-over-Ethernet designs, combine PCIe-based CXL endpoints in compute FPGAs, custom packet managers that encapsulate memory requests and responses in Ethernet payloads, and dedicated translation/CAM units on memory-side FPGAs for address mapping and response formulation (Wang et al., 2023).
2. Performance, Latency, and Modeling
Critical to MD adoption are end-to-end memory access latencies, throughput ceilings, and their decomposition (Wang et al., 26 Mar 2025, Maruf et al., 2023). Measured results from current hardware primitives include:
| Stack/Mechanism | Remote Read Latency | Throughput |
|---|---|---|
| CXL-over-Ethernet + FPGA (Wang et al., 2023) | 1.97 μs avg / 415 ns (FPGA cache hit) | 100 Gbps/FPGA |
| RDMA over RoCEv2 (Su et al., 13 Nov 2024) | ≈2.03 μs | ≈8 Greq/sec at 25 Gbps |
| EDM/PHY-based Ethernet (Su et al., 13 Nov 2024) | ≈0.30 μs | ≈22 Greq/sec at 25 Gbps |
Latency decompositions in MD systems typically sum: where each term covers serialization, protocol logic, network transit, access, and response (Wang et al., 2023).
Performance is tightly bound to:
- Cache/miss ratios in local DRAM/L3
- Network link serialization & queuing (e.g., 800 ns two-way for MAC+PHY+switch (Wang et al., 2023))
- Queueing in DRAM controllers on memory nodes (modeled as M/M/1 systems (Puri et al., 2023))
- Allocation/policy designs that avoid hot-spotting in both interconnects and memory-microqueues
Cache-optimized MD architectures demonstrate significant reductions in remote access delays (e.g., 415 ns cache-hit, 1.97 μs miss on FPGA (Wang et al., 2023)), outperforming RDMA-based or unoptimized systems by up to 37%.
3. Resource Management and Allocation Policies
Memory management spans global and pool-local policies, addressing not only function (load/store access) but also resource balancing (Puri et al., 2023, Yelam, 2022). Notable points:
- Address Translation: TLBs or content-addressable memory (CAM) in FPGA/ASICs manage high-throughput CXL or RDMA memory translations (Wang et al., 2023).
- Allocation Algorithms: 'Smart-Idle' pool-selection (Puri et al., 2023) uses recent access frequencies and allocation counts to select the least-congested set of pools (formally: minimize Allocₚ over a subset m = ⎡log₂n⎤ with smallest access factors Afₚ).
- Local-First (LF) vs. Alternate (LR): LF policies exhaust local DRAM before pulling remote, leading to sharp 'latency cliffs'; LR interleaves local/remote pages to flatten latency ramps and improve tail metrics (Puri et al., 2023).
- Congestion Control: Hardware and software rate limiting (token-buckets, PFC, additive increase/multiplicative decrease) and retransmission protocols (Go-Back-N/Selective-ACK) are deployed in custom packet managers to maximize sustainable link utilization without incurring excessive 'pause' events (Wang et al., 2023).
Empirical results indicate that intelligent pool selection (Smart-Idle) can consistently halve average as well as 99th-percentile access latencies (e.g., to 250–350 ns per access, >80% reduction in >1 μs tail events (Puri et al., 2023)).
4. Software, Hardware, and Cross-Layer Optimizations
Modern MD systems leverage a full cross-stack integration for both programmability and efficiency (Wang et al., 26 Mar 2025):
- FPGA/SmartNIC Offload: CXL endpoint IPs and soft logic on FPGAs enable full hardware data-paths for load/store and protocol management, isolating CPU overhead and supporting in-hardware MESI cache coherence (Wang et al., 2023, Guo et al., 2021).
- On-FPGA Caching: Low-latency, set-associative caches (e.g., 32 KB, 4-way, MESI) with LRU per set and write-allocate policies deliver <60 ns hit times (Wang et al., 2023).
- Event-based & Cycle-accurate Simulation: Models such as DRackSim (Puri et al., 2023) and detailed hybrid models (Puri et al., 2023) integrate event-based interconnects, cycle-accurate DRAM modeling, and realistic queue/buffer sizes for system design space exploration (e.g., queueing, migration, and hot-page policies).
- Congestion-Aware Scheduling: In switch-based or PHY-based fabrics (EDM (Su et al., 13 Nov 2024)), parallel iterative matching (PIM) algorithms in hardware can schedule slot-reservations for memory traffic, eliminating queuing delay and maintaining close to line-rate utilization even at 90% loading.
Performance modeling consistently shows that remote DRAM is 4–10× slower than local (cacheline scale), but can be effectively hidden for workloads with high spatial/temporal locality or when hot/cold sets are managed in local tiers.
5. Security, Isolation, and Practical Deployment
Security in MD systems requires both fine-grained protection and performance-aware validation (Heo et al., 2021):
- Access Control: Hardware-enforced per-page permission tables (e.g., in FPGA HBM) replace RDMA's region-level rkey facilities, preventing kernel or hypervisor attacks from escalating privileges or leaking data across tenants (Heo et al., 2021).
- Confidentiality & Integrity: AES-GCM encryption is implemented CPU-side, with keys managed by trusted hardware, ensuring that even if physical DRAM or host software is compromised, memory content remains unreadable and unmodifiable (Heo et al., 2021).
- Oblivious Access: Randomized page remapping on every swap-out/in by hardware (FPGA-level randomization) breaks static access-pattern leakage, supporting weaker forms of obliviousness without full ORAM-level guarantees.
Overheads for fine-grained secure MD remain modest (≈10–20% in bandwidth or latency), with multi-tenant scaling and hardware resource usage (LUT/BRAM) supporting future extensibility (Heo et al., 2021).
6. Comparative Analysis, Impact, and Open Problems
Memory disaggregation is shown to deliver substantial improvements over fixed, local DRAM allocation in both utilization and average performance (Wang et al., 2023, Puri et al., 2023, Puri et al., 2023). Key findings include:
- Latency reductions by more than 30% compared to RDMA-disaggregation baseline, and median sub-microsecond (415 ns) access for cache hits (Wang et al., 2023).
- Tail-latency control via optimized allocation and proactive congestion-management policies (Puri et al., 2023).
- Scalability to multiple racks and network fabrics, contingent on FPGA resource sizing and switching fabric support (Wang et al., 2023, Su et al., 13 Nov 2024).
- Trade-offs in cache size, aggressiveness of allocation/pooling, and hardware complexity, with portable FPGA-based deployment favored for rapid system adoption (Wang et al., 2023).
Open questions remain in integrating MD into commodity CXL host ASICs, generalizing hardware prefetch to minimize cache-miss penalties, co-designing network and memory management for larger topologies (e.g., CXL 3.0’s multi-root support (Wang et al., 2023)), and extending security primitives for denial-of-service resilience and stronger access pattern hiding (Heo et al., 2021).
7. Future Directions and Research Challenges
Research advances in MD target lower latency fabrics (e.g., PHY-layer scheduling (Su et al., 13 Nov 2024)), more intelligent tiering and hot/cold management (Puri et al., 2023), improved policy for allocation and migration under contention (Puri et al., 2023), and expanded support for cross-tenant multitenancy and isolation. Integration with composable data center architectures, further hardware/software co-design (e.g., PIM, in-network management), and adaptive, real-time policy frameworks will likely mature MD platforms for mainstream, large-scale deployments. As design, deployment, and operational experience accumulate, the interplay between hardware scalability, network architecture, and resource policy will shape the next generation of elastic, efficient datacenter memory systems.