Papers
Topics
Authors
Recent
2000 character limit reached

Memory-Compute Separation (MCSep)

Updated 18 January 2026
  • Memory-Compute Separation (MCSep) is an architectural principle that decouples compute and memory resources to enable independent scaling and efficient resource utilization across data centers.
  • MCSep employs high-speed RDMA, programmable switches, and advanced coherence protocols to minimize latency and enhance overall system performance.
  • By separating compute and memory, MCSep facilitates flexible resource provisioning and cost savings, addressing underutilization in traditional monolithic architectures.

Memory-Compute Separation (MCSep) is an architectural principle in which computational tasks and memory/storage resources are physically and/or logically decoupled, enabling independent provisioning, management, and scaling of compute and memory across a data center or distributed system. Historically, this separation has been motivated by the need to address resource underutilization, reduce total cost of ownership (TCO), and overcome performance bottlenecks induced by fixed ratios of CPU to DRAM in monolithic server architectures. Modern MCSep designs employ high-speed RDMA networks, programmable switches, near-memory compute primitives, and software or hardware-managed coherence techniques to achieve high utilization, sub-10 μs tail latencies, and flexible memory pooling at datacenter scale. The emergence of high-speed coherent fabrics (e.g., CXL, InfiniBand) and approaches such as Active Data Objects (ADO), in-network memory management, and compute-side cache coherence extend the concept of MCSep from mere memory disaggregation to tightly coupled, high-performance, shared-memory or key-value systems, even under durability and persistence requirements [2104.06225, 2305.03943, 2107.00164, 2409.02088].

1. Fundamental Principles and Taxonomy of MCSep

The foundational aspect of MCSep is the architectural decoupling of “compute nodes” (CPU sockets) from “memory nodes” (DRAM modules or persistent memory devices), treating all memory within a cluster or rack as a logically single resource pool. This allows any CPU to allocate a combined address space—local DRAM (with typical access latencies $L_{loc}\approx 80\,\mathrm{ns}$) and remote DRAM (where remote access latency decomposes as $L_{\mathrm{remote}} = L_{\mathrm{network}} + L_{\mathrm{DRAM\,access}}$, e.g., $L_{\mathrm{network}}\approx 2$–$5\,\mu$s RDMA, $50$–$100\,\mathrm{ns}$ CXL). The system bandwidth to remote memory is constrained by the minimum of the network and DRAM bandwidths, $B_{\mathrm{remote}} = \min(B_{\mathrm{network}}, B_{\mathrm{DRAM\,channel}})$.

MCSep systems can be realized along several axes:
- Disaggregated Memory: Local compute nodes consume memory resources provisioned elsewhere via fast networks [2305.03943].
- Near-Memory Compute: Compute logic is pushed down into the memory domain, as in MCAS/ADO [2104.06225].
- In-Network Memory Management: Programmable switches manage global page tables, permissions, and cache coherence, enabling rack-scale shared memory [2107.00164].
- Compute-Driven Coherence: Coherence protocol state and bookkeeping reside entirely on the compute tier, freeing memory nodes from any per-line processing, as in SELCC [2409.02088].

A typical MCSep deployment achieves TCO savings when memory underutilization $(1 - U_{\mathrm{loc}})$ offsets the additional network and management costs. Savings of 15–25% in net TCO are projected at cluster utilizations $U_{\mathrm{loc}}\approx0.4$–$0.6$ [2305.03943].

2. Architectural Realizations and System Designs

Contemporary MCSep architectures draw from several models:
- MCAS (Memory Centric Active Storage): Sharded RDMA-attached key-value store where all value manipulation, including pointer-based data structures, occurs via ADOs directly in persistent memory. Each MCAS shard maintains a hopscotch hash-table index in persistent memory, and plugin ADO processes handle user logic in isolation by memory-mapping persistent-memory pools [2104.06225].
- Memory Disaggregation via RDMA/CXL: Compute-only blades interconnected with memory-only blades over high-speed networks (50–200 Gb/s, 2–5 μs RDMA, up to 256 GB/s CXL), forming a physically or logically unified pool. The software stack includes kernel extensions for page-fault handling (Infiniswap), remote memory block exports, and advanced resilience and isolation layers (Hydra, Justitia, Aequitas) [2305.03943].
- MIND (In-Network Memory Management): Employs programmable switches as line-rate in-network MMUs, implementing global page tables, address translation, range-based protections, and directory-based MSI protocol for coherence. Compute blades operate with local DRAM caches and forward misses to memory blades through the switch, with all metadata coherence handled centrally in switch SRAM/TCAM [2107.00164].
- SELCC (Shared-Exclusive Latch Cache Coherence): Embeds cache-ownership metadata per line in 64b latch-words in remote DRAM, manipulated atomically via RDMA. All coherence transitions (invalidations, upgrades, demotions) are handled by compute nodes, and remote DRAM remains “dumb” [2409.02088].

A synthesized architectural diagram across these approaches is as follows:

Layer Example System Key Functionality
Application/Client MCAS, MIND, SELCC Key-value, shared-memory, or transactional workloads
Compute Node MCAS, MIND, SELCC Local cache, RDMA, plugin execution, coherence
Network/Fabric MIND Programmable switch, line-rate directory, MMU, MSI
Memory Node MCAS, MIND, SELCC DRAM/Optane PM, zero-logic, RDMA registrations
Persistent Storage MCAS Write-ahead logs, undo/redo-logs for crash consistency

3. Coherence, Consistency, and Fault-Tolerance Strategies

MCSep introduces unique consistency and coherence challenges due to its physical separation and potential for cross-node sharing:
- Cache Coherence: Techniques range from software-managed (Leap) and hardware-managed (CXL, in-switch directories in MIND) coherence protocols to fully compute-driven approaches (SELCC), which use per-line 64b latch words and per-node spin mutexes to maintain MSI semantics and sequential consistency. Scalability limits (e.g., SELCC bitmap allowing ≈56 compute nodes) prompt the need for sharded or hierarchical directories at larger scales [2409.02088].
- Consistency Guarantees: MCAS and SELCC guarantee crash consistency and sequential consistency, respectively. In MCAS, crash consistency is achieved by plugin-controlled undo-logging and the use of atomic 64b updates and explicit flush/barrier primitives [2104.06225]. SELCC’s eager invalidation and local spin synchronization ensure global write ordering [2409.02088].
- Replication and Fault Tolerance: Synchronous replication is enforced in MCAS, wherein updates are acknowledged only after all replica shards confirm the operation. Hydra introduces erasure-coding and CodingSets for multi-machine fault domains [2305.03943].

In MIND, the network switch tracks all cacheable regions, maintains sharer bitmaps, and issues synchronized invalidations, supporting rack-scale coherence and region resizing (bounded splitting) for directory efficiency [2107.00164]. Performance degradation due to coherence storms is mitigated by parallel multicast invalidations and two-phase transitions.

4. Performance, Latency, and Memory Movement Analysis

MCSep’s principal performance argument derives from minimizing host-DRAM round-trips and eliminating network round-trips for data that can be operated on in-memory or near-memory:
- Host Data Movement Reduction: MCAS avoids nearly 100% of load/store cycles compared to thick-client architectures. In a Continuous Data Protection (CDP) use case, it achieves up to 4.92M updates/s with $P_{997}$ latency ≈10 μs and mean latency 6.7 μs; ADO throughput is 43% higher than client-side merges ($U_\mathrm{ADO}(n)/U_\mathrm{plain}(n)\approx1.43$) [2104.06225].
- Network Latency and Bandwidth: Memory disaggregation over RDMA introduces ~5–20 μs remote access latency and up to ~1 GB/s throughput per connection (local DRAM ≳20 GB/s/channel) [2305.03943]. Leap’s prefetching reduces effective remote hit latency to ~0.8 μs by raising prefetch hit ratios above 80%.
- In-Network Coherence Latency: In MIND, end-to-end latency for remote access is modeled as $L(v)\approx L_{net}+L_{MMU}+L_{coh}+L_{TLB}+L_{queue}$ where $L_{net}\approx7\,\mu$s. Aggregate throughput of 1.1M IOPS (8 blades, 1 thread per blade, $L_{avg}\approx9\,\mu$s) is demonstrated for shared read [2107.00164].
- SELCC Microbenchmarks: SELCC achieves 2.5M ops/s, 3.4–3.6× over competitors, and >80% cache hit rates in skewed patterns [2409.02088].

A general slowdown model is: $L_{avg} = (1-f)L_{loc} + fL_{remote}$. For latency-sensitive workloads, keeping remote access fraction $f$ below a system-dependent threshold preserves performance within 1.5× of all-local memory [2305.03943].

5. Software, APIs, and Methodological Evolution

MCSep systems have developed a software stack and APIs aligned with the layered architecture:
- Plugin/Push-Down Compute: MCAS Active Data Objects are shared-object plugins, invoked via opaque request marshalling (e.g., status_t invoke_ado(...)). Plugins mmap value buffers, modify in place, and can allocate/free persistent memory pools with crash-consistency enforced at the plugin [2104.06225].
- SELCC Application API: Disaggregated memory lines are manipulated through Allocate, Free, SELCC_SLock, SELCC_XLock, and Atomic calls, with latches acquired via atomic RDMA operations and local mutexes [2409.02088].

API Input Semantics
Allocate() Allocate a GCL in remote memory
Free() gaddr Free GCL
SLock() gaddr Acquire shared latch, return local ptr
XLock() gaddr Acquire exclusive latch, return local ptr
SUnlock() handle Release local latch (global latch is lazy)
XUnlock() handle Release local latch (global latch is lazy)
Atomic() gaddr, f, args One-sided RDMA atomic (FAA or CAS)
  • Infiniswap and Leap: Early page-fault and VMM integration for remote memory, offering block-device exports and on-demand or prefetch-based page migration [2305.03943].
  • Directory Management: MIND employs range partitioning in TCAM for efficient directory and protection, dynamically splitting regions based on false-invalidation rates to maintain $O(N\log M)$ directory entry scaling [2107.00164].

The interface trajectory has progressed from raw block or slab exports, through VMM-integrated paging, to byte-addressable, programmable coherence-capable APIs, and runtime-optimized interfaces for graph/ML frameworks.

6. Trade-offs, Limitations, and Open Research Areas

Adoption of MCSep imposes nontrivial engineering and theoretical trade-offs:
- Programming and Debugging Complexity: MCAS plugin authors are responsible for explicit crash-consistency primitives (flush, fence, undo-log). Debugging PM data structures outside of DRAM adds complexity [2104.06225].
- Resource Contention and Scalability: SELCC’s 56-bit reader bitmap constrains node scale; hierarchical or sharded directories are needed for very large deployments. RDMA atomic operation bottlenecks may emerge at extremely high contention [2409.02088].
- Coherence Bottlenecks: Strict home-node or single-root directories (MIND, SELCC) do not scale beyond rack-level due to TCAM/SRAM limits. Directory sharding, region-based coherence, or weakening of consistency (e.g., PSO/RMO modes) are feasible mitigations [2107.00164, 2305.03943].
- QoS and Isolation: Per-tenant admission control and end-to-end scheduling are required, with approaches like Justitia and Aequitas exemplifying hardware- and switch-mediated techniques [2305.03943].
- Security and Fault Domains: Memory side-channel risks, integrity, and confidentiality can be managed via SGX enclaves, attestation, erasure coding (Hydra), and placement-aware coding sets.

The future trajectory in MCSep architecture includes CXL-based coherent address spaces, dynamic granularity memory access (switching between 4 KB and cache-line accesses), hardware-managed in-line prefetching and erasure coding in switches, and programmable coherence logic on SmartNICs [2305.03943].

7. Comparative Evaluation and Key Results

Empirical evaluations consistently demonstrate the benefits of MCSep for high-utilization, low-latency, and fault-tolerant operation:
- MCAS vs. Baseline: In CDP workloads, 43% higher throughput and sub-10 μs latencies are attributed to push-down ADO compute and zero DRAM data shuffling [2104.06225].
- Disaggregated-memory microbenchmarks: SELCC achieves 2.5M ops/s, outperforming centralized protocols by 3.6×, and caches >80% of reads under realistic access distributions [2409.02088].
- In-network management (MIND): 1.1M IOPS/thread, 59× speedup for TensorFlow on 8 blades, and minimal directory and translation storage footprint. Under 100% writes, invalidation traffic limits efficiency, guiding the need for consistency model relaxation or region splitting [2107.00164].
- Cluster-wide TCO and utilization: SymbioticLab’s experiments demonstrate that 15–25% TCO reduction is attainable as idle DRAM can be globally repurposed. Proactive prefetching and erasure coding further close the performance gap with monolithic server designs [2305.03943].

In summary, MCSep advances resource utilization, performance, and isolation in contemporary data centers through principled decoupling of memory and compute, enabled by a spectrum of architectural and protocol innovations in both hardware and software. Continued advances in interconnect technology, coherence management, and resource abstraction are expected to extend these benefits to larger and more heterogeneous infrastructures.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory-Compute Separation (MCSep).