Centralized Memory Systems Overview
- Centralized memory systems are unified architectures where multiple compute agents share one large memory pool, reducing data movement overhead and NUMA effects.
- They leverage high-bandwidth interconnects and memory-node compositions (e.g., NVLink rings, crossbars) to achieve significant speedups and scalable performance for AI and multi-agent applications.
- Unified memory management protocols simplify programming by coordinating coherent access and dynamic page migration across heterogeneous devices.
A centralized memory system is a memory architecture in which heterogeneous or homogeneous compute agents (CPUs, GPUs, accelerators, or even distributed agents) share a logically unified and centrally managed memory pool, typically exposed as a single physical or virtual address space. Centralized memory systems can manifest as hardware (shared bus, crossbar, high-radix switch, device-side “memory-nodes”), OS-level constructs (unified virtual memory, global mapping), or distributed software protocols (centralized caching, AI agent memory), but all share the property that multiple agents can directly access and coordinate through a common memory substrate.
1. Architectural Principles of Centralized Memory Systems
Centralized memory systems seek to remove or minimize data movement overhead, NUMA effects, and fragmentation by allowing multiple compute agents to address and operate on a single large memory pool with uniform or near-uniform access semantics. High-end implementations, such as the Memory-Centric Deep Learning Architecture (MC-DLA), replace the traditional device-centric model—where each accelerator is bound to local memory and bridges to host DRAM via bandwidth-limited PCIe—with a topology in which capacity-optimized "memory-nodes" are directly integrated into the same high-bandwidth, low-latency fabric as the accelerators themselves (e.g., NVLink rings). This exposes both local device memory and remote memory modules under a unified address space, managed transparently by the device driver and runtime (Kwon et al., 2019).
The MGPU-TSM architecture builds upon this by physically sharing all HBM stacks among multiple GPUs via a fast switch-based crossbar, resolving both coherency and address translation at the interconnect level and eliminating the need for replication or RDMA transactions (Mojumder et al., 2020). Centralized systems may also be constructed in the CPU domain using commodity DDR DIMMs connected via shared memory controllers and inter-socket links (UPI, Infinity Fabric), with all sockets observing a flat DRAM address pool (Liu et al., 28 Aug 2025).
In tightly coupled many-core chips, centralized L1/L2 memory (e.g., MemPool's 1 MiB shared L1 SPM) is accessed by all cores through physically optimized interconnects, maintaining uniform latency and providing a global scratchpad visible in every core's address space (Cavalcante et al., 2020).
2. Memory Node Composition, Interconnects, and Topologies
Centralized systems are engineered to optimize capacity, bandwidth, and programmability by decoupling memory expansion from host-centric devices. MC-DLA’s "memory-node" comprises a protocol-compatible engine for device-side links (e.g., NVLink), a DMA engine for high-speed transfers, and a collection of commodity DIMMs managed as a rank under a robust memory controller. Devices and memory-nodes are woven into a ring-based interconnect, with each device linked to two adjacent memory-nodes; with N link ports per device (typical N = 6, each 25 GB/s), total remote bandwidth is N × B (e.g., 6 × 25 GB/s = 150 GB/s per device), far exceeding PCIe Gen3 (Kwon et al., 2019).
MGPU-TSM adopts a centralized switch topology: all on-die GPU L2 banks and DRAM banks connect to a high-radix crossbar (e.g., 32 ports, ≈1 TB/s peak b-directional), guaranteeing uniform access to all DRAM and providing high aggregate bandwidth. This design removes NUMA disparities and non-uniform latency between “local” and “remote” memory (Mojumder et al., 2020).
In CPU-dense racks, shared DRAM is allocated across memory controllers and exposed via DRAM interconnects at the PCB level, but as system size scales latency, per-bit energy, and bandwidth become bottlenecked by trace length, signaling rate, and pin count (Liu et al., 28 Aug 2025). Compute-memory nodes using 2.5D/3D integration collapse memory distances to µm-scale, supporting ultra-high bandwidth and minimal energy per bit compared to mm/cm-scale topologies.
3. Unified Memory Management and Coordination
Modern centralized systems frequently enforce a unified memory management protocol at the OS or runtime level. The GMEM architecture provides a high-level virtualization framework where both CPUs and devices (GPUs, network cards) attach to a single process virtual address space (gm_as), with all translation (VA allocation, logical/physical mapping, TLB invalidation) centrally orchestrated by the OS. Devices supply only minimal MMU-specific hooks, resulting in a coherent, shared VA space and enabling automatic page migration or remote access (Zhu et al., 2023).
Centralized GPU memory (MGPU-TSM) uses switch-filtered coherence (potentially timestamp-based, e.g., HALCONE), obviating the need for data replication or explicit page movement between memories. A single address space allows uniform access latency and enables stronger, more programmer-friendly consistency models near release- or sequential-consistency (Mojumder et al., 2020). Similarly, MC-DLA presents a flat address space combining on-package and remote memory sections, augmented by bandwidth-aware page placement and allocation policies (Kwon et al., 2019). These strategies eliminate manual buffer management and context switching, supporting plug-and-play device expansion and transparent oversubscription.
4. Performance, Scalability, and Physical Constraints
Centralized memory systems in leading proposals achieve substantial performance and scalability gains. MC-DLA reports 2.8× end-to-end speedup across eight DL applications (eight-device node), reaching 84–99% of “infinite on-package memory” throughput and scaling memory capacity per device from tens of GB to 10.4 TB (eight memory-nodes, each 1.3 TB) (Kwon et al., 2019).
MGPU-TSM yields a mean 3.9× speedup over RDMA-based multi-GPU, up to 27× lower remote-access effects for matrix-multiplication benchmarks, and close to linear scaling from 1 to 4 GPUs—achieved by the switch’s elimination of off-chip remote traffic and aggressive uniform page striping (Mojumder et al., 2020). In networking and GPU driver case studies, GMEM-based drivers reduce code complexity (≥800 LoC eliminated), improve RX throughput by up to 54%, and decrease CPU usage by 32% compared to legacy approaches (Zhu et al., 2023).
Physical limitations arise when centralized memory is realized as a massive DRAM pool connected to many CPUs via on-board links. As total capacity grows toward petabyte scales, latency and energy per bit rise with physical trace length (PCB copper or cable), while aggregate bandwidth saturates due to pin constraints. For distances d > 1 mm, per-bit energy and achievable bandwidth degrade exponentially (moving from 36 μm microbump to 730 μm C4 bump increases energy/bit 5× and collapses bandwidth 12×); systems beyond O(10 TB) become impractical (Liu et al., 28 Aug 2025). The alternative—disaggregated nodes with in-package or 2.5D/3D-integrated memory—restores scalable, energy-efficient designs.
5. Centralized Memory for AI and Multi-Agent Systems
Centralized memory paradigms extend to multi-agent AI systems and digital context management. In StackPlanner, a single Central Coordinator agent administers both a task-level LIFO stack (\mathcal{M}) maintaining current execution trace and a structured experience memory (\mathcal{E}) indexing reusable cross-task knowledge. All sub-agents read and write through coordinator-mediated memory operations (Update, Condense, Prune) to ensure context stability and robustness; high-level coordination is separated from subtask execution (Zhang et al., 9 Jan 2026).
In privacy- and sovereignty-critical settings, centralized AI memory introduces novel trust challenges. The MemTrust framework decomposes an AI memory system into five TEE-hardened layers (Storage, Extraction, Learning, Retrieval, Governance), each running within an enclave to ensure confidentiality and integrity. Zero-trust is enforced via cryptographically managed key hierarchies, attestation-backed session tokens, and side-channel-resistant retrieval protocols (e.g., oblivious bucket sampling), achieving “local-equivalent” guarantees even for cloud-based deployments (Zhou et al., 11 Jan 2026).
| System / Approach | Topology / Coordination | Key Outcomes |
|---|---|---|
| MC-DLA (Kwon et al., 2019) | NVLink-based device/memory rings | 2.8× speedup, up to 10.4 TB/device, transparent ops |
| MGPU-TSM (Mojumder et al., 2020) | Central switch, uniform MM | 3.9× speedup, near-linear scaling, coherence support |
| GMEM (Zhu et al., 2023) | OS-managed global VA, device attach | 54% RX gain, code size −92%, auto migration |
| StackPlanner (Zhang et al., 9 Jan 2026) | Central agent, explicit memory stack | Long-horizon RL, stable multi-agent coordination |
| MemTrust (Zhou et al., 11 Jan 2026) | Five-layer TEE, zero-trust | Local-grade security with cross-agent sharing |
6. Limitations, Engineering Barriers, and Alternatives
While centralized designs can drastically improve bandwidth, capacity, and transparency, they are constrained by both physical and logical barriers. Petabyte-scale shared DRAM suffers prohibitive cost ($100K–200K/PB), energy inefficiency due to long signaling paths, and flat per-core bandwidth as core counts rise. Empirical data demonstrate that the energy and bandwidth degrade rapidly with physical distance; a petabyte shared across a rack (d ~ 0.5 m) incurs 50–100 pJ/bit signaling overhead (Liu et al., 28 Aug 2025).
Scaling switch-based centralized memories (e.g., MGPU-TSM) incurs area and power costs which increase with port/radix count, and fully scalable coherence across thousands of compute units remains unresolved. When physical monolithic memory becomes unviable, the only path forward is to disaggregate into compute-memory nodes tightly coupled via µm-scale packaging and leverage in-node memory for all hot data, treating off-package DRAM as a cold-tier (Liu et al., 28 Aug 2025).
7. Emerging Directions: Memory-Centric and Processing-In-Memory Architectures
A notable trend in centralized memory design is the migration toward "memory-centric" architectures where memory is elevated from passive storage to an active, self-managing, and even compute-capable participant (Mutlu et al., 1 May 2025). In this paradigm, DRAM structures autonomously manage maintenance (e.g., RowHammer mitigation) without host intervention, and may further embed processing-in-memory (PIM) accelerators (ALUs, SIMD units) directly within or adjacent to the DRAM die (PNM) or perform computation using analog effects (PUM, e.g., RowClone, Ambit bitwise).
These designs minimize off-chip data movement (which dominates energy), scale compute/bandwidth with memory expansion, and support flexible, evolutionary adoption (from simple interface extensions to fully disaggregated memory-compute fabrics). Memory-centric systems promise 5–50× performance and energy improvements on graph, ML, and analytics workloads, with robust scalability and transparent maintenance (Mutlu et al., 1 May 2025). However, broad deployment depends on standardizing protocols, expanding OS/runtime support, and careful verification of novel in-memory controllers.
A centralized memory system thus encapsulates an array of hardware, OS, and system-level techniques to enable true shared access, bandwidth scalability, and simplified programming across compute agents, but is ultimately constrained by physical laws and must co-evolve with new paradigms in packaging, autonomy, and memory-centric computation.