Modular Memory Design Space

Updated 29 December 2025

Modular memory design space is a framework that defines memory as interchangeable modules with standardized interfaces and dynamic reconfiguration.
It employs vector-indexed storage, runtime swappability, and parameterized trade-offs to balance performance, capacity, and hardware constraints.
The design supports diverse applications—from dialogue systems to DNN accelerators—by enabling specialization, scalability, and efficient resource utilization.

Modular memory design space encompasses the architectural, algorithmic, and systems-level choices that organize memory resources as a collection of interchangeable, composable modules. Across diverse computing domains—dialogue agents, DNN accelerators, cryptographic engines, and multi-agent workflow systems—modularity in memory systems enables specialization, scalability, and real-time adaptability. The design space spans parameters such as module granularity, interface, performance quality, hardware constraints, and dynamic reconfiguration, with quantitative trade-offs rigorously characterized in recent research.

1. Architectural Foundations of Modular Memory

Modular memory systems abstract the memory subsystem as a set of discrete components, often with standardized read/write interfaces and runtime swappability. In SLM-backed conversational agents, each NPC instance combines a “fixed-persona” SLM with two independent memory modules: a conversation memory $M^c$ and a world-knowledge memory $M^w$ (Braas et al., 13 Nov 2025). Each module operates as a vector-indexed key-value store (e.g., ChromaDB, Sentence-BERT–sized embeddings), supporting dynamic retrieval of relevant snippets and enabling isolation or sharing of knowledge boundaries as needed:

Block diagram:

$[x_t] \rightarrow \text{Embed} \rightarrow$ (Read $M^c$ , Read $M^w$ ) $\rightarrow$ Format $P_t$ $\rightarrow$ [SLM + Persona] $\rightarrow y_t \rightarrow$ Write $M^c$

Memory modularity is also foundational in hardware accelerators: for instance, design frameworks for true multi-port SRAM/DRAM modules (Sethi, 2020), memory slices co-integrating DRAM and compute (Asgari et al., 2018), compositional multi-agent LLM systems (Han et al., 6 Oct 2025, Shang et al., 2024), hybrid mobile memories (Wen et al., 2020), and hierarchy-aware DRAM cache models (Babaie et al., 2023).

2. Internal Structure and Interfaces of Memory Modules

A modular memory unit is defined by its physical composition and abstracted API:

Vector-indexed memory: Each memory module stores $(k_i, v_i)$ pairs, with $k_i\in\mathbb{R}^d$ as a dense embedding and $v_i$ as the text (utterance, fact, execution trace, etc.). Reads select top- $k$ values by cosine similarity to the current query embedding; writes append new (embedding, value) pairs (Braas et al., 13 Nov 2025, Han et al., 6 Oct 2025).
Swappability and independence: Modules (such as $M^c$ , $M^w$ ) are disk-backed and runtime-swappable. Loading or unloading a module is reduced to opening or closing a memory-mapped file handle, incurring sub-30 ms overhead at 1k entries, and does not require in-memory reloading or SLM retraining.
Abstracted interfaces: In agentic LLM frameworks the memory is encapsulated as two black-box operators— $\text{mem} = M_{\mathrm{write}}(o, \text{mem})$ ; $m = M_{\mathrm{retrieve}}(o, \text{mem})$ —where $o$ is the current observation, applied to tasks ranging from episodic diaries to hierarchical or knowledge-driven buffers (Shang et al., 2024). This enables recombination and evolution in search spaces for agent design.

3. Key Design Parameters and Formal Trade-offs

The modular memory design space is parameterized along axes controlling performance, isolation, capacity, and retrieval. A representative selection:

Capacity vs. retrieval speed $(N, T_r)$ : Larger memory modules recall more but increase retrieval latency. Measured $T_r(N) \approx \alpha N + \beta$ (empirically $\alpha\sim10^{-5}$ s/entry, $T_r<0.042$ s for $N=1,000$ with naive index) (Braas et al., 13 Nov 2025).
Persona isolation vs. consistency: Encoding persona in SLM weights achieves isolation; sharing world modules across NPCs of the same type ensures lore consistency.
Hardware resources vs. quality: SLM parameters $M$ dictate VRAM usage $V(M)$ , time-to-first-token $\text{TTF}(M)$ , latency $L(M)$ , and dialogue quality $Q(M)$ . For example, at 7B parameters, factuality and context retention plateau, with best latency-vs-quality at $\sim$ 1B–7B (Braas et al., 13 Nov 2025).
Fine-grained multi-port memories: Algorithmic Multi-Port Memories (AMMs) (Sethi, 2020) trade off number of ports $(P_r, P_w)$ , bank count $B$ , depth $N$ , and partitioning (cyclic, block, no-partition). Cost models: area, energy, and latency scale with these choices, as formalized in $L_{\mathrm{access}}(P_r,P_w,N,B) = L_{\mathrm{SRAM}}(N,B) + L_{\mathrm{logic}}(P_r,P_w)$ .

4. Systemic Methodologies for Design Space Exploration

Exploration of modular memory design space employs parameter sweeps, multi-objective optimization, and cost-model–guided Pareto analysis:

DSE in simulation frameworks: ZigZag (Mei et al., 2020) and MICSim (Wang et al., 2024) allow construction and evaluation of memory hierarchies (capacities, bandwidths, sharing) and mapping strategies for DNN accelerators. In ZigZag, each hierarchy is defined as $\{(C_\ell,B_\ell,S_\ell)\}_{\ell=0…L-1}$ , and temporal mappings of nested loops across operands and levels are exhaustively or heuristically enumerated.
Multi-level modularity: DreamRAM (Cai et al., 13 Dec 2025) models 3D die-stacked DRAM as a Cartesian product of choices at inter-bank, bank/subarray, MAT, and routing levels, and applies wire-aware equations to map out five-dimensional tuples ( $\text{BW}$ , cap, $E_{\text{bit}}$ , lat, area) and the resulting Pareto frontiers.
Dynamic reconfiguration: Embedded systems can exploit split/unified bank assignments, associativity tuning, scratchpad vs. cache modes, and L2 tiling policies, all exposed as orthogonal knobs to software (Bates et al., 2016).

5. Evaluation, Practical Topologies, and Performance Data

The impact of different modular memory configurations is quantitatively established:

System/Domain	Key Parameters Tuned	Empirical Metrics (selected)	Reference
Persona-NPC SLMs	SLM size $M$ , mem size $N$	Q: 93/100/100% (factual/context/retrieval), $L=5.49$ s	(Braas et al., 13 Nov 2025)
AMM accelerators	$(P_r, P_w, B, N)$	$\text{Speedup}=2.3\times$ , $\text{Area}=1.8\times$	(Sethi, 2020)
Memory slices	$M_s, B_{\text{mem}}, N$	Up to $747$ GFLOPs/J, superlinear scaling	(Asgari et al., 2018)
LEGOMem agents	Memory bank split, granularity	$+12$ –$13$ pts overall success gain	(Han et al., 6 Oct 2025)
DRAM cache models	$C, B, A, W, T, N_o$	Pareto optima: MR $\leq20\%$ , best at balanced tag SRAM	(Babaie et al., 2023)

Notable outcomes:

Modular procedural memory in multi-agent systems gives strong gains even for SLM-only teams ( $+13.4$ pts vs. no memory), with orchestrator memory consistently yielding the highest leverage for planning/execution (Han et al., 6 Oct 2025).
In dialogue SLMs, quantization of SLMs reduces disk by $4\times$ but can increase generation latency $7\times$ ; memory replay is $<0.042$ s for $1$k entries and $<0.03$ s for module swap-in (Braas et al., 13 Nov 2025).
In mobile hybrid systems, block vs. page granularities and adaptive thresholds in the hardware memory manager yield a $39\%$ energy reduction versus all-DRAM, with $12\%$ performance loss—a tunable Pareto envelope (Wen et al., 2020).

6. Application-Driven Specialization and Extensions

Modular memory design space is heavily application-driven:

Dialogue systems: Runtime-swappable persona and memory modules permit maintaining strict knowledge boundaries in NPCs and minimize VRAM residency for concurrent characters (Braas et al., 13 Nov 2025).
Neural accelerators: Modular slices and programmable memory interfaces allow matching application operational intensity $I_{\text{app}}$ to the balanced compute/memory point, such that $C_{\text{op}}/B_{\text{mem}}\approx I_{\text{app}}$ , enabling superlinear scaling with increasing slices (Asgari et al., 2018).
Deep learning: In DNN accelerators, modular design supports uneven loop-to-memory mapping, hierarchical sharing, and rapid DSE to identify architecture/dataflow points that cut energy use by up to $33\%$ (Mei et al., 2020).
Multi-agent LLMs: Procedural memory units can be allocated globally (for orchestrators), locally (fine-grained agent guidance), or both; LEGOMem and AgentSquare show search frameworks and module evolution over this axis, with flexible recombination for architecture search (Han et al., 6 Oct 2025, Shang et al., 2024).

7. Future Directions and Open Problems

Principal research directions raised in the literature include:

Adaptive continual memory updates and consolidation, especially for agentic or multi-agent systems undergoing mission variability or scaling to open-ended environments (Han et al., 6 Oct 2025).
Lifelong learning and memory operations (consolidation, forgetting) tightly integrated into modular frameworks.
More advanced cost-model integration with runtime reconfiguration to dynamically steer operating points on the Pareto boundary depending on workload phase or hardware resource availability.
Extending modular memory abstractions to new device technologies (e.g., emerging NVM, mixed-signal compute-in-memory) and deeper algorithm-hardware co-design, as demonstrated in mixed-precision CNN accelerators and modular 3D DRAM (Wang et al., 2024, Cai et al., 13 Dec 2025).

Modular memory design space, as rigorously articulated in contemporary research, structures the landscape of memory systems into a combinatorial set of modular components and policies. This approach exposes multi-dimensional trade-offs, empowers targeted specialization, and underpins the next generation of efficient, scalable, and adaptable computing systems by quantifying and systematizing the choices at every architectural level (Braas et al., 13 Nov 2025, Han et al., 6 Oct 2025, Babaie et al., 2023, Sethi, 2020, Cai et al., 13 Dec 2025).