FlashMem Architecture

Updated 21 April 2026

FlashMem architecture is a family of frameworks that leverage flash storage and high-dimensional state representations to optimize memory use for LLMs and mobile DNNs.
It employs latent memory distillation with a Shared-KV Consolidator to streamline memory consolidation, reducing latency by up to 5× while maintaining accuracy.
FlashMem integrates dynamic memory streaming for mobile GPUs and hardware-level flash map management, balancing resource constraints and computational efficiency.

FlashMem architecture refers to a family of technical frameworks, each distinguished by high-efficiency memory management that leverages flash storage or high-dimensional state representations. The term encompasses three principal domains: latent memory distillation for LLMs (Hou et al., 9 Jan 2026), memory streaming for efficient deployment of Deep Neural Networks (DNNs) on resource-constrained mobile GPUs (Shu et al., 17 Feb 2026), and hardware-level address translation acceleration for NAND flash-based storage devices (Woo et al., 2017). This entry details the design principles, algorithms, and empirical outcomes of the two most recent and distinct FlashMem systems for LLMs and for mobile DNNs, with a brief reference to hardware-based Flash Map Management.

1. Intrinsic Latent Memory Distillation in FlashMem for LLMs

Latent memory in LLMs extends agent autonomy by distilling compact, persistent context summaries from transient backbone states. The core theoretical premise is the approximate injectivity of the internal representations of transformer-based models with respect to their input trajectory (Nikolaou et al., 2025). FlashMem exploits this property by treating the last hidden state $h_t \in \mathbb{R}^d$ at generation step $t$ as a sufficient statistic for all prior interactions $\tau_{<t}$ : $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ Rather than re-encoding the trajectory via a separate memory encoder $\mathcal{G}_\phi$ , FlashMem projects $h_t$ into an initial seed embedding $m_0$ through a compact MLP: $m_0 = \mathrm{MLP}_{\mathrm{proj}}(h_t)$ An autoregressive Shared-KV Consolidator then generates a concise sequence of $K$ memory tokens $\mathcal{M} = \{m_1, ..., m_K\}$ , encapsulating the agent’s distilled “experience.”

2. Shared-KV Consolidator: Architecture and Dataflow

The Shared-KV Consolidator is architecturally distinguished by direct cache reuse and projection-free cross-attention. Its inputs comprise the last hidden state $t$ 0 and the frozen key-value (KV) cache $t$ 1 from the LLM backbone. The consolidation pipeline can be formalized as:

Project $t$ 2 to $t$ 3
For $t$ $t$ 4 from 1 to $t$ $t$ 5:
- $t$ 6 DecoderSelfAttnAndFFN $t$ 7
- Projection-free cross-attention:
$t$ 8 - $t$ 9

Only $\tau_{<t}$ 0 is learned afresh; the backbone’s $\tau_{<t}$ 1 are inherited. This mechanism eliminates the need to re-encode history or introduce new $\tau_{<t}$ 2 projection heads for memory formation.

Pseudocode summary:

$h_t$ 1 (Hou et al., 9 Jan 2026)

3. Uncertainty-Driven, Parameter-Free Memory Scheduling

FlashMem introduces a cognitive monitor that adaptively decides when latent memory must be consolidated, based on model uncertainty. Let $\tau_{<t}$ 3 denote the raw attention weights of head $\tau_{<t}$ 4 at step $\tau_{<t}$ 5 (with attention sinks masked). The per-head Shannon entropy is: $\tau_{<t}$ 6 Aggregated system uncertainty is $\tau_{<t}$ 7. Memory consolidation is triggered if and only if $\tau_{<t}$ 8, where $\tau_{<t}$ 9 is set to the $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 0-th percentile of $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 1 on validation.

Empirically, in about 76.5% of triggers (Table 8), entropy drops after injection—on average by 0.215—indicating that memory synthesis indeed resolves epistemic uncertainty.

4. Computation Reuse and Complexity Analysis

Conventional latent memory methods incur $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 2 cost by separately re-encoding all historical tokens. FlashMem incurs only $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 3 cost per consolidation operation, independent of $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 4, since it reuses the backbone’s KV cache for all cross-attention computations. For long-context scenarios ( $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 5k tokens), this yields a dramatic empirical advantage:

System	Peak VRAM (GB)	Throughput (tok/s)	Latency (ms/step)
MemGen	40.78	4.13	61.99
FlashMem	31.44	20.86	12.28
Vanilla (no mem)	31.12	22.10	9.97

FlashMem thus achieves a $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 6 latency reduction compared to segregated encoder-based latent memory designs, with negligible VRAM increase (Hou et al., 9 Jan 2026).

5. Empirical Evaluation of FlashMem-LM

On challenging reasoning and summarization benchmarks (GSM8K, MATH, GPQA, BookSum, GovReport, KodCode), the framework matches or outperforms heavy-weight memory baselines with the following key results:

Latency: At 64k-token context, FlashMem yields latencies close to vanilla (12.28ms/step) yet much faster than MemGen (61.99ms/step).
Accuracy: GSM8K accuracy (Qwen 2.5B): MemGen 70.54%, FlashMem 70.09%, Vanilla 65.12%.
Memory fidelity: FlashMem remains within $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 7 of or exceeds MemGen’s accuracy, especially on smaller models for challenging mathematical tasks.
Ablation studies: Single-layer ( $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 8) consolidators suffice; entropy-based monitor reliably predicts resolution of epistemic uncertainty.

This demonstrates FlashMem’s utility in bridging efficiency and persistent cognition in LLM-based agents, without loss in core task performance (Hou et al., 9 Jan 2026).

6. FlashMem for Mobile DNN Workloads: Memory Streaming Architecture

FlashMem, in the context of resource-constrained mobile DNN accelerators, is a streaming execution engine that eliminates the inefficiencies of full-weight preloading. The system statically schedules a hybrid of preloading and just-in-time streaming of model weights, governed by a CP-SAT-based overlap plan.

Key Components

Static Model-Loading Scheduler (OPG): Solves a constrained optimization problem balancing full preloading ( $h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)$ 9) against chunk-wise streaming ( $\mathcal{G}_\phi$ 0) under explicit DRAM and texture memory constraints, via

$\mathcal{G}_\phi$ 1

Dynamic Streaming Engine: For each executed layer, model weights not already in texture memory are streamed from disk via unified (CPU) memory and DMA into 2.5D texture layouts directly favored by mobile GPUs. A ring buffer manages texture pages, asynchronously evicting unneeded weights.
2.5D Texture Memory: Weights are tiled into micro-tiles (H, W, 4) on disk, ensuring that disk reads can be mapped directly to GPU image objects with no online transformation (yielding a 2.5–3.5 $\mathcal{G}_\phi$ 2 speedup over unified-memory transform pipelines).

Memory and Latency Reduction Formulas

Streaming peak footprint:

$\mathcal{G}_\phi$ 3

Inference latency speedup:

$\mathcal{G}_\phi$ 4

with $\mathcal{G}_\phi$ 5 and $\mathcal{G}_\phi$ 6 taking I/O-compute overlap into account.

Empirical benchmarking (11 models, Table 1 (Shu et al., 17 Feb 2026)):

Memory reduction: $\mathcal{G}_\phi$ 7– $\mathcal{G}_\phi$ 8 (relative to preloading frameworks)
Speedup: $\mathcal{G}_\phi$ 9– $h_t$ 0 across modern DNN workloads, supporting models up to 70B parameters and multi-model pipelines

7. Hardware-Level Flash Map Management (Brief Reference)

In the context of flash storage devices, the Flash Map Management Unit (FMMU) architecture (Woo et al., 2017) automates address translation at the hardware level for SSDs. The FMMU implements two-level SRAM caches (CMT and CTP), non-blocking request pipelines, and cost-aware dirty translation management to eliminate the translation bottleneck in NAND flash storage, achieving a 44% reduction in Flash Translation Layer execution time on map cache hits and effective scaling to multi-channel, high-performance systems.

In aggregate, FlashMem architectures exemplify the synthesis of theoretical insight (injectivity, sufficiency, optimal streaming) and advanced systems engineering (direct cache reuse, precise entropy-based gating, branch-free tiling, and asynchronous execution). The empirical evidence demonstrates substantial improvements in both memory efficiency and computational performance while maintaining—often surpassing—benchmark accuracy across LLM and DNN benchmarks (Hou et al., 9 Jan 2026, Shu et al., 17 Feb 2026). The methodology highlights a paradigm in which memory, computation reuse, and tightly-coupled scheduling coalesce to address the challenges of persistent cognition and efficient inference in modern AI systems.

Markdown Report Issue Upgrade to Chat

References (3)

FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse (2026)

FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations (2026)

FMMU: A Hardware-Automated Flash Map Management Unit for Scalable Performance of NAND Flash-Based SSDs (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashMem Architecture.