Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlashMem Architecture

Updated 21 April 2026
  • FlashMem architecture is a family of frameworks that leverage flash storage and high-dimensional state representations to optimize memory use for LLMs and mobile DNNs.
  • It employs latent memory distillation with a Shared-KV Consolidator to streamline memory consolidation, reducing latency by up to 5× while maintaining accuracy.
  • FlashMem integrates dynamic memory streaming for mobile GPUs and hardware-level flash map management, balancing resource constraints and computational efficiency.

FlashMem architecture refers to a family of technical frameworks, each distinguished by high-efficiency memory management that leverages flash storage or high-dimensional state representations. The term encompasses three principal domains: latent memory distillation for LLMs (Hou et al., 9 Jan 2026), memory streaming for efficient deployment of Deep Neural Networks (DNNs) on resource-constrained mobile GPUs (Shu et al., 17 Feb 2026), and hardware-level address translation acceleration for NAND flash-based storage devices (Woo et al., 2017). This entry details the design principles, algorithms, and empirical outcomes of the two most recent and distinct FlashMem systems for LLMs and for mobile DNNs, with a brief reference to hardware-based Flash Map Management.

1. Intrinsic Latent Memory Distillation in FlashMem for LLMs

Latent memory in LLMs extends agent autonomy by distilling compact, persistent context summaries from transient backbone states. The core theoretical premise is the approximate injectivity of the internal representations of transformer-based models with respect to their input trajectory (Nikolaou et al., 2025). FlashMem exploits this property by treating the last hidden state htRdh_t \in \mathbb{R}^d at generation step tt as a sufficient statistic for all prior interactions τ<t\tau_{<t}: ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t) Rather than re-encoding the trajectory via a separate memory encoder Gϕ\mathcal{G}_\phi, FlashMem projects hth_t into an initial seed embedding m0m_0 through a compact MLP: m0=MLPproj(ht)m_0 = \mathrm{MLP}_{\mathrm{proj}}(h_t) An autoregressive Shared-KV Consolidator then generates a concise sequence of KK memory tokens M={m1,...,mK}\mathcal{M} = \{m_1, ..., m_K\}, encapsulating the agent’s distilled “experience.”

2. Shared-KV Consolidator: Architecture and Dataflow

The Shared-KV Consolidator is architecturally distinguished by direct cache reuse and projection-free cross-attention. Its inputs comprise the last hidden state tt0 and the frozen key-value (KV) cache tt1 from the LLM backbone. The consolidation pipeline can be formalized as:

  • Project tt2 to tt3
  • For tt4 from 1 to tt5:

    • tt6 DecoderSelfAttnAndFFNtt7
    • Projection-free cross-attention:

    tt8 - tt9

Only τ<t\tau_{<t}0 is learned afresh; the backbone’s τ<t\tau_{<t}1 are inherited. This mechanism eliminates the need to re-encode history or introduce new τ<t\tau_{<t}2 projection heads for memory formation.

Pseudocode summary:

hth_t1 (Hou et al., 9 Jan 2026)

3. Uncertainty-Driven, Parameter-Free Memory Scheduling

FlashMem introduces a cognitive monitor that adaptively decides when latent memory must be consolidated, based on model uncertainty. Let τ<t\tau_{<t}3 denote the raw attention weights of head τ<t\tau_{<t}4 at step τ<t\tau_{<t}5 (with attention sinks masked). The per-head Shannon entropy is: τ<t\tau_{<t}6 Aggregated system uncertainty is τ<t\tau_{<t}7. Memory consolidation is triggered if and only if τ<t\tau_{<t}8, where τ<t\tau_{<t}9 is set to the ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)0-th percentile of ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)1 on validation.

Empirically, in about 76.5% of triggers (Table 8), entropy drops after injection—on average by 0.215—indicating that memory synthesis indeed resolves epistemic uncertainty.

4. Computation Reuse and Complexity Analysis

Conventional latent memory methods incur ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)2 cost by separately re-encoding all historical tokens. FlashMem incurs only ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)3 cost per consolidation operation, independent of ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)4, since it reuses the backbone’s KV cache for all cross-attention computations. For long-context scenarios (ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)5k tokens), this yields a dramatic empirical advantage:

System Peak VRAM (GB) Throughput (tok/s) Latency (ms/step)
MemGen 40.78 4.13 61.99
FlashMem 31.44 20.86 12.28
Vanilla (no mem) 31.12 22.10 9.97

FlashMem thus achieves a ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)6 latency reduction compared to segregated encoder-based latent memory designs, with negligible VRAM increase (Hou et al., 9 Jan 2026).

5. Empirical Evaluation of FlashMem-LM

On challenging reasoning and summarization benchmarks (GSM8K, MATH, GPQA, BookSum, GovReport, KodCode), the framework matches or outperforms heavy-weight memory baselines with the following key results:

  • Latency: At 64k-token context, FlashMem yields latencies close to vanilla (12.28ms/step) yet much faster than MemGen (61.99ms/step).
  • Accuracy: GSM8K accuracy (Qwen 2.5B): MemGen 70.54%, FlashMem 70.09%, Vanilla 65.12%.
  • Memory fidelity: FlashMem remains within ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)7 of or exceeds MemGen’s accuracy, especially on smaller models for challenging mathematical tasks.
  • Ablation studies: Single-layer (ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)8) consolidators suffice; entropy-based monitor reliably predicts resolution of epistemic uncertainty.

This demonstrates FlashMem’s utility in bridging efficiency and persistent cognition in LLM-based agents, without loss in core task performance (Hou et al., 9 Jan 2026).

6. FlashMem for Mobile DNN Workloads: Memory Streaming Architecture

FlashMem, in the context of resource-constrained mobile DNN accelerators, is a streaming execution engine that eliminates the inefficiencies of full-weight preloading. The system statically schedules a hybrid of preloading and just-in-time streaming of model weights, governed by a CP-SAT-based overlap plan.

Key Components

  • Static Model-Loading Scheduler (OPG): Solves a constrained optimization problem balancing full preloading (ht=LLMθlast(τ<t,ot)h_t = \mathrm{LLM}_\theta^{\text{last}}(\tau_{<t}, o_t)9) against chunk-wise streaming (Gϕ\mathcal{G}_\phi0) under explicit DRAM and texture memory constraints, via

Gϕ\mathcal{G}_\phi1

  • Dynamic Streaming Engine: For each executed layer, model weights not already in texture memory are streamed from disk via unified (CPU) memory and DMA into 2.5D texture layouts directly favored by mobile GPUs. A ring buffer manages texture pages, asynchronously evicting unneeded weights.
  • 2.5D Texture Memory: Weights are tiled into micro-tiles (H, W, 4) on disk, ensuring that disk reads can be mapped directly to GPU image objects with no online transformation (yielding a 2.5–3.5Gϕ\mathcal{G}_\phi2 speedup over unified-memory transform pipelines).

Memory and Latency Reduction Formulas

  • Streaming peak footprint:

Gϕ\mathcal{G}_\phi3

  • Inference latency speedup:

Gϕ\mathcal{G}_\phi4

with Gϕ\mathcal{G}_\phi5 and Gϕ\mathcal{G}_\phi6 taking I/O-compute overlap into account.

Empirical benchmarking (11 models, Table 1 (Shu et al., 17 Feb 2026)):

  • Memory reduction: Gϕ\mathcal{G}_\phi7–Gϕ\mathcal{G}_\phi8 (relative to preloading frameworks)
  • Speedup: Gϕ\mathcal{G}_\phi9–hth_t0 across modern DNN workloads, supporting models up to 70B parameters and multi-model pipelines

7. Hardware-Level Flash Map Management (Brief Reference)

In the context of flash storage devices, the Flash Map Management Unit (FMMU) architecture (Woo et al., 2017) automates address translation at the hardware level for SSDs. The FMMU implements two-level SRAM caches (CMT and CTP), non-blocking request pipelines, and cost-aware dirty translation management to eliminate the translation bottleneck in NAND flash storage, achieving a 44% reduction in Flash Translation Layer execution time on map cache hits and effective scaling to multi-channel, high-performance systems.


In aggregate, FlashMem architectures exemplify the synthesis of theoretical insight (injectivity, sufficiency, optimal streaming) and advanced systems engineering (direct cache reuse, precise entropy-based gating, branch-free tiling, and asynchronous execution). The empirical evidence demonstrates substantial improvements in both memory efficiency and computational performance while maintaining—often surpassing—benchmark accuracy across LLM and DNN benchmarks (Hou et al., 9 Jan 2026, Shu et al., 17 Feb 2026). The methodology highlights a paradigm in which memory, computation reuse, and tightly-coupled scheduling coalesce to address the challenges of persistent cognition and efficient inference in modern AI systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashMem Architecture.