Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Memory Compression

Updated 23 January 2026
  • Dynamic Memory Compression is a suite of adaptive techniques that compress neural model activations, weights, and caches to reduce runtime memory usage.
  • It leverages methods such as bitwidth-adaptive quantization, entropy-driven coding, and trainable cache merging to balance performance improvements with minimal accuracy trade-offs.
  • Practical implementations demonstrate up to 22.9× memory savings and 3.2× training speedups in on-device, LLM, and system-level applications.

Dynamic Memory Compression (DMC) refers to a broad set of algorithmic and system-level techniques for adaptively reducing the runtime memory footprint of neural networks, LLMs, memory-centric data processing systems, and hardware/software memory hierarchies. DMC approaches are distinguished by their adaptivity—compression is carried out online, responding to data access patterns, resource constraints, or the intrinsic compressibility of states, activations, or model parameters. State-of-the-art DMC strategies span neural activation quantization, lossless/lossy weight encoding, hardware-level block compression, trainable cache merging in transformers, and hierarchical memory management in systems and model checking.

1. Adaptive Compression for DNN Activations and Inference Caches

A central challenge in on-device and large-scale training/inference is the memory explosion caused by storing intermediate activations or key–value (KV) caches, particularly evident in deep neural networks and transformers. In on-device supervised learning, the DAF (Dynamic Activation Framework) system applies bitwidth-adaptive quantization to activations on a per-layer basis, using a quantization function

x^=Q(x;α,b)=clamp(round(xxminα),0,2b1)\hat{x} = Q(x;\,\alpha, b) = \mathrm{clamp}\left(\mathrm{round}\left(\frac{x-x_{\min}}{\alpha}\right),\, 0,\, 2^b{-}1\right)

with α=xmaxxmin2b1\alpha = \frac{x_{\max}-x_{\min}}{2^b-1}, b{0,,8}b\in\{0,\ldots,8\}, which enables memory savings up to 22.9×22.9\times and training speedups up to 3.2×3.2\times with negligible (<1%<1\%) accuracy losses. DAF implements a suite of kernel-level and OS-level optimizations—hybrid (tree/atomic) reductions for quantization stats, CPU–GPU collaborative bit packing using SIMD instructions, and importance-aware paging with eviction (based on moving-average error sensitivity and memory pressure) managed via red–black trees to bound fragmentation and memory usage (Liu et al., 9 Jul 2025).

In LLMs, transformer decoders' KV caches grow linearly with context, threatening both GPU/TPU memory capacity and inference throughput. DMC for transformers introduces per-head, per-layer online decisions to either append or merge (weighted accumulate) KV entries, implemented via a binary gating variable αh,,t=round(σ(kh,,t[0]))\alpha_{h,\ell,t}=\mathrm{round}\bigl(\sigma(k_{h,\ell,t}[0])\bigr) and importance weights ωh,,t\omega_{h,\ell,t}. This dynamic cache compaction, retrained with a Gumbel-Sigmoid relaxation for differentiability, achieves 2×4×2\times-4\times compression and 1.8×3.7×1.8\times-3.7\times throughput increases with zero or negligible downstream degradation, substantially outperforming fixed-schedule token pooling or grouped-query attention (GQA) baselines (Nawrot et al., 2024).

2. Lossless and Lossy Model Weight Compression

NeuZip establishes an entropy-theoretic approach to compressing neural weights through bit-splitting and per-field (sign, exponent, mantissa) modeling. Empirically, BFloat16 parameter exponents exhibit very low entropy (H(e)2.8H(e)\approx2.8 bits), permitting efficient per-layer ANS (Asymmetric Numeral Systems) coding. Mantissa truncation is reserved for inference-only settings, with the loss bounded by 1/2k\leq 1/2^k for kk-bit mantissa retention. The total effective compressed representation approaches 10.8bits/weight10.8\,\text{bits/weight} (versus $16$ bits raw), achieving 1.48×1.48\times theoretical and 51%51\% empirical memory savings for state-of-the-art LLM training and >50%>50\% at inference, all while maintaining lossless backward compatibility and unconstrained training dynamics (Hao et al., 2024). Compression/decompression overheads are amortized by layer-wise processing; throughput matches that of mainstream quantization schemes.

In contrast to static quantization, which introduces performance-degrading rounding errors, and pruning/adapter approaches like LoRA/QLoRA (which restrict trainable capacity), entropy-driven DMC methods are "always-on," require no model re-architecture, and do not interfere with loss landscape or convergence.

3. Hardware and System-Level Memory Compression Frameworks

At the hardware level, DMC encompasses both narrow (controller-resident) and system-wide architectural features. In AI accelerators, DMC techniques such as bit-plane disaggregation and on-die block compression (LZ4, ZSTD) are embedded within memory controllers. Data (weights, KV caches) are partitioned into planes (e.g., per-bit, exponent, mantissa), clustered (e.g., cross-token, per-channel), and compressed before being routed to DRAM. On inference, partial-plane fetching enables dynamic precision scaling—e.g., serving high-importance context at FP16 and less salient tokens at FP8/FP4, directly matching DRAM bandwidth and energy use to task demands. Prototyped subsystems demonstrate lossless 25.2%25.2\% model-weight and 46.9%46.9\% KV-cache reductions, up to 30%30\% DRAM energy savings, and 8 TB/s throughput with minimal (<4mm24\,\mathrm{mm}^2) area/power cost (Xie et al., 24 Mar 2025).

In conventional memory hierarchies, architectures such as CRAM employ metadata-free, marker-based cache line grouping to exploit on-the-fly line-level compressibility, increasing DRAM bandwidth via implicit 2–4-to-1 line packing, managed by a hardware line location predictor (LLP) with 98%98\% prediction accuracy. A dynamic cost-benefit controller toggles compression based on measured DRAM write and retrieval benefits, delivering a robust average speedup of 6%6\% (up to 73%73\% in favorable patterns) and comprehensive bandwidth/energy reductions with <300<300 bytes of on-controller state (Young et al., 2018).

System software mechanisms extend DMC to OS/user-space by managing data across multiple compressed memory tiers. Projects such as TierScape dynamically allocate application pages among NN software-defined pools (distinguished by algorithm, allocator, and backing media). A user-space daemon profiles access hotness, solves an ILP for optimal placement given TCO and performance constraints, and controls migration among tiers. This approach yields $22$–$40$ percentage point TCO savings beyond state-of-the-art two-tiered systems and preserves or even improves runtime on real-world benchmarks with judicious tuning (Kumar et al., 2024).

4. Trainable Cache and Memory Bank Compression in LLMs

Recent innovations in DMC for LLMs utilize trainable, data-driven strategies to control which cached or memory representations are retained, evicted, or merged during and after training.

  • Dynamic Memory Sparsification (DMS): During inference, tokens marked for eviction are not dropped immediately but delayed for a preset window, allowing the model to merge information implicitly and retaining attention quality at lower cache footprint. A lightweight, Gumbel-Sigmoid-based scorer learns eviction policies post-hoc, requiring just 1K1\,\mathrm{K} steps, yielding $4$–8×8\times cache compression with minimal accuracy impact and enabling "hyper-scaling"—extended context or batch size for fixed compute/memory (Łańcucki et al., 5 Jun 2025).
  • Memory Bank Compression (MBC): For continual learning, DMC is realized via codebook-based vector quantization: document/context embeddings are quantized to a small learned codebook (Nc\mathrm{N_c}), indexed, and only the indices plus codebook are stored. A critical online resetting step maintains codeword diversity (prevents collapse). Aggregation and retrieval are made efficient by hierarchical pairing; memory reduction achieves 0.3%0.3\% of baseline footprints with F1/EM retention >96%>96\%. KV-LoRA adapters allow for utilization of compressed codes during inference with less than 0.5%0.5\% extra trainable parameters (Katraouras et al., 2 Jan 2026).
  • Compression Beacons in Transformers: Breadcrumbs Reasoning introduces periodic learned beacon tokens that compress recent cc tokens into a single summary KV entry. This blockwise, RL-distilled, joint training achieves bounded memory (L/cL/c scaling) and strict Pareto improvements in memory–accuracy trade-offs in autoregressive multi-step reasoning tasks. Compared to streaming or training-free baselines, accuracy is preserved or even improved under aggressive ($8$–32×32\times) cache reductions (Monea et al., 15 Oct 2025).

5. Hierarchical and Model Checking Applications of DMC

DMC schemes are also effective for hierarchical transformer memory management and distributed or variable-length state representations:

  • Hierarchical Compression (MELODI): Separates short-term (recurrent, cross-layer) and long-term (aggregative queue, mid-layer) memories, employing recurrent compression operators (STM) and a single long-term memory aggregation (LTM) function. STM compresses transitions between context windows; LTM aggressively compresses summaries and batches into a FIFO queue for deep history. This architecture delivers 8×8\times reduction in memory while matching or surpassing baselines on long-context datasets (Chen et al., 2024).
  • Variable-Length State Compression in Model Checking: In dynamic memory-centric settings (e.g., multi-core software model checking), DMC leverages a tree-of-trees ("dtree") structure for variable-length state vectors, supporting incremental updates, partial reads, and significant subvector sharing, essential for heap-allocated/concurrent data structures. Compared to fixed-length and generic hash table-based state stores, dtree-based DMC is up to 2.9×2.9\times faster and 29%29\% more compact on dynamic, variable-length benchmarks (Berg, 2020).

6. Limitations, Trade-Offs, and Guidelines

DMC methods present design and implementation trade-offs:

  • Accuracy vs. Compression: Bitwidth adaptation, mixed-precision, and delayed eviction entail a balance between accuracy preservation and compressibility. Trainable DMC methods, such as DMC for LLMs or MBC, impose some tuning/validation cost to hit the memory–quality Pareto frontier.
  • Hardware/Kernel Overheads: Depending on deployment (AI accelerator, general-purpose CPU/GPU, OS tiered memory), DMC can require hardware pipelines (for bit-plane shuffle, compression), small metadata buffers, or user/kernel daemons. Most schemes report negligible area/power or kernel cycle overhead.
  • Inference vs. Training: Certain DMC schemes are lossless for both training and inference (e.g., NeuZip), while others (bitwidth truncation, cache eviction) are best applied at inference or with explicit retraining.
  • Compatibility: DMC is usually integrated as a drop-in function—leveraging on-device memory management, attention backends, or memory controllers—but some flexibility in kernel/backend code is advantageous for optimal results.
  • Parameter Sensitivity: Schemes such as MBC or DMS are sensitive to codebook size, window size, and target compression, requiring principled hyperparameter selection.

7. Empirical Performance and Benchmarks

Extensive experimental results across neural training/inference, LLMs, and data processing confirm the practical advantages of DMC, with representative figures summarized below:

System/Task Empirical Memory Reduction Speedup / Throughput Accuracy / Quality Drop Reference
On-device DNN (DAF) $11.2$–22.9×22.9\times 3.2×3.2\times <1%<1\% (Liu et al., 9 Jul 2025)
LLM KV cache (DMC) 2×2\times4×4\times $1.8$–3.7×3.7\times $0$ to 0.7-0.7 pts (MMLU) (Nawrot et al., 2024)
NeuZip (BFloat16, large LLM) 1.5×1.5\times (theoretical) No slowdown None / near-lossless (Hao et al., 2024)
AI acc. (weights/KV, ZSTD) $25.2$\% / $46.9$\% $8$ TB/s mode None (lossless) (Xie et al., 24 Mar 2025)
LLM (DMS, CR=$4$–8×8\times) $4$–8×8\times 4×4\times memory & KV speedup <3.5<3.5 pts avg (AIME24) (Łańcucki et al., 5 Jun 2025)
Model checking (dtree) 29%29\% more compact 2.9×2.9\times faster None (Berg, 2020)
System memory (TierScape, DMC) $22$–$40$ pp above 2-Tier $2$–$10$\% perf. improvement None / TCO knobbed (Kumar et al., 2024)

Applications, ablation studies, and convergence analyses consistently show that DMC achieves near-theoretical compression in memory-bound scenarios without sacrificing core model accuracy or introducing unacceptable system complexity.


In summary, Dynamic Memory Compression represents a cross-disciplinary paradigm that integrates algorithmic, systems, and hardware approaches to adaptive, data-driven memory compaction, enabling previously impractical neural, data, and concurrent software applications on constrained compute platforms, and setting new benchmarks for memory–accuracy–performance trade-offs across the computing stack (Liu et al., 9 Jul 2025, Nawrot et al., 2024, Hao et al., 2024, Xie et al., 24 Mar 2025, Łańcucki et al., 5 Jun 2025, Katraouras et al., 2 Jan 2026, Monea et al., 15 Oct 2025, Young et al., 2018, Chen et al., 2024, Berg, 2020, Kumar et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Memory Compression (DMC).