Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memory Reuse Rate (MRR): Metrics & Implications

Updated 30 April 2026
  • Memory Reuse Rate (MRR) is a metric that measures the extent to which systems reuse cached computation or data, reducing the need for recomputation and primary memory fetches.
  • It is defined and evaluated across diverse domains using methods like cache hit indicators, token reuse counts, and reuse distance profiles, enhancing system efficiency in models and hardware.
  • Optimizing MRR involves trade-offs in cache strategies, semantic indexing, and dataflow architectures, with empirical studies showing significant speedups and resource savings.

Memory Reuse Rate (MRR) quantifies the proportion of computation or memory operations in a system that are serviced by reusing previously stored results or data, rather than recomputing or fetching from primary memory. The metric provides a fundamental measure of temporal and spatial locality across diverse domains ranging from deep neural network (DNN) accelerators and static code analysis to transformer inference and large-reasoning models. Its computation and interpretation are strongly context-dependent but consistently play a critical role in optimizing latency, bandwidth, and overall system efficiency.

1. Formal Definitions Across Domains

The definition of MRR varies by domain but shares the core objective of quantifying effective reuse:

  • Transformer Inference (LLMCache): For a transformer with LL layers and a dataset DD, each input X∈DX \in D performs one cache lookup per layer. Define the hit indicator

Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}

Then the global MRR is

MRR=1∣D∣⋅L∑X∈D∑l=1LIl(X),\mathrm{MRR} = \frac{1}{|D|\cdot L} \sum_{X \in D} \sum_{l=1}^L I_l(X),

with MRRl=1∣D∣∑X∈DIl(X)\mathrm{MRR}_l = \frac{1}{|D|} \sum_{X \in D} I_l(X) as the layerwise rate (Bansal, 18 Dec 2025).

  • Large Reasoning Models (ENGRAM-R): MRR is the proportion of tokens saved by reusing memory instead of recomputation:

MRR=TFC−TERTFC,\mathrm{MRR} = \frac{T_\mathrm{FC} - T_\mathrm{ER}}{T_\mathrm{FC}},

where TFCT_\mathrm{FC} and TERT_\mathrm{ER} are the token counts for full-context and memory-augmented runs, respectively (Patel et al., 17 Nov 2025).

  • DNN Accelerator Architecture (Voltra):

MRR=RF,\mathrm{MRR} = \frac{R}{F},

where DD0 is the number of local operand reuses (on-chip), and DD1 is the count of expensive off-chip or shared-memory fetches (Yi et al., 11 Feb 2026).

  • LLVM Static Analysis:

After solving for a reuse distance profile DD2 (probability of a memory access having reuse distance DD3 for DD4), MRR (for an LRU cache of capacity DD5) is computed as:

DD6

representing the steady-state hit rate for the cache under the observed profile (Barai et al., 2023).

2. Measurement and Algorithmic Realizations

Each context requires distinct measurement strategies, typically leveraging either dynamic instrumentation or static analysis:

  • LLMCache: Semantic fingerprinting (SimHash, PCA, or MinHash) generates fixed-length keys for each input. Cache banks per layer store activations indexed by these fingerprints. Cosine or Jaccard similarity governs matches (threshold DD7). Hits are tallied by PyTorch hooks, and MRR computed as hit fraction post-inference (Bansal, 18 Dec 2025).
  • ENGRAM-R: Instrumentation of the inference loop counts tokens in both the baseline and memory-reusing runs for both input and reasoning steps. Fact-card rendering and citation control guarantee that evidence is genuinely reused, not simply rephrased (Patel et al., 17 Nov 2025).
  • Voltra Accelerator: Hardware counter arrays count both operand fetches from shared/off-chip memory and number of subsequent on-chip reuses. Analytical models relate the unrolling factors DD8 in 2D and 3D systolic dataflows to achievable MRR (Yi et al., 11 Feb 2026).
  • LLVM Static Analysis: Construction of a bracketed static memory trace from control-flow and loop annotations enables recursive computation of the reuse distance histogram. Once the profile is obtained, the MRR is immediately calculated for any cache capacity using the definition above. Notably, this method executes in time independent of program input size (Barai et al., 2023).

3. Reported MRR Values and Correlations

Empirical measurements across systems reveal strong associations between high MRR, throughput gains, and resource savings:

System/Task Reported MRR (or Proxy) Impact on Latency/Throughput Impact on Fidelity
GPT-2, WikiText-103 (Bansal, 18 Dec 2025) 72% global; DD990% low layers X∈DX \in D0–X∈DX \in D1 speedup X∈DX \in D20.5% accuracy loss
BERT-base, SQuAD (Bansal, 18 Dec 2025) 78% global X∈DX \in D3 speedup X∈DX \in D40.2% F1 drop
ENGRAM-R, LoCoMo (Patel et al., 17 Nov 2025) 88.4% input; 71.7% reasoning 68% latency reduction X∈DX \in D52.5% for multi-hop
Voltra, ResNet50 (Yi et al., 11 Feb 2026) 100% spatial utilization (proxy) Up to X∈DX \in D6 spatial, X∈DX \in D7 speedup –
LLVM Static Analysis (Barai et al., 2023) Derived per cache profile – –

Extensive ablations on similarity threshold X∈DX \in D8, cache budget, and eviction policy in LLMCache reveal distinctive trade-offs: higher X∈DX \in D9 (0.88) ensures Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}00.1% accuracy loss but reduces MRR (Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}165%), while lower Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}2 boosts MRR at some cost in upper-layer accuracy (Bansal, 18 Dec 2025). In ENGRAM-R, input and reasoning MRRs above 85% consistently yield order-of-magnitude reductions in total token budget, especially for multi-hop tasks (Patel et al., 17 Nov 2025). In hardware, the 3D-spatial scheme in Voltra directly multiplies spatial or temporal MRR by the unrolling factor, up to 2Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}3 that of traditional 2D arrays, translating into up to 50% savings in DRAM bandwidth (Yi et al., 11 Feb 2026).

4. Methodological Variants and Design Considerations

The parameterizations and system-level choices critically affect MRR outcomes and their downstream implications:

  • Fingerprinting and Matching Criteria (LLMCache): Choice of hash/similarity function and threshold Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}4 mediates the tension between hit rate (MRR) and output quality.
  • Eviction Strategies: LRU, frequency-based, and divergence-aware policies adjust the window of reuse, with LRU/FRQ optimizing short-term MRR and divergence-aware methods sacrificing some MRR for longer-term output fidelity (Bansal, 18 Dec 2025).
  • Dataflow Architecture (Voltra): Size and aspect of the on-chip buffer tiles, FIFO depth, and streamer channel width together define maximal attainable spatial/temporal MRRs, but incur area/power costs and may exacerbate bank contention (Yi et al., 11 Feb 2026).
  • Static Trace Analysis (LLVM): The granularity of loop brackets and block-level CFG determines the accuracy of the reuse-distance estimation, though the method is invariant to input data size (Barai et al., 2023).
  • ENGRAM-R Retrieval Budget (K): Reducing Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}5 increases MRR_input but may undercut recall of critical facts; a retrieval budget sweep balances reuse against answer accuracy (Patel et al., 17 Nov 2025).

5. Practical Effects and Trade-offs

High MRR consistently confers substantial reductions in wall-clock latency, compute, and off-chip bandwidth usage, with controllable or negligible degradation in accuracy. Key empirical insights include:

  • LLMCache achieves between 2.2Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}6 and 3.1Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}7 speedups at global MRRs of 70–78%, with Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}80.5% drop in F1 or end-task accuracy, especially when restricting caching to lower transformer layers (Bansal, 18 Dec 2025).
  • In large-reasoning pipelines via ENGRAM-R, input and reasoning MRRs of Il(X)={1if the cache at layer l is used for X 0otherwiseI_l(X) = \begin{cases} 1 & \text{if the cache at layer } l \text{ is used for } X \ 0 & \text{otherwise} \end{cases}990% and MRR=1∣D∣⋅L∑X∈D∑l=1LIl(X),\mathrm{MRR} = \frac{1}{|D|\cdot L} \sum_{X \in D} \sum_{l=1}^L I_l(X),075% reduce context and reasoning tokens by factors of MRR=1∣D∣⋅L∑X∈D∑l=1LIl(X),\mathrm{MRR} = \frac{1}{|D|\cdot L} \sum_{X \in D} \sum_{l=1}^L I_l(X),1 and MRR=1∣D∣⋅L∑X∈D∑l=1LIl(X),\mathrm{MRR} = \frac{1}{|D|\cdot L} \sum_{X \in D} \sum_{l=1}^L I_l(X),2, with observed accuracy improvements in composition-heavy QA tasks (Patel et al., 17 Nov 2025).
  • Hardware spatial and temporal utilization scales directly with achieved MRR; Voltra's 3D tiling and streaming yield measured 2.12–2.94MRR=1∣D∣⋅L∑X∈D∑l=1LIl(X),\mathrm{MRR} = \frac{1}{|D|\cdot L} \sum_{X \in D} \sum_{l=1}^L I_l(X),3 boosts in effective reuse (Yi et al., 11 Feb 2026).
  • Memory-vs-hit-rate curves display strong diminishing returns beyond moderate cache or memory investments: doubling cache from 500MB to 1GB raises MRR by only MRR=1∣D∣⋅L∑X∈D∑l=1LIl(X),\mathrm{MRR} = \frac{1}{|D|\cdot L} \sum_{X \in D} \sum_{l=1}^L I_l(X),44% in LLMCache (Bansal, 18 Dec 2025).
  • Data compaction (e.g., PCA) for cached outputs effects minor (MRR=1∣D∣⋅L∑X∈D∑l=1LIl(X),\mathrm{MRR} = \frac{1}{|D|\cdot L} \sum_{X \in D} \sum_{l=1}^L I_l(X),51%) absolute reductions in MRR, often justified by memory savings (Bansal, 18 Dec 2025).

6. Static Analysis and Reuse Profiling

Barai et al. demonstrate that accurate MRR estimation is attainable in constant time via LLVM-based static analysis, without dynamic trace generation. Constructed control-flow graphs, loop annotations, and recursive profile computation directly yield the probability distribution of reuse distances, from which MRR for any LRU cache is calculated as the cumulative probability up to the cache's capacity (Barai et al., 2023). This method enables rapid, accurate forecasting of memory system performance for arbitrary input or workload scale.

7. Broader Implications and Optimization Guidelines

Maximizing MRR is a central goal in the design of high-throughput inference engines, efficient accelerators, and cache-optimized programs. Across domains, the following principles surface:

  • Locality-aware Dataflows: Deep spatial and temporal reuse (3D unrolling, streaming, loop tiling) significantly augment MRR and system efficiency (Yi et al., 11 Feb 2026).
  • Semantic Indexing and Adaptive Eviction: Tailoring cache lookup keys, thresholds, and eviction policies allows practitioners to tune the balance between speed, memory, and task fidelity (Bansal, 18 Dec 2025).
  • Typed Memory and Citation Control: For reasoning models, explicit fact-card abstractions and enforced memory citation prevent regeneration of redundant material, sharply elevating effective MRR (Patel et al., 17 Nov 2025).
  • Static Profiling: Rapid, input-invariant MRR estimation empowers compiler and hardware designers with actionable intelligence for buffer sizing, tile partitioning, and prefetch allocation (Barai et al., 2023).

Empirically, elevated MRR is consistently associated with reduced token usage, lower inference latency, and increased hardware utilization, provided mechanisms for memory freshness and semantic matching are robust. Adaptive policies that govern retrieval, eviction, and memory update ensure that high MRR can be attained without compromising correctness or long-horizon compositional accuracy.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory Reuse Rate (MRR).