Internal Memory Cache Architectures

Updated 14 February 2026

Internal Memory Cache is a hardware or software-managed structure that partitions memory into tagless hot regions and dynamic tagged caches to reduce access latency.
It employs advanced tag management techniques, including parallel on-die probing and dedicated hit/miss buses, achieving up to 2.6× faster tag checks and improved energy efficiency.
Hybrid HW/SW caching approaches leverage profiling and adaptive admission control policies to optimize performance across DRAM, NVRAM, and ML inference systems.

An internal memory cache is a hardware or software-managed structure used to store recently or frequently accessed data close to the processing units, with the goal of reducing average memory access latency and improving effective bandwidth. Internal memory caches operate at multiple scales, ranging from on-die SRAM/DRAM for processors, to device-level caches in accelerators and persistent memory systems, to application-layer architectures such as key-value caches in machine learning inference.

1. Fundamental Architecture and Partitioning Models

Internal memory caches are commonly structured according to their underlying physical substrate, allocation policy, and tag management. For die-stacked DRAM, such as in the MemCache proposal, the total capacity $D$ is partitioned into a memory region ( $M$ ) for hot pages and a cache region ( $C = D - M$ ) serving as a traditional hardware-managed cache with tags. The memory region delivers zero-tag, fast accesses to statically selected pages; the cache region captures dynamic variations in access locality but incurs tag storage, lookup bandwidth, and replacement overheads (Bakhshalipour et al., 2018):

Memory Region: OS-mapped, static, tagless; stores hot pages
Cache Region: hardware-managed, set-associative, tagged

The address path is:

If physical address $\in \mathrm{MemRegion}$ , serve from $M$ with no tags
Otherwise, probe the cache region tags, serve from $C$ on hit, or off-chip memory on miss

This hybrid partitioning eliminates unnecessary tag checks for dominant, hot-memory accesses while preserving adaptability for the remainder of working set accesses.

2. Tag Management and Latency

Tag management is central to hardware-managed caches, directly impacting access time, bandwidth, and energy efficiency. For DRAM-based caches, such as TDRAM, embedding low-latency tag arrays ("tag mats") on the DRAM die and introducing a dedicated hit/miss bus enables fast parallel tag and data access, with on-die tag comparison and conditional data response. The result is a substantial reduction in tag-check latency, from ~55–57 ns (prior direct-mapped DRAM caches) to 22 ns with early probing (Babaie et al., 2024). Key architectural advances include:

Parallel tag+data activation, on-die comparison
Early tag probing for miss prediction
Dedicated flush buffers to expedite dirty write misses
Result: 2.6× faster tag checks, 1.2× application speedup, 21% less DRAM-side energy

Skipping unnecessary data transfers (e.g., read misses to clean lines) prevents bandwidth waste and improves both latency and system power. Area and pin overheads (8.3% die, ~10% pins) are the principal trade-offs.

3. HW/SW Hot Data Identification and Placement

For caches that hybridize software and hardware management, as in MemCache, the selection of "hot" pages for static memory mapping is delegated to profiling/compiler passes and OS, while hardware cache structures handle the dynamic portion. Workflow:

Profile program or traces under a simulated LLC
Accumulate per-page access counts
Host top- $M$ pages in the memory region
The remainder is managed dynamically by the hardware cache

This separation allows the system to serve a large fraction of accesses from tagless memory, saving bandwidth and cycle time, and applies widely in current 3D-DRAM and HBM systems (Bakhshalipour et al., 2018).

4. Analytical Performance Models

Performance modeling for internal caches includes hit rate, bandwidth, latency, and storage overhead metrics. For a cache partitioned as in MemCache:

Hit rate: $H_\textrm{total} = (M/D)\cdot H_\textrm{mem} + (C/D)\cdot H_\textrm{cache}$
Bandwidth served: $\Delta\textrm{BW} = \textrm{BW}_\textrm{die}\cdot H_\textrm{total} - \textrm{BW}_\textrm{off}\cdot (1 - H_\textrm{total})$
Tag bits: $T_\textrm{bits} = (C/L)[\lceil \log_2 N_\textrm{sets}\rceil + \lceil \log_2 N_\textrm{ways}\rceil + 1]$ , $L=$ line size
Latency benefit: $\Delta L = L_\textrm{off} (1-H_\textrm{total}) - L_\textrm{die} H_\textrm{total}$

Empirical results with hybrid DRAM caches show overall speedups of 28–114% over prior state-of-the-art (Banshee, Alloy), with tagless memory regions contributing significantly to reductions in off-chip bandwidth and metadata overhead (Bakhshalipour et al., 2018).

5. Dynamic Caching Algorithms and Admission Control

Workload-driven admission control policies are required for memory technologies with intrinsic asymmetries, such as NVMe and NVRAM. NVCache for Optane NVRAM introduces a cost-benefit-adaptive admission policy based on the Overhead-Bypass Ratio (OBP):

$\mathrm{OBP}(t) = \frac{\# \textrm{blocks inserted} + \# \textrm{blocks removed}}{\# \textrm{cache lookups}}$

Admission is throttled if $\mathrm{OBP} > \tau$ (empirically $\tau = 0.10$ ), preventing excessive cache-line insertions in write-sensitive devices while functioning transparently for read or small working-set workloads (Fedorova et al., 2022). Performance relative to unconstrained admission schemes is substantially improved, particularly for working sets exceeding NVRAM capacity, where NVCache outperforms Intel Memory Mode by 30–170% on YCSB workloads.

6. Energy, Bandwidth, and Device-Specific Considerations

Device-specific constraints inform microarchitectural cache design. For example:

STT-RAM/eDRAM L2 partitions (RRAP): retention-relaxed and long-retention regions for general traffic and read-critical lines, reducing L2 miss ratio by 51.4% and increasing IPC by 11.7% compared to SRAM L2, with notable reductions in leakage and dynamic energy (Khoshavi et al., 2016).
For CXL-expansion systems, ICGMM applies a hardware-friendly Gaussian Mixture Model for online hot-spot detection in HBM caches, reducing SSD access latency by up to 39.14% with minimal area and order-of-magnitude lower inference latency vs. LSTM policies in FPGA implementations (Chen et al., 2024).
Row-buffer misses in hybrid DRAM-NVM memory are tracked by a small on-chip stats store. When row-buffer miss counts exceed adaptive thresholds, lines are migrated to DRAM, optimizing for both energy and endurance (Yoon et al., 2018).

7. Application-Layer and ML-Inference Cache Management

Internal caches are increasingly crucial for controlling memory footprint in online, stream-processing, and inference settings. InfiniPot-V introduces a training-free, query-agnostic KV cache controller for multimodal transformer inference, enforcing a fixed memory cap with minimal accuracy degradation by combining Temporal-axis Redundancy (cosine-similarity pruning of stale vision tokens) and Value-Norm ranking (retention of semantically salient tokens). In streaming LLM video, this enables up to 94% reduction in GPU memory and real-time operation under constant resource bounds (Kim et al., 18 Jun 2025).

Architecture	Key Mechanism	Reported Result
MemCache (Bakhshalipour et al., 2018)	HW/SW memory–cache split	+114% IPC vs. Banshee; –15% MPKI
TDRAM (Babaie et al., 2024)	On-die parallel tag/data	2.6× tag-check speed, –21% DRAM energy
NVCache (Fedorova et al., 2022)	OBP-based admission	30–170% faster than MM (large workloads)
ICGMM (Chen et al., 2024)	GMM hot-spot detection	–0.32–6% miss, –16–39% SSD access latency
InfiniPot-V (Kim et al., 18 Jun 2025)	TaR+VaN compression	–94% memory, ≈accuracy to full-cache

These results illustrate the diversity and specificity of internal memory cache architectures and controllers as deployed in modern systems, with strong dependence on memory technology, application pattern, and system resource constraints.