Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memory-Heterogeneous Accelerators (MemHA)

Updated 4 July 2026
  • Memory-Heterogeneous Accelerators (MemHA) are architectures that integrate diverse memory types and access modes to optimize workload-specific performance.
  • They employ variable coherence modes, specialized address translation, and profiling-guided memory composition to reduce latency, energy, and bandwidth costs.
  • Applications span AI systems and accelerator-rich SoCs, where adaptive memory hierarchies yield significant speedup and energy efficiency improvements.

Searching arXiv for recent and foundational papers on memory-heterogeneous accelerators and related heterogeneous memory/coherence designs. Memory-Heterogeneous Accelerators (MemHA) denote accelerator architectures and systems that exploit multiple memory organizations, technologies, or access modes within a single design. In the cited literature, this heterogeneity appears because accelerator-rich SoCs and AI systems exhibit sharply different memory behaviors: some accelerators stream long bursts, some access a few words irregularly, some benefit from private caches, and others push most traffic off-chip; likewise, some data are short-lived and bandwidth-sensitive, while other data are long-lived, capacity-dominated, or latency-critical (Zuckerman et al., 2021, Li et al., 21 Apr 2025). MemHA therefore spans coherence specialization, accelerator-local MMUs, heterogeneous scratchpads, SRAM/GCRAM composition, local-plus-remote memory hierarchies, processing-in-memory substrates, and phase-aware split execution across distinct devices (Kim et al., 2017, Kwon et al., 2019, Duan et al., 16 Sep 2025, Wei et al., 29 Jun 2026).

1. Conceptual scope and taxonomy

Across the literature, MemHA is not a single microarchitectural template but a family of design strategies organized around complementary memory properties. At least three recurring forms are visible. First, the same accelerator may expose several coherence or address-translation modes, with the mode selected statically or at runtime. Second, a single on-chip hierarchy may mix fast and dense memories according to lifetime, access order, or retention requirements. Third, whole systems may split workloads across physically distinct memory substrates, such as local HBM plus remote memory nodes, SRAM-PIM plus HBM-PIM, or GDDR-based prefill plus HBM-based decode (Zuckerman et al., 2021, Wang et al., 24 Feb 2026, Negi et al., 3 Oct 2025).

Axis of heterogeneity Representative options Example papers
Interface and control non-coh DMA, LLC-coh DMA, coh DMA, fully-coh; accelerator-local MMUs (Zuckerman et al., 2021, Kim et al., 2017)
On-chip memory technology SRAM, Si-GCRAM, Hybrid-GCRAM; SHIFT plus RANDOM scratchpads (Li et al., 21 Apr 2025, Wang et al., 24 Feb 2026, Zokaee et al., 2021)
System and phase specialization local HBM plus remote memory-nodes; SRAM-PIM plus HBM-PIM; GDDR prefill plus HBM decode (Kwon et al., 2019, Duan et al., 16 Sep 2025, Wei et al., 29 Jun 2026)

This taxonomy suggests that MemHA is best understood as a cross-layer co-design problem rather than a narrow memory-device question. In some papers, heterogeneity is expressed through cache-coherence flexibility; in others, through retention-aware array composition, logic-layer compute placement, or cross-vendor serving pipelines. The unifying premise is that memory behavior is workload-dependent enough that fixed homogeneous provision is systematically suboptimal.

2. Coherence, address translation, and unified memory placement

A foundational MemHA problem is that there is no single universally optimal coherence interface for fixed-function, loosely coupled accelerators. Cohmeleon identifies four primary coherence modes: Non-Coherent DMA ("direct off-chip"), in which the accelerator has no private cache and coherence is handled in software via explicit CPU-side flushes and invalidations; LLC-Coherent DMA, in which DMA transactions are routed through the shared LLC; Coherent DMA (I/O coherence), in which the interconnect implements full hardware coherence; and Fully-Coherent (Private Cache), in which the accelerator tile includes a private cache implementing a full MESI/MOESI or custom protocol (Zuckerman et al., 2021). Because these modes trade off invocation latency, on-chip bandwidth, off-chip traffic, and software overhead, Cohmeleon reports that statically selecting one mode at design time leads to 20–50 % performance loss and 40–70 % more off-chip accesses compared to the best possible choice in each situation.

Fine-grain specialization can be pushed below the level of whole-accelerator mode selection. "A Case for Fine-grain Coherence Specialization in Heterogeneous Systems" proposes an architecture that enables low-complexity independent specialization of each individual coherence request by building on the Spandex coherence interface. The paper’s abstract reports that these techniques can reduce execution time by up to 61% or network traffic by up to 99% while adding minimal complexity to the protocol (Alsop et al., 2021). This places request-level specialization and invocation-level mode selection on a continuum rather than in opposition.

Address translation is a second control-plane dimension of MemHA. Kim et al. analyze CPU-managed MMUs, local PTW-only designs, and hierarchical TLB+PTW designs, and conclude that accelerators should not rely on the CPU MMU for any aspect of address translation, but instead must have a local, fully fledged MMU tailored to the application (Kim et al., 2017). Their translation model expresses accelerator-side translation overhead as

Acc_MMU=Nref[hTTLB+(1h)TPTWalk].Acc\_MMU = N_{ref}\bigl[h\cdot T_{TLB} + (1-h)\cdot T_{PTWalk}\bigr].

The same study states that even a 1 ns round-trip to the CPU’s L2 TLB kills >50 % of accelerator speedup, and that workload-optimal designs differ markedly: streaming kernels can be well served by a 32-entry L1 only, whereas random-pattern kernels benefit from larger TLB reach or 8–32 way PTW parallelism.

Unified address spaces do not eliminate these optimization pressures. On the Grace Hopper Superchip, all processing units share a single 64 KB-page virtual address space via a unified page table and an ATS-TBU, yet measured bandwidth and latency still depend strongly on placement, and the reported optimization rule is to place data as close as possible to the dominant consumer (Fusco et al., 2024). The paper applies the standard latency-bandwidth model

T(S)=α+S/βT(S) = \alpha + S/\beta

and shows that transparent fine-grained access shifts, rather than removes, the burden of memory optimization.

3. On-chip heterogeneous memories and profiling-guided composition

A major branch of MemHA research treats on-chip memory composition as a function of application behavior. GainSight is a profiling-driven framework with two components: retargetable hardware backends that emit cycle-accurate memory traces, and an application-agnostic analytical frontend that computes data lifetimes, read/write frequencies, and capacity utilization, then correlates them with SRAM, Si-GCRAM, and hybrid GCRAM device models (Li et al., 21 Apr 2025). Its key lifetime metric is

τ=t2t1,\tau = t_2 - t_1,

where t1t_1 is the first write or fetch into the on-chip memory and t2t_2 is the last read before overwrite, eviction, or invalidation. GainSight further models active energy as

Eactive=Er(Nr+R)+Ew(Nw+R).E_{\mathrm{active}} = E_r\,(N_r + R) + E_w\,(N_w + R).

In its reported case studies, 64% of L1 and 18% of L2 GPU cache accesses, and 79% of systolic-array scratchpad accesses, are short-lived and suitable for Si-GCRAM; heterogeneous arrays augmenting SRAM with GCRAM can reduce active energy by up to 66.8% (Li et al., 21 Apr 2025).

OpenGCRAM extends this profiling perspective into macro generation and design-space exploration. The compiler supports 6T SRAM, 2T Si-Si GCRAM, and 2T OS-Si GCRAM in TSMC 40 nm, generating DRC/LVS-clean macros, SPICE-characterized timing and power, and Pareto frontiers over area and leakage (Wang et al., 24 Feb 2026). Reported cell-area ratios are 0.69× 6T-SRAM for Si-Si GCRAM and 0.35× 6T-SRAM for OS-Si GCRAM. The study also characterizes SRAM as the “speed peak,” Si-Si GCRAM as a “transient high-bandwidth cache” regime, and OS-based GCRAM as a long-term low-leakage store. Its design rules include matching retention to data lifetime and reserving SRAM for the fastest, hottest data. This suggests that workload-guided lifetime profiling and macro-level memory compilation can be coupled directly, rather than treated as separate stages.

Heterogeneous scratchpads offer a more tightly specialized on-chip embodiment. SMART, for SFQ-based systolic CNN accelerators, combines ultra-dense sequential SHIFT scratchpads with a large shared RANDOM scratchpad implemented as a pipelined multi-bank CMOS-SFQ array (Zokaee et al., 2021). The SHIFT banks support sequential accesses in 0.02 ns per word; the RANDOM path provides 0.11 ns/word service for arbitrary indexing. An ILP-based compiler deploys CNN models across this hierarchy, and the reported results are 3.9× throughput improvement and 86% energy reduction for single-image inference, or 2.2× throughput improvement and 71% energy reduction for batch inference, at +3% total chip area over the latest SHIFT-based baseline. SMART illustrates a form of MemHA in which heterogeneity is driven by access order, not only by retention or bandwidth.

4. Memory-centric architectures, remote pools, and processing-in-memory

At the system level, MemHA often appears as an explicit partition across memory tiers with complementary latency, bandwidth, capacity, area, or power characteristics. The cited work spans capacity expansion, 3D-stacked in-memory engines, heterogeneous PIM, and 2.5D-integrated CiM/CiD systems (Kwon et al., 2019, Falahati et al., 2018, Duan et al., 16 Sep 2025, Negi et al., 3 Oct 2025).

System Heterogeneous organization Reported outcome
MC-DLA on-package HBM plus ring-attached remote memory-nodes average 2.8× speedup; capacity scales to tens of TBs (Kwon et al., 2019)
ORIGAMI logic-layer compute engines plus off-memory platform 1.55× avg speedup; 29× avg EDP improvement (Falahati et al., 2018)
HPIM SRAM-PIM plus HBM-PIM average 6.2× lower latency than A100; peak 22.8× speedup (Duan et al., 16 Sep 2025)
HALO HBM-CiD plus analog CiM via 2.5D integration 18× speedup over AttAcc1; 2.4× over CENT (Negi et al., 3 Oct 2025)

The memory-centric deep learning system in "Beyond the Memory Wall" decouples capacity-optimized memory nodes from the host bus and embeds them into the same high-bandwidth device-side interconnect used for accelerator-to-accelerator communication (Kwon et al., 2019). Its first tier is local HBM; its second tier is a ring-attached remote pool addressed through cudaMallocRemote and cudaFreeRemote, with a BW_AWARE policy that stripes pages across adjacent memory nodes. In the reported 8-device evaluation on eight DNNs, this design achieves an average 2.8× speedup and expands system-wide memory to tens of TBs.

ORIGAMI addresses the limited area and power budgets of the logic layer in 3D-stacked memory by extracting a small set of recurring ML compute patterns—MAC, comparator, optimization, and special nonlinear operations—into heterogeneous in-memory compute engines, then splitting the residual work to an off-memory FPGA/GPU/TPU-class platform (Falahati et al., 2018). The compiler exploits model-level, partial-level, and block-level parallelism, and the reported system operates within 1 % of an ideal unlimited-logic reference while delivering 1.55× average performance speedup and 29× average EDP improvement over an FPGA-only baseline.

HPIM is explicitly framed as a memory-centric heterogeneous PIM accelerator for LLM inference. It couples an SRAM-PIM subsystem for latency-critical attention operations with an HBM-PIM subsystem for weight-intensive GEMV, coordinated by a controller and scheduler (Duan et al., 16 Sep 2025). The HBM-PIM subsystem is built on four HBM3 stacks with 96 GB capacity and over 100 TB/s internal bandwidth; the SRAM-PIM side provides 32 cores, each with near-array GEMV support, a 64×64 TCU, vector and scalar units, and local scratchpads. The paper models per-token latency as

Ttotal=max(TSRAM,THBM)δoverlap,T_{total} = \max(T_{SRAM},T_{HBM}) - \delta_{overlap},

and reports 6.2× lower latency than an NVIDIA A100 on average across OPT models, with a peak 22.8× speedup.

HALO reaches a related conclusion for low-batch LLM inference but with a different substrate. It combines HBM-based Compute-in-DRAM with on-chip analog Compute-in-Memory using 2.5D integration, mapping compute-bound prefill GEMMs to CiM and memory-bound decode GEMVs to CiD (Negi et al., 3 Oct 2025). Its reported geometric-mean results on LLaMA-2 7B and Qwen3 8B include 18× end-to-end speedup over AttAcc1, 2.4× over CENT, and energy reductions of 2.0× and 1.8×, respectively. Taken together, HPIM and HALO show that MemHA can mean pairing distinct PIM or in-memory substrates inside one accelerator, not only adding more tiers to a conventional hierarchy.

5. Runtime orchestration, phase-aware control, and heterogeneous mapping workflows

MemHA designs frequently require runtime policies because the best memory mode depends on workload size, footprint, and system contention. Cohmeleon formulates coherence-mode selection as a reinforcement-learning problem in which the state vector

st=(nfc,nnc_dma,nllc_dma,tile_footprint,acc_footprint)s_t = (n_{fc}, n_{nc\_dma}, n_{llc\_dma}, tile\_footprint, acc\_footprint)

is discretized into 243 states, the action space consists of the four coherence modes, and the reward is a weighted sum

Rt=xRexec+yRcomm+zRmem.R_t = x\cdot R_{exec} + y\cdot R_{comm} + z\cdot R_{mem}.

The implementation uses a 243×4 Q-table, ϵ\epsilon-greedy exploration, and in practice a simpler on-policy update

T(S)=α+S/βT(S) = \alpha + S/\beta0

to emphasize immediate invocation performance (Zuckerman et al., 2021). Hardware support is intentionally small: a coherence-control register per accelerator, mode support in the interconnect/LLC, and counters exposed via a register block costing <100 LUTs per tile and read with ~100 ns overhead. On FPGA prototypes with 11 accelerators, Cohmeleon converges in ≲10 application iterations and achieves on average 38 % higher throughput with 66 % fewer off-chip DRAM accesses than state-of-the-art static policies.

Phase-aware control becomes even more explicit in cross-device LLM serving. HMA-Serve pairs GDDR-based accelerators for prefill with HBM-based GPUs for decode, arguing that HBM bandwidth sits almost entirely idle during prefill and is therefore economically mismatched to that phase (Wei et al., 29 Jun 2026). Its three central mechanisms are phase-wise quantization, a layer-wise compute-transfer pipeline, and deferred dequantization. The pipeline changes the first-token latency from a naive

T(S)=α+S/βT(S) = \alpha + S/\beta1

to an overlapped form

T(S)=α+S/βT(S) = \alpha + S/\beta2

The system ships raw BFP8 KV bytes over the network and reconstructs them lazily on the decode GPU; the reported full-KV reconstruction cost is ≈1.5 ms per request versus ~72 ms for a naive element-wise path. Across four Qwen3 models and three production traces, HMA-Serve reports up to 3.2× higher goodput and 4.8× higher goodput-per-dollar than memory-homogeneous baselines, with no measurable loss on generation-quality benchmarks.

For analog in-memory computing, runtime and compile-time mapping are linked by precision sensitivity. The unified AIMC workflow classifies heterogeneous mapping by granularity and optimization strategy, and applies a four-stage procedure: precision sensitivity profiling, partitioning-strategy selection, mapping granularity and allocation, and final optimization/deployment (Lammie, 1 Jun 2026). In its GPT-2 case study, sensitivity is dominated by 4 of 49 projections, with the first decoder block’s attention output dominating by an order of magnitude. A threshold rule with T(S)=α+S/βT(S) = \alpha + S/\beta3 maps 9 projections to digital and 40 to analog, giving an analog MAC ratio of

T(S)=α+S/βT(S) = \alpha + S/\beta4

with <0.5% increase over baseline perplexity. This result is notable because it rejects both chip-level “all analog” and “all digital” simplifications in favor of projection-level heterogeneity.

6. Misconceptions, limitations, and open research directions

A recurring misconception is that unified address spaces or full hardware coherence make memory decisions largely automatic. The GH200 measurements contradict this: transparent access exists, but bandwidth and latency remain NUMA-sensitive, and careful placement still matters materially for kernels, LLM inference, and collectives (Fusco et al., 2024). A related misconception is that MemHA is synonymous with accumulating more HBM. Several cited systems do the opposite: HMA-Serve argues explicitly that HBM is not all you need for disaggregated LLM serving, and GainSight/OpenGCRAM emphasize retention-aware composition with GCRAM rather than bandwidth-only scaling (Wei et al., 29 Jun 2026, Wang et al., 24 Feb 2026).

The literature also identifies several technical limits. Cohmeleon notes that a tabular Q-table scales poorly if more state attributes are tracked, and that RL overhead becomes problematic if decisions are made more frequently than once per accelerator task; its stated future directions are Deep Q-Networks, predictive workload models, and within-kernel mode switching (Zuckerman et al., 2021). GainSight provides a clear optimization formulation over subpartitions, area budgets, and retention constraints, but the paper explicitly does not present a full integer-programming solver (Li et al., 21 Apr 2025). SMART depends on mature cryo-CMOS SRAM at 4 K, reports that current ILP is per-layer and near-optimal under a fixed prefetch horizon, and identifies cooling overhead (×400 at 4 K) as a practical system-level energy cost (Zokaee et al., 2021).

Several works converge on a common future direction: tighter coupling between profiling, hardware synthesis, and runtime control. OpenGCRAM positions macro generation, SPICE-accurate characterization, and Pareto filtering as a basis for plugging in new memory technologies and new PDKs (Wang et al., 24 Feb 2026). GH200 points toward hardware counters and programmable migration thresholds for coherent layered memories (Fusco et al., 2024). Cohmeleon points toward finer-grain adaptation, while AIMC mapping points toward projection-level rather than layer-level control (Zuckerman et al., 2021, Lammie, 1 Jun 2026). This suggests that future MemHA research will likely be defined less by any single memory device than by the sophistication with which systems observe, model, and exploit workload-specific memory behavior across coherence, translation, retention, capacity, bandwidth, and phase structure.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory-Heterogeneous Accelerators (MemHA).