Memory-Guided Unified Accelerator

Updated 9 January 2026

Memory-guided unified hardware accelerators are architectures that co-optimize memory hierarchy, compression, and control flow to eliminate bottlenecks.
They employ dynamic batching, reconfigurable SRAM hierarchies, and precision-tuned scheduling to achieve up to 65.9× external memory reduction and 6.2× speedup.
Empirical results validate significant gains in throughput, energy efficiency, and performance across applications like DNNs, CNNs, and scientific computing.

A memory-guided unified hardware accelerator is a hardware architecture designed to minimize memory capacity, maximize on-chip data reuse, and enable high throughput across a diverse set of workloads through systematic integration of memory subsystem design, memory-aware dataflow, and dynamically reconfigurable control. Such accelerators are characterized by co-optimizing memory hierarchy, data compression/encoding, scheduling, and processing element (PE) utilization—thereby eliminating memory bottlenecks that typically constrain system-level performance and energy efficiency in conventional accelerator designs. Recent literature demonstrates concrete embodiments of memory-guided unified architectures for domains spanning transformers, DNNs, scientific computing, and image processing, and codifies key algorithms for memory hierarchy tailoring and integration (Moon et al., 1 Mar 2025 Bause et al., 2024 Wang et al., 8 Jan 2026 Shao et al., 2021).

1. Memory-Guided Compression and On-Chip Hierarchy

Reducing external memory access (EMA) is central in advanced accelerators. For transformer inference, T-REX decomposes each weight matrix $W$ as $W = W_s + W_d$ , where $W_s$ is a dense “shared” matrix preloaded once on-chip, and $W_d$ is a layer-specific sparse matrix subject to hard $k$ -sparsity per column. Comprehensive bit-width minimization is achieved by non-uniform quantization of $W_s$ to 4 bits using LUTs and delta plus uniform quantization of $W_d$ 's indices and values. This approach, combined with efficient encoding and reordering, predicates an EMA reduction of $31\times$ to $65.9\times$ , verified across multiple models and benchmarks (Moon et al., 1 Mar 2025).

In general DNN workloads, configurable multi-level SRAM hierarchies absorb off-chip latency and adapt buffer sizing (depth $N_i$ , width $W_i$ , ports per level) to match reuse and unroll patterns for each layer, balancing storage and bandwidth provisioning (Bause et al., 2024). An output shift register (OSR) is often added for fine-granularity strided or overlapped access, further improving flexibility.

For CNNs, on-chip storage of interlayer feature maps is drastically reduced through in-stream lossless/lossy compression (e.g., 8×8 DCT+quantization) and lightweight bitmask-driven SRAM layouts. Dynamically reconfigurable partitioning lets the same SRAM blocks serve as feature, scratchpad, or index buffers according to per-layer needs (Shao et al., 2021).

2. Dynamic Control Flow: Batching, Scheduling, and Policy

Control logic in memory-guided accelerators must adaptively orchestrate memory and compute to maximize on-chip data reuse and minimize idle cycles. In T-REX, a dynamic batching state machine selects batch size $B\in\{1,2,4\}$ as a function of token length, and remaps dataflow to maximize amortization of parameter DMA over $B$ tokens, achieving up to $3.31\times$ uplift in utilization for short sequences (Moon et al., 1 Mar 2025).

Unified accelerators for mixed-precision scientific computing maintain learned mappings in dedicated long-term (LTM) and short-term (STM) BRAM buffers. Precision and parallelism are assigned through memory-guided selector units—small hardware lookups that, based on historical (e.g., condition number, utilization, accuracy tradeoffs), select precision tuples and dynamic 4D systolic array tiling for each compute stage (Wang et al., 8 Jan 2026). Runtime adaptation is continuous: after each execution batch, STM is updated, and thresholds in LTM are tuned to further minimize energy and maximize accuracy under changing workload demands.

In NPU-PIM unified systems for LLMs, centralized command scheduling ensures that PIM commands—requiring DRAM-level atomicity—are strictly isolated from normal DMA to eliminate conflicts and preserve bandwidth efficiency (Seo et al., 2024). Adaptive analytical models route fully connected layers between MU/NPU execution vs. PIM computation, based on profile-derived latency and arithmetic intensity.

3. Dataflow Mechanisms and Buffer Architectures

Physical buffer architectures are specialized to maximize throughput while minimizing per-access latency and area. T-REX employs “two-direction accessible register files” (TRF): 4×4 or 8×8 tiles allowing single-cycle access to a full row or column, thus replacing multiple SRAMs with crossbar-wired register banks, and reducing tile-multiply cycle count by up to $N$ per matrix operation (Moon et al., 1 Mar 2025). Resultant PE hardware utilization increases $12–20$\% system-wide.

Push-memory abstractions drive CGRA-based image/DNN pipelines, where unified buffers combine storage, per-port address sequencing, and cycle-accurate schedule gating. Here, compiler algorithms (via polyhedral analysis) synthesize a minimal set of vectorized/register-buffered/strip-mined physical unified buffers (PUBs), ensuring line-rate delivery and precise control of memory-port contention (Liu et al., 2021).

Coalescing, banking, and shift-register introduction (for address/schedule equivalence across ports) reduce on-chip occupancy while preserving functional correctness, and are mapped automatically per-stage by ILP-driven compilers (Ujjainkar et al., 2023). For data with irregular access (e.g., sparse tensors, data-dependent pipelines), curriculum learning and dynamic policy networks tune the memory structure online for optimal throughput and error.

4. Unified Integration: Processing Elements, Reuse, and Control

A memory-guided unified accelerator physically integrates a PE array (matrix/vector compute cores, convolution engines, or scientific compute tiles) with a global buffer, per-core memory banks, and on-chip interconnect. Batch-wise and layer-wise data reuse are central:

T-REX: Four DMM (dense matrix) and four SMM (sparse matrix) cores share a global buffer of $W_s$ / $W_d$ and activations, and reuse preloaded $W_s$ across all layers and batches, with controller-driven overlapping of DMA and compute (Moon et al., 1 Mar 2025).
IANUS: NPUs with MU and VU units share a single GDDR6-AiM-based memory for both standard DRAM loads/stores and PIM compute, eliminating parameter duplication and leveraging analytical cost models for optimal workload placement (Seo et al., 2024).
Scientific computing MGUA: Three-stage streaming pipeline (AP-FEM, SNN, Sparse Tensor Engine) connected by double-buffered BRAM, with a unified FSM (Control Unit) tracking module readiness, data availability, and adaptive configuration trajectories (Wang et al., 8 Jan 2026).

Key attributes are the ability to perform static and dynamic memory assignment, elastic partitioning of buffers, and hardware-level policy enforcement per application cycle and per workload phase.

5. System-Level Metrics, Benefits, and Quantitative Validation

Measured system metrics validate the approach:

Accelerator	EMA/Memory Saving	Utilization Gain	Latency/Energy	Accuracy Impact
T-REX (Moon et al., 1 Mar 2025)	$31$– $65.9\times$	$1.2$– $3.4\times$	$68$– $567\,\mu$ s/token, $0.41$– $3.95\,\mu$ J/token	$<0.3\%$ acc. loss
CNN Compress (Shao et al., 2021)	$1.4$– $3.3\times$ (on-chip/off-chip)	--	$403$ GOPS, $2.16$ TOPS/W	$<1\%$ top-1 loss
DNN Hierarchy (Bause et al., 2024)	$62.2\%$ area reduction	$2.4\%$ max perf. loss	--	--
MGUA (Wang et al., 8 Jan 2026)	$34\%$ energy reduction	$45$– $65\%$ more throughput vs. domain-specific	$2.8\%$ L2 error improve	--
IANUS (Seo et al., 2024)	$2\times$ footprint, $3.7$– $4.4\times$ energy	$3.2$– $6.2\times$ speedup (GPT-2)	--

Empirical studies demonstrate that co-optimization of memory capacity, compression, dynamic reuse, and integration of control yields Pareto optimal designs across throughput, area, energy, and accuracy spaces.

6. Configuration and Design Methodology

Systematic configuration is realized through automated analysis of application loop nests or DAGs. The following pattern is representative across domains:

Analyze application schedule for minimum buffer reuse window ( $cycle\_length_i$ ), unique weight/input strides, and memory access patterns.
Select memory hierarchy depth and sizing for each level to cover the largest reuse window, matching buffer arch to access width and concurrency required per step (Bause et al., 2024).
Use on-demand, FSM-controlled data arrival and retirement at each buffer level to maximize bank/port utilization and minimize stalls.
Partition, bank, and coalesce buffer SRAM macros to minimize total area subject to throughput and port contention constraints, solved via ILP or LP approaches (Ujjainkar et al., 2023 Piccolboni et al., 2019).
Configure controller with state machine code and DMA descriptors that overlap data load, compute, and AFU execution per batch/layer.

Automated “fit_hierarchy” and “allocate_mem” routines further streamline per-layer or per-stage adaptation for dynamic workloads.

7. Limitations, Open Problems, and Future Directions

While memory-guided unified accelerators achieve substantial improvements across energy, area, and throughput, they remain subject to several constraints:

PIM/DRAM mutual exclusion: concurrent DMA and PIM are generally infeasible, necessitating sophisticated scheduling and sometimes leading to resource underutilization (Seo et al., 2024).
Applicability to irregular/sparse workloads may require integration of dynamic sparsity maps and extended memory assignment models (Ujjainkar et al., 2023 Wang et al., 8 Jan 2026).
Most flows assume regular compute graphs and static image/mesh dimensions. Expansion to fully dynamic or data-dependent patterns is underway but incomplete.

Prospective enhancements include finer-grained NoC arbitration, fully integrated dynamic scheduling policies, and scaling to multi-chiplet/heterogeneous fabric deployments. The progression toward fully autonomous, memory-controlled compute architectures—where memory acts as the principal design constraint and driver—reflects a unification of workload diversity, energy scalability, and high-performance computing in modern accelerator systems (Moon et al., 1 Mar 2025 Bause et al., 2024 Seo et al., 2024 Wang et al., 8 Jan 2026 Shao et al., 2021 Ujjainkar et al., 2023 Piccolboni et al., 2019).