Papers
Topics
Authors
Recent
2000 character limit reached

Memory-Guided Unified Accelerator

Updated 9 January 2026
  • Memory-guided unified hardware accelerators are architectures that co-optimize memory hierarchy, compression, and control flow to eliminate bottlenecks.
  • They employ dynamic batching, reconfigurable SRAM hierarchies, and precision-tuned scheduling to achieve up to 65.9× external memory reduction and 6.2× speedup.
  • Empirical results validate significant gains in throughput, energy efficiency, and performance across applications like DNNs, CNNs, and scientific computing.

A memory-guided unified hardware accelerator is a hardware architecture designed to minimize memory capacity, maximize on-chip data reuse, and enable high throughput across a diverse set of workloads through systematic integration of memory subsystem design, memory-aware dataflow, and dynamically reconfigurable control. Such accelerators are characterized by co-optimizing memory hierarchy, data compression/encoding, scheduling, and processing element (PE) utilization—thereby eliminating memory bottlenecks that typically constrain system-level performance and energy efficiency in conventional accelerator designs. Recent literature demonstrates concrete embodiments of memory-guided unified architectures for domains spanning transformers, DNNs, scientific computing, and image processing, and codifies key algorithms for memory hierarchy tailoring and integration (Moon et al., 1 Mar 2025Bause et al., 2024Wang et al., 8 Jan 2026Shao et al., 2021).

1. Memory-Guided Compression and On-Chip Hierarchy

Reducing external memory access (EMA) is central in advanced accelerators. For transformer inference, T-REX decomposes each weight matrix WW as W=Ws+WdW = W_s + W_d, where WsW_s is a dense “shared” matrix preloaded once on-chip, and WdW_d is a layer-specific sparse matrix subject to hard kk-sparsity per column. Comprehensive bit-width minimization is achieved by non-uniform quantization of WsW_s to 4 bits using LUTs and delta plus uniform quantization of WdW_d's indices and values. This approach, combined with efficient encoding and reordering, predicates an EMA reduction of 31×31\times to 65.9×65.9\times, verified across multiple models and benchmarks (Moon et al., 1 Mar 2025).

In general DNN workloads, configurable multi-level SRAM hierarchies absorb off-chip latency and adapt buffer sizing (depth NiN_i, width WiW_i, ports per level) to match reuse and unroll patterns for each layer, balancing storage and bandwidth provisioning (Bause et al., 2024). An output shift register (OSR) is often added for fine-granularity strided or overlapped access, further improving flexibility.

For CNNs, on-chip storage of interlayer feature maps is drastically reduced through in-stream lossless/lossy compression (e.g., 8×8 DCT+quantization) and lightweight bitmask-driven SRAM layouts. Dynamically reconfigurable partitioning lets the same SRAM blocks serve as feature, scratchpad, or index buffers according to per-layer needs (Shao et al., 2021).

2. Dynamic Control Flow: Batching, Scheduling, and Policy

Control logic in memory-guided accelerators must adaptively orchestrate memory and compute to maximize on-chip data reuse and minimize idle cycles. In T-REX, a dynamic batching state machine selects batch size B{1,2,4}B\in\{1,2,4\} as a function of token length, and remaps dataflow to maximize amortization of parameter DMA over BB tokens, achieving up to 3.31×3.31\times uplift in utilization for short sequences (Moon et al., 1 Mar 2025).

Unified accelerators for mixed-precision scientific computing maintain learned mappings in dedicated long-term (LTM) and short-term (STM) BRAM buffers. Precision and parallelism are assigned through memory-guided selector units—small hardware lookups that, based on historical (e.g., condition number, utilization, accuracy tradeoffs), select precision tuples and dynamic 4D systolic array tiling for each compute stage (Wang et al., 8 Jan 2026). Runtime adaptation is continuous: after each execution batch, STM is updated, and thresholds in LTM are tuned to further minimize energy and maximize accuracy under changing workload demands.

In NPU-PIM unified systems for LLMs, centralized command scheduling ensures that PIM commands—requiring DRAM-level atomicity—are strictly isolated from normal DMA to eliminate conflicts and preserve bandwidth efficiency (Seo et al., 2024). Adaptive analytical models route fully connected layers between MU/NPU execution vs. PIM computation, based on profile-derived latency and arithmetic intensity.

3. Dataflow Mechanisms and Buffer Architectures

Physical buffer architectures are specialized to maximize throughput while minimizing per-access latency and area. T-REX employs “two-direction accessible register files” (TRF): 4×4 or 8×8 tiles allowing single-cycle access to a full row or column, thus replacing multiple SRAMs with crossbar-wired register banks, and reducing tile-multiply cycle count by up to NN per matrix operation (Moon et al., 1 Mar 2025). Resultant PE hardware utilization increases $12–20$\% system-wide.

Push-memory abstractions drive CGRA-based image/DNN pipelines, where unified buffers combine storage, per-port address sequencing, and cycle-accurate schedule gating. Here, compiler algorithms (via polyhedral analysis) synthesize a minimal set of vectorized/register-buffered/strip-mined physical unified buffers (PUBs), ensuring line-rate delivery and precise control of memory-port contention (Liu et al., 2021).

Coalescing, banking, and shift-register introduction (for address/schedule equivalence across ports) reduce on-chip occupancy while preserving functional correctness, and are mapped automatically per-stage by ILP-driven compilers (Ujjainkar et al., 2023). For data with irregular access (e.g., sparse tensors, data-dependent pipelines), curriculum learning and dynamic policy networks tune the memory structure online for optimal throughput and error.

4. Unified Integration: Processing Elements, Reuse, and Control

A memory-guided unified accelerator physically integrates a PE array (matrix/vector compute cores, convolution engines, or scientific compute tiles) with a global buffer, per-core memory banks, and on-chip interconnect. Batch-wise and layer-wise data reuse are central:

  • T-REX: Four DMM (dense matrix) and four SMM (sparse matrix) cores share a global buffer of WsW_s/WdW_d and activations, and reuse preloaded WsW_s across all layers and batches, with controller-driven overlapping of DMA and compute (Moon et al., 1 Mar 2025).
  • IANUS: NPUs with MU and VU units share a single GDDR6-AiM-based memory for both standard DRAM loads/stores and PIM compute, eliminating parameter duplication and leveraging analytical cost models for optimal workload placement (Seo et al., 2024).
  • Scientific computing MGUA: Three-stage streaming pipeline (AP-FEM, SNN, Sparse Tensor Engine) connected by double-buffered BRAM, with a unified FSM (Control Unit) tracking module readiness, data availability, and adaptive configuration trajectories (Wang et al., 8 Jan 2026).

Key attributes are the ability to perform static and dynamic memory assignment, elastic partitioning of buffers, and hardware-level policy enforcement per application cycle and per workload phase.

5. System-Level Metrics, Benefits, and Quantitative Validation

Measured system metrics validate the approach:

Accelerator EMA/Memory Saving Utilization Gain Latency/Energy Accuracy Impact
T-REX (Moon et al., 1 Mar 2025) $31$–65.9×65.9\times $1.2$–3.4×3.4\times $68$–567μ567\,\mus/token, $0.41$–3.95μ3.95\,\muJ/token <0.3%<0.3\% acc. loss
CNN Compress (Shao et al., 2021) $1.4$–3.3×3.3\times (on-chip/off-chip) -- $403$ GOPS, $2.16$ TOPS/W <1%<1\% top-1 loss
DNN Hierarchy (Bause et al., 2024) 62.2%62.2\% area reduction 2.4%2.4\% max perf. loss -- --
MGUA (Wang et al., 8 Jan 2026) 34%34\% energy reduction $45$–65%65\% more throughput vs. domain-specific 2.8%2.8\% L2 error improve --
IANUS (Seo et al., 2024) 2×2\times footprint, $3.7$–4.4×4.4\times energy $3.2$–6.2×6.2\times speedup (GPT-2) --

Empirical studies demonstrate that co-optimization of memory capacity, compression, dynamic reuse, and integration of control yields Pareto optimal designs across throughput, area, energy, and accuracy spaces.

6. Configuration and Design Methodology

Systematic configuration is realized through automated analysis of application loop nests or DAGs. The following pattern is representative across domains:

  1. Analyze application schedule for minimum buffer reuse window (cycle_lengthicycle\_length_i), unique weight/input strides, and memory access patterns.
  2. Select memory hierarchy depth and sizing for each level to cover the largest reuse window, matching buffer arch to access width and concurrency required per step (Bause et al., 2024).
  3. Use on-demand, FSM-controlled data arrival and retirement at each buffer level to maximize bank/port utilization and minimize stalls.
  4. Partition, bank, and coalesce buffer SRAM macros to minimize total area subject to throughput and port contention constraints, solved via ILP or LP approaches (Ujjainkar et al., 2023Piccolboni et al., 2019).
  5. Configure controller with state machine code and DMA descriptors that overlap data load, compute, and AFU execution per batch/layer.

Automated “fit_hierarchy” and “allocate_mem” routines further streamline per-layer or per-stage adaptation for dynamic workloads.

7. Limitations, Open Problems, and Future Directions

While memory-guided unified accelerators achieve substantial improvements across energy, area, and throughput, they remain subject to several constraints:

  • PIM/DRAM mutual exclusion: concurrent DMA and PIM are generally infeasible, necessitating sophisticated scheduling and sometimes leading to resource underutilization (Seo et al., 2024).
  • Applicability to irregular/sparse workloads may require integration of dynamic sparsity maps and extended memory assignment models (Ujjainkar et al., 2023Wang et al., 8 Jan 2026).
  • Most flows assume regular compute graphs and static image/mesh dimensions. Expansion to fully dynamic or data-dependent patterns is underway but incomplete.

Prospective enhancements include finer-grained NoC arbitration, fully integrated dynamic scheduling policies, and scaling to multi-chiplet/heterogeneous fabric deployments. The progression toward fully autonomous, memory-controlled compute architectures—where memory acts as the principal design constraint and driver—reflects a unification of workload diversity, energy scalability, and high-performance computing in modern accelerator systems (Moon et al., 1 Mar 2025Bause et al., 2024Seo et al., 2024Wang et al., 8 Jan 2026Shao et al., 2021Ujjainkar et al., 2023Piccolboni et al., 2019).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Memory-Guided Unified Hardware Accelerator.