RACM Accelerators: Memory Constraints & Strategies

Updated 18 December 2025

RACM accelerators are architectures defined by a low random-to-sequential bandwidth ratio, requiring integrated hardware-software design to maximize efficiency.
They employ techniques like polyhedral dataflow transformations, dynamic allocation, and banked streaming engines to enhance memory utilization and throughput.
Empirical studies report significant speedup and energy efficiency improvements, underscoring their role in optimizing large model inference and DNN serving.

A Random-Access-Constrained Memory (RACM) accelerator is a computational architecture where the underlying memory technology exhibits a pronounced disparity between sequential (streaming) and random-access bandwidths, with $B_\mathrm{rand}/B_\mathrm{seq} \lesssim \alpha < 1$ , for a constant $\alpha$ substantially less than unity. This constraint is characteristic of memory technologies such as LPDDR5, eNVM crossbar arrays, cryogenic memories, software-managed scratchpad banks, and some emerging processing-in-memory (PIM) substrates. Such platforms require carefully co-designed hardware–software mechanisms to maximize throughput and memory utilization, despite the penalty for fine-grained, low-locality memory references. RACM accelerators have emerged as a necessity for efficient large model inference, in-memory computing, and bandwidth-constrained DNN serving, prompting new allocation strategies, dataflow transformations, and adaptive runtime scheduling.

1. Definition and Core Memory Constraints

A RACM substrate is defined by the relative suppression of small, random-accessed reads/writes compared to its streaming bandwidth:

$B_\mathrm{seq}$ : sustained bandwidth under large, streaming-oriented accesses.
$B_\mathrm{rand}$ : bandwidth for cache-line-scale, low-locality (random) accesses.
Random-access-constrained if $B_\mathrm{rand}/B_\mathrm{seq} \lesssim \alpha < 1$ .

Commodity LPDDR5, for instance, delivers near-peak $B_\mathrm{seq}$ but only $30$– $70\,\%$ of peak for small, random requests; command-bus occupancy and burst granularity intensify this effect. Similar constraints are found in crossbar eNVM arrays (where analog-to-digital conversion precision limits row-level parallel reads), cryogenic SFQ SHIFT registers, and processor-managed scratchpad architectures (Zou et al., 10 Dec 2025, Crafton et al., 2020, Zokaee et al., 2021).

2. Representative RACM Architectures

2.1 LPDDR- and GDDR-Based Platforms

LPDDR-class memories on devices like Cambricon MLU370-X4 severely penalize fine-grained paging or PagedAttention-style indirection, leading to degraded throughput for workloads (e.g., LLM KV-cache) demanding frequent dynamic memory allocations (Zou et al., 10 Dec 2025). Static memory pre-allocation further reduces effective concurrency.

2.2 In-DRAM Bit-Serial PIM

The RACAM architecture incorporates bank-locality buffers, bit-serial processing elements, popcount reduction units, and on-die broadcast networks to exploit streaming access and data reuse. By minimizing row activations from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ for $n$ -bit operands, and by reusing loaded data through locality buffers, RACAM attains a bit-serial reuse factor approaching $R \rightarrow 0.875$ for $n=8$ , thereby boosting effective bandwidth (Ma et al., 10 Dec 2025).

2.3 Compute-In-Memory (CIM) with Crossbar Arrays

CIM fabrics using RRAM/PCRAM are fundamentally RACM devices. Small ADC quantization windows (e.g., 3 bits) enforce time-multiplexed access and throttle random parallel activation, while synchronization barriers (layer/block) impact global array utilization (Crafton et al., 2020).

2.4 Scratchpad and Hybrid SRAM Architectures

Scratchpad-memory accelerators (e.g., AWS Inferentia), as described in global memory-access-optimization frameworks, deploy polyhedral compilation to minimize the frequency of costly cross-bank or off-chip transitions. These transformations ensure maximal streaming-style reuse within each banked local memory (Zheng et al., 2020).

2.5 Decoupled Dataflow/Banked Streaming Engines

Access-execute decoupling, e.g., DataMaestro’s streamer+PE+datapath pipeline, leverages programmable affine AGUs, fine-grained prefetch, and dynamic address permutation to orchestrate efficient streaming into banked SRAM or DRAM, while mitigating bank conflicts and shielding against random-access stalls (Yi et al., 18 Apr 2025).

3. Memory Management and Allocation Strategies

3.1 Contiguous/Adaptive KV-cache Management

The ODMA framework introduces allocation strategies for LLM serving that combine a lightweight prompt-length predictor (with uncertainty scaling), dynamic bucket partitioning (quantiles from live histograms), and a large-bucket fallback to maximize utilization and RPS/TPS. Allocation operates subject to

$\max_{\mathbf{x}} U(\mathbf{x}) = \frac{\sum_{i=1}^N \mathrm{size}(x_i)}{C}, \quad \min_{\mathbf{x}} L(\mathbf{x}) = \sum_{i=1}^N L_i(x_i)$

subject to memory capacity constraints. Contiguous buffers ensure streaming-friendly layout, avoiding the random-access overhead of fine-grained paging (Zou et al., 10 Dec 2025).

3.2 Load Balancing in CIM and Banked SRAM

In both crossbar CIM and banked SRAM RACM architectures, profiling-driven, greedy allocation algorithms assign arrays among fine-grained blocks to minimize makespan (maximum block latency), adapting dynamically to input data sparsity and observed zero-skipping statistics. This yields utilization increases above $90\,\%$ and up to $7.47\times$ speedup in ResNet18 benchmarking (Crafton et al., 2020).

3.3 Polyhedral Dataflow Transformations

Systematic tiling and bank mapping, driven by the polyhedral model, enable global minimization of cross-bank and off-chip references while guaranteeing that all live data fits in the constrained local memory. Data-movement elimination (DME) further collapses superfluous operator-local copies (Zheng et al., 2020).

3.4 Decoupled Asynchronous Memory Access

CXLMemUring and DataMaestro demonstrate that advanced in-core logic, endpoint-side weak cores, and programmable access streamers can both batch requests and adapt prefetch window sizes to maximize overlap between host computation and remote memory fetch, tailored to the RACM’s response characteristics (Yang, 2023, Yi et al., 18 Apr 2025).

4. Performance/Limitation Analysis and Empirical Impact

4.1 Key Performance Metrics

Memory Utilization: $U = \mathrm{used\_bytes}/\mathrm{capacity\_bytes}$
Prediction Accuracy (for allocators): $\mathrm{Accuracy} = \#\mathrm{correct}/\#\mathrm{total}$
Requests/Tokens per Second (LLMs): $\mathrm{RPS}$ , $\mathrm{TPS}$
Block/Array Makespan (CIM): $T_i(k_i) = W_i/(k_i \mu_i)$

4.2 Notable Benchmarks

Accelerator	Utilization	Speedup/Improvement	Notes
ODMA (MLU370-X4)	72.45%	+29% RPS, +27% TPS	vs static; 99.55% pred. acc. on Alpaca
RACAM (GPT-3)	–	$9\times$ – $102\times$ over GPU	233 $\times$ perf./mm $^2$ vs Proteus
SMART (SFQ CNN)	–	$3.9\times$ throughput	$86\%$ energy reduction (single image)
DataMaestro (512PE)	95–100%	$1.05$– $21.39\times$	2.57 TOPS/W, 6.43% area, 15.06% power
CIM (ResNet18)	>90%	$7.47\times$	After greedy block allocation
Inferentia	–	$1.7\times$ – $2\times$	With global bank mapping, DME (Zheng et al., 2020)

4.3 Fundamental Limitations

Device-level constraints—such as small ADC quantization in crossbars, TLB windowing in HBM (e.g., $\leq$ 64GiB per controller on A100 (Walker, 19 May 2024)), finite command-bus width, and synchronization barriers—are inherent. Solutions must adapt to but cannot eliminate these; partitioned scheduling, window-conscious memory layout, and bank-aware tiling are universal mitigations.

5. Design Principles and Synthesis Across RACM Variants

Favor streaming/sequential access: All RACM-aware frameworks employ aggressive tiling, contiguous pre-allocation, or in-place logic to preserve locality or enable amortization of expensive operations.
Exploit data reuse: Bit-serial PIM, CIM, and DCRAM architectures minimize data movement through local buffering, reuse of loaded operands, and in-memory logic, evaluated with explicit reuse-factor models (Ma et al., 10 Dec 2025, Traversa et al., 2013).
Dynamic, hardware-aware allocation: Bucketing, greedy task allocation, or online adaptation (ODMA, profiling-driven scheduling, ILP-based placement in SMART) are essential to track workload drifts, heavy-tailed input, and device-age/lifetime effects.
Architectural co-design: Many solutions embed weak endpoint cores, reconfigurable streaming engines, or hybrid SRAM structures to resolve the tension between sequential and random-access efficiency (Yang, 2023, Zokaee et al., 2021).
Efficient software/hardware interfaces: Polyhedral optimization, JIT-guided prefetch window adaptation, and customizable access pattern engines (ND AGUs) bridge traditional compute models and RACM constraints (Zheng et al., 2020, Yi et al., 18 Apr 2025).

6. Open Challenges and Research Trajectories

Joint optimization of allocation, scheduling, and compute placement over multi-tenant accelerators remains an unsolved online problem (Zou et al., 10 Dec 2025).
Automated, real-time adaptation to input distribution and workload drift: While sliding-window histograms and background retraining exist, integration with full-precision online learning is still emerging.
Scaling to larger, heterogeneous memory regimes: Combining RACM with HBM for multi-level hierarchies, or integrating with NVDLA-style DMA, opens further complexity in partitioned and nested offloads (Yang, 2023).
Device advancements: Increasing TLB reach (HBM), wider ADCs or improved quantization (CIM), and lower-overhead in-place logic (DCRAM) offer increased effective bandwidth but require area, power, and reliability trade-offs.
Compiler and OS integration: For full exploitation, RACM-aware compilation and runtime tiling must become standard in ML frameworks and DNN execution engines (Zheng et al., 2020).

In sum, RACM accelerators unify a broad class of architectures where hardware, allocation software, and runtime scheduling are co-designed to exploit streaming and data reuse, overcome random-access bottlenecks, and directly target energy-efficient, high-throughput inference and in-memory computation across advanced memory technologies. The continued evolution of these devices will depend on finer-grained adaptation both at the hardware and software levels, leveraging advances in modeling, allocation, and profiling to approach the theoretical ceilings imposed by their underlying memory physics.