On-Chip Memory-Only Framework
- On-chip memory-only frameworks are architectures that localize all data storage on the chip, optimizing energy use, latency, and throughput.
- They leverage methods like quantization, pruning, and dynamic scheduling to ensure efficient use of limited on-chip memory in diverse applications.
- Compiler-assisted optimizations and streamlined simulation techniques enable cycle-accurate operation and reduced power consumption in real-time systems.
An on-chip memory-only framework is a design paradigm or simulation environment wherein all core data storage and memory transactions occur within the boundaries of the chip—eschewing reliance on off-chip DRAM or other external storage components. Such frameworks span digital, analog, and optical domains and find critical use in FPGA implementations, neural network accelerators, multiprocessor simulation platforms, memory-efficient image processing pipelines, and next-generation SoC architectures. They target maximum throughput, improved energy efficiency, reduced latency, and enhanced integration density—all while leveraging emerging memory technologies, specialized scheduling tactics, and sometimes novel hardware primitives or simulation wrappers.
1. Foundational Principles of On-Chip Memory-Only Frameworks
The essential property of on-chip memory-only approaches is the containment of all working data—weights, activations, buffers, intermediate states—within the chip’s dedicated memory banks (such as SRAM, BRAM, eNVM, or ROM). Traditional designs often depend on external DRAM bandwidth (especially when memory demands exceed on-chip capacity). On-chip memory-only frameworks employ radical architectural, quantization, and resource scheduling strategies to circumvent these bottlenecks.
Examples:
- FPGA-based DNN accelerators quantize weights to 3 bits (Park et al., 2016), enabling multi-million weight storage in on-chip BRAM and eliminating DRAM traffic.
- Object detection CNN engines implement stream-based line buffers and rigorous accelerator-aware pruning to fit all weights and activations into FPGA internal memory (Kang, 2022).
The key is aggressive memory size reduction (via quantization, pruning, hierarchical scheduling, block-wise pipelining), guaranteeing that intermediate and model parameters are contained on-chip throughout execution.
2. Dynamic Memory Management and Simulation Integration
In co-simulation and MPSoC prototyping, dynamic memory operations pose significant fidelity and performance challenges. A foundational technique leverages host OS memory management capabilities to emulate dynamic allocation, deallocation, and pointer arithmetic in high-speed simulation contexts (0710.4646).
- When an allocation occurs in the simulation, a wrapper calls the host’s calloc; the returned pointer is mapped and indexed inside a simulation-local pointer table, which handles translation between virtual and real addresses.
- Pointer arithmetic is rigorously tracked. For each new allocation: . Virtual pointers (Vptr) are then mapped to the host pointer (Hptr) and offset calculations are handled via: .
- These mappings ensure that data coherency and timing fidelity are maintained, even under concurrent, multi-core access—using mechanisms such as reservation bits (semaphores) for pointer reservations.
The result is rapid, cycle-accurate simulation of multiprocessor memory subsystems without the prohibitively slow overhead of explicit memory modeling.
3. Memory Reduction Techniques and Architectural Specialization
On-chip memory-only frameworks routinely employ domain-specific architectural optimizations:
- Quantization and Pruning: Reducing bitwidth (e.g., 3-bit weights (Park et al., 2016)) or deploying accelerator-aware pruning (e.g., 75% weights pruned and mapped for regularity (Kang, 2022)) ensures model parameters fit within the available memory.
- Stream-Based Dataflow and Line Buffers: Intermediate activation storage is minimized by storing only the necessary rows ("window" of size K). For CNN pipelines, the buffer requirement changes from (full frame) to per layer (Kang, 2022).
- Hierarchical Banking and Pipelining: Layer-dedicated pipeline blocks, each with a self-contained memory buffer, enable efficient input-output streaming and concurrency using only on-chip banks (Kang, 2022, Ujjainkar et al., 2023).
- Power-Gated Partitioning: Memory banks and sectors are partitioned and selectively power-gated based on utilization; sleep transistors and handshake protocols minimize static leakage (Marchisio et al., 2019).
Tabular summary:
| Technique | Memory Impact | Example Paper |
|---|---|---|
| Quantization | Bitwidth ↓ | (Park et al., 2016) |
| Pruning | Model size ↓ | (Kang, 2022) |
| Line buffers | Activations ↓ | (Kang, 2022) |
| Sector gate | Leakage ↓ | (Marchisio et al., 2019) |
These strategies operate in concert to enable complex inference pipelines and real-time processing on memory-limited hardware.
4. Hardware Specialization for Emerging Workloads
Recent advances integrate not only high-density and low-leakage memory cells, but also novel compute-in-memory (CiM), ROM-based (Chen et al., 2022, Wang et al., 17 Mar 2025), and hybrid storage architectures:
- ROM-Based CiM and LLM Acceleration: Base model weights for LLMs are stored in high-density ROM (using block-optimized B-ROM), while dynamic adaptation (LoRA weights, KV cache) reside in SRAM. A fused-cell layout co-locates ROM and compute functions for area efficiency (Wang et al., 17 Mar 2025, Chen et al., 2022).
- Emerging NVM Integration: Many frameworks exploit eNVM (RRAM, PCM, FeFET) for ultra-dense, low-power storage of synaptic weights or cache lines. Cross-stack evaluation tools such as NVMExplorer (Pentecost et al., 2021) provide quantitative trade-off analysis on density, endurance, latency, and energy for system architects.
- Heterogeneous On-Chip Memory Profiling: Application-guided profiling informs the placement of short-lived data into silicon gain cell RAM (Si-GCRAM)—66.8% energy saving in case studies—while sustaining longer lifetimes in SRAM or emerging NVM (Li et al., 21 Apr 2025).
These innovations underpin frameworks that deliver area, power, and latency improvements with architectural and workload-specific compositional guidance.
5. Scheduling, Optimization, and Compiler-Assisted Frameworks
Compiler frameworks such as ImaGen (Ujjainkar et al., 2023) formulate pipeline scheduling and memory banking as constrained ILP problems:
- The objective function is minimization of total line buffer size: , where start cycles are chosen to minimize overlap and preserve data dependencies.
- Constraints enforce causality (), port access limits (), and buffer coalescence.
- Such optimizations enable pixel-per-cycle throughput while achieving up to 86% reduction in memory usage and 62.9% less power compared to naive pipelines.
Scheduling and banking decisions that are driven by compiler analysis or workload profiling enhance both hardware efficiency and scalability, and are essential for large-scale deployments.
6. Applications and System-Level Implications
On-chip memory-only frameworks address a range of applications:
- Edge AI and Embedded DNNs: Real-time digit/phoneme recognition systems using on-chip FPGAs (Park et al., 2016), large-scale object detection teams using high-density ROM-CiM (Chen et al., 2022), or line-buffered, pipelined CNNs (Kang, 2022).
- Neuromorphic and Manycore Systems: Multi-bank memristor crossbars and distributed 3D memory stacks support high fan-out, low-latency spike routing in neuromorphic platforms (Moradi et al., 2018).
- Latency/Throughput-Critical Architectures: Hierarchical shared-memory controllers utilize combinatorial and geometric port assignment (including fractal and directed randomization) for balanced performance and area reduction (Luan et al., 2020).
- Image Processing Pipelines: ILP-scheduled, line-coalesced pipelines in frameworks such as ImaGen (Ujjainkar et al., 2023) minimize both area and power without throughput compromise.
Broader implications:
- Realization of compute-efficient, real-time, or privacy-respecting edge applications.
- Support for dense deployment of ML/LLM inference on resource-constrained hardware, with on-chip storage of multi-billion parameter models (Wang et al., 17 Mar 2025).
- Enhanced profiling and optimization tools for domain-aligned memory system composition (Li et al., 21 Apr 2025, Pentecost et al., 2021).
7. Future Directions and Paradigm Shifts
Cutting-edge research challenges the necessity of traditional memory hierarchies altogether. The prospects of pairing ultra-high-bandwidth DDR DRAM (up to 7 Tbps) with L1-cache-equivalent wireless-on-chip interconnects (“GodSpeed transceivers”) suggest a paradigm where cache layers are eliminated (GS et al., 23 Apr 2024):
- CPUs transmit memory requests wirelessly (directly from the register file to DRAM), removing cache-induced latency, area, and power constraints.
- The projected system leads to simplified hardware, unrestricted bandwidth, and fundamentally improved responsiveness and energy profiles.
In summary, the on-chip memory-only framework unifies architectural, technological, and scheduling innovations to contain all working memory within the chip boundary. These approaches yield empirical gains in energy, area, and speed, and promise future systems that re-architect fundamental computing hierarchies toward DRAM-local, wireless, or ROM-compute paradigms, driven by comprehensive scheduling, compiler, and profiling methodologies.