SRAM-Peripheral Near-Memory Acceleration
- SRAM-peripheral near-memory acceleration is a computing paradigm that integrates peripheral circuits with SRAM arrays to execute in-situ, energy-efficient computation and overcome data-movement bottlenecks.
- The architecture leverages analog charge-domain computing, bitline-driven Boolean logic, and dedicated arithmetic engines to support diverse workloads including neural inference and cryptography.
- Innovations such as capacitive adders, ADC streaming with early-stop, and output-based calibration enable scalable, robust, and high-performance computation with reduced area and energy costs.
Static Random-Access Memory (SRAM)-Peripheral Near-Memory Acceleration refers to the integration of advanced peripheral circuits and compute primitives immediately adjacent to the memory cell array within SRAM macros to enable in-situ and near-memory computation with high performance, energy efficiency, and scalability. This paradigm overcomes the data-movement bottleneck intrinsic to von Neumann architectures by executing massively parallel arithmetic or logic operations near or within the periphery of SRAM arrays, without disturbing primary memory storage functions. Modern research demonstrates significant advances using analog charge-domain computing, digital in-memory logic, and peripheral circuit innovations to accelerate workloads ranging from neural inference and modular arithmetic to cryptography and secure data manipulation.
1. Architectural Principles and Circuit Primitives
SRAM-peripheral near-memory acceleration leverages both minor and major modifications to standard SRAM macros:
- Analog Charge-Domain and Bit-Parallel Compute: Charge-domain computing-in-memory (CD-CiM) macros utilize local metal–insulator–metal (MIM/MOM) capacitors, transmission-gate switches, and hierarchical charge redistribution networks to permit fully or partially analog multiply–accumulate (MAC) execution, with summation realized across dedicated periphery capacitor ladders or adder trees (Yin et al., 2022, Chen et al., 2024).
- Bitline-Driven Boolean Logic and In-Situ Digital Operations: Through multi-row activation and sense amplifier (SA) enhancements, periphery logic supports massively parallel Boolean functions (XNOR, AND, XOR, NOR, etc.) directly via bitline micro-architectures. Specialized 9T, 10T, 8T, or MEFET-enhanced cells expand functionality, such as single-cycle array-wide XOR or non-volatility with ME-SRAM (Lokhande et al., 16 Nov 2025, Yin et al., 2023, Najafi et al., 2023).
- Near-Memory Arithmetic Engines: Barrel shifters, compressor-tree accumulators, adders, and subtractors are instantiated in the peripheral datapath immediately adjacent to the SRAM array to enable variable-precision accumulation, multi-level shift-and-add, carry-save addition, modular reduction, or popcount (Lokhande et al., 16 Nov 2025, Ku et al., 2024, Li et al., 5 Nov 2025).
- Array Organizations: Designs range from homogeneous bit-parallel banks (128×128, 288×144, 1152×9, etc.), hierarchical cluster partitioning, and dual-mode bank structures (memory-only vs. compute-enabled sub-arrays), to reconfigurable groupings for hardware–software optimized dataflow (Chen et al., 2024, Li et al., 5 Nov 2025, Li et al., 27 Mar 2025).
2. Peripheral Innovations: Capacitive Summation, ADC, and Calibration
Breakthroughs in peripheral circuit design are central to the throughput, energy efficiency, and robustness of SRAM-peripheral near-memory architectures:
- Hierarchical Capacitive Adder Networks (CAAT): Two-level capacitor trees are used for partial and total sum formation in charge-domain MAC arrays, substantially reducing the physical capacitor area and RC time constants versus classical structures. Leaf-level hybrid (binary-weighted + C-2C ladder) implementations multiplex high dynamic range into area-efficient layouts (Yin et al., 2022).
- Single-ADC Streaming and Early-Stop Nonlinearities: High-throughput macros utilize a single SAR or time-domain ADC for array output, with customized early stop for ReLU non-linearity (i.e., enforcing non-negative output via MSB gating) that halves ADC switching and energy. Time-domain ADCs with dual-threshold comparators further lower static power via aggressive power-gating during idle cycles (Yin et al., 2022, Chen et al., 2024).
- Output-Based Calibration: One-shot, output-based linear correction re-aligns analog array/ADC non-idealities with digital ground truth, using post-silicon mean/variance matching to compute linear coefficients . This maintains inference accuracy without the need for per-macro retraining or run-time iteration (Yin et al., 2022).
3. Supported Workloads: Neural Inference, Modular Arithmetic, Cryptography
SRAM peripheral near-memory accelerators are deployed for a broad spectrum of data-intensive and compute-bound applications:
- Deep Neural Network Inference: 8b×8b or 4b×4b MACs are executed in situ for convolutional and fully-connected layers, with trade-offs between bit-parallel and bit-serial input/output depending on array capacity, energy, and task accuracy. Sparse, clustered, or weight-pooled architectures (e.g., CIMPool) enable orders-of-magnitude area reduction and energy savings for large-scale DNNs with negligible accuracy sacrifices (Yin et al., 2022, Chen et al., 2024, Li et al., 27 Mar 2025).
- Large-Word Modular Multiplication: Barrett, Montgomery, and radix-4 carry-save arithmetic for ECC, zero-knowledge proofs, and PQC use byte-wise MAC mapping onto parallel macros, with all partial products, shifters, accumulators, and modular reduction staged in peripheral logic. This enables very high-bitwidth operations at a fraction of cycle and area cost versus previous works (Li et al., 5 Nov 2025, Ku et al., 2024).
- High-Speed and Secure Cryptography: AES and SHA3 kernels are realized by combining multi-row activation, periphery logic (AND/NOR/XOR/shift), and minimized data transfers. Secure in-SRAM engines with periphery extensions (Sealer, CryptoSRAM, 9T security cell) allow >2 orders-of-magnitude improvement in throughput-per-area and an order of magnitude reduction in energy-per-bit relative to off-array cryptographic accelerators (Zhang et al., 26 Sep 2025, Zhang et al., 2022, Yin et al., 2023).
- Post-Quantum NTT and Polynomial Arithmetic: Reconfigurable bit-cell arrays (e.g., 10T) plus glitch-driven pulse generators operate near-memory compute units for modular butterfly operations, achieving topology-optimized dataflow and pipelining for high-throughput lattice-cryptography primitives (Ding et al., 13 May 2025).
4. Performance Metrics, Area, Energy, and Robustness
Key quantitative and comparative metrics exemplify the impact of SRAM-peripheral near-memory acceleration:
| Macro/Work | Peak Throughput | Energy Efficiency | Area (normalized) | Accuracy | Notable Peripheral Innovation(s) |
|---|---|---|---|---|---|
| CD-CiM (Yin et al., 2022) | 51.2 GOPS (8b MACs) | 10.3 TOPS/W @240 MHz | 1.2× smaller than multi-ADC | 88.6% (CIFAR-10) | CAAT, ReLU-optimized SAR ADC |
| PICO-RAM (Chen et al., 2024) | 25.3 GMAC/s (4b×4b) | 297 TOPS/W @1.2V | 559 Kb/mm² | 90.7% (4b-ResNet-20) | Dual-threshold TD-ADC, in-situ DAC reuse |
| FERMI-ML (Lokhande et al., 16 Nov 2025) | 1.93 TOPS (binary) | 364 TOPS/W | 4.58 TOPS/mm² | >97.5% (TinyML CNNs) | In-situ XNOR 9T cell, C22T compressor |
| LaMoS (Li et al., 5 Nov 2025) | 7.02× speedup vs ModSRAM | 3× area-normalized latency·area | ≈0.11 mm² | N/A | Byte-wise MAC mapping, workload grouping |
| Sealer (Zhang et al., 2022) | 24 GB/s/array | ≈3× lower than prior AIM | ~1–1.5% overhead | N/A | Multi-row BL activation, XOR/shift SA |
Robustness to PVT (process, voltage, temperature) variation is attained through passive charge-domain and thin-cell matching, while circuit-level calibration and reprogrammable periphery logic sustain inference accuracy and modular arithmetic correctness, even under fine-grained supply and temperature ranges (Chen et al., 2024).
5. Design Trade-offs, Scalability, and Generalization
SRAM-peripheral near-memory acceleration involves nontrivial trade-offs in cell complexity, peripheral area, precision scaling, and energy efficiency:
- Area and Cell Complexity: Additional transistors (8T, 9T, 10T, MEFET) incur a 12–50% area penalty over baseline 6T, but enable logic functions (XNOR, XOR, carry-save, in-situ storage) unattainable otherwise. Peripheral logic trees (e.g., compressor, FA, shifter) may constitute 10–40% of macro area.
- Precision and Throughput: Fully bit-parallel implementations yield highest density and throughput at the cost of increased peripheral analog/digital complexity and vulnerability to non-idealities. Bit-serial or weight-sharing designs (CIMPool, ModSRAM) trade per-cycle performance for massive model capacity or energy optimality (Ku et al., 2024, Li et al., 27 Mar 2025).
- Scalability: Techniques such as workload grouping, clustered quantization, buffer ping-ponging, and dual-mode access (memory vs. compute) allow scaling to large models, high-bitwidth operands, and DNNs with hundreds of millions of parameters or arithmetic precision up to 1024 bits (Li et al., 27 Mar 2025, Li et al., 5 Nov 2025).
- Broader Applicability: Principles established in SRAM-peripheral near-memory accelerators are extensible to eDRAM, MRAM, RRAM, or other NVM arrays for SoC and heterogenous computing. Examples include compute-enabled last-level caches, in-line security engines, and periphery-embedded polynomial accelerators (Chakraborty et al., 15 Sep 2025, Ding et al., 13 May 2025).
6. Current Challenges and Future Directions
Despite demonstrated improvements, key challenges remain:
- ADC Bottleneck: For analog MVM, ADC area and power dominate sub-6–10 b MACs; innovations in power-gated or column-shared ADCs and precision scaling are critical (Chen et al., 2024).
- Non-Ideality Mitigation: Passive matching, per-macro calibration, and adaptive error correction are active areas to contain capacitor/signal non-linearities and bit-line coupling.
- Endurance and Variability: Emerging device options (e.g., MEFET, embedded RRAM) need continued research in endurance, retention, and cell/circuit co-optimization on advanced nodes (Najafi et al., 2023, Chakraborty et al., 15 Sep 2025).
- Algorithm–Hardware Co-Design: Close coupling between algorithm mapping (e.g., radix-4 CSA, grouped pooling, compressed indices) and hardware dataflow is essential for area/energy–throughput optimality—a trend that will likely intensify as models and cryptographic workloads scale (Li et al., 5 Nov 2025, Li et al., 27 Mar 2025).
- Integration with Compiler and System Stack: Automatic mapping of DNN layers, quantization-aware bit-packing, and security policy enforcement at the memory controller/peripheral level is needed for seamless deployment in practical systems (Li et al., 27 Mar 2025).
SRAM-peripheral near-memory acceleration, through advanced charge-domain and digital logic with low-overhead periphery, is established as a key technique for overcoming memory wall limitations and enabling energy-efficient, high-throughput, and scalable computing for AI, cryptography, and edge workloads (Yin et al., 2022, Chen et al., 2024, Li et al., 5 Nov 2025, Lokhande et al., 16 Nov 2025, Li et al., 27 Mar 2025).