Hybrid Compute-in-Memory Architectures
- Hybrid CiM architectures are integrated compute systems that merge digital and analog processing within memory to reduce data movement and alleviate the von Neumann bottleneck.
- They leverage MSB-digital and LSB-analog partitioning to efficiently balance precision, throughput, and energy consumption for AI workloads.
- Co-design of algorithms and hardware, along with advanced mixed-signal techniques, is essential for mitigating non-idealities and optimizing overall system performance.
Hybrid Compute-in-Memory (CiM) Architectures
Hybrid Compute-in-Memory (CiM) architectures integrate both digital and analog computation within or near the memory arrays, strategically partitioning computational tasks to maximize system efficiency, minimize data movement, and optimize accuracy, energy, and area for advanced AI workloads. These architectures are central to overcoming the traditional von Neumann bottleneck by collapsing compute and memory hierarchies, and are increasingly employed in applications ranging from DNN inference at the edge to precision-critical generative AI models.
1. Architectural Principles and Hybridization Strategies
Hybrid CiM architectures marry digital compute-in-memory (DCIM) and analog compute-in-memory (ACIM) techniques, using the memory array periphery or even the bitcells themselves to embed computation without moving data out to external logic. In SRAM-based macros, DCIM provides high computational precision and robust process scalability, typically embedding full adder trees or bitwise logic for exact dot-products. ACIM exploits current, charge, or voltage summation directly on the bitlines, digitized by peripheral ADCs, yielding superior energy and area efficiency, particularly for medium-precision tasks (Yoshioka et al., 2024, Konno et al., 25 Aug 2025).
The canonical hybridization approach is to delegate the most significant bits (MSB) of operands to digital computation for high accuracy, while assigning the least significant bits (LSB) to analog accumulation for efficiency. This MSB-digital/LSB-analog split achieves effective 8–10 bit performance with lower energy and area overhead than full-precision DCIM, while avoiding the precision and SNR limitations inherent in deep-analog ACIM (Yoshioka et al., 2024, Konno et al., 25 Aug 2025, Yi et al., 11 Feb 2025).
Other modalities include current-domain, charge-domain, and time-domain ACIM; Lightning-style interleaving of DCIM and ACIM bitcells; and spatially partitioned analog/digital compute blocks. Hybrid architectures also extend to complex-number MACs, hybrid floating-point multiply-accumulate, and cryogenic or FeFET-based memory platforms (Konno et al., 25 Aug 2025, Yi et al., 11 Feb 2025, Yin et al., 2024, Alam et al., 2021).
2. Circuit-Level Implementations
Hybrid-bit-split designs implement digital accumulation for a small group of MSB bits, often within local adder trees using in-cell or periphery logic, while accumulating LSB contributions on local analog capacitors or bitlines that are digitized via low-resolution (3–7 bit) SAR or flash ADCs. For SRAM-based macros, 6T or 8T cells are augmented with MIM/MOM capacitors, split-port wordlines, local counting logic for DCIM, and pass-transistor logic for ACIM path (Yoshioka et al., 2024, Konno et al., 25 Aug 2025, Chen et al., 2023).
Representative implementations include:
- A 28nm hybrid SRAM macro using a 2D-weighted capacitor array for complex-number MACs, where the upper 3 bits are computed digitally and the lower 5 bits via charge-based ACIM, achieving 1.80 Mb/mm² density and 0.435% RMS error (Konno et al., 25 Aug 2025).
- An FP8 hybrid macro with per-cell pseudo-AND/XOR logic for sub-addition in the digital path and switched-capacitor analog sub-multiplication, giving a 1.53× energy advantage over all-digital floating-point CiM (Yi et al., 11 Feb 2025).
- OSA-HCIM: On-the-fly saliency-aware macros that dynamically switch the digital/analog boundary per MAC using a runtime evaluator, with a split-port 6T SRAM organization enabling concurrent DCIM/ACIM operations (Chen et al., 2023).
- PACiM: Probabilistic approximate computing for DNNs, using exact DCIM for MSBs and sparsity-domain statistical approximation for LSBs to cut bit-serial compute cycles by 81% (Zhang et al., 2024).
The interface between digital and analog sub-paths requires careful matching of data widths, calibration for non-idealities, and mixed-signal postprocessing. DAC-free input gating techniques (digital gating) further eliminate the need for per-column DACs, reducing area and mismatch errors (Konno et al., 25 Aug 2025).
3. Performance, Energy, and Precision Trade-Offs
Hybrid CiM macros strike an efficient balance among precision, throughput, energy, and area:
| Architecture | Precision | Energy (fJ/MAC) | Throughput (ns/MAC) | Density (Mb/mm²) |
|---|---|---|---|---|
| DCIM | 12+ bits | 8–20 | 1–2 | ~3–4 |
| ACIM (charge) | 6–8 bits | 0.2–1 | 5–10 | ~1–2 |
| Hybrid (typical) | 8–10 bits | 1–3 | 3–5 | ~1–1.8 |
| 28nm hybrid (Konno et al., 25 Aug 2025) | 8 (complex) | 28 | ~0.3 | 1.80 |
Precision: Handling MSBs digitally ensures that hybrid schemes exceed the effective precision (SNR >30 dB) of purely analog front-ends. For DNN workloads such as ResNet or MobileNet, ≤1% accuracy loss is typical with 8–10 effective bits (Konno et al., 25 Aug 2025, Yoshioka et al., 2024, Yi et al., 11 Feb 2025).
Energy/performance: Hybrid macros achieve an order-of-magnitude lower energy than full-precision DCIM, e.g., 1–3 fJ/MAC at ~300 TOPS/W, with marginal impact on latency. Macro-level area also improves dramatically by removing high-resolution ADCs and using denser supporting logic (Yoshioka et al., 2024, Konno et al., 25 Aug 2025, Kundu et al., 2024, Zhang et al., 2024). Optimizations such as DAC-free input (Konno et al., 25 Aug 2025) and memory-immersed ADCs (Nasrin et al., 2023) further drive energy and area savings.
Throughput/latency: Hybrid macros are limited by the slower of ADC time (in analog path) or digital adder-tree depth. Advanced techniques, including collaborative digitization and pipelined operation, mitigate these bottlenecks for large-scale DNN inference (Nasrin et al., 2023).
4. Co-Design and Algorithm–Hardware Optimization
Algorithm–hardware co-design is essential due to the sensitive interplay between DNN topology, dataflow characteristics, and hardware configuration. Joint optimization frameworks such as CiMNet (Kundu et al., 2024) and CIMNAS (Krestinskaya et al., 30 Sep 2025) formalize this process as a search for Pareto-optimal pairs of network design (θ) and hardware configuration (H), using objectives that trade accuracy and execution latency (or general EDAP). The design space includes:
- Network elastic parameters: channel widths, depths, attention heads.
- Hardware elastic parameters: memory-array sizes (M), vector bandwidths (B), processing granularity (P), parallel lanes (L), SRAM/DRAM buffer division.
- Quantization policies and device/circuit-level attributes for further efficiency gains (Krestinskaya et al., 30 Sep 2025).
Predictor-guided search (e.g., NSGA-II) with surrogate models enables efficient exploration, capturing non-linear trade-offs between bandwidth, MAC granularity, and accuracy. Experimental evaluations indicate:
- Joint θ+H optimization yields up to 5.4× cycle-count reduction vs. architecture-only tuning.
- DNNs with low arithmetic intensity (depthwise CNN) benefit from higher bandwidth; compute-bound workloads (transformers, wide CNNs) from greater MAC parallelism (Kundu et al., 2024).
- Adaptive digital-analog partitioning, saliency-aware boundary selection (Chen et al., 2023), and approximate computation (Zhang et al., 2024) further optimize resource allocation.
System-level co-design extends to entire SoCs with heterogeneous NPU+CiM partitioning, where neural architecture search allocates sub-modules for compute-bound vs. memory-bound layers, achieving significant system-wide reductions in latency (up to 56%) and energy (up to 41.7%) (Zhao et al., 2024).
5. Advanced Hybridization Modes and Resilience
Recent hybrid CiM research addresses FP/INT domain blending, resilience to device and algorithmic faults, and support for multi-modal workloads:
- Hybrid-domain floating-point CiM splits FP MACs such that analog paths carry sub-multiplication (mantissa products) while digital logic executes sub-addition and exponent management. This yields high-accuracy FP (e.g., BFLOAT16) with <1% DNN accuracy drop and 1.5× energy efficiency gain over fully digital baselines (Yi et al., 11 Feb 2025).
- SafeCiM incorporates fault-resilient post-alignment of FP mantissas and tiling of MAC arrays to bound the impact of MSB errors, achieving up to 49× reduction in inference accuracy loss under single-adder faults in large-scale networks (Bhattacharya et al., 23 Nov 2025).
- 1FeFET-1C arrays for neuro-symbolic AI demonstrate simultaneous support for MAC (DNN) and CAM (associative) operations by leveraging ferroelectric field-effect non-volatility and robust charge-domain analog computation within a DRAM-compatible cell (Yin et al., 2024).
- Cryogenic QAHE-based CiM realizes single-cycle universal logic operations with topologically protected, robust storage at ultra-low temperatures (Alam et al., 2021).
- Hybrid projection decomposition (HPD) splits large output projections across analog crossbars (first SVD factors) and digital logic (V-matrix, correction factor), greatly improving robustness to analog perturbations while maintaining throughput (Feng et al., 16 Aug 2025).
6. System-Level Integration, Compilation, and Future Prospects
Hybrid CiM deployment in full-stack systems hinges on dynamic allocation of memory vs. compute roles, efficient compilation, and co-scheduling with external cores:
- Dual-mode fabrics allow arrays to dynamically switch between compute and storage (scratchpad) modes, exposing an additional layer of scheduling flexibility. Compiler–hardware co-designs (CMSwitch) integrate segmentation and mixed-integer programming techniques to optimize mode switching, yielding up to 2× end-to-end speedups across diverse DNN workloads (Zhao et al., 24 Feb 2025).
- End-to-end accelerators (e.g., CIMR-V) couple programmable RISC-V cores to hybrid SRAM-CiM macros, combining analog MAC with digital instruction flow, on-chip DMA, and layer/weight fusion for 85% latency reduction and >3 kTOPS/W efficiency on TinyML tasks (and et al., 28 Mar 2025).
- Emerging CiM hardware targets algorithm–hardware robustness, statistical approximation, and dynamic digital-analog boundary control. Approaches such as memory-immersed collaborative digitization exploit in-memory capacitive DACs and shared reference generation for scalable, area-efficient ADC implementation (Nasrin et al., 2023).
Key remaining challenges include:
- ADC overhead mitigation and analog non-ideality management.
- Scaling hybrid methodologies to multi-bit floating-point, large transformer models, and variable-precision workflows.
- Dynamic, input-driven resource management and online segmentation for non-static DNNs.
- Integrated hardware–software toolflows for joint network and hardware parameter search over immense design spaces (Krestinskaya et al., 30 Sep 2025, Kundu et al., 2024).
- Integration with heterogeneous SoCs (NPUs, CPUs, custom accelerators) and algorithm/precision-adaptive execution (Zhao et al., 2024, Chen et al., 2023).
Hybrid CiM architectures are critical to enabling efficient, robust AI inference and training at scale. By combinatorially leveraging the strengths of digital and analog domains, carefully partitioning computation and storage, and automating co-optimization across the algorithm–hardware stack, hybrid CiM remains a leading paradigm for next-generation AI hardware (Yoshioka et al., 2024, Konno et al., 25 Aug 2025, Kundu et al., 2024, Yi et al., 11 Feb 2025).