Hybrid Analogue-Digital CIM System
- Hybrid analogue–digital CIM systems are integrated architectures that combine analog in-memory computing with digital processing to deliver energy-efficient and high-precision neural network computations.
- They use dynamic partitioning to assign dense layers and MAC operations to analog cores while leveraging digital logic for control and complex functions, optimizing throughput and area utilization.
- Key performance metrics include up to 20 GOPS/mm² area efficiency and over 2 TOPS/W energy efficiency, making these systems pivotal for next-generation AI accelerators.
A hybrid analogue–digital compute-in-memory (CIM) system is a heterogeneous architecture that integrates analogue in-memory compute engines—typically based on non-volatile or SRAM/DRAM-based crossbars for massive parallelization of multiply-accumulate (MAC) operations—with digital processors, such as CPUs, digital compute-in-memory (DCIM) subarrays, or digital logic for task orchestration, data movement, control, and high-precision computation. This architecture enables simultaneous exploitation of the ultra-high energy/area efficiency of analogue MAC computation and the flexibility, accuracy, and programmability of digital logic and memory, targeting modern neural network inference and training workloads.
1. System-Level Architecture
Hybrid analogue–digital CIM systems adopt a tightly-coupled architecture in which digital processing cores (e.g., RISC-V or ARM clusters) share memory resources and interconnect with one or more analogue in-memory compute accelerators (IMA) or analogue crossbar arrays. Shared resources allow for high-bandwidth, low-latency data exchange, mutual exclusion via locks, and concurrent job scheduling between analogue and digital domains. A canonical example consists of:
- An 8-core RISC-V cluster and an analogue IMA, both accessing a shared 512 kB tightly-coupled data memory (TCDM) via a low-latency interconnect.
- The IMA itself is hierarchical, comprising a control FSM, 3D-strided address generators, digital input layer (DAC buffers), the analogue compute core (e.g., PCM crossbar with DAC/ADC), and digital output layer (ADC buffers).
- The number of master ports on the TCDM interconnect (Nload/Nstore) can be varied at design time to trade area and performance.
- The hybridization appears both at the SoC system level and down to individual memory macro design, where e.g., the upper bits of a word are processed digitally and lower bits analogically in a split-domain macro (Konno et al., 25 Aug 2025, Yoshioka et al., 2024).
This partitioning supports flexible workload mapping and minimizes data-movement bottlenecks, with all processing phases tightly synchronized by software through control/status registers, done flags, and optional interrupts (Ottavi et al., 2021, Gallo et al., 2022, Klein et al., 2022).
2. Dataflow Partitioning and Integration Strategies
Workload mapping in hybrid CIM systems is highly workload-aware, with the analog in-memory core typically offloaded the high-arithmetic-intensity portions of the neural network, such as dense/pointwise convolutions or fully-connected layers, while digital cores or DCIM structures handle elements with poor analogue utilization, such as depthwise separable convolutions, low-rank/batch-norm updates, non-linear activation functions, and control flow (Ottavi et al., 2021, Gallo et al., 2022, Yoshioka et al., 2024, Klein et al., 2022).
Hybrid partitioning can be static—e.g., all 1×1 convolutions sent to the analogue CIM, depthwise convolutions executed by CPU cores—or dynamic with more sophisticated runtime or compiler-based orchestration. Analog-to-digital interfaces are realized via integrated DACs/ADCs, digital input/output buffers, and lock-step sequencers for synchronized operation:
- Input tiles are streamed into the IMA via 3D-strided access engines or pulse-width-modulated wordlines.
- Outputs are quantized by ADCs (variable precision, typically 4–8 bits) and streamed back into the TCDM or digital registers, where further processing can occur.
Multiple jobs can be pipelined or burst loaded by prefetching configuration and stride registers, reducing the effective non-computation cycle overhead (Ottavi et al., 2021).
3. Mathematical Performance Models and Efficiency Metrics
Key metrics for hybrid CIM systems are throughput, energy efficiency, and area utilization. Formally:
- GOPS:
- Energy efficiency: (TOPS/W)
- Area efficiency:
where is the number of MACs, system time, total power, and / digital/analog area. For a PCM crossbar of size, up to dot-products of length can be completed per 70 ns analog cycle (Ottavi et al., 2021, Gallo et al., 2022).
Efficiency scaling is determined by the mapping and utilization. Analogue core utilization is optimal for dense layers with high fan-in/out but drops dramatically for low-channel-count depthwise or separable convs, leading to a trade-off between area and throughput per layer. Area efficiency can reach up to GOPS/mm² for hybrid mappings compared to $4$–$8$ for pure digital, and energy efficiency exceeding $2$ TOPS/W (Ottavi et al., 2021).
4. Hybrid Compute-in-Memory Microarchitectures
Hybrid architectures manifest both at the macro design and the system orchestration level. Notable structures include:
- Bit-sliced macros: Upper (MSB) bits computed by DCIM (popcount or adder trees), lower (LSB) bits in charge-domain or time-domain ACIM cells with shared SAR ADC (Konno et al., 25 Aug 2025, Yoshioka et al., 2024). Final output is merged via digital addition and shifting.
- ADC-Less hybrids: Replace bulky high-resolution ADCs with ternary/binary comparators, quantization-aware training, and a digital in-memory adder/subtractor tree for scaling and accumulation, leveraging the inherent sparsity for clock/energy gating (Negi et al., 2024).
- Storage-hybrid cells: Integration of non-volatile (e.g., RRAM) and CMOS paths within a single bit-cell, enabling analog compute on power rails and preserving digital SRAM read/write (Chakraborty et al., 15 Sep 2025).
- Task-level split: Assigning pointwise convolution to IMA and depthwise to digital cores results in up to throughput, energy efficiency, and $2$– area reduction compared to full-IMA mapping for full MobileNetV2 workloads (Ottavi et al., 2021).
A selection of published efficiency/throughput values is provided below.
| Configuration | Throughput (GOPS) | Energy Eff. (TOPS/W) | Area Eff. (GOPS/mm²) |
|---|---|---|---|
| SW only | 4.4 | 0.85 | 4.2 |
| IMA (ima8) | 12.9 | 1.8 | 7.9 |
| IMA (ima16) | 13.5 | 1.5 | 6.1 |
| Hybrid (1x1→IMA, DW→SW) | 13.2 | 2.55 | 19.7 |
5. Example Applications, Trade-Offs, and Bottlenecks
Hybrid architectures are particularly advantageous for convolutional and transformer-based DNNs, where workload heterogeneity precludes a one-domain-fits-all solution:
- Dense/pointwise layers fully utilize analogue crossbar parallelism, leading to speedup over digital cores for 1x1 convolution.
- Depthwise/separable layers are mapped to digital cores due to suboptimal crossbar mapping, avoiding up to area penalty and yielding overall speedup for networks like MobileNetV2 (Ottavi et al., 2021).
- Attention accelerators prune of tokens in analogue CIM, passing only the informative subset for accurate digital refinement, achieving $14.8$ TOPS/W in analog core and $1.65$ TOPS/W at SoC level with accuracy drop (Moradifirouzabadi et al., 2024).
Key system trade-offs include:
- Throughput ceilings due to streaming bottlenecks at memory interface, beyond $4$–$8$ ports is diminishing return.
- Analog non-idealities (variability, IR drop, precision limitations) requiring algorithm-hardware co-design, retraining, or compensation (Gallo et al., 2022, Ottavi et al., 2021).
- Design complexity: Co-ordination of multiple hardware units, heterogeneity in software toolchains, partitioning support, calibration, and runtime orchestration.
- Conversion/auxiliary overhead: SAR ADCs and auxiliary digital circuits can dominate energy and latency, demanding conversion-efficient SAR/Flash hybrids, or even ADC-less approaches with sparse quantization (Nasrin et al., 2023, Negi et al., 2024).
6. Design Challenges and Mitigation Strategies
Frequent obstacles are the limited ADC/DAC bandwidth, process-induced analog variability, calibration requirements, and poor analog utilization for certain structure types or data patterns. Solutions include:
- Dynamic partitioning: Saliency- or inference-aware allocation of MAC bits to digital or analog domains, e.g., on-the-fly boundary setting via OSE in OSA-HCIM (Yoshioka et al., 2024).
- Calibration and drift compensation: One-time per-ADC calibration, global rescaling of MVM output, hardware-in-the-loop finetuning, and device-aware retraining to address non-idealities and weight drift (Gallo et al., 2022, Yi et al., 11 Feb 2025).
- Exploiting algorithmic robustness: By careful assignment of critical bits (MSB) to digital, overall accuracy is preserved even at moderate SNR/CNR in the analog domain. For transformers, this boundary must ensure CSNR dB (Yoshioka et al., 2024).
- Architectural scaling: Array tiling, pipelined hybrid conversion (collaborative digitization), and 3D stacking for large-model support (Nasrin et al., 2023).
- SRAM stability & overhead: Hybrid cell designs avoid area increase by leveraging back-end integration or time-multiplexed analog/digital operation modes (Chakraborty et al., 15 Sep 2025, Konno et al., 25 Aug 2025).
7. Outlook and Future Research Directions
Hybrid analogue–digital CIM systems represent a converged path toward maximizing the compute-in-memory paradigm for DNN inference and training, achieving synergy between analog efficiency and digital precision/flexibility. Research directions include:
- Dynamic, input-aware partitioning at runtime or via compiler, possibly guided by saliency, quantization-awareness, or workload profiling.
- Ultra-dense, precision-adaptable macros supporting 8–12+ b fixed/floating point via hierarchical hybrid slicing (Konno et al., 25 Aug 2025, Yi et al., 11 Feb 2025).
- Extension of the hybrid paradigm to on-the-fly continual learning, training, and secure machine unlearning by co-located digital LoRA branches (Lin et al., 15 Jan 2026).
- Enhanced system integration with memory-immersed ADC/DAC, pipelined conversion, and crossbar-in-loop digital support for further efficiency scaling (Nasrin et al., 2023).
- Technology scaling toward sub-10 nm nodes and integration of novel devices (e.g., high-TMR STT-MRAM, fine-grained RRAM) for larger, more flexible hybrid arrays (Cai et al., 2021, Chakraborty et al., 15 Sep 2025).
The hybrid analogue–digital CIM approach is central to bridging the energy-precision-flexibility trade space required for next-generation edge and server AI accelerators, as formalized and demonstrated across multiple recent experimental and simulation studies (Ottavi et al., 2021, Gallo et al., 2022, Negi et al., 2024, Yoshioka et al., 2024, Konno et al., 25 Aug 2025, Yi et al., 11 Feb 2025, Chakraborty et al., 15 Sep 2025, Klein et al., 2022, Nasrin et al., 2023, Moradifirouzabadi et al., 2024, Lin et al., 15 Jan 2026).