Heterogeneous Compute-in-Memory Accelerators

Updated 26 November 2025

Heterogeneous Compute-in-Memory accelerators are architectures that combine multiple memory and compute modalities (analog and digital) to overcome data-movement limits in ML workloads.
They integrate diverse memory types, such as ReRAM, PCM, and SRAM, with digital processing to achieve fine-grained tradeoffs between energy efficiency, throughput, and precision.
Hardware-software co-design and advanced compilation techniques ensure dynamic workload mapping and scalable performance in heterogeneous CIM systems.

Heterogeneous Compute-in-Memory (CIM) accelerators constitute a class of architectures that exploit multiple memory and compute modalities—typically analog and digital, and often spanning diverse memory device types, numerical precisions, and functional blocks—to overcome the data-movement and scalability limits of traditional von Neumann systems in advanced ML workloads. A heterogeneous CIM system combines crossbar-based analog compute (e.g., ReRAM, PCM, FeFET, SRAM) with digital-side compute (ALUs, SIMD units, on-chip or chiplet-level functional units), sometimes coordinating different array technologies spatially or hierarchically. This approach enables fine-grained tradeoffs between energy efficiency, throughput, precision, endurance, and scalability, and presents unique challenges in system integration, compilation, and workload mapping.

1. Architectural Principles of Heterogeneous CIM Accelerators

A heterogeneous CIM accelerator integrates multiple memory technologies or compute paradigms (analog, digital, mixed-signal) at varying architectural levels—array, core, chip, or system. Notable examples include architectures that mix analog crossbar arrays (e.g., eNVM, SRAM, FeFET) for in-situ multiply–accumulate with digital logic for control and non-linear/post-processing steps; those interleaving analog and digital CIM (DCIM) chips in a chiplet mesh; and hybrids offering both memory mode (as storage) and compute mode (as in-place MAC engine) with dynamic switching.

Key generalizations:

Device Diversity: Systems employ SRAM, RRAM, PCM, FeFET, eDRAM, and MRAM—each optimized for specific tasks: e.g., static VMM (analog), dynamic computation (SRAM), or associative search (CAM), reflecting their different density, precision, and endurance characteristics (Khan et al., 2024, Wang et al., 19 Nov 2025, Qu et al., 2024, Chen et al., 2023).
Functional Heterogeneity: Analog CIM blocks perform dense MVM for convolution/projection layers; digital units handle index-sensitive or control-heavy operations, non-linearities, normalization, and accumulate crossbar outputs with high precision.
Hierarchy and Chiplets: Chiplet-style platforms (e.g., Hemlet) tile arrays of ACIM and DCIM chiplets with global buffer and intermediate processing chiplets interconnected by network-on-package (NoP), for scalable deployment, especially in large workloads like vision transformers (Wang et al., 19 Nov 2025).
Dynamic Reconfiguration: Dual-mode arrays (compute/memory) allow adaptive allocation and programmable dataflow, reallocating resources as workload demands shift (Zhao et al., 24 Feb 2025).

2. Device and Modality Integration

Multiple CIM implementations exploit heterogeneity at the device and circuit level to harmonize compute and storage.

Analog-Digital Integration: Core compute steps—such as MVM—use crossbar-based analog in-memory operations, followed by digital accumulation, quantization, or logic. For example, RRAM and PCM-based designs utilize conductance-based dot product with digital ADC/readout, while SRAM/FeFET analog macros employ charge-domain or capacitive multiplication, digitized by ADC/flash ADC (Khan et al., 2024, Wang et al., 19 Nov 2025).
Dynamic Precision and Hybrid Arrays: Hybrid CIM arrays (as in OSA-HCIM) blend ACIM and DCIM modes in a split-port SRAM cell, allowing application-driven, per-operation dynamic selection of digital or analog MAC, controlled through saliency-aware precision mapping (Chen et al., 2023).
Technology-aware Partitioning: Frameworks like DNN+NeuroSim assign different layers or processing elements to different device types: SRAM for rapid weight updates/gradients in training, eNVM devices (e.g., FeFET, PCM, RRAM) for large, high-fan-in layers in inference (Peng et al., 2020).

Table 1 summarizes representative heterogeneous CIM designs:

Architecture	Memory Technology	Heterogeneity Mode
Hemlet (Wang et al., 19 Nov 2025)	RRAM (ACIM), SRAM (DCIM), SRAM (IDP)	Chiplet partitioned + analog/digital
OSA-HCIM (Chen et al., 2023)	Hybrid 6T SRAM	Dynamic analog/digital boundary
DNN+NeuroSim V2.0 (Peng et al., 2020)	FeFET, RRAM, PCM, SRAM	Layer-wise device assignment
IBM 64-core PCM (Khan et al., 2024)	PCM, on-chip digital	Analog in-array compute + digital post-processing

3. Compilation, Abstraction, and Programming Methodologies

Emergence of heterogeneous CIM imposes nontrivial challenges for mapping, scheduling, and code generation.

Hierarchical and Multi-level Compilation: Networks are mapped using multi-level abstractions—chip, core, crossbar/array—combined with iterative, tiered compilation passes (e.g., CIM-MLC) that partition workloads, tile matrices, pipeline operations, and remap data granularity, accounting for each device’s regime and resource (Qu et al., 2024).
Device- and Mode-aware Abstractions: Compilers such as CMSwitch extend standard architectures with dynamic hardware annotations (e.g., array dual-mode switching, mode-switch latency) and employ DP+MIP passes to jointly segment graphs and allocate modes across arrays for optimal throughput and resource utilization (Zhao et al., 24 Feb 2025).
Meta-operator and Multi-target IR: Advanced compilers (CINM, CIM-MLC) emit meta-operators at each architectural tier (e.g., core, crossbar, wordline), which a backend lowers to native commands, supporting portability across backends (Qu et al., 2024, Khan et al., 2022). CINM, built on MLIR, uses parallel dialects (e.g., “cim”, “cnm”, “memristor-dialect”) and progressive lowering for device-agnostic and device-aware optimizations (Khan et al., 2022).
Hardware-software Co-design: In HASTILY, ISA extensions (e.g., EXP_VEC, MAX_REDUCE_VECTOR) and compiler-aware scheduling eliminate memory bottlenecks in transformer softmax by co-optimizing dataflow, reduction, and in-memory element-wise operations (Kim et al., 17 Feb 2025).

4. Performance Metrics, Energy Efficiency, and Trade-offs

Heterogeneous CIM systems achieve significant gains in throughput and energy efficiency compared to both monolithic architectures and von Neumann baselines, but highlight nuanced trade-offs between area, latency, and architectural parameters.

Throughput and Efficiency: Chiplet-based designs (Hemlet) deliver up to 8.68 TOPS per system, with system-level speedups ranging from 1.44× to 4.07× versus monolithic baselines, and up to 3.86 TOPS/W energy efficiency. OSA-HCIM achieves 5.79 TOPS/W, a 1.95× gain over pure DCIM for ≤0.1% accuracy drop (Wang et al., 19 Nov 2025, Chen et al., 2023).
Latency and Mode-switching: Dynamic mode allocation (CMSwitch) incurs minimal (3–5%) runtime overhead, but offers up to 2.03× speedup versus single-mode mapping (Zhao et al., 24 Feb 2025).
Area/performance/energy: SiTe CiM I in 8T-SRAM incurs 18% area overhead for nearly 7.7× latency improvement; SiTe CiM II reduces area cost (6%) at some expense in speed. HASTILY's UCLM modules add <3% area and <2.5% dynamic power (Thakuria et al., 2024, Kim et al., 17 Feb 2025).
Device Variation and Endurance: Analog NVMs remain limited by retention, endurance, and fabrication variability, while digital SRAM/DCIM features larger area per bit and higher energy but superior endurance (Khan et al., 2024, Wang et al., 19 Nov 2025).

5. Scalability, System Integration, and Workload Mapping

Scaling heterogeneous CIM accelerators beyond single-die constraints leverages chiplets, network-on-package, and hierarchical resource allocation.

Chiplet Modularization: Hemlet's mesh of ACIM, DCIM, and IDP chiplets interconnected by NoP enables expansion in compute and memory capacity. However, NoP bandwidth and hop latency dominate when system is scaled, capping throughput scaling (Wang et al., 19 Nov 2025).
Group-Level Parallelism: GLP (as in Hemlet) interleaves columns from different layers within an ACIM group, enabling all columns in a group to be digitized in parallel, optimizing ADC utilization and throughput scaling within analog CIM tiles (Wang et al., 19 Nov 2025).
Dynamic Mapping and Reconfiguration: By exposing per-array, per-chip, and per-segment constraints, compilers (CMSwitch, CIM-MLC) automatically adapt mapping and resource allocation to changing model structure, arithmetic intensity, and array availability (Zhao et al., 24 Feb 2025, Qu et al., 2024).
Heterogeneous Task Partitioning: Certain frameworks (e.g., DNN+NeuroSim) partition networks layer-wise, mapping high-fan-in layers to eNVM arrays and computation-heavy, update-intense stages (e.g., weight gradients) to SRAM or PCM. Offloading is performed for GEMM-heavy kernels (TDO-CIM), with less compute-intensive tasks retained in host CPU (Peng et al., 2020, Vadivel et al., 2020).

6. Challenges, Limitations, and Future Outlook

Although heterogeneous CIM accelerators present substantial promise, several technical challenges and open questions persist.

Device/Integration Barriers: Yield, endurance, analog drift/noise, and device variation (both D2D and C2C) continue to limit broad deployment and reliability in large-scale systems (Khan et al., 2024, Peng et al., 2020).
Software and Programming Model Complexity: Extracting the full benefit of heterogeneity mandates robust, hierarchical compilation stacks, device- and workload-aware scheduling, and generalized IRs, which increases programming complexity and hinders adoption (Khan et al., 2022, Qu et al., 2024).
On-chip Communications and Bottlenecks: As in Hemlet, NoP link bandwidth and communication latency cap practical scaling, and tradeoffs must be negotiated between compute density and communication-intensive interchip or interarray dataflows (Wang et al., 19 Nov 2025).
Cross-layer/Function Adaptivity: Runtime or co-design mechanisms for allocating modes, splitting workloads, or tuning analog/digital boundaries (e.g., saliency-aware in OSA-HCIM) are still emerging and typically lack OS/runtime coordination (Chen et al., 2023, Zhao et al., 24 Feb 2025).
Precision and Generalizability: Most efficiency gains are in low- to moderate-precision settings (1–8 bit), with extensions to high-precision or non-dense/sparse workloads requiring additional research and, in many cases, new analog primitives (Thakuria et al., 2024, Peng et al., 2020).

A plausible implication is that future generations of heterogeneous CIM accelerators will increasingly integrate hardware–compiler codesign, programmable dual-mode or multi-modal arrays, advanced communication fabrics (e.g., photonic NoP), and mixed-technology stacks within and across chiplets or system tiles. Systematic performance–energy–precision trade-off exploration, enabled by co-simulation and design frameworks (e.g., DNN+NeuroSim), remains critical for balanced designs across diverse ML benchmarks (Peng et al., 2020, Khan et al., 2024).