Heterogeneous-Hybrid PIM (HH-PIM)
- HH-PIM is an advanced accelerator architecture that combines distinct PIM device types to tailor processing for varying workload demands.
- It employs dynamic scheduling and data mapping to overcome the limitations of homogeneous designs, ensuring optimal device utilization.
- Empirical results show significant throughput, energy efficiency, and performance gains across applications like LLM inference and GNN training.
A heterogeneous-hybrid processing-in-memory (HH-PIM) architecture integrates multiple, distinct PIM device types or compute modalities within a unified accelerator platform, orchestrating the complementary strengths of each type for a target workload’s diverse memory, compute, latency, or power requirements. In an HH-PIM system, various memory and compute elements—such as analog/memristive PIM arrays, digital systolic arrays, SRAM-PIM, DRAM-PIM, or MRAM/FeFET PIM modules—operate as specialized compute islands or tiers under coordinated scheduling. This approach overcomes the fixed granularity, resource rigidity, or device-specific bottlenecks of homogeneous or simple hybrid PIM by dynamically partitioning and mapping sub-tasks to the optimal device type. Recent HH-PIM architectures have demonstrated order-of-magnitude improvements in throughput, energy efficiency, and resilience to workload skew or device non-idealities across applications ranging from LLMs to GNN training, AI inference, and transactional-analytical database integration.
1. Foundational Design Principles of HH-PIM
HH-PIM design mandates modularity at the device, array, and interconnect levels. Representative implementations combine:
- Multiple compute substrates (e.g., analog RRAM crossbars, digital accumulators, embedded SRAM/FeFET, MRAM) on planar or 3D-integrated stacks (Ogbogu et al., 22 Aug 2025, Kanani et al., 14 Aug 2025, Jeon et al., 2 Apr 2025).
- Discrete digital accelerator blocks (e.g., systolic arrays, TPUs/NPUs) coupled to PIM blocks for high-precision or nonlinear tasks (Malekar et al., 31 Mar 2025, Chen et al., 10 Nov 2025, Duan et al., 16 Sep 2025, Heo et al., 2024).
- Clusters or chiplets grouped by PIM type, linked via an on-chip or chiplet/interposer-level network (Kanani et al., 14 Aug 2025, Jeon et al., 2 Apr 2025).
- Centralized or distributed schedulers, responsibility for graph partitioning, data movement, and kernel-to-device mapping (Malekar et al., 31 Mar 2025, Duan et al., 16 Sep 2025, Kang et al., 2022).
A key tenet is that each memory technology is tasked according to its performance envelope; for example, analog RRAM PIM performs energy-dense low-precision projections, while high-endurance SRAM or FeFET PIM handles frequent writes or nonlinear reductions. This separation is fundamental when faced with workload dynamism, non-uniform memory access patterns, and hardware constraints typical of real AI and data workloads.
2. Architectural and Device-Level Heterogeneity
Device heterogeneity spans NVM (ReRAM, MRAM, FeFET), SRAM, analog/digital modalities, and voltage domains. Notable architectural examples include:
- PIM-LLM: Combines analog RRAM crossbar PIM (256×256, 1b×8b, for projection layers) with an 8b digital systolic array (32×32, for attention heads), orchestrated by a CPU layer scheduler. The on-chip network routes activations/weights, and LPDDR serves as the backing storage medium (Malekar et al., 31 Mar 2025).
- HePGA: 3D-stacked “tiles” with tiered device types: bottom two tiers employ high-density ReRAM (static data), mid-tier FeFET (fast, moderate write endurance), top-tier SRAM (write-dominated flows, e.g. gradient updates) (Ogbogu et al., 22 Aug 2025).
- THERMOS: Chiplet-level heterogeneity by mixing ReRAM, SRAM (digital/analog), FeFET-based PIM, in various crossbar sizes and memory bit-widths, grouped into clusters per technology (Kanani et al., 14 Aug 2025).
- HH-PIM for edge AI: Implements “big.LITTLE” cluster structure of high-performance MRAM+SRAM and low-power MRAM+SRAM PIM modules, both managed by dynamic allocation and power gating (Jeon et al., 2 Apr 2025).
This configuration enables dynamic mapping of compute or storage sub-tasks to device types with the best matching latency, endurance, or energy profile.
3. Scheduling, Data Mapping, and Dynamic Resource Allocation
Scheduling in HH-PIM is central to achieving Pareto-optimal deployment. Key strategies observed:
- Centralized dispatch: CPU or top-level controller statically and/or dynamically partitions computation, delegating high-precision kernels to digital arrays or SRAM-PIM, low-precision or bandwidth-bound operations to analog/NVM PIM.
- e.g., PIM-LLM’s master scheduler partitions Transformer layers into projections (PIM) and attention heads (TPU), controlling weight and activation movement (Malekar et al., 31 Mar 2025).
- HPIM leverages a hardware-aware compiler to label and distribute operators (GEMM/GEMV) to SRAM-PIM or HBM-PIM, with intra-token pipelining (Duan et al., 16 Sep 2025).
- Dynamic, workload-adaptive resource allocation: THERMOS achieves multi-objective optimization (latency, energy, thermal) by training a multi-objective RL policy to select the device cluster per neural layer, factoring device temperature, available memory, and inter-chiplet distance (Kanani et al., 14 Aug 2025).
- Data placement optimization: HH-PIM for edge AI applies DP-based (knapsack) allocation to minimize energy for a given latency budget by splitting neural weights across HP-SRAM/MRAM and LP-SRAM/MRAM, with runtime migration and power-gating (Jeon et al., 2 Apr 2025).
- Hardware/software co-design for concurrency: NeuPIMs unlocks lock-free concurrency by introducing dual row-buffers in each DRAM bank, enabling NPU and PIM to access separate banks simultaneously and batch interleaving to maximize utilization (Heo et al., 2024).
Such scheduling and mapping approaches are essential to exploit the full heterogeneity and avoid underutilization or thermal violations across complex hardware planes.
4. Application Domains and Workload-Specific Mapping
HH-PIM has delivered empirical and modeled performance advances in distinct workload domains:
- Transformer LLM Inference:
- PIM-LLM shows up to 80× throughput and 70% higher tokens/J versus traditional accelerators; analog PIM is assigned low-precision projection matmuls, systolic digital arrays handle attention heads. Peak analog MatMul throughput is modeled as ; digital OS systolic array achieves 102.4 GOP/s at MHz (Malekar et al., 31 Mar 2025).
- HPIM partitions LLM kernels such that latency-critical attention is on SRAM-PIM (ultra-low latency, full ISA), while HBM-PIM handles bandwidth-bound GEMV, with tightly coupled inter-domain pipelining. Speedups up to 22.8× against NVIDIA A100; parse-phase (GEMM) on TCU, decode-phase (GEMV) on PIM units (Duan et al., 16 Sep 2025).
- IANUS integrates NPU and DRAM-PIM as unified memory, offloading memory-bound small mat-vecs to PIM, compute-bound mat-mats to NPU. Achieves 6.2× speedup and 3.7–4.4× energy efficiency versus NPU-memory baselines (Seo et al., 2024).
- P³-LLM and NeuPIMs extend this paradigm with mixed-precision quantization and quantized PIM compute-units or concurrent NPU/PIM execution, further boosting decode throughput and energy efficiency (Chen et al., 10 Nov 2025, Heo et al., 2024).
- GNN Training (HePGA): Maps large, static, sparse graph adjacency to ReRAM, fast moderate-size combine kernels to FeFET, and write-intensive gradient flows to SRAM, optimizing for energy, area, and temperature-induced device variation. Up to 3.8× energy and 6.8× area efficiency improvements (Ogbogu et al., 22 Aug 2025).
- Edge AI: HH-PIM for edge devices dynamically distributes weight/data file across performance/power-optimized clusters in response to real-time latency constraints, providing average 60.4% energy savings vs. prior PIMs (Jeon et al., 2 Apr 2025).
- HTAP Databases (PUSHtap): Fuses conventional CPU and PIM access via a unified row/column-aligned block-circulant format and in-place MVCC; achieves 3.4×/4.4× OLAP/OLTP throughput increase over multi-instance PIM (Zhao et al., 4 Aug 2025).
- Skew-resistant Ordered Indexing: PIM-tree partitions B+-/skip-list hybrids between shared-memory CPU (L3/top) and distributed PIM (L1/L2/leaves), dynamically “pushes” or “pulls” requests based on contention, guaranteeing communication per query even under power-law skew (Kang et al., 2022).
A common motif is per-operator or per-layer assignment to the “best-fit” PIM device, either via static analysis or runtime adaptation.
5. Quantitative Results, Device Characterization, and Performance Models
Reported empirical and modeled performance advances are achieved via device specialization and fine-grained orchestration. Systems provide:
- Substantial throughput and energy improvements:
- PIM-LLM: ≥ 2× and ≥ 5× improvement in GOPS and GOPS/W vs. previous PIM LLM accelerators (Malekar et al., 31 Mar 2025).
- HPIM: Average 6.2× speedup, up to 22.8× versus A100 GPU (OPT family, up to 30B parameters) (Duan et al., 16 Sep 2025).
- IANUS: 6.2× faster than A100 GPU for single-token GPT-2, reducing system energy by factors of 3.7–13× (Seo et al., 2024).
- HH-PIM Edge: 60.4% average energy savings, up to 86.2% under light loads (Jeon et al., 2 Apr 2025).
- HePGA: 3.8× energy, 6.8× area efficiency vs. homogeneous PIMs (Ogbogu et al., 22 Aug 2025).
- PUSHtap: 3.4× OLTP/4.4× OLAP throughput vs. multi-instance PIM HTAP (Zhao et al., 4 Aug 2025).
- NeuPIMs: Up to ×3 throughput, ×2–3× utilization over NPU-only/NPU+PIM (Heo et al., 2024).
- Device parameters (selected, from (Ogbogu et al., 22 Aug 2025, Kanani et al., 14 Aug 2025)):
- ReRAM: 32 nm, 2 bits/cell, read ≈ 30 ns/50 pJ, non-ideality: conductance window shrinks with temperature.
- FeFET: 28 nm, 1 bit/cell, read ≈ 5 ns/30 pJ, non-ideality: Vt window reduces linearly with temperature.
- SRAM: 14 nm, 6T cell, ≈ 1 ns/10 pJ read, very high endurance.
- Analog PIM (RRAM): 0.8 pJ/MAC (DAC ≈ 0.1 pJ, ADC ≈ 0.5 pJ, conduction ≈ 0.2 pJ).
- Scheduling and power models are tailored to each architecture, e.g., THERMOS uses a multi-objective RL framework, while HH-PIM edge applies DP-based allocation and LUT power-gating (Kanani et al., 14 Aug 2025, Jeon et al., 2 Apr 2025).
The HH-PIMs are systematically evaluated on standard benchmarks (Transformer LLMs, GNNs, CNNs, HTAP workloads), often employing cycle-accurate simulation or FPGA prototyping.
6. Implementation Challenges, Limitations, and Generalization
Several practical challenges are identified:
- Resource constraints: PIM program size, host-PIM bandwidth, modest cache sizes in today's hardware (Kang et al., 2022).
- Thermal and non-ideality management: Crossbar device parameters and variability require aware mapping and thermal modeling (Ogbogu et al., 22 Aug 2025, Kanani et al., 14 Aug 2025).
- Complexity and scheduling: Coordinated allocation policies, RL-based schedulers, dynamic power/latency trade-off mechanisms introduce additional design and runtime complexity.
- Extension to new domains: Push-pull scheduling and shadow replication in PIM-tree, or block-circulant data mapping in PUSHtap, can be generalized to B-trees, radix indexes, graphs, and new PIM device types (Kang et al., 2022, Zhao et al., 4 Aug 2025).
A plausible implication is that as both memory-embedded PIM and specialized compute islands (e.g., digital NPUs, analog crossbars) scale, HH-PIM architectures and their associated scheduling software will become the de facto baseline for AI and data-intensive workload acceleration on heterogeneous platforms.
7. Comparative Summary of HH-PIM Architectures
| Architecture | Main Device Types | Application Domain | Performance Summary/Advantage |
|---|---|---|---|
| PIM-LLM (Malekar et al., 31 Mar 2025) | Analog RRAM PIM + digital systolic | 1-bit LLM inference | Up to 80× tokens/s, 2–5× GOPS(W) vs. prior |
| HPIM (Duan et al., 16 Sep 2025) | SRAM-PIM + HBM-PIM | LLM single-batch inference | Up to 22.8× speedup vs. A100, 6.2× avg. |
| HePGA (Ogbogu et al., 22 Aug 2025) | 3D ReRAM, FeFET, SRAM | GNN training/inference | ≤3.8× energy, 6.8× area efficiency |
| THERMOS (Kanani et al., 14 Aug 2025) | ReRAM, SRAM/FeFET chiplets | Multi-model AI inference | Up to 89% faster, 57% less energy vs. baselines |
| IANUS (Seo et al., 2024) | NPU + DRAM-PIM | LLM end-to-end inference | 6.2× speedup, 3.7–4.4× energy efficiency |
| HH-PIM edge (Jeon et al., 2 Apr 2025) | HP/LP MRAM+SRAM | Edge AI adaptation | Avg. 60.4% energy savings vs. conventional PIM |
| PIM-tree (Kang et al., 2022) | CPU + 2048 DRAM-PIM nodes | Ordered index/DB | 59.1–69.7× speedup vs. prior skip lists |
| PUSHtap (Zhao et al., 4 Aug 2025) | CPU + DRAM-PIM, unified data | OLTP/OLAP (HTAP) | 3.4× OLTP/4.4× OLAP throughput frontier shift |
These results collectively demonstrate that the HH-PIM paradigm achieves a new Pareto frontier in bandwidth, energy, throughput, and adaptivity for AI and data workloads by leveraging heterogeneous device specializations, dynamic mapping, and workload-aware orchestration.