Heterogeneous Chiplet PIM Architecture

Updated 17 August 2025

Heterogeneous Chiplet-Based PIM architecture is a design that integrates diverse processing and memory chiplets to minimize data movement and boost energy-efficient performance for data-intensive applications.
It employs specialized chiplets with distinct process technologies and interconnects, leveraging hierarchical organization to balance cost, yield, and scalability across varied workloads.
The architecture relies on a co-designed software-hardware stack with unified programming models, adaptive scheduling, and runtime support to optimize workload partitioning and throughput.

A heterogeneous chiplet-based PIM (Processing-in-Memory) architecture integrates multiple distinct processing and memory chiplets—potentially fabricated using different technologies and process nodes—within a single system. The objective is to move computation closer to, or inside, memory to alleviate data movement bottlenecks, enhance performance, and improve energy efficiency, especially for data-intensive applications. The heterogeneity manifests not only in the computational/memory resources and accelerators present, but also in varying design trade-offs and integration strategies across chiplets within a package. Such architectures require careful co-design of software, hardware, system integration, and runtime, as well as support for efficient communication, cache coherence, programming models, workload partitioning, and cost-aware physical partitioning.

1. Architectural Principles and Integration Strategies

Heterogeneous chiplet-based PIM systems combine logic (e.g., processor cores, general-purpose or domain-specific accelerators), diverse memory types (e.g., DRAM, SRAM, ReRAM, MRAM), and persistent storage modules, all interconnected via high-bandwidth, low-latency in-package links. This compositional approach allows designers to optimize specific chiplets for different functions—such as deep-learning acceleration, in-memory transaction processing, or bulk pointer chasing—while leveraging cost, yield, and performance benefits of heterogeneous integration (Hao et al., 2023, Graening et al., 26 Jul 2025, Kanani et al., 14 Aug 2025, Jeon et al., 2 Apr 2025).

Key aspects include:

Modular Specialization: Chiplets may be specialized for compute (e.g., bit-serial CRAMs in PIMSAB (Arora et al., 2023), NPU units in NeuPIMs (Heo et al., 2024)), bandwidth-centric memory access (e.g., SRAM/DRAM arrays), or persistent storage.
Diverse Process Technologies: Technology assignment can be heterogeneous, with advanced nodes for performance-critical chiplets and mature nodes for others, as cost models show that mixing nodes yields up to 43% lower cost compared to homogeneous designs (Graening et al., 26 Jul 2025).
Inter-Chiplet Interconnects: Physical interconnects may include organic substrates, passive/active silicon interposers, or network-on-chip (NoC) topologies, each with specific bandwidth, latency, and “reach” constraints (Hao et al., 2023, Graening et al., 26 Jul 2025).
Hierarchical Organization: Tiling and clustering of chiplets enable scalable architectures, where cross-chiplet communication is critical and often realized through hierarchical networks (e.g., H-trees, 2D mesh, or packet-switched networks) (Arora et al., 2023, Leitersdorf et al., 2023, Heo et al., 2024).

2. Programming Model, Software Stack, and System Co-Design

Programming heterogeneous chiplet-based PIM architectures requires unified abstractions that span disparate hardware domains. This entails:

Unified Programming Models: Execution models such as the Codelet PXM decompose computation into fine-grained, event-driven “codelets” with explicit data dependencies, decoupling programming from underlying chiplet heterogeneity and providing a contract for safe, predictable execution (Fox et al., 2022). This enables portable, composable software across vendor-specific chiplets.
High-level Frameworks: Software frameworks like SimplePIM abstract low-level hardware management on real PIM architectures (e.g., UPMEM), providing iterator-based functional interfaces (map, reduce, zip) and high-productivity primitives for data distribution, aggregation, and communication. These models hide the complexity of distributed chiplet environments and facilitate productivity and performance simultaneously (Chen et al., 2023).
ISA and Microarchitecture: Middleware such as PyPIM exposes tensor-oriented programming interfaces in Python, automatically mapping user code down to programmable, partition-parallel PIM ISAs and microarchitectures that support efficient intra- and inter-chiplet parallelism (Leitersdorf et al., 2023).
Compiler and DSL: End-to-end co-design with compiler flow (as in PIMSAB’s TVM integration) allows automatic mapping of data/memory tiling, loop transformations, and adaptive precision to underlying chiplet and memory topology (Arora et al., 2023).
OS and Runtime: Support is required for PIM-aware memory management, address translation (e.g., IMPICA’s region-based page tables (Ghose et al., 2018)), scheduling, and virtual memory abstraction, ensuring that disaggregated chiplets operate under unified or virtualized address spaces (Oliveira et al., 2022, Ghose et al., 2019).

3. Communication, Data Movement, and Cache Coherence

Efficient communication and cache coherence are central. Approaches include:

Communication Topologies: Intra-tile H-tree and inter-tile mesh/NoC connections allow spatially optimized reductions, broadcasts, and point-to-point data transfer, minimizing off-chip movement (Arora et al., 2023, Leitersdorf et al., 2023, Heo et al., 2024).
Signature-Based/Batch Coherence: Protocols such as LazyPIM employ speculative execution within PIM cores, record memory references in compressed Bloom filter–based signatures (PIMReadSet, PIMWriteSet, CPUWriteSet), and batch coherence checking on kernel commit, reducing off-chip traffic by up to 30.9% and improving performance by 19.6% over prior PIM mechanisms (Boroumand et al., 2017, Ghose et al., 2018).
Conflict Detection and Rollbacks: Fine-tuning the granularity of speculative execution chunks manages the trade-off between false positives (rollback frequency) and batch size. Coherence is enforced by reconciling Bloom-filter-based access sets at kernel end, with partial commits and address locking to guarantee forward progress (Boroumand et al., 2017).
Programming Model Integration: Advanced program execution models, by expressing data dependencies through codelets, can reduce the need for complex global cache coherence, as explicit message passing or data transfer between codelets replaces aggressive hardware coherence (Fox et al., 2022).
Transactional Memory: PIM-STM shows that restricting transactions to local chiplet memory and supporting diverse TM algorithms (Tiny, NOrec, VR) enables scalable, efficient concurrency control with explicit APIs, achieving speedups up to 14.53× over CPU baselines in the UPMEM context (Lopes et al., 2024).

4. Workload Partitioning, Scheduling, and Optimization

The dynamic mapping and partitioning of workloads across heterogeneous chiplets is central to system efficacy and efficiency:

Workload Characterization & Benchmarking: Systematic workload characterization (e.g., DAMOV (Oliveira et al., 2022), PrIM (Gómez-Luna et al., 2021)) identifies memory-bound and compute-bound kernels, enables targeted partitioning, and benchmarks cross-chiplet communication constraints.
Multi-Objective Scheduling: THERMOS extends workload scheduling using multi-objective reinforcement learning (MORL) that dynamically places neural network layers onto chiplet clusters, optimizing for Pareto-optimal execution time, energy, and thermal envelope under dynamic load. A proximity-aware mapping stage within each cluster further reduces inter-chiplet communication. Up to 89% reduction in execution time and 57% lower energy are demonstrated compared to prior algorithms, with runtime overheads as low as 0.14% (Kanani et al., 14 Aug 2025).
Dynamic Data Placement: HH-PIM employs a knapsack-like dynamic programming algorithm for runtime allocation of neural network weights between HP/LP MRAM/SRAM banks, subject to inference latency constraints. Lookup table-based results ensure rapid (per time-slice) reallocation under fluctuating workloads, yielding up to 60.43% average energy savings over static placement (Jeon et al., 2 Apr 2025).
Co-Designed Partitioning: ChipletPart employs a genetic algorithm for technology assignment and a simulated annealing floorplanner to partition block-level netlists under reach, cost, and heterogeneous technology constraints, reducing overall chiplet manufacturing costs by up to 58% and enabling floorplan-feasible, performance-aware partitions (Graening et al., 26 Jul 2025).
Co-Optimization of Architecture and Packaging: The Monad framework encodes both architectural (PE/resource mapping, dataflow, pipelining) and integration parameters (packaging, network, placement) for chiplets, using Bayesian optimization and simulated annealing to search for trade-offs between PPA (power, performance, area) and cost. This approach leads to EDP (energy-delay product) reductions of 16%–30% compared to state-of-the-art chiplet spatial accelerators (Hao et al., 2023).

5. Application Domains and Representative Performance

Heterogeneous chiplet-based PIM architectures offer definitive benefits for large-scale, data-intensive applications:

Graph Processing and Databases: Coherent, high-throughput memory access and atomic kernel commitment benefit applications like PageRank, connected components, and HTAP workloads requiring both transactional and analytical processing on shared datasets (Boroumand et al., 2017, Oliveira et al., 2022).
Deep Learning and AI Inference: PIMSAB, exploiting bit-serial CRAMs and spatial reduction networks, outperforms monster GPUs on convolutional and matrix-multiplication kernels by 3–3.88× in performance and 4.2× in energy reduction (Arora et al., 2023). NeuPIMs, by coupling NPU and PIM on a single chiplet with dual-row-buffer DRAM banks, enables concurrent GEMM and GEMV execution for LLM inference, yielding 1.6–3× throughput boosts over GPU, NPU-only, and naive NPU+PIM baselines (Heo et al., 2024).
Edge AI: HH-PIM’s two-cluster hybrid design offers real-time dynamic power and performance optimization for edge DNNs, with up to 60.43% energy savings and significant reductions in inference latency, validated on FPGA prototypes (Jeon et al., 2 Apr 2025).
Scientific Computing and Tensor Processing: Generalized frameworks such as PyPIM and SimplePIM accelerate vector, matrix, and user-defined tensor workloads with programming simplicity, high resource utilization, and scalability to large systems (Leitersdorf et al., 2023, Chen et al., 2023).

6. Challenges, Open Problems, and Future Directions

Several challenges persist:

Cache Coherence and Consistency: Efficient, scalable coherence remains non-trivial as systems grow more heterogeneous and support more flexible programming models. Research continues into batch verification, compressed signatures, and “codelet”-granularity message passing to manage trade-offs between correctness, overhead, and system complexity (Boroumand et al., 2017, Fox et al., 2022, Oliveira et al., 2022).
Toolchain and System Support: The maturity of compilers, auto-partitioners, allocation algorithms, and runtime frameworks lags behind hardware. Ongoing work includes full-stack integration (DAMOV, SIMDRAM (Oliveira et al., 2022)), high-level API abstraction (PyPIM, SimplePIM), and domain-specialized DSLs (PIMSAB’s tensor DSL) to reduce software overhead and programmer effort.
Partitioning and Packaging: As designs escalate in scale and heterogeneity, partitioning under area, power, IO “reach,” and cost constraints becomes a multi-objective, NP-hard problem, motivating advanced optimization frameworks integrating technological, economic, and physical constraints (Graening et al., 26 Jul 2025, Hao et al., 2023).
Thermal and Power Management: Multi-modal, thermally-aware scheduling such as THERMOS (Kanani et al., 14 Aug 2025) addresses hot-spot avoidance and energy constraints, especially under rapid workload variation and as PIM devices diversify in technology (ReRAM, SRAM, etc.).
Scalable Communication: Apportioning workloads and minimizing communication cost—whether by algorithmic (proximity-based) scheduling (Kanani et al., 14 Aug 2025), flexible interconnect design (Arora et al., 2023), or dynamic data placement (Jeon et al., 2 Apr 2025)—are critical for maximizing the benefit of chiplet-level integration.

7. Summary Table: Representative Techniques and Metrics

Solution	Core Idea	Performance/Energy Impact
LazyPIM	Batch, signature-based coherence	+19.6% perf, –30.9% traffic, –18% energy vs best prior (Boroumand et al., 2017)
Monad	Co-optimization of arch/integration	–16–30% EDP, co-opt = 8.1× latency, 3.9× energy improvement (Hao et al., 2023)
PIMSAB	Bit-serial CRAM + spatial comm network	3× speedup, 4.2× energy over A100 GPU; 3.88× faster than SIMDRAM (Arora et al., 2023)
NeuPIMs	NPU+PIM dual-row-buffer, subbatching	3× (GPU), 2.4× (NPU), 1.6× (naive NPU+PIM) speedup (Heo et al., 2024)
ChipletPart	Cost-driven gen-alg partition/floorplan	–58% to –43% chiplet cost vs min-cut/homogeneous; IO reach/floorplan feasible (Graening et al., 26 Jul 2025)
HH-PIM	HP/LP hybrid with dynamic placement	Up to 60.43% energy savings, large latency improvement on AI edge workloads (Jeon et al., 2 Apr 2025)
THERMOS	MORL multi-obj thermally-aware sched	–89% exec time, –57% energy vs Simba/RELMAS; 0.14% runtime overhead (Kanani et al., 14 Aug 2025)

These methodologies collectively illustrate the state of the art in scalable, energy-efficient, and cost-effective heterogeneous chiplet-based PIM architectures as evidenced by empirical and analytical results across a spectrum of emerging research.