Near-bank PIM Architectures

Updated 18 November 2025

Near-bank Processing-in-Memory architectures are systems that integrate simple compute engines adjacent to DRAM banks to improve data locality and maximize internal bandwidth.
They employ adaptive data migration and subscription controls that have been shown to reduce local memory latency by up to 54% and enhance throughput for intensive workloads.
Successful deployment requires hardware-software co-design addressing challenges in coherence, virtual memory, and interconnect overhead to balance performance gains with system complexity.

Near-bank Processing-in-Memory (PIM) architectures are a class of photolithographically integrated memory-processing systems in which programmable logic or lightweight accelerators are situated immediately adjacent to individual DRAM banks, either within a logic layer (in 3D-stacked memory) or at the bank periphery on advanced DRAM chips. By physically collocating logic and DRAM at fine granularity, these systems exploit orders-of-magnitude higher internal bandwidth and vastly reduced access latency for "local" data compared to conventional CPU-centric architectures, and sharply curtail the energy and latency overheads stemming from off-chip data movement. Near-bank PIM thus represents a critical architectural response to both the bandwidth and energy limitations imposed by the Von Neumann bottleneck, delivering system-level performance and efficiency gains for data-intensive workloads (Tian et al., 9 Oct 2025, Mutlu et al., 2019, Giannoula et al., 2022, Oliveira et al., 2022, Mutlu et al., 2020).

1. Architectural Principles and Microstructural Building Blocks

Near-bank PIM architectures typically integrate simple compute engines or processing units immediately next to the DRAM subarrays they serve. The dominant integration paradigms are:

3D-Stacked DRAM + Logic Layer: Multiple DRAM dies are stacked over a logic die, interconnected via thousands of through-silicon vias (TSVs). The logic die is partitioned into local controllers for each DRAM array (vaults in HMC, channels in HBM), within which reside clusters of PIM "cores," scratchpads, and direct high-bandwidth I/O to nearby banks (Tian et al., 9 Oct 2025, Mutlu et al., 2020).
Near-Bank Compute Cores: Each DRAM bank or vault is paired with a digital PIM core (e.g., a lightweight RISC engine, SIMD ALU, or bank-level DSP), providing exclusive low-latency access to its associated memory array with aggregate bandwidth scaling linearly with the number of banks (Giannoula et al., 2022, Alsop et al., 2023).
Logic Placement and Locality: The logic must be topologically proximal to its partition of the DRAM array to maximize throughput and minimize contention. For instance, in HMC, a 6×6 network connects 32 vaults, and in HBM a 4×2 mesh connects 8 channels; variations (e.g., UPMEM's per-bank RISC DPUs) exploit similar principles (Tian et al., 9 Oct 2025, Giannoula et al., 2022).
Peripheral and Interconnect Support: Additional structures such as subscription tables, address-indirection tables, local feedback registers, and subscription buffers are required for more sophisticated data-locality management, as exemplified by the DL-PIM architecture (Tian et al., 9 Oct 2025).

2. Data Locality, Movement, and Migration Mechanisms

A fundamental performance constraint for near-bank PIM is the non-uniformity in latency and bandwidth for "local" versus "remote" memory accesses:

Local Accesses: Accesses that remain within the DRAM partition (e.g., vault/bank/channel paired with the requesting PIM core) are serviced at high bandwidth (often >50 GB/s per vault) and low latency (sub-100 ns), typically saturating only with highly concurrent streaming (Tian et al., 9 Oct 2025, Giannoula et al., 2022, Mutlu et al., 2020).
Remote Accesses: Accesses to non-local DRAM banks traverse the internal PIM interconnect, incurring additional network transfer latency proportional to the number of hops, and facing queuing delays and contention at various ingress/egress points—it is here that much of the "missed" PIM performance is lost (Tian et al., 9 Oct 2025).
DL-PIM's Locality Optimization: DL-PIM embodies a hardware data migration protocol: when a PIM core in vault V repeatedly accesses a remote data block B (hosted in vault W), DL-PIM migrates B into V's reserved local memory region and updates the local subscription table to redirect future requests. This can eliminate network hops for high-reuse data, which, according to benchmarks, reduces average memory latency by 54% for HMC and 50% for HBM, and realizes up to 2× speedups for data reuse–intensive workloads (Tian et al., 9 Oct 2025).
Address Indirection and Overheads: To maintain correctness and support dynamic migration, all memory accesses are subject to a fast, distributed address-indirection lookup. The trade-off is between the incremental indirection latency per access and the bulk savings in data-movement for hot blocks. The net benefit is a function of access reuse, frequency, and migration cost (Tian et al., 9 Oct 2025).

3. Adaptive Control and Hardware-Software Co-Design

Optimal utilization of near-bank PIM depends on hardware-software codesign that manages data placement, migration, and offload scheduling:

Adaptive Subscription Control: DL-PIM employs an epoch-based, feedback-driven control plane that measures actual (vs. hypothetical) network hops and average access latency to dynamically enable or suppress block subscription. 'Hops-based' and 'latency-based' adaptive policies are implemented via per-vault feedback registers, and set-wise policy sampling is used to avoid locally optimal but globally pessimistic subscription states (Tian et al., 9 Oct 2025).
Coordination Across Banks: Mechanisms are required to avoid 'subscription away,' where data movement only shifts contention hot spots instead of reducing queuing or network load. Negative feedback from source and destination vaults is aggregated to globally modulate subscription behavior (Tian et al., 9 Oct 2025).
Integration with Host Memory Stack: Modern commercial implementations (e.g., UPMEM, PIM-MMU) provide DRAM-to-PIM transfer acceleration, per-bank memory scheduling, and address mapping units at both hardware and OS levels to further reduce overheads of collective input/output broadcasts and data redistribution (Lee et al., 2024).

4. Quantitative Performance and System Evaluation

Extensive empirical studies and simulation analyses demonstrate that near-bank PIM delivers substantial latency, throughput, and energy improvements for representative memory-bound workloads:

DL-PIM (HMC/HBM): Reduces average per-request memory latency by 54% (HMC) and 50% (HBM). Overall speedup is 15% for workloads with high data reuse on HMC, 5% for HBM, and 6%/3% respectively across all representative workloads (Tian et al., 9 Oct 2025).
Bandwidth and Energy Scaling: The bank-parallel compute model exposes aggregate internal memory bandwidth of several hundreds of GB/s, with energy consumption for internal data movement orders of magnitude lower than off-chip traffic. For example, UPMEM hardware achieves up to 75% aggregate memory bandwidth utilization and 2.5–3× higher energy efficiency over CPU and 1.5× over GPU on SpMV (Giannoula et al., 2022).
Offload-Dependent Scaling Limits: Scaling is ultimately constrained by bottlenecks in CPU–PIM transfers (broadcast/gather), bank-group partitioning, and data reuse patterns. Workload types with low data reuse or heavy cross-bank communication see limited benefit and may even suffer due to indirection or protocol overheads (Tian et al., 9 Oct 2025, Giannoula et al., 2022).
Representative Numbers Table:

| Workload | Latency Reduction | Speedup (HMC) | Speedup (HBM) | |---------------|------------------|---------------|---------------| | CHABsBez | 60% | 1.25× | 1.20× | | SPLRad | 70% | 2.05× | 1.90× | | PHELinReg | 55% | 1.15× | 1.10× | | All (avg) | 54% | 1.06× | 1.03× |

Chronically, benchmarks with high spatial and temporal data locality respond best to near-bank optimizations, whereas pointer-chasing, random-access, or adversarially remote access patterns dilute overall benefit (Tian et al., 9 Oct 2025).

5. System Integration, Programming, and Coherence

Integrating near-bank PIM in a system-compliant, software-facing manner introduces complex design and runtime trade-offs:

Coherence and Consistency: Lightweight coherence is essential; approaches include speculative execution with batched verification (LazyPIM), batched invalidation, and region-based page tables for address translation (IMPICA) (Ghose et al., 2018). Non-intrusive protocols avoid the off-chip round-trip penalty that would nullify PIM gains.
Virtual Memory and Protection: Near-bank PIM logic must be extended to handle virtual-to-physical address translation in ways compatible with standard OS abstractions. Techniques such as region-based page tables and TLBs local to the logic layer add negligible steady-state overhead (Ghose et al., 2018).
Programming Model: Practical PIM deployment requires new instruction extensions (PIM-enabled intrinsics), memory allocators that control data bank affinity, and OS/runtime infrastructure for kernel offload, synchronization, and error reporting. Compiler-driven optimizations can automatically map kernels to near-bank hardware and minimize cross-bank dependencies (Oliveira et al., 2022, Mutlu et al., 2020).
Application Domains and Workload Fit: Memory-bound kernels such as SpMV, bitwise operations, database scans, graph traversal, and simple stencils are especially amenable to bank-parallel PIM offload. By contrast, operations with high intra-core data reuse or requiring cross-bank communication may be better suited to host-centric or hybrid execution models (Alsop et al., 2023).

6. Implementation Variants, Trade-offs, and Research Directions

The field exhibits a diversity of near-bank PIM design variants:

Subscription-Driven Architectures: DL-PIM demonstrates dynamic data migration for improved spatial locality but highlights the importance of adaptive controls to avoid over-subscribing low-reuse data and incurring unnecessary indirection overhead (Tian et al., 9 Oct 2025).
Commercial Prototypes: Systems like UPMEM provide RISC-core-per-bank near-bank PIM on DDR4 modules, with hardware/software co-designed memory management units to mitigate host–PIM transfer bottlenecks (Lee et al., 2024).
Algorithm-Architecture Co-Design: Advances include application-driven partitioning (row/column/2D tiling), per-bank workload balancing, and explicit support for irregular access patterns, all of which are required to maximize local compute and minimize synchronization and communication volume (Giannoula et al., 2022, Kang et al., 2022).
Open Problems: Outstanding challenges span programming model standardization, coherence granularity, OS-level integration, thermal management (especially in 3D stacks), and secure processing. Research continues into fine-grained coherence, dynamic data mapping, bank-wise task scheduling, and workload-optimized hardware primitives (Tian et al., 9 Oct 2025, Mutlu et al., 2020, Oliveira et al., 2022).

7. Comparative Perspective and Practical Significance

Near-bank PIM architectures represent a fundamental shift to memory-centric, data-proximal computing. When hardware and system support are suitably co-designed, speedups of 2–14× and energy reductions of 40–90% over CPU-centric architectures are routinely observed for suitable workloads (Tian et al., 9 Oct 2025, Mutlu et al., 2020). However, the effectiveness of near-bank PIM is contingent upon (a) application-specific data locality and reuse, (b) architectural support for minimizing remote accesses, and (c) robust, adaptive migration and offload policies. Ongoing research aims to unify these mechanisms and establish near-bank PIM as a standard component of heterogeneous, high-density, data-centric systems (Tian et al., 9 Oct 2025, Oliveira et al., 2022, Ghose et al., 2018).