Papers
Topics
Authors
Recent
Search
2000 character limit reached

Near-Bank PIM Architectures

Updated 5 February 2026
  • Near-bank PIM architectures are memory-centric designs that tightly couple lightweight compute engines to memory bank peripheries for efficient, parallel, in-place data processing.
  • They integrate digital ALUs, filtering units, and neural primitives at DRAM/SRAM bank levels, reducing data movement and enhancing throughput with finely tuned bank-level operations.
  • Empirical models show these systems achieve up to 40% latency reduction and 85% energy savings, despite trade-offs in area overhead and static power, driving impactful advances in memory-centric computing.

Near-bank Processing-in-Memory (PIM) architectures are a class of memory-centric designs that tightly couple compute logic to the periphery of memory banks, enabling the direct manipulation of data in-place with minimal data movement. These architectures exploit the abundant internal bandwidth and potential for massive parallelism inherent to modern DRAM and SRAM devices by embedding lightweight computation engines—such as digital ALUs, filtering units, or neural primitives—proximate to each memory bank or macro. The near-bank placement is critical: in contrast to in-situ (cell-matrix) analog PIM and far-memory approaches (hosted in memory controllers), near-bank PIM can leverage both digital design flexibility and the fine-grain access/bandwidth properties of bank-local operations.

1. Architectural Principles and Bank-Level Organization

Modern near-bank PIM systems are constructed by enhancing the hierarchy of commodity memory devices (e.g., DRAM, SRAM, BRAM in FPGAs) to embed compute resources directly at the bank level. The canonical organization is:

  • DRAM: Channel → Rank → Chip → Bank → Subarray → Row/Column. Shared-PIM, for example, augments each bank to include both conventional compute rows and a small set of "shared rows" per subarray, each attachable to a segmented bank-level bus (Mamdouh et al., 2024).
  • SRAM (e.g., in FPGAs or custom ASICs): Each macro or bank is equipped with local near-bank logic (e.g., digital ALUs, bit-serial compute elements), sometimes partitioned further by bit-slice to maximize concurrency (Kabir et al., 2023, Duan et al., 25 May 2025).
  • Peripheral logic and buses: Dedicated global bit-lines or busses (e.g., Shared-PIM's BK-bus) and associated sense amplifiers provide low-capacitance, high-bandwidth paths for data movement among subarrays or between memory and compute (Mamdouh et al., 2024).

Table: Representative Near-Bank Additions in Notable Designs

Architecture Near-Bank Compute/Logic Data Movement Support
Shared-PIM LUT-based PIM logic + shared rows 4-segment BK-bus + sense amplifiers
PiCaSO (FPGA-PIM) Bit-serial ALUs at every BRAM bank RFs + reduction/fold network
PIMfused PIMcores per DRAM bank LBUF/GBUF buffers, in-DRAM bus
DB-PIM (SRAM) AND/Adder in 6T cells, per macro Local IPU, CSD adder tree

All such designs retain the property that PIM execution is orchestrated by logic tightly coupled to the bank's internal data path, rather than requiring full off-chip transfer.

2. Microarchitectural Innovations and Resource Arbitration

Achieving high throughput and overlap between computation and data movement in near-bank PIM requires substantial microarchitectural modification to standard memory devices:

  • Shared-PIM employs augmented memory cells featuring an extra gate transistor (GWL) to connect shared rows to a bank-level bus, and adds dedicated BK-SAs (bank-level sense amps) for isolation during data movement (Mamdouh et al., 2024).
  • Command and controller enhancements enable new operational primitives, e.g., ACTIVATE_C for compute, ACTIVATE_S for transfer, PRECHARGE_B for bus reset. Arbitration between concurrent compute/transfer is handled by dual-queue controllers and round-robin scheduling on shared resources.
  • In SRAM-based DB-PIM, low-level logic (AND gates, local adders) is tightly grafted into the periphery of each 6T cell group. An input pre-processing unit dynamically skips all-zero input columns, while a CSD adder tree accumulates computations in a highly parallel, carry-free fashion (Duan et al., 25 May 2025).
  • In FPGA overlays such as PiCaSO, a bit-serial pipeline is coupled with networked reduction logic at the BRAM bank input/output, allowing deep pipelining and on-the-fly fold accumulation without the need to modify underlying BRAM structures (Kabir et al., 2023).

Arbitration policies are essential to maintain concurrent progress of compute and data transfers without stalling either resource pool, achieved through decoupled resource queues and bank-level pipelining.

3. Mathematical Models: Latency, Energy, and Area

Closed-form expressions characterize the operational improvements over baseline and prior-art architectures. For Shared-PIM (Mamdouh et al., 2024):

  • Compute Latency:
    • Bitwise: Latcompute=tRCD+tOP+tRP\mathrm{Lat}_{\mathrm{compute}} = t_{RCD} + t_{OP} + t_{RP}
    • 32–128b op: up to 1.4× faster than pLUTo+LISA
  • Transfer Latency (intra-bank, 8 KB row):

Lattransfer=2tACT_S+tSENSE_B+tPRE_B−Δoverlap≈52.75 ns\mathrm{Lat}_{\mathrm{transfer}} = 2 t_{ACT\_S} + t_{SENSE\_B} + t_{PRE\_B} - \Delta_{\mathrm{overlap}} \approx 52.75\,\mathrm{ns}

  • Energy per Data-Move:

Etransfer=2⋅(CbusVdd2+EBK-SA)×ColsNsegmentsE_{transfer} = 2 \cdot (C_{bus} V_{dd}^2 + E_{BK\text{-}SA}) \times \frac{\text{Cols}}{N_{segments}}

  • Results: 0.14 μJ0.14\,\mu\mathrm{J} (Shared-PIM) vs 0.17 μJ0.17\,\mu\mathrm{J} (LISA) vs ∼6 μJ\sim6\,\mu\mathrm{J} (memcpy)
    • Area Overhead:

ΔAAbase≈7.16%\frac{\Delta A}{A_{\text{base}}} \approx 7.16\%

Analogous metrics for SRAM- and BRAM-based near-bank PIMs quantify speedups, utilization efficiency, and area cost of tightly integrating ALUs into memory peripheries (Duan et al., 25 May 2025, Kabir et al., 2023).

4. Quantitative End-to-End Application Impact

Empirical results demonstrate the implications of architectural advances for key computational kernels:

  • Shared-PIM + pLUTo:
    • Matrix multiplication (200×200): +40% latency reduction, −18% data-move energy
    • Graph BFS/DFS: +29% faster, −17% data-move energy
    • Addition/multiplication: up to 1.40× faster (128b ops)
  • PIMfused:
  • PiCaSO Overlay (FPGA):
    • 80% of custom PIM throughput, 2.56× lower latency, 25–43% higher BRAM efficiency (Kabir et al., 2023).
  • DB-PIM:

These gains arise from the capacity to overlap computation with data movement, exploit massive bank-level parallelism, and minimize internal and external data traffic.

5. Design Trade-offs, Bottlenecks, and Scalability

While aggressive near-bank integration yields systematic throughput and efficiency benefits, several trade-offs and bottlenecks are inherent:

  • Area Overhead:
    • E.g., Shared-PIM's BK-bus, BK-SAs, and GWL driver logic constitute a 7.16% increase over the pLUTo baseline (Mamdouh et al., 2024),
  • Resource Allocation:
    • Shared rows diminish storage density; fixed bus segmentation trades off transfer speed for SA area overhead.
  • Static Power:
    • Additional sense amps and always-precharged bus lines elevate static energy use, although throughput gains typically amortize this penalty.
  • Scalability and Hotspot Mitigation:
    • Bus segment conflicts, cross-bank bandwidth, and crosstalk (mitigated by bitline twist/coding) represent critical scalability factors.
  • Applicability Limitations:
    • Tasks must be mapped to bank-local computation/resources; random-data access and non-bulk/non-parallel kernels benefit less, and data-dependent (e.g., irregular aggregation) workloads may suffer diminished speedup.
  • Software/ISA Coordination:

Future extensions such as dynamic shared row configuration, cross-bank coupling, and in-place pipelining for more complex operations (e.g., histograms, convolutions) are active areas of research.

Near-bank PIM principles are realized in diverse substrates:

  • Emerging DRAM/3D-Stacked Devices:
    • Commercial examples (Samsung HBM-PIM, SK Hynix GDDR-PIM) integrate SIMD ALUs per bank, with memory controller–issued PIM commands and up to 1.2 TB/s internal bandwidth (Alsop et al., 2023).
  • FPGA BRAM Overlays and Custom Silicon:
    • Bit-serial overlays (PiCaSO) utilize standard BRAM blocks, reduction (fold) networks, and host command buffers for PIM instruction orchestration; custom extensions (CoMeFa, CCB) offer higher density/throughput at the expense of more exotic hardware (Kabir et al., 2023).
  • Security and Confidential Computing:
    • PIM-Enclave demonstrates bank-local execution environments with native AES-GCM protection and per-bank attestation, achieving 2.89× CPU speedup at negligible overhead, and eliminating off-chip side channels (Duy et al., 2021).
  • Hybrid CPU+PIM Database Analytics:
    • Membrane offers bank-level comparators for in-situ filtering, paired with cooperative software for denormalization and selective offload; this achieves ~6× OLAP speedup with minimal DRAM area investment (Shekar et al., 8 Apr 2025).

Industrial and research implementations consistently show that near-bank localization yields dramatic reductions in data-movement overhead, provided the software layer is capable of partitioning tasks and memory allocations to match the bank-local compute and bandwidth properties.

7. Outlook and Ongoing Challenges

Major ongoing research thrusts include:

  • Dynamic Reconfiguration of Bank Resources:
    • Adaptive tuning of shared-row provision, bus segmentation, and deployment of additional compute resources per bank/subarray.
  • Cross-Bank/Global Interconnects:
    • Coupling of bank-level buses (e.g., across banks or through TSVs in 3D-stacked DRAM) to facilitate more flexible intra/inter-bank communication.
  • Complex Kernel Support:
    • Expanding near-bank PIM beyond primitive kernels (e.g., addition, MAC) to in-place reductions, filtering, histogramming, and even non-linear operations without resorting to off-bank data transfer.
  • System Coherence and Programming Model:

The balance of area, throughput, programmability, and energy efficiency continues to define the scope of near-bank PIM research and its translation into commercial systems. Leading-edge implementations such as Shared-PIM provide empirical evidence of 5× data-movement latency/1.2× energy reduction over prior intra-DRAM transfer architectures, along with workload-level speedups of 29–44% and modest silicon overheads, illustrating the efficacy of rethinking memory bank periphery as an active compute substrate (Mamdouh et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Near-Bank Processing-in-Memory (PIM) Architectures.