Papers
Topics
Authors
Recent
Search
2000 character limit reached

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Published 30 Apr 2026 in cs.AR | (2604.27808v1)

Abstract: High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require specialized software stacks. In this work, we investigate whether HBM-PIM can serve as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as a semantic reference. We propose a PEP-based execution model that maps AME element-wise and matrix instructions to HBM-PIM micro-kernels and data instructions in memory operations. Differently from SoA HBM-PIM, we introduce a reduction-free outer-product dataflow that enables accumulation entirely within memory despite the lack of native reduction support. Our approach supports end-to-end execution of element-wise operations, GEMV, and GEMM in PIM mode, minimizing host involvement and off-chip transfers. An experimental evaluation on Samsung Aquabolt-XL shows that AME matrix tile multiplication achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single HBM pseudo-channel.

Summary

  • The paper introduces HBM-PIM as a novel accelerator that executes complete in-memory matrix operations, achieving up to 14.9 GFLOP/s per pseudo-channel.
  • The methodology maps RISC-V AME instructions to HBM-PIM micro-kernels, enabling reduction-free execution for GEMM, GEMV, and element-wise operations.
  • The study identifies ISA limitations and proposes architectural enhancements to improve energy efficiency and scalability for data-intensive AI and HPC workloads.

Summary of AME-PIM: Can Memory be Your Next Tensor Accelerator? (2604.27808)

Motivation and Context

The paper addresses the fundamental inefficiencies of modern computational architectures, particularly the von Neumann bottleneck, which arises when memory bandwidth and energy cost associated with moving data across the memory hierarchy become the limiting factors for performance. This issue is acute for workloads in machine learning, scientific computing, and large-scale analytics, especially as model sizes vastly exceed the capacities of on-chip caches and accelerators. High Bandwidth Memory (HBM) mitigates some of these constraints by providing substantial parallelism and higher bandwidth, but even HBM is insufficient for emerging generative AI and foundation model workloads.

Processing-in-Memory (PIM), specifically HBM-PIM, is introduced as a complementary solution that not only increases memory bandwidth but also reduces off-chip data movement by allowing computation to occur directly where data resides. Commercial platforms such as Samsung Aquabolt-XL demonstrate practical PIM implementations. However, the limited instruction sets, lack of native reduction operations, and need for device-specific APIs hinder the broad adoption and effective integration of HBM-PIM into standard programming models and CPU ISAs.

Architectural and ISA Mapping

This work investigates whether HBM-PIM can act as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as the target semantic abstraction. The authors analyze the architectural constraints of Samsung Aquabolt-XL and devise a mapping strategy from AME instructions to HBM-PIM micro-kernels.

Element-wise and matrix instructions in the AME specification—including addition, multiplication, subtraction (emulated via multiplication with -1 and addition), and multiply-and-accumulate (MAC)—can be effectively implemented using native PIM instructions. However, operations requiring widening, minimum, and maximum are unsupported due to the PIM ISA's lack of conditional or comparison instructions and restriction to FP16 data types. The mapping leverages an outer-product-based dataflow that circumvents the absence of native reduction support by accumulating results directly within memory banking structures.

AME matrix tile and accumulation registers are abstracted as memory-resident constructs in the HBM-PIM DRAM dies. Data movement instructions are realized via pointer updates in a hardware table, eliminating unnecessary data motion and aligning memory access patterns with the SIMD requirements of the PIM units.

Computational Model and Execution

The paper formalizes PIM Execution Primitives (PEPs): concise micro-kernels of native instructions orchestrated within the HBM-PIM environment. The AME programming model is realized through iterative invocation of these PEPs, exploiting parallelism across pseudo-channels and memory banks. Tiles are mapped to even banks, and accumulations to odd banks, with column-major ordering to optimize parallel execution.

The key advance is the reduction-free matrix multiplication within memory. The MAC-PEP enables complete GEMM, GEMV, and element-wise operations to proceed entirely within HBM-PIM without host-side reduction, fundamentally minimizing host involvement and off-chip transfers.

Performance Evaluation

Experimental results indicate that matrix tile multiplication using AME instructions on Aquabolt-XL achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) per pseudo-channel for 128 × 4096 tiles. Setup overhead is negligible (less than 1% for maximum tile sizes). The observed throughput is close to the theoretical limits imposed by the PIM architecture, with computational efficiency saturating for larger tile sizes.

These results are directly compared to prior works such as MPC-Wrapper [4], which rely on external reduction engines, thus incurring significant intermediate data movement and only achieving 58.1 FLOP/cycle. The AME-PIM approach delivers in-memory accumulation for GEMM/GEMV and element-wise operations, representing a practical solution for large, tiled matrix workloads.

The main bottlenecks remaining are attributable to intrinsic limitations in the PIM ISA: specifically, the lack of cross-lane reduction instructions and operand routing restrictions. The authors propose architectural enhancements such as memory-to-all-bank broadcasting and instruction fusions to increase computational efficiency further.

Implications and Future Directions

The work establishes that HBM-PIM, when correctly abstracted and orchestrated, can serve as a viable substrate for ISA-level matrix extensions, enabling portable and generalized exploitation of PIM features. This architectural mapping to AME not only increases performance and scalability for matrix-heavy workloads but also substantially reduces power consumption and host resource involvement.

Practically, this paves the way for improved integration of PIM into standard compiler and software stacks, enhancing portability and reducing engineering overhead. Theoretically, the research informs future HBM-PIM architectures—specifically, the addition of reduction operations, improved operand routing, enhanced broadcasting, and potentially expanded data-type support.

The authors highlight that future work will need to examine the cost and scalability of more complex data movement instructions and multi-channel coordination, as well as full exploitation of platform capabilities for an end-to-end AME implementation.

Conclusion

The paper delivers a rigorous analysis and practical blueprint for leveraging HBM-PIM as a matrix accelerator backend via the RISC-V AME abstraction. By enabling complete matrix and element-wise operations within memory, the approach achieves near-theoretical peak performance while highlighting the impact of ISA and microarchitectural constraints. The research clarifies the necessary advancements in HBM-PIM architectures and programming models to ensure scalable, efficient, and portable matrix acceleration for data-intensive AI and HPC workloads.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview: What this paper is about

This paper asks a simple question: can we make computer memory “do the math” for us, instead of hauling huge amounts of data back and forth to a CPU or GPU? The authors look at a special kind of memory called HBM-PIM (High Bandwidth Memory with Processing-in-Memory), which has tiny calculators built inside the memory chips. They show a way to run common matrix operations (the kind used in AI and science) directly inside memory, and they test it on a real Samsung HBM-PIM chip.

The main goal, in plain terms

The researchers wanted to find out:

  • If HBM-PIM can act like a “matrix math engine” that a regular processor could use through standard instructions.
  • How to run matrix multiplies and element-wise math entirely inside memory, even though current HBM-PIM chips don’t support some features (like “reductions,” which are steps where lots of partial numbers get added together).
  • Whether this approach can cut data movement and still be fast on real hardware.

They use the RISC-V AME (Attached Matrix Extension) as a guideline for “what matrix instructions should look like,” and then try to make HBM-PIM behave like a backend that executes those instructions.

How they did it (with everyday analogies)

Think of a kitchen:

  • Traditionally, cooking (computation) happens on the stove (CPU/GPU), and ingredients (data) are fetched from the pantry (memory). Running back and forth wastes time and energy.
  • Processing-in-Memory (PIM) is like putting small cooktops inside the pantry shelves. You bring less stuff out; you do more work right where the ingredients are.

Here’s their approach, step by step:

  • They looked at the AME “recipe book” (the set of matrix instructions) and matched those recipes to what the tiny cooktops inside HBM-PIM can actually do.
  • They built tiny programs called PIM Execution Primitives (PEPs). Each PEP is a short, reusable sequence of in-memory steps that performs a specific operation, like add, multiply, subtract (by multiplying by −1), and multiply-and-accumulate (MAC).
  • They mapped matrix “tiles” (small blocks of a big matrix) into specific places in memory, so each tiny in-memory calculator always knows exactly which data to handle.
  • The clever part: for matrix multiplication, they used an “outer-product” strategy. Instead of the usual way (which would need a “reduction” step to add many partial sums together—a feature the chip doesn’t have), they multiply columns by scalars and add results directly in memory as they go. This avoids the missing reduction feature.
  • They ran and measured these PIM programs on a real Samsung Aquabolt-XL HBM-PIM device controlled by an FPGA board.

Key terms made simple:

  • Matrix multiply (GEMM) and matrix–vector multiply (GEMV): ways of combining tables of numbers to get new tables or columns—core operations in AI and scientific computing.
  • Element-wise operations: doing the same math (like add or multiply) to each matching pair of numbers in two matrices.
  • Reduction: adding lots of partial results into a final result. Like totaling scores from many columns.
  • Outer product: a way to build a big matrix by taking one column and one row at a time and adding their contributions directly where they belong—no separate “sum later” step needed.

What they found and why it matters

Main results:

  • They showed that many AME-style matrix instructions can run fully inside memory on today’s HBM-PIM hardware:
    • Element-wise add, multiply, and subtract (they emulate subtract using “multiply by −1 + add”).
    • Matrix–vector multiply (GEMV).
    • Matrix–matrix multiply (GEMM), using their outer-product trick to avoid reductions.
  • On a real Samsung Aquabolt-XL HBM-PIM chip, their in-memory matrix tile multiplication reached up to 14.9 GFLOP/s and 59.4 FLOP/cycle on a single “pseudo-channel” (think of it as one slice of the memory chip).
  • They identified the main limits that hold performance back:
    • No native “reduction” instructions in the chip.
    • Limited ways to route or broadcast operands to all compute lanes.
    • Only FP16 (half-precision) is supported, so widening types or min/max comparisons aren’t available.
  • Even with those limits, doing more work inside memory cut data movement and kept the CPU out of the loop, which is exactly what PIM is supposed to do.

Why it matters:

  • Moving data is often the biggest cost in modern AI and data-heavy workloads. If memory can do more math where the data lives, systems can be faster and use less energy.
  • Their approach makes PIM look more like a standard “matrix engine” that compilers and programming tools could target, instead of a one-off device that needs hand-written code.
  • The results suggest that small upgrades to future PIM chips (like adding reduction or better broadcasting) could boost performance a lot.

What this could change next

  • For hardware designers: Adding native reductions and better broadcasting inside PIM could make in-memory matrix math even faster and simpler, closing much of the gap to the theoretical maximum.
  • For software and compilers: If PIM can be used through a standard matrix instruction set (like AME), developers can write normal code and let compilers place the work in memory automatically.
  • For AI and big-data apps: Running GEMM, GEMV, and element-wise ops inside memory is a step toward handling massive models and datasets more efficiently.

A short “cheat sheet” of ideas in this paper

  • The problem: Moving data between memory and processors costs time and energy.
  • The device: HBM-PIM puts simple compute units inside High Bandwidth Memory.
  • The idea: Treat HBM-PIM like a backend for a standard matrix instruction set (AME).
  • The trick: Use an outer-product dataflow to avoid unsupported “reductions.”
  • The result: Element-wise ops, GEMV, and GEMM can run fully in memory, hitting up to 14.9 GFLOP/s on one memory slice, despite hardware limits.
  • The impact: Less data movement, simpler programming models, and clear guidance for future PIM features.

Knowledge Gaps

Unresolved Gaps, Limitations, and Open Questions

Below is a single, concrete list of the main knowledge gaps and limitations left open by the paper, phrased to guide future work:

  • Architectural coverage gap: current HBM-PIM lacks reductions and comparisons (min/max), preventing full AME instruction-set coverage; what minimal in-memory ISA additions (e.g., reduction, compare, conditional moves) would unlock complete AME support with acceptable area/power?
  • Data type limitation: the FP16-only PIM datapath precludes AME widening variants and mixed-precision accumulations (e.g., FP16→FP32, int8 dot-products); what hardware or software techniques (e.g., chunked accumulation, Kahan summation, emulation) are feasible with bounded numerical error?
  • Numerical accuracy unanswered: outer-product accumulation order differs from inner-product GEMM; how does FP16 in-memory accumulation affect accuracy vs. standard GEMM across ML and HPC kernels?
  • Fixed-row tiling constraint: mapping enforces 128 rows per tile (tied to PIM units and SIMD width); how to efficiently handle M dimensions <128 or not divisible by 128 without wasting lanes or incurring high tail-handling overhead?
  • Loop control constraint: AB-PIM JUMP limit (255 iterations, no nesting) caps K per PEP and complicates larger tiles; what are the performance/complexity trade-offs of microkernel chaining or ISA changes (e.g., longer counters, nested loops)?
  • Operand-routing restriction: inability to use SRF_M as a MAC input in Address-Aligned Mode forces extra instructions; can operand routing be broadened (or AAM redesigned) to reduce instruction overhead and approach theoretical peak?
  • Broadcast inefficiency: lack of a single-cycle memory-to-all-bank lane broadcast leaves a 1:1 compute-to-data-movement instruction ratio; what broadcast primitive is most cost-effective and how much speedup would it enable in practice?
  • Reduction-free dataflow generality: outer-product strategy removes reductions for GEMM/GEMV, but does not address reductions for other ops (e.g., softmax sums, norms, layer-normalization); what generic in-PIM reduction patterns are viable without off-chip involvement?
  • Element-wise coverage gap: min/max and conditionals are unsupported; what fallbacks (host or near-memory) are needed and what are the quantified data-movement costs for real workloads?
  • Tile-register virtualization feasibility: the proposed “hardware table” mapping AME tr/acc registers to memory addresses is conceptual; where does it live (host, controller, logic die), how is it accessed atomically, and can it be realized within JEDEC-compliant interfaces?
  • Correctness of “pointer-only” packing: replacing AME pack/unpack/transpose with pointer updates assumes consumers accept alternative layouts; when is real data reorganization unavoidable and how can it be done entirely in-PIM?
  • Memory management and coherence: the paper does not address cache coherence, consistency, or TLB/virtual memory interactions when tiles are modified in memory; what OS/runtime support is required to make AME-PIM safe and correct on shared-memory systems?
  • Concurrency and preemption: AB-PIM executes all banks in lockstep, potentially blocking regular memory traffic; how should OS/runtimes schedule PIM kernels to avoid QoS violations and enable preemption or time-slicing?
  • Multi-tenant isolation and security: AB broadcasts and in-memory execution raise isolation concerns; what mechanisms prevent cross-tenant data leaks or side channels in shared HBM stacks?
  • Scaling beyond one pseudo-channel: evaluation is per single pseudo-channel; how do performance, synchronization, and inter-pseudo-channel dataflows scale across an entire stack and across multiple stacks?
  • Cross-pseudo-channel accumulations: when outputs must be reduced across pseudo-channels or stacks, what on-device mechanisms avoid off-chip aggregation?
  • Platform evaluation limits: measurements rely on FPGA-side bus counters and a fixed-access-pattern PIM kernel; microarchitectural effects (pipeline stalls, bank conflicts, timing margins) remain unobserved—how do native counters or silicon traces change the picture?
  • End-to-end baselines and energy: no comparisons to CPU/GPU/AMX/SME baselines or energy/power on Aquabolt-XL; how does AME-PIM fare on performance/Watt and total time-to-solution on representative ML/HPC workloads?
  • Throughput bound diagnosis: 59.4 FLOP/cycle vs. 128 theoretical is attributed to data-movement overhead, but the exact contribution of AAM limits, operand routing, and bank bandwidth is not isolated; can per-stage profiling quantify each bottleneck?
  • Generality of outer-product GEMM: the impact of accumulator write intensity, bank conflicts, and thermal constraints for diverse matrix shapes (tall-skinny, wide, batched) is not assessed; do some shapes negate the benefits?
  • Activation fusion: PIM supports optional simple activations (e.g., ReLU) during data movement, but AME-PIM does not exploit it; what compiler/runtime strategies can legally and profitably fuse activations with AME semantics?
  • Software toolchain gap: no concrete compiler/runtime path from AME instructions to PEP sequences is provided; what is required to integrate with RISC-V toolchains (lowering, scheduling, tile selection, auto-tuning)?
  • Mode transition overheads: the costs and side effects of switching between SB, AB, and AB-PIM modes in multi-application scenarios are not characterized; what is the amortization threshold for real workloads?
  • Memory capacity and allocation: dedicating Even and Odd banks for tiles/accumulators implies capacity partitioning; how should memory allocators manage fragmentation, dynamic tile lifetimes, and contention among multiple kernels?
  • Robustness and reliability: ECC behavior and fault coverage for in-PIM compute and accumulations are not discussed; how are soft errors detected/corrected during MACs and data movement within PIM units?
  • Portability across HBM-PIM vendors: proposed PEPs and mappings target Aquabolt-XL; to what extent are they portable across other HBM-PIM designs and what minimal standardization is needed?
  • Frequency/thermal variability: results assume 250 MHz; how do performance and stability vary at 300 MHz, across temperature/voltage, and under sustained workloads?
  • Mixed-precision and integer acceleration: common ML workflows use int8/int4 and BF16; can the approach be extended (architecturally or via emulation) to these types without eroding benefits?
  • End-to-end integration path: the prototype is host-driven over PCIe; what system architecture enables a CPU to issue AME instructions that transparently dispatch to HBM-PIM via a JEDEC-compliant controller without bespoke FPGA logic?

Practical Applications

Immediate Applications

The following applications can be prototyped or deployed today on systems that already include HBM-PIM (e.g., Samsung Aquabolt-XL) and can leverage the AME-to-PIM mapping and reduction-free outer-product dataflow presented in the paper.

    • AI/ML inference offload for memory-bound layers
    • Sector: AI/Cloud; Software; Semiconductors
    • What: Offload FP16 GEMM/GEMV and element-wise kernels (e.g., MLP blocks, attention projections, final linear layers) to HBM-PIM using the proposed PIM Execution Primitives (PEPs) and AME semantics. Reduce off-chip data movement by executing multiply–accumulate and element-wise ops entirely within memory.
    • Tools/products/workflows:
    • Integrate a PIM-backed GEMM/GEMV path in BLAS-like libraries (e.g., a PIM-aware cBLAS/cuBLAS alternative or TVM/XLA backend pass that dispatches large FP16 tiles to PIM).
    • Runtime layer that converts AME tile ops into PEP micro-kernel invocations and manages AB-PIM mode transitions plus the “hardware table” tile-to-memory mapping described in the paper.
    • Use GPUs/CPUs for non-supported ops (e.g., min/max, comparisons) while PIM handles tiled GEMM/GEMV cores.
    • Assumptions/dependencies:
    • Availability of HBM-PIM devices (e.g., Aquabolt-XL) and drivers exposing AB-PIM mode and CRF loading.
    • FP16 precision only; fixed tile-row height (128), max tile width 4096; lack of reductions, min/max, and widening.
    • Data layouts must be adapted to the Even/Odd bank mapping and column-major row partitioning.
    • HPC microbenchmarks and kernel acceleration in research clusters
    • Sector: HPC; Academia; Semiconductor test labs
    • What: Prototype in-memory acceleration for dense linear algebra kernels where GEMV/GEMM dominate memory traffic (e.g., blocked matrix multiplications in solvers, pre/post-processing stages).
    • Tools/products/workflows:
    • Add PIM execution paths to existing HPC kernels (e.g., blocked GEMM in vendor or open-source BLAS) that issue AME-like tile instructions mapped to PEPs.
    • Use the FPGA-based evaluation flow (PCIe control + PIM_kernel) to script end-to-end experiments.
    • Assumptions/dependencies:
    • The paper’s GFLOP/s figures are measured per pseudo-channel; performance scaling to multiple pseudo-channels requires additional orchestration not covered in the current evaluation flow.
    • Only a subset of AME instruction semantics supported (no reductions, comparisons).
    • Energy-efficiency pilots in datacenters
    • Sector: Energy; Cloud; Policy (evaluation)
    • What: Demonstrate power savings by using in-memory MAC (outer-product) to avoid host-side reductions and off-chip transfers for large FP16 tiles in model inference pipelines.
    • Tools/products/workflows:
    • Instrumented A/B tests comparing PIM-enabled GEMV/GEMM vs. conventional CPU/GPU paths on the same models.
    • Integrate PIM usage into green-computing dashboards for facility-level power tracking.
    • Assumptions/dependencies:
    • Requires PIM-capable HBM stacks and software integration.
    • Benefits are workload-dependent; best results on large tiles with high memory pressure.
    • Teaching and research platforms for PIM programming via ISA semantics
    • Sector: Academia; Education
    • What: Use AME semantics mapped to HBM-PIM as a teaching example to bridge architecture, compilers, and memory systems. Students write AME-style code that executes on PIM via PEPs.
    • Tools/products/workflows:
    • Course labs using FPGA boards with Aquabolt-XL and the CRF/AB-PIM programming model.
    • Open-source examples that implement element-wise, GEMV, and GEMM via outer-product PEPs.
    • Assumptions/dependencies:
    • Access to hardware and low-level programming interfaces; vendor documentation/SDKs often under NDA.
    • Prototype RISC-V AME compiler backends and runtimes
    • Sector: Software/Compilers; RISC-V ecosystem
    • What: Implement an AME backend that lowers tile ops to HBM-PIM PEPs with a runtime that manages PIM modes, tile-to-memory mapping, and bank-aware layouts.
    • Tools/products/workflows:
    • LLVM or GCC AME pass that emits AME IR, plus a runtime to translate AME ops into PEP sequences and configuration writes (CRF, JUMP counts).
    • Basic scheduler for mixing CPU/GPU/PIM execution based on tile sizes and op types.
    • Assumptions/dependencies:
    • AME is a proposal (not ratified); target a subset compatible with current PIM ISA (no reductions, no widening, no min/max).
    • Requires tight coordination with the memory driver and PIM firmware.
    • Vendor SDK and device enablement for HBM-PIM as a “tensor backend”
    • Sector: Semiconductors; System OEMs
    • What: Package the PEPs as a device SDK that exposes matrix-ISA-style interfaces to customers (e.g., “AME-on-PIM” APIs).
    • Tools/products/workflows:
    • Libraries of PEP microkernels, reference AME mappings, and sample tiling strategies (128×4096).
    • Device-side firmware for CRF management and robust AB-PIM mode control.
    • Assumptions/dependencies:
    • Commercialization decisions; need for stable APIs and documentation; customer access to PIM modes.
    • Columnar analytics prototypes with in-memory element-wise transforms
    • Sector: Data Analytics (R&D)
    • What: Prototype in-memory FP16 scaling, normalization, and simple element-wise transformations directly within HBM-PIM columns to reduce data motion in ETL/feature-prep stages.
    • Tools/products/workflows:
    • Vectorized pipelines that push element-wise passes to PIM while CPU/GPU handles filtering and comparisons (not supported natively in PIM).
    • Assumptions/dependencies:
    • Servers with HBM-PIM are atypical for database deployments; likely confined to research pilots.
    • FP16 numeric behavior must be acceptable for the transformation stage.

Long-Term Applications

These require further research, hardware evolution, ecosystem integration, standardization, or scaling beyond a single pseudo-channel.

    • ISA-integrated PIM as a first-class “tensor accelerator”
    • Sector: CPU vendors; RISC-V ecosystem; System software
    • What: Couple RISC-V AME (or similar ISA extensions) with HBM-PIM as an implementation backend across CPUs, exposing matrix tiles via architectural semantics and scheduling PIM as an on-package accelerator.
    • Tools/products/workflows:
    • AME-aware CPUs/SoCs with PIM-aware memory controllers and OS drivers.
    • Compiler/runtime that schedules tiles among CPU/GPU/PIM based on data locality and memory pressure.
    • Assumptions/dependencies:
    • Standardization of AME subsets and JEDEC-compliant PIM command sets.
    • Co-design of drivers, memory allocators, and tiling policies aware of Even/Odd bank mappings.
    • Next-generation PIM ISA features for higher efficiency and broader kernel coverage
    • Sector: Semiconductors; AI hardware
    • What: Add native reductions, single-cycle broadcasts, richer data types (FP32, BF16, INT8/FP8), comparisons/min/max to PIM units to lift current constraints and approach theoretical FLOP/cycle.
    • Tools/products/workflows:
    • Revised PIM logic layers and interconnects to support all-bank lane broadcasts and dot-product/reduction ops.
    • Updated AME mappings to fuse loads, broadcasts, and MACs efficiently.
    • Assumptions/dependencies:
    • Logic density and thermal budgets in HBM logic dies; JEDEC and vendor roadmaps; verification and reliability of extended PIM ISAs.
    • End-to-end integration with ML compilers and frameworks
    • Sector: Software/AI Frameworks (PyTorch, TensorFlow, TVM, XLA)
    • What: Frameworks that partition graphs to PIM for memory-bound matrix ops while leaving control-heavy or unsupported ops on CPU/GPU; automatic tiling/layout transformations for PIM-friendly mapping.
    • Tools/products/workflows:
    • Graph-level cost models that account for PIM’s FLOP/cycle, tile limits (128×4096), and op support.
    • Runtime kernels for PIM with asynchronous command streams and synchronization with other accelerators.
    • Assumptions/dependencies:
    • Stable PIM device interfaces and predictable performance across pseudo-channels/stacks.
    • Tooling for mixed-precision and accuracy management when using FP16 in the memory path.
    • Large-scale deployment in cloud inference/training pipelines
    • Sector: Cloud; AI services
    • What: Use PIM to accelerate memory-bound matvecs and large-tile GEMMs in transformer models, improving throughput-per-watt and freeing GPU memory bandwidth; integrate across many pseudo-channels/stacks.
    • Tools/products/workflows:
    • Cluster-level software to dispatch PIM jobs, manage data placement, and overlap PIM/CPU/GPU work.
    • SLA-aware scheduling that leverages PIM where latency/throughput gains are maximized.
    • Assumptions/dependencies:
    • Proven multi-pseudo-channel scaling and orchestration.
    • Broader op support (e.g., reductions) and more data types for training scenarios.
    • Database and analytics engines with in-memory scans and aggregations
    • Sector: Data platforms
    • What: Execute columnar scans, projections, and (future) aggregations inside memory to reduce data motion; eventually support predicates (min/max/comparisons) and reductions for GROUP BY/analytics.
    • Tools/products/workflows:
    • PIM-aware query planners that push down supported element-wise and reduction primitives.
    • Column store formats aligned to Even/Odd bank layouts for bank-parallel scans.
    • Assumptions/dependencies:
    • PIM ISA must grow to include comparisons, min/max, and reduction operations.
    • Server adoption of PIM-capable HBM stacks in data analytics SKUs.
    • Energy- and carbon-aware procurement and policy
    • Sector: Policy; Cloud; Government
    • What: Encourage or mandate procurement of PIM-capable memory devices for AI/HPC workloads where in-memory compute demonstrably reduces energy per inference/training step.
    • Tools/products/workflows:
    • Certification benchmarks and metrics (e.g., joules/op for GEMM on PIM vs. GPU/CPU).
    • Green-computing guidelines that recognize PIM as a qualifying energy-saving technology.
    • Assumptions/dependencies:
    • Transparent, vendor-neutral metrics; supply chain availability; standards for PIM programmability and safety.
    • Edge and robotics platforms with low-power in-memory tensor compute
    • Sector: Edge/Robotics; Mobile/AR/VR (future memory stacks)
    • What: Use PIM-like capabilities in future stacked memories (e.g., LPDDR-PIM variants) to run on-device inference for perception and control with lower energy and thermal impact.
    • Tools/products/workflows:
    • Lightweight AME-like interfaces in embedded RISC-V cores; small PIM microkernels for common tensor ops.
    • Assumptions/dependencies:
    • Emergence of low-power DRAM stacks with PIM features; simplified ISAs and toolchains for embedded markets.
    • Scientific computing: in-memory Krylov solvers and preconditioners
    • Sector: HPC/Scientific
    • What: Offload dense/sparse matvecs and, with reductions added, dot products and norms to PIM to reduce memory traffic in iterative solvers.
    • Tools/products/workflows:
    • PIM-aware linear algebra libraries; domain solvers that tile and schedule to PIM.
    • Assumptions/dependencies:
    • Native reductions and richer precision beyond FP16 are typically needed for solver stability and accuracy.
    • Security and data-residency benefits via reduced data motion
    • Sector: Security; Enterprise
    • What: Process sensitive tensors in-place within memory to limit exposure on external buses and intermediate buffers.
    • Tools/products/workflows:
    • Data governance policies and audit trails acknowledging in-memory compute paths; secure runtimes.
    • Assumptions/dependencies:
    • Threat modeling to validate that reduced movement confers meaningful security; hardened PIM firmware and isolation support.

Notes on feasibility and cross-cutting dependencies

  • Hardware dependencies:
    • HBM-PIM availability (e.g., Samsung Aquabolt-XL) and access to AB-PIM mode, CRF programming, and device drivers.
    • Current ISA limits: FP16-only, no native reductions/min/max/widening, limited broadcast, and loop iteration caps; tile mapping constrained to 128 rows and up to 4096 columns.
    • Performance measured per pseudo-channel; end-to-end gains require scaling across channels/stacks and careful orchestration.
  • Software dependencies:
    • AME is a proposal; practical deployments will target a compatible subset and require compiler/runtime support.
    • Data layout transformations to match Even/Odd bank mapping and outer-product dataflows.
    • Interoperability with CPUs/GPUs for unsupported operations and mixed-precision pipelines.
  • Ecosystem and standardization:
    • Coordination between JEDEC, RISC-V, memory vendors, and OS/toolchain communities to standardize PIM programming models and AME subsets.
    • SDKs and libraries to make PIM a drop-in backend for matrix operations in existing stacks.

Glossary

  • Activate (ACT) command: A DRAM command that opens a specific row in a bank to make it accessible for subsequent column operations. "Activate (ACT) command"
  • Advanced Matrix Extensions (AMX): Intel’s ISA-level matrix acceleration extension providing tile-based operations. "Advanced Matrix Extensions (AMX)"
  • Address-aligned mode (AAM): A PIM execution mode where operand selection is derived implicitly from DRAM row/column addresses to reduce instruction overhead. "address-aligned mode (AAM)"
  • All-Bank (AB) Mode: An HBM-PIM mode that issues a single column command to the same row/column across all banks in a pseudo-channel in lock-step. "All-Bank (AB) Mode"
  • All-Bank PIM (AB-PIM) Mode: An HBM-PIM mode coupling all-bank column commands with in-memory execution, advancing a PIM program counter. "All-Bank PIM (AB-PIM) Mode"
  • Attached Matrix Extension (AME): A RISC-V proposal that defines an ISA abstraction for tiled matrix computation with tile and accumulation registers. "Attached Matrix Extension (AME)"
  • AXI4 crossbar: An on-chip interconnect fabric implementing the AXI4 protocol to connect multiple masters and slaves. "AXI4 crossbar"
  • Bank-level parallelism: Parallel execution across multiple DRAM banks to increase throughput. "bank-level parallelism"
  • Command Register File (CRF): A per-pseudo-channel storage for PIM instructions that the controller fetches and executes. "Command Register File (CRF)"
  • Column-major order: A memory layout where elements are stored column-by-column, shaping how tiles are mapped to banks. "column-major order"
  • Control and Status Registers (CSRs): Architectural registers used to configure and expose execution state for the matrix extension. "Control and Status Registers (CSRs)"
  • Cross-lane reduction: A SIMD operation that aggregates values across lanes; not supported by the described PIM ISA. "cross-lane reduction"
  • Data-type widening: Performing arithmetic that promotes operands to a wider data type for higher precision. "data-type widening"
  • DRAM row buffer: The sense-amplifier buffer that holds an activated row’s data for fast access by subsequent operations. "DRAM row buffer"
  • Even and odd bank: The paired DRAM banks connected to a single PIM unit, which can access one bank at a time. "even and odd bank"
  • FP16: A 16-bit floating-point format used for efficient compute and storage in PIM datapaths. "FP16"
  • Fused multiply-add: A computation pattern combining a multiply and an add (often in one instruction) used in linear algebra kernels. "fused multiply-add style computation"
  • General Matrix–Matrix Multiplication (GEMM): A core linear algebra operation multiplying two matrices to produce a matrix. "GEMM"
  • General Matrix–Vector Multiplication (GEMV): A linear algebra operation multiplying a matrix by a vector. "GEMV"
  • General Register File (GRF): A set of PIM-local vector registers used as sources/destinations for arithmetic and data movement. "General Register File (GRF)"
  • HBM logic layer: The logic die in an HBM stack where lightweight compute units for PIM are integrated. "HBM logic layer"
  • HBM-PIM: High Bandwidth Memory equipped with in-memory compute units to reduce data movement. "HBM-PIM"
  • High Bandwidth Memory (HBM): Stacked DRAM with very wide interfaces that provide much higher bandwidth than conventional DDR. "High Bandwidth Memory (HBM)"
  • JEDEC: The standards body governing memory interfaces; compliance ensures controller compatibility. "JEDEC-compliant memory controllers"
  • JUMP instruction: A PIM control-flow instruction used for looping with minimal overhead. "JUMP instructions"
  • Multiply-and-Accumulate (MAC): An operation that multiplies operands and accumulates into a destination, central to GEMM/GEMV. "Multiply-and-Accumulate (MAC) units"
  • Outer-product dataflow: A matrix multiplication strategy forming results via sums of outer products, avoiding explicit reductions. "reduction-free outer-product dataflow"
  • PCIe DMA IP: A hardware IP core that enables direct memory access transfers over PCIe. "PCIe DMA IP"
  • PIM Execution Primitive (PEP): A small, reusable microkernel of PIM instructions that implements higher-level operations (e.g., AME ops). "PIM Execution Primitive (PEP)"
  • Processing-in-Memory (PIM): An architectural paradigm performing computation near or within memory to reduce data movement. "Processing-in-Memory (PIM)"
  • Pseudo-channel: A subdivision of an HBM channel comprising a set of banks and PIM units that operate in lock-step. "pseudo-channel"
  • Scalar Register File (SRF): PIM-local scalar registers, often used for broadcasted scalar operands during vector operations. "Scalar Register File (SRF)"
  • Scalable Matrix Extension (SME): Arm’s ISA-level matrix acceleration extension similar in spirit to AMX/AME. "Scalable Matrix Extension (SME)"
  • Single-Bank (SB) Mode: The standard HBM mode where column commands address a single bank’s open row. "Single-Bank (SB) Mode"
  • Tile registers (tr0--tr3): Architectural registers in AME that hold input matrix tiles for computation. "tile registers (tr0--tr3)"
  • von Neumann bottleneck: The performance/efficiency limit imposed by data movement between CPU and memory. "von Neumann bottleneck"
  • Zero-cycle execution: An optimization where certain control-flow operations incur no additional cycles. "zero-cycle execution"
  • SIMD: A parallel execution model applying one instruction to multiple data elements simultaneously. "16-wide SIMD"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 58 likes about this paper.