AttAcc PIM Backend for Transformer Inference

Updated 26 November 2025

AttAcc PIM Backend is a processing-in-memory co-design that accelerates transformer self-attention by pipelining operations across DRAM banks and embedded MAC units.
It employs dynamic command orchestration with direct PIM access and lazy memory allocation to minimize off-chip data movement by up to 100×.
Compiler optimizations and integration with heterogeneous accelerators yield significant speedups in memory-bound kernels while ensuring confidential computing.

AttAcc PIM Backend is a processing-in-memory (PIM) software–hardware co-design targeting high-throughput, memory-bound components of attention-based models, notably transformer self-attention and KV-cache management in LLM inference. The backend implements a family of data- and pipeline-parallel management techniques, dynamic command and memory orchestration, and data-centric code generation, permitting dramatic improvements in memory bandwidth utilization, inference throughput, and model capacity versus conventional accelerator backends. It also provides hooks for confidential computing protection and sophisticated compiler/runtime integration.

1. Hardware Architecture and Memory-Bound Parallelism

AttAcc PIM backends are built upon a topology of DRAM subsystems augmented with embedded compute engines, typically dot-product MAC (multiply–accumulate) units, inside or near each DRAM bank. Each memory "group" (e.g., HBM channel) includes 32–64 banks, forming the fundamental parallel granularity. In architectures such as LoL-PIM, each module consists of sixteen or more banks, each with a local buffer and per-bank output registers; AttAcc inherits this, with 16-way FP16 GEMV datapaths streaming vectors directly from local DRAM (Kwon et al., 2024, Yang et al., 19 Nov 2025).

All banks in a group connect to a PIM controller or "hub," which manages broadcast of input vectors and bank synchronization. A channel-level softmax or post-processing unit enables on-chip merge/reduction operations for attention, such as exponentiation and normalization in softmax. Bandwidth is a central concern: with 64 GB/s per module as typical off-chip interface and over 65 TB/s internal parallel bandwidth across banks, optimal utilization of internal resources drives the architecture (Kwon et al., 2024, Yang et al., 19 Nov 2025).

Unique to AttAcc is the ability to pipeline transformer layer-groups across a chain of PIM modules, with each module or node responsible for L/PP layers (L=total layers, PP=number of pipeline stages). This pipeline parallelism enables multiple micro-batches to advance concurrently, each at a different token/layer pair, maximizing steady-state throughput.

2. Dynamic Command Orchestration via Direct PIM Access

Traditional PIM command models encode static memory addresses and cannot adapt to dynamic sequence lengths or arbitrary context growth; this leads to over-allocation and poor device utilization. AttAcc PIM Backend addresses this via a Direct PIM Access (DPA) controller, which supports dynamic loops (Dyn-Loop) and address-modifying (Dyn-Modi) commands, dispatched just-in-time by an on-module micro-controller (Kwon et al., 2024).

The DPA system comprises:

A virtual-to-physical table (Va2Pa), mapping logical chunk IDs to actual DRAM rows/pages.
Command and configuration buffers, with command stacks encoding DPA's loop and address modifications.
Decoding logic that patches row/column indices in commands at runtime.

Lazy memory allocation is implemented such that physical chunks are only assigned once additional tokens actually arrive, and immediately released upon completion or end-of-sequence, minimizing fragmentation and memory waste.

This design pattern generalizes across PIM-based attention accelerators, allowing dynamic context management and KV-cache sizing without manual preallocation, and supports arbitrary-length LLM decoding up to hardware capacity.

3. Mapping Attention Workloads to PIM Microarchitecture

Core transformer attention operations—specifically, scaled dot-product attention and softmax—are natively mapped to PIM bank and group parallel resources. A frequently used data partitioning, intra-module token-parallel partitioning (ITPP), assigns different banks to token slices (over the context or sequence dimension), allowing each bank to compute a portion of the QK^T or SV dot-products in parallel (Kwon et al., 2024).

The operand mapping for attention is:

Q: Query vectors are broadcast to all banks.
K/V: Key and value caches are sharded across banks over the token dimension, such that each bank holds keys/values for a subset of tokens.
Output: Partial results are reduced on-chip before returning only essential elements off-chip.

The softmax operation is carried out using a bank-group reduction unit that collects local logits, computes max/exponent/sum, and writes back normalized weights.

This strategy yields O(L² d) complexity in total per-head dot-products, but all KV-cache and intermediate products remain inside DRAM, limiting off-chip transfer to O(L d + d^2); for long sequences (large L), this results in up to 100× reduction in data movement compared to GPU or CPU (Kwon et al., 2024).

4. Compiler, Data Layout, and Code Optimization

The AttAcc PIM backend leverages advanced compiler infrastructure, exemplified by the DCC data-centric tensor compiler (Yang et al., 19 Nov 2025), which co-optimizes both data rearrangement and loop partitioning to fit the constraints of the PIM hardware. The compiler stack operates in three layers:

System level: Host issues all commands and handles address mapping.
PIM group level: Channel-wide operations (e.g., softmax) are orchestrated as group instructions.
PIM core level: Per-bank GEMV (dot-product) commands are issued and executed independently for maximal parallelism.

Key optimizations include:

Tensor→bank tiling, selecting bank tile sizes B_i, B_j such that bank loads saturate SIMD units and avoid underutilization (enforced e.g. by B_i mod 16 = 0).
Sofware-managed double-buffering: Each bank contains dual register files to support simultaneous data transfer and computation without idle cycles.
Vectorization and alignment: All tiling and buffer fills align with the bank’s SIMD width (typically 16) for peak compute efficiency.
Bank-level scratchpad constraints: Scheduling is pruned to ensure per-bank kernels fit in limited local scratchpads.
Host-to-PIM data rearrangement: Data is pre-interleaved on host shared memory before parallel dispatch to banks, maximizing channel and bank bandwidth utilization.

Performance is predicted via a hybrid of analytical and machine-learned models, using empirical features (tile size, bandwidth, command overhead) to select optimal schedules for each tensor operation.

5. Confidential Computing and Security

AttAcc supports secure computation via enclave-style protection mechanisms, adopting the AttAcc "PIM-Enclave" backend abstraction (Duy et al., 2021). Each DRAM vault/bank is paired with a PIM core and an AES-GCM capable DMA engine, which transparently encrypts and authenticates all inbound/outbound data. Key storage is local, and attestation is implemented via a ROM-stored endorsement key and remote measurement protocol.

Security features include:

Access control: Host commands targeting enclave-protected regions are rejected unless properly provisioned and authenticated.
Side-channel resistance: With all plaintext and computation resident in DRAM during offload, there is no observable bus traffic or address patterns exposed to the untrusted host.
Practicality: Encryption overhead is <25% per-DMA, and the total slowdown is <4% for typical data-intensive kernels, while removing the data-movement bottleneck and eliminating CPU-enclave memory limits.

A plausible implication is that such secure PIM backends can service both bandwidth intensive and confidential workloads without significant performance tradeoff compared to unprotected PIM, strongly outperforming CPU enclave baselines (Duy et al., 2021).

6. Integration with Heterogeneous Accelerators and Scheduling

State-of-the-art heterogeneous architectures such as IANUS and NeuPIMs combine PIM with traditional NPUs or GPUs to exploit both high memory bandwidth (for GEMV in PIM) and high compute throughput (for GEMM in NPU/TPU) (Heo et al., 2024, Seo et al., 2024). The system may employ a shared unified DRAM, with both PIM and NPU issuing accesses and compute commands.

Critical to efficiency is fine-grained scheduling:

Dual row-buffer or equivalent mechanisms allow concurrent memory and compute paths, so NPU and PIM can interleave accesses.
Sub-batch interleaving and pipeline scheduling maximize device utilization. For example, while one sub-batch runs MHA on PIM, another computes QKV on the NPU (Heo et al., 2024).
Compiler models select, per FC or attention block, whether PIM or NPU is optimal based on bandwidth-versus-compute phase.

Tradeoffs include increased scheduler and buffer complexity, and serialization points when PIM compute and NPU DMA cannot overlap, but net throughput and energy efficiency increase (IANUS: up to 6.2× GPT-2 speedup vs A100; NeuPIMs: 3× batched throughput) (Heo et al., 2024, Seo et al., 2024).

7. Empirical Performance and Scalability

The AttAcc PIM backend and its data-centric compiler stack have demonstrated:

Up to 13.17× speedup (5.75× average) in memory-bound ML kernels over GPU-only execution (e.g., attention and GEMV ops).
End-to-end LLM inference (GPT-3, LLaMA-2): 7.71× maximum speedup (4.88× average) over GPU (Yang et al., 19 Nov 2025).
For long-context LLM (up to 32K tokens): 8.54× throughput improvement over 16×A100 GPU-HBM, and up to 16× over naïve GPU+PIM (Kwon et al., 2024).
Memory utilization improvements in PIM MAC units from ~15% to ~30%, with robust scaling from 7B to 72B parameter models (Kwon et al., 2024).
Empirical confirmation that compiler-driven aggressive data rearrangement and partitioning (with predictor-tuned loop nest and tiling) yield a further 24–50% boost over naïve fixed-tiling PIM kernels (Yang et al., 19 Nov 2025).

These results demonstrate that AttAcc PIM backend designs, when tightly orchestrated by software with respect to microarchitectural constraints, dynamic memory allocation, and heterogeneous compute scheduling, provide a scalable and practical solution for both high-throughput and secure inference in modern LLMs.