Low-Batch Interleaving Mode (LBIM)
- LBIM is a mode in CD-PIM architectures that time-multiplexes DRAM pseudo-banks to concurrently handle GEMV and GEMM operations, reducing inference latency.
- It strategically schedules PIM and CPU workloads, achieving up to 1.46× speedup in low-batch LLM inferences over high-bandwidth modes.
- LBIM enhances edge computing efficiency by balancing compute and memory bandwidth tradeoffs, enabling fast, low-power LLM processing.
CD-PIM refers to digital Processing-In-Memory (PIM) architectures that implement either cross-division bank-level computation for bandwidth and utilization optimization or confidential computation through memory-resident security enclaves. Recent CD-PIM designs have emerged to address the challenges of memory-bound LLM inference on edge devices and securely offloading data-intensive workloads within trusted memory banks. Architectures such as CD-PIM for low-batch LLM acceleration (Lin et al., 18 Jan 2026) and PIM-Enclave for confidential computation (Duy et al., 2021) implement distinct hardware and computational strategies to extend the PIM paradigm.
1. Architectural Foundations of CD-PIM
Bank-level CD-PIM leverages digital computation units placed within DRAM banks to accelerate general matrix-vector multiplication (GEMV) and general matrix-matrix multiplication (GEMM) operations central to transformer-based LLM inference. The architecture utilizes commodity LPDDR5 DRAM equipped with two pipelined computing units (CUs) per bank, partitioning each physical LPDDR5 bank into four pseudo-banks ("Pbanks") to maximize internal bandwidth and parallelism. High-level features include:
- Per-bank compute: Two identical CUs, each capable of 8-bit MACs at double the DRAM internal clock frequency ().
- Bank segmentation: Each bank is divided into upper/lower and left/right segments via isolation transistors and split global bitlines, enabling independent activation of four Pbanks.
- Overlapped workload support: Specialized modes for simultaneous GEMM (CPU) and GEMV (PIM).
In contrast, confidential CD-PIM as realized by PIM-Enclave (Duy et al., 2021) organizes each memory bank (or vault) to include a lightweight RISC core, AES-GCM-capable DMA engine, and access-control logic tied to a root-of-trust key and ROM. The memory module enables remote attestation, secure session and data key provisioning, and hardware-based protection against unauthorized access and side-channel leakage.
2. Modes of Operation: HBCEM and LBIM
CD-PIM for LLM acceleration introduces two principal operating modes on LPDDR5 (Lin et al., 18 Jan 2026):
High-Bandwidth Compute-Efficient Mode (HBCEM)
- Each LPDDR5 bank is subdivided into four Pbanks, each with independent sense amplifiers.
- All Pbanks are activated in parallel via the PIM_MAC_FM instruction, multiplying the bandwidth by 4. For example, with MHz and B, one bank’s bandwidth increases from $3.2$ GB/s to $12.8$ GB/s.
- Across 16 banks, the aggregate internal bandwidth reaches $204.8$ GB/s.
Low-Batch Interleaving Mode (LBIM)
- The four Pbanks per bank are time-multiplexed: two process GEMV operations for LLM decoding, while the remaining two handle regular DRAM requests for CPU-driven GEMM.
- Enables concurrent execution where GEMV (decode) is processed on PIM and GEMM (prefill) on the CPU, reducing total inference latency:
- LBIM leverages bandwidth/compute tradeoff for improved latency in compute-bound low-batch settings, empirically yielding up to speedup over HBCEM.
3. Compute Unit Microarchitecture and Data-Mapping
Each Pbank integrates two deeply pipelined CUs, designed for maximal multiply-accumulate throughput:
- Each CU contains a 64 B input buffer and a 128 B output partial-sum buffer.
- The CU operates on 32 B wide weight vectors, processing two input-by-weight outer products per cycle.
- At twice the DRAM internal clock ($400$ MHz for MHz), each CU delivers $128$ MACs per cycle; two CUs yield $256$ MACs per bank per cycle, achieving GMAC/s on 16 banks.
Optimized data-mapping strategies maximize CU utilization for both transformer K-cache (outer-product flow) and V-cache (inner-product flow):
- Column-wise (K-cache): tile per bank, broadcasted vector, bankwise outer product accumulation.
- Row-wise (V-cache): tile per bank, broadcasted attention weight vector, bankwise inner product calculation.
4. Confidential Processing-In-Memory: PIM-Enclave CD-PIM
PIM-Enclave (Duy et al., 2021) supports confidential CD-PIM by constructing a memory-resident secure enclave leveraging:
- AES-GCM-capable DMA engine for zero-copy encrypted data transfer.
- Access-control logic per bank to enforce "in-enclave" memory regions, inaccessible to unauthorized host reads/writes via efficient on-die range checks.
- EK (endorsement key) and ROM for attestation, key establishment, and root-of-trust.
Operational workflow includes:
- Remote attestation where host verifies enclave integrity and provisions session/data keys using asymmetric encryption.
- Protected regions are locked via base and mask registers.
- All DMA transfers are authenticated and encrypted via hardware, ensuring IND-CPA confidentiality and tamper detection (integrity).
Programming interface resembles GPU offload paradigms, with explicit enclaving, encrypted code/data transfer, execution, and result retrieval.
5. Evaluation Metrics and System Impact
Evaluation of CD-PIM for LLMs on edge devices via Ramulator 2.0 simulation and real device benchmarks (Lin et al., 18 Jan 2026) demonstrates:
| System | Batch | Mode | Speedup vs. GPU-only | Speedup vs. AttAcc | Area Overhead |
|---|---|---|---|---|---|
| Jetson Orin | 1 (single) | HBCEM | 11.42× | 4.25× | ~0.8% |
| iPhone 15 | 1 (single) | HBCEM | Range 4.5×–18.6× | - | - |
| Jetson Orin | 4 (low) | LBIM vs. HBCEM | 1.01–1.46× | - | - |
Each CU incurs area and $4.5$ mW; cumulative addition for two CUs per bank remains modest relative to die size and power ( mW on $32$ Gb die).
Confidential CD-PIM via PIM-Enclave (Duy et al., 2021) achieves:
- Linear speedup with >6 banks.
- Encrypted DMA overhead of only $3.7$–.
- Robust mitigation of bus side-channel leakages, cold-boot, replay/tampering.
6. Security Properties and Threat Model
Confidential CD-PIM designs define adversaries capable of system software compromise, physical bus analysis, DMA, and cold-boot attacks. Security goals are:
- Confidentiality: IND-CPA assurance for enclave memory contents.
- Integrity: AES-GCM tag-based tampering/replay protection.
- Side-channel resilience: No runtime leakage of memory access patterns on external buses.
Hardware-based access controls ensure in-enclave memory exclusivity during execution; all host read attempts to protected regions return . Remote attestation allows cryptographic verification of enclave state prior to execution.
7. Context and Adaptability
CD-PIM architectures, through either high-bandwidth multi-bank compute or secure bank-resident enclaving, offer scalable, efficient solutions for both edge intelligence workloads (LLM inference) and confidential data processing. CD-PIM/LLM implementations optimize for low-latency, low-batch scenarios, achieving significant throughput and latency improvements over GPU-only and previous PIM baselines. Confidential CD-PIM strategies eliminate practical bus-level side-channels while maintaining minimal performance overhead. Adoption in memory-centric acceleration and cloud security primitives suggests ongoing relevance as DRAM and PIM paradigms evolve (Lin et al., 18 Jan 2026, Duy et al., 2021).