CD-PIM: LLM Acceleration & Confidential Computing
- CD-PIM is a dual-architecture in-memory computing paradigm that accelerates low-batch LLM inference on edge devices while securely processing confidential workloads.
- It employs techniques such as bank subdivision, time multiplexing, and targeted data mapping to optimize both GEMV and GEMM operations efficiently.
- The architecture achieves up to 14.6× speedup in decode stages and maintains modest area and power overheads, enabling performance and security in constrained environments.
CD-PIM refers to two distinct but thematically related architectures in the processing-in-memory (PIM) field: (1) Cross-Division PIM, a bandwidth- and compute-optimized digital PIM architecture built atop LPDDR5 for low-batch LLM inference on edge devices (Lin et al., 18 Jan 2026), and (2) Confidential Data Processing-In-Memory, a security-enhanced PIM system for confidential workload offloading, as realized in the PIM-Enclave architecture (Duy et al., 2021). Both approaches fundamentally exploit PIM principles to minimize data movement, maximize in-situ compute throughput, and—optionally—enforce end-to-end security on sensitive workloads.
1. Design Goals and Key Challenges
CD-PIM for LLM acceleration on edge platforms targets the autoregressive decoding stage, characterized by memory-bound general matrix-vector multiplications (GEMV) at low batch sizes. Three principal challenges underlie this design:
- Insufficient Bank-Level Bandwidth: Commodity LPDDR5, as used in mobile and edge SoCs, provides limited internal bank parallelism compared to HBM-based systems, constraining per-bank GEMV acceleration.
- Resource Underutilization with Mixed Workloads: Standard PIM deployments suffer reduced utilization in alternating GEMM (compute-bound, typically prefill) and GEMV workloads, operating in a blocked, mutually exclusive execution mode.
- Limited Bank Compute Capacity: A single CU per bank running at DRAM frequencies is unable to sufficiently exploit the available I/O and bandwidth, restricting overall throughput.
The goal is to enable high-throughput, low-latency GEMV execution during the decoding phase, while supporting overlapping prefill stage GEMM execution to maximize hardware utilization on edge devices such as NVIDIA Jetson AGX Orin and Apple iPhone 15 Pro (Lin et al., 18 Jan 2026).
2. High-Bandwidth Compute-Efficient Mode (HBCEM) and Low-Batch Interleaving Mode (LBIM)
HBCEM
In HBCEM, each LPDDR5 bank is subdivided into four independent pseudo-banks ("Pbanks") by segmenting the global bitlines (GBLs) via two orthogonal cuts (left/right and upper/lower) and associated isolation transistors. Each segment, addressed as Bank_TL, Bank_TR, Bank_BL, Bank_BR, possesses its own sense amplifier. This structural partitioning enables:
- Quadrupled Effective Bandwidth: Each bank’s bandwidth (e.g., yields ). HBCEM runs all four Pbanks in parallel, resulting in .
- Scalable Internal Throughput: On platforms with 16 banks, aggregate die-internal bandwidth reaches .
LBIM
LBIM interleaves the GEMV computation on PIM with GEMM execution on the host CPU by time-multiplexing the Pbanks:
- Two Pbanks are dedicated to independent GEMV (decode) operations, while the other two serve normal DRAM requests to support concurrent GEMM (prefill) on the CPU.
- End-to-end latency for LBIM is , contrasting with the sequential of HBCEM.
Empirical results demonstrate up to reduction in end-to-end latency on compute-bound low-batch workloads via LBIM, with a mean speedup of (Lin et al., 18 Jan 2026).
3. Compute Unit (CU) Microarchitecture
Each Pbank includes two pipelined compute units, with the following microarchitectural features:
- Pipeline and Buffering: Each CU comprises a 64 B input buffer and a 128 B output buffer, accepting 8-bit weight streams in a serial fashion.
- Throughput: Each CU receives 1 B of input and computes the outer product with two 32 B weight vectors per cycle, generating partial sums per cycle. At clock rate , per-cycle throughput is $256$ MACs per bank.
- Totals: Across 16 banks at MHz, combined aggregate throughput reaches approximately $819.2$ GMAC/s ($0.82$ TMAC/s).
Area and power per CU are and $4.5$ mW in TSMC 28 nm; two CUs per bank imply an area overhead of and power cost of $144$ mW per $32$ Gb die (Lin et al., 18 Jan 2026).
4. Data-Mapping and Execution Strategies
CD-PIM applies tailored data-mapping strategies to maximize CU occupancy for both outer-product and inner-product flows in LLM attention mechanisms:
- K-Cache (Column-Wise Mapping): The K-cache matrix is stored as tiles per bank. Query vectors (Q) are broadcast in 64-element sub-vectors, enabling the CUs to perform efficient outer-product computation, yielding GEMV operations per bank per memory cycle.
- V-Cache (Row-Wise Mapping): The V-cache is mapped as tiles per bank; attention weight vectors are broadcast in matching sub-vectors to enable inner-product operations, achieving GEMV steps per memory cycle.
These arrangements ensure that both CUs per Pbank remain continuously active across all cycles, fully utilizing internal parallelism (Lin et al., 18 Jan 2026).
5. Evaluation Results
CD-PIM has been evaluated using a modified Ramulator 2.0 simulation framework, modeling 4 GB LPDDR5 dies with platforms including Jetson Orin and iPhone 15 Pro. Benchmarks include LLaMA-1B, 7B, and 13B across decode- and compute-intensive scenarios.
| Comparison | Metric | Result |
|---|---|---|
| CD-PIM vs. GPU-only (HBCEM, batch=1) | Decode-stage speedup | Mean (range $4.5$–) |
| CD-PIM vs. AttAcc PIM (HBCEM, batch=1) | Decode-stage speedup | Mean |
| LBIM vs. HBCEM (batch=4, compute-bound) | End-to-end latency reduction | Range $1.01$–, mean |
CD-PIM achieves up to decode-stage acceleration on resource-constrained edge platforms, with area and power overheads considered modest in context (Lin et al., 18 Jan 2026).
6. CD-PIM for Confidential Computing
CD-PIM, in the context of PIM-Enclave (Duy et al., 2021), refers to confidential data processing within memory banks, realized via secure enclave extensions to PIM systems:
- Architecture: Each DRAM bank incorporates a PIM processor core, local scratchpad, AES-GCM–capable DMA engine, and bank-level access-control.
- Security Model: Formalizes IND-CPA confidentiality and detection of integrity violations through AES-GCM authenticated encryption. Banks resist bus side-channels, cold-boot, replay, and unauthorized access; adversary is assumed to control host software and physical buses but not PIM internals.
- Programming Model: Host enclaves offload kernels, data, and parameters via encrypted DMA to in-memory enclaves, instantiate remote attestation, and set up session/data keys. All DMA transfers are cryptographically protected, and host CPU is denied direct access to protected regions during execution.
- Evaluation: PIM-Enclave eliminates host–memory bus side-channels and achieves only overhead versus insecure PIM models; encrypted DMA throughput reaches , with a average bandwidth penalty over unsecured DMA.
This instantiation of CD-PIM demonstrates secure PIM as a practical and efficient substrate for confidential computing at in-memory scale (Duy et al., 2021).
7. Broader Significance and Implications
CD-PIM advances both performance and security for memory-intensive machine learning and confidential data workloads in edge and potentially cloud domains. By addressing internal bandwidth scaling, maximizing per-bank CU throughput, and supporting secure enclaves, CD-PIM enables LLM inference and sensitive workload acceleration on platforms where energy and area budgets are tightly constrained, and data movement across external interfaces exposes privacy risks. This suggests that future edge and hybrid-cloud accelerators will increasingly adopt PIM paradigms merging both high-throughput and cryptographic capabilities, facilitated by nuanced bank/mode/tile management as embodied in CD-PIM (Lin et al., 18 Jan 2026, Duy et al., 2021).