CD-PIM: LLM Acceleration & Confidential Computing

Updated 25 January 2026

CD-PIM is a dual-architecture in-memory computing paradigm that accelerates low-batch LLM inference on edge devices while securely processing confidential workloads.
It employs techniques such as bank subdivision, time multiplexing, and targeted data mapping to optimize both GEMV and GEMM operations efficiently.
The architecture achieves up to 14.6× speedup in decode stages and maintains modest area and power overheads, enabling performance and security in constrained environments.

CD-PIM refers to two distinct but thematically related architectures in the processing-in-memory (PIM) field: (1) Cross-Division PIM, a bandwidth- and compute-optimized digital PIM architecture built atop LPDDR5 for low-batch LLM inference on edge devices (Lin et al., 18 Jan 2026), and (2) Confidential Data Processing-In-Memory, a security-enhanced PIM system for confidential workload offloading, as realized in the PIM-Enclave architecture (Duy et al., 2021). Both approaches fundamentally exploit PIM principles to minimize data movement, maximize in-situ compute throughput, and—optionally—enforce end-to-end security on sensitive workloads.

1. Design Goals and Key Challenges

CD-PIM for LLM acceleration on edge platforms targets the autoregressive decoding stage, characterized by memory-bound general matrix-vector multiplications (GEMV) at low batch sizes. Three principal challenges underlie this design:

Insufficient Bank-Level Bandwidth: Commodity LPDDR5, as used in mobile and edge SoCs, provides limited internal bank parallelism compared to HBM-based systems, constraining per-bank GEMV acceleration.
Resource Underutilization with Mixed Workloads: Standard PIM deployments suffer reduced utilization in alternating GEMM (compute-bound, typically prefill) and GEMV workloads, operating in a blocked, mutually exclusive execution mode.
Limited Bank Compute Capacity: A single CU per bank running at DRAM frequencies is unable to sufficiently exploit the available I/O and bandwidth, restricting overall throughput.

The goal is to enable high-throughput, low-latency GEMV execution during the decoding phase, while supporting overlapping prefill stage GEMM execution to maximize hardware utilization on edge devices such as NVIDIA Jetson AGX Orin and Apple iPhone 15 Pro (Lin et al., 18 Jan 2026).

2. High-Bandwidth Compute-Efficient Mode (HBCEM) and Low-Batch Interleaving Mode (LBIM)

HBCEM

In HBCEM, each LPDDR5 bank is subdivided into four independent pseudo-banks ("Pbanks") by segmenting the global bitlines (GBLs) via two orthogonal cuts (left/right and upper/lower) and associated isolation transistors. Each segment, addressed as Bank_TL, Bank_TR, Bank_BL, Bank_BR, possesses its own sense amplifier. This structural partitioning enables:

Quadrupled Effective Bandwidth: Each bank’s bandwidth $B_\text{bank} = f_\text{int} \times W$ (e.g., $f_\text{int}=200\,\text{MHz},\, W=16\,\text{B}$ yields $B_\text{bank}=3.2\,\text{GB/s}$ ). HBCEM runs all four Pbanks in parallel, resulting in $B_\text{HBCEM}=4\times B_\text{bank}=12.8\,\text{GB/s/bank}$ .
Scalable Internal Throughput: On platforms with 16 banks, aggregate die-internal bandwidth reaches $204.8\,\text{GB/s}$ .

LBIM

LBIM interleaves the GEMV computation on PIM with GEMM execution on the host CPU by time-multiplexing the Pbanks:

Two Pbanks are dedicated to independent GEMV (decode) operations, while the other two serve normal DRAM requests to support concurrent GEMM (prefill) on the CPU.
End-to-end latency for LBIM is $T_\text{LBIM} = \max(T_\text{prefill}, T_\text{decode\_PIM}/2)$ , contrasting with the sequential $T_\text{HBCEM} = T_\text{prefill} + T_\text{decode}$ of HBCEM.

Empirical results demonstrate up to $1.46\times$ reduction in end-to-end latency on compute-bound low-batch workloads via LBIM, with a mean speedup of $1.12\times$ (Lin et al., 18 Jan 2026).

3. Compute Unit (CU) Microarchitecture

Each Pbank includes two pipelined compute units, with the following microarchitectural features:

Pipeline and Buffering: Each CU comprises a 64 B input buffer and a 128 B output buffer, accepting 8-bit weight streams in a serial fashion.
Throughput: Each CU receives 1 B of input and computes the outer product with two 32 B weight vectors per cycle, generating $2\times(32\times8)$ partial sums per cycle. At clock rate $f_\text{CU}=2\times f_\text{int}$ , per-cycle throughput is $256$ MACs per bank.
Totals: Across 16 banks at $f_\text{int}=200$ MHz, combined aggregate throughput reaches approximately $819.2$ GMAC/s ($0.82$ TMAC/s).

Area and power per CU are $14,941\,\mu\text{m}^2$ and $4.5$ mW in TSMC 28 nm; two CUs per bank imply an area overhead of $0.8\%$ and power cost of $144$ mW per $32$ Gb die (Lin et al., 18 Jan 2026).

4. Data-Mapping and Execution Strategies

CD-PIM applies tailored data-mapping strategies to maximize CU occupancy for both outer-product and inner-product flows in LLM attention mechanisms:

K-Cache (Column-Wise Mapping): The K-cache matrix is stored as $2 \times 64$ tiles per bank. Query vectors (Q) are broadcast in 64-element sub-vectors, enabling the CUs to perform efficient outer-product computation, yielding $(1,1) \times (1,128)$ GEMV operations per bank per memory cycle.
V-Cache (Row-Wise Mapping): The V-cache is mapped as $64 \times 2$ tiles per bank; attention weight vectors are broadcast in matching sub-vectors to enable inner-product operations, achieving $(1,64) \times (64,2)$ GEMV steps per memory cycle.

These arrangements ensure that both CUs per Pbank remain continuously active across all cycles, fully utilizing internal parallelism (Lin et al., 18 Jan 2026).

5. Evaluation Results

CD-PIM has been evaluated using a modified Ramulator 2.0 simulation framework, modeling 4 GB LPDDR5 dies with platforms including Jetson Orin and iPhone 15 Pro. Benchmarks include LLaMA-1B, 7B, and 13B across decode- and compute-intensive scenarios.

Comparison	Metric	Result
CD-PIM vs. GPU-only (HBCEM, batch=1)	Decode-stage speedup	Mean $11.42\times$ (range $4.5$– $18.6\times$ )
CD-PIM vs. AttAcc PIM (HBCEM, batch=1)	Decode-stage speedup	Mean $4.25\times$
LBIM vs. HBCEM (batch=4, compute-bound)	End-to-end latency reduction	Range $1.01$– $1.46\times$ , mean $\approx1.12\times$

CD-PIM achieves up to $14.6\times$ decode-stage acceleration on resource-constrained edge platforms, with area and power overheads considered modest in context (Lin et al., 18 Jan 2026).

6. CD-PIM for Confidential Computing

CD-PIM, in the context of PIM-Enclave (Duy et al., 2021), refers to confidential data processing within memory banks, realized via secure enclave extensions to PIM systems:

Architecture: Each DRAM bank incorporates a PIM processor core, local scratchpad, AES-GCM–capable DMA engine, and bank-level access-control.
Security Model: Formalizes IND-CPA confidentiality and detection of integrity violations through AES-GCM authenticated encryption. Banks resist bus side-channels, cold-boot, replay, and unauthorized access; adversary is assumed to control host software and physical buses but not PIM internals.
Programming Model: Host enclaves offload kernels, data, and parameters via encrypted DMA to in-memory enclaves, instantiate remote attestation, and set up session/data keys. All DMA transfers are cryptographically protected, and host CPU is denied direct access to protected regions during execution.
Evaluation: PIM-Enclave eliminates host–memory bus side-channels and achieves only $3.7\%$ overhead versus insecure PIM models; encrypted DMA throughput reaches $2.9\,\text{GB/s}$ , with a $17.8\%$ average bandwidth penalty over unsecured DMA.

This instantiation of CD-PIM demonstrates secure PIM as a practical and efficient substrate for confidential computing at in-memory scale (Duy et al., 2021).

7. Broader Significance and Implications

CD-PIM advances both performance and security for memory-intensive machine learning and confidential data workloads in edge and potentially cloud domains. By addressing internal bandwidth scaling, maximizing per-bank CU throughput, and supporting secure enclaves, CD-PIM enables LLM inference and sensitive workload acceleration on platforms where energy and area budgets are tightly constrained, and data movement across external interfaces exposes privacy risks. This suggests that future edge and hybrid-cloud accelerators will increasingly adopt PIM paradigms merging both high-throughput and cryptographic capabilities, facilitated by nuanced bank/mode/tile management as embodied in CD-PIM (Lin et al., 18 Jan 2026, Duy et al., 2021).

Markdown Upgrade to Chat

References (2)

CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device (2026)

PIM-Enclave: Bringing Confidential Computation Inside Memory (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CD-PIM.

CD-PIM: LLM Acceleration & Confidential Computing

1. Design Goals and Key Challenges

2. High-Bandwidth Compute-Efficient Mode (HBCEM) and Low-Batch Interleaving Mode (LBIM)

HBCEM

LBIM

3. Compute Unit (CU) Microarchitecture

4. Data-Mapping and Execution Strategies

5. Evaluation Results

6. CD-PIM for Confidential Computing

7. Broader Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

CD-PIM: LLM Acceleration & Confidential Computing

1. Design Goals and Key Challenges

2. High-Bandwidth Compute-Efficient Mode (HBCEM) and Low-Batch Interleaving Mode (LBIM)

HBCEM

LBIM

3. Compute Unit (CU) Microarchitecture

4. Data-Mapping and Execution Strategies

5. Evaluation Results

6. CD-PIM for Confidential Computing

7. Broader Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research