Papers
Topics
Authors
Recent
Search
2000 character limit reached

Low-Batch Interleaving Mode (LBIM)

Updated 25 January 2026
  • LBIM is a mode in CD-PIM architectures that time-multiplexes DRAM pseudo-banks to concurrently handle GEMV and GEMM operations, reducing inference latency.
  • It strategically schedules PIM and CPU workloads, achieving up to 1.46× speedup in low-batch LLM inferences over high-bandwidth modes.
  • LBIM enhances edge computing efficiency by balancing compute and memory bandwidth tradeoffs, enabling fast, low-power LLM processing.

CD-PIM refers to digital Processing-In-Memory (PIM) architectures that implement either cross-division bank-level computation for bandwidth and utilization optimization or confidential computation through memory-resident security enclaves. Recent CD-PIM designs have emerged to address the challenges of memory-bound LLM inference on edge devices and securely offloading data-intensive workloads within trusted memory banks. Architectures such as CD-PIM for low-batch LLM acceleration (Lin et al., 18 Jan 2026) and PIM-Enclave for confidential computation (Duy et al., 2021) implement distinct hardware and computational strategies to extend the PIM paradigm.

1. Architectural Foundations of CD-PIM

Bank-level CD-PIM leverages digital computation units placed within DRAM banks to accelerate general matrix-vector multiplication (GEMV) and general matrix-matrix multiplication (GEMM) operations central to transformer-based LLM inference. The architecture utilizes commodity LPDDR5 DRAM equipped with two pipelined computing units (CUs) per bank, partitioning each physical LPDDR5 bank into four pseudo-banks ("Pbanks") to maximize internal bandwidth and parallelism. High-level features include:

  • Per-bank compute: Two identical CUs, each capable of 8-bit MACs at double the DRAM internal clock frequency (fCU=2×fintf_{CU} = 2 \times f_{int}).
  • Bank segmentation: Each bank is divided into upper/lower and left/right segments via isolation transistors and split global bitlines, enabling independent activation of four Pbanks.
  • Overlapped workload support: Specialized modes for simultaneous GEMM (CPU) and GEMV (PIM).

In contrast, confidential CD-PIM as realized by PIM-Enclave (Duy et al., 2021) organizes each memory bank (or vault) to include a lightweight RISC core, AES-GCM-capable DMA engine, and access-control logic tied to a root-of-trust key and ROM. The memory module enables remote attestation, secure session and data key provisioning, and hardware-based protection against unauthorized access and side-channel leakage.

2. Modes of Operation: HBCEM and LBIM

CD-PIM for LLM acceleration introduces two principal operating modes on LPDDR5 (Lin et al., 18 Jan 2026):

High-Bandwidth Compute-Efficient Mode (HBCEM)

  • Each LPDDR5 bank is subdivided into four Pbanks, each with independent sense amplifiers.
  • All Pbanks are activated in parallel via the PIM_MAC_FM instruction, multiplying the bandwidth by 4. For example, with fint=200f_{int}=200 MHz and W=16W=16 B, one bank’s bandwidth increases from $3.2$ GB/s to $12.8$ GB/s.
  • Across 16 banks, the aggregate internal bandwidth reaches $204.8$ GB/s.

Low-Batch Interleaving Mode (LBIM)

  • The four Pbanks per bank are time-multiplexed: two process GEMV operations for LLM decoding, while the remaining two handle regular DRAM requests for CPU-driven GEMM.
  • Enables concurrent execution where GEMV (decode) is processed on PIM and GEMM (prefill) on the CPU, reducing total inference latency:
    • THBCEM=Tprefill+TdecodeT_{HBCEM} = T_{prefill} + T_{decode}
    • TLBIM=max(Tprefill,Tdecode_PIM/2)T_{LBIM} = \max(T_{prefill}, T_{decode\_PIM}/2)
  • LBIM leverages bandwidth/compute tradeoff for improved latency in compute-bound low-batch settings, empirically yielding up to 1.46×1.46\times speedup over HBCEM.

3. Compute Unit Microarchitecture and Data-Mapping

Each Pbank integrates two deeply pipelined CUs, designed for maximal multiply-accumulate throughput:

  • Each CU contains a 64 B input buffer and a 128 B output partial-sum buffer.
  • The CU operates on 32 B wide weight vectors, processing two input-by-weight outer products per cycle.
  • At twice the DRAM internal clock ($400$ MHz for fint=200f_{int}=200 MHz), each CU delivers $128$ MACs per cycle; two CUs yield $256$ MACs per bank per cycle, achieving 819.2\sim819.2 GMAC/s on 16 banks.

Optimized data-mapping strategies maximize CU utilization for both transformer K-cache (outer-product flow) and V-cache (inner-product flow):

  • Column-wise (K-cache): 2×642 \times 64 tile per bank, broadcasted QQ vector, bankwise outer product accumulation.
  • Row-wise (V-cache): 64×264 \times 2 tile per bank, broadcasted attention weight vector, bankwise inner product calculation.

4. Confidential Processing-In-Memory: PIM-Enclave CD-PIM

PIM-Enclave (Duy et al., 2021) supports confidential CD-PIM by constructing a memory-resident secure enclave leveraging:

  • AES-GCM-capable DMA engine for zero-copy encrypted data transfer.
  • Access-control logic per bank to enforce "in-enclave" memory regions, inaccessible to unauthorized host reads/writes via efficient on-die range checks.
  • EK (endorsement key) and ROM for attestation, key establishment, and root-of-trust.

Operational workflow includes:

  • Remote attestation where host verifies enclave integrity and provisions session/data keys using asymmetric encryption.
  • Protected regions are locked via base and mask registers.
  • All DMA transfers are authenticated and encrypted via hardware, ensuring IND-CPA confidentiality and tamper detection (integrity).

Programming interface resembles GPU offload paradigms, with explicit enclaving, encrypted code/data transfer, execution, and result retrieval.

5. Evaluation Metrics and System Impact

Evaluation of CD-PIM for LLMs on edge devices via Ramulator 2.0 simulation and real device benchmarks (Lin et al., 18 Jan 2026) demonstrates:

System Batch Mode Speedup vs. GPU-only Speedup vs. AttAcc Area Overhead
Jetson Orin 1 (single) HBCEM 11.42× 4.25× ~0.8%
iPhone 15 1 (single) HBCEM Range 4.5×–18.6× - -
Jetson Orin 4 (low) LBIM vs. HBCEM 1.01–1.46× - -

Each CU incurs 14,941μm2\sim14,941\,\mu\text{m}^2 area and $4.5$ mW; cumulative addition for two CUs per bank remains modest relative to die size and power (144\approx144 mW on $32$ Gb die).

Confidential CD-PIM via PIM-Enclave (Duy et al., 2021) achieves:

  • Linear speedup with >6 banks.
  • Encrypted DMA overhead of only $3.7$–22%22\%.
  • Robust mitigation of bus side-channel leakages, cold-boot, replay/tampering.

6. Security Properties and Threat Model

Confidential CD-PIM designs define adversaries capable of system software compromise, physical bus analysis, DMA, and cold-boot attacks. Security goals are:

  • Confidentiality: IND-CPA assurance for enclave memory contents.
  • Integrity: AES-GCM tag-based tampering/replay protection.
  • Side-channel resilience: No runtime leakage of memory access patterns on external buses.

Hardware-based access controls ensure in-enclave memory exclusivity during execution; all host read attempts to protected regions return \perp. Remote attestation allows cryptographic verification of enclave state prior to execution.

7. Context and Adaptability

CD-PIM architectures, through either high-bandwidth multi-bank compute or secure bank-resident enclaving, offer scalable, efficient solutions for both edge intelligence workloads (LLM inference) and confidential data processing. CD-PIM/LLM implementations optimize for low-latency, low-batch scenarios, achieving significant throughput and latency improvements over GPU-only and previous PIM baselines. Confidential CD-PIM strategies eliminate practical bus-level side-channels while maintaining minimal performance overhead. Adoption in memory-centric acceleration and cloud security primitives suggests ongoing relevance as DRAM and PIM paradigms evolve (Lin et al., 18 Jan 2026, Duy et al., 2021).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Low-Batch Interleaving Mode (LBIM).