Papers
Topics
Authors
Recent
Search
2000 character limit reached

L3-Based Resource Allocation for LLM Inference

Updated 5 January 2026
  • L3-based resource allocation is a co-designed method that distributes computational loads between GPUs and DIMM-PIM to overcome memory and bandwidth limitations in LLM inference.
  • Its adaptive scheduling algorithm balances decoding and prefill operations, achieving up to 14.3× larger batch sizes and 6.1× speedup compared to traditional HBM-only methods.
  • Innovative techniques such as bit-level re-layout and rank-set interleaving enhance data transfer and reduce latency, enabling near-linear improvements with increased DIMM-PIM capacity and bandwidth.

L3-based resource allocation refers to the hardware–software co-designed methodology for efficient distribution and orchestration of computational resources between GPUs and DIMM-based Processing-In-Memory (DIMM-PIM) subsystems during long-context LLM inference. The L3 system achieves scalability in memory capacity and bandwidth by offloading the decoding phase of multi-head attention (MHA)—the principal bottleneck for context length and batch size—from GPU high-bandwidth memory (HBM) to host-side DIMM-PIM, overcoming trade-offs inherent to conventional HBM-accelerated architectures (Liu et al., 24 Apr 2025).

1. Architectural Overview and Resource Model

L3 operates on a heterogeneous platform comprising GG GPUs, each characterized by FP16 throughput (CGPUC_{\mathit{GPU}} [TFLOP/s]), HBM capacity (HGPUH_{\mathit{GPU}} [bytes]), and HBM bandwidth (BHBMB_{\mathit{HBM}} [bytes/s]). The host-side memory subsystem is augmented with DIMM-PIM capability, consisting of DD channels, RR ranks per channel, and BB banks per rank, with each DRAM chip featuring a data bus width wchipw_{\text{chip}}.

The total PIM memory capacity is formalized as: CPIM=DRB(rowscolswchip)C_{\mathit{PIM}} = D \cdot R \cdot B \cdot (\text{rows} \cdot \text{cols} \cdot w_{\text{chip}}) Peak bandwidth per channel is BPIM,chanB_{\mathit{PIM,chan}}, yielding total PIM bandwidth: BPIM=DBPIM,chanB_{\mathit{PIM}} = D \cdot B_{\mathit{PIM,chan}} Data transfer between GPU and host is mediated by PCIe with bandwidth BPCIeB_{\mathrm{PCIe}}.

2. Adaptive Scheduling and Latency Modeling

L3’s scheduler orchestrates requests in prefilling and decoding phases, striving for maximal overlap of GPU and PIM pipelines to suppress idle bubbles. In each iteration, two sub-batches are constructed:

  • Prefilling requests: Tracked by set Pi\mathcal{P}_i, each with finished token count fsf_s and possibly processed in chunks csLsc_s \le L_s.
  • Decoding requests: Tracked by set Di\mathcal{D}_i, with total context length LrL_r per request.

Critical latencies per sub-batch ii:

  • GPU-side latency: TGPUi=RFRprefill(sPi(Lsfs),c,)+RFRbatch(rDiLr,)T^{\mathit{GPU}_i} = \mathrm{RFR}_\mathit{prefill}\bigl(\sum_{s\in\mathcal{P}_i}(L_s - f_s), c, \dots\bigr) + \mathrm{RFR}_\mathit{batch}\bigl(\sum_{r\in\mathcal{D}_i}L_r, \dots\bigr)
  • PIM-side latency: TPIMi=αrDiLr+βrDiLrγsPicsT^{\mathit{PIM}_i} = \alpha \cdot \sum_{r \in \mathcal{D}_i} L_r + \beta \cdot \sum_{r \in \mathcal{D}_i} L_r - \gamma \cdot \sum_{s \in \mathcal{P}_i} c_s with α1/(PIM-FLOP/s)\alpha \approx 1/(\text{PIM-FLOP/s}), β1/BPCIe\beta \approx 1/B_{\mathrm{PCIe}}, and γ\gamma encoding prefill transfer overlap.

Each iteration solves: min{maxiTGPUi,maxiTPIMi}\min \left\{\max_i T^{\mathit{GPU}_i}, \max_i T^{\mathit{PIM}_i} \right\} subject to: rDi(bytes_per_token)LrHGPU, CPIM\sum_{r \in \mathcal{D}_i} \text{(bytes\_per\_token)}\, L_r \leq H_{\mathit{GPU}},\ C_{\mathit{PIM}}

βrDiLr1,αrDiLr1\beta \sum_{r \in \mathcal{D}_i} L_r \leq 1,\quad \alpha \sum_{r \in \mathcal{D}_i} L_r \leq 1

Thereby, resource allocation dynamically adapts to memory and bandwidth constraints.

3. Heuristic Scheduling Procedure

The practical L3 scheduler employs a greedy iterative procedure:

  1. Pull decoding requests fitting host memory; partition into D0\mathcal{D}_0, D1\mathcal{D}_1 for balanced context-length sums.
  2. Initialize P0=P1=\mathcal{P}_0=\mathcal{P}_1=\emptyset; add largest remaining prefilling requests to sub-batch 0 until TGPU0>TPIM1T^{\mathit{GPU}_0} > T^{\mathit{PIM}_1}, and symmetrically for sub-batch 1.
  3. If remaining imbalance exists, select one request in each sub-batch; set chunk size csc_s to equalize TGPUiT^{\mathit{GPU}_i} and TPIM1iT^{\mathit{PIM}_{1-i}} (solving a linear equation for csc_s).
  4. Update precomputed counters: fsfs+csf_s \leftarrow f_s + c_s. Unfinished chunks are re-enqueued for subsequent iterations.

This approach balances GPU and PIM compute utilization and overlaps communication with computation.

4. Hardware–Software Co-Design and Data Mapping Techniques

L3 resolves hardware mismatches and communication overhead through several architectural innovations:

  • Bit-level re-layout: 16-bit FP elements split across ×8-bit chips are rearranged so that all bits are co-located; a rank PU “re-layout unit” swaps the upper and lower 8 bits beat-by-beat during write bursts with zero added cycles: chip_id=ewchip,new_chip_id=chip_id1\text{chip\_id} = \left\lfloor \frac{e}{w_{\rm chip}} \right\rfloor,\quad \text{new\_chip\_id} = \text{chip\_id} \oplus 1
  • Element-level mapping for K/V matrices: For QKTQ \cdot K^T score computation, K is tiled such that each bank holds contiguous DhD_h slices. For SVS \cdot V context, V-token slices are mapped to successive banks.
  • Rank-set interleaving: Only one rank per channel is driven during PCIe offload; others continue PIM compute, maintaining up to R1R\frac{R-1}{R} of PIM power live during transfer. Prefill-only offloads are performed in GPU FC background paths, minimizing critical-path communication to only essential Q/K/V and attention vectors.

5. Performance Metrics and Analytical Outcomes

Key performance metrics include:

  • Speedup:

S=TbaselineTL3S = \frac{T_{\rm baseline}}{T_{\rm L3}}

  • Maximum batch size before out-of-memory (OOM):
    • Baseline HBM-GPU: 2.28\leq 2.28 requests of GPT-175B @ 8k tokens on 80 GB HBM (batch 2\lesssim 2).
    • L3 with 2 TB DIMM-PIM: batch 14\gtrsim 14 requests (up to 14.3×14.3\times larger).
  • Time Between Tokens (TBT): End-to-end token-generation latency.

On representative traces (OpenR1, Dolphin, OpenThoughts, LongBench) and models (OPT-66B, GPT-89B, GPT-175B), L3 demonstrates:

  • Up to 6.1×6.1\times speedup compared to state-of-the-art HBM-PIM.
  • Up to 14.3×14.3\times larger batch sizes versus HBM-only GPU.
  • 9×\gtrsim 9\times speedup versus CPU-offload methods (NEO/FastDecode) due to superior aggregate PIM bandwidth (8–30× vs. DDR).

6. Scalability and Latency Trade-offs

Analysis of scalability reveals:

  • Scaling DIMM-PIM capacity alone (×8\times8) results in 1.6×\sim1.6\times throughput improvement (PCIe/PIM bandwidth-limited).
  • Scaling bandwidth alone (×8\times8 ranksets) yields only 1.1×\sim1.1\times gain (capacity saturation).
  • Simultaneous scaling of capacity and bandwidth (×8\times8 each) enables 5.1×\sim5.1\times gain; full benefit accrues only through concerted resource growth.

Latency outcomes show:

  • L3 maintains TBT within 2953%29–53\% of GPU-only baseline even on GPT-89B with 6k tokens, due to pipelined PCIe overlap.
  • Increasing ranksets (2162 \rightarrow 16) yields near-linear TBT reductions, paralleling growth in BPIMB_{\rm PIM}.

7. Contextual Significance and System Implications

L3-based resource allocation exemplifies a tightly-coupled approach leveraging joint hardware-software innovation to resolve memory and bandwidth bottlenecks in long-context LLM inference. By formalizing the GPU versus PIM trade space, applying iterative latency-balancing scheduling, implementing dynamic data re-layouts, and exploiting communication overlap, L3 substantially increases throughput and batch capacity (5–6× speedup; 10–15× batch capacity) without sacrificing per-token latency. This architecture marks a substantive advancement in scalable LLM serving and informs future directions in resource management for memory-intensive AI workloads (Liu et al., 24 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to L3-based Resource Allocation.