L3-Based Resource Allocation for LLM Inference
- L3-based resource allocation is a co-designed method that distributes computational loads between GPUs and DIMM-PIM to overcome memory and bandwidth limitations in LLM inference.
- Its adaptive scheduling algorithm balances decoding and prefill operations, achieving up to 14.3× larger batch sizes and 6.1× speedup compared to traditional HBM-only methods.
- Innovative techniques such as bit-level re-layout and rank-set interleaving enhance data transfer and reduce latency, enabling near-linear improvements with increased DIMM-PIM capacity and bandwidth.
L3-based resource allocation refers to the hardware–software co-designed methodology for efficient distribution and orchestration of computational resources between GPUs and DIMM-based Processing-In-Memory (DIMM-PIM) subsystems during long-context LLM inference. The L3 system achieves scalability in memory capacity and bandwidth by offloading the decoding phase of multi-head attention (MHA)—the principal bottleneck for context length and batch size—from GPU high-bandwidth memory (HBM) to host-side DIMM-PIM, overcoming trade-offs inherent to conventional HBM-accelerated architectures (Liu et al., 24 Apr 2025).
1. Architectural Overview and Resource Model
L3 operates on a heterogeneous platform comprising GPUs, each characterized by FP16 throughput ( [TFLOP/s]), HBM capacity ( [bytes]), and HBM bandwidth ( [bytes/s]). The host-side memory subsystem is augmented with DIMM-PIM capability, consisting of channels, ranks per channel, and banks per rank, with each DRAM chip featuring a data bus width .
The total PIM memory capacity is formalized as: Peak bandwidth per channel is , yielding total PIM bandwidth: Data transfer between GPU and host is mediated by PCIe with bandwidth .
2. Adaptive Scheduling and Latency Modeling
L3’s scheduler orchestrates requests in prefilling and decoding phases, striving for maximal overlap of GPU and PIM pipelines to suppress idle bubbles. In each iteration, two sub-batches are constructed:
- Prefilling requests: Tracked by set , each with finished token count and possibly processed in chunks .
- Decoding requests: Tracked by set , with total context length per request.
Critical latencies per sub-batch :
- GPU-side latency:
- PIM-side latency: with , , and encoding prefill transfer overlap.
Each iteration solves: subject to:
Thereby, resource allocation dynamically adapts to memory and bandwidth constraints.
3. Heuristic Scheduling Procedure
The practical L3 scheduler employs a greedy iterative procedure:
- Pull decoding requests fitting host memory; partition into , for balanced context-length sums.
- Initialize ; add largest remaining prefilling requests to sub-batch 0 until , and symmetrically for sub-batch 1.
- If remaining imbalance exists, select one request in each sub-batch; set chunk size to equalize and (solving a linear equation for ).
- Update precomputed counters: . Unfinished chunks are re-enqueued for subsequent iterations.
This approach balances GPU and PIM compute utilization and overlaps communication with computation.
4. Hardware–Software Co-Design and Data Mapping Techniques
L3 resolves hardware mismatches and communication overhead through several architectural innovations:
- Bit-level re-layout: 16-bit FP elements split across ×8-bit chips are rearranged so that all bits are co-located; a rank PU “re-layout unit” swaps the upper and lower 8 bits beat-by-beat during write bursts with zero added cycles:
- Element-level mapping for K/V matrices: For score computation, K is tiled such that each bank holds contiguous slices. For context, V-token slices are mapped to successive banks.
- Rank-set interleaving: Only one rank per channel is driven during PCIe offload; others continue PIM compute, maintaining up to of PIM power live during transfer. Prefill-only offloads are performed in GPU FC background paths, minimizing critical-path communication to only essential Q/K/V and attention vectors.
5. Performance Metrics and Analytical Outcomes
Key performance metrics include:
- Speedup:
- Maximum batch size before out-of-memory (OOM):
- Baseline HBM-GPU: requests of GPT-175B @ 8k tokens on 80 GB HBM (batch ).
- L3 with 2 TB DIMM-PIM: batch requests (up to larger).
- Time Between Tokens (TBT): End-to-end token-generation latency.
On representative traces (OpenR1, Dolphin, OpenThoughts, LongBench) and models (OPT-66B, GPT-89B, GPT-175B), L3 demonstrates:
- Up to speedup compared to state-of-the-art HBM-PIM.
- Up to larger batch sizes versus HBM-only GPU.
- speedup versus CPU-offload methods (NEO/FastDecode) due to superior aggregate PIM bandwidth (8–30× vs. DDR).
6. Scalability and Latency Trade-offs
Analysis of scalability reveals:
- Scaling DIMM-PIM capacity alone () results in throughput improvement (PCIe/PIM bandwidth-limited).
- Scaling bandwidth alone ( ranksets) yields only gain (capacity saturation).
- Simultaneous scaling of capacity and bandwidth ( each) enables gain; full benefit accrues only through concerted resource growth.
Latency outcomes show:
- L3 maintains TBT within of GPU-only baseline even on GPT-89B with 6k tokens, due to pipelined PCIe overlap.
- Increasing ranksets () yields near-linear TBT reductions, paralleling growth in .
7. Contextual Significance and System Implications
L3-based resource allocation exemplifies a tightly-coupled approach leveraging joint hardware-software innovation to resolve memory and bandwidth bottlenecks in long-context LLM inference. By formalizing the GPU versus PIM trade space, applying iterative latency-balancing scheduling, implementing dynamic data re-layouts, and exploiting communication overlap, L3 substantially increases throughput and batch capacity (5–6× speedup; 10–15× batch capacity) without sacrificing per-token latency. This architecture marks a substantive advancement in scalable LLM serving and informs future directions in resource management for memory-intensive AI workloads (Liu et al., 24 Apr 2025).