L3-Based Resource Allocation for LLM Inference

Updated 5 January 2026

L3-based resource allocation is a co-designed method that distributes computational loads between GPUs and DIMM-PIM to overcome memory and bandwidth limitations in LLM inference.
Its adaptive scheduling algorithm balances decoding and prefill operations, achieving up to 14.3× larger batch sizes and 6.1× speedup compared to traditional HBM-only methods.
Innovative techniques such as bit-level re-layout and rank-set interleaving enhance data transfer and reduce latency, enabling near-linear improvements with increased DIMM-PIM capacity and bandwidth.

L3-based resource allocation refers to the hardware–software co-designed methodology for efficient distribution and orchestration of computational resources between GPUs and DIMM-based Processing-In-Memory (DIMM-PIM) subsystems during long-context LLM inference. The L3 system achieves scalability in memory capacity and bandwidth by offloading the decoding phase of multi-head attention (MHA)—the principal bottleneck for context length and batch size—from GPU high-bandwidth memory (HBM) to host-side DIMM-PIM, overcoming trade-offs inherent to conventional HBM-accelerated architectures (Liu et al., 24 Apr 2025).

1. Architectural Overview and Resource Model

L3 operates on a heterogeneous platform comprising $G$ GPUs, each characterized by FP16 throughput ( $C_{\mathit{GPU}}$ [TFLOP/s]), HBM capacity ( $H_{\mathit{GPU}}$ [bytes]), and HBM bandwidth ( $B_{\mathit{HBM}}$ [bytes/s]). The host-side memory subsystem is augmented with DIMM-PIM capability, consisting of $D$ channels, $R$ ranks per channel, and $B$ banks per rank, with each DRAM chip featuring a data bus width $w_{\text{chip}}$ .

The total PIM memory capacity is formalized as: $C_{\mathit{PIM}} = D \cdot R \cdot B \cdot (\text{rows} \cdot \text{cols} \cdot w_{\text{chip}})$ Peak bandwidth per channel is $B_{\mathit{PIM,chan}}$ , yielding total PIM bandwidth: $B_{\mathit{PIM}} = D \cdot B_{\mathit{PIM,chan}}$ Data transfer between GPU and host is mediated by PCIe with bandwidth $B_{\mathrm{PCIe}}$ .

2. Adaptive Scheduling and Latency Modeling

L3’s scheduler orchestrates requests in prefilling and decoding phases, striving for maximal overlap of GPU and PIM pipelines to suppress idle bubbles. In each iteration, two sub-batches are constructed:

Prefilling requests: Tracked by set $\mathcal{P}_i$ , each with finished token count $f_s$ and possibly processed in chunks $c_s \le L_s$ .
Decoding requests: Tracked by set $\mathcal{D}_i$ , with total context length $L_r$ per request.

Critical latencies per sub-batch $i$ :

GPU-side latency: $T^{\mathit{GPU}_i} = \mathrm{RFR}_\mathit{prefill}\bigl(\sum_{s\in\mathcal{P}_i}(L_s - f_s), c, \dots\bigr) + \mathrm{RFR}_\mathit{batch}\bigl(\sum_{r\in\mathcal{D}_i}L_r, \dots\bigr)$
PIM-side latency: $T^{\mathit{PIM}_i} = \alpha \cdot \sum_{r \in \mathcal{D}_i} L_r + \beta \cdot \sum_{r \in \mathcal{D}_i} L_r - \gamma \cdot \sum_{s \in \mathcal{P}_i} c_s$ with $\alpha \approx 1/(\text{PIM-FLOP/s})$ , $\beta \approx 1/B_{\mathrm{PCIe}}$ , and $\gamma$ encoding prefill transfer overlap.

Each iteration solves: $\min \left\{\max_i T^{\mathit{GPU}_i}, \max_i T^{\mathit{PIM}_i} \right\}$ subject to: $\sum_{r \in \mathcal{D}_i} \text{(bytes\_per\_token)}\, L_r \leq H_{\mathit{GPU}},\ C_{\mathit{PIM}}$

$\beta \sum_{r \in \mathcal{D}_i} L_r \leq 1,\quad \alpha \sum_{r \in \mathcal{D}_i} L_r \leq 1$

Thereby, resource allocation dynamically adapts to memory and bandwidth constraints.

3. Heuristic Scheduling Procedure

The practical L3 scheduler employs a greedy iterative procedure:

Pull decoding requests fitting host memory; partition into $\mathcal{D}_0$ , $\mathcal{D}_1$ for balanced context-length sums.
Initialize $\mathcal{P}_0=\mathcal{P}_1=\emptyset$ ; add largest remaining prefilling requests to sub-batch 0 until $T^{\mathit{GPU}_0} > T^{\mathit{PIM}_1}$ , and symmetrically for sub-batch 1.
If remaining imbalance exists, select one request in each sub-batch; set chunk size $c_s$ to equalize $T^{\mathit{GPU}_i}$ and $T^{\mathit{PIM}_{1-i}}$ (solving a linear equation for $c_s$ ).
Update precomputed counters: $f_s \leftarrow f_s + c_s$ . Unfinished chunks are re-enqueued for subsequent iterations.

This approach balances GPU and PIM compute utilization and overlaps communication with computation.

4. Hardware–Software Co-Design and Data Mapping Techniques

L3 resolves hardware mismatches and communication overhead through several architectural innovations:

Bit-level re-layout: 16-bit FP elements split across ×8-bit chips are rearranged so that all bits are co-located; a rank PU “re-layout unit” swaps the upper and lower 8 bits beat-by-beat during write bursts with zero added cycles: $\text{chip\_id} = \left\lfloor \frac{e}{w_{\rm chip}} \right\rfloor,\quad \text{new\_chip\_id} = \text{chip\_id} \oplus 1$
Element-level mapping for K/V matrices: For $Q \cdot K^T$ score computation, K is tiled such that each bank holds contiguous $D_h$ slices. For $S \cdot V$ context, V-token slices are mapped to successive banks.
Rank-set interleaving: Only one rank per channel is driven during PCIe offload; others continue PIM compute, maintaining up to $\frac{R-1}{R}$ of PIM power live during transfer. Prefill-only offloads are performed in GPU FC background paths, minimizing critical-path communication to only essential Q/K/V and attention vectors.

5. Performance Metrics and Analytical Outcomes

Key performance metrics include:

Speedup:

$S = \frac{T_{\rm baseline}}{T_{\rm L3}}$

Maximum batch size before out-of-memory (OOM):
- Baseline HBM-GPU: $\leq 2.28$ requests of GPT-175B @ 8k tokens on 80 GB HBM (batch $\lesssim 2$ ).
- L3 with 2 TB DIMM-PIM: batch $\gtrsim 14$ requests (up to $14.3\times$ larger).
Time Between Tokens (TBT): End-to-end token-generation latency.

On representative traces (OpenR1, Dolphin, OpenThoughts, LongBench) and models (OPT-66B, GPT-89B, GPT-175B), L3 demonstrates:

Up to $6.1\times$ speedup compared to state-of-the-art HBM-PIM.
Up to $14.3\times$ larger batch sizes versus HBM-only GPU.
$\gtrsim 9\times$ speedup versus CPU-offload methods (NEO/FastDecode) due to superior aggregate PIM bandwidth (8–30× vs. DDR).

6. Scalability and Latency Trade-offs

Analysis of scalability reveals:

Scaling DIMM-PIM capacity alone ( $\times8$ ) results in $\sim1.6\times$ throughput improvement (PCIe/PIM bandwidth-limited).
Scaling bandwidth alone ( $\times8$ ranksets) yields only $\sim1.1\times$ gain (capacity saturation).
Simultaneous scaling of capacity and bandwidth ( $\times8$ each) enables $\sim5.1\times$ gain; full benefit accrues only through concerted resource growth.

Latency outcomes show:

L3 maintains TBT within $29–53\%$ of GPU-only baseline even on GPT-89B with 6k tokens, due to pipelined PCIe overlap.
Increasing ranksets ( $2 \rightarrow 16$ ) yields near-linear TBT reductions, paralleling growth in $B_{\rm PIM}$ .

7. Contextual Significance and System Implications

L3-based resource allocation exemplifies a tightly-coupled approach leveraging joint hardware-software innovation to resolve memory and bandwidth bottlenecks in long-context LLM inference. By formalizing the GPU versus PIM trade space, applying iterative latency-balancing scheduling, implementing dynamic data re-layouts, and exploiting communication overlap, L3 substantially increases throughput and batch capacity (5–6× speedup; 10–15× batch capacity) without sacrificing per-token latency. This architecture marks a substantive advancement in scalable LLM serving and informs future directions in resource management for memory-intensive AI workloads (Liu et al., 24 Apr 2025).

Markdown Upgrade to Chat

References (1)

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to L3-based Resource Allocation.

L3-Based Resource Allocation for LLM Inference

1. Architectural Overview and Resource Model

2. Adaptive Scheduling and Latency Modeling

3. Heuristic Scheduling Procedure

4. Hardware–Software Co-Design and Data Mapping Techniques

5. Performance Metrics and Analytical Outcomes

6. Scalability and Latency Trade-offs

7. Contextual Significance and System Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

L3-Based Resource Allocation for LLM Inference

1. Architectural Overview and Resource Model

2. Adaptive Scheduling and Latency Modeling

3. Heuristic Scheduling Procedure

4. Hardware–Software Co-Design and Data Mapping Techniques

5. Performance Metrics and Analytical Outcomes

6. Scalability and Latency Trade-offs

7. Contextual Significance and System Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research