Request-Level Batching (RLB)

Updated 21 January 2026

Request-Level Batching (RLB) is a practice that aggregates full inference requests to improve throughput, reduce resource overhead, and meet strict SLA/SLO constraints.
It utilizes advanced scheduling methods, dynamic bin packing, and adaptive policies to handle heterogeneous request loads in diverse AI applications.
RLB significantly enhances performance in LLM serving, recommender systems, and edge inference by optimizing latency, utilization, and overall throughput.

Request-Level Batching (RLB) refers to the practice of aggregating complete inference requests—rather than partial computations such as graph nodes or individual tokens—into shared execution units for increased throughput, reduced resource overhead, and improved quality-of-service (QoS) under varied constraints. In modern AI systems, RLB has emerged as a critical scheduling abstraction in both production-grade LLM serving (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025), cloud and edge inference engines (Choi et al., 2020, Zhang et al., 2023), adaptive fine-tuning infrastructures (Wen et al., 2023), and billion-scale recommender systems (Guan et al., 8 Nov 2025), superseding simpler batching strategies to address heterogeneity, scalability, and strict deadline requirements.

1. Principles and Definitions

RLB is characterized by grouping entire inference requests or training examples into a batch prior to shared computation over the model execution graph. Each request consists of an input (such as a prompt, history, or image), associated metadata (such as a target, user adapter, or SLO/SLA deadline), and is processed end-to-end through appropriate layers. This contrasts with static batching (fixed-size wait for requests) or token-level “continuous” batching, which combine partial computations but do not address holistic request-level resource amortization or scheduling (Zheng et al., 23 Jul 2025, Tian et al., 18 Dec 2025).

Formally, in the RLB paradigm:

A batch $B = \{ r_1, r_2, ..., r_N \}$ is a set of independent requests, each with its own execution trajectory through the model.
In LLM serving, RLB groups queries with different sequence lengths and arrival times, forming a schedule for simultaneous execution of their full prefill or decode phases (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025).
In multitarget recommender systems, RLB enables shared encoding of repeated features (e.g., user history) across multiple targets (Guan et al., 8 Nov 2025).
In adapter-based inference (e.g., FLoRA), RLB supports batching diverse requests each with their own low-rank adaptation weights (Wen et al., 2023).
In SLA/SLO-constrained inference, RLB underpins adaptive admission control to balance deadline satisfaction and resource utilization (Choi et al., 2020, Zhang et al., 2023, Chang et al., 24 Jun 2025).

2. Algorithmic Methodologies

2.1 Scheduling and Buffering

RLB systems typically introduce a scheduler-side buffer that temporarily holds requests and dynamically forms batches according to configurable windows, resource states, or pending deadlines (Tian et al., 18 Dec 2025). For example, Staggered Batch Scheduling (SBS) buffers requests for a window $\Delta t$ to match device availability and then dispatches an optimal batch across distributed DP units, reducing queueing in hidden device buffers and improving both time-to-first-token (TTFT) and throughput (Tian et al., 18 Dec 2025). In contrast, offline batch pipelines (e.g., BlendServe) may reorder requests according to resource profiles to maximize both prefix sharing and operator resource overlap (Zhao et al., 2024).

2.2 Bin Packing and Load Balancing

Batch formation in RLB must account for heterogeneity such as sequence length (LLMs), memory usage (autoregressive models), priority (SLA/SLO), or adapter identity (personalized fine-tuning). Water-filling, longest-job-first, and prioritized scheduling algorithms are deployed to maximize resource occupancy and minimize straggler effects (Zheng et al., 23 Jul 2025, Tian et al., 18 Dec 2025, Zhao et al., 2024). In distributed LLM serving, global allocation policies—such as Prioritized Batch Allocation and IQR-aware decode scheduling—dynamically bin-pack requests by DP unit state, available capacity, and cache pressure (Tian et al., 18 Dec 2025).

2.3 Adaptive and SLA-aware Policies

Where requests carry deadlines or SLOs, RLB includes runtime predictors to admit only those batches likely to finish within constraints. This involves slack estimators (Choi et al., 2020), reward-augmented reinforcement learning objectives (Zhang et al., 2023), or explicit per-request feasibility models based on measured resource-speed curves (e.g., Universal Scalability Law for CodeLLMs) (Chang et al., 24 Jun 2025). Node-granular batching (LazyBatching) allows fine-grained preemption, adaptively merging execution across subgraphs to maintain high throughput while upholding strict deadlines (Choi et al., 2020).

3. System Architectures and Implementation Patterns

3.1 Scheduler Layer Organization

RLB requires an orchestrated control plane managing request queues, resource monitors, and feedback from compute engines. Architectures such as SBS in LLM clusters utilize a three-plane design (Control, State, Resource) to synchronize buffer formation, dispatch triggers, and load allocation (Tian et al., 18 Dec 2025). Edge deployments (BCEdge) instantiate an MDP-based DRL agent that dynamically selects batch size and concurrency for each scheduling epoch (Zhang et al., 2023).

3.2 Memory-efficient Data Structuring

Bucket-based batchers (BucketServe) maintain disjoint batches (“buckets”) partitioned by sequence length, dynamically splitting/merging as workloads and memory availability fluctuate (Zheng et al., 23 Jul 2025). Jagged tensors and compacted record blocks are used for minimal padding overhead in multimodal and long-sequence pipelines (Guan et al., 8 Nov 2025). In RLB-enabled recommender systems, collating all per-request targets into grouped micro-batches enables shared encoding and bandwidth amortization (Guan et al., 8 Nov 2025).

3.3 Adapter Batching for Personalized Inference

Standard low-rank adaptation mechanisms (LoRA) are batch-incompatible with heterogeneous adapters. Fast LoRA (FLoRA) reframes the forward pass via an elementwise Hadamard mask, allowing each request in the batch to carry its own $(B_i, A_i)$ adapter while sharing the expensive GEMM operations, resulting in 2–5 $\times$ throughput and latency gains for small rank values (Wen et al., 2023).

4. Theoretical Models and Mathematical Formulations

The performance and feasibility of RLB is governed by explicit cost models:

Service times and device queuing: Immediate dispatch gives average device queueing of $T/2$ per batch, while RLB with $N$ parallel engines reduces this to $T/(2N)$ plus scheduler wait, yielding lower TTFT if scheduler intervals are kept small (Tian et al., 18 Dec 2025).
Memory fitting: For LLMs, the maximum batch size $N_{\max}$ given memory budget $M_{\text{safe}}$ is

$N_{\max} = \max \left\{ N : 2 L H D S_{\max} B_{\text{dtype}} N \leq M_{\text{safe}}\right\}$

where $\Delta t$ 0, $\Delta t$ 1, $\Delta t$ 2 exponentiate transformer layers, head count, per-head dim, and $\Delta t$ 3 is datatype size (Zheng et al., 23 Jul 2025).

Utility and reward functions: In edge settings, the log-utility function

$\Delta t$ 4

is maximized via entropy-regularized RL to find the optimal $\Delta t$ 5 within memory and SLO constraints (Zhang et al., 2023).

Admission control and SLA feasibility: Predictive models $\Delta t$ 6 for concurrent load allow per-request acceptance only if expected completion time under new concurrency meets deadline, optimizing “goodput” (percent fulfilling SLA) (Chang et al., 24 Jun 2025).

5. Empirical Results and Comparative Evaluations

5.1 LLM Serving and Throughput

SBS achieves 30–40% lower TTFT and 15–22% higher throughput versus immediate dispatch, with chunk utilization improved from ~52% to nearly 89% (Tian et al., 18 Dec 2025).
BucketServe demonstrates 3.58 $\Delta t$ 7 higher token/sec throughput than UELLM and 1.31 $\Delta t$ 8 DistServe, and can handle nearly double the request load at 80% SLO compliance, sustaining $\Delta t$ 980% GPU utilization (Zheng et al., 23 Jul 2025).
SABER yields up to 26% more SLA-compliant completions (“goodput”) than the best static setting, dropping end-to-end latency variability by 31–45% (Chang et al., 24 Jun 2025).

5.2 Resource-Optimized Training and Recommendation

Request-Level Batching in billion-scale recommenders reduces history encoding compute and GPU memory by %%%%20 $M_{\text{safe}}$ 21%%%%, cuts PS CPU usage by 50%, and doubles training throughput (2.2 $(B_i, A_i)$ 2 over point-wise), with no loss in metric (AUC/NLL) (Guan et al., 8 Nov 2025).
BlendServe outperforms SOTA offline LLM batchers (vLLM/SGLang/NanoFlow) by $(B_i, A_i)$ 320.8% in throughput, reaching 86.6% of the feasible optimum for joint compute/memory overlap, while maintaining $(B_i, A_i)$ 497% of prefix-sharing efficiency (Zhao et al., 2024).

5.3 Personalization and Adapter Diversity

FLoRA achieves 2–5 $(B_i, A_i)$ 5 lower token latency and up to 3 $(B_i, A_i)$ 6 higher throughput for small rank adapters in personalized LLM serving, matching LoRA’s accuracy across diverse code-generation and speech recognition tasks (Wen et al., 2023).

A representative summary of salient RLB gains is:

System	Relative Throughput	SLA/TTFT/SLO Gains	Utilization
SBS (LLM) (Tian et al., 18 Dec 2025)	+15–22%	–30–40% TTFT	88% chunks
BucketServe (Zheng et al., 23 Jul 2025)	+3.58x (offline)	1.93x more reqs @80% SLO	>80% GPU
SABER (Chang et al., 24 Jun 2025)	+26% goodput	–45% latency var	adaptive
BCEdge (Zhang et al., 2023)	+37.6% utility	SLO <5% up to 40 rps	adaptive
FLoRA (Wen et al., 2023)	2–5x (rank 1–4)	~0.5s/token latency	N/A
Douyin RLB (Guan et al., 8 Nov 2025)	+2.2–5.1x (train)	+3.35% finish rate	~8x longer

6. Applications, Limitations, and Extensions

RLB is fundamental to high-throughput, low-latency AI workloads with heterogeneous, bursty, or personalized request patterns. It underpins modern LLM serving pipelines, billion-scale personalization (video, recommendation), and edge inference with hard SLO/SLA guarantees. Key limitations include overheads from complex batch formation logic, challenges in predicting output-length distributions for balanced memory use (Zhao et al., 2024), and potential under-utilization from overly conservative admission or slack estimation. Inference across highly dynamic graphs or with extreme deadline variance remains challenging for naive RLB schedulers (Choi et al., 2020). Future directions include further integration of online learning for dynamic batch/scheduling policies (Zhang et al., 2023), extension to mixture-of-adapters per-request (Wen et al., 2023), and incorporation of advanced attention/memory management schemes for even higher resource efficiency (Zhao et al., 2024).

7. Comparison to Alternative Batching Paradigms

The table below contrasts RLB with alternative batching approaches:

Paradigm	Batch Unit	Flexibility	SLA/SLO Awareness	Padding Overhead	Personalization
Static Batching	Fixed N requests	Low	None	High (variable)	No
Token-level/Continuous	Individual tokens	High (decode loop)	None	Medium	No
Graph-level Batching	All nodes/graph	Moderate	Low/explicit	Medium–High	No
Request-Level Batching	Entire request	High	Yes (via policy)	Minimal (bucket)	Yes (adapters)