Request-Level Batching (RLB)
- Request-Level Batching (RLB) is a practice that aggregates full inference requests to improve throughput, reduce resource overhead, and meet strict SLA/SLO constraints.
- It utilizes advanced scheduling methods, dynamic bin packing, and adaptive policies to handle heterogeneous request loads in diverse AI applications.
- RLB significantly enhances performance in LLM serving, recommender systems, and edge inference by optimizing latency, utilization, and overall throughput.
Request-Level Batching (RLB) refers to the practice of aggregating complete inference requests—rather than partial computations such as graph nodes or individual tokens—into shared execution units for increased throughput, reduced resource overhead, and improved quality-of-service (QoS) under varied constraints. In modern AI systems, RLB has emerged as a critical scheduling abstraction in both production-grade LLM serving (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025), cloud and edge inference engines (Choi et al., 2020, Zhang et al., 2023), adaptive fine-tuning infrastructures (Wen et al., 2023), and billion-scale recommender systems (Guan et al., 8 Nov 2025), superseding simpler batching strategies to address heterogeneity, scalability, and strict deadline requirements.
1. Principles and Definitions
RLB is characterized by grouping entire inference requests or training examples into a batch prior to shared computation over the model execution graph. Each request consists of an input (such as a prompt, history, or image), associated metadata (such as a target, user adapter, or SLO/SLA deadline), and is processed end-to-end through appropriate layers. This contrasts with static batching (fixed-size wait for requests) or token-level “continuous” batching, which combine partial computations but do not address holistic request-level resource amortization or scheduling (Zheng et al., 23 Jul 2025, Tian et al., 18 Dec 2025).
Formally, in the RLB paradigm:
- A batch is a set of independent requests, each with its own execution trajectory through the model.
- In LLM serving, RLB groups queries with different sequence lengths and arrival times, forming a schedule for simultaneous execution of their full prefill or decode phases (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025).
- In multitarget recommender systems, RLB enables shared encoding of repeated features (e.g., user history) across multiple targets (Guan et al., 8 Nov 2025).
- In adapter-based inference (e.g., FLoRA), RLB supports batching diverse requests each with their own low-rank adaptation weights (Wen et al., 2023).
- In SLA/SLO-constrained inference, RLB underpins adaptive admission control to balance deadline satisfaction and resource utilization (Choi et al., 2020, Zhang et al., 2023, Chang et al., 24 Jun 2025).
2. Algorithmic Methodologies
2.1 Scheduling and Buffering
RLB systems typically introduce a scheduler-side buffer that temporarily holds requests and dynamically forms batches according to configurable windows, resource states, or pending deadlines (Tian et al., 18 Dec 2025). For example, Staggered Batch Scheduling (SBS) buffers requests for a window to match device availability and then dispatches an optimal batch across distributed DP units, reducing queueing in hidden device buffers and improving both time-to-first-token (TTFT) and throughput (Tian et al., 18 Dec 2025). In contrast, offline batch pipelines (e.g., BlendServe) may reorder requests according to resource profiles to maximize both prefix sharing and operator resource overlap (Zhao et al., 2024).
2.2 Bin Packing and Load Balancing
Batch formation in RLB must account for heterogeneity such as sequence length (LLMs), memory usage (autoregressive models), priority (SLA/SLO), or adapter identity (personalized fine-tuning). Water-filling, longest-job-first, and prioritized scheduling algorithms are deployed to maximize resource occupancy and minimize straggler effects (Zheng et al., 23 Jul 2025, Tian et al., 18 Dec 2025, Zhao et al., 2024). In distributed LLM serving, global allocation policies—such as Prioritized Batch Allocation and IQR-aware decode scheduling—dynamically bin-pack requests by DP unit state, available capacity, and cache pressure (Tian et al., 18 Dec 2025).
2.3 Adaptive and SLA-aware Policies
Where requests carry deadlines or SLOs, RLB includes runtime predictors to admit only those batches likely to finish within constraints. This involves slack estimators (Choi et al., 2020), reward-augmented reinforcement learning objectives (Zhang et al., 2023), or explicit per-request feasibility models based on measured resource-speed curves (e.g., Universal Scalability Law for CodeLLMs) (Chang et al., 24 Jun 2025). Node-granular batching (LazyBatching) allows fine-grained preemption, adaptively merging execution across subgraphs to maintain high throughput while upholding strict deadlines (Choi et al., 2020).
3. System Architectures and Implementation Patterns
3.1 Scheduler Layer Organization
RLB requires an orchestrated control plane managing request queues, resource monitors, and feedback from compute engines. Architectures such as SBS in LLM clusters utilize a three-plane design (Control, State, Resource) to synchronize buffer formation, dispatch triggers, and load allocation (Tian et al., 18 Dec 2025). Edge deployments (BCEdge) instantiate an MDP-based DRL agent that dynamically selects batch size and concurrency for each scheduling epoch (Zhang et al., 2023).
3.2 Memory-efficient Data Structuring
Bucket-based batchers (BucketServe) maintain disjoint batches (“buckets”) partitioned by sequence length, dynamically splitting/merging as workloads and memory availability fluctuate (Zheng et al., 23 Jul 2025). Jagged tensors and compacted record blocks are used for minimal padding overhead in multimodal and long-sequence pipelines (Guan et al., 8 Nov 2025). In RLB-enabled recommender systems, collating all per-request targets into grouped micro-batches enables shared encoding and bandwidth amortization (Guan et al., 8 Nov 2025).
3.3 Adapter Batching for Personalized Inference
Standard low-rank adaptation mechanisms (LoRA) are batch-incompatible with heterogeneous adapters. Fast LoRA (FLoRA) reframes the forward pass via an elementwise Hadamard mask, allowing each request in the batch to carry its own adapter while sharing the expensive GEMM operations, resulting in 2–5 throughput and latency gains for small rank values (Wen et al., 2023).
4. Theoretical Models and Mathematical Formulations
The performance and feasibility of RLB is governed by explicit cost models:
- Service times and device queuing: Immediate dispatch gives average device queueing of per batch, while RLB with parallel engines reduces this to plus scheduler wait, yielding lower TTFT if scheduler intervals are kept small (Tian et al., 18 Dec 2025).
- Memory fitting: For LLMs, the maximum batch size given memory budget is
where 0, 1, 2 exponentiate transformer layers, head count, per-head dim, and 3 is datatype size (Zheng et al., 23 Jul 2025).
- Utility and reward functions: In edge settings, the log-utility function
4
is maximized via entropy-regularized RL to find the optimal 5 within memory and SLO constraints (Zhang et al., 2023).
- Admission control and SLA feasibility: Predictive models 6 for concurrent load allow per-request acceptance only if expected completion time under new concurrency meets deadline, optimizing “goodput” (percent fulfilling SLA) (Chang et al., 24 Jun 2025).
5. Empirical Results and Comparative Evaluations
5.1 LLM Serving and Throughput
- SBS achieves 30–40% lower TTFT and 15–22% higher throughput versus immediate dispatch, with chunk utilization improved from ~52% to nearly 89% (Tian et al., 18 Dec 2025).
- BucketServe demonstrates 3.587 higher token/sec throughput than UELLM and 1.318 DistServe, and can handle nearly double the request load at 80% SLO compliance, sustaining 980% GPU utilization (Zheng et al., 23 Jul 2025).
- SABER yields up to 26% more SLA-compliant completions (“goodput”) than the best static setting, dropping end-to-end latency variability by 31–45% (Chang et al., 24 Jun 2025).
5.2 Resource-Optimized Training and Recommendation
- Request-Level Batching in billion-scale recommenders reduces history encoding compute and GPU memory by %%%%2021%%%%, cuts PS CPU usage by 50%, and doubles training throughput (2.22 over point-wise), with no loss in metric (AUC/NLL) (Guan et al., 8 Nov 2025).
- BlendServe outperforms SOTA offline LLM batchers (vLLM/SGLang/NanoFlow) by 320.8% in throughput, reaching 86.6% of the feasible optimum for joint compute/memory overlap, while maintaining 497% of prefix-sharing efficiency (Zhao et al., 2024).
5.3 Personalization and Adapter Diversity
- FLoRA achieves 2–55 lower token latency and up to 36 higher throughput for small rank adapters in personalized LLM serving, matching LoRA’s accuracy across diverse code-generation and speech recognition tasks (Wen et al., 2023).
A representative summary of salient RLB gains is:
| System | Relative Throughput | SLA/TTFT/SLO Gains | Utilization |
|---|---|---|---|
| SBS (LLM) (Tian et al., 18 Dec 2025) | +15–22% | –30–40% TTFT | 88% chunks |
| BucketServe (Zheng et al., 23 Jul 2025) | +3.58x (offline) | 1.93x more reqs @80% SLO | >80% GPU |
| SABER (Chang et al., 24 Jun 2025) | +26% goodput | –45% latency var | adaptive |
| BCEdge (Zhang et al., 2023) | +37.6% utility | SLO <5% up to 40 rps | adaptive |
| FLoRA (Wen et al., 2023) | 2–5x (rank 1–4) | ~0.5s/token latency | N/A |
| Douyin RLB (Guan et al., 8 Nov 2025) | +2.2–5.1x (train) | +3.35% finish rate | ~8x longer |
6. Applications, Limitations, and Extensions
RLB is fundamental to high-throughput, low-latency AI workloads with heterogeneous, bursty, or personalized request patterns. It underpins modern LLM serving pipelines, billion-scale personalization (video, recommendation), and edge inference with hard SLO/SLA guarantees. Key limitations include overheads from complex batch formation logic, challenges in predicting output-length distributions for balanced memory use (Zhao et al., 2024), and potential under-utilization from overly conservative admission or slack estimation. Inference across highly dynamic graphs or with extreme deadline variance remains challenging for naive RLB schedulers (Choi et al., 2020). Future directions include further integration of online learning for dynamic batch/scheduling policies (Zhang et al., 2023), extension to mixture-of-adapters per-request (Wen et al., 2023), and incorporation of advanced attention/memory management schemes for even higher resource efficiency (Zhao et al., 2024).
7. Comparison to Alternative Batching Paradigms
The table below contrasts RLB with alternative batching approaches:
| Paradigm | Batch Unit | Flexibility | SLA/SLO Awareness | Padding Overhead | Personalization |
|---|---|---|---|---|---|
| Static Batching | Fixed N requests | Low | None | High (variable) | No |
| Token-level/Continuous | Individual tokens | High (decode loop) | None | Medium | No |
| Graph-level Batching | All nodes/graph | Moderate | Low/explicit | Medium–High | No |
| Request-Level Batching | Entire request | High | Yes (via policy) | Minimal (bucket) | Yes (adapters) |
RLB uniquely enables resource-optimal, deadline-aware, and personalized servicing of heterogeneous and stateful inference workloads across cloud, edge, and offline/online use cases (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025, Wen et al., 2023, Guan et al., 8 Nov 2025, Choi et al., 2020, Zhang et al., 2023, Zhao et al., 2024, Chang et al., 24 Jun 2025).