Papers
Topics
Authors
Recent
Search
2000 character limit reached

Request-Level Batching (RLB)

Updated 21 January 2026
  • Request-Level Batching (RLB) is a practice that aggregates full inference requests to improve throughput, reduce resource overhead, and meet strict SLA/SLO constraints.
  • It utilizes advanced scheduling methods, dynamic bin packing, and adaptive policies to handle heterogeneous request loads in diverse AI applications.
  • RLB significantly enhances performance in LLM serving, recommender systems, and edge inference by optimizing latency, utilization, and overall throughput.

Request-Level Batching (RLB) refers to the practice of aggregating complete inference requests—rather than partial computations such as graph nodes or individual tokens—into shared execution units for increased throughput, reduced resource overhead, and improved quality-of-service (QoS) under varied constraints. In modern AI systems, RLB has emerged as a critical scheduling abstraction in both production-grade LLM serving (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025), cloud and edge inference engines (Choi et al., 2020, Zhang et al., 2023), adaptive fine-tuning infrastructures (Wen et al., 2023), and billion-scale recommender systems (Guan et al., 8 Nov 2025), superseding simpler batching strategies to address heterogeneity, scalability, and strict deadline requirements.

1. Principles and Definitions

RLB is characterized by grouping entire inference requests or training examples into a batch prior to shared computation over the model execution graph. Each request consists of an input (such as a prompt, history, or image), associated metadata (such as a target, user adapter, or SLO/SLA deadline), and is processed end-to-end through appropriate layers. This contrasts with static batching (fixed-size wait for requests) or token-level “continuous” batching, which combine partial computations but do not address holistic request-level resource amortization or scheduling (Zheng et al., 23 Jul 2025, Tian et al., 18 Dec 2025).

Formally, in the RLB paradigm:

  • A batch B={r1,r2,...,rN}B = \{ r_1, r_2, ..., r_N \} is a set of independent requests, each with its own execution trajectory through the model.
  • In LLM serving, RLB groups queries with different sequence lengths and arrival times, forming a schedule for simultaneous execution of their full prefill or decode phases (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025).
  • In multitarget recommender systems, RLB enables shared encoding of repeated features (e.g., user history) across multiple targets (Guan et al., 8 Nov 2025).
  • In adapter-based inference (e.g., FLoRA), RLB supports batching diverse requests each with their own low-rank adaptation weights (Wen et al., 2023).
  • In SLA/SLO-constrained inference, RLB underpins adaptive admission control to balance deadline satisfaction and resource utilization (Choi et al., 2020, Zhang et al., 2023, Chang et al., 24 Jun 2025).

2. Algorithmic Methodologies

2.1 Scheduling and Buffering

RLB systems typically introduce a scheduler-side buffer that temporarily holds requests and dynamically forms batches according to configurable windows, resource states, or pending deadlines (Tian et al., 18 Dec 2025). For example, Staggered Batch Scheduling (SBS) buffers requests for a window Δt\Delta t to match device availability and then dispatches an optimal batch across distributed DP units, reducing queueing in hidden device buffers and improving both time-to-first-token (TTFT) and throughput (Tian et al., 18 Dec 2025). In contrast, offline batch pipelines (e.g., BlendServe) may reorder requests according to resource profiles to maximize both prefix sharing and operator resource overlap (Zhao et al., 2024).

2.2 Bin Packing and Load Balancing

Batch formation in RLB must account for heterogeneity such as sequence length (LLMs), memory usage (autoregressive models), priority (SLA/SLO), or adapter identity (personalized fine-tuning). Water-filling, longest-job-first, and prioritized scheduling algorithms are deployed to maximize resource occupancy and minimize straggler effects (Zheng et al., 23 Jul 2025, Tian et al., 18 Dec 2025, Zhao et al., 2024). In distributed LLM serving, global allocation policies—such as Prioritized Batch Allocation and IQR-aware decode scheduling—dynamically bin-pack requests by DP unit state, available capacity, and cache pressure (Tian et al., 18 Dec 2025).

2.3 Adaptive and SLA-aware Policies

Where requests carry deadlines or SLOs, RLB includes runtime predictors to admit only those batches likely to finish within constraints. This involves slack estimators (Choi et al., 2020), reward-augmented reinforcement learning objectives (Zhang et al., 2023), or explicit per-request feasibility models based on measured resource-speed curves (e.g., Universal Scalability Law for CodeLLMs) (Chang et al., 24 Jun 2025). Node-granular batching (LazyBatching) allows fine-grained preemption, adaptively merging execution across subgraphs to maintain high throughput while upholding strict deadlines (Choi et al., 2020).

3. System Architectures and Implementation Patterns

3.1 Scheduler Layer Organization

RLB requires an orchestrated control plane managing request queues, resource monitors, and feedback from compute engines. Architectures such as SBS in LLM clusters utilize a three-plane design (Control, State, Resource) to synchronize buffer formation, dispatch triggers, and load allocation (Tian et al., 18 Dec 2025). Edge deployments (BCEdge) instantiate an MDP-based DRL agent that dynamically selects batch size and concurrency for each scheduling epoch (Zhang et al., 2023).

3.2 Memory-efficient Data Structuring

Bucket-based batchers (BucketServe) maintain disjoint batches (“buckets”) partitioned by sequence length, dynamically splitting/merging as workloads and memory availability fluctuate (Zheng et al., 23 Jul 2025). Jagged tensors and compacted record blocks are used for minimal padding overhead in multimodal and long-sequence pipelines (Guan et al., 8 Nov 2025). In RLB-enabled recommender systems, collating all per-request targets into grouped micro-batches enables shared encoding and bandwidth amortization (Guan et al., 8 Nov 2025).

3.3 Adapter Batching for Personalized Inference

Standard low-rank adaptation mechanisms (LoRA) are batch-incompatible with heterogeneous adapters. Fast LoRA (FLoRA) reframes the forward pass via an elementwise Hadamard mask, allowing each request in the batch to carry its own (Bi,Ai)(B_i, A_i) adapter while sharing the expensive GEMM operations, resulting in 2–5×\times throughput and latency gains for small rank values (Wen et al., 2023).

4. Theoretical Models and Mathematical Formulations

The performance and feasibility of RLB is governed by explicit cost models:

  • Service times and device queuing: Immediate dispatch gives average device queueing of T/2T/2 per batch, while RLB with NN parallel engines reduces this to T/(2N)T/(2N) plus scheduler wait, yielding lower TTFT if scheduler intervals are kept small (Tian et al., 18 Dec 2025).
  • Memory fitting: For LLMs, the maximum batch size NmaxN_{\max} given memory budget MsafeM_{\text{safe}} is

Nmax=max{N:2LHDSmaxBdtypeNMsafe}N_{\max} = \max \left\{ N : 2 L H D S_{\max} B_{\text{dtype}} N \leq M_{\text{safe}}\right\}

where Δt\Delta t0, Δt\Delta t1, Δt\Delta t2 exponentiate transformer layers, head count, per-head dim, and Δt\Delta t3 is datatype size (Zheng et al., 23 Jul 2025).

  • Utility and reward functions: In edge settings, the log-utility function

Δt\Delta t4

is maximized via entropy-regularized RL to find the optimal Δt\Delta t5 within memory and SLO constraints (Zhang et al., 2023).

  • Admission control and SLA feasibility: Predictive models Δt\Delta t6 for concurrent load allow per-request acceptance only if expected completion time under new concurrency meets deadline, optimizing “goodput” (percent fulfilling SLA) (Chang et al., 24 Jun 2025).

5. Empirical Results and Comparative Evaluations

5.1 LLM Serving and Throughput

  • SBS achieves 30–40% lower TTFT and 15–22% higher throughput versus immediate dispatch, with chunk utilization improved from ~52% to nearly 89% (Tian et al., 18 Dec 2025).
  • BucketServe demonstrates 3.58Δt\Delta t7 higher token/sec throughput than UELLM and 1.31Δt\Delta t8 DistServe, and can handle nearly double the request load at 80% SLO compliance, sustaining Δt\Delta t980% GPU utilization (Zheng et al., 23 Jul 2025).
  • SABER yields up to 26% more SLA-compliant completions (“goodput”) than the best static setting, dropping end-to-end latency variability by 31–45% (Chang et al., 24 Jun 2025).

5.2 Resource-Optimized Training and Recommendation

  • Request-Level Batching in billion-scale recommenders reduces history encoding compute and GPU memory by %%%%20MsafeM_{\text{safe}}21%%%%, cuts PS CPU usage by 50%, and doubles training throughput (2.2(Bi,Ai)(B_i, A_i)2 over point-wise), with no loss in metric (AUC/NLL) (Guan et al., 8 Nov 2025).
  • BlendServe outperforms SOTA offline LLM batchers (vLLM/SGLang/NanoFlow) by (Bi,Ai)(B_i, A_i)320.8% in throughput, reaching 86.6% of the feasible optimum for joint compute/memory overlap, while maintaining (Bi,Ai)(B_i, A_i)497% of prefix-sharing efficiency (Zhao et al., 2024).

5.3 Personalization and Adapter Diversity

  • FLoRA achieves 2–5(Bi,Ai)(B_i, A_i)5 lower token latency and up to 3(Bi,Ai)(B_i, A_i)6 higher throughput for small rank adapters in personalized LLM serving, matching LoRA’s accuracy across diverse code-generation and speech recognition tasks (Wen et al., 2023).

A representative summary of salient RLB gains is:

System Relative Throughput SLA/TTFT/SLO Gains Utilization
SBS (LLM) (Tian et al., 18 Dec 2025) +15–22% –30–40% TTFT 88% chunks
BucketServe (Zheng et al., 23 Jul 2025) +3.58x (offline) 1.93x more reqs @80% SLO >80% GPU
SABER (Chang et al., 24 Jun 2025) +26% goodput –45% latency var adaptive
BCEdge (Zhang et al., 2023) +37.6% utility SLO <5% up to 40 rps adaptive
FLoRA (Wen et al., 2023) 2–5x (rank 1–4) ~0.5s/token latency N/A
Douyin RLB (Guan et al., 8 Nov 2025) +2.2–5.1x (train) +3.35% finish rate ~8x longer

6. Applications, Limitations, and Extensions

RLB is fundamental to high-throughput, low-latency AI workloads with heterogeneous, bursty, or personalized request patterns. It underpins modern LLM serving pipelines, billion-scale personalization (video, recommendation), and edge inference with hard SLO/SLA guarantees. Key limitations include overheads from complex batch formation logic, challenges in predicting output-length distributions for balanced memory use (Zhao et al., 2024), and potential under-utilization from overly conservative admission or slack estimation. Inference across highly dynamic graphs or with extreme deadline variance remains challenging for naive RLB schedulers (Choi et al., 2020). Future directions include further integration of online learning for dynamic batch/scheduling policies (Zhang et al., 2023), extension to mixture-of-adapters per-request (Wen et al., 2023), and incorporation of advanced attention/memory management schemes for even higher resource efficiency (Zhao et al., 2024).

7. Comparison to Alternative Batching Paradigms

The table below contrasts RLB with alternative batching approaches:

Paradigm Batch Unit Flexibility SLA/SLO Awareness Padding Overhead Personalization
Static Batching Fixed N requests Low None High (variable) No
Token-level/Continuous Individual tokens High (decode loop) None Medium No
Graph-level Batching All nodes/graph Moderate Low/explicit Medium–High No
Request-Level Batching Entire request High Yes (via policy) Minimal (bucket) Yes (adapters)

RLB uniquely enables resource-optimal, deadline-aware, and personalized servicing of heterogeneous and stateful inference workloads across cloud, edge, and offline/online use cases (Tian et al., 18 Dec 2025, Zheng et al., 23 Jul 2025, Wen et al., 2023, Guan et al., 8 Nov 2025, Choi et al., 2020, Zhang et al., 2023, Zhao et al., 2024, Chang et al., 24 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Request-Level Batching (RLB).