Request-Level Batching (RLB) Explained
- Request-Level Batching (RLB) is defined as grouping and scheduling complete inference requests to enhance resource utilization and enforce system constraints.
- It employs techniques like homogeneous binning, adaptive bucket grouping, and DRL-based scheduling to balance trade-offs between throughput, latency, cost, and memory.
- Empirical results demonstrate significant gains, such as up to 5.1× throughput improvements and substantial cost reductions, confirming its applicability across AI serving domains.
Request-Level Batching (RLB) refers to a suite of algorithmic and systems techniques for grouping, scheduling, and executing whole inference “requests” in AI and serving pipelines. Unlike micro-batching (which splits operators inside a request) or pure token-level/continuous batching (which aggregates tokens across requests during iterative inference), RLB manages entire independent request dataflows, allowing the system to maximize resource utilization, amortize expensive sub-computations, and/or enforce system constraints such as SLO, cost, or memory. The implementation details of RLB vary substantially by domain—including recommender systems, LLM serving (online/offline), serverless DNN inference, edge model scheduling, and batching of control-intensive programs—but the unifying principle is that RLB organizes the batch and serving pipeline at “request granularity,” enabling shared computation, parameter-efficient scheduling, and systematic trade-offs between throughput, latency, cost, and resource constraints.
1. Core Concepts and Formal Models
Modern AI serving workloads often comprise requests of heterogeneous structure, length, and resource demand. In RLB, the atomic unit of batching is the whole request, which can consist of variable-length sequences (e.g., input prompts and output tokens for LLMs), multi-target candidate sets (e.g., user histories and candidate recommendations), or generic data-dependent interpreters (e.g., for MCMC). Key distinctions versus micro-batching and continuous token batching are:
- Batch Formation: Batches are formed by grouping together requests (possibly with similar length, SLO, or resource profile), rather than forming batches by operator or token.
- Scheduling Objective: RLB seeks to optimize a target system metric over the batch—e.g., minimizing per-request latency, maximizing tokens/sec, satisfying multi-SLO requirements, or minimizing serverless cost.
- Memory and Control Constraints: With variable-length and heterogeneous requests, batch feasibility is often constrained by GPU/TPU KV-cache, system RAM, or operation-adaptive memory models. Mathematical feasibility conditions are typically of the form:
where indexes requests in the batch.
Common formalizations include queueing-theoretic models (Guldogan et al., 3 Dec 2024), cost-SLO analytical optimization (Chen et al., 9 May 2024), and per-target complexity reduction (Guan et al., 8 Nov 2025).
2. Batching and Scheduling Algorithms
RLB methodologies employ specialized algorithms to construct and execute batches, exploiting statistical structure in request arrival, request length distribution, and shared computation.
a. Homogeneous Grouping and Bin-Based Partitioning
When primary inefficiency arises from variable request lengths—such as in LLMs where batch completion is gated by the slowest request—binning approaches partition requests by predicted execution time (Guldogan et al., 3 Dec 2024). For bins, request with predicted is assigned to the relevant bin , and batches are formed within each bin. As , throughput approaches the parallel-optimal rate.
b. Bucket-Based Adaptive Batching
BucketServe (Zheng et al., 23 Jul 2025) formulates RLB as bucketing requests into sequence-length-homogeneous groups, adapting bucket boundaries and batch size in real time to balance padding inefficiency, memory usage, and latency SLO. Adaptive splitting and merging adjust to workload, and priority-aware scheduling gives precedence to latency-sensitive traffic.
c. Grouping by Prefix Sharing and SLO
BatchLLM (Zheng et al., 29 Nov 2024) and BlendServe (Zhao et al., 25 Nov 2024) implement RLB by performing global prefix analysis, grouping requests sharing maximal common prefixes to amortize expensive prefill computation, and scheduling these groups to blend decode-heavy and prefill-heavy work. The system employs memory-centric batching, horizontal fusion of attention kernels, and group-reordering to maximize GPU utilization.
d. Cost- and SLO-Optimized Grouping
HarmonyBatch (Chen et al., 9 May 2024) addresses multi-tenant serverless DNN serving by merging request groups (with diverse SLOs and rates) into batchable clusters, jointly provisioning CPU/GPU resources, batch size, and timeout to minimize expected monetary cost while guaranteeing per-request SLO via analytical latency fits.
e. Control-Intensive Batched Execution
RLB as a program transformation (e.g., for Bayesian inference in (Radul et al., 2019)) involves tracking control state (program counter, call stack) for each request, issuing kernels only when all requests are at the same interpreter point, and supporting divergent control flow/recursion in data-parallel batches.
f. Resource- and Throughput-Oriented Scheduling
LLM serving with variable prefill/decode lengths (Wang et al., 8 Aug 2025) formalizes RLB as a batch scheduling problem over heterogeneous requests and a capacity-constrained memory model:
where is the set being advanced and is tokens generated so far. The “Sorted-F” algorithm forms batches by minimizing batch cost under memory feasibility.
3. Complexity Reduction and Shared Computation
A central motivation of RLB is to amortize computation, bandwidth, or memory costs across multiple "sub-samples" of a request or across requests:
- Amortized User-/Prefix Embedding: In long-sequence recommenders with candidates, RLB encodes user/history once and shares across predictions, reducing communication and per-target cost by $1/m$ (Guan et al., 8 Nov 2025).
- Global Prefix Sharing: In LLMs, grouping requests by shared prefix drastically reduces redundant key-value computation; empirical KV reuse ratios rise from 44% (LRU-caching) to 54.9% (global grouping) (Zheng et al., 29 Nov 2024). BlendServe retains of optimal prefix sharing (Zhao et al., 25 Nov 2024).
- Batch Memory Scaling: Reusing the same user-embedding or prefix expands maximal trainable sequence length (at fixed RAM) (Guan et al., 8 Nov 2025); batch-structured transformers can handle up to 10k-length histories or unpadded LLM prompt settings.
4. System Integration and Implementation Strategies
Specific instantiations of RLB are realized via integration with serving engines, training/inference pipelines, and scheduler design:
- LLM Inference Engines: BatchLLM and BlendServe operate as scheduling layers above the core inference kernel, reordering offline or large-batch LLM requests by group/prefix and memory footprint, and dispatching fused kernels with horizontal attention (Zheng et al., 29 Nov 2024, Zhao et al., 25 Nov 2024).
- Online Serving: In BucketServe, batch sizes and bucket boundaries are adjusted dynamically with each queue-length change, and batches are scheduled according to earliest-deadline (online) or SJF/LJF (offline) within buckets (Zheng et al., 23 Jul 2025).
- DRL-Based Schedulers: BCEdge frames the request-level batching/adaptive concurrency problem as an MDP, optimizes a log-throughput/latency utility, and employs off-policy, maximum-entropy DRL (discrete SAC) to tune batch size and concurrent runs (Zhang et al., 2023).
- Analytical Provisioning: HarmonyBatch analytically solves for group-batching, function type, and parameter assignment to minimize cost; batch size for GPU is chosen at the utilization-limited point (Chen et al., 9 May 2024).
- Programmatic Transformation: Autobatching for control-intensive interpreters transforms each scalar variable into a stack tensor, and maintains program counters for all requests, synchronizing kernel invocation across control flow divergence (Radul et al., 2019).
5. Quantitative Impact and Empirical Outcomes
Empirical validation across domains consistently demonstrates that request-level batching yields substantial improvements in resource utilization, latency, cost, and/or SLO satisfaction. The following table summarizes selected results:
| System/Paper | Metric | Baseline | RLB/Improved | Relative Gain |
|---|---|---|---|---|
| Douyin/STCA (Guan et al., 8 Nov 2025) | Bandwidth | — | 77–84% reduction | — |
| Throughput | — | +2.2–5.1× (GPU, kernel) | — | |
| BucketServe (Zheng et al., 23 Jul 2025) | Throughput | UELLM/DistServe | 3.58×/1.31× | — |
| SLO Attainment | DistServe | 1.93× higher RPS | — | |
| BatchLLM (Zheng et al., 29 Nov 2024) | Req/sec | vLLM | 1.1× (A100) – 2.0× (MI200) | — |
| BlendServe (Zhao et al., 25 Nov 2024) | Throughput | SGLang/vLLM/NanoFlow | 19.3%–22.7% | Up to 1.44× |
| HarmonyBatch (Chen et al., 9 May 2024) | Cost | BATCH, MBS⁺ | up to 82.9% reduction | — |
| BCEdge (Zhang et al., 2023) | Utility | DeepRT | +37.6% | — |
| Autobatcher (Radul et al., 2019) | Grad/sec | Python NUTS | 400k+ (RLB); 200k (naive) | Up to 30× |
All RLB methods demonstrate robust scaling and robustness to workload, with characteristic monotonic increases in throughput with batch size or degree of grouping (until hardware or latency constraints dominate). Notably, methods such as multi-bin batching (Guldogan et al., 3 Dec 2024) and Sorted-F (Wang et al., 8 Aug 2025) are provably optimal (or within a constant factor) with respect to parallel throughput or latency.
6. Trade-offs, Limiting Factors, and Approximation Schemes
Design and deployment of RLB involve nontrivial trade-offs, often system- or application-specific:
- Latency vs. Throughput: Increasing group or bin count (larger k, finer homogeneity) can lead to longer queueing delays but aligns per-batch service time, producing monotonic improvements at the expense of possibly higher mean or tail latency (Guldogan et al., 3 Dec 2024).
- Approximate Grouping: Hybrid and approximate algorithms—e.g., local swap, quantile-greedy, DP-relaxations—can deliver near-optimal batch selection and scheduling with much lower per-batch compute cost (Wang et al., 8 Aug 2025).
- Prefix-Group Reordering: In BatchLLM, early scheduling of heavy-decode groups boosts blending in subsequent batches but may be suboptimal for unpredictable output-length distributions; conservative memory thresholds and fallback to memory-centric batching provide robustness (Zheng et al., 29 Nov 2024).
- Prediction Accuracy: Gains in multi-bin batching and bucketed scheduling rely on highly accurate length prediction or online profiling; predictor misclassification causes residual idle-tail but system degradation remains graceful (Guldogan et al., 3 Dec 2024).
- Hardware/Kernel Overheads: Fused attention and memory adaptation may fall marginally short of hand-optimized kernels in extreme regimes (~10–15% overhead), but overall system utilization increases due to fewer kernel launches and smoother occupancy curves (Zheng et al., 29 Nov 2024, Zhao et al., 25 Nov 2024).
7. Domains of Application and Future Research Directions
Request-level batching has achieved wide adoption and ongoing research innovation across:
- High-throughput LLM Inference: RLB is fundamental in modern offline and large-batch LLM serving, where prefix reuse, compute/memory balancing, and latency/SLO-constrained scheduling are primary drivers (Zheng et al., 23 Jul 2025, Zheng et al., 29 Nov 2024, Guldogan et al., 3 Dec 2024, Zhao et al., 25 Nov 2024, Wang et al., 8 Aug 2025).
- Recommendation and Ranking Systems: Amortized user-side encoding (as in Douyin’s STCA with RLB) is critical for long-history, multi-target CTR prediction at scale (Guan et al., 8 Nov 2025).
- Serverless and Edge DNN Inference: RLB enables multi-SLO, cost-optimized, and heterogeneous batch scheduling over cloud and edge compute, leveraging analytical, DRL, or heuristic control (Chen et al., 9 May 2024, Zhang et al., 2023).
- General Control-Intensive Computation: Probabilistic programming, gradient-based MCMC, and recursive algorithms leverage RLB program transformations to unlock SIMD acceleration scaling (Radul et al., 2019).
Future directions include tighter integration of continuous/token-level batching with global RLB, adaptive online binning/grouping based on workload shifts, advanced hybrid policy learning (combining DRL and analytical models), kernel- and hardware-aware group scheduling, and extension to multi-model, multi-tenant, or federated AI serving environments. A plausible implication is that as model and request heterogeneity grow, RLB abstractions and algorithms will increasingly co-evolve with serving system software and hardware stack co-design.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free