Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Request-Level Batching (RLB) Explained

Updated 15 November 2025
  • Request-Level Batching (RLB) is defined as grouping and scheduling complete inference requests to enhance resource utilization and enforce system constraints.
  • It employs techniques like homogeneous binning, adaptive bucket grouping, and DRL-based scheduling to balance trade-offs between throughput, latency, cost, and memory.
  • Empirical results demonstrate significant gains, such as up to 5.1× throughput improvements and substantial cost reductions, confirming its applicability across AI serving domains.

Request-Level Batching (RLB) refers to a suite of algorithmic and systems techniques for grouping, scheduling, and executing whole inference “requests” in AI and serving pipelines. Unlike micro-batching (which splits operators inside a request) or pure token-level/continuous batching (which aggregates tokens across requests during iterative inference), RLB manages entire independent request dataflows, allowing the system to maximize resource utilization, amortize expensive sub-computations, and/or enforce system constraints such as SLO, cost, or memory. The implementation details of RLB vary substantially by domain—including recommender systems, LLM serving (online/offline), serverless DNN inference, edge model scheduling, and batching of control-intensive programs—but the unifying principle is that RLB organizes the batch and serving pipeline at “request granularity,” enabling shared computation, parameter-efficient scheduling, and systematic trade-offs between throughput, latency, cost, and resource constraints.

1. Core Concepts and Formal Models

Modern AI serving workloads often comprise requests of heterogeneous structure, length, and resource demand. In RLB, the atomic unit of batching is the whole request, which can consist of variable-length sequences (e.g., input prompts and output tokens for LLMs), multi-target candidate sets (e.g., user histories and candidate recommendations), or generic data-dependent interpreters (e.g., for MCMC). Key distinctions versus micro-batching and continuous token batching are:

  • Batch Formation: Batches are formed by grouping together requests (possibly with similar length, SLO, or resource profile), rather than forming batches by operator or token.
  • Scheduling Objective: RLB seeks to optimize a target system metric over the batch—e.g., minimizing per-request latency, maximizing tokens/sec, satisfying multi-SLO requirements, or minimizing serverless cost.
  • Memory and Control Constraints: With variable-length and heterogeneous requests, batch feasibility is often constrained by GPU/TPU KV-cache, system RAM, or operation-adaptive memory models. Mathematical feasibility conditions are typically of the form:

iBMem(ri)Mdevice\sum_{i \in \mathcal{B}} \mathrm{Mem}(r_i) \leq M_{\text{device}}

where B\mathcal{B} indexes requests in the batch.

Common formalizations include queueing-theoretic models (Guldogan et al., 3 Dec 2024), cost-SLO analytical optimization (Chen et al., 9 May 2024), and per-target complexity reduction (Guan et al., 8 Nov 2025).

2. Batching and Scheduling Algorithms

RLB methodologies employ specialized algorithms to construct and execute batches, exploiting statistical structure in request arrival, request length distribution, and shared computation.

a. Homogeneous Grouping and Bin-Based Partitioning

When primary inefficiency arises from variable request lengths—such as in LLMs where batch completion is gated by the slowest request—binning approaches partition requests by predicted execution time (Guldogan et al., 3 Dec 2024). For kk bins, request rr with predicted LrL_r is assigned to the relevant bin [li1,li)[l_{i-1}, l_{i}), and batches are formed within each bin. As kk \rightarrow \infty, throughput approaches the parallel-optimal rate.

b. Bucket-Based Adaptive Batching

BucketServe (Zheng et al., 23 Jul 2025) formulates RLB as bucketing requests into sequence-length-homogeneous groups, adapting bucket boundaries and batch size in real time to balance padding inefficiency, memory usage, and latency SLO. Adaptive splitting and merging adjust to workload, and priority-aware scheduling gives precedence to latency-sensitive traffic.

c. Grouping by Prefix Sharing and SLO

BatchLLM (Zheng et al., 29 Nov 2024) and BlendServe (Zhao et al., 25 Nov 2024) implement RLB by performing global prefix analysis, grouping requests sharing maximal common prefixes to amortize expensive prefill computation, and scheduling these groups to blend decode-heavy and prefill-heavy work. The system employs memory-centric batching, horizontal fusion of attention kernels, and group-reordering to maximize GPU utilization.

d. Cost- and SLO-Optimized Grouping

HarmonyBatch (Chen et al., 9 May 2024) addresses multi-tenant serverless DNN serving by merging request groups (with diverse SLOs and rates) into batchable clusters, jointly provisioning CPU/GPU resources, batch size, and timeout to minimize expected monetary cost while guaranteeing per-request SLO via analytical latency fits.

e. Control-Intensive Batched Execution

RLB as a program transformation (e.g., for Bayesian inference in (Radul et al., 2019)) involves tracking control state (program counter, call stack) for each request, issuing kernels only when all requests are at the same interpreter point, and supporting divergent control flow/recursion in data-parallel batches.

f. Resource- and Throughput-Oriented Scheduling

LLM serving with variable prefill/decode lengths (Wang et al., 8 Aug 2025) formalizes RLB as a batch scheduling problem over heterogeneous requests and a capacity-constrained memory model:

iU(pi+(ai+1))+iRtU(pi+ai)M\sum_{i \in U} (p_i + (a_i + 1)) + \sum_{i \in R_t \setminus U} (p_i + a_i) \leq M

where UU is the set being advanced and aia_i is tokens generated so far. The “Sorted-F” algorithm forms batches by minimizing batch cost F(X)=(iXoi)/X2F(X) = (\sum_{i \in X} o_i)/|X|^2 under memory feasibility.

3. Complexity Reduction and Shared Computation

A central motivation of RLB is to amortize computation, bandwidth, or memory costs across multiple "sub-samples" of a request or across requests:

  • Amortized User-/Prefix Embedding: In long-sequence recommenders with mm candidates, RLB encodes user/history once and shares across mm predictions, reducing communication and per-target cost by $1/m$ (Guan et al., 8 Nov 2025).
  • Global Prefix Sharing: In LLMs, grouping requests by shared prefix drastically reduces redundant key-value computation; empirical KV reuse ratios rise from 44% (LRU-caching) to 54.9% (global grouping) (Zheng et al., 29 Nov 2024). BlendServe retains >97%>97\% of optimal prefix sharing (Zhao et al., 25 Nov 2024).
  • Batch Memory Scaling: Reusing the same user-embedding or prefix expands maximal trainable sequence length 8×\sim8\times (at fixed RAM) (Guan et al., 8 Nov 2025); batch-structured transformers can handle up to 10k-length histories or unpadded LLM prompt settings.

4. System Integration and Implementation Strategies

Specific instantiations of RLB are realized via integration with serving engines, training/inference pipelines, and scheduler design:

  • LLM Inference Engines: BatchLLM and BlendServe operate as scheduling layers above the core inference kernel, reordering offline or large-batch LLM requests by group/prefix and memory footprint, and dispatching fused kernels with horizontal attention (Zheng et al., 29 Nov 2024, Zhao et al., 25 Nov 2024).
  • Online Serving: In BucketServe, batch sizes and bucket boundaries are adjusted dynamically with each queue-length change, and batches are scheduled according to earliest-deadline (online) or SJF/LJF (offline) within buckets (Zheng et al., 23 Jul 2025).
  • DRL-Based Schedulers: BCEdge frames the request-level batching/adaptive concurrency problem as an MDP, optimizes a log-throughput/latency utility, and employs off-policy, maximum-entropy DRL (discrete SAC) to tune batch size and concurrent runs (Zhang et al., 2023).
  • Analytical Provisioning: HarmonyBatch analytically solves for group-batching, function type, and parameter assignment to minimize cost; batch size for GPU is chosen at the utilization-limited point b=rXTX+1b = \lfloor r^\mathcal{X} T^\mathcal{X} \rfloor + 1 (Chen et al., 9 May 2024).
  • Programmatic Transformation: Autobatching for control-intensive interpreters transforms each scalar variable into a stack tensor, and maintains program counters for all requests, synchronizing kernel invocation across control flow divergence (Radul et al., 2019).

5. Quantitative Impact and Empirical Outcomes

Empirical validation across domains consistently demonstrates that request-level batching yields substantial improvements in resource utilization, latency, cost, and/or SLO satisfaction. The following table summarizes selected results:

System/Paper Metric Baseline RLB/Improved Relative Gain
Douyin/STCA (Guan et al., 8 Nov 2025) Bandwidth 77–84% reduction
Throughput +2.2–5.1× (GPU, kernel)
BucketServe (Zheng et al., 23 Jul 2025) Throughput UELLM/DistServe 3.58×/1.31×
SLO Attainment DistServe 1.93× higher RPS
BatchLLM (Zheng et al., 29 Nov 2024) Req/sec vLLM 1.1× (A100) – 2.0× (MI200)
BlendServe (Zhao et al., 25 Nov 2024) Throughput SGLang/vLLM/NanoFlow 19.3%–22.7% Up to 1.44×
HarmonyBatch (Chen et al., 9 May 2024) Cost BATCH, MBS⁺ up to 82.9% reduction
BCEdge (Zhang et al., 2023) Utility DeepRT +37.6%
Autobatcher (Radul et al., 2019) Grad/sec Python NUTS 400k+ (RLB); 200k (naive) Up to 30×

All RLB methods demonstrate robust scaling and robustness to workload, with characteristic monotonic increases in throughput with batch size or degree of grouping (until hardware or latency constraints dominate). Notably, methods such as multi-bin batching (Guldogan et al., 3 Dec 2024) and Sorted-F (Wang et al., 8 Aug 2025) are provably optimal (or within a constant factor) with respect to parallel throughput or latency.

6. Trade-offs, Limiting Factors, and Approximation Schemes

Design and deployment of RLB involve nontrivial trade-offs, often system- or application-specific:

  • Latency vs. Throughput: Increasing group or bin count (larger k, finer homogeneity) can lead to longer queueing delays but aligns per-batch service time, producing monotonic ckcmaxc_k \rightarrow c_{\max} improvements at the expense of possibly higher mean or tail latency (Guldogan et al., 3 Dec 2024).
  • Approximate Grouping: Hybrid and approximate algorithms—e.g., local swap, quantile-greedy, DP-relaxations—can deliver near-optimal batch selection and scheduling with much lower per-batch compute cost (Wang et al., 8 Aug 2025).
  • Prefix-Group Reordering: In BatchLLM, early scheduling of heavy-decode groups boosts blending in subsequent batches but may be suboptimal for unpredictable output-length distributions; conservative memory thresholds and fallback to memory-centric batching provide robustness (Zheng et al., 29 Nov 2024).
  • Prediction Accuracy: Gains in multi-bin batching and bucketed scheduling rely on highly accurate length prediction or online profiling; predictor misclassification causes residual idle-tail but system degradation remains graceful (Guldogan et al., 3 Dec 2024).
  • Hardware/Kernel Overheads: Fused attention and memory adaptation may fall marginally short of hand-optimized kernels in extreme regimes (~10–15% overhead), but overall system utilization increases due to fewer kernel launches and smoother occupancy curves (Zheng et al., 29 Nov 2024, Zhao et al., 25 Nov 2024).

7. Domains of Application and Future Research Directions

Request-level batching has achieved wide adoption and ongoing research innovation across:

Future directions include tighter integration of continuous/token-level batching with global RLB, adaptive online binning/grouping based on workload shifts, advanced hybrid policy learning (combining DRL and analytical models), kernel- and hardware-aware group scheduling, and extension to multi-model, multi-tenant, or federated AI serving environments. A plausible implication is that as model and request heterogeneity grow, RLB abstractions and algorithms will increasingly co-evolve with serving system software and hardware stack co-design.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Request Level Batching (RLB).