Dynamic Rebatching Techniques

Updated 24 December 2025

Dynamic rebatching is an adaptive methodology that groups computation tasks in real time to optimize throughput, latency, and resource utilization across diverse applications such as LLM inference and distributed DNN training.
It leverages control policies like SMDP and reinforcement learning to make real-time decisions on batch sizes and compositions based on system state metrics such as queue length and memory usage.
Empirical evidence shows that dynamic rebatching can reduce kernel launch overhead by up to 10× and boost throughput by over 3× while ensuring compliance with SLAs and efficient hardware utilization.

Dynamic rebatching is a set of methodologies for adaptively grouping computation tasks—data samples, requests, or computational graph nodes—into executable batches at runtime, rather than statically fixing batch structure or size. The goal is to maximize throughput and hardware efficiency while satisfying domain-specific constraints such as latency, SLAs, or memory limits. This approach is essential for applications with highly dynamic workloads, control-intensive computation graphs, or heterogeneous computational resources, and spans use cases ranging from LLM inference serving and distributed DNN training to adaptive experimentation and dynamic graph deep learning.

1. Formal Definitions and Theoretical Models

Dynamic rebatching generalizes static batching by making the batch size, composition, and even the batch boundary itself a real-time decision variable. In systems terms, dynamic rebatching can be formulated as a control policy $\pi$ that, at each decision epoch, observes system state $s$ —for example, queue length, request ages, memory usage, or graph-node frontiers—and selects a batch action $a$ (e.g., batch size, group of tasks, node signature to execute) to optimize long-run objectives.

A canonical formalization is via a semi-Markov decision process (SMDP), as in GPU inference serving, minimizing a weighted sum of average response time and average power consumption:

$J(\pi) = \limsup_{T \to \infty} \frac{1}{T} \mathbb{E}^\pi \Big[ w_1\int_0^T s(t)dt + w_2\sum_{i:b_i>0} \zeta(b_i) \Big],$

where $s(t)$ is the instantaneous system state (e.g., queued requests), $b_i$ is the batch size at decision $i$ , and $\zeta(b)$ is the energy cost of batch size $b$ (Xu et al., 4 Jan 2025).

For computation graphs, the problem is a partitioning of a global job-set $N$ into batches $B_1,\ldots,B_m$ such that all $u,v \in B_k$ are eligible for joint execution (e.g., same operator type and shape), and all data dependencies are honored (Fegade et al., 2023, Neubig et al., 2017, Chen et al., 2023).

In the context of distributed ML, dynamic rebatching policies seek per-worker batch sizes $\{b_i\}$ so that all workers approximately equalize iteration latency $T_i(b_i) \approx T_j(b_j), \forall i,j$ (Tyagi et al., 2023, Ye et al., 2020).

2. Algorithmic Strategies

Dynamic rebatching frameworks instantiate a range of algorithmic strategies, often dictated by system constraints and workload characteristics:

Operator/Node-Level Rebatching for Dynamic Graphs

Agenda-based Batching: Maintains an agenda of ready nodes (op type/signature), greedily flushing large compatible batches to minimize kernel launches (Neubig et al., 2017).
FSM/Policy Learning: Uses finite-state machine encodings of the batching process and tabular Q-learning to discover near-optimal batching policies, with reward structured to both penalize kernels and encourage frontier advancement (Chen et al., 2023).
Hybrid Static + Dynamic Compilation: Applies compile-time control-flow and taint analysis to annotate nodes with, e.g., depth or inline phase, reducing dynamic scheduling cost at runtime and ensuring correctness across control-flow divergence (Fegade et al., 2023).

SLA- or Resource-Constrained Batching for Inference Serving

SMDP-Based Decision Making: At each service completion or arrival, selects batch action $a$ to minimize a Bellman-style average-cost-to-go equation, typically using tail-cost penalization for tractable finite-state reductions (Xu et al., 4 Jan 2025).
Memory/LATENCY Feedback Loops: Uses periodic measurements of queue/memory state and recent latency to adapt batch size per interval, blending a memory-aware scheduler (to avoid OOM) with a latency bracket adjustment under SLA constraints (Pang et al., 7 Mar 2025).
Admission Control with Predictive Modeling: Applies per-request SLA feasibility checks, using “Universal Scalability Law” fits of per-request speed under contention to admit additional requests only if their SLA can be met under current resource allocation (Chang et al., 24 Jun 2025).

Model/Experiment-Aware Dynamic Batching

Variance/Efficiency-Driven Training Batch Sizing: Tracks per-microbatch gradient direction change; accumulates until the gradient direction begins to fluctuate, then triggers a parameter update, thereby adapting the batch dynamically with provable empirical gains (Xu et al., 2020).
Quadrature-Precision-Driven Batch Sizing: In Bayesian optimization or active learning, frames batch selection as a kernel quadrature problem, adjusting batch size at every round to meet a prescribed worst-case integration error, with explicit LP-based selection (Adachi et al., 2023, Lyu et al., 2020).
Batched Bandit Experimentation: Uses residual-horizon model predictive optimization, dynamically computing batch sizes for each arm in a finite-horizon adaptive experiment, based on a Gaussian dynamic programming approximation (Che et al., 2023).

3. Dynamic Rebatching in Dynamic Graph Frameworks

Implementing dynamic rebatching for models with instance-dependent control flow or dynamic dataflow graphs necessitates sophisticated policy and scheduling approaches:

Signature-Based Grouping: Nodes/ops are grouped by a deterministic signature (operator type, shape, parameter ID), enabling batch formation only for compatible nodes. Hashing and canonicalization ensure O(1) amortized cost per node (Neubig et al., 2017, Zha et al., 2019).
Depth/Agenda/Sorted-Multiset State Encodings: Scheduling may use depth-based groupings, agenda-based priorities, or state compressions such as sorted frontier multisets (for FSM-based learners) to efficiently determine which nodes to batch (Chen et al., 2023, Neubig et al., 2017).
PQ-Tree Memory Planning: Efficient in-place memory allocation layouts are computed offline, ensuring all batched operands are contiguous and aligned, thus avoiding scattered copies or index-gather kernels at runtime (Chen et al., 2023).
Reinforcement Learning for Policy Discovery: Batch scheduling policy can be learned via tabular Q-learning over the FSM representation, with a reward structure that provably recovers the first step of an optimal schedule in some cases (Chen et al., 2023).

Empirical results on chain, tree, and lattice models (e.g., BiLSTM-Tagger, TreeLSTM, LatticeGRU) show kernel-launch reductions of up to 3.27× over heuristic strategies and throughput speedups up to 3.13× on GPU (Chen et al., 2023).

4. Rebatching for SLA and Hardware Utilization

In inference serving for LLMs and DNNs, rebatching is essential for controlling latency, memory consumption, and quality-of-service:

SMDP-derived Control-Limit Policies: Optimal SMDP policies often have a "threshold" form: wait until $s<Q$ to accumulate requests (small batches), else serve as many as possible, with $Q$ increasing as power/fuel cost increases (Xu et al., 4 Jan 2025).
Integrated Memory and SLA Constraints: Real-time rebatching combines a memory-constrained upper bound on batch size (accounting for per-request KV cache usage) with responsive bracket search to keep per-token latency below SLA target. These two are combined per scheduling interval, batch size clipped by active requests and $B_{max}$ (Pang et al., 7 Mar 2025).
Adaptive Admission Control: In systems such as SABER, every request admission is tested for predicted SLA feasibility via an offline-fitted per-request speed function; only those requests that would not cause either themselves or already-admitted requests to violate SLA are admitted into the batch (Chang et al., 24 Jun 2025).
Empirical Results: Memory- and SLA-aware dynamic batching offers consistent throughput gains (8–28% over static batching, +22% QPS at 50ms SLA) with minimal SLA violations, and further reduces latency variability by 31–45% in self-hosted LLM serving (Pang et al., 7 Mar 2025, Chang et al., 24 Jun 2025).

5. Applications: Distributed Training, Early-Exit Models, and Adaptive Experimentation

Dynamic rebatching methodologies extend beyond inference and dynamic graphs:

Distributed DNN Training: In heterogeneous clusters, per-worker batch size is dynamically adjusted based on observed throughput, utilizing proportional/PID-inspired control logic so that all workers converge to similar iteration latencies, drastically reducing idle times and speeding up training by 2–4.5× in practice (Tyagi et al., 2023, Ye et al., 2020). DBS and similar algorithms guarantee standard convergence rates as long as all local batch sizes grow (Ye et al., 2020).
Early-Exit Transformers: In LLMs supporting early exit (EE), requests within a batch may diverge at EE ramps; dynamic rebatching identifies those tokens eligible to exit immediately, while "continuing" requests are buffered and regrouped for deeper layers. Copy-free buffering (bit-mask/index, not physical tensors) and memory-optimized virtual KV mapping are used to support this efficiently. An analytical "adaptive rebatching threshold" (ART) determines when batch splitting is profitable, yielding up to 12% throughput improvement and eliminating forced or involuntary exits relative to grouped policies (Liu et al., 17 Dec 2025).
Adaptive Experimentation and Active Learning: In adaptive experimentation (e.g., clinical trials, A/B/n tests), batch allocation across treatment arms can be dynamically derived using residual-horizon optimization or by kernel quadrature precision; batch size varies per reallocation epoch to minimize credible intervals or maximize power vs. computational cost (Che et al., 2023, Adachi et al., 2023).

6. Complexity, Overhead, and Empirical Scaling

Dynamic rebatching introduces modest analysis or scheduling overhead, almost always dominated by the kernel execution cost. For graphs with $N$ nodes across $B$ examples:

Graph Analysis and Scheduling: Typically $O(N)$ in total node count, with batching overhead <10% of runtime for realistic workloads (Zha et al., 2019, Neubig et al., 2017).
Kernel Launch Reduction: Batching reduces operator launches by a factor equal to the average batch size per group, yielding 5–10× speedups even for moderate groupings. In sparse Mixture-of-Experts, the reduction may be O(1000×) (Suarez et al., 2017).
Online Control-Loop Complexity: Memory/SLA dynamic batching is $O(1)$ per interval; admission control requiring predictive fits needs only simple arithmetic comparisons per candidate (Pang et al., 7 Mar 2025, Chang et al., 24 Jun 2025).
FSM Policy Learning: Per-model compile overhead for policy learning is typically <30 s, PQ-tree memory planning <50 ms; practical for deployment scale (Chen et al., 2023).

Limitations include possible sub-optimality under extreme memory or sequence-length diversity, retraining cost of RL-based policies when graph topology changes frequently, and diminishing returns for tiny or communication-bound models.

7. Practical Guidance and Impact

In deployment, dynamic rebatching should be used whenever:

Workload is highly variable or bursty, static batch sizing leads to poor hardware utilization or frequent OOM.
Models contain instance-dependent control flow, recursion, or require tight SLA adherence.
The system demands robustness to heterogeneous computational resources or needs to optimize under power/memory constraints.

Implementers are encouraged to profile per-operator latency/energy curves, choose aggregation and policy structure (SMDP, RL, PID, signature-based), and set throughput/power/SLA weights according to application constraints. For dynamic graph workloads, FSM-based or agenda-based heuristics are recommended; for distributed inference serving, SMDP, PID, or admission control loops are optimal (Pang et al., 7 Mar 2025, Chang et al., 24 Jun 2025, Chen et al., 2023, Fegade et al., 2023).

Dynamic rebatching continues to be a fundamental systems and algorithmic primitive for scaling contemporary ML, with new work integrating probabilistic numerics, bandit optimization, and advanced memory management enabling both theoretical and practical performance gains (Adachi et al., 2023, Liu et al., 17 Dec 2025, Che et al., 2023).