Papers
Topics
Authors
Recent
2000 character limit reached

Dynamic Batching Algorithms

Updated 12 January 2026
  • Dynamic batching algorithms are adaptive techniques that group computations at runtime based on workload, latency requirements, and hardware constraints.
  • They are applied in neural network execution, GPU inference, distributed training, and dynamic graph processing to lower compute overhead and reduce kernel launches.
  • Empirical results show notable improvements, including up to 6.25× speedup in neural batching and significant reductions in memory movement and resource waste.

Dynamic batching algorithms comprise a class of techniques that automatically group computations, requests, or data items into batches at runtime, with batch sizes and groupings determined adaptively based on workload characteristics, hardware constraints, efficiency objectives, or latency requirements. Unlike static batching, which uses pre-set batch sizes or batch composition, dynamic batching algorithms leverage runtime information and models (queuing, learning, memory usage, graph structure, etc.) to optimize the trade-off between efficiency, resource utilization, and responsiveness across a range of domains including neural network execution, distributed training, GPU inference serving, and dynamic graph algorithms.

1. Foundational Principles and Models

Dynamic batching leverages runtime adaptivity to maximize parallel hardware utilization while minimizing excess latency and resource waste. In neural network frameworks, dynamic batching groups isomorphic (structurally compatible) operations or subgraphs from variable input instances and fuses them into batched executions, reducing compute overhead and kernel launches compared to naïve instance-by-instance scheduling (Neubig et al., 2017, Zha et al., 2019, Looks et al., 2017, Chen et al., 2023). For service queues and batch processing systems, policy optimization may be formalized as a semi-Markov decision process (SMDP) minimizing average latency and resource cost, yielding near-threshold or control-limit policies for dispatching batches (Xu et al., 4 Jan 2025).

Fundamental modeling elements include:

  • Batch-dependent processing time: Service time and energy consumption depend on batch size, requiring intricate scheduling.
  • Memory and resource constraints: Batching must prevent memory overflow and excessive per-batch resource utilization, with explicit memory models (e.g., for attention KV-cache size) guiding real-time batch size limits (Pang et al., 7 Mar 2025, Zheng et al., 23 Jul 2025).
  • Latency-service level objectives (SLOs): Online serving systems integrate latency feedback to modulate batch sizes dynamically, balancing throughput against deadline compliance.
  • Dynamic input structure: Batched execution must handle variable-length sequences, tree- or graph-shaped inputs, or evolving graphs whose topology changes across instances or time.

2. Algorithmic Frameworks for Dynamic Batching

Several architectural frameworks have emerged for dynamic batching:

A. Neural Network Execution on Dynamic Graphs

  • Agenda-based scheduling (DyNet): Build a per-batch graph, partition nodes by operation signature, and use a greedy agenda algorithm to maximize batching opportunities at each topological depth (Neubig et al., 2017).
  • Just-in-time compilation (MXNet Gluon): Extend imperative API with delayed NDArrayFutures, group isomorphic operators or subgraphs during a defined batching scope, and memoize batched graphs for re-use (Zha et al., 2019).
  • Finite state machine learning (ED-Batch): Learn an FSM-based scheduling policy via RL to minimize kernel launches, and use PQ-tree based memory planning to optimally layout batched data, reducing memory movement (Chen et al., 2023).

B. Service Queue and Generic Batch Scheduling

  • SMDP-based dynamic batching formulates batch scheduling as a continuous-time decision process with the objective (weighted sum of response time and power) solved via truncated state-space DT-MDP and relative value iteration; tail-aggregation via abstract cost dramatically reduces state/time complexity (Xu et al., 4 Jan 2025).
  • Latency/memory-aware online batch selection: At each scheduling epoch, compute batch size bound by memory model and by recent latency feedback, select the minimum to ensure SLA and avoid OOM (Pang et al., 7 Mar 2025).

C. LLM Inference Serving

  • Bucket-based batching: Partition requests by sequence length into buckets minimizing padding waste; adaptively split/merge buckets according to observed load, and select real-time batch size via KV-cache memory model (Zheng et al., 23 Jul 2025).
  • Priority-aware scheduling: Enforce FCFS, SJF, or LJF policies within and across buckets to optimize latency or tokens/sec, with continuous batching at the decoding phase.

D. Distributed and Heterogeneous Training

  • PID-based mini-batch controllers: Each worker in a heterogeneous cluster adjusts its mini-batch size according to proportional control on iteration time deviation from target; gradients are weighted proportionally to batch size for correctness (Tyagi et al., 2023).

3. Mathematical Formulations and Cost Models

Dynamic batching is rigorously modeled using:

  • Cost and throughput models balancing batch size vs. processing/latency overheads, e.g., TotalTime(G)=Tanalysis(G)+αNops/Beff(G)\text{TotalTime}(G) = T_{\text{analysis}}(G) + \alpha\,N_{\text{ops}}/B_{\text{eff}}(G) (Zha et al., 2019).
  • SMDP and DT-MDP Bellman equations for service systems: optimal policy achieves h(s)=mina{c(s,a)gy(s,a)+jm(js,a)h(j)}h(s) = \min_a\{ c(s,a) - g^* y(s,a) + \sum_j m(j|s,a) h(j) \} (Xu et al., 4 Jan 2025).
  • Memory-aware batch limiting is determined by NmaxMsafe/(2LHDB)N_{\text{max}} \leq M_{\text{safe}} / (2 L H D B) or probabilistic bounds via CLT and normal quantiles (Pang et al., 7 Mar 2025, Zheng et al., 23 Jul 2025).
  • PID dynamic control for distributed batch sizing: bkt+1=bktXk(TktT)b_k^{t+1} = b_k^t - X_k (T_k^t - T^*) where XkX_k is the throughput (Tyagi et al., 2023).

4. Implementation Strategies and Heuristics

Practical dynamic batching implementations integrate:

  • Subgraph or operation signature hashing for grouping nodes.
  • Heuristics to divide batching at subgraph, operator, or kernel granularity for trade-offs between matching cost and batch effectiveness (Zha et al., 2019).
  • Caching and memoization of compiled batched graphs to avoid repeated scheduling analysis (Zha et al., 2019).
  • Adaptive bucket formation and splitting algorithms, e.g. 50% midpoint rule and merging if total load drops below memory threshold (Zheng et al., 23 Jul 2025).
  • Deadbanding for PID controllers, bounds on per-worker batch size, EWMA smoothing over iteration times to avoid instability (Tyagi et al., 2023).
  • Priority queues and PQ-tree-based memory planning to enforce alignment and minimize data movement (Chen et al., 2023).

5. Empirical Performance and Complexity Results

Significant empirical findings include:

  • Neural dynamic batching achieves up to 6.25× speedup vs. manual or naïve per-instance execution; kernel launches reduced from millions to thousands per epoch (Zha et al., 2019, Neubig et al., 2017).
  • ED-Batch improves throughput by 1.15× (chains), 1.39× (trees), and 2.45× (lattices) over state-of-the-art frameworks with RL-discovered policies, and memory PQ-tree planning reduces data movement up to 66× (Chen et al., 2023).
  • LLM serving with BucketServe: up to 3.58× throughput improvement over UELLM, 1.93× higher request load under 80% SLO, 1.975× higher system capacity (Zheng et al., 23 Jul 2025).
  • SMDP-based policies enable up to 98% time and 63.5% state space reduction via abstract cost for tail state handling; consistently achieve Pareto-optimal trade-off curves (latency vs. energy) outperforming static and greedy baselines (Xu et al., 4 Jan 2025).
  • Dynamic per-worker batching in heterogeneous training yields up to 4× speedup in wall-clock time compared to static, with improved accuracy in asynchronous SGD via reduced gradient staleness (Tyagi et al., 2023).
  • For GNN batching, dynamic algorithms (JAX/Jraph) provide up to 2.7× speedup on CPU, but static batching may outperform on GPU for long runs with large batch sizes; no effect on model convergence or accuracy (Speckhard et al., 2 Feb 2025).

6. Limitations and Practical Guidelines

Known limitations and guidance include:

  • Overhead: dynamic batching incurs matching/grouping, stacking/slicing, and queue management costs; overhead grows with structural heterogeneity and batch size (Neubig et al., 2017, Chen et al., 2023).
  • Unique graphs: if every graph/sample is distinct, batching offers little benefit but can still incur analysis cost (Zha et al., 2019).
  • Conditional control flow, cycles, and recursion are more challenging. Program-counter autobatching handles recursive allocation at higher memory overhead (Radul et al., 2019).
  • Padding waste: bucket-based and dynamic batching strategies minimize but cannot eliminate padding in LLM and graph-serving workloads (Zheng et al., 23 Jul 2025, Speckhard et al., 2 Feb 2025).
  • For distributed training, per-worker batch adjustment must avoid rapid oscillations ("ping-ponging"); use deadband and bounded updates (Tyagi et al., 2023).
  • Dynamic batching only parallelizes operations with aligned signatures; deep, unbalanced structures may limit batching opportunity (Looks et al., 2017).

Best practices:

  • Wrap repeated operations in sub-blocks for coarse batching (Zha et al., 2019).
  • Scope dynamic batching tightly to batch-aligned computations, excluding logging/metrics (Neubig et al., 2017).
  • Pre-declare types and operation signatures; batch over depth or other partition axes (Looks et al., 2017).
  • For variable GPU memory or latency constraints, continuously monitor running statistics and adapt batch size in real time, applying stricter bounds when traffic or resource spikes are observed (Pang et al., 7 Mar 2025, Zheng et al., 23 Jul 2025).

7. Impact and Future Directions

Dynamic batching algorithms have become foundational in high-performance neural network systems (MXNet, DyNet, TensorFlow Fold, vLLM, distributed ML frameworks), enabling efficient execution of models with irregular or dynamic computation graphs in both training and inference. The techniques bridge the gap between manual operator grouping and fully automated scheduling under complex constraints. Future research directions include:

  • Extending RL- or SMDP-based policy discovery to more complex service queues and multi-stage systems.
  • Integrating fine-grained batching with conditional computation and data-dependent control flow.
  • Exploiting probabilistic knowledge of workload for tighter batch-sizing under SLA and memory constraints.
  • Adapting bucket-based batching and memory feedback for emerging hardware with highly variable resource profiles.
  • Pushing batch-dynamic algorithms into streaming, out-of-core, and cross-cluster environments.

Dynamic batching stands as a core abstraction for scalable, resource-efficient, and responsive computation systems, with continuing evolution toward unified, learning-based scheduling and integration across frameworks and hardware platforms (Zheng et al., 23 Jul 2025, Pang et al., 7 Mar 2025, Chen et al., 2023, Zha et al., 2019, Xu et al., 4 Jan 2025, Tyagi et al., 2023, Neubig et al., 2017).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Dynamic Batching Algorithm.