Heterogeneous Batching in ML Systems

Updated 20 October 2025

Heterogeneous batching is a method for grouping diverse computational tasks to maximize efficiency, throughput, and resource utilization in machine learning pipelines.
It leverages dynamic scheduling, adaptive batch sizing, and resource-aware algorithms to manage variations in data, model structures, and hardware performance.
It is applied in domains like LLM inference, edge multi-task processing, and distributed computing, demonstrating improvements in convergence speed, latency, and cost-efficiency.

Heterogeneous batching refers to the set of algorithmic, systems, and architectural techniques that optimize the grouping, scheduling, and execution of computational tasks—across varying data types, model structures, hardware, and application demands—so as to maximize utilization, throughput, and flexibility in machine learning and data processing pipelines. Unlike traditional homogeneous batching, where all elements in a batch share identical computational or statistical characteristics (such as feature size, sequence length, or resource requirements), heterogeneous batching enables efficient parallelism and resource allocation in the presence of substantial diversity. This article surveys the core methodologies, mathematical models, practical engineering mechanisms, and principal application domains of heterogeneous batching.

1. Foundations: Definitions and Taxonomy

Heterogeneous batching arises whenever batched computational units (data points, requests, subgraphs, or subproblems) are not uniform and require explicit management to avoid waste, inefficiency, or loss of accuracy. The main sources of heterogeneity in batching include:

Instance heterogeneity: Input data varies by length (as in NLP with variable-length sequences), structure (e.g., trees vs. sequences in dynamic neural networks), statistical properties (e.g., distribution across sources or classes), or task (multi-task inference).
System heterogeneity: Processing hardware (e.g., CPU, GPU, cloud/serverless, edge) exhibits varying throughput, memory, and latency characteristics.
Algorithmic/graph heterogeneity: Control flow, operation type, or dependency structure varies across batch elements.

Principal modes of heterogeneous batching can be classified along several axes:

Dimension	Representative Examples
Data structure	Variable-length sequences, tree-structured models, graph-based batches
Hardware/context	Mixed CPU/GPU/TPU/resource-aware batching (e.g., (Ma et al., 2020, Zhang et al., 2023, Zhou et al., 2023))
Model/execution level	Layer/node-level (e.g., LazyBatching (Choi et al., 2020)), adaptive normalization (Alsobhi et al., 2022)
Scheduling/objectives	SLO/SLA/QoS-aware (e.g., (Choi et al., 2020, Zhang et al., 2023, Chen et al., 9 May 2024, Zheng et al., 23 Jul 2025))

This diversity necessitates specialized batching algorithms, compatibility models, and optimization techniques.

2. Core Algorithmic Approaches

Multiple core designs have been established for implementing heterogeneous batching at various levels:

Graph-based and dynamic operation batching: Automatic batching schemes for dynamic computation graphs (e.g., DyNet, ED-Batch) assign operation "signatures" encoding type and shape and group compatible nodes for efficient batched execution. Execution scheduling involves either depth-based heuristics or agenda/priority-based batching to maximize compatibility while preserving dependency order (Neubig et al., 2017, Chen et al., 2023).
Finite State and RL-based batching: For highly dynamic or structurally variable models, batching policies are encoded as finite state machines learned by RL to minimize the number of batches or data movement, with reward functions targeting minimal kernel launches or contiguous memory usage (Chen et al., 2023).
Program transformation for control-intensive kernels: Control-flow intensive and recursive programs (e.g., Markov Chain Monte Carlo algorithms) are transformed to explicit batched forms, either using masking (local static autobatching) or explicit program counter stacks (program counter autobatching), enabling aggregate execution on accelerators while respecting divergent control paths (Radul et al., 2019).
Heterogeneous resource allocation: Adaptive batching on heterogeneous CPU+GPU clusters assigns batch sizes proportional to device speed and dynamically tunes them to balance update frequency and statistical efficiency, with explicit mechanisms for reallocation based on observed throughput (Ma et al., 2020, Zhang et al., 2023, Tyagi et al., 2023, Zhou et al., 2023).
Bucket and bin-based batching: In high-variance inference (e.g., LLMs), requests are partitioned into bins or buckets according to predicted execution time or sequence length to reduce padding and "straggler" effects, maximizing GPU utilization and throughput (Guldogan et al., 3 Dec 2024, Zheng et al., 23 Jul 2025).

3. Optimization, Compatibility, and Scheduling Mechanisms

Efficient heterogeneous batching relies on precise definitions of compatibility and combinatorial scheduling techniques:

Signature-based compatibility: Nodes/operations are compared by signatures encoding operation type, tensor dimensions, parameter identities, and execution context. Only nodes with matching signatures (and non-conflicting dependencies) are batched (Neubig et al., 2017).
Agenda/depth-based scheduling: Batching opportunities are detected and scheduled to maximize concurrency. The agenda-based strategy delays execution to accumulate more batchable nodes, considering the average depth of nodes in the graph (Neubig et al., 2017, Chen et al., 2023).
PQ-tree memory planning: To minimize data movement during heterogeneous batching, PQ-tree–based algorithms are used to produce memory layouts that guarantee adjacency and alignment constraints for batched operations (Chen et al., 2023).
Adaptive batch sizing and resource matching: Synchronous or delayed synchronous training/distributed inference mechanisms dynamically adjust batch sizes or the number of concurrent model instances according to hardware speed or load, via proportional control, PID-like adjustment, or DRL-based controllers (Ma et al., 2020, Zhang et al., 2023, Tyagi et al., 2023, Zhou et al., 2023, Luan et al., 16 Jan 2025).
Node/layer-level SLA/QoS-aware batching: Fine-grained scheduling (e.g., LazyBatching) allows SLA-aware admission control at the node/layer level, optimizing both latency and throughput via conservative slack time estimation models (Choi et al., 2020).

4. Applications and Architectures

Heterogeneous batching has been successfully adopted in a variety of application domains:

High-variance LLM inference: Bucket-based and multi-bin batching enable throughput-optimal serving of requests with diverse sequence lengths by minimizing padding and service time variance (Guldogan et al., 3 Dec 2024, Zheng et al., 23 Jul 2025).
Edge and serverless multi-task inference: Joint optimization of batch formation and model offloading/onloading in hierarchical multi-task inference systems delivers high accuracy and system utilization under strict memory, compute, and SLO constraints (Cha et al., 18 Aug 2025, Chen et al., 9 May 2024).
Warehouse and logistics optimization: Task-oriented heterogeneous graph clustering allows the co-batching of orders for picking paths, directly optimizing operational metrics such as picking distance (Duan et al., 2020).
GNN training: Community-structure-aware randomized mini-batching (COMM-RAND) enables control over the tradeoff between convergence efficiency and per-epoch speed by interpolating between pure randomization and deterministic graph-structured batching, significantly improving GPU cache utilization (Balaji et al., 25 Apr 2025).
Cloud and distributed data processing: The streaming batch model (Ray Data) merges batch and streaming paradigms, offering memory-efficient, pipelined, and fault-tolerant heterogeneous execution across distributed CPU and GPU clusters, improving throughput by 3–8× for batch inference pipelines (Luan et al., 16 Jan 2025).

5. Performance Tradeoffs and Empirical Outcomes

Empirical results across domains demonstrate the value and necessity of heterogeneous batching:

Throughput and utilization improvements: BucketServe achieves up to 3.58× throughput over static batching baselines for LLMs, simultaneously increasing system load capacity without exceeding GPU memory (Zheng et al., 23 Jul 2025). HarmonyBatch reduces serverless inference costs by up to 82.9% while ensuring multi-SLO compliance (Chen et al., 9 May 2024).
Convergence acceleration: Adaptive Hogbatch and ABS-SGD demonstrate 1.3–4× faster convergence in heterogeneous clusters versus standard synchronous approaches, owing to dynamic batch reallocation and gradient weighting (Ma et al., 2020, Zhou et al., 2023, Tyagi et al., 2023).
Latency and SLO compliance: LazyBatching yields up to 15× lower response time and 5.5× improvement in SLA satisfaction by dynamically batching at the DNN node level (Choi et al., 2020). BCEdge increases the utility metric (joint throughput–latency objective) by up to 37.6% using DRL for dual-parameter (batch size/concurrence) adaptation (Zhang et al., 2023).
Task-specific gains: In graph learning, COMM-RAND reduces GNN per-epoch time by up to 2.76× (1.8× average) while maintaining accuracy within 1.8 percentage points relative to fully randomized strategies (Balaji et al., 25 Apr 2025); in multi-user edge AI, joint batching/scheduling methods increase throughput with constraints (as in (Cang et al., 2023)).

6. Mathematical Models and Theoretical Guarantees

Heterogeneous batching research frequently employs rigorous mathematical modeling and offers provable properties:

Queueing-theoretic analysis: Multi-bin batching for LLM inference uses explicit throughput and latency models, optimizing batch assignment via convex programming and order statistics to formally approach optimal throughput as bins grow finer (Guldogan et al., 3 Dec 2024).
Optimization relaxations and duality: Joint onloading/offloading in hierarchical inference is addressed via alternating Lagrangian-relaxed submodular maximization (for model assignment) and surrogate-constrained LP (for routing/offloading), ensuring optimality gaps are provably minimized (Cha et al., 18 Aug 2025).
PID/proportional control: Dynamic batch reallocation controllers are designed using proportional control and moving average filtering, providing stable convergence to uniform iteration times across workers (Tyagi et al., 2023).
Statistical guarantees: Adaptive batch normalization with heterogeneity-aware thresholds delivers better accuracy and stability in small-batch scenarios than uniform BN application, as shown by systematic empirical evaluation in (Alsobhi et al., 2022).
Complexity and scalability: Techniques such as PQ-tree rearrangement for minimizing data movement during dynamic batching are proven to be nearly linear in practice for constant batch sizes (Chen et al., 2023).

7. Implications, Trends, and Generalization

The evolution of heterogeneous batching reflects larger trends in machine learning systems:

Shift to dynamic and resource-adaptive execution: Increasing model complexity, diverse application demands, and the rise of mixed hardware platforms necessitate dynamic, compatibility-driven batching strategies.
Scaling and fault tolerance: Hybrid models such as the streaming batch paradigm deliver batch-style reproducibility and fault recovery with stream-style elasticity, representing a convergence of conflicting objectives in distributed ML pipelines (Luan et al., 16 Jan 2025).
Unified abstraction and automation: Automatic batching compilers, RL-based policy learning, and analytical cost models collectively aim to close the gap between developer intent and system execution, reducing the manual burden of tuning and design across data/model/hardware axes.
Extension to statistical heterogeneity: Methods such as structure-aware batching (COMM-RAND) and adaptive normalization/tuning (Balaji et al., 25 Apr 2025, Alsobhi et al., 2022) indicate applicability to scenarios with non-i.i.d. or temporally-/spatially-varying data statistics.
Open research challenges: Key open problems include optimal bin/bucket partitioning under adversarial prediction errors, meta-learning of batching policy hyperparameters, and fine-grained online adaptation under unpredictable workload changes.

Heterogeneous batching has thus become a fundamental construct across the ML stack, underpinning the efficiency, scalability, and flexibility of modern data-driven systems. Systematic progress in this area continues to drive significant improvements in both real-world applications and academic benchmarks across diverse domains.