ML Inference Scheduling
- ML Inference Scheduling is the orchestration of machine learning model inference workloads on diverse hardware while meeting latency, throughput, and SLO constraints.
- Techniques include queueing models, mixed-integer programming, and reinforcement learning, with strategies such as batching, spatial partitioning, and interference-aware scheduling.
- Empirical results demonstrate significant throughput gains and reduced tail latency, highlighting the importance of resource-aware optimization and multi-objective scheduling.
Machine learning (ML) inference scheduling is the discipline concerned with orchestrating the execution of ML model inference workloads on hardware platforms—such as GPUs, edge clusters, or custom accelerators—under constraints of latency, throughput, resource partitioning, accuracy, and service-level objectives (SLOs). Modern systems increasingly require precise, high-throughput inference despite fluctuating arrivals, heterogeneous requests, and complex multi-tenant or multi-modal workloads. The problem intersects with queueing theory, multi-resource optimization, hardware architecture, and reinforcement learning, and is foundational to real-time analytics, cloud/edge intelligence, and LLM serving.
1. Formal Problem Statement and Mathematical Models
Abstractly, ML inference scheduling takes as input a dynamic stream of inference requests, each defined by parameters such as model identity, input/output size, accuracy requirement, arrival/deadline, and resource footprint, and seeks an assignment, order, and partition of these requests across available compute resources subject to platform constraints. Typical objectives include:
- Latency minimization: Minimize E2E or quantile response times.
- SLO attainment: Maximize the number/weight of jobs completed under SLOs.
- Resource utilization: Maximize GPU/cluster occupancy.
- Accuracy/utility: Optimize the weighted sum of accuracy-adjusted completed inferences.
The problem is often captured using queueing-theoretical, mixed-integer-programming, or Markov decision process formulations:
- Bin-packing/MIP: Assign requests to bins (clients, GPU partitions) to minimize makespan under per-job or per-batch processing times, e.g.,
where is the processing time (see (Pang et al., 14 Feb 2025)).
- Queueing models with predictions: Non-preemptive/predicted-SJF, preemptive SPRPT and variants, using ML-based service time estimators (), allowing analytical or competitive-performance guarantees (see (Mitzenmacher et al., 10 Mar 2025)).
- SLO-goodput objective: Maximize , counting SLO-satisfied completions per unit latency (Huang et al., 21 Apr 2025).
- Multi-objective fitness: , blending utilization and latency (Li et al., 28 Jul 2025).
- Gain-index/MDP: At the wireless edge, gain indices quantify the error reduction per resource unit, leading to maximum-gain-first (MGF) policies (Shisher et al., 8 Jan 2025).
This formalization often considers additional resource models: GPU SMs, memory, KV-cache pressure, communication constraints, and interference models.
2. Core Scheduling Algorithms and Strategies
Inference schedulers are implemented using a suite of algorithmic techniques, attuned to system and workload specifics.
A. Partitioning and Parallelism
- Spatial partitioning (gpu-lets, partitions): Modern GPUs (Ampere, Hopper) allow slicing into "gpu-lets" or resource partitions, each executing a model or batch at an assigned SM/memory share (Choi et al., 2021, Kim et al., 2022).
- Batching and queueing: Requests are grouped into inference batches subject to memory/SLO constraints; batch sizes dynamically selected to maximize throughput while respecting per-batch/partition constraints (Kim et al., 2022, Huang et al., 21 Apr 2025).
- Heterogeneity-aware assignment: Partition sizes are matched to the observed/modelled distribution of request batch sizes, e.g., small slices for low-batch MobileNet, large slices for high-batch BERT (Kim et al., 2022).
B. SLO- and Utility-Aware Scheduling
- SLO mapping and simulated annealing: Jobs are prioritized by SLO urgency/type, with sequence and batch size optimized via simulated annealing over a goodput-per-latency objective (Huang et al., 21 Apr 2025).
- Accuracy scaling and data-aware selection: Model selection and prioritization are adapted using instance-specific predicted accuracy () and slack/deadline, producing priority
(Wolfrath et al., 10 May 2025).
- Iteration-level preemptive scheduling: For LLMs, at each token step, jobs can be preempted, using Lagrangian cost heuristics to decide when to insert prefill tasks or continue decode (Pang et al., 14 Feb 2025).
C. Interference and Resource Awareness
- Interference-aware admission: Predicts per-batch slowdown () due to co-located workloads using linear regression on GPU counters, scheduling only if latency remains below deadline under predicted (Zhao et al., 21 Dec 2025).
- Cache and model locality: Preference for GPUs where the requested model is resident; O3 out-of-order mechanisms allow cache-hit requests to "jump ahead" within a fairness window, minimizing load/unload overhead (Zhao et al., 2023).
D. Learning-Based Methods
- Deep RL meta-schedulers: Model-free RL agents (e.g., ASET, DQN) select among a small set of static heuristics based on system state, learning cost/delay tradeoffs (especially in edge deployments with dynamic network conditions) (Castellano et al., 2023).
- Hybrid RL+exact: RL-based solvers yield solutions to relaxed scheduling subproblems, refined by deterministic ILP within a reduced search space for optimality (Yin et al., 2023).
E. Multi-Modal and Distributed Inference
- Index-threshold policies for AoI: In remote/multimodal inference, optimal policy is index-based: switch data modality when its marginal inference-error reduction index exceeds a threshold (Zhang et al., 11 Aug 2025).
- MGF (Maximum-Gain-First) scheduling: In wireless multi-task inference, dual decomposition yields per-task gain indices guiding CPU and channel allocation (Shisher et al., 8 Jan 2025).
3. Hardware Models, Resource Partitioning, and Interference
The hardware context is critical. State-of-the-art frameworks leverage:
- Dynamic GPU partitioning: Nvidia A100/Hopper supports time- or space-multiplexed partitions (GPCs, MPS), enabling assignment of variable slice sizes (Kim et al., 2022, Choi et al., 2021).
- Programmable, heterogeneous accelerators: Custom architectures with combined systolic arrays and vector processors, schedulers assign tasks dynamically based on layer type and resource model (Kim et al., 2022).
- Cache and memory management: LRU-based cache management enables high cache-hit rates; model swapping and eviction policies are influenced by per-GPU utilization and memory (Zhao et al., 2023).
- Interference modeling: Linear models (using counters such as L2, DRAM bandwidth), validated against extensive co-location measurements, drive scheduler’s co-residence decisions (Choi et al., 2021, Zhao et al., 21 Dec 2025).
4. Multi-Tenant, Multi-Objective, and Multi-SLO Scenarios
Servicing simultaneous heterogeneous workloads—multiple LLMs, multi-modal tasks, agentic graphs, or concurrent inference+training—complicates scheduling.
- Elastic scheduling (ELSA): Assigns jobs to partitions by predicted waiting time and SLA “slack,” prioritizing SLO-safe placement, falls back to fastest-finish if SLA cannot be met (Kim et al., 2022).
- Unified inference-training scheduler (LeMix): Jointly schedules LLM serving and retraining, using prediction models for idle gaps and response times and SLO-aware queue reordering (Li et al., 28 Jul 2025).
- Multi-SLO scheduling: Explicit per-request SLO objectives (e2e latency, TTFT, TPOT) are integrated into a global “goodput-per-latency” maximization framework (Huang et al., 21 Apr 2025).
- Augmented LLMs and memory-time scheduling: Inference jobs invoking APIs (e.g., tool-augmented LLMs) are ranked by predicted integral of memory occupancy over time (“memory-time footprint”), with API-handling strategies (Preserve, Discard, Swap) selected per request to minimize collective resource waste (Shahout et al., 23 Oct 2024, Mitzenmacher et al., 10 Mar 2025).
5. Empirical Results and Comparative Analysis
Quantitative results consistently show substantial gains:
| Technique | Throughput Gain | Latency SLO Attainment | Energy/Resource Utilization | Key Paper |
|---|---|---|---|---|
| gpu-let+intf (partitioning) | 2.0–2.1× | SLO violation <0.14% | Near-ideal utilization | (Choi et al., 2021) |
| PARIS+ELSA (reconfig. GPU) | 1.7× (SLAs) | Up to 17.4× tail reduction | Balanced occupancy | (Kim et al., 2022) |
| SimAnn SLO scheduler | 5× (SLOs met) | 31.6% lower avg latency | Not reported | (Huang et al., 21 Apr 2025) |
| RL-based initialization (Inc-ILP) | 128× faster solver | Same as exact (ILP) | Optimal pipeline assignment | (Yin et al., 2023) |
| Sched. with RL meta-policy (ASET) | +7–10% success | – | – | (Castellano et al., 2023) |
| Multi-modal index-threshold | 55% error reduction | – | – | (Zhang et al., 11 Aug 2025) |
| Memory-time LLM scheduling | 27–85% less latency | 4–96% TTFT reduction | Up to 1.5× throughput | (Shahout et al., 23 Oct 2024) |
| Interference-aware scheduling | +33% throughput | 80% fewer SLO misses | Modest overhead (<1%) | (Zhao et al., 21 Dec 2025) |
| Edge co-sched. (MGF) | 26–32× better error | – | Optimal as | (Shisher et al., 8 Jan 2025) |
Significance:
- Spatial/heterogeneous partitioning doubles throughput compared to temporal sharing in multi-model settings (Choi et al., 2021, Kim et al., 2022).
- SLO-aware and utility-maximizing schedulers deliver >5× improvement in SLO compliance and marked latency reduction (Huang et al., 21 Apr 2025).
- Resource-aware and memory-time algorithms substantially cut tail latencies, crucial for LLM serving with prompt sharing or API calls (Shahout et al., 23 Oct 2024).
- Interference-aware strategies yield 80% SLO-miss reduction, with little scheduling overhead (Zhao et al., 21 Dec 2025).
- RL and ML-based schedulers adapt better to dynamic/mismatched workload conditions, outperforming static heuristics (Castellano et al., 2023, Yin et al., 2023).
6. Open Problems and Future Directions
The ML inference scheduling problem remains an intensive area for research. Outstanding challenges include:
- Robust queueing with predictions: Developing analytical tools for scheduling with inexact or distributional predictions; understanding multi-server cases; adapting policies aggregrate or ignore bad predictions (Mitzenmacher et al., 10 Mar 2025).
- Dynamic and stochastic resource models: Incorporating random job sizes, arrivals, and dynamically shifting resource pools (e.g., GPU hotplugging), requiring stochastic programming approaches (Pang et al., 14 Feb 2025).
- Adaptive, learning-based schedulers: Reinforcement learning agents that adapt online to non-stationarity in model/job/task mix, handling prompt sharing and speculative decoding (Mitzenmacher et al., 10 Mar 2025, Li et al., 28 Jul 2025).
- Integration with system stack: Direct co-design with operators (e.g., CUDA stream management), multi-stage deployment optimization, and predictive/online cache placement (Zhao et al., 2023, Yu et al., 2021).
- Theoretical bounds for memory-augmented LLMs: Unified frameworks for requests that invoke variable-latency APIs and require complex memory handling (Shahout et al., 23 Oct 2024, Mitzenmacher et al., 10 Mar 2025).
- Multi-agent & DAG structured workflows: Schedulers for agentic reasoning programs, expansion/aggregation phase orchestration, DAG-wide memory reuse, and early termination (Mitzenmacher et al., 10 Mar 2025).
- QoS/accuracy-energy trade-offs: Schedulers that balance accuracy scaling, energy efficiency, and deadline constraints, especially on resource-constrained edge or accelerator hardware (Wolfrath et al., 10 May 2025, Kim et al., 2022).
These directions point toward a confluence of queueing theory, ML-based prediction, real-time optimization, and systems-level integration as the field continues to advance.