ML Inference Scheduling

Updated 28 December 2025

ML Inference Scheduling is the orchestration of machine learning model inference workloads on diverse hardware while meeting latency, throughput, and SLO constraints.
Techniques include queueing models, mixed-integer programming, and reinforcement learning, with strategies such as batching, spatial partitioning, and interference-aware scheduling.
Empirical results demonstrate significant throughput gains and reduced tail latency, highlighting the importance of resource-aware optimization and multi-objective scheduling.

Machine learning (ML) inference scheduling is the discipline concerned with orchestrating the execution of ML model inference workloads on hardware platforms—such as GPUs, edge clusters, or custom accelerators—under constraints of latency, throughput, resource partitioning, accuracy, and service-level objectives (SLOs). Modern systems increasingly require precise, high-throughput inference despite fluctuating arrivals, heterogeneous requests, and complex multi-tenant or multi-modal workloads. The problem intersects with queueing theory, multi-resource optimization, hardware architecture, and reinforcement learning, and is foundational to real-time analytics, cloud/edge intelligence, and LLM serving.

1. Formal Problem Statement and Mathematical Models

Abstractly, ML inference scheduling takes as input a dynamic stream of inference requests, each defined by parameters such as model identity, input/output size, accuracy requirement, arrival/deadline, and resource footprint, and seeks an assignment, order, and partition of these requests across available compute resources subject to platform constraints. Typical objectives include:

Latency minimization: Minimize E2E or quantile response times.
SLO attainment: Maximize the number/weight of jobs completed under SLOs.
Resource utilization: Maximize GPU/cluster occupancy.
Accuracy/utility: Optimize the weighted sum of accuracy-adjusted completed inferences.

The problem is often captured using queueing-theoretical, mixed-integer-programming, or Markov decision process formulations:

Bin-packing/MIP: Assign requests to bins (clients, GPU partitions) to minimize makespan under per-job or per-batch processing times, e.g.,

$\min_{x_{i,j}}~t^{\max} \quad \text{subject to}~\sum_j x_{i,j} = 1,~\sum_i x_{i,j}T_i \leq t^{\max},~x_{i,j}\in\{0,1\}$

where $T_i$ is the processing time (see (Pang et al., 14 Feb 2025)).

Queueing models with predictions: Non-preemptive/predicted-SJF, preemptive SPRPT and variants, using ML-based service time estimators ( $\hat S$ ), allowing analytical or competitive-performance guarantees (see (Mitzenmacher et al., 10 Mar 2025)).
SLO-goodput objective: Maximize $G = (\sum_i x_i) / (\sum_i t_{e2e,i})$ , counting SLO-satisfied completions per unit latency (Huang et al., 21 Apr 2025).
Multi-objective fitness: $f^n(t) = \text{Idleness penalty} + \lambda_2 \cdot \text{length consistency} - \lambda_1 \cdot \text{response time}$ , blending utilization and latency (Li et al., 28 Jul 2025).
Gain-index/MDP: At the wireless edge, gain indices $\alpha_{m,j,t}(\delta)$ quantify the error reduction per resource unit, leading to maximum-gain-first (MGF) policies (Shisher et al., 8 Jan 2025).

This formalization often considers additional resource models: GPU SMs, memory, KV-cache pressure, communication constraints, and interference models.

2. Core Scheduling Algorithms and Strategies

Inference schedulers are implemented using a suite of algorithmic techniques, attuned to system and workload specifics.

A. Partitioning and Parallelism

Spatial partitioning (gpu-lets, partitions): Modern GPUs (Ampere, Hopper) allow slicing into "gpu-lets" or resource partitions, each executing a model or batch at an assigned SM/memory share (Choi et al., 2021, Kim et al., 2022).
Batching and queueing: Requests are grouped into inference batches subject to memory/SLO constraints; batch sizes dynamically selected to maximize throughput while respecting per-batch/partition constraints (Kim et al., 2022, Huang et al., 21 Apr 2025).
Heterogeneity-aware assignment: Partition sizes are matched to the observed/modelled distribution of request batch sizes, e.g., small slices for low-batch MobileNet, large slices for high-batch BERT (Kim et al., 2022).

B. SLO- and Utility-Aware Scheduling

SLO mapping and simulated annealing: Jobs are prioritized by SLO urgency/type, with sequence and batch size optimized via simulated annealing over a goodput-per-latency objective (Huang et al., 21 Apr 2025).
Accuracy scaling and data-aware selection: Model selection and prioritization are adapted using instance-specific predicted accuracy ( $\hat a_{j,m}$ ) and slack/deadline, producing priority

$p_{j,m} = \lambda (\text{slack}_j/d_j) + (1-\lambda) \hat a_{j,m}$

(Wolfrath et al., 10 May 2025).

Iteration-level preemptive scheduling: For LLMs, at each token step, jobs can be preempted, using Lagrangian cost heuristics to decide when to insert prefill tasks or continue decode (Pang et al., 14 Feb 2025).

C. Interference and Resource Awareness

Interference-aware admission: Predicts per-batch slowdown ( $I$ ) due to co-located workloads using linear regression on GPU counters, scheduling only if latency remains below deadline under predicted $I$ (Zhao et al., 21 Dec 2025).
Cache and model locality: Preference for GPUs where the requested model is resident; O3 out-of-order mechanisms allow cache-hit requests to "jump ahead" within a fairness window, minimizing load/unload overhead (Zhao et al., 2023).

D. Learning-Based Methods

Deep RL meta-schedulers: Model-free RL agents (e.g., ASET, DQN) select among a small set of static heuristics based on system state, learning cost/delay tradeoffs (especially in edge deployments with dynamic network conditions) (Castellano et al., 2023).
Hybrid RL+exact: RL-based solvers yield solutions to relaxed scheduling subproblems, refined by deterministic ILP within a reduced search space for optimality (Yin et al., 2023).

Index-threshold policies for AoI: In remote/multimodal inference, optimal policy is index-based: switch data modality when its marginal inference-error reduction index exceeds a threshold (Zhang et al., 11 Aug 2025).
MGF (Maximum-Gain-First) scheduling: In wireless multi-task inference, dual decomposition yields per-task gain indices guiding CPU and channel allocation (Shisher et al., 8 Jan 2025).

3. Hardware Models, Resource Partitioning, and Interference

The hardware context is critical. State-of-the-art frameworks leverage:

Dynamic GPU partitioning: Nvidia A100/Hopper supports time- or space-multiplexed partitions (GPCs, MPS), enabling assignment of variable slice sizes (Kim et al., 2022, Choi et al., 2021).
Programmable, heterogeneous accelerators: Custom architectures with combined systolic arrays and vector processors, schedulers assign tasks dynamically based on layer type and resource model (Kim et al., 2022).
Cache and memory management: LRU-based cache management enables high cache-hit rates; model swapping and eviction policies are influenced by per-GPU utilization and memory (Zhao et al., 2023).
Interference modeling: Linear models (using counters such as L2, DRAM bandwidth), validated against extensive co-location measurements, drive scheduler’s co-residence decisions (Choi et al., 2021, Zhao et al., 21 Dec 2025).

4. Multi-Tenant, Multi-Objective, and Multi-SLO Scenarios

Servicing simultaneous heterogeneous workloads—multiple LLMs, multi-modal tasks, agentic graphs, or concurrent inference+training—complicates scheduling.

Elastic scheduling (ELSA): Assigns jobs to partitions by predicted waiting time and SLA “slack,” prioritizing SLO-safe placement, falls back to fastest-finish if SLA cannot be met (Kim et al., 2022).
Unified inference-training scheduler (LeMix): Jointly schedules LLM serving and retraining, using prediction models for idle gaps and response times and SLO-aware queue reordering (Li et al., 28 Jul 2025).
Multi-SLO scheduling: Explicit per-request SLO objectives (e2e latency, TTFT, TPOT) are integrated into a global “goodput-per-latency” maximization framework (Huang et al., 21 Apr 2025).
Augmented LLMs and memory-time scheduling: Inference jobs invoking APIs (e.g., tool-augmented LLMs) are ranked by predicted integral of memory occupancy over time (“memory-time footprint”), with API-handling strategies (Preserve, Discard, Swap) selected per request to minimize collective resource waste (Shahout et al., 2024, Mitzenmacher et al., 10 Mar 2025).

5. Empirical Results and Comparative Analysis

Quantitative results consistently show substantial gains:

Technique	Throughput Gain	Latency SLO Attainment	Energy/Resource Utilization	Key Paper
gpu-let+intf (partitioning)	2.0–2.1×	SLO violation <0.14%	Near-ideal utilization	(Choi et al., 2021)
PARIS+ELSA (reconfig. GPU)	1.7× (SLAs)	Up to 17.4× tail reduction	Balanced occupancy	(Kim et al., 2022)
SimAnn SLO scheduler	5× (SLOs met)	31.6% lower avg latency	Not reported	(Huang et al., 21 Apr 2025)
RL-based initialization (Inc-ILP)	128× faster solver	Same as exact (ILP)	Optimal pipeline assignment	(Yin et al., 2023)
Sched. with RL meta-policy (ASET)	+7–10% success	–	–	(Castellano et al., 2023)
Multi-modal index-threshold	55% error reduction	–	–	(Zhang et al., 11 Aug 2025)
Memory-time LLM scheduling	27–85% less latency	4–96% TTFT reduction	Up to 1.5× throughput	(Shahout et al., 2024)
Interference-aware scheduling	+33% throughput	80% fewer SLO misses	Modest overhead (<1%)	(Zhao et al., 21 Dec 2025)
Edge co-sched. (MGF)	26–32× better error	–	Optimal as $r\to\infty$	(Shisher et al., 8 Jan 2025)

Significance:

Spatial/heterogeneous partitioning doubles throughput compared to temporal sharing in multi-model settings (Choi et al., 2021, Kim et al., 2022).
SLO-aware and utility-maximizing schedulers deliver >5× improvement in SLO compliance and marked latency reduction (Huang et al., 21 Apr 2025).
Resource-aware and memory-time algorithms substantially cut tail latencies, crucial for LLM serving with prompt sharing or API calls (Shahout et al., 2024).
Interference-aware strategies yield 80% SLO-miss reduction, with little scheduling overhead (Zhao et al., 21 Dec 2025).
RL and ML-based schedulers adapt better to dynamic/mismatched workload conditions, outperforming static heuristics (Castellano et al., 2023, Yin et al., 2023).

6. Open Problems and Future Directions

The ML inference scheduling problem remains an intensive area for research. Outstanding challenges include:

Robust queueing with predictions: Developing analytical tools for scheduling with inexact or distributional predictions; understanding multi-server cases; adapting policies aggregrate or ignore bad predictions (Mitzenmacher et al., 10 Mar 2025).
Dynamic and stochastic resource models: Incorporating random job sizes, arrivals, and dynamically shifting resource pools (e.g., GPU hotplugging), requiring stochastic programming approaches (Pang et al., 14 Feb 2025).
Adaptive, learning-based schedulers: Reinforcement learning agents that adapt online to non-stationarity in model/job/task mix, handling prompt sharing and speculative decoding (Mitzenmacher et al., 10 Mar 2025, Li et al., 28 Jul 2025).
Integration with system stack: Direct co-design with operators (e.g., CUDA stream management), multi-stage deployment optimization, and predictive/online cache placement (Zhao et al., 2023, Yu et al., 2021).
Theoretical bounds for memory-augmented LLMs: Unified frameworks for requests that invoke variable-latency APIs and require complex memory handling (Shahout et al., 2024, Mitzenmacher et al., 10 Mar 2025).
Multi-agent & DAG structured workflows: Schedulers for agentic reasoning programs, expansion/aggregation phase orchestration, DAG-wide memory reuse, and early termination (Mitzenmacher et al., 10 Mar 2025).
QoS/accuracy-energy trade-offs: Schedulers that balance accuracy scaling, energy efficiency, and deadline constraints, especially on resource-constrained edge or accelerator hardware (Wolfrath et al., 10 May 2025, Kim et al., 2022).

These directions point toward a confluence of queueing theory, ML-based prediction, real-time optimization, and systems-level integration as the field continues to advance.