Papers
Topics
Authors
Recent
2000 character limit reached

ML Inference Scheduling

Updated 28 December 2025
  • ML Inference Scheduling is the orchestration of machine learning model inference workloads on diverse hardware while meeting latency, throughput, and SLO constraints.
  • Techniques include queueing models, mixed-integer programming, and reinforcement learning, with strategies such as batching, spatial partitioning, and interference-aware scheduling.
  • Empirical results demonstrate significant throughput gains and reduced tail latency, highlighting the importance of resource-aware optimization and multi-objective scheduling.

Machine learning (ML) inference scheduling is the discipline concerned with orchestrating the execution of ML model inference workloads on hardware platforms—such as GPUs, edge clusters, or custom accelerators—under constraints of latency, throughput, resource partitioning, accuracy, and service-level objectives (SLOs). Modern systems increasingly require precise, high-throughput inference despite fluctuating arrivals, heterogeneous requests, and complex multi-tenant or multi-modal workloads. The problem intersects with queueing theory, multi-resource optimization, hardware architecture, and reinforcement learning, and is foundational to real-time analytics, cloud/edge intelligence, and LLM serving.

1. Formal Problem Statement and Mathematical Models

Abstractly, ML inference scheduling takes as input a dynamic stream of inference requests, each defined by parameters such as model identity, input/output size, accuracy requirement, arrival/deadline, and resource footprint, and seeks an assignment, order, and partition of these requests across available compute resources subject to platform constraints. Typical objectives include:

  • Latency minimization: Minimize E2E or quantile response times.
  • SLO attainment: Maximize the number/weight of jobs completed under SLOs.
  • Resource utilization: Maximize GPU/cluster occupancy.
  • Accuracy/utility: Optimize the weighted sum of accuracy-adjusted completed inferences.

The problem is often captured using queueing-theoretical, mixed-integer-programming, or Markov decision process formulations:

  • Bin-packing/MIP: Assign requests to bins (clients, GPU partitions) to minimize makespan under per-job or per-batch processing times, e.g.,

minxi,j tmaxsubject to jxi,j=1, ixi,jTitmax, xi,j{0,1}\min_{x_{i,j}}~t^{\max} \quad \text{subject to}~\sum_j x_{i,j} = 1,~\sum_i x_{i,j}T_i \leq t^{\max},~x_{i,j}\in\{0,1\}

where TiT_i is the processing time (see (Pang et al., 14 Feb 2025)).

  • Queueing models with predictions: Non-preemptive/predicted-SJF, preemptive SPRPT and variants, using ML-based service time estimators (S^\hat S), allowing analytical or competitive-performance guarantees (see (Mitzenmacher et al., 10 Mar 2025)).
  • SLO-goodput objective: Maximize G=(ixi)/(ite2e,i)G = (\sum_i x_i) / (\sum_i t_{e2e,i}), counting SLO-satisfied completions per unit latency (Huang et al., 21 Apr 2025).
  • Multi-objective fitness: fn(t)=Idleness penalty+λ2length consistencyλ1response timef^n(t) = \text{Idleness penalty} + \lambda_2 \cdot \text{length consistency} - \lambda_1 \cdot \text{response time}, blending utilization and latency (Li et al., 28 Jul 2025).
  • Gain-index/MDP: At the wireless edge, gain indices αm,j,t(δ)\alpha_{m,j,t}(\delta) quantify the error reduction per resource unit, leading to maximum-gain-first (MGF) policies (Shisher et al., 8 Jan 2025).

This formalization often considers additional resource models: GPU SMs, memory, KV-cache pressure, communication constraints, and interference models.

2. Core Scheduling Algorithms and Strategies

Inference schedulers are implemented using a suite of algorithmic techniques, attuned to system and workload specifics.

A. Partitioning and Parallelism

  • Spatial partitioning (gpu-lets, partitions): Modern GPUs (Ampere, Hopper) allow slicing into "gpu-lets" or resource partitions, each executing a model or batch at an assigned SM/memory share (Choi et al., 2021, Kim et al., 2022).
  • Batching and queueing: Requests are grouped into inference batches subject to memory/SLO constraints; batch sizes dynamically selected to maximize throughput while respecting per-batch/partition constraints (Kim et al., 2022, Huang et al., 21 Apr 2025).
  • Heterogeneity-aware assignment: Partition sizes are matched to the observed/modelled distribution of request batch sizes, e.g., small slices for low-batch MobileNet, large slices for high-batch BERT (Kim et al., 2022).

B. SLO- and Utility-Aware Scheduling

  • SLO mapping and simulated annealing: Jobs are prioritized by SLO urgency/type, with sequence and batch size optimized via simulated annealing over a goodput-per-latency objective (Huang et al., 21 Apr 2025).
  • Accuracy scaling and data-aware selection: Model selection and prioritization are adapted using instance-specific predicted accuracy (a^j,m\hat a_{j,m}) and slack/deadline, producing priority

pj,m=λ(slackj/dj)+(1λ)a^j,mp_{j,m} = \lambda (\text{slack}_j/d_j) + (1-\lambda) \hat a_{j,m}

(Wolfrath et al., 10 May 2025).

  • Iteration-level preemptive scheduling: For LLMs, at each token step, jobs can be preempted, using Lagrangian cost heuristics to decide when to insert prefill tasks or continue decode (Pang et al., 14 Feb 2025).

C. Interference and Resource Awareness

  • Interference-aware admission: Predicts per-batch slowdown (II) due to co-located workloads using linear regression on GPU counters, scheduling only if latency remains below deadline under predicted II (Zhao et al., 21 Dec 2025).
  • Cache and model locality: Preference for GPUs where the requested model is resident; O3 out-of-order mechanisms allow cache-hit requests to "jump ahead" within a fairness window, minimizing load/unload overhead (Zhao et al., 2023).

D. Learning-Based Methods

  • Deep RL meta-schedulers: Model-free RL agents (e.g., ASET, DQN) select among a small set of static heuristics based on system state, learning cost/delay tradeoffs (especially in edge deployments with dynamic network conditions) (Castellano et al., 2023).
  • Hybrid RL+exact: RL-based solvers yield solutions to relaxed scheduling subproblems, refined by deterministic ILP within a reduced search space for optimality (Yin et al., 2023).

E. Multi-Modal and Distributed Inference

  • Index-threshold policies for AoI: In remote/multimodal inference, optimal policy is index-based: switch data modality when its marginal inference-error reduction index exceeds a threshold (Zhang et al., 11 Aug 2025).
  • MGF (Maximum-Gain-First) scheduling: In wireless multi-task inference, dual decomposition yields per-task gain indices guiding CPU and channel allocation (Shisher et al., 8 Jan 2025).

3. Hardware Models, Resource Partitioning, and Interference

The hardware context is critical. State-of-the-art frameworks leverage:

  • Dynamic GPU partitioning: Nvidia A100/Hopper supports time- or space-multiplexed partitions (GPCs, MPS), enabling assignment of variable slice sizes (Kim et al., 2022, Choi et al., 2021).
  • Programmable, heterogeneous accelerators: Custom architectures with combined systolic arrays and vector processors, schedulers assign tasks dynamically based on layer type and resource model (Kim et al., 2022).
  • Cache and memory management: LRU-based cache management enables high cache-hit rates; model swapping and eviction policies are influenced by per-GPU utilization and memory (Zhao et al., 2023).
  • Interference modeling: Linear models (using counters such as L2, DRAM bandwidth), validated against extensive co-location measurements, drive scheduler’s co-residence decisions (Choi et al., 2021, Zhao et al., 21 Dec 2025).

4. Multi-Tenant, Multi-Objective, and Multi-SLO Scenarios

Servicing simultaneous heterogeneous workloads—multiple LLMs, multi-modal tasks, agentic graphs, or concurrent inference+training—complicates scheduling.

  • Elastic scheduling (ELSA): Assigns jobs to partitions by predicted waiting time and SLA “slack,” prioritizing SLO-safe placement, falls back to fastest-finish if SLA cannot be met (Kim et al., 2022).
  • Unified inference-training scheduler (LeMix): Jointly schedules LLM serving and retraining, using prediction models for idle gaps and response times and SLO-aware queue reordering (Li et al., 28 Jul 2025).
  • Multi-SLO scheduling: Explicit per-request SLO objectives (e2e latency, TTFT, TPOT) are integrated into a global “goodput-per-latency” maximization framework (Huang et al., 21 Apr 2025).
  • Augmented LLMs and memory-time scheduling: Inference jobs invoking APIs (e.g., tool-augmented LLMs) are ranked by predicted integral of memory occupancy over time (“memory-time footprint”), with API-handling strategies (Preserve, Discard, Swap) selected per request to minimize collective resource waste (Shahout et al., 23 Oct 2024, Mitzenmacher et al., 10 Mar 2025).

5. Empirical Results and Comparative Analysis

Quantitative results consistently show substantial gains:

Technique Throughput Gain Latency SLO Attainment Energy/Resource Utilization Key Paper
gpu-let+intf (partitioning) 2.0–2.1× SLO violation <0.14% Near-ideal utilization (Choi et al., 2021)
PARIS+ELSA (reconfig. GPU) 1.7× (SLAs) Up to 17.4× tail reduction Balanced occupancy (Kim et al., 2022)
SimAnn SLO scheduler 5× (SLOs met) 31.6% lower avg latency Not reported (Huang et al., 21 Apr 2025)
RL-based initialization (Inc-ILP) 128× faster solver Same as exact (ILP) Optimal pipeline assignment (Yin et al., 2023)
Sched. with RL meta-policy (ASET) +7–10% success (Castellano et al., 2023)
Multi-modal index-threshold 55% error reduction (Zhang et al., 11 Aug 2025)
Memory-time LLM scheduling 27–85% less latency 4–96% TTFT reduction Up to 1.5× throughput (Shahout et al., 23 Oct 2024)
Interference-aware scheduling +33% throughput 80% fewer SLO misses Modest overhead (<1%) (Zhao et al., 21 Dec 2025)
Edge co-sched. (MGF) 26–32× better error Optimal as rr\to\infty (Shisher et al., 8 Jan 2025)

Significance:

6. Open Problems and Future Directions

The ML inference scheduling problem remains an intensive area for research. Outstanding challenges include:

These directions point toward a confluence of queueing theory, ML-based prediction, real-time optimization, and systems-level integration as the field continues to advance.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ML Inference Scheduling.