Multi-SLO-Aware Scheduling Techniques
- Multi-SLO-aware scheduling is the coordinated allocation of resources in distributed environments to satisfy multiple quantitative service-level objectives simultaneously.
- It employs techniques such as SLO-based prioritization, bin-packing, and metaheuristic search to optimize overall utility and meet strict performance metrics.
- Practical implementations across serverless, LLM, and HPC domains demonstrate significant reductions in SLO violations and enhanced resource efficiency.
Multi-SLO-aware scheduling is the real-time orchestration of resources and task assignments in distributed or parallel systems to simultaneously satisfy diverse, formally specified service-level objectives (SLOs) for multiple users, tenants, or applications. SLOs are typically expressed as quantitative constraints on metrics such as tail-latency percentiles, time-to-first-token (TTFT), throughput per token (TPOT), energy usage, or domain-specific deadlines. Modern multi-tenant serverless platforms, LLM serving systems, and high-performance computing environments require these techniques to guarantee differentiated quality of service and optimize aggregate resource efficiency under high load and variable workload mixes.
1. Formal Problem Models and SLO Specification
A multi-SLO scheduling problem is generally characterized by:
- Request/task set where each may specify different types and levels of SLOs.
- SLOs can take forms such as:
- : at least fraction of function requests must finish within deadline (Yu et al., 2023).
- TTFT , TPOT , end-to-end deadline , and utility weight for LLMs and real-time/priority inference (Chen et al., 5 Apr 2025, Zhou et al., 21 Oct 2025, Zhu et al., 17 Jul 2025).
- Composite or environmental constraints, such as joint minimization of SLO violations and carbon emissions (Qi et al., 31 Aug 2024, Qi et al., 9 Oct 2024).
The canonical objective is to maximize aggregate utility or service gain, frequently through a utility function , where iff 's SLO(s) are satisfied, and encodes relative importance (Zhou et al., 21 Oct 2025). Constraints span resource capacities, admission or queuing strategies, batch or bin assignment policy, and per-iteration or per-phase SLO feasibility.
2. Scheduling Architectures and Algorithmic Patterns
Multi-SLO-aware schedulers employ diverse system architectures and algorithmic primitives:
- SLO-based prioritization: Functions or requests are ordered using online metrics that quantify SLO urgency, such as the Required-Request-Count (RRC) in FaaSwap, which dynamically tracks how many future requests must complete within deadline to achieve a tail-percentile SLO:
Requests are prioritized in queues and grouped for service admission using two- or multi-class stratification (Yu et al., 2023).
- Bin-packing/resource-aware placement: Multi-dimensional bin-packing heuristics assign each request to a worker or device so as to respect multi-SLO and multi-resource constraints (KV cache, GPU compute, etc.), with stages such as prefill and decode often co-modeled via latency predictors (Nie et al., 11 May 2024, Shen et al., 10 Nov 2024, Chen et al., 5 Apr 2025).
- Search/metaheuristics: Simulated annealing, dynamic programming, and greedy or hybrid algorithms are applied to derive batch orderings or token allocations that maximize the number of SLO-compliant requests or utility under feasibility constraints (Huang et al., 21 Apr 2025, Chen et al., 5 Apr 2025).
- Service-gain maximization: Schedulers like Tempo and SLICE utilize penalty or degradation functions linking SLO adherence to actual observable performance and utility, then iteratively re-prioritize or adjust request rates to keep SLO miss probability low (Zhang et al., 24 Apr 2025, Zhou et al., 21 Oct 2025).
- Interference-awareness and pipelined resource control: Schedulers integrate interference models (e.g., NVLink/PCIe concurrency or cache pipeline contention) and perform swap/placement or eviction policies to avoid cross-SLO interference (Yu et al., 2023, Shen et al., 10 Nov 2024).
3. Multi-SLO Coordination and Stage-specific Enforcement
Distinct phases of inference or serving, such as prefill and decode in LLM systems, often require coordinated SLO enforcement:
- Stage-level SLO mapping: Systems like Drift and SLOs-Serve associate explicit per-stage SLOs (e.g., TTFT and TPOT) to batched operations, translating aggregate deadlines into token-, chunk-, or block-level partitioning (Cui et al., 20 Apr 2025, Chen et al., 5 Apr 2025).
- Hardware partitioning: Drift leverages low-level GPU SM partitioning (PD-multiplexing) to run TTFT (prefill) and TBT (decode) workloads concurrently, dynamically optimizing per-phase resource splits to jointly satisfy SLO targets (Cui et al., 20 Apr 2025).
- Multi-tier binning: SLO tiering and binning create multi-priority or SLO classes (e.g., PolyServe's S SLO bins), allowing scalable separation and routing of requests to the most appropriate instance or resource pool and enabling greedy auto-scaling and instance sharing among compatible tiers (Zhu et al., 17 Jul 2025).
4. Practical Implementations and System Integration
Modern systems integrate multi-SLO-aware scheduling into diverse environments:
- Serverless and containerized inference: FaaSwap implements late-binding model swapping and interference-aware decisions; HarmonyBatch minimizes serverless provisioning cost by grouping requests across SLOs and resources, supporting both CPU and cGPU, and enforcing SLO-based timeouts (Yu et al., 2023, Chen et al., 9 May 2024).
- Sustainability and dual-objective frameworks: CASA and sustainability-aware FaaS augment SLO scheduling with operational carbon and water usage constraints, utilizing hybrid evolutionary search and dual-objective local search to find Pareto-efficient tradeoffs (Qi et al., 31 Aug 2024, Qi et al., 9 Oct 2024).
- Edge and resource-constrained scenarios: SLICE embodies utility-maximizing SLO assignment with dynamic per-request generation rate control, supporting tight TTFT and TPOT requirements for LLM inference on edge devices (Zhou et al., 21 Oct 2025).
- HPC systems and batch schedulers: Extensions to SLURM allow joint malleable job reconfiguration subject to both performance- and power-based SLOs, solving at each tick for trajectories minimizing makespan, response time, and dynamic power-corridor violation (Chadha et al., 2020).
- LLM and diffusion serving: Systems like PATCHEDSERVE and EcoSERVE utilize fine-grained batching (patch or KV cache) and urgency-based slack scoring to admit requests or tasks in an SLO-feasible manner, improving satisfaction ratio under contention (Sun et al., 16 Jan 2025, Shen et al., 10 Nov 2024).
5. Complexity, Scalability, and Evaluation
Most multi-SLO scheduling problems are NP-hard even in simplified forms (e.g., knapsack or bin-packing generalizations) (Zhou et al., 21 Oct 2025). Key strategies for scalability include:
- Greedy or priority heuristic orderings to enable or per-iteration placement (Nie et al., 11 May 2024, Zhu et al., 17 Jul 2025).
- Divide-and-conquer (multi-bin, staged, or two-pass merging): HarmonyBatch achieves near-optimal groupings and batchings with per-group search time (Chen et al., 9 May 2024).
- GPU-accelerated multi-criteria decision: For large data centers, GPU-parallelized AHP/TOPSIS can deliver near-real-time scheduling decisions for thousands of containers and QoS-linked links (Rodrigues et al., 2019).
- Dynamic feedback and online adaptation: Control parameters (e.g., high/low watermark for scale-up, slack for SLO miss prediction) are constantly updated in frameworks such as FaaSwap, PolyServe, and Tempo to respond to bursty and heterogeneous workloads (Yu et al., 2023, Zhu et al., 17 Jul 2025, Zhang et al., 24 Apr 2025).
Empirical evaluations across systems demonstrate reduction in SLO violation rates up to over naive or baseline schemes, SLO attainment at much higher load, 30--82% cost savings, and goodput figures within a few percent of theoretical or oracle-optimal policies (Yu et al., 2023, Nie et al., 11 May 2024, Zhu et al., 17 Jul 2025, Zhang et al., 24 Apr 2025, Chen et al., 9 May 2024, Huang et al., 21 Apr 2025).
6. Limitations, Practical Challenges, and Extensions
Common challenges include accurate online prediction of per-request requirements (e.g., output length), efficient scaling across heterogeneous resources, and balancing complicated trade-offs (e.g., carbon vs. SLO vs. cost). Limitations of current approaches:
- Degraded performance with poor response length prediction (Shen et al., 10 Nov 2024).
- Conservative admission in the face of hardware heterogeneity or bursty, unpredictable arrivals (Zhu et al., 17 Jul 2025, Nie et al., 11 May 2024).
- Most frameworks assume static profiling tables and may underperform with hardware or model upgrades (Zhu et al., 17 Jul 2025).
- Cluster-scope admission and scaling decisions are often batched over short epochs and may not respond instantly to microburst changes (Qi et al., 31 Aug 2024, Qi et al., 9 Oct 2024).
- Some frameworks, e.g., SLOs-Serve, assume discrete SLO tiering; extension to continuous or arbitrary per-request SLO contracts is an open research challenge (Chen et al., 5 Apr 2025).
Opportunities for further work include extending multi-SLO-aware scheduling to cost/energy/fairness objectives, heterogeneous/disaggregated clusters, live pre-emption and speculative execution, and more advanced hybrid online-offline optimization.
Table: Representative Multi-SLO Scheduling Techniques and Domains
| Technique/Framework | Application Domain | Scheduling Principle |
|---|---|---|
| FaaSwap (Yu et al., 2023) | Serverless Inference | RRC-based multi-class prioritization |
| Aladdin (Nie et al., 11 May 2024) | LLM Cluster Serving | MIP + bin-packing w/ latency SLOs |
| CASA (Qi et al., 31 Aug 2024) | Serverless Autoscaling | Local search, carbon/SLO dual-optimization |
| PATCHEDSERVE (Sun et al., 16 Jan 2025) | Diffusion Inference | Slack-based patch scheduling |
| SLICE (Zhou et al., 21 Oct 2025) | Edge LLM Inference | Utility-maximizing rate control |
| PolyServe (Zhu et al., 17 Jul 2025) | Multi-SLO LLM Serving | Multi-bin queueing + SLO-aware routing |
| Drift (Cui et al., 20 Apr 2025) | GPU LLM Serving | Resource partitioning (PD-multiplexing) |
| EcoSERVE (Shen et al., 10 Nov 2024) | LLM Serving | Decoupled batching + KVC pipelining |
| SLOs-Serve (Chen et al., 5 Apr 2025) | LLM Multi-Stage | Dynamic-programming token allocation |
| Network-aware MILP (Rodrigues et al., 2019) | Container DC | MILP with bandwidth+placement SLOs |
Multi-SLO-aware scheduling is now a core challenge in the design of highly effective, scalable serving systems for modern ML/AI, serverless, and HPC workloads. State-of-the-art research demonstrates the value of combining mathematical models, SLO- and interference-aware prioritization, resource-aware bin-packing, and adaptive scaling with practical, empirically validated heuristics. These frameworks support differentiated user SLAs, near-optimal resource efficiency, and robust operation in heterogeneous and multi-tenant environments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free