Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 28 tok/s Pro
GPT-4o 81 tok/s
GPT OSS 120B 453 tok/s Pro
Kimi K2 229 tok/s Pro
2000 character limit reached

Multi-SLO-Aware Scheduler

Updated 25 August 2025
  • Multi-SLO-Aware Scheduler is a system that dynamically allocates resources for heterogeneous workloads by considering diverse service level objectives such as latency, throughput, and deadlines.
  • It employs multilevel scheduling and adaptive policy mechanisms to reduce scheduler-induced latency and improve utilization, particularly for short-duration tasks.
  • The architecture integrates probabilistic bin packing, real-time monitoring, and hierarchical scheduling to balance resource efficiency with strict SLO compliance in multi-tenant environments.

A Multi-SLO-Aware Scheduler is an advanced scheduling system designed to optimize the allocation and execution of computational workloads in environments where tasks exhibit heterogeneous service level objectives (SLOs), such as latency, throughput, and deadline requirements. The need for such schedulers has grown as workloads have diversified from monolithic, synchronously parallel applications to independently parallel, short-duration tasks and complex multi-tenant services. The fundamental challenge lies in maximizing infrastructure utilization while ensuring the observed quality-of-service per workload—given that scheduler overheads and resource contention have a disproportionate effect on short jobs and SLO-sensitive applications.

1. Workload Diversity and SLO Heterogeneity

Modern computing infrastructures face an evolving mix of workloads:

  • Traditional HPC workloads: These are long-running, synchronously parallel simulations managed by schedulers such as Slurm and Son of Grid Engine. They typically benefit from robust queueing, resource allocation, job array and dependency mechanisms, and backfilling algorithms to improve utilization.
  • High Performance Data Analysis (HPDA) and Big Data workloads: These are composed of short-duration, independently parallel jobs such as map-reduce or analytics tasks. Schedulers like Mesos and YARN were created for rapid launch, dynamic resource sharing, and data affinity—often supporting two-level scheduling architectures (central resource manager plus multiple framework-level schedulers).

In multi-tenant or cloud scenarios, requests from different applications may have widely varying SLOs: some require stringent latency guarantees (interactive services), others depend on throughput (batch analytics), and yet others may have collective deadlines (agentic/inference workflows). This diversity necessitates explicitly SLO-aware scheduling policies, rather than simple fairness or best-effort approaches.

2. Scheduler Latency, Utilization, and Multilevel Scheduling

Scheduler-induced latency—including submission, queue management, resource allocation, dispatch, and teardown—plays a crucial role in the effective utilization UU of a system, especially for short tasks. The utilization model is

U=TjobTtotalU = \frac{T_{job}}{T_{total}}

where TjobT_{job} is the aggregate user task time (e.g., tnt \cdot n for nn tasks of duration tt), and TtotalT_{total} incorporates scheduler overhead. Benchmarking experiments reveal that for sub-second jobs, utilization can drop sharply due to cumulative scheduling latency—even with efficient schedulers such as Slurm or Mesos. For Big Data schedulers (e.g., YARN), overhead tends to be higher, exacerbating the problem.

Multilevel scheduling is a key mitigation. Instead of submitting each short task independently, multiple tasks are bundled into a single job launch (e.g., via tools like LLMapReduce). Bundling amortizes scheduler overhead:

Neffective=Ntotal/BN_{effective} = N_{total} / B

with BB as the bundle size. Experiments show utilization improvements to ~90% across all task durations when multilevel scheduling is applied, demonstrating its effectiveness for short-running and HPDA workloads.

3. Multi-SLO Recognition and Policy Adaptation

A Multi-SLO-Aware Scheduler must dynamically distinguish workloads with diverse SLOs and adjust scheduling policies accordingly. For latency-critical requests, the scheduler may apply multilevel scheduling and prioritize fast queue turnaround, while for throughput-oriented or batch jobs, it may employ backfilling, longer timeslices, or low-priority batching.

Key capabilities include:

  • Real-time monitoring of utilization UU per workload class, enabling adaptive bundle sizing or queue management if UU falls below a threshold.
  • Modular architecture with a high-level scheduler managing resource pools abstracted by SLO (borrowed from Mesos’s two-level scheduling), and lower-level domain schedulers optimizing allocation within each SLO class.
  • Context-aware placement and resource partitioning, ensuring that short tasks with strict SLOs are not delayed by low-priority work or scheduler bottlenecks.

Such design principles allow the scheduler to flexibly respond to varying demand profiles and maintain high utility across mixed workloads.

4. Algorithms and Architectural Patterns

State-of-the-art schedulers increasingly use formal modeling and probabilistic methods to drive their multi-SLO decisions.

  • Resource-aware packing algorithms (e.g., Gaussian Percentile Approximation (Janus et al., 2017)): For colocating many heterogeneous tasks, the scheduler models the summed instantaneous CPU usage as a Gaussian random variable with mean and variance derived from empirical trace data. SLO compliance is checked probabilistically by ensuring:

1Φ(c;μ,σ2)<ρ1 - \Phi(c; \mu', \sigma'^2) < \rho

where cc is machine capacity, Φ\Phi is the cumulative distribution function, and ρ\rho is the desired SLO (overflow probability).

  • Dynamic bin packing and adaptive bundling: For environments with multiple SLOs, such probabilistic checks replace simple thresholding, and rebalancing steps can further improve SLO adherence at lower resource costs compared to extreme conservative schemes.
  • Two-level and multi-domain scheduling: Schedulers such as Mesos permit high-level partitioning of cluster resources among heterogeneous frameworks, each possibly tuned for distinct SLO criteria.
  • Monitoring and feedback loops: Utilization and SLO compliance are monitored in real time, triggering policy adaptations (e.g., increasing bundle size, shifting pool boundaries, or changing backfill strategies).

5. Performance Considerations and Practical Implications

Multi-SLO-Aware Schedulers must balance several conflicting objectives:

  • Utilization vs SLO attainment: Maximizing utilization (U)(U) often requires aggressive bundling or logistic optimizations, but may risk violating tight SLOs for latency-critical jobs if not carefully managed.
  • Scalability: Architectures leveraging distributed or hierarchical scheduling maintain scalability even in large clusters. Multilevel scheduling and bin partitioning reduce the per-scheduler invocation count, lowering overhead and improving throughput.
  • Resource efficiency: Algorithms such as GPA achieve SLO adherence with fewer machines compared to conservative admission control, thanks to accurate stochastic modeling. Benchmarking reveals that for realistic traces, multilevel scheduling and probabilistic bin packing dramatically reduce SLO violation rates.
  • Applicability to real-world systems: Recommendations based on empirical studies include continuous updating of task statistics, applying bundle-based launches for short jobs, enhancing bin-packing (e.g., via GPA), and extending algorithms to multidimensional resource constraints.

6. Evolution, Extensions, and Research Directions

Recent extensions of Multi-SLO-Aware Scheduling address broader system goals:

  • Domain autonomy and privacy: DSSP (Sun et al., 2023) demonstrates that distributed query scheduling and privacy-preserving budget decomposition can enable competitive marketplaces (multiple sellers/buyers) without centralizing internal resource data.
  • Edge computing and heterogeneous environments: MultiTASC (Nikolaidis et al., 2023) and BCEdge (Zhang et al., 2023) develop adaptive, capacity-aware and learning-based scheduling for cascaded DNN inference and resource-constrained edge platforms. Such systems incorporate multi-tenancy, device heterogeneity, dynamic batching, and real-time feedback mechanisms, further broadening the application of multi-SLO-aware design.
  • Environmental and sustainability objectives: The SFCM framework (Qi et al., 9 Oct 2024) introduces co-optimization of SLO, carbon footprint, and wastewater generation by formulating multi-objective scheduling and scaling as a constrained weighted sum optimization, navigated with hybrid local and evolutionary search.

A plausible implication is that future scheduler designs will need to integrate multilevel scheduling, probabilistic bin packing, two-level architectural patterns, and hybrid optimization to remain effective as workload and objective heterogeneity continue to increase.

7. Summary Table: SLO-Aware Scheduler Mechanisms

Architectural Feature Scheduling Objective Typical Implementation
Multilevel Scheduling Maximize Utilization for Small Jobs Bundle-based launches, LLMapReduce
Probabilistic Packing SLO Violation Minimization Gaussian Percentile Approximation (GPA)
Two-level (Hierarchical) Scheduling Resource Pooling for Multiple SLO tenants Mesos, multi-domain frameworks
Adaptive Policy Dynamic SLO Compliance Real-time feedback, bundle resizing
Distributed Scheduling Scalability and Autonomy DSSP layered architecture

In conclusion, Multi-SLO-Aware Schedulers represent an advanced, unified approach to resource management in contemporary computing infrastructures. By adapting scheduling logic to workload heterogeneity, leveraging multilevel task aggregation, and employing probabilistic and hierarchical algorithms, they achieve improved utilization and robust SLO fulfiLLMent across diverse and dynamic application domains (Reuther et al., 2016, Janus et al., 2017, Zhang et al., 2023, Nikolaidis et al., 2023, Sun et al., 2023, Qi et al., 9 Oct 2024).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube