Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters (2010.15206v3)

Published 28 Oct 2020 in cs.DC and cs.LG

Abstract: Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Rosella generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(log n) obtained by prior algorithms to O(log log n). We evaluate Rosella with a variety of workloads on a 32-node AWS cluster. Experimental results show that Rosella significantly reduces task response time, and adapts to environment changes quickly.

Citations (3)

Summary

We haven't generated a summary for this paper yet.