Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 84 tok/s

Gemini 2.5 Pro 61 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 200 tok/s Pro

GPT OSS 120B 463 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

SLO-Aware Scheduling for Large Language Model Inferences (2504.14966v2)

Published 21 Apr 2025 in cs.DC

Abstract: LLMs have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference services capabilities. In practice, an inference service processes multiple types of tasks, each with its own distinct SLO. To ensure satisfactory user experiences, each request's distinct SLOs should be considered in scheduling. However, existing designs lack this consideration, leading to insufficient hardware utility and suboptimal performance. This paper analyzes scenarios to process tasks with varying SLOs, and introduces a simulated annealing-based scheduler to decide request priority sequence based on a request's SLO, input lengths, and possible output lengths. As the first specialized scheduler for multi-SLO scenarios, this work improves SLO attainment by up to 5x and reduces average latency by 31.6% on Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, compared to current state-of-the-art framework vLLM and a new framework LMDeploy.