Papers
Topics
Authors
Recent
2000 character limit reached

MaaSO: SLO-aware Orchestration of Heterogeneous Model Instances for MaaS (2509.06362v1)

Published 8 Sep 2025 in cs.DC

Abstract: Model-as-a-Service (MaaS) platforms face diverse Service Level Objective (SLO) requirements stemming from various LLM applications, manifested in contextual complexity, first-token latency, and between-token latency. On the other hand, an LLM instance, when configured with different parallelism strategies and inference batch sizes, exhibits distinct performance characteristics and can thus be used to serve different SLO requirements. However, current LLM inference systems typically deploy instances of the same model with identical configurations, lacking mechanisms to leverage such heterogeneity. To fill this research gap, we propose MaaSO, the first MaaS Orchestrator, which comprises three modules: (1) a profiler characterizing instance performance under diverse parallelism strategies and inference batch sizes; (2) a placer optimizing heterogeneous instance configurations; (3) a distributor enabling SLO-aware request distribution and preventing cascaded timeouts in continuous batching. Experiments show that MaaSO improves the SLO satisfaction ratio by 15 to 30% and reduces response latency by 40 to 60% compared to existing approaches, and significantly lowers overall orchestration overhead.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.