MaaSO: SLO-aware Orchestration of Heterogeneous Model Instances for MaaS (2509.06362v1)
Abstract: Model-as-a-Service (MaaS) platforms face diverse Service Level Objective (SLO) requirements stemming from various LLM applications, manifested in contextual complexity, first-token latency, and between-token latency. On the other hand, an LLM instance, when configured with different parallelism strategies and inference batch sizes, exhibits distinct performance characteristics and can thus be used to serve different SLO requirements. However, current LLM inference systems typically deploy instances of the same model with identical configurations, lacking mechanisms to leverage such heterogeneity. To fill this research gap, we propose MaaSO, the first MaaS Orchestrator, which comprises three modules: (1) a profiler characterizing instance performance under diverse parallelism strategies and inference batch sizes; (2) a placer optimizing heterogeneous instance configurations; (3) a distributor enabling SLO-aware request distribution and preventing cascaded timeouts in continuous batching. Experiments show that MaaSO improves the SLO satisfaction ratio by 15 to 30% and reduces response latency by 40 to 60% compared to existing approaches, and significantly lowers overall orchestration overhead.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.