Papers
Topics
Authors
Recent
Search
2000 character limit reached

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Published 11 Jun 2026 in cs.DC | (2606.12950v1)

Abstract: LLM based Multi-Agent Systems (LLM-MAS) have emerged as a powerful paradigm for tackling complex tasks by breaking them into collaborative workflows of specialized LLM-powered agents. However, deploying such multi-agent workloads at scale poses significant system challenges. Each user query spawns an iterative pipeline of LLM calls, greatly amplifying resource consumption compared to single-turn queries. In resource-constrained cloud settings, these workflows face non-deterministic and input-dependent costs at decode stage, heavy-tailed multi-model requirements with memory fragmentation and over-provisioning, and cross-cluster scheduling trade-offs. We present Maestro, a workload-aware scheduling system designed for LLM-MAS serving under strict GPU budgets. Maestro explicitly leverages agent semantics and roles: it predicts the output length and memory usage of each stage and uses this prediction to drive a hierarchical scheduler. At the node level, Maestro enables dynamic multi-model co-location via hierarchical weight caching and elastic memory provisioning. At the cluster level, it performs latency-aware routing to avoid cold-start delays and memory overloads. At the global level, it enforces workflow-aware prioritization to minimize head-of-line blocking for interactive tasks. Across prototype experiments and trace-driven simulations, Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points.

Summary

  • The paper introduces Maestro, a scheduling system that leverages workload-aware prediction and agent semantics to optimize multi-agent LLM workflows.
  • It uses a two-phase regressor-classifier to predict output token lengths and KV memory demands, reducing prediction MAE by 19.2%.
  • The system achieves 23.6% SLO improvement and 67.2% HBM reduction through elastic multi-model colocation and dependency-aware routing.

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Motivation and Problem Statement

The increasing adoption of LLM-based Multi-Agent Systems (LLM-MAS) enables decomposition of complex tasks into interdependent agent workflows, each involving specialized LLM calls. Unlike single-turn LLM serving, LLM-MAS workloads exhibit complex, dynamic resource demands due to iterative pipelines, input-dependent output lengths, and heterogeneous agent roles. The result is a dramatic amplification of compute and memory pressure, especially under strict or heterogeneous GPU budgets typical of private cloud and on-premises deployments. Major challenges impeding efficient, scalable LLM-MAS serving include: (1) high variance in execution costs due to input, role, and reasoning complexity; (2) long-tailed, dynamic multi-model invocation patterns leading to memory fragmentation and cold-start penalties; and (3) cross-cluster routing tradeoffs where network, memory, and model readiness interact nontrivially.

Maestro System Architecture

Maestro is designed as a workload-aware, hierarchical, cross-cluster scheduling system, leveraging direct access to agent- and workflow-level semantics to orchestrate LLM-MAS workloads efficiently. Its design integrates cost prediction, memory-safe multi-model colocation, elastic key-value (KV) cache management, and latency/resource-aware scheduling at node, cluster, and global levels. Figure 1

Figure 2: The system architecture and workflow of Maestro.

Cost Prediction and Agent Context Modeling

A central innovation of Maestro is semantic and structural modeling of each workflow stage. At dispatch, Maestro extracts agent roles, workflow positions, tool intent, and prompt embeddings (via MiniLM-style BERT), generating a compact descriptor. It then predicts output token length (L^(T)\hat{L}(T)), KV footprint, and tool-use probability using a calibrated two-phase regressor-classifier architecture: Figure 3

Figure 4: Agent-aware output-length prediction architecture in Maestro.

This approach differentiates between short-structured (tool-based) completions and open-ended Chain-of-Thought (CoT) generative steps, reflecting the actual output token distributions observed in LLM-MAS traces. Figure 2

Figure 2

Figure 5: Output-token length distributions under non-CoT and CoT settings, tool-call and non-tool-call.

Accurate cost prediction directly underpins memory budgeting, admission decisions, and the minimization of head-of-line (HoL) blocking, with Maestro achieving substantial improvements in prediction MAE over established methods.

Multi-Model Colocation and Memory Elasticity

Node-level orchestration in Maestro enables safe, dynamic colocation of multiple specialist models per GPU. A hierarchical weight residency mechanism, with least-recently-used (LRU) eviction across GPU, CPU, disk, and remote tiers, amortizes cold-start costs while preserving fast context recovery for frequent model transitions.

Memory allocation is rigorously controlled via explicit KV cache accounting. Physical GPU memory partitioning is elastic—admitted stages are conditioned on predicted total KV demand, and, where possible, virtual memory and page-locked structures increase concurrency and memory overcommitment safely without OOM risk. Figure 6

Figure 1: GPU utilization and memory usage under multi-model colocation.

Overcommitment by up to 205% of physical capacity (enabled by elastic KV allocation and accurate prediction) reduces high-bandwidth memory (HBM) reservation by 67.2% compared to statically partitioned baselines.

Cross-Cluster Scheduling and Dependency-Aware Prioritization

Maestro’s cluster-level scheduler integrates network latency, model readiness, and predicted memory feasibility into a unified fitness score for routing each agent stage. Only nodes passing flexible policy, network, and feasibility filters are candidates; among these, Maestro optimizes for a robust trade-off between low-latency (interactive SLOs) and resource availability. Figure 5

Figure 3: Intra- and inter-region network latency affecting cluster allocation efficiency.

Global scheduling leverages workflow-aware Shortest-Remaining-Time-First (SRTF) queueing, driven by rolling workflow profiles. This minimizes HoL blocking across dependent stages—a distinctive feature compared to naive deadline- or FIFO-based schedulers. Preemption is supported at stage boundaries with explicit degradation-cost accounting, further mitigating contention in mixed-workload, interactive/batch settings.

Empirical Results

Experimental evaluation uses both trace-driven simulations and deployments on a 32-node A100 cluster. Maestro is benchmarked against FCFS, EDF, Oracle-SRTF, and competing multi-model frameworks (QLM-style, etc.) under various load and SLO regimes. Figure 7

Figure 7: Overall scheduling results across arrival rates and batch ratios: SLO attainment, mean latency, and interactive queueing delay.

Key results:

  • SLO Attainment: Under high contention and heavy batch ratios, Maestro improves SLO attainment by 23.6% over EDF. Interactive queueing delay is reduced by over 80%.
  • Resource Utilization: Memory overcommitment enables safe, efficient colocation; Maestro matches exclusive-GPU performance with fewer GPUs and modest additional latency (12.1% for 1/3 the GPU cost).
  • Prediction: Output-length MAE is cut by 19.2% relative to a strong embedding-based regressor baseline; tool-intent classification attains AUC of 0.9625 with well-calibrated uncertainty (critical for downstream regression). Figure 8

    Figure 6: Prediction overhead (P50/P95) for methods with BERT encoders.

Prediction latency remains sub-15ms (P95) for the two-stage model, negligible compared to network and LLM decode times. Figure 9

Figure 8: Model activation latency (0.6B--14B) on a single A100 40GB GPU under Maestro’s hierarchical loading.

Theoretical and Practical Implications

The approach in Maestro establishes that explicit integration of agent semantics, workflow structure, and stage-level cost prediction is essential for resource efficiency and SLO compliance in LLM-MAS. This departs from workload-agnostic frame-works, demonstrating that viewing agentic pipelines as non-stationary, dependency-coupled jobs allows for more aggressive memory elasticity and higher GPU utilization.

The explicit disruption- and resource-awareness at the control plane creates a scheduling regime robust to cold-starts, interactive/batch mixtures, and non-trivial cluster topologies. The memory accounting and prediction-guided overcommitment strategy offers a practical path to efficient LLM serving on constrained hardware—scalability previously possible only on hyperscaler-style, stateless microservice LLM serving stacks.

Future Directions

Potential directions involve extending Maestro to heterogeneous accelerator clusters, integrating speculative decoding, KV compression, and fine-grained token-level preemption. Enhanced support for multi-tenant fairness, quota enforcement, and integration of external tool invocation scheduling are also required for robust productionizing. Dynamic adaptation to changing workload profiles and heterogeneous agent roles remains a promising research vector, especially in dynamic agent generation and evolving prompt formats.

Conclusion

Maestro provides a rigorous, empirically validated scheduling solution for large-scale LLM-MAS serving, unifying agent-aware prediction, multi-model colocation, and cross-cluster routing into a coordinated scheme. It achieves substantial gains in both SLO attainment and resource efficiency under realistic, heterogeneous workloads, providing a new reference point for multi-agent LLM system design. Its modular architecture allows integration with ongoing advancements in model architectures and low-level memory management, representing a substantial advance in the practical deployment of complex agentic LLM pipelines.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.