Papers
Topics
Authors
Recent
Search
2000 character limit reached

Maestro: Scheduling for LLM-MAS

Updated 3 July 2026
  • Maestro is a workload-aware scheduling framework for LLM-MAS, enabling efficient cross-cluster GPU orchestration under strict resource budgets.
  • It integrates semantic modeling, memory-efficient multi-model co-location, latency-aware routing, and global prioritization to address complex inter-agent dependencies.
  • The approach significantly improves memory utilization, SLO attainment, and cost-latency trade-offs, demonstrating robust performance over traditional methods.

Maestro refers to a set of advances in workload-aware scheduling for LLM-based Multi-Agent Systems (LLM-MAS), enabling efficient cross-cluster GPU orchestration under stringent resource budgets. LLM-MAS architectures, which decompose complex user queries into collaborative workflows of specialized LLM-powered agents, induce substantial system-level challenges well beyond those encountered in single-turn LLM inference. These include highly variable and input-dependent resource demands, intricate inter-agent dependencies, sharp trade-offs between memory fragmentation and multi-model overprovisioning, and the need to optimize for latency, throughput, and service-level objectives (SLOs) simultaneously. Maestro delivers a unified, hierarchical, prediction-driven solution to these issues—integrating per-stage semantic modeling, memory-efficient multi-model co-location, latency-aware cluster routing, and workflow-aware global prioritization—demonstrating significant improvements in memory utilization and SLO attainment over prior art (Wang et al., 11 Jun 2026).

1. Motivation: Challenges in LLM-MAS Serving

Workflows in LLM-MAS transform each user query into multi-stage, agent-specialized pipelines. The resource consumption for such workloads is amplified due to multiple iterative LLM invocations, unpredictably variable decode costs, and diverse model memory footprints. Key challenges targeted by Maestro include:

  • Non-deterministic, input-conditioned decode costs, especially at the KV-cache level.
  • Fragmentation and over-commitment requirements of hosting many models with different memory needs on a single GPU.
  • Cross-cluster orchestration under GPU scarcity: selecting nodes and clusters so as to minimize cold starts, prevent memory overload, and prioritize low-latency interactive paths.
  • Workflow-level head-of-line (HoL) blocking due to inter-dependent agent stages.

Traditional GPU serving stacks, which are optimized for homogeneous, single-model, single-turn inference, fail to meet these complex multi-dimensional constraints efficiently.

2. Maestro Design: Workload-Aware, Hierarchical Scheduling

Maestro’s architecture explicitly leverages agent-role semantics to predict resource footprints ahead of dispatch, and organizes the scheduling stack into three tightly coordinated layers (Wang et al., 11 Jun 2026):

Node Level

  • Hierarchical weight residency: Models are dynamically loaded and evicted between GPU, CPU, disk, and remote tiers using a least-recently-used (LRU) cascade. Model activation latency is made explicit via

Tact(m)Size(m)BWtierT_{\mathrm{act}(m)} \approx \frac{\mathrm{Size}(m)}{\mathrm{BW}_{\mathrm{tier}}}

  • Elastic KV-cache via CUDA VMM: KV (key-value) cache pages are mapped into a shared pool, with strict enforcement:

Mkv+MresMtotalM_{\mathrm{kv}} + M_{\mathrm{res}} \leq M_{\mathrm{total}}

Runtime algorithms (Algorithm 2) enact minimum-impact degradation plans over resident engines to reclaim memory, measuring and minimizing total disruption cost CdegC_\text{deg}.

Cluster Level

  • Latency-aware routing: Each job stage TT is routed based on predicted KV memory (R^kv(T)\hat R_{\mathrm{kv}}(T)), empirical safety margins (ρ\rho), node headroom, activation state, and queueing delay, all combined into a fitness score:

S(N,T)=A(N,T)λTready(N,T)μCdeg(N,T)S(N,T) = A(N,T) - \lambda\,T_{\mathrm{ready}}(N,T) - \mu\,C_{\mathrm{deg}}(N,T)

This ensures that interactive workloads are not dispatched to cold nodes, minimizing cold-load latency.

Global Level

  • Workflow-aware prioritization: An SRTF (Shortest-Remaining-Time-First) policy globally orders workflow stages by the sum of estimated current stage and future expected execution time, using historical templates and preemption at LLM-call boundaries (with hysteresis and cooldown control). HoL blocking is directly mitigated, particularly for interactive and multi-stage workflows.

3. Prediction Models for Stage-Specific Cost Estimation

Central to Maestro is the ability to predict, prior to scheduling, the output length, runtime, and memory demand of every agent stage based on its semantic and structural role (Wang et al., 11 Jun 2026):

  • Tool-Intent Classifier: Structured features (role, position, tool-availability, invocation index) and MiniLM-based semantic embeddings are fed to a LightGBM classifier to predict the likelihood that a tool will be called.
  • Output-Length Regression: Per-role or global regressors, trained on log(1+L)\log(1+L), estimate output token counts, achieving MAE 165.4 tokens (R2R^2=0.7774), an improvement of 19.2% over previous systems.
  • System Metrics Translation: Execution time and KV memory are projected using:

T^exec(T)=tpre(P(T),M)+tdec(M)L^(T)\widehat T_{\mathrm{exec}}(T) = t_{\mathrm{pre}}(P(T), M) + t_{\mathrm{dec}}(M) \cdot \hat{L}(T)

Mkv+MresMtotalM_{\mathrm{kv}} + M_{\mathrm{res}} \leq M_{\mathrm{total}}0

where Mkv+MresMtotalM_{\mathrm{kv}} + M_{\mathrm{res}} \leq M_{\mathrm{total}}1 and Mkv+MresMtotalM_{\mathrm{kv}} + M_{\mathrm{res}} \leq M_{\mathrm{total}}2 encode the per-token and prompt/context footprints.

4. Memory Management: Multi-Model Co-location and Elastic Provisioning

Efficient serving under GPU constraints is achieved through:

  • Hierarchical Caching: The runtime persistently maintains lightweight GPU contexts so model re-activation can occur via fast host-to-GPU transfers, rather than full cold loads.
  • Elastic Overcommitment: Virtual KV-cache allocation is overcommitted by up to 3× the physical HBM, with admission reserved to predicted per-stage demand. Five models (from 0.6B–14B parameters) are shown to coexist on 40 GB HBM with 122 GB total virtual budget, reducing reserved HBM by 67.2% compared to naive allocation.
  • Min-Impact Degradation: Admission failures due to under-prediction trigger carefully planned staged evictions rather than blanket preemption, minimizing disruption.

5. Quantitative Impact and Evaluation

Maestro’s design delivers:

  • Resource efficiency: 67.2% reduction in HBM memory reserved for KV-cache (via overcommit), and an effective memory footprint of 3.05×.
  • SLO Attainment: In high contention, increases service-level objective attainment by 23.6 percentage points vs. EDF (from 50.0% to 73.6% at Mkv+MresMtotalM_{\mathrm{kv}} + M_{\mathrm{res}} \leq M_{\mathrm{total}}3=2.0 req/s), and cuts interactive queueing delay by 84.8%.
  • Cost-Latency Trade-off: For a six-stage workflow, lowers end-to-end latency by 38.9% (151 s→92.4 s) with a single GPU and reduces GPU costs by 67% over exclusive deployment, incurring only 12.1% latency overhead.

These improvements are robust even in ablation settings; preemption, cluster-level KV-cache packing, and network-aware routing all contribute significant incremental gains.

6. Trade-Offs, Limitations, and Future Directions

Maestro’s decisions reflect trade-offs between system complexity, granularity, and robustness:

  • Granularity: Stage-boundary-only preemption sacrifices fine granularity for reduced memory management risk and simpler state reclamation.
  • Prediction dependence: Effectiveness relies on high-accuracy output and memory forecasting. Poor prediction can lead to imposed emergency degradation plans and inefficiency spikes.
  • Scope limitations: No current support for heterogeneous accelerators, per-tenant isolation, or coordinated multi-modal (CPU, I/O, LLM) joint scheduling. Extending hierarchical scheduling to these contexts is a future direction.
  • Pluggability: Kernel-level accelerations (e.g. speculative decode, KV compression) can be incorporated by updating microbenchmark performance profiles.

7. Significance and Context in LLM-MAS Systems

Maestro’s predictive, hierarchical, and workflow-aware orchestration marks a substantive step forward for scalable LLM-MAS serving. Its approach demonstrates that multi-agent systems—in contrast to single-model deployments—require fundamentally new approaches to scheduling, memory management, and admission control, as demonstrated by the marked efficiency and SLO improvements in both simulation and prototype deployments. These results position Maestro as a critical enabler for cost-effective, highly interactive, and robust orchestration of heterogeneous LLM-backed agent workflows under realistic cloud resource constraints (Wang et al., 11 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAESTRO.