Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference (2509.07571v1)

Published 9 Sep 2025 in cs.MA and cs.AI

Abstract: The rapid advancement of LLMs and domain-specific AI agents has greatly expanded the ecosystem of AI-powered services. User queries, however, are highly diverse and often span multiple domains and task types, resulting in a complex and heterogeneous landscape. This diversity presents a fundamental routing challenge: how to accurately direct each query to an appropriate execution unit while optimizing both performance and efficiency. To address this, we propose MoMA (Mixture of Models and Agents), a generalized routing framework that integrates both LLM and agent-based routing. Built upon a deep understanding of model and agent capabilities, MoMA effectively handles diverse queries through precise intent recognition and adaptive routing strategies, achieving an optimal balance between efficiency and cost. Specifically, we construct a detailed training dataset to profile the capabilities of various LLMs under different routing model structures, identifying the most suitable tasks for each LLM. During inference, queries are dynamically routed to the LLM with the best cost-performance efficiency. We also introduce an efficient agent selection strategy based on a context-aware state machine and dynamic masking. Experimental results demonstrate that the MoMA router offers superior cost-efficiency and scalability compared to existing approaches.

Summary

The paper presents MoMA, a unified routing framework that dynamically directs queries to specialized agents or optimal LLMs based on a cost-performance trade-off.
It employs a two-stage system using data augmentation, a MoE head, and Pareto optimization for fine-grained, query-specific model selection.
Empirical evaluations show MoMA reduces costs by up to 37% while maintaining performance, effectively scaling across diverse domains and models.

Generalized Routing for Model and Agent Orchestration: The MoMA Framework

This essay provides a technical analysis of "Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference" (2509.07571), which introduces MoMA, a unified routing framework for orchestrating both LLMs and AI agents. The framework is designed to optimize inference efficiency and cost-effectiveness in heterogeneous, real-world AI service environments.

Motivation and Problem Formulation

The proliferation of LLMs and domain-specific agents has led to a highly heterogeneous AI ecosystem, where user queries span a wide range of domains and task complexities. Existing routing solutions are typically limited to either LLM selection or agent invocation, and often fail to scale across diverse model pools or adapt to the evolving landscape of available execution units. The core challenge addressed is the development of a generalized, adaptive router that can:

Precisely match user queries to the most suitable LLM or agent.
Balance performance and cost via dynamic, context-aware orchestration.
Scale efficiently as the pool of models and agents grows.
Figure 1: Illustration of the proposed adaptive routing model.

MoMA Framework Overview

MoMA (Mixture of Models and Agents) is a two-stage routing system that integrates both LLM and agent routing. The framework is depicted in Figure 2.

Figure 2: The MoMA routing model framework.

Upon receiving a user query, MoMA first attempts to route the query to a specialized agent if the task is well-defined and matches agent capabilities. If no suitable agent is found, the query is routed to the most appropriate LLM, selected from a heterogeneous pool based on a cost-performance trade-off. This hierarchical approach leverages the determinism and efficiency of agents while retaining the flexibility and generality of LLMs.

LLM Routing: Data, Architecture, and Optimization

Training Data Construction and Augmentation

A large-scale, hierarchically organized dataset ( $D_{train}$ , ~2.25M instances) is constructed to profile LLM capabilities across domains (science, writing, technology, programming) and task complexities. The dataset is augmented using a BERT-based selection mechanism and LLM-as-a-judge pairwise comparisons, producing fine-grained performance labels for model pairs.

Figure 3: Training data distribution by category.

Router Architecture

The LLM router employs a pre-trained, instruction-tuned LLM (Qwen-3) as an encoder, extracting hidden states as feature representations. These are passed to a Mixture-of-Experts (MoE) head, which dynamically selects the top- $k$ experts (corresponding to candidate LLMs) for each query. The output is an $M$ -dimensional vector of predicted performance scores, one per LLM.

Figure 4: LLM routing network structure.

The router is trained using a categorical cross-entropy loss over five pairwise comparison outcomes, with additional modeling of "strong" and "weak" wins/losses via a logit-based margin. The optimization objective is to minimize the discrepancy between predicted and true pairwise outcomes, supporting fine-grained, query-specific model selection.

Score-Cost Trade-off and Pareto Optimization

For each query, the router constructs a Pareto frontier over candidate LLMs, balancing normalized performance scores and inference costs. The TOPSIS algorithm is used to select the LLM closest to the ideal point (maximal performance, minimal cost), with user-adjustable weights for cost vs. performance. This enables dynamic adaptation to user preferences and system constraints.

Figure 5: The Pareto frontiers curve for score-cost.

Agent Routing: Hierarchical and Context-Aware Selection

Agent routing is implemented as a two-layer hierarchical retrieval system:

First Layer: Coarse-grained classification of the query into agent categories using semantic embeddings and k-means clustering over agent descriptions.
Second Layer: Fine-grained selection within candidate categories using a context-aware finite state machine (FSM), integrating rule-based and embedding-based semantic disambiguation.

A token logits masking strategy is employed during LLM-based agent selection to prevent invocation of unavailable agents, ensuring robust and safe routing. The system supports efficient scaling via KV-cache-based prefetching and automated category assignment for new agents.

Figure 6: The Agent Routing framework.

Empirical Evaluation

Model Capability Profiling

Experiments on the Jiutian LLM series demonstrate that model performance varies significantly across mathematical subfields and difficulty levels, validating the need for fine-grained, query-specific routing.

Figure 7: Exploring the performance of Jiutian serial LLMs in the mathematics domain.

Routing Performance and Cost Analysis

MoMA is evaluated on AIME2024 (mathematics), LiveCodeBench (code generation), and SimpleQA (general knowledge) using a diverse set of LLMs (1B–235B parameters, including open-source and proprietary models). Key findings include:

Performance-First Routing: MoMA matches or exceeds the best single LLM (qwen3-235b-a22b) while reducing cost by 31.46%.
Auto-Routing: Achieves near-optimal performance at significantly lower cost (37.19% reduction compared to performance-priority).
Cost-Priority Routing: Selects lightweight models for maximal efficiency with acceptable performance degradation.

MoMA outperforms SFT-based and contrastive learning-based routers in cost-performance trade-off, scalability, and adaptability.

Model Usage Distribution

Analysis of model usage under different routing preferences reveals that MoMA dynamically leverages specialized lightweight models (e.g., jiutian-math-8b, jiutian-code-8b) for domain-specific tasks, while defaulting to larger models only when necessary. This highlights the underutilized value of smaller, specialized models in practical deployments.

Figure 8: Model usage percentage across code, mathematical, and general knowledge domains. Each domain corresponds to three preferences: cost-priority, auto-routing, and performance-first (from left to right).

System Deployment and Practical Considerations

MoMA has been deployed with a dozen high-quality LLMs and over 20 expert agents, covering domains such as programming, mathematics, translation, and healthcare. The system supports real-time, dynamic orchestration, efficient scaling, and automated agent onboarding. The hierarchical routing and caching strategies ensure low latency and high throughput in production environments.

Implications and Future Directions

The MoMA framework demonstrates that unified, adaptive routing across heterogeneous LLMs and agents is both feasible and beneficial for real-world AI services. The approach enables:

Fine-grained, query-specific orchestration for optimal cost-performance trade-off.
Scalable integration of new models and agents without retraining the entire system.
Dynamic adaptation to user preferences and evolving system constraints.

Future research directions include:

Extending the routing framework to support multi-modal and multi-turn interactions.
Incorporating reinforcement learning for continual adaptation of routing policies.
Exploring federated or decentralized routing architectures for privacy-preserving inference.

Conclusion

MoMA represents a significant advance in the orchestration of heterogeneous AI execution units, providing a practical, scalable, and adaptive solution for efficient inference in complex, real-world environments. By unifying LLM and agent routing with a principled, data-driven approach, MoMA achieves state-of-the-art cost-performance trade-offs and lays the groundwork for future developments in generalized AI service routing.