Routoo: Learning to Route to Large Language Models Effectively

Published 25 Jan 2024 in cs.CL, cs.AI, and cs.LG | (2401.13979v3)

Abstract: LLMs with superior response quality--particularly larger or closed-source models--often come with higher inference costs, making their deployment inefficient and costly. Meanwhile, developing foundational LLMs from scratch is becoming increasingly resource-intensive and impractical for many applications. To address the challenge of balancing quality and cost, we introduce Routoo, an architecture designed to optimize the selection of LLMs for specific prompts based on performance, cost, and efficiency. Routoo provides controllability over the trade-off between inference cost and quality, enabling significant reductions in inference costs for a given quality requirement. Routoo comprises two key components: a performance predictor and cost-aware selector. The performance predictor is a lightweight LLM that estimates the expected performance of various underlying LLMs on a given prompt without executing them. The cost-aware selector module then selects the most suitable model based on these predictions and constraints such as cost and latency, significantly reducing inference costs for the same quality. We evaluated Routoo using the MMLU benchmark across 57 domains employing open-source models. Our results show that Routoo matches the performance of the Mixtral 8x7b model while reducing inference costs by one-third. Additionally, by allowing increased costs, Routoo surpasses Mixtral's accuracy by over 5% at equivalent costs, achieving an accuracy of 75.9%. When integrating GPT4 into our model pool, Routoo nearly matches GPT4's performance at half the cost and exceeds it with a 25% cost reduction. These outcomes highlight Routoo's potential to significantly reduce inference costs without compromising quality, and even to establish new state-of-the-art results by leveraging the collective capabilities of multiple LLMs.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (2)

View on Semantic Scholar

Summary

The paper presents the Leeroo-orch, an orchestrator that dynamically routes tasks to specialized LLM experts for improved efficiency and cost savings.
It employs a reinforcement learning-inspired self-play loop to continuously refine its decision strategy using diverse, real-world queries.
Evaluations on the MMLU benchmark show the framework outperforms models like Mixtral and delivers competitive performance at a fraction of the cost.

Introduction

The proliferation of LLMs has created a new landscape for AI-based text generation and understanding. The development of foundational models, however, is reaching a pivot where the gap between cost and performance enhancement is significantly widening. Training these models requires substantial computational resources and data, and incremental improvements often come with exponential increases in expense. Amidst this scenario, a model architecture known as the Leeroo Orchestrator (Leeroo-orch) presents a promising solution with its cost-effective and performance-efficient approach to leveraging an ensemble of LLM experts.

Model Architecture

The central component of the proposed framework is Leeroo-orch, a LLM-based orchestrator that intelligently selects the best-suited underlying LLM experts for executing specific tasks. It is distinguished from a traditional Mixture of Experts (MoE) model in that it does not require all expert sub-networks to be loaded onto a single machine, thus allowing greater flexibility and scalability. Instead, it designates each 'expert' to operate independently, which can be hosted on different machines and can utilize varied neural network architectures. The Leeroo-orch stands out for its optimize-first approach, considering speed, cost, accuracy, and other criteria to determine the most efficient utilization of resources without sacrificing output quality.

Training and Integration Methodology

The Leeroo-orch adopts a self-play loop for training, drawing inspiration from reinforcement learning. A loop of query generation, orchestration, and evaluation is used to refine the orchestrator’s decision-making process over time. This allows for a consistant improvement in outcomes as the model learns from a diverse range of questions and assimilates the corresponding feedback. Additionally, the orchestrator is designed to integrate new expert models as they emerge gracefully, utilizing them in synergy with existing models to continuously enhance overall performance.

Performance and Cost Optimization

Evaluations based on the Massive Multitask Language Understanding (MMLU) benchmark reveal that the Leeroo-orch achieves state-of-the-art performance among open-source models and offers cost savings. It outshines the leading open-source LLM, Mixtral, in accuracy and operates at two-thirds of its cost. Moreover, when integrated with GPT4 as an expert, Leeroo-orch exhibits competitive performance with only half the cost of GPT4 alone. The orchestrator’s model selection is also fine-tuned to ensure cost-aware optimization, efficiently balancing expenditure without compromising on output. Interestingly, the use of smaller expert models significantly contributes to cost-to-performance efficiency, indicating a potential pathway for more economical solutions without sacrificing output quality.

Conclusion

In conclusion, the Leeroo-orch points towards an innovative direction in the utilization of LLMs, shifting the focus from monolithic, general-purpose models to a collaborative ensemble of domain-specific ones. This methodology not only accomplishes a reduction in costs but simultaneously elevates AI’s capabilities by optimizing the synergistic relationship between various LLMs. As the field of AI continues to expand, the orchestrator embodiment stands as a testament to the potential of leveraging a diverse array of expertise within LLMs to achieve superior and economically viable performance outcomes.

Markdown Report Issue