MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs (2407.10834v2)

Published 15 Jul 2024 in cs.LG and cs.AI

Abstract: The rapid progress in ML has brought forth many LLMs that excel in various tasks and areas. These LLMs come with different abilities and costs in terms of computation or pricing. Since the demand for each query can vary, e.g., because of the queried domain or its complexity, defaulting to one LLM in an application is not usually the best choice, whether it is the biggest, priciest, or even the one with the best average test performance. Consequently, picking the right LLM that is both accurate and cost-effective for an application remains a challenge. In this paper, we introduce MetaLLM, a framework that dynamically and intelligently routes each query to the optimal LLM (among several available LLMs) for classification tasks, achieving significantly improved accuracy and cost-effectiveness. By framing the selection problem as a multi-armed bandit, MetaLLM balances prediction accuracy and cost efficiency under uncertainty. Our experiments, conducted on popular LLM platforms such as OpenAI's GPT models, Amazon's Titan, Anthropic's Claude, and Meta's LLaMa, showcase MetaLLM's efficacy in real-world scenarios, laying the groundwork for future extensions beyond classification tasks.

PDF HTML Abstract

MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs

The research paper titled "MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs" addresses the critical challenge of selecting the optimal LLM in terms of both accuracy and cost efficiency for specific applications. This dynamic framework, known as MetaLLM, proposes an innovative solution by routing each query to the most appropriate LLM among several available options, leveraging the multi-armed bandit algorithm.

Overview of MetaLLM

MetaLLM is designed to optimize the performance and cost-effectiveness of LLMs in zero-shot classification tasks. The framework wraps around a diverse suite of LLMs and dynamically determines which LLM to use for each query, balancing prediction accuracy and usage cost under uncertainty, and thus aims to maximize the total reward— a trade-off between overall performance and cost constraints.

Technical Contributions

Dynamic Routing via Multi-Armed Bandit Algorithm: MetaLLM utilizes a multi-armed bandit algorithm to make intelligent routing decisions for each query. This algorithm balances the selection between the accuracy of responses and the cost incurred, thereby optimizing the use of resources.
Performance Improvement with Cost Efficiency: The empirical results from experiments conducted on well-known LLM platforms, including OpenAI, Amazon Bedrock, Anthropic Claude, and Meta LLaMa, demonstrate the effectiveness of MetaLLM. For instance, MetaLLM improves the classification accuracy compared to the best-performing individual LLM by approximately 1% while achieving significant cost savings (up to 60% on OpenAI and 40% on Bedrock APIs).
Versatility and Scalability: Although the current work focuses on zero-shot classification tasks, the MetaLLM framework can be extended to other language-related tasks by modifying the reward function to incorporate suitable metrics for evaluating response quality.

Methodology

The problem of selecting the optimal LLM for each query is modeled as a decision-making problem under uncertainty. Given a set of LLMs with diverse capabilities and cost structures, MetaLLM trains a routing function using a linear model that maps the embedded features of a query to the expectation of a reward, defined in terms of both performance and cost.

The reward function is formalized as $r(x, i) = a_i(x) - p c_i$ , where $a_i(x)$ denotes the accuracy of LLM $i$ on sample $x$ and $c_i$ represents the cost. By dynamically adjusting the weight parameter $p$ , the framework can prioritize cost over performance or vice versa, according to the application’s budget constraints.

Experimental Validation

The experimental setup includes the evaluation of MetaLLM on well-established datasets such as SST-2 and IMDB, along with multiple API services from OpenAI and Bedrock. Some key findings from the experiments are:

OpenAI Models: MetaLLM achieved better performance while significantly reducing costs by intelligently selecting less expensive yet sufficiently accurate models for certain queries. For example, with a specific cost scaling parameter ( $p = 0.001$ ), MetaLLM demonstrated an increased accuracy of 90.98% at a cost reduction of about 32%, compared to a single model strategy.
Bedrock APIs: When tested with a heterogeneous set of models from Bedrock, such as Titan Lite and Claude Instant, MetaLLM managed to attain higher accuracy with substantial cost efficiency compared to using any single LLM model. Specifically, MetaLLM with $p = 0.5$ outperformed Llama 2 7B in accuracy while being approximately 42% more cost-efficient.

Implications and Future Directions

The adoption of MetaLLM represents a significant step towards enhancing the practical deployment of LLMs by ensuring cost efficiency without compromising performance. The implications are twofold:

Practical Impact: By reducing the operational costs associated with LLMs, MetaLLM facilitates the broader adoption of advanced ML applications, making them more accessible and sustainable.
Theoretical Implications: The framework underscores the importance of dynamically balancing performance and cost, paving the way for future research into more sophisticated and adaptive AI systems that can integrate diverse evaluation metrics into the decision-making process.

Future developments could explore the application of MetaLLM to a broader range of language tasks, such as question answering and text generation, by refining the reward function. Additionally, incorporating other aspects like inference time and model robustness could further enhance the framework’s utility across different AI applications.

In conclusion, MetaLLM offers a pragmatic and effective approach for leveraging multiple LLMs in real-world applications, addressing the dual objectives of high performance and cost efficiency through intelligent query routing and decision-making under uncertainty.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Quang H. Nguyen (8 papers)
Duy C. Hoang (3 papers)
Juliette Decugis (4 papers)
Saurav Manchanda (15 papers)
Nitesh V. Chawla (111 papers)
Khoa D. Doan (36 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1813047414447612382