Cost-Effective Online Multi-LLM Selection with Versatile Reward Models (2405.16587v2)
Abstract: With the rapid advancement of LLMs, the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can effectively address the multi-LLM selection challenge by managing the exploration-exploitation trade-off across different models, while also balancing cost and reward for diverse tasks. The NP-hard integer linear programming problem for selecting multiple LLMs with trade-off dilemmas is addressed by: i) decomposing the integer problem into a relaxed form by the local server, ii) utilizing a discretization rounding scheme that provides optimal LLM combinations by the scheduling cloud, and iii) continual online updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers strict guarantees over versatile reward models, matching state-of-the-art results for regret and violations in some degenerate cases. Empirically, we show that \textit{C2MAB-V} effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.
- Mistral AI. Mistral. https://mistral.ai/, Accessed: 2024-04.
- Finite-time analysis of the multiarmed bandit problem. Mach. Learn.,, 47:235–256, 2002.
- Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367, 1967.
- Baidu. Wenxin. https://wenxin.baidu.com/, Accessed: 2024-03.
- Ekya: Continuous learning of video analytics models on edge compute servers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 119–135, 2022.
- Convex optimization. Cambridge university press, 2004.
- Learning with guarantee via constrained multi-armed bandit: Theory and network applications. IEEE Transactions on Mobile Computing, 2022.
- Maximizing a submodular set function subject to a matroid constraint. In International Conference on Integer Programming and Combinatorial Optimization, pages 182–196. Springer, 2007.
- Timothy M Chan. Approximation schemes for 0-1 knapsack. In 1st Symposium on Simplicity in Algorithms (SOSA 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
- Dependent randomized rounding for matroid polytopes and applications. arXiv preprint arXiv:0909.4348, 2009.
- Beyond the click-through rate: Web link selection with multi-level feedback. arXiv preprint arXiv:1805.01702, 2018.
- Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
- Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.
- Claude. Claude AI LLM API. https://www.anthropic.com/, Accessed: 2023-11.
- Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015.
- Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, 2009.
- Efficient exploration for llms. arXiv preprint arXiv:2402.00396, 2024.
- Forefront. Forefront AI LLM API. https://forefront.ai/, Accessed: 2023-11.
- Parameter-free online learning via model selection. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, 2017.
- Model selection for contextual bandits. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, 2019.
- Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking (TON), 20(5):1466–1478, 2012.
- Dependent rounding and its applications to approximation algorithms. Journal of the ACM (JACM), 53(3):324–360, 2006.
- Alleviating matthew effect of offline reinforcement learning in interactive recommendation. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023.
- Language model cascades: Token-level uncertainty and beyond. arXiv preprint arXiv:2404.10136, 2024.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Dorit S Hochba. Approximation algorithms for np-hard problems. ACM Sigact News, 28(2):40–52, 1997.
- Dyne: Dynamic ensemble decoding for multi-document summarization. arXiv preprint arXiv:2006.08748, 2020.
- Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
- Autoencoding language model based ensemble learning for commonsense validation and explanation. arXiv preprint arXiv:2204.03324, 2022.
- Monotone closure of relaxed constraints in submodular optimization: Connections between minimization and maximization: Extended version. In UAI. Citeseer, 2014.
- Big little transformer decoder. arXiv preprint arXiv:2302.07863, 2023.
- Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pages 767–776. PMLR, 2015a.
- Combinatorial cascading bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 1450–1458, 2015b.
- Tight regret bounds for stochastic combinatorial semi-bandits. In AISTATS, 2015c.
- Bandit algorithms. Cambridge University Press, 2020.
- Contextual combinatorial cascading bandits. In International conference on machine learning, pages 1245–1253. PMLR, 2016.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Multi-layered network exploration via random walks: From offline optimization to online learning. In International Conference on Machine Learning, pages 7057–7066. PMLR, 2021.
- Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms. arXiv preprint arXiv:2208.14837, 2022.
- Contextual combinatorial bandits with probabilistically triggered arms. In International Conference on Machine Learning, pages 22559–22593. PMLR, 2023a.
- Variance-adaptive algorithm for probabilistic maximum coverage bandits with general feedback. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications, pages 1–10. IEEE, 2023b.
- Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170, 2023c.
- Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
- Batch-size independent regret bounds for the combinatorial multi-armed bandit problem. In Conference on Learning Theory, pages 2465–2489. PMLR, 2019.
- Tight lower bounds for combinatorial multi-armed bandits. In Conference on Learning Theory, pages 2830–2857. PMLR, 2020.
- Ollama. Ollama. https://github.com/jmorganca/ollama/, Accessed: 2023-11.
- OpenAI. OpenAI LLM API. https://platform.openai.com/, Accessed: 2023-11.
- Gurobi Optimization. GUROBI. https://www.gurobi.com/, Accessed: 2024-04.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback.(2023). arXiv preprint cs.CL/2302.12813, 2023.
- Poe Platform. Poe. https://poe.com/ChatGPT, Accessed: 2024-03.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, page 1–0, 2023. ISSN 2162-2388. doi: 10.1109/tnnls.2023.3250269. URL http://dx.doi.org/10.1109/TNNLS.2023.3250269.
- Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
- Masked language model scoring. arXiv preprint arXiv:1910.14659, 2019.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Combinatorial semi-bandits with knapsacks. In International Conference on Artificial Intelligence and Statistics, pages 1760–1770. PMLR, 2018.
- Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
- Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
- Simple deterministic approximation for submodular multiple knapsack problem. In 31st Annual European Symposium on Algorithms (ESA 2023). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2023.
- Submodular bandit problem under multiple constraints. In Conference on Uncertainty in Artificial Intelligence, pages 191–200. PMLR, 2020.
- Efficient multilingual sexism detection via large language model cascades. In Conference and Labs of the Evaluation Forum, 2023a.
- Efficient text-based propaganda detection via language model cascades. In IberLEF@SEPLN, 2023b.
- Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In 30th annual symposium on foundations of computer science, pages 332–337. IEEE Computer Society, 1989.
- Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. Advances in Neural Information Processing Systems, 30, 2017.
- Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023a.
- Efficient explorative key-term selection strategies for conversational contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10288–10295, 2023b.
- Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017.
- Investlm: A large language model for investment using financial domain instruction tuning. arXiv preprint arXiv:2309.13064, 2023.
- Xiangxiang Dai (9 papers)
- Jin Li (365 papers)
- Xutong Liu (28 papers)
- Anqi Yu (1 paper)
- John C. S. Lui (112 papers)