Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents (2407.01887v2)

Published 2 Jul 2024 in cs.LG, cs.AI, and cs.CL

Abstract: In-context decision-making is an important capability of artificial general intelligence, which LLMs have effectively demonstrated in various scenarios. However, LLMs often face challenges when dealing with numerical contexts, and limited attention has been paid to evaluating their performance through preference feedback generated by the environment. This paper is the first to investigate the performance of LLMs as decision-makers in the context of Dueling Bandits (DB). We compare GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Llama 3.1, and o1-preview against eight well-established DB algorithms. Our results reveal that LLMs, particularly GPT-4 Turbo, quickly identify the Condorcet winner, thus outperforming existing state-of-the-art algorithms in terms of weak regret. Nevertheless, LLMs struggle to converge even when explicitly prompted to do so and are sensitive to prompt variations. To overcome these issues, we introduce a hybrid algorithm: LLM-Enhanced Adaptive Dueling (LEAD), which takes advantage of both in-context decision-making capabilities of LLMs and theoretical guarantees inherited from classic DB algorithms. We show that LEAD has theoretical guarantees on both weak and strong regret and validate its robustness even with noisy and adversarial prompts. The design of such an algorithm sheds light on how to enhance trustworthiness for LLMs used in decision-making tasks where performance robustness matters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Do llm agents have regret? a case study in online learning and games. arXiv preprint arXiv:2403.16843, 2024.
  2. Can large language models explore in-context? arXiv preprint arXiv:2403.15371, 2024.
  3. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538--1556, 2012.
  4. Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73--82, 2014.
  5. Bandit algorithms. Cambridge University Press, 2020.
  6. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201--1208, 2009.
  7. Multi-dueling bandits with dependent arms. arXiv preprint arXiv:1705.00253, 2017.
  8. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856--864. PMLR, 2014.
  9. Versatile dueling bandits: Best-of-both-world analyses for online learning from preferences. In ICML 2022-39th International Conference on Machine Learning, pages 1--25, 2022.
  10. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008--3021, 2020.
  11. Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024.
  12. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacology, pages 1--5, 2024.
  13. Legal syllogism prompting: Teaching large language models for legal judgment prediction. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 417--421, 2023.
  14. The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges. arXiv preprint arXiv:2304.05351, 2023.
  15. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213, 2022.
  16. Double thompson sampling for dueling bandits. Advances in neural information processing systems, 29, 2016.
  17. Beat the mean bandit. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 241--248. Citeseer, 2011.
  18. Relative upper confidence bound for the k-armed dueling bandit problem. In International conference on machine learning, pages 10--18. PMLR, 2014.
  19. Preference-based learning for exoskeleton gait optimization. In 2020 IEEE international conference on robotics and automation (ICRA), pages 2351--2357. IEEE, 2020.
  20. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744, 2022.
  21. Llms-augmented contextual bandit. arXiv preprint arXiv:2311.02268, 2023.
  22. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837, 2022.
  23. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  24. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  25. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  26. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  27. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  28. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning, pages 287--318. PMLR, 2023.
  29. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  30. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  31. Large language models to enhance bayesian optimization. arXiv preprint arXiv:2402.03921, 2024.
  32. Generic exploration and k-armed voting bandits. In International conference on machine learning, pages 91--99. PMLR, 2013.
  33. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on learning theory, pages 1141--1154. PMLR, 2015.
  34. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324--345, 1952.
  35. Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193--202, 1975.
  36. Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. Advances in Neural Information Processing Systems, 33:19943--19952, 2020.
  37. Exploring the sensitivity of llms’ decision-making capabilities: Insights from prompt variation and hyperparameters. arXiv preprint arXiv:2312.17476, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Fanzeng Xia (3 papers)
  2. Hao Liu (497 papers)
  3. Yisong Yue (154 papers)
  4. Tongxin Li (31 papers)