Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can large language models explore in-context? (2403.15371v3)

Published 22 Mar 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We investigate the extent to which contemporary LLMs can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.

Exploring the Limits of Exploration: How LLMs Fare in Multi-Armed Bandit Environments

Introduction

The capacity for exploration underpins effective decision-making in complex environments. This paper scrutinizes the inherent abilities of contemporary LLMs to engage in exploration, crucial for reinforcement learning (RL) and sequential decision making. By deploying LLMs as agents within multi-armed bandit (MAB) settings—without any training adjustments—the investigation uniquely positions LLMs in scenarios that demand exploration for successful navigation and learning.

Experimental Design

Given the emerging relevance of in-context learning, this paper introduces a systematic examination of LLMs' exploration capabilities via simple yet foundational RL problems: multi-armed bandits. This choice is motivated by the simpleness and analytical tractability of MAB problems, which isolate the exploration-exploitation dilemma fundamental to decision making.

The research employs three LLMs: Gpt-3.5, GPT-4, and Llama2, leveraging various prompt designs to enact the MAB scenario and gather responses. These models are exposed to a set of specifically designed prompts that detail the bandit environment and query for next actions, giving rise to different experimental configurations based on the prompt nuances.

The exploration behaviors of these LLMs are probed across multiple settings:

  • Environment Complexity: Easy and hard instances of MAB are chosen based on the number of arms and reward distribution complexities.
  • Temperature Settings: Zero and one temperature settings in LLM prompts aim to distinguish between intrinsic exploration and externally injected randomness.
  • Prompt Variations: Ranging from basic to advanced prompts, this paper encompasses different scenarios, framings, summarization levels, and prompting for chain-of-thought reasoning.

Results and Findings

Across numerous experimental runs, a single configurational success emerged: GPT-4, complemented by specific prompt attributes that suggested exploration, summarized interaction history, and enforced chain-of-thought reasoning. This configuration exhibited robust exploratory behavior, effectively identifying and exploiting the most rewarding actions in the stipulated bandit environment.

Contrastingly, the majority of configurations demonstrated significant exploration deficiencies, manifesting either through an undue focus on exploiting immediate rewards (akin to a greedy strategy) or through an almost uniform, undiscriminating choice distribution across all actions, indicative of a failure to learn from past interactions.

Specifically, configurations not employing summarized interaction histories or lacking in prompt attributes that explicitly incited exploration were prone to these failures. Interestingly, the exploration success with GPT-4 also highlighted the nuanced but critical role of prompt design in eliciting more sophisticated behaviors from LLMs.

Implications and Future Directions

This investigation underlines the necessity of non-trivial prompt engineering or potential algorithmic interventions to unlock and elevate the decision-making capacities of LLMs in settings that demand robust exploration strategies. Thefindings prompt several lines of inquiry and development:

  • Further Prompt Exploration: Expanding the diversity and depth of prompts may uncover more nuanced aspects of LLM capabilities.
  • Algorithmic Interventions: Fine-tuning or custom training paradigms might be essential for cultivating sophisticated exploration behaviors in more complex RL environments.
  • Methodological Advances: Developing methodologies for cost-effective, large-scale evaluations of LLM behaviors in decision-making contexts is paramount.

Conclusion

While a singular configuration demonstrated the potential for LLMs to engage in strategic exploration within a controlled environment, the overarching evidence points to a generalized struggle among LLMs to autonomously navigate the exploration-exploitation trade-off without explicit guidance. This paper, while focusing on the elemental RL challenge of multi-armed bandits, lays foundational insights for the development of LLMs as more adept decision-making agents intackling broader and more complex decision-making tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. A mechanism for sample-efficient in-context learning for sparse retrieval tasks. arXiv:2305.17040, 2023.
  2. Analysis of Thompson Sampling for the multi-armed bandit problem. In Conference on Learning Theory, 2012.
  3. Near-optimal regret bounds for thompson sampling. Journal of the ACM, 2017. Preliminary version in AISTATS 2013.
  4. Transformers learn to implement preconditioned gradient descent for in-context learning. arXiv:2306.00297, 2023.
  5. Do as I can, not as I say: Grounding language in robotic affordances. arXiv:2204.01691, 2022.
  6. In-context learning through the bayesian prism. arXiv:2306.04891, 2023.
  7. What learning algorithm is in-context learning? Investigations with linear models. arXiv:2211.15661, 2022.
  8. In-context language learning: Architectures and algorithms. arXiv:2401.12973, 2024.
  9. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002.
  10. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv:2306.04637, 2023.
  11. Bandit social learning: Exploration under myopic behavior. arXiv:2302.07425, 2023.
  12. Understanding in-context learning in transformers and LLMs by learning to learn discrete functions. arXiv:2310.03016, 2023.
  13. Large language models can implement policy iteration. In Advances in Neural Information Processing Systems, 2023.
  14. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  15. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. Published with Now Publishers (Boston, MA, USA). Also available at https://arxiv.org/abs/1204.5721.
  16. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv:2303.12712, 2023.
  17. Transformers implement functional gradient descent to learn non-linear functions in context. arXiv:2312.06528, 2023.
  18. Training verifiers to solve math word problems. arXiv:2110.14168, 2021.
  19. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, 2023.
  20. Synergpt: In-context learning for personalized drug synergy prediction and drug design. arXiv:2307.11694, 2023.
  21. Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv:2310.17086, 2023.
  22. Pal: Program-aided language models. In International Conference on Machine Learning, 2023.
  23. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 2022.
  24. How do transformers learn in-context beyond simple functions? A case study on learning with representations. arXiv:2310.10616, 2023.
  25. A theory of emergent in-context learning as implicit structure induction. arXiv:2303.07971, 2023.
  26. Explaining emergent in-context learning as kernel regression. arXiv:2305.12766, 2023a.
  27. Understanding in-context learning via supportive pretraining data. arXiv:2306.15091, 2023b.
  28. In-context learning creates task vectors. arXiv:2310.15916, 2023.
  29. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Journal of Artificial Intelligence Research, 2016. Preliminary version in ACM EC 2014.
  30. In-context convergence of transformers. arXiv:2310.05249, 2023.
  31. An information-theoretic analysis of in-context learning. arXiv:2401.15530, 2024.
  32. Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, 2012.
  33. Causal reasoning and large language models: Opening a new frontier for causality. arXiv:2305.00050, 2023.
  34. General-purpose in-context learning by meta-learning transformers. arXiv:2212.04458, 2022.
  35. Large language models are zero-shot reasoners. Advances in neural information processing systems, 2022.
  36. In-context reinforcement learning with algorithm distillation. arXiv:2210.14215, 2022.
  37. Bandit Algorithms. Cambridge University Press, 2020.
  38. Supervised pretraining can learn in-context reinforcement learning. arXiv:2306.14892, 2023a.
  39. The AI revolution in medicine: GPT-4 and beyond. Pearson, 2023b.
  40. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, 2023.
  41. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. arXiv:2310.08566, 2023.
  42. Exposing attention glitches with flip-flop language modeling. Advances in Neural Information Processing Systems, 2024.
  43. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv:2310.02255, 2023.
  44. Eran Malach. Auto-regressive next-token predictors are universal learners. arXiv:2309.06979, 2023.
  45. Evaluating cognitive maps and planning in large language models with cogeval. arXiv:2309.15129, 2023.
  46. OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.
  47. Generative agents: Interactive simulacra of human behavior. In Symposium on User Interface Software and Technology, 2023.
  48. Generalization to new sequential decision making tasks with in-context learning. arXiv:2312.03801, 2023.
  49. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. arXiv:2306.15063, 2023.
  50. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 2018.
  51. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv:2310.11324, 2023.
  52. Do pretrained transformers really learn in-context by gradient descent? arXiv:2310.08540, 2023.
  53. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366, 2023.
  54. Bayesian decision-making under misspecified priors with applications to meta-learning. Advances in Neural Information Processing Systems, 2021.
  55. Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 2019.
  56. Ranked bandits in metric spaces: Learning optimally diverse rankings over large document collections. Journal of Machine Learning Research, 2013. Preliminary version in ICML, 2010.
  57. Demonstrations are all you need: Advancing offensive content paraphrasing using in-context learning. arXiv:2310.10707, 2023.
  58. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
  59. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  60. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Advances in Neural Information Processing Systems: Datasets and Benchmarks Track, 2023.
  61. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, 2023.
  62. Voyager: An open-ended embodied agent with large language models. arXiv:2305.16291, 2023.
  63. The ICL consistency test. arXiv:2312.04945, 2023.
  64. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 2022.
  65. The learnability of in-context learning. arXiv:2303.07895, 2023.
  66. How many pretraining tasks are needed for in-context learning of linear regression? arXiv:2310.08391, 2023.
  67. Smartplay: A benchmark for LLMs as intelligent agents. In International Conference on Learning Representations, 2024.
  68. An explanation of in-context learning as implicit bayesian inference. arXiv:2111.02080, 2021.
  69. Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, 2022.
  70. Creative robot tool use with large language models. arXiv:2310.13065, 2023.
  71. Imitation versus innovation: What children can do that large language and language-and-vision models cannot (yet)? arXiv:2305.07666, 2023.
  72. Skill-mix: A flexible and expandable family of evaluations for ai models. arXiv:2310.17567, 2023.
  73. Trained transformers learn linear models in-context. arXiv:2306.09927, 2023a.
  74. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv:2305.19420, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Akshay Krishnamurthy (92 papers)
  2. Keegan Harris (17 papers)
  3. Dylan J. Foster (66 papers)
  4. Cyril Zhang (34 papers)
  5. Aleksandrs Slivkins (67 papers)
Citations (14)
Youtube Logo Streamline Icon: https://streamlinehq.com