Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models are Biased Reinforcement Learners (2405.11422v1)

Published 19 May 2024 in cs.CL, cs.AI, and cs.LG
Large Language Models are Biased Reinforcement Learners

Abstract: In-context learning enables LLMs to perform a variety of tasks, including learning to make reward-maximizing choices in simple bandit tasks. Given their potential use as (autonomous) decision-making agents, it is important to understand how these models perform such reinforcement learning (RL) tasks and the extent to which they are susceptible to biases. Motivated by the fact that, in humans, it has been widely documented that the value of an outcome depends on how it compares to other local outcomes, the present study focuses on whether similar value encoding biases apply to how LLMs encode rewarding outcomes. Results from experiments with multiple bandit tasks and models show that LLMs exhibit behavioral signatures of a relative value bias. Adding explicit outcome comparisons to the prompt produces opposing effects on performance, enhancing maximization in trained choice sets but impairing generalization to new choice sets. Computational cognitive modeling reveals that LLM behavior is well-described by a simple RL algorithm that incorporates relative values at the outcome encoding stage. Lastly, we present preliminary evidence that the observed biases are not limited to fine-tuned LLMs, and that relative value processing is detectable in the final hidden layer activations of a raw, pretrained model. These findings have important implications for the use of LLMs in decision-making applications.

Do LLMs Learn Like Humans in Decision-Making Tasks?

Introduction

LLMs like GPT-3.5 and GPT-4 have shown an impressive range of abilities from language translation to problem-solving. Among these abilities is something called in-context learning, where models can learn to perform new tasks just by observing examples within a given context. This paper dives into how LLMs deal with decision-making tasks, particularly ones involving reinforcement learning (RL) under the hood. The focus is on understanding whether these models exhibit human-like biases when encoding and using rewards to make decisions.

Experiment Setup: The Bandit Tasks

To probe the decision-making abilities of LLMs, the researchers employed so-called bandit tasks. These tasks involve making choices from a set of options where each choice results in a reward. The goal is to maximize the cumulative reward over time. Think of it like choosing a slot machine to play out of a set of slot machines to get the maximum payout.

Here's how the experiment was set up:

  • Models Tested: The researchers tested four popular LLMs, including proprietary ones like GPT-3.5-turbo-0125 and GPT-4-0125-preview, as well as open-source models like Llama-2-70b-chat and Mixtral-8x7b-instruct.
  • Tasks: Five different bandit tasks were used. Each task had different structural features, like how rewards were distributed and the grouping of options.
  • Prompt Designs: Two types of prompts were tested: one listing outcomes in a neutral manner (standard prompt) and another adding explicit comparisons between outcomes (comparisons prompt).

Figure \ref{Fig1} in the paper illustrates an example of these bandit tasks with different contexts and prompt designs.

Main Findings

Choice Accuracy

The researchers measured how well the LLMs performed in both the training phase (where feedback was provided) and the transfer test phase (where no feedback was given). Key observations include:

  • Training Phase: The comparisons prompt generally led to higher accuracy, suggesting that explicit comparisons helped models learn better within the training context.
  • Transfer Test: Interestingly, the comparisons prompt actually reduced accuracy, indicating a trade-off between learning well in the initial context and generalizing to new contexts.

Relative Value Bias

The paper revealed something fascinating: LLMs displayed what’s known as a relative value bias. This means that, like humans, the models tended to favor options that had higher relative rewards in the training context, even if those options were not the best in an absolute sense.

  • LLMs’ Preference: The researchers found that the models were more biased towards relative value, especially with the comparisons prompt. For example, options that gave better local outcomes were favored even when it wasn't the optimal choice.
  • Human-Like Bias: This reflects a human-like tendency where subjective rewards depend more on local context than absolute values, possibly leading to sub-optimal decisions in new contexts.

Computational Modeling

To further understand the underlying behavior, the researchers used computational cognitive models. They created models that combined both relative and absolute value signals. The winning models generally included:

  • Relative Encoding: Incorporating both the absolute value of rewards and their relative standing compared to other options.
  • Confirmation Bias: The models updated their expectations differently based on whether the outcome confirmed or disconfirmed prior beliefs, mirroring a kind of confirmation bias seen in humans.

Hidden States Analysis

An additional interesting revelation came from examining a pre-trained, non-fine-tuned model (Gemma-7b). This model also showed relative value bias, indicating that such biases are not necessarily introduced during the fine-tuning stages but could be inherent in the way these models are pretrained on massive datasets.

Implications and Future Directions

The paper offers several important takeaways and sets the stage for future research:

  • Decision-Making Applications: The finding that LLMs exhibit human-like biases is crucial for deploying these models in real-world decision-making scenarios.
  • Fine-Tuning vs. Pretraining: Since biases can appear even in pretrained models, it suggests that strategies beyond fine-tuning need attention.
  • Mitigation Strategies: Future research should explore methods to counteract these biases, potentially through different prompting strategies or architectural changes.

Exploring the intricate behavior of LLMs not only aids in improving their performance in various tasks but also deepens our understanding of how these models compare to human cognition.

Final Thoughts

The paper sheds light on the nuanced behavior of LLMs in reinforcement learning tasks, emphasizing the importance of understanding and potentially mitigating human-like biases for better autonomous decision-making. As LLMs continue to evolve, keeping an eye on such findings will be crucial to responsibly harness their capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR, 2023.
  2. The functional form of value normalization in human reinforcement learning. Elife, 12:e83891, 2023.
  3. Reference-point centering and range-adaptation enhance human reinforcement learning at the cost of irrational preferences. Nature communications, 9(1):4503, 2018.
  4. Two sides of the same coin: Beneficial and detrimental consequences of range adaptation in human reinforcement learning. Science Advances, 7(14):eabe0340, 2021.
  5. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods. arXiv preprint arXiv:2404.00282, 2024.
  8. The emergence of economic rationality of gpt. Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023.
  9. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  10. Inducing anxiety in large language models increases exploration and bias. arXiv preprint arXiv:2304.11111, 2023.
  11. Cogbench: a large language model walks into a psychology lab. arXiv preprint arXiv:2402.18225, 2024.
  12. Gary Fowler. The transformative power of generative ai and large language models, Apr 2024. URL https://www.forbes.com/sites/forbesbusinessdevelopmentcouncil/2024/04/08/the-transformative-power-of-generative-ai-and-large-language-models/?sh=562695e977b0.
  13. Thilo Hagendorff. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988, 2023.
  14. Regret in experience-based decisions: The effects of expected value differences and mixed gains and losses. Decision, 8(4):277, 2021.
  15. Reinforcement learning in and out of context: The effects of attentional focus. Journal of Experimental Psychology: Learning, Memory, and Cognition, 2022.
  16. Effects of blocked versus interleaved training on relative value learning. Psychonomic Bulletin & Review, 30(5):1895–1907, 2023a.
  17. Testing models of context-dependent outcome encoding in reinforcement learning. Cognition, 230:105280, 2023b.
  18. Relative value biases in large language models. arXiv preprint arXiv:2401.14530, 2024.
  19. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  20. John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.
  21. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  22. Human value learning and representation reflect rational adaptation to task demands. Nature Human Behaviour, 6(9):1268–1279, 2022.
  23. Learning relative values in the striatum induces violations of normative decision making. Nature communications, 8(1):16033, 2017.
  24. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  25. Behavioural and neural characterization of optimistic reinforcement learning. Nature Human Behaviour, 1(4):0067, 2017.
  26. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  27. Intrinsic rewards explain context-sensitive valuation in reinforcement learning. PLoS Biology, 21(7):e3002201, 2023.
  28. Cal Newport. What kind of mind does chatgpt have?, Apr 2023. URL https://www.newyorker.com/science/annals-of-artificial-intelligence/what-kind-of-mind-does-chatgpt-have.
  29. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  30. Stephen Ornes. How quickly do large language models learn unexpected skills?, Feb 2024. URL https://www.quantamagazine.org/how-quickly-do-large-language-models-learn-unexpected-skills-20240213/.
  31. Context-dependent outcome encoding in human reinforcement learning. Current Opinion in Behavioral Sciences, 41:144–151, 2021.
  32. Contextual modulation of value signals in reward and punishment learning. Nature communications, 6(1):8096, 2015.
  33. Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing. PLoS computational biology, 13(8):e1005684, 2017.
  34. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
  35. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  36. Robert A Rescorla. A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. Classical conditioning, Current research and theory, 2:64–69, 1972.
  37. In-context learning agents are asymmetric belief updaters. arXiv preprint arXiv:2402.03969, 2024.
  38. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023.
  39. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  40. Reinforcement learning: An introduction. MIT press, 2018.
  41. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  42. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  44. Contextual influence of reinforcement learning performance of depression: evidence for a negativity bias? Psychological Medicine, 53(10):4696–4706, 2023.
  45. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  46. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26, 2024.
  47. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  48. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
  49. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  50. Studying and improving reasoning in humans and machines. arXiv preprint arXiv:2309.12485, 2023.
  51. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. William M. Hayes (2 papers)
  2. Nicolas Yax (5 papers)
  3. Stefano Palminteri (8 papers)
Citations (1)