Papers
Topics
Authors
Recent
2000 character limit reached

CogBench: a large language model walks into a psychology lab (2402.18225v1)

Published 28 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have significantly advanced the field of artificial intelligence. Yet, evaluating them comprehensively remains challenging. We argue that this is partly due to the predominant focus on performance metrics in most benchmarks. This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs' behavior. We apply CogBench to 35 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. Interestingly, we find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior. Finally, we explore the effects of prompt-engineering techniques. We discover that chain-of-thought prompting improves probabilistic reasoning, while take-a-step-back prompting fosters model-based behaviors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Playing repeated games with large language models, 2023.
  2. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  3. Anthropic. Claude 2. Blog post, 2023. URL https://www.anthropic.com/news/claude-2. Accessed: 2024-01-19.
  4. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023.
  5. How should the advent of large language models affect the practice of science? arXiv preprint arXiv:2312.03759, 2023.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Exploration beyond bandits. The drive for knowledge: The science of human information seeking, pp.  147–168, 2021.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  10. Rethink reporting of evaluation results in ai. Science, 380(6641):136–138, 2023.
  11. Visual cognition in multimodal large language models, 2024.
  12. Domain-general enhancements of metacognitive ability through adaptive training. Journal of Experimental Psychology: General, 148(1):51, 2019.
  13. On the functional form of temporal discounting: An optimized adaptive test. Journal of Risk and Uncertainty, 52:233–254, 2016.
  14. Evaluating large language models trained on code, 2021.
  15. The emergence of economic rationality of gpt. Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023. doi: 10.1073/pnas.2316205120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2316205120.
  16. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  17. Training verifiers to solve math word problems, 2021.
  18. Inducing anxiety in large language models increases exploration and bias. arXiv preprint arXiv:2304.11111, 2023.
  19. Meta-in-context learning in large language models. Advances in Neural Information Processing Systems, 36, 2024.
  20. Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks. arXiv preprint arXiv:2205.05718, 2022.
  21. A theory of learning to infer. Psychological review, 127(3):412, 2020.
  22. Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051, 2022.
  23. Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6):1204–1215, 2011.
  24. Meta-cognitive efficiency in learned value-based choice. In 2023 Conference on Cognitive Computational Neuroscience, pp.  29–32, 2023. doi: 10.32470/CCN.2023.1570-0. URL https://hdl.handle.net/21.11116/0000-000D-5BC7-D.
  25. Gershman, S. J. Deconstructing the human algorithms for exploration. Cognition, 173:34–42, 2018.
  26. Google. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  27. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. Nature Computational Science, 3(10):833–838, 2023.
  28. Measuring massive multitask language understanding, 2021.
  29. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
  30. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  31. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  32. LAION. Towards a transparent ai future: The call for less regulatory hurdles on open-source ai in europe. Available at: https://laion.ai/blog/transparent-ai/, 2024. Accessed: January 19, 2024.
  33. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
  34. Behavioural and neural characterization of optimistic reinforcement learning. Nature Human Behaviour, 1(4):0067, 2017.
  35. Lejuez, C. W. et al. Evaluation of a behavioral measure of risk taking: the balloon analogue risk task (bart). Journal of experimental psychology. Applied, 8(2):75–84, 2002. doi: 10.1037//1076-898x.8.2.75.
  36. Prompting frameworks for large language models: A survey, 2023.
  37. Detecting regime shifts: The causes of under-and overreaction. Management Science, 51(6):932–947, 2005.
  38. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  39. Umap: Uniform manifold approximation and projection for dimension reduction, 2020.
  40. Computational psychiatry. Trends in cognitive sciences, 16(1):72–80, 2012.
  41. MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models. Blog post, 2023. URL www.mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.
  42. Nardo, C. The waluigi effect (mega-post). Available at: https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post, 2024. Accessed: January 19, 2024.
  43. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  44. The computational roots of positivity and confirmation biases in reinforcement learning. Trends in Cognitive Sciences, 2022.
  45. Computational phenotyping: using models to understand individual differences in personality, development, and mental illness. Personality Neuroscience, 1:e18, 2018.
  46. Rescorla, R. A. Classical conditioning ii: current research and theory. pp.  64, 1972.
  47. The globalizability of temporal discounting. Nature Human Behaviour, 6(10):1386–1397, 2022.
  48. In-context impersonation reveals large language models’ strengths and biases. arXiv preprint arXiv:2305.14930, 2023.
  49. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
  50. Dynamic computational phenotyping of human cognition. 2023.
  51. Sources of metacognitive inefficiency. Trends in Cognitive Sciences, 25(1):12–23, 2021.
  52. Srivastava, A. and authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.
  53. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  55. Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.
  56. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  57. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  58. Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of Experimental Psychology: General, 143(6):2074, 2014.
  59. Studying and improving reasoning in humans and machines. arXiv preprint arXiv:2309.12485, 2023.
  60. Take a step back: Evoking reasoning via abstraction in large language models, 2023a.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023b.
Citations (21)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 73 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com