Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament
Abstract: Accurately predicting the future would be an important milestone in the capabilities of artificial intelligence. However, research on the ability of LLMs to provide probabilistic predictions about future events remains nascent. To empirically test this ability, we enrolled OpenAI's state-of-the-art LLM, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4's forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question. We explore a potential explanation, that GPT-4 might be predisposed to predict probabilities close to the midpoint of the scale, but our data do not support this hypothesis. Overall, we find that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction; unlike in other benchmark tasks like professional exams or time series forecasting, where strong performance may at least partly be due to the answers being memorized from the training data. This makes real-world forecasting tournaments an ideal environment for testing the generalized reasoning and prediction capabilities of artificial intelligence going forward.
- “A Theory for Emergence of Complex Skills in Language Models” In arXiv preprint arXiv:2307.15936, 2023
- “On the Dangers of Stochastic Parrots: Can Language Models be too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21 Virtual Event, Canada: Association for Computing Machinery, 2021, pp. 610–623 DOI: 10.1145/3442188.3445922
- “Hybrid Forecasting of Geopolitical Events” In AI Magazine, 2023
- “Emergent and Predictable Memorization in Large Language Models”, 2023 arXiv:2304.11158 [cs.CL]
- Glenn W Brier “Verification of Forecasts Expressed in Terms of Probability” In Monthly Weather Review 78.1 American Meteorological Society, 1950, pp. 1–3
- “Sparks of Artificial General Intelligence: Early Experiments with GPT-4”, 2023 arXiv:2303.12712 [cs.CL]
- David V Budescu and Eva Chen “Identifying Expertise to Extract the Wisdom of Crowds” In Management Science 61.2 INFORMS, 2015, pp. 267–280
- “Quantifying Memorization Across Neural Language Models” In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 OpenReview.net, 2023 URL: https://openreview.net/pdf?id=TatRHT%5C_1cK
- “Asking Better Questions: The Art and Science of Forecasting” In CHI 2023 Designing Technology and Policy Simultaneously: Towards A Research Agenda and New Practice Workshop Hamburg, Germany: ACM, 2023 URL: https://doi.org/...
- “When is a Crowd Wise?” In Decision 1.2 Educational Publishing Foundation, 2014, pp. 79
- Tanya Goyal, Junyi Jessy Li and Greg Durrett “News Summarization and Evaluation in the Era of GPT-3”, 2023 arXiv:2209.12356 [cs.CL]
- Nathaniel P Grove and Stacey Lowery Bretz “A Continuum of Learning: From Rote Memorization to Meaningful Learning in Organic Chemistry” In Chemistry Education Research and Practice 13.3 Royal Society of Chemistry, 2012, pp. 201–208
- “Large Language Models Are Zero-Shot Time Series Forecasters”, 2023 arXiv:2310.07820 [cs.LG]
- Mark Himmelstein, David V. Budescu and Ying Han “The Wisdom of Timely Crowds” In Judgment in Predictive Analytics Springer, 2023, pp. 215–242
- “Is ChatGPT a Good Translator? Yes with GPT-4 as the Engine”, 2023 arXiv:2301.08745 [cs.CL]
- “What do Forecasting Rationales Reveal about Thinking Patterns of Top Geopolitical Forecasters?” In International Journal of Forecasting 38.2 Elsevier, 2022, pp. 688–704
- “GPT-4 Passes the Bar Exam” In SSRN, 2023
- “Large Language Models with Controllable Working Memory” In Findings of the Association for Computational Linguistics: ACL 2023 Toronto, Canada: Association for Computational Linguistics, 2023, pp. 1774–1793 DOI: 10.18653/v1/2023.findings-acl.112
- “Data Contamination: From Memorization to Exploitation” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 157–165 DOI: 10.18653/v1/2022.acl-short.18
- Albert E Mannes, Jack B Soll and Richard P Larrick “The Wisdom of Select Crowds” In Journal of Personality and Social Psychology 107.2 American Psychological Association, 2014, pp. 276
- “Chimeric Forecasting: Combining Probabilistic Predictions from Computational Models and Human Judgment” In BMC Infectious Diseases 22.1 Springer, 2022, pp. 833
- “Early Human Judgment Forecasts of Human Monkeypox, May 2022” In The Lancet Digital Health 4.8 Elsevier, 2022, pp. e569–e571
- “Identifying and Cultivating Superforecasters as a Method of Improving Probabilistic Predictions” In Perspectives on Psychological Science 10.3 Sage Publications Sage CA: Los Angeles, CA, 2015, pp. 267–281
- Metaculus “Quarterly Cup” Metaculus, 2023 URL: https://www.metaculus.com/tournament/quarterly-cup-2023q3/
- “A Comprehensive Overview of Large Language Models”, https://github.com/humza909/LLM_Survey.git, 2023
- Richard Ngo, Lawrence Chan and Sören Mindermann “The Alignment Problem from a Deep Learning Perspective”, 2023 arXiv:2209.00626 [cs.AI]
- “Capabilities of GPT-4 on Medical Challenge Problems”, 2023 arXiv:2303.13375 [cs.CL]
- OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
- OpenAI “OpenAI Charter” OpenAI, 2018 URL: https://openai.com/charter
- Peter S. Park, Philipp Schoenegger and Chongyang Zhu “Diminished Diversity-of-Thought in a Standard Large Language Model”, 2023 arXiv:2302.07267 [cs.HC]
- Peter S. Park and Max Tegmark “Divide-and-Conquer Dynamics in AI-Driven Disempowerment”, 2023 arXiv:2310.06009 [cs.CY]
- “AI Deception: A Survey of Examples, Risks, and Potential Solutions”, 2023 arXiv:2308.14752 [cs.CY]
- “Forecasting: Theory and Practice” In International Journal of Forecasting 38.3 Elsevier, 2022, pp. 705–871
- Philip E. Tetlock and Dan Gardner “Superforecasting: The Art and Science of Prediction” Random House, 2016
- Philip E. Tetlock, Barbara A Mellers and J Peter Scoblic “Bringing Probability Judgments into Policy Debates via Forecasting Tournaments” In Science 355.6324 American Association for the Advancement of Science, 2017, pp. 481–483
- “Forecasting Tournaments: Tools for Increasing Transparency and Improving the Quality of Debate” In Current Directions in Psychological Science 23.4 Sage Publications Sage CA: Los Angeles, CA, 2014, pp. 290–295
- “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
- Joost C.F. Winter “Can ChatGPT Pass High School Exams on English Language Comprehension?” In International Journal of Artificial Intelligence in Education, 2023
- “ExpertPrompting: Instructing Large Language Models to be Distinguished Experts”, 2023 arXiv:2305.14688 [cs.CL]
- “Fine-tuning Language Models from Human Preferences” In arXiv preprint arXiv:1909.08593, 2019
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.