Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament (2310.13014v1)

Published 17 Oct 2023 in cs.CY, cs.AI, cs.CL, and cs.LG

Abstract: Accurately predicting the future would be an important milestone in the capabilities of artificial intelligence. However, research on the ability of LLMs to provide probabilistic predictions about future events remains nascent. To empirically test this ability, we enrolled OpenAI's state-of-the-art LLM, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4's forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question. We explore a potential explanation, that GPT-4 might be predisposed to predict probabilities close to the midpoint of the scale, but our data do not support this hypothesis. Overall, we find that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction; unlike in other benchmark tasks like professional exams or time series forecasting, where strong performance may at least partly be due to the answers being memorized from the training data. This makes real-world forecasting tournaments an ideal environment for testing the generalized reasoning and prediction capabilities of artificial intelligence going forward.

Evaluating the Predictive Abilities of LLMs Through a Real-World Forecasting Tournament

The paper "LLM Prediction Capabilities: Evidence from a Real-World Forecasting Tournament" presents a rigorous assessment of GPT-4's forecasting capabilities in comparison to a median human crowd. Despite the potential of LLMs in varied domains, the paper reveals significant underperformance of GPT-4 in probabilistic predictions when placed in a real-world forecasting context on the Metaculus platform.

Methodology

The research involved enrolling GPT-4 in a forecasting tournament from July to October 2023. This setup provided a natural test environment to evaluate its forecasting prowess on a diverse set of binary questions across topics such as Big Tech, U.S. politics, and global conflicts. The intention was to circumvent issues of training-data memorization by testing the model in an environment where answers were unknown at prediction time.

Key Findings

  1. Performance Comparison: GPT-4's predictive accuracy fell short of the median human-crowd forecasts, which were significantly more reliable. A Brier score analysis showed that GPT-4's predictions did not statistically differ from a no-information baseline, underscoring difficulties in surpassing even a 50% probability assignment strategy.
  2. Directional Accuracy: GPT-4 was directionally correct in 69.57% of its forecasts, yet this performance was still inferior to the human crowd's 95.65%. This indicates notable challenges in prediction precision.
  3. Potential Conservatism: The paper explored whether GPT-4 exhibited a tendency towards mid-range probability estimates. Although this hypothesis was tentatively supported by a coefficient of variation analysis, statistical tests did not confirm significant variance differences.

Theoretical and Practical Implications

The findings underscore the current limitations in LLMs' ability to generalize probabilistically to out-of-distribution scenarios. GPT-4's underwhelming performance highlights a crucial gap in its application to domains requiring future event predictions—fields that are economically significant, such as policy-making and strategic planning.

From a theoretical angle, the paper reinforces the importance of distinguishing genuine reasoning capabilities from memorization within AI systems. This differentiation is vital for evaluating artificial intelligence's potential across complex, real-world tasks, moving beyond simplistic question-answer settings often used in benchmarks.

Future Directions

Several avenues for future research arise from these results:

  • Improving Real-Time Information Access: Addressing the knowledge cutoff within LLMs by embedding mechanisms for real-time information updating without human intervention.
  • Harnessing Diverse Model Ensembles: Utilizing multiple LLM instances across varied configurations and datasets may help emulate a wisdom-of-the-crowds effect, potentially improving forecast accuracy.
  • Refining Aggregation Techniques: The paper suggests potential in Bayesian Model Averaging for combining machine and human forecasts, although such techniques will require enhancement to effectively incorporate LLMs.
  • Exploring Hybrid Prediction Models: Investigating systems combining human intuition with LLM outputs may lead to superior forecasting capabilities, fostering synergy between human and machine cognition.

Conclusion

While GPT-4 showcases impressive abilities in various tasks, forecasting remains a domain necessitating further refinement. The current limitation points towards an opportunity for advancing AI systems to competently tackle prediction-based applications. Ultimately, this paper provides critical insights to guide both the progression and deployment of LLMs in real-world, economically relevant scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. “A Theory for Emergence of Complex Skills in Language Models” In arXiv preprint arXiv:2307.15936, 2023
  2. “On the Dangers of Stochastic Parrots: Can Language Models be too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21 Virtual Event, Canada: Association for Computing Machinery, 2021, pp. 610–623 DOI: 10.1145/3442188.3445922
  3. “Hybrid Forecasting of Geopolitical Events” In AI Magazine, 2023
  4. “Emergent and Predictable Memorization in Large Language Models”, 2023 arXiv:2304.11158 [cs.CL]
  5. Glenn W Brier “Verification of Forecasts Expressed in Terms of Probability” In Monthly Weather Review 78.1 American Meteorological Society, 1950, pp. 1–3
  6. “Sparks of Artificial General Intelligence: Early Experiments with GPT-4”, 2023 arXiv:2303.12712 [cs.CL]
  7. David V Budescu and Eva Chen “Identifying Expertise to Extract the Wisdom of Crowds” In Management Science 61.2 INFORMS, 2015, pp. 267–280
  8. “Quantifying Memorization Across Neural Language Models” In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 OpenReview.net, 2023 URL: https://openreview.net/pdf?id=TatRHT%5C_1cK
  9. “Asking Better Questions: The Art and Science of Forecasting” In CHI 2023 Designing Technology and Policy Simultaneously: Towards A Research Agenda and New Practice Workshop Hamburg, Germany: ACM, 2023 URL: https://doi.org/...
  10. “When is a Crowd Wise?” In Decision 1.2 Educational Publishing Foundation, 2014, pp. 79
  11. Tanya Goyal, Junyi Jessy Li and Greg Durrett “News Summarization and Evaluation in the Era of GPT-3”, 2023 arXiv:2209.12356 [cs.CL]
  12. Nathaniel P Grove and Stacey Lowery Bretz “A Continuum of Learning: From Rote Memorization to Meaningful Learning in Organic Chemistry” In Chemistry Education Research and Practice 13.3 Royal Society of Chemistry, 2012, pp. 201–208
  13. “Large Language Models Are Zero-Shot Time Series Forecasters”, 2023 arXiv:2310.07820 [cs.LG]
  14. Mark Himmelstein, David V. Budescu and Ying Han “The Wisdom of Timely Crowds” In Judgment in Predictive Analytics Springer, 2023, pp. 215–242
  15. “Is ChatGPT a Good Translator? Yes with GPT-4 as the Engine”, 2023 arXiv:2301.08745 [cs.CL]
  16. “What do Forecasting Rationales Reveal about Thinking Patterns of Top Geopolitical Forecasters?” In International Journal of Forecasting 38.2 Elsevier, 2022, pp. 688–704
  17. “GPT-4 Passes the Bar Exam” In SSRN, 2023
  18. “Large Language Models with Controllable Working Memory” In Findings of the Association for Computational Linguistics: ACL 2023 Toronto, Canada: Association for Computational Linguistics, 2023, pp. 1774–1793 DOI: 10.18653/v1/2023.findings-acl.112
  19. “Data Contamination: From Memorization to Exploitation” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 157–165 DOI: 10.18653/v1/2022.acl-short.18
  20. Albert E Mannes, Jack B Soll and Richard P Larrick “The Wisdom of Select Crowds” In Journal of Personality and Social Psychology 107.2 American Psychological Association, 2014, pp. 276
  21. “Chimeric Forecasting: Combining Probabilistic Predictions from Computational Models and Human Judgment” In BMC Infectious Diseases 22.1 Springer, 2022, pp. 833
  22. “Early Human Judgment Forecasts of Human Monkeypox, May 2022” In The Lancet Digital Health 4.8 Elsevier, 2022, pp. e569–e571
  23. “Identifying and Cultivating Superforecasters as a Method of Improving Probabilistic Predictions” In Perspectives on Psychological Science 10.3 Sage Publications Sage CA: Los Angeles, CA, 2015, pp. 267–281
  24. Metaculus “Quarterly Cup” Metaculus, 2023 URL: https://www.metaculus.com/tournament/quarterly-cup-2023q3/
  25. “A Comprehensive Overview of Large Language Models”, https://github.com/humza909/LLM_Survey.git, 2023
  26. Richard Ngo, Lawrence Chan and Sören Mindermann “The Alignment Problem from a Deep Learning Perspective”, 2023 arXiv:2209.00626 [cs.AI]
  27. “Capabilities of GPT-4 on Medical Challenge Problems”, 2023 arXiv:2303.13375 [cs.CL]
  28. OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
  29. OpenAI “OpenAI Charter” OpenAI, 2018 URL: https://openai.com/charter
  30. Peter S. Park, Philipp Schoenegger and Chongyang Zhu “Diminished Diversity-of-Thought in a Standard Large Language Model”, 2023 arXiv:2302.07267 [cs.HC]
  31. Peter S. Park and Max Tegmark “Divide-and-Conquer Dynamics in AI-Driven Disempowerment”, 2023 arXiv:2310.06009 [cs.CY]
  32. “AI Deception: A Survey of Examples, Risks, and Potential Solutions”, 2023 arXiv:2308.14752 [cs.CY]
  33. “Forecasting: Theory and Practice” In International Journal of Forecasting 38.3 Elsevier, 2022, pp. 705–871
  34. Philip E. Tetlock and Dan Gardner “Superforecasting: The Art and Science of Prediction” Random House, 2016
  35. Philip E. Tetlock, Barbara A Mellers and J Peter Scoblic “Bringing Probability Judgments into Policy Debates via Forecasting Tournaments” In Science 355.6324 American Association for the Advancement of Science, 2017, pp. 481–483
  36. “Forecasting Tournaments: Tools for Increasing Transparency and Improving the Quality of Debate” In Current Directions in Psychological Science 23.4 Sage Publications Sage CA: Los Angeles, CA, 2014, pp. 290–295
  37. “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
  38. Joost C.F. Winter “Can ChatGPT Pass High School Exams on English Language Comprehension?” In International Journal of Artificial Intelligence in Education, 2023
  39. “ExpertPrompting: Instructing Large Language Models to be Distinguished Experts”, 2023 arXiv:2305.14688 [cs.CL]
  40. “Fine-tuning Language Models from Human Preferences” In arXiv preprint arXiv:1909.08593, 2019
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Philipp Schoenegger (9 papers)
  2. Peter S. Park (16 papers)
Citations (11)