Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (2402.19379v6)

Published 29 Feb 2024 in cs.CY, cs.AI, cs.CL, and cs.LG

Abstract: Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of LLMs suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.

Exploring the Forecasting Prowess of the Silicon Crowd

Introduction to Ensemble LLMs in Forecasting

Significant strides have been made in the capabilities of LLMs, notably through the utilization of ensembles of diverse models to imitate the human 'wisdom of the crowd' phenomenon. This approach has now been rigorously tested against human forecasting accuracy, revealing LLMs' potential to match human crowd performance in the domain of probabilistic forecasting. Through two distinct but interconnected studies, researchers have explored the ensemble method's efficacy and the influence of human-derived forecasts on LLM predictions.

Study 1: LLM Ensemble Versus Human Crowds

In the first paper, an ensemble of twelve LLMs was compared against the aggregated predictions of 925 human forecasters in a forecasting tournament. The critical findings include:

  • The LLM ensemble outperformed a basic no-information benchmark and achieved statistical parity with the human crowd forecasting accuracy.
  • An observance of an acquiescence effect, where LLMs tended toward predictions above the 50% mark, despite a close to even split of actual outcomes. This underscores an inclination towards positive outcomes in LLM predictions, echoing a human cognitive bias but did not detract from overall predictive accuracy.
  • Variances among individual LLMs' accuracies did surface, yet none statistically undermined the ensemble's performance, implicating a broad robustness across varying model architectures and training specifics.

Study 2: Integrating Human Cognitive Outputs

Exploring further, the second paper delved into the potential of enhancing LLM predictions with human cognitive outputs. Key results include:

  • Both tested models, GPT-4 and Claude 2, exhibited improved forecasting accuracy upon exposure to human crowd median predictions.
  • Notably, prediction intervals narrowed post-exposure to human forecasts within the LLM's initial uncertainty range, indicating a refinement in prediction confidence levels.
  • A directly proportional relationship between the initial forecast deviation from the human median and the extent of LLM forecast adjustments was evident, showcasing a nuanced model capability for integrating external human-derived insights.

Implications and Future Directions

The collectively drawn conclusion from these studies not only heralds a significant benchmark in LLM capabilities but also opens up avenues for practical applications and further academic inquiry:

  • Practical Applications: The demonstrated equivalence in forecasting accuracy between LLM ensembles and human crowds, despite a noted positive bias in LLM predictions, introduces cost-effective, scalable alternatives to traditional human-driven forecasting tournaments.
  • Calibration and Bias: Despite their prowess, LLMs exhibited issues with calibration and a notable acquiescence bias. Addressing these could enhance the reliability and applicability of LLM-driven forecasts across various domains.
  • Integration of Human-AI Forecasts: The second paper's insights into the dynamics of combining human and LLM forecasts spotlight the potential for hybrid forecasting models that leverage both human intuition and LLM processing strengths.

Concluding Thoughts

Through an ensemble approach, LLMs have exhibited a capacity to match and potentially surpass human forecasting accuracy. These findings do not just signify a milestone in artificial intelligence but also offer a glimpse into future interdisciplinary research and application pathways. As LLMs continue to evolve, the integration of human cognitive outputs may serve not only to refine predictive accuracies but also to harness the collective strengths of both human and machine intelligence, forging a new frontier in forecasting methodology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. “Perils and Opportunities in Using Large Language Models in Psychological Research” In PsyArXiv, 2023 URL: https://osf.io/preprints/psyarxiv/d695y
  2. Mahdi Abolghasemi, Odkhishig Ganbold and Kristian Rotaru “Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI” In arXiv preprint arXiv:2312.06941, 2023 URL: https://arxiv.org/abs/2312.06941
  3. Daron Acemoğlu “Harms of AI” In The Oxford Handbook of AI Governance Oxford University Press, 2023 DOI: 10.1093/oxfordhb/9780197579329.013.65
  4. “When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards”, 2024 arXiv:2402.01781 [cs.CL]
  5. Anthropic “Model Card and Evaluations for Claude Models”, 2023 URL: https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf
  6. “A Theory for Emergence of Complex Skills in Language Models” In arXiv preprint arXiv:2307.15936, 2023
  7. “Small steps to accuracy: Incremental belief updaters are better forecasters” In Proceedings of the 21st ACM Conference on Economics and Computation, 2020, pp. 873–874
  8. “Which humans?” In PsyArXiv PsyArXiv, 2023 URL: https://osf.io/preprints/psyarxiv/5b26t
  9. “Two reasons to make aggregated probability forecasts more extreme” In Decision Analysis 11.2 INFORMS, 2014, pp. 133–145
  10. Achal Bassamboo, Ruomeng Cui and Antonio Moreno “Wisdom of crowds: Forecasting using prediction markets”, 2018
  11. “Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments” In Surgery Elsevier, 2024
  12. “On the Dangers of Stochastic Parrots: Can Language Models be too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21 Virtual Event, Canada: Association for Computing Machinery, 2021, pp. 610–623 DOI: 10.1145/3442188.3445922
  13. “Emergent and Predictable Memorization in Large Language Models”, 2023 arXiv:2304.11158 [cs.CL]
  14. Glenn W Brier “Verification of forecasts expressed in terms of probability” In Monthly weather review 78.1, 1950, pp. 1–3
  15. “Sparks of Artificial General Intelligence: Early Experiments with GPT-4”, 2023 arXiv:2303.12712 [cs.CL]
  16. Budescu and Chen “Identifying expertise to extract the wisdom of crowds” In Management science 61.2 INFORMS, 2015, pp. 267–280
  17. David V. Budescu “Confidence in aggregation of opinions from multiple sources” In Information Sampling and Adaptive Cognition Cambridge, UK: Cambridge University Press, 2006, pp. 327–352
  18. Roberto Buizza “Ensemble forecasting and the need for calibration” In Statistical postprocessing of ensemble forecasts Elsevier, 2018, pp. 15–48
  19. “Quantifying Memorization Across Neural Language Models” In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 OpenReview.net, 2023 URL: https://openreview.net/pdf?id=TatRHT%5C_1cK
  20. “Transformers predicting the future. Applying attention in next-frame and time series forecasting” In arXiv preprint arXiv:2108.08224, 2021
  21. Jacob Cohen “Statistical power analysis for the behavioral sciences” Academic press, 2013
  22. “Acquiescence response bias—Yeasaying and higher education” In The Educational and Developmental Psychologist 32.2 Cambridge University Press, 2015, pp. 105–119
  23. “Harnessing the wisdom of crowds” In Management Science 66.5 INFORMS, 2020, pp. 1847–1867
  24. “When is a crowd wise?” In Decision 1.2 Educational Publishing Foundation, 2014, pp. 79
  25. “More Than Meets the AI: Evaluating the performance of GPT-4 on Computer Graphics assessment questions” In Proceedings of the 26th Australasian Computing Education Conference, 2024, pp. 182–191
  26. “A Review of ChatGPT Applications in Education, Marketing, Software Engineering, and Healthcare: Benefits, Drawbacks, and Research Directions”, 2023 arXiv:2305.00237 [cs.CY]
  27. “Gemini: A Family of Highly Capable Multimodal Models”, 2023 arXiv:2312.11805 [cs.CL]
  28. Paolo Ghirardato “Revisiting Savage in a conditional world” In Economic Theory 20 Springer, 2002, pp. 83–92
  29. Tilmann Gneiting and Adrian E Raftery “Strictly proper scoring rules, prediction, and estimation” In Journal of the American statistical Association 102.477 Taylor & Francis, 2007, pp. 359–378
  30. Nathaniel P Grove and Stacey Lowery Bretz “A Continuum of Learning: From Rote Memorization to Meaningful Learning in Organic Chemistry” In Chemistry Education Research and Practice 13.3 Royal Society of Chemistry, 2012, pp. 201–208
  31. “Large language models are zero-shot time series forecasters” In Advances in Neural Information Processing Systems 36, 2024
  32. “Approaching Human-Level Forecasting with Language Models”, 2024 arXiv:2402.18563 [cs.LG]
  33. “Devising and detecting phishing: Large language models vs. smaller human models” In arXiv preprint arXiv:2308.12287, 2023
  34. Michael Himmelstein, David V. Budescu and Yoonjin Han “The wisdom of timely crowds” In Judgment in predictive analytics Springer International Publishing, 2023, pp. 215–242
  35. Michael Himmelstein, David V. Budescu and Elizabeth H. Ho “The wisdom of many in few: Finding individuals who are as wise as the crowd” In Journal of Experimental Psychology: General American Psychological Association, 2023
  36. “The acquiescence effect in responding to a questionnaire” In GMS Psycho-Social Medicine 4 German Medical Science, 2007
  37. “Is ChatGPT a Good Translator? Yes with GPT-4 as the Engine”, 2023 arXiv:2301.08745 [cs.CL]
  38. “Time-llm: Time series forecasting by reprogramming large language models” In arXiv preprint arXiv:2310.01728, 2023
  39. “GPT-4 Passes the Bar Exam” In SSRN, 2023
  40. “Human-AI Collaboration in Large Language Model-Assisted Brain MRI Differential Diagnosis: A Usability Study” In medRxiv Cold Spring Harbor Laboratory Press, 2024, pp. 2024–02
  41. “How to build a benchmark” In Proceedings of the 6th ACM/SPEC international conference on performance engineering, 2015, pp. 333–336
  42. Asher Koriat “When are two heads better than one and why?” In Science 336.6079 American Association for the Advancement of Science, 2012, pp. 360–362
  43. “A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets”, 2023 arXiv:2305.18486 [cs.CL]
  44. Kenneth C. Lichtendahl Jr, Yael Grushka-Cockayne and Phillip E. Pfeifer “The wisdom of competitive crowds” In Operations Research 61.6 INFORMS, 2013, pp. 1383–1398
  45. “Data Contamination: From Memorization to Exploitation” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 157–165 DOI: 10.18653/v1/2022.acl-short.18
  46. David R Mandel and Alan Barnes “Accuracy of forecasts in strategic intelligence” In Proceedings of the National Academy of Sciences 111.30 National Acad Sciences, 2014, pp. 10984–10989
  47. Metaculus “Metaculus” Accessed: 2024-02-21, https://www.metaculus.com/home/, 2024
  48. “A Comprehensive Overview of Large Language Models”, 2023 arXiv:2307.06435 [cs.CL]
  49. “Capabilities of GPT-4 on Medical Challenge Problems”, 2023 arXiv:2303.13375 [cs.CL]
  50. “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
  51. Peter S Park, Philipp Schoenegger and Chongyang Zhu “Diminished diversity-of-thought in a standard large language model” In Behavior Research Methods Springer, 2024, pp. 1–17
  52. Peter S. Park “The evolution of cognitive biases in human learning” In Journal of Theoretical Biology 541 Elsevier, 2022, pp. 111031
  53. Peter S. Park and Max Tegmark “Divide-and-Conquer Dynamics in AI-Driven Disempowerment”, 2023 arXiv:2310.06009 [cs.CY]
  54. “AI Deception: A Survey of Examples, Risks, and Potential Solutions”, 2023 arXiv:2308.14752 [cs.CY]
  55. “Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data” In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8 IEEE
  56. “ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations” In Narra J 3.1, 2023, pp. e103–e103
  57. Leonard J Savage “The Foundations of Statistics” New York: Dover Publications, 1972
  58. Philipp Schoenegger and Peter S. Park “Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament” In arXiv preprint arXiv:2310.13014, 2023
  59. “AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy” In arXiv preprint arXiv:2402.07862, 2024 DOI: 10.48550/arXiv.2402.07862
  60. “Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization”, 2023 arXiv:2305.13091 [cs.CL]
  61. “SlimPajama-DC: Understanding Data Combinations for LLM Training” In arXiv preprint arXiv:2309.10818, 2023
  62. Stefan Siegert “Simplifying and generalising Murphy’s Brier score decomposition” In Quarterly Journal of the Royal Meteorological Society 143.703 Wiley Online Library, 2017, pp. 1178–1183
  63. Lawrence H Summers and Steve Rattner “Larry Summers on who could be replaced by AI [Interviewed by Bloomberg TV’s David Westin]”, 2023 URL: https://www.youtube.com/watch?v=8Epl9yAu0gk
  64. James Surowiecki “The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations” London: Little, Brown, 2004
  65. Rich Sutton “AI succession [Youtube video of talk]” World Artificial Intelligence Conference in Shanghai, 2023 URL: https://www.youtube.com/watch?v=NgHFMolXs3U
  66. Philip E. Tetlock and Dan Gardner “Superforecasting: The Art and Science of Prediction” Random House, 2016
  67. “Forecasting Tournaments: Tools for Increasing Transparency and Improving the Quality of Debate” In Current Directions in Psychological Science 23.4 Sage Publications Sage CA: Los Angeles, CA, 2014, pp. 290–295
  68. “Llama 2: Open Foundation and Fine-Tuned Chat Models”, 2023 arXiv:2307.09288 [cs.CL]
  69. “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
  70. “Chatgpt for robotics: Design principles and model abilities” In Microsoft Auton. Syst. Robot. Res 2, 2023, pp. 20
  71. Volker Walter, Michael Kölle and David Collmar “Measuring the Wisdom of the Crowd: How Many is Enough?” In PFG–Journal of Photogrammetry, Remote Sensing and Geoinformation Science 90.3 Springer, 2022, pp. 269–291
  72. “Emergent abilities of large language models” In arXiv preprint arXiv:2206.07682, 2022
  73. Joost C.F. Winter “Can ChatGPT Pass High School Exams on English Language Comprehension?” In International Journal of Artificial Intelligence in Education, 2023
  74. “Human-AI Interactions in the Communication Era: Autophagy Makes Large Models Achieving Local Optima” In arXiv preprint arXiv:2402.11271, 2024
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Philipp Schoenegger (9 papers)
  2. Indre Tuminauskaite (1 paper)
  3. Peter S. Park (16 papers)
  4. Philip E. Tetlock (6 papers)
Citations (14)