Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

765 1

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (2402.19379v6)

Published 29 Feb 2024 in cs.CY, cs.AI, cs.CL, and cs.LG

Abstract: Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of LLMs suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.

PDF HTML Abstract

Exploring the Forecasting Prowess of the Silicon Crowd

Introduction to Ensemble LLMs in Forecasting

Significant strides have been made in the capabilities of LLMs, notably through the utilization of ensembles of diverse models to imitate the human 'wisdom of the crowd' phenomenon. This approach has now been rigorously tested against human forecasting accuracy, revealing LLMs' potential to match human crowd performance in the domain of probabilistic forecasting. Through two distinct but interconnected studies, researchers have explored the ensemble method's efficacy and the influence of human-derived forecasts on LLM predictions.

Study 1: LLM Ensemble Versus Human Crowds

In the first paper, an ensemble of twelve LLMs was compared against the aggregated predictions of 925 human forecasters in a forecasting tournament. The critical findings include:

The LLM ensemble outperformed a basic no-information benchmark and achieved statistical parity with the human crowd forecasting accuracy.
An observance of an acquiescence effect, where LLMs tended toward predictions above the 50% mark, despite a close to even split of actual outcomes. This underscores an inclination towards positive outcomes in LLM predictions, echoing a human cognitive bias but did not detract from overall predictive accuracy.
Variances among individual LLMs' accuracies did surface, yet none statistically undermined the ensemble's performance, implicating a broad robustness across varying model architectures and training specifics.

Study 2: Integrating Human Cognitive Outputs

Exploring further, the second paper delved into the potential of enhancing LLM predictions with human cognitive outputs. Key results include:

Both tested models, GPT-4 and Claude 2, exhibited improved forecasting accuracy upon exposure to human crowd median predictions.
Notably, prediction intervals narrowed post-exposure to human forecasts within the LLM's initial uncertainty range, indicating a refinement in prediction confidence levels.
A directly proportional relationship between the initial forecast deviation from the human median and the extent of LLM forecast adjustments was evident, showcasing a nuanced model capability for integrating external human-derived insights.

Implications and Future Directions

The collectively drawn conclusion from these studies not only heralds a significant benchmark in LLM capabilities but also opens up avenues for practical applications and further academic inquiry:

Practical Applications: The demonstrated equivalence in forecasting accuracy between LLM ensembles and human crowds, despite a noted positive bias in LLM predictions, introduces cost-effective, scalable alternatives to traditional human-driven forecasting tournaments.
Calibration and Bias: Despite their prowess, LLMs exhibited issues with calibration and a notable acquiescence bias. Addressing these could enhance the reliability and applicability of LLM-driven forecasts across various domains.
Integration of Human-AI Forecasts: The second paper's insights into the dynamics of combining human and LLM forecasts spotlight the potential for hybrid forecasting models that leverage both human intuition and LLM processing strengths.

Concluding Thoughts

Through an ensemble approach, LLMs have exhibited a capacity to match and potentially surpass human forecasting accuracy. These findings do not just signify a milestone in artificial intelligence but also offer a glimpse into future interdisciplinary research and application pathways. As LLMs continue to evolve, the integration of human cognitive outputs may serve not only to refine predictive accuracies but also to harness the collective strengths of both human and machine intelligence, forging a new frontier in forecasting methodology.

PDF Markdown Bookmark Chat (Pro)

References (74)

Authors (4)

Philipp Schoenegger (9 papers)
Indre Tuminauskaite (1 paper)
Peter S. Park (16 papers)
Philip E. Tetlock (6 papers)

Citations (14)

View on Semantic Scholar

Tweets

https://twitter.com/johnjnay/status/1764470331568238618

https://twitter.com/emollick/status/1764991097716031887

https://twitter.com/Scott_E_Page/status/1765575350564671685

https://twitter.com/IntuitMachine/status/1764621551712748012

https://twitter.com/causalinf/status/1853777247016030554

https://twitter.com/SchoeneggerPhil/status/1764931720824909951

HackerNews

LLM Ensemble Prediction Capabilities Match Human Crowd Accuracy (1 point, 2 comments)