Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Approaching Human-Level Forecasting with Language Models (2402.18563v1)

Published 28 Feb 2024 in cs.LG, cs.AI, cs.CL, and cs.IR

Abstract: Forecasting future events is important for policy and decision making. In this work, we study whether LLMs (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

Approaching Human-Level Forecasting with LLMs

Introduction to Automated Forecasting

In the domain of forecasting, the traditional dichotomy has been between statistical forecasting and judgmental forecasting, the latter relying heavily on human expertise to factor in domain-specific knowledge, intuition, and contextual considerations. This research paper presents an innovative leap in the application of LLMs (LMs) for judgmental forecasting, aiming to harness their vast pre-trained knowledge and reasoning capabilities. By developing a retrieval-augmented forecasting system, the authors endeavor to automate the intricacies of generating, weighing, and synthesizing forecasts that traditionally necessitated human intervention.

Methodology and System Design

The core of the proposed system relies on three integral components: retrieval, reasoning, and aggregation. The retrieval component is tasked with sourcing relevant articles to inform the forecast, addressing the challenge of keeping the models current with events post their last training data cut-off. This is followed by a reasoning step, where the system, leveraging the retrieved articles, generates probabilistic forecasts and their justifications. Finally, an aggregation process synthesizes these individual outputs into a singular prediction. The innovation extends to a self-supervised fine-tuning approach aimed at enhancing the model's forecasting accuracy and reasoning fidelity by iteratively improving its performance based on real-world forecasting questions from competitive platforms.

Data Collection and Evaluation

A uniquely compiled dataset, encompassing a wide range of forecasting questions sourced from competitive forecasting platforms, serves to validate and fine-tune this system. Importantly, the questions span beyond the knowledge cut-off of the pre-trained models, ensuring an authentic testing ground for the system's forecasting capabilities. Using the Brier score metric for evaluation, the system demonstrates a near-human performance level, occasionally surpassing aggregated human forecasts under certain conditions.

Strengths, Limitations, and Future Directions

The robust evaluation suggests notable strengths of the system, particularly in contexts of high uncertainty among human forecasters or when ample relevant information could be retrieved. Conversely, the system shows potential deficits when forced to predict without sufficient context or on topics heavily reliant on recent events beyond its training cut-off. These insights beckon further exploration into iterative self-supervision, domain-adaptive training, and leveraging future LM iterations for improved forecasting.

Conclusion and Implications

This paper advances the discourse on the potential of LMs in automating judgmental forecasting, offering a scalable and efficient alternative to purely human-driven approaches. The implications for policy-making, business strategy, and more broadly, decision-making processes are profound, envisaging a future where informed decisions can be bolstered by automated, yet highly accurate, forecasting support. The research sets a promising trajectory for further refining these systems, with the prospect of achieving parity or exceeding human forecasting capabilities in a broader array of contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Humans vs large language models: Judgmental forecasting in an era of advanced AI. arXiv preprint arXiv:2312.06941.
  2. Adam, D. (2020). Special report: The simulations driving the world’s response to COVID-19. Nature, 580(7802):316–319.
  3. Anthropic (2023). Model card and evaluations for Claude models. https://www-cdn.anthropic.com/files/4zrzovbb/website/5c49cc247484cecf107c699baf29250302e5da70.pdf.
  4. Armstrong, J. S. (2001). Principles of Forecasting: a Handbook for Researchers and Practitioners. Springer.
  5. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
  7. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
  8. A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688.
  9. Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205.
  10. ForecastPFN: Synthetically-trained zero-shot forecasting. In Advanced in Neural Information Processing Systems (NeurIPS).
  11. A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342.
  12. Gemini Team (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  13. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378.
  14. Large language models are zero-shot time series forecasters. In Advanced in Neural Information Processing Systems (NeurIPS).
  15. Hanson, R. (2007). Logarithmic markets coring rules for modular combinatorial information aggregation. The Journal of Prediction Markets, 1(1):3–15.
  16. Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916.
  17. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL).
  18. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  19. Time-LLM: Time series forecasting by reprogramming large language models. In International Conference on Learning Representations (ICLR).
  20. ForecastQA: A question answering challenge for event forecasting with temporal text data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL).
  21. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS).
  22. Manifold (2022). Maniswap. https://manifoldmarkets.notion.site/manifoldmarkets/Maniswap-ce406e1e897d417cbd491071ea8a0c39.
  23. Metaculus (2023). Wisdom of the crowd vs. the best of the best of the best. https://www.metaculus.com/notebooks/15760/wisdom-of-the-crowd-vs-the-best-of-the-best-of-the-best.
  24. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
  25. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  26. A time series is worth 64 words: Long-term forecasting with transformers. In International Conference on Learning Representations (ICLR).
  27. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  28. OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  29. Polymarket (2023). Polymarket/poly-market-maker: Market Maker Keeper for the polymarket CLOB. https://github.com/Polymarket/poly-market-maker.
  30. Lag-Llama: Towards foundation models for time series forecasting. arXiv preprint arXiv:2310.08278.
  31. Large language model prediction capabilities: Evidence from a real-world forecasting tournament. arXiv preprint arXiv:2310.13014.
  32. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics (Findings of EMNLP).
  33. Superforecasting: The Art and Science of Prediction. Crown.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Wang, C. (2023). Calibration in deep learning: A survey of the state-of-the-art. arXiv preprint arXiv:2308.01222.
  36. Unified training of universal time series forecasting transformers. arXiv preprint arXiv:2402.02592.
  37. Autocast++: Enhancing world event prediction with zero-shot ranking-based context retrieval. In International Conference on Learning Representations (ICLR).
  38. Formal specification of constant product (xy= k) market maker model and implementation. White paper.
  39. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107.
  40. Forecasting future world events with neural networks. In Advances in Neural Information Processing Systems (NeurIPS).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Danny Halawi (6 papers)
  2. Fred Zhang (15 papers)
  3. Chen Yueh-Han (5 papers)
  4. Jacob Steinhardt (88 papers)
Citations (22)
Youtube Logo Streamline Icon: https://streamlinehq.com