Approaching Human-Level Forecasting with Language Models
Abstract: Forecasting future events is important for policy and decision making. In this work, we study whether LMs can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.
- Humans vs large language models: Judgmental forecasting in an era of advanced AI. arXiv preprint arXiv:2312.06941.
- Adam, D. (2020). Special report: The simulations driving the world’s response to COVID-19. Nature, 580(7802):316–319.
- Anthropic (2023). Model card and evaluations for Claude models. https://www-cdn.anthropic.com/files/4zrzovbb/website/5c49cc247484cecf107c699baf29250302e5da70.pdf.
- Armstrong, J. S. (2001). Principles of Forecasting: a Handbook for Researchers and Practitioners. Springer.
- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
- Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
- A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688.
- Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205.
- ForecastPFN: Synthetically-trained zero-shot forecasting. In Advanced in Neural Information Processing Systems (NeurIPS).
- A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342.
- Gemini Team (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378.
- Large language models are zero-shot time series forecasters. In Advanced in Neural Information Processing Systems (NeurIPS).
- Hanson, R. (2007). Logarithmic markets coring rules for modular combinatorial information aggregation. The Journal of Prediction Markets, 1(1):3–15.
- Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916.
- Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL).
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Time-LLM: Time series forecasting by reprogramming large language models. In International Conference on Learning Representations (ICLR).
- ForecastQA: A question answering challenge for event forecasting with temporal text data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL).
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS).
- Manifold (2022). Maniswap. https://manifoldmarkets.notion.site/manifoldmarkets/Maniswap-ce406e1e897d417cbd491071ea8a0c39.
- Metaculus (2023). Wisdom of the crowd vs. the best of the best of the best. https://www.metaculus.com/notebooks/15760/wisdom-of-the-crowd-vs-the-best-of-the-best-of-the-best.
- Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
- WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- A time series is worth 64 words: Long-term forecasting with transformers. In International Conference on Learning Representations (ICLR).
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.
- Polymarket (2023). Polymarket/poly-market-maker: Market Maker Keeper for the polymarket CLOB. https://github.com/Polymarket/poly-market-maker.
- Lag-Llama: Towards foundation models for time series forecasting. arXiv preprint arXiv:2310.08278.
- Large language model prediction capabilities: Evidence from a real-world forecasting tournament. arXiv preprint arXiv:2310.13014.
- Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics (Findings of EMNLP).
- Superforecasting: The Art and Science of Prediction. Crown.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Wang, C. (2023). Calibration in deep learning: A survey of the state-of-the-art. arXiv preprint arXiv:2308.01222.
- Unified training of universal time series forecasting transformers. arXiv preprint arXiv:2402.02592.
- Autocast++: Enhancing world event prediction with zero-shot ranking-based context retrieval. In International Conference on Learning Representations (ICLR).
- Formal specification of constant product (xy= k) market maker model and implementation. White paper.
- Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107.
- Forecasting future world events with neural networks. In Advances in Neural Information Processing Systems (NeurIPS).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.