LLM-Generated Economic Forecasts

Updated 2 September 2025

LLM-generated economic forecasts are predictions produced by transformer-based models that integrate diverse data sources such as news, filings, and geospatial signals.
They improve traditional methods by employing semantic path models, ensemble calibration, and multi-agent simulations to enhance interpretability and accuracy.
Challenges include addressing data memorization, bias control, and ensuring genuine out-of-sample predictive power, driving ongoing research in feature engineering and calibration.

LLMs have become pervasive in economic forecasting over the past several years, providing new approaches to harnessing textual, numeric, and behavioral data for predictive analytics. LLM-generated economic forecasts typically refer to predictions about macroeconomic indicators, financial variables, or sectoral trends produced directly or indirectly by transformer-based models trained on vast corpora of news, financial filings, survey data, and other sources. These approaches have been evaluated and refined across multiple technical dimensions, including interpretability, calibration, bias control, ensemble combination, novel uncertainty quantification, agent-based behavioral simulation, and integration with high-dimensional data sources.

1. Origins and Model Architectures

LLM-generated economic forecasts have evolved from basic machine learning models using bag-of-words or vector embeddings to sophisticated architectures leveraging the full context sensitivity and generative capabilities of large-scale transformers. Early approaches extended traditional autoregressive time series models by integrating financial news as high-dimensional input features. For example, the semantic path model projects word counts from regulatory filings onto interpretable latent semantic structures—such as positivity, negativity, or uncertainty—and aggregates them via a regularized linear model. These innovations address issues of overfitting and low interpretability by enforcing domain-specific grouping and penalization schemes, allowing decomposition of forecasts into clear semantic contributions (e.g., $\tilde{Y} = \psi_0 + \sum_{i=1}^m \psi_i z_i$ where $z_i$ are latent semantic constructs) (Feuerriegel et al., 2018).

Recent work has further shifted toward utilizing pre-trained and instruction-tuned LLMs (GPT, Claude, Gemini, Moirai, TimesFM, etc.) for direct economic forecasting in tasks such as time-series prediction, “surveying” the LLM’s expectations, and simulating agent behaviors. These models are capable of ingesting diverse modalities—including text sequences, tables, satellite imagery, and web-scraped real-time data—and are evaluated both in zero-shot and fine-tuned configurations (Carriero et al., 1 Jul 2024, Ahn et al., 17 Jul 2025).

2. Feature Engineering and Interpretability

LLMs have enabled new forms of feature engineering in economic forecasting, moving beyond traditional dimensionality reduction techniques (e.g., PCA, LSA). Semantic path models and chain-of-thought prompting approaches allow words and phrases to be mapped onto domain-specific semantic features, which are then aggregated according to supervised weights optimized for covariation with macroeconomic outcomes. For instance, financial news can be projected onto a latent variable representing “market optimism,” with tf–idf weighted word counts grouped and supervised weights $\phi_{ij}$ estimated to maximize predictive power (Feuerriegel et al., 2018).

Approaches like GeoReg use LLMs to act as “data engineers” by categorizing features derived from satellite imagery and geospatial information as positive, negative, mixed, or irrelevant for the target socio-economic indicator. These categorizations are enforced as sign constraints in linear regression models, improving interpretability and generalizing well to few-shot and data-poor contexts (Ahn et al., 17 Jul 2025). Sparse autoencoder “brain scan” techniques offer an interpretable layer that maps internal residual streams of LLMs to English-labeled economic concepts (e.g., sentiment, technical analysis, timing), enabling post-hoc steering and granular bias correction without diminishing predictive performance (Chen et al., 29 Aug 2025).

3. Ensemble Methods and Forecast Calibration

LLMs have been integrated into ensemble frameworks for combining expert forecasts, particularly in settings with high disagreement or inattentiveness. By prompting the model to assign dynamic weights based on historical accuracy, lagged updating, and recent trends, the LLM-ensemble method improves forecast accuracy for short-term horizons relative to simple averaging. In settings with heterogeneous expert opinions, systematic pattern recognition and behavioral compensation (e.g., for herd behavior or prediction inertia) allow these models to outperform baseline methods for GDP growth, inflation, and unemployment rate forecasts (Ren et al., 29 Jun 2025).

Log probability aggregation techniques employ the native logprobs of LLM token outputs to compute probabilistic forecasts with uncertainty quantification. By exponentially weighting multiple probability estimates, this method achieves well-calibrated output probabilities and yields improved Brier scores (e.g., 0.186, outperforming random chance and vanilla LLM baselines) (Soru et al., 8 Jan 2025).

4. Behavioral Biases and Agent-Based Simulation

LLM-driven forecasts often display behavioral features analogous to human decision-makers, such as underreaction to new information or extrapolative bias in asset return predictions. Surveying LLMs with financial news reveals expectation patterns that replicate consensus human surveys, including underreaction to shocks and overly optimistic point forecasts disconnected from objective expected returns (Bybee, 2023, Chen et al., 17 Sep 2024). Confidence interval prediction also shows that LLMs tend to be better calibrated in risk assessment than human forecasters, though they often generate conservative tail estimates and display optimism bias.

Agent-based simulation frameworks use multiple LLMs, each assigned distinct cognitive traits or risk preferences, to model heterogeneous economic agents for policy analysis. These multi-agent LLM systems can simulate differing group responses to policy interventions (e.g., interest-income taxation), capturing both objective (income, demographic) and subjective (reasoning, risk aversion) heterogeneities (Hao et al., 24 Feb 2025). Similar techniques are applied to model market uncertainty following central bank communications, with synthetic disagreement from virtual trader LLMs correlating strongly with observed market volatility (e.g., Spearman correlation ~0.5–0.58 at 2-year swap tenor) (Collodel, 19 Aug 2025).

5. Data Sources, Memorization, and Challenges

The reliability of LLM-generated economic forecasts is contingent on data provenance and breadth. Models trained on historical macroeconomic and financial data demonstrate near-perfect recall for pre–cutoff periods, casting doubt on true predictive capability in those contexts. The memorization problem is acute: high forecast “accuracy” may often reflect direct retrieval of seen data rather than inference or reasoning. Attempts to mask entities, enforce cutoff constraints, or anonymize data rarely prevent motivated recall, making it essential to restrict out-of-sample evaluations to post–training cutoff dates (Lopez-Lira et al., 20 Apr 2025, Shi et al., 25 Nov 2024).

Conversely, LLMs excel at integrating high-frequency textual and geospatial data into econometric models. The LLM-CPI framework, for example, demonstrates efficient combination of online social media signals with econometric ARX/VARX models, yielding improved point forecasts and tighter prediction intervals for inflation metrics (Fan et al., 11 Jun 2025). Studies indicate that deliberate addition of noise predictors may yield improved forecast variance after crossing the interpolation threshold, provided underlying signals are dense—suggesting a counterintuitive ensemble diversification effect in high-dimensional LLM-based forecasting (Liao et al., 2023).

6. Performance, Evaluation, and Limitations

Empirical evaluations indicate that LLM-generated forecasts are frequently competitive with traditional econometric methods, particularly for short-term horizons and when applied to well-structured datasets. For long-term or decadal predictions, recursive deep learning frameworks employing LSTM variants demonstrate reliable extrapolation and robustness to nonlinear relationships—although risk of error propagation remains (Wang et al., 2023). Nevertheless, standard time-series models (Bayesian VAR, factor models) tend to deliver more stable performance and fewer outlier errors, with LLM-based models occasionally prone to extreme “hallucination” forecasts (Carriero et al., 1 Jul 2024).

While LLMs combined with advanced feature engineering (e.g., semantic path models, network topology descriptors from international trade) substantially outperform linear models and interpret the economic environment more holistically, challenges persist around generalization, bias control, and trustworthiness. Ongoing work emphasizes the necessity of post-hoc monitoring, calibration, and interpretability plug-ins to safeguard against overfitting, “look-ahead bias,” and unintentional memorialization (Chen et al., 29 Aug 2025, Shi et al., 25 Nov 2024).

7. Future Directions and Research Trajectories

The use of LLMs in economic forecasting fosters several promising research avenues. These include the development of interpretable steering and bias-correction modules, enhanced probabilistic calibration via log probability aggregation, multi-agent simulation frameworks for granular policy analysis, and novel methods for integrating high-frequency unstructured textual, spatial, and network data. Attention to the memorization problem and temporal consistency in model training is critical for trustworthy application and robust evaluation. The field is expected to advance further through the combination of empirical calibration, transparency mechanisms, domain-specific prompt engineering, and rigorous statistical benchmarking.

A plausible implication is that, while LLM-generated economic forecasts hold significant practical and theoretical promise—spanning improved accuracy, richer agent heterogeneity, and advanced uncertainty quantification—they must be accompanied by careful methodological scrutiny and post-hoc interpretability to ensure that their outputs reflect genuine reasoning rather than mere data recall. As interpretability, calibration, and data provenance become central, LLM-powered forecasting frameworks will require ongoing refinement to deliver reliable guidance in policy, operational, and investment contexts.