Papers
Topics
Authors
Recent
2000 character limit reached

FinDeepForecast: Benchmarking DR Agents

Updated 15 January 2026
  • FinDeepForecast is a live, end-to-end multi-agent system that benchmarks deep research agents on financial forecasting tasks using scheduled and unscheduled task generation.
  • It employs specialized agents for data collection, task generation, forecasting, and evaluation, ensuring rigorous, contamination-free performance assessments across global markets.
  • Empirical results reveal that while DR agents excel in binary event prediction, they struggle with precise numeric forecasts, highlighting the need for enhanced uncertainty quantification and process-level evaluation.

FinDeepForecast is a live, end-to-end multi-agent system designed to benchmark deep research (DR) agents on real-world, research-oriented financial forecasting tasks. It operationalizes a dual-track taxonomy to dynamically generate and evaluate tasks encompassing both scheduled (recurrent) and unscheduled (non-recurrent) phenomena at corporate and macroeconomic levels. The pipeline powers FinDeepForecastBench, a weekly updated benchmark spanning eight global economies and over 1,300 listed companies, supporting rigorous, contamination-free evaluation of model performance. The system represents the first public, extensible framework for continuous assessment of DR agents’ forecasting capabilities in high-stakes financial contexts, with detailed empirical results revealing both state-of-the-art strengths and persistent limitations in forward-looking reasoning (Li et al., 8 Jan 2026).

1. System Architecture and Multi-Agent Pipeline

FinDeepForecast is architected as a multi-stage pipeline driven by six specialized agents, progressing through Data Collection, Task Generation, Forecasting, and Evaluation. The pivotal components and their functions are:

  • Data Collection Agent: Continuously ingests four primary data streams—corporate filings (e.g., SEC EDGAR, HKEX), government releases (statistical bulletins), real-time news feeds, and market data (prices, volumes). All documents are strictly timestamped, with enforced temporal isolation via indexed storage.
  • Task Generation Agents (Dual Track):
    • Recurrent Task Generation Agent: Scans disclosure calendars for scheduled reports, using parameterized templates to instantiate quantitative forecasting tasks across 121 corporate metrics (balance sheet, income, cash-flow, ratios) and 96 macro indicators (e.g., GDP, CPI). Each problem encapsulates (q,tg,td,te,y)(q, t_g, t_d, t_e, y) specifying query, generation time, deadline, evaluation time, and ground truth.
    • Non-Recurrent Task Generation Agent: Employs a three-stage LLM pipeline, comprising (1) signal extraction from news, (2) relevance assessment based on conflicting sources, and (3) LLM-driven synthesis of binary event questions. This agent covers 70 corporate event types (e.g., M&A, CEO changes) and 208 macro specifications (rate hikes, sanctions, shocks).
  • Forecasting Agent: Aggregates tasks by their deadline tdt_d into weekly batches. Each model (13 total) is invoked via standardized interfaces, using only data published before tdt_d, with predictions recorded into a structured repository.
  • Evaluation Agents:
    • Ground Truth Extraction Agent: For recurrent tasks, programmatically parses official disclosures (spot-check accuracy >>99.8%); for non-recurrent tasks, aggregates LLM-extracted evidence and requires 100% expert verification (95% inter-annotator agreement).
    • Evaluation Agent: Computes a binary scoring function on each task (with indicator-specific tolerances), permitting aggregate and partitioned accuracy analytics (by task class, prediction window, and geography).

This workflow enforces rigorous temporal isolation and guarantees that all forecasting is strictly forward in time relative to the informational cut-off.

2. Dual-Track Taxonomy for Task Generation

FinDeepForecast introduces an orthogonal taxonomy along temporal and substantive axes: temporal predictability (recurrent vs non-recurrent) and forecasting scope (corporate vs macro). Each problem is formalized as P=(q,tg,td,te,y)P=(q, t_g, t_d, t_e, y), adhering to strict ordering tg<td<tet_g < t_d < t_e. The taxonomy is further elaborated as:

  • Recurrent Tasks: Anchor on scheduled disclosures. At the corporate level, encompass 121 metrics covering all major financial statements and ratios. Macro-level recurrent tasks span 96 indicators reflecting outputs (growth), prices, monetary aggregates, and external balances. Outcomes are numeric but unresolved at the time of prediction.
  • Non-Recurrent Tasks: Triggered by unscheduled, event-driven phenomena. Corporate non-recurrents include 70 archetypal event types (e.g., M&A, leadership turnover). Macro non-recurrents are classified into 9×26=2349 \times 26 = 234 scenarios (e.g., fiscal shocks, trade disruptions), mapped through economy-specific templates and rules. These are instantiated as binary outcomes (event occurrence).
  • Scoring: The evaluation metric is

Score(y,y^)={1[∣y^−y∣/∣y∣≤Ek]for recurrent 1[y^=y]for non-recurrent\text{Score}(y, \hat{y}) = \begin{cases} \mathbf{1}[|\hat{y} - y|/|y| \leq E_k] & \text{for recurrent} \ \mathbf{1}[\hat{y} = y] & \text{for non-recurrent} \end{cases}

where EkE_k is the indicator-specific tolerance (e.g., 5% for large metrics, 1% for ratios, 0.1% for rates).

This dual-track setup enables simultaneous evaluation along fundamentally different axes of task predictability and domain specificity.

3. Benchmark Scope and Construction

The FinDeepForecastBench corpus, produced via this system, spans ten consecutive weeks (from October 27, 2025), publishing new task sets weekly. Design features include:

  • Coverage: Includes eight economies (US, CN, HK, JP, UK, DE, FR, SG) and nine equity indices (such as S&P 500, NASDAQ 100, etc.), with a static pool of 1,314 listed companies as of an October 2025 reference date.
  • Task Statistics:
    • Total tasks: 1,394, partitioned as 723 recurrent corporate, 296 recurrent macro, 247 non-recurrent corporate, 128 non-recurrent macro.
    • Weekly averages: ∼\sim139 new tasks released (30 macro-recurrent, 72 corporate-recurrent, 16 macro-nonrecurrent, 82 corporate-nonrecurrent). Macro non-recurrents are stratified across economies; corporate non-recurrents fluctuate with news volume.
  • Temporal Structure: Tasks with deadlines tdt_d spanning Thursday to Sunday (weekly) are evaluated the following Monday, contingent on data availability at evaluation time tet_e. Performance is tracked using rolling-horizon accuracy: accuracyw=(1/Nw)∑iScore(yi,y^i)\text{accuracy}_w = (1/N_w) \sum_i \text{Score}(y_i, \hat{y}_i) with NwN_w tasks per week.

This benchmark design ensures systematic, real-time, high-coverage evaluation with strong cross-sectional and temporal granularity.

4. Evaluation Paradigms and Quantitative Metrics

All evaluated models operate under strict temporal isolation; post-deadline data is inaccessible. Three paradigms are compared:

  • LLMs (Thinking Only): GPT-5, Claude-Sonnet-4.5, Gemini 2.5 Pro, Deepseek-v3.2, Grok 4, operating without external retrieval or plugins.
  • LLMs (Thinking+Search): Identical model suite, but augmented with web and plugin access to external sources.
  • Deep Research Agents: Specialized pipelines—OpenAI o3-deep-research, Perplexity Sonar Deep Research, Tongyi Deep Research.
  • Scoring Metrics:
    • Primary: Task accuracy under the binary scoring rule.
    • Numerical Tasks: Root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) calculated post hoc for standalone numeric forecasts:

    RMSE=1N∑i=1N(y^i−yi)2,MAE=1N∑i=1N∣y^i−yi∣,MAPE=100%N∑i=1N∣y^i−yiyi∣\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2}, \quad \text{MAE} = \frac{1}{N} \sum_{i=1}^N |\hat{y}_i - y_i|, \quad \text{MAPE} = \frac{100\%}{N} \sum_{i=1}^N \left|\frac{\hat{y}_i - y_i}{y_i}\right|

The evaluation methodology foregrounds accuracy, robustness to temporal leakage, and discrimination by task type and prediction horizon.

5. Empirical Results and Observed Challenges

Analysis over 1,394 tasks across ten weeks yields the following principal results:

  • Paradigm Performance:

    • Deep Research agents attained highest overall accuracy (∼\sim39.5%, with OpenAI o3-deep-research at 39.5% and Perplexity Sonar at 39.4%).
    • LLMs (Thinking+Search) achieved 35–36% accuracy.
    • LLMs (Thinking Only) ranged from 23–31%, but severely underperformed on recurrent tasks (<<10%).
  • Effect of Search: Disabling retrieval consistently degraded accuracy by 11–14 percentage points, emphasizing external data dependency.
  • Task-Type Gap: Non-recurrent tasks (binary events) yielded high accuracy for DR models (∼\sim79%), while recurrent (numeric) tasks posed substantial difficulty (DR accuracy ∼\sim25%).
  • Cross-Market Variation: U.S. and China saw highest accuracies (>>45% for DR), while Japan and smaller markets lagged, attributable to differences in data accessibility and language heterogeneity.
  • Temporal Trends: Accuracy increased over the ten-week period in direct proportion to the declining fraction of recurrent tasks, reinforcing the challenge of numeric precision forecasting relative to event detection.

These results reveal that while DR agents report robust event prediction, they remain significantly limited in precise, forward-looking numeric estimation, particularly under tight error tolerances.

6. Limitations and Prospective Advancements

Current constraints and targeted future work are as follows:

  • Limitations:
    • Numeric estimation with stringent tolerances is a persistent weakness; DR agents do not yet exhibit genuine anticipatory reasoning in these domains.
    • The current evaluation paradigm is restricted to point forecasts, excluding probabilistic or distributional prediction.
    • No analysis of process or reasoning traces is conducted, precluding detailed audits of evidence aggregation or model planning.
  • Proposed Extensions:
    • Broaden task variety to include probabilistic forecasts, cascading multi-step scenarios, and portfolio-level risk/reward assessment.
    • Develop process-level evaluation to dissect agent tool usage, planning, and error modes.
    • Expand market and firm coverage, and institute longer-horizon, rolling-window backtesting for enhanced robustness.

The scalability and extensibility of the FinDeepForecast backend provide a foundation for exploring these directions.

7. Integration of Feature-Fitted Online Conformal Prediction

For empirical tasks requiring uncertainty quantification, FinDeepForecast can be augmented with Feature-Fitted Online Conformal Prediction (FFOCP) (Huang et al., 13 May 2025). This method wraps pre-trained point models with a lightweight, feature-based quantile predictor and an online conformal calibration layer, constructing prediction intervals with provable coverage guarantees. Key elements include extracting the last-layer embedding as a feature representation, training a neural quantile predictor for residual errors, and employing an adaptive offset to maintain valid coverage under nonstationarity. Intervals are constructed online per time step as

Ct,i,j=[y^t,i,j−q^t,i,j−at,i,j, y^t,i,j+q^t,i,j+at,i,j]C_{t,i,j} = [\hat{y}_{t,i,j} - \hat{q}_{t,i,j} - a_{t,i,j},\, \hat{y}_{t,i,j} + \hat{q}_{t,i,j} + a_{t,i,j}]

with at,i,ja_{t,i,j} updated by gradient steps to control drift. Theoretical results guarantee that average coverage converges to nominal targets as T→∞T\rightarrow\infty, with error rates determined by the feature and quantile-model quality. This structure enables efficient, online uncertainty quantification for all FinDeepForecast-compatible deep learning backbones, facilitating richer forms of evaluation (Huang et al., 13 May 2025).


The FinDeepForecast platform, as established by Li et al. (2026), constitutes a comprehensive, live, and extensible infrastructure for benchmarking agentic forecasting systems in finance. Its dual-path taxonomy, extensive empirical reach, and robust evaluation ensure its continued centrality in the evaluation and evolution of deep agent financial reasoning (Li et al., 8 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FinDeepForecast System.