FinDeepForecastBench: Live Financial Forecast Benchmark
- FinDeepForecastBench is a live benchmark platform for evaluating financial forecasting models, emphasizing forward-looking predictions and strict temporal isolation.
- It continuously generates research-grade forecasting tasks across macroeconomic and corporate domains using both recurrent and non-recurrent evaluation strategies.
- Performance metrics reveal DR agents excel in binary classification tasks while numeric predictions remain challenging, guiding future research advancements.
FinDeepForecastBench is a live, large-scale benchmark for evaluating the forward-looking forecasting capabilities of Deep Research (DR) agents, advanced LLMs, and related systems in the context of financial research. Developed as a core component of the FinDeepForecast platform, FinDeepForecastBench continuously generates and scores research-grade forecasting tasks that span macroeconomic and corporate domains, enforcing strict temporal isolation and supporting broad, contamination-free evaluation for financial AI research (Li et al., 8 Jan 2026).
1. Origin, Motivation, and System Context
The motivation for FinDeepForecastBench is to address limitations inherent in prior financial forecasting benchmarks. Conventional static datasets are susceptible to data contamination due to public availability, and their content quickly becomes obsolete. Existing time-sensitive benchmarks in domains such as code or mathematics often allow for rote memorization. In contrast, FinDeepForecastBench leverages the unique properties of the financial domain—periodic disclosures, irregular high-impact events, and reliable temporal isolation of ground truth—to enable authentic, forward-looking evaluation. The system is embedded within the broader FinDeepForecast platform, a multi-agent system designed to orchestrate DR agents and LLMs such that they only access information timestamped before each task’s prediction deadline.
2. Architecture and Workflow
FinDeepForecastBench is generated and maintained through a well-defined workflow comprising several specialized agent modules within the FinDeepForecast system:
- Data Collection: An agent continuously aggregates legally timestamped data from corporate filings, government releases, real-time news, and market data.
- Task Generation:
- Recurrent Task Agent: Scans official calendars, generating structured numeric prediction tasks (e.g., quarterly EPS, national CPI).
- Non-Recurrent Task Agent: Uses LLM-based event detection to craft genuinely uncertain, event-driven binary prediction tasks (e.g., unscheduled mergers, regulatory actions).
- Forecasting Agent: Batches the generated tasks and invokes candidate models or agents, enforcing that all context provided to them is strictly time-locked prior to each task’s deadline.
- Ground-Truth and Evaluation: An agent parses official releases or aggregates LLM-based evidence (with expert verification for 100% of non-recurrent tasks) to establish ground truth; a deterministic scoring protocol computes final marks.
- Live Leaderboard: The OpenFinArena platform maintains a real-time leaderboard, offering breakdowns by model, task type, and market.
3. Task Taxonomy and Dataset Scope
FinDeepForecastBench applies a dual-track taxonomy that intersects temporal predictability with economic scope:
- Temporal Predictability:
- Recurrent Tasks: Scheduled, numeric-value predictions (e.g., “Estimate Apple Q4 2025 EPS”).
- Non-Recurrent Tasks: Unscheduled, binary predictions on event occurrence (e.g., “Will China’s MOFCOM impose new export controls by November 22, 2025?”).
- Forecasting Scope:
- Corporate-Level: 121 financial metrics (balance sheet, income statement, cash flow, ratios) for 1,314 companies included in indices such as S&P 500, NASDAQ 100, Nikkei 225, etc.
- Macro-Level: 96 macroeconomic indicators, including GDP, CPI, interest rates, exchange rates, and global indices.
- Scale: Over a 10-week live period, the system released 1,394 tasks (296 recurrent macro, 723 recurrent corporate, 128 non-recurrent macro, 247 non-recurrent corporate), covering eight major economies (US, CN, HK, JP, UK, DE, FR, SG).
4. Evaluation Methodology and Metrics
FinDeepForecastBench employs a scoring protocol designed to reflect real-world decision criteria and to facilitate cross-model comparison:
- Scoring Function:
- For binary (non-recurrent) tasks: 1 if the prediction matches outcome, 0 otherwise.
- For numeric (recurrent) tasks: 1 if , where is a metric-specific tolerance (e.g., 5% for million-scale, 0.1% for rates), 0 otherwise.
- Overall accuracy is over all tasks.
- Standard Continuous Metrics (recurrent tasks, post hoc): Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE).
- Experimental Design: All models forecast identical weekly task batches under enforced data isolation and are compared only after outcome becomes public. Results are disaggregated by task type, scope, economy, and week. Significance can be assessed via paired bootstrap, though this is not reported in the main release.
5. Baselines, Model Types, and Performance Results
Thirteen representative models, divided into three methodological paradigms, were evaluated:
- LLM (“Thinking” Only, No Web Search): GPT-5 (T), Claude-Sonnet-4.5 (T), Gemini 2.5 Pro (T), Deepseek v3.2 (T), Grok 4 (T).
- LLM (Thinking + Search): The same models with direct search augmentation.
- Deep Research Agents: OpenAI o3-deep-research, Perplexity Sonar Deep Research, Tongyi Deep Research.
Model configurations used default hyperparameters, zero- or few-shot prompts, temperature=0, and strictly enforced time-windowed data exposure.
Results:
- Overall: DR agents achieved highest accuracy (OpenAI o3-deep-research 39.5%, Perplexity Sonar DR 39.4%), LLM (T+S) cluster reached 35–36%, LLM (T) only 20.8–24.4%.
- Task Type Breakdown: Non-recurrent tasks saw up to 81% accuracy (Perplexity Sonar DR), while recurrent (numeric) tasks remained below 26% even for top models, indicating the substantial difficulty of real forward numeric financial prediction.
- Market/Country Analysis: Accuracies were highest in data-rich economies (US, CN), lowest in markets with less English coverage or depth (JP).
- Temporal Trends: Accuracy improved slightly as scheduled disclosures accumulated later in the quarterly cycle for recurrent tasks. DR agent advantage remained consistent throughout.
6. Deployment, Leaderboard, and Maintenance
FinDeepForecastBench powers a publicly accessible leaderboard via OpenFinArena.com, facilitating API-based access to task definitions, submission, and results retrieval. The system is maintained via weekly automated pipelines that regenerate tasks, run models, process outcomes, and update statistical summaries.
- Leaderboard features: Breakdown by agent/model, task category, and economy; historical trends; download of past results.
- Ground-truth quality: Recurrent ground truth is parsed from official filings with 99.8% accuracy; non-recurrent tasks employ LLM-based aggregation with 100% expert verification.
7. Limitations and Future Directions
- Current limitations: Numeric (recurrent) financial forecasting remains an unsolved challenge, with top DR agents <30% accuracy on these subproblems. The binary scoring protocol for numeric tasks can oversimplify forecasts near the tolerance boundary. There is no probabilistic/confidence forecast scoring, and only prediction outcomes—not agent process—are evaluated.
- Proposed extensions: Probabilistic interval/quantile forecasts, multi-step/chained forecast tasks, portfolio-level performance evaluation, process-based agent assessment (logging search/tool use and reasoning chains), and expanded coverage to new markets, sectors, and higher-frequency horizons (Li et al., 8 Jan 2026).
FinDeepForecastBench represents a rigorous, live, and contamination-free approach to benchmarking research agents and LLMs for financial forecasting, enabling ongoing and objective assessment under genuine temporal isolation and realistic uncertainty. Its methodology establishes a robust foundation for future research and development in predictive AI for high-stakes economic domains.