Papers
Topics
Authors
Recent
Search
2000 character limit reached

FinTradeBench: Financial Reasoning Benchmark

Updated 26 March 2026
  • FinTradeBench is a benchmark that tests LLMs' ability to integrate company fundamentals with market trading signals.
  • The dataset spans a decade (2015–2025) for NASDAQ-100 firms with 1,400 carefully constructed questions in fundamentals, trading, and hybrid categories.
  • It employs a calibration-then-scaling framework with both human and automated evaluations to ensure rigorous and reproducible performance assessment.

FinTradeBench is a large-scale benchmark specifically designed to evaluate the financial reasoning capabilities of LLMs through questions that require integration of company fundamentals and trading signals. Constructed from NASDAQ-100 firms over a decade (2015–2025), and encompassing 1,400 calibrated questions across fundamentals-focused, trading-signal-focused, and hybrid types, FinTradeBench offers a rigorous and reproducible framework for assessing LLMs on the multifaceted reasoning tasks reflective of real-world financial analysis (Agrawal et al., 19 Mar 2026).

1. Motivation and Distinctive Scope

FinTradeBench addresses the core challenge that real-world financial decision-making entails: simultaneously reasoning over heterogeneous signals derived from company fundamentals (e.g., profitability and valuation ratios from regulatory filings) and trading dynamics (quantitative indicators from market data). Previous benchmarks—such as FinQA, ConvFinQA, and DocFinQA—have driven progress in question answering based on structured financial reports but remain limited in evaluating reasoning across market-driven signals and the reconciliation of conflicting market and fundamentals narratives. For example, sharp price movements following earnings releases often contrast with accounting-based metrics, demanding cross-signal synthesis for tasks such as investment recommendation. FinTradeBench is specifically engineered to close this evaluation gap for LLMs by integrating both signal modalities into a unified and scalable benchmark (Agrawal et al., 19 Mar 2026).

2. Dataset Construction and Question Taxonomy

FinTradeBench is grounded in a 10-year historical window (2015–2025) for NASDAQ-100 constituents, with data aligned quarterly. The benchmark encompasses:

  • Fundamentals: Extracted from SEC 10-K/10-Q filings, including:
    • Return on Assets (ROA): ROA=Net IncomeTotal Assets\text{ROA} = \frac{\text{Net Income}}{\text{Total Assets}}
    • Return on Equity (ROE), Earnings/Price, Book/Price, Debt/Equity, Cash Flow/Assets, Dividend Yield, etc.
  • Trading Signals: Derived from daily OHLCV (Open, High, Low, Close, Volume) data. Representative metrics include:
    • Simple Moving Average: MAN=1Ni=1NPti\text{MA}_N = \frac{1}{N}\sum_{i=1}^N P_{t-i}
    • Exponential Moving Average: EMAt=αPt+(1α)EMAt1\text{EMA}_t = \alpha\,P_t + (1-\alpha)\,\text{EMA}_{t-1}
    • Relative Strength Index (RSI): RSI=1001001+RS\text{RSI} = 100 - \frac{100}{1 + \text{RS}}, where RS=Avg GainAvg Loss\text{RS} = \frac{\text{Avg Gain}}{\text{Avg Loss}}
    • One-Day Reversal: PtPt1Pt1\frac{P_t - P_{t-1}}{P_{t-1}}

The core question corpus is developed as follows:

Category Description Example
Fundamentals (F) Reasoning over accounting metrics "Is Nvidia’s profitability sustainable given its valuation in September 2025?"
Trading (T) Analysis involving trading signals and price movement "Which stocks in H1 2025 trade strongest above their 20-day EMA?"
Hybrid (FT) Cross-signal reasoning combining fundamentals and trading signals "Is Microsoft overvalued in Q3 2025 despite strong operational performance?"

A seed set of 150 expert-authored questions (50 per category) is algorithmically scaled to 1,400 questions by instantiating each seed across all firms and quarters, ensuring broad and balanced coverage (Agrawal et al., 19 Mar 2026).

3. Calibration-Then-Scaling Evaluation Framework

FinTradeBench implements a three-phase calibration-then-scaling pipeline to maximize reliability and minimize bias:

  1. Multi-Model Candidate Generation and Self-Filtering
    • For each question and each model, six candidate answers are generated using TELeR-guided prompts across levels 1–6.
    • Each model conducts intra-model self-selection, nominating its top answer using internal measures of accuracy, completeness, and relevance.
    • Automated numerical auditing is performed via an independent LLM auditor labeling numerical claims as SUPPORTED or CONTRADICTED, with per-model numerical accuracy defined as:

    Accnum(Mm)=1Ni=1N1[is_accurate(aim)=1]\text{Acc}_{\mathrm{num}}(M_m) = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\left[\mathrm{is\_accurate}(a_i^m)=1\right]

  2. Human and LLM-as-Judge Calibration

    • Human experts rate selected answers on a 1–5 Likert scale for factual/numerical accuracy, completeness, relevance, and clarity.
    • An independent LLM judge (Claude Sonnet 4.5) applies the same rubric, incorporating the numerical audit.
    • Alignment between human and LLM-judge scoring is quantified as the mean absolute error (MAE):

    MAE(Mm)=1nmi=1nmSh,imSJ,im\mathrm{MAE}(M_m) = \frac{1}{n_m}\sum_{i=1}^{n_m} |S_{h,i}^m - S_{J,i}^m|

  • On the seed set, MAE consistently below 0.27 (5.4%) attests to high human–LLM alignment.
  1. Automated Scaling
    • The generation, filtering, and evaluation protocol is propagated to all expanded instances, yielding ground-truth labels for the complete question set.

This pipeline underpins the reproducibility and trustworthiness of FinTradeBench's annotations and evaluation (Agrawal et al., 19 Mar 2026).

4. Model Evaluation Protocol and Metrics

A diverse suite of 14 LLMs is evaluated, spanning large proprietary models (e.g., DeepSeek-R1, Gemini 2.5 Flash, GPT-5-mini), midsize open weights (LLaMA 3.3 70B, Qwen 2.5 32B, R1‐Distill‐Qwen 32B), and smaller distilled/tuned models (Phi-4 14B, Mistral 7B, LFM 1.2B).

  • Experiment Settings:
    • Zero-Shot (No-RAG): No retrieval augmentation.
    • Retrieval-Augmented Generation (RAG): Dual-track retrieval engine (Track A: SEC document chunks; Track B: time-series data), hierarchical indexing, cross-encoder re-ranking. Generation incorporates TELeR prompts and model self-filtering.
  • Key Metrics:

    • Absolute Accuracy:

    A(Mm)=SJm5×100%A(M_m) = \frac{S_J^m}{5} \times 100\% - Retrieval Delta:

    Δ(Mm)=ARAG(Mm)ANo-RAG(Mm)ANo-RAG(Mm)×100%\Delta(M_m) = \frac{A_{\mathrm{RAG}}(M_m) - A_{\mathrm{No\text{-}RAG}}(M_m)}{A_{\mathrm{No\text{-}RAG}}(M_m)} \times 100\% - Golden-Indicator F1 (precision/recall/F1 matching cited vs. expert metrics) - Signal-Integration Scores (1–5 for fundamentals (FI), trading signals (TI)), and Reasoning Depth score (1–5).

5. Empirical Results and Analytical Insights

Key average results across all models, by question type:

Question Category No-RAG Accuracy RAG Accuracy Retrieval Delta (Δ\Delta)
Fundamentals (F) 34% 42% +23%
Trading (T) 25% 24% \sim0% (often negative)
Hybrid (FT) 31% 38% +25%
Overall 30% 37% +20%

Fundamentals and hybrid questions benefit significantly (statistical significance at p<0.01p<0.01, paired tt-test), while trading-signal questions see no improvement or degradation under RAG. This is attributed to the coverage bias in LLM pre-training corpora: SEC filings are widely available and indexed, supporting effective retrieval, whereas proprietary, underrepresented time-series datasets impede LLMs' ability to extract value from raw price/volume data.

Additional findings reveal:

  • Golden-Indicator F1 drops by 56.5% under RAG, indicating distraction from precise metric extraction.
  • Reasoning Depth falls by 10.8%, suggesting information overload.
  • Model architecture has higher impact than sheer scale; latent-reasoning models (e.g., DeepSeek-R1) demonstrate up to +55.1% performance gain on hybrid questions, while instruction-tuned LLaMA variants sometimes perform worse with retrieval augmentation (Agrawal et al., 19 Mar 2026).

6. Limitations and Prospective Developments

FinTradeBench has several acknowledged limitations:

  • Coverage is restricted to NASDAQ-100 equities and a fixed set of signals.
  • No real-time or forward-looking scenarios are evaluated.
  • Benchmark construction depends on LLM-judge calibration using only 150 seed questions.
  • It omits portfolio optimization, risk management, and multi-asset questions.

Planned future work includes: integration of lightweight code execution for real-time computations on time series, dataset extension to other asset classes and macroeconomic indicators, enhancement of retrieval architectures (exploring agentic RAG and context compression), and incorporation of analyst forecasts and alternative data streams.

A plausible implication is that advances in retrieval and agentic systems, as well as expansion into underrepresented financial data modalities, are crucial for overcoming the modality-dependent performance gap observed in current LLMs (Agrawal et al., 19 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FinTradeBench.