QuantEval Benchmark for LLMs in Finance

Updated 25 February 2026

QuantEval Benchmark is a unified suite that rigorously assesses financial LLMs across conceptual knowledge, quantitative reasoning, and executable strategy coding.
It combines MCQs, open-ended quantitative tasks, and Python trading scripts to address limitations of previous benchmarks in financial workflows.
The benchmark uses deterministic backtesting and detailed metrics to reveal gaps between LLM performance and expert human trading strategies.

QuantEval Benchmark is a comprehensive evaluation suite designed to rigorously assess LLMs across financial quantitative workflows: conceptual knowledge, mathematical reasoning, and executable strategy development. By explicitly integrating knowledge-based questions, open-ended quantitative problem solving, and live trading code evaluation within a unified, reproducible platform, QuantEval isolates critical gaps in current LLM capabilities relative to human financial expertise (Kang et al., 13 Jan 2026).

1. Motivation, Scope, and Design Principles

The primary motivation for QuantEval is the fragmentation and narrowness of prior financial LLM benchmarks, which have predominantly tested knowledge-centric or table-based question answering (e.g., FinQA, TAT-QA, FinTextQA, FinEval). These efforts do not evaluate multi-step computation (such as volatility chaining to option pricing) or the correctness and risk properties of model-generated, framework-compliant trading code.

QuantEval is constructed to bridge this gap by targeting:

Conceptual knowledge (definitions, formulas, regulatory concepts)
Multi-step quantitative mathematical reasoning grounded in real historical data
End-to-end strategy coding with live performance evaluation

This integrated, execution-based approach reflects authentic quant workflows, where domain understanding and computational prowess must culminate in robust, backtestable strategies (Kang et al., 13 Jan 2026).

2. Benchmark Structure and Dataset Composition

QuantEval’s dataset spans three distinct yet complementary task families:

Component	# Samples	Task Format	Source/Scope
Knowledge-Based QA	660	MCQ, fill-in-blank	Textbooks, papers, regulations; expert-curated
Quantitative Reasoning	855	Open-ended	CFA/FRM templates, industry white papers, real market data
Quantitative Strategy Coding	60	Executable code	Adapted CTA libraries, expert scripts, unified Python API

Knowledge QA is distilled from textbooks and regulatory filings and structured as multiple-choice or fill-in-the-blank queries. Reasoning tasks span from direct financial calculations (e.g., convexity, VaR) to chains of market data transformations (e.g., volatility rolling-window estimation feeding into Black-Scholes option pricing). Strategy Coding items require the generation of Python trading scripts that operate under QuantEval’s standardized CTA-style backtesting harness.

All code solutions are paired with metadata specifying universe, risk controls, cost models, and ground-truth metric values (Kang et al., 13 Jan 2026).

3. Evaluation Protocol, Metrics, and Backtesting Framework

A model’s performance is assessed by distinct criteria for each component:

Knowledge QA/Reasoning: Graded for exact-match or normalized semantic correctness. Chain-of-thought (CoT) prompting is evaluated for its effect on reasoning.
Strategy Coding:
- Executability Rate: Fraction of code samples that compile and execute under the CTA backtesting framework.
- Metric Deviation: Mean absolute error (MAE) to expert-validated ground truth for return, maximum drawdown, Sharpe ratio, and return/drawdown.

Backtesting configuration is deterministic, with universe (15 U.S. ETFs + large-caps), a fixed NYSE calendar, next-bar open order execution, leverage and turnover constraints, and hard-coded cost model (2 bps commission, 1 bps slippage). All researchers are provided the exact YAML/JSON protocol, code, and data download scripts to ensure strict reproducibility.

Key financial metrics used:

$\text{Sharpe} = \frac{\mathbb{E}[r_{\mathrm{daily}}]}{\sqrt{\mathrm{Var}(r_{\mathrm{daily}})}} \sqrt{252}$

$\mathrm{MaxDD} = \max_{t}\left(\max_{s\le t} \mathrm{Equity}(s) - \mathrm{Equity}(t)\right) / \max_{s\le t}\mathrm{Equity}(s)$

$\mathrm{Return/Drawdown} = \frac{\text{Annualized Return}}{\text{Max Drawdown}}$

Statistical results are reported for accuracy (QA/reasoning), executability, and MAE (coding). Human expert baselines set upper bounds (e.g., 100% executability, Sharpe MAE 0.07) (Kang et al., 13 Jan 2026).

4. Experimental Findings and Comparative Analysis

QuantEval’s evaluation of proprietary and open-source models reveals systematic capabilities and limitations:

Knowledge QA: Top proprietary models reach ~83–90% accuracy; reasoning tasks are harder (44–55%), with humans achieving 91.75% (QA) and 89.05% (reasoning).
Strategy Coding: Only advanced proprietary models achieve meaningful executability rates (51.7–63.3%); most open-source models register 0%. Metric MAEs remain materially higher than expert solutions (Sharpe MAE 0.16–0.18).
CoT prompting significantly improves reasoning (up to +21.2 percentage points for GPT-5) but can impair recall-based QA.
Errors cluster around arithmetic missteps in reasoning and API or interface violations in coding.

When compared to earlier financial NLP benchmarks, QuantEval is unique in: (i) integrating end-to-end executable strategy tasks, (ii) using a deterministic and reproducible backtest environment, (iii) explicitly quantifying code correctness via domain metrics, and (iv) exposing gaps between LLM and human practitioners under realistic constraints (Kang et al., 13 Jan 2026).

5. Supervised and Reinforcement Learning Enhancements

Domain-specific supervised fine-tuning (SFT) and group relative policy optimization (GRPO) are employed with ~57,000 in-domain samples from datasets such as Agentar-DeepFinance-100K, DianJin-R1, FinQA, and custom cases.

SFT improves QA (from 50.0% to 53.2%), reasoning (42.0% to 47.8%), and Sharpe MAE (to 0.85).
RL further boosts reasoning and Sharpe MAE (to 0.72) over 50K steps.
Gaps to human-level performance persist—multi-step reasoning and robust, interface-compliant code generation remain unsolved at scale, indicating the need for further work in numerical precision, hybrid architectures, and finetuning protocols (Kang et al., 13 Jan 2026).

6. Reproducibility, Limitations, and Impact

QuantEval provides its entire protocol, data, backtesting harness, and metric code to enable result reproducibility and standardized benchmarking across the LLM-community. By benchmarking knowledge, reasoning, and strategy-coding in a single suite with rigorous metric baselines and statistical reporting, QuantEval establishes a foundation to accelerate LLM development for practical, risk-aware quantitative trading tasks.

Limitations acknowledged include:

Coverage of only CTA-style (not multi-period, options, or real-time) strategies
Model training data contamination is controlled but not eliminated
Only Python ecosystem and a fixed asset universe are supported in the initial release

Nonetheless, QuantEval is currently the only benchmark enabling the empirical study of LLMs’ ability to reason, compute, and generate actionable code across realistic financial workflows (Kang et al., 13 Jan 2026).

7. Position Relative to Other Quantitative and Domain Benchmarks

QuantEval constitutes the first unified end-to-end LLM benchmark for financial quantitative analysis. In contrast, prior work such as EQUATE targets quantitative reasoning in NLI (natural language inference) without code execution or strategy simulation (Ravichander et al., 2019); HumanEval and Qiskit HumanEval focus on code correctness in classical and quantum domains, respectively, but not financial strategy risk–return assessment (Vishwakarma et al., 2024). The integration of a deterministic backtesting harness with performance-based strategy evaluation is unique to QuantEval as of its publication, addressing an unmet need in automated finance research and LLM deployment benchmarking (Kang et al., 13 Jan 2026).