StockBench: LLM Trading Benchmark

Updated 4 October 2025

StockBench is an open-source benchmark offering contamination-free, multi-modal evaluation of LLM agents in realistic trading environments.
It simulates back-trading with fixed DJIA stocks, daily market data, and curated news to replicate authentic market dynamics.
The framework evaluates operational performance using key metrics such as cumulative return, maximum drawdown, and Sortino ratio.

StockBench is an open-source, contamination-free benchmark explicitly designed for evaluating LLM agents within realistic, multi-month stock trading environments (Chen et al., 2 Oct 2025). The framework delivers daily market signals—including price histories, key company fundamentals, and curated news streams—for a defined bundle of investment targets. Agents, structured to imitate the iterative decision-making workflow of human traders, must sequentially generate buy, sell, or hold decisions that determine portfolio allocation and trading execution. StockBench advances existing evaluation paradigms by shifting focus from static financial knowledge tasks to assessing dynamic, operational performance in noisy, out-of-sample market conditions, thereby providing a standardized platform for benchmarking autonomous financial agents in real-world scenarios.

1. Benchmark Architecture and Data Sources

StockBench simulates realistic back-trading environments using three integral data sources:

Investment Universe: A static bundle of twenty Dow Jones Industrial Average (DJIA) stocks selected for industry diversity and high market weight, capturing characteristic sectoral heterogeneity.
Market Data: Daily OHLCV prices and company fundamental indicators (e.g., market cap, P/E ratio, dividend yield) are included for each stock; historical slices are drawn from a fixed forward-out-of-training window (e.g., March–July 2025) to guarantee forward contamination resistance.
News Corpus: Up to five time-relevant news articles per stock (published within the previous 48 hours), providing a multi-modal, high-impact information stream typical for advanced trading environments.

The agent workflow is formalized into four sequential modules: (1) portfolio overview, (2) targeted stock analysis, (3) decision generation (buy/sell/hold for each asset), and (4) order execution with output validation.

2. Evaluation Protocols and Financial Metrics

StockBench performance evaluation employs three primary quantitative metrics:

Final (Cumulative) Return: Total portfolio return from start to end of the benchmark horizon, defined by

$\text{Final Return} = \frac{V_T - V_0}{V_0}$

where $V_0$ is initial value and $V_T$ is final value.

Maximum Drawdown (MDD): The largest peak-to-trough percentage loss experienced during the trading period,

$\text{Max Drawdown} = \min_{t \in [0,T]} \left( \frac{V_t - \max_{s \leq t} V_s}{\max_{s \leq t} V_s} \right)$

Sortino Ratio: A risk-adjusted performance measure focusing on negative volatility,

$\text{Sortino Ratio} = \frac{R_p}{\sigma_d}$

where $R_p$ is excess (annualized) return and

$\sigma_d = \sqrt{\frac{1}{N_d} \sum_{i} (\min(R_i, 0))^2}$

Holistic model ranking is enforced by aggregating z-scores across these metrics:

$\text{Composite Rank} = \frac{z(\text{Final Return}) - z(\text{Max Drawdown}) + z(\text{Sortino Ratio})}{3}$

This composite captures the trade-off between return maximization and risk minimization.

3. Agent Performance and Observed Limitations

The benchmark encompasses evaluations of both proprietary (GPT-5, Claude-4) and open-weight (Qwen3 series, Kimi-K2, GLM-4.5) LLM agents. A straightforward equal-weight buy-and-hold baseline—returning 0.4% with a drawdown of –15.2%—serves as a robust performance comparator for algorithmic agents.

Key findings include:

Certain agents (e.g., Kimi-K2, Qwen3-235B-Ins) achieve cumulative returns above 2% and drawdowns reduced to roughly –11% to –14%. Others underperform the baseline, particularly with respect to sustained downside risk.
Models fine-tuned for reasoning tasks (e.g., Qwen3-235B-Think) do not demonstrably exceed instruct-tuned versions in trading performance, indicating that proficiency in static reasoning or QA-style financial knowledge does not correlate with superior active trading outcomes.
Practical operational issues are prevalent: arithmetic errors, schema formatting mistakes, and occasional misalignment with reward optimization signal that agent output reliability remains a challenge, independent of model scale or sophistication.

4. Agent Workflow and Market Dynamics

StockBench formalizes a daily trading cycle paralleling practitioner workflows: each agent receives an aggregated overnight market update, conducts asset-wise analysis integrating fundamentals and recent news, generates structured trade recommendations, and executes orders within a realistic market simulation. This iterative protocol closely tracks industry-standard portfolio management practice, with fidelity to operational latency, information delays, and noise.

Crucially, the benchmark exposes agents to market regime variations (bullish, bearish, ranging), testing robustness against non-stationary environments and abrupt event shocks. Performance stratification across different micro-regimes (e.g., drawdown during volatile months versus value accrual in stable periods) provides deeper insight into agent adaptability.

5. Contamination Resistance and Reproducibility

A foundational feature of StockBench is strict contamination resistance: historical data slices are forward-out-of-sample relative to known model cutoffs, thus ensuring no indirect training exposure or leakage. This is necessary for high-stakes financial research, where realistic simulation and reproducibility are paramount. The open-source availability of both complete datasets and code enables rigorous cross-comparison and systematic extension by the research community.

6. Research Implications and Future Directions

Initial empirical results highlight several bottlenecks and research avenues:

Surpassing the buy-and-hold baseline in dynamic, noisy environments remains a tall order for LLM agents. This underscores the necessity for agent architectures better attuned to sequential error correction, dynamic portfolio rebalancing, and reward-aligned learning objectives.
The gap between static financial reasoning/QA benchmarks and true market-aligned sequential decision-making is considerable.
Opportunities lie in integrating enhanced error management (arithmetic, output schema), more robust fusion of multi-modal information streams, and adaptive strategies specific to market regime detection.

A plausible implication is that agent workflow design, market-sensitive evaluation, and operational error reduction are as critical as LLM model scale or pretraining sophistication in achieving profitable, robust autonomous trading.

7. Position in the Benchmark Ecosystem

StockBench extends the landscape of financial AI evaluation beyond prior static QA or trend prediction benchmarks, focusing on dynamic, operations-level trading realism. In doing so, it complements frameworks such as QuantBench (Wang et al., 24 Apr 2025), FinTSB (Hu et al., 26 Feb 2025), and LOBCAST (Prata et al., 2023), all of which emphasize standardized protocols, multi-modal data, realistic simulation, and contamination resistance. StockBench directly targets the challenges of LLM-driven financial decision-making and offers a reproducible testbed for future advances in autonomous agent-based trading research.