Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? (2510.02209v1)

Published 2 Oct 2025 in cs.LG and cs.CL

Abstract: LLMs have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.

Summary

  • The paper presents a contamination-free benchmark that simulates sequential, real-world trading environments for LLM agents.
  • It outlines a realistic agent workflow that integrates daily market data, news, and portfolio analysis for informed decision-making.
  • Results indicate select LLM agents outperform passive baselines in profitability and risk management, though consistent superiority remains challenging.

StockBench: Evaluating LLM Agents in Real-World Stock Trading

Introduction

StockBench presents a contamination-free benchmark for assessing the capabilities of LLM agents in realistic, multi-month stock trading environments. Unlike prior financial benchmarks that focus on static question answering, StockBench requires agents to make sequential buy, sell, or hold decisions based on daily market signals, including prices, fundamentals, and news. The benchmark is designed to measure both profitability and risk management using established financial metrics, and it is instantiated with recent market data to prevent data leakage from LLM training corpora. Figure 1

Figure 1: StockBench architecture, comprising a back-trading benchmark dataset and a workflow that converts backbone LLMs into trading agents.

Benchmark Design

Back-Trading Environment

StockBench simulates real-world trading by exposing agents only to information available at each decision point. The environment is constructed around three pillars:

  • Investment Targets: 20 high-weight DJIA stocks are selected to ensure diversity and stability, minimizing the impact of sentiment-driven volatility.
  • Historical Market Data: Includes opening prices and fundamental metrics (market cap, P/E ratio, dividend yield, trading range), with strict timestamping to prevent future data leakage.
  • News Corpora: Up to five relevant news articles per stock are collected daily, restricted to the previous 48 hours, balancing context coverage and LLM computational cost.

The evaluation window (March–June 2025) is chosen to be post-knowledge cutoff for mainstream LLMs, ensuring contamination-free assessment.

Agent Workflow

The agent workflow is minimal and realistic, mirroring retail investor behavior:

  1. Portfolio Overview: Agent receives data for all stocks, including news, holdings, historical actions, and prices.
  2. In-Depth Stock Analysis: Agent selects stocks for deeper analysis, receiving additional fundamental data.
  3. Decision Generation: Agent outputs buy, sell, or hold actions for each stock.
  4. Execution and Validation: Actions are executed, with liquidity constraints enforced; invalid actions require revision.

This workflow is designed to avoid inductive biases and ensure fair comparison across LLM architectures.

Experimental Evaluation

Setup

  • Models: Both proprietary (GPT-5, Claude-4) and open-weight (Qwen3, Kimi-K2, GLM-4.5, DeepSeek) LLMs are evaluated, each with 32k context windows and official decoding settings.
  • Baseline: An equal-weight buy-and-hold strategy serves as the passive benchmark.
  • Metrics: Final return, maximum drawdown, and Sortino ratio are computed, with a composite z-score ranking.

Results

  • Profitability: Most LLM agents outperform the passive baseline (0.4% return, -15.2% drawdown, Sortino 0.0155), with top agents (Kimi-K2, Qwen3-235B-Ins, GLM-4.5) achieving >2% returns and improved risk profiles.
  • Risk Management: All agents exhibit lower maximum drawdowns than the baseline, indicating effective downside risk mitigation.
  • Reasoning vs. Instruction Tuning: Reasoning-tuned models do not consistently outperform instruction-tuned variants; for example, Qwen3-235B-Ins achieves lower drawdown than Qwen3-235B-Think, highlighting a gap between reasoning ability and dynamic decision-making.

Analysis

Investment Target Size

Performance degrades and volatility increases as the number of tradable assets grows. Larger models (e.g., Kimi-K2) demonstrate greater robustness to portfolio expansion, maintaining positive returns and lower variability, while smaller models (GPT-OSS-120B) suffer from excessive volatility.

Error Analysis

  • Arithmetic Errors: More frequent in instruction-tuned models; reasoning-tuned models excel in numerical reasoning.
  • Schema Errors: More frequent in reasoning-tuned models, likely due to overcomplex output structures.

Ablation Study

Removing news and fundamental data leads to consistent declines in cumulative return, confirming that LLM agents benefit from integrating heterogeneous inputs. Kimi-K2 is more robust to missing news, but performance drops sharply when both modalities are absent.

Evaluation Window Impact

Model rankings shift significantly between downturn and upturn market periods. Some models (GPT-OSS-120B) perform better in bullish conditions, while others (Kimi-K2) maintain stable performance. Notably, LLM agents fail to outperform the baseline in bearish markets, indicating limitations in adapting to adverse regimes.

Model Return Variance

Return variance analysis reveals that DeepSeek-V3 exhibits the highest stability, while GPT-OSS-120B is highly volatile. High variance is undesirable for real-world financial applications, emphasizing the need for robust agent architectures.

Implications and Future Directions

StockBench demonstrates that current LLM agents can trade profitably and manage risk, but consistent outperformance of passive strategies remains elusive. The findings suggest that static financial knowledge and reasoning skills do not directly translate to effective trading in dynamic environments. Robustness to portfolio size, error handling, and adaptability to market regimes are critical challenges.

Practically, StockBench provides a reproducible, contamination-free framework for benchmarking LLM agents in finance. The open-source release enables further research into agent architectures, multi-modal input integration, and reinforcement learning approaches tailored to financial decision-making.

Theoretically, the results highlight the need for agentic LLMs to develop temporal reasoning, uncertainty modeling, and robust schema adherence. Future work should explore hybrid architectures combining LLMs with specialized financial models, continual learning, and meta-learning strategies to enhance adaptability and generalization.

Conclusion

StockBench establishes a rigorous benchmark for evaluating LLM agents in real-world stock trading, revealing both their potential and current limitations. While some models achieve profitable and risk-managed trading, consistent superiority over passive baselines is not yet realized. The benchmark will catalyze research into more capable, robust, and adaptive financial agents, with future extensions targeting broader market scenarios and advanced agentic capabilities.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper introduces StockBench, a new way to test whether AI agents built from LLMs can make good decisions when trading real stocks over several months. Instead of just answering finance questions, these AI agents must react to real market signals each day—like prices, company data, and news—and decide whether to buy, sell, or hold. The goal is to measure not only how much money they can make, but also how well they manage risk.

Objectives

The paper asks three simple questions:

  • Can LLM-based agents trade stocks in a realistic setting, day after day, and make a profit?
  • Do these agents handle risk (like big losses) better than simple strategies?
  • Does knowing lots of finance facts (good at Q&A) actually help with real trading decisions in a fast-changing market?

Methods and Approach

Think of StockBench like a “stock market video game” that uses real past data. The AI agent logs in each morning and makes choices, just like a regular investor.

Here’s how it works:

  • Realistic setup: The agents trade 20 well-known Dow Jones stocks from March to June 2025. This time period was chosen so models wouldn’t have seen the data during training (like giving a test with questions written after the textbook).
  • Daily inputs: Each day, the agent gets three types of clues:
    • Prices: What stocks opened at and recent movements.
    • Fundamentals: Simple facts about a company’s health (like market size, P/E ratio, and dividends).
    • News: Up to five recent headlines from the past 48 hours.
  • Trading workflow (four steps): 1) Portfolio overview: Scan all 20 stocks and see what’s going on. 2) In-depth analysis: Pick some stocks and look at their financial stats more closely. 3) Decision: For each stock, decide whether to buy more, sell some, or hold. 4) Execution: Turn dollar amounts into share counts. If the plan spends more cash than available, fix it and try again.
  • Baseline for comparison: A simple “buy-and-hold” strategy that splits money equally across all 20 stocks on day one and then does nothing until the end. This is like a basic, hands-off approach.
  • How performance is judged:
    • Cumulative return: How much the portfolio gained or lost by the end.
    • Maximum drawdown: The worst drop from a peak to a low point during the test (a measure of pain).
    • Sortino ratio: A score that looks at return but only penalizes the “bad” ups and downs (downside volatility), not the good ones.

Main Findings

What did they learn?

  • Mixed success versus the simple baseline: In the four-month window studied, several AI agents earned slightly more than the buy-and-hold approach and also had smaller worst-case drops. However, strong performance in finance Q&A did not guarantee better trading. Being good at facts and static tests isn’t the same as making smart moves in a noisy, fast-changing market.
  • Risk management showed promise: Many agents had lower maximum drawdowns than the baseline, meaning they sometimes avoided the worst dips better than doing nothing.
  • “Reasoning” models weren’t always better: Models tuned to think step-by-step (like “Think” versions) did not consistently beat simpler instruction-following versions. More “thinking” didn’t automatically mean better trading decisions.
  • Errors matter: Agents often made two kinds of mistakes:
    • Arithmetic errors: Miscalculating how many shares to buy with a given budget.
    • Schema errors: Outputting the wrong format (like messy JSON), which breaks the trading system.
    • Reasoning-heavy models made fewer math mistakes but more formatting mistakes, possibly because they “overthink” and produce complex outputs.
  • Scaling up is hard: When increasing the number of tradable stocks (from 5 up to 30), performance became more unstable. Bigger portfolios made the job tougher, especially for smaller models.
  • Inputs help: Removing news or fundamental data reduced performance. Agents did best when they combined both types of information—numbers and narratives.
  • Market conditions change the story: In a separate test, during a market downturn (bearish period), AI agents generally failed to beat the baseline. In an upturn (bullish period), many did better. This suggests some agents are more suited to rising markets than falling ones.

Why is this important?

  • It shows that making smart, money-related decisions day after day is a very different skill than answering test questions.
  • It highlights the need to judge AI agents on both profit and safety (risk), especially when real money could be involved.

Implications and Impact

StockBench gives researchers and builders a fair, updated, and realistic test bed to improve AI trading agents. The key takeaways:

  • Don’t assume an AI that aces finance quizzes will be a great trader. Dynamic decision-making requires different skills.
  • Real trading needs careful risk control. Lower drawdowns are valuable, even if profits aren’t huge.
  • Better tools and training are needed to reduce errors and handle more complex portfolios.
  • Performance depends on market conditions. Future agents must adapt to both good times and bad.
  • By open-sourcing StockBench and keeping it “contamination-free” (using new data not seen during training), the community can build and compare agents more fairly, speeding up progress toward safer, smarter financial AI.

In short, StockBench is a stepping stone: today’s LLM agents show hints of promise, but they’re not reliable stock traders yet. This benchmark will help researchers make them safer, steadier, and more effective over time.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to enable actionable follow-up work.

  • Transaction costs and market frictions are not modeled (commissions, bid–ask spreads, slippage, taxes). Open question: How do the agents’ returns and risk metrics change when realistic execution costs and tax considerations are applied?
  • Order execution realism is limited to fills at the opening price with no order types, partial fills, or intraday dynamics. Open question: What is the impact of realistic microstructure (limit/market orders, partial fills, price impact, latency) on agent performance?
  • Short selling, leverage, and margin are excluded. Open question: How do LLM agents perform under broader instrument sets (shorting, options, ETFs, futures) and explicit leverage/margin constraints?
  • Dividend and corporate action handling is unspecified (ex-dividend timing, split/merger effects, cash crediting). Open question: Does incorporating realized dividends and corporate actions materially change outcomes?
  • Baselines are restricted to equal-weight buy-and-hold. Open question: Do LLM agents outperform widely used systematic baselines (e.g., momentum, mean-reversion, moving-average crossover, risk parity, minimum variance, DJIA index tracker)?
  • Statistical rigor is limited (three seeds, 82 trading days, no confidence intervals or multiple-comparisons control). Open question: Are observed outperformance and risk differences statistically significant under bootstrapping, out-of-sample testing, and corrections for multiple model comparisons?
  • Composite ranking via equal-weighted z-scores may bias conclusions. Open question: How sensitive are rankings to alternative metric sets and weightings (e.g., Sharpe, Calmar, Omega, alpha vs benchmark, tracking error) and to different composite formulations?
  • Sortino ratio definition is incomplete (excess return Rp and minimal acceptable return/risk-free rate not specified). Open question: How do results change under different MAR/risk-free assumptions and alternative downside definitions?
  • “Contamination-free” claim is unverified for proprietary models with unknown training corpora and cutoffs. Open question: Can auditing protocols (e.g., provenance checks, temporal holdout beyond model cutoffs, synthetic blinds) substantiate contamination-free evaluation?
  • Reproducibility of dynamic news inputs is unclear (API-based retrieval, top-5 selection). Open question: Will the authors release a frozen snapshot (full text, timestamps, URLs, hashes) and documented ranking criteria to guarantee exact replication?
  • Action schema is ambiguous: Step 3 specifies “increase/decrease/hold,” but execution references “dollar targets.” Open question: What is the precise decision format (position sizing rules, target weights/amounts, constraints) and how is sizing determined?
  • Turnover, trade count, and average holding period are not reported. Open question: What are agents’ trading intensities, and how do turnover and holding periods affect costs and realized performance?
  • Historical price context is minimal (opening price, 52-week high/low) without time-series features. Open question: Does providing richer historical signals (returns, volatility, technical indicators, rolling fundamentals) improve agent decisions?
  • Memory and planning across days are constrained (only past 7 days of actions shared). Open question: How do longer memory windows, episodic memory, or planning modules impact consistency, regime adaptation, and performance?
  • Tool augmentation is not explored, despite arithmetic and schema errors. Open question: Do calculator tools, schema validators, self-checkers, or constrained decoding materially reduce execution errors and improve returns?
  • Scalability beyond 20 DJIA constituents shows degradation but is not dissected. Open question: Which bottlenecks (context limits, analysis load, action sizing) drive performance decay, and what architectures/workflows enable stable scaling to larger universes?
  • Market regime analysis is preliminary (bear-period underperformance noted). Open question: Can regime-aware agents (regime detection, adaptive prompts, conditional policies) improve robustness across downturns and high-volatility phases?
  • Liquidity modeling is limited to available cash; stock-level liquidity is not considered. Open question: How do volume constraints, ADV-based sizing, and liquidity-adjusted risk affect feasibility and outcomes?
  • Decision horizon is fixed to daily at market open. Open question: How do results vary across intraday, weekly, or monthly horizons, and what is the horizon that best matches LLM strengths vs. costs?
  • Human comparison is absent. Open question: How do LLM agents compare to retail/professional discretionary baselines when given the same information and constraints?
  • Explainability and attribution are not evaluated (do rationales correspond to actual drivers?). Open question: Can the benchmark measure decision-grounding quality (alignment to news/fundamentals), confidence calibration, and post-hoc attribution to inputs?
  • Prompt fairness and decoding sensitivity are under-specified. Open question: How robust are outcomes to prompt variants, instruction vs. CoT, sampling temperature, and system settings; can standardized prompt suites reduce bias across models?
  • Period length is short (four months). Open question: Do findings hold under longer rolling windows (multi-year), multiple non-overlapping test periods, and different start dates to assess generalization and seasonality?
  • Asset coverage is narrow (large-cap DJIA). Open question: How do agents perform on mid/small caps, higher-news-volatility names, sector-specific sets, and international markets?
  • News preprocessing, relevance ranking, and language/paywall handling are not detailed. Open question: What is the impact of deduplication, time alignment, sentiment labeling, and top-k truncation on decision quality?
  • Safety/compliance is not considered (risk disclosures, suitability constraints). Open question: How should the benchmark incorporate compliance guardrails (position limits, suitability filters, risk warnings) for deployment to retail contexts?
  • Data reliability and failure modes (API outages, delayed feeds) are not addressed. Open question: How do agents handle missing/delayed inputs (imputation, fallback logic) without degrading decisions?
  • Adaptation/learning is not explored (agents are zero-shot). Open question: Can fine-tuning, RL on historical regimes, or online learning improve profitability and stability without overfitting?
  • Risk measurement is limited (max drawdown and Sortino). Open question: Add path-dependent and tail-risk metrics (drawdown duration, VaR/CVaR, time-to-recovery) and factor exposures (beta, sector/factor loadings) to assess true risk management.
  • Cost-effectiveness is unmeasured (token usage, latency, dollar cost per decision). Open question: What is the performance–cost frontier, and which workflows yield the best risk-adjusted returns per dollar of inference?
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging StockBench’s open-source benchmark, trading workflow, and empirical findings.

  • Finance — Pre-deployment evaluation harness for AI trading agents
    • Use case: Banks, hedge funds, and brokerages can score LLM-based trading agents against contamination-free, multi-month scenarios using StockBench’s composite metrics (Final Return, Max Drawdown, Sortino).
    • Tools/workflows: StockBench CI pipeline for model gating, regime-specific test windows, portfolio-size stress tests (5/10/20/30 assets), z-score ranking, error logging (arithmetic/schema).
    • Assumptions/dependencies: Requires live or recent market data licenses (prices, fundamentals, news), inclusion of transaction costs and slippage in bespoke extensions, and vendor API reliability for large context windows.
  • Finance — Risk-testing sandbox for retail brokerage “AI co-pilots”
    • Use case: Brokerages can safely sandbox agent outputs (buy/sell/hold) without executing real trades, validating liquidity constraints and JSON schema adherence before enabling any human-in-the-loop workflows.
    • Tools/workflows: Execution validator (liquidity and budget checks), schema guards, calculator tools to prevent arithmetic mistakes, daily open decision simulation.
    • Assumptions/dependencies: Strong compliance guardrails; paper’s results indicate agents struggle in downturns—limit to advisory or paper-trading.
  • Finance — Research desk assistant for portfolio overview and news synthesis
    • Use case: Equity research teams get daily portfolio scans with summarized top-5 news per ticker, relevant fundamentals, and risk flags; human analysts validate trade ideas.
    • Tools/workflows: Portfolio overview stage, news API integration (48-hour window), fundamentals fetch, human review queue.
    • Assumptions/dependencies: News relevance ranking, timely data feeds, clear “not financial advice” disclosures; avoid automated execution.
  • Software/Fintech — Reliability wrappers for agent outputs
    • Use case: Productize “schema-guard” and “calc-guard” layers to reduce common agent errors identified in the paper (JSON format and arithmetic).
    • Tools/workflows: Structured output enforcement, function-calling calculators, retry/revision loops on bad outputs.
    • Assumptions/dependencies: Access to tool-use features in chosen LLMs; rate limits and latency management.
  • Academia/Education — Hands-on module for AI-in-finance courses
    • Use case: Universities and training programs use StockBench to teach realistic agent evaluation, risk metrics, contamination avoidance, and regime sensitivity.
    • Tools/workflows: Course labs replicating four-stage workflow, ablations on news/fundamentals, regime comparisons (downturn vs upturn windows).
    • Assumptions/dependencies: Compute resources for multiple model runs and seeds; simplified data licenses for instruction.
  • Policy/RegTech — Pilot audits of AI trading tools
    • Use case: Regulators and regtech firms run pilot audits to assess AI agent risk profiles using benchmarked drawdown and downside-risk metrics before approving consumer exposure.
    • Tools/workflows: Minimum threshold checks (e.g., max drawdown caps), regime stress tests, documentation of contamination-free evaluation periods.
    • Assumptions/dependencies: Regulatory willingness; alignment on standardized test protocols; inclusion of realistic frictions.
  • Daily life — Paper-trading simulator for DIY investors
    • Use case: Consumers experiment with agent-generated suggestions in a simulated environment, tracking risk-adjusted performance without real capital at stake.
    • Tools/workflows: Portfolio simulation, risk dashboards (drawdown, Sortino), human override, educational prompts.
    • Assumptions/dependencies: Clear disclaimers; no auto-execution; results vary by market regime and portfolio size.
  • Finance/Procurement — Vendor benchmarking and model selection
    • Use case: Compare proprietary and open-weight LLMs (e.g., GPT-5, Claude-4, Qwen3 variants) using contamination-free windows to inform procurement decisions.
    • Tools/workflows: Standardized runs with multiple seeds, ablation on input modalities (news vs fundamentals), composite ranking reports.
    • Assumptions/dependencies: Representative tickers (DJIA top-20 are a starting point); extend to sector-specific universes; incorporate fees and constraints relevant to the institution.

Long-Term Applications

These applications require further research, scaling, or development to address reliability, compliance, and performance gaps identified by the paper.

  • Finance — Regime-aware autonomous trading co-pilots
    • Use case: Agents that detect market regimes (bull/bear/sideways) and adapt strategies, improving robustness in downturns where current agents underperform.
    • Tools/workflows: Regime classification, dynamic policy switching, reinforcement learning with drawdown constraints; hybrid quant + LLM architecture.
    • Assumptions/dependencies: Better downturn performance and stability; explicit risk limits; continuous evaluation across evolving windows.
  • Finance — End-to-end AI trading with broker integration
    • Use case: Semi-autonomous systems that place trades under strict risk controls, audit trails, and human oversight in production environments.
    • Tools/workflows: OMS/EMS integration, transaction-cost analysis (TCA), slippage modeling, compliance workflows, kill-switches.
    • Assumptions/dependencies: Regulatory approval, robust error handling, verified schema and arithmetic correctness, strict cybersecurity and KYC/AML controls.
  • Finance — Multi-asset expansion beyond DJIA equities
    • Use case: Extend agent workflows to ETFs, bonds, commodities, FX, and crypto while maintaining contamination-free evaluation and realistic inputs.
    • Tools/workflows: Broader data ingestion, cross-asset risk modeling, sector/regional diversification, intraday decision variants.
    • Assumptions/dependencies: Data coverage, market microstructure differences, higher-fidelity simulators.
  • AI/ML Research — RL training frameworks with financial rewards and risk penalties
    • Use case: Train agents using rewards shaped by returns, drawdown, and downside volatility (Sortino), leveraging StockBench as a safe training/evaluation loop.
    • Tools/workflows: Reward shaping, off-policy evaluation, counterfactual “what-if” simulation, curriculum across regime windows.
    • Assumptions/dependencies: Prevention of overfitting to short windows; realistic simulators including fees and liquidity; reproducibility across updates.
  • Policy/Standards — Industry-wide audit norms for AI trading
    • Use case: Establish standards for contamination-free backtesting, multi-seed evaluations, regime stress tests, and error reporting before consumer deployment.
    • Tools/workflows: Certification protocols, minimum performance/risk thresholds, continuous benchmark updates, independent test labs.
    • Assumptions/dependencies: Multi-stakeholder buy-in; harmonization across jurisdictions; auditability and transparency requirements.
  • Software Engineering — Formal verification of agent outputs
    • Use case: Guarantee arithmetic correctness and schema adherence using verifiable tool chains, static checks, and constrained generation.
    • Tools/workflows: Typed JSON schemas, formal calculators, programmatic output validation, deterministic execution paths.
    • Assumptions/dependencies: Advances in constrained decoding and tool-use orchestration; model compliance under verification regimes.
  • Finance — Explainable decision pipelines combining news and fundamentals
    • Use case: Transparent reasoning for trade recommendations that link specific news events and fundamental signals to actions and risk projections.
    • Tools/workflows: Attribution methods, rationale templates, auditor-friendly reports; hybrid systems where LLMs interpret and quant models execute.
    • Assumptions/dependencies: Reliable explanations that correlate with performance; avoidance of “overthinking” errors in reasoning-tuned models.
  • Academia — Longitudinal studies on benchmark contamination and evolving market tests
    • Use case: Ongoing research programs that track how model training corpora overlap with market periods, updating StockBench to remain contamination-free and predictive.
    • Tools/workflows: Temporal data curation, leakage auditing, annual benchmark refreshes, multi-year performance tracking.
    • Assumptions/dependencies: Sustained community maintenance; data licensing; methodological consensus on leakage detection.
  • Daily life — Regulated consumer-grade AI investment advisors
    • Use case: Retail-facing advisors offering explainable, risk-managed suggestions with strong guardrails and proven robustness across regimes.
    • Tools/workflows: Suitability checks, fiduciary compliance, human-in-the-loop approvals, conservative policies in bearish markets.
    • Assumptions/dependencies: Regulatory approval, stable downturn performance, clear liability frameworks, continuous monitoring.
  • Fintech Infrastructure — “StockBench-as-a-Service” for continuous model monitoring
    • Use case: Hosted service that continuously evaluates deployed agents against fresh, contamination-free windows, issuing alerts on drawdown spikes or schema failures.
    • Tools/workflows: Live benchmarking, drift detection, error telemetry dashboards, automated rollback triggers.
    • Assumptions/dependencies: Reliable data feeds, operational SLAs, integration with model ops and trading risk systems.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Ablation Study: A method of analyzing a system by removing components to assess their contribution. "we conduct an ablation study by progressively removing these inputs."
  • Agentic Capability: The ability of an AI system to autonomously perceive, decide, and act in multi-step tasks. "This agentic capability is verified by benchmarks in various different domains, such as software engineering~\citep{jimenez2024swebench,yang2024sweagent}, scientific discovery~\citep{mialon2023gaia}, and marketing~\citep{chen2025xbench,barres2025tau2bench}, using the most recent advanced LLMs such as GPT-5~\citep{gpt5} and Claude-4~\citep{claude4}, highlighting their promise for workflow automation and productivity gains."
  • Back-Trading Environment: A simulated trading setup using historical, time-restricted data to evaluate strategies as if in real time. "A back-trading environment, which contains historical data necessary for stock-trading decision making."
  • Buy-and-Hold Baseline: A simple benchmark that purchases assets and holds them without further trading. "struggle to outperform the simple buy-and-hold baseline"
  • Coefficient of Variation (CV): A normalized measure of dispersion defined as the ratio of the standard deviation to the mean. "coefficient of variation (CV)."
  • Composite Rank: An aggregate performance score combining standardized metrics (e.g., return, drawdown, ratio). "we derive a composite rank by leveraging the z-score of each metric, averaging them to produce a single performance score."
  • Contamination-Free: Designed to avoid training data leakage into evaluation, ensuring fair assessment. "a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments."
  • Cumulative Return: The total percentage gain or loss of a portfolio over a period. "Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio."
  • Data Contamination: Unintended overlap between training and test data that can inflate evaluation results. "raising concerns about potential data contamination."
  • Dividend Yield: Annual dividends per share divided by the stock price, indicating income return. "market capitalization, price-to-earnings (P/E) ratio, dividend yield, and trading range."
  • Dow Jones Industrial Average (DJIA): A major U.S. stock market index of large industrial companies. "we select $20$ stocks from the Dow Jones Industrial Average (DJIA) with the highest weights as our investment targets."
  • Downside Deviation: The standard deviation of negative returns used in risk-adjusted metrics like Sortino. "It is defined as the excess return RpR_p divided by the downside deviation σd\sigma_d:"
  • Equal-Weight Buy-and-Hold Strategy: A passive approach that allocates capital evenly across assets and holds. "we implement a passive equal-weight buy-and-hold strategy"
  • Inductive Biases: Built-in assumptions of a model or workflow that influence learning and decisions. "since overly complicated workflows introduce inductive biases that may favor certain backbone LLMs."
  • Investment Targets: The predefined set of assets available for trading in the evaluation. "The investment targets are a bundle of stocks that allow the trading agents to perform buy and sell operations."
  • Liquidity: The availability of cash or tradable capacity to execute transactions without undue impact. "If the decisions of the agents exceed available liquidity, the system flags the issue and requires the agent to revise its decisions until they can be executed within available resources."
  • Market Capitalization: The total market value of a company’s outstanding shares. "market capitalization, P/E ratio, and dividend yield."
  • Maximum Drawdown: The largest peak-to-trough decline in portfolio value over a period. "Maximum Drawdown. The maximum drawdown quantifies the largest decline in portfolio value from its peak to its trough during the evaluation period, providing a measure of downside risk:"
  • Price-to-Earnings (P/E) Ratio: A valuation metric comparing a company’s stock price to its earnings per share. "price-to-earnings (P/E) ratio"
  • Risk-Adjusted Return: A performance measure that accounts for the risk taken to achieve returns. "fail to outperform this simple baseline in terms of both cumulative return and risk-adjusted return."
  • Schema Error: A formatting mistake where outputs do not follow the required structured schema (e.g., JSON). "Schema Error, where the agent fails to adhere to the specified JSON output format, leading to parsing failures."
  • Sortino Ratio: A risk-adjusted performance metric that penalizes only downside volatility. "The Sortino ratio is a risk adjusted return metric that penalizes only downside volatility."
  • Temporal Separation: Ensuring the evaluation data occurs after the model’s training cutoff to prevent leakage. "ensuring temporal separation and avoiding any overlap with the training corpora of contemporary LLMs."
  • Token Context Window: The maximum number of tokens an LLM can attend to in a single input. "All models are equipped with $32,768$ token context windows"
  • Z-Score: A standardized score indicating how many standard deviations a value is from the mean, used for metric aggregation. "by leveraging the z-score of each metric"
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 posts and received 197 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com