Papers
Topics
Authors
Recent
2000 character limit reached

LiveTradeBench: Seeking Real-World Alpha with Large Language Models (2511.03628v1)

Published 5 Nov 2025 in q-fin.TR, cs.AI, cs.CE, and cs.CL

Abstract: LLMs achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

Summary

  • The paper introduces a live trading benchmark using real-time data streaming to evaluate LLM-based agents across U.S. equities and prediction markets.
  • It formalizes portfolio management as a POMDP, integrating agent memory, tool use, and explicit reasoning to achieve adaptive, market-specific allocations.
  • Empirical results reveal that high general-purpose LLM reasoning scores do not correlate with superior trading outcomes, underscoring the need for specialized benchmarks.

LiveTradeBench: A Live Multi-Market Benchmark for LLM-Based Trading Agents

Motivation and Problem Formulation

The proliferation of LLMs has led to near-saturation on static benchmarks, yet these evaluations fail to capture the sequential, uncertain, and feedback-driven nature of real-world decision-making. LiveTradeBench addresses this gap by introducing a live trading environment for LLM agents, emphasizing three core principles: (i) live data streaming to eliminate information leakage and capture real-time uncertainty, (ii) a portfolio-management abstraction for multi-asset allocation and risk management, and (iii) multi-market evaluation across structurally distinct environments—U.S. equities and Polymarket prediction markets.

The portfolio management task is formalized as a POMDP, where the agent receives partial observations comprising current positions, market prices, and contextual news, and outputs a portfolio allocation vector. The environment simulates real-world market transitions and portfolio rebalancing, with the agent’s actions mapped directly to executable trading operations. Figure 1

Figure 1: Market selection in LiveTradeBench, contrasting the dynamics of U.S. equities and Polymarket contracts, and illustrating the diversity of assets across domains.

Environment and Agent Architecture

Observation and Action Spaces

At each timestep, the agent observes its current portfolio, real-time asset prices, and a curated set of recent news headlines. The action space is the probability simplex over assets, representing the proportion of capital allocated to each asset (long-only, no shorting). For prediction markets, each binary contract is split into YES/NO assets, and the agent must allocate to only one side per market. Figure 2

Figure 2: Observation and action space for LiveTradeBench, showing the integration of position, price, and news context, and the mapping to portfolio allocation decisions.

Agent Design

Agents are instantiated as LLMs wrapped in a React-style framework, equipped with three key capabilities: (i) tool use for structured data extraction and feature engineering from raw observations, (ii) memory for maintaining a fixed-length history of past observations and actions, and (iii) explicit reasoning via chain-of-thought and ReAct prompting. The agent’s policy is a function of the current observation and memory, outputting a JSON-encoded allocation vector and a natural language rationale. Figure 3

Figure 3: Agent-environment interaction loop, with the environment streaming live data and the agent leveraging tools, memory, and reasoning to generate adaptive portfolio allocations.

Experimental Protocol and Evaluation Metrics

Model Pool and Setup

Twenty-one LLMs spanning six major families (OpenAI, Anthropic, Google, xAI, Meta, Qwen, DeepSeek, Moonshot) are evaluated in both the U.S. stock and Polymarket environments. All models are prompted with identical context and instruction templates, and outputs are parsed and validated for compliance with allocation constraints.

Metrics

Performance is assessed using cumulative return, Sharpe ratio, maximum drawdown (MDD), win rate, and volatility. These jointly capture profitability, risk-adjusted efficiency, downside protection, and stability.

Empirical Findings

Generalization and Market-Specificity

There is negligible correlation between trading performance in the stock and prediction markets (Sharpe ratio correlation ≈ 0), indicating that strategies do not generalize across market regimes. For example, GPT-4.1 achieves the highest cumulative return in stocks but underperforms in Polymarket, while Qwen2.5-72B-Instruct and Grok-4 display more stable cross-market behavior.

LLM Benchmark Scores vs. Trading Performance

Contrary to expectations, LMArena scores are uncorrelated or negatively correlated with trading outcomes (Spearman ρ ≈ 0.05 in stocks, −0.38 in Polymarket). High general-purpose reasoning ability does not translate to superior financial decision-making, underscoring the inadequacy of static benchmarks for evaluating real-world competence. Figure 4

Figure 4

Figure 4: Correlation between LMArena score and Sharpe ratio in both markets, demonstrating the disconnect between general LLM ability and trading performance.

Portfolio Management Styles

Distinct risk and allocation styles emerge across models. Some (e.g., Claude-Opus-4.1, Grok-4) are conservative, prioritizing low volatility and drawdown, while others (e.g., Kimi-K2-Instruct, GPT-5) are risk-seeking. Cash management patterns also vary, with some agents maintaining high liquidity buffers and others remaining fully invested except during extreme volatility.

Reasoning Models and Overfitting

Models optimized for explicit reasoning (DeepSeek-R1, Qwen3-235B-A22B-Thinking, GPT-o3) do not outperform others and often exhibit excessive volatility, suggesting that the type of reasoning beneficial for math or code tasks does not transfer to financial domains. Over-deliberation can degrade trading consistency.

Analysis: Adaptivity and Decision Rationale

Rolling-k Delta Analysis

To test adaptivity, the rolling-k delta analysis delays agent actions by k days. Performance degrades as k increases, especially in Polymarket, confirming that agents are not random but respond to contemporaneous signals. In stocks, the effect is muted, reflecting smoother market dynamics. Figure 5

Figure 5

Figure 5

Figure 5: Rolling k-delta analysis on U.S. stocks, quantifying the impact of delayed rebalancing on cumulative returns.

Reasoning Trace Annotation

Automated annotation of reasoning traces reveals that agents most frequently cite news in Polymarket and price history in stocks, with position information less dominant. Many decisions integrate multiple information sources, indicating nontrivial reasoning. Figure 6

Figure 6: Proportion of reasoning traces referencing position, price, or news, highlighting market-specific information usage.

Case Studies

U.S. Stock Market: Cash Dynamics

During bullish phases (e.g., tech rally on August 28), agents reduce cash allocations, while sharp drawdowns (October 10) trigger defensive increases in cash. The best-performing agents preemptively shift to cash before downturns, as evidenced by their reasoning traces. Figure 7

Figure 7: Case paper for U.S. stocks, showing average cash ratio dynamics and model-specific reasoning during a market drawdown.

Polymarket: Event-Driven Trading

In the "Russia × Ukraine ceasefire in 2025?" market, agents react to news events with abrupt allocation shifts. On October 13, optimistic headlines prompt a switch to YES positions, but the market does not move, resulting in no profit. On October 17, credible diplomatic news leads to profitable position holds, demonstrating the challenge of distinguishing signal from noise. Figure 8

Figure 8: Case paper in Polymarket, illustrating allocation shifts and reasoning traces in response to news events.

Limitations and Future Directions

LiveTradeBench currently omits transaction costs, liquidity constraints, and slippage, potentially overestimating achievable returns for high-turnover strategies. The observation and action spaces are limited by LLM context length and the number of supported markets. Agent design is restricted to basic tool use and memory; future work should incorporate richer analytical tools, hierarchical memory, multi-agent interactions, and reinforcement learning-based training. Extending the framework to support longer horizons, richer news context, and more complex market universes is essential for advancing the realism and challenge of the benchmark.

Conclusion

LiveTradeBench establishes a new paradigm for evaluating LLM-based agents in live, multi-market trading environments. The results demonstrate that static benchmarks are insufficient proxies for real-world competence, and that market-specific, adaptive, and risk-aware strategies are essential. The benchmark provides a foundation for future research on adaptive, financially grounded, and socially intelligent agent systems, and is released as an open-source package with a real-time web interface for community use.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper introduces LiveTradeBench, a way to test AI models (LLMs, or LLMs) in real, live financial markets. Instead of giving models fixed quizzes or past data, it lets them make trading decisions day by day with real prices and news rolling in, just like a real investor. The goal is to see if AIs can make good choices when the world is changing and uncertain.

What questions did the researchers ask?

They asked three simple questions:

  • Can AIs that ace standard tests also make smart money decisions in the real world?
  • How should we fairly test AIs on trading, without “peeking” at the future?
  • Do different AIs act with different risk styles (careful vs. bold), and can they adapt to live news and prices?

How did they test the idea?

Think of it like a live strategy game with fog-of-war:

  • The AI can’t see the whole future; it only sees current prices, its own holdings, and fresh news.
  • Each turn, the AI decides how to split its money across many assets (like slicing a pizza into slices of Apple, Microsoft, cash, etc.). This is called portfolio management.
  • The world keeps moving on its own—prices go up and down whether the AI trades or not—so the AI must adapt over time.

Here’s what they built and used:

  • Two live markets with very different “personalities”:
    • U.S. stocks (steady, slow-moving, influenced by fundamentals).
    • Polymarket prediction markets (fast, sharp moves, prices act like the crowd’s probability of an event, e.g., “Will interest rates go up?”).
  • Observations the AI sees each step:
    • Its current positions (how much it owns).
    • Live prices.
    • Recent news headlines and summaries.
  • Actions the AI takes:
    • Percent allocations (for example: 10% cash, 15% Apple, 5% Tesla, etc.). This directly maps to buy/sell/hold trades.
  • Agent design:
    • Tools: to fetch and process fresh data.
    • Memory: to remember recent steps.
    • Reasoning: to think through signals before acting (a plan-then-act loop).
  • Models and time period:
    • 21 different LLMs (from families like GPT, Claude, Qwen, Llama, DeepSeek, etc.).
    • Tested live for 50 trading days.
  • How they scored performance (plain-English versions):
    • Cumulative return: total profit or loss over the whole period.
    • Sharpe ratio: how good the returns are compared to how bumpy (risky) they were; higher means better reward per unit of risk.
    • Maximum drawdown: the worst slump from a high point; lower is safer.
    • Win rate: how often daily returns were positive.
    • Volatility: how wild the ups and downs were; lower is steadier.

What did they find?

In short: being smart on paper isn’t the same as being smart in the market.

Key findings:

  • High test scores didn’t mean better trading. Models that did great on popular AI leaderboards (like LMArena) didn’t necessarily make the best live trading choices. In some cases, the correlation was even slightly negative.
  • Success didn’t transfer across markets. Doing well in U.S. stocks didn’t predict doing well in Polymarket. The two markets require different instincts: stocks reward patient, balanced strategies; Polymarket demands quick reactions to events.
  • Different models showed different “styles.” Some were cautious (lower swings and smaller losses), others were aggressive (chasing higher returns with bigger risks). Cash handling and which assets they favored varied a lot.
  • “Big reasoning” models weren’t automatically better traders. Models designed to think longer didn’t outperform; in fact, some took on more risk and bounced around more, which hurt consistency.
  • The AIs used live signals, not random guessing. When the researchers delayed the models’ decisions by a few days (as if reacting late), performance got worse—especially in the fast-moving Polymarket. That means models were actually adapting to fresh information.
  • What guided their reasoning? In stocks, models leaned more on price trends; in Polymarket, they leaned more on news. Many decisions combined multiple cues.

Why this matters: It shows a gap between old-style, static tests and real-world decision-making. Live, sequential choices under uncertainty are a different skill.

Why does this matter?

  • Better, fairer testing: LiveTradeBench avoids “information leakage” (no using future data by accident) and measures what really counts—timely, consistent decisions in the face of uncertainty.
  • Smarter AI for real tasks: If we want AIs that help with investing (or any live, changing domain like logistics, health monitoring, or online safety), we need benchmarks that test adapting over time—not just solving one-off puzzles.
  • New research directions: Results suggest we need market-aware training, improved risk management, and reasoning tuned for noisy, fast-changing signals—not just bigger models.
  • Practical insight: Different markets need different strategies. A one-size-fits-all AI won’t cut it; models must adapt their “style” to the context.

In one sentence: LiveTradeBench shows that real-world “smarts” mean making steady, sensible choices as the world changes—and that’s not what most current AI tests measure.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single list of concrete gaps and open questions that remain unresolved and can guide future research:

  • Transaction costs, fees, slippage, bid–ask spreads, and liquidity constraints are not modeled; quantify their impact on performance and incorporate realistic execution (including turnover-adjusted results).
  • The action cadence is unclear and inconsistent (“live” streaming vs. daily stock closes); specify and paper intraday vs. daily decision frequencies in both markets, and assess sensitivity to timing/latency.
  • Limited asset universe (15 U.S. equities/ETFs, 10 Polymarket contracts); evaluate generalization across broader universes (small/mid caps, international equities, fixed income, FX, crypto, futures/options).
  • Long-only constraint with no shorting, leverage, derivatives, or hedging; extend to market-neutral, leveraged, and derivative-enabled portfolios, and assess risk-adjusted benefits.
  • Cash asset is modeled with zero return; incorporate realistic risk-free yields (e.g., T‑bills for stocks, stablecoin yields for prediction markets) and re-evaluate Sharpe and allocation behavior.
  • No comparison to strong non-LLM baselines (e.g., buy-and-hold SPY/QQQ, equal-weight, risk-parity, momentum/trend, mean–variance); add baselines to contextualize LLM performance.
  • Statistical rigor is limited: no confidence intervals, no hypothesis tests, and no repeated runs; perform multi-window live evaluations, bootstrapping, and significance testing across seeds/runs.
  • Single 50-day window may reflect one regime; test across multiple non-overlapping live periods, regime shifts, macro events, and stress scenarios for robustness.
  • Tool-use and memory components are not ablated; quantify each component’s contribution (including memory horizon Δ), and evaluate effects of removing/altering them.
  • Prompting/hyperparameters are not standardized across models; control for temperature, decoding strategy, context length, and tool permissions to ensure fair comparisons.
  • News ingestion via Google News is unvalidated; evaluate coverage/latency/relevance, deduplication, sentiment quality, and conduct controlled ablations to isolate the incremental value of news.
  • Prompt-injection and adversarial content risks from live news are not addressed; design defenses and measure robustness to malicious or misleading external inputs.
  • Polymarket modeling abstracts away contract mechanics (fees, order book depth, liquidity, resolution timing); implement mark-to-market consistent with platform rules and evaluate unresolved vs. resolved markets separately.
  • No turnover and cost metrics are reported; measure turnover, cost-adjusted returns, and identify model-specific optimal rebalancing cadence under transaction costs.
  • Risk metrics are limited (Sharpe, volatility, MDD); include Sortino, CVaR, tail risk, risk-of-ruin, tracking error, and formal drawdown constraints to better capture downside risk.
  • Live–live market causality is hypothesized but not tested; quantify lead–lag and information flow between prediction markets and equities (e.g., Granger causality, transfer entropy).
  • Reasoning-trace analysis relies on another LLM with no validation; benchmark annotation accuracy with human experts and link rationales to measurable action features/outcomes (causal attribution).
  • Reproducibility is challenging in live settings; provide complete data snapshots, time stamps, prompts, parsed actions, and deterministic replays to enable faithful post hoc replication.
  • Real-world constraints (lot sizes, margin requirements, borrow availability, trading halts, market hours) are ignored; incorporate these to align with executable trading.
  • No exploration of training/adaptation; compare zero-shot agents to RL/fine-tuned or online-adaptive policies, and paper sample efficiency and stability under live feedback.
  • Calibration in prediction markets is not assessed; evaluate belief calibration (e.g., Brier score, log loss) and align allocation exposures with probabilistic forecasts.
  • Model-family analysis lacks scaling laws and resource normalization; paper performance vs. model size and ensure equal tool access, token budgets, and compute constraints across families.
  • Time synchronization and latency are unspecified; document streaming/data pipeline delays, market-hour alignment, and their effects on decision quality and slippage.

Glossary

  • Alpha prediction: Forecasting expected excess returns (“alpha”) for assets to inform trading decisions. "Other approaches focus on alpha prediction, producing continuous vectors of alpha signals that represent expected excess returns or relative performance across assets"
  • Backtesting: Evaluating a trading strategy using historical market data rather than live data. "most evaluation frameworks rely on offline backtesting, which is prone to information leakage and fails to capture the uncertainty, volatility, and feedback of real-world environments"
  • Binary prediction markets: Markets whose contracts resolve to one of two outcomes (e.g., Yes/No), with prices reflecting outcome probabilities. "We continuously track ten active binary prediction markets from Polymarket"
  • Chain-of-thought prompting: A prompting technique that elicits step-by-step reasoning from LLMs. "Similar to chain-of-thought prompting~\citep{wei2022chain}"
  • Complementary assets: Paired assets representing opposing sides of a binary contract. "each binary contract corresponds to two complementary assets—YES and NO"
  • Endogenous portfolio adjustment: Changes to an agent’s holdings caused by its own actions. "and an endogenous portfolio adjustment induced by the agent’s allocation decision at\mathbf{a}_t under the new market state"
  • Exogenous market-state evolution: Market changes driven by external dynamics independent of the agent’s actions. "an exogenous market-state evolution, governed by real-world dynamics and observable as (pt,ct)(pt+1,ct+1)(\mathbf{p}_t, \mathbf{c}_t) \rightarrow (\mathbf{p}_{t+1}, \mathbf{c}_{t+1})"
  • Information leakage: Unintended use of future or unavailable information that biases evaluation results. "most evaluation frameworks rely on offline backtesting, which is prone to information leakage"
  • Latent market state space: The unobserved true state of the market underlying observed data. "where S\mathcal{S} is the latent market state space"
  • Liquidity: The ease of trading an asset without significantly affecting its price. "including asset fundamentals, volatility, liquidity, and other unobserved factors"
  • LMArena: A benchmark of general LLM capabilities used as a reference for model performance. "high LMArena scores do not imply superior trading outcomes"
  • Long-only setting: A portfolio constraint that allows only nonnegative positions (no short selling). "By default, we assume a long-only setting where at(i)0a_t^{(i)} \ge 0 for all ii"
  • Maximum drawdown (MDD): The largest peak-to-trough decline in portfolio value during an evaluation period. "maximum drawdown (MDD)"
  • Memory horizon: The length of historical observations the agent retains for decision-making. "where Δ\Delta denotes the memory horizon"
  • Observation emission function: The function mapping latent states to observable data. "and Ω\Omega the observation emission function"
  • Partially observable Markov decision process (POMDP): A decision-making framework where an agent acts with incomplete information about the true state. "We formulate the portfolio management task as a partially observable Markov decision process (POMDP)"
  • Probability simplex: The set of allocation vectors whose nonnegative components sum to one. "the probability simplex action space"
  • Portfolio rebalancing: Adjusting holdings to match target allocation proportions. "through proportional rebalancing"
  • Prediction market: A market where contract prices reflect the crowd’s beliefs about future events. "Polymarket prediction market"
  • ReAct framework: An agent paradigm that interleaves explicit reasoning with actions. "follows the ReAct~\citep{yao2022react} framework"
  • Rolling-k delta: A diagnostic that measures performance degradation when actions are delayed by k days. "rolling-kk delta (Δk\Delta_k) analysis"
  • Risk-free rate: The theoretical return of an investment with zero risk, used in risk-adjusted metrics. "r_f is the risk-free rate representing the baseline return from a no-risk investment"
  • Sharpe ratio: A risk-adjusted performance metric equal to excess mean return divided by return volatility. "Sharpe ratio (SR=rˉrfσSR = \frac{\bar{r} - r_f}{\sigma})"
  • Spearman correlation: A rank-based measure of association between two variables. "the Spearman correlation between LMArena scores and cumulative returns is only 0.054"
  • Temporal de-biasing: Procedures to reduce time-related contamination or bias in backtests. "temporal de-biasing protocols"
  • Transition dynamics: The rules governing how states evolve over time in an MDP/POMDP. "and T\mathcal{T} the transition dynamics"
  • Tree-based searching: Planning method that explores possible action sequences via a search tree. "support tree-based searching~\citep{Koh2024TreeSFA,Putta2024AgentQAA,Aksitov2023ReSTMRA}"
  • Volatility: The variability of returns, commonly used as a measure of risk. "Strategies with lower volatility σ\sigma exhibit more stable performance over time"
  • Win rate: The proportion of time steps with positive returns. "Win rate (WR=1T1t=2TI(rt>0)WR = \frac{1}{T-1}\sum_{t=2}^{T}\mathbb{I}(r_t > 0))"

Practical Applications

Immediate Applications

Industry: Finance

  • Algorithmic Trading Enhancement: The LiveTradeBench provides a real-time testing environment for developing and refining trading algorithms that leverage LLMs. This can be immediately integrated into existing platforms for improved decision-making under market uncertainty.

Academia

  • Educational Tool for Financial Courses: The environment can be used as a practical tool in academic settings for teaching students about portfolio management, financial risk assessment, and the integration of AI in finance.

Policy

  • Market Analysis and Regulation: Regulators can use the findings and methods as a benchmarking tool to assess the impact of automated trading decisions in live markets, ensuring fair trading practices.

Daily Life

  • Personal Investment Advisory: The insights from the environment can be adapted into personal finance applications that assist individual investors in making real-time, informed investment decisions.

Long-Term Applications

Industry: Software Development

  • Development of Advanced AI Trading Tools: Further research can be done to integrate LLMs in broader financial software systems, creating powerful tools that can handle complex multi-asset portfolios with market adaptability.

Energy and Environment

  • Renewable Energy Trading Platforms: The portfolio management framework could be extended to live trading markets in renewable energy certificates or carbon credits, requiring additional development to tailor for these specific markets.

Academia

  • Cross-domain AI Benchmarking: Development of new benchmarks for evaluating AI capabilities across different domains, such as finance, healthcare, and education, using environments similar to LiveTradeBench.

Assumptions and Dependencies

  • Deployment of these applications assumes continuous access to real-time market data.
  • Scalability of the LiveTradeBench for different assets may depend on further advancements in computational efficiency of LLMs.
  • Effective implementation of personal finance applications would require robust user data privacy frameworks.

The above points synthesize the paper's focus on evaluating LLMs in trading scenarios and translating these insights into diverse, practical applications.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.