Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiveTradeBench: Real-Time Trading Benchmark

Updated 26 November 2025
  • LiveTradeBench is a real-time benchmark framework that assesses the trading competence of LLM agents using live market and news data.
  • It employs a portfolio-management abstraction and multi-market evaluation to measure risk, return, and adaptability in evolving market conditions.
  • Empirical evaluations reveal significant generalization gaps across different market regimes and diverse strategy profiles among LLM models.

LiveTradeBench is a benchmark framework designed to evaluate the real-time trading competence of LLM agents in live, evolving market environments. It departs from offline and synthetic benchmarks by integrating actual market data streams, portfolio-level control, and heterogeneous market structures to study decision-making under genuine uncertainty. The system couples live market and news data with LLM-driven portfolio management, providing the infrastructure and metrics necessary to rigorously diagnose agent performance across risk, return, and adaptability dimensions (Yu et al., 5 Nov 2025).

1. Design Principles

LiveTradeBench is anchored by three fundamental principles that define its architecture and experimental claims:

  1. Live Data Streaming:
    • Market prices for U.S. equities are sourced via public APIs (e.g., yfinance) and for Polymarket contracts via public CLOB endpoints, both utilizing a 10-day lookback to mitigate information leakage.
    • News and sentiment are aggregated using Google News queries for stock tickers or Polymarket event texts, over a rolling window [t3,t1][t-3, t-1].
    • Robustness to network conditions is enforced with randomized fetch delays, standard User-Agent headers, exponential-backoff retries, and JSON parsing with timeouts.
    • Data preprocessing includes timestamp normalization (e.g., “3 hours ago” to UNIX time), title/snippet extraction, mapping to symbols/market IDs, and chronological sorting.
  2. Portfolio-Management Abstraction:
    • State at each timestep tt is ot=(qt,pt,ct)o_t = (q_t, p_t, c_t), where qtR+Nq_t \in \mathbb{R}_+^N represents holdings, ptRNp_t \in \mathbb{R}^N represents current prices or probabilities, and ctc_t conveys textual news summaries.
    • The agent action atΔN1={aia(i)=1,a(i)0}a_t \in \Delta^{N-1} = \{a \mid \sum_i a^{(i)}=1, a^{(i)}\geq 0\} is a long-only allocation vector over NN assets, including mandatory cash allocation.
    • No shorting is permitted, and explicit portfolio allocation fosters cross-asset reasoning—balancing risk, sector exposure, volatility, and diversification.
  3. Multi-Market Evaluation:
    • U.S. Stock Market: 15 highly liquid equities/ETFs across sectors and a cash asset, characterized by strong cross-correlations and moderate volatility.
    • Polymarket Prediction Markets: 10 binary contracts on diverse macro/political/crypto events, exhibiting high frequency, sentiment-driven moves, and elevated volatility.
    • This structural diversity tests the agent’s ability to generalize from regulated, low-volatility to unregulated, high-volatility regimes.

2. Environment Mechanics

At each trading interval, the environment-agent interaction follows a precise sequence:

  1. Fetch live prices ptp_t and news ctc_t.
  2. Emit observation tt0 reflecting the current state.
  3. The agent transforms tt1 via a “tool-use” module into structured features tt2 (e.g., returns, volatility, sentiment embeddings).
  4. Agent memory tt3, comprising the previous tt4 observations, is concatenated with tt5.
  5. The ReAct-style agent generates a chain-of-thought and outputs allocation tt6.
  6. The environment executes the allocation:
    • Update portfolio value: tt7
    • Rebalance holdings: tt8, ensuring tt9 (element-wise division).
  7. The next state ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)0 is emitted. This loop continues for horizon ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)1.

Agent–Environment Loop Pseudocode:

qtR+Nq_t \in \mathbb{R}_+^N5 This design reflects a fully live and sequential decision-making pipeline.

3. Market Structures and Agent Interface

The dual-market setup exposes agents to structurally distinct environments, summarized below:

Market Assets/Contracts Price Dynamics Notable Features
U.S. Stocks 15 equities/ETFs + cash Strong cross-sector correlations, moderate volatility Long-term fundamentals, lower frequency moves
Polymarket 10 binary contracts Sentiment-driven, sharp, frequent moves News-reactivity, high volatility

Agents interact with the environment solely through portfolio allocations encoded in a JSON schema, enforcing allocation constraints (long-only, sum to one, positive cash weight).

4. Performance Metrics and Evaluation

Evaluation utilizes five well-defined financial metrics, each formulated in LaTeX to ensure reproducibility:

  • Cumulative Return: ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)2
  • Sharpe Ratio: ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)3

where,

ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)4, ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)5, and ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)6 specifies sample standard deviation.

  • Maximum Drawdown: ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)7
  • Win Rate: ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)8
  • Volatility: ot=(qt,pt,ct)o_t = (q_t, p_t, c_t)9

Editor's term: These metrics collectively delineate an agent’s return, risk-adjusted performance, loss resilience, consistency in outperforming prior periods, and risk profile over the test duration.

5. Empirical Findings from Live Evaluations

Twenty-one LLMs were evaluated during a 50-day window (August 18–October 24, 2025), spanning major families such as Claude, Grok, Qwen, Llama, GPT, Gemini, DeepSeek, and Kimi.

Key empirical results include:

  1. Benchmark Decoupling: High LMArena/general-LLM scores exhibit negligible or negative correlation (Spearman qtR+Nq_t \in \mathbb{R}_+^N0) with trading returns, particularly on Polymarket, indicating that traditional LLM benchmarks do not predict real-world trading competence.
  2. Market-Specific Generalization Gaps: The best stock-market agent (GPT-4.1, qtR+Nq_t \in \mathbb{R}_+^N1, qtR+Nq_t \in \mathbb{R}_+^N2) performs poorly on Polymarket (qtR+Nq_t \in \mathbb{R}_+^N3, significant drawdowns), highlighting failures to generalize across market regimes.
  3. Strategy Spectrum: Models such as Grok-4 and Qwen2.5 maintain stable, low-volatility strategies, while others (GPT-5, Kimi-K2) pursue aggressive returns but encounter severe drawdowns.
  4. Action Timing Analysis: Delaying agent actions (rolling-k delta analysis) rapidly degrades performance (qtR+Nq_t \in \mathbb{R}_+^N4), confirming that agents exploit live signals rather than executing static or autoregressive behaviors.
  5. Reasoning Trace: LLMs most frequently cite news in reasoning traces, followed by price history and position information. Polymarket agents rely even more on fresh news inputs versus stock-market agents.

This set of findings substantiates the claim that static benchmark proficiency does not entail robust, real-time decision-making capacities in live markets.

6. Implementation Architecture and Reproducibility

LiveTradeBench is distributed as an open-source Python package (live-trade-bench) with seamless installation via pip. The environment utilizes:

  • Stocks: Daily bars from yfinance REST API.
  • Polymarket: Public CLOB HTTP endpoints for contract discovery and historical pricing.
  • News: Google News scraping through query-generated URLs.

LLM agent invocation leverages a React-style loop with integrated tool-use and memory modules. The unified JSON schema ensures compatibility with a thin client (LiteLLM), routing requests to providers such as anthropic, openai, gemini, x-ai, etc., with fixed model parameters (temperature 0.3, max_tokens 16,000).

Reproducing Live Runs:

  1. Clone repository: git clone https://github.com/ulab-uiuc/live-trade-bench
  2. Install: pip install live-trade-bench
  3. Configure credentials (as required).
  4. Launch a live trading session: qtR+Nq_t \in \mathbb{R}_+^N6
  5. Outputs can be inspected through the web UI or exported as CSV.

The infrastructure combines real-time data streams, portfolio allocation, and standardized quantitative metrics, furnishing a reproducible apparatus for evaluating the full spectrum of LLM-based trading agent competence under real-world temporal and informational uncertainty (Yu et al., 5 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveTradeBench.