Papers
Topics
Authors
Recent
2000 character limit reached

LiveTradeBench: Real-Time Trading Benchmark

Updated 26 November 2025
  • LiveTradeBench is a real-time benchmark framework that assesses the trading competence of LLM agents using live market and news data.
  • It employs a portfolio-management abstraction and multi-market evaluation to measure risk, return, and adaptability in evolving market conditions.
  • Empirical evaluations reveal significant generalization gaps across different market regimes and diverse strategy profiles among LLM models.

LiveTradeBench is a benchmark framework designed to evaluate the real-time trading competence of LLM agents in live, evolving market environments. It departs from offline and synthetic benchmarks by integrating actual market data streams, portfolio-level control, and heterogeneous market structures to paper decision-making under genuine uncertainty. The system couples live market and news data with LLM-driven portfolio management, providing the infrastructure and metrics necessary to rigorously diagnose agent performance across risk, return, and adaptability dimensions (Yu et al., 5 Nov 2025).

1. Design Principles

LiveTradeBench is anchored by three fundamental principles that define its architecture and experimental claims:

  1. Live Data Streaming:
    • Market prices for U.S. equities are sourced via public APIs (e.g., yfinance) and for Polymarket contracts via public CLOB endpoints, both utilizing a 10-day lookback to mitigate information leakage.
    • News and sentiment are aggregated using Google News queries for stock tickers or Polymarket event texts, over a rolling window [t3,t1][t-3, t-1].
    • Robustness to network conditions is enforced with randomized fetch delays, standard User-Agent headers, exponential-backoff retries, and JSON parsing with timeouts.
    • Data preprocessing includes timestamp normalization (e.g., “3 hours ago” to UNIX time), title/snippet extraction, mapping to symbols/market IDs, and chronological sorting.
  2. Portfolio-Management Abstraction:
    • State at each timestep tt is ot=(qt,pt,ct)o_t = (q_t, p_t, c_t), where qtR+Nq_t \in \mathbb{R}_+^N represents holdings, ptRNp_t \in \mathbb{R}^N represents current prices or probabilities, and ctc_t conveys textual news summaries.
    • The agent action atΔN1={aia(i)=1,a(i)0}a_t \in \Delta^{N-1} = \{a \mid \sum_i a^{(i)}=1, a^{(i)}\geq 0\} is a long-only allocation vector over NN assets, including mandatory cash allocation.
    • No shorting is permitted, and explicit portfolio allocation fosters cross-asset reasoning—balancing risk, sector exposure, volatility, and diversification.
  3. Multi-Market Evaluation:
    • U.S. Stock Market: 15 highly liquid equities/ETFs across sectors and a cash asset, characterized by strong cross-correlations and moderate volatility.
    • Polymarket Prediction Markets: 10 binary contracts on diverse macro/political/crypto events, exhibiting high frequency, sentiment-driven moves, and elevated volatility.
    • This structural diversity tests the agent’s ability to generalize from regulated, low-volatility to unregulated, high-volatility regimes.

2. Environment Mechanics

At each trading interval, the environment-agent interaction follows a precise sequence:

  1. Fetch live prices ptp_t and news ctc_t.
  2. Emit observation ot=(qt,pt,ct)o_t = (q_t, p_t, c_t) reflecting the current state.
  3. The agent transforms oto_t via a “tool-use” module into structured features o~t\tilde{o}_t (e.g., returns, volatility, sentiment embeddings).
  4. Agent memory MtM_t, comprising the previous Δ\Delta observations, is concatenated with o~t\tilde{o}_t.
  5. The ReAct-style agent generates a chain-of-thought and outputs allocation at=fθ(ot,Mt)a_t = f_\theta(o_t, M_t).
  6. The environment executes the allocation:
    • Update portfolio value: vt+1=qtpt+1v_{t+1}^{-} = q_t^\top p_{t+1}
    • Rebalance holdings: qt+1=vt+1atpt+1q_{t+1} = v_{t+1}^{-} \frac{a_t}{p_{t+1}}, ensuring vt+1=qt+1pt+1=vt+1v_{t+1} = q_{t+1}^\top p_{t+1} = v_{t+1}^{-} (element-wise division).
  7. The next state ot+1o_{t+1} is emitted. This loop continues for horizon TT.

Agent–Environment Loop Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
initialize q_0   initial holdings (e.g. all cash)
for t in 0..T-1:
    p_t, c_t   fetch_live_prices_and_news(t)
    o_t        (q_t, p_t, c_t)
    tilde_o_t  feature_extraction(o_t)
    M_t        update_memory(M_{t-1}, tilde_o_t)
    a_t        LLM_policy(o_t, M_t)  # JSON → allocation vector
    # Fetch next prices (live update)
    p_{t+1}    fetch_live_prices(t+1)
    v_next     q_t  p_{t+1}
    q_{t+1}    v_next × (a_t / p_{t+1})
end
This design reflects a fully live and sequential decision-making pipeline.

3. Market Structures and Agent Interface

The dual-market setup exposes agents to structurally distinct environments, summarized below:

Market Assets/Contracts Price Dynamics Notable Features
U.S. Stocks 15 equities/ETFs + cash Strong cross-sector correlations, moderate volatility Long-term fundamentals, lower frequency moves
Polymarket 10 binary contracts Sentiment-driven, sharp, frequent moves News-reactivity, high volatility

Agents interact with the environment solely through portfolio allocations encoded in a JSON schema, enforcing allocation constraints (long-only, sum to one, positive cash weight).

4. Performance Metrics and Evaluation

Evaluation utilizes five well-defined financial metrics, each formulated in LaTeX to ensure reproducibility:

  • Cumulative Return: CR=vTv0v0CR = \frac{v_T - v_0}{v_0}
  • Sharpe Ratio: SR=rˉrfσSR = \frac{\bar r - r_f}{\sigma}

where,

rˉ=1Tt=1Trt\displaystyle \bar r = \frac{1}{T}\sum_{t=1}^T r_t, rt=vtvt1vt1r_t = \frac{v_t - v_{t-1}}{v_{t-1}}, and σ\sigma specifies sample standard deviation.

  • Maximum Drawdown: MDD=maxt[1,T]  maxitvivtmaxitviMDD = \max_{t\in[1,T]}\;\frac{\max_{i\leq t}v_i - v_t}{\max_{i\leq t}v_i}
  • Win Rate: WR=1T1t=2T1{rt>0}WR = \frac{1}{T-1}\sum_{t=2}^T \mathbf{1}\{r_t > 0\}
  • Volatility: σ=1T1t=1T(rtrˉ)2\sigma = \sqrt{\frac{1}{T-1}\sum_{t=1}^T(r_t - \bar r)^2}

Editor's term: These metrics collectively delineate an agent’s return, risk-adjusted performance, loss resilience, consistency in outperforming prior periods, and risk profile over the test duration.

5. Empirical Findings from Live Evaluations

Twenty-one LLMs were evaluated during a 50-day window (August 18–October 24, 2025), spanning major families such as Claude, Grok, Qwen, Llama, GPT, Gemini, DeepSeek, and Kimi.

Key empirical results include:

  1. Benchmark Decoupling: High LMArena/general-LLM scores exhibit negligible or negative correlation (Spearman ρ0\rho \approx 0) with trading returns, particularly on Polymarket, indicating that traditional LLM benchmarks do not predict real-world trading competence.
  2. Market-Specific Generalization Gaps: The best stock-market agent (GPT-4.1, CR6.25%CR \approx 6.25\%, SR2.64SR \approx 2.64) performs poorly on Polymarket (CR<30%CR < -30\%, significant drawdowns), highlighting failures to generalize across market regimes.
  3. Strategy Spectrum: Models such as Grok-4 and Qwen2.5 maintain stable, low-volatility strategies, while others (GPT-5, Kimi-K2) pursue aggressive returns but encounter severe drawdowns.
  4. Action Timing Analysis: Delaying agent actions (rolling-k delta analysis) rapidly degrades performance (Δk<0\Delta_k < 0), confirming that agents exploit live signals rather than executing static or autoregressive behaviors.
  5. Reasoning Trace: LLMs most frequently cite news in reasoning traces, followed by price history and position information. Polymarket agents rely even more on fresh news inputs versus stock-market agents.

This set of findings substantiates the claim that static benchmark proficiency does not entail robust, real-time decision-making capacities in live markets.

6. Implementation Architecture and Reproducibility

LiveTradeBench is distributed as an open-source Python package (live-trade-bench) with seamless installation via pip. The environment utilizes:

  • Stocks: Daily bars from yfinance REST API.
  • Polymarket: Public CLOB HTTP endpoints for contract discovery and historical pricing.
  • News: Google News scraping through query-generated URLs.

LLM agent invocation leverages a React-style loop with integrated tool-use and memory modules. The unified JSON schema ensures compatibility with a thin client (LiteLLM), routing requests to providers such as anthropic, openai, gemini, x-ai, etc., with fixed model parameters (temperature 0.3, max_tokens 16,000).

Reproducing Live Runs:

  1. Clone repository: git clone https://github.com/ulab-uiuc/live-trade-bench
  2. Install: pip install live-trade-bench
  3. Configure credentials (as required).
  4. Launch a live trading session:
    1
    2
    3
    4
    5
    
    from live_trade_bench import LiveTradingEnv, build_llm_agent
    env   = LiveTradingEnv(market="stocks")  # or "polymarket"
    agent = build_llm_agent(model_name="openai/gpt-4.1")
    env.run_live(agent, days=50)
    print(env.summary_metrics())
  5. Outputs can be inspected through the web UI or exported as CSV.

The infrastructure combines real-time data streams, portfolio allocation, and standardized quantitative metrics, furnishing a reproducible apparatus for evaluating the full spectrum of LLM-based trading agent competence under real-world temporal and informational uncertainty (Yu et al., 5 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to LiveTradeBench.