LiveTradeBench: Real-Time Trading Benchmark

Updated 26 November 2025

LiveTradeBench is a real-time benchmark framework that assesses the trading competence of LLM agents using live market and news data.
It employs a portfolio-management abstraction and multi-market evaluation to measure risk, return, and adaptability in evolving market conditions.
Empirical evaluations reveal significant generalization gaps across different market regimes and diverse strategy profiles among LLM models.

LiveTradeBench is a benchmark framework designed to evaluate the real-time trading competence of LLM agents in live, evolving market environments. It departs from offline and synthetic benchmarks by integrating actual market data streams, portfolio-level control, and heterogeneous market structures to study decision-making under genuine uncertainty. The system couples live market and news data with LLM-driven portfolio management, providing the infrastructure and metrics necessary to rigorously diagnose agent performance across risk, return, and adaptability dimensions (Yu et al., 5 Nov 2025).

1. Design Principles

LiveTradeBench is anchored by three fundamental principles that define its architecture and experimental claims:

Live Data Streaming:
- Market prices for U.S. equities are sourced via public APIs (e.g., yfinance) and for Polymarket contracts via public CLOB endpoints, both utilizing a 10-day lookback to mitigate information leakage.
- News and sentiment are aggregated using Google News queries for stock tickers or Polymarket event texts, over a rolling window $[t-3, t-1]$ .
- Robustness to network conditions is enforced with randomized fetch delays, standard User-Agent headers, exponential-backoff retries, and JSON parsing with timeouts.
- Data preprocessing includes timestamp normalization (e.g., “3 hours ago” to UNIX time), title/snippet extraction, mapping to symbols/market IDs, and chronological sorting.
Portfolio-Management Abstraction:
- State at each timestep $t$ is $o_t = (q_t, p_t, c_t)$ , where $q_t \in \mathbb{R}_+^N$ represents holdings, $p_t \in \mathbb{R}^N$ represents current prices or probabilities, and $c_t$ conveys textual news summaries.
- The agent action $a_t \in \Delta^{N-1} = \{a \mid \sum_i a^{(i)}=1, a^{(i)}\geq 0\}$ is a long-only allocation vector over $N$ assets, including mandatory cash allocation.
- No shorting is permitted, and explicit portfolio allocation fosters cross-asset reasoning—balancing risk, sector exposure, volatility, and diversification.
Multi-Market Evaluation:
- U.S. Stock Market: 15 highly liquid equities/ETFs across sectors and a cash asset, characterized by strong cross-correlations and moderate volatility.
- Polymarket Prediction Markets: 10 binary contracts on diverse macro/political/crypto events, exhibiting high frequency, sentiment-driven moves, and elevated volatility.
- This structural diversity tests the agent’s ability to generalize from regulated, low-volatility to unregulated, high-volatility regimes.

2. Environment Mechanics

At each trading interval, the environment-agent interaction follows a precise sequence:

Fetch live prices $p_t$ and news $c_t$ .
Emit observation $o_t = (q_t, p_t, c_t)$ reflecting the current state.
The agent transforms $o_t$ via a “tool-use” module into structured features $\tilde{o}_t$ (e.g., returns, volatility, sentiment embeddings).
Agent memory $M_t$ , comprising the previous $\Delta$ observations, is concatenated with $\tilde{o}_t$ .
The ReAct-style agent generates a chain-of-thought and outputs allocation $a_t = f_\theta(o_t, M_t)$ .
The environment executes the allocation:
- Update portfolio value: $v_{t+1}^{-} = q_t^\top p_{t+1}$
- Rebalance holdings: $q_{t+1} = v_{t+1}^{-} \frac{a_t}{p_{t+1}}$ , ensuring $v_{t+1} = q_{t+1}^\top p_{t+1} = v_{t+1}^{-}$ (element-wise division).
The next state $o_{t+1}$ is emitted. This loop continues for horizon $T$ .

Agent–Environment Loop Pseudocode:

initialize q_0  ← initial holdings (e.g. all cash)
for t in 0..T-1:
    p_t, c_t  ← fetch_live_prices_and_news(t)
    o_t       ← (q_t, p_t, c_t)
    tilde_o_t ← feature_extraction(o_t)
    M_t       ← update_memory(M_{t-1}, tilde_o_t)
    a_t       ← LLM_policy(o_t, M_t)  # JSON → allocation vector
    # Fetch next prices (live update)
    p_{t+1}   ← fetch_live_prices(t+1)
    v_next    ← q_t ⋅ p_{t+1}
    q_{t+1}   ← v_next × (a_t / p_{t+1})
end

This design reflects a fully live and sequential decision-making pipeline.

3. Market Structures and Agent Interface

The dual-market setup exposes agents to structurally distinct environments, summarized below:

Market	Assets/Contracts	Price Dynamics	Notable Features
U.S. Stocks	15 equities/ETFs + cash	Strong cross-sector correlations, moderate volatility	Long-term fundamentals, lower frequency moves
Polymarket	10 binary contracts	Sentiment-driven, sharp, frequent moves	News-reactivity, high volatility

Agents interact with the environment solely through portfolio allocations encoded in a JSON schema, enforcing allocation constraints (long-only, sum to one, positive cash weight).

4. Performance Metrics and Evaluation

Evaluation utilizes five well-defined financial metrics, each formulated in LaTeX to ensure reproducibility:

Cumulative Return: $CR = \frac{v_T - v_0}{v_0}$
Sharpe Ratio: $SR = \frac{\bar r - r_f}{\sigma}$

where,

$\displaystyle \bar r = \frac{1}{T}\sum_{t=1}^T r_t$ , $r_t = \frac{v_t - v_{t-1}}{v_{t-1}}$ , and $\sigma$ specifies sample standard deviation.

Maximum Drawdown: $MDD = \max_{t\in[1,T]}\;\frac{\max_{i\leq t}v_i - v_t}{\max_{i\leq t}v_i}$
Win Rate: $WR = \frac{1}{T-1}\sum_{t=2}^T \mathbf{1}\{r_t > 0\}$
Volatility: $\sigma = \sqrt{\frac{1}{T-1}\sum_{t=1}^T(r_t - \bar r)^2}$

Editor's term: These metrics collectively delineate an agent’s return, risk-adjusted performance, loss resilience, consistency in outperforming prior periods, and risk profile over the test duration.

5. Empirical Findings from Live Evaluations

Twenty-one LLMs were evaluated during a 50-day window (August 18–October 24, 2025), spanning major families such as Claude, Grok, Qwen, Llama, GPT, Gemini, DeepSeek, and Kimi.

Key empirical results include:

Benchmark Decoupling: High LMArena/general-LLM scores exhibit negligible or negative correlation (Spearman $\rho \approx 0$ ) with trading returns, particularly on Polymarket, indicating that traditional LLM benchmarks do not predict real-world trading competence.
Market-Specific Generalization Gaps: The best stock-market agent (GPT-4.1, $CR \approx 6.25\%$ , $SR \approx 2.64$ ) performs poorly on Polymarket ( $CR < -30\%$ , significant drawdowns), highlighting failures to generalize across market regimes.
Strategy Spectrum: Models such as Grok-4 and Qwen2.5 maintain stable, low-volatility strategies, while others (GPT-5, Kimi-K2) pursue aggressive returns but encounter severe drawdowns.
Action Timing Analysis: Delaying agent actions (rolling-k delta analysis) rapidly degrades performance ( $\Delta_k < 0$ ), confirming that agents exploit live signals rather than executing static or autoregressive behaviors.
Reasoning Trace: LLMs most frequently cite news in reasoning traces, followed by price history and position information. Polymarket agents rely even more on fresh news inputs versus stock-market agents.

This set of findings substantiates the claim that static benchmark proficiency does not entail robust, real-time decision-making capacities in live markets.

6. Implementation Architecture and Reproducibility

LiveTradeBench is distributed as an open-source Python package (live-trade-bench) with seamless installation via pip. The environment utilizes:

Stocks: Daily bars from yfinance REST API.
Polymarket: Public CLOB HTTP endpoints for contract discovery and historical pricing.
News: Google News scraping through query-generated URLs.

LLM agent invocation leverages a React-style loop with integrated tool-use and memory modules. The unified JSON schema ensures compatibility with a thin client (LiteLLM), routing requests to providers such as anthropic, openai, gemini, x-ai, etc., with fixed model parameters (temperature 0.3, max_tokens 16,000).

Reproducing Live Runs:

Clone repository: git clone https://github.com/ulab-uiuc/live-trade-bench
Install: pip install live-trade-bench
Configure credentials (as required).

Launch a live trading session:

from live_trade_bench import LiveTradingEnv, build_llm_agent
env   = LiveTradingEnv(market="stocks")  # or "polymarket"
agent = build_llm_agent(model_name="openai/gpt-4.1")
env.run_live(agent, days=50)
print(env.summary_metrics())

Outputs can be inspected through the web UI or exported as CSV.

The infrastructure combines real-time data streams, portfolio allocation, and standardized quantitative metrics, furnishing a reproducible apparatus for evaluating the full spectrum of LLM-based trading agent competence under real-world temporal and informational uncertainty (Yu et al., 5 Nov 2025).

Markdown Upgrade to Chat

References (1)

LiveTradeBench: Seeking Real-World Alpha with Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveTradeBench.