LiveTradeBench: Real-Time Trading Benchmark
- LiveTradeBench is a real-time benchmark framework that assesses the trading competence of LLM agents using live market and news data.
- It employs a portfolio-management abstraction and multi-market evaluation to measure risk, return, and adaptability in evolving market conditions.
- Empirical evaluations reveal significant generalization gaps across different market regimes and diverse strategy profiles among LLM models.
LiveTradeBench is a benchmark framework designed to evaluate the real-time trading competence of LLM agents in live, evolving market environments. It departs from offline and synthetic benchmarks by integrating actual market data streams, portfolio-level control, and heterogeneous market structures to paper decision-making under genuine uncertainty. The system couples live market and news data with LLM-driven portfolio management, providing the infrastructure and metrics necessary to rigorously diagnose agent performance across risk, return, and adaptability dimensions (Yu et al., 5 Nov 2025).
1. Design Principles
LiveTradeBench is anchored by three fundamental principles that define its architecture and experimental claims:
- Live Data Streaming:
- Market prices for U.S. equities are sourced via public APIs (e.g., yfinance) and for Polymarket contracts via public CLOB endpoints, both utilizing a 10-day lookback to mitigate information leakage.
- News and sentiment are aggregated using Google News queries for stock tickers or Polymarket event texts, over a rolling window .
- Robustness to network conditions is enforced with randomized fetch delays, standard User-Agent headers, exponential-backoff retries, and JSON parsing with timeouts.
- Data preprocessing includes timestamp normalization (e.g., “3 hours ago” to UNIX time), title/snippet extraction, mapping to symbols/market IDs, and chronological sorting.
- Portfolio-Management Abstraction:
- State at each timestep is , where represents holdings, represents current prices or probabilities, and conveys textual news summaries.
- The agent action is a long-only allocation vector over assets, including mandatory cash allocation.
- No shorting is permitted, and explicit portfolio allocation fosters cross-asset reasoning—balancing risk, sector exposure, volatility, and diversification.
- Multi-Market Evaluation:
- U.S. Stock Market: 15 highly liquid equities/ETFs across sectors and a cash asset, characterized by strong cross-correlations and moderate volatility.
- Polymarket Prediction Markets: 10 binary contracts on diverse macro/political/crypto events, exhibiting high frequency, sentiment-driven moves, and elevated volatility.
- This structural diversity tests the agent’s ability to generalize from regulated, low-volatility to unregulated, high-volatility regimes.
2. Environment Mechanics
At each trading interval, the environment-agent interaction follows a precise sequence:
- Fetch live prices and news .
- Emit observation reflecting the current state.
- The agent transforms via a “tool-use” module into structured features (e.g., returns, volatility, sentiment embeddings).
- Agent memory , comprising the previous observations, is concatenated with .
- The ReAct-style agent generates a chain-of-thought and outputs allocation .
- The environment executes the allocation:
- Update portfolio value:
- Rebalance holdings: , ensuring (element-wise division).
- The next state is emitted. This loop continues for horizon .
Agent–Environment Loop Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
initialize q_0 ← initial holdings (e.g. all cash) for t in 0..T-1: p_t, c_t ← fetch_live_prices_and_news(t) o_t ← (q_t, p_t, c_t) tilde_o_t ← feature_extraction(o_t) M_t ← update_memory(M_{t-1}, tilde_o_t) a_t ← LLM_policy(o_t, M_t) # JSON → allocation vector # Fetch next prices (live update) p_{t+1} ← fetch_live_prices(t+1) v_next ← q_t ⋅ p_{t+1} q_{t+1} ← v_next × (a_t / p_{t+1}) end |
3. Market Structures and Agent Interface
The dual-market setup exposes agents to structurally distinct environments, summarized below:
| Market | Assets/Contracts | Price Dynamics | Notable Features |
|---|---|---|---|
| U.S. Stocks | 15 equities/ETFs + cash | Strong cross-sector correlations, moderate volatility | Long-term fundamentals, lower frequency moves |
| Polymarket | 10 binary contracts | Sentiment-driven, sharp, frequent moves | News-reactivity, high volatility |
Agents interact with the environment solely through portfolio allocations encoded in a JSON schema, enforcing allocation constraints (long-only, sum to one, positive cash weight).
4. Performance Metrics and Evaluation
Evaluation utilizes five well-defined financial metrics, each formulated in LaTeX to ensure reproducibility:
- Cumulative Return:
- Sharpe Ratio:
where,
, , and specifies sample standard deviation.
- Maximum Drawdown:
- Win Rate:
- Volatility:
Editor's term: These metrics collectively delineate an agent’s return, risk-adjusted performance, loss resilience, consistency in outperforming prior periods, and risk profile over the test duration.
5. Empirical Findings from Live Evaluations
Twenty-one LLMs were evaluated during a 50-day window (August 18–October 24, 2025), spanning major families such as Claude, Grok, Qwen, Llama, GPT, Gemini, DeepSeek, and Kimi.
Key empirical results include:
- Benchmark Decoupling: High LMArena/general-LLM scores exhibit negligible or negative correlation (Spearman ) with trading returns, particularly on Polymarket, indicating that traditional LLM benchmarks do not predict real-world trading competence.
- Market-Specific Generalization Gaps: The best stock-market agent (GPT-4.1, , ) performs poorly on Polymarket (, significant drawdowns), highlighting failures to generalize across market regimes.
- Strategy Spectrum: Models such as Grok-4 and Qwen2.5 maintain stable, low-volatility strategies, while others (GPT-5, Kimi-K2) pursue aggressive returns but encounter severe drawdowns.
- Action Timing Analysis: Delaying agent actions (rolling-k delta analysis) rapidly degrades performance (), confirming that agents exploit live signals rather than executing static or autoregressive behaviors.
- Reasoning Trace: LLMs most frequently cite news in reasoning traces, followed by price history and position information. Polymarket agents rely even more on fresh news inputs versus stock-market agents.
This set of findings substantiates the claim that static benchmark proficiency does not entail robust, real-time decision-making capacities in live markets.
6. Implementation Architecture and Reproducibility
LiveTradeBench is distributed as an open-source Python package (live-trade-bench) with seamless installation via pip. The environment utilizes:
- Stocks: Daily bars from yfinance REST API.
- Polymarket: Public CLOB HTTP endpoints for contract discovery and historical pricing.
- News: Google News scraping through query-generated URLs.
LLM agent invocation leverages a React-style loop with integrated tool-use and memory modules. The unified JSON schema ensures compatibility with a thin client (LiteLLM), routing requests to providers such as anthropic, openai, gemini, x-ai, etc., with fixed model parameters (temperature 0.3, max_tokens 16,000).
Reproducing Live Runs:
- Clone repository:
git clone https://github.com/ulab-uiuc/live-trade-bench - Install:
pip install live-trade-bench - Configure credentials (as required).
- Launch a live trading session:
1 2 3 4 5
from live_trade_bench import LiveTradingEnv, build_llm_agent env = LiveTradingEnv(market="stocks") # or "polymarket" agent = build_llm_agent(model_name="openai/gpt-4.1") env.run_live(agent, days=50) print(env.summary_metrics())
- Outputs can be inspected through the web UI or exported as CSV.
The infrastructure combines real-time data streams, portfolio allocation, and standardized quantitative metrics, furnishing a reproducible apparatus for evaluating the full spectrum of LLM-based trading agent competence under real-world temporal and informational uncertainty (Yu et al., 5 Nov 2025).