LiveTradeBench: Real-Time Trading Benchmark
- LiveTradeBench is a real-time benchmark framework that assesses the trading competence of LLM agents using live market and news data.
- It employs a portfolio-management abstraction and multi-market evaluation to measure risk, return, and adaptability in evolving market conditions.
- Empirical evaluations reveal significant generalization gaps across different market regimes and diverse strategy profiles among LLM models.
LiveTradeBench is a benchmark framework designed to evaluate the real-time trading competence of LLM agents in live, evolving market environments. It departs from offline and synthetic benchmarks by integrating actual market data streams, portfolio-level control, and heterogeneous market structures to study decision-making under genuine uncertainty. The system couples live market and news data with LLM-driven portfolio management, providing the infrastructure and metrics necessary to rigorously diagnose agent performance across risk, return, and adaptability dimensions (Yu et al., 5 Nov 2025).
1. Design Principles
LiveTradeBench is anchored by three fundamental principles that define its architecture and experimental claims:
- Live Data Streaming:
- Market prices for U.S. equities are sourced via public APIs (e.g., yfinance) and for Polymarket contracts via public CLOB endpoints, both utilizing a 10-day lookback to mitigate information leakage.
- News and sentiment are aggregated using Google News queries for stock tickers or Polymarket event texts, over a rolling window .
- Robustness to network conditions is enforced with randomized fetch delays, standard User-Agent headers, exponential-backoff retries, and JSON parsing with timeouts.
- Data preprocessing includes timestamp normalization (e.g., “3 hours ago” to UNIX time), title/snippet extraction, mapping to symbols/market IDs, and chronological sorting.
- Portfolio-Management Abstraction:
- State at each timestep is , where represents holdings, represents current prices or probabilities, and conveys textual news summaries.
- The agent action is a long-only allocation vector over assets, including mandatory cash allocation.
- No shorting is permitted, and explicit portfolio allocation fosters cross-asset reasoning—balancing risk, sector exposure, volatility, and diversification.
- Multi-Market Evaluation:
- U.S. Stock Market: 15 highly liquid equities/ETFs across sectors and a cash asset, characterized by strong cross-correlations and moderate volatility.
- Polymarket Prediction Markets: 10 binary contracts on diverse macro/political/crypto events, exhibiting high frequency, sentiment-driven moves, and elevated volatility.
- This structural diversity tests the agent’s ability to generalize from regulated, low-volatility to unregulated, high-volatility regimes.
2. Environment Mechanics
At each trading interval, the environment-agent interaction follows a precise sequence:
- Fetch live prices and news .
- Emit observation 0 reflecting the current state.
- The agent transforms 1 via a “tool-use” module into structured features 2 (e.g., returns, volatility, sentiment embeddings).
- Agent memory 3, comprising the previous 4 observations, is concatenated with 5.
- The ReAct-style agent generates a chain-of-thought and outputs allocation 6.
- The environment executes the allocation:
- Update portfolio value: 7
- Rebalance holdings: 8, ensuring 9 (element-wise division).
- The next state 0 is emitted. This loop continues for horizon 1.
Agent–Environment Loop Pseudocode:
5 This design reflects a fully live and sequential decision-making pipeline.
3. Market Structures and Agent Interface
The dual-market setup exposes agents to structurally distinct environments, summarized below:
| Market | Assets/Contracts | Price Dynamics | Notable Features |
|---|---|---|---|
| U.S. Stocks | 15 equities/ETFs + cash | Strong cross-sector correlations, moderate volatility | Long-term fundamentals, lower frequency moves |
| Polymarket | 10 binary contracts | Sentiment-driven, sharp, frequent moves | News-reactivity, high volatility |
Agents interact with the environment solely through portfolio allocations encoded in a JSON schema, enforcing allocation constraints (long-only, sum to one, positive cash weight).
4. Performance Metrics and Evaluation
Evaluation utilizes five well-defined financial metrics, each formulated in LaTeX to ensure reproducibility:
- Cumulative Return: 2
- Sharpe Ratio: 3
where,
4, 5, and 6 specifies sample standard deviation.
- Maximum Drawdown: 7
- Win Rate: 8
- Volatility: 9
Editor's term: These metrics collectively delineate an agent’s return, risk-adjusted performance, loss resilience, consistency in outperforming prior periods, and risk profile over the test duration.
5. Empirical Findings from Live Evaluations
Twenty-one LLMs were evaluated during a 50-day window (August 18–October 24, 2025), spanning major families such as Claude, Grok, Qwen, Llama, GPT, Gemini, DeepSeek, and Kimi.
Key empirical results include:
- Benchmark Decoupling: High LMArena/general-LLM scores exhibit negligible or negative correlation (Spearman 0) with trading returns, particularly on Polymarket, indicating that traditional LLM benchmarks do not predict real-world trading competence.
- Market-Specific Generalization Gaps: The best stock-market agent (GPT-4.1, 1, 2) performs poorly on Polymarket (3, significant drawdowns), highlighting failures to generalize across market regimes.
- Strategy Spectrum: Models such as Grok-4 and Qwen2.5 maintain stable, low-volatility strategies, while others (GPT-5, Kimi-K2) pursue aggressive returns but encounter severe drawdowns.
- Action Timing Analysis: Delaying agent actions (rolling-k delta analysis) rapidly degrades performance (4), confirming that agents exploit live signals rather than executing static or autoregressive behaviors.
- Reasoning Trace: LLMs most frequently cite news in reasoning traces, followed by price history and position information. Polymarket agents rely even more on fresh news inputs versus stock-market agents.
This set of findings substantiates the claim that static benchmark proficiency does not entail robust, real-time decision-making capacities in live markets.
6. Implementation Architecture and Reproducibility
LiveTradeBench is distributed as an open-source Python package (live-trade-bench) with seamless installation via pip. The environment utilizes:
- Stocks: Daily bars from yfinance REST API.
- Polymarket: Public CLOB HTTP endpoints for contract discovery and historical pricing.
- News: Google News scraping through query-generated URLs.
LLM agent invocation leverages a React-style loop with integrated tool-use and memory modules. The unified JSON schema ensures compatibility with a thin client (LiteLLM), routing requests to providers such as anthropic, openai, gemini, x-ai, etc., with fixed model parameters (temperature 0.3, max_tokens 16,000).
Reproducing Live Runs:
- Clone repository:
git clone https://github.com/ulab-uiuc/live-trade-bench - Install:
pip install live-trade-bench - Configure credentials (as required).
- Launch a live trading session: 6
- Outputs can be inspected through the web UI or exported as CSV.
The infrastructure combines real-time data streams, portfolio allocation, and standardized quantitative metrics, furnishing a reproducible apparatus for evaluating the full spectrum of LLM-based trading agent competence under real-world temporal and informational uncertainty (Yu et al., 5 Nov 2025).