Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Abstract: We introduce Prediction Arena, a benchmark for evaluating AI models' predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests models in environments where trades execute on actual exchanges (Kalshi and Polymarket), providing objective ground truth that cannot be gamed or overfitted. Each model operates as an independent agent starting with $10,000, making autonomous decisions every 15-45 minutes. Over a 57-day longitudinal evaluation (January 12 to March 9, 2026), we track two cohorts: six frontier models in live trading (Cohort 1, full period) and four next-generation models in paper trading (Cohort 2, 3-day preliminary). For Cohort 1, final Kalshi returns range from -16.0% to -30.8%. Our analysis identifies a clear performance hierarchy: initial prediction accuracy and the ability to capitalize on correct predictions are the main drivers, while research volume shows no correlation with outcomes. A striking cross-platform contrast emerges from parallel Polymarket live trading: Cohort 1 models averaged only -1.1% on Polymarket vs. -22.6% on Kalshi, with grok-4-20-checkpoint achieving a 71.4% settlement win rate - the highest across any platform or cohort. gemini-3.1-pro-preview (Cohort 2), which executed zero trades on Kalshi, achieved +6.02% on Polymarket in 3 days - the best return of any model across either cohort - demonstrating that platform design has a profound effect on which models succeed. Beyond performance, we analyze computational efficiency (token usage, cycle time), settlement accuracy, exit patterns, and market preferences, providing a comprehensive view of how frontier models behave under real financial pressure.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
The paper introduces Prediction Arena, a “real-life test” for AI models. Instead of scoring AIs on quizzes or simulations, the authors let them trade on real prediction markets (places where people buy and sell “yes/no” shares on future events, like “Will it rain on Friday?”). The AIs get their own accounts and money, make trades every 15–45 minutes, and are judged by whether they make or lose money. This shows how well they can predict the future and make good decisions when it matters.
What questions the researchers asked
- Can AI models really predict future events well enough to make money when they have to act in the real world, not just answer test questions?
- Which skills matter most: researching a lot, being fast, or being accurate and knowing when to act?
- Does the type of market matter? Do AIs do better when they can pick their own opportunities or when everyone is forced to trade the same set of questions?
- How efficient are the models (for example, how many “thinking tokens” they use and how long they take per decision), and does that relate to success?
How the test worked (in everyday terms)
Think of this like a sports league for AIs:
- Each AI is a “player” that starts with $10,000.
- They “play” by trading in prediction markets such as Kalshi and Polymarket. A market is just a yes/no question priced between $0 and$1. If the event happens, “Yes” pays $1; if not, “No” pays$1. Buy low and be right, and you profit.
- Every 15–45 minutes, the AI:
- Looks at its balance and open bets.
- Checks current prices and rules of markets.
- Does some research on the web if it wants.
- Decides what to buy or sell.
Two versions of the game:
- Kalshi (curated set): All AIs see the same 29 carefully chosen questions. This is like everyone playing the same fixed schedule, so results reflect pure prediction skill more than choosing the “easiest” opponents.
- Polymarket (open discovery): AIs can search the whole site to find any market they like. This tests whether they can find good opportunities in a big universe—like scouting for the best matches to play.
Keeping score fairly:
- Account value is updated using “mark-to-market.” In plain terms: they value open bets at the price someone would pay them right now (the bid). It’s like saying, “If I tried to sell this sneaker today, what would I actually get for it?” That’s more honest than pretending you still have your original purchase price.
- “Settlement” means the real-world event has happened (e.g., the game is over, the weather report is in), so the bet pays out. “Win rate” here means how often settled bets ended up profitable.
- Some models traded with real money (live trading); some newer ones did a short test with “paper trading,” which simulates trades without needing a real counterparty. Paper trading is easier because your orders always “fill.”
Safety rules (so nothing gets out of hand):
- No all-in gambles: each position is capped at 15% of the account.
- No borrowing or moving money.
- Trades must fit within available cash (including fees).
- AIs can only use a few sandboxed tools (search, notes, the exchange). No shell access or outside systems. Humans monitor the system.
What they found (plain summary)
- On Kalshi (everyone trades the same questions), all six live-trading models lost money over 57 days:
- Final returns ranged from about −16% to −31%.
- The best performer early on was “grok-4-20-checkpoint” in the first month, but by the end “glm-4.7” ranked first overall.
- Models that were better at being right when it counted (higher settlement win rates) and that actually capitalized on those correct calls did better. Doing more research didn’t correlate with better results.
- On Polymarket (open discovery), the same models did much better:
- Losses were tiny on average (around −1.1%) compared to Kalshi’s much larger losses over the same period.
- “grok-4-20-checkpoint” had a 71.4% win rate on settled Polymarket bets, the best across any platform or group.
- One next-gen model (“gemini-3.1-pro-preview”) made no trades on Kalshi but earned +6.02% in just 3 days of paper trading on Polymarket. This shows that letting AIs choose their own markets can change who shines.
- Efficiency and effort didn’t predict success:
- Using more tokens, taking longer to “think,” or doing more web searches didn’t reliably lead to better performance.
- The most careful or wordy model wasn’t the best trader.
- Risk patterns:
- Some models had big swings (drawdowns), but that didn’t always mean worse results—one top early performer had a large drop later but still led for a while due to earlier gains.
- Important caution:
- The “next-gen” cohort only ran for 3 days and used paper trading, which doesn’t face real-world execution problems. So you shouldn’t directly compare them to the 57-day, real-money results.
Why these results matter
- Real-world tests beat study guides: Performing well on a static test is not the same as making money against a crowd that constantly updates prices as new information appears. Profit here measures whether the AI is not just “right,” but right when the crowd is wrong—and willing to act on it.
- Market design changes who wins: AIs performed much better when they could pick their own markets (Polymarket) than when they were forced to trade the same fixed set (Kalshi). This suggests product design and discovery tools can massively affect AI success.
- “Try harder” isn’t enough: Doing more research or spending more computation didn’t guarantee better trading. What mattered most was being accurate and timing decisions well.
- Ground truth is objective: Because prediction markets settle on real events, there’s no way to “game” the answer key. This makes the benchmark harder and more honest.
What this could lead to (simple implications)
- Better AI forecasting tools: By seeing how AIs behave with real money on the line, researchers can build models that are more cautious, better-calibrated, and more selective—useful for business, economics, sports analytics, and public policy planning.
- Smarter benchmarks: Future AI tests may use more live, real-world challenges where outcomes can’t be faked and the “right answer” is decided by what actually happens.
- Platform design guidance: Market features (like discovery tools, liquidity, and question selection) can make or break AI performance. Designers can use this to create fairer, more informative evaluation environments.
- Safer autonomy: The safety and financial guardrails used here (position limits, isolation, sandboxed tools) offer a template for testing autonomous AIs responsibly in high-stakes settings.
In short, Prediction Arena shows that if you want to know whether an AI can truly predict the future and act wisely, you have to let it play—and keep score—in the real world.
Collections
Sign up for free to add this paper to one or more collections.