Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

Published 1 May 2026 in cs.MA, cs.LG, and q-fin.GN | (2605.00420v2)

Abstract: Evaluating the true forecasting ability of AI agents requires environments that are resistant to environments resistant to overfitting, free from centralized trust, and grounded in incentive-compatible scoring. Existing benchmarks either rely on static datasets vulnerable to training-data contamination, or measure trading PnL -- a metric conflating predictive accuracy with timing, sizing, and risk appetite. We introduce Foresight Arena, the first permissionless, on-chain benchmark for evaluating AI forecasting agents on real-world prediction markets. Agents submit probabilistic forecasts on binary Polymarket markets via a commit-reveal protocol enforced by Solidity smart contracts on Polygon PoS; outcomes are resolved trustlessly through the Gnosis Conditional Token Framework. Performance is measured by the Brier Score and a novel Alpha Score -- proper scoring rules that incentivize honest probability reporting and isolate predictive edge over market consensus. We provide a formal analysis: closed-form variance for per-market Alpha, the connection to Murphy's classical Brier decomposition, and a power analysis characterizing the number of rounds required to reliably distinguish agents of different skill levels. We show that detecting a true edge of $α^* = 0.02$ at 80% power requires approximately 350 resolved binary predictions (50 rounds of 7 markets), while $α^* = 0.01$ requires four times more. We complement these analytical results with a deterministic, seed-controlled simulation study calibrated to literature-reported Brier-score ranges, illustrating how Murphy decomposition distinguishes well-calibrated agents from market-tracking agents that fail through reduced resolution. Live results from the deployed benchmark will be reported in a future revision. All smart contracts and evaluation infrastructure are open-source.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a novel on-chain benchmark for evaluating AI forecasting agents using proper scoring metrics like Brier and Alpha scores.
It employs decentralized smart contracts and commit-reveal protocols to ensure verifiable, tamper-proof forecasting evaluation.
The simulation study demonstrates that agents must outperform market consensus with statistically rigorous measures for predictive accuracy.

Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

Introduction

"Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents" (2605.00420) presents a rigorously engineered framework for assessing the forecasting prowess of AI agents, specifically LLM-based models, using live, real-world prediction markets as the evaluative substrate. This work is motivated by core deficiencies in extant evaluation paradigms: dataset contamination due to static benchmarks, reliance on centralized and non-transparent recording and scoring, and the confounding influence of trading-derived metrics (PnL) that blend predictive signal with risk and market-timing artifacts. Foresight Arena addresses these gaps by offering a permissionless, trustless, and verifiably open infrastructure for probabilistic evaluation, anchored in proper scoring rules and implemented fully on-chain.

Limitations in Existing AI Forecasting Evaluation

The authors dissect three principal limitations of prevailing forecasting agent benchmarks:

Dataset contamination exposes static benchmarks to leakage and gaming, undermining their validity as true forward-looking evaluation.
Centralized trust in record-keeping and resolution procedures induces unverifiability and limits integrity, especially in settings with economic or reputational implications.
Improper evaluation metrics, particularly the confounding nature of trading profit & loss, obscure isolation of pure probabilistic accuracy, muddying distinctions between signal, risk-taking, and market access.

These issues are underscored empirically by contrasting results from contemporary competitions such as "Prediction Arena" (Zhang et al., 28 Mar 2026), where all six leading LLMs lost capital as market participants, but experienced divergent behavior across market platforms—a clear signal of confounds beyond forecasting quality.

Foresight Arena Benchmark Design

Foresight Arena introduces an architecture that is both technically and methodologically robust:

On-Chain Protocols: All prediction commitments, reveals, and scoring computations are recorded and executed via audited Solidity smart contracts (PredictionArena and RoundManager) on Polygon PoS. Every aspect of an agent's performance is thus transparent and immutable.
Trustless Outcome Resolution: Resolution is sourced directly from Gnosis Conditional Token Framework (the oracle backend for Polymarket), ensuring outcome veracity without centralized arbitration.
Proper Incentive-Compatible Scoring: Performance is measured using the Brier Score, and a novel Alpha Score, which strictly quantifies predictive edge relative to the market consensus. Both are shown to be strictly proper scoring rules.
Commit-Reveal Protocol: A cryptographically enforced commit-reveal protocol ensures that agents submit immutable probabilistic vectors before market resolution, preventing peer imitation or retroactive gaming.
ERC-8004 On-Chain Reputation: Each agent’s cumulative performance is registered to an ERC-8004 register, yielding verifiable, persistent, and tamper-resistant credentials.
Permissionless, Gasless Participation: Any Ethereum-compatible keypair can participate, with relayer infrastructure providing gas abstraction, democratizing access and minimizing operational friction.

Statistical and Methodological Innovations

Foresight Arena’s evaluative backbone is a dual-metric system:

Brier Score: A classic strictly proper scoring rule quantifying mean squared error between forecast and outcome, minimized in expectation exclusively by calibrated forecasts.
Alpha Score: Defined as the difference between the market-consensus Brier Score (using Polymarket's mid-price as $b_i$ ) and the agent's Brier Score on the same market. $\alpha = B_{\text{market}} - B_{\text{agent}}$ captures informational advantage stripped of trading/timing confounds.

Importantly, the Alpha score is analytically decomposed (via Murphy decomposition) into:

Resolution Gain: The agent’s improvement in sorting outcomes relative to the base rate.
Reliability Gap: The agent’s net calibration error advantage versus market consensus.

This explicit decomposition provides actionable diagnosis of what contributes to an agent’s forecast supremacy: neither overconfidence nor blind mimicry of the market can produce high Alpha.

A precise variance formula for per-market Alpha is derived, leading directly to closed-form power analyses. For example, an agent with edge $\alpha^* = 0.02$ over market consensus requires $\sim 350$ resolved markets for 80% detection power. This enables principled experiment sizing for future benchmarking and supports statistically honest reporting of significance.

System Architecture

Foresight Arena’s deployment comprises:

Smart Contracts: PredictionArena and RoundManager contracts manage round lifecycles, verifiable commit-reveal logic, and outcome-triggered scoring.
Agent Suite: Reference implementations include a fully automated LLM agent (supporting OpenAI, Anthropic, Google, xAI, and Zhipu models via OpenRouter), as well as a uniform-random baseline for solvency checks.
Oracles: Live market questions and resolutions are synchronized from Polymarket and Gnosis CTF, composing the question universe and outcomes without centralized curation.
Relayer API: Relayer infrastructure handles transaction broadcasting for agents without native tokens, enforcing rate limits to prevent Sybil attacks.
Frontend Leaderboard: Live and historic performance, traceable reasoning, and cumulative scores are accessible via an open, public dashboard.

Simulation Study and Empirical Insights

Due to timing, the paper’s empirical demonstration is a deterministic, seed-controlled Monte Carlo simulation, calibrated against literature Brier-score baselines for LLMs and human crowds [see (Karger et al., 2024, Schoenegger et al., 2023, Halawi et al., 2024)]:

Random Baseline: Consistently negative Alpha ( $\sim -0.14$ ), easily distinguishable from any skilled agent, even at modest sample sizes.
Frontier LLM Agents: In simulation, reach Alphas in $[+0.003, +0.005]$ — statistically indistinguishable from market consensus at $n \sim 350$ , congruent with theoretical expectations regarding the high efficiency of real-money prediction markets.
Market-Tracking Agents: Agents echoing market prices with added noise exhibit negative Alpha, demonstrating the penalty for imitative or noisy consensus-tracking strategies.
Murphy Decomposition: The simulation validates that resolution improvements, not mere calibration tweaks, are necessary for positive Alpha when the market itself is already highly calibrated.

Category-wise analysis reveals heterogeneity in achievable Alpha across domains (e.g., Crypto, Politics, Sports), reflecting differential signal-to-noise ratios and model strengths.

Practical and Theoretical Implications

Verifiable AI Credentials: Foresight Arena makes agent track records as auditable as blockchain transactions, eliminating reliance on organizer integrity. This enables trustworthy agent comparison and establishes the infrastructure for regulatory and marketplace accountability.
Metric Unconfoundedness: Proper scoring rules (Brier, Alpha) isolate predictive skill directly, untainted by trading style or market frictions, in contrast to PnL-based benchmarks.
Ensemble Methods: The framework amplifies the value of ensembling (as in LLM ensemble work (Schoenegger et al., 2023)), as ensemble forecasters are provably more calibratable and discriminative, aligning directly with the path to statistical Alpha.
Design Guidance for AI Agents: To produce positive Alpha, agents must exploit fresh, non-market-embedded signals and optimize boldness only where justified by confidence, emphasizing the critical balance encapsulated by the Murphy decomposition.
Statistical Honesty: The framework makes explicit the sample-size–effect-size–power tradeoff, highlighting the current limits of ranking frontier LLMs at fine-grained Alpha scales with short-horizon evaluations.

Limitations and Future Work

The authors recognize and address several caveats:

All results before deployment are simulation-based; live on-chain rounds are forthcoming.
Market selection, though auditable, can influence comparative ranking; further work on randomization and domain expansion is needed.
Analysis tools assume per-market outcome independence; cluster-robust methods may be required for correlated markets.
Statistical power, though significantly improved, remains limiting for small Alphas; conditional scoring restricted to bold, anti-consensus predictions is proposed as a future statistical accelerator.

Extensions include token-incentivized staking, ensemble and copy-trading overlays, and more granular, category-conditional Alpha estimation.

Conclusion

Foresight Arena operationalizes a principled, on-chain, and open benchmark for evaluating the forecasting capabilities of AI agents. By formalizing and deploying strictly proper incentive structures, unifying live market data with trustless, on-chain resolution, and making statistical power constraints explicit, the benchmark rigorously elevates the standards for both academic and commercial AI evaluation. As rounds accumulate and agent diversity expands, the platform is positioned to become a definitive arbiter of probabilistic reasoning skill in autonomous systems, with broad implications for both AI deployment and policy.

Markdown Report Issue