MarketBench: AI Market Evaluation

Updated 2 July 2026

MarketBench is a suite of frameworks that rigorously evaluates LLM agents' decision-making and economic performance in simulated market environments.
It incorporates realistic constraints such as budget limits, asset scarcity, and adversarial attacks to test auction, trading, and calibration strategies.
Empirical studies reveal winner-take-most outcomes, miscalibration in self-assessments, and varying operational efficiency across diverse market simulation scenarios.

MarketBench is a term applied to a family of benchmarking frameworks that rigorously evaluate the decision-making and economic capabilities of artificial agents—particularly LLMs—in simulated or real market environments. Distinct MarketBench variants target economic competition, agent calibration and self-assessment, market simulation, backtesting of trading strategies, and robust aggregation in decentralized marketplaces. The frameworks emphasize measurement under realistic constraints such as limited budgets, asset scarcity, adversarial attacks, and operationally meaningful metrics. Each instantiation enforces reproducible, transparent protocols for performance assessment in both isolated reasoning and agent competition settings.

1. Market-Bench: Economic and Trade Competition among LLM Agents

Market-Bench, as introduced in "Market-Bench: Benchmarking LLMs on Economic and Trade Competition" (Zheng et al., 7 Apr 2026), formalizes a controlled multi-agent supply chain market. The setup comprises $m$ retailer agents $\mathcal{A} = \{A_1, \ldots, A_m\}$ , $n$ distinct merchandise types $\mathcal{X} = \{1, \ldots, n\}$ , and a discrete time horizon $T$ . At each step $t$ , each agent manages:

Funds $F_i(t) \in \mathbb{R}_{\ge 0}$
Inventory $I_{i,x}(t) \in \mathbb{Z}_{\ge 0}$ for each $x \in \mathcal{X}$

Stage A: Procurement via Auction

Agents bid on available inventory through budget-constrained, multi-unit auctions:

Each bid $b_{i,x}(t) = (q_{i,x}(t), p^{\text{bid}}_{i,x}(t))$ denotes desired quantity and price per unit
Bids must satisfy $\mathcal{A} = \{A_1, \ldots, A_m\}$ 0 (budget constraint)
Goods are allocated by highest price; updates to $\mathcal{A} = \{A_1, \ldots, A_m\}$ 1 are made based on awarded quantities

Stage B: Retail Competition and Persona-Gated Attention

Agents set retail prices $\mathcal{A} = \{A_1, \ldots, A_m\}$ 2 and generate marketing slogans $\mathcal{A} = \{A_1, \ldots, A_m\}$ 3. Buyers, each with latent persona and a hidden attention coefficient $\mathcal{A} = \{A_1, \ldots, A_m\}$ 4, sample a consideration set via persona-gated attention:

Similarity score: $\mathcal{A} = \{A_1, \ldots, A_m\}$ 5
Attention: $\mathcal{A} = \{A_1, \ldots, A_m\}$ 6
Buyers select products based on minimum price among considered, in-stock offers

Automatic Metrics

Economic: Cumulative profit $\mathcal{A} = \{A_1, \ldots, A_m\}$ 7, Net profit margin $\mathcal{A} = \{A_1, \ldots, A_m\}$ 8
Operational: Stockout rate, Inventory Efficiency Index (IEI)
Semantic: Mean Match Score (MMS, cosine similarity between slogans and buyer personas)

Empirical Findings

Experiments with 20 LLM-driven retailers show pronounced dispersion. Only a small subset (e.g. Gemini 2.5 Pro, Gemini 2.5 Flash) achieve substantial cumulative profits ( $\mathcal{A} = \{A_1, \ldots, A_m\}$ 9, $n$ 0) and margins (NPM ~0.17–0.19); most agents hover near break-even, despite similar MMS. The environment induces "winner-take-most" outcomes and a multi-winner oligopoly: Gini rises from 0.07 to 0.21, CR4 (top-4 market share) grows from 23% to 33%. Early bidding success (BidEfficiency, StockoutRate) is highly correlated with long-run profit ( $n$ 1 for BidEfficiency, $n$ 2 for StockoutRate), whereas marketing semantic alignment shows only weak association with profit ( $n$ 3). The benchmark enables systematic study of quantitative, operational, and semantic interactions in LLM market behaviors (Zheng et al., 7 Apr 2026).

2. Agent Calibration and Market-Based Allocation Efficiency

"MarketBench: Evaluating AI Agents as Market Participants" (Fradkin et al., 26 Apr 2026) targets the calibration and self-assessment of AI agents within market-based allocation. Using a 93-task slice of SWE-bench Lite, six contemporary LLMs estimate their probability of success $n$ 4 and cost $n$ 5 per task, producing JSON-formatted self-reports for market-based auctions:

Sealed-bid, second-price auctions allocate tasks based on calculated break-even bids $n$ 6
Primary evaluation metrics include Brier score (probabilistic calibration), expected calibration error (ECE), and token-usage forecast ratios

Large-scale miscalibration is observed: real pass rates are 75–81%, but stated $n$ 7 varies from 61% to 93%. All six models underestimate cost (median estimated/realized token ratio ≈ 0.02 globally), leading to auctions that significantly underperform "oracle" allocation (e.g., GPT-5.2 scores \$n$80.385 oracle). Introduction of calibration priors ("self-knowledge cards") improves Brier score and ECE (Brier: 0.1835 → 0.1693; ECE: 0.1065 → 0.0616), but narrows the profit gap only marginally. A toy live market-based router boosts success rates by ~10pp over solo, but still lags strong external scaffolds by ~16pp—model diversity, not auction dynamics, is the main source of gains (Fradkin et al., 26 Apr 2026).

3. Comparative MarketBench Benchmarks in Trading, Generative Finance, and Gradient Marketplaces

Multiple derivatives of the "MarketBench" concept have been proposed:

PolyBench/MarketBench (Prediction Markets): PolyBench (Cheng et al., 3 Apr 2026) evaluates LLMs' forecasting and trading ability under live, timestamp-locked prediction market data (Polymarket). Seven LLMs are benchmarked using directional accuracy, confidence-weighted return (CWR), annualized percentage yield (APY), Sharpe ratio, and realistic order-book execution. Only two models, MiMo-V2-Flash (CWR=17.6%) and Gemini-3-Flash (CWR=6.2%), are systematically profitable; other models lose money due to overconfidence and calibration failures, despite strong surface-level fluency. The evaluation protocol explicitly prevents look-ahead contamination (Cheng et al., 3 Apr 2026).
Quantitative Trading and Market Dynamics: In "Market-Bench: Evaluating LLMs on Introductory Quantitative Trading and Market Dynamics" (Srivastava et al., 13 Dec 2025), LLMs must generate executable Python backtests for canonical quantitative trading tasks (scheduled trading, pairs trading, delta hedging). Pass@k and mean absolute error (MAE) distinguish structural code correctness from numerical fidelity. High pass@3 on simple strategies (0.80), but wide error dispersion and frequent logical failures on more complex tasks indicate persistent weaknesses in operational reasoning (Srivastava et al., 13 Dec 2025).
Generative Finance and Order Book Modeling: LOB-Bench (Nagy et al., 13 Feb 2025) (cited as a "MarketBench" suite for generative market modeling) focuses on realism in LOB message generation, using Wasserstein distance, KL divergence, Maximum Mean Discrepancy (MMD), feature-level statistics (spread, volume, imbalance), and market impact functions. cGANs outperform autoregressive and parametric models on distributional closeness as well as adversarial discrimination metrics. This framework is positioned as a central testbed for generative LOB modeling (Nagy et al., 13 Feb 2025).
Gradient Marketplace Robustness: The framework in (Song et al., 6 Sep 2025) addresses market-centric properties in distributed gradient marketplaces: cost-of-convergence, seller reward fairness (Gini and Malicious Selection Rate), and selection dynamics (entropy, stability). Simulations expose vulnerabilities to Sybil attacks, with adaptive adversaries compromising both fairness and perceived efficiency. Payment dispersion and malicious selection rates sharply rise under attack scenarios, demonstrating the need for robust trust and anomaly-detection mechanisms (Song et al., 6 Sep 2025).

4. Core Evaluation Protocols and Metrics

MarketBench implementations share emphasis on protocol reproducibility, statistical rigor, and comprehensive logging. Key methodological elements include:

Timestamp-locking and contamination prevention: All inference and market decisions are made against frozen-in-time information, with ground-truth outcomes unavailable at decision time (e.g., PolyBench (Cheng et al., 3 Apr 2026))
Economic and operational metrics: Profit, margin, inventory efficiency, market share/inequality measures (Gini, Theil, CR4, HHI)
Calibration diagnostics: Brier score, ECE, realized vs. forecast accuracy/costs (Fradkin et al., 26 Apr 2026)
Behavioral segregation: Agent heterogeneity, winner-take-most vs. competitive dynamics, role of procurement in long-term profitability (Zheng et al., 7 Apr 2026)
Agent and environment diversity: Wide range of model rosters, including state-of-the-art open- and closed-source agents

Evaluation domains vary—including synthetic economic games, live market replay, and distributed computing—but all enforce transaction- and information-realistic mechanisms to isolate agent strengths and bottlenecks.

5. Insights, Limitations, and Distinguishing Features

MarketBench frameworks consistently identify key bottlenecks:

Resource management under scarcity is the gating skill in competitive supply chain and trading settings; early stage auction efficacy is highly predictive of long-term agent success (Zheng et al., 7 Apr 2026)
Semantic (language-driven) persuasion under persona gating does not reliably translate into economic returns; slogans rapidly converge to generic optima (Zheng et al., 7 Apr 2026)
Calibration—metacognitive accuracy in self-assessment—is the primary limitation for market allocation efficiency among LLM agents. Over- and underconfidence drive inefficiencies in auction-based task routing (Fradkin et al., 26 Apr 2026)
Simple behavioral metrics (e.g., pass rates, match scores) are insufficient: Quantitative performance often diverges dramatically from semantic or code-level correctness, especially as market dynamics or adversarial manipulation are introduced (Srivastava et al., 13 Dec 2025, Song et al., 6 Sep 2025)

Limitations commonly cited include domain restriction (e.g., SWE-bench slicing or specific economic settings), the use of simplified auction or payment rules, and reliance on synthetic or re-simulated market data. Extensibility to richer agent representations (reputation, explicit negotiation, tool-use) or broader domains (structured planning, open-ended execution) is actively explored in ongoing research (Cheng et al., 3 Apr 2026, Nagy et al., 13 Feb 2025).

6. Relationship to MarketBench for Portfolio Construction

The term "MarketBench" is also found in a different context: as a closed-form, non-iterative system for constructing long-only benchmark portfolios and outperformance overlays in equity markets (Kakushadze et al., 2018). In this lineage, MarketBench refers to algorithms and code implementing:

Multifactor, multilevel clustering risk models: factor model covariance, cluster- and market-level variance decomposition
Positive-definite, long-only portfolio weights $n$ 9 constructed via Sharpe-optimal formulas, bypassing principal component analysis (PCA) or iterative quadratic programming
Explicit, bottom-up propagation of idiosyncratic and systemic risk adjustments, ensuring portfolio positivity and sector diversification by construction

The original MarketBench in this sense serves as both a practical algorithmic benchmark for quantitative asset management and an analytical illustration of modern portfolio theory in the multifactor regime (Kakushadze et al., 2018).

Collectively, MarketBench frameworks serve as a rigorous standard for evaluating the intersection of artificial agent intelligence, economic optimization, and market-driven interaction—spanning resource allocation, market simulation, code generation for quantitative finance, and federated system robustness. Their reproducible protocols and multidimensional metrics have become central references for empirical research into market-capable AI.