- The paper presents MarketBench, a benchmark that evaluates AI agents' self-assessment and calibration for market-based task allocation.
- The study uses calibration tasks and auction simulations to quantify misestimations in success probabilities and token usage among top LLMs.
- The findings highlight that while market-inspired coordination improves pass rates, poor bid calibration remains a critical bottleneck for efficient decentralized AI systems.
MarketBench: A Benchmark for AI Agent Market Participation and Self-Assessment
Introduction and Motivation
The paper "MarketBench: Evaluating AI Agents as Market Participants" (2604.23897) introduces MarketBench, a novel benchmark for evaluating the readiness of AI agents—specifically LLMs—for participation in market-based coordination mechanisms. The central research question is whether current agents possess calibrated self-assessment capabilities regarding their success probability and resource usage on software engineering tasks, such that their bids in market settings aggregate useful decentralized information. The study is highly motivated by the transition from static, centrally orchestrated multi-agent systems to more decentralized, market-style mechanisms aimed at efficiently allocating heterogeneous cognitive labor.
Conceptual Framework
The paper establishes a formal model for market-based task allocation among agents with private, task-specific productivity and cost information. In this model, agents generate bids based on their ex ante probability of success and anticipated cost, under incentive-compatible mechanisms (e.g., second-price reserves). The market paradigm theoretically dominates fixed or parallel allocation schemes by exploiting decentralization and the aggregation of local information through prices. The model highlights the necessity for meaningful, task-level self-assessment signals from agents, which form the backbone of any effective market allocation.
MarketBench: Benchmark Design and Protocol
MarketBench operationalizes agent evaluation in two primary task families on SWE-bench Lite, a curated corpus of GitHub-issue-based, real-world software engineering assignments:
- Calibration Tasks: Agents must, prior to execution, estimate their probability of success in a single attempt as well as expected token usage. These estimates are evaluated against realized binary task outcomes and true token consumption, enabling a direct assessment of calibration via metrics such as Brier score and expected calibration error.
- Auction Tasks: The elicited self-assessments are mapped to procurement bids in a reserve-price auction simulation. Payoffs and allocation efficiency are computed both in expectation and realized terms, benchmarking against the oracle outcome with perfect information.
Six top-tier LLM families are benchmarked to reveal aggregate and model-specific patterns in self-estimation and competitive behavior.
Empirical Results
All six LLMs display significant miscalibration in both reported success probability and token usage:
- Success Probability: The models have realized pass rates in a narrow band (75.3%–80.6%), but their mean stated confidences vary widely (Gemini 3 Pro Preview is overconfident at 92.9%, GPT-5-mini is underconfident at 61.4%). Only variants of Claude (Opus 4.5 and Sonnet 4.5) achieve marginally positive Brier skill over base-rate forecasters.
- Token Usage: Forecasts of token consumption are severely understated, often by an order of magnitude.
Incorporating a "self-knowledge card"—a summary of model-specific prior performance—into prompts yields a modest improvement in calibration, both in terms of Brier score and estimated/actual token ratio, but leaves a substantial gap to full-information performance.
Auction Efficiency
Market-based auction outcomes using model-generated self-estimates systematically underperform oracle allocation benchmarks. Models earn substantially lower expected and realized profit per task; for example, realized auction profit for GPT-5.2 is $0.006$/task compared to an oracle profit of $0.385$/task. Aggressive but miscalibrated bidding (e.g., Gemini) can distort allocation without generating efficient outcomes. Prompt-level prior information marginally reduces oracle gaps but does not resolve core deficiencies in taskwise self-assessment.
Market Scaffolding Experiment
An illustrative market-inspired routing scaffold is deployed to test market-based multi-agent coordination. The scaffold executes up to two agents per task, using self-assessed success probabilities and costs for routing. The market paradigm surpasses a solo baseline (58% vs. 48% pass rate for the best-performing solo model, GPT-5.2), but remains below the oracle (84%) and strong external scaffold (74%). Gains are mainly attributable to the diversity of the model pool rather than market-based allocation per se. The experiment also reveals increased resource usage in the market condition due to multi-agent inspection and retrying.
Matched centralized routers using the same workers reach similar performance (54% pass rate), indicating the marginal impact of the current bidding mechanism is muted due to poor bid calibration. Interventions providing historical performance priors ("hard-prior" bidding) further enhance market scaffold performance, illustrating the importance of self-knowledge signals for future markets.
Implications and Future Directions
The empirical evidence indicates that self-assessment is the critical bottleneck for realizing the promised efficiency of market-based AI agent coordination. In absence of reliable, task-specific probabilistic and cost estimates, decentralized price-based allocation is noisy and suboptimal. This insight has immediate design implications:
- Calibration as a Core Capability: Calibration, abstention, and disciplined resource prediction should become explicit objectives in model training, fine-tuning, and evaluation, not just downstream task performance.
- Market Scoring Functions: Future markets will likely score bids not just on price, but also on historical reliability, abstention discipline, and domain reputation, akin to scoring auctions in procurement or sponsored search. Such designs can partially counteract poor agent-level calibration through aggregation of external or historical metrics.
- Cost-Quality Schedules: As frontier agents demonstrate increasing marginal returns for higher resource provision, simple scalar bidding (cost/expected success) is inadequate. Mechanisms will need to elicit and reason with agent-specific cost-quality schedules (token budget → pass probability curves, etc.).
- Richer Institutional Mechanisms: Full realization of AI-enabled markets will require reputation systems, escrow mechanisms, and possibly third-party verification oracles to ensure bid reliability and agent accountability.
- Beyond Software Engineering: Extension of MarketBench to broader task regimes (e.g., scientific reasoning, strategic planning) is essential to stress-test market-based designs at scale.
Conclusion
MarketBench provides a rigorous, operational framework for measuring the viability of market-oriented coordination among AI agents, demonstrating with strong empirical evidence that current LLMs are hampered by weak self-calibration. The findings underscore an urgent need to prioritize calibration and metacognitive self-assessment in agent design for practical decentralized AI systems. The path to effective AI-agent markets will integrate richer self-knowledge, multi-attribute bid scoring, and improved economic protocols, bridging gaps revealed in this benchmark toward scalable, trustworthy, and economically efficient agent ecosystems.