Test-Time Compute Games

Published 29 Jan 2026 in cs.CY, cs.AI, cs.GT, and cs.LG | (2601.21839v1)

Abstract: Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of LLMs. However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a game-theoretic model of LLM compute allocation, revealing that providers escalate compute to maximize profits at the cost of social welfare.
It demonstrates that such strategic behavior can lead to up to 19% inefficiency as measured by the price-of-anarchy in real-world LLM markets.
A reverse second-price auction mechanism is proposed, effectively aligning provider incentives and eliminating welfare loss under ideal conditions.

Test-Time Compute Games: Strategic Provider Behavior and Market Inefficiency in LLM-as-a-Service

Introduction and Market Motivation

LLM inference increasingly relies on cloud providers offering LLM-as-a-service. A central factor in production cost and model efficacy is test-time compute (TTC): the computational effort spent during inference, involving techniques such as chain-of-thought (CoT), best-of-n sampling, or tree-of-thought methods. Economically, providers directly tie user payments to the volume of compute expended, with billing models sensitive to the number of generated tokens—an established practice in leading commercial APIs.

The paper "Test-Time Compute Games" (2601.21839) initiates an economic and game-theoretic analysis of strategic TTC deployment, demonstrating how provider incentives in market equilibrium diverge from socially optimal compute allocation. When providers compete for users, they are incentivized to inflate compute (and thus costs) regardless of whether increased TTC yields substantial improvements in output quality or user utility. Crucially, this setting exposes a price-of-anarchy (PoA): the degree to which market competition, absent regulatory or incentive-alignment interventions, leads to outcomes that are suboptimal for aggregate social welfare.

Game-Theoretic Modeling of Provider Competition

The model frames provider competition for user tasks as an N-player normal-form game, with each provider choosing level $\theta \in \Theta$ of TTC for each task/query $\mathcal{T}$ . Output quality functions $q_i(\theta)$ typically increase with $\theta$ , but so do generation costs $c_i(\theta)$ and consequently the user-facing prices $p_i(\theta)$ .

User value for provider $i$ at compute level $\theta$ is $V_i(\theta) = q_i(\theta) - p_i(\theta)$ . Users allocate demand probabilistically (softmax, bounded rationality) or deterministically to the provider with the highest value (Bertrand competition). Provider utility is proportional to market share and profit: $U_i(\theta_i; \boldsymbol{\theta}_{-i}) = s_i(V_i(\theta_i); \mathbf{V}_{-i}(\boldsymbol{\theta}_{-i})) \cdot (p_i(\theta_i) - c_i(\theta_i))$ .

The global welfare function aggregates provider profit and user value, excluding monetary transfers which cancel:

$\mathrm{SW}(\boldsymbol{\theta}) = \sum_{i=1}^N s_i(V_i(\theta_i); \mathbf{V}_{-i}(\boldsymbol{\theta}_{-i})) \cdot (q_i(\theta_i) - c_i(\theta_i))$

Providers act strategically to maximize individual utility (profit), resulting in a Nash equilibrium for TTC selection. Theoretical analysis establishes:

Existence and structure: The game is a generalized ordinal potential game, guaranteeing pure Nash equilibria and convergence under better-response dynamics.
Suboptimality of equilibrium: Providers select compute to maximize profit, not welfare; this systematically diverges from the compute allocation that would maximize $\mathrm{SW}$ .

Quantitatively, the price-of-anarchy (PoA) lower bounds the ratio of optimal social welfare to equilibrium welfare. Even with high rationality parameter $\beta$ , the market can exhibit nontrivial inefficiency.

Figure 1: Dynamics of a test-time compute game using majority-voting illustrate strategic escalation of compute and divergence from optimal configuration.

Reverse Second-Price Auction: Welfare Alignment

To mitigate inefficiency, the paper introduces a reverse second-price auction mechanism:

Providers submit bids specifying TTC, expected output quality, and price.
The user pays a price linked to the marginal value generated by the winning provider over the second-best bid.
Winning providers serve queries at the compute maximizing their individual welfare contribution.

Formally, for provider $i$ , dominant strategy is to bid price equal to cost and choose the welfare-maximizing compute level. This mechanism implements the social optimum, eliminating PoA (i.e., PoA = 1):

$\pi_1 = \arg\max_{i} \max_{\theta} (q_i(\theta) - c_i(\theta))$

Market dynamics under auction guarantee non-negative provider profit, truth-telling, and sharp welfare alignment.

Empirical Evaluation

Experiments validate theory using Llama and Qwen families (instruct and distilled reasoning models) on STEM benchmarks (GSM8K, GPQA, AIME). For each model and TTC setting, accuracy and inference costs are measured, enabling calculation of equilibrium market shares, user value, provider utility, global welfare, and PoA.

Figure 2: Accuracy of Llama and Qwen models on GSM8K under majority voting and best-of-n.

Figure 3: Accuracy of reasoning models distilled from DeepSeek-R1 on GSM8K using chain-of-thought, showing solution quality as TTC increases.

Results:

Equilibrium in pay-for-compute markets induces up to 19% inefficiency as measured by PoA.
Welfare loss manifests due to strategic escalation of TTC by dominant providers, evident in empirical better-response dynamics.
Auction mechanism achieves higher user value (up to 29% increase), provider profit, and total welfare, fully eliminating inefficiency under ideal assumptions.

Figure 4: Better-response dynamics of compute allocation by providers illustrate market convergence and inefficiency at equilibrium.

Figure 5: Potential function evolution under best-of-n sampling; providers jointly optimize profit and second-highest user value among competitors.

Practical and Theoretical Implications

The analysis rigorously demonstrates that current LLM pricing models inherently incentivize providers to over-allocate TTC, generating significant deadweight welfare loss. As LLMs scale and TTC becomes a major cost driver (including externalities such as energy consumption [Fernandez et al., (Fernandez et al., 24 Apr 2025)]), aligning provider incentives with social welfare is both an economic and societal imperative.

Mechanism design via auctions (multi-attribute, reverse second-price) provides a robust provable framework for efficiency, but introduces vulnerabilities to collusion and manipulation [Mailath & Zemsky 1991; Chen & Tauman 2006]. Practical deployment requires careful scrutiny of auction implementation, quality verification, and applicability to open-ended or non-verifiable tasks.

Future directions involve auction mechanism robustness, integration with carbon- or energy-aware prompting [Adamska et al., (Adamska et al., 9 Mar 2025)], and empirical analysis with heterogeneous user preferences and provider service attributes.

Conclusion

"Test-Time Compute Games" establishes a rigorous model for the strategic interaction of LLM providers in TTC markets. The work introduces fundamental inefficiency results for real-world LLM-as-a-service ecosystems and provides an actionable, theoretically grounded auction mechanism for welfare maximization. As model inference costs escalate and AI markets mature, insights from this paper will be instrumental for regulatory design, platform engineering, and sustainable AI deployment.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Plain‑Language Summary of “Test‑Time Compute Games”

Overview

This paper looks at how companies that run LLMs in the cloud decide how much “extra thinking” a model should do before giving you an answer—and how that decision affects what you pay and what you get. The authors show that today’s market often pushes companies to use more compute than is actually useful, making users pay more without getting much better answers. They then design a fairer way to pick providers and set payments so everyone benefits.

What questions did the researchers ask?

Do LLM service providers have reasons to use more test-time compute (extra thinking) than is best for society?
What happens when many providers compete: does the market settle into a stable pattern, and is that pattern good for users?
Can we design a simple, fair rule for choosing providers and setting prices so that users get the best overall deal and providers still make money?

How did they study it? (Explained simply)

Think of hiring a tutor online:

“Quality” is how accurate the tutor’s answers are.
“Compute” is how many steps the tutor takes to think through a problem (like writing out a longer explanation).
“Price” is how much they charge you.
“Value to the user” is quality minus price: how good the answer is, given what you paid.

In today’s LLM services, you pay for all the tokens the model generates—even the “invisible” reasoning steps you never see. Providers can choose to make the model think more (use more test-time compute), which usually improves quality a little but increases your bill a lot. That can increase the provider’s profit even when it barely helps you.

What the authors did:

They built a game-theory model of this marketplace. Each provider picks how much extra compute to use. Users choose a provider based on value (quality minus price).
They analyzed “equilibrium” situations—stable outcomes where no provider wants to change their compute choice alone.
They measured how inefficient these outcomes can be using a standard metric called the “Price of Anarchy” (how far the market’s outcome is from the best possible total benefit for everyone).
They proposed a new system: a reverse second-price auction. Providers submit:
- their expected quality with a chosen compute level,
- and a price.
- A platform picks the provider with the best deal (quality minus price). The user then pays based on the difference between the winner’s quality and the second-best offer’s value. This encourages providers to be honest and to pick the compute level that’s truly best overall, not just best for profit.
They tested these ideas using several real models (Llama and Qwen families, plus reasoning models distilled from DeepSeek‑R1) on math/science benchmarks (GSM8K, GPQA, AIME) and common test-time methods (best‑of‑n, majority voting, chain‑of‑thought).

Key terms in everyday language:

Test-time compute: extra thinking steps the model does when answering a question.
Best‑of‑n / majority voting: the model tries several answers and picks the best or most common one.
Equilibrium: a stable state where no provider gains by changing their strategy alone.
Price of Anarchy: how much worse the market’s result is compared to the best possible outcome for everyone.
Reverse second‑price auction: the platform chooses the best offer, but the price paid is tied to the second-best offer—this makes being honest and efficient the best strategy.

Main findings and why they matter

Today’s market is often inefficient: Providers have a financial reason to dial up test-time compute, because more compute brings higher charges, even when it barely improves answers. In experiments, this inefficiency (Price of Anarchy) reached up to 19%.
A better way exists: The proposed reverse second‑price auction makes providers choose the compute level that maximizes overall benefit (quality minus cost). It also:
- Guarantees the best possible total outcome for users and providers.
- Ensures users always get at least the second-best possible value.
- Ensures providers make non‑negative (i.e., not losing) profit.
Stability: The authors show the current market does settle into stable behavior, but that stable point is usually not the best for users. Their auction reaches a stable outcome that is also socially optimal.

What could this change?

For users and companies buying LLM services: You may currently be paying for lots of “invisible thinking” that doesn’t help much. A better pricing and selection system could cut waste and maintain (or improve) answer quality.
For providers: The auction rewards efficient compute choices—spending more only when it meaningfully improves answers, not just profits.
For platforms and policymakers: The results suggest moving from “pay for compute” to “pay for delivered value,” using auctions or similar mechanisms, could align incentives and reduce waste across the AI ecosystem.
For evaluating LLMs: Benchmarks and leaderboards should factor in both accuracy and the compute cost required to achieve it.

In short, this paper shows that simply charging for extra compute pushes the market in the wrong direction. A smarter selection and payment rule can make the system fairer and more efficient—so users pay for real improvements, not just more tokens.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a focused list of what the paper leaves missing, uncertain, or unexplored, framed to be actionable for future research.

Strategic pricing as a decision variable: The competitive market model treats price per token as exogenous and focuses on providers choosing test-time compute (TTC). Extend the game to include strategic pricing (and potentially bundle pricing or tiered plans) alongside TTC, and characterize equilibria and welfare implications.
Cost structures and marginal profit assumptions: Results rely on costs and prices increasing with TTC and on positive marginal profit (Eq. 3). Study sensitivity to different cost curves (convex, concave, stepwise), volume discounts, and hardware utilization effects; identify conditions under which equilibria and welfare guarantees still hold.
Heterogeneous action spaces: Providers are assumed to share a common discrete TTC action set Θ. Generalize results to provider-specific, heterogeneous action spaces (different TTC methods and granularity per provider), and assess how heterogeneity affects equilibria and PoA.
Market share modeling beyond Bertrand/logit: The analysis uses two canonical market share functions (perfect rationality and logit RUM). Empirically estimate user choice models from real usage (including reputation, latency, safety, and pricing frictions), and test whether the potential-game characterization and equilibrium properties persist.
Explicit bounds for bounded rationality: Theorem 3 asserts existence of a finite β₀ but does not provide explicit, interpretable bounds. Derive quantitative β₀ conditions in terms of observable quantities (e.g., value gaps, noise scales) and analyze equilibrium selection and efficiency for moderate β.
Equilibrium multiplicity and convergence: Potential games ensure existence but not uniqueness or rates. Characterize the set of Nash equilibria, their welfare ranking, convergence rates of better-response dynamics, and robustness to asynchronous updates and noise.
Abstention/outside option in auction: The auction mechanism omits the user’s outside option V₀ considered in the market model. Design an auction that (i) allows abstention when welfare is negative, (ii) preserves social optimality, and (iii) handles cases where the second-best provider offers negative user value.
Verifiable quality in open-ended tasks: The auction assumes truthful quality reporting because users can “easily detect inconsistencies,” which is tenuous for open-ended tasks. Develop auditing protocols, scoring rules, or statistical verification methods that handle unverifiable outcomes, multi-dimensional quality, and sampling noise.
Robustness to overfitting on representative queries: Providers bid quality on a user-supplied query set Q. Analyze strategic overfitting (training/tuning to Q), sample-size requirements for reliable quality estimation, and robust mechanism variants (e.g., holdout splits, blinded evaluation, distribution shift safeguards).
Collusion and shill bidding in reverse second-price auctions: Investigate vulnerabilities to coordinated bidding (e.g., a provider inflating the second-best value to raise payments) and propose anti-collusion safeguards, monitoring, and penalty structures while maintaining efficiency.
Capacity constraints, latency, and SLAs: The models assume unconstrained capacity and ignore latency. Extend the theory to capacity-limited providers, queuing, and service-level agreements; quantify welfare trade-offs when high-quality providers face congestion and delays.
Adaptive per-query TTC policies: Providers pick a single TTC level per task. Study policies that adapt TTC per query (e.g., early-exit, confidence-based scaling), and analyze equilibrium and welfare when strategies are fine-grained and contingent on instance difficulty.
Multi-task and repeated-interaction dynamics: Tasks are treated independently. Model cross-task coupling (shared compute resources), repeated auctions, provider learning and reputation, and long-term incentives (e.g., investment in training to shift q_i(θ)).
Multi-attribute scoring (beyond price and “quality”): Real users value safety, faithfulness, speed, and reliability. Generalize the auction to multi-attribute scoring rules, identify scoring functions that remain welfare-maximizing, and study trade-offs across attributes.
Tighter PoA characterization: Theorem 4 gives a lower bound with higher-order terms, but not tight conditions. Derive exact PoA for key subclasses (e.g., specific cost/quality families), identify worst-case instances, and provide empirical PoA estimates across broader settings.
Empirical calibration of user rationality (β) and value functions: Experiments assume a fixed linear mapping from accuracy to monetary value. Collect user data to estimate β and non-linear value functions (e.g., diminishing returns, task-dependent valuations) and revisit welfare and equilibrium conclusions under calibrated preferences.
Real-market validation: Current experiments use offline models, synthetic token prices, and a fixed 25% margin. Validate with real API providers, actual prices (including network/storage fees), latency penalties, and variable profit margins; measure welfare and PoA in production.
TTC method coverage and control fidelity: For reasoning models, TTC is approximated via quantiles of reasoning tokens rather than explicit control. Evaluate mechanisms on TTC techniques with tunable effort (e.g., dynamic multi-pass reasoning, tree search depth) and assess how control fidelity affects theory and practice.
Tokenization and billing manipulation risks: Prior work shows providers can misreport token counts. Analyze how such manipulation interacts with TTC choices and with the proposed auction; design auditing mechanisms that jointly verify token counts and TTC usage.
Privacy and data governance for auctions: Users must share representative queries; these may contain sensitive data. Develop privacy-preserving auction workflows (e.g., secure enclaves, differential privacy, secure aggregation) that still allow reliable quality estimation.
Tie-breaking rules and edge cases: The analysis assumes no ties. Formalize tie-breaking policies for both market and auction settings, and study their effects on welfare, payments, and incentives.
Payment mechanics and budget constraints: The auction’s payment P = q_winner − V_second may be volatile. Explore budget caps, risk-aware variants (e.g., reserve prices, payment smoothing), and their impact on truthfulness and efficiency.
Safety and externalities: Welfare is defined as q − c, ignoring harmful outputs, bias, and safety incidents. Incorporate externalities into quality metrics and mechanism objectives; analyze how safety-aware scoring changes equilibria and provider incentives.
Granularity of market segmentation: The paper treats “tasks” as homogeneous sets of queries. Investigate segmentation strategies (subtasks by difficulty, domain, or user group) to improve welfare and fairness, and determine how auctions operate across segments.
Provider heterogeneity in hardware and energy costs: Costs are inferred from token counts and a fixed margin. Model hardware-specific cost curves, energy prices, and cooling constraints, and study their effects on optimal TTC and welfare.
Mechanism adoption and platform design: The auction requires a third-party platform. Analyze the institutional and business requirements (governance, certification, settlement, dispute resolution), and prototype end-to-end systems to assess feasibility and adoption barriers.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below is a concise set of practical applications that can be deployed now, leveraging the paper’s findings about test-time compute (TTC) inefficiencies and its reverse second-price auction mechanism.

Industry

Auction-based LLM routing inside aggregators
- Description: Meta-providers (e.g., Together.ai, Fireworks, Vercel AI, enterprise model hubs) run a reverse second-price auction across their own or partner models per user/task, selecting the provider and TTC level that maximizes quality minus cost.
- Sectors: Software, cloud/AI platforms
- Tools/Workflows: Bidding API for “price–quality” offers; auction/scoring engine; audit logs; per-task representative query sets; reward models to estimate expected quality.
- Assumptions/Dependencies: Providers (or sub-models) can report expected quality on representative queries; a broker exists to compute payments and route calls; quality metrics are available or estimable; truthful quality reporting is verifiable via spot checks.
Welfare-optimized TTC controller
- Description: A controller that sets best-of-n/majority voting sample counts or chain-of-thought effort to maximize W(θ) = quality(θ) − cost(θ) at inference time, avoiding marginal TTC that offers little value.
- Sectors: Software engineering, robotics, finance analytics
- Tools/Workflows: Marginal gain estimators; reward models; early-stopping/gating policies; cost telemetry; per-task TTC policies integrated in inference pipelines.
- Assumptions/Dependencies: Access to quality–cost curves (per model/task); task-level performance signals (accuracy, test pass rates, reward scores); consistent token accounting.
Value-aware billing and SLAs
- Description: Enterprise contracts with providers that include “pay-for-value” clauses (e.g., no charge for marginal TTC with negligible quality gains; second-price style pricing pegged to competing offers).
- Sectors: Enterprise AI procurement
- Tools/Workflows: Contract templates; thresholds for marginal value; KPI dashboards tracking value gains vs TTC.
- Assumptions/Dependencies: Provider willingness to offer alternative billing; measurable value metrics; legal alignment with procurement processes.
TTC auditing and monitoring
- Description: Dashboards that compute price of anarchy (PoA), track quality vs token growth, and flag providers or models that inflate TTC without commensurate quality improvements.
- Sectors: Cloud AI ops, FinOps, MLOps
- Tools/Workflows: PoA calculator; token and cost telemetry; reward-model-based quality monitors; alerts.
- Assumptions/Dependencies: Access to token counts, per-token prices, and measured quality; standardized logging across providers.
Representative query set curation
- Description: Processes to build small, high-coverage samples of user tasks to estimate expected quality (used in auctions and TTC controllers).
- Sectors: Software, healthcare, education
- Tools/Workflows: Data selection pipelines; stratified sampling; privacy-preserving redaction; continual refresh of representative tasks.
- Assumptions/Dependencies: Availability of historic queries or proxy datasets; privacy compliance; mapping between representative performance and production tasks.

Academia

TTC-aware benchmarking
- Description: Extend evaluation frameworks (e.g., lm-eval, HELM) to report quality and TTC cost jointly, publish PoA estimates, and compare “pay-for-compute” vs “pay-for-value” rankings.
- Sectors: Academic ML evaluation
- Tools/Workflows: Evaluation harness plugins; cost normalization; public leaderboards reporting quality, TTC, and social welfare.
- Assumptions/Dependencies: Access to per-token pricing; reproducible TTC configurations; tasks with verifiable ground truth or robust satisfaction metrics.

Policy

Scoring auctions in public-sector AI procurement
- Description: Government/NGO RFPs run reverse second-price (scoring) auctions: bidders submit price and expected quality on representative tasks; award based on value; payment computed relative to runner-up.
- Sectors: Public procurement, healthcare, education
- Tools/Workflows: RFP templates; auction rules; audit procedures; independent quality validation.
- Assumptions/Dependencies: Legal authority and procurement capacity; standardized quality measures; accountability mechanisms for misreporting.

Daily Life

Smart consumer router
- Description: A client-side or browser extension that routes queries to the provider predicted to maximize net value (quality − price), with configurable budget caps and TTC limits.
- Sectors: Consumer productivity tools
- Tools/Workflows: Lightweight quality predictors; budget controls; model selection heuristics; optional “probe” questions per session.
- Assumptions/Dependencies: Aggregator APIs; viable quality proxies; transparency in token billing.

Sector-specific Examples

Healthcare (verifiable tasks)
- Use case: Clinical triage, guideline retrieval, and math-like risk calculations where ground truth exists—auction selection and TTC control minimize cost while preserving accuracy.
- Tools/Workflows: Hospital AI procurement adopting scoring auctions; validated clinical benchmarks; governance.
- Assumptions/Dependencies: Regulatory compliance; safety validation; protected health information handling.
Education
- Use case: STEM tutoring, homework help—set sample counts adaptively per problem type; choose provider with best accuracy-per-dollar.
- Tools/Workflows: Edtech platforms integrating TTC gating; math/science benchmark alignment; parental controls.
- Assumptions/Dependencies: Task difficulty calibration; fairness and accessibility.
Software engineering
- Use case: Code generation and bug fixing—set best-of-n dynamically based on test outcomes; avoid unnecessary sampling when tests already pass.
- Tools/Workflows: CI integrations; test-aware TTC gates; reward models using static analysis.
- Assumptions/Dependencies: Reliable tests; latency constraints; determinism in builds.
Finance
- Use case: Report drafting, risk scoring, data reconciliation—adopt welfare-optimal TTC policies in analysis pipelines to maximize ROI.
- Tools/Workflows: Value mapping to dollar outcomes; compliance logging; cost controls.
- Assumptions/Dependencies: Clear value-to-dollar mapping; audit trails; data governance.

Long-Term Applications

These applications require further research, standardization, provider participation, or scaling to realize the full benefits of the proposed mechanisms.

Industry

Open LLM auction marketplace
- Description: A cross-provider, standardized marketplace where providers bid quality and price in real time; platform runs reverse second-price auctions and handles clearing/settlement.
- Sectors: Cloud AI, fintech (market mechanisms)
- Tools/Workflows: Bidding protocol and APIs; payment rails; reputation systems; collusion detection.
- Assumptions/Dependencies: Wide provider adoption; standard quality metrics; robust verification and dispute resolution; latency guarantees.
Carbon- and latency-aware TTC auctions
- Description: Extend scoring auctions to include energy use, carbon intensity, and latency as attributes; optimize multi-criteria welfare.
- Sectors: Energy-aware computing, real-time systems, robotics
- Tools/Workflows: Emissions telemetry; latency SLAs; multi-attribute scoring rules.
- Assumptions/Dependencies: Reliable emissions reporting; task-specific latency constraints; multi-objective optimization.
Personalized, dynamic auctions and repeated-game mechanisms
- Description: Auctions that adapt to individual user preferences (e.g., risk tolerance, speed, cost), with repeated interactions, learning, and reputation effects.
- Sectors: Consumer and enterprise personalization
- Tools/Workflows: Preference elicitation; dynamic scoring; privacy-preserving history.
- Assumptions/Dependencies: Consent and privacy safeguards; robust behavioral models; avoidance of exploitation of bounded rationality.

Academia

Mechanism design extensions and robustness
- Description: Formal analysis of multi-attribute scoring auctions (quality, cost, latency, safety), collusion resistance, misreporting penalties, and open-ended task metrics (e.g., satisfaction, helpfulness).
- Sectors: Economics of ML, mechanism design
- Tools/Workflows: Theoretical models, simulations; field experiments; auditing methodologies.
- Assumptions/Dependencies: Agreed-upon scoring rules; ground-truth proxies for open-ended tasks; enforcement mechanisms.
Quality estimation research
- Description: Better off-policy quality predictors for diverse tasks, uncertainty quantification for expected quality bids, and causal methods to estimate marginal value of TTC.
- Sectors: ML evaluation, causal inference
- Tools/Workflows: Reward modeling; calibration; counterfactual TTC estimators.
- Assumptions/Dependencies: High-quality labeled datasets; avoidance of reward hacking; domain adaptation.

Policy

Transparency and token-accounting standards
- Description: Standards that require providers to disclose TTC usage, reasoning tokens, and pricing; independent audits to prevent overcharging for invisible reasoning tokens.
- Sectors: Consumer protection, standards bodies
- Tools/Workflows: Standardized logs; third-party auditors; certification programs.
- Assumptions/Dependencies: Regulator buy-in; interoperable logging; enforceable penalties.
Consumer “pay-for-value” rights
- Description: Policy frameworks that cap charges when marginal TTC does not improve quality, encourage auction-based procurement, and mandate disclosures of expected quality.
- Sectors: Consumer law, procurement policy
- Tools/Workflows: Model contract clauses; compliance monitoring; ombuds/appeals.
- Assumptions/Dependencies: Legal infrastructure; consensus on quality measures; economic impact assessment.

Daily Life

Personal AI wallets and compute insurance
- Description: Budgeting tools that cap TTC spend per task/session and automatically trigger auctions or refunds when marginal value falls; optional “compute insurance” to cover waste.
- Sectors: Consumer fintech, productivity
- Tools/Workflows: Wallet APIs; budget policies; auto-refund mechanisms integrating provider contracts.
- Assumptions/Dependencies: Provider cooperation; secure billing integrations; consumer education.

Sector-specific Examples

Healthcare
- Description: Regulatory adoption of welfare-optimized TTC and auction-based procurement for AI decision support; integration into EHRs; continuous post-market surveillance of value vs cost.
- Tools/Workflows: Clinical evaluation protocols; safety monitoring; procurement standards.
- Assumptions/Dependencies: Clinical validation; liability frameworks; interoperability with clinical systems.
Education
- Description: District or national edtech procurement via scoring auctions; standardized STEM benchmarks for quality bids; equity/fairness criteria included in scoring.
- Tools/Workflows: Public benchmarks; transparent auction results; educator oversight.
- Assumptions/Dependencies: Curriculum alignment; fairness metrics; long-term funding.
Finance
- Description: Exchange-like markets for research and analytics tasks where LLM providers bid; governance against collusion; integration with compliance systems.
- Tools/Workflows: Task taxonomies; surveillance; capital allocation models.
- Assumptions/Dependencies: Strong compliance; clear value attribution; risk controls.
Robotics
- Description: Real-time planning tasks with auctions balancing latency, quality, and cost; adaptive TTC policies that respect safety constraints.
- Tools/Workflows: Low-latency scoring; safety monitors; fallback policies.
- Assumptions/Dependencies: Deterministic timing guarantees; safe interruptions; robust verification.

In all cases, the feasibility hinges on measuring or reliably estimating expected quality per task and TTC level, transparent token and cost accounting, and mechanisms to verify or audit provider claims. Where ground truth is unavailable, robust reward models and satisfaction metrics must be developed and standardized.

View Paper Prompt View All Prompts

Glossary

Abstention threshold value: A baseline value that must be exceeded for users to choose any provider instead of abstaining. "users have the option to abstain from using any provider if none of them offers a value higher than an abstention threshold value $V_0 \in \mathbb{R}$ ."
Bertrand market: A market model where perfectly rational users always select the provider offering the highest value. "The first function corresponds to a Bertrand market~\citep{Tirole1988} where users are perfectly rational and always select the provider offering them the highest value."
Best-of-n sampling: A test-time compute method that generates multiple outputs and picks the best according to some scoring. "using techniques such as chain-of-thought~\citep{wei2022chain}, tree-of-thought~\citep{yao2023tree}, and best-of-n sampling~\citep{chow2024inference}."
Better-response dynamics: An iterative process where players sequentially switch to strategies that improve their utility. "start at some arbitrary point $\thetab^1\in\Theta^N$ and follow better-response dynamics over time."
Boundedly rational: A behavioral assumption where users choose probabilistically due to limited or noisy value estimation. "The second function corresponds to a market where users are boundedly rational~\citep{MCKELVEY19956} and select providers probabilistically based on their offered values."
Chain-of-thought: A prompting/inference technique that elicits step-by-step reasoning to improve model performance. "using techniques such as chain-of-thought~\citep{wei2022chain}, tree-of-thought~\citep{yao2023tree}, and best-of-n sampling~\citep{chow2024inference}."
Dominant provider: The provider that can offer the highest maximum value across compute levels. "we will refer to the provider who can offer the highest value $V^*_{1}$ as the dominant provider."
Dominant strategy: A strategy that maximizes a provider’s utility regardless of other providers’ actions. "each provider $i$ has a dominant strategy, that is, a combination of test-time compute $\theta^\dagger_i$ and price $p^\dagger_i$ that maximizes their utility regardless of how the other providers act."
Generalized ordinal potential game: A game where any unilateral utility-increasing move strictly increases a global potential function. "we start by showing that every test-time compute game is a generalized ordinal potential game"
Majority voting: A test-time compute method that aggregates multiple outputs by choosing the most frequent answer. "in test-time compute methods such as majority voting~\citep{wang2023selfconsistency} and best-of-n sampling~\citep{chow2024inference}"
Marginal profit: The increase in profit that results from increasing test-time compute. "we assume any increase in test-time compute leads to a positive marginal profit for the provider"
Marginal value: The additional user value created by the winning provider compared to the next-best offer. "users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder."
Market share function: A function mapping offered values to the fraction of queries served by each provider or abstained. "we characterize the allocation of user demand across providers through a market share function $s: \mathbb{R}^{N+1} \to [0,1]^{N+1}$ "
Mechanism design: The study of designing rules (mechanisms) that lead to desired outcomes in strategic settings. "we draw inspiration from mechanism design~\citep{NISAN2001166,krishna2009auction} and propose a reverse second-price auction"
Normal-form game: A simultaneous-move game representation where players choose strategies to maximize utilities. "We start by modeling the current LLM-as-a-service market as a normal-form game~\citep{NISAN}"
Potential games: Games admitting a potential function whose increases align with players’ utility improvements. "we resort to the theory of potential games"
Price of Anarchy (PoA): The ratio between optimal social welfare and the welfare at equilibrium, measuring inefficiency. "Our results show that the existing pay-for-compute market is up to $19\%$ (socially) inefficient, as measured by the Price of Anarchy."
Procurement contracts: Contracts for acquiring goods/services, used as context for auction design analysis. "scoring auctions studied in the context of procurement contracts and supplier selection"
Pure Nash equilibrium: A strategy profile where no player can improve utility via unilateral deviation. "admits a pure Nash equilibrium"
Random Utility Models (RUMs): Models where choices are made based on noisy estimates of utilities/values. "In the context of random utility models (RUMs), boundedly rational users are those who only have access to noisy estimations of the value offered by each provider~\citep{yao2023rethinking}."
Reasoning tokens: Tokens corresponding to intermediate reasoning steps generated by a model. "based on the number of reasoning tokens in the output."
Reverse second-price auction mechanism: A procurement-style auction where the buyer selects the highest value offer and pays based on the runner-up. "we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user"
Reward model: A model that scores outputs to select or rank them during inference. "according to a fixed reward model, namely ArmoRM-Llama3-8B-v0.1."
Scoring auctions: Auctions where bids include price and quality attributes and are evaluated by a scoring rule. "closely related to scoring auctions studied in the context of procurement contracts and supplier selection"
Scoring rule: A rule that combines price and quality attributes to determine the winning bid. "and the seller uses a scoring rule to determine the winner."
Second-price payment rule: A rule where payment is set relative to the second-best offer rather than the winner’s bid. "using a second-price payment rule"
Social welfare: The total value combining users’ value and providers’ utilities. "we define social welfare as the sum of the total utility obtained by all providers and the net value obtained by the user"
Test-time compute: Additional computation at inference (e.g., multiple samples or reasoning steps) to improve performance. "Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of LLMs."
Tree-of-thought: A reasoning technique that explores branching sequences of steps before selecting an answer. "using techniques such as chain-of-thought~\citep{wei2022chain}, tree-of-thought~\citep{yao2023tree}, and best-of-n sampling~\citep{chow2024inference}."

Test-Time Compute Games

Summary

Test-Time Compute Games: Strategic Provider Behavior and Market Inefficiency in LLM-as-a-Service

Introduction and Market Motivation

Game-Theoretic Modeling of Provider Competition

Equilibrium Analysis and Social Inefficiency

Reverse Second-Price Auction: Welfare Alignment

Empirical Evaluation

Practical and Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain‑Language Summary of “Test‑Time Compute Games”

Overview

What questions did the researchers ask?

How did they study it? (Explained simply)

Main findings and why they matter

What could this change?

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Practical Applications

Immediate Applications

Industry

Academia

Policy

Daily Life

Sector-specific Examples

Long-Term Applications

Industry

Academia

Policy

Daily Life

Sector-specific Examples

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research