Papers
Topics
Authors
Recent
Search
2000 character limit reached

How Inference Compute Shapes Frontier LLM Evaluation

Published 16 Jun 2026 in cs.AI | (2606.17930v1)

Abstract: AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier LLMs on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.

Summary

  • The paper demonstrates that expanded token budgets and iterative submission protocols significantly boost LLM performance on complex, long-horizon tasks.
  • The paper systematically evaluates 12 state-of-the-art LLMs across diverse benchmarks, revealing that fixed compute protocols can underestimate true model capabilities.
  • The paper highlights generational improvements in reach and reliability while emphasizing that evaluation outcomes are critically shaped by inference-time compute allocation.

Inference-Time Compute as a Determinant in Frontier LLM Benchmarking

Motivation and Context

Recent advancements in LLM evaluation emphasize increasingly complex, long-horizon tasks that rely on extended trajectories, tool use, and iterative problem-solving. This shift induces a heightened sensitivity of model performance to inference-time compute allocations. Yet, evaluation practices often rely on fixed, restrictive compute budgets—token limits, turn caps, or timeouts—that may artificially suppress observed capabilities. Such protocol-dependent measurement risks misrepresenting a model’s potential, especially as frontier LLMs progress. The paper systematically interrogates inference scaling through a uniform protocol across 12 state-of-the-art LLMs on seven benchmarks—including software engineering (TerminalBench, SWE-Bench Pro), mathematics (FrontierMath), medicine (HealthBench), expert knowledge (Humanity's Last Exam), and cybersecurity scenarios (Cyber CTFs, The Last Ones) (2606.17930).

Experimental Design

The investigation employs three general inference-scaling interventions:

  • Expanded Token Budgets: Total trajectory budgets are increased by 1–3 orders of magnitude over typical published defaults, allowing test-time compute to reach up to 100M tokens in some settings.
  • Context Compaction: Summarization of earlier context by the model enables serial scaling—circumventing window constraints and preserving information for long-horizon tasks.
  • Iterative Submission/Resubmission: Repeat attempts per task, governed either by self-guided exploration or minimal correctness feedback, create opportunities for refinement.

Evaluations are fully crossed: each model is assessed under both no feedback and oracle score feedback conditions, using a ReAct-style agent scaffold. For each benchmark-model-condition combination, five independent trajectories are run, enabling robust statistical analysis of scaling effects.

Quantitative Performance Insights

Benchmark Sensitivity to Inference Scaling

Scaling with larger token budgets yields substantial performance improvements, with strong benchmark-dependent variability:

  • FrontierMath and HLE exhibit pronounced headroom. Increasing token budgets from typical values (1M to 10M for FrontierMath; 64k to 5M for HLE) raises success by +11.7±11.0+11.7 \pm 11.0 and +9.3±12.0+9.3 \pm 12.0 percentage points, respectively.
  • TerminalBench and SWE-Bench Pro, evaluated at already permissive budgets, only marginally benefit: increases are <1.5<1.5 percentage points even when budgets are doubled.
  • HealthBench is nearly saturated; performance increases are negligible (+0.3±0.4+0.3 \pm 0.4 points).
  • Cyber benchmarks (CTFs, The Last Ones) and selected stateful environments continue to improve within tested ranges—indicating protocol-induced headroom, not inherent saturation.

Plateau analysis establishes that diminishing returns from increased compute arise at task-specific locations; in many cases, evaluated token caps reveal only partial reachable performance.

Model Generational Effects

Decomposition of generational gains (across three generations per provider—Anthropic and OpenAI) demonstrates:

  • Reach: Successive model iterations unlock harder tasks (with negative generation-by-difficulty interaction for reach, p<0.05p < 0.05 on five out of six benchmarks).
  • Reliability: Newer models solve unlocked tasks more consistently across repeated attempts, a robust trend except HealthBench.
  • Efficiency: Token usage per solved task is reduced in newer models, but improvements are uneven and conditional on reach—most pronounced in Cyber CTFs and HealthBench.

Aggregate trends show that newer models attain higher capped performance and begin succeeding at lower budgets, but the principal advances are in reach and reliability rather than pure token efficiency.

Protocol Dependence and Submission Dynamics

Repeated submission materially enhances performance across all studied benchmarks, with uplift multipliers ranging from 1.11x (FrontierMath) to 1.70x (HLE). Oracle score feedback is particularly potent in settings where it guides continued search (notably HLE and SWE-Bench Pro), whereas self-guided refinement dominates on benchmarks close to saturation.

Parallelization—allocation of fixed compute budgets across multiple independent trajectories—benefits stateless tasks (HLE, HealthBench) far more than stateful tool-based environments. Gains from parallel sampling diminish markedly in newer models, reflecting increased ability to leverage deep, single trajectories for problem solving.

Theoretical and Practical Implications

The findings challenge the adequacy of fixed-protocol, single-budget evaluation practices. Benchmark scores are not static reflections of model architecture but are critically shaped by inference-time compute and submission rules. The results call for:

  • Reporting capability as a function of inference budget, not a single number.
  • Explicit protocol design and documentation—including tool use, feedback, and trajectory/task allocation.
  • Computing cross-generational comparisons at matched budgets over broad ranges, especially in safety- or policy-relevant contexts.

These recommendations align with recent literature emphasizing compute-optimal inference scaling [Snell et al., 2024], time-horizon analyses [Kwa et al., 2025], and the necessity of protocol-aware benchmarking [Balachandran et al., 2025].

Practically, underestimating model performance through compute-starved evaluation poses governance risks, as well-resourced actors may unlock substantial additional capabilities in real-world deployments. For theoretical analysis, inference scaling curves reveal latent capabilities and should be a staple in capability tracking.

Limitations and Future Directions

Limitations include use of a single, general scaffold and relatively simple inference-scaling techniques: more sophisticated elicitation (e.g., verifier-guided branching, adaptive width/depth) may shift results. Benchmark-specific saturation may reflect protocol or scaffold mismatches, not inherent model boundaries. LLM-judged repetition guards and continuous-score benchmarks (HealthBench) introduce measure noise, but parallel-scaling effects are validated to be robust against judge unreliability.

Future research should investigate optimal inference scaling strategies, wider protocol variations, and more granular capability elicitation—especially as LLMs become more agentic and interactive. Extending these protocols to broader domains (e.g., RL agents, multimodal models) and integrating real-world feedback and safety constraints remains vital.

Conclusion

Performance assessment of frontier LLMs is inseparably linked to inference-time compute allocation and submission protocol. The investigated scaling interventions demonstrate that fixed-budget scores systematically understate model reach, reliability, and potential on complex benchmarks, particularly as models progress. Evaluation protocols must evolve to capture threshold unlocking, iterative search, and compute-limited headroom—establishing a more accurate and operation-relevant measure of AI capability (2606.17930).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Plain‑Language Summary of “How Inference Compute Shapes Frontier LLM Evaluation”

What is this paper about?

This paper looks at how much “thinking power” you let an AI use when you test it, and how that changes its score. The authors show that many tests only give AIs a tiny amount of time and space to think, which can make strong AIs look weaker than they really are. They argue we should measure how an AI’s performance grows as we give it more room to think, not just at one small setting.

What were the researchers trying to find out?

They asked three simple questions, in teen-friendly terms:

  • If you let an AI “think” longer and try more ideas, does it do better on hard tasks like coding, math, medicine, and cybersecurity?
  • Do newer AIs benefit more from extra “thinking time” than older ones?
  • Which ways of using that extra compute work best: one deep attempt, many shorter attempts, or giving feedback after each try?

How did they test it?

They evaluated up to 12 advanced LLMs on seven tough benchmarks across different areas (software engineering, math, medicine, and cybersecurity). To keep things fair and simple, they used the same general setup everywhere and three easy‑to‑understand “inference scaling” tricks:

  • Bigger token budgets: Think of “tokens” as pieces of text the AI can read or write. A “token budget” is like giving a student more time and scratch paper. The authors gave the AIs much larger budgets than usual—often 10–1000× bigger—so the models could plan, reason, and revise more.
  • Context compaction: As a conversation grows, the AI’s memory window fills up. “Context compaction” is like summarizing earlier pages of notes so there’s room for new work, without losing the important parts.
  • Repeated submissions: Instead of one shot, the AI could try many times—up to hundreds—refining or switching strategies. Two versions were tested:
    • No feedback: The model gets a neutral “saved” message after each try, with no hint if it’s right or wrong.
    • Oracle score feedback: The model is told if the last answer was correct (or partially correct). This is like a teacher saying “right” or “wrong” after each attempt.

They also compared two ways to spend a fixed amount of compute:

  • Serial (deep): Put most of the compute into one long, thoughtful attempt.
  • Parallel (wide): Split compute across many shorter, independent attempts.

Benchmarks included:

  • Software engineering: SWE‑Bench Pro (fixing real code) and TerminalBench (solving tasks via a command line).
  • Mathematics: FrontierMath (hard, often code‑based math problems).
  • Medicine: HealthBench (difficult clinical reasoning).
  • Expert knowledge: Humanity’s Last Exam (HLE; advanced cross‑disciplinary questions).
  • Cybersecurity: Capture‑the‑Flag puzzles and a long scenario called The Last Ones.

What did they find?

Here are the main takeaways, in plain language:

  • More “thinking room” often helps—especially on some tasks.
    • Giving larger token budgets noticeably improved performance on math (FrontierMath), expert knowledge (HLE), cybersecurity tasks (CTFs and The Last Ones), and terminal‑based tasks (TerminalBench).
    • On some software and medical tasks (SWE‑Bench Pro and HealthBench), extra budget helped only a little in this setup. That could mean these tasks are already near their limit with typical budgets—or that different strategies (not tested here) might help more.
  • Newer models shine when given more budget.
    • Newer generations tended to reach higher scores when they were allowed to “think” longer. At small budgets, newer and older models can look similar; at larger budgets, newer models pulled ahead.
    • This means a single, small‑budget score can miss true progress. As models get better at using extra compute, fixed low‑budget tests can understate what they can really do.
  • Different tasks benefit from different strategies—there’s no one‑size‑fits‑all.
    • Repeated submissions helped across the board. Letting the AI iterate generally raised scores.
    • Feedback mattered more on some tasks. For example, telling the AI “right/wrong” after each try boosted results a lot on HLE and SWE‑Bench Pro, where that signal helps the model steer its search.
    • Serial vs. parallel depends on the task:
    • Stateless tasks (like many Q&A problems) benefited more from parallel attempts—many quick tries may find a good answer faster.
    • Stateful tasks (where the AI works inside an ongoing environment, like a terminal or program) benefited more from going deep in a single trajectory, since the AI builds on what it already did.
  • Why performance improved: reach and reliability, more than efficiency.
    • Reach: Newer models could solve a wider range of tasks, especially the harder ones (they “unlocked” tasks earlier models couldn’t solve).
    • Reliability: On tasks they could solve, newer models solved them more consistently across repeated tries.
    • Efficiency: Using fewer tokens per solve improved in some areas (e.g., cybersecurity, some software tasks), but was uneven. The biggest gains came from being able to solve more tasks, and solving them more reliably—not only from solving the same tasks with fewer tokens.
  • Plateaus differ by benchmark.
    • Some tasks kept improving even at very large budgets (e.g., TerminalBench, cybersecurity CTFs).
    • Others showed flattening (diminishing returns) sooner (e.g., HealthBench), at least under this particular, simple setup.
    • Where the curve flattens can depend on the model family, generation, and whether feedback is provided.

Why does this matter?

  • Fairer and more useful evaluations: Judging an AI by a single, small-budget score is like grading a student after giving them only a few minutes and one attempt. For a fair picture, we should show performance as a function of “inference compute” (how much thinking time/space/tries the AI gets).
  • Better comparisons over time: As AIs get better at using extra compute, small-budget tests can hide real progress. Comparing generations should be done across a shared range of budgets and with clearly stated testing rules.
  • Safety and policy: In high‑stakes areas (health, cybersecurity), it’s crucial to know what an AI can do when given more time and attempts, because well‑resourced users—good or bad—can provide that compute. Reporting scaling curves helps decision‑makers understand both the best‑case performance and how accessible misuse might be at lower budgets.
  • Practical design: If you’re building an AI system, this study suggests:
    • Allow iterations and give useful feedback when possible.
    • Choose between deep single attempts and many quick attempts based on whether the task is stateful or stateless.
    • Expect newer models to gain more from extra inference compute.

Bottom line

The amount and the way we give AIs “thinking room” during testing strongly affects their scores—especially on hard, real‑world tasks. Newer models often unlock more problems and solve them more reliably when allowed to think longer or try multiple times. Because of this, evaluations should:

  • Report performance across a range of inference budgets,
  • Clearly document testing rules (e.g., feedback, iteration, token limits),
  • Compare models at matched budgets over a wide range, so that we see the true capability of today’s frontier models and make better, safer decisions about how to use them.

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a concise list of concrete knowledge gaps and limitations left unresolved by the paper, framed to guide follow-on research:

  • Limited model diversity: results are based on two proprietary families (Anthropic Opus 4.x, OpenAI GPT‑5.x) and one Anthropic checkpoint (cyber only). It is unknown whether the findings hold for other closed- and open-weight families (e.g., Google, Meta, Mistral, Llama, DeepSeek, Qwen), smaller models, or distilled/finetuned variants.
  • Token-based compute as the sole metric: using “total tokens” (excluding judge tokens) ignores wall‑clock time, energy, FLOPs, accelerator utilization, tool/execution time, and tokenization differences across models. It remains unclear how conclusions change under compute-normalized metrics (e.g., FLOPs, latency) and when judge/auxiliary tokens are included.
  • Typical-budget baselines are approximated: for TerminalBench and SWE-Bench Pro, “typical” budgets were inferred from time/turn caps rather than reported tokens; the accuracy of these conversions and their impact on uplift estimates are unquantified.
  • Plateau identification is heuristic: the g_cap criterion (based on cap vs. cap/2) lacks uncertainty quantification and sensitivity analysis. More rigorous plateau detection (e.g., bootstrapped derivatives, change-point models, power-law fits) could alter which benchmarks are labeled as saturated.
  • High-budget ceiling remains untested for several benchmarks: multiple scaling curves did not plateau within the evaluated caps (e.g., TerminalBench, Cyber CTFs, parts of HLE/FrontierMath). The budget at which performance saturates and the shape of the high-compute tail are unknown.
  • Low-compute regime is underexplored: while the paper motivates both ends of the compute spectrum, it does not systematically characterize performance at very small budgets, where accessibility and misuse risks are highest.
  • Single scaffold dependence: results are obtained under one ReAct-style Inspect AI scaffold with bash/python tools. Sensitivity to alternative agentic strategies (e.g., tree-of-thought, toolformer-style planning, program-aided reasoning, code execution policies), tool suites (internet/APIs, retrieval, verifiers), and prompt/role designs is not assessed.
  • Context compaction strategy is fixed and unablated: only summary-based compaction is tested with a single trigger threshold (≈130k). The trade-offs among compaction methods (summarization vs. retrieval vs. external memory), thresholds, and memory externalization (files, scratchpads, KV cache virtualization) on scaling and error modes remain unknown.
  • Repetition guard effects are unexamined: an LLM-judge-based semantic-equivalence guard can prematurely terminate productive search. No sensitivity analysis of the guard’s thresholds, judge choice, or false-positive/false-negative rates is provided.
  • Feedback realism is limited: “oracle score feedback” (perfect correctness signals) may overstate real-world gains. How noisy, delayed, partial, or adversarial feedback affects scaling, search policies, and convergence is unresolved.
  • LLM-graded benchmarks introduce grading risk: HLE and HealthBench rely on a single judge model (GPT‑4o‑mini). Cross-judge robustness, judge calibration, adjudication against expert human graders, and susceptibility to judge-targeting strategies are not evaluated.
  • Judge/auxiliary tokens are excluded from budgets: the compute and cost impact of grading, semantic equivalence checks, and other auxiliary models are omitted from scaling curves; this may misrepresent end-to-end resource needs.
  • Interaction between per-call and per-trajectory budgets is untested: the per-generation reasoning budget (16k) is fixed. How varying per-call reasoning tokens, call timeouts, and call-level sampling policies interacts with total budget utilization and outcomes is unknown.
  • Decoding/sampling parameters are unspecified/unaltered: effects of temperature, top‑p, beam/branching, multi-sampling, and self-consistency on reach, efficiency, and reliability across compute regimes remain unexplored.
  • Parallel vs. serial compute allocation lacks prospective tests: “parallel scaling” is analyzed via reallocation over existing trajectories rather than with truly parallel, independently seeded attempts and adaptive schedulers. The optimal width–depth allocation and scheduling policies (bandits/early stopping) remain open.
  • Small number of trajectories per task: with 5 trajectories per (model, task, condition), estimates of reliability and tail behavior may be underpowered. The variance across seeds and the stability of curve estimates need quantification via more runs.
  • Task sampling and representativeness: up to 100 tasks per benchmark were used, with subsampling that makes results not directly comparable to published scores. The representativeness of the sampled sets, and how results change with full or stratified task coverage, are unreported.
  • Cyber benchmarks lack full cross-condition analysis: Cyber CTFs and The Last Ones were reused from prior work under oracle-only feedback and different setups; no no-feedback or partial-feedback comparisons, submission-behavior analyses, or unified protocols were run for these domains.
  • Tool and environment fidelity constraints: for stateful tasks, only bash/python were provided in a sandbox. The impact of richer tool ecosystems (IDEs, build systems, debuggers, internet, package managers, system privileges) and environment fidelity on scaling and protocol dependence is not addressed.
  • Data contamination and prior exposure: there is no analysis of training-data overlap or leakages that could affect where scaling gains appear (especially on high-profile benchmarks), nor methods to isolate contamination effects from genuine inference scaling.
  • Multilingual and multimodal generalization: experiments are primarily in English text with limited tool use. Whether the reported scaling patterns extend to multilingual settings and to multimodal tasks (vision, audio) is unknown.
  • Failure-mode taxonomy is limited: while reach/efficiency/reliability are decomposed, the paper does not categorize qualitative failure modes (e.g., search myopia, tool misuse, context-loss from compaction), nor how these evolve with more compute.
  • Safety and misuse impacts are unquantified: although policy relevance is emphasized, the study does not measure how increased inference compute changes harmful capability elicitation, persistence against safeguards, or exploit success rates under realistic constraints.
  • Economic and environmental costs are unreported: the marginal cost per unit of performance gain (tokens, time, dollars, energy) and cost-effectiveness of different protocol choices (feedback, compaction, iteration depth) are not analyzed.
  • Cross-model fairness concerns: tokenization differences, context window sizes, and provider-specific reasoning modes (e.g., xhigh vs. high) can bias cross-family comparisons under equal “token” budgets. Normalization and fairness adjustments are not provided.
  • Standardization of compute-aware reporting: the paper argues for reporting performance as a function of inference compute but does not propose a concrete community standard (e.g., canonical budget grid, metrics, judge protocols, inclusion of auxiliary tokens, uncertainty reporting).
  • Theoretical modeling of inference scaling is absent: there is no formal model linking training compute/parameters with inference scaling (reach/efficiency/reliability), nor predictive laws for scaling exponents across domains and protocols.

Practical Applications

Practical Applications Derived from “How Inference Compute Shapes Frontier LLM Evaluation”

Below are actionable applications grounded in the paper’s findings about inference-time compute, protocol design (iteration, feedback, and allocation of depth vs breadth), and cross-benchmark behavior. Items are grouped by deployment horizon and mapped to relevant sectors. Each item includes assumptions/dependencies that affect feasibility.

Immediate Applications

These can be deployed now with current models, tooling, and operational practices.

  • Industry – Software/AI Products: Compute-aware evaluation and release gating
    • What: Replace single-number benchmark reporting with cumulative success vs tokens (“inference-scaling curves”) to gate model upgrades and agent workflows before production rollout.
    • Sectors: Software, AI platforms, MLOps
    • Tools/products/workflows: Evaluation harnesses (e.g., Inspect AI), dashboards that plot cumulative success vs total tokens and show plateau detection, matched-budget A/B comparisons across model generations.
    • Assumptions/dependencies: Access to test suites or LLM-judge scoring; reproducible scaffolds; cost budget to run evaluations over wide token ranges.
  • Industry – MLOps: Adaptive inference budget controllers in production
    • What: Dynamically allocate token budgets, context-compaction thresholds, and max resubmissions per task type; switch between serial depth (stateful/interactive tasks) and parallel breadth (stateless QA) per the paper’s guidance.
    • Sectors: Software, agent platforms, customer support, search
    • Tools/products/workflows: “Inference Budget Controller” microservice; repetition guards; context compaction; optional LLM judges; task classifiers that route to depth vs breadth strategies.
    • Assumptions/dependencies: Stable signals of task type/statefulness; reliable context compaction; latency/cost SLOs.
  • Software Engineering: CI/CD loops with oracle-like feedback
    • What: Use unit tests as oracle feedback to drive repeated submissions in code generation/repair (PR bots iterate until tests pass or budget is exhausted).
    • Sectors: Software engineering
    • Tools/products/workflows: Test-backed agent scaffolds in CI (GitHub Actions/GitLab); capped iteration with early stopping on success; compute-tiered pipelines (fast/cheap vs thorough/expensive).
    • Assumptions/dependencies: High-quality tests; robust sandboxes; cost controls; secure handling of repositories.
  • Cybersecurity: Compute-tiered red teaming and risk profiling
    • What: Evaluate model and agent cyber capabilities across low-to-high inference budgets to characterize risk, including capture-the-flag and long-horizon ranges.
    • Sectors: Cybersecurity, safety evaluations
    • Tools/products/workflows: Red-team harnesses with adjustable token budgets, depth/width toggles, oracle-style success checks; reporting of “capability as a function of compute.”
    • Assumptions/dependencies: Legal/ethical clearance; secure sandboxes; expert oversight; cost ceilings.
  • Policy/Governance: Transparent compute-profile reporting in model cards and evaluations
    • What: Require publication of inference-scaling curves, plateau points, feedback conditions, and matched-budget comparisons (rather than single scores).
    • Sectors: Policy, standards bodies, journals, conferences
    • Tools/products/workflows: Reporting templates; editorial and grant review guidelines; procurement checklists requiring matched-budget evidence.
    • Assumptions/dependencies: Community/journal buy-in; consistent protocol descriptions; standardized definitions for budgets and plateau criteria.
  • Pricing and SLAs: Compute-tiered offerings with “quality vs speed” controls
    • What: Offer user-facing sliders or API parameters that set iteration count, token budget, and feedback use; set SLAs by compute tier.
    • Sectors: SaaS, developer platforms, finance (cost prediction), energy (carbon/cost tracking)
    • Tools/products/workflows: Tiered pricing tied to token and attempt limits; quality tiers for enterprise SKUs; cost calculators for “expected uplift per additional tokens.”
    • Assumptions/dependencies: Clear value deltas per tier; guardrails to prevent runaway cost; customer education.
  • Health and Expert QA: Iterative refinement with verifiable feedback signals
    • What: For medical or expert QA tasks where verifiable partial-credit scoring is available (rubrics, rules), allow a few guided resubmissions; avoid overspending on long trajectories when scaling does not help (as observed in HealthBench).
    • Sectors: Healthcare, expert advisory
    • Tools/products/workflows: Rule/rubric-based graders, medical ontologies, LLM-judge panels with adjudication; conservative iteration caps; human-in-the-loop gates.
    • Assumptions/dependencies: Strong oversight; validated graders; compliance (HIPAA/GDPR); risk management for LLM limitations.
  • Education: Attempt-based tutoring with correctness feedback
    • What: Tutors support multiple student attempts with immediate correctness feedback and modest iteration budgets; parallel attempts for stateless quiz items.
    • Sectors: Education
    • Tools/products/workflows: “Iterate-with-feedback” tutoring flows; compute-aware session management.
    • Assumptions/dependencies: High-quality answer keys or graders; pedagogy alignment; equitable access to compute.
  • Operations/Safety: Rate-limiting and anomaly detection by compute use
    • What: Detect and throttle unusually long trajectories or many parallel attempts (potential misuse), and cap public endpoints’ compute per request or per time window.
    • Sectors: Platform safety, abuse prevention
    • Tools/products/workflows: Compute-based anomaly detectors; budget caps; policy rules that escalate review for high-compute sessions.
    • Assumptions/dependencies: Accurate metering; low false positives; privacy-preserving telemetry.
  • Product UX: “Try another approach” and “Spend more compute” controls
    • What: Add buttons to trigger parallel alternatives (breadth) or deeper search (depth), and UI hints when more compute is likely to help based on task type.
    • Sectors: Productivity, consumer apps, coding assistants
    • Tools/products/workflows: UX affordances for iteration and budget control; progress indicators; explainable “why more compute may help.”
    • Assumptions/dependencies: Good task-type detection; user tolerance for latency/cost.
  • Academia/Research Practice: Standardized compute reporting and reproducible frameworks
    • What: Mandate reporting of token budgets, feedback protocols, and scaling curves; share Inspect-like scaffolds to enable matched-budget comparison across labs.
    • Sectors: Academia
    • Tools/products/workflows: Open-source evaluation packages; public dashboards; benchmark supplements with protocol metadata.
    • Assumptions/dependencies: Community standards; funding for high-budget runs.

Long-Term Applications

These require further research, scaling, integration, or regulatory development.

  • Regulation and Standards: Compute-profile disclosure and matched-budget comparisons in high-stakes domains
    • What: Policy requiring “capability as a function of inference compute” for deployments in healthcare, critical infrastructure, education assessment, and finance; certification labels (e.g., “Grade at 64k/1M/10M tokens”).
    • Sectors: Policy, compliance, certification
    • Tools/products/workflows: Standardized “Inference Scaling Profile” (ISP) schema; third-party audits; conformance tests across specified compute ranges.
    • Assumptions/dependencies: Multi-stakeholder standards; accredited auditors; alignment with safety regimes.
  • Safety/Risk Management: Compute-aware access controls for dual-use capabilities
    • What: Gate access to higher compute tiers for riskier task categories; adaptive caps during incidents; risk-weighted budgets (more compute only after provenance, identity verification, or purpose checks).
    • Sectors: Platform trust & safety, cybersecurity
    • Tools/products/workflows: Policy engines binding task risk to compute ceilings; dynamic throttling; incident “circuit breakers.”
    • Assumptions/dependencies: Reliable task-risk classifiers; governance and legal frameworks.
  • Agent Orchestration: Meta-controllers that learn optimal depth/width allocation
    • What: Controllers that predict when to explore in parallel (stateless questions) vs deepen search with feedback (stateful problems), and when to stop (plateau detection).
    • Sectors: Agent platforms, robotics software stacks, enterprise automation
    • Tools/products/workflows: Reinforcement learning/meta-learning over allocation policies; plateau detectors; cost-quality frontiers integrated with schedulers.
    • Assumptions/dependencies: High-quality telemetry; robust success signals; generalization across tasks.
  • Domain-Oracles at Scale: Generalizable programmatic feedback channels
    • What: Build “oracle feedback” proxies across domains (contracts, tax, compliance, scientific computation) using validators, simulators, linters, and rule engines to enable productive iteration.
    • Sectors: LegalTech, FinTech, GovTech, scientific computing
    • Tools/products/workflows: Verifier libraries; domain sandboxes; hybrid human+automated adjudication.
    • Assumptions/dependencies: High-fidelity validators; careful handling of edge cases; liability management.
  • Benchmark Redesign: Multi-protocol, compute-explicit benchmarks with plateau checks
    • What: Next-gen benchmarks that specify multiple feedback modes, serial vs parallel allocations, and require reporting across a mandated compute range with plateau criteria.
    • Sectors: Benchmarking consortia, research
    • Tools/products/workflows: Shared harnesses; standardized scoring; metadata for protocol choices; leaderboards sorted by matched budgets.
    • Assumptions/dependencies: Community consensus; compute sponsorship; backward compatibility considerations.
  • Hardware/Systems Co-Design: Inference infrastructure optimized for long trajectories
    • What: Memory- and context-centric architectures (fast KV cache paging, summarization accelerators), schedulers that prioritize long-horizon, stateful tasks; cost-aware batching for iterative workflows.
    • Sectors: Cloud, chip design, systems
    • Tools/products/workflows: Inference schedulers aware of depth/width; memory hierarchy tuned for context compaction; “overnight solve” batch windows.
    • Assumptions/dependencies: Vendor roadmaps; throughput/latency trade-offs; energy budgets.
  • Market Mechanisms: Insurance, warranties, and SLAs tied to compute-reliability curves
    • What: Underwriting and warranties based on reliability improvements at specified compute tiers; cost forecasts and carbon disclosures per tier.
    • Sectors: Finance/insurance, sustainability, enterprise IT
    • Tools/products/workflows: Risk models using scaling curves; tier-specific incident expectations; green-compute reporting.
    • Assumptions/dependencies: Stable, auditable curves; actuarial data; standardized emissions accounting.
  • Defensive Cyber Operations: Compute-aware detection of multi-stage attack agents
    • What: Detect patterns of extended serial search or unusually broad parallel attempts indicative of automated attack chains; impose friction (CAPTCHAs, proof-of-work) at high compute thresholds.
    • Sectors: Cybersecurity, platform defense
    • Tools/products/workflows: Telemetry-driven detectors; adaptive friction; forensics linking compute use to incident timelines.
    • Assumptions/dependencies: Privacy-preserving logging; cooperative platform ecosystems; low operational overhead.
  • Education and Assessment: Compute-calibrated exams and learning analytics
    • What: Exams that report model performance at fixed compute budgets; learning tools that reveal marginal gains of additional attempts to teach metacognition and cost-benefit reasoning.
    • Sectors: Education, credentialing
    • Tools/products/workflows: Assessment specs with compute ceilings; analytics on compute-vs-learning gains.
    • Assumptions/dependencies: Acceptance by accreditation bodies; equity considerations for compute access.
  • Consumer Products: Scheduled high-compute “deep solve” modes
    • What: Modes that defer intensive multi-iteration tasks (e.g., complex coding, research syntheses) to low-cost windows or background processing with notifications upon success/plateau.
    • Sectors: Productivity, developer tools
    • Tools/products/workflows: Task queues; user consent to costs; progress/plateau notifications with “stop/continue” options.
    • Assumptions/dependencies: User trust; clear cost controls; robust resume/summarize mechanisms.

Notes on key assumptions/dependencies that recur across applications:

  • Availability of verifiable feedback (“oracle”) signals: unit tests, rubrics, validators, or reliable LLM judges.
  • Cost and energy budgets for running high-compute evaluations and deployments.
  • Accurate metering of total tokens and visibility into iteration behavior for telemetry and governance.
  • Safe sandboxes and tool integrations for stateful tasks (terminal, code execution, cyber ranges).
  • Reliable context compaction/summarization that does not erase critical information.
  • Clear, standardized reporting of inference protocols (depth vs width allocations, feedback conditions, timeouts, repetition guards).
  • Human oversight and domain expertise in high-stakes applications (healthcare, legal, finance), regardless of compute scaling gains.

Glossary

  • Adaptive continuation prompt: A dynamic follow-up prompt that guides the model to refine or try alternate approaches during iteration. "Both conditions share an adaptive continuation prompt that invites the agent either to refine its previous answer or to try a substantially different approach."
  • Aggregate inference curve: The average over trajectories of success as a function of tokens consumed. "The aggregate inference curve is the mean of Si(t) over a specified set of trajectories,"
  • Capture-the-flag (CTF): A cybersecurity challenge format where solving exploits yields “flags.” "10 for the capture-the-flag (CTF) suite (April 2025 to April 2026)"
  • Compute-saturated performance: A regime where further compute adds negligible gains in performance. "TerminalBench and the Cyber CTFs are the clearest cases where the tested range still appears insufficient to reveal compute-saturated performance."
  • Context compaction: Summarizing earlier conversation turns to free context space for longer reasoning chains. "Context compaction, which replaces earlier turns with a model-generated summary to enable serial scaling beyond the nominal context-window size."
  • Context window: The maximum token span the model can attend to in a single exchange. "beyond the nominal context-window size."
  • Cumulative success curve: A curve showing performance as a function of tokens rather than at a single budget. "Inference scaling can be characterised by a cumulative success curve, which shows performance as a function of tokens consumed rather than at one fixed budget."
  • Frontier capability: The upper envelope of what the most advanced models can achieve under generous inference resources. "fixed-budget evaluations can increasingly understate frontier capability as models advance."
  • Frontier LLM: A cutting-edge, most-advanced LLM. "we evaluate up to 12 frontier LLMs on seven challenging benchmarks"
  • Inference compute: Compute available at test time for running models, often measured in tokens. "compute available at test time ("inference compute")."
  • Inference scaling: How performance changes as inference-time compute increases. "scaling of performance with inference-time compute (hereafter, "inference scaling")"
  • Inference-time compute: The compute budget allocated during evaluation rather than training. "Performance on these tasks increasingly depends on how much inference-time compute evaluations allow"
  • Kendall tau: A nonparametric statistic measuring rank correlation. "Per-facet Kendall T between release date and A is annotated at the top of each panel"
  • LLM judge: A separate model used to grade outputs or detect repetition. "graded by a separate LLM judge (GPT-4o-mini at temperature 0)."
  • LLM-graded: Scored by a LLM rather than deterministic tests. "LLM-graded (binary)"
  • Linear probability model: A linear regression model applied to a binary outcome. "Reach coefficients come from a linear probability model for whether a model unlocks a task"
  • Mixed-effects regression: A regression with fixed and random effects to model grouped/clustered data. "simple-slope predictions from a mixed-effects regression fit on continuous task difficulty"
  • Oracle score feedback: Immediate correctness feedback provided after each submission. "Oracle score feedback. The model is told whether each submission is correct"
  • Oracle-scored protocol: An evaluation setup where an oracle reveals correctness after attempts. "under a closely similar oracle-scored protocol only"
  • Parallel inference scaling: Allocating a fixed compute budget across multiple independent trajectories. "We study parallel inference scaling separately in Section 3.3.2, by reanalysing the same trajectories under different fixed-total-budget allocations."
  • Parallel scaling: Spreading compute across several shallower attempts instead of one deep attempt. "or spread across multiple shallower ones ("parallel" scaling)."
  • Programmatic verification: Automated, code-based checking of correctness (e.g., unit tests). "Scoring is by programmatic verification for TerminalBench (bundled unit tests)"
  • ReAct-style scaffold: A prompting framework that interleaves reasoning steps with tool calls/actions. "using one shared ReAct-style scaffold (Yao et al., 2023) implemented in Inspect AI"
  • Reasoning-token budget: A per-call token limit dedicated to the model’s internal reasoning. "a reasoning-token budget of 16,000 per generation call"
  • Repetition guard: A mechanism that stops a run when submissions become semantically repetitive. "we also employ a lightweight repetition guard that terminates a trajectory when a separate LLM judge finds three or more consecutive submissions semantically equivalent"
  • Sandboxed code execution: Running code in an isolated environment for safe grading. "FrontierMath (sandboxed code execution against per-problem reference implementations)"
  • Serial inference scaling: Using more compute within a single deep trajectory. "These techniques all operate within a single trajectory and target serial inference scaling"
  • Stateful benchmark: A task where environment state persists across multiple interactions. "Three are "stateful" benchmarks requiring environment state persistence across multi-turn tool use"
  • Stateless benchmark: A task without persistent state across interactions. "Two are "stateless" knowledge-and-reasoning benchmarks"
  • Theil–Sen fit: A robust method for estimating a linear trend based on median slopes. "lines are per-condition Theil-Sen fits (Section A.4.2)."
  • Token budget: The total number of tokens permitted for a trajectory or experiment. "Expanded total token budgets of 5M-30M tokens per trajectory"
  • Token cap: A hard upper limit on tokens for a trajectory. "Token cap refers to the per-trajectory total token budget for the target model (input, output, and reasoning tokens), excluding LLM judge tokens."
  • Trajectory: A single multi-step evaluation run comprising interactions, tool calls, and submissions. "These techniques all operate within a single trajectory and target serial inference scaling"
  • Turn cap: A limit on how many interaction turns are allowed in an evaluation. "turn caps, timeouts, poor context management"
  • Wall-clock time: Real elapsed time as a limit or measurement unit. "output tokens, turn caps, or wall-clock time"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 5 likes about this paper.