Papers
Topics
Authors
Recent
Search
2000 character limit reached

Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Published 9 Apr 2026 in cs.AI, cs.CL, and cs.MA | (2604.08369v1)

Abstract: Inference-time compute scaling has emerged as a powerful technique for improving the reliability of LLM agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE-4 matches SC-4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE-8 matches SC-8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter-rollout agreement is a reliable signal of step-level success, validating the core hypothesis that the model's own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training-free, per-timestep adaptive-compute controller for LLM agents to be evaluated on multi-step sequential decision tasks.

Authors (1)

Summary

  • The paper introduces TrACE, a training-free adaptive compute controller that adjusts LLM calls per timestep based on inter-rollout action agreement.
  • Experiments reveal compute cost reductions of 33% to 65% compared to fixed-budget approaches while maintaining similar accuracy on GSM8K and MiniHouse benchmarks.
  • TrACE uses consensus among candidate actions as a proxy for decision uncertainty, optimizing resource allocation for both simple and complex steps.

Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Introduction and Motivation

The paper introduces TrACE (Trajectorical Adaptive Compute via agrEement), a training-free, per-timestep adaptive compute controller for LLM-based agents operating in sequential decision-making settings (2604.08369). The core problem addressed is the uniform allocation of inference compute in deployed LLM agents—each timestep in a multi-step reasoning or acting pipeline typically receives an identical compute budget (e.g., a fixed number of samples), irrespective of variance in decision difficulty. This practice results in inefficient compute use, with trivial steps overprovisioned and complex decisions starved of needed exploration.

TrACE leverages inter-rollout action agreement as a proxy for agent uncertainty. The method adaptively increases or decreases the number of LLM calls at each timestep by measuring the empirical consistency in sampled next actions: high consistency signals that a step is easy and can be decided immediately; low consistency triggers additional samples, up to a maximum, before committing to an action. Crucially, TrACE is training-free and requires no labeled data, external verifiers, or modifications to the LLM itself.

Method: TrACE Adaptive Compute Controller

TrACE employs a stepwise algorithm for allocating compute budget:

  1. At each decision step: Draw a small set of candidate next actions by independently sampling the model at a specified temperature.
  2. Agreement computation: Measure the proportion of samples agreeing (modulo canonicalization) on an action, denoted as αt\alpha_t.
  3. Threshold decision:
    • If αt\alpha_t exceeds a tunable threshold Ï„high\tau_\text{high}, commit to the plurality action immediately.
    • Otherwise, iteratively sample more candidates (up to a cap kmaxk_\text{max}) and repeat the agreement check, finally committing to the mode action.

This approach contrasts with fixed-kk self-consistency (SC-kk), which applies the same number of samples per decision and does not exploit stepwise uncertainty. The default setting uses kinit=2k_\text{init} = 2, kmax∈{4,8}k_\text{max} \in \{4, 8\}, and τhigh=0.75\tau_\text{high} = 0.75.

This empirical behavioral agreement (distinct from the LLM's self-reported confidence) is repeatedly validated as a robust signal for step-level correctness, outperforming naïve or token-level uncertainty metrics and offering a compelling operational tradeoff between compute cost and reliability.

Empirical Results

Experiments were conducted using a quantized Qwen 2.5 3B Instruct model on CPU across two benchmarks: GSM8K (single-step grade-school math) and MiniHouse (custom multi-step household navigation). Figure 1

Figure 1: TrACE-4 and TrACE-8 match the accuracy of SC-4 and SC-8 while using substantially fewer LLM calls, dominating the self-consistency baseline on the compute–accuracy Pareto frontier.

TrACE demonstrates consistent efficiency improvements:

  • GSM8K: TrACE-4 matches SC-4 accuracy ($0.82$) using αt\alpha_t0 calls/task vs. αt\alpha_t1 (33% reduction). TrACE-8 matches SC-8 accuracy (αt\alpha_t2) with αt\alpha_t3 calls/task vs. αt\alpha_t4 (55% reduction).
  • MiniHouse: TrACE-4 matches SC-4 accuracy (αt\alpha_t5) with αt\alpha_t6 calls/task vs. αt\alpha_t7 (39% reduction). TrACE-8 matches SC-8 accuracy with αt\alpha_t8 vs. αt\alpha_t9 (65% reduction).

In all cases, TrACE matches the maximum reachable accuracy given the underlying model and task complexity but does so at significantly lower compute cost. Notably, in MiniHouse, increasing inference budget (greedy to SC-τhigh\tau_\text{high}0 to TrACE-τhigh\tau_\text{high}1) yields no accuracy improvement due to the model’s intrinsic weaknesses, but TrACE still provides meaningful efficiency gains.

Analysis of Agreement as a Success Signal

A core hypothesis is that inter-rollout agreement encodes genuine decision confidence. Higher observed agreement correlates with steps situated in ultimately successful task trajectories, as demonstrated empirically (see left of Figure 2). Distributional analysis of LLM call counts reveals TrACE’s adaptivity: most steps exit at low sample counts (easy steps), with high sample counts reserved for genuinely ambiguous states (right of Figure 2). Figure 2

Figure 2

Figure 2: Left: High stepwise agreement reliably predicts task-level success. Right: TrACE adaptively concentrates compute on more challenging steps, reflected in the non-uniform call distribution.

Ablation on Agreement Threshold

The agreement threshold (τhigh\tau_\text{high}2) criticality was explored via ablation. Lowering the threshold (τhigh\tau_\text{high}3) leads to greater efficiency but slightly reduced accuracy, as steps may terminate early before consensus. Raising to unanimity (τhigh\tau_\text{high}4) degenerates to standard SC-τhigh\tau_\text{high}5 performance and cost. The default (τhigh\tau_\text{high}6) maintains Pareto-optimality, trading minimal or no accuracy for large compute reductions. Figure 3

Figure 3: Tuning τhigh\tau_\text{high}7 explores the tradeoff between call count and accuracy; TrACE at τhigh\tau_\text{high}8 yields strong efficiency while preserving the accuracy of SC-4.

Practical and Theoretical Implications

TrACE offers a robust, training-free improvement to LLM agent deployment: it Pareto-dominates fixed-budget baselines in practical compute-constrained environments. This feature is especially pertinent as the community increasingly deploys open-weight models, where retraining or fine-tuning with reward models is often infeasible. The explicit demonstration that output agreement encodes actionable uncertainty information elevates the discussion of calibration in sequential LLM control, showing that behavioral variance is a more reliable signal than LLM self-report.

Nevertheless, limitations are acknowledged: experiments were run on a single small model on two benchmarks with relatively constrained action spaces; absolute accuracy is capped by model limitations. TrACE’s core advantage—relative compute efficiency at fixed accuracy—remains to be validated at larger scale and in more open-ended domains (e.g., code generation, complex planning).

Future Directions

Key avenues include extending TrACE to large models and domains with open-ended outputs, more systematic calibration of τhigh\tau_\text{high}9, and—critically—composing behavioral agreement with complementary signals such as perturbation consistency (SPECTRA) under a meta-controller (COMPASS) for further reliability gains.

Conclusion

TrACE establishes inter-rollout action agreement as a principled, training-free signal for stepwise adaptive compute control in LLM agents. It matches the accuracy of fixed-budget self-consistency baselines at dramatically reduced inference cost and demonstrates that LLM behavioral agreement is a robust proxy for decision difficulty. Its deployment requires no additional learning or supervision, making it valuable for practical agents and suggesting a wider role for training-free signal exploitation in reliable AI systems (2604.08369).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.