Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction (2506.07976v2)

Published 9 Jun 2025 in cs.LG and cs.AI

Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.

Summary

The paper introduces interaction scaling, demonstrating that longer test-time interactions yield larger performance gains than deeper per-step reasoning.
It presents the TTI algorithm, a curriculum-based online RL framework that gradually increases horizon lengths to balance exploitation with exploration.
Experiments on WebVoyager and WebArena confirm that TTI achieves state-of-the-art results among open-source agents, improving success rates by up to 30%.

Detailed summary of

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction (Junhong Shen et al., arXiv (2506.07976))

————————————————————————————————————

Motivation ————————————————————————————————————

Foundation-model agents (web browsers, terminal bots, robots, etc.) act in partially-observable, non-stationary environments.
“Test-time scaling’’ today = give the model extra reasoning tokens (CoT, best-of-n, verifiers). ➜ This only re-organises information already observed; the agent still executes the same single action and remains reactive.
Many tasks instead require gathering new evidence: explore, back-track, re-plan. Example: hotel booking—must open many listings before deciding.

Key thesis → Scale the length of the interaction (number of environment steps) instead of, or in addition to, per-step thinking.

————————————————————————————————————

Initial Evidence: Prompt-Only Interaction Scaling ———————————————————————————————————— WebArena subset, Gemma-3-12B, horizon limit h = 30.

Prompting trick: after the agent issues “ANSWER”, append “Let’s pause and think again …” (double-check, triple-check, …). Results (CoT baseline 23.8 % SR): * Double-check ⇒ 26–28 % SR; trajectories ≈ 2× longer. * Scaling per-step compute instead (forced longer CoT or best-of-n sampling) adds <3 %. Conclusion → Given fixed token budget, acting longer yields larger gains than thinking longer because new information can be observed.

Limitations of “prompt only”: repeated re-checks can confuse the model; cannot learn fine-grained exploration during a rollout.

————————————————————————————————————

Challenges in Training for Long Horizons ———————————————————————————————————— Attempted online filtered behaviour-cloning (a REINFORCE variant) with fixed horizons:

h = 5 → fast but overfits, terminates early. h = 20 → learns very slowly; high-variance credit assignment; clicks randomly.

Need a mechanism that first masters exploitation at short horizons, then gradually introduces exploration at longer horizons.

————————————————————————————————————

TTI (Test-Time Interaction) – Proposed Algorithm ———————————————————————————————————— Curriculum-based online RL built on filtered BC.

Schedule h₀ = 10 → hᵢ = min(h_max, h_min × i) (multiplicative found best). * Early iterations: short roll-outs ⇒ stable gradients, core skills. * Later iterations: progressively longer roll-outs ⇒ exploration, back-tracking.

Implementation (Alg. 1 in appendix): * Collect 512 rollouts per iteration with current horizon hᵢ. * Keep only successful trajectories (reward = 1). * Weighted replay buffer (recent data up-weighted). * One epoch of AdamW (1 × 10⁻⁶–4 × 10⁻⁶); vision tower frozen.

Synthetic-task pipeline (extension of PAE): generator agent explores real websites, LLM proposes new tasks; Gemma-3-27B verifier labels success (≈ 89 % agreement with gold).

————————————————————————————————————

Experimental Results ———————————————————————————————————— Model: Gemma-3-12B; training on public synthetic data only.

A. WebVoyager (427 tasks, 13 domains) Average SR Zero-shot Gemma 55.8 % Fixed h = 10 59.1 % Fixed h = 30 45.2 % TTI (h 10→30) 64.8 % ← new open-source SOTA (Highest in 8/13 domains; +30 % over best previous open-weight/data.)

Learning dynamics: trajectory length, GoBack+Bing usage, and SR all rise once h_max = 30 phase starts, while average CoT / step shrinks—agent automatically trades off reasoning for interaction.

B. Full WebArena (812 tasks) Zero-shot 18.3 % Fixed h = 10 23.8 % TTI 26.1 % Best proprietary baseline (GPT-4/Claude) > 45 %, but TTI is top among fully open-source/self-improve agents.

C. Test-time re-checks on top of TTI Adds small further gains; most useful early in training—TTI already learns to re-check later.

————————————————————————————————————

Qualitative Analysis ———————————————————————————————————— Strengths
- Complex info-gathering (browse multiple recipes, recover from pop-ups).
- Adaptive: exploits direct paths on simple tasks, explores on hard ones.

Failure modes * Over-reliance on “Bing” reset instead of in-site recovery. * Insufficient final self-verification; sometimes submits wrong year, etc.

————————————————————————————————————

Contributions & Take-aways ————————————————————————————————————
Introduce interaction scaling as an orthogonal axis to per-step compute scaling for LLM agents.
Empirically demonstrate that, under equal token budget, longer interaction brings larger benefits than deeper reasoning for web tasks.
Present TTI, a simple curriculum-based online RL framework that steadily enlarges the horizon and yields adaptive exploration behaviour.
Achieve state-of-the-art results among open-weight/open-data agents on WebVoyager and WebArena using only Gemma-3-12B.
Release code, 139 K synthetic tasks, and analysis to foster further research.

————————————————————————————————————

Future Directions (authors) ————————————————————————————————————

Apply interaction scaling to robotics and other stochastic domains.
Formal trade-off between “thinking tokens’’ and “interaction steps”.
Richer RL algorithms (GRPO, PPO, value-based) and memory mechanisms for even longer horizons/contexts.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (11)

Tweets

https://twitter.com/JunhongShen1/status/1932480756162703407

https://twitter.com/fly51fly/status/1932557437707719122

https://twitter.com/OkhayIea/status/1934189906999566769

https://twitter.com/TheTuringPost/status/1934946709744021638

https://twitter.com/arxivsanitybot/status/1932796922386915488

YouTube

Show All Videos