- The paper introduces interaction scaling, demonstrating that longer test-time interactions yield larger performance gains than deeper per-step reasoning.
- It presents the TTI algorithm, a curriculum-based online RL framework that gradually increases horizon lengths to balance exploitation with exploration.
- Experiments on WebVoyager and WebArena confirm that TTI achieves state-of-the-art results among open-source agents, improving success rates by up to 30%.
Detailed summary of
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
(Junhong Shen et al., arXiv (2506.07976))
————————————————————————————————————
- Motivation
————————————————————————————————————
- Foundation-model agents (web browsers, terminal bots, robots, etc.) act in partially-observable, non-stationary environments.
- “Test-time scaling’’ today = give the model extra reasoning tokens (CoT, best-of-n, verifiers).
➜ This only re-organises information already observed; the agent still executes the same single action and remains reactive.
- Many tasks instead require gathering new evidence: explore, back-track, re-plan.
Example: hotel booking—must open many listings before deciding.
Key thesis → Scale the length of the interaction (number of environment steps) instead of, or in addition to, per-step thinking.
————————————————————————————————————
- Initial Evidence: Prompt-Only Interaction Scaling
————————————————————————————————————
WebArena subset, Gemma-3-12B, horizon limit h = 30.
Prompting trick: after the agent issues “ANSWER”, append
“Let’s pause and think again …” (double-check, triple-check, …).
Results (CoT baseline 23.8 % SR):
* Double-check ⇒ 26–28 % SR; trajectories ≈ 2× longer.
* Scaling per-step compute instead (forced longer CoT or best-of-n sampling) adds <3 %.
Conclusion → Given fixed token budget, acting longer yields larger gains than thinking longer because new information can be observed.
Limitations of “prompt only”: repeated re-checks can confuse the model; cannot learn fine-grained exploration during a rollout.
————————————————————————————————————
- Challenges in Training for Long Horizons
————————————————————————————————————
Attempted online filtered behaviour-cloning (a REINFORCE variant) with fixed horizons:
h = 5 → fast but overfits, terminates early.
h = 20 → learns very slowly; high-variance credit assignment; clicks randomly.
Need a mechanism that first masters exploitation at short horizons, then gradually introduces exploration at longer horizons.
————————————————————————————————————
- TTI (Test-Time Interaction) – Proposed Algorithm
————————————————————————————————————
Curriculum-based online RL built on filtered BC.
Schedule h₀ = 10 → hᵢ = min(h_max, h_min × i) (multiplicative found best).
* Early iterations: short roll-outs ⇒ stable gradients, core skills.
* Later iterations: progressively longer roll-outs ⇒ exploration, back-tracking.
Implementation (Alg. 1 in appendix):
* Collect 512 rollouts per iteration with current horizon hᵢ.
* Keep only successful trajectories (reward = 1).
* Weighted replay buffer (recent data up-weighted).
* One epoch of AdamW (1 × 10⁻⁶–4 × 10⁻⁶); vision tower frozen.
Synthetic-task pipeline (extension of PAE): generator agent explores real websites, LLM proposes new tasks; Gemma-3-27B verifier labels success (≈ 89 % agreement with gold).
————————————————————————————————————
- Experimental Results
————————————————————————————————————
Model: Gemma-3-12B; training on public synthetic data only.
A. WebVoyager (427 tasks, 13 domains)
Average SR
Zero-shot Gemma 55.8 %
Fixed h = 10 59.1 %
Fixed h = 30 45.2 %
TTI (h 10→30) 64.8 % ← new open-source SOTA
(Highest in 8/13 domains; +30 % over best previous open-weight/data.)
Learning dynamics: trajectory length, GoBack+Bing usage, and SR all rise once h_max = 30 phase starts, while average CoT / step shrinks—agent automatically trades off reasoning for interaction.
B. Full WebArena (812 tasks)
Zero-shot 18.3 %
Fixed h = 10 23.8 %
TTI 26.1 %
Best proprietary baseline (GPT-4/Claude) > 45 %, but TTI is top among fully open-source/self-improve agents.
C. Test-time re-checks on top of TTI
Adds small further gains; most useful early in training—TTI already learns to re-check later.
————————————————————————————————————
- Qualitative Analysis
————————————————————————————————————
Strengths
- Complex info-gathering (browse multiple recipes, recover from pop-ups).
- Adaptive: exploits direct paths on simple tasks, explores on hard ones.
Failure modes
* Over-reliance on “Bing” reset instead of in-site recovery.
* Insufficient final self-verification; sometimes submits wrong year, etc.
————————————————————————————————————
- Contributions & Take-aways
————————————————————————————————————
- Introduce interaction scaling as an orthogonal axis to per-step compute scaling for LLM agents.
- Empirically demonstrate that, under equal token budget, longer interaction brings larger benefits than deeper reasoning for web tasks.
- Present TTI, a simple curriculum-based online RL framework that steadily enlarges the horizon and yields adaptive exploration behaviour.
- Achieve state-of-the-art results among open-weight/open-data agents on WebVoyager and WebArena using only Gemma-3-12B.
- Release code, 139 K synthetic tasks, and analysis to foster further research.
————————————————————————————————————
- Future Directions (authors)
————————————————————————————————————
- Apply interaction scaling to robotics and other stochastic domains.
- Formal trade-off between “thinking tokens’’ and “interaction steps”.
- Richer RL algorithms (GRPO, PPO, value-based) and memory mechanisms for even longer horizons/contexts.