Papers
Topics
Authors
Recent
Search
2000 character limit reached

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Published 5 Jun 2026 in cs.LG and cs.AI | (2606.07074v1)

Abstract: Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

Summary

  • The paper proposes a novel adaptive reward gating mechanism integrating SFT and RL to minimize tool calls while guaranteeing correctness.
  • It employs multi-stage gating (correctness, tool efficiency, and token efficiency) to filter redundant trajectories and enforce resource efficiency.
  • Experiments show up to 58% reduction in tool calls and 33.4% token compression, maintaining or improving accuracy across multiple benchmarks.

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Motivation: Efficiency Collapse in Tool-Integrated Web Agents

Modern LLM-based web agents achieve high accuracy on complex information-seeking tasks via extensive tool integration (web search, browsing, code execution). However, accuracy-centric training paradigms induce a severe efficiency collapse in both open and proprietary systems, manifesting as blind tool dependency and performative reasoning. These agents invoke tools indiscriminately—even for answerable common-sense queries—and generate redundant chains of reasoning characterized by loops and dead-end branches, leading to excessive computational and time overhead with hundreds of tool calls per query. Figure 1

Figure 1: Behavioral analysis highlighting blind tool dependency and performative reasoning in agent trajectories, leading to wasted tool calls and increased latency.

Pareto-optimality between accuracy and resource expenditure is unaddressed: standard SFT admits any correct (including bloated) trajectory, and existing RL approaches focus solely on task completion, thereby internalizing highly redundant behavior. Naive prompt engineering or token-penalty RL fails to systematically prune actions—external tool invocations dominate real agent resource consumption.

SlimSearcher Framework

Multi-Stage Gating: Unified SFT and RL Optimization

SlimSearcher introduces a principled framework to push the efficiency-accuracy Pareto frontier, integrating efficiency optimization at both SFT and RL stages. The central mechanism is a hierarchical Multi-Stage Gating system:

  • Gate 1 (Correctness): Strict binary constraint, rcorrectr_{correct}, ensures reward is zero for any trajectory not matching ground-truth, preventing reward hacking and brevity bias.
  • Gate 2 (Tool Efficiency): Adaptive reward anchoring (AEA) exponentially penalizes tool overuse relative to the empirically minimal observed path within trajectory cohorts, providing task-adaptive gradients.
  • Gate 3 (Token Efficiency): Parallel anchoring mechanism penalizes excessive token usage, further enforcing conciseness and suppressing performative reasoning.

Reward is computed as:

Rfinal(Ï„)=rcorrect(Ï„)â‹…rtool(Ï„)â‹…rlen(Ï„)R_{final}(\tau) = r_{correct}(\tau) \cdot r_{tool}(\tau) \cdot r_{len}(\tau)

Efficiency-Aware SFT via Pareto Filtering

SlimSearcher implements reward-guided rejection sampling for SFT, selecting a single trajectory for each query that maximizes joint tool and token efficiency, forming a high-signal corpus that distills the Minimal Necessary Path. This approach directly purges poisoned or redundant trajectories from early model initialization.

RL Policy Optimization

GRPO-based RL is employed, using the group-standardized advantage beneath the Multi-Stage Gating reward to establish policy gradients. Each trajectory is compared to the empirical cohort minimum, anchoring learning to the true resource-optimal path. Figure 2

Figure 2: SlimSearcher framework with multi-dimensional filtering in SFT and adaptive reward gating during RL unifying accuracy and efficiency objectives.

Experimental Results

Pareto Frontier Shift: Efficiency Gains Without Accuracy Sacrifice

Extensive evaluation was performed across four benchmarks—XBench-DeepSearch, BrowseComp, GAIA, HLE—using both Tongyi-DeepResearch and Qwen3-30B-A3B-Instruct as backbones. SlimSearcher achieves 17%–58% reduction in tool-call rounds and substantial token compression with equal or higher accuracy compared to baseline SFT or prompt-engineered baselines.

On GAIA, SlimSearcher (SFT+RL) reduces tool-call rounds by 48.4% (from 20.56 to 10.61) and token usage by 33.4%, increasing accuracy from 0.682 to 0.709 on Tongyi backbone. Similar trends are observed for BrowseComp and HLE. Baselines (Prompt Control, DeepSeek-V3) either collapse accuracy or fail to consistently improve efficiency. Figure 3

Figure 3: Comparison of tool-call distribution and cumulative accuracy. SlimSearcher achieves steeper accuracy curves with compact action spaces.

Robustness and Ablations

  • Reward-Guided Rejection Sampling: Outperforms standard SFT (GAIA: accuracy rise ∼\sim0.665, tool rounds drop).
  • Correctness Gate: Its removal destroys accuracy, causing reward hacking and vacuous outputs (GAIA: accuracy plummets, tool-rounds near-zero).
  • Adaptive Efficiency Anchoring: Without AEA, brute-force exploration emerges—tool-call rounds inflate, token consumption increases, and accuracy stagnates.

Qualitative case comparisons demonstrate SlimSearcher’s strategic minimal reasoning traces, solving tasks in a few tool invocations, while baselines (e.g., MiroThinker) require up to 50×50\times more tool calls or fail entirely.

Implications and Future Directions

SlimSearcher advances data-driven training protocol design for web agents, representing a scalable solution for discipline in both computational and operational efficiency without accuracy regression. The Adaptive Reward Gating paradigm enables real-time anchoring to empirical minimal paths, an approach extensible to multimodal domains and dynamic action weighting.

Practical implications include tractable deployment of open-source agents at scale, decreased infrastructure cost, and improved latency for long-horizon research scenarios. Theoretical developments may extend toward cost-sensitive reward design, adaptive sampling in RL, and generalization to visual environments with multimodal redundancy.

Future work should tackle multimodal efficiency (visual web tasks), domain adaptation where initial empirical anchors are scarce, and fine-grained tool weighting to align with deployment economics.

Conclusion

SlimSearcher systematically overcomes the efficiency trap in LLM-based web agents, coupling efficiency-aware SFT and adaptive RL under a unified gating objective. This produces agents that are not only accurate, but also computationally disciplined, demonstrating clear Pareto improvements in real-world information-seeking tasks. The framework provides a robust foundation for future agentic training paradigms targeting scalable, efficient, and versatile autonomous web research.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 18 likes about this paper.