SlimSearcher: Efficient Web Research Agent

Updated 4 July 2026

SlimSearcher is a training framework and web agent that optimizes long-horizon research tasks using a correctness-first, efficiency-aware gating mechanism.
It integrates multi-stage gating in both supervised fine-tuning and reinforcement learning to select Pareto-efficient trajectories based on minimal tool calls and token usage.
Empirical evaluations demonstrate significant reductions in tool-call rounds and token consumption while maintaining or improving accuracy on multiple benchmark datasets.

SlimSearcher is a training framework and resulting web agent designed for long-horizon web research tasks in which accuracy and computational cost are optimized jointly rather than sequentially. Its central objective is the “Minimal Necessary Path”: just enough tool calls and reasoning tokens to reach a correct answer, and no more. In the formulation introduced by “SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating” (Xie et al., 5 Jun 2026), this objective is implemented across both supervised fine-tuning and reinforcement learning through a correctness-first, efficiency-aware gating mechanism that explicitly targets the Pareto frontier between task success and search cost.

1. Problem setting and efficiency objective

SlimSearcher is formulated for long-horizon web research tasks in which an agent must answer a question by interacting with tools such as search, page visitation, file parsing, bibliographic search, and code execution. The evaluated environment includes search, visit, google_scholar, PythonInterpreter, and parse_file, and the reported benchmarks are XBench‑DeepSearch, BrowseComp, GAIA, and HLE, with HLE evaluated on 500 text-only questions (Xie et al., 5 Jun 2026).

The motivating diagnosis is an “efficiency trap” in deep research agents. Two failure modes are emphasized. The first is blind tool dependency, in which agents call web search, browsers, and other tools even when a question is answerable from internal knowledge. The second is performative reasoning, in which agents generate long, redundant trajectories with loops, repeated verification, and dead ends that do not improve the answer but do increase token consumption and tool usage. Accuracy-only rejection sampling in supervised pipelines and success-only reinforcement learning are identified as the training regimes that induce this behavior (Xie et al., 5 Jun 2026).

The system is evaluated under three explicit metrics. Accuracy (Acc) measures correctness of the final answer, typically by automatic or LLM-based judging. Tool-call rounds (Rounds) count the number of external tool invocations per query. Token usage (Token) counts model-generated reasoning tokens per query. SlimSearcher’s target is therefore not merely higher answer quality, but higher answer quality under lower Rounds and Token budgets (Xie et al., 5 Jun 2026).

A plausible implication is that SlimSearcher belongs to a broader reorientation of web-agent training in which the object of optimization is no longer “solve if possible,” but “solve with the least sufficient externalization of computation.” In SlimSearcher, however, that reorientation is not expressed as a prompt-level heuristic; it is encoded directly into data selection and reward design.

2. Multi-stage gating and correctness-first valuation

The core mechanism is a hierarchical valuation system called multi-stage gating. For a trajectory $\tau$ , SlimSearcher defines

$R_{\text{final}}(\tau)=r_{\text{correct}}(\tau)\cdot r_{\text{tool}}(\tau)\cdot r_{\text{len}}(\tau).$

The three multiplicative gates are a correctness gate, a tool-efficiency gate, and a token-efficiency gate (Xie et al., 5 Jun 2026). The structure is consequential: only correct trajectories are allowed to compete on efficiency. This prevents the standard failure mode in which absolute length or tool penalties encourage premature termination or under-exploration.

The correctness gate is binary. A trajectory receives nonzero reward only if its final answer matches the ground truth. The efficiency gates are then computed relative to a cohort of correct trajectories for the same query. Tool efficiency is based on the tool cost

$C(\tau_i)=\sum_{a\in \mathcal{A}(\tau_i)} w_{\text{type}(a)},$

with $w_{\text{type}(a)}=1$ in the reported experiments. Token efficiency is based on $L(\tau_i)$ , the number of model-generated reasoning tokens. For each query, SlimSearcher computes the empirical minima $C_{\min}$ and $L_{\min}$ among correct trajectories, then scores other correct trajectories by their relative deviation from those minima (Xie et al., 5 Jun 2026).

The paper terms the RL variant of this mechanism Adaptive Reward Gating, and also refers to the same idea as Adaptive Efficiency Anchoring. The critical detail is that both tool and token efficiency are cohort-relative rather than absolute. Efficiency is thus judged against the current best correct behavior discovered for that specific query, not against a global fixed penalty schedule. This avoids brevity bias on difficult tasks: if a task genuinely requires a long chain of actions, the anchor is long as well, so the reward does not force the policy below the empirical minimal correct path (Xie et al., 5 Jun 2026).

This design also explains why SlimSearcher treats efficiency as a co-primary objective rather than a post hoc constraint. The final reward is not “accuracy minus cost”; it is accuracy gated by relative efficiency.

3. Pareto-efficient supervised fine-tuning

SlimSearcher’s supervised stage uses Pareto-efficient filtration to build an efficiency-aware corpus before optimization begins. The SFT source pool contains 13,863 trajectories drawn from prior web-agent datasets—Asearcher, TaskCraft, WebWalker, WebShaper, RedSearcher, WebDancer, Voyager, and MegaScience—as well as synthetic deep research tasks generated in the style of WebSailor and WebDancer (Xie et al., 5 Jun 2026).

Queries are first screened by pass rate in a baseline environment, Tongyi DeepResearch. Each query is run four times, and only those with $0<\mathrm{PR}(q)<1$ are retained, so the training set excludes both trivial and impossible tasks. For every retained query, the system generates $K$ candidate trajectories from the base model and filters them through the correctness gate. The valid set is therefore the subset of candidate trajectories whose final answer is correct (Xie et al., 5 Jun 2026).

Within that valid cohort, SlimSearcher identifies the Minimal Necessary Path as

$\tau^*=\underset{\tau\in \mathcal{T}^{\text{valid}}_q}{\arg\max}\Big(r_{\text{tool}}(\tau)\times r_{\text{len}}(\tau)\Big).$

The final SFT dataset contains exactly these selected trajectories, one per retained query, after Pareto-efficient filtration (Xie et al., 5 Jun 2026).

This differs materially from ordinary rejection sampling. Standard rejection SFT accepts any correct trajectory; SlimSearcher accepts only correct trajectories that are also near the efficiency Pareto frontier. The empirical effect is visible in the reported ablation on GAIA: reward-guided filtering raises accuracy from 0.641 to 0.665 while reducing average rounds from 25.90 to 24.46 and tokens from 7478 to 7299 (Xie et al., 5 Jun 2026). In the paper’s interpretation, the training distribution itself is the mechanism by which efficiency-aware behavior is taught.

4. Reinforcement learning with Adaptive Reward Gating

After SFT, SlimSearcher applies RL with a GRPO-style algorithm. The policy is the SFT-initialized LLM; the environment is the tool-backed web interface; each episode runs from question to final answer or to a fixed step or time limit. The RL corpus contains 1,510 deep research QA pairs from Voyager, WebShaper, REDSearcher, WebDancer, and MegaScience (Xie et al., 5 Jun 2026).

For each query, the policy samples a cohort of $R_{\text{final}}(\tau)=r_{\text{correct}}(\tau)\cdot r_{\text{tool}}(\tau)\cdot r_{\text{len}}(\tau).$ 0 trajectories from the old policy. Each trajectory receives the same multiplicative reward as in the valuation system,

$R_{\text{final}}(\tau)=r_{\text{correct}}(\tau)\cdot r_{\text{tool}}(\tau)\cdot r_{\text{len}}(\tau).$ 1

and the reward is standardized within the group to produce a group-relative advantage

$R_{\text{final}}(\tau)=r_{\text{correct}}(\tau)\cdot r_{\text{tool}}(\tau)\cdot r_{\text{len}}(\tau).$ 2

The policy is then updated with a PPO-like clipped objective under GRPO (Xie et al., 5 Jun 2026).

Two ablations are central to the interpretation of this RL stage. Removing the correctness gate produces a severe collapse: on GAIA, rounds fall to 0.07, but accuracy falls to 0.136. This is the canonical “short but wrong” attractor. Removing Adaptive Efficiency Anchoring and replacing it with static cost penalties moves the model back toward brute-force exploration: on HLE, rounds increase from 19.51 to 31.05, and on GAIA, accuracy drops from 0.699 to 0.641 while rounds increase from 20.13 to 25.42 (Xie et al., 5 Jun 2026).

The implementation details are consistent with long-horizon agent training. SFT uses ms‑SWIFT; RL uses rLLM; rollout and inference use vLLM with tensor parallelism. RL is trained with batch size 64 and prompt length up to 8,000 tokens, response length up to 120,000 tokens, eight responses per prompt at temperature 1.0, a maximum of 100 environment steps, and a timeout of 7,200 seconds (Xie et al., 5 Jun 2026). The tools are concrete rather than abstract: search is Serper-based Google search, visit uses Jina Reader for URL parsing and summarization, and google_scholar, PythonInterpreter, and parse_file provide bibliographic, computational, and local-file interfaces.

5. Empirical performance and behavioral profile

Across GAIA, BrowseComp, XBench‑DeepSearch, and HLE, SlimSearcher is reported to reduce average tool-call rounds by 17%–58% while maintaining or improving accuracy (Xie et al., 5 Jun 2026). The strongest single comparison in the paper is the Tongyi‑DeepResearch backbone against its SlimSearcher-trained counterpart.

Benchmark	Tongyi‑DeepResearch	SlimSearcher (SFT+RL)
GAIA	Acc 0.682, Rounds 20.56, Tokens 7378	Acc 0.709, Rounds 10.61, Tokens 4915
BrowseComp	Acc 0.410, Rounds 63.70, Tokens 12014	Acc 0.447, Rounds 47.63, Tokens 11093
XBench	Acc 0.713, Rounds 14.26	Acc 0.790, Rounds 5.92
HLE	Acc 0.358, Rounds 23.92	Acc 0.376, Rounds 19.82

These results show simultaneous gains in effectiveness and efficiency rather than a simple cost-accuracy trade (Xie et al., 5 Jun 2026). A second backbone, Qwen3‑30B‑A3B‑Instruct‑2507, shows the same pattern. Raw Qwen3‑30B‑Instruct is reported as weak as a web agent; after SlimSearcher SFT it improves substantially, and after SlimSearcher SFT+RL it reaches XBench accuracy 0.770 with rounds reduced from 15.04 to 10.87 and GAIA accuracy 0.699 with rounds 20.13 (Xie et al., 5 Jun 2026).

The behavioral analyses are equally central. On GAIA, in a task requiring identification of a dissertation footnote reference, relation to Smithsonian paintings, and arithmetic over chapter numbers, SlimSearcher uses 22 tools and 15 searches, whereas MiroThinker uses 288 tools and 134 searches to reach the same answer. On an XBench query about a dish from the Northern and Southern Dynasties, SlimSearcher solves the task with 2 tool calls, while MiroThinker uses 102 tool calls. On another XBench query about the number of prefecture-level cities in three Chinese provinces bordering foreign countries, SlimSearcher uses 3 tool calls, whereas MiroThinker uses 403 and fails after hitting the turn limit (Xie et al., 5 Jun 2026).

The reported histograms and cumulative accuracy curves support the same interpretation. On GAIA and XBench, SlimSearcher compresses the long tail of 50–100+ tool calls into a concentration under roughly 20–40 rounds while preserving competitive or higher accuracy. The authors attribute this to a learned preference for high-yield information sources, reduced verification loops, and the ability to skip tools when internal knowledge suffices (Xie et al., 5 Jun 2026).

A further misconception is addressed by the PromptControl baseline. Merely instructing the model to “please be efficient” reduces tool calls slightly in some settings but hurts accuracy and does not reproduce SlimSearcher’s Pareto improvements (Xie et al., 5 Jun 2026).

6. Relation to adjacent search systems and stated limitations

SlimSearcher is part of a broader literature on deep search agents, but its specific contribution is the unification of efficiency-aware SFT and efficiency-aware RL under the same multi-stage gating logic. This contrasts with systems that rely primarily on high-quality SFT trajectories. “SimpleDeepSearcher” demonstrates that SFT on only 871 curated real-web trajectories can outperform several RL-based deep search baselines, and explicitly presents RL as optional post-SFT refinement (Sun et al., 22 May 2025). SlimSearcher instead makes efficiency a co-primary training objective in both stages (Xie et al., 5 Jun 2026).

It also differs from architecture-centered approaches. “Flash-Searcher” reorients deep search around DAG-based parallel execution, goal decomposition, and dynamic workflow optimization, reporting 67.7% accuracy on BrowseComp and 83% on xbench-DeepSearch with up to 35% fewer execution steps (Qin et al., 29 Sep 2025). “SimpleSearch‑VL” approaches multimodal deep search through Factorized Adaptive Rollout, evidence-verified reasoning, and self-summary within the agent, using only 5K supervised tool-interleaved trajectories and 2K RL prompts (Dai et al., 30 Jun 2026). “VSearcher” focuses on long-horizon multimodal search with text search, image search, and browsing trained via SFT then RL (Zhang et al., 3 Mar 2026). SlimSearcher’s novelty lies elsewhere: it is neither principally a data-engineering recipe, nor principally a DAG scheduler, nor principally a multimodal tool-use framework, but a correctness-gated efficiency training framework (Xie et al., 5 Jun 2026).

The limitations stated for SlimSearcher are specific. On extremely niche tasks where the SFT stage never sees a correct, efficient trajectory, the RL anchoring mechanism may lack a useful empirical reference. The current tool cost model uses uniform weights, even though real deployments may incur heterogeneous latencies and monetary costs across tools. The framework is text-only, so extension to visual web content would require explicit modeling of image-processing cost inside the efficiency reward (Xie et al., 5 Jun 2026).

These limits clarify the scope of the contribution. SlimSearcher does not claim that all deep search efficiency problems reduce to shorter chain-of-thought or fewer API calls in isolation. Its claim is narrower and more technical: efficiency-aware training, grounded in Pareto-efficient trajectory selection and adaptive reward gating, can move the practical Pareto frontier of long-horizon web agents without sacrificing correctness (Xie et al., 5 Jun 2026).