Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL (2508.07976v2)

Published 11 Aug 2025 in cs.CL and cs.AI

Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

Summary

The paper introduces the ASearcher framework, which overcomes restrictive turn limits and data bottlenecks using fully asynchronous RL.
It integrates a scalable QA synthesis pipeline that produces 134k diverse QA samples, enabling robust multi-tool reasoning and cross-document inference.
Empirical results demonstrate significant performance gains with up to 128-turn trajectories, achieving state-of-the-art accuracies on benchmarks like GAIA and xBench-DeepSearch.

Large-Scale Asynchronous RL for Long-Horizon Agentic Search: The ASearcher Framework

Introduction

The paper "Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL" (2508.07976) addresses the limitations of current open-source LLM-based search agents in achieving expert-level Search Intelligence. The authors identify two primary bottlenecks: (1) restrictive turn limits in RL training, which prevent the emergence of complex, multi-step search strategies, and (2) the lack of large-scale, high-quality, and challenging QA datasets necessary for robust agentic RL. To overcome these, the paper introduces ASearcher, an open-source framework that combines fully asynchronous RL training with a scalable, autonomous QA synthesis pipeline, enabling agents to perform long-horizon, multi-tool search and reasoning.

Limitations of Existing Approaches

Current open-source agentic RL systems typically cap the number of search turns per trajectory to $\leq 10$ , severely constraining the agent's ability to learn deep, multi-step search strategies. This is particularly problematic for queries with high uncertainty, conflicting information, or requiring cross-document inference. The paper provides a detailed case paper on a GAIA benchmark query, demonstrating that existing agents (e.g., Search-R1-32B) fail to decompose complex queries, hallucinate unsupported conclusions, and cannot resolve all unknowns due to short tool-use horizons.

Figure 1: A case paper on a complex query from GAIA, illustrating the failure modes of Search-R1-32B and Search-o1 (QwQ) compared to the uncertainty-aware, cross-document reasoning of ASearcher-Web-QwQ.

Prompt-based LLM agents, while capable of extensive tool calls, are limited by the underlying model's extraction and verification capabilities. They often miss key information and fail to verify conclusions, even when the correct answer is present in the retrieved content. In contrast, ASearcher-Web-QwQ demonstrates uncertainty-aware reasoning, precise extraction from noisy web content, cross-document inference, and grounded verification.

Figure 2: Another GAIA case paper showing ASearcher-Web-QwQ's ability to resolve multiple unknowns and conditions through precise queries and stepwise deduction, outperforming baselines.

Agent Design

ASearcher employs a minimalistic agent architecture with two core tools: a search engine and a web browser. The agent is responsible for both reasoning and summarizing lengthy web content, with all components optimized end-to-end via RL. For base LLMs (e.g., Qwen2.5-7B/14B), an append-only prompting strategy is used, while for advanced LRMs (e.g., QwQ-32B), the agent maintains a compact, instruction-following prompt history to ensure efficient token generation within context length constraints.

Figure 3: Comparison between ASearcher and Search-R1, highlighting ASearcher's unified reasoning and summarization capabilities and its independence from external LLMs.

Scalable QA Synthesis

To address the data bottleneck, the paper introduces a data synthesis agent that autonomously generates challenging QA pairs. Starting from seed questions, the agent iteratively applies two actions: Injection (enriching context with external facts) and Fuzzing (obscuring details to increase uncertainty). Each modification is validated for quality, difficulty, and answer uniqueness using LLM-based checks.

Figure 4: The data synthesis agent workflow, showing iterative injection and fuzzing actions with multi-stage quality verification.

Figure 5: Statistics from the data synthesis process, including distributions of supporting facts, injection/fuzz actions, and the difficulty of generated questions.

This pipeline produces a large-scale dataset (134k samples, 25.6k requiring tool use) with high diversity and complexity, directly incentivizing the learning of advanced search strategies such as reflection, cross-verification, and multi-hop reasoning.

Fully Asynchronous RL Training

Traditional batch-based RL systems suffer from severe inefficiencies when scaling to long trajectories, as the entire batch must wait for the slowest trajectory, resulting in significant GPU idle time. ASearcher adopts a fully asynchronous RL paradigm, decoupling trajectory generation from model updates and allowing trajectories to span multiple model versions. This enables relaxed turn limits (up to 128 turns/trajectory), facilitating the emergence of long-horizon search behaviors without sacrificing training throughput.

Figure 6: Comparison of one-step-off RL and fully asynchronous RL, demonstrating near-full resource utilization and faster training in the asynchronous regime.

Empirical analysis shows that accuracy on complex benchmarks increases with the number of tool calls, confirming the necessity of long trajectories for effective agentic search.

Figure 7: Test scaling of ASearcher-Web-QwQ, showing accuracy improvements as the minimum number of tool calls increases.

Experimental Results

ASearcher is instantiated with both base LLMs and advanced LRMs, and evaluated on standard (HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle, Natural Questions, TriviaQA, PopQA) and challenging (GAIA, xBench-DeepSearch, Frames) benchmarks. The agent consistently outperforms existing open-source baselines of comparable model size across F1 and LLM-as-Judge metrics.

Figure 8: Avg@4 and Pass@4 of ASearcher compared to various 32B-scale agents using search tools only.

On GAIA and xBench-DeepSearch, ASearcher-Web-QwQ achieves Avg@4 scores of 52.8 and 42.1, respectively, surpassing previous state-of-the-art open-source agents. RL training yields substantial improvements (+46.7% on xBench, +20.8% on GAIA), with pass rates (Pass@4) also significantly higher.

Figure 9: Asynchronous RL brings substantial improvements in Avg@4 and Pass@4, and enables extreme long-horizon search with >40 tool calls and >150k output tokens during training.

Figure 10: Performance comparison of QwQ-32B agent before and after RL training, highlighting strong gains in both accuracy and pass rate.

Training Dynamics and Scaling

Training dynamics reveal that both the number of tool calls and generated tokens scale up dramatically during RL, with advanced agents reaching up to 70 tool calls and 150k tokens per trajectory. The asynchronous system maintains high GPU utilization and enables efficient scaling to long-horizon tasks.

Figure 11: Generated tokens during training, illustrating the scaling behavior of ASearcher.

Figure 12: Generated tokens for another training run, confirming the scaling trend.

Implications and Future Directions

The ASearcher framework demonstrates that fully asynchronous RL, combined with scalable, high-quality QA synthesis, is essential for unlocking long-horizon agentic search capabilities in open-source LLM agents. The results indicate that restrictive turn limits and insufficient data are the primary obstacles to expert-level Search Intelligence. By overcoming these, ASearcher enables agents to perform uncertainty-aware reasoning, precise extraction, cross-document inference, and rigorous verification—behaviors previously limited to proprietary systems.

Practically, this work provides a blueprint for training advanced search agents capable of handling real-world, knowledge-intensive tasks. The open-source release of models, data, and code facilitates reproducibility and further research. Theoretically, the findings suggest that agentic RL for LLMs can scale to arbitrarily long horizons given appropriate system design and data, with implications for broader applications in autonomous reasoning, planning, and tool integration.

Future work may explore further scaling of trajectory length, integration of additional tools (e.g., calculators, code interpreters), and more sophisticated reward functions. The asynchronous RL paradigm is likely to become standard for agentic LLM training, especially as task complexity and model size continue to grow.

Conclusion

ASearcher establishes a new standard for open-source agentic search by combining fully asynchronous RL training with autonomous, scalable QA synthesis. The framework enables long-horizon, multi-tool reasoning and achieves state-of-the-art results on challenging benchmarks, demonstrating the critical importance of system-level innovations in RL and data generation for advanced LLM agents. The open-source release is poised to accelerate progress in agentic AI, with broad implications for both research and real-world deployment.

PDF Markdown

Follow-up Questions

Related Papers

Authors (8)

GitHub

GitHub - inclusionAI/ASearcher (30 stars)

Tweets

https://twitter.com/HuggingPapers/status/1955603041518035358

https://twitter.com/rohanpaul_ai/status/1955509252371443931

https://twitter.com/fly51fly/status/1955388204892532927

https://twitter.com/DeAI_Insights/status/1956043886939222154

https://twitter.com/LangChainJP/status/1955766518802718996

https://twitter.com/arxivsanitybot/status/1955826465649377332

alphaXiv

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL (56 likes, 0 questions)