Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 96 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Kimi K2 189 tok/s Pro
2000 character limit reached

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents (2509.06283v1)

Published 8 Sep 2025 in cs.AI and cs.CL

Abstract: Equipping LLMs with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a reinforcement learning framework that enhances autonomous single-agent reasoning for deep research tasks.
  • It employs an agentic inference pipeline with memory management to convert multi-turn interactions into efficient single-turn contextual tasks.
  • Experimental results demonstrate significant performance gains and efficient tool usage with models like QwQ-32B and Qwen3.

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Introduction

The paper "SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents" proposes a framework that focuses on the development of Autonomous Single-Agent models optimized for Deep Research (DR) tasks. This approach emphasizes minimal use of tools like web crawling and Python tool integration, contrasting with multi-agent systems that follow predefined workflows. The research centers on continuous reinforcement learning (RL) of reasoning-optimized models to enhance agentic skills while maintaining their reasoning capabilities.

Methodology

Agentic Inference Pipeline

The pipeline employs an agentic scaffolding approach, resembling multi-turn conversations with tools, incorporating a memory management system that allows for an effectively unlimited context window. For models like QwQ-32B and Qwen3, the interaction is framed as a single-turn contextual question answering task, incorporating previous tool calls and responses as memory (Figure 1). Figure 1

Figure 1: An example tool calling trajectory by our SFR-DR agentic workflow, catered for QwQ-32B and Qwen3 models. The multi-turn interaction is framed as a single-turn contextual question answering problem, where there is always only 1 user turn. The previous tool calls and responses are packed as memory and placed in the user turn together with the question.

Training Data Synthesis and RL Recipe

The research introduces a synthetic data pipeline tailored for short-form QA and long-form report writing tasks. This dataset is challenging, requiring extensive reasoning and tool use. The RL training employs a length-normalized REINFORCE-based approach, incorporating trajectory filtering, partial rollouts, and reward modeling to stabilize the policy optimization process. This framework effectively minimizes unintended behaviors such as repetitive tool calls, which are mitigated by length normalization (Figure 2). Figure 2

Figure 2: Average training trajectory lengths of SFR-DR-8B agents over the course of RL training with and without our proposed length normalization.

Results

The SFR-DeepResearch framework shows strong performance across several benchmarks like Humanity's Last Exam (HLE), FRAMES, and GAIA. Remarkably, the SFR-DR-20B variant achieves up to 28.7% on HLE, showcasing significant improvements over various baselines. The approach also demonstrates effective tool usage and response efficiency, with the model generating fewer tokens per step as seen in Figure 3. Figure 3

Figure 3

Figure 3: Comparison of (a) average tool usage and (b) average step-level response lengths (tokens) on HLE across different models.

Analysis

The paper provides in-depth ablation studies, revealing insights into several methodological components:

  • The agentic workflow significantly improves performance by reformulating multi-turn conversations into single-turn tasks, enabling reasoning-optimized models to operate effectively.
  • The length normalization strategy effectively stabilizes RL training by penalizing unnecessary long trajectories, improving both training stability and performance.
  • A detailed analysis of tool usage and response lengths illustrates that models trained with this framework efficiently balance reasoning and tool-calling actions, leading to robust agentic behavior.

Conclusion

The proposed SFR-DeepResearch introduces a pragmatic RL framework for building sophisticated Autonomous Single-Agent models that are capable of effective deep research. By leveraging reasoning-optimized backbones, the approach achieves substantial advancements in DR tasks, providing a foundation for future developments in autonomously reasoning AI agents. The results not only underline the efficacy of the proposed methodologies but also suggest further exploration of length normalization and tool integration strategies for more refined AI behaviors.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.