Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 424 tok/s Pro
Kimi K2 164 tok/s Pro
2000 character limit reached

SFR-DR-20B: Autonomous Deep Research Agent

Updated 9 September 2025
  • SFR-DR-20B is an autonomously reasoning single-agent LLM optimized for deep research by integrating a minimal set of essential tools into an agentic inference framework.
  • It employs a reinforcement learning framework with group-based rollouts and length-normalized advantage to stabilize long-horizon reasoning and tool usage.
  • The model achieves significant benchmark improvements, including a 65% gain on Humanity’s Last Exam, showcasing its efficient chained reasoning and autonomous memory management.

SFR-DR-20B is an autonomously reasoning single-agent LLM optimized for the domain of “Deep Research” (DR)—a paradigm characterized by extended, tool-assisted search and interleaved reasoning across multiple data sources and modalities. Developed as part of the SFR-DeepResearch initiative, SFR-DR-20B integrates linearly minimal yet essential tool-use capabilities (web search, browsing, Python code execution) directly into an agentic inference framework. Distinct from multi-agent approaches or purely instruction-tuned models, SFR-DR-20B dynamically determines each next action based on full contextual awareness, leveraging a reasoning-optimized backbone (gpt-oss-20b) and continual reinforcement learning applied on challenging synthetic datasets (Nguyen et al., 8 Sep 2025). This agent achieves notable benchmark performance, including a 28.7% Pass@1 on the Humanity’s Last Exam, reflecting a 65% improvement over its base architecture.

1. Model Structure and Agentic Design

SFR-DR-20B is constructed atop the gpt-oss-20b backbone with a design intentionally focused on autonomous single-agent flexibility and robustness. Three key architectural distinctions define the model:

  • Agentic Inference Pipeline: Multi-turn tool calls and their results (e.g., search outcomes, code execution outcomes) are concatenated into a single-step contextual prompt, providing unified access to all historical tool interactions. This structure departs from conventional multi-turn conversation—a design which mitigates context fragmentation and allows context-rich reasoning in a single inference pass.
  • Minimalistic, Crucial Tool Suite: Rather than integrating a broad or sophisticated toolset, the agent is equipped only with a stateless code interpreter (code_interpreter), a web search API (search_internet), and a robust browsing tool (browse_page). This design mandates efficient learning of tool invocation strategies and maximizes the agent’s capacity to sequence, plan, and adapt tool usage to complex query requirements.
  • Autonomous Memory Management: A built-in clean_memory tool enables context window management, allowing the agent to prune less-salient or obsolete contextual elements and avoid exceeding token limits. This enhances the model’s capacity to tackle tasks with extended reasoning horizons without human intervention.

The architecture is therefore explicitly optimized for long-horizon reasoning, sequential tool usage, and contextual self-adaptation, in contrast with “single-step” reasoning models that fail to scale across multi-turn interactions.

2. Reinforcement Learning Framework

SFR-DR-20B employs a reinforcement learning framework rooted in the REINFORCE algorithm, combined with modern stabilizing techniques:

  • Group-Based Rollouts: For each query qq, GG independent trajectories (rollouts) are sampled. Each trajectory τi=[(si,1,ai,1),(si,2,ai,2),...,(si,T,ai,T)]\tau_i = [(s_{i,1}, a_{i,1}), (s_{i,2}, a_{i,2}), ..., (s_{i,T}, a_{i,T})] represents a sequence of agent-environment interactions, where si,js_{i,j} encodes the current state and context, and ai,ja_{i,j} the agent’s action.
  • Length-Normalized Advantage: The stepwise advantage used to compute policy gradients is normalized by the trajectory length (TiT_i), implemented as:

Ai,j=rimean(Rˉ)std(Rˉ)TiA_{i,j} = \frac{r_i - \mathrm{mean}(\bar{R})}{\mathrm{std}(\bar{R}) \cdot T_i}

where rir_i is the trajectory’s final reward, and Rˉ\bar{R} the set of rewards within the group. This addresses the issue of degenerate behaviors—such as repetitive tool invocation—across longer sequences.

  • Trajectory Filtering and Recycling: Invalid or truncated responses are excluded from training; partial rollouts are reused as new starting points, increasing gradient update diversity.
  • Automated Reward Modeling: Short-form question-answering rewards are given for semantic consistency with ground truth (binary reward), while long-form outputs are evaluated on a weighted sum from fact-checking, compliance, and quality assessments via a verifier model.

These strategies stabilize policy optimization for long-horizon, tool-intensive DR tasks and allow adaptation to the complexities introduced by synthetic, high-variance rollouts.

3. Integrated Reasoning and Tool-Use

SFR-DR-20B demonstrates intertwined reasoning and tool-use, mediated by both architectural and training-time innovations:

  • Essential Toolset:
    • search_internet(query:str): Returns top-10 search snippets.
    • browse_page(url:str, section_id:int): Cleans, parses, and returns selected HTML sections as Markdown.
    • code_interpreter(code:str): Executes stateless Python code snippets.
  • Contextualized Prompt Construction: The agentic inference pipeline ensures that all historical tool calls and outputs are available as context for each new action decision.
  • Dynamic Chain-of-Thought (CoT) Management: At each reasoning step, SFR-DR-20B regenerates its chain-of-thought tokens and omits prior, potentially error-prone CoTs. This design mitigates error propagation across long multi-step reasoning processes.
  • Fault Tolerance: Misformatted tool outputs or invalid invocations trigger error-driven retries—enabling on-the-fly self-correction.
  • Self-Governed Internal Memory: The dedicated clean_memory tool ensures only the most task-relevant context is maintained, particularly at token count limits.

The model’s prompt format and token management approaches differ depending on the backbone: for certain models (e.g., QwQ-32B, Qwen3), interaction histories are compressed into a single prompt; for gpt-oss-20b, a standard multi-turn structure is used.

4. Benchmark Results and Empirical Evaluation

SFR-DR-20B’s empirical performance is assessed on several challenging DR benchmarks:

Benchmark Metric Score
Humanity’s Last Exam (HLE) Pass@1 28.7%
FRAMES - 82.8
GAIA - 66.0
  • On HLE—a math and science-intensive suite—SFR-DR-20B’s score of 28.7% Pass@1 represents a 65% improvement over gpt-oss-20b.
  • The model substantially outperforms both prior single-agent systems and several larger proprietary models on auxiliary tasks, exhibiting reliable tool-use frequency, brevity, and effectiveness.
  • Compared to QwQ and Qwen3 variants, SFR-DR-20B executes an order-of-magnitude more tool calls (up to 10×), while maintaining efficient (sub-2000 token per reasoning step) outputs.

These metrics confirm both the robustness and efficiency of the agentic inference and RL framework adopted.

5. Synthetic Data Generation and Utilization

All SFR-DR-20B RL training is conducted on challenging synthetic datasets, progressively generated and refined to accentuate the complexities of deep research:

  • Multi-hop QA: The synthetic data pool contains tasks that necessitate extended, indirect reasoning chains and multi-modal fact-checking, often beyond the complexity of standard open-source datasets such as HotpotQA or WikiMultiHopQA.
  • Tool-Intensive Tasks: Many training queries are constructed to require iterative searches, information extraction, and code execution, compelling the agent to optimize tool invocation sequences.
  • Long-Form Content Generation: Reporting and writing tasks are designed to be open-ended, requiring the agent to adhere to grading rubrics evaluating factuality, compliance, clarity, and citation correctness.

The synthetic data is specifically engineered to push the boundaries of sequential reasoning and tool-use. This strategy compels the agent to generalize robustly across long reasoning horizons and multi-disciplinary subtasks.

6. Analytical and Ablation Studies

Extensive analysis experiments and ablations elucidate the impacts of various system components:

  • Agentic vs. Default Multi-Turn Prompting: Reformatting interaction histories into a single, context-rich prompt leads to significant improvements: for instance, a 10% absolute lift on FRAMES when applied to the 32B model.
  • Dynamic Chain-of-Thought Management: Omitting previous, lengthy CoT outputs and regenerating at each step prevents error accumulation, leading to more accurate long-horizon reasoning.
  • RL Length Normalization: The use of length-normalized advantage (see Section 2) moderates degenerate tool-invocation patterns and yields higher and more stable returns.
  • Tool Call and Output Length Efficiency: SFR-DR-20B’s tool-invocation strategy (frequently more tool calls, consistently shorter reasoning outputs) is shown to improve both throughput and reward, compared to models with verbose, over-detailed outputs.

Empirical results from these experiments provide strong evidence that the integration of agentic inference scaffolding, internal memory management, and custom RL loss normalization are crucial to the observed gains in tool-use efficacy and reasoning robustness.

7. Significance and Implications

SFR-DR-20B advances the state of autonomous LLM-based research agents by demonstrating that a compact tool set, when properly integrated with agentic inference and stabilized reinforcement learning, can yield substantial improvements on reasoning-intensive benchmarks. The centrality of the advantage normalization formula,

Ai,j=rimean(Rˉ)std(Rˉ)TiA_{i,j} = \frac{r_i - \mathrm{mean}(\bar{R})}{\mathrm{std}(\bar{R}) \cdot T_i}

in managing trajectory-length variance highlights a notable methodology for future RL training of agentic LLMs. The results suggest that aggressive context management, prompt engineering, and reward decomposition are all structurally and empirically validated strategies for scaling agentic LLMs to complex research workflows. A plausible implication is that continued advances in synthetic data generation, longer context management, and nuanced reward modeling will further extend the capabilities of single-agent, reasoning-oriented AI systems for real-world research challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube