Papers
Topics
Authors
Recent
2000 character limit reached

DR Tulu-8B: Open-Source Deep Research Model

Updated 25 November 2025
  • Deep Research Tulu-8B is an open-source 8B parameter language model designed for long-form deep research with integrated tool calls and multi-step reasoning.
  • It employs Reinforcement Learning with Evolving Rubrics (RLER) to optimize research reports, ensuring accurate citations and adherence to evaluation rubrics.
  • The model achieves state-of-the-art performance on diverse benchmarks while significantly reducing computational costs compared to proprietary systems.

Deep Research Tulu-8B (DR Tulu-8B) is an open-source, 8-billion-parameter LLM designed explicitly for open-ended, long-form deep research tasks. It is trained using Reinforcement Learning with Evolving Rubrics (RLER), a methodology that combines co-evolving evaluation rubrics and policy learning, enabling the model to produce multi-step, well-cited research reports that effectively utilize search and tool augmentation. DR Tulu-8B achieves state-of-the-art performance among open models on diverse research benchmarks and matches or exceeds proprietary systems in several domains, while being significantly smaller and more cost-efficient (Shao et al., 24 Nov 2025).

1. Reinforcement Learning with Evolving Rubrics (RLER)

The RLER framework addresses the limitations of previous deep research training recipes, which were constrained to short-form, verifiable QA via static reward signals. DR Tulu-8B treats the research agent as a LLM policy πθ\pi_\theta, which, given a prompt xx (including both the research question and system instructions), generates trajectories yy consisting of interleaved:

  • > tokens: for free-text, step-by-step reasoning. > > - <call_tool>...<call_tool>: explicit tool invocation for web search, paper lookup, or web browsing. > > - <cite id="...">...<cite>: inline attribution to supporting evidence, typically identified by tool calls. > > - <answer>...<answer>: to demarcate the final report and terminate the trajectory. > > Each research instance xx is associated with a set of rubrics Rx={(rx,k,wx,k)}k=1K\mathcal{R}_x = \{ (r_{x,k}, w_{x,k}) \}_{k=1}^{K}, where rx,kr_{x,k} denotes a textual evaluation criterion and %%%%6%%%% its weight. Each model output yy is scored by a rubric-based reward: > > S(x,y)=k=1Kwx,kJudge(rx,k,y)k:wx,k>0wx,kS(x,y) = \frac{ \sum_{k=1}^K w_{x,k} \cdot \mathrm{Judge}(r_{x,k}, y) }{ \sum_{k : w_{x,k} > 0} w_{x,k} } > > where Judge(){0,0.5,1}\mathrm{Judge}(\cdot) \in \{0, 0.5, 1\} is an LM-based scorer evaluating rubric satisfaction. > > The total reinforcement signal is: > > R(x,y)=αS(x,y)+βrcit(x,y)+γrformat(x,y)+δrsearch(x,y)R(x,y) = \alpha\cdot S(x,y) + \beta\cdot r_{\mathrm{cit}}(x,y) + \gamma\cdot r_{\mathrm{format}}(x,y) + \delta\cdot r_{\mathrm{search}}(x,y) > > where rcitr_{\mathrm{cit}} reflects citation quality (precision and recall, via LLM judgment), rformatr_{\mathrm{format}} captures adherence to output formatting (correct use of tags), rsearchr_{\mathrm{search}} encourages appropriate tool use, and α,β,γ,δ\alpha, \beta, \gamma, \delta are weights (with α1\alpha \approx 1 and others small). > > The policy is optimized under the Gradient-Regularized Policy Optimization (GRPO) algorithm, which incorporates a KL penalty to a reference SFT policy πref\pi_{\mathrm{ref}} and gradient regularization. > > Rubrics themselves co-evolve with policy learning. Each RL update includes: > > - Precomputed persistent rubrics from LM rubric generation using retrieved evidence, > > - Dynamic enrichment of the rubric pool through new, on-policy rollouts and rubric generator prompts, > > - Selection of the most discriminative rubrics based on judge variance across current rollouts, with a cap of Kmax=5K_{\max}=5 per step, > > - Removal of non-informative rubrics (zero variance). > > This methodology ensures reward signals reflect newly model-explored behavior and maintains discriminative feedback as policy capabilities evolve (Shao et al., 24 Nov 2025). > > ## 2. Model Architecture and Agent Protocol > > DR Tulu-8B's backbone is Qwen3-8B, a 32-layer Transformer with hidden size 4096 and 32 attention heads. Architectural modifications are limited to protocol token addition for explicit action differentiation: "think", "call_tool", "cite", "answer". No other structural changes are made; all improvement is driven by training methodology and protocol orchestration. > > Interaction with external tools is orchestrated through a Model Context Protocol (MCP), supporting unified async tool calling and response handling. Tool actions exposed to the model include: > > - google_search(query; k, gl, hl): returns search snippets, > > - snippet_search(query; limit, year, fields): fetches scientific paper snippets, > > - web_browse(URL): retrieves and, if necessary, summarizes web content. > > At inference time, the agent repeatedly chooses the next action conditioned on the augmented state (system prompt, user question, prior actions and evidence) until an <answer> termination (Shao et al., 24 Nov 2025). > > ## 3. Training Regimen > > Training proceeds in two stages: > > ### a) Supervised Fine-Tuning (SFT) > A cold-start phase uses 16 K expert-generated or program-matched trajectories, divided among: > > - Long-form queries from OpenScholar and SearchArena (6 K, emphasizing multi-step, search-driven tasks), > > - 4 K rubric-annotated prompts (RaR), > > - 6 K short-form QA instances (databases: HotpotQA, TaskCraft, WebWalker, MegaScience, synthetic PopQA). > > The teacher model is GPT-5, filtered for format and answer correctness. Training hyperparameters: > > - 5 epochs, 8×H100 GPUs (136 GPU·h total), > > - Batch size 16 (gradient accumulation), > > - Learning rate 4×1054\times 10^{-5}, 10% warmup, BF16 precision. > > ### b) Online RL with RLER > The RLER phase employs 5 K fresh long-form research prompts (OpenScholar/SearchArena) and 4 K RaR prompts. Training uses GRPO with evolving rubrics and MCP tool orchestration, leveraging: > > - 32 prompts per RL step × 8 rollouts each, up to 18,500 tokens/rollout, > > - Learning rate 5×1075\times 10^{-7}, KL penalty $0.001$, > > - A maximum of 10 tool calls per trajectory, > > - 1,900 RL updates over 25 days on 2×H100 (≈9,700 GPU·h total). > > All reward and rubric computations are "on-policy," i.e., based on the current model's own search-tool execution and outputs; both the Judge and the rubric generation model Grubric\mathcal{G}_{\text{rubric}} use these traces to generate scores and criteria. > > ## 4. Agent Infrastructure: dr-agent-lib and MCP > > The dr-agent-lib library operationalizes Model Context Protocol (MCP) for research agents, supplying: > > - Unified asynchronous interfaces to web/search tools (Google, Semantic Scholar, Crawl4AI), > > - Global caching of tool outputs to reduce redundancy, > > - Per-API rate-limiting, > > - Extensible prompt scaffolding for flexible, chained, or stacked tool invocation, > > - High-throughput rollout through non-blocking tool call execution (agents sleep awaiting results). > > The agent acts in a loop: in each state, it generates an action/content pair; for "think", state is extended with reasoning text; for "call_tool", the relevant tool output is concatenated; for "cite", citation markers are inserted into the ongoing answer; and for "answer", an output is finalized and emitted. > > This infrastructure enables tractable synchronous and asynchronous RL with realistic tool use traces and ensures that both policy learning and evaluation are feasible for research-scale experimentation (Shao et al., 24 Nov 2025). > > ## 5. Evaluation and Benchmark Results > > DR Tulu-8B is evaluated on four long-form deep research benchmarks, as well as clinical-genetic variant QA and several short-form QA tasks. > > ### a) Long-Form Deep Research Benchmarks > > | Model | SQAv2 | HB | RQA | DRB | Avg | > |----------------------------|-------|------|------|------|-------| > | OpenAI Deep Research | 79.6 | 53.8 | 79.2 | 46.9 | 64.9 | > | GPT-5 + Search | 74.8 | 59.5 | 78.2 | 50.7 | 65.8 | > | DR Tulu-8B (RL / RLER) | 86.8 | 50.2 | 74.3 | 43.4 | 63.7 | > | DR Tulu-8B (SFT) | 72.3 | 38.1 | 68.5 | 39.0 | 53.9 | > | Tongyi DeepResearch-30B | 46.5 | 46.2 | 66.7 | 40.6 | 50.0 | > | WebExplorer-8B | 42.5 | 33.7 | 64.8 | 36.7 | 44.4 | > | WebThinker-32B (DPO) | 32.9 | 11.1 | 48.6 | 23.3 | 28.9 | > > Scores: SQAv2 (ScholarQA-CS2), HB (HealthBench), RQA (ResearchQA), DRB (DeepResearchBench), Avg = mean. > > DR Tulu-8B (RL) outperforms all other open-source models by 13.7–53.4 points on average and is within 1–2 points of leading proprietary systems, despite its smaller scale and inference cost (Shao et al., 24 Nov 2025). > > ### b) Subscore Analysis > > | Model | SQAv2-Rubric | SQAv2-Ans | SQAv2-Cite-P | SQAv2-Cite-R | DRB-Comp | DRB-Depth | DRB-Instr | DRB-Read | > |----------------------------|-------------|-----------|--------------|--------------|----------|-----------|-----------|----------| > | DR Tulu-8B (SFT) | 81.4 | 91.0 | 65.3 | 51.6 | 36.3 | 35.3 | 45.5 | 39.5 | > | DR Tulu-8B (RL) | 89.6 | 95.4 | 88.6 | 73.7 | 41.7 | 41.8 | 48.2 | 41.3 | > > ### c) Clinical Genetic Variant QA > > On 47 expert-curated questions (GeneticDiseasesQA), DR Tulu-8B (RL) attains: > > - Final answer correctness: 76.1% (GPT-5 + Search: 66.5%) > > - Evidence quality: 70.2% > > - Evidence synthesis: 62.3% > > - Evidence support: 88.5% > > ### d) Short-Form QA > > | Model | SimpleQA | 2Wiki | WebWalker | Avg | > |--------------------------|----------|-------|-----------|-------| > | DR Tulu-8B (RL) | 80.1 | 68.0 | 39.1 | 62.4 | > | DR Tulu-8B (SFT) | 75.5 | 66.5 | 31.9 | 58.0 | > | Qwen3-8B + Search | 70.5 | 44.0 | 27.9 | 47.5 | > > ## 6. Cost Analysis > > | Model | Cost/query (USD) | Answer ⎯tok | #Citations | #Tool-calls | > |--------------------------|------------------|-------------|------------|-------------| > | OpenAI Deep Research | 1.80 | 6445.1 | 79.6 | – | > | GPT-5 + Search | 0.29 | 2358.7 | 28.1 | – | > | Gemini 3 Pro + Search | 0.13 | 1310.9 | 8.6 | 8.5 | > | DR Tulu-8B (RL) | 0.0019 | 1889.2 | 35.8 | 4.3 | > > DR Tulu-8B is 100–1000× cheaper per query than proprietary deep research APIs and 10–20× cheaper than comparable open models on SQAv2 (Shao et al., 24 Nov 2025). > > ## 7. Release and Reproducibility > > All code for training, inference, and evaluation is released at https://github.com/rlresearch/dr-tulu. Models and checkpoints (SFT and RL) are available at HuggingFace (https://huggingface.co/collections/rl-research/dr-tulu). Datasets include SFT trajectories, RL prompts, evolving rubric logs, and the GeneticDiseasesQA benchmark. The agent library (dr-agent-lib) with MCP tool support, async tool calling, and caching is released. Reproducibility scripts cover end-to-end SFT, RLER training, and all benchmarks, requiring 2×H100 (≈10,000 GPU·h) for the full RL run (Shao et al., 24 Nov 2025). > > DR Tulu-8B demonstrates that a carefully orchestrated combination of protocol-driven modeling, supervised and RL training, and co-evolved rubrics can substantially improve the open-model state-of-the-art in deep research—both in quality and efficiency—across science, healthcare, and general research domains.
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Deep Research Tulu (DR Tulu-8B).