DR Tulu-8B Agent: Open Research AI Blueprint

Updated 26 November 2025

DR Tulu-8B Agent is an advanced AI research agent that employs a modular Model–Context–Protocol structure to integrate planning, retrieval, and synthesis for long-form research.
It utilizes Reinforcement Learning with Evolving Rubrics (RLER) to dynamically refine evaluation criteria and improve evidence grounding, achieving significant performance gains.
Empirical results show that DR Tulu-8B Agent outperforms competitors with fewer tool calls, providing a scalable blueprint for research in low-resource and code-mixed domains.

DR Tulu-8B Agent refers to a class of AI research agents trained for open-ended, long-form deep research using LLMs. The most prominent instance, DR Tulu-8B, is the first open model directly optimized for open-ended research synthesis using Reinforcement Learning with Evolving Rubrics (RLER) atop a modular Model-Context-Protocol (MCP) agent infrastructure. This design enables multistep planning, evidence retrieval, and synthesis, with a training paradigm focused on co-evolving, prompt-specific evaluation rubrics that adapt dynamically to new knowledge, emergent weaknesses, and domain-specific research requirements. DR Tulu-8B achieves substantial performance gains over both open and proprietary deep research systems, and suggests methodological blueprints for building future expert agents in low-resource, code-mixed, or specialized domains (Shao et al., 24 Nov 2025, D et al., 15 Aug 2025).

1. Agent Architecture and System Decomposition

DR Tulu-8B is structured as an MCP agent, where Model–Context–Protocol delineates the decomposition of research workflows. The agent comprises three principal modules:

Planner: Implements stepwise reasoning by generating “think” traces. At each step $i$ , the planner executes an action $a_i=\mathrm{oc}$ -cyan, proposing either next steps or search queries, appending its output $\zeta_i$ to the evolving context buffer $s_i$ , yielding $s_{i+1}$ .
Retriever: Upon generation of a tool-call action $a_i=\mathrm{oc}$ -pink, the agent invokes external tools $T_k$ (e.g., google_search, snippet_search, web_browse) with arguments $\zeta_i = (q_i, \alpha_i)$ . The returned evidence $o_i = T_k(q_i; \alpha_i)$ is incorporated into $s$ .
Synthesizer: When sufficient evidence is present, the agent issues a final-answer action $a_\tau=\mathrm{oc}$ -yellow. The answer $\zeta_\tau$ includes explicit citation tags (via oc-lime tokens), with all claims grounded in previously retrieved evidence.

The architecture is unified by a central LLM policy $\pi_\theta$ , based on Qwen3-8B, responsible for both planning and synthesis. Inter-module communication operates exclusively through token-level context $s$ , and an asynchronous inference engine ensures non-blocking parallel API/toolcalls. The entire context, including “think” steps, tool calls, and evidence, is accumulated, serving as an externalized working memory. Notably, the system dispenses with a separate critic; instead, a secondary LLM (GPT-4.1-mini) functions as the reward judge during training (Shao et al., 24 Nov 2025).

2. Reinforcement Learning with Evolving Rubrics (RLER)

The RLER framework generalizes RL with Verifiable Rewards (RLVR) by maintaining dynamically updating, instance-specific, knowledge-grounded rubrics that serve as task- and prompt-specific reward functions. For each prompt $x$ , the system maintains:

Persistent rubrics $\mathcal{R}_x^{\mathrm{persist}}$ : Generated once, pre-training, using an LM prompted on $x$ plus relevant retrieved content.
Active rubrics $\mathcal{R}_x^{\mathrm{active}}$ : Continuously accrued during RL, consisting of newly minted rubric items that most effectively differentiate recent on-policy rollouts.

Each rubric is a weighted criterion $(r_{x,k},w_{x,k})$ , with $r_{x,k}$ a textual rubric phrase and $w_{x,k} \in \mathbb{R}$ its weight. The rubric score for answer $y$ is

$S(x, y) = \frac{\sum_k w_{x,k} \, \mathrm{Judge}(r_{x,k}, y)}{\sum_{k: w_{x,k}>0} w_{x,k}}$

where $\mathrm{Judge}(r, y) \in \{0, 0.5, 1\}$ is provided by a judge LLM.

Evolving rubrics are generated at each training step by rubric generator $\mathcal{G}_\mathrm{rubric}$ , distinguishing among sampled on-policy rollouts. Rubrics are filtered to retain only the $K_{\max}$ most discriminative according to reward variance, and the policy is updated with the Generalized Reinforcement Policy Optimization (GRPO) trust-region-free method.

3. Mathematical Objectives and Optimization

The agent’s objective is cast as policy-gradient RL with auxiliary reward shaping. The reward for a trajectory $\tau$ under rubric weights $\phi$ is

$R(\tau; \phi) = S(x, y) + \lambda_{\mathrm{cit}} r_{\mathrm{cit}}(y) + \lambda_{\mathrm{fmt}} r_{\mathrm{fmt}}(y) + \lambda_{\mathrm{search}} r_{\mathrm{search}}(y)$

where $S(x, y)$ depends on rubric-driven assessment, and the $r_{\mathrm{cit}}, r_{\mathrm{fmt}}, r_{\mathrm{search}}$ are rewards for citation quality, format, and search-tool use. The optimization target is

$J(\theta, \phi) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau; \phi)]$

Gradient steps are performed as

$\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(\tau)\, R(\tau;\phi)\right]$

Rubric weights $\phi$ are adjusted by ranking and selection; no end-to-end backpropagation through rubric generation occurs in practice (Shao et al., 24 Nov 2025).

4. Training Regimen and Deployment Recipe

The system follows a two-phase training protocol:

Supervised Fine-Tuning (SFT): The agent is initially trained on approximately 16,000 search-augmented trajectories. Sources include SearchArena (24k filtered user–assistant threads), OpenScholar (55k scientific queries), and a mixture of synthetic and benchmark QA data (HotpotQA, TaskCraft, WebWalker-Silver, MegaScience, PopQA). Teacher trajectories from GPT-5 with explicit “think” tokens are used for guidance. SFT is performed on Qwen3-8B for five epochs with learning rate $4 \times 10^{-5}$ , batch size 16, and max length 16,384.
Online RL with RLER: Approximately 10,000 GPU hours are invested to fine-tune the policy in an online RL regime with RLER. Prompt sources expand to new long-form queries from SearchArena, OpenScholar, and RaR. Asynchronous inference and tool invocation minimize latency. Key hyperparameters: batch size 32 prompts, $G=8$ rollouts per prompt, LR $5 \times 10^{-7}$ , KL penalty 0.001, max-tool-calls=10, $K_{\max}=5$ rubrics. Sample-packing and output masking are used to improve efficiency.

All models, data, and infrastructure (including dr-agent-lib) are released for public research (Shao et al., 24 Nov 2025).

5. Empirical Evaluation and Benchmarking

DR Tulu-8B is systematically evaluated on four public long-form deep research benchmarks:

Benchmark	Domain/Focus
AstaBench–ScholarQA–CS2 (SQAv2)	Science literature synthesis
HealthBench	Healthcare QA/dialogue
ResearchQA	Scholarly QA (75 fields)
DeepResearchBench	General-domain research

Metrics include rubric-graded scores for factuality, comprehensiveness, coherence, and citation quality, with additional citation precision/recall for SQAv2, and detailed splits for DeepResearchBench.

Key results:

Qwen3-8B baseline with search: $\approx 31.9$
DR Tulu-8B (SFT): $53.9$
DR Tulu-8B (RLER fine-tuned): $63.7$
Best open competitors (e.g. Search-R1-7B, ASearcher-Web-7B, WebExplorer-8B, WebThinker-32B, Tongyi-30B): mid 30s to low 50s
Proprietary baselines: OpenAI Deep Research $64.9$, GPT-5+Search $65.8$, Gemini3 Pro+Search $57.0$, Claude-Sonnet $57.7$

DR Tulu-8B matches or exceeds the proprietary models despite being significantly smaller and substantially cheaper per query (USD 0.0019 vs. USD 1–2). The agent averages 4.3 tool calls per query (mainly free paper_search), versus 20–80 for closed models (Shao et al., 24 Nov 2025).

6. Ablation Studies and Qualitative Behavior

Ablation analyses confirm:

Static, general rubrics (single checklist) reduce performance on long-form QA by $-1.1$ points.
LM-generated question-specific rubrics (“closed-book,” no search) confer only marginal gains.
Initial search-contextualized rubrics add $+3.1$ points over SFT.
Incorporating evolving rubrics further augments performance by $+1.0$ (see Tables 7–8 of the source).

Rubric specificity analysis finds closed-book rubrics are only $22\%$ assertive, whereas search-based and evolving rubrics exceed $50\%$ assertiveness and approach perfect factuality. RLER penalizes emergent pathological behaviors, for example, unprompted code generation during survey tasks, leading to rapid correction (Figure 1). Despite RL being only on long-form data, transfer to short-form QA is strong: DR Tulu-8B (SFT) yields $58.0$ average accuracy on SimpleQA/2Wiki/WebWalker (vs. $47.5$ for baseline), and RLER further improves this to $62.4$ (Table 9).

The agent demonstrates adaptive retrieval behaviors: paper_search is dominant ( $\sim$ 90\%) on science QA (SQAv2), google_search is preferred for general QA, and tool mixing occurs on general research tasks (Figure 2–12).

7. Extensions and Outlook for Low-Resource Language Agents

While DR Tulu-8B is centered on long-form research in English and scientific/technical domains, research in extremely low-resource domains such as Dravidian Tulu demonstrates unique challenges. Code-mixed Tulu shows strong outperformance with BiGRU+self-attention models over large multilingual transformers, primarily due to underrepresentation in pretraining corpora, script/code-mixing, and domain shift. For a dedicated Tulu-8B agent, methodologically essential steps include corpus expansion by scraping diverse forums, code-mix generation, domain-adaptive masked-language modeling, self-attention modeling adapted to script switching, multi-task objectives for offense/sentiment/language ID, and incorporating human-in-the-loop active learning (D et al., 15 Aug 2025). A plausible implication is that the DR Tulu-8B blueprint for co-evolving question-specific evaluation and MCP-style modular context management may prove extensible to emerging low-resource agent designs.