Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepSeekR1 Professor Agent

Updated 7 April 2026
  • DeepSeekR1 Professor Agent is a modular open-domain QA system integrating a Planner, Search Actor, and Reasoner with SRR-Judge evaluation.
  • It utilizes a modified ReAct loop to generate and refine candidate reasoning steps, enhancing pass@1 performance in multi-hop scenarios.
  • The system applies iterative rejection-sampling fine-tuning coupled with step-level calibration to achieve significant benchmark improvements.

The DeepSeekR1 Professor Agent is a search-integrated reasoning agent built on the DeepSeekR1 framework, enhanced by the SRR-Judge system for step-level assessment and refinement. It exemplifies an advanced approach to open-domain question answering and tool-augmented reasoning, incorporating explicit evaluation and correction at each decision step to produce robust, high-quality responses in complex multi-hop QA scenarios. The DeepSeekR1 Professor Agent leverages a modular architecture—Planner, Search Actor, and Reasoner—and adopts an SRR-Judge-mediated "rate-and-refine" workflow, aligning policy to fine-grained expert feedback and facilitating substantial improvement in pass@1 performance across challenging QA benchmarks (Zhang et al., 8 Feb 2026).

1. Architecture and Workflow

The DeepSeekR1 Professor Agent operates via a modified ReAct loop—a stepwise interaction paradigm in which at each step jj, the agent accumulates a history hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1}), generates a "thought" tjt_j, selects an action actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}, receives observation ojo_j, and proceeds iteratively to completion. The SRR-Judge model, a 32B-parameter LLM fine-tuned for step-level evaluation, interposes at each decision point to score and optionally refine (tj,actjt_j,\text{act}_j) pairs based on their context-conditioned quality.

The inference protocol is organized as follows:

  • Each step jj, the agent proposes NN candidate thought–action pairs {(tji,actji)}i=1N\{(t_j^i, \text{act}_j^i)\}_{i=1}^N.
  • SRR-Judge evaluates these, outputting (eji,rji,t~ji,act~ji)(e_j^i, r_j^i, \tilde t_j^i, \tilde{\text{act}}_j^i) where hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})0 denotes a step-level quality rating, hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})1 is a short explanation, and hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})2 provide optional refinements.
  • The highest-rated candidate (hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})3) is selected if hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})4; otherwise, its refinement hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})5 is used.
  • This process continues until an answer action is selected or the predefined maximum number of steps (hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})6) is reached.

Thresholds and candidate counts are set as hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})7 for online inference and hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})8 for offline alignment trace generation.

2. Step-Level Rating, Training, and Calibration

SRR-Judge assigns ratings hj=(t0,act0,o0,,tj1,actj1,oj1)h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})9 to each step considering four criteria: clarity/conciseness, logical structure, query appropriateness, and coverage/improvement potential. The model is fine-tuned from QwQ-32B on tjt_j0 supervision using the standard cross-entropy loss: tjt_j1 Calibration is quantified by the point-biserial correlation tjt_j2 between step ratings and final-answer correctness tjt_j3: tjt_j4 where tjt_j5, tjt_j6, tjt_j7, and tjt_j8 denotes the standard deviation of tjt_j9.

3. Iterative Rejection-Sampling Fine-Tuning (RFT)

The iterative RFT procedure aligns the DeepSeekR1 policy to high-quality, SRR-Judge-rated trajectories:

  • For each RFT iteration, the current policy generates best-of-actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}0 (typically actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}1) trajectories per question using SRR-Judge scoring.
  • Only trajectories for which actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}2 (with actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}3) are retained.
  • The aggregate set of accepted trajectories is used to further fine-tune the policy via supervised loss.
  • This loop is repeated for actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}4 rounds or as needed for further improvement.

The following pseudocode encapsulates the RFT procedure: procedure RFT_ITERATION(current_policy M_cur, judge F, data Q, iterations T) D_all ← ∅ for it in 1..T do D_new ← ∅ for q in Q do trajs ← INFER_WITH_SRR(M_cur, F, q, K, N=5, τ=4) if min_j r_j ≥ τ_accept then D_new ← D_new ∪ {trajs} D_all ← D_all ∪ D_new M_cur ← SFT_FINE_TUNE(M_cur, D_all) return M_cur end procedure Acceptance is predicated strictly on all step ratings meeting or exceeding the threshold.

4. Data Annotation, Filtering, and Trajectory Generation

SRR-Judge annotation relies on QA pairs drawn from a mixture of open QA benchmarks (e.g., InfoSeek-Hard, DuetQA-Verified, ASearcher-LRM), with approximately actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}5 pairs used for judge training and actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}6 for SFT in the described experiments.

The annotation pipeline entails:

  1. Generating search-integrated trajectories using a strong teacher (DeepSeek-V3.1) under vanilla ReAct.
  2. Performing 5 independent SRR-Judge annotations per step, computing the majority-vote for actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}7.
  3. Extracting one run’s actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}8, actj{search,answer}\text{act}_j\in\{\text{search},\text{answer}\}9, ojo_j0 as representative annotation.
  4. Discarding trajectories whose average step rating correlates with final correctness below the point-biserial cutoff (ojo_j1).

Curation of the SRR-annotated pool involves balancing the distribution of step ratings by upsampling rare cases and synthesizing negative examples (e.g., ojo_j2).

5. Retrofitting SRR-Judge to DeepSeekR1 Professor Agent

DeepSeekR1 is organized as three coupled modules: Planner (thought generation), Search Actor (tool invocation), and Reasoner (answer synthesis). Integration of SRR-Judge proceeds as follows:

  • Step Definition: Treat each planner, search actor, and reasoner invocation as a step ojo_j3.
  • Inference Pipeline:
  1. Upon Planner emission of ojo_j4 and Search Actor selection of ojo_j5, invoke SRR-Judge ojo_j6.
  2. If ojo_j7, replace ojo_j8 via Reasoner logic and re-present to Search Actor.
  3. Proceed if rating passes threshold.
  • Training Pipeline:
  1. Deploy DeepSeekR1 (current policy) on QA pool.
  2. Annotate trajectories with SRR-Judge; filter for ojo_j9.
  3. Augment with baseline SFT data.
  4. Fine-tune all modules end-to-end to replicate filtered high-quality trajectories.
  5. Repeat RFT as detailed above.

Standard evaluation utilizes the same tool-budgeted benchmarks as the original experiments: BrowseComp-En, BrowseComp-Zh, and Xbench-DeepSearch, with tj,actjt_j,\text{act}_j0 step limit.

6. Empirical Evaluation

The efficacy of the SRR-Judge-augmented DeepSeekR1 Professor Agent is demonstrated by both improved calibration of step ratings and substantial performance gains under standard metrics.

  • Step Rating–Correctness Correlation: SRR-Judge (QwQ-32B) achieves first-step tj,actjt_j,\text{act}_j1, last-step tj,actjt_j,\text{act}_j2, and average-step tj,actjt_j,\text{act}_j3, surpassing the larger DeepSeek-V3.1's average-step tj,actjt_j,\text{act}_j4.
  • Inference-Time Refinement Gains: Across BrowseComp, BrowseComp-ZH, and Xbench-DeepSearch, DeepSeek-R1 + SRR refine achieves 14.6±3.2, 37.8±0.7, and 55.3±2.5 pass@1 respectively, improvements over QwQ-32B + SRR refine and much higher than vanilla QwQ-32B.
  • Alignment (RFT) Gains:
    • RFT with SRR-Judge yields 16.2±0.8 (BrowseComp), 38.3±2.4 (BrowseComp-Zh), and 61.3±1.5 (Xbench) after two iterations, representing ∼10 percent absolute pass@1 increase over direct SFT, with all improvements statistically significant (tj,actjt_j,\text{act}_j5, paired bootstrap).

7. Design Choices and Hyperparameters

Key design parameters include:

  • Maximum steps tj,actjt_j,\text{act}_j6.
  • Candidate beams tj,actjt_j,\text{act}_j7 online, tj,actjt_j,\text{act}_j8 offline.
  • Rating and refinement threshold tj,actjt_j,\text{act}_j9.
  • Judge training: 1 epoch SFT on ≈40,000 step examples; upsampling for jj0, 10,000 samples for jj1, and synthetic 10,000 jj2 negatives.
  • RFT rounds jj3, each using ≈6,000 QA instances.
  • Trajectory filtering: point-biserial ≥ 0.7.

These choices optimize the balance between computational cost and the reliability of both ratings and policy alignment to high-quality intermediate reasoning steps.


By incorporating the SRR-Judge framework into the DeepSeekR1 Professor Agent, search-integrated reasoning agents can move beyond outcome-based supervision, achieving fine-grained control, traceable step-quality measurement, and significantly stronger benchmark performance, as confirmed by independent evaluations and controlled ablation studies (Zhang et al., 8 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepSeekR1 Professor Agent.