DeepSeekR1 Professor Agent

Updated 7 April 2026

DeepSeekR1 Professor Agent is a modular open-domain QA system integrating a Planner, Search Actor, and Reasoner with SRR-Judge evaluation.
It utilizes a modified ReAct loop to generate and refine candidate reasoning steps, enhancing pass@1 performance in multi-hop scenarios.
The system applies iterative rejection-sampling fine-tuning coupled with step-level calibration to achieve significant benchmark improvements.

The DeepSeekR1 Professor Agent is a search-integrated reasoning agent built on the DeepSeekR1 framework, enhanced by the SRR-Judge system for step-level assessment and refinement. It exemplifies an advanced approach to open-domain question answering and tool-augmented reasoning, incorporating explicit evaluation and correction at each decision step to produce robust, high-quality responses in complex multi-hop QA scenarios. The DeepSeekR1 Professor Agent leverages a modular architecture—Planner, Search Actor, and Reasoner—and adopts an SRR-Judge-mediated "rate-and-refine" workflow, aligning policy to fine-grained expert feedback and facilitating substantial improvement in pass@1 performance across challenging QA benchmarks (Zhang et al., 8 Feb 2026).

1. Architecture and Workflow

The DeepSeekR1 Professor Agent operates via a modified ReAct loop—a stepwise interaction paradigm in which at each step $j$ , the agent accumulates a history $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ , generates a "thought" $t_j$ , selects an action $\text{act}_j\in\{\text{search},\text{answer}\}$ , receives observation $o_j$ , and proceeds iteratively to completion. The SRR-Judge model, a 32B-parameter LLM fine-tuned for step-level evaluation, interposes at each decision point to score and optionally refine ( $t_j,\text{act}_j$ ) pairs based on their context-conditioned quality.

The inference protocol is organized as follows:

Each step $j$ , the agent proposes $N$ candidate thought–action pairs $\{(t_j^i, \text{act}_j^i)\}_{i=1}^N$ .
SRR-Judge evaluates these, outputting $(e_j^i, r_j^i, \tilde t_j^i, \tilde{\text{act}}_j^i)$ where $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 0 denotes a step-level quality rating, $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 1 is a short explanation, and $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 2 provide optional refinements.
The highest-rated candidate ( $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 3) is selected if $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 4; otherwise, its refinement $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 5 is used.
This process continues until an answer action is selected or the predefined maximum number of steps ( $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 6) is reached.

Thresholds and candidate counts are set as $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 7 for online inference and $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 8 for offline alignment trace generation.

2. Step-Level Rating, Training, and Calibration

SRR-Judge assigns ratings $h_j=(t_0,\text{act}_0,o_0,\ldots,t_{j-1},\text{act}_{j-1},o_{j-1})$ 9 to each step considering four criteria: clarity/conciseness, logical structure, query appropriateness, and coverage/improvement potential. The model is fine-tuned from QwQ-32B on $t_j$ 0 supervision using the standard cross-entropy loss: $t_j$ 1 Calibration is quantified by the point-biserial correlation $t_j$ 2 between step ratings and final-answer correctness $t_j$ 3: $t_j$ 4 where $t_j$ 5, $t_j$ 6, $t_j$ 7, and $t_j$ 8 denotes the standard deviation of $t_j$ 9.

3. Iterative Rejection-Sampling Fine-Tuning (RFT)

The iterative RFT procedure aligns the DeepSeekR1 policy to high-quality, SRR-Judge-rated trajectories:

For each RFT iteration, the current policy generates best-of- $\text{act}_j\in\{\text{search},\text{answer}\}$ 0 (typically $\text{act}_j\in\{\text{search},\text{answer}\}$ 1) trajectories per question using SRR-Judge scoring.
Only trajectories for which $\text{act}_j\in\{\text{search},\text{answer}\}$ 2 (with $\text{act}_j\in\{\text{search},\text{answer}\}$ 3) are retained.
The aggregate set of accepted trajectories is used to further fine-tune the policy via supervised loss.
This loop is repeated for $\text{act}_j\in\{\text{search},\text{answer}\}$ 4 rounds or as needed for further improvement.

The following pseudocode encapsulates the RFT procedure: procedure RFT_ITERATION(current_policy M_cur, judge F, data Q, iterations T) D_all ← ∅ for it in 1..T do D_new ← ∅ for q in Q do trajs ← INFER_WITH_SRR(M_cur, F, q, K, N=5, τ=4) if min_j r_j ≥ τ_accept then D_new ← D_new ∪ {trajs} D_all ← D_all ∪ D_new M_cur ← SFT_FINE_TUNE(M_cur, D_all) return M_cur end procedure Acceptance is predicated strictly on all step ratings meeting or exceeding the threshold.

4. Data Annotation, Filtering, and Trajectory Generation

SRR-Judge annotation relies on QA pairs drawn from a mixture of open QA benchmarks (e.g., InfoSeek-Hard, DuetQA-Verified, ASearcher-LRM), with approximately $\text{act}_j\in\{\text{search},\text{answer}\}$ 5 pairs used for judge training and $\text{act}_j\in\{\text{search},\text{answer}\}$ 6 for SFT in the described experiments.

The annotation pipeline entails:

Generating search-integrated trajectories using a strong teacher (DeepSeek-V3.1) under vanilla ReAct.
Performing 5 independent SRR-Judge annotations per step, computing the majority-vote for $\text{act}_j\in\{\text{search},\text{answer}\}$ 7.
Extracting one run’s $\text{act}_j\in\{\text{search},\text{answer}\}$ 8, $\text{act}_j\in\{\text{search},\text{answer}\}$ 9, $o_j$ 0 as representative annotation.
Discarding trajectories whose average step rating correlates with final correctness below the point-biserial cutoff ( $o_j$ 1).

Curation of the SRR-annotated pool involves balancing the distribution of step ratings by upsampling rare cases and synthesizing negative examples (e.g., $o_j$ 2).

5. Retrofitting SRR-Judge to DeepSeekR1 Professor Agent

DeepSeekR1 is organized as three coupled modules: Planner (thought generation), Search Actor (tool invocation), and Reasoner (answer synthesis). Integration of SRR-Judge proceeds as follows:

Step Definition: Treat each planner, search actor, and reasoner invocation as a step $o_j$ 3.
Inference Pipeline:

Upon Planner emission of $o_j$ 4 and Search Actor selection of $o_j$ 5, invoke SRR-Judge $o_j$ 6.
If $o_j$ 7, replace $o_j$ 8 via Reasoner logic and re-present to Search Actor.
Proceed if rating passes threshold.

Training Pipeline:

Deploy DeepSeekR1 (current policy) on QA pool.
Annotate trajectories with SRR-Judge; filter for $o_j$ 9.
Augment with baseline SFT data.
Fine-tune all modules end-to-end to replicate filtered high-quality trajectories.
Repeat RFT as detailed above.

Standard evaluation utilizes the same tool-budgeted benchmarks as the original experiments: BrowseComp-En, BrowseComp-Zh, and Xbench-DeepSearch, with $t_j,\text{act}_j$ 0 step limit.

6. Empirical Evaluation

The efficacy of the SRR-Judge-augmented DeepSeekR1 Professor Agent is demonstrated by both improved calibration of step ratings and substantial performance gains under standard metrics.

Step Rating–Correctness Correlation: SRR-Judge (QwQ-32B) achieves first-step $t_j,\text{act}_j$ 1, last-step $t_j,\text{act}_j$ 2, and average-step $t_j,\text{act}_j$ 3, surpassing the larger DeepSeek-V3.1's average-step $t_j,\text{act}_j$ 4.
Inference-Time Refinement Gains: Across BrowseComp, BrowseComp-ZH, and Xbench-DeepSearch, DeepSeek-R1 + SRR refine achieves 14.6±3.2, 37.8±0.7, and 55.3±2.5 pass@1 respectively, improvements over QwQ-32B + SRR refine and much higher than vanilla QwQ-32B.
Alignment (RFT) Gains:
- RFT with SRR-Judge yields 16.2±0.8 (BrowseComp), 38.3±2.4 (BrowseComp-Zh), and 61.3±1.5 (Xbench) after two iterations, representing ∼10 percent absolute pass@1 increase over direct SFT, with all improvements statistically significant ( $t_j,\text{act}_j$ 5, paired bootstrap).

7. Design Choices and Hyperparameters

Key design parameters include:

Maximum steps $t_j,\text{act}_j$ 6.
Candidate beams $t_j,\text{act}_j$ 7 online, $t_j,\text{act}_j$ 8 offline.
Rating and refinement threshold $t_j,\text{act}_j$ 9.
Judge training: 1 epoch SFT on ≈40,000 step examples; upsampling for $j$ 0, 10,000 samples for $j$ 1, and synthetic 10,000 $j$ 2 negatives.
RFT rounds $j$ 3, each using ≈6,000 QA instances.
Trajectory filtering: point-biserial ≥ 0.7.

These choices optimize the balance between computational cost and the reliability of both ratings and policy alignment to high-quality intermediate reasoning steps.

By incorporating the SRR-Judge framework into the DeepSeekR1 Professor Agent, search-integrated reasoning agents can move beyond outcome-based supervision, achieving fine-grained control, traceable step-quality measurement, and significantly stronger benchmark performance, as confirmed by independent evaluations and controlled ablation studies (Zhang et al., 8 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepSeekR1 Professor Agent.

DeepSeekR1 Professor Agent

1. Architecture and Workflow

2. Step-Level Rating, Training, and Calibration

3. Iterative Rejection-Sampling Fine-Tuning (RFT)

4. Data Annotation, Filtering, and Trajectory Generation

5. Retrofitting SRR-Judge to DeepSeekR1 Professor Agent

6. Empirical Evaluation

7. Design Choices and Hyperparameters

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DeepSeekR1 Professor Agent

1. Architecture and Workflow

2. Step-Level Rating, Training, and Calibration

3. Iterative Rejection-Sampling Fine-Tuning (RFT)

4. Data Annotation, Filtering, and Trajectory Generation

5. Retrofitting SRR-Judge to DeepSeekR1 Professor Agent

6. Empirical Evaluation

7. Design Choices and Hyperparameters

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research