DeepSeekR1 Professor Agent
- DeepSeekR1 Professor Agent is a modular open-domain QA system integrating a Planner, Search Actor, and Reasoner with SRR-Judge evaluation.
- It utilizes a modified ReAct loop to generate and refine candidate reasoning steps, enhancing pass@1 performance in multi-hop scenarios.
- The system applies iterative rejection-sampling fine-tuning coupled with step-level calibration to achieve significant benchmark improvements.
The DeepSeekR1 Professor Agent is a search-integrated reasoning agent built on the DeepSeekR1 framework, enhanced by the SRR-Judge system for step-level assessment and refinement. It exemplifies an advanced approach to open-domain question answering and tool-augmented reasoning, incorporating explicit evaluation and correction at each decision step to produce robust, high-quality responses in complex multi-hop QA scenarios. The DeepSeekR1 Professor Agent leverages a modular architecture—Planner, Search Actor, and Reasoner—and adopts an SRR-Judge-mediated "rate-and-refine" workflow, aligning policy to fine-grained expert feedback and facilitating substantial improvement in pass@1 performance across challenging QA benchmarks (Zhang et al., 8 Feb 2026).
1. Architecture and Workflow
The DeepSeekR1 Professor Agent operates via a modified ReAct loop—a stepwise interaction paradigm in which at each step , the agent accumulates a history , generates a "thought" , selects an action , receives observation , and proceeds iteratively to completion. The SRR-Judge model, a 32B-parameter LLM fine-tuned for step-level evaluation, interposes at each decision point to score and optionally refine () pairs based on their context-conditioned quality.
The inference protocol is organized as follows:
- Each step , the agent proposes candidate thought–action pairs .
- SRR-Judge evaluates these, outputting where 0 denotes a step-level quality rating, 1 is a short explanation, and 2 provide optional refinements.
- The highest-rated candidate (3) is selected if 4; otherwise, its refinement 5 is used.
- This process continues until an answer action is selected or the predefined maximum number of steps (6) is reached.
Thresholds and candidate counts are set as 7 for online inference and 8 for offline alignment trace generation.
2. Step-Level Rating, Training, and Calibration
SRR-Judge assigns ratings 9 to each step considering four criteria: clarity/conciseness, logical structure, query appropriateness, and coverage/improvement potential. The model is fine-tuned from QwQ-32B on 0 supervision using the standard cross-entropy loss: 1 Calibration is quantified by the point-biserial correlation 2 between step ratings and final-answer correctness 3: 4 where 5, 6, 7, and 8 denotes the standard deviation of 9.
3. Iterative Rejection-Sampling Fine-Tuning (RFT)
The iterative RFT procedure aligns the DeepSeekR1 policy to high-quality, SRR-Judge-rated trajectories:
- For each RFT iteration, the current policy generates best-of-0 (typically 1) trajectories per question using SRR-Judge scoring.
- Only trajectories for which 2 (with 3) are retained.
- The aggregate set of accepted trajectories is used to further fine-tune the policy via supervised loss.
- This loop is repeated for 4 rounds or as needed for further improvement.
The following pseudocode encapsulates the RFT procedure:
procedure RFT_ITERATION(current_policy M_cur, judge F, data Q, iterations T)
D_all ← ∅
for it in 1..T do
D_new ← ∅
for q in Q do
trajs ← INFER_WITH_SRR(M_cur, F, q, K, N=5, τ=4)
if min_j r_j ≥ τ_accept then
D_new ← D_new ∪ {trajs}
D_all ← D_all ∪ D_new
M_cur ← SFT_FINE_TUNE(M_cur, D_all)
return M_cur
end procedure
Acceptance is predicated strictly on all step ratings meeting or exceeding the threshold.
4. Data Annotation, Filtering, and Trajectory Generation
SRR-Judge annotation relies on QA pairs drawn from a mixture of open QA benchmarks (e.g., InfoSeek-Hard, DuetQA-Verified, ASearcher-LRM), with approximately 5 pairs used for judge training and 6 for SFT in the described experiments.
The annotation pipeline entails:
- Generating search-integrated trajectories using a strong teacher (DeepSeek-V3.1) under vanilla ReAct.
- Performing 5 independent SRR-Judge annotations per step, computing the majority-vote for 7.
- Extracting one run’s 8, 9, 0 as representative annotation.
- Discarding trajectories whose average step rating correlates with final correctness below the point-biserial cutoff (1).
Curation of the SRR-annotated pool involves balancing the distribution of step ratings by upsampling rare cases and synthesizing negative examples (e.g., 2).
5. Retrofitting SRR-Judge to DeepSeekR1 Professor Agent
DeepSeekR1 is organized as three coupled modules: Planner (thought generation), Search Actor (tool invocation), and Reasoner (answer synthesis). Integration of SRR-Judge proceeds as follows:
- Step Definition: Treat each planner, search actor, and reasoner invocation as a step 3.
- Inference Pipeline:
- Upon Planner emission of 4 and Search Actor selection of 5, invoke SRR-Judge 6.
- If 7, replace 8 via Reasoner logic and re-present to Search Actor.
- Proceed if rating passes threshold.
- Training Pipeline:
- Deploy DeepSeekR1 (current policy) on QA pool.
- Annotate trajectories with SRR-Judge; filter for 9.
- Augment with baseline SFT data.
- Fine-tune all modules end-to-end to replicate filtered high-quality trajectories.
- Repeat RFT as detailed above.
Standard evaluation utilizes the same tool-budgeted benchmarks as the original experiments: BrowseComp-En, BrowseComp-Zh, and Xbench-DeepSearch, with 0 step limit.
6. Empirical Evaluation
The efficacy of the SRR-Judge-augmented DeepSeekR1 Professor Agent is demonstrated by both improved calibration of step ratings and substantial performance gains under standard metrics.
- Step Rating–Correctness Correlation: SRR-Judge (QwQ-32B) achieves first-step 1, last-step 2, and average-step 3, surpassing the larger DeepSeek-V3.1's average-step 4.
- Inference-Time Refinement Gains: Across BrowseComp, BrowseComp-ZH, and Xbench-DeepSearch, DeepSeek-R1 + SRR refine achieves 14.6±3.2, 37.8±0.7, and 55.3±2.5 pass@1 respectively, improvements over QwQ-32B + SRR refine and much higher than vanilla QwQ-32B.
- Alignment (RFT) Gains:
- RFT with SRR-Judge yields 16.2±0.8 (BrowseComp), 38.3±2.4 (BrowseComp-Zh), and 61.3±1.5 (Xbench) after two iterations, representing ∼10 percent absolute pass@1 increase over direct SFT, with all improvements statistically significant (5, paired bootstrap).
7. Design Choices and Hyperparameters
Key design parameters include:
- Maximum steps 6.
- Candidate beams 7 online, 8 offline.
- Rating and refinement threshold 9.
- Judge training: 1 epoch SFT on ≈40,000 step examples; upsampling for 0, 10,000 samples for 1, and synthetic 10,000 2 negatives.
- RFT rounds 3, each using ≈6,000 QA instances.
- Trajectory filtering: point-biserial ≥ 0.7.
These choices optimize the balance between computational cost and the reliability of both ratings and policy alignment to high-quality intermediate reasoning steps.
By incorporating the SRR-Judge framework into the DeepSeekR1 Professor Agent, search-integrated reasoning agents can move beyond outcome-based supervision, achieving fine-grained control, traceable step-quality measurement, and significantly stronger benchmark performance, as confirmed by independent evaluations and controlled ablation studies (Zhang et al., 8 Feb 2026).