Dr. Zero: Data-Free Self-Evolving LLM Framework
- Dr. Zero is a search-augmented large language model framework that achieves open-domain, multi-hop reasoning without relying on human-annotated data.
- It employs a dual-agent design where a proposer generates synthetic QA tasks and a solver uses external search and reward feedback to refine answers.
- The framework’s hop-grouped relative policy optimization reduces computational costs while enabling performance comparable to fully supervised models on complex QA benchmarks.
Dr. Zero is a search-augmented LLM framework that enables open-domain and multi-hop reasoning capabilities through a fully data-free, self-evolutionary process. By constructing a closed-loop curriculum between a synthetic task proposer and an answering solver, both agents—initialized from the same base LLM—are co-evolved without access to human-annotated training examples. This paradigm leverages external search engines on-the-fly for knowledge acquisition and employs reward-driven feedback to autonomously calibrate problem difficulty and answer accuracy, culminating in emergent reasoning proficiency comparable to fully supervised agents (Yue et al., 11 Jan 2026).
1. Motivation and Conceptual Foundations
Dr. Zero addresses the constraints imposed by limited or costly high-quality training data, especially for open-domain tasks involving multi-step inferential reasoning and tool-use. Traditional models rely on human-annotated corpora and substantial supervision, which are prohibitive for many real-world scenarios such as proprietary databases or low-resource languages. Dr. Zero circumvents this bottleneck by employing a single LLM to instantiate both a "proposer" for generating diverse QA tasks and a "solver" for attempting their resolution, with both agents relying solely on external web search and self-generated feedback signals during their evolution (Yue et al., 11 Jan 2026).
The self-evolutionary feedback loop compels the proposer to create tasks that are neither trivial nor unsolvable, subject to a scalar reward based on the solver’s multi-try pass-rate. This recursive dynamic yields automated curriculum learning, continuously shifting the QA distribution toward higher complexity as the solver improves.
2. System Architecture and Self-Evolution Workflow
The Dr. Zero framework comprises two synergistic components:
- Proposer (Tₑ): Initialized from the base LLM and responsible for formulating QA pairs via multi-turn reasoning interleaved with search-engine calls.
- Solver (Tₒ): Also initialized from the base LLM, tasked with answering the synthetic questions using multi-step, tool-augmented rollouts.
The self-evolution feedback loop operates as follows:
- Proposer Creation Phase: For each input prompt, the proposer generates a QA pair (xᵢ, yᵢ) through a single multi-step reasoning and search sequence.
- Solver Evaluation Phase: The solver attempts each xᵢ multiple times (n rollouts) to empirically estimate the QA instance's difficulty and solvability, recording the number of correct responses, kᵢ.
- Reward Calculation: The reward signal rᵢ for each QA is
where is a format-validity bonus.
- Proposer Update via HRPO: Using the hop-grouped relative policy optimization (HRPO), proposer policy parameters are updated with group-level advantage baselines (see Section 3).
- Solver Update via GRPO: The solver’s policy is refined using group relative policy optimization (GRPO) over the collected data.
This feedback loop is repeated, inducing a curriculum in which the average solver pass-rate threshold drops and proposers are incentivized to surface increasingly sophisticated (but still solvable) queries.
3. Hop-Grouped Relative Policy Optimization (HRPO)
HRPO is central to the practical efficiency and stability of Dr. Zero. The algorithm clusters QA pairs by their “hop count” (), denoting the number of reasoning or search steps needed to answer. Each group serves as a stratum for calculating group-level baseline rewards:
For sample in group , advantage is
Policy parameters for the proposer are updated according to:
where is the learning rate and the KL regularization coefficient.
Solver optimization proceeds similarly, employing GRPO (group-based clipped policy gradient), which factors in empirical advantage and change in action probabilities. This hop-grouped baseline structure allows Dr. Zero to minimize estimator variance and compute requirements during training by using only one generated question per prompt, plus solver rollouts—resulting in 50% to 75% reduction in rollout count compared to standard methods.
4. Training Efficiency and Computational Implications
Dr. Zero’s workflow, particularly under HRPO, guarantees substantial compute savings. For example, using questions per prompt and solver rollouts, standard GRPO costs rollouts per prompt; with HRPO (), this lowers to $2n$. Empirically, with , Dr. Zero requires 10 rollouts per prompt vs. 20 for GRPO—halving computation. Benchmark results with a 3B backbone model show HRPO achieving an exact match (EM) average of 0.326 versus GRPO’s 0.320. This suggests HRPO preserves or modestly improves answering quality despite compute reductions.
The group-level baseline, however, can be less granular for complex multi-hop tasks, occasionally yielding increased outcome variance, though this is mitigated by on-policy updates and hop-stratified normalization.
5. Performance Evaluation and Benchmark Comparison
Dr. Zero is validated across seven QA datasets: three one-hop (Natural Questions, TriviaQA, PopQA) and four multi-hop (HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle), using both Qwen2.5-3B and Qwen2.5-7B LLMs. Baseline comparisons include few-shot prompting, IRCoT, Search-o1, RAG, as well as fully supervised agents (SFT, R1, Search-R1).
Key results include:
- Dr. Zero-3B outperforms all few-shot baselines on single-hop tasks (NQ EM 0.397 vs. prompting 0.106, IRCoT 0.111, Search-o1 0.238).
- Relative to supervised Search-R1 (which uses labeled data), Dr. Zero-3B achieves equivalent or superior EM: +22.9% (NQ), +6.5% (TriviaQA), +18.4% (PopQA), with no human training data.
- Dr. Zero-7B matches or exceeds Search-R1 on single-hop, and achieves ≥90% of supervised scores on multi-hop, narrowly outperforming on 2WikiMQA.
- Against other data-free approaches (SQLM*, R-Zero*), Dr. Zero produces +39.9%/+27.3% average gains, especially for multi-hop tasks.
Ablation studies demonstrate that removing the format-validity bonus reduces average EM, and that linear difficulty rewards yield optimal performance. Self-evolution reliably converges in 2–3 curriculum iterations; extended training may cause entropy collapse in higher-parameter models.
6. Applications, Limitations, and Prospects
Dr. Zero’s architectural independence from curated datasets positions it for deployment in scenarios where labeled data are unavailable, including proprietary enterprise search, low-resource languages, and rapid prototyping for reasoning agents. Its ability to continually self-improve within closed knowledge environments (e.g., scientific or legal corpora) is noteworthy.
Limitations are intrinsic to the self-evolution process:
- Larger LLMs (7B+) may exhibit instability, including instruction drift and entropy collapse as iterations progress.
- Performance plateaus, forming a “curriculum ceiling” after a few cycles; further diversity or reward engineering could extend gains.
- The present reward strategy focuses on solver pass-rates; incorporating metrics such as retrieval diversity or proof length may strengthen curriculum robustness.
- Bias amplification is a risk under unchecked self-evolution, necessitating future integration of safety and fairness controls.
A plausible implication is that Dr. Zero’s approach lays groundwork for persistent self-improvement and adaptive tool-augmented reasoning in domains devoid of labeled data, stimulating research into automated curriculum generation and data-free agent evolution.
For detailed methodology, experimental analysis, and policy optimization pseudocode, see (Yue et al., 11 Jan 2026).