- The paper introduces a novel RL framework combining AI feedback and self-verification for deep research agents.
- It employs tool integration and multi-threaded synthesis to improve evidence retrieval and answer accuracy across ten benchmarks.
- The approach demonstrates enhanced robustness and human-aligned performance, outperforming existing 7B-scale research models.
PokeeResearch: Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold for Deep Research Agents
Introduction
PokeeResearch-7B presents a unified framework for developing deep research agents at the 7B parameter scale, focusing on robust tool-augmented reasoning and direct optimization for human-aligned answer quality. The system addresses key limitations in prior research agents, including shallow retrieval, brittle tool-use, and weak alignment metrics, by integrating reinforcement learning from AI feedback (RLAIF) with a robust reasoning scaffold. The agent is designed to decompose complex queries, retrieve and synthesize external evidence, and self-verify its outputs, achieving state-of-the-art performance across ten open-domain research benchmarks.
Methodology
Deep Research Workflow
PokeeResearch-7B operates in research-verification cycles. Upon receiving a query, the agent alternates between research mode—issuing tool calls or generating candidate answers—and verification mode, where it self-assesses the correctness of its output. Unlike prior approaches that terminate upon answer generation, PokeeResearch-7B introduces an explicit self-verification step, enabling the agent to detect and correct common failure modes such as incomplete answers, insufficient evidence, or logical errors. This iterative process continues until the answer passes verification or the context limit is reached.
The agent is equipped with specialized tools for web search (Serper) and web reading (Jina Reader). Serper enables structured retrieval of relevant URLs and snippets, while Jina Reader provides concise summaries of webpage content. This modular tool-use design supports multi-hop information gathering and evidence synthesis, critical for complex research tasks.
Training Pipeline and Algorithm
Training utilizes the MiroRL-GenQA dataset, which contains multi-turn, non-trivial question-answer pairs requiring deep research. The agent is fine-tuned using the REINFORCE Leave-One-Out (RLOO) algorithm, an unbiased on-policy policy gradient estimator. For each prompt, k completions are sampled, and the advantage for each is computed using a leave-one-out baseline. The update rule is:
θ←θ+αk1i=1∑k[R(x,y(i))−bi]∇logπθ(y(i)∣x)
where bi is the mean reward of all other completions. This approach reduces variance compared to classic REINFORCE and avoids the bias and instability observed in PPO-style algorithms and GRPO, especially during long training runs.
Reward Design
PokeeResearch-7B's reward function is primarily based on AI feedback, supplemented by small format rewards. AI feedback leverages an external LLM judge to assess semantic equivalence between the agent's answer and the ground truth, overcoming the limitations of token-level metrics (F1, EM) which can be either overly permissive or excessively strict. This alignment with semantic correctness mitigates reward hacking and provides more reliable signals for complex reasoning.
Test-Time Research Threads Synthesis (RTS)
At inference, the agent launches multiple independent research threads per query. Each thread is summarized, and the model synthesizes these summaries to produce a final answer. This RTS approach is particularly effective for challenging queries, as it allows the agent to cross-validate evidence and select the most substantiated response, improving accuracy on benchmarks with high reasoning complexity.
Experimental Results
PokeeResearch-7B was evaluated on ten benchmarks, including Natural Questions, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, and Humanity's Last Exam. The agent consistently outperformed all open-source 7B-scale baselines (R1-Searcher, Search-R1, ZeroSearch, ASearcher, DeepResearcher) in mean@4 accuracy across all tasks. The RTS variant further improved performance, especially on the most challenging benchmarks (HLE, GAIA, BrowseComp), with gains of up to 4.4 points over the best baseline.
Self-verification was shown to be effective in correcting initially inaccurate answers, as demonstrated in detailed interaction logs. The agent's ability to diagnose and recover from tool failures and reasoning errors contributed to its superior robustness and reliability.
Implications and Future Directions
PokeeResearch-7B demonstrates that careful integration of reinforcement learning from AI feedback and robust reasoning scaffolds can yield research-grade agents that are both cost-efficient and resilient. The use of unbiased on-policy RL algorithms (RLOO) and semantic reward signals (AI feedback) directly optimizes for human-aligned answer quality, moving beyond surface-level lexical metrics.
Practically, the framework enables scalable deployment of deep research agents in open-domain settings, with robust handling of tool failures and dynamic environments. The modular tool-use and multi-threaded synthesis strategies are extensible to more complex toolchains and multi-modal research tasks.
Theoretically, the results suggest that reliability and alignment are as critical as model scale for advancing autonomous research agents. Future work may explore further improvements in reward modeling, integration of additional tools (e.g., code execution, data analysis), and extension to multi-modal and long-context reasoning. The open-source release of PokeeResearch-7B and its inference code provides a foundation for reproducible research and community-driven development.
Conclusion
PokeeResearch-7B establishes a new standard for 7B-scale deep research agents by combining reinforcement learning from AI feedback with a robust, self-correcting reasoning scaffold. The agent achieves state-of-the-art performance across diverse benchmarks, validating the efficacy of its design in both reasoning quality and operational resilience. The framework's emphasis on reliability, alignment, and modularity offers a promising direction for future research in scalable, autonomous, and human-aligned AI agents.