PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold (2510.15862v3)

Published 17 Oct 2025 in cs.AI

Abstract: Tool-augmented LLMs are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under Apache 2.0 license at https://github.com/Pokee-AI/PokeeResearchOSS.

Summary

The paper introduces a novel RL framework combining AI feedback and self-verification for deep research agents.
It employs tool integration and multi-threaded synthesis to improve evidence retrieval and answer accuracy across ten benchmarks.
The approach demonstrates enhanced robustness and human-aligned performance, outperforming existing 7B-scale research models.

PokeeResearch: Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold for Deep Research Agents

Introduction

PokeeResearch-7B presents a unified framework for developing deep research agents at the 7B parameter scale, focusing on robust tool-augmented reasoning and direct optimization for human-aligned answer quality. The system addresses key limitations in prior research agents, including shallow retrieval, brittle tool-use, and weak alignment metrics, by integrating reinforcement learning from AI feedback (RLAIF) with a robust reasoning scaffold. The agent is designed to decompose complex queries, retrieve and synthesize external evidence, and self-verify its outputs, achieving state-of-the-art performance across ten open-domain research benchmarks.

Methodology

Deep Research Workflow

PokeeResearch-7B operates in research-verification cycles. Upon receiving a query, the agent alternates between research mode—issuing tool calls or generating candidate answers—and verification mode, where it self-assesses the correctness of its output. Unlike prior approaches that terminate upon answer generation, PokeeResearch-7B introduces an explicit self-verification step, enabling the agent to detect and correct common failure modes such as incomplete answers, insufficient evidence, or logical errors. This iterative process continues until the answer passes verification or the context limit is reached.

Tool Integration

The agent is equipped with specialized tools for web search (Serper) and web reading (Jina Reader). Serper enables structured retrieval of relevant URLs and snippets, while Jina Reader provides concise summaries of webpage content. This modular tool-use design supports multi-hop information gathering and evidence synthesis, critical for complex research tasks.

Training Pipeline and Algorithm

Training utilizes the MiroRL-GenQA dataset, which contains multi-turn, non-trivial question-answer pairs requiring deep research. The agent is fine-tuned using the REINFORCE Leave-One-Out (RLOO) algorithm, an unbiased on-policy policy gradient estimator. For each prompt, $k$ completions are sampled, and the advantage for each is computed using a leave-one-out baseline. The update rule is:

$\theta \gets \theta + \alpha \frac{1}{k}\sum_{i=1}^{k} \left[R(x, y^{(i)}) - b_i\right] \nabla \log \pi_\theta(y^{(i)}|x)$

where $b_i$ is the mean reward of all other completions. This approach reduces variance compared to classic REINFORCE and avoids the bias and instability observed in PPO-style algorithms and GRPO, especially during long training runs.

Reward Design

PokeeResearch-7B's reward function is primarily based on AI feedback, supplemented by small format rewards. AI feedback leverages an external LLM judge to assess semantic equivalence between the agent's answer and the ground truth, overcoming the limitations of token-level metrics (F1, EM) which can be either overly permissive or excessively strict. This alignment with semantic correctness mitigates reward hacking and provides more reliable signals for complex reasoning.

Test-Time Research Threads Synthesis (RTS)

At inference, the agent launches multiple independent research threads per query. Each thread is summarized, and the model synthesizes these summaries to produce a final answer. This RTS approach is particularly effective for challenging queries, as it allows the agent to cross-validate evidence and select the most substantiated response, improving accuracy on benchmarks with high reasoning complexity.

Experimental Results

PokeeResearch-7B was evaluated on ten benchmarks, including Natural Questions, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, and Humanity's Last Exam. The agent consistently outperformed all open-source 7B-scale baselines (R1-Searcher, Search-R1, ZeroSearch, ASearcher, DeepResearcher) in mean@4 accuracy across all tasks. The RTS variant further improved performance, especially on the most challenging benchmarks (HLE, GAIA, BrowseComp), with gains of up to 4.4 points over the best baseline.

Self-verification was shown to be effective in correcting initially inaccurate answers, as demonstrated in detailed interaction logs. The agent's ability to diagnose and recover from tool failures and reasoning errors contributed to its superior robustness and reliability.

Implications and Future Directions

PokeeResearch-7B demonstrates that careful integration of reinforcement learning from AI feedback and robust reasoning scaffolds can yield research-grade agents that are both cost-efficient and resilient. The use of unbiased on-policy RL algorithms (RLOO) and semantic reward signals (AI feedback) directly optimizes for human-aligned answer quality, moving beyond surface-level lexical metrics.

Practically, the framework enables scalable deployment of deep research agents in open-domain settings, with robust handling of tool failures and dynamic environments. The modular tool-use and multi-threaded synthesis strategies are extensible to more complex toolchains and multi-modal research tasks.

Theoretically, the results suggest that reliability and alignment are as critical as model scale for advancing autonomous research agents. Future work may explore further improvements in reward modeling, integration of additional tools (e.g., code execution, data analysis), and extension to multi-modal and long-context reasoning. The open-source release of PokeeResearch-7B and its inference code provides a foundation for reproducible research and community-driven development.

Conclusion

PokeeResearch-7B establishes a new standard for 7B-scale deep research agents by combining reinforcement learning from AI feedback with a robust, self-correcting reasoning scaffold. The agent achieves state-of-the-art performance across diverse benchmarks, validating the efficacy of its design in both reasoning quality and operational resilience. The framework's emphasis on reliability, alignment, and modularity offers a promising direction for future research in scalable, autonomous, and human-aligned AI agents.