PokeeResearch-7B: Robust 7B Research Agent
- PokeeResearch-7B is a 7-billion-parameter agent designed to automate research reasoning and tool integration using a multi-step, chain-of-thought approach.
- It integrates specialized tool interfaces with interleaved dialog modes for precise tool calls, robust self-correction, and structured verification.
- Using Reinforcement Learning from AI Feedback, the model optimizes factual accuracy, citation relevance, and format compliance across complex research tasks.
PokeeResearch-7B is a 7-billion-parameter open-source deep research agent designed to address challenges in automated research reasoning, tool integration, and alignment within LLMs. Built upon the Qwen2.5-7B-Instruct backbone, PokeeResearch-7B incorporates structured tool use, robust self-correcting procedures, and direct policy optimization via reinforcement learning from AI feedback. The model achieves state-of-the-art results among 7B-scale agents across diverse and demanding research benchmarks, and prioritizes factual accuracy, citation relevance, and format compliance. The model and inference code are publicly available under the Apache 2.0 license at https://github.com/Pokee-AI/PokeeResearchOSS (Wan et al., 17 Oct 2025).
1. Model Architecture and System Design
PokeeResearch-7B employs a multi-part system architecture that layers specialized capabilities on top of the Qwen2.5-7B-Instruct LLM backbone. Key components of the architecture are:
- Specialized Tool Interfaces: Two principal tool APIs are integrated—Serper for web search and Jina Reader for individual web-page parsing—both accessed through XML-style tags (<tool_call>…</tool_call>, <tool_response>…</tool_response>).
- Interleaved Dialog Modes: The protocol distinguishes Research Mode, Verification Mode, and (at test time) Thread Synthesis Mode. Each is delimited with tags such as > , <answer>, and <verification>, enabling the agent to clarify intent and context.
Chain-of-Thought Reasoning: The model is prompted to reflect (within <think> tags) before tool invocation or answer generation, thereby generating intermediate reasoning steps.
- Turn-Level Operation: Upon each turn, the agent may propose a tool call, output an intermediate or final answer, or plan next steps—allowing granular tracking and diagnosis of tool use, as well as targeted correction.
- Componentization for Robustness: This architecture provides explicit hooks for failure detection, self-verification, and structured recovery measures, both at the tool-call and thread levels.
This design supports a research workflow in which tool calls and reasoning steps are traceable, reproducible, and modifiable at a fine-grained level.
2. Reinforcement Learning from AI Feedback (RLAIF)
PokeeResearch-7B is trained using an annotation-free reinforcement learning paradigm termed Reinforcement Learning from AI Feedback (RLAIF). Unlike standard approaches that optimize for next-token likelihood or n-gram overlap, RLAIF directly optimizes the expected return
where represents a composite reward function over policy completions given prompt . The training employs the REINFORCE Leave-One-Out (RLOO) algorithm:
- Sample trajectories for each input .
- Assign rewards .
- For each sample, compute the baseline .
- Compute advantages .
- Gradient update: .
parses as:
- : A binary indicator from a high-capacity LLM “judge” of semantic correctness relative to ground truth.
- : A small positive value conferred for adherence to required XML-format output.
By optimizing directly for these reward signals, the agent aligns to human-valued qualities—factual accuracy, citation faithfulness (correct and relevant tool use), and instruction-following—rather than mere text similarity.
3. Chain-of-Thought and Multi-Call Reasoning Scaffold
At the core of PokeeResearch-7B’s process is a robust, chain-of-thought-driven, multi-call reasoning scaffold alternating between Research and Verification modes:
- Research Mode: The agent plans, makes tool calls, and iteratively accumulates context, interleaved with <think> annotations for internal reasoning.
- Verification Mode: After generating an <answer>, the agent enters self-verification, synthesizing a <verification> step to assess internal consistency and correctness.
- Self-Correction: Upon detection of errors (malformed tool calls, logical inconsistency, or tool failures), the model re-enters Research Mode to revise or restart its process.
This loop is encapsulated by a programmatic control skeleton (see pseudocode in (Wan et al., 17 Oct 2025)) which, at inference, launches independent threads in parallel ( by default). Upon thread completion, the model invokes a Research Thread Synthesis (RTS) stage that aggregates findings across threads, ranks them, and produces a consensus answer.
Key mechanisms for resilience include detection and correction of tool-call formatting errors, adaptive recovery from tool/API failures, and iterative self-verification before finalization.
4. Empirical Evaluation and Benchmark Performance
PokeeResearch-7B is benchmarked on ten diverse “deep research” evaluation suites. Performance is quantified via mean@4 accuracy (the proportion of correct threads per four runs, as adjudicated by an LLM).
Summary of benchmark results at 7B scale:
Method HLE GAIA BrowseComp R1-Searcher 5.4 8.3 1.0 Search-R1 13.0 18.7 0.4 ASearcher 13.8 22.1 3.2 DeepResearcher 6.0 24.0 1.8 PokeeResearch-7B 15.2 36.9 5.4 PokeeResearch-RTS 17.6 41.3 8.4 On seven representative QA datasets (BAMB, 2WIKI, TQ, NQ, POPQA, MUSIQUE, HOTPOTQA), PokeeResearch-7B and PokeeResearch-RTS consistently surpass all prior 7B-scale systems. The RTS synthesis stage is particularly effective on the most complex tasks, selecting for correctness while discarding incomplete trajectories.
These results highlight the model’s superior robustness, alignment, and scalability stemming from its reasoning scaffold and direct reward optimization, even relative to substantially larger commercial systems.
5. Contributions, Limitations, and Prospective Directions
PokeeResearch-7B introduces several methodological advances for tool-augmented research agents at moderate scale:
- Direct Alignment with Human Value: The annotation-free RLAIF regime with AI feedback ensures that factual correctness, citation integrity, and proper formatting drive agent behavior.
- Robust and Modular Reasoning: The explicit self-correcting, multi-call process dramatically reduces unrecoverable workflow errors and improves generalization to complex queries.
- Resource-Efficient Deployment: The 7B-parameter scale supports broad accessibility and rapid inference, in contrast to 70B+ models.
Principal limitations currently documented:
- Text-Only Benchmarking: Evaluation has not extended to non-text modalities; future work includes integrating tabular, visual, and interactive web data.
- AI-Judge Reward Reliability: Reliance on a single LLM judge could admit calibration noise; ensemble or hybrid human judges may further optimize alignment.
- Memory Constraints: Scaling to multi-thread reasoning sessions exceeding 32,000 tokens requires advanced memory management.
- Hybrid Human–AI Feedback: Periodic human interventions are proposed to supplement or cross-validate LLM-based rewards in high-stakes applications.
A plausible implication is that PokeeResearch-7B’s architecture and training methods can serve as a template for future research agents aspiring to maximize robustness, verifiability, and alignment at accessible model scales (Wan et al., 17 Oct 2025).