- The paper introduces Gen-Searcher, which reinforces image generation by actively integrating external knowledge via agentic reinforcement learning.
- It employs a novel dual reward mechanism combining image-based and text-based evaluations to enhance visual correctness and factual accuracy.
- Experimental results demonstrate significant K-Score improvements and superior fine-grained attribute grounding compared to existing T2I models.
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Introduction and Motivation
The paper "Gen-Searcher: Reinforcing Agentic Search for Image Generation" (2603.28767) addresses a central limitation of contemporary text-to-image (T2I) generation models: their reliance on static parametric knowledge, which restricts their ability to synthesize images for knowledge-intensive or temporally dynamic prompts. The lack of integration between up-to-date external information and image synthesis significantly impacts real-world applicability, particularly when generating images involving entities with changing attributes, recent events, or highly specialized facts not encoded during model pretraining.
Existing approaches to retrieval-augmented generation (RAG) in T2I focus primarily on database retrieval or shallow web search, often lacking visual grounding, multihop reasoning, and adaptability in tool use. Furthermore, most prior systems either operate with manual prompt engineering workflows or employ text-based search exclusively for fact gathering, yielding suboptimal quality, poor extensibility across domains, and inadequate visual faithfulness.
Methodology
Gen-Searcher is the first paradigm for actively training a multimodal agentic search framework for grounded image generation. The system is architected as a search-augmented agent—a Large Multimodal Model (LMM) that learns, via agentic reinforcement learning (RL), to perform multi-hop web search, textual querying, image reference retrieval, and evidence aggregation. The agent, initialized from Qwen3-VL-8B-Instruct, is trained via both supervised fine-tuning (SFT) and agentic RL, using carefully curated datasets to reinforce robust search behaviors.
A novel data curation pipeline supports this effort:
- Prompt Construction: Search-intensive prompts spanning 20+ domains (science, pop culture, news, etc.) are generated by prompt-engineering state-of-the-art LLMs (Gemini 3 Pro) and transforming deep research QA datasets for visual synthesis objectives.
- Agentic Trajectory Generation: Search and image retrieval actions are executed in multi-turn trajectories utilizing three tool interfaces (text search, image search, webpage browsing) to capture reasoning paths and facilitate supervised learning.
- Ground Truth Synthesis: The synthesized prompts and references are passed to Nano Banana Pro to produce ground-truth images.
- Quality Filtering: Seed1.8 provides multidimensional scoring (faithfulness, correctness, aesthetics, etc.), followed by rule-based filtering to ensure high-quality supervision.
Two primary training datasets result: Gen-Searcher-SFT-10k (SFT) and Gen-Searcher-RL-6k (RL); the KnowGen benchmark—630 manually verified samples—is introduced to robustly evaluate real-world, search-grounded image generation. The K-Score metric, computed as a weighted sum of faithfulness, visual correctness, text accuracy, and aesthetics, is used for benchmarking.
Agentic RL with Dual Reward Feedback
The RL stage employs Guided Reward Policy Optimization (GRPO), augmented with a dual reward scheme that fuses both image-based and text-based rewards:
- Image-based Reward: Evaluates the final image’s adherence to grounded attributes, using K-Score as reference.
- Text-based Reward: Judges the sufficiency and correctness of the generated search-augmented prompt and associated references (via GPT-4.1).
This dual reward structure (with a balancing hyperparameter, optimal at α ≈ 0.5) stabilizes training by mitigating reward signal variance arising from downstream generator limitations, while ensuring the agent’s outputs are both informationally sufficient and practically effective for robust image synthesis.
Experimental Evaluation
Main Results
Gen-Searcher is systematically benchmarked against state-of-the-art proprietary and open-source T2I models on KnowGen and WISE:
- On Qwen-Image: Gen-Searcher improves K-Score from 14.98 to 31.52 (+16.54 points).
- On Seedream 4.5: Increases K-Score from 31.01 to 47.29.
- On Nano Banana Pro: Boosts K-Score from 50.38 to 53.30, despite Nano Banana Pro’s internal (but text-only) search capabilities.
Notably, these improvements are concentrated on visual correctness and text accuracy, the dimensions most critical for knowledge-intensive T2I generation.
On WISE, Gen-Searcher+Qwen-Image achieves 0.77 overall (vs. 0.62 for Qwen-Image), outperforming all open-domain image generators—even in subdomains such as chemistry, where knowledge specificity is acute.
Ablation Studies
Component-wise ablation demonstrates that each element—search augmentation, learned agentic policies, and both reward modalities—is essential. Removing either reward source (text or image) results in substantial performance degradation, confirming their complementary roles in optimizing agentic search for image synthesis.
Qualitative Analysis
Qualitative results indicate that Gen-Searcher yields superior fine-grained attribute grounding and accurate rendering of text/visual entities compared to both open-source and proprietary baselines. Error cases are predominantly the result of fundamental image generation constraints (e.g., multi-entity consistency, text rendering limitations), not the search-and-grounding pipeline.
Parameter Sensitivity
Performance is robust for broad ranges of the reward-balancing coefficient α; both pure text and pure image rewards are empirically insufficient in isolation, while finely tuned combinations maximize downstream generative fidelity and grounding.
Implications and Future Work
Gen-Searcher sets a strong precedent for integrating agentic web search, multihop reasoning, and multimodal evidence aggregation into the T2I pipeline. Methodologically, the demonstration of agentic RL for long-horizon search brings reinforcement learning paradigms from text-based tool-use to practical, large-scale multimodal generation tasks. Substantial, transferable gains across open and closed-source models indicate that agentic search augmentation can upgrade the effective knowledge capacity and factuality of current generators, offsetting their parametric limitations without retrofitting core generative architectures.
The open-sourcing of all models, data, and benchmarks is poised to significantly accelerate research in this domain, facilitating reproducibility and downstream integration into verticals where knowledge accuracy and faithfulness are critical.
Theoretical directions include further exploration of hierarchical agent architectures, more nuanced credit assignment for long-horizon trajectories, and scaling towards real-time knowledge grounding under hard latency and interaction budgets. Practically, coupling Gen-Searcher with increasingly capable image generators and extending search policies to multimodal (e.g., video/audio) grounding are promising trajectories.
Conclusion
Gen-Searcher establishes a robust baseline for knowledge-intensive, search-grounded image generation via agentic RL, supported by domain-spanning data curation and rigorous benchmarking. The dual reward mechanism, multimodal tool-use policies, and strong empirical gains underscore the utility of agentic search frameworks for overcoming intrinsic constraints in contemporary T2I models. This work provides a practical template and rigorous evaluation protocol for future advances in agent-integrated image generation.