Gen-Searcher: Reinforcing Agentic Search for Image Generation

Published 30 Mar 2026 in cs.CV | (2603.28767v1)

Abstract: Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces Gen-Searcher, which reinforces image generation by actively integrating external knowledge via agentic reinforcement learning.
It employs a novel dual reward mechanism combining image-based and text-based evaluations to enhance visual correctness and factual accuracy.
Experimental results demonstrate significant K-Score improvements and superior fine-grained attribute grounding compared to existing T2I models.

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Introduction and Motivation

The paper "Gen-Searcher: Reinforcing Agentic Search for Image Generation" (2603.28767) addresses a central limitation of contemporary text-to-image (T2I) generation models: their reliance on static parametric knowledge, which restricts their ability to synthesize images for knowledge-intensive or temporally dynamic prompts. The lack of integration between up-to-date external information and image synthesis significantly impacts real-world applicability, particularly when generating images involving entities with changing attributes, recent events, or highly specialized facts not encoded during model pretraining.

Existing approaches to retrieval-augmented generation (RAG) in T2I focus primarily on database retrieval or shallow web search, often lacking visual grounding, multihop reasoning, and adaptability in tool use. Furthermore, most prior systems either operate with manual prompt engineering workflows or employ text-based search exclusively for fact gathering, yielding suboptimal quality, poor extensibility across domains, and inadequate visual faithfulness.

Methodology

Gen-Searcher is the first paradigm for actively training a multimodal agentic search framework for grounded image generation. The system is architected as a search-augmented agent—a Large Multimodal Model (LMM) that learns, via agentic reinforcement learning (RL), to perform multi-hop web search, textual querying, image reference retrieval, and evidence aggregation. The agent, initialized from Qwen3-VL-8B-Instruct, is trained via both supervised fine-tuning (SFT) and agentic RL, using carefully curated datasets to reinforce robust search behaviors.

A novel data curation pipeline supports this effort:

Prompt Construction: Search-intensive prompts spanning 20+ domains (science, pop culture, news, etc.) are generated by prompt-engineering state-of-the-art LLMs (Gemini 3 Pro) and transforming deep research QA datasets for visual synthesis objectives.
Agentic Trajectory Generation: Search and image retrieval actions are executed in multi-turn trajectories utilizing three tool interfaces (text search, image search, webpage browsing) to capture reasoning paths and facilitate supervised learning.
Ground Truth Synthesis: The synthesized prompts and references are passed to Nano Banana Pro to produce ground-truth images.
Quality Filtering: Seed1.8 provides multidimensional scoring (faithfulness, correctness, aesthetics, etc.), followed by rule-based filtering to ensure high-quality supervision.

Two primary training datasets result: Gen-Searcher-SFT-10k (SFT) and Gen-Searcher-RL-6k (RL); the KnowGen benchmark—630 manually verified samples—is introduced to robustly evaluate real-world, search-grounded image generation. The K-Score metric, computed as a weighted sum of faithfulness, visual correctness, text accuracy, and aesthetics, is used for benchmarking.

Agentic RL with Dual Reward Feedback

The RL stage employs Guided Reward Policy Optimization (GRPO), augmented with a dual reward scheme that fuses both image-based and text-based rewards:

Image-based Reward: Evaluates the final image’s adherence to grounded attributes, using K-Score as reference.
Text-based Reward: Judges the sufficiency and correctness of the generated search-augmented prompt and associated references (via GPT-4.1).

This dual reward structure (with a balancing hyperparameter, optimal at α ≈ 0.5) stabilizes training by mitigating reward signal variance arising from downstream generator limitations, while ensuring the agent’s outputs are both informationally sufficient and practically effective for robust image synthesis.

Experimental Evaluation

Main Results

Gen-Searcher is systematically benchmarked against state-of-the-art proprietary and open-source T2I models on KnowGen and WISE:

On Qwen-Image: Gen-Searcher improves K-Score from 14.98 to 31.52 (+16.54 points).
On Seedream 4.5: Increases K-Score from 31.01 to 47.29.
On Nano Banana Pro: Boosts K-Score from 50.38 to 53.30, despite Nano Banana Pro’s internal (but text-only) search capabilities.

Notably, these improvements are concentrated on visual correctness and text accuracy, the dimensions most critical for knowledge-intensive T2I generation.

On WISE, Gen-Searcher+Qwen-Image achieves 0.77 overall (vs. 0.62 for Qwen-Image), outperforming all open-domain image generators—even in subdomains such as chemistry, where knowledge specificity is acute.

Ablation Studies

Component-wise ablation demonstrates that each element—search augmentation, learned agentic policies, and both reward modalities—is essential. Removing either reward source (text or image) results in substantial performance degradation, confirming their complementary roles in optimizing agentic search for image synthesis.

Qualitative Analysis

Qualitative results indicate that Gen-Searcher yields superior fine-grained attribute grounding and accurate rendering of text/visual entities compared to both open-source and proprietary baselines. Error cases are predominantly the result of fundamental image generation constraints (e.g., multi-entity consistency, text rendering limitations), not the search-and-grounding pipeline.

Parameter Sensitivity

Performance is robust for broad ranges of the reward-balancing coefficient α; both pure text and pure image rewards are empirically insufficient in isolation, while finely tuned combinations maximize downstream generative fidelity and grounding.

Implications and Future Work

Gen-Searcher sets a strong precedent for integrating agentic web search, multihop reasoning, and multimodal evidence aggregation into the T2I pipeline. Methodologically, the demonstration of agentic RL for long-horizon search brings reinforcement learning paradigms from text-based tool-use to practical, large-scale multimodal generation tasks. Substantial, transferable gains across open and closed-source models indicate that agentic search augmentation can upgrade the effective knowledge capacity and factuality of current generators, offsetting their parametric limitations without retrofitting core generative architectures.

The open-sourcing of all models, data, and benchmarks is poised to significantly accelerate research in this domain, facilitating reproducibility and downstream integration into verticals where knowledge accuracy and faithfulness are critical.

Theoretical directions include further exploration of hierarchical agent architectures, more nuanced credit assignment for long-horizon trajectories, and scaling towards real-time knowledge grounding under hard latency and interaction budgets. Practically, coupling Gen-Searcher with increasingly capable image generators and extending search policies to multimodal (e.g., video/audio) grounding are promising trajectories.

Conclusion

Gen-Searcher establishes a robust baseline for knowledge-intensive, search-grounded image generation via agentic RL, supported by domain-spanning data curation and rigorous benchmarking. The dual reward mechanism, multimodal tool-use policies, and strong empirical gains underscore the utility of agentic search frameworks for overcoming intrinsic constraints in contemporary T2I models. This work provides a practical template and rigorous evaluation protocol for future advances in agent-integrated image generation.

Markdown Report Issue