Gen-Searcher-RL-6k: RL Dataset for Image Generation
- Gen-Searcher-RL-6k is a reinforcement-learning dataset that supports multi-turn, search-augmented image generation by providing search-intensive trajectories and dual reward signals.
- It is constructed from 6,000 curated instances derived from a 17K pool using a four-stage pipeline, ensuring breadth across 20+ domains with multimodal tasks.
- The dataset enables RL on Gen-Searcher by optimizing agent policies through text and image rewards, leading to significant performance improvements on benchmarks like KnowGen and WISE.
Gen-Searcher-RL-6k is the reinforcement-learning dataset used in the second training stage of Gen-Searcher, a search-augmented image generation agent that performs multi-hop reasoning and search to collect textual knowledge and reference images for grounded generation (Feng et al., 30 Mar 2026). Within the Gen-Searcher framework, the dataset is paired with prior supervised fine-tuning on Gen-Searcher-SFT-10k and supports GRPO-based policy optimization with dual reward feedback. It is defined by 6,000 curated search-intensive instances drawn from a larger 17 K pool produced by a four-stage data pipeline, and it is specifically organized around multi-turn trajectories over text and image tools, grounded prompts, reference images, and synthesized target images (Feng et al., 30 Mar 2026).
1. Role within the Gen-Searcher training pipeline
Gen-Searcher adopts a two-stage training procedure: supervised fine-tuning followed by agentic reinforcement learning. Gen-Searcher-RL-6k is the dataset used in the second stage, after the policy has already been initialized by supervised fine-tuning on Gen-Searcher-SFT-10k (Feng et al., 30 Mar 2026).
The broader system is motivated by a limitation of image generation models: they are described as being constrained by frozen internal knowledge and therefore often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. Gen-Searcher addresses this by training a search-augmented image generation agent that can perform multi-hop reasoning and search, rather than relying solely on parametric knowledge (Feng et al., 30 Mar 2026).
Within that design, Gen-Searcher-RL-6k provides the reinforcement-learning complement to the supervised stage. The dataset contains search-intensive prompts together with trajectories over multiple tools and final grounded outputs, enabling the model to optimize policy behavior under reward rather than imitation alone. The paper explicitly attributes the full Gen-Searcher gains to the combination of SFT-10k and RL-6k, with the latter responsible for the improvement from the SFT-only system to the final RL-trained system on KnowGen (Feng et al., 30 Mar 2026).
2. Construction, curation, and dataset composition
Gen-Searcher-RL-6k is drawn from the same 17 K high-quality samples produced by a four-stage pipeline. Those stages are text prompt construction, agentic trajectory generation, ground-truth image synthesis, and data filtering and curation (Feng et al., 30 Mar 2026).
In the text prompt construction stage, Gemini 3 Pro is used to auto-generate approximately 10K multi-hop search prompts across more than 20 domains, including Anime, Architecture, Biology, Chemistry, Celebrities, Culture, Engineering, Film, Game, Geography, History, Medicine, Physics, Politics, Posters, Religion, and Sports. A secondary process converts deep-research QA datasets into approximately 6 K news-style image-generation prompts. In the trajectory-generation stage, an LLM agent, also Gemini 3 Pro, iteratively issues calls to three tools—search, image_search, and browse—aggregates evidence, and terminates with a final grounded prompt plus selected reference images. Ground-truth image synthesis is then performed by passing grounded prompts and reference images to Nano Banana Pro, yielding 30 K raw examples of (prompt, trajectory, grounded prompt, refs, image). Finally, Seed1.8 scores over faithfulness, correctness, aesthetics, and safety, together with rule-based filters on prompt length and consistency, reduce the pool to 17 K. Of those, 630 are reserved for the KnowGen benchmark, and the remaining 16 K are split into Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k (Feng et al., 30 Mar 2026).
The key statistics reported for Gen-Searcher-RL-6k are as follows.
| Property | Value |
|---|---|
| Total instances | 6,000 |
| Prompt domains | 20+ categories, roughly balanced |
| Reference images per prompt | On average 3.2, capped at 5 |
| Trajectory length | Up to 10 tool calls, mean approximately 6.4 |
| Validation during RL | Held-out slices comprising 5% of RL-6k |
The domain balancing is reported as approximately 250–350 examples per domain. The dataset therefore encodes both breadth across task categories and depth in multi-turn retrieval behavior. A representative example is an educational infographic prompt that requires comparison of the heat of vaporization of Astatine with another halogen, including correct symbols, heat-of-vaporization values in kJ/mol, and the mass number of the most stable Astatine isotope; the final grounded prompt and reference images are then paired with a ground-truth image synthesized by Nano Banana Pro (Feng et al., 30 Mar 2026).
3. Reinforcement-learning formulation on RL-6k
After supervised fine-tuning, the Gen-Searcher policy is refined on RL-6k via Grouped-RPO (GRPO) with dual rewards (Feng et al., 30 Mar 2026). The agent is a policy , initialized from Qwen3-VL-8B-Instruct, that maps a prompt and past search feedback either to a new tool call from the set or to a terminal answer consisting of a grounded prompt and reference images (Feng et al., 30 Mar 2026).
The reward is explicitly dual-signal. The text-based component is
scored by GPT-4.1 on how well the final grounded prompt and references support the target image. The image-based component is
with
where , , , and denote faithfulness, visual correctness, text accuracy, and aesthetics. The final episodic reward is defined using :
0
The policy optimization objective follows the GRPO surrogate form. For a rollout under prompt 1 that produces output 2 with reward 3, the standardized advantage is
4
With probability ratio
5
the GRPO loss per sample is
6
The reported hyperparameters are learning rate 7 with AdamW, batch size 8 sequences, group size 6, maximum 10 interaction turns per episode, maximum 5 images per turn, 8, 9, reward weight 0, and no discounting, effectively 1. Training uses 16 × H800 GPUs for one day total (Feng et al., 30 Mar 2026).
4. Agentic search behavior and multimodal integration
Gen-Searcher-RL-6k is not only a collection of prompts; it is a dataset of search trajectories over heterogeneous tools. The agentic search framework explicitly includes text retrieval, image retrieval, and browsing, reflecting the fact that grounded image generation often depends jointly on factual text and visual reference material (Feng et al., 30 Mar 2026).
The search agent is Qwen3-VL-8B-Instruct, extended with special tokens and heads to emit JSON-style tool_call actions. Downstream image generators such as Qwen-Image, Seedream 4.5, and Nano Banana Pro remain frozen. They are conditioned on the final grounded prompt and the selected reference images via an editing multimodal interface, for example Qwen-Image-Edit. The paper states that multi-hop reasoning is handled entirely by the LLM: each > … step updates hidden state with prior evidence, each <tool_call> fetches new text or image shards, and the final <answer> concatenates gathered facts into a single prompt. Reference images are passed as input images in the editing stage, where cross-attention layers fuse them with the grounded prompt (Feng et al., 30 Mar 2026).
This division of labor is important for interpreting RL-6k. The dataset supervises and rewards search-and-grounding behavior rather than end-to-end image generator parameter updates. A plausible implication is that the dataset is designed to optimize the agentic interface between reasoning, retrieval, and frozen image generation, rather than to modify the generative backbone itself.
5. Empirical contribution of RL-6k
The paper reports that Gen-Searcher improves Qwen-Image by around 16 points on KnowGen and 15 points on WISE, and these gains are tied to the full training setup that includes RL-6k (Feng et al., 30 Mar 2026). On KnowGen, whose held-out benchmark has 630 samples and uses K-Score on a 0–100 scale, the reported scores are:
- Qwen-Image alone: 14.98
+ SFT-10k only: 28.15+ SFT-10k + RL-6k (full Gen-Searcher): 31.52
This corresponds to an approximately 16.5-point improvement over the baseline and a further gain from the SFT-only stage to the final RL-trained system (Feng et al., 30 Mar 2026).
The paper also reports WISE improvement from 0.62 to 0.77 using Gen-Searcher. Similar approximately 16-point gains are reported when transferring the trained agent to Seedream 4.5 and Nano Banana Pro (Feng et al., 30 Mar 2026).
The ablation table isolates the impact of RL-6k:
| Method | K-Score |
|---|---|
| Qwen-Image | 14.98 |
| + Manual search workflow | 22.91 |
| + Gen-Searcher-SFT-10k | 28.15 |
| + Gen-Searcher w/o text reward | 29.59 |
| + Gen-Searcher w/o image reward | 29.36 |
| + Gen-Searcher (full) | 31.52 |
The paper states that the jump from 28.15 to 31.52 comes solely from RL on RL-6k, and presents this as evidence for both the importance of the reinforcement-learning stage and the utility of dual rewards (Feng et al., 30 Mar 2026). Qualitative examples described in the paper involve failures of generic generators, such as misrendering a museum placard or omitting a landmark name, where Gen-Searcher instead retrieves the correct text and images and grounds the final output.
6. Relation to adjacent search-agent RL frameworks
Gen-Searcher-RL-6k belongs to a broader research trajectory on reinforcement-learned search agents, but it is specialized for grounded image generation rather than text-only QA. ZeroSearch studies how to incentivize search capability without live search during training by using a simulated retrieval module that can generate useful or noisy documents under a curriculum-based strategy (Sun et al., 7 May 2025). s3 instead proposes a decoupled search-only agent trained with a Gain Beyond RAG reward and emphasizes strong downstream performance with only 2.4k training samples (Jiang et al., 20 May 2025). MemSearcher addresses the context-growth problem in multi-turn search agents by maintaining a compact memory and training reasoning, search, and memory management jointly through multi-context GRPO (Yuan et al., 4 Nov 2025).
These systems share family resemblances with Gen-Searcher-RL-6k at the level of RL optimization, multi-turn search, and policy learning over tool use, but their objectives and data differ. ZeroSearch focuses on simulated search for LLM reasoning; s3 focuses on a decoupled searcher for frozen generators in QA; MemSearcher focuses on bounded-context search agents with explicit memory management; Gen-Searcher-RL-6k focuses on prompts, trajectories, grounded prompts, reference images, and synthesized images for knowledge-intensive image generation (Feng et al., 30 Mar 2026). This suggests that Gen-Searcher-RL-6k extends search-agent RL into a multimodal regime where successful behavior must jointly optimize textual grounding and image-level fidelity.
A common misconception is to treat Gen-Searcher-RL-6k as merely an image dataset. The description in the paper is more specific: it is a reinforcement-learning dataset built around agentic search trajectories and grounded synthesis targets, with image generation quality entering the reward alongside text-grounding quality (Feng et al., 30 Mar 2026). Another possible misconception is that the gains arise from manual workflows or prompt engineering alone; the reported ablations instead separate manual search workflow, supervised fine-tuning, and RL-6k, and assign a distinct incremental gain to the RL stage.
7. Significance and limitations implied by the reported design
Gen-Searcher-RL-6k is significant because it operationalizes search-grounded image generation as a reinforcement-learning problem with multimodal rewards. The dataset couples search-intensive prompts, tool trajectories, grounded prompts, reference images, and synthesized outputs in a single training resource, allowing policy learning over how to search, what evidence to keep, and when to terminate (Feng et al., 30 Mar 2026).
Its design also indicates a particular view of grounded generation. Rather than requiring the image generator itself to search or update its knowledge, the framework keeps downstream generators frozen and concentrates adaptation in the search agent. This aligns Gen-Searcher with modular search-agent paradigms that optimize tool use and context construction rather than direct generator fine-tuning. A plausible implication is that RL-6k is intended to support transfer across generators because the search policy learns to produce better grounded prompts and reference-image sets, which can then be consumed by multiple frozen generation systems.
At the same time, the dataset’s scope is defined by its construction pipeline. The prompts are generated and curated through specific upstream models and filtering criteria, and the rewards combine GPT-4.1-based text scoring with image-based K-Score. The reported results therefore establish the utility of RL-6k within that training and evaluation protocol, especially on KnowGen and WISE, rather than constituting a general claim about all possible search-grounded image-generation settings (Feng et al., 30 Mar 2026). Within that scope, Gen-Searcher-RL-6k functions as the central RL resource that converts search-augmented image generation from a manually specified workflow into an optimizable agentic policy.