Gen-Searcher-SFT-10k: Multimodal SFT Dataset

Updated 4 July 2026

Gen-Searcher-SFT-10k is a supervised fine-tuning dataset that equips multimodal agents with multi-hop search, tool-use, and evidence retrieval capabilities.
It employs an XML-like trajectory format to guide agents through search, image retrieval, and final grounded prompt composition in a two-stage training framework.
Empirical results demonstrate significant gains in factual accuracy and image synthesis quality compared to traditional prompt engineering methods.

Gen-Searcher-SFT-10k is the supervised fine-tuning dataset introduced for Gen-Searcher, a search-augmented image-generation agent that performs multi-hop reasoning and tool use to collect textual knowledge and reference images before producing a grounded generation prompt (Feng et al., 30 Mar 2026). In the paper’s two-stage training design, Gen-Searcher-SFT-10k is the first-stage corpus used to teach a multimodal LLM how to behave as an agent rather than as a conventional text-to-image system: it learns when to search, how to browse, how to retrieve visual references, how to interpret tool outputs, and how to terminate with a structured final answer containing a search-grounded prompt and selected reference images (Feng et al., 30 Mar 2026). The dataset contains about 10K curated samples and was created because standard text-to-image resources do not provide prompts that truly require external search, multi-turn tool-use trajectories, retrieved textual evidence, retrieved reference images, or aligned synthesis images (Feng et al., 30 Mar 2026).

1. Functional role in the Gen-Searcher framework

Gen-Searcher-SFT-10k is the supervised dataset for stage 1 of the Gen-Searcher pipeline, whose stage 2 applies agentic reinforcement learning on a separate corpus, Gen-Searcher-RL-6k (Feng et al., 30 Mar 2026). The base model is Qwen3-VL-8B-Instruct, and during training the image generator is fixed while the trainable component is the agent that emits search trajectories and final grounded prompts. The paper states that supervised fine-tuning “equips the model with basic tool-use abilities, enabling multi-step search, browsing, and reasoning for image generation,” which positions the dataset as a behavioral initialization resource rather than merely a collection of prompt–image pairs.

The task setting is explicitly search-grounded, multi-hop image generation. The intended prompts are “search-intensive”: they require external knowledge acquisition rather than closed-book generation. The paper characterizes such prompts through needs such as up-to-date facts, multi-hop fact aggregation, entity disambiguation, fine-grained visual grounding, text rendering constraints, and cross-modal grounding. This means the supervision target is not only factual correctness in text but also the retrieval and selection of visual evidence appropriate for downstream image synthesis.

A central implication is that Gen-Searcher-SFT-10k belongs to the broader family of “searcher-style” SFT datasets, but its domain is multimodal generation rather than text-only deep research. The final objective is not a QA answer or a search report; it is a grounded image-generation specification plus reference images, conditioned on a tool-mediated evidence-gathering process.

2. Data construction pipeline

The paper places Gen-Searcher-SFT-10k inside a four-stage construction pipeline that is shared with Gen-Searcher-RL-6k and the KnowGen benchmark (Feng et al., 30 Mar 2026). The first stage is text prompt construction. One source uses Gemini 3. Pro to generate multi-hop search-intensive prompts across around 20 categories, including Anime, Architecture, Art, Astronomy, Biology, Celebrities, Chemistry, Culture, Engineering, Film, Game, Geography, History, Industry, Medicine, Physics, Politics, Posters, Religion, and Sports. A second source converts examples from existing deep research QA datasets into image-generation-oriented prompts using Gemini 3. Pro; this source primarily contributes General News scenarios.

The second stage is agentic trajectory generation. For each prompt, Gemini 3. Pro is used together with tools to generate a full multi-turn search trajectory. The tool set includes search, which performs text web search and returns top- $k$ URLs and snippets; image_search, which retrieves relevant images from a text query; and browse, which reads and summarizes webpage content. The agent repeatedly reasons about missing information, issues a tool call, observes the result, updates its internal state, and either continues searching or terminates.

The third stage is ground-truth image synthesis. After the trajectory yields a grounded generation prompt and selected reference images, the authors feed them into Nano Banana Pro, a proprietary image generator, and treat the resulting image as the synthesis ground truth. The raw pipeline produces roughly 30K samples containing the original prompts, search trajectories, grounded prompts, reference images, and synthesized target images.

The fourth stage is filtering and curation. Samples are scored with Seed1.8 from several angles: whether the prompt genuinely requires search, correctness of generated content, faithfulness to the prompt, visual aesthetics, text rendering clarity, and safety. Rule-based filtering also removes prompts with excessively long token lengths and inconsistent search results. After filtering, about 17K high-quality samples remain. From these, 630 human-verified samples are held out for the KnowGen benchmark, and the remaining 16K training samples are split into Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k. The paper explicitly states that there is no overlap between training and benchmark data.

3. Sample structure, action space, and serialization

Gen-Searcher-SFT-10k contains aligned multimodal-agent supervision rather than ordinary prompt–image tuples (Feng et al., 30 Mar 2026). A typical sample includes the original user prompt, a multi-turn agent trajectory, tool feedback, intermediate reasoning, a final grounded generation output, reference image annotations, and a ground-truth synthesized image.

The trajectory uses an XML-like interaction format. Model outputs contain > ... followed either by <tool_call> ... </tool_call> or by a terminal <answer> ... </answer>. The action space supervised during SFT consists of the tools search, image_search, and browse, plus the terminal action represented by the final <answer>. Tool feedback may contain search snippets and URLs, image search results with image IDs such as IMG_001, and browse summaries. The examples described in the paper also include intermediate progress notes indicating what information has been confirmed, what remains uncertain, and what should be searched next.

The final answer is a JSON object with two fields:

{
  "gen_prompt": "...",
  "reference_images": [
    {
      "img_id": "IMG_###",
      "note": "..."
    }
  ]
}

The paper specifies several constraints on this schema. The gen_prompt must be a grounded prompt suitable for an image generator; it should refer to selected images only through ordinal phrases such as “the first reference image” or “the second reference image”; it must not include URLs; and it must not include raw IMG_### identifiers in the text. The reference_images list contains 1–5 items, is sorted by image ID in ascending order, and includes short notes describing what to copy from each image.

A common misunderstanding is to treat the dataset as a collection of answer-only prompts. In fact, the supervision target is the full tool-using trajectory. The paper is explicit that Gen-Searcher-SFT-10k is designed to teach multi-turn search state maintenance, textual evidence gathering, visual reference retrieval, reference image selection, and final structured answer formatting.

4. Supervised fine-tuning semantics

During supervised fine-tuning, the model is trained to imitate the complete agent behavior exhibited in Gen-Searcher-SFT-10k (Feng et al., 30 Mar 2026). What is supervised includes multi-turn reasoning in <think>, tool invocation through <tool_call>, query construction, interpretation of search and image results, reference image selection, grounded prompt composition, and final JSON answer formatting. The paper explicitly states that the optimization target is not merely the final gen_prompt; it is the entire tool-using trajectory.

This stage functions as a cold-start policy initialization. Without it, the subsequent RL stage would be optimizing a policy that does not yet know how to use tools coherently. The paper therefore presents Gen-Searcher-SFT-10k as the mechanism by which the model acquires basic search-agent competence for image generation: issuing search queries for factual knowledge, issuing image_search queries for visual grounding, using browse when search snippets are insufficient, maintaining multi-turn search context, and terminating with a structured grounded output.

The paper does not explicitly print an SFT loss equation. It only states that SFT is performed before RL. Likewise, it does not provide a full chat-template specification for training conversations beyond the XML-like forms and final answer schema. This under-specification is important for reproduction, but it does not change the central point that the dataset is intended for trajectory-level imitation learning over a tool-augmented multimodal search process.

The second training stage, which uses Gen-Searcher-RL-6k, is defined by the dual reward

$R = (1 - \alpha) R_{\text{image}} + \alpha R_{\text{text}},$

with $\alpha = 0.5$ . The paper’s framing makes clear that this RL stage depends on the prior SFT stage: Gen-Searcher-SFT-10k teaches the model to produce coherent trajectories, after which RL refines search strategy quality.

5. Empirical contribution and benchmark relevance

The clearest empirical evidence for the importance of Gen-Searcher-SFT-10k comes from the paper’s ablation on KnowGen (Feng et al., 30 Mar 2026). The reported scores are 14.98 for the Qwen-Image baseline, 22.91 for Qwen-Image with a prompt-based workflow, 28.15 for Qwen-Image + Gen-Searcher-SFT, and 31.52 for the full Gen-Searcher system. This shows that the SFT dataset alone provides a large gain over both the base generator and a hand-designed prompting workflow, while RL yields an additional improvement on top of the SFT initialization.

The paper’s abstract reports that the full Gen-Searcher system improves Qwen-Image by around 16 points on KnowGen and 15 points on WISE. In the article’s logic, Gen-Searcher-SFT-10k is the main source of the initial jump from prompt engineering to learned agent behavior, and the RL stage then further improves long-horizon search policy quality.

KnowGen is designed to evaluate the same competencies that Gen-Searcher-SFT-10k is meant to teach. Its aggregate metric is

$\text{K-Score} = 0.1 \cdot \text{Faithfulness} + 0.4 \cdot \text{Visual Correctness} + 0.4 \cdot \text{Text Accuracy} + 0.1 \cdot \text{Aesthetics}.$

This alignment matters. The dataset does not merely improve conventional image quality; it targets search-grounded prompt construction, factual correctness, text accuracy, and visual grounding through retrieved references. The ablation therefore supports the interpretation that curated search trajectories are more effective than manually written workflows for this task family.

6. Limitations, caveats, and position within the searcher-SFT literature

Several limitations are explicit in the paper (Feng et al., 30 Mar 2026). First, the “ground-truth” images are synthetic targets generated by Nano Banana Pro rather than human-authored gold images. Second, the pipeline depends on proprietary systems: Gemini 3. Pro for prompt and trajectory generation, Nano Banana Pro for synthesis, and Seed1.8 for scoring and filtering. Third, the training data is automatically constructed, so even after filtering it may still contain retrieval errors, imperfect factual aggregation, suboptimal reference image choices, or synthetic image artifacts. Fourth, the paper does not report many detailed dataset diagnostics, such as per-category counts, hop-depth distributions, exact tool-use frequencies, or detailed image-count distributions.

These caveats also delimit what Gen-Searcher-SFT-10k is not. It is not a generic web-search benchmark, not a human-annotated image corpus, and not a fully specified reproducible recipe independent of proprietary infrastructure. It is best understood as a curated agent-trajectory dataset for search-grounded image generation.

Within the broader search-agent literature, Gen-Searcher-SFT-10k occupies a distinctive position. OpenSeeker and OpenSeeker-v2 show that approximately 11.7k and 10.6k synthesized trajectories can train strong text-based frontier search agents with supervised fine-tuning alone, while DR-Venus shows that roughly 10K open trajectories can support a 4B deep research agent with agentic SFT plus RL refinement (Du et al., 16 Mar 2026, Du et al., 5 May 2026, Team et al., 21 Apr 2026). R1-Searcher++ and DLLM-Searcher, by contrast, use much smaller SFT initialization sets—805 examples for the Stage-1 cold start in R1-Searcher++ and 3977 curated teacher trajectories for Agentic SFT in DLLM-Searcher—before RL-style refinement (Song et al., 22 May 2025, Zhao et al., 3 Feb 2026). FORT-Searcher further argues that long trajectories alone are an insufficient proxy for deep-search supervision and emphasizes shortcut-resistant task synthesis, later answer-hit time, and lower prior-shortcut rate as stronger indicators of supervision quality (Deng et al., 10 Jun 2026).

This comparison suggests that Gen-Searcher-SFT-10k should be read as the multimodal analogue of the compact searcher-SFT paradigm: a roughly 10K-scale, trajectory-centered dataset whose primary function is to teach an agentic policy before reinforcement learning. Its distinguishing feature is that the terminal artifact is a grounded image-generation specification with selected reference images, not a textual answer or research report.