UniREdit-Data-100K Dataset Overview
- UniREdit-Data-100K is a large-scale synthetic dataset designed for training and benchmarking image editing models with structured chain-of-thought annotations.
- It systematically covers 18 reasoning sub-dimensions across real-world and game-world scenarios, enabling complex multi-object and logical interactions.
- The dataset underpins fine-tuning of dual-modality architectures like Bagel, achieving state-of-the-art performance on reasoning-centric editing benchmarks.
UniREdit-Data-100K is a large-scale, synthetic dataset designed to train and benchmark image editing models that require complex, step-by-step reasoning across both real-world and game-world scenarios. Developed as part of the UniREditBench effort, this dataset addresses core limitations of prior image editing corpora by systematically covering diverse reasoning dimensions and providing high-quality chain-of-thought (CoT) reasoning annotations. Its scale and structure enable the effective fine-tuning of multi-modal diffusion architectures that can perform instruction-based edits with logic, multi-object interactions, and constraint satisfaction, as well as spatial and temporal reasoning (Han et al., 3 Nov 2025).
1. Scope, Structure, and Design Goals
UniREdit-Data-100K comprises 100,421 samples, each representing an image editing task augmented with structured reasoning and multiple forms of supervision. Its coverage is balanced across both real-world and synthetic "game-world" settings.
Each sample in the dataset is structured as a quintuple:
- : original image,
- : natural language instruction specifying the edit,
- : chain-of-thought text explicating the reasoning required for the edit,
- : ground-truth edited image,
- : concise textual description of the intended edited effect.
Coverage encompasses $18$ sub-dimensions within $8$ primary reasoning dimensions, with each sub-dimension represented by approximately $4,000$ examples. The real-world scenario categories include single-object transformations (e.g., viewpoint, pose, temporal, material), multi-object interactions (e.g., collisions, fluid dynamics, spatial rearrangements), while game-world scenarios encompass planning, logical puzzle solving, strategic reasoning, and spatial intelligence (e.g., 3D reconstruction).
The dataset’s design was motivated by the need to:
- Support models that reason over multiple objects and context-dependent game rules,
- Enable learning of compositional edits and non-trivial effects (e.g., tool use or solving Sokoban),
- Provide reliable CoT supervision for training reasoning-augmented editors,
- Allow for dual-modality evaluation in concert with UniREditBench.
2. Synthetic Data Generation Pipeline
The generation of UniREdit-Data-100K involves two distinct branches: real-world edits and game-world reasoning scenarios.
A. Real-World Synthesis:
- Seed (O₀, I₀, Rₜ₀) triples are handcrafted.
- A vision-LLM (e.g., Gemini-2.5) is leveraged for prompt paraphrasing and variation, creating diverse instructions for the same editing goal.
- Image pairs () are rendered by a generative model (e.g., GPT-4o) conditioned on the instruction and intended effect .
- Gemini-2.5-Pro filters out low-fidelity or hallucinatory samples, and then annotates each sample with a stepwise CoT .
B. Game-World Synthesis:
- For each puzzle or planning sub-dimension (Maze, Sokoban, Sudoku, etc.), a procedural Python program generates image pairs with minimal instructions , explicit effect text , and a code-level reasoning trace.
- A vision-LLM converts program traces into human-readable CoTs.
- Post-processing filters out logically inconsistent or visually erroneous samples.
C. Dataset Composition:
| Reasoning category | Sub-dimensions | Samples per sub-dim | Examples |
|---|---|---|---|
| Real-World | 9 | ~4,000 | material, spatial |
| Game-World | 9 | ~4,000 | maze, sudoku |
| Total samples | 18 | ~4,000 | 100,421 |
All samples are manually or programmatically verified for fidelity, logical coherence, and absence of spurious correlations.
3. Chain-of-Thought Annotation and Role in Training
A central feature of UniREdit-Data-100K is the inclusion of CoT () supervision. Each CoT provides a detailed, stepwise decomposition of the reasoning process:
- For real-world edits: describes how to map and to the sequence of visual transformations required to obtain .
- For game-world tasks: traces, for example, the moves in Sokoban, logical deductions in Sudoku, or step-by-step cell filling in Tic-Tac-Toe.
This explicit CoT annotation supports the "think-then-edit" paradigm used for training unified editing architectures such as Bagel, fostering robust generalization on tasks demanding sequential, symbolic, or spatial reasoning.
4. Model Training, Objectives, and Implementation
UniREdit-Data-100K underpins the fine-tuning of Bagel—a unified image understanding and editing model leveraging a two-stage architecture: an LLM-based reasoning module and a diffusion-based image generator.
Training objective:
- CoT text negative log-likelihood:
- Diffusion matching for images, conditioned on :
- Joint loss:
Optimization:
- Optimizer: Adam
- Learning rate: Cosine decay, 500 warm-up steps, peak to minimum
- 5000 iterations, updating all Bagel weights except the frozen VAE
This training regime directly leverages the structure to maximize both reasoning fidelity and image realism.
5. Empirical Impact and Benchmark Results
Models trained with UniREdit-Data-100K demonstrate state-of-the-art performance on UniREditBench, particularly when evaluated on tasks requiring multi-step reasoning or cross-modal understanding.
- UniREdit-Bagel, trained on UniREdit-Data-100K, achieves a weighted overall score of $78.87$ on UniREditBench, exceeding GPT-4o by points.
- Largest improvements are observed in game-world reasoning scenarios, with UniREdit-Bagel outperforming the next-best model by points, and exceeding accuracy on Sokoban, Maze, Sudoku, and Tic-Tac-Toe tasks.
- Out-of-distribution generalization is also enhanced:
- On RISEBench (temporal/causal/spatial/logical): UniREdit-Bagel obtains , a gain of over Bagel-Think and over Gemini-2.0-Flash-exp.
- On KRISBench (knowledge grounding): overall, over Bagel-Think.
Visual consistency—preservation of regions outside the edit—remains robust, while strategic and adversarial planning (e.g., Pacman, Space Invader) and certain real-world mechanics continue to pose challenges, particularly for open-source models not trained with CoT supervision.
6. Significance and Directions for Future Research
UniREdit-Data-100K serves as the central enabling resource for studying and advancing reasoning-based image editing:
- By providing task-level diversity spanning spatial, logical, physical, and symbolic transformations, it allows systematic benchmarking of visual reasoning and multi-object manipulation capabilities.
- The CoT annotation format catalyzes research into LLM-based planning for image transformation, fostering new algorithmic advances in the "think-then-edit" paradigm.
- Its multimodal structure and game-world coverage pave the way for benchmarking and developing models capable of simulating rule-governed edits, a prerequisite for applications in vision-based planning, robotics, and synthetic agent environments.
Proposed future directions stemming from this work include:
- Expanding the paradigm to video editing with temporal consistency requirements,
- Simulating more complex physical phenomena (e.g., fluid mechanics, cloth deformation),
- Open-sourcing standardized dual-reference evaluators,
- Integrating symbolic planners or reinforcement learning modules to enhance long-horizon and strategic reasoning.
The methodology and structure established by UniREdit-Data-100K situate it as a cornerstone for the next generation of reasoning-augmented vision editing architectures and evaluation protocols (Han et al., 3 Nov 2025).