UniREdit-Data-100K Dataset Overview

Updated 9 November 2025

UniREdit-Data-100K is a large-scale synthetic dataset designed for training and benchmarking image editing models with structured chain-of-thought annotations.
It systematically covers 18 reasoning sub-dimensions across real-world and game-world scenarios, enabling complex multi-object and logical interactions.
The dataset underpins fine-tuning of dual-modality architectures like Bagel, achieving state-of-the-art performance on reasoning-centric editing benchmarks.

UniREdit-Data-100K is a large-scale, synthetic dataset designed to train and benchmark image editing models that require complex, step-by-step reasoning across both real-world and game-world scenarios. Developed as part of the UniREditBench effort, this dataset addresses core limitations of prior image editing corpora by systematically covering diverse reasoning dimensions and providing high-quality chain-of-thought (CoT) reasoning annotations. Its scale and structure enable the effective fine-tuning of multi-modal diffusion architectures that can perform instruction-based edits with logic, multi-object interactions, and constraint satisfaction, as well as spatial and temporal reasoning (Han et al., 3 Nov 2025).

1. Scope, Structure, and Design Goals

UniREdit-Data-100K comprises 100,421 samples, each representing an image editing task augmented with structured reasoning and multiple forms of supervision. Its coverage is balanced across both real-world and synthetic "game-world" settings.

Each sample in the dataset is structured as a quintuple:

$O$ : original image,
$I$ : natural language instruction specifying the edit,
$C$ : chain-of-thought text explicating the reasoning required for the edit,
$G$ : ground-truth edited image,
$R_t$ : concise textual description of the intended edited effect.

Coverage encompasses $18$ sub-dimensions within $8$ primary reasoning dimensions, with each sub-dimension represented by approximately $4,000$ examples. The real-world scenario categories include single-object transformations (e.g., viewpoint, pose, temporal, material), multi-object interactions (e.g., collisions, fluid dynamics, spatial rearrangements), while game-world scenarios encompass planning, logical puzzle solving, strategic reasoning, and spatial intelligence (e.g., 3D reconstruction).

The dataset’s design was motivated by the need to:

Support models that reason over multiple objects and context-dependent game rules,
Enable learning of compositional edits and non-trivial effects (e.g., tool use or solving Sokoban),
Provide reliable CoT supervision for training reasoning-augmented editors,
Allow for dual-modality evaluation in concert with UniREditBench.

2. Synthetic Data Generation Pipeline

The generation of UniREdit-Data-100K involves two distinct branches: real-world edits and game-world reasoning scenarios.

A. Real-World Synthesis:

Seed (O₀, I₀, Rₜ₀) triples are handcrafted.
A vision-LLM (e.g., Gemini-2.5) is leveraged for prompt paraphrasing and variation, creating diverse instructions for the same editing goal.
Image pairs ( $O, G$ ) are rendered by a generative model (e.g., GPT-4o) conditioned on the instruction $I$ and intended effect $R_t$ .
Gemini-2.5-Pro filters out low-fidelity or hallucinatory samples, and then annotates each sample with a stepwise CoT $C$ .

B. Game-World Synthesis:

For each puzzle or planning sub-dimension (Maze, Sokoban, Sudoku, etc.), a procedural Python program generates $(O, G)$ image pairs with minimal instructions $I$ , explicit effect text $R_t$ , and a code-level reasoning trace.
A vision-LLM converts program traces into human-readable CoTs.
Post-processing filters out logically inconsistent or visually erroneous samples.

C. Dataset Composition:

Reasoning category	Sub-dimensions	Samples per sub-dim	Examples
Real-World	9	~4,000	material, spatial
Game-World	9	~4,000	maze, sudoku
Total samples	18	~4,000	100,421

All samples are manually or programmatically verified for fidelity, logical coherence, and absence of spurious correlations.

3. Chain-of-Thought Annotation and Role in Training

A central feature of UniREdit-Data-100K is the inclusion of CoT ( $C$ ) supervision. Each CoT provides a detailed, stepwise decomposition of the reasoning process:

For real-world edits: $C$ describes how to map $O$ and $I$ to the sequence of visual transformations required to obtain $G$ .
For game-world tasks: $C$ traces, for example, the moves in Sokoban, logical deductions in Sudoku, or step-by-step cell filling in Tic-Tac-Toe.

This explicit CoT annotation supports the "think-then-edit" paradigm used for training unified editing architectures such as Bagel, fostering robust generalization on tasks demanding sequential, symbolic, or spatial reasoning.

4. Model Training, Objectives, and Implementation

UniREdit-Data-100K underpins the fine-tuning of Bagel—a unified image understanding and editing model leveraging a two-stage architecture: an LLM-based reasoning module and a diffusion-based image generator.

Training objective:

CoT text negative log-likelihood:

$\mathcal{L}_{\rm text} = -\sum_{t=1}^T \log p_\theta\left(y_t \mid y_{<t}, O, I\right)$

Diffusion matching for images, conditioned on $(O, I, C)$ :

$\mathcal{L}_{\rm img} = \mathbb{E}_{t\sim\mathcal{U}(0,1)} \| u_\theta(z_t, t; O, I, C) - u^\star(z_t, t) \|_2^2$

Joint loss:

$\mathcal{L} = \lambda_{\rm text} \mathcal{L}_{\rm text} + \lambda_{\rm img}\mathcal{L}_{\rm img}$

Optimization:

Optimizer: Adam
Learning rate: Cosine decay, 500 warm-up steps, peak $2 \times 10^{-5}$ to minimum $1 \times 10^{-6}$
5000 iterations, updating all Bagel weights except the frozen VAE

This training regime directly leverages the $O, I, C, G, R_t$ structure to maximize both reasoning fidelity and image realism.

5. Empirical Impact and Benchmark Results

Models trained with UniREdit-Data-100K demonstrate state-of-the-art performance on UniREditBench, particularly when evaluated on tasks requiring multi-step reasoning or cross-modal understanding.

UniREdit-Bagel, trained on UniREdit-Data-100K, achieves a weighted overall score of $78.87$ on UniREditBench, exceeding GPT-4o by $+7.23$ points.
Largest improvements are observed in game-world reasoning scenarios, with UniREdit-Bagel outperforming the next-best model by $+17.08$ points, and exceeding $95\%$ accuracy on Sokoban, Maze, Sudoku, and Tic-Tac-Toe tasks.
Out-of-distribution generalization is also enhanced:
- On RISEBench (temporal/causal/spatial/logical): UniREdit-Bagel obtains $18.3\%$ , a gain of $+9.1$ over Bagel-Think and $+5.0$ over Gemini-2.0-Flash-exp.
- On KRISBench (knowledge grounding): $65.45\%$ overall, $+4.7$ over Bagel-Think.

Visual consistency—preservation of regions outside the edit—remains robust, while strategic and adversarial planning (e.g., Pacman, Space Invader) and certain real-world mechanics continue to pose challenges, particularly for open-source models not trained with CoT supervision.

6. Significance and Directions for Future Research

UniREdit-Data-100K serves as the central enabling resource for studying and advancing reasoning-based image editing:

By providing task-level diversity spanning spatial, logical, physical, and symbolic transformations, it allows systematic benchmarking of visual reasoning and multi-object manipulation capabilities.
The CoT annotation format catalyzes research into LLM-based planning for image transformation, fostering new algorithmic advances in the "think-then-edit" paradigm.
Its multimodal structure and game-world coverage pave the way for benchmarking and developing models capable of simulating rule-governed edits, a prerequisite for applications in vision-based planning, robotics, and synthetic agent environments.

Proposed future directions stemming from this work include:

Expanding the paradigm to video editing with temporal consistency requirements,
Simulating more complex physical phenomena (e.g., fluid mechanics, cloth deformation),
Open-sourcing standardized dual-reference evaluators,
Integrating symbolic planners or reinforcement learning modules to enhance long-horizon and strategic reasoning.

The methodology and structure established by UniREdit-Data-100K situate it as a cornerstone for the next generation of reasoning-augmented vision editing architectures and evaluation protocols (Han et al., 3 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

UniREditBench: A Unified Reasoning-based Image Editing Benchmark (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UniREdit-Data-100K.