Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

UniREdit-Data-100K Dataset Overview

Updated 9 November 2025
  • UniREdit-Data-100K is a large-scale synthetic dataset designed for training and benchmarking image editing models with structured chain-of-thought annotations.
  • It systematically covers 18 reasoning sub-dimensions across real-world and game-world scenarios, enabling complex multi-object and logical interactions.
  • The dataset underpins fine-tuning of dual-modality architectures like Bagel, achieving state-of-the-art performance on reasoning-centric editing benchmarks.

UniREdit-Data-100K is a large-scale, synthetic dataset designed to train and benchmark image editing models that require complex, step-by-step reasoning across both real-world and game-world scenarios. Developed as part of the UniREditBench effort, this dataset addresses core limitations of prior image editing corpora by systematically covering diverse reasoning dimensions and providing high-quality chain-of-thought (CoT) reasoning annotations. Its scale and structure enable the effective fine-tuning of multi-modal diffusion architectures that can perform instruction-based edits with logic, multi-object interactions, and constraint satisfaction, as well as spatial and temporal reasoning (Han et al., 3 Nov 2025).

1. Scope, Structure, and Design Goals

UniREdit-Data-100K comprises 100,421 samples, each representing an image editing task augmented with structured reasoning and multiple forms of supervision. Its coverage is balanced across both real-world and synthetic "game-world" settings.

Each sample in the dataset is structured as a quintuple:

  • OO: original image,
  • II: natural language instruction specifying the edit,
  • CC: chain-of-thought text explicating the reasoning required for the edit,
  • GG: ground-truth edited image,
  • RtR_t: concise textual description of the intended edited effect.

Coverage encompasses $18$ sub-dimensions within $8$ primary reasoning dimensions, with each sub-dimension represented by approximately $4,000$ examples. The real-world scenario categories include single-object transformations (e.g., viewpoint, pose, temporal, material), multi-object interactions (e.g., collisions, fluid dynamics, spatial rearrangements), while game-world scenarios encompass planning, logical puzzle solving, strategic reasoning, and spatial intelligence (e.g., 3D reconstruction).

The dataset’s design was motivated by the need to:

  • Support models that reason over multiple objects and context-dependent game rules,
  • Enable learning of compositional edits and non-trivial effects (e.g., tool use or solving Sokoban),
  • Provide reliable CoT supervision for training reasoning-augmented editors,
  • Allow for dual-modality evaluation in concert with UniREditBench.

2. Synthetic Data Generation Pipeline

The generation of UniREdit-Data-100K involves two distinct branches: real-world edits and game-world reasoning scenarios.

A. Real-World Synthesis:

  1. Seed (O₀, I₀, Rₜ₀) triples are handcrafted.
  2. A vision-LLM (e.g., Gemini-2.5) is leveraged for prompt paraphrasing and variation, creating diverse instructions for the same editing goal.
  3. Image pairs (O,GO, G) are rendered by a generative model (e.g., GPT-4o) conditioned on the instruction II and intended effect RtR_t.
  4. Gemini-2.5-Pro filters out low-fidelity or hallucinatory samples, and then annotates each sample with a stepwise CoT CC.

B. Game-World Synthesis:

  1. For each puzzle or planning sub-dimension (Maze, Sokoban, Sudoku, etc.), a procedural Python program generates (O,G)(O, G) image pairs with minimal instructions II, explicit effect text RtR_t, and a code-level reasoning trace.
  2. A vision-LLM converts program traces into human-readable CoTs.
  3. Post-processing filters out logically inconsistent or visually erroneous samples.

C. Dataset Composition:

Reasoning category Sub-dimensions Samples per sub-dim Examples
Real-World 9 ~4,000 material, spatial
Game-World 9 ~4,000 maze, sudoku
Total samples 18 ~4,000 100,421

All samples are manually or programmatically verified for fidelity, logical coherence, and absence of spurious correlations.

3. Chain-of-Thought Annotation and Role in Training

A central feature of UniREdit-Data-100K is the inclusion of CoT (CC) supervision. Each CoT provides a detailed, stepwise decomposition of the reasoning process:

  • For real-world edits: CC describes how to map OO and II to the sequence of visual transformations required to obtain GG.
  • For game-world tasks: CC traces, for example, the moves in Sokoban, logical deductions in Sudoku, or step-by-step cell filling in Tic-Tac-Toe.

This explicit CoT annotation supports the "think-then-edit" paradigm used for training unified editing architectures such as Bagel, fostering robust generalization on tasks demanding sequential, symbolic, or spatial reasoning.

4. Model Training, Objectives, and Implementation

UniREdit-Data-100K underpins the fine-tuning of Bagel—a unified image understanding and editing model leveraging a two-stage architecture: an LLM-based reasoning module and a diffusion-based image generator.

Training objective:

  • CoT text negative log-likelihood:

Ltext=t=1Tlogpθ(yty<t,O,I)\mathcal{L}_{\rm text} = -\sum_{t=1}^T \log p_\theta\left(y_t \mid y_{<t}, O, I\right)

  • Diffusion matching for images, conditioned on (O,I,C)(O, I, C):

Limg=EtU(0,1)uθ(zt,t;O,I,C)u(zt,t)22\mathcal{L}_{\rm img} = \mathbb{E}_{t\sim\mathcal{U}(0,1)} \| u_\theta(z_t, t; O, I, C) - u^\star(z_t, t) \|_2^2

  • Joint loss:

L=λtextLtext+λimgLimg\mathcal{L} = \lambda_{\rm text} \mathcal{L}_{\rm text} + \lambda_{\rm img}\mathcal{L}_{\rm img}

Optimization:

  • Optimizer: Adam
  • Learning rate: Cosine decay, 500 warm-up steps, peak 2×1052 \times 10^{-5} to minimum 1×1061 \times 10^{-6}
  • 5000 iterations, updating all Bagel weights except the frozen VAE

This training regime directly leverages the O,I,C,G,RtO, I, C, G, R_t structure to maximize both reasoning fidelity and image realism.

5. Empirical Impact and Benchmark Results

Models trained with UniREdit-Data-100K demonstrate state-of-the-art performance on UniREditBench, particularly when evaluated on tasks requiring multi-step reasoning or cross-modal understanding.

  • UniREdit-Bagel, trained on UniREdit-Data-100K, achieves a weighted overall score of $78.87$ on UniREditBench, exceeding GPT-4o by +7.23+7.23 points.
  • Largest improvements are observed in game-world reasoning scenarios, with UniREdit-Bagel outperforming the next-best model by +17.08+17.08 points, and exceeding 95%95\% accuracy on Sokoban, Maze, Sudoku, and Tic-Tac-Toe tasks.
  • Out-of-distribution generalization is also enhanced:
    • On RISEBench (temporal/causal/spatial/logical): UniREdit-Bagel obtains 18.3%18.3\%, a gain of +9.1+9.1 over Bagel-Think and +5.0+5.0 over Gemini-2.0-Flash-exp.
    • On KRISBench (knowledge grounding): 65.45%65.45\% overall, +4.7+4.7 over Bagel-Think.

Visual consistency—preservation of regions outside the edit—remains robust, while strategic and adversarial planning (e.g., Pacman, Space Invader) and certain real-world mechanics continue to pose challenges, particularly for open-source models not trained with CoT supervision.

6. Significance and Directions for Future Research

UniREdit-Data-100K serves as the central enabling resource for studying and advancing reasoning-based image editing:

  • By providing task-level diversity spanning spatial, logical, physical, and symbolic transformations, it allows systematic benchmarking of visual reasoning and multi-object manipulation capabilities.
  • The CoT annotation format catalyzes research into LLM-based planning for image transformation, fostering new algorithmic advances in the "think-then-edit" paradigm.
  • Its multimodal structure and game-world coverage pave the way for benchmarking and developing models capable of simulating rule-governed edits, a prerequisite for applications in vision-based planning, robotics, and synthetic agent environments.

Proposed future directions stemming from this work include:

  • Expanding the paradigm to video editing with temporal consistency requirements,
  • Simulating more complex physical phenomena (e.g., fluid mechanics, cloth deformation),
  • Open-sourcing standardized dual-reference evaluators,
  • Integrating symbolic planners or reinforcement learning modules to enhance long-horizon and strategic reasoning.

The methodology and structure established by UniREdit-Data-100K situate it as a cornerstone for the next generation of reasoning-augmented vision editing architectures and evaluation protocols (Han et al., 3 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UniREdit-Data-100K.