Papers
Topics
Authors
Recent
2000 character limit reached

UniREdit-Data-100K Dataset Overview

Updated 9 November 2025
  • UniREdit-Data-100K is a large-scale synthetic dataset designed for training and benchmarking image editing models with structured chain-of-thought annotations.
  • It systematically covers 18 reasoning sub-dimensions across real-world and game-world scenarios, enabling complex multi-object and logical interactions.
  • The dataset underpins fine-tuning of dual-modality architectures like Bagel, achieving state-of-the-art performance on reasoning-centric editing benchmarks.

UniREdit-Data-100K is a large-scale, synthetic dataset designed to train and benchmark image editing models that require complex, step-by-step reasoning across both real-world and game-world scenarios. Developed as part of the UniREditBench effort, this dataset addresses core limitations of prior image editing corpora by systematically covering diverse reasoning dimensions and providing high-quality chain-of-thought (CoT) reasoning annotations. Its scale and structure enable the effective fine-tuning of multi-modal diffusion architectures that can perform instruction-based edits with logic, multi-object interactions, and constraint satisfaction, as well as spatial and temporal reasoning (Han et al., 3 Nov 2025).

1. Scope, Structure, and Design Goals

UniREdit-Data-100K comprises 100,421 samples, each representing an image editing task augmented with structured reasoning and multiple forms of supervision. Its coverage is balanced across both real-world and synthetic "game-world" settings.

Each sample in the dataset is structured as a quintuple:

  • OO: original image,
  • II: natural language instruction specifying the edit,
  • CC: chain-of-thought text explicating the reasoning required for the edit,
  • GG: ground-truth edited image,
  • RtR_t: concise textual description of the intended edited effect.

Coverage encompasses $18$ sub-dimensions within $8$ primary reasoning dimensions, with each sub-dimension represented by approximately $4,000$ examples. The real-world scenario categories include single-object transformations (e.g., viewpoint, pose, temporal, material), multi-object interactions (e.g., collisions, fluid dynamics, spatial rearrangements), while game-world scenarios encompass planning, logical puzzle solving, strategic reasoning, and spatial intelligence (e.g., 3D reconstruction).

The dataset’s design was motivated by the need to:

  • Support models that reason over multiple objects and context-dependent game rules,
  • Enable learning of compositional edits and non-trivial effects (e.g., tool use or solving Sokoban),
  • Provide reliable CoT supervision for training reasoning-augmented editors,
  • Allow for dual-modality evaluation in concert with UniREditBench.

2. Synthetic Data Generation Pipeline

The generation of UniREdit-Data-100K involves two distinct branches: real-world edits and game-world reasoning scenarios.

A. Real-World Synthesis:

  1. Seed (O₀, I₀, Rₜ₀) triples are handcrafted.
  2. A vision-LLM (e.g., Gemini-2.5) is leveraged for prompt paraphrasing and variation, creating diverse instructions for the same editing goal.
  3. Image pairs (O,GO, G) are rendered by a generative model (e.g., GPT-4o) conditioned on the instruction II and intended effect RtR_t.
  4. Gemini-2.5-Pro filters out low-fidelity or hallucinatory samples, and then annotates each sample with a stepwise CoT CC.

B. Game-World Synthesis:

  1. For each puzzle or planning sub-dimension (Maze, Sokoban, Sudoku, etc.), a procedural Python program generates (O,G)(O, G) image pairs with minimal instructions II, explicit effect text RtR_t, and a code-level reasoning trace.
  2. A vision-LLM converts program traces into human-readable CoTs.
  3. Post-processing filters out logically inconsistent or visually erroneous samples.

C. Dataset Composition:

Reasoning category Sub-dimensions Samples per sub-dim Examples
Real-World 9 ~4,000 material, spatial
Game-World 9 ~4,000 maze, sudoku
Total samples 18 ~4,000 100,421

All samples are manually or programmatically verified for fidelity, logical coherence, and absence of spurious correlations.

3. Chain-of-Thought Annotation and Role in Training

A central feature of UniREdit-Data-100K is the inclusion of CoT (CC) supervision. Each CoT provides a detailed, stepwise decomposition of the reasoning process:

  • For real-world edits: CC describes how to map OO and II to the sequence of visual transformations required to obtain GG.
  • For game-world tasks: CC traces, for example, the moves in Sokoban, logical deductions in Sudoku, or step-by-step cell filling in Tic-Tac-Toe.

This explicit CoT annotation supports the "think-then-edit" paradigm used for training unified editing architectures such as Bagel, fostering robust generalization on tasks demanding sequential, symbolic, or spatial reasoning.

4. Model Training, Objectives, and Implementation

UniREdit-Data-100K underpins the fine-tuning of Bagel—a unified image understanding and editing model leveraging a two-stage architecture: an LLM-based reasoning module and a diffusion-based image generator.

Training objective:

  • CoT text negative log-likelihood:

Ltext=t=1Tlogpθ(yty<t,O,I)\mathcal{L}_{\rm text} = -\sum_{t=1}^T \log p_\theta\left(y_t \mid y_{<t}, O, I\right)

  • Diffusion matching for images, conditioned on (O,I,C)(O, I, C):

Limg=EtU(0,1)uθ(zt,t;O,I,C)u(zt,t)22\mathcal{L}_{\rm img} = \mathbb{E}_{t\sim\mathcal{U}(0,1)} \| u_\theta(z_t, t; O, I, C) - u^\star(z_t, t) \|_2^2

  • Joint loss:

L=λtextLtext+λimgLimg\mathcal{L} = \lambda_{\rm text} \mathcal{L}_{\rm text} + \lambda_{\rm img}\mathcal{L}_{\rm img}

Optimization:

  • Optimizer: Adam
  • Learning rate: Cosine decay, 500 warm-up steps, peak 2×1052 \times 10^{-5} to minimum 1×1061 \times 10^{-6}
  • 5000 iterations, updating all Bagel weights except the frozen VAE

This training regime directly leverages the O,I,C,G,RtO, I, C, G, R_t structure to maximize both reasoning fidelity and image realism.

5. Empirical Impact and Benchmark Results

Models trained with UniREdit-Data-100K demonstrate state-of-the-art performance on UniREditBench, particularly when evaluated on tasks requiring multi-step reasoning or cross-modal understanding.

  • UniREdit-Bagel, trained on UniREdit-Data-100K, achieves a weighted overall score of $78.87$ on UniREditBench, exceeding GPT-4o by +7.23+7.23 points.
  • Largest improvements are observed in game-world reasoning scenarios, with UniREdit-Bagel outperforming the next-best model by +17.08+17.08 points, and exceeding 95%95\% accuracy on Sokoban, Maze, Sudoku, and Tic-Tac-Toe tasks.
  • Out-of-distribution generalization is also enhanced:
    • On RISEBench (temporal/causal/spatial/logical): UniREdit-Bagel obtains 18.3%18.3\%, a gain of +9.1+9.1 over Bagel-Think and +5.0+5.0 over Gemini-2.0-Flash-exp.
    • On KRISBench (knowledge grounding): 65.45%65.45\% overall, +4.7+4.7 over Bagel-Think.

Visual consistency—preservation of regions outside the edit—remains robust, while strategic and adversarial planning (e.g., Pacman, Space Invader) and certain real-world mechanics continue to pose challenges, particularly for open-source models not trained with CoT supervision.

6. Significance and Directions for Future Research

UniREdit-Data-100K serves as the central enabling resource for studying and advancing reasoning-based image editing:

  • By providing task-level diversity spanning spatial, logical, physical, and symbolic transformations, it allows systematic benchmarking of visual reasoning and multi-object manipulation capabilities.
  • The CoT annotation format catalyzes research into LLM-based planning for image transformation, fostering new algorithmic advances in the "think-then-edit" paradigm.
  • Its multimodal structure and game-world coverage pave the way for benchmarking and developing models capable of simulating rule-governed edits, a prerequisite for applications in vision-based planning, robotics, and synthetic agent environments.

Proposed future directions stemming from this work include:

  • Expanding the paradigm to video editing with temporal consistency requirements,
  • Simulating more complex physical phenomena (e.g., fluid mechanics, cloth deformation),
  • Open-sourcing standardized dual-reference evaluators,
  • Integrating symbolic planners or reinforcement learning modules to enhance long-horizon and strategic reasoning.

The methodology and structure established by UniREdit-Data-100K situate it as a cornerstone for the next generation of reasoning-augmented vision editing architectures and evaluation protocols (Han et al., 3 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to UniREdit-Data-100K.