CrispEdit-2M Image Editing Dataset

Updated 19 December 2025

CrispEdit-2M is a high-resolution image editing dataset comprising 2M annotated example pairs partitioned into seven semantic categories for precise model development.
It utilizes a multi-stage pipeline that integrates automated image curation, LLM-driven annotation, dual-pipeline editing, and stringent CLIP-based quality checks.
The dataset foregoes additional augmentations beyond resolution normalization, ensuring high spatial fidelity and semantic integrity for robust training and benchmarking.

CrispEdit-2M is a rigorously filtered high-resolution image editing dataset comprising two million annotated example pairs, established as the central training resource for the EditMGT framework. The dataset is partitioned into seven semantically distinct image editing categories and is specifically constructed to facilitate large-scale, high-fidelity, and category-diverse image editing model development and benchmarking. Its pipeline integrates automated image curation, annotation with large vision–LLMs (VLMs), multi-stage instruction synthesis, dual-pipeline editing, and strict semantic quality assessment using CLIP, ensuring both sample integrity and diversity without sacrificing spatial resolution fidelity (Chow et al., 12 Dec 2025).

1. Corpus Composition and Resolution Normalization

CrispEdit-2M contains exactly 2,000,000 paired examples, each normalized for high spatial resolution. All source images undergo resizing such that their short side equals 1024 pixels via the formula $\alpha = 1024 / \min(H, W)$ , followed by $H' = \operatorname{round}(\alpha \cdot H)$ and $W' = \operatorname{round}(\alpha \cdot W)$ . After normalization, approximately 60% of samples possess a long edge within the 1280–1665 px range, with a smooth tail extending to 2048 px, thus ensuring the content is genuinely high resolution. No additional geometric or color augmentations—such as cropping, flipping, or jittering—are reported beyond this mandatory resizing step (Chow et al., 12 Dec 2025).

2. Editing Categories and Internal Statistics

The dataset is partitioned into seven editing categories with the following approximate sample counts:

Edit Category	Sample Count
Add new object(s)	300,000
Replace object(s)	300,000
Remove object(s)	300,000
Color alteration	500,000
Background change	200,000
Style transformation	400,000
Motion modification	34,000

These categories were selected to span a broad range of real-world editing operations, including both local and global changes, from semantic compositional edits (adding/replacing/removing entities) to global restructuring (stylistic and background transformations) and temporal effects (motion modification). Category proportions are visualized in Appendix Fig. A.3 of the reference (Chow et al., 12 Dec 2025).

3. Data Acquisition, Annotation, and Processing Pipeline

The CrispEdit-2M construction pipeline comprises four major stages:

Image Curation: Seed images (~5.5M) are sourced from LAION-Aesthetics (filtered for aesthetic $>4.5$ ), Unsplash-Lite, and JourneyDB. These are restricted to those with short side $\geq 1024$ px, and filtered by Qwen3 to eliminate watermarks, text overlays, stickers, and trivial single-object scenes. An additional 0.5M examples from ImgEdit (in the same seven categories) further diversify the seed set.
Instruction Generation: Qwen2.5-VL produces exhaustive image captions describing objects, backgrounds, colors, and relations. GPT-4o rewrites these captions into actionable edit instructions following category-specific templates through a “constrained generation + self-refinement” process; each category leverages its own in-context exemplars, and new pairs are recycled as exemplars to improve linguistic variability.
Edited-Image Synthesis: Two open-source edit pipelines—FLUX.1-Kontext.dev and Step1X-Edit v1.2—generate outputs for each (source, instruction) pair. A lightweight vision–language reranker (InternVL2.5-MPO) then scores both, retaining the higher-ranking output.
Quality Assurance: Quality is enforced in two phases:
- Instruction Validation: An LLM reviews instructions to reject off-target or illogical requests (e.g., nonsensical edits or attribute mismatches with category intent).
- Edited Image Verification: CLIP-based thresholds are imposed for semantic edit alignment ( $f_{\mathrm{CLIP}}(I_{\mathrm{edit}}, \mathrm{instruction}) \geq \tau_1$ ) and preservation of non-target content ( $f_{\mathrm{CLIP}}(I_{\mathrm{source}}, I_{\mathrm{edit}}) \geq \tau_2$ ). Both thresholds are empirically set to balance edit fidelity and diversity (Chow et al., 12 Dec 2025).

4. Dataset Utilization and Splitting Regimen

CrispEdit-2M is allocated exclusively as training data for EditMGT; no validation or test partition is carved from the dataset. Instead, model evaluation is conducted on standard external benchmarks such as EmuEdit, MagicBrush, AnyBench, and GEdit-EN-full. Within the three-stage EditMGT training procedure, the top 12% (ranked by LAION aesthetic score) are reemployed for high-quality fine-tuning in Stage 3. No separate numerical metrics (e.g., PSNR) are reported over the raw CrispEdit-2M corpus itself (Chow et al., 12 Dec 2025).

5. Aesthetic, Semantic, and Preservation Quality Metrics

CrispEdit-2M integrates both objective and semantic quality constraints:

Aesthetic Filtering: All source images must satisfy Aesthetic( $I_{\text{source}}) \geq 4.5$ , as predicted by the LAION-Aesthetics model.
Semantic Alignment: CLIP cosine similarity between edited image and textual instruction ( $f_{\mathrm{CLIP}}(I_{\mathrm{edit}}, \mathrm{instruction})$ ) must exceed threshold $\tau_1$ .
Preservation Check: CLIP similarity between source and edited image ( $f_{\mathrm{CLIP}}(I_{\mathrm{source}}, I_{\mathrm{edit}})$ ) must also exceed threshold $\tau_2$ to avoid spurious non-target region changes.

Thresholds $\tau_1$ and $\tau_2$ are empirically selected to optimize the tradeoff between enforcing fidelity to the requested edit and maximizing overall dataset variety and realism. The dataset’s statistical composition is further summarized: no additional numeric metrics such as PSNR are reported for the corpus (Chow et al., 12 Dec 2025).

6. Annotation Diversity and Post-Processing

No geometric or color jittering augmentations are applied beyond resolution normalization. During the base-model recaptioning (Stage 1), each image receives three alternative InternVL2.5-MPO annotations to enhance linguistic diversity, with one selected at random per training step. The process does not utilize synthetic cropping or flipping. This approach preserves high-fidelity spatial representation and semantic consistency across the dataset (Chow et al., 12 Dec 2025).

7. Context and Significance in Training Large-Scale Editors

CrispEdit-2M is distinguished by its combination of large-scale data volume, high spatial fidelity (short side $\geq$ 1024 px), rigorous per-example semantic alignment, and diversity of edit operations. Its end-to-end construction pipeline—spanning source curation, dual-stage LLM-driven annotation, text-to-edit transformation with two “expert” models, VLM-based reranking, and dual-phase semantic verification—produces datasets fit for modern and future high-resolution generative and editing models. A plausible implication is that this level of sample curation and annotation detail will become standard for future image editing datasets aiming to balance quality, diversity, and task control (Chow et al., 12 Dec 2025).

CrispEdit-2M thus serves as a new benchmark in scalable, semantically granular image editing datasets, centralizing both practical model development and rigorous evaluation for the image manipulation research community.

PDF Markdown Chat (Pro)

References (1)

EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to CrispEdit-2M.