X2Edit: Unified Instruction Image Editing
- X2Edit is a unified dataset designed for arbitrary-instruction image editing, combining source images, edited targets, text instructions, and explicit task labels.
- It employs a mostly automatic pipeline with multi-stage quality control and balancing strategies to generate 3.7M editing pairs across 14 distinct tasks.
- The dataset leverages advanced models for instruction generation and task-specific image synthesis, enabling task-aware contrastive learning and robust performance metrics.
Searching arXiv for the cited X2Edit paper to ground the article in the source paper. The X2Edit Dataset is a large-scale, self-constructed corpus for arbitrary-instruction image editing introduced in "X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning" (Ma et al., 11 Aug 2025). It is organized around editing examples built from a realistic source image, a natural-language edit instruction, and a resulting edited or target image, with explicit task labels designed for unified training across heterogeneous editing tasks. The dataset was constructed to address limitations of existing open-source editing corpora, which are described as task-specific, imbalanced across tasks, low quality or distribution-shifted, and weak in complex tasks such as reasoning edits, camera movement, style transfer, and subject-driven generation. X2Edit accordingly emphasizes a single, unified, mostly automatic pipeline, category balance, real-world source images, and explicit support for task-aware representation learning.
1. Scope, identity, and design goals
X2Edit is explicitly positioned as a dataset for unified arbitrary-instruction image editing rather than a collection of isolated task-specific benchmarks. Its basic sample structure comprises a source image, an edited image, a text instruction, and a task label. The paper further characterizes the dataset as more than “image pairs + prompts”: it is task-aware, with task labels baked into both data generation and downstream training, including contrastive loss over task types and MoE routing (Ma et al., 11 Aug 2025).
The dataset is designed around four stated objectives. First, it uses a single, unified, mostly automatic pipeline across 14 editing tasks. Second, it aims for category balance through explicit balancing of instruction types. Third, it prioritizes real-world source images from COYO-700M, Wukong, and LAION-5B, while using carefully generated references only where necessary, particularly for subject-driven generation. Fourth, it explicitly constructs large volumes of complex tasks, including reasoning, camera movement, subject-driven generation, and style transfer.
This design suggests a dual role. At the data level, X2Edit serves as a large corpus for model fitting and benchmarking. At the representation level, its task labels are intended to provide inductive structure for architectures that exploit task-conditioned specialization.
2. Scale, resolution, and task taxonomy
The reported overall scale is 3.7M editing pairs. The release is divided into two resolution subsets:
| Subset | Size | Resolution |
|---|---|---|
| X2Edit(512) | 2.0M pairs | 512×512 |
| X2Edit(1024) | 1.7M pairs | “∼1024” with various aspect ratios |
Each sample is described as a pair (source image, edited image) with associated text instruction and task label. Source images satisfy a minimum-side constraint greater than 512 pixels, and aspect ratios are maintained with min/max sides in [512, 2048]. The edited images are distributed across the 512 and ~1024 subsets, with multi-AR support emphasized especially for the high-fidelity subset (Ma et al., 11 Aug 2025).
The 14 primary editing tasks span local editing, global or appearance editing, complex reasoning or geometric editing, and subject-driven generation:
- Background Change
- Color Change
- Material Change
- Action Change
- Subject Addition
- Subject Deletion
- Subject Replacement
- Text Change
- Portrait Editing 10. Style Change
- Tone Transform
- Reasoning Editing
- Camera Movement
- Subject-Driven Generation (incl. reference-style transfer)
The paper reports particularly large volumes for several complex categories: 342K reasoning edits, 94K camera movements, and 460K subject-driven generation. At training time, an additional “other” category is included for zero-shot tasks, although the 3.7M figure corresponds to the 14 primary tasks.
A central property of the dataset is category balance. Editing instructions are generated with a load-balancing strategy that dynamically adjusts instruction-type sampling so that the final corpus approaches even coverage across task labels. A plausible implication is that X2Edit is intended not only to broaden task diversity, but also to reduce the dominance of easy or highly automatable edit categories that often distort large web-scale editing corpora.
3. Construction pipeline and task-specific generation
The construction workflow has four stages: source image preparation, editing instruction generation, edited image construction, and quality evaluation and filtering (Ma et al., 11 Aug 2025).
Source image preparation
For general editing tasks, X2Edit draws source images from COYO-700M, Wukong, and LAION-5B to keep the distribution close to real photos and web data. Images are pre-filtered by a high aesthetic score and by minimum side length greater than 512 pixels. For subject-driven generation, the pipeline additionally uses an internal query dataset, whose text prompts are filtered by Qwen3 to ensure the presence of foreground-subject keywords before generating reference images with Shuttle-3-Diffusion.
Instruction generation via VLM
A key design choice is the use of Qwen2.5-VL-7B to generate editing instructions directly from the image and task definitions, rather than deriving instructions from captions. The model is prompted with the source image, a list of task definitions, and contextual examples per task type, and it is required to output 10 editing instructions appropriate to the given image, selecting only tasks that make sense for that content. During this stage, the instruction cache and task counts are used to dynamically adjust sampling weights, producing a near-balanced distribution across the 14 task labels.
Edited image construction
The edited-image stage is explicitly task-specific.
For subject addition and subject deletion, the pipeline uses RAM++ for tagging, Grounding DINO for object boxes, SAM2 for masks, and LaMa for inpainting or deletion. Masks are filtered to an area ratio in [2%, 35%] of the image, and prompts from RAM tags are filtered to length greater than 15.
For “normal” editing tasks such as background change, color change, and portrait editing, the source image and instruction are passed to Step1X-Edit.
For subject-driven generation, the workflow samples image descriptions from the internal query dataset, filters them with Qwen, generates clean reference subject images with Shuttle-3-Diffusion, expands the reference description into a richer editing prompt, and then uses Kontext to generate the final target image.
For style transfer, the pipeline samples two textual descriptions from the internal query data, generates a reference style image from one with Shuttle-3-Diffusion, uses the other as a content prompt, and then employs Kontext to synthesize an image that retains the reference style while following the content description.
For style change, Qwen2.5-VL-7B selects or instantiates a style from a predefined list such as Ghibli, illustration, or oil painting, forms an instruction of the form “Convert the image to a [Style],” and OmniConsistency produces the edited image.
For text change, the pipeline uses OCR and region detection to identify scene text and bounding boxes, selects the largest text region with area ratio at least 1%, constrains the target text length to differ from the source by at most 3 characters, generates a binary mask for the selected region, and feeds the source image plus mask to TextFlux, a DiT specialized for scene text synthesis. Qwen2.5-VL-7B is then used again to verify that the target text was successfully placed.
For reasoning and camera movement, the paper states that Step1X-Edit was found insufficient; instead, Bagel or GPT-4o plus Kontext are used, especially in the 1024-resolution subset. For X2Edit-1024, higher-resolution source images and the combination of GPT-4o and Kontext are associated with high-fidelity edits and strong image quality metrics.
The overall pipeline is therefore not monolithic. It is unified at the level of data specification and balancing, but deliberately heterogeneous at the level of expert edit synthesis.
4. Annotation structure and task-aware organization
The dataset’s implied schema includes source_image, target_image, instruction, and task_type, with possible additional metadata such as image resolution or aspect ratio, construction-pipeline tags, and filter scores including aesthetics, LIQE, CLIPIQA, ClipScore, and VLM-derived scores (Ma et al., 11 Aug 2025). The paper does not mention per-pixel annotations or masks being stored in the dataset, even though masks are used internally during generation for tasks such as inpainting.
A distinctive property of X2Edit is the presence of explicit task labels. The 14 editing tasks, together with the training-time “other” category, are encoded as integer labels
These labels are not merely descriptive metadata. They are used during model training to drive task embeddings and positive/negative grouping in task-aware contrastive learning. The task embedding matrix is described as
Within the MoE gating formulation, the intermediate representation is concatenated with the task embedding and passed to a gating network to obtain expert scores. The paper further defines top- expert selection and combines expert outputs for , , and with a shared expert before attention.
The dataset also underpins a task-aware InfoNCE loss. After flattening and L2-normalizing , pairwise squared Euclidean distances are computed as
with a task mask
The task-aware contrastive objective is then
0
and the full training objective is
1
This organization indicates that X2Edit should be understood as a task-labeled editing dataset rather than a generic text-image pair corpus. A plausible implication is that its balancing strategy is structurally important for training, because contrastive mini-batches require repeated positives and negatives across multiple task types.
5. Quality control, automatic evaluation, and comparative performance
The final filtering stage is explicitly multi-metric. For generic image quality, edited images are scored with an aesthetic predictor, LIQE, and CLIPIQA, and samples below tuned thresholds are discarded (Ma et al., 11 Aug 2025).
For edit correctness and alignment, the pipeline uses multiple judges. ImgEdit-Judge is used to score alignment with instruction and preservation of irrelevant regions. Qwen2.5-VL-72B follows the ImgEdit-Judge protocol to compute S_1 for success in following the instruction, S_2 for absence of unintended changes, and Overall = S_1 + S_2. For broader evaluation and comparison, GPT-4o is used to compute VIEScore, consisting of Semantic Consistency (SC), Perceptual Quality (PQ), and
2
The same Qwen2.5-VL-7B model used for instruction generation is also used for self-reflection on instruction sanity. Given a source image, edited image, and instruction, it outputs two scores:
3
with the prompt specifying
4
Samples with low scores are discarded to reduce hallucinated or infeasible instructions.
Additional filters are task-specific. For subject-driven generation, CLIP-based similarity and DINO features between reference and edited images are used to measure subject consistency. For style transfer, Qwen2.5-VL-7B rates whether the generated image matches the reference style.
At the dataset level, the paper compares X2Edit with AnyEdit, HQ-Edit, UltraEdit, SEED-Data-Edit, ImgEdit, and OmniEdit. On 1K-sample evaluations, X2Edit(512) reports Qwen/ImgEdit overall 7.77, ImgEdit-Judge overall 9.17, and GPT-4o overall 5.87. X2Edit(1024) reports Qwen/ImgEdit overall 8.08, ImgEdit-Judge overall 9.93, and GPT-4o overall 6.33. The paper states that the high-resolution subset is best or near-best on aesthetic and LIQE/CLIPIQA measures.
The paper also reports pipeline-level comparisons for several tasks using the same source images and instructions across four generators: the X2Edit pipeline, Step1X-Edit, Bagel, and Kontext. For subject deletion, the reported GPT-4o VIEScore values are 7.197 for X2Edit data, 6.300 for Step1X-Edit, 6.980 for Bagel, and 6.402 for Kontext. The text states that the X2Edit pipeline generally wins or is competitive for style change and text change as well.
These results support two related claims in the source paper: first, that X2Edit is not only large and diverse but also high quality by multiple automatic judges; second, that its task-specific generation strategy is materially relevant to final data quality.
6. Intended uses, ecosystem position, and limitations
The stated intended uses include training unified arbitrary-instruction image editing models, fine-tuning FLUX.1-based DiT backbones with plug-and-play editing modules, training multi-task editing systems spanning local, global, subject-driven, and complex tasks, and serving as a benchmark corpus over 14 task types (Ma et al., 11 Aug 2025).
The paper explicitly frames the dataset as compatible with FLUX.1 and similar DiTs. It also reports plug-and-play use of the trained editing module with FLUX.1-Schnell, Shuttle-3-Diffusion, PixelWave, and FLUX.1-Krea-dev, as well as with community LoRAs such as FLUX-Super-Realism and FLUX-Midjourney-Mix2. Because the dataset includes subject-driven generation, text editing, style transfer, and style change, it is described as suitable for training modular adapters such as IP-Adapter-like and UNIC-Adapter-like systems.
Within the broader open ecosystem, X2Edit is contrasted with several other datasets. The paper describes AnyEdit as 2.5M, 25 types, 512 px; HQ-Edit as 197K, 6 types, ≥768 px; UltraEdit as 4M, 9 types, 512 px; SEED-Data-Edit as 3.7M, 6 types, 768 px; ImgEdit as 1.2M, 13 types, ≥1280 px; and OmniEdit as 5.2M/1.2M, 7 types, ≥512 px. X2Edit’s claimed advantages are task coverage across 14 tasks, explicit category balancing, multi-stage quality control, heavy use of real-world source images, and a task-aware design aligned with MoE-LoRA and contrastive learning.
Several limitations and curation considerations are also apparent. The paper does not specify a public license in the text, directing readers instead to the GitHub repository. It does not mention official train/validation/test splits for the dataset itself; evaluation is performed on separate benchmarks such as GEdit-Bench++, ImgEdit-Bench, AnyEdit-Test, KontextBench, and DreamBench. It also does not provide an explicit misuse mitigation plan, despite the general risk that realistic image-editing datasets can support systems capable of deepfakes or misinformation.
A further point of nomenclature is worth recording. The term X2Edit Dataset in (Ma et al., 11 Aug 2025) refers to the arbitrary-instruction image editing corpus described above and should not be conflated with X2DB, the experimental 2D materials database introduced in "Large-scale Integration of Experimental and Computational Data for 2D Materials" (Akhound et al., 5 Mar 2026), where “X2Edit” is explicitly stated not to be a defined term. This distinction matters because the two works are unrelated despite the partially overlapping naming convention.
The public access point given for code, checkpoints, and datasets is:
https://github.com/OPPO-Mente-Lab/X2Edit