I2EBench2.0: Image Editing Benchmark
- I2EBench2.0 is a comprehensive evaluation framework for instruction-based image editing that assesses both semantic correctness and low-level restoration quality.
- It uniquely introduces multi-round evaluation to track iterative edits, revealing error propagation and robustness across 7 distinct dimensions.
- The benchmark integrates automated metrics (GPT‑4V, CLIP, SSIM) with extensive human studies to provide actionable insights for advancing image editing models.
Searching arXiv for the benchmark paper and its predecessor to ground the article in the latest published versions. I2EBench2.0 is a comprehensive benchmark for instruction-based image editing (IIE), defined as the task in which a model takes an input image and a natural-language instruction and produces an edited image that follows the instruction while preserving irrelevant content and maintaining visual quality. The benchmark extends the original NeurIPS’24 I2EBench by introducing joint evaluation of single-round and multi-round editing, broadening the evaluation space to 16 single-round dimensions and 7 multi-round dimensions, and grounding its automatic scoring procedures in human studies and cross-judge analysis (Ma et al., 14 Jun 2026). Within the benchmark’s framing, the central difficulty of IIE evaluation is that instructions are heterogeneous, the edits range from high-level semantic operations to low-level restoration, a single scalar metric cannot capture all relevant properties, and interactive editing requires robustness across sequences of edits rather than only one-shot transformations (Ma et al., 14 Jun 2026, Ma et al., 2024).
1. Origins, scope, and motivation
I2EBench2.0 was created to address a specific evaluation gap in instruction-based image editing. Prior work had notable limitations: small datasets, limited edit types, focus on mask-guided or otherwise narrow protocols, very limited or no multi-round evaluation, and dependence on generic metrics such as CLIP, PSNR, SSIM, and LPIPS applied uniformly across disparate editing tasks (Ma et al., 14 Jun 2026, Ma et al., 2024). In contrast, I2EBench2.0 is explicitly designed as a benchmark for mask-free instruction-based editing that evaluates both semantic correctness and low-level image quality using dimension-specific automatic metrics rather than a single universal score (Ma et al., 14 Jun 2026).
The benchmark inherits the core idea of the original "I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing" (Ma et al., 2024) and expands it into a framework that simultaneously evaluates single-round and multi-round editing (Ma et al., 14 Jun 2026). A round is one editing step: single-round editing applies one instruction to the original image, whereas multi-round editing applies a sequence of 2–5 instructions, each operating on the previous output (Ma et al., 14 Jun 2026). This distinction matters because practical editing workflows are iterative, and failure in any intermediate step can propagate forward. The benchmark therefore treats multi-round editing not as repeated single-round testing, but as a separate regime with its own success criterion and its own dimensions (Ma et al., 14 Jun 2026).
The benchmark is intended both as an evaluation tool and as what may be called an “analysis benchmark” (Editor’s term): it measures current performance and also exposes systematic weaknesses of existing IIE models, especially under multi-round interaction (Ma et al., 14 Jun 2026). The paper explicitly characterizes it as both a benchmark and a research probe, using comparative evaluation across strong baseline systems to derive guidance for future model design (Ma et al., 14 Jun 2026).
2. Dataset construction and benchmark composition
Each evaluation sample consists of an input image , one or more natural-language instructions , and the output of an IIE model, after which I2EBench2.0 computes dimension-specific scores for instruction following, preservation of non-target regions, low-level quality, and multi-round robustness (Ma et al., 14 Jun 2026). The dataset contains over 2,000 images and over 6,700 editing instructions (Ma et al., 14 Jun 2026). For the 16 single-round dimensions, the paper reports approximately 140 images per dimension, yielding roughly image-dimension instances (Ma et al., 14 Jun 2026, Ma et al., 2024).
Images are drawn from public datasets covering both general scenes and restoration settings. The benchmark explicitly mentions MS COCO for natural scenes and objects, as well as task-specific sources for deblurring, dehazing, snow, rain, shadow, low-light, and watermark removal (Ma et al., 14 Jun 2026). The predecessor benchmark also lists Berkeley segmentation, GoPro, HIDE, Dense-Haze, Haze4K, LOL, LHP-Rain, CSD, SRD, and CLWD / WDNet among the sources used for the 16-dimensional framework (Ma et al., 2024). The resulting coverage spans animals, objects, plants, humans, scenery, and global edits (Ma et al., 14 Jun 2026, Ma et al., 2024).
For each image-dimension pair, human annotators first write an original instruction. To increase linguistic diversity, the benchmark then uses ChatGPT to produce diverse paraphrases that preserve intent while varying syntax and vocabulary (Ma et al., 14 Jun 2026, Ma et al., 2024). Each instruction is additionally assigned one of six content categories: Animal, Object, Scenery, Plant, Human, or Global (Ma et al., 14 Jun 2026). This category labeling supports downstream analysis of category-dependent performance differences.
Multi-round annotations are provided for each applicable high-level dimension except Region Accuracy, with 2–5 instructions per image (Ma et al., 14 Jun 2026). The benchmark’s design choice to use atomic instructions in single-round settings and composition through multi-round sequences is deliberate. This suggests that metric reliability and interpretability were prioritized over one-shot compositional complexity, while still approximating realistic editing workflows via sequential interaction (Ma et al., 14 Jun 2026).
3. Evaluation dimensions and metrics
The benchmark is organized into 8 high-level single-round dimensions, 8 low-level single-round dimensions, and 7 multi-round high-level dimensions (Ma et al., 14 Jun 2026). The high-level dimensions focus on semantic correctness and localization, while the low-level dimensions capture restoration and enhancement quality. Region Accuracy occupies a bridge position because it evaluates localization and preservation rather than direct semantic correctness (Ma et al., 14 Jun 2026, Ma et al., 2024).
Single-round dimensions
| Group | Dimensions | Metric basis |
|---|---|---|
| High-level | Counting; Direction Perception; Object Removal; Object Replacement; Background Replacement; Color Alteration; Style Alteration; Region Accuracy | GPT‑4V for six semantic dimensions, CLIP for Style Alteration, SSIM for Region Accuracy |
| Low-level | Deblurring; Haze Removal; Lowlight Enhancement; Noise Removal; Rain Removal; Shadow Removal; Snow Removal; Watermark Removal | SSIM against ground-truth clean images |
For Counting, Direction Perception, Object Removal, Object Replacement, Background Replacement, and Color Alteration, the benchmark uses GPT‑4V as an automatic judge with human-designed question–answer templates (Ma et al., 14 Jun 2026). Counting asks questions such as “How many cats are on the shoe rack?” and scores the percentage of correct matches to annotated answers (Ma et al., 14 Jun 2026). Direction Perception uses yes/no or relational questions about object position, while Object Removal asks whether a target object remains present after editing (Ma et al., 14 Jun 2026). Object Replacement checks whether the new target object is present and the original object is absent, Background Replacement checks whether the background matches the requested description, and Color Alteration asks GPT‑4V to identify the edited object’s color (Ma et al., 14 Jun 2026).
Style Alteration is evaluated with CLIP image–text similarity using prompts of the form
where the placeholder is filled by the requested style (Ma et al., 14 Jun 2026, Ma et al., 2024). Region Accuracy is computed by masking the intended edit region, whitening that region in both original and edited images, and then measuring SSIM on the remaining pixels so that a high score indicates preservation of non-target regions (Ma et al., 14 Jun 2026, Ma et al., 2024).
For the eight low-level dimensions, the metric is SSIM against a ground-truth clean image: $\text{SSIM}(I_{\text{edit}, I_{\text{gt}) \in [0,1].$ These scores are scaled to 0–100 when reported in the benchmark tables (Ma et al., 14 Jun 2026). The low-level tasks are Deblurring, Haze Removal, Lowlight Enhancement, Noise Removal, Rain Removal, Shadow Removal, Snow Removal, and Watermark Removal (Ma et al., 14 Jun 2026).
Multi-round dimensions
The multi-round benchmark evaluates seven high-level dimensions: Counting, Direction Perception, Object Removal, Object Replacement, Background Replacement, Color Alteration, and Style Alteration (Ma et al., 14 Jun 2026). Region Accuracy is excluded because multi-round region masks would have to be recomputed for each model’s intermediate outputs, which is incompatible with a fixed automated benchmark (Ma et al., 14 Jun 2026).
The multi-round scoring rule is intentionally strict. For a sequence of instructions, an example is successful only if all rounds succeed (Ma et al., 14 Jun 2026). For GPT‑4V-based dimensions, the benchmark evaluates each round separately and assigns a sample score of 1 only if every round is correct; otherwise the sample is a failure. The reported dimension score is the multi-round success rate: $\text{Score} = \frac{\text{\# multi-round successes}{\text{\# samples} \times 100.$ For Style Alteration, the benchmark again uses CLIP-based style similarity across rounds while retaining the same all-rounds-must-succeed interpretation (Ma et al., 14 Jun 2026). This scoring rule makes multi-round performance substantially more demanding than single-round performance because small intermediate errors accumulate.
4. Human alignment and evaluation protocol
A central design principle of I2EBench2.0 is alignment with human judgment (Ma et al., 14 Jun 2026). To test this, the benchmark conducts a user study for each of the 16 single-round and 7 multi-round dimensions. Human raters are shown the instruction 0, the original image 1, and edited outputs 2 from 3 models, and are asked to rank all outputs from best to worst according to overall judgment, combining instruction faithfulness and visual quality (Ma et al., 14 Jun 2026). Ranks are converted into human scores by mapping rank 1 to score 4, rank 5 to score 1, and rank 6 to score 7 (Ma et al., 14 Jun 2026, Ma et al., 2024). The per-dimension human score for a model is then averaged over the sampled examples: 8
The benchmark then compares these human-derived rankings with rank-based scores produced by its automatic metrics (Ma et al., 14 Jun 2026). Figure-based analysis in the paper reports strong positive correlations, described in terms of Spearman’s 9, between automatic ranks and human evaluation across all dimensions (Ma et al., 14 Jun 2026). The predecessor benchmark similarly reports significant positive correlations between I2EBench rank scores and human rank scores across its 16 dimensions (Ma et al., 2024). Although the provided summary does not list exact coefficient values, the stated conclusion is that better benchmark scores reliably correspond to better human preference rankings (Ma et al., 14 Jun 2026).
I2EBench2.0 further evaluates judge dependence by replacing GPT‑4V with Qwen3VL-8B and LLaVA-1.5-7B in cross-judge experiments (Ma et al., 14 Jun 2026). It computes average ranks across judges to verify consistency, and the paper states that rankings are very consistent across different judges, with small average-rank variations (Ma et al., 14 Jun 2026). This is important because high-level evaluation relies on powerful MLLMs, and the benchmark’s claim is not that judge choice is irrelevant, but that the overall conclusions remain stable across multiple evaluators (Ma et al., 14 Jun 2026).
Operationally, the benchmark’s single-round protocol feeds each 0 pair into the model once, obtains one edited image using official inference settings, and applies the dimension-specific evaluator: GPT‑4V-based Q/A, CLIP similarity, Region Accuracy SSIM, or low-level SSIM (Ma et al., 14 Jun 2026). The multi-round protocol initializes 1, applies instructions 2 sequentially to produce 3, evaluates each round, and counts the sample as successful only if all rounds pass (Ma et al., 14 Jun 2026).
5. Baseline models and empirical findings
The benchmark evaluates a wide set of IIE models using official code and checkpoints, with one edited image generated per instruction (Ma et al., 14 Jun 2026). The reported model set includes HIVE, InstructDiffusion, InstructPix2Pix, MagicBrush, MGIE, InstructEdit, InstructAny2Pix, HQ-Edit, FLUX Kontext, and Qwen-image (Ma et al., 14 Jun 2026). The predecessor benchmark had evaluated eight models, while the extended version’s tables show at least ten (Ma et al., 2024, Ma et al., 14 Jun 2026).
On single-round low-level dimensions, the benchmark reports that low-level tasks are mostly moderately solved, but with substantial diversity across methods (Ma et al., 14 Jun 2026). InstructDiffusion often leads in Deblurring, Haze Removal, Rain Removal, and Watermark Removal, while Qwen-image achieves very high scores in Lowlight Enhancement, including 73.7% in one table and 72.8% in another (Ma et al., 14 Jun 2026). FLUX Kontext scores very high in Haze Removal and Rain Removal / Watermark Removal, including 79.7% on rain and 92.2% on watermark removal in one reported table (Ma et al., 14 Jun 2026). MGIE and HQ-Edit are competitive on some low-level edits but are not consistently best (Ma et al., 14 Jun 2026).
On single-round high-level dimensions, Counting and Direction Perception are strongly discriminative (Ma et al., 14 Jun 2026). Qwen-image reaches roughly 65% on original instructions and 59% on diverse instructions for Counting, and exceeds 80% on Direction Perception (Ma et al., 14 Jun 2026). FLUX Kontext is also strong, with approximately 49–45% on Counting and 76–78% on Direction Perception (Ma et al., 14 Jun 2026). Earlier models such as InstructPix2Pix, InstructDiffusion, and InstructEdit often remain below 20% on counting (Ma et al., 14 Jun 2026). For object-centric operations, Qwen-image is reported as best or near-best in Object Removal, Object Replacement, and Background Replacement, with approximately 78–75%, 88–97%, and 96–93%, respectively (Ma et al., 14 Jun 2026). MagicBrush and MGIE also perform relatively well in Object Replacement and Background Replacement, typically in the 60–80% range (Ma et al., 14 Jun 2026). Color Alteration is comparatively strong for Qwen-image and FLUX Kontext, both above roughly 80%, whereas Style Alteration remains difficult for all models, with maximum scores only around the 27–28% CLIP similarity range (Ma et al., 14 Jun 2026).
Multi-round evaluation reveals much larger weaknesses (Ma et al., 14 Jun 2026). Counting becomes extremely difficult: all models have very low multi-round counting accuracy, often below 5%, with Qwen-image at about 11% (Ma et al., 14 Jun 2026). Direction Perception degrades but remains feasible for the strongest models, with Qwen-image around 57% on original and 53% on diverse instructions, and FLUX Kontext around 45–38% (Ma et al., 14 Jun 2026). Object Removal and Object Replacement also degrade substantially, though Qwen-image still reaches around 39–43% on removal and 64–68% on replacement (Ma et al., 14 Jun 2026). Background Replacement and Color Alteration degrade less severely; Qwen-image retains about 82% on multi-round background replacement versus around 96% in single-round evaluation, and about 66–68% on multi-round Color Alteration (Ma et al., 14 Jun 2026). Style Alteration remains low and similar across models in the multi-round setting, around 23–27% (Ma et al., 14 Jun 2026).
The benchmark also reports trade-offs. Some models are strong at low-level restoration but weaker at nuanced instruction semantics; others are relatively strong in high-level semantics but less stable on low-level metrics (Ma et al., 14 Jun 2026). Qwen-image and FLUX Kontext are singled out as the most balanced across high-level and low-level as well as single-round and multi-round settings (Ma et al., 14 Jun 2026). Qualitative examples further identify failure modes such as over-editing non-target regions, hallucinated objects, incorrect counting despite plausible images, and progressive drift across rounds (Ma et al., 14 Jun 2026).
6. Research implications, comparative position, and limitations
The paper distills several research insights from the benchmark results (Ma et al., 14 Jun 2026). First, multi-round robustness is a major unsolved problem. The gap between single-round and multi-round performance is clear across dimensions, and early sequence errors propagate, especially for object-centric tasks such as counting, removal, and replacement (Ma et al., 14 Jun 2026). This suggests that future models may need explicit mechanisms for edit history, state tracking, and error correction, though such architectural prescriptions are presented as research guidance rather than as part of the benchmark itself.
Second, the benchmark states that aesthetic quality degrades across rounds, using an aesthetic predictor score (AP) similar to CLIP-MLP from LAION-5B (Ma et al., 14 Jun 2026). AP tends to drop in multi-round editing, indicating accumulating artifacts and quality loss (Ma et al., 14 Jun 2026). Third, no single model dominates all dimensions: InstructDiffusion is strong on low-level tasks, MagicBrush and MGIE on some high-level edits, and Qwen-image and FLUX on complex semantics and balanced performance (Ma et al., 14 Jun 2026). Fourth, instruction wording matters. The benchmark reports that some models perform notably worse under diverse paraphrased instructions, especially in object removal, whereas models leveraging strong LLM or MLLM front-ends are more robust to instruction variation (Ma et al., 14 Jun 2026). The earlier benchmark formalized this sensitivity through an instruction-robustness change rate,
4
where 5 and 6 are scores under original and diverse instructions, respectively (Ma et al., 2024).
The benchmark also reports category-dependent performance (Ma et al., 14 Jun 2026, Ma et al., 2024). Scenery and Global categories consistently obtain higher scores because they are easier and often correspond to global edits that do not require precise localization (Ma et al., 14 Jun 2026). Animal and Human categories are harder because they demand fine-grained semantics and geometry (Ma et al., 14 Jun 2026). This category pattern situates I2EBench2.0 not merely as a collection of tasks, but as a controlled testbed for differentiating which kinds of semantic precision remain difficult for current editors.
Relative to prior benchmarks, I2EBench2.0 distinguishes itself by combining 8 high-level semantic dimensions, 8 low-level editing dimensions, and 7 multi-round dimensions in one framework (Ma et al., 14 Jun 2026). Compared with TedBench and TedBench++, it offers broader coverage and explicit multi-round evaluation; compared with EditBench, it is suitable for mask-free instruction-based models; compared with EditVal, it covers more than geometric aspects and relies less on manual scoring; and compared with MagicBrush, Emu Edit, and SmartEdit test sets, it goes beyond single-round evaluation and generic CLIP- or DINO-based scoring (Ma et al., 14 Jun 2026). Its use of multimodal LLM judges with human-designed Q/A templates, human-alignment studies, and cross-judge rank analysis is presented as one of its defining methodological novelties (Ma et al., 14 Jun 2026).
The benchmark’s limitations are also explicit. Domain coverage is broader than prior work but still focused on typical natural images and classical low-level conditions; artistic or domain-specific areas such as medical, satellite, or anime are not explicitly included (Ma et al., 14 Jun 2026). Instructions are in English only, and cross-lingual robustness is not evaluated (Ma et al., 14 Jun 2026). The benchmark does not yet measure diversity of valid outputs under underspecified instructions, nor does it test ambiguous or infeasible instructions (Ma et al., 14 Jun 2026). High-level evaluation depends on powerful MLLMs, which introduces cost and latency (Ma et al., 14 Jun 2026). Complex compositional one-shot instructions are not directly benchmarked; instead, they are approximated through multi-round sequences (Ma et al., 14 Jun 2026). Region Accuracy is absent from multi-round evaluation for practical reasons of annotation and automation (Ma et al., 14 Jun 2026).
The released resources include the dataset, human-annotated instructions and paraphrases, multi-round sequences, category labels, evaluation scripts, MLLM integrations, SSIM and CLIP implementations, and edited images from all evaluated models, all hosted in the public repository associated with the benchmark (Ma et al., 14 Jun 2026, Ma et al., 2024). In that sense, I2EBench2.0 functions as both a standardized benchmark and a reproducible infrastructure for evaluating new IIE models under a common protocol (Ma et al., 14 Jun 2026).