GEditBench v2: Image Editing Benchmark

Updated 3 July 2026

GEditBench v2 is a comprehensive benchmark defining 23 image editing tasks to evaluate real-world, human-aligned editing outcomes.
The framework introduces PVC-Judge, a pairwise visual consistency model that uses advanced metrics and LoRA tuning to prioritize human preference.
It delivers granular insights into model performance, addressing issues like identity preservation, structural coherence, and under-editing.

GEditBench v2 is a comprehensive, human-aligned benchmark for general image editing that addresses intrinsic deficiencies found in prior evaluation frameworks and datasets. It introduces a broad taxonomy of editing tasks, emphasizes visual consistency as a core evaluation principle, and provides an open-source suite for pairwise assessment using learned preference models. The benchmark is designed to reflect real-world human editing demands, moving beyond closed task lists and standard pointwise metrics to deliver more granular and reliable insights into the capabilities and limitations of modern image-editing models (Jiang et al., 30 Mar 2026).

1. Motivation and Scope

Historical image editing benchmarks have exhibited narrow task coverage, typically supporting only 5–14 task categories, often with an overreliance on pointwise visual-LLM (VLM) scoring. These approaches are limited in three critical areas: (1) lack of representation for unconstrained, open-set user instructions; (2) absence of systematic evaluation of visual consistency, i.e., the preservation of image identity and structure outside edited regions; and (3) poor correlation with actual human preference behavior.

GEditBench v2 addresses these limitations by explicitly broadening the scope of evaluation to 23 editing tasks, including open-set categories, and provisioning both the data and methodology needed for pairwise, human-aligned comparison of model outputs. The benchmark comprises 1,200 real-world user queries, each consisting of an input image and a natural-language instruction, ensuring high ecological validity and diversity in editing requirements.

2. Dataset Composition and Task Taxonomy

GEditBench v2’s dataset is structured to evaluate a spectrum of editing competencies across:

Local tasks (12 categories): e.g., Subject Addition/Removal/Replacement, Size Adjustment, Color Alteration, Material Modification, Portrait Beautification, Motion Change, Relation Change, Text Editing, In-Image Translation, Chart Editing.
Global tasks (6): e.g., Background Change, Style Transfer, Tone Transfer, Enhancement, Camera Motion, Line2Image.
Reference tasks (3): e.g., Character, Object, or Style Reference.
Hybrid tasks (1): multi-operation compositions.
Open-Set tasks (1): unconstrained instructions from user forums that do not cleanly map onto standardized taxonomies.

Task instances were derived from public forums (Reddit, X) and prior benchmarks. Each editing prompt underwent expert filtering for clarity, duplication, and specificity. User-submitted images were replaced with public domain or Nano Banana Pro–generated images. The dataset is strictly designed as an evaluation set, with no separate training partition. The following table summarizes the distribution:

Category	#Tasks	#Instances
Local	12	~550
Global	6	~300
Reference	3	~150
Hybrid	1	~100
Open-Set	1	~100
Total	23	1,200

This taxonomy encompasses both frequent, well-defined transformations (e.g. subject removal) and rare, complex, or semantically ambiguous edits sourced from open online discourse.

3. PVC-Judge: Pairwise Visual Consistency Model

A central contribution is PVC-Judge, an open-source pairwise assessment model for visual consistency. PVC-Judge is based on the Qwen3-VL-8B-Instruct backbone, augmented via Low-Rank Adaptation (LoRA) across all attention and MLP layers (rank=64, α=128). It implements a binary preference scorer $s(x_1, x_2)$ , with the output probability $\sigma(s(x_1, x_2))$ indicating the model’s belief that $x_1$ is preferred to $x_2$ .

Preference data synthesis for PVC-Judge is multi-pronged:

Object-centric pipeline: Editing targets are localized (via Qwen3-VL-8B) into masks $\Omega_{\text{edit}}$ and $\Omega_{\text{non}}$ . Region-specific metrics are enforced—SSIM, LPIPS, and CLIP-EMD for $\Omega_{\text{non}}$ (invariance), and task-specific metrics (e.g., lightness-SSIM, depth-SSIM) for $\Omega_{\text{edit}}$ .
Human-centric pipeline: For portraits, regions are partitioned into Face, Hair, and Body. Non-modified subregions are evaluated using ArcFace, DINOv3 embeddings, and hair-texture differences.
VLM-as-judge pipeline: For global edits where precise region masks are unavailable, Gemini 3 Pro is used for annotating candidate output pairs.

Preference pairs are filtered by z-score normalization, Pareto dominance (improvement on at least one primary metric and non-inferiority on others), and consistency among auxiliary metrics (majority vote). Training is performed with a standard pairwise ranking loss:

$L_{\text{pref}} = -\log \sigma(s(x^+, x^-)) - \log(1-\sigma(s(x^-, x^+))),$

where $x^+$ is the preferred output and $\sigma(s(x_1, x_2))$ 0 is the inferior.

4. VCReward-Bench and Human Alignment

To systematically validate automated preference judgments, VCReward-Bench provides 3,506 human-annotated pairwise comparisons across 21 tasks. Candidate pairs were generated from 7 open-source models and Nano Banana Pro (at most 28 unique pairs per prompt). Expert annotators rate each pair across Instruction Following (IF), Visual Quality (VQ), Visual Consistency (VC), and Overall. Pareto filtering ensures that VC preference pairs differ only on the VC dimension while being non-inferior on all others. The interface allows annotators to select “Prefer A,” “Prefer B,” or explicit tie options.

This careful construction facilitates robust meta-evaluation of PVC-Judge and other metrics, providing a high-fidelity alignment signal relative to human preference.

5. Evaluation of Editing and Assessment Models

PVC-Judge is compared against a range of proprietary and open-source metrics on two meta-benchmarks: EditReward-Bench and VCReward-Bench. Accuracy in correctly reflecting human VC preferences is as follows:

Model	EditReward (%)	VCReward (%)
Gemini 3 Pro	87.33	87.13
GPT-5.1	78.68	76.89
EditReward (MiMo-7B)	78.34	67.41
EditScore-Avg@4	57.83	49.20
Qwen3-VL-8B	72.27	73.07
PVC-Judge	82.44	81.82

PVC-Judge outperforms all open-source metrics and even GPT-5.1 on VCReward-Bench. Task-level analysis demonstrates especially strong performance on local tasks with complex instance-level constraints (e.g., color alteration: 94.2% accuracy versus 86.96% for prior best).

Sixteen editing models, both closed-source (e.g., GPT-Image-1.5, Nano Banana Pro, Seedream 4.5) and open-source (e.g., FLUX.2, Qwen-Image-Edit, BAGEL), were evaluated. Rankings utilize Bradley–Terry capability scores, translating win/tie/loss into Elo ratings:

Model	IF Elo	VQ Elo	VC Elo	Overall Elo
GPT-Image 1.5	1260	1149	846	1071
Nano Banana Pro	1126	1066	1108	1096
Seedream 4.5	1111	1142	1030	1089
FLUX.2 [klein] 9B	1083	1025	1019	1039
Qwen-Image-Edit 2511	1095	1060	972	1038

Closed-source systems tend to lead in IF and VQ, but not always in VC, with certain models (e.g., GPT-Image 1.5) penalized for excessive non-target modifications ("collateral changes"). Some open-source models demonstrate that efficient architectures can enable competitive overall performance. A salient failure mode is "under-editing" (minimal model output change), which can artificially inflate VC scores.

6. Discussion, Limitations, and Prospects

Persistent challenges remain in identity preservation (especially for small faces and background instances during local edits), structural coherence in complex relation tasks (e.g., coordinated object repositioning), and semantic comprehension, especially for open-set and cross-taxonomy instructions frequently leading to partial or ambiguous edits.

Proposed future directions include:

Integration of PVC-Judge as a fine-tuning reward (reinforcement learning "in-the-loop") for editing models, enhancing output consistency with true human perception.
Extension to multi-image editing tasks as open-source VLMs mature.
Mitigation of automated pipeline biases by advancing segmentation, feature extraction, and decoupling methods.
Development of differentiable region-aware metrics to allow direct, end-to-end training on composite editing objectives.

GEditBench v2 in combination with PVC-Judge and VCReward-Bench establishes a unified, open-source ecosystem for benchmark-driven research in image editing. This suite comprehensively addresses real-world instruction diversity, robustly measures visual consistency through pairwise preference, and exposes both advances and limitations in the preservation of identity, structure, and semantics in model-edited images (Jiang et al., 30 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GEditBench v2: A Human-Aligned Benchmark for General Image Editing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GEditBench v2.

GEditBench v2: Image Editing Benchmark

1. Motivation and Scope

2. Dataset Composition and Task Taxonomy

3. PVC-Judge: Pairwise Visual Consistency Model

4. VCReward-Bench and Human Alignment

5. Evaluation of Editing and Assessment Models

6. Discussion, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GEditBench v2: Image Editing Benchmark

1. Motivation and Scope

2. Dataset Composition and Task Taxonomy

3. PVC-Judge: Pairwise Visual Consistency Model

4. VCReward-Bench and Human Alignment

5. Evaluation of Editing and Assessment Models

6. Discussion, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research