GEditBench v2: Image Editing Benchmark
- GEditBench v2 is a comprehensive benchmark defining 23 image editing tasks to evaluate real-world, human-aligned editing outcomes.
- The framework introduces PVC-Judge, a pairwise visual consistency model that uses advanced metrics and LoRA tuning to prioritize human preference.
- It delivers granular insights into model performance, addressing issues like identity preservation, structural coherence, and under-editing.
GEditBench v2 is a comprehensive, human-aligned benchmark for general image editing that addresses intrinsic deficiencies found in prior evaluation frameworks and datasets. It introduces a broad taxonomy of editing tasks, emphasizes visual consistency as a core evaluation principle, and provides an open-source suite for pairwise assessment using learned preference models. The benchmark is designed to reflect real-world human editing demands, moving beyond closed task lists and standard pointwise metrics to deliver more granular and reliable insights into the capabilities and limitations of modern image-editing models (Jiang et al., 30 Mar 2026).
1. Motivation and Scope
Historical image editing benchmarks have exhibited narrow task coverage, typically supporting only 5–14 task categories, often with an overreliance on pointwise visual-LLM (VLM) scoring. These approaches are limited in three critical areas: (1) lack of representation for unconstrained, open-set user instructions; (2) absence of systematic evaluation of visual consistency, i.e., the preservation of image identity and structure outside edited regions; and (3) poor correlation with actual human preference behavior.
GEditBench v2 addresses these limitations by explicitly broadening the scope of evaluation to 23 editing tasks, including open-set categories, and provisioning both the data and methodology needed for pairwise, human-aligned comparison of model outputs. The benchmark comprises 1,200 real-world user queries, each consisting of an input image and a natural-language instruction, ensuring high ecological validity and diversity in editing requirements.
2. Dataset Composition and Task Taxonomy
GEditBench v2’s dataset is structured to evaluate a spectrum of editing competencies across:
- Local tasks (12 categories): e.g., Subject Addition/Removal/Replacement, Size Adjustment, Color Alteration, Material Modification, Portrait Beautification, Motion Change, Relation Change, Text Editing, In-Image Translation, Chart Editing.
- Global tasks (6): e.g., Background Change, Style Transfer, Tone Transfer, Enhancement, Camera Motion, Line2Image.
- Reference tasks (3): e.g., Character, Object, or Style Reference.
- Hybrid tasks (1): multi-operation compositions.
- Open-Set tasks (1): unconstrained instructions from user forums that do not cleanly map onto standardized taxonomies.
Task instances were derived from public forums (Reddit, X) and prior benchmarks. Each editing prompt underwent expert filtering for clarity, duplication, and specificity. User-submitted images were replaced with public domain or Nano Banana Pro–generated images. The dataset is strictly designed as an evaluation set, with no separate training partition. The following table summarizes the distribution:
| Category | #Tasks | #Instances |
|---|---|---|
| Local | 12 | ~550 |
| Global | 6 | ~300 |
| Reference | 3 | ~150 |
| Hybrid | 1 | ~100 |
| Open-Set | 1 | ~100 |
| Total | 23 | 1,200 |
This taxonomy encompasses both frequent, well-defined transformations (e.g. subject removal) and rare, complex, or semantically ambiguous edits sourced from open online discourse.
3. PVC-Judge: Pairwise Visual Consistency Model
A central contribution is PVC-Judge, an open-source pairwise assessment model for visual consistency. PVC-Judge is based on the Qwen3-VL-8B-Instruct backbone, augmented via Low-Rank Adaptation (LoRA) across all attention and MLP layers (rank=64, α=128). It implements a binary preference scorer , with the output probability indicating the model’s belief that is preferred to .
Preference data synthesis for PVC-Judge is multi-pronged:
- Object-centric pipeline: Editing targets are localized (via Qwen3-VL-8B) into masks and . Region-specific metrics are enforced—SSIM, LPIPS, and CLIP-EMD for (invariance), and task-specific metrics (e.g., lightness-SSIM, depth-SSIM) for .
- Human-centric pipeline: For portraits, regions are partitioned into Face, Hair, and Body. Non-modified subregions are evaluated using ArcFace, DINOv3 embeddings, and hair-texture differences.
- VLM-as-judge pipeline: For global edits where precise region masks are unavailable, Gemini 3 Pro is used for annotating candidate output pairs.
Preference pairs are filtered by z-score normalization, Pareto dominance (improvement on at least one primary metric and non-inferiority on others), and consistency among auxiliary metrics (majority vote). Training is performed with a standard pairwise ranking loss:
where is the preferred output and 0 is the inferior.
4. VCReward-Bench and Human Alignment
To systematically validate automated preference judgments, VCReward-Bench provides 3,506 human-annotated pairwise comparisons across 21 tasks. Candidate pairs were generated from 7 open-source models and Nano Banana Pro (at most 28 unique pairs per prompt). Expert annotators rate each pair across Instruction Following (IF), Visual Quality (VQ), Visual Consistency (VC), and Overall. Pareto filtering ensures that VC preference pairs differ only on the VC dimension while being non-inferior on all others. The interface allows annotators to select “Prefer A,” “Prefer B,” or explicit tie options.
This careful construction facilitates robust meta-evaluation of PVC-Judge and other metrics, providing a high-fidelity alignment signal relative to human preference.
5. Evaluation of Editing and Assessment Models
PVC-Judge is compared against a range of proprietary and open-source metrics on two meta-benchmarks: EditReward-Bench and VCReward-Bench. Accuracy in correctly reflecting human VC preferences is as follows:
| Model | EditReward (%) | VCReward (%) |
|---|---|---|
| Gemini 3 Pro | 87.33 | 87.13 |
| GPT-5.1 | 78.68 | 76.89 |
| EditReward (MiMo-7B) | 78.34 | 67.41 |
| EditScore-Avg@4 | 57.83 | 49.20 |
| Qwen3-VL-8B | 72.27 | 73.07 |
| PVC-Judge | 82.44 | 81.82 |
PVC-Judge outperforms all open-source metrics and even GPT-5.1 on VCReward-Bench. Task-level analysis demonstrates especially strong performance on local tasks with complex instance-level constraints (e.g., color alteration: 94.2% accuracy versus 86.96% for prior best).
Sixteen editing models, both closed-source (e.g., GPT-Image-1.5, Nano Banana Pro, Seedream 4.5) and open-source (e.g., FLUX.2, Qwen-Image-Edit, BAGEL), were evaluated. Rankings utilize Bradley–Terry capability scores, translating win/tie/loss into Elo ratings:
| Model | IF Elo | VQ Elo | VC Elo | Overall Elo |
|---|---|---|---|---|
| GPT-Image 1.5 | 1260 | 1149 | 846 | 1071 |
| Nano Banana Pro | 1126 | 1066 | 1108 | 1096 |
| Seedream 4.5 | 1111 | 1142 | 1030 | 1089 |
| FLUX.2 [klein] 9B | 1083 | 1025 | 1019 | 1039 |
| Qwen-Image-Edit 2511 | 1095 | 1060 | 972 | 1038 |
Closed-source systems tend to lead in IF and VQ, but not always in VC, with certain models (e.g., GPT-Image 1.5) penalized for excessive non-target modifications ("collateral changes"). Some open-source models demonstrate that efficient architectures can enable competitive overall performance. A salient failure mode is "under-editing" (minimal model output change), which can artificially inflate VC scores.
6. Discussion, Limitations, and Prospects
Persistent challenges remain in identity preservation (especially for small faces and background instances during local edits), structural coherence in complex relation tasks (e.g., coordinated object repositioning), and semantic comprehension, especially for open-set and cross-taxonomy instructions frequently leading to partial or ambiguous edits.
Proposed future directions include:
- Integration of PVC-Judge as a fine-tuning reward (reinforcement learning "in-the-loop") for editing models, enhancing output consistency with true human perception.
- Extension to multi-image editing tasks as open-source VLMs mature.
- Mitigation of automated pipeline biases by advancing segmentation, feature extraction, and decoupling methods.
- Development of differentiable region-aware metrics to allow direct, end-to-end training on composite editing objectives.
GEditBench v2 in combination with PVC-Judge and VCReward-Bench establishes a unified, open-source ecosystem for benchmark-driven research in image editing. This suite comprehensively addresses real-world instruction diversity, robustly measures visual consistency through pairwise preference, and exposes both advances and limitations in the preservation of identity, structure, and semantics in model-edited images (Jiang et al., 30 Mar 2026).