PICAEval: Evaluating Physical Realism in Editing

Updated 23 October 2025

PICAEval is a protocol for evaluating physical realism in image editing that uses region-specific QA to assess optical, mechanical, and state transition effects.
It methodically diagnoses physical consistency by decomposing edits into binary yes/no questions, thereby minimizing hallucination in automated evaluations.
PICAEval reveals critical gaps in current models and promotes enhanced training strategies using dynamic data and refined instruction processes.

PICAEval is a protocol for evaluating physical realism in image editing, introduced in the context of the PICABench benchmark (Pu et al., 20 Oct 2025). PICAEval is designed to address the shortcomings of previous benchmarks, which primarily focus on semantic instruction fidelity and overlook the crucial physical effects required for realism. The protocol employs a region-level, question–answer (QA) framework and utilizes vision-LLMs (VLMs) as automated judges, aiming to systematically diagnose the degree to which edited images comply with the laws of physics in terms of optics, mechanics, and state transitions.

1. Definition and Motivation

PICAEval is constructed as a reliable and interpretable evaluation protocol targeted at measuring physical realism in the outputs of image editing models. Its core principle is that realistic image manipulation must not only fulfill semantic instructions (e.g., object insertion or removal) but also correctly model the associated physical effects, such as the presence, absence, or correct transformation of shadows, reflections, deformations, and inter-object interactions. PICAEval decomposes evaluation into localized, binary yes/no questions for each physics-relevant region of an edit, moving beyond global scoring approaches that conflate semantic and physical correctness.

This approach arises from the observation that state-of-the-art editing models can satisfy user instructions but frequently produce outputs that violate basic physical laws (for example, by leaving shadows after removing an object or failing to update reflected surfaces). PICAEval aims to fill this evaluation gap with fine-grained, diagnostic feedback.

2. Protocol Structure and Workflow

PICAEval operates via a multi-stage workflow:

Annotation: Human annotators first delineate key regions in the edited image where physical interactions (such as the presence of a shadow, a reflection, or signs of material deformation) are expected to occur as a consequence of the edit.
QA Pair Generation: For each annotated region, sets of region-specific, binary QA pairs are crafted to probe specific physical properties. For example, “Is the reflection of the removed object still visible in the mirror region?”; “Has the shadow of the added object appeared in the correct light direction?”; “Have materials deformed compatibly with object insertion/removal?”
VLM-as-a-Judge: The protocol enlists a VLM (with examples including GPT-5) to answer the generated region-specific questions. Each answer from the VLM is compared against a reference label, yielding a binary classification for each physical criterion.
Scoring and Aggregation: Physical realism is quantified by the proportion of correctly answered questions per region or image, giving a multidimensional view into which physical effects are rendered accurately and where failures remain.

Unlike traditional metrics that use a single global prompt or holistic score, PICAEval’s region-level QA structure minimizes the risk of hallucination and aligns more closely with rigorous human assessment. The protocol is shown to better differentiate among model outputs, pinpointing specific physical inconsistencies otherwise masked by high-level success metrics.

3. Evaluation Dimensions

Physical realism in PICAEval is cataloged into eight sub-dimensions spanning three high-level domains: optics, mechanics, and state transitions. Each dimension is grounded in observable physical phenomena, codified as region-specific QA criteria.

Domain	Sub-dimension	Example QA
Optics	Light Propagation	“Is the shadow direction consistent with the light?”
Optics	Reflection	“Is the mirror image updated after object removal?”
Optics	Refraction	“Does the glass distort the background correctly?”
Optics	Light-Source Effects	“Does added lamp change local illumination?”
Mechanics	Deformation	“Do elastic objects bend realistically?”
Mechanics	Causality	“Did removing support cause collapse?”
State Trans.	Global State Trans.	“Does day-night change affect all regions?”
State Trans.	Local State Trans.	“Did melting result in correct local changes?”

Evaluation is performed via local region assessment rather than entire-image scoring, often isolated to the presence or absence of expected visual effects. Quantitative metrics may be additionally computed, for example, PSNR over unchanged regions to capture consistency.

4. VLM-based Automated Judging

Within PICAEval, the use of VLMs as “judges” enables scalable and reproducible evaluation. Instead of granting a general score for an edit, a VLM (e.g., GPT-5) receives the edited image, the annotated region, and each corresponding QA prompt. The binary response is cross-referenced with human-annotated ground truth. PICAEval demonstrates that such narrowly scoped questions greatly reduce hallucination and ambiguity: the protocol ensures that the VLM’s assessment is tightly bound to explicit, visual evidence in the given region rather than inferred or imagined global properties.

Experimental findings indicate high alignment between PICAEval scores and human judgments. A plausible implication is that region-specific QA with VLMs may become a standard methodology for nuanced visual evaluation tasks.

5. Observed Challenges and Model Deficiencies

Analysis using PICAEval reveals persistent deficiencies in mainstream image editing models, even those employing large, unified multimodal architectures. Notably, models often:

Fail to synchronize lighting and shadowing after insertions/removals.
Produce unrealistic deformations or inter-object relationship inconsistencies.
Leave non-updated physical cues in edited regions, such as residual reflections, missed causality effects, or neglect material transitions.

This suggests that semantic instruction completion is orthogonal to physical law adherence, and current benchmarks that do not probe beyond semantics may overestimate realism.

6. Proposed Model Training Solutions

To remedy the observed lack of physical coherence, the protocol is paired with an improved model training methodology. The key contributions include:

PICA-100K Dataset: A synthetic dataset constructed from video sequences that simulate physically plausible state transitions. The pipeline first uses a text-to-image model to build static scenes, then applies an image-to-video model to synthesize dynamic transitions.
GPT-5 is employed to refine editing instructions at multiple abstraction levels (superficial, intermediate, explicit) based on the generated video clips.
Editing models (e.g., FLUX.1-Kontext) fine-tuned on PICA-100K demonstrate improved physical consistency without sacrificing semantic fidelity.

This suggests that leveraging temporal context from video data, and multi-level instruction refinement, enhances models' capacity for physically realistic editing as measured by PICAEval.

7. Future Directions

Advancing the PICAEval protocol and associated model training frameworks encompasses several avenues:

Expanding dataset scale and diversity, and incorporating finer temporal granularity.
Incorporating reinforcement learning-based post-training to explicitly encode physical laws rather than relying solely on supervised methods.
Extending the framework to support multi-image or multi-condition inputs for scenes with complex or time-varying interactions.
Reducing potential VLM hallucinations and sharpening region-level QA precision using richer annotations or newer vision-language architectures.

A plausible implication is that as these areas are actively developed, the gap between semantic completion and physically consistent realism in image editing will narrow, with PICAEval providing the foundational diagnostic and evaluation capabilities.

Summary

PICAEval structures the evaluation of physically realistic image editing as a region- and phenomenon-specific QA task, judged by VLMs and anchored in eight rigorously defined physical dimensions (Pu et al., 20 Oct 2025). Combined with advances in training methodologies involving temporally coherent video data and multi-level instruction refinement, PICAEval sets the standard for benchmarking and improving realism in edited images—underlining both current model limitations and prospective solutions for future research on physically faithful visual content generation.

PDF Markdown Chat (Pro)

References (1)

PICABench: How Far Are We from Physically Realistic Image Editing? (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PICAEval.