PICABench: How Far Are We from Physically Realistic Image Editing? (2510.17681v2)

Published 20 Oct 2025 in cs.CV and cs.AI

Abstract: Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.

Summary

The paper introduces PICABench to systematically evaluate physical realism in instruction-based image editing.
It proposes a region-grounded QA evaluation protocol to assess physical effects such as shadows, reflections, and deformations.
The use of synthetic video data (PICA-100K) enables targeted supervision, yielding measurable improvements in editing accuracy and consistency.

PICABench: A Benchmark for Physically Realistic Image Editing

Motivation and Problem Statement

Instruction-based image editing has advanced rapidly, with modern models capable of following complex natural language instructions to manipulate images in semantically coherent ways. However, the field has largely prioritized semantic fidelity and perceptual quality, neglecting the physical realism of edits. Physically realistic editing requires that models not only perform the requested operation (e.g., object addition/removal) but also update all associated physical effects—such as shadows, reflections, deformations, and state transitions—consistent with the laws of optics, mechanics, and material state changes. Existing benchmarks and evaluation protocols are insufficient, as they focus on instruction completion and overlook these critical physical effects, resulting in models that produce visually plausible but physically implausible outputs.

PICABench: Benchmark Design and Taxonomy

PICABench is introduced as a comprehensive diagnostic benchmark to systematically evaluate physical realism in image editing. The benchmark is structured around three principal dimensions, each further divided into sub-dimensions with concrete, checkable criteria:

Optics:
- Light propagation (shadow geometry, occlusion, softness)
- Reflection (view-dependent highlights, mirror consistency)
- Refraction (background distortion through transparent media)
- Light-source effects (integration of new light sources with scene illumination)
Mechanics:
- Deformation (material-consistent shape changes)
- Causality (physically plausible contacts, supports, and force redistribution)
State Transition:
- Global (scene-wide changes: time of day, weather, phase transitions)
- Local (object- or region-specific changes: wetting, burning, melting, fracturing)

The benchmark comprises 900 editing samples, each paired with multi-level natural language instructions (superficial, intermediate, explicit) targeting specific physical phenomena. Images are selected to maximize diversity in materials, lighting, and physical cues, and instructions are crafted to induce edits that test adherence to physical laws.

PICAEval: Region-Grounded, QA-Based Evaluation Protocol

Evaluating physical realism is non-trivial due to the lack of reference images and the contextual nature of physical plausibility. PICAEval addresses this by introducing a region-grounded, question-answering (QA) protocol:

For each edited image, human annotators mark key regions where physical evidence should manifest (e.g., shadow zones, reflective surfaces, contact points).
For each region, a set of binary (yes/no) questions is generated (using GPT-5 and manual review) to probe both instruction completion and physical consistency.
At evaluation, a VLM (e.g., GPT-5, Qwen2.5-VL-72B) is prompted with the edited image, instruction, region, and question, and must answer based solely on visible evidence.

This protocol reduces hallucination, increases interpretability, and achieves high alignment with human judgments (Pearson $r > 0.9$ with human Elo rankings). The metric is robust to subtle, localized errors and enables fine-grained diagnostic analysis across physical sub-dimensions.

PICA-100K: Synthetic Video-Based Dataset for Physics-Aware Supervision

To address the lack of large-scale, physics-aware training data, the authors introduce PICA-100K, a synthetic dataset of 105,085 instruction-based editing samples derived from video generation:

Scene and subject prompts are curated and refined using GPT-5, rendered into images with a T2I model (FLUX.1-Krea-dev).
Motion-based edit instructions are generated and applied using an I2V model (Wan2.2-14B), simulating physically plausible transformations.
The first and last frames of each video are extracted as (source, edited) pairs, with instructions and QA annotations generated automatically.

This pipeline enables precise, controllable supervision signals for training models to perform physically realistic edits, without the cost and limitations of real-world video annotation.

Experimental Results and Analysis

Benchmarking State-of-the-Art Models

Eleven open- and closed-source models are evaluated on PICABench using PICAEval. Key findings:

All open-source models score below 60% accuracy on physical realism; only closed-source models (GPT-Image-1, Seedream 4.0) slightly exceed this threshold.
Unified multimodal models (e.g., Bagel, OmniGen2) underperform compared to dedicated image editing models, indicating that visual understanding alone does not guarantee physical realism.
Model performance improves with prompt specificity: explicit prompts yield higher scores than superficial or intermediate prompts, but the gains are limited by the models' lack of internalized physics knowledge.

Effectiveness of Video-Based Supervision

Fine-tuning FLUX.1-Kontext on PICA-100K yields consistent improvements across all physical sub-dimensions:

Overall accuracy improves by +1.71% (from 48.93% to 50.64%) and physical consistency (PSNR) increases from 24.57 dB to 25.23 dB.
Gains are observed in optics (reflection, light propagation), mechanics (deformation), and local state transitions.
Training on a real video-based dataset (MIRA400K) underperforms compared to PICA-100K, highlighting the importance of targeted, synthetic supervision.

Evaluation Protocol Validity

PICAEval demonstrates high correlation with human preference rankings, outperforming baseline VLM-as-a-judge protocols that lack region-level grounding. The QA-based approach is more sensitive to nuanced physical violations and less prone to hallucination.

Limitations and Future Directions

The PICA-100K dataset, while effective, is limited in scale and diversity due to the constraints of the synthetic generation pipeline.
The current training paradigm relies on supervised finetuning; reinforcement learning or more advanced post-training strategies may yield further gains.
The framework is restricted to single-image inputs and does not support multi-image or multi-condition conditioning.
Future work should explore larger-scale, more diverse synthetic data, RL-based optimization, and multi-modal conditioning to further close the gap in physical realism.

Implications and Outlook

PICABench, PICAEval, and PICA-100K collectively establish a new standard for evaluating and improving physical realism in image editing. The results highlight a significant gap between current model capabilities and the requirements for physically plausible image manipulation. The findings suggest that progress in semantic and perceptual quality does not automatically translate to physical consistency, and that targeted supervision and evaluation are necessary to drive advances in this area.

The benchmark and dataset provide a foundation for future research on physics-aware generative models, with implications for applications in visual effects, digital content creation, scientific visualization, and any domain where physical plausibility is critical. As generative models approach world-simulation capabilities, integrating explicit physical reasoning and supervision will be essential for robust, trustworthy image editing systems.

Conclusion

PICABench provides a rigorous, fine-grained framework for diagnosing and advancing physical realism in image editing. The introduction of region-grounded evaluation and synthetic video-based supervision demonstrates measurable improvements but also exposes the persistent challenges in this domain. The work sets a clear agenda for future research: bridging the gap between semantic understanding and physical plausibility in generative visual models.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about making image editing feel truly real. Today’s AI tools can follow instructions like “remove the dog” or “make it a sunny day,” and they often look good at first glance. But they miss important physical details, like the dog’s shadow still being on the ground, or the sunlight not changing the direction and softness of shadows. The authors created a new benchmark, called PICABench, to test whether edited images respect real-world physics, and a new evaluation method, PICAEval, to judge those edits carefully. They also built a training dataset (PICA-100K) from videos to help models learn physics better.

What questions does the paper ask?

The paper asks simple but important questions:

Are we good at image edits that look physically realistic, not just correct in words?
How can we measure whether an edited image follows basic laws of light, materials, and changes over time?
Can training models with video-based examples help them learn physical effects (like shadows, reflections, weight, or weather) more reliably?

How did they paper it?

The team designed three main categories to check physical realism in edited images, each with everyday examples:

Optics (how light behaves):
- Light propagation: Do shadows point the right way and have the right softness?
- Reflection: Do mirrors or shiny surfaces show the right reflections in the right place?
- Refraction: Do things seen through glass or water bend and distort naturally?
- Light-source effects: If you add a lamp, does it cast light that fits the scene’s color and brightness?
Mechanics (how objects and forces work):
- Deformation: Do materials bend or stay rigid in a realistic way (pillows squish, metal stays firm)?
- Causality: Do objects sit on surfaces properly, not float or intersect awkwardly? Does weight cause dents or pressure marks?
State Transition (how things change):
- Global: Whole-scene changes, like day to night, summer to winter, rainy to sunny. Does everything update consistently (lighting, plants, ground, sky)?
- Local: Object-level changes, like melting, freezing, burning, wetting, or wrinkling. Do these effects look and spread naturally?

To evaluate edits well, they created PICAEval:

Instead of asking a model “Is the edit good?”, they ask several small, specific yes/no questions tied to important regions in the image (for example, the mirror, the shadow area, or the contact between a shoe and the ground).
Human annotators mark the key regions (the exact areas to inspect).
An AI “judge” (a vision-LLM that can see images and read text) answers the questions only about those regions. This reduces guesswork and makes judgments clearer and more trustworthy.

They also built a training dataset, PICA-100K:

They used a text-to-image model to create realistic scenes (like “a teapot on a kitchen table”).
Then they used an image-to-video model to simulate physical changes (like “remove the teapot” or “tilt the vase until it tips over”).
They took the first and last frames to form “before” and “after” pairs with instructions, creating over 100,000 examples focused on physics-aware edits.
They fine-tuned a popular image-editing model using this dataset to see if it became more physically accurate.

What did they find?

Most models still struggle with physics. Even state-of-the-art systems often keep wrong shadows, reflections, or object support. Many outputs look instruction-aligned but physically off.
Closed-source models (like GPT-Image-1 and Seedream 4.0) perform slightly better, but overall scores are still not high, showing a wide gap to truly realistic edits.
Detailed prompts help. When the instruction clearly explains the expected physical changes (explicit prompts), models do better than when the instruction is short and vague.
Their new evaluation method (PICAEval) agrees more with human judgments than standard “AI-as-a-judge” methods. Focusing on region-level questions makes the evaluation more reliable and interpretable.
Training with video-derived examples (PICA-100K) improves physical realism without hurting instruction-following. The fine-tuned model became better at lighting, reflections, and deformations, though big scene changes (like global weather shifts) remain tough.

Why is this important?

When you edit images, small physical details make them feel real. Removing an object should remove its shadow and reflection. Adding a heavy dumbbell should dent a pillow. Changing to summer should brighten the light and turn snow into grass. This paper shows that:

We need benchmarks and tests that catch these physical details, not just whether the instruction was followed.
Region-focused, question-based evaluation makes judgments closer to what humans care about.
Learning from video motion and state changes helps models understand how the world behaves, leading to more believable edits.

Implications and future impact

This work pushes image editing toward “physics-aware” realism. It gives researchers:

A clear benchmark (PICABench) with practical categories that matter in everyday editing.
A better evaluation protocol (PICAEval) that reduces AI judging errors and matches human expectations.
A training recipe (PICA-100K) showing that video-based supervision can teach models real-world behavior.

In the future, this could make edited images trustworthy for creative work, ads, movies, and even scientific communication. The authors plan to build larger datasets, explore smarter training (like reinforcement learning), and support more complex inputs (multiple images or conditions), helping models understand and respect the physics of the world even better.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions that future work could concretely address:

Benchmark scope and representativeness: Expand beyond 900 samples to cover harder, long‑tail physical phenomena (e.g., volumetric lighting, subsurface scattering, caustics, smoke/fog, fluid dynamics, cloth draping, granular materials) and more diverse scene types, materials, and scales.
Sub-dimension coverage completeness: Formalize additional optics/mechanics/state-transition sub-dimensions (e.g., global illumination consistency, interreflections, contact friction and wear, multi-body interactions, elastic/plastic behavior under varying stresses) with checkable criteria.
Cross-domain generalization: Quantify how well PICABench performance transfers to real-world, user-supplied edits (different camera models, indoor/outdoor extremes, adverse weather, low light), and identify domain gaps between synthetic training and real scenes.
Resolution and preprocessing effects: Assess sensitivity of physical realism scores to resizing, cropping, and max-resolution choices (1024-long-side); measure whether higher-resolution inference or tiling improves physics adherence.
Instruction variability and language robustness: Evaluate robustness to ambiguous, under-specified, noisy, or multilingual prompts; report how performance changes with paraphrases and language styles common in real user workflows.
Metric validity beyond VLM QA: Develop physically grounded, reference-free metrics that quantify shadow direction error, contact/support plausibility, reflection/refraction correctness (e.g., geometry-aware checks via estimated depth/normals, environment maps, differentiable rendering-based consistency scores) rather than relying primarily on yes/no VLM judgments.
VLM-as-judge reliability: Systematically paper evaluator variance across different VLMs, prompts, and seeds; provide calibration procedures, confidence estimates, and inter-evaluator agreement (with detailed statistics) to ensure reproducibility.
Region annotation dependency: Analyze how PICAEval accuracy degrades with noisy or missing ROIs; explore methods to automatically detect physics-critical regions (reflective surfaces, contact interfaces, shadow receivers) to scale evaluation without intensive human annotation.
Context dependence of judgments: Quantify failure rates when cropping removes essential global context (e.g., light direction cues); define criteria for when region-level evaluation is insufficient and whole-image reasoning is required.
Human paper details and significance: Report sample sizes, inter-rater reliability, confidence intervals, and statistical significance for Elo correlations; verify robustness across different participant pools and task designs.
Statistical rigor of model comparisons: Provide confidence intervals, hypothesis tests, and power analyses for reported improvements (e.g., +1.71% overall accuracy); disclose run-to-run variance and inference budget parity.
PSNR as “consistency” proxy: Validate whether PSNR on non-edited regions correlates with perceived preservation; compare against stronger perceptual or structural metrics (LPIPS, DISTS) and measure edit localization accuracy to ensure fair masking.
Synthetic video pipeline limitations: Address the use of only first/last frames; explore leveraging intermediate frames, temporal constraints, and motion priors to better capture global state transitions and causality.
Synthetic vs. real video data: Investigate why the real-video-based MIRA400K underperforms; isolate dataset characteristics (motion type, camera motion, compression artifacts, instruction quality) that drive differences; propose hybrid pipelines.
Annotation quality and bias: Audit GPT-5 generated instructions and labels for error modes, biases, and leakage; create validation subsets with high-quality human labels to benchmark annotation fidelity.
Evaluator availability and circularity: Reduce dependence on closed-source GPT-5 for both data generation and evaluation; provide protocols that work with open-source VLMs without large performance drops to improve accessibility and replicability.
Training objectives beyond SFT: Experiment with RL from physics-aware feedback, self-supervised physical constraints, cycle-consistency (before/after edits), energy-based or constraint-augmented losses, and curriculum learning on physics difficulty.
3D scene priors and differentiable rendering: Integrate monocular depth/normal estimation, inverse rendering, environment map estimation, or NeRF-style scene reconstructions to enforce lighting/shadow/reflection constraints during generation.
Architecture-level coupling of understanding and generation: Design mechanisms that tie physical reasoning modules (scene graphs, contact/force predictors, simulators) to the generative pipeline, addressing the observed gap where “understanding ≠ realism.”
Multi-image/multi-view conditioning: Extend the framework to support multi-view images, reference materials, HDR environment maps, or short image sequences to enforce cross-view physical consistency.
Edit types beyond add/remove/attribute change: Include occlusion-aware edits, partial object manipulations, material conversions (e.g., wood→metal), topology changes, and multi-step sequential edits requiring consistent physics across steps.
Adversarial/edge-case robustness: Test whether models and evaluators can be fooled by visually plausible but physically impossible edits; build adversarial subsets to stress-test physics reasoning.
Leaderboard and continual benchmarking: Establish standardized inference budgets, seeds, and reporting protocols; plan for benchmark versioning to incorporate new sub-dimensions and periodic refreshes.
Practical deployment constraints: Analyze computational cost of physics-aware editing and evaluation at scale; propose efficient approximations or distillation strategies for real-world systems.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications Derived from PICABench, PICAEval, and PICA-100K

Below we outline concrete, real-world applications that leverage the paper’s benchmark (PICABench), evaluation protocol (PICAEval), and video-derived training dataset (PICA-100K). Each item notes sectors, potential tools/workflows, and key assumptions or dependencies.

Immediate Applications

These can be deployed with current tools and modest integration effort.

Physics-aware QA for image-editing CI/CD (software, creative tech)
- What: Integrate PICAEval as an automated gate in CI/CD to score edits on optics, mechanics, and state transitions before release.
- Tools/workflows: “VLM-as-a-judge” API with region masks, per-case Q&A; nightly regression suites; model release dashboards.
- Assumptions/dependencies: Access to a strong VLM (e.g., GPT-5 or Qwen2.5-VL-72B) and region annotations; standardized evaluation images; compute budget.
“Physics Check” plugin for editors (software, media/VFX, advertising)
- What: An add-on for Photoshop/After Effects/Blender that highlights regions likely violating shadows, reflections, refraction, or support/causality post-edit.
- Tools/workflows: On-canvas ROI overlays, localized yes/no checks from PICAEval, one-click fix suggestions (e.g., add/match shadow direction).
- Assumptions/dependencies: Plugin ecosystem support; VLM inference latency acceptable for interactive use; curated ROI templates per edit type.
Prompt optimizer for consumer and pro apps (software, daily life, social/AR)
- What: Auto-rewrite user instructions into more explicit, physics-grounded prompts to boost realism (per Table 2 gains).
- Tools/workflows: In-app “Make prompt more explicit” button; LLM prompt expansion tuned to optics/mechanics cues.
- Assumptions/dependencies: LLM prompt-rewriting service; user consent and UX integration; does not leak user content.
Preflight checker for VFX and advertising imagery (media/VFX, marketing)
- What: Batch evaluation of hero frames to flag unrealistic lighting, missed reflections, or inconsistent seasonal/weather changes.
- Tools/workflows: Shot/asset preflight scripts; per-subdimension reports with heatmaps; issue tracking for art teams.
- Assumptions/dependencies: Reference ROIs or automatic ROI proposals; high-resolution image handling; buy-in from pipelines.
E-commerce product image compliance (retail, marketplaces)
- What: Verify edits (object removal/addition) preserve physically consistent shadows/reflections; reduce misleading listings.
- Tools/workflows: Seller-side upload validator; moderation queue prioritization by PICAEval score; automated guidance to fix.
- Assumptions/dependencies: Marketplace content policies; acceptable false-positive/negative rates; transparent appeals process.
Deepfake and tampering triage via physics cues (security, journalism, policy)
- What: Use region-grounded physical checks to flag potentially manipulated images that defy optics/mechanics.
- Tools/workflows: Forensic triage UI with PICAEval questions, evidence regions, and confidence; routing to human analysts.
- Assumptions/dependencies: Not a standalone authenticity guarantee; complements cryptographic provenance (e.g., C2PA).
Training recipe to improve in-house editors (software/AI providers)
- What: Fine-tune internal editing models with PICA-100K and LoRA to raise physics realism without hurting semantics.
- Tools/workflows: Adopt the paper’s LoRA config (rank=256), batch size, and optimizer; ablation on subdimensions.
- Assumptions/dependencies: License compliance for T2I/I2V backbones used to synthesize data; hardware availability; domain shift monitoring.
Synthetic dataset bootstrapping for niche domains (education, design, AEC)
- What: Repurpose the video-to-image pipeline to create domain-specific edit pairs (e.g., interiors with glass, metals).
- Tools/workflows: Subject/scene dictionaries; GPT-based instruction refinement; I2V generation; first/last-frame pairing.
- Assumptions/dependencies: T2I/I2V models capture domain physics well; human spot checks to avoid drift/artefacts.
AR filter validation for consumer apps (social/AR/VR)
- What: Validate that real-time filters adding/removing objects maintain consistent lighting and contact shadows.
- Tools/workflows: Offline A/B testing with PICAEval; on-device lightweight heuristics trained from labeled ROIs.
- Assumptions/dependencies: Latency constraints; mobile-friendly proxies of PICAEval; privacy-preserving processing.
Insurance and claims fraud screening (finance/insurance)
- What: Triage photos for edits that violate support/causality (e.g., “floating” dents, conflicting shadows).
- Tools/workflows: Batch scoring; rule-based escalations; coupling with EXIF/provenance signals.
- Assumptions/dependencies: Human review stays in loop; calibrated thresholds to avoid bias and wrongful rejections.
Editorial standards and procurement criteria (policy, enterprise IT)
- What: Require minimum per-subdimension PICAEval scores for procuring editing engines or approving campaign assets.
- Tools/workflows: RFP language tying acceptance to benchmarked scores; periodic re-audits.
- Assumptions/dependencies: Stable benchmark versions; reproducible evaluation; disclosures of evaluator model.
Research and teaching aids (academia, education)
- What: Use PICABench to benchmark new methods; create coursework/labs on physics-aware editing with auto-grading via PICAEval.
- Tools/workflows: Public leaderboards; assignment kits with ROI/Q&A templates; ablation notebooks.
- Assumptions/dependencies: Access to evaluator VLMs; licensing of data for teaching; compute quotas for students.

Long-Term Applications

These need further research, scaling, or engineering to be production-ready.

Industry-wide “Physics Realism Score” and certification (policy, standards, media)
- What: A standardized, third-party certification (e.g., ISO-like) for physics plausibility of edited content, embedded in provenance metadata.
- Tools/workflows: Reference test suites; auditor APIs; C2PA extensions to include per-subdimension scores.
- Assumptions/dependencies: Multi-stakeholder governance; robust, attack-resistant evaluators; transparency requirements.
RL-based post-training with physics rewards (software/AI research)
- What: Optimize editing models with reinforcement learning where PICAEval provides reward signals for optics/mechanics/state transitions.
- Tools/workflows: On-policy/off-policy RL pipelines; curriculum over subdimensions; safety guardrails against reward hacking.
- Assumptions/dependencies: Stable, low-variance evaluator; scalable RL infrastructure; reward-model audits.
Real-time physics-aware AR/VR editing (AR/VR, mobile silicon)
- What: On-device generation that adapts shadows, reflections, and refractions to user pose and lighting in milliseconds.
- Tools/workflows: Distilled evaluators as auxiliary losses; neural light transport priors; sensor fusion with device IMU/ToF.
- Assumptions/dependencies: Efficient models (edge-friendly); robust scene/light estimation; thermal/power budgets.
Video editing with coherent physical state changes (media/VFX, creator tools)
- What: Temporal editors that maintain consistent global weather/season/time-of-day shifts and local state transitions across frames.
- Tools/workflows: Video-level PICAEval variant with temporal ROIs; use of intermediate frames as supervision; scene graph constraints.
- Assumptions/dependencies: Temporal VLM judges; better motion/state simulators; memory-efficient training.
Robotics and autonomous systems data generation (robotics, automotive)
- What: Physics-consistent synthetic imagery for perception training (contact/shadow cues, material properties) to reduce sim-to-real gap.
- Tools/workflows: Domain randomization with physics-aware edit constraints; curriculum on support/occlusion reasoning.
- Assumptions/dependencies: Transfer studies demonstrating gains; sensor-accurate rendering; safety validation.
Remote sensing and infrastructure planning simulations (energy, urban planning)
- What: Physically plausible edits (e.g., adding solar arrays, changing vegetation seasonality) to plan and communicate projects.
- Tools/workflows: Geo-context-aware state transitions; illumination models for aerial imagery.
- Assumptions/dependencies: Domain-calibrated optics for sun angles/atmospherics; regulatory acceptance for planning artifacts.
Authenticity scoring for journalism and civic processes (policy, civil society)
- What: Augment provenance with a physics-consistency audit to inform—but not determine—credibility assessments.
- Tools/workflows: Public dashboards with localized evidence; appeals and expert override mechanisms.
- Assumptions/dependencies: Clear communication of uncertainty; avoidance of over-reliance; governance to prevent censorship misuse.
STEM education platforms using edit-to-learn (education)
- What: Interactive lessons where students apply edits and receive physics-based feedback on shadows, refraction, or material deformation.
- Tools/workflows: Browser-based ROIs and Q&A; mastery tracking per subdimension; teacher dashboards.
- Assumptions/dependencies: Age-appropriate content; offline-capable evaluators; equity of access.
Domain-specialized medical and scientific simulation (healthcare, science) — high regulation
- What: Physics-aware synthetic imagery for training instruments (e.g., endoscopy lighting/shadows) or educational visualization.
- Tools/workflows: Controlled simulators; clinician-in-the-loop validation; domain-specific optics models.
- Assumptions/dependencies: Strict non-diagnostic use unless clinically validated; regulatory approvals; bias and safety audits.
Financial KYC/AML and ad integrity checks (finance, ads policy)
- What: Physics-consistency checks to flag manipulated identity photos or deceptive ad creatives (e.g., product “results”).
- Tools/workflows: Batch scoring with risk-tier routing; human adjudication; record-keeping for audits.
- Assumptions/dependencies: Legal frameworks for automated screening; privacy-preserving processing; fairness monitoring.
Self-serve dataset synthesis platforms (software/SaaS)
- What: No-code services to generate physics-aware edit pairs for niche verticals (e.g., furniture, automotive imagery).
- Tools/workflows: Template libraries for subject/scene; automatic ROI/Q&A generation; export to popular training stacks.
- Assumptions/dependencies: Licensing of generative backbones; cost control; content safety filters.

Cross-cutting Dependencies and Assumptions

Evaluator strength and bias: PICAEval quality depends on VLM capabilities and can inherit evaluator biases; region grounding mitigates but does not eliminate this.
Data provenance and licensing: Generative pipelines must respect licenses for T2I/I2V models and training data; synthetic data should be labeled as such.
Compute and latency: Many workflows require GPU resources; real-time or on-device scenarios need distilled, efficient models and evaluators.
Generalization limits: PICA-100K is synthetic; domain transfer to rare materials/lighting may require domain-specific augmentation and human QA.
Governance and ethics: Use in moderation/forensics must include human oversight, clear appeal paths, and transparency about uncertainty; avoid misuse as sole authenticity arbiter.

View Paper Prompt View All Prompts

Glossary

AdamW optimizer: An optimization algorithm that decouples weight decay from gradient updates to improve training stability. "optimized using the AdamW optimizer with a learning rate of 10-5."
Causality: Physical plausibility of interactions, supports, and reactions under laws like gravity and force redistribution. "Causality covers a broader range of physically plausible effects, including structural responses to force redistribution, agent reactions to added or removed stimuli, and environmental changes that alter object behavior, all of which must follow consistent physical or behavioral laws."
Color temperature: The hue characteristic of a light source (e.g., warm vs. cool), affecting the scene’s color cast. "Common issues include mismatched color temperatures, overly hard shadows, or inconsistent falloff relative to distance."
Deformation: Changes in object shape that must respect material properties (rigid vs. elastic) with consistent texture warping. "Deformation should follow material properties-rigid objects must retain shape, while elastic ones deform smoothly with consistent texture and geometry."
Elo ranking: A pairwise preference scoring system often used in human studies to derive relative rankings. "We conduct a human study using Elo ranking to further validate the effectiveness of PICAEval."
Elastic deformations: Smooth, bounded shape changes characteristic of elastic materials, preserving coherent texture. "elastic deformations should be smooth and bounded."
Falloff: The decrease in light intensity with distance from a source, affecting brightness distribution. "brightness falloff should integrate naturally with the scene."
Flow-based diffusion transformer: A generative architecture combining diffusion processes with transformer backbones guided by flow formulations. "a 12B flow-based diffusion transformer for image editing."
Global state transitions: Scene-wide changes (e.g., season, weather) requiring consistent updates to lighting, shadows, and environment. "Global state transitions, such as changes in time of day, season, or weather, must update all relevant visual cues consistently-ranging from lighting and shadows to vegetation, surface conditions, and atmospheric effects."
Image-to-video (I2V) model: A generative model that simulates temporal dynamics by expanding static images into videos. "image-to-video (I2V) models such as Wan2.2-14B (Wan et al., 2025) simulate complex dynamic processes with remarkable physical fidelity."
Light propagation: How light travels and casts shadows consistent with source direction, softness, and occlusion. "Light propagation requires shadows that are geometrically consistent with the dominant light source, including direction, length, softness, and occlusion."
Light-source effects: Consistency of added or modified light sources with global illumination, including color cast, penumbra, and falloff. "Light-source effects evaluate whether new light-introducing edits (like "add a lamp") are consistent with the global illumination context-color casts, shadow penumbra, and brightness falloff should integrate naturally with the scene."
LoRA: Low-Rank Adaptation; an efficient fine-tuning method that learns low-rank updates to model weights. "We employ LoRA (Hu et al., 2022) with a rank of 256 for fine-tuning."
Local state transitions: Physically coherent changes confined to objects or regions (e.g., freezing or melting), integrated with context. "Local state transitions, on the other hand, involve targeted physical changes confined to specific objects or regions."
Occluders: Objects that block light or view, affecting shadow casting and visibility. "Typical failure modes include misaligned or missing cast shadows and flat shading that ignores occluders."
Peak Signal-to-Noise Ratio (PSNR): A quantitative metric (in dB) measuring similarity, here used to assess consistency in non-edited regions. "For consistency evaluation, we compute PSNR over the non-edited regions by masking out the predicted edit area"
Pearson correlation coefficient: A statistical measure of linear correlation between two variables. "Pearson Correlation Coefficient r=0.95."
Penumbra: The softer, partially shaded region at the edges of a cast shadow. "shadow penumbra"
Question-answering based metric: An evaluation approach using targeted, localized yes/no questions to assess physical plausibility. "PICAEval, a region-grounded, question-answering based metric designed to assess physical realism in a modular, interpretable manner."
Reflection: View- and shape-dependent mirror images and highlights that must align with scene geometry. "Reflection consistency demands view-dependent behavior for specular highlights and mirror reflections."
Refraction: Bending and distortion of background seen through transparent media, continuous with interface geometry. "Refraction requires continuous, coherent background distortion through transparent or translucent media."
Region-grounded: Evaluation or reasoning constrained to annotated spatial regions to reduce hallucinations and improve accuracy. "PICAEval, a region-grounded, question-answering based metric"
Region of interest (ROI): Annotated spatial areas containing physics-critical evidence used for targeted evaluation. "anchored to human-annotated regions of interest (ROIs)."
Semantic fidelity: Accuracy of edits relative to the instruction meaning, independent of physical plausibility. "evaluate physical realism in image editing beyond semantic fidelity."
Specular highlights: Bright, mirror-like reflections on shiny surfaces that change with viewpoint and curvature. "view-dependent behavior for specular highlights and mirror reflections."
State Transition: A dimension of physical realism covering both global and local changes to the scene or materials. "State Transition addresses both global and local state changes."
Text-to-image (T2I) templates: Structured prompts or templates used to generate images from text descriptions. "handcrafted text-to-image (T2I) templates"
VLM-as-Judge: Using a vision-LLM to automatically rate or judge generated edits. "While existing VLM-as-Judge setups (Wu et al., 2025c; Niu et al., 2025; Sun et al., 2025; Zhao et al., 2025) offer a convenient way to automate evaluation"
VQA-based evaluator: A vision-question answering model used to answer localized questions about edited regions. "passed to the VQA-based evaluator."
World-simulator: A generative model aiming to simulate realistic physical dynamics and environments. "video generation approaching world-simulator (Wan et al., 2025)"

View Paper Prompt View All Prompts

Open Problems

Effective coupling of visual understanding and image generation for physical realism

Continue Learning

Authors (13)

Collections

Tweets

This paper has been mentioned in 11 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

alphaXiv

PICABench: How Far Are We from Physically Realistic Image Editing? (20 likes, 0 questions)

PICABench: How Far Are We from Physically Realistic Image Editing? (2510.17681v2)

Summary

PICABench: A Benchmark for Physically Realistic Image Editing

Motivation and Problem Statement

PICABench: Benchmark Design and Taxonomy

PICAEval: Region-Grounded, QA-Based Evaluation Protocol

PICA-100K: Synthetic Video-Based Dataset for Physics-Aware Supervision

Experimental Results and Analysis

Benchmarking State-of-the-Art Models

Effectiveness of Video-Based Supervision

Evaluation Protocol Validity

Limitations and Future Directions

Implications and Outlook

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions does the paper ask?

How did they paper it?

What did they find?

Why is this important?

Implications and future impact

Knowledge Gaps

Practical Applications

Practical Applications Derived from PICABench, PICAEval, and PICA-100K

Immediate Applications

Long-Term Applications

Cross-cutting Dependencies and Assumptions

Glossary

Open Problems

Continue Learning

Related Papers

Authors (13)

Collections

Tweets

YouTube

alphaXiv