PICABench: How Far Are We from Physically Realistic Image Editing? (2510.17681v2)
Abstract: Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about making image editing feel truly real. Today’s AI tools can follow instructions like “remove the dog” or “make it a sunny day,” and they often look good at first glance. But they miss important physical details, like the dog’s shadow still being on the ground, or the sunlight not changing the direction and softness of shadows. The authors created a new benchmark, called PICABench, to test whether edited images respect real-world physics, and a new evaluation method, PICAEval, to judge those edits carefully. They also built a training dataset (PICA-100K) from videos to help models learn physics better.
What questions does the paper ask?
The paper asks simple but important questions:
- Are we good at image edits that look physically realistic, not just correct in words?
- How can we measure whether an edited image follows basic laws of light, materials, and changes over time?
- Can training models with video-based examples help them learn physical effects (like shadows, reflections, weight, or weather) more reliably?
How did they paper it?
The team designed three main categories to check physical realism in edited images, each with everyday examples:
- Optics (how light behaves):
- Light propagation: Do shadows point the right way and have the right softness?
- Reflection: Do mirrors or shiny surfaces show the right reflections in the right place?
- Refraction: Do things seen through glass or water bend and distort naturally?
- Light-source effects: If you add a lamp, does it cast light that fits the scene’s color and brightness?
- Mechanics (how objects and forces work):
- Deformation: Do materials bend or stay rigid in a realistic way (pillows squish, metal stays firm)?
- Causality: Do objects sit on surfaces properly, not float or intersect awkwardly? Does weight cause dents or pressure marks?
- State Transition (how things change):
- Global: Whole-scene changes, like day to night, summer to winter, rainy to sunny. Does everything update consistently (lighting, plants, ground, sky)?
- Local: Object-level changes, like melting, freezing, burning, wetting, or wrinkling. Do these effects look and spread naturally?
To evaluate edits well, they created PICAEval:
- Instead of asking a model “Is the edit good?”, they ask several small, specific yes/no questions tied to important regions in the image (for example, the mirror, the shadow area, or the contact between a shoe and the ground).
- Human annotators mark the key regions (the exact areas to inspect).
- An AI “judge” (a vision-LLM that can see images and read text) answers the questions only about those regions. This reduces guesswork and makes judgments clearer and more trustworthy.
They also built a training dataset, PICA-100K:
- They used a text-to-image model to create realistic scenes (like “a teapot on a kitchen table”).
- Then they used an image-to-video model to simulate physical changes (like “remove the teapot” or “tilt the vase until it tips over”).
- They took the first and last frames to form “before” and “after” pairs with instructions, creating over 100,000 examples focused on physics-aware edits.
- They fine-tuned a popular image-editing model using this dataset to see if it became more physically accurate.
What did they find?
- Most models still struggle with physics. Even state-of-the-art systems often keep wrong shadows, reflections, or object support. Many outputs look instruction-aligned but physically off.
- Closed-source models (like GPT-Image-1 and Seedream 4.0) perform slightly better, but overall scores are still not high, showing a wide gap to truly realistic edits.
- Detailed prompts help. When the instruction clearly explains the expected physical changes (explicit prompts), models do better than when the instruction is short and vague.
- Their new evaluation method (PICAEval) agrees more with human judgments than standard “AI-as-a-judge” methods. Focusing on region-level questions makes the evaluation more reliable and interpretable.
- Training with video-derived examples (PICA-100K) improves physical realism without hurting instruction-following. The fine-tuned model became better at lighting, reflections, and deformations, though big scene changes (like global weather shifts) remain tough.
Why is this important?
When you edit images, small physical details make them feel real. Removing an object should remove its shadow and reflection. Adding a heavy dumbbell should dent a pillow. Changing to summer should brighten the light and turn snow into grass. This paper shows that:
- We need benchmarks and tests that catch these physical details, not just whether the instruction was followed.
- Region-focused, question-based evaluation makes judgments closer to what humans care about.
- Learning from video motion and state changes helps models understand how the world behaves, leading to more believable edits.
Implications and future impact
This work pushes image editing toward “physics-aware” realism. It gives researchers:
- A clear benchmark (PICABench) with practical categories that matter in everyday editing.
- A better evaluation protocol (PICAEval) that reduces AI judging errors and matches human expectations.
- A training recipe (PICA-100K) showing that video-based supervision can teach models real-world behavior.
In the future, this could make edited images trustworthy for creative work, ads, movies, and even scientific communication. The authors plan to build larger datasets, explore smarter training (like reinforcement learning), and support more complex inputs (multiple images or conditions), helping models understand and respect the physics of the world even better.
Knowledge Gaps
Below is a single, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions that future work could concretely address:
- Benchmark scope and representativeness: Expand beyond 900 samples to cover harder, long‑tail physical phenomena (e.g., volumetric lighting, subsurface scattering, caustics, smoke/fog, fluid dynamics, cloth draping, granular materials) and more diverse scene types, materials, and scales.
- Sub-dimension coverage completeness: Formalize additional optics/mechanics/state-transition sub-dimensions (e.g., global illumination consistency, interreflections, contact friction and wear, multi-body interactions, elastic/plastic behavior under varying stresses) with checkable criteria.
- Cross-domain generalization: Quantify how well PICABench performance transfers to real-world, user-supplied edits (different camera models, indoor/outdoor extremes, adverse weather, low light), and identify domain gaps between synthetic training and real scenes.
- Resolution and preprocessing effects: Assess sensitivity of physical realism scores to resizing, cropping, and max-resolution choices (1024-long-side); measure whether higher-resolution inference or tiling improves physics adherence.
- Instruction variability and language robustness: Evaluate robustness to ambiguous, under-specified, noisy, or multilingual prompts; report how performance changes with paraphrases and language styles common in real user workflows.
- Metric validity beyond VLM QA: Develop physically grounded, reference-free metrics that quantify shadow direction error, contact/support plausibility, reflection/refraction correctness (e.g., geometry-aware checks via estimated depth/normals, environment maps, differentiable rendering-based consistency scores) rather than relying primarily on yes/no VLM judgments.
- VLM-as-judge reliability: Systematically paper evaluator variance across different VLMs, prompts, and seeds; provide calibration procedures, confidence estimates, and inter-evaluator agreement (with detailed statistics) to ensure reproducibility.
- Region annotation dependency: Analyze how PICAEval accuracy degrades with noisy or missing ROIs; explore methods to automatically detect physics-critical regions (reflective surfaces, contact interfaces, shadow receivers) to scale evaluation without intensive human annotation.
- Context dependence of judgments: Quantify failure rates when cropping removes essential global context (e.g., light direction cues); define criteria for when region-level evaluation is insufficient and whole-image reasoning is required.
- Human paper details and significance: Report sample sizes, inter-rater reliability, confidence intervals, and statistical significance for Elo correlations; verify robustness across different participant pools and task designs.
- Statistical rigor of model comparisons: Provide confidence intervals, hypothesis tests, and power analyses for reported improvements (e.g., +1.71% overall accuracy); disclose run-to-run variance and inference budget parity.
- PSNR as “consistency” proxy: Validate whether PSNR on non-edited regions correlates with perceived preservation; compare against stronger perceptual or structural metrics (LPIPS, DISTS) and measure edit localization accuracy to ensure fair masking.
- Synthetic video pipeline limitations: Address the use of only first/last frames; explore leveraging intermediate frames, temporal constraints, and motion priors to better capture global state transitions and causality.
- Synthetic vs. real video data: Investigate why the real-video-based MIRA400K underperforms; isolate dataset characteristics (motion type, camera motion, compression artifacts, instruction quality) that drive differences; propose hybrid pipelines.
- Annotation quality and bias: Audit GPT-5 generated instructions and labels for error modes, biases, and leakage; create validation subsets with high-quality human labels to benchmark annotation fidelity.
- Evaluator availability and circularity: Reduce dependence on closed-source GPT-5 for both data generation and evaluation; provide protocols that work with open-source VLMs without large performance drops to improve accessibility and replicability.
- Training objectives beyond SFT: Experiment with RL from physics-aware feedback, self-supervised physical constraints, cycle-consistency (before/after edits), energy-based or constraint-augmented losses, and curriculum learning on physics difficulty.
- 3D scene priors and differentiable rendering: Integrate monocular depth/normal estimation, inverse rendering, environment map estimation, or NeRF-style scene reconstructions to enforce lighting/shadow/reflection constraints during generation.
- Architecture-level coupling of understanding and generation: Design mechanisms that tie physical reasoning modules (scene graphs, contact/force predictors, simulators) to the generative pipeline, addressing the observed gap where “understanding ≠ realism.”
- Multi-image/multi-view conditioning: Extend the framework to support multi-view images, reference materials, HDR environment maps, or short image sequences to enforce cross-view physical consistency.
- Edit types beyond add/remove/attribute change: Include occlusion-aware edits, partial object manipulations, material conversions (e.g., wood→metal), topology changes, and multi-step sequential edits requiring consistent physics across steps.
- Adversarial/edge-case robustness: Test whether models and evaluators can be fooled by visually plausible but physically impossible edits; build adversarial subsets to stress-test physics reasoning.
- Leaderboard and continual benchmarking: Establish standardized inference budgets, seeds, and reporting protocols; plan for benchmark versioning to incorporate new sub-dimensions and periodic refreshes.
- Practical deployment constraints: Analyze computational cost of physics-aware editing and evaluation at scale; propose efficient approximations or distillation strategies for real-world systems.
Practical Applications
Practical Applications Derived from PICABench, PICAEval, and PICA-100K
Below we outline concrete, real-world applications that leverage the paper’s benchmark (PICABench), evaluation protocol (PICAEval), and video-derived training dataset (PICA-100K). Each item notes sectors, potential tools/workflows, and key assumptions or dependencies.
Immediate Applications
These can be deployed with current tools and modest integration effort.
- Physics-aware QA for image-editing CI/CD (software, creative tech)
- What: Integrate PICAEval as an automated gate in CI/CD to score edits on optics, mechanics, and state transitions before release.
- Tools/workflows: “VLM-as-a-judge” API with region masks, per-case Q&A; nightly regression suites; model release dashboards.
- Assumptions/dependencies: Access to a strong VLM (e.g., GPT-5 or Qwen2.5-VL-72B) and region annotations; standardized evaluation images; compute budget.
- “Physics Check” plugin for editors (software, media/VFX, advertising)
- What: An add-on for Photoshop/After Effects/Blender that highlights regions likely violating shadows, reflections, refraction, or support/causality post-edit.
- Tools/workflows: On-canvas ROI overlays, localized yes/no checks from PICAEval, one-click fix suggestions (e.g., add/match shadow direction).
- Assumptions/dependencies: Plugin ecosystem support; VLM inference latency acceptable for interactive use; curated ROI templates per edit type.
- Prompt optimizer for consumer and pro apps (software, daily life, social/AR)
- What: Auto-rewrite user instructions into more explicit, physics-grounded prompts to boost realism (per Table 2 gains).
- Tools/workflows: In-app “Make prompt more explicit” button; LLM prompt expansion tuned to optics/mechanics cues.
- Assumptions/dependencies: LLM prompt-rewriting service; user consent and UX integration; does not leak user content.
- Preflight checker for VFX and advertising imagery (media/VFX, marketing)
- What: Batch evaluation of hero frames to flag unrealistic lighting, missed reflections, or inconsistent seasonal/weather changes.
- Tools/workflows: Shot/asset preflight scripts; per-subdimension reports with heatmaps; issue tracking for art teams.
- Assumptions/dependencies: Reference ROIs or automatic ROI proposals; high-resolution image handling; buy-in from pipelines.
- E-commerce product image compliance (retail, marketplaces)
- What: Verify edits (object removal/addition) preserve physically consistent shadows/reflections; reduce misleading listings.
- Tools/workflows: Seller-side upload validator; moderation queue prioritization by PICAEval score; automated guidance to fix.
- Assumptions/dependencies: Marketplace content policies; acceptable false-positive/negative rates; transparent appeals process.
- Deepfake and tampering triage via physics cues (security, journalism, policy)
- What: Use region-grounded physical checks to flag potentially manipulated images that defy optics/mechanics.
- Tools/workflows: Forensic triage UI with PICAEval questions, evidence regions, and confidence; routing to human analysts.
- Assumptions/dependencies: Not a standalone authenticity guarantee; complements cryptographic provenance (e.g., C2PA).
- Training recipe to improve in-house editors (software/AI providers)
- What: Fine-tune internal editing models with PICA-100K and LoRA to raise physics realism without hurting semantics.
- Tools/workflows: Adopt the paper’s LoRA config (rank=256), batch size, and optimizer; ablation on subdimensions.
- Assumptions/dependencies: License compliance for T2I/I2V backbones used to synthesize data; hardware availability; domain shift monitoring.
- Synthetic dataset bootstrapping for niche domains (education, design, AEC)
- What: Repurpose the video-to-image pipeline to create domain-specific edit pairs (e.g., interiors with glass, metals).
- Tools/workflows: Subject/scene dictionaries; GPT-based instruction refinement; I2V generation; first/last-frame pairing.
- Assumptions/dependencies: T2I/I2V models capture domain physics well; human spot checks to avoid drift/artefacts.
- AR filter validation for consumer apps (social/AR/VR)
- What: Validate that real-time filters adding/removing objects maintain consistent lighting and contact shadows.
- Tools/workflows: Offline A/B testing with PICAEval; on-device lightweight heuristics trained from labeled ROIs.
- Assumptions/dependencies: Latency constraints; mobile-friendly proxies of PICAEval; privacy-preserving processing.
- Insurance and claims fraud screening (finance/insurance)
- What: Triage photos for edits that violate support/causality (e.g., “floating” dents, conflicting shadows).
- Tools/workflows: Batch scoring; rule-based escalations; coupling with EXIF/provenance signals.
- Assumptions/dependencies: Human review stays in loop; calibrated thresholds to avoid bias and wrongful rejections.
- Editorial standards and procurement criteria (policy, enterprise IT)
- What: Require minimum per-subdimension PICAEval scores for procuring editing engines or approving campaign assets.
- Tools/workflows: RFP language tying acceptance to benchmarked scores; periodic re-audits.
- Assumptions/dependencies: Stable benchmark versions; reproducible evaluation; disclosures of evaluator model.
- Research and teaching aids (academia, education)
- What: Use PICABench to benchmark new methods; create coursework/labs on physics-aware editing with auto-grading via PICAEval.
- Tools/workflows: Public leaderboards; assignment kits with ROI/Q&A templates; ablation notebooks.
- Assumptions/dependencies: Access to evaluator VLMs; licensing of data for teaching; compute quotas for students.
Long-Term Applications
These need further research, scaling, or engineering to be production-ready.
- Industry-wide “Physics Realism Score” and certification (policy, standards, media)
- What: A standardized, third-party certification (e.g., ISO-like) for physics plausibility of edited content, embedded in provenance metadata.
- Tools/workflows: Reference test suites; auditor APIs; C2PA extensions to include per-subdimension scores.
- Assumptions/dependencies: Multi-stakeholder governance; robust, attack-resistant evaluators; transparency requirements.
- RL-based post-training with physics rewards (software/AI research)
- What: Optimize editing models with reinforcement learning where PICAEval provides reward signals for optics/mechanics/state transitions.
- Tools/workflows: On-policy/off-policy RL pipelines; curriculum over subdimensions; safety guardrails against reward hacking.
- Assumptions/dependencies: Stable, low-variance evaluator; scalable RL infrastructure; reward-model audits.
- Real-time physics-aware AR/VR editing (AR/VR, mobile silicon)
- What: On-device generation that adapts shadows, reflections, and refractions to user pose and lighting in milliseconds.
- Tools/workflows: Distilled evaluators as auxiliary losses; neural light transport priors; sensor fusion with device IMU/ToF.
- Assumptions/dependencies: Efficient models (edge-friendly); robust scene/light estimation; thermal/power budgets.
- Video editing with coherent physical state changes (media/VFX, creator tools)
- What: Temporal editors that maintain consistent global weather/season/time-of-day shifts and local state transitions across frames.
- Tools/workflows: Video-level PICAEval variant with temporal ROIs; use of intermediate frames as supervision; scene graph constraints.
- Assumptions/dependencies: Temporal VLM judges; better motion/state simulators; memory-efficient training.
- Robotics and autonomous systems data generation (robotics, automotive)
- What: Physics-consistent synthetic imagery for perception training (contact/shadow cues, material properties) to reduce sim-to-real gap.
- Tools/workflows: Domain randomization with physics-aware edit constraints; curriculum on support/occlusion reasoning.
- Assumptions/dependencies: Transfer studies demonstrating gains; sensor-accurate rendering; safety validation.
- Remote sensing and infrastructure planning simulations (energy, urban planning)
- What: Physically plausible edits (e.g., adding solar arrays, changing vegetation seasonality) to plan and communicate projects.
- Tools/workflows: Geo-context-aware state transitions; illumination models for aerial imagery.
- Assumptions/dependencies: Domain-calibrated optics for sun angles/atmospherics; regulatory acceptance for planning artifacts.
- Authenticity scoring for journalism and civic processes (policy, civil society)
- What: Augment provenance with a physics-consistency audit to inform—but not determine—credibility assessments.
- Tools/workflows: Public dashboards with localized evidence; appeals and expert override mechanisms.
- Assumptions/dependencies: Clear communication of uncertainty; avoidance of over-reliance; governance to prevent censorship misuse.
- STEM education platforms using edit-to-learn (education)
- What: Interactive lessons where students apply edits and receive physics-based feedback on shadows, refraction, or material deformation.
- Tools/workflows: Browser-based ROIs and Q&A; mastery tracking per subdimension; teacher dashboards.
- Assumptions/dependencies: Age-appropriate content; offline-capable evaluators; equity of access.
- Domain-specialized medical and scientific simulation (healthcare, science) — high regulation
- What: Physics-aware synthetic imagery for training instruments (e.g., endoscopy lighting/shadows) or educational visualization.
- Tools/workflows: Controlled simulators; clinician-in-the-loop validation; domain-specific optics models.
- Assumptions/dependencies: Strict non-diagnostic use unless clinically validated; regulatory approvals; bias and safety audits.
- Financial KYC/AML and ad integrity checks (finance, ads policy)
- What: Physics-consistency checks to flag manipulated identity photos or deceptive ad creatives (e.g., product “results”).
- Tools/workflows: Batch scoring with risk-tier routing; human adjudication; record-keeping for audits.
- Assumptions/dependencies: Legal frameworks for automated screening; privacy-preserving processing; fairness monitoring.
- Self-serve dataset synthesis platforms (software/SaaS)
- What: No-code services to generate physics-aware edit pairs for niche verticals (e.g., furniture, automotive imagery).
- Tools/workflows: Template libraries for subject/scene; automatic ROI/Q&A generation; export to popular training stacks.
- Assumptions/dependencies: Licensing of generative backbones; cost control; content safety filters.
Cross-cutting Dependencies and Assumptions
- Evaluator strength and bias: PICAEval quality depends on VLM capabilities and can inherit evaluator biases; region grounding mitigates but does not eliminate this.
- Data provenance and licensing: Generative pipelines must respect licenses for T2I/I2V models and training data; synthetic data should be labeled as such.
- Compute and latency: Many workflows require GPU resources; real-time or on-device scenarios need distilled, efficient models and evaluators.
- Generalization limits: PICA-100K is synthetic; domain transfer to rare materials/lighting may require domain-specific augmentation and human QA.
- Governance and ethics: Use in moderation/forensics must include human oversight, clear appeal paths, and transparency about uncertainty; avoid misuse as sole authenticity arbiter.
Glossary
- AdamW optimizer: An optimization algorithm that decouples weight decay from gradient updates to improve training stability. "optimized using the AdamW optimizer with a learning rate of 10-5."
- Causality: Physical plausibility of interactions, supports, and reactions under laws like gravity and force redistribution. "Causality covers a broader range of physically plausible effects, including structural responses to force redistribution, agent reactions to added or removed stimuli, and environmental changes that alter object behavior, all of which must follow consistent physical or behavioral laws."
- Color temperature: The hue characteristic of a light source (e.g., warm vs. cool), affecting the scene’s color cast. "Common issues include mismatched color temperatures, overly hard shadows, or inconsistent falloff relative to distance."
- Deformation: Changes in object shape that must respect material properties (rigid vs. elastic) with consistent texture warping. "Deformation should follow material properties-rigid objects must retain shape, while elastic ones deform smoothly with consistent texture and geometry."
- Elo ranking: A pairwise preference scoring system often used in human studies to derive relative rankings. "We conduct a human study using Elo ranking to further validate the effectiveness of PICAEval."
- Elastic deformations: Smooth, bounded shape changes characteristic of elastic materials, preserving coherent texture. "elastic deformations should be smooth and bounded."
- Falloff: The decrease in light intensity with distance from a source, affecting brightness distribution. "brightness falloff should integrate naturally with the scene."
- Flow-based diffusion transformer: A generative architecture combining diffusion processes with transformer backbones guided by flow formulations. "a 12B flow-based diffusion transformer for image editing."
- Global state transitions: Scene-wide changes (e.g., season, weather) requiring consistent updates to lighting, shadows, and environment. "Global state transitions, such as changes in time of day, season, or weather, must update all relevant visual cues consistently-ranging from lighting and shadows to vegetation, surface conditions, and atmospheric effects."
- Image-to-video (I2V) model: A generative model that simulates temporal dynamics by expanding static images into videos. "image-to-video (I2V) models such as Wan2.2-14B (Wan et al., 2025) simulate complex dynamic processes with remarkable physical fidelity."
- Light propagation: How light travels and casts shadows consistent with source direction, softness, and occlusion. "Light propagation requires shadows that are geometrically consistent with the dominant light source, including direction, length, softness, and occlusion."
- Light-source effects: Consistency of added or modified light sources with global illumination, including color cast, penumbra, and falloff. "Light-source effects evaluate whether new light-introducing edits (like "add a lamp") are consistent with the global illumination context-color casts, shadow penumbra, and brightness falloff should integrate naturally with the scene."
- LoRA: Low-Rank Adaptation; an efficient fine-tuning method that learns low-rank updates to model weights. "We employ LoRA (Hu et al., 2022) with a rank of 256 for fine-tuning."
- Local state transitions: Physically coherent changes confined to objects or regions (e.g., freezing or melting), integrated with context. "Local state transitions, on the other hand, involve targeted physical changes confined to specific objects or regions."
- Occluders: Objects that block light or view, affecting shadow casting and visibility. "Typical failure modes include misaligned or missing cast shadows and flat shading that ignores occluders."
- Peak Signal-to-Noise Ratio (PSNR): A quantitative metric (in dB) measuring similarity, here used to assess consistency in non-edited regions. "For consistency evaluation, we compute PSNR over the non-edited regions by masking out the predicted edit area"
- Pearson correlation coefficient: A statistical measure of linear correlation between two variables. "Pearson Correlation Coefficient r=0.95."
- Penumbra: The softer, partially shaded region at the edges of a cast shadow. "shadow penumbra"
- Question-answering based metric: An evaluation approach using targeted, localized yes/no questions to assess physical plausibility. "PICAEval, a region-grounded, question-answering based metric designed to assess physical realism in a modular, interpretable manner."
- Reflection: View- and shape-dependent mirror images and highlights that must align with scene geometry. "Reflection consistency demands view-dependent behavior for specular highlights and mirror reflections."
- Refraction: Bending and distortion of background seen through transparent media, continuous with interface geometry. "Refraction requires continuous, coherent background distortion through transparent or translucent media."
- Region-grounded: Evaluation or reasoning constrained to annotated spatial regions to reduce hallucinations and improve accuracy. "PICAEval, a region-grounded, question-answering based metric"
- Region of interest (ROI): Annotated spatial areas containing physics-critical evidence used for targeted evaluation. "anchored to human-annotated regions of interest (ROIs)."
- Semantic fidelity: Accuracy of edits relative to the instruction meaning, independent of physical plausibility. "evaluate physical realism in image editing beyond semantic fidelity."
- Specular highlights: Bright, mirror-like reflections on shiny surfaces that change with viewpoint and curvature. "view-dependent behavior for specular highlights and mirror reflections."
- State Transition: A dimension of physical realism covering both global and local changes to the scene or materials. "State Transition addresses both global and local state changes."
- Text-to-image (T2I) templates: Structured prompts or templates used to generate images from text descriptions. "handcrafted text-to-image (T2I) templates"
- VLM-as-Judge: Using a vision-LLM to automatically rate or judge generated edits. "While existing VLM-as-Judge setups (Wu et al., 2025c; Niu et al., 2025; Sun et al., 2025; Zhao et al., 2025) offer a convenient way to automate evaluation"
- VQA-based evaluator: A vision-question answering model used to answer localized questions about edited regions. "passed to the VQA-based evaluator."
- World-simulator: A generative model aiming to simulate realistic physical dynamics and environments. "video generation approaching world-simulator (Wan et al., 2025)"
Collections
Sign up for free to add this paper to one or more collections.