Papers
Topics
Authors
Recent
Search
2000 character limit reached

PhysicEdit: Physics-Aware Image Editing

Updated 28 February 2026
  • PhysicEdit is a comprehensive framework that redefines visual editing as predictive physical state transitions across multiple domains.
  • It integrates a textual-visual dual-thinking mechanism with learnable transition queries to ensure adherence to physical laws during synthesis.
  • The framework leverages the PhysicTran38K synthetic video dataset to supervise dynamic transitions, enhancing both physical realism and knowledge compliance.

PhysicEdit is an end-to-end framework for instruction-based, physics-aware image editing that formulates visual editing tasks as predictive physical state transitions. In contrast to static image-pair paradigms, PhysicEdit leverages dynamic supervision across multiple physical domains—mechanical, biological, thermal, optical, and material—to synthesize edits respecting complex causal dynamics, such as refraction and material deformation. It employs a textual-visual dual-thinking mechanism, integrating physically grounded reasoning with timestep-adaptive visual guidance through learnable transition queries. This method achieves state-of-the-art performance in physical realism and knowledge-grounded editing for open-source solutions, substantially mitigating prior models' propensity for physically implausible results (Zhao et al., 25 Feb 2026).

1. Dataset Foundation: PhysicTran38K

The PhysicEdit framework is underpinned by PhysicTran38K, a large-scale, video-based dataset purpose-built to supervise and benchmark physics-aware editing grounded in continuous transition trajectories rather than static boundary conditions. PhysicTran38K comprises 38,620 synthetic videos at 256×256 resolution, hierarchically categorized into five primary domains (Mechanical: 10,245; Biological: 10,242; Thermal: 6,602; Optical: 6,245; Material: 5,286), 16 sub-domains, and 46 distinct physical transition types (e.g., refraction, melting, germination). Each trajectory is typically 16–30 frames at approximately 24 fps, featuring a strictly static camera constraint to isolate pixel changes to physical state evolution.

Dataset samples, serialized in directory-based format, include:

  • Source image (Isrc)(I_\mathrm{src}) and target (Itgt)(I_\mathrm{tgt}) frames
  • Six uniformly sampled intermediate keyframes
  • Edit instruction (Tedit)(T_\mathrm{edit})
  • Structured physics-reasoning trace (R={S0,S1,...,ST})(R = \{S_0, S_1, ..., S_T\})
  • Rich metadata: domain/sub-domain labels, principle verification scores (via GPT-5-mini), contradiction tags, and source prompts

This configuration enforces explicit supervision over dynamic transitions, addressing the underspecification inherent in pairwise datasets and enabling data-driven learning of physically plausible transformations.

2. Dataset Construction and Annotation Procedure

The dataset construction pipeline for PhysicTran38K involves:

  1. Synthesis: Raw video generation using Wan2.2-T2V-A14B with a structured "Wan Prompt" syntax: [[Start State]] + [[Trigger Event]] + [[Transition Description]] + [[Final State]].
  2. Two-Stage Filtering:
    • Stage 1: Viewpoint Stability assessed by ViPE [Huang et al., 2025], adaptively filtering for unwanted camera motion.
    • Stage 2: Principle Verification with GPT-5-mini, generating N=3N=3 candidate transition-specific physics principles and classifying them (align/contradict/unknown) per keyframe; acceptance threshold is Sverify0.5S_\mathrm{verify} \geq 0.5, where Sverify=Nalign/NtotalS_\mathrm{verify} = N_\mathrm{align} / N_\mathrm{total}.
    • Contradicted principles are stored as hard negatives for future evaluation.
  3. Constraint-Aware Annotation: For filtered samples, Qwen2.5-VL-7B generates TeditT_\mathrm{edit} (the edit instruction) and a stepwise reasoning trace RR, narrating mid-transition physics with explicit principle alignment; contradiction with physics laws is explicitly forbidden.

Together, these pipeline elements ensure high-quality, physics-compliant supervision and a rich set of structured annotations for downstream training.

3. Physical State Representation and Transition Modeling

Each temporal step tt in a trajectory is encoded using complementary latent features:

  • FDINOt=DINOv2(It)F_\mathrm{DINO}^t = \mathrm{DINOv2}(I_t) (semantic/structural representation)
  • FVAEt=VAE(It)F_\mathrm{VAE}^t = \mathrm{VAE}(I_t) (fine-grained texture)

Supervision centers on deltas relative to the initial state:

  • ΔFDINOt=FDINOtFDINO0\Delta F_\mathrm{DINO}^t = F_\mathrm{DINO}^t - F_\mathrm{DINO}^0
  • ΔFVAEt=FVAEtFVAE0\Delta F_\mathrm{VAE}^t = F_\mathrm{VAE}^t - F_\mathrm{VAE}^0

For transition guidance within the diffusion backbone, a timestep-sensitive combination is constructed:

  • Ftran(t)=tΔFDINOt+(1t)ΔFVAEtF_\mathrm{tran}(t) = t \cdot \Delta F_\mathrm{DINO}^t + (1-t)\cdot \Delta F_\mathrm{VAE}^t

Physics is enforced through both continuous and discrete formulations, including compliance with conservation laws (e.g., Ekin(t)+Epot(t)=constE_\mathrm{kin}(t) + E_\mathrm{pot}(t) = \mathrm{const}), momentum continuity (pt+1ptαΔt)(\|p_{t+1} - p_t\| \leq \alpha \Delta t), and specific physical principles (e.g., Snell's law: n1sinθ1=n2sinθ2n_1 \sin \theta_1 = n_2 \sin \theta_2).

The composite training objective includes a standard diffusion loss LdiffL_\mathrm{diff} and a transition alignment term:

Ltran=t[tEQ(qt)DΔFDINOt2+(1t)EQ(qt)VΔFVAEt2]L_\mathrm{tran} = \sum_t [ t\cdot \|E_Q(q_t)_D - \Delta F_\mathrm{DINO}^t\|_2 + (1-t)\cdot \|E_Q(q_t)_V - \Delta F_\mathrm{VAE}^t\|_2 ]

where qtq_t are the learnable transition queries, and EQE_Q their respective projection heads.

4. Architecture and Inference

PhysicEdit’s architecture is distinguished by its textual-visual dual-thinking design:

  • During training, learnable transition queries (64 tokens, Editor's term: transition tokens) are concatenated with the frozen Qwen2.5-VL backbone alongside the initial frame IsrcI_\mathrm{src} and textual edit instruction TeditT_\mathrm{edit}. The resulting embeddings are directly supervised to align to the per-timestep deltas (ΔFDINOt,ΔFVAEt)(\Delta F_\mathrm{DINO}^t, \Delta F_\mathrm{VAE}^t).
  • At inference, only IsrcI_\mathrm{src} and TeditT_\mathrm{edit} are inputs. Qwen2.5-VL infers a structured physics reasoning trace. The learned transition queries implicitly supply prior knowledge of typical physical trajectories. A cascaded diffusion model (MMDiT) is then guided simultaneously by the textual trace and the stepwise visual priors Ftran(t)F_\mathrm{tran}(t).

This arrangement enables physically grounded, temporally consistent synthesis without generating explicit intermediate frames (which can accumulate errors and incur high computational cost).

5. Evaluation Metrics and Empirical Results

PhysicEdit is quantitatively assessed using:

  • Physical Realism (PICABench): Aggregates eight physical plausibility subscores (light propagation, source effects, reflection, refraction, deformation, causality, global/local transitions) into an Overall Physical Realism Score. Formalized as:

Physical Realism=1Ni=1N1[M obeys physical laws on taski]\text{Physical Realism} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}[\text{M obeys physical laws on task}_i]

  • Knowledge-Grounded Editing (KRISBench): Scores factual, conceptual, and procedural alignment, operationalizing:

Knowledge Score=1N1[M cites correct factual or conceptual principles]\text{Knowledge Score} = \frac{1}{N} \sum \mathbb{1}[\text{M cites correct factual or conceptual principles}]

PhysicEdit achieves a 5.9 percentage point improvement in overall physical realism (64.86 vs. 61.26) and a 10.1 percentage point gain in knowledge-grounded editing over static pair baselines, surpassing open-source competitors and approaching proprietary model performance.

6. Synthesis of Mechanisms and Limitations

PhysicEdit demonstrates that jointly modeling textual constraints (for logical physical transitions and principle compliance) and implicit visual transitions (capturing deformation and optical subtleties) yields synergistic improvements in both realism and knowledge compliance. The use of implicit latent transition priors avoids the limitations of explicit intermediate-frame synthesis, mitigating error accumulation and reducing computation.

However, the framework is subject to several limitations:

  • All PhysicTran38K data are synthetic; real-world scenarios may present more complex lighting, noise, and object interactions.
  • The enforced static-camera requirement ensures background stability but restricts camera viewpoint variety, which may be a limitation for broader visual editing tasks.
  • Prospective extensions include scaling to higher image resolutions, longer temporal horizons, multimodal (stereo/RGB-D) data, and incorporating real-capture datasets to close the sim-to-real gap.

A plausible implication is that future developments along these axes will enhance the generalization of physics-aware editing solutions to unconstrained, real-world settings, further narrowing the divide between simulation-trained and real-scene capable models (Zhao et al., 25 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PhysicEdit.