PhysicEdit: Physics-Aware Image Editing

Updated 28 February 2026

PhysicEdit is a comprehensive framework that redefines visual editing as predictive physical state transitions across multiple domains.
It integrates a textual-visual dual-thinking mechanism with learnable transition queries to ensure adherence to physical laws during synthesis.
The framework leverages the PhysicTran38K synthetic video dataset to supervise dynamic transitions, enhancing both physical realism and knowledge compliance.

PhysicEdit is an end-to-end framework for instruction-based, physics-aware image editing that formulates visual editing tasks as predictive physical state transitions. In contrast to static image-pair paradigms, PhysicEdit leverages dynamic supervision across multiple physical domains—mechanical, biological, thermal, optical, and material—to synthesize edits respecting complex causal dynamics, such as refraction and material deformation. It employs a textual-visual dual-thinking mechanism, integrating physically grounded reasoning with timestep-adaptive visual guidance through learnable transition queries. This method achieves state-of-the-art performance in physical realism and knowledge-grounded editing for open-source solutions, substantially mitigating prior models' propensity for physically implausible results (Zhao et al., 25 Feb 2026).

1. Dataset Foundation: PhysicTran38K

The PhysicEdit framework is underpinned by PhysicTran38K, a large-scale, video-based dataset purpose-built to supervise and benchmark physics-aware editing grounded in continuous transition trajectories rather than static boundary conditions. PhysicTran38K comprises 38,620 synthetic videos at 256×256 resolution, hierarchically categorized into five primary domains (Mechanical: 10,245; Biological: 10,242; Thermal: 6,602; Optical: 6,245; Material: 5,286), 16 sub-domains, and 46 distinct physical transition types (e.g., refraction, melting, germination). Each trajectory is typically 16–30 frames at approximately 24 fps, featuring a strictly static camera constraint to isolate pixel changes to physical state evolution.

Dataset samples, serialized in directory-based format, include:

Source image $(I_\mathrm{src})$ and target $(I_\mathrm{tgt})$ frames
Six uniformly sampled intermediate keyframes
Edit instruction $(T_\mathrm{edit})$
Structured physics-reasoning trace $(R = \{S_0, S_1, ..., S_T\})$
Rich metadata: domain/sub-domain labels, principle verification scores (via GPT-5-mini), contradiction tags, and source prompts

This configuration enforces explicit supervision over dynamic transitions, addressing the underspecification inherent in pairwise datasets and enabling data-driven learning of physically plausible transformations.

2. Dataset Construction and Annotation Procedure

The dataset construction pipeline for PhysicTran38K involves:

Synthesis: Raw video generation using Wan2.2-T2V-A14B with a structured "Wan Prompt" syntax: $[$ Start State $]$ + $[$ Trigger Event $]$ + $[$ Transition Description $]$ + $[$ Final State $]$ .
Two-Stage Filtering:
- Stage 1: Viewpoint Stability assessed by ViPE [Huang et al., 2025], adaptively filtering for unwanted camera motion.
- Stage 2: Principle Verification with GPT-5-mini, generating $N=3$ candidate transition-specific physics principles and classifying them (align/contradict/unknown) per keyframe; acceptance threshold is $S_\mathrm{verify} \geq 0.5$ , where $S_\mathrm{verify} = N_\mathrm{align} / N_\mathrm{total}$ .
- Contradicted principles are stored as hard negatives for future evaluation.
Constraint-Aware Annotation: For filtered samples, Qwen2.5-VL-7B generates $T_\mathrm{edit}$ (the edit instruction) and a stepwise reasoning trace $R$ , narrating mid-transition physics with explicit principle alignment; contradiction with physics laws is explicitly forbidden.

Together, these pipeline elements ensure high-quality, physics-compliant supervision and a rich set of structured annotations for downstream training.

3. Physical State Representation and Transition Modeling

Each temporal step $t$ in a trajectory is encoded using complementary latent features:

$F_\mathrm{DINO}^t = \mathrm{DINOv2}(I_t)$ (semantic/structural representation)
$F_\mathrm{VAE}^t = \mathrm{VAE}(I_t)$ (fine-grained texture)

Supervision centers on deltas relative to the initial state:

$\Delta F_\mathrm{DINO}^t = F_\mathrm{DINO}^t - F_\mathrm{DINO}^0$
$\Delta F_\mathrm{VAE}^t = F_\mathrm{VAE}^t - F_\mathrm{VAE}^0$

For transition guidance within the diffusion backbone, a timestep-sensitive combination is constructed:

$F_\mathrm{tran}(t) = t \cdot \Delta F_\mathrm{DINO}^t + (1-t)\cdot \Delta F_\mathrm{VAE}^t$

Physics is enforced through both continuous and discrete formulations, including compliance with conservation laws (e.g., $E_\mathrm{kin}(t) + E_\mathrm{pot}(t) = \mathrm{const}$ ), momentum continuity $(\|p_{t+1} - p_t\| \leq \alpha \Delta t)$ , and specific physical principles (e.g., Snell's law: $n_1 \sin \theta_1 = n_2 \sin \theta_2$ ).

The composite training objective includes a standard diffusion loss $L_\mathrm{diff}$ and a transition alignment term:

$L_\mathrm{tran} = \sum_t [ t\cdot \|E_Q(q_t)_D - \Delta F_\mathrm{DINO}^t\|_2 + (1-t)\cdot \|E_Q(q_t)_V - \Delta F_\mathrm{VAE}^t\|_2 ]$

where $q_t$ are the learnable transition queries, and $E_Q$ their respective projection heads.

4. Architecture and Inference

PhysicEdit’s architecture is distinguished by its textual-visual dual-thinking design:

During training, learnable transition queries (64 tokens, Editor's term: transition tokens) are concatenated with the frozen Qwen2.5-VL backbone alongside the initial frame $I_\mathrm{src}$ and textual edit instruction $T_\mathrm{edit}$ . The resulting embeddings are directly supervised to align to the per-timestep deltas $(\Delta F_\mathrm{DINO}^t, \Delta F_\mathrm{VAE}^t)$ .
At inference, only $I_\mathrm{src}$ and $T_\mathrm{edit}$ are inputs. Qwen2.5-VL infers a structured physics reasoning trace. The learned transition queries implicitly supply prior knowledge of typical physical trajectories. A cascaded diffusion model (MMDiT) is then guided simultaneously by the textual trace and the stepwise visual priors $F_\mathrm{tran}(t)$ .

This arrangement enables physically grounded, temporally consistent synthesis without generating explicit intermediate frames (which can accumulate errors and incur high computational cost).

5. Evaluation Metrics and Empirical Results

PhysicEdit is quantitatively assessed using:

Physical Realism (PICABench): Aggregates eight physical plausibility subscores (light propagation, source effects, reflection, refraction, deformation, causality, global/local transitions) into an Overall Physical Realism Score. Formalized as:

$\text{Physical Realism} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}[\text{M obeys physical laws on task}_i]$

Knowledge-Grounded Editing (KRISBench): Scores factual, conceptual, and procedural alignment, operationalizing:

$\text{Knowledge Score} = \frac{1}{N} \sum \mathbb{1}[\text{M cites correct factual or conceptual principles}]$

PhysicEdit achieves a 5.9 percentage point improvement in overall physical realism (64.86 vs. 61.26) and a 10.1 percentage point gain in knowledge-grounded editing over static pair baselines, surpassing open-source competitors and approaching proprietary model performance.

6. Synthesis of Mechanisms and Limitations

PhysicEdit demonstrates that jointly modeling textual constraints (for logical physical transitions and principle compliance) and implicit visual transitions (capturing deformation and optical subtleties) yields synergistic improvements in both realism and knowledge compliance. The use of implicit latent transition priors avoids the limitations of explicit intermediate-frame synthesis, mitigating error accumulation and reducing computation.

However, the framework is subject to several limitations:

All PhysicTran38K data are synthetic; real-world scenarios may present more complex lighting, noise, and object interactions.
The enforced static-camera requirement ensures background stability but restricts camera viewpoint variety, which may be a limitation for broader visual editing tasks.
Prospective extensions include scaling to higher image resolutions, longer temporal horizons, multimodal (stereo/RGB-D) data, and incorporating real-capture datasets to close the sim-to-real gap.

A plausible implication is that future developments along these axes will enhance the generalization of physics-aware editing solutions to unconstrained, real-world settings, further narrowing the divide between simulation-trained and real-scene capable models (Zhao et al., 25 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PhysicEdit.

PhysicEdit: Physics-Aware Image Editing

1. Dataset Foundation: PhysicTran38K

2. Dataset Construction and Annotation Procedure

3. Physical State Representation and Transition Modeling

4. Architecture and Inference

5. Evaluation Metrics and Empirical Results

6. Synthesis of Mechanisms and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PhysicEdit: Physics-Aware Image Editing

1. Dataset Foundation: PhysicTran38K

2. Dataset Construction and Annotation Procedure

3. Physical State Representation and Transition Modeling

4. Architecture and Inference

5. Evaluation Metrics and Empirical Results

6. Synthesis of Mechanisms and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research