PhysicEdit: Physics-Aware Image Editing
- PhysicEdit is a comprehensive framework that redefines visual editing as predictive physical state transitions across multiple domains.
- It integrates a textual-visual dual-thinking mechanism with learnable transition queries to ensure adherence to physical laws during synthesis.
- The framework leverages the PhysicTran38K synthetic video dataset to supervise dynamic transitions, enhancing both physical realism and knowledge compliance.
PhysicEdit is an end-to-end framework for instruction-based, physics-aware image editing that formulates visual editing tasks as predictive physical state transitions. In contrast to static image-pair paradigms, PhysicEdit leverages dynamic supervision across multiple physical domains—mechanical, biological, thermal, optical, and material—to synthesize edits respecting complex causal dynamics, such as refraction and material deformation. It employs a textual-visual dual-thinking mechanism, integrating physically grounded reasoning with timestep-adaptive visual guidance through learnable transition queries. This method achieves state-of-the-art performance in physical realism and knowledge-grounded editing for open-source solutions, substantially mitigating prior models' propensity for physically implausible results (Zhao et al., 25 Feb 2026).
1. Dataset Foundation: PhysicTran38K
The PhysicEdit framework is underpinned by PhysicTran38K, a large-scale, video-based dataset purpose-built to supervise and benchmark physics-aware editing grounded in continuous transition trajectories rather than static boundary conditions. PhysicTran38K comprises 38,620 synthetic videos at 256×256 resolution, hierarchically categorized into five primary domains (Mechanical: 10,245; Biological: 10,242; Thermal: 6,602; Optical: 6,245; Material: 5,286), 16 sub-domains, and 46 distinct physical transition types (e.g., refraction, melting, germination). Each trajectory is typically 16–30 frames at approximately 24 fps, featuring a strictly static camera constraint to isolate pixel changes to physical state evolution.
Dataset samples, serialized in directory-based format, include:
- Source image and target frames
- Six uniformly sampled intermediate keyframes
- Edit instruction
- Structured physics-reasoning trace
- Rich metadata: domain/sub-domain labels, principle verification scores (via GPT-5-mini), contradiction tags, and source prompts
This configuration enforces explicit supervision over dynamic transitions, addressing the underspecification inherent in pairwise datasets and enabling data-driven learning of physically plausible transformations.
2. Dataset Construction and Annotation Procedure
The dataset construction pipeline for PhysicTran38K involves:
- Synthesis: Raw video generation using Wan2.2-T2V-A14B with a structured "Wan Prompt" syntax: Start State + Trigger Event + Transition Description + Final State.
- Two-Stage Filtering:
- Stage 1: Viewpoint Stability assessed by ViPE [Huang et al., 2025], adaptively filtering for unwanted camera motion.
- Stage 2: Principle Verification with GPT-5-mini, generating candidate transition-specific physics principles and classifying them (align/contradict/unknown) per keyframe; acceptance threshold is , where .
- Contradicted principles are stored as hard negatives for future evaluation.
- Constraint-Aware Annotation: For filtered samples, Qwen2.5-VL-7B generates (the edit instruction) and a stepwise reasoning trace , narrating mid-transition physics with explicit principle alignment; contradiction with physics laws is explicitly forbidden.
Together, these pipeline elements ensure high-quality, physics-compliant supervision and a rich set of structured annotations for downstream training.
3. Physical State Representation and Transition Modeling
Each temporal step in a trajectory is encoded using complementary latent features:
- (semantic/structural representation)
- (fine-grained texture)
Supervision centers on deltas relative to the initial state:
For transition guidance within the diffusion backbone, a timestep-sensitive combination is constructed:
Physics is enforced through both continuous and discrete formulations, including compliance with conservation laws (e.g., ), momentum continuity , and specific physical principles (e.g., Snell's law: ).
The composite training objective includes a standard diffusion loss and a transition alignment term:
where are the learnable transition queries, and their respective projection heads.
4. Architecture and Inference
PhysicEdit’s architecture is distinguished by its textual-visual dual-thinking design:
- During training, learnable transition queries (64 tokens, Editor's term: transition tokens) are concatenated with the frozen Qwen2.5-VL backbone alongside the initial frame and textual edit instruction . The resulting embeddings are directly supervised to align to the per-timestep deltas .
- At inference, only and are inputs. Qwen2.5-VL infers a structured physics reasoning trace. The learned transition queries implicitly supply prior knowledge of typical physical trajectories. A cascaded diffusion model (MMDiT) is then guided simultaneously by the textual trace and the stepwise visual priors .
This arrangement enables physically grounded, temporally consistent synthesis without generating explicit intermediate frames (which can accumulate errors and incur high computational cost).
5. Evaluation Metrics and Empirical Results
PhysicEdit is quantitatively assessed using:
- Physical Realism (PICABench): Aggregates eight physical plausibility subscores (light propagation, source effects, reflection, refraction, deformation, causality, global/local transitions) into an Overall Physical Realism Score. Formalized as:
- Knowledge-Grounded Editing (KRISBench): Scores factual, conceptual, and procedural alignment, operationalizing:
PhysicEdit achieves a 5.9 percentage point improvement in overall physical realism (64.86 vs. 61.26) and a 10.1 percentage point gain in knowledge-grounded editing over static pair baselines, surpassing open-source competitors and approaching proprietary model performance.
6. Synthesis of Mechanisms and Limitations
PhysicEdit demonstrates that jointly modeling textual constraints (for logical physical transitions and principle compliance) and implicit visual transitions (capturing deformation and optical subtleties) yields synergistic improvements in both realism and knowledge compliance. The use of implicit latent transition priors avoids the limitations of explicit intermediate-frame synthesis, mitigating error accumulation and reducing computation.
However, the framework is subject to several limitations:
- All PhysicTran38K data are synthetic; real-world scenarios may present more complex lighting, noise, and object interactions.
- The enforced static-camera requirement ensures background stability but restricts camera viewpoint variety, which may be a limitation for broader visual editing tasks.
- Prospective extensions include scaling to higher image resolutions, longer temporal horizons, multimodal (stereo/RGB-D) data, and incorporating real-capture datasets to close the sim-to-real gap.
A plausible implication is that future developments along these axes will enhance the generalization of physics-aware editing solutions to unconstrained, real-world settings, further narrowing the divide between simulation-trained and real-scene capable models (Zhao et al., 25 Feb 2026).