PhysicTran38K: Video-Based Physics Editing Dataset
- PhysicTran38K is a large-scale dataset comprising 38,620 video clips that capture diverse physical state transitions for supervised image editing.
- The dataset employs an automated synthesis and filtering pipeline using pretrained text-to-video models, GPT-5-mini, and Qwen2.5-VL-7B for precise annotation.
- Integration with PhysicEdit leverages dual-stream latent encoding and structured reasoning to achieve high physical realism and knowledge grounding across five domains.
PhysicTran38K is a large-scale video-based dataset consisting of 38,620 transition trajectories constructed to facilitate physics-aware image editing through supervised learning of physical state transitions. Designed to address the limitations of discrete editing paradigms that underspecify causal dynamics, PhysicTran38K provides extensive supervision over multi-frame transitions, enabling the modeling of physical processes such as refraction, material deformation, and complex biophysical events. It serves as the foundation for the PhysicEdit framework, which leverages both textual and visual reasoning to achieve physically plausible and knowledge-grounded image editing (Zhao et al., 25 Feb 2026).
1. Dataset Construction Pipeline
PhysicTran38K is structured to convert physics-aware image editing into a supervised predictive problem over video-based state transitions. Construction begins with a physics taxonomy encompassing five domains: Mechanical, Thermal, Material, Optical, and Biological. These domains are further refined into 16 sub-domains and 46 transition types, including events such as refraction, fracture, and germination.
For each transition type, approximately 1,000 short video clips are automatically synthesized using a pretrained text-to-video model (Wan2.2-T2V-A14B, 256×256 pixels, 16–20 frames at 30 fps). Prompts follow a structured template: [Start State] + [Trigger Event] + [Transition Description] + [Final State], with GPT-5-mini instantiating concrete objects from curated "object pools" to promote scene diversity.
Raw clips are filtered in two stages. Geometric stability is enforced using the ViPE filter, which computes a view–pose consistency score (); the threshold is adaptively relaxed for transitions involving significant non-rigid deformation. Principle-driven verification is conducted by GPT-5-mini, proposing up to physical principles per clip. Each is labeled as align, contradict, or unknown, following a visual critique of keyframes. A verification score is used to retain only clips with . Contradicted principles are preserved as hard-negative constraints instead of causing outright rejection, preserving data utility and physical rigor.
Retained clips undergo constraint-aware annotation with Qwen2.5-VL-7B. The first and last frames establish the editing pair , while intermediate keyframes are uniformly sampled. Qwen2.5-VL generates a concise instruction describing the trigger and net effect (), and produces a structured reasoning trace for all states (initial), (transitions), and (final). Any artifacts flagged as "contradict" lead to explicit negative constraints enforced during annotation. This pipeline yields 38,620 high-quality video–instruction–reasoning triplets.
2. Dataset Composition and Structure
PhysicTran38K catalogues transitions across five principal physical domains:
| Domain | # Clips | Representative Transitions |
|---|---|---|
| Mechanical | 10,245 | Translation, rotation, oscillation, collision, deformation |
| Biological | 10,242 | Germination, decay, mold growth, life–death transitions |
| Thermal | 6,602 | Heating, cooling, melting, freezing, evaporation |
| Optical | 6,245 | Reflection, refraction, scattering, intensity shifts |
| Material | 5,286 | Surface wear, hardening, softening, integrity change |
Each clip is organized as a folder containing full-resolution frames (256×256 PNGs) and a JSON metadata file detailing domain, sub-domain, transition type, object pool, and scores, the editing instruction , the structured reasoning trace, explicit principle alignments or contradictions, sampling timestamps, and any negative constraints. Clips are synthesized with a fixed-camera assumption, such that all pixel displacements originate from state changes, not viewpoint shifts, simplifying downstream modeling of physical dynamics.
3. Physical State Representation and Latent Encoding
Physical states at each timestep are represented via the corresponding image . In the downstream PhysicEdit framework, each frame is encoded into two complementary modalities:
- Structural features via a frozen DINOv2 encoder:
- Texture features via the diffusion backbone’s VAE encoder:
The modeling of state transitions adopts a discrete-time approximation of Newtonian evolution:
Generalizing to continuous time:
PhysicEdit distills these dynamics using learnable "transition queries" , trained to reconstruct the latent difference between and . The transition loss is a timestep-aware mixture:
Larger weights structural alignment, while smaller focuses on texture refinement, reflecting the generative trajectory of diffusion models from coarse to fine detail. While explicit physical constraints (e.g., conservation of energy) can be enforced, in the current release such properties are implicitly captured through data structure and dual-stream loss.
4. Integration with PhysicEdit and Benchmark Evaluation
Within PhysicEdit, PhysicTran38K is central to two mechanisms:
- Physically-Grounded Reasoning: At both training and inference, Qwen2.5-VL is conditioned to generate a structured reasoning trace specifying explicit laws and causal chains (e.g., "angle of incidence equals angle of reflection" in refraction). These textual reasoning blocks, concatenated with , condition the diffusion backbone.
- Implicit Visual Thinking: The transition queries, initialized randomly and optimized on latent deltas from video trajectories, offer timestep-adaptive visual guidance to the diffusion model. At inference (with no input video), queries supply guidance consistent with observed transitions.
The fused latent feature at step is:
This is injected into the model at each denoising step along with textual conditioning.
Evaluation is conducted using:
- PICABench for physical realism, scoring across eight axes (light propagation, source effects, reflection, refraction, deformation, causality, global and local state transitions). The overall score:
- KRISBench for knowledge grounding, which covers Factual, Conceptual, and Procedural categories:
PhysicEdit achieves (vs. 61.26\% for Qwen-Image-Edit) and (vs. 65.56\%), representing 5.9\% and 10.1\% absolute improvements, respectively (Zhao et al., 25 Feb 2026).
5. Insights, Strengths, and Limitations
Training on explicit transition trajectories as provided by PhysicTran38K enables models to internalize the temporal and causal mechanisms underpinning physical events, supporting improved generalization in physical realism and knowledge-grounding. The dual-stream architecture ensures global physical consistency through textual reasoning while capturing fine-grained motion and deformation patterns via latent queries.
Ablation studies demonstrate that removing either the textual or visual streams results in degraded performance; textual reasoning primarily contributes to mechanical/causal understanding, whereas visual queries enhance modeling of optics and localized transitions.
PhysicTran38K's reliance on synthetic data poses limitations regarding the representation of complex real-world physics, such as turbulent fluid phenomena or intricate multi-object interactions. The taxonomy includes 46 transition types but excludes certain important phenomena (e.g., aerodynamic lift, electromagnetism). Current resolutions are fixed at 256×256 with single-camera viewpoints, possibly restricting the transferability to diverse real-world settings. Prospective directions include the integration of real-world video, higher spatial resolution, expansion of domains (e.g., acoustics, electromagnetics), and explicit enforcement of physical laws using physics-informed neural networks (Zhao et al., 25 Feb 2026).
6. Relevance and Implications for Physics-Aware Editing
PhysicTran38K addresses a critical bottleneck in instruction-based image editing—the underspecification of causal, physically plausible transitions—by providing richly annotated, temporally resolved supervision. Its integration into frameworks such as PhysicEdit has yielded state-of-the-art performance for open-source approaches, rivalling proprietary solutions in both physical realism and knowledge grounding. The dataset offers a scalable foundation for further research into physics-aware generative systems, especially those requiring deep semantic alignment with physical law and causal reasoning.