Papers
Topics
Authors
Recent
Search
2000 character limit reached

PhysicTran38K: Video-Based Physics Editing Dataset

Updated 28 February 2026
  • PhysicTran38K is a large-scale dataset comprising 38,620 video clips that capture diverse physical state transitions for supervised image editing.
  • The dataset employs an automated synthesis and filtering pipeline using pretrained text-to-video models, GPT-5-mini, and Qwen2.5-VL-7B for precise annotation.
  • Integration with PhysicEdit leverages dual-stream latent encoding and structured reasoning to achieve high physical realism and knowledge grounding across five domains.

PhysicTran38K is a large-scale video-based dataset consisting of 38,620 transition trajectories constructed to facilitate physics-aware image editing through supervised learning of physical state transitions. Designed to address the limitations of discrete editing paradigms that underspecify causal dynamics, PhysicTran38K provides extensive supervision over multi-frame transitions, enabling the modeling of physical processes such as refraction, material deformation, and complex biophysical events. It serves as the foundation for the PhysicEdit framework, which leverages both textual and visual reasoning to achieve physically plausible and knowledge-grounded image editing (Zhao et al., 25 Feb 2026).

1. Dataset Construction Pipeline

PhysicTran38K is structured to convert physics-aware image editing into a supervised predictive problem over video-based state transitions. Construction begins with a physics taxonomy encompassing five domains: Mechanical, Thermal, Material, Optical, and Biological. These domains are further refined into 16 sub-domains and 46 transition types, including events such as refraction, fracture, and germination.

For each transition type, approximately 1,000 short video clips are automatically synthesized using a pretrained text-to-video model (Wan2.2-T2V-A14B, 256×256 pixels, 16–20 frames at 30 fps). Prompts follow a structured template: [Start State] + [Trigger Event] + [Transition Description] + [Final State], with GPT-5-mini instantiating concrete objects from curated "object pools" to promote scene diversity.

Raw clips are filtered in two stages. Geometric stability is enforced using the ViPE filter, which computes a view–pose consistency score (SvipeS_{\mathrm{vipe}}); the threshold is adaptively relaxed for transitions involving significant non-rigid deformation. Principle-driven verification is conducted by GPT-5-mini, proposing up to N=3N = 3 physical principles per clip. Each is labeled as align, contradict, or unknown, following a visual critique of keyframes. A verification score Sverify=Nalign/NtotalS_{\mathrm{verify}} = N_{\text{align}} / N_{\text{total}} is used to retain only clips with Sverify≥0.5S_{\mathrm{verify}} \geq 0.5. Contradicted principles are preserved as hard-negative constraints instead of causing outright rejection, preserving data utility and physical rigor.

Retained clips undergo constraint-aware annotation with Qwen2.5-VL-7B. The first and last frames establish the editing pair (Isrc,Itgt)(I_{\mathrm{src}}, I_{\mathrm{tgt}}), while T=8T=8 intermediate keyframes are uniformly sampled. Qwen2.5-VL generates a concise instruction describing the trigger and net effect (TeditT_{\mathrm{edit}}), and produces a structured reasoning trace for all states S0S_0 (initial), S1,…,ST−1S_1, \ldots, S_{T-1} (transitions), and STS_T (final). Any artifacts flagged as "contradict" lead to explicit negative constraints enforced during annotation. This pipeline yields 38,620 high-quality video–instruction–reasoning triplets.

2. Dataset Composition and Structure

PhysicTran38K catalogues transitions across five principal physical domains:

Domain # Clips Representative Transitions
Mechanical 10,245 Translation, rotation, oscillation, collision, deformation
Biological 10,242 Germination, decay, mold growth, life–death transitions
Thermal 6,602 Heating, cooling, melting, freezing, evaporation
Optical 6,245 Reflection, refraction, scattering, intensity shifts
Material 5,286 Surface wear, hardening, softening, integrity change

Each clip is organized as a folder containing T=16T=16 full-resolution frames I0,...,I15{I_0, ..., I_{15}} (256×256 PNGs) and a JSON metadata file detailing domain, sub-domain, transition type, object pool, SvipeS_{\mathrm{vipe}} and SverifyS_{\mathrm{verify}} scores, the editing instruction TeditT_{\mathrm{edit}}, the structured reasoning trace, explicit principle alignments or contradictions, sampling timestamps, and any negative constraints. Clips are synthesized with a fixed-camera assumption, such that all pixel displacements originate from state changes, not viewpoint shifts, simplifying downstream modeling of physical dynamics.

3. Physical State Representation and Latent Encoding

Physical states StS_t at each timestep tt are represented via the corresponding image ItI_t. In the downstream PhysicEdit framework, each frame is encoded into two complementary modalities:

  • Structural features via a frozen DINOv2 encoder: FDINO(It)∈RK×d1F_{\mathrm{DINO}}(I_t) \in \mathbb{R}^{K \times d_1}
  • Texture features via the diffusion backbone’s VAE encoder: FVAE(It)∈RK×d2F_{\mathrm{VAE}}(I_t) \in \mathbb{R}^{K \times d_2}

The modeling of state transitions adopts a discrete-time approximation of Newtonian evolution:

xt+1=xt+Δt⋅vt,Δt=1x_{t+1} = x_t + \Delta t \cdot v_t, \quad \Delta t = 1

Generalizing to continuous time:

Sfinal=S0+∫0Tf(St,Tedit) dtS_{\mathrm{final}} = S_0 + \int_{0}^{T} f(S_t, T_{\mathrm{edit}}) \, dt

PhysicEdit distills these dynamics using KK learnable "transition queries" Q={q1,...,qK}Q = \{q_1, ..., q_K\}, trained to reconstruct the latent difference between ItI_t and I0I_0. The transition loss is a timestep-aware mixture:

Ltran=∑t=1T[t⋅∥FDINO(It)−FDINO(I0)∥2+(1−t/T)⋅∥FVAE(It)−FVAE(I0)∥2]L_{\mathrm{tran}} = \sum_{t=1}^{T} \left[ t \cdot \| F_{\mathrm{DINO}}(I_t) - F_{\mathrm{DINO}}(I_0) \|^2 + (1 - t/T) \cdot \| F_{\mathrm{VAE}}(I_t) - F_{\mathrm{VAE}}(I_0) \|^2 \right]

Larger tt weights structural alignment, while smaller tt focuses on texture refinement, reflecting the generative trajectory of diffusion models from coarse to fine detail. While explicit physical constraints (e.g., conservation of energy) can be enforced, in the current release such properties are implicitly captured through data structure and dual-stream loss.

4. Integration with PhysicEdit and Benchmark Evaluation

Within PhysicEdit, PhysicTran38K is central to two mechanisms:

  1. Physically-Grounded Reasoning: At both training and inference, Qwen2.5-VL is conditioned to generate a structured reasoning trace specifying explicit laws and causal chains (e.g., "angle of incidence equals angle of reflection" in refraction). These textual reasoning blocks, concatenated with TeditT_{\mathrm{edit}}, condition the diffusion backbone.
  2. Implicit Visual Thinking: The KK transition queries, initialized randomly and optimized on latent deltas from video trajectories, offer timestep-adaptive visual guidance to the diffusion model. At inference (with no input video), queries supply guidance consistent with observed transitions.

The fused latent feature at step tt is:

Ftran(t)=(t/T)⋅FDINO+(1−t/T)⋅FVAEF_{\mathrm{tran}}(t) = (t/T) \cdot F_{\mathrm{DINO}} + (1 - t/T) \cdot F_{\mathrm{VAE}}

This is injected into the model at each denoising step along with textual conditioning.

Evaluation is conducted using:

  • PICABench for physical realism, scoring across eight axes (light propagation, source effects, reflection, refraction, deformation, causality, global and local state transitions). The overall score:

Rphys=LP+LSE+RFL+RFR+DFM+CSL+GST+LST8R_{\mathrm{phys}} = \frac{LP + LSE + RFL + RFR + DFM + CSL + GST + LST}{8}

  • KRISBench for knowledge grounding, which covers Factual, Conceptual, and Procedural categories:

Rkg=Rfactual+Rconceptual+Rprocedural3R_{\mathrm{kg}} = \frac{R_{\mathrm{factual}} + R_{\mathrm{conceptual}} + R_{\mathrm{procedural}}}{3}

PhysicEdit achieves Rphys=64.86%R_{\mathrm{phys}} = 64.86\% (vs. 61.26\% for Qwen-Image-Edit) and Rkg=72.16%R_{\mathrm{kg}} = 72.16\% (vs. 65.56\%), representing 5.9\% and 10.1\% absolute improvements, respectively (Zhao et al., 25 Feb 2026).

5. Insights, Strengths, and Limitations

Training on explicit transition trajectories as provided by PhysicTran38K enables models to internalize the temporal and causal mechanisms underpinning physical events, supporting improved generalization in physical realism and knowledge-grounding. The dual-stream architecture ensures global physical consistency through textual reasoning while capturing fine-grained motion and deformation patterns via latent queries.

Ablation studies demonstrate that removing either the textual or visual streams results in degraded performance; textual reasoning primarily contributes to mechanical/causal understanding, whereas visual queries enhance modeling of optics and localized transitions.

PhysicTran38K's reliance on synthetic data poses limitations regarding the representation of complex real-world physics, such as turbulent fluid phenomena or intricate multi-object interactions. The taxonomy includes 46 transition types but excludes certain important phenomena (e.g., aerodynamic lift, electromagnetism). Current resolutions are fixed at 256×256 with single-camera viewpoints, possibly restricting the transferability to diverse real-world settings. Prospective directions include the integration of real-world video, higher spatial resolution, expansion of domains (e.g., acoustics, electromagnetics), and explicit enforcement of physical laws using physics-informed neural networks (Zhao et al., 25 Feb 2026).

6. Relevance and Implications for Physics-Aware Editing

PhysicTran38K addresses a critical bottleneck in instruction-based image editing—the underspecification of causal, physically plausible transitions—by providing richly annotated, temporally resolved supervision. Its integration into frameworks such as PhysicEdit has yielded state-of-the-art performance for open-source approaches, rivalling proprietary solutions in both physical realism and knowledge grounding. The dataset offers a scalable foundation for further research into physics-aware generative systems, especially those requiring deep semantic alignment with physical law and causal reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PhysicTran38K.