Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamix: Language-Based Physical Simulation

Updated 3 July 2026
  • Dynamix is a language-based framework that converts monocular video into structured YAML scene configurations for rigid-body physics simulation.
  • It leverages optical flow and natural-language reasoning to capture motion details and physical parameters like geometry, state, and material properties.
  • The framework bridges perception and simulation, enabling editable, counterfactual simulations and demonstrating superior performance over traditional methods.

Dynamix, in this usage, refers to Δ\DeltaYNAMICS, a language-based vision–language framework for inferring rigid-body dynamics from monocular videos and converting them into structured scene configurations that a physics engine can simulate. Its central premise is to treat physical inference not as regression of a narrow parameter vector, but as generation of a textual representation of the full physical scene, including object geometry, state, material properties, camera, and gravity, so that simulation becomes a language generation problem (Kao et al., 20 May 2026). The framework is presented as an interpretable, editable bridge between perception and physics simulation, aimed at settings with multiple object types, multiple interacting objects, varying viewpoints or camera poses, and real-world scenes.

1. Problem formulation

Δ\DeltaYNAMICS addresses the problem of recovering physical scene parameters and motion dynamics from a single video. Given a video X\mathbf{X}, the model predicts a configuration c=Fθ(X)\mathbf{c} = \mathcal{F}_\theta(\mathbf{X}), which is then passed to a simulator S\mathcal{S} to reconstruct the video: X^=S(c).\hat{\mathbf{X}} = \mathcal{S}(\mathbf{c}).

The objective is for the simulated rollout to faithfully reproduce the observed motion (Kao et al., 20 May 2026).

The paper positions this as a more general problem than prior physics-from-video formulations because the intended scope includes multiple object types, multiple interacting objects, varying viewpoints or camera poses, and real-world scenes rather than constrained toy settings. A plausible implication is that the framework is designed less as a task-specific estimator and more as a general scene-to-simulation interface.

This problem setting is operationalized through simulator-based reconstruction rather than direct supervision on a fixed physical state vector. That choice makes the inferred output directly consumable by a physics engine, which in turn permits comparison between observed and simulated motion through rendered outputs such as masks and flows.

2. Language as a scene representation

The main representational move in Δ\DeltaYNAMICS is to recast physical inference as text generation. Instead of regressing a fixed numeric parameter vector, the model generates a YAML-formatted scene description (Kao et al., 20 May 2026).

The structured scene representation includes object properties, initial state, and global parameters. The paper summarizes the parameter categories as geometry or inertial parameters, material parameters, kinematics, orientation, camera, and environment. Concretely, these include radius, height, width, depth, mass, rolling or sliding friction, damping, position, linear and angular velocity, quaternion, camera pose or field of view, and gravity. This representation is intended to encode a dynamical scene as language.

A typical schema contains entries for rigid bodies and for global entities such as camera and gravity. The paper gives a representative YAML format of the following kind:

X\mathbf{X}2

The significance of this representation is not only that it is structured, but that it is human-readable, directly editable, scalable to many objects, and immediately consumable by a simulator (Kao et al., 20 May 2026). This suggests that the framework is designed for inference and for downstream editing or counterfactual simulation, rather than only for recognition.

3. Pipeline and learning objective

The end-to-end pipeline is video \rightarrow text \rightarrow simulation. The model takes a monocular video, using 10 sampled frames from a 1-second clip in training. For motion-aware preprocessing, the preferred input is optical flow computed by RAFT rather than RGB alone. The backbone is Qwen2.5-VL-3B, fine-tuned on 400K synthetic MuJoCo videos, and the output is a YAML scene configuration (Kao et al., 20 May 2026).

The paper also defines a reasoning-enhanced variant. In the vanilla target format, the model emits

{<answer> configuration </answer>}.\texttt{\{<answer> configuration </answer>\}}.

In the reasoning-augmented format, it emits

Δ\Delta0

The natural-language description summarizes motion before the structured configuration is produced.

After generation, the YAML is converted into MuJoCo XML and simulator state. The simulator renders a reconstructed video Δ\Delta1, along with masks and flows. Evaluation then compares simulated outputs to ground truth using segmentation IoU and optical flow EPE (Kao et al., 20 May 2026).

Training uses standard autoregressive text likelihood: Δ\Delta2 with negative log-likelihood loss

Δ\Delta3

Within the paper’s formulation, the structured scene description is the prediction target, and simulation is the mechanism that turns that prediction into measurable physical reconstruction.

4. Motion-aware inputs and natural-language reasoning

The framework integrates motion information in two complementary ways: optical flow as input and natural-language motion reasoning (Kao et al., 20 May 2026).

Optical flow is treated as a semantics-agnostic representation of motion. The paper explicitly states that this strips away appearance and semantics, focusing the model on motion, and reports that using optical flow instead of RGB improves full-sequence segmentation IoU on CLEVRER by 26%, from 0.19 to 0.24. On the in-distribution synthetic benchmark, the same transition is associated with full-sequence IoU increasing from 0.32 to 0.49, and object composition accuracy increasing from 0.60 to 0.97.

Natural-language motion reasoning is introduced as an intermediate textual representation derived from simulator traces. The model is trained to produce a motion summary describing visibility changes, motion changes such as stopping, and collisions or contact events, before producing the YAML scene configuration. The reasoning is inserted in the form

Δ\Delta4

The paper states that this intermediate text gives the model a more structured causal representation of the scene and improves downstream configuration prediction.

Quantitatively, the reasoning-augmented variant improves synthetic full-sequence IoU from 0.49 to 0.54 and object composition accuracy from 0.97 to 0.99. On CLEVRER, the same progression raises full-sequence IoU from 0.24 to 0.30 (Kao et al., 20 May 2026). A plausible implication is that the method benefits from making causal and event-level regularities explicit before emitting the simulator-facing configuration.

5. Evaluation, transfer, and test-time optimization

The paper evaluates Δ\Delta5YNAMICS on synthetic MuJoCo data, on CLEVRER as a cross-engine transfer benchmark, and on a new dataset of 235 real-world rigid-body videos (Kao et al., 20 May 2026).

On the synthetic benchmark, the reported full-sequence IoU values are 0.32 for the RGB model, 0.49 for the optical flow model, and 0.54 for optical flow with motion reasoning. The same table reports object composition accuracy of 0.60, 0.97, and 0.99, respectively. The model is also compared against InternVL3-8B, Qwen2.5-VL-7B, and Claude-4-Sonnet, which are described as having very low segmentation IoU, generally Δ\Delta6, and much worse flow EPE than Δ\Delta7YNAMICS.

On CLEVRER, which is rendered in Blender and therefore tests transfer beyond MuJoCo, the reported full-sequence IoU values are 0.02 for InternVL3-8B, 0.01 for Qwen2.5-VL-7B, 0.04 for Claude-4-Sonnet, 0.19 for Δ\Delta8YNAMICS RGB, 0.24 for Δ\Delta9YNAMICS optical flow, and 0.30 for X\mathbf{X}0YNAMICS with motion reasoning. The 0.30 segmentation IoU is the headline result summarized as a 7x improvement over leading VLMs.

The paper further studies test-time optimization. Under Best-of-32 sampling on CLEVRER, the base model’s full-sequence IoU improves from 0.24 to 0.28. For the reasoning model, first-frame IoU improves from 0.30 to 0.38, and flow EPE also improves substantially. The abstract summarizes the gains as 27% improvement from test-time sampling and 120% improvement from evolutionary search. The table also indicates that CMA-ES gives the best full-sequence alignment, with very low EPE and strong IoU, especially when initialized from Best@32. Preference optimization is reported in the appendix to provide a modest improvement, such as first-frame EPE improving from 2.22 to 1.85 under Best@32 for the reasoning model, but it is weaker than CMA-ES.

For real-world transfer, the paper evaluates on 235 real-world rigid-body videos captured with iPhone 13 and Canon cameras across indoor and outdoor surfaces. The method generalizes to shoeboxes, balls, containers, massage rollers, apples, and some irregular objects. Reported results are 0.26 IoU and 0.67 flow EPE for the base model, 0.29 IoU and 0.58 EPE for motion reasoning, 0.41 IoU and 0.46 EPE for Best@32, and 0.65 full-sequence IoU and 0.36 EPE for CMA-ES (Kao et al., 20 May 2026).

The paper also reports a pilot human study in which humans achieved mean IoU 0.44 and EPE 1.38, while the model reached 0.61 and 0.71 on the corresponding metric setup. This is presented as evidence that the model can outperform humans on the reconstruction task.

6. Interpretation, scope, and common misunderstandings

The framework is described as bridging perception and simulation because it perceives dynamics from video via a VLM, represents them in an explicit language format, executes them in a physics engine, and can support editing and counterfactual simulation (Kao et al., 20 May 2026). The YAML scene description therefore functions as a practical common language between visual observation and physical rollout.

A common misunderstanding would be to view Dynamix as only a recognizer of rigid-body properties. The paper instead presents it as a text-based physics interface that can infer, simulate, and eventually edit real-world rigid-body motion. Another possible misunderstanding is to treat the method as ordinary video captioning. The generated text is not free-form description alone, but a structured scene configuration designed for conversion into MuJoCo XML and simulator state.

The scope of the method remains specifically rigid-body dynamics from monocular video. The evaluation uses segmentation IoU and optical flow EPE, and the reported representation explicitly covers object geometry or inertia, material parameters, kinematics, orientation, camera pose or field of view, and gravity. This suggests a formulation centered on simulator-compatible scene recovery rather than unconstrained physical reasoning.

Within that scope, X\mathbf{X}1YNAMICS advances a particular thesis: language can serve as a unified representation of rigid-body dynamics. The paper’s quantitative results on CLEVRER, synthetic MuJoCo evaluation, and the 235-video real-world dataset support the claim that this representation can generalize across engines and transfer beyond synthetic training conditions (Kao et al., 20 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamix.