Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inverse Dynamic Model Turing Test

Updated 14 January 2026
  • The Inverse Dynamic Model Turing Test quantitatively benchmarks if generated videos encode sufficient physical dynamics to yield viable robotic control commands.
  • It employs a Gripper-Centric Inverse Dynamics Model (GC-IDM) trained solely on real robot trajectories and assesses performance via success rates on real manipulation tasks.
  • Empirical findings reveal that high visual realism does not guarantee actionable control, highlighting the need for dynamic grounding in video synthesis.

The Inverse Dynamic Model Turing Test is a quantitative benchmark for assessing whether videos produced by video foundation models (VFMs) in embodied AI are not merely visually plausible, but encode sufficient physical and dynamical grounding to enable actionable control in real robotic systems. Unlike perceptual Turing Tests that measure human beliefs about realism, the IDM Turing Test probes whether generated videos can be accurately parsed by a robot-trained inverse dynamics model to yield control sequences that succeed when executed in the physical world. This framework exposes a disjunction between appearance-based evaluation and execution-grounded competency, providing a rigorous criterion for embodied world model fidelity (Fan et al., 7 Jan 2026).

1. Formal Definition and Objective

The core question addressed by the Inverse Dynamic Model Turing Test is: given a video ViV_i generated by a world model MiM_i (conditioned on initial image and natural language instruction), can a purely real-robot–trained inverse dynamics model IDM* generate a control command sequence a^1:T\hat{a}_{1:T} that, when replayed on a physical robot, accomplishes the intended manipulation task? Success is formally measured by a binary indicator R(a^1:T){0,1}R(\hat{a}_{1:T}) \rightarrow \{0, 1\}, where “1” marks successful execution. The aggregate metric is

SuccessRate(Mi)=100%1Nn=1NR(IDM(Vi(n)))\text{SuccessRate}(M_i) = 100\% \cdot \frac{1}{N} \sum_{n=1}^N R(\text{IDM}^*(V_i^{(n)}))

where NN is the number of evaluation episodes. A high SuccessRate implies generated videos are indistinguishable from real ones in terms of actionable dynamics, thus “passing” the IDM Turing Test.

2. IDM Architecture and Training

The deployed inverse dynamics model is the Gripper-Centric Inverse Dynamics Model (GC-IDM), trained exclusively on real manipulation trajectories {(I1:T,a1:T)}\{(I_{1:T}, a_{1:T})\}, where I1:TI_{1:T} are video frames and atRda_t \in \mathbb{R}^d encodes continuous gripper pose and open/close commands. GC-IDM learns a mapping

a^t=fθ(I1:T)\hat{a}_t = f_\theta(I_{1:T})

Optimization leverages per-frame visual encodings (via convolutional backbones or transformers) aggregated temporally (using RNNs or self-attention) and decoded into action predictions. Training is driven by an 2\ell_2 mean-squared error loss:

L(θ)=1Tt=1Tatfθ(I1:T)t2L(\theta) = \frac{1}{T} \sum_{t=1}^T \|a_t - f_\theta(I_{1:T})_t\|^2

A critical protocol constraint is that GC-IDM is exposed exclusively to real video streams—never to generated samples—ensuring that its predictions directly reflect the physical plausibility encoded in the generative model’s output.

3. Evaluation Protocol

The evaluation pipeline encompasses the following steps:

  • Task and Data Selection: Nine real-robot manipulation tasks (ranging from simple placement to complex insertion) form the evaluative corpus. For each, held-out real execution videos serve for calibration and as conditions for generative rollout.
  • Video Generation: Distinct world models (Kling, Hailuo, CogVideoX, Cosmos-Predict1/2, Wan2.1, WoW-wan, WoW-cosmos2) receive identical image and instruction input, producing synthesized 5s videos for each task.
  • Action Sequence Inference: Generated videos are parsed by IDM*, yielding a^1:T\hat{a}_{1:T} control sequences.
  • Real-World Robot Replay: The robot executes a^1:T\hat{a}_{1:T} on-hardware, with outcome R(a^1:T)R(\hat{a}_{1:T}) set to 1 if the goal is achieved, 0 otherwise.
  • Ground-Truth Calibration: To validate IDM* reliability, real videos are replayed through the model, achieving ≈90% success across all tasks—a necessary safeguard that failure on generated videos stems from the generative model, not an inept IDM.

4. Execution-Accuracy Metrics

The evaluation metric is the per-model Real-World Success Rate:

SuccessRate(Mi)=100%1Nn=1NR(IDM(Vi(n)))\text{SuccessRate}(M_i) = 100\% \cdot \frac{1}{N} \sum_{n=1}^N R(\text{IDM}^*(V_i^{(n)}))

where N=9N=9 (one trial per task reported in primary results). No further statistical augmentation is specified, but drastic disparities serve as robust indicators.

Model Performance Table

Model Real-World SuccessRate (%)
Kling 9.88
Hailuo 2.47
CogVideoX 0.00
Cosmos-Predict1 0.00
Wan2.1 0.00
Cosmos-Predict2 8.64
WoW-wan 40.74
WoW-cosmos2 18.52

GC-IDM achieves \sim90% success on real videos, solidifying its reliability for this protocol.

5. Empirical Findings and Analysis

Despite several models achieving high scores on perceptual realism and physics-consistency metrics (notably Kling’s 68.02 “Physical Law” score) and routinely fooling human raters (>50% rates), nearly all collapse to 0% success when evaluated under the IDM Turing Test. The only two models to exceed a 10% success threshold, WoW-wan (40.74%) and WoW-cosmos2 (18.52%), share explicit training on real-robot data and built-in biases for contact dynamics. A plausible implication is that time-aligned, force-motion encoding—absent from purely visual training—is strictly necessary for actionable control. Superficial resemblance to real events is shown to be insufficient for grounded manipulation.

The authors highlight several critical insights:

  • Appearance fidelity does not guarantee dynamical plausibility; executable priors demand grounding in real-world interaction data.
  • Passing the IDM Turing Test correlates strongly with both physical-law and instruction-understanding metrics, suggesting that successful models must achieve synergy across physical and semantic axes.
  • VFMs, despite progress, are not yet reliable as universal priors for embodied agents, motivating research in integrating simulators, real-world trajectories, and action-conditioned rollouts.

6. Implications for Embodied AI and Future Directions

The IDM Turing Test establishes a principled, quantitative bridge between generative video synthesis and executable robotics, surfacing a fundamental chasm between perceptual realism and true physical utility. This suggests that embodied AI systems must couple high-capacity video modeling with rigorous dynamic priors, incentivizing research in hybrid architectures, data-driven simulation augmentation, and more granular action–conditioned generation. The metric’s sensitivity to execution failure, in direct contrast with human judgments of realism, indicates that grounding generative models in practical control protocols may be essential for closing the gap between simulated and deployable robotic intelligence in future embodied AI systems (Fan et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inverse Dynamic Model Turing Test.