Test-Time Consistency via Inverse Dynamics
- The paper proposes a test-time consistency method that uses an inverse dynamics model to verify alignment between predicted actions and corresponding state transitions.
- It details a framework where a consensus-based best-of-N selection at inference and inverse dynamics rewards during training enhance rollout fidelity.
- Empirical results in single-arm and bimanual robotic tasks demonstrate improved success rates and kinematic plausibility, underscoring its practical utility.
Test-time consistency via inverse dynamics is a methodology for verifying and improving the reliability of predictions made by World Action Models (WAMs) in sequential decision-making systems, particularly in robotics. By employing an inverse dynamics model (IDM), this approach evaluates whether the predicted future states generated by a WAM are compatible with the actions associated with those transitions. This diagnostic serves both as a filtering criterion for selecting plausible imagined rollouts at inference time and as a signal for reinforcement learning alignment of generative video world models. The framework is model-agnostic, requiring no value estimation, and is applicable to both joint-prediction and factorized WAM architectures (Ruan et al., 8 May 2026, Wang et al., 18 Mar 2026).
1. Foundations: World Action Models and Inverse Dynamics
World Action Models (WAMs) parameterize distributions over future state-action trajectories conditioned on history and task. In discrete time, at each step , the agent observes state , executes action , and transitions to .
The joint-prediction form models the rollout as: while the inverse-dynamics-factorized form splits this into: An inverse dynamics model is trained to map state transitions to the most plausible action . For a rollout , the consistency metric 0 quantifies the alignment between predicted actions and observed state transitions. Hard consistency uses exact equality,
1
while soft metrics average 2 or employ an exponential decay in a metric space.
2. Consistency Metrics and the Executability Gap
Test-time consistency assesses dynamic compatibility, distinguishing visually plausible but physically unattainable trajectories from realizable ones. In video world models, the executability gap denotes the discrepancy between visually realistic rollouts and those that induce feasible joint-space actions when decoded by an IDM (Wang et al., 18 Mar 2026). The executability gap is measured via kinematic penalties imposed on the decoded action sequence 3: 4 with associated reward
5
High executability gap indicates that the rollout, though possibly visually convincing, contains kinematic inconsistencies or physically implausible commands.
3. Test-Time Consistency for Rollout Selection
At inference, test-time consistency serves as a value-free resource for best-of-N selection. Given 6 candidate rollouts, each is scored according to its action-state consistency. In the "future-consensus" strategy, predicted futures are averaged to form a consensus state, and each candidate is scored as: 7 The branch 8 with maximal 9 is selected, and its first action 0 is executed. No environment resets are needed; there is no reliance on a trained value function or reward model (Ruan et al., 8 May 2026). This method is both training-agnostic and deployable across WAM instantiations.
4. Training-Time Alignment with Inverse Dynamics Rewards
Beyond test-time filtering, inverse dynamics models can align training distributions in generative video world models. The Executable Video Alignment (EVA) framework leverages a frozen IDM as a dense kinematic reward model: 1 The generator is fine-tuned to maximize 2 via Group-Relative Policy Optimization (GRPO), with an additional KL regularization toward the reference model to avoid catastrophic drift. This alignment minimizes the executability gap upstream, obviating the need for rejection sampling at inference (Wang et al., 18 Mar 2026).
5. Empirical Evaluation and Metrics
Empirical studies on RoboCasa (single-arm, Cosmos-Policy) and RoboTwin 2.0 (bimanual, LingBot-VA) support strong correlation between action-state consistency and downstream task success. In (Ruan et al., 8 May 2026), a logistic regression on episode-level consistency achieved AUC=0.77 (single-arm) and AUC=0.88 (bimanual); per-task ROC AUC frequently exceeded 0.9 for tasks with substantive motion. Applying test-time consensus yielded performance improvements without further training: Cosmos-Policy saw a success rate increase from 66.6% to 67.3%, LingBot-VA from 90.2% to 93.0%.
Equivalent improvements were observed for EVA-aligned models (Wang et al., 18 Mar 2026): structured human judgment found kinematic plausibility increased from 70.5% to 91.4% and perfect execution from 68.1% to 83.8% after IDM-reward RL. RoboTwin simulation success rose from 46.2% to 52.6%.
6. Failure Modes, Limitations, and Extensions
Action-state consistency is sensitive to background collapse, where nearly static predictions inflate the metric despite task failure. This effect is quantified by small 3; consistency is negatively correlated with latent change magnitude. In low-dynamic (quasi-static) tasks, the AUC of consistency as a predictor of success degrades toward 0.5. Proposed extensions include weighting consistency by latent-change to downweigh trivial transitions, joint training of the IDM to emphasize task-relevant transitions, and hybrid criteria combining consistency with value estimation or learned verifiers (Ruan et al., 8 May 2026).
Test-time rejection sampling via an IDM-based filter is less efficient compared to upstream alignment strategies such as EVA, which incorporates IDM-based kinematic rewards directly in generator training, greatly raising acceptance rates and rollout fidelity at inference (Wang et al., 18 Mar 2026).
7. Significance and Applications
Test-time consistency via inverse dynamics introduces a model-agnostic, value-free axis for evaluating and selecting imagined futures in WAMs. It directly measures whether predicted actions generate their corresponding states, capturing decision-relevant structure beyond visual realism. This methodology facilitates reliable deployment of decision-making agents in robotics by filtering implausible actions at inference and by aligning generative models with actionable, physically consistent rollouts. It yields performance gains in both simulated and real-robot tasks, improves robustness to visual artifacts, and can be extended to complement reward- or value-based signals for greater planning reliability (Ruan et al., 8 May 2026, Wang et al., 18 Mar 2026).