SC3-Eval: Self-Consistent Video Generation for Robot Policy Evaluation
This presentation introduces SC3-Eval, a novel framework that evaluates robot manipulation policies by generating action-conditioned videos with enforced self-consistency. By jointly training forward and inverse dynamics models and maintaining coherence across multiple camera views, SC3-Eval achieves state-of-the-art fidelity in predicting real-world policy performance without requiring physics simulators or geometric reconstruction. The system demonstrates robust generalization to out-of-distribution tasks and provides fine-grained diagnostic insights into policy failure modes.Script
Evaluating robot foundation models is expensive. Every policy checkpoint requires hours of real-world trials, burning through hardware and human supervision just to know if your model improved.
The authors tackled this with video world models that predict what a policy will do before deploying it. But autoregressive video prediction has a fatal flaw: tiny errors compound with every frame, and rollouts drift into physically impossible futures that tell you nothing about real performance.
SC3-Eval solves drift through self-consistency. The system jointly trains forward dynamics to generate future frames from actions, inverse dynamics to recover actions from those frames, and cross-view inpainting to enforce coherence across multiple cameras. This triple constraint anchors rollouts to physically plausible trajectories.
The results are striking. SC3-Eval achieves 0.929 correlation between predicted and actual policy performance, outperforming every baseline across in-distribution and out-of-distribution tasks. It even reproduces specific failure modes, language-following errors, failed lifts, incorrect placements, giving you diagnostic precision no aggregate metric can match.
The inverse dynamics head does double duty at test time. After generating each video chunk, the model asks: what actions would produce these frames? High discrepancy triggers early termination, stopping rollouts before drift poisons your evaluation. This uncertainty signal is automatic, ablation-free, and essential for handling distribution shift.
SC3-Eval delivers human-level policy ranking without physics simulators or per-scene reconstruction, making scalable robot evaluation finally practical. To dive deeper into self-consistent world models and create your own video explainers, visit EmergentMind.com.